Mosaic in HDF5 files

HDF5 files contain a tree structure whose leaves are datasets. Non-leaf nodes are called groups and work much like a directory in a file system. Each dataset is an array whose elements can be numbers or characters, but also compound data types (similar to record types in various programming languages) or fixed-size arrays of numbers or characters. Groups and datasets can have metadata tags called attributes. Each group or dataset can be identified by a path specifying how to reach it from the root group of a file. However, a path is not necessarily unique, because HDF5 provides links that effectively put a single node in several places in the tree. Moreover, HDF5 provides a data type “reference” that allows to refer to a node or a subset of a dataset.

The design criteria for the HDF5 representation of Mosaic data were efficiency of storage and ease of use from low-level languages such as C or Fortran. As much as possible, Mosaic data is stored as arrays of numbers.

HDF5 has two string layouts: fixed-size strings (character arrays) and variable-length strings. The two layouts are not interchangeable. In order to facilitate software development, Mosaic uses only variable-length strings.

Mosaic data items in a HDF5 file

A Mosaic data item in an HDF5 file can be a dataset or a group containing multiple datasets. It is identified by four attributes, all of which are required:

DATA_MODEL
a variable-length string with the value “MOSAIC”
DATA_MODEL_MAJOR_VERSION
an integer
DATA_MODEL_MINOR_VERSION
an integer
MOSAIC_DATA_TYPE
a variable-length string

References between data items are stored as attributes whose value is an HDF5 object reference.

Universes

A universe is stored as a group containing several datasets. The datasets convention and cell_shape are variable-length strings. The symmetry transformation list is stored in dataset symmetry_transformations as a one-dimensional array, possibly empty, whose elements are of a compound data type with fields

rotation
A 3x3 array of float64 numbers.
translation
An array of 3 float64 numbers.

The fragment tree is stored in several arrays. All string values are stored in a one-dimensional array dataset symbols whose elements are variable-length strings. In the fragment tree, the strings can then be replaced by integers, which are indices into this symbol list. Ideally, each string is stored only once in the symbol array, though this is not a requirement. The tree data structure is stored in two integer arrays, fragments and atoms. Each node (fragment or atom) has one array entry, which is a compound data type whose fields are unsigned integers. Any size of unsigned integer can be used, but the same type must be used everywhere for a given universe group.

The entries of the fragments array have the fields

parent_index
The index of the parent node in the fragments array. A value of 0 indicates a root node, which has no parent.
label_symbol_index
The index of the label in the symbols array.
species_symbol_index
The index of the species in the symbols array.
number_of_fragments
The number of sub-fragments. This is redundant information, provided to facilitate reading fragment-related information without analyzing the whole fragment tree.

The first entry (index 0) of the fragments array is unused, in order to allow an index value of 0 to stand for “no parent”. The fragments array thus has one more entry than the number of fragments in the universe.

The entries of the atoms array have the fields

parent_index
The index of the parent node in the fragments array.
label_symbol_index
The index of the label in the symbols array.
type_symbol_index
The index of the type in the symbols array.
name_symbol_index
The index of the name in the symbols array.
number_of_sites
The number of sites.

The entries of the bonds array have the fields

atom_index_1
The index of the first atom in the atoms array.
atom_index_2
The index of the second atom in the atoms array.
bond_order_symbol_index
The index of the bond-order label in the symbols array.

The entries of the molecules array have a large number of redundant fields (all but the first two) that are provided to allow atoms be attributed to molecules without analyzing the full fragment tree.

fragment_index
The index of the fragment node in the fragments array.
number_of_copies
The number of copies of the molecule in the universe.
first_atom_index
The index of the first atom in the atoms array.
number_of_atoms
The number of atoms in the molecule.
first_bond_index
The index of the first bond in the bonds array.
number_of_bonds
The number of bonds in the molecule.
first_site_index
The index of the first site of the molecule.
number_of_sites
The number of sites in the molecule.

Since the atoms, sites, and bonds of a molecule have consecutive indices, the redundant “first_index” and “number_of” values are sufficient to locate atoms, sites, and bonds for each molecule. For many applications this is sufficient, making it unnecessary to use the fragments array.

Finally, the array polymers has one entry for each polymer fragment in the universe. Its fields are

fragment_index
The index of the fragment node in the fragments array.
polymer_type_symbol_index
The index of the polymer-type label in the symbols array.

If the universe has no polymer fragments, the dataset polymers may be omitted.

Configurations

A configuration is stored as a group containing two datasets: positions (required) and cell_parameters (required if the universe’s cell shape is not “infinite”). The reference to the universe is stored in the attribute universe of the group.

The dataset positions is a one-dimensional array whose length is equal to the number of sites in the universe. Its elements are one-dimensional arrays of length 3 whose elements are of type “float32” or “float64”.

The dataset cell_parameters is an array whose elements are of type “float32” or “float64”, and whose shape is defined in the specification.

Properties

A property data item is stored as a dataset that is a one-dimensional array whose length is equal to the number of atoms or sites in the universe or the universe’s fragment list. Each element of this array is an array whose shape and element type is defined by the property’s data. The reference to the universe is stored in the attribute universe of the group. The property’s name and units are stored in attributes of the same name as variable-length strings. The property’s type is stored in the attribute property_type, also as a variable-length string.

Labels

A label data item is stored as a dataset that is a one-dimensional array whose length is equal to the number of atoms or sites in the universe or the universe’s fragment list. Each element of this array is a variable-length string. The reference to the universe is stored in the attribute universe of the group. The label’s name is stored in the attribute name as a variable-length string. The label’s type is stored in the attribute label_type, also as a variable-length string.

Selections

A selection is stored as a dataset that is a one-dimensional array of integers. The reference to the universe is stored in the attribute universe of the group. The selection’s type is stored in the attribute selection_type as a variable-length string.