.. Written by Konrad Hinsen .. License: CC-BY 3.0 .. index:: single: HDF5 Mosaic in HDF5 files #################### `HDF5 `_ files contain a tree structure whose leaves are datasets. Non-leaf nodes are called groups and work much like a directory in a file system. Each dataset is an array whose elements can be numbers or characters, but also compound data types (similar to record types in various programming languages) or fixed-size arrays of numbers or characters. Groups and datasets can have metadata tags called attributes. Each group or dataset can be identified by a path specifying how to reach it from the root group of a file. However, a path is not necessarily unique, because HDF5 provides links that effectively put a single node in several places in the tree. Moreover, HDF5 provides a data type "reference" that allows to refer to a node or a subset of a dataset. The design criteria for the HDF5 representation of Mosaic data were efficiency of storage and ease of use from low-level languages such as C or Fortran. As much as possible, Mosaic data is stored as arrays of numbers. HDF5 has two string layouts: fixed-size strings (character arrays) and variable-length strings. The two layouts are not interchangeable. In order to facilitate software development, Mosaic uses only variable-length strings. Mosaic data items in a HDF5 file -------------------------------- A Mosaic data item in an HDF5 file can be a dataset or a group containing multiple datasets. It is identified by four attributes, all of which are required: DATA_MODEL a variable-length string with the value "MOSAIC" DATA_MODEL_MAJOR_VERSION an integer DATA_MODEL_MINOR_VERSION an integer MOSAIC_DATA_TYPE a variable-length string References between data items are stored as attributes whose value is an HDF5 object reference. Universes ......... A universe is stored as a group containing several datasets. The datasets :ref:`convention` and :ref:`cell_shape` are variable-length strings. The :ref:`symmetry transformation` list is stored in dataset ``symmetry_transformations`` as a one-dimensional array, possibly empty, whose elements are of a compound data type with fields rotation A 3x3 array of float64 numbers. translation An array of 3 float64 numbers. The fragment tree is stored in several arrays. All string values are stored in a one-dimensional array dataset ``symbols`` whose elements are variable-length strings. In the fragment tree, the strings can then be replaced by integers, which are indices into this symbol list. Ideally, each string is stored only once in the symbol array, though this is not a requirement. The tree data structure is stored in two integer arrays, ``fragments`` and ``atoms``. Each node (fragment or atom) has one array entry, which is a compound data type whose fields are unsigned integers. Any size of unsigned integer can be used, but the same type must be used everywhere for a given universe group. The entries of the ``fragments`` array have the fields parent_index The index of the parent node in the ``fragments`` array. A value of 0 indicates a root node, which has no parent. label_symbol_index The index of the :ref:`label` in the ``symbols`` array. species_symbol_index The index of the :ref:`species` in the ``symbols`` array. number_of_fragments The number of sub-fragments. This is redundant information, provided to facilitate reading fragment-related information without analyzing the whole fragment tree. The first entry (index 0) of the ``fragments`` array is unused, in order to allow an index value of 0 to stand for "no parent". The ``fragments`` array thus has one more entry than the number of fragments in the universe. The entries of the ``atoms`` array have the fields parent_index The index of the parent node in the ``fragments`` array. label_symbol_index The index of the :ref:`label` in the ``symbols`` array. type_symbol_index The index of the :ref:`type` in the ``symbols`` array. name_symbol_index The index of the :ref:`name` in the ``symbols`` array. number_of_sites The :ref:`number of sites`. The entries of the ``bonds`` array have the fields atom_index_1 The index of the first atom in the ``atoms`` array. atom_index_2 The index of the second atom in the ``atoms`` array. bond_order_symbol_index The index of the bond-order label in the ``symbols`` array. The entries of the ``molecules`` array have a large number of redundant fields (all but the first two) that are provided to allow atoms be attributed to molecules without analyzing the full fragment tree. fragment_index The index of the fragment node in the ``fragments`` array. number_of_copies The number of copies of the molecule in the universe. first_atom_index The index of the first atom in the ``atoms`` array. number_of_atoms The number of atoms in the molecule. first_bond_index The index of the first bond in the ``bonds`` array. number_of_bonds The number of bonds in the molecule. first_site_index The index of the first site of the molecule. number_of_sites The number of sites in the molecule. Since the atoms, sites, and bonds of a molecule have consecutive indices, the redundant "first_index" and "number_of" values are sufficient to locate atoms, sites, and bonds for each molecule. For many applications this is sufficient, making it unnecessary to use the ``fragments`` array. Finally, the array ``polymers`` has one entry for each polymer fragment in the universe. Its fields are fragment_index The index of the fragment node in the ``fragments`` array. polymer_type_symbol_index The index of the polymer-type label in the ``symbols`` array. If the universe has no polymer fragments, the dataset ``polymers`` may be omitted. Configurations .............. A configuration is stored as a group containing two datasets: ``positions`` (required) and ``cell_parameters`` (required if the universe's cell shape is not "infinite"). The reference to the universe is stored in the attribute ``universe`` of the group. The dataset ``positions`` is a one-dimensional array whose length is equal to the number of sites in the universe. Its elements are one-dimensional arrays of length 3 whose elements are of type "float32" or "float64". The dataset ``cell_parameters`` is an array whose elements are of type "float32" or "float64", and whose shape is defined in the :ref:`specification`. Properties .......... A property data item is stored as a dataset that is a one-dimensional array whose length is equal to the number of atoms or sites in the universe or the universe's fragment list. Each element of this array is an array whose shape and element type is defined by the property's :ref:`data`. The :ref:`reference` to the universe is stored in the attribute ``universe`` of the group. The property's :ref:`name` and :ref:`units` are stored in attributes of the same name as variable-length strings. The property's :ref:`type` is stored in the attribute ``property_type``, also as a variable-length string. Labels ...... A label data item is stored as a dataset that is a one-dimensional array whose length is equal to the number of atoms or sites in the universe or the universe's fragment list. Each element of this array is a variable-length string. The :ref:`reference` to the universe is stored in the attribute ``universe`` of the group. The label's :ref:`name` is stored in the attribute ``name`` as a variable-length string. The label's :ref:`type` is stored in the attribute ``label_type``, also as a variable-length string. Selections .......... A selection is stored as a dataset that is a one-dimensional array of integers. The :ref:`reference` to the universe is stored in the attribute ``universe`` of the group. The selection's :ref:`type` is stored in the attribute ``selection_type`` as a variable-length string.