Mosaic PDB convention

Mosaic can be used to store molecular models from the Protein Data Bank (PDB). The main application is to use such models as the starting point for molecular simulations. The following conventions describe how a PDB structure is stored in terms of Mosaic data items. Note that only the structure itself can be stored, but not experimental data (structure factors etc.) or metadata describing the experiment or the refinement process.

The PDB’s official data format is called PDBx/mmCIF. In the conversion from PDBx/mmCIF to Mosaic, as much information as possible is transposed without modification. In particular, residue and atom names are the same.

Crystallographic structures

A crystallographic structure is represented by two required data items:

  • A universe defining the molecular structures and, in the case of crystals, the symmetries. The atoms in the universe have multiple sites if the PDB structure contains alternate locations.
  • A configuration providing the positions for all sites and the shape of the unit cell in the case of crystals.

Additional information from the PDB entry can be provided by optional data items:

  • The occupancy of each site can be provided as a property of type “site” or “template_site” with an empty units string. Each value is a scalar of type “float32” or “float64” in the interval [0..1]. If no occupancy values are provided, the occupancy of all sites is assumed to be 1.
  • An anisotropic displacement parameter for each site can be provided as a property of type “site” or “template_site”. A valid units string must be provided, the preferred units are “nm2”. Each value is an array of shape “6” and of type “float32” or “float64”, the order of the elements is [1][1], [2][2], [3][3], [2][3], [1][3], [1][2]. For the precise definition of the anisotropic displacement parameters, see the PDB documentation for items _atom_site.aniso_U[1][1] to _atom_site.aniso_U[3][3].
  • An isotropic displacement parameter for each site can be provided as a property of type “site” or “template_site”. A valid units string must be provided, the preferred units are “nm2”. Each value is a scalar of type “float32” or “float64”. An isotropic displacement parameter of value x is equivalent to an anisotropic displacement parameter of value [x x x 0 0 0].

If anisotropic displacement parameters are provided, then no isotropic displacement parameters may be given, in order to prevent incoherencies in the data.

Heterogeneous sequences

In a heterogeneous sequence, a specific position can be taken by different residues in different copies of the molecule. In a PDB entry, heteregeneous sequences are marked as such, and contain multiple residues with the same residue number, but of a different chemical component. All the atoms in these multiple residues have occupancies smaller than 1.

In Mosaic, heterogeneous sequences are represented by a single polymer fragment. The fragment at a heterogeneous position has no atoms and as many subfragments as there are residue variants at this position. Each subfragment is one of the residue variants. There are no bonds between atoms of different residue variants, but each of them can have bonds to the neighboring residue(s).

NMR structures

An NMR structure is represented by the following data items:

  • A universe defining the molecular structures.
  • One configuration per model contained in the PDB entry.