Presented by Yana Valasatava
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in
mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
PDBx/mmCIF
Flexible, extensible, and verbose
format with rich metadata, well suited
for archival purposes.
repetitive information
redundant annotations
inefficient representation
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O)
❏ it is faster to read (no time-consuming string parsing)
❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)
○ Metadata (e.g. rFree and resolution)
○ Structure data (e.g. number of models, chains, groups, atoms)
○ Chain data (e.g. list of chain IDs, chain names)
○ Group data (e.g. list of group names, formal charges, bonds)
○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
MMTF compression pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural data
calculate bonds, SSE
The binary container format of MMTF
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1
SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 14699
14.500 -> 14500
169
1,2,3->1,1,1->1,3
(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the
occurrence count
delta encoding: differences (deltas) between the numbers are stored
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
Overview of data
Full format
• all atoms (useful for structural bioinformatics analysis)
• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format
• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)
• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)
• exactly same data structure as full (parsers work for both)
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Presented by Anthony Bradley
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial
AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn