Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Compact representation
of 3D macromolecular
structures from the PDB
Presented by Yana Valasatava
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in
mmCIF format
Structural biology efforts meet a big-data era:
● ...
★ Interactive visualization
○ slow network transfer
○ slow parsing
○ slow rendering
★ Mobile visualization
○ limited bandw...
PDBx/mmCIF
Flexible, extensible, and verbose
format with rich metadata, well suited
for archival purposes.
repetitive info...
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O...
MMTF compression pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
...
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1...
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A...
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 ...
Overview of data
Full format
• all atoms (useful for structural bioinformatics analysis)
• coordinates with 3 decimal plac...
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-s...
Presented by Anthony Bradley
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-s...
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial
AND fast
• Big Data tools...
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
MMTF-Spark - Simple API
Continued…..
Data mining - speed advantage
Contact finding
Contact finding
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much mor...
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much mor...
Thanks!
• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://gi...
Acknowledgements
NCI/NIH (U01 CA198942)
Upcoming SlideShare
Loading in …5
×

CADD meeting 08-30-2016

394 views

Published on

Compact representation and mining of the PDB

Published in: Science
  • Be the first to comment

  • Be the first to like this

CADD meeting 08-30-2016

  1. 1. Compact representation of 3D macromolecular structures from the PDB
  2. 2. Presented by Yana Valasatava Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  3. 3. The PDB evolving complexity PDB archive > 30 GB ~250 MB in mmCIF format Structural biology efforts meet a big-data era: ● Growing size: ~ 120K structures with an annual growth by ~10K structures ● Evolving complexity: growing compositional heterogeneity and size ● Increasing usage: > 300,000 users per month from over 160 countries 3J3Q 3J3Q has more than 1 million atoms The PDB has more than 1 billion atoms
  4. 4. ★ Interactive visualization ○ slow network transfer ○ slow parsing ○ slow rendering ★ Mobile visualization ○ limited bandwidth ○ limited memory ★ Large-scale structural analysis ○ slow repeated I/O ○ slow repeated parsing Scalability issues
  5. 5. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes. repetitive information redundant annotations inefficient representation
  6. 6. PDB/MMTF The MacroMolecular Transmission Format MMTF has the following advantages: ❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing) ❏ it contains precalculated information useful for structural analysis and visualisation (covalent bonds and bond orders) Fields: ○ Format data (e.g. the version number of the specification) ○ Metadata (e.g. rFree and resolution) ○ Structure data (e.g. number of models, chains, groups, atoms) ○ Chain data (e.g. list of chain IDs, chain names) ○ Group data (e.g. list of group names, formal charges, bonds) ○ Atom data (e.g. B-factors, coordinates, occupancies) https://github.com/rcsb/mmtf/blob/master/spec.md
  7. 7. MMTF compression pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE The binary container format of MMTF
  8. 8. Compression pipeline: dictionary encoding Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 { "groupName": "ARG", "singleLetterCode": "R", "chemCompType": "L-PEPTIDE LINKING", "atomNameList": [ "N", "CA", "C" ], "elementList": [ "N", "C", "C"] } index: 1 SER-GLY-ARG-SER-SER groupTypeList: [ 2, 0, 1, 2, 2 ]
  9. 9. Compression pipeline: encodings Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 14.699 -> 14699 14.500 -> 14500 169 1,2,3->1,1,1->1,3 (delta + run-length) -> (integer + delta) integer encoding: map floating point numbers to integer run-length encoding: stretches of equal values are represented by the value itself and the occurrence count delta encoding: differences (deltas) between the numbers are stored
  10. 10. Compression pipeline: Recursive Indexing Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14] Array of 8-bit integer values, so the open interval is (127, -128):
  11. 11. Overview of data Full format • all atoms (useful for structural bioinformatics analysis) • coordinates with 3 decimal place precision (no loss after decoding) Reduced format • C-alpha/phosphate backbone atoms and ligands (useful for visualisation and some structural bioinformatics) • coordinates with 1 decimal place precision (almost further 40 % reduction in size) • exactly same data structure as full (parsers work for both)
  12. 12. MMTF size and parsing speed * Parsing using Java libraries
  13. 13. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  14. 14. Presented by Anthony Bradley Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  15. 15. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  16. 16. Goals • Analysis should be easy and simple • Whole archive analysis of the PDB should be trivial AND fast • Big Data tools (e.g. Spark and Hadoop) are available
  17. 17. mmtf-python mmtf-java Nobody should (have to) write their own parser. Ever.
  18. 18. MMTF-Spark - Simple API
  19. 19. Continued…..
  20. 20. Data mining - speed advantage
  21. 21. Contact finding
  22. 22. Contact finding
  23. 23. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  24. 24. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  25. 25. Thanks! • http://mmtf.rcsb.org/ • https://github.com/rcsb/mmtf-javascript • https://github.com/rcsb/mmtf-java • https://github.com/rcsb/mmtf-python • http://spark.apache.org/
  26. 26. Acknowledgements NCI/NIH (U01 CA198942)

×