Advertisement

More Related Content

Similar to CADD meeting 08-30-2016(20)

Advertisement

CADD meeting 08-30-2016

  1. Compact representation of 3D macromolecular structures from the PDB
  2. Presented by Yana Valasatava Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  3. The PDB evolving complexity PDB archive > 30 GB ~250 MB in mmCIF format Structural biology efforts meet a big-data era: ● Growing size: ~ 120K structures with an annual growth by ~10K structures ● Evolving complexity: growing compositional heterogeneity and size ● Increasing usage: > 300,000 users per month from over 160 countries 3J3Q 3J3Q has more than 1 million atoms The PDB has more than 1 billion atoms
  4. ★ Interactive visualization ○ slow network transfer ○ slow parsing ○ slow rendering ★ Mobile visualization ○ limited bandwidth ○ limited memory ★ Large-scale structural analysis ○ slow repeated I/O ○ slow repeated parsing Scalability issues
  5. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes. repetitive information redundant annotations inefficient representation
  6. PDB/MMTF The MacroMolecular Transmission Format MMTF has the following advantages: ❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing) ❏ it contains precalculated information useful for structural analysis and visualisation (covalent bonds and bond orders) Fields: ○ Format data (e.g. the version number of the specification) ○ Metadata (e.g. rFree and resolution) ○ Structure data (e.g. number of models, chains, groups, atoms) ○ Chain data (e.g. list of chain IDs, chain names) ○ Group data (e.g. list of group names, formal charges, bonds) ○ Atom data (e.g. B-factors, coordinates, occupancies) https://github.com/rcsb/mmtf/blob/master/spec.md
  7. MMTF compression pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE The binary container format of MMTF
  8. Compression pipeline: dictionary encoding Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 { "groupName": "ARG", "singleLetterCode": "R", "chemCompType": "L-PEPTIDE LINKING", "atomNameList": [ "N", "CA", "C" ], "elementList": [ "N", "C", "C"] } index: 1 SER-GLY-ARG-SER-SER groupTypeList: [ 2, 0, 1, 2, 2 ]
  9. Compression pipeline: encodings Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 14.699 -> 14699 14.500 -> 14500 169 1,2,3->1,1,1->1,3 (delta + run-length) -> (integer + delta) integer encoding: map floating point numbers to integer run-length encoding: stretches of equal values are represented by the value itself and the occurrence count delta encoding: differences (deltas) between the numbers are stored
  10. Compression pipeline: Recursive Indexing Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14] Array of 8-bit integer values, so the open interval is (127, -128):
  11. Overview of data Full format • all atoms (useful for structural bioinformatics analysis) • coordinates with 3 decimal place precision (no loss after decoding) Reduced format • C-alpha/phosphate backbone atoms and ligands (useful for visualisation and some structural bioinformatics) • coordinates with 1 decimal place precision (almost further 40 % reduction in size) • exactly same data structure as full (parsers work for both)
  12. MMTF size and parsing speed * Parsing using Java libraries
  13. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  14. Presented by Anthony Bradley Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  15. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  16. Goals • Analysis should be easy and simple • Whole archive analysis of the PDB should be trivial AND fast • Big Data tools (e.g. Spark and Hadoop) are available
  17. mmtf-python mmtf-java Nobody should (have to) write their own parser. Ever.
  18. MMTF-Spark - Simple API
  19. Continued…..
  20. Data mining - speed advantage
  21. Contact finding
  22. Contact finding
  23. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  24. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  25. Thanks! • http://mmtf.rcsb.org/ • https://github.com/rcsb/mmtf-javascript • https://github.com/rcsb/mmtf-java • https://github.com/rcsb/mmtf-python • http://spark.apache.org/
  26. Acknowledgements NCI/NIH (U01 CA198942)
Advertisement