Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PDB
RCSB
Compressive Structural Bioinformatics:
Large-scale analysis and visualization of the
Protein Data Bank archive
Pe...
PDB
RCSB
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
120,000
structures
in June 2016
PDB
RCSB
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-...
PDB
RCSB
Growing User Base
PDB
RCSB
 Scalability Issues
• Interactive visualization
• slow network transfer
• slow parsing
• slow rendering
• Mobile...
PDB
RCSB
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macr...
PDB
RCSB
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bo...
PDB
RCSB
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif....
PDB
RCSB
MMTF
• MacroMolecular Transmission Format (mmtf.rcsb.org)
• Compact
• fast network transfer, less I/O
• Fast to p...
PDB
RCSB
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
...
PDB
RCSB
Size and Parsing Speed
mmCIF vs. MMTF for 120,000 Structures
Fast
Mac mini with 2.6 GHz Intel Core i5
(4 cores) a...
PDB
RCSB
Efficient hashing algorithm
Inefficient looping algorithm
MMTFmmCIF
50
6
448
404
Find all C-alpha-C-alpha contact...
PDB
RCSB
Download + Parsing time
MMTF vs. mmCIF
Bethesda, MD
85 MMTF
2418 mmCIF
Switzerland
1589 MMTF
4431 mmCIF
Russia
55...
PDB
RCSB
Community Engagement
• Open source specification
• Open source decoding libraries
• Java
• JavaScript
• Python
• ...
PDB
RCSB
Summary
• MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
• Compressed, binary, efficient representation...
PDB
RCSB
Acknowledgements
Funding: NCI/NIH (U01 CA198942)
MMTF Early Adopters
Upcoming SlideShare
Loading in …5
×

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

444 views

Published on

This presentation gives an overview of the application of data compression for the large-scale analysis and visualization of large 3D macromolecular structures. It describes the MacroMolecular Transmission Format (http://mmtf.rcsb.org) for the compact and efficient representation of 3D structures.

This presentation was given at the NIH BD2K All Hands Meeting, Nov. 29, 2016, in Bethesda, MD (https://datascience.nih.gov/bd2k/AHM)

This work has been supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

  1. 1. PDB RCSB Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive Peter W. Rose, Anthony R. Bradley, Alexander S. Rose, Yana Valasatava, Jose M. Duarte, Andreas Prlić Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  2. 2. PDB RCSB PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units 120,000 structures in June 2016
  3. 3. PDB RCSB Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  4. 4. PDB RCSB Growing User Base
  5. 5. PDB RCSB  Scalability Issues • Interactive visualization • slow network transfer • slow parsing • slow rendering • Mobile visualization • limited bandwidth • limited memory • Large-scale structural analysis • slow repeated I/O • slow repeated parsing
  6. 6. PDB RCSB Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  7. 7. PDB RCSB Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  8. 8. PDB RCSB PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  9. 9. PDB RCSB MMTF • MacroMolecular Transmission Format (mmtf.rcsb.org) • Compact • fast network transfer, less I/O • Fast to parse • binary, no string parsing • Contains information for structural analysis and visualization • covalent bonds and bond orders • consistently calculated secondary structure
  10. 10. PDB RCSB MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  11. 11. PDB RCSB Size and Parsing Speed mmCIF vs. MMTF for 120,000 Structures Fast Mac mini with 2.6 GHz Intel Core i5 (4 cores) and 16GB RAM using 30 GB 7 GB < 2 min 400 min MMT F mmCIF MMT F mmCIF Whole PDB archive GZIP compressed (MMTF reduced/lossy: ~800 MB) Small
  12. 12. PDB RCSB Efficient hashing algorithm Inefficient looping algorithm MMTFmmCIF 50 6 448 404 Find all C-alpha-C-alpha contacts Data Mining using Apache Spark mmCIF vs. MMTF
  13. 13. PDB RCSB Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589 MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative
  14. 14. PDB RCSB Community Engagement • Open source specification • Open source decoding libraries • Java • JavaScript • Python • C/C++ (developed by community members) • Applications using MMTF • 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol • BioJava, Biopython, MDAnalysis • RCSB PDB website
  15. 15. PDB RCSB Summary • MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) • Compressed, binary, efficient representation of 3D structures • Lossless representation (~4x compression) • Lossy, reduced representation (~37x compression) • Compressive Structural Bioinformatics • Algorithms, application, and workflows using MMTF • 10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  16. 16. PDB RCSB Acknowledgements Funding: NCI/NIH (U01 CA198942) MMTF Early Adopters

×