Successfully reported this slideshow.

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

1

Share

Upcoming SlideShare
CADD meeting 08-30-2016
CADD meeting 08-30-2016
Loading in …3
×
1 of 16
1 of 16

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

1

Share

Download to read offline

This presentation gives an overview of the application of data compression for the large-scale analysis and visualization of large 3D macromolecular structures. It describes the MacroMolecular Transmission Format (http://mmtf.rcsb.org) for the compact and efficient representation of 3D structures.

This presentation was given at the NIH BD2K All Hands Meeting, Nov. 29, 2016, in Bethesda, MD (https://datascience.nih.gov/bd2k/AHM)

This work has been supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942.

This presentation gives an overview of the application of data compression for the large-scale analysis and visualization of large 3D macromolecular structures. It describes the MacroMolecular Transmission Format (http://mmtf.rcsb.org) for the compact and efficient representation of 3D structures.

This presentation was given at the NIH BD2K All Hands Meeting, Nov. 29, 2016, in Bethesda, MD (https://datascience.nih.gov/bd2k/AHM)

This work has been supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

  1. 1. PDB RCSB Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive Peter W. Rose, Anthony R. Bradley, Alexander S. Rose, Yana Valasatava, Jose M. Duarte, Andreas Prlić Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  2. 2. PDB RCSB PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units 120,000 structures in June 2016
  3. 3. PDB RCSB Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  4. 4. PDB RCSB Growing User Base
  5. 5. PDB RCSB  Scalability Issues • Interactive visualization • slow network transfer • slow parsing • slow rendering • Mobile visualization • limited bandwidth • limited memory • Large-scale structural analysis • slow repeated I/O • slow repeated parsing
  6. 6. PDB RCSB Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  7. 7. PDB RCSB Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  8. 8. PDB RCSB PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  9. 9. PDB RCSB MMTF • MacroMolecular Transmission Format (mmtf.rcsb.org) • Compact • fast network transfer, less I/O • Fast to parse • binary, no string parsing • Contains information for structural analysis and visualization • covalent bonds and bond orders • consistently calculated secondary structure
  10. 10. PDB RCSB MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  11. 11. PDB RCSB Size and Parsing Speed mmCIF vs. MMTF for 120,000 Structures Fast Mac mini with 2.6 GHz Intel Core i5 (4 cores) and 16GB RAM using 30 GB 7 GB < 2 min 400 min MMT F mmCIF MMT F mmCIF Whole PDB archive GZIP compressed (MMTF reduced/lossy: ~800 MB) Small
  12. 12. PDB RCSB Efficient hashing algorithm Inefficient looping algorithm MMTFmmCIF 50 6 448 404 Find all C-alpha-C-alpha contacts Data Mining using Apache Spark mmCIF vs. MMTF
  13. 13. PDB RCSB Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589 MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative
  14. 14. PDB RCSB Community Engagement • Open source specification • Open source decoding libraries • Java • JavaScript • Python • C/C++ (developed by community members) • Applications using MMTF • 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol • BioJava, Biopython, MDAnalysis • RCSB PDB website
  15. 15. PDB RCSB Summary • MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) • Compressed, binary, efficient representation of 3D structures • Lossless representation (~4x compression) • Lossy, reduced representation (~37x compression) • Compressive Structural Bioinformatics • Algorithms, application, and workflows using MMTF • 10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  16. 16. PDB RCSB Acknowledgements Funding: NCI/NIH (U01 CA198942) MMTF Early Adopters

×