Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

Peter Rose
Peter RoseDirector, Structural Bioinformatics Laboratory
PDB
RCSB
Compressive Structural Bioinformatics:
Large-scale analysis and visualization of the
Protein Data Bank archive
Peter W. Rose, Anthony R. Bradley,
Alexander S. Rose, Yana Valasatava,
Jose M. Duarte, Andreas Prlić
Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
PDB
RCSB
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
120,000
structures
in June 2016
PDB
RCSB
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
PDB
RCSB
Growing User Base
PDB
RCSB
 Scalability Issues
• Interactive visualization
• slow network transfer
• slow parsing
• slow rendering
• Mobile visualization
• limited bandwidth
• limited memory
• Large-scale structural analysis
• slow repeated I/O
• slow repeated parsing
PDB
RCSB
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
PDB
RCSB
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
PDB
RCSB
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
PDB
RCSB
MMTF
• MacroMolecular Transmission Format (mmtf.rcsb.org)
• Compact
• fast network transfer, less I/O
• Fast to parse
• binary, no string parsing
• Contains information for structural analysis and visualization
• covalent bonds and bond orders
• consistently calculated secondary structure
PDB
RCSB
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
PDB
RCSB
Size and Parsing Speed
mmCIF vs. MMTF for 120,000 Structures
Fast
Mac mini with 2.6 GHz Intel Core i5
(4 cores) and 16GB RAM using
30 GB
7 GB
< 2 min
400 min
MMT
F
mmCIF MMT
F
mmCIF
Whole PDB archive GZIP compressed
(MMTF reduced/lossy: ~800 MB)
Small
PDB
RCSB
Efficient hashing algorithm
Inefficient looping algorithm
MMTFmmCIF
50
6
448
404
Find all C-alpha-C-alpha contacts
Data Mining using Apache Spark
mmCIF vs. MMTF
PDB
RCSB
Download + Parsing time
MMTF vs. mmCIF
Bethesda, MD
85 MMTF
2418 mmCIF
Switzerland
1589 MMTF
4431 mmCIF
Russia
557 MMTF
failed mmCIF
Japan
79 MMTF
2838 mmCIF
San Diego, CA
36 MMTF
840 mmCIF
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
PDB
RCSB
Community Engagement
• Open source specification
• Open source decoding libraries
• Java
• JavaScript
• Python
• C/C++ (developed by community members)
• Applications using MMTF
• 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol
• BioJava, Biopython, MDAnalysis
• RCSB PDB website
PDB
RCSB
Summary
• MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
• Compressed, binary, efficient representation of 3D structures
• Lossless representation (~4x compression)
• Lossy, reduced representation (~37x compression)
• Compressive Structural Bioinformatics
• Algorithms, application, and workflows using MMTF
• 10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
PDB
RCSB
Acknowledgements
Funding: NCI/NIH (U01 CA198942)
MMTF Early Adopters
1 of 16

Recommended

CADD meeting 08-30-2016 by
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016Yana Valasatava
950 views26 slides
Profiling Web Archives by
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesSawood Alam
3.1K views33 slides
TPDL 2015 - Profiling Web Archives by
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
2.6K views34 slides
Bioinformatics, Data Integration, and Data Representation Working Group Summa... by
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
803 views8 slides
Web-based visualization of 3D distance restraints integrated with NGL molecul... by
Web-based visualization of 3D distance restraints integrated with NGL molecul...Web-based visualization of 3D distance restraints integrated with NGL molecul...
Web-based visualization of 3D distance restraints integrated with NGL molecul...Alexander Rose
483 views1 slide
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... by
 MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo... MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...
MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromo...Peter Rose
281 views26 slides

More Related Content

Similar to Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss... by
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
472 views1 slide
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar by
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarData Con LA
738 views39 slides
Biological databases by
Biological databasesBiological databases
Biological databasesQamar iqbal
856 views52 slides
Extended memory access in PHP by
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHPAndrew Goodwin
128 views31 slides
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers by
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
453 views30 slides
POLARDB: A database architecture for the cloud by
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloudoysteing
670 views27 slides

Similar to Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive(20)

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss... by Anthony Bradley
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Anthony Bradley472 views
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar by Data Con LA
Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarPurpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar
Data Con LA738 views
Biological databases by Qamar iqbal
Biological databasesBiological databases
Biological databases
Qamar iqbal856 views
Extended memory access in PHP by Andrew Goodwin
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
Andrew Goodwin128 views
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers by Ceph Community
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Community 453 views
POLARDB: A database architecture for the cloud by oysteing
POLARDB: A database architecture for the cloudPOLARDB: A database architecture for the cloud
POLARDB: A database architecture for the cloud
oysteing670 views
MySQL NDB Cluster 8.0 SQL faster than NoSQL by Bernd Ocklin
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
Bernd Ocklin545 views
Databases_CSS2.pptx by Silpa87
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
Silpa8747 views
Hadoop at aadhaar by Regunath B
Hadoop at aadhaarHadoop at aadhaar
Hadoop at aadhaar
Regunath B14.5K views
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva by Gezim Sejdiu
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu492 views
ceph optimization on ssd ilsoo byun-short by NAVER D2
ceph optimization on ssd ilsoo byun-shortceph optimization on ssd ilsoo byun-short
ceph optimization on ssd ilsoo byun-short
NAVER D22.4K views
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark by DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit615 views
The Progress on Sagace and Data Integration by Maori Ito
The Progress on Sagace and Data IntegrationThe Progress on Sagace and Data Integration
The Progress on Sagace and Data Integration
Maori Ito529 views

Recently uploaded

Batrachospermum.pptx by
Batrachospermum.pptxBatrachospermum.pptx
Batrachospermum.pptxnisarahmad632316
20 views37 slides
Open Access Publishing in Astrophysics by
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in AstrophysicsPeter Coles
380 views26 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
5 views15 slides
Workshop LLM Life Sciences ChemAI 231116.pptx by
Workshop LLM Life Sciences ChemAI 231116.pptxWorkshop LLM Life Sciences ChemAI 231116.pptx
Workshop LLM Life Sciences ChemAI 231116.pptxMarco Tibaldi
96 views37 slides
Max Welling ChemAI 231116.pptx by
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptxMarco Tibaldi
132 views35 slides
Class 2 (12 july).pdf by
  Class 2 (12 july).pdf  Class 2 (12 july).pdf
Class 2 (12 july).pdfclimber9977
8 views1 slide

Recently uploaded(20)

Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles380 views
Workshop LLM Life Sciences ChemAI 231116.pptx by Marco Tibaldi
Workshop LLM Life Sciences ChemAI 231116.pptxWorkshop LLM Life Sciences ChemAI 231116.pptx
Workshop LLM Life Sciences ChemAI 231116.pptx
Marco Tibaldi96 views
Max Welling ChemAI 231116.pptx by Marco Tibaldi
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptx
Marco Tibaldi132 views
Class 2 (12 july).pdf by climber9977
  Class 2 (12 july).pdf  Class 2 (12 july).pdf
Class 2 (12 july).pdf
climber99778 views
PRINCIPLES-OF ASSESSMENT by rbalmagro
PRINCIPLES-OF ASSESSMENTPRINCIPLES-OF ASSESSMENT
PRINCIPLES-OF ASSESSMENT
rbalmagro9 views
himalay baruah acid fast staining.pptx by HimalayBaruah
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah5 views
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx by MN
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
MN6 views
Ethical issues associated with Genetically Modified Crops and Genetically Mod... by PunithKumars6
Ethical issues associated with Genetically Modified Crops and Genetically Mod...Ethical issues associated with Genetically Modified Crops and Genetically Mod...
Ethical issues associated with Genetically Modified Crops and Genetically Mod...
PunithKumars618 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy17 views
Matthias Beller ChemAI 231116.pptx by Marco Tibaldi
Matthias Beller ChemAI 231116.pptxMatthias Beller ChemAI 231116.pptx
Matthias Beller ChemAI 231116.pptx
Marco Tibaldi82 views
Company Fashion Show ChemAI 231116.pptx by Marco Tibaldi
Company Fashion Show ChemAI 231116.pptxCompany Fashion Show ChemAI 231116.pptx
Company Fashion Show ChemAI 231116.pptx
Marco Tibaldi69 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH12 views
Physical Characterization of Moon Impactor WE0913A by Sérgio Sacani
Physical Characterization of Moon Impactor WE0913APhysical Characterization of Moon Impactor WE0913A
Physical Characterization of Moon Impactor WE0913A
Sérgio Sacani42 views
Artificial Intelligence Helps in Drug Designing and Discovery.pptx by abhinashsahoo2001
Artificial Intelligence Helps in Drug Designing and Discovery.pptxArtificial Intelligence Helps in Drug Designing and Discovery.pptx
Artificial Intelligence Helps in Drug Designing and Discovery.pptx
abhinashsahoo2001112 views

Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive

  • 1. PDB RCSB Compressive Structural Bioinformatics: Large-scale analysis and visualization of the Protein Data Bank archive Peter W. Rose, Anthony R. Bradley, Alexander S. Rose, Yana Valasatava, Jose M. Duarte, Andreas Prlić Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  • 2. PDB RCSB PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units 120,000 structures in June 2016
  • 3. PDB RCSB Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  • 5. PDB RCSB  Scalability Issues • Interactive visualization • slow network transfer • slow parsing • slow rendering • Mobile visualization • limited bandwidth • limited memory • Large-scale structural analysis • slow repeated I/O • slow repeated parsing
  • 6. PDB RCSB Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 7. PDB RCSB Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  • 8. PDB RCSB PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  • 9. PDB RCSB MMTF • MacroMolecular Transmission Format (mmtf.rcsb.org) • Compact • fast network transfer, less I/O • Fast to parse • binary, no string parsing • Contains information for structural analysis and visualization • covalent bonds and bond orders • consistently calculated secondary structure
  • 10. PDB RCSB MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  • 11. PDB RCSB Size and Parsing Speed mmCIF vs. MMTF for 120,000 Structures Fast Mac mini with 2.6 GHz Intel Core i5 (4 cores) and 16GB RAM using 30 GB 7 GB < 2 min 400 min MMT F mmCIF MMT F mmCIF Whole PDB archive GZIP compressed (MMTF reduced/lossy: ~800 MB) Small
  • 12. PDB RCSB Efficient hashing algorithm Inefficient looping algorithm MMTFmmCIF 50 6 448 404 Find all C-alpha-C-alpha contacts Data Mining using Apache Spark mmCIF vs. MMTF
  • 13. PDB RCSB Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589 MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative
  • 14. PDB RCSB Community Engagement • Open source specification • Open source decoding libraries • Java • JavaScript • Python • C/C++ (developed by community members) • Applications using MMTF • 3Dmol.js, JSmol, iCn3D(NCBI), ICM Viewer, PyMol • BioJava, Biopython, MDAnalysis • RCSB PDB website
  • 15. PDB RCSB Summary • MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) • Compressed, binary, efficient representation of 3D structures • Lossless representation (~4x compression) • Lossy, reduced representation (~37x compression) • Compressive Structural Bioinformatics • Algorithms, application, and workflows using MMTF • 10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  • 16. PDB RCSB Acknowledgements Funding: NCI/NIH (U01 CA198942) MMTF Early Adopters