SlideShare a Scribd company logo
MMTF-Spark: Interactive, Scalable, and
Reproducible Datamining of
3D Macromolecular Structures
NIH/NCI U01CA198942
Peter W. Rose
Director, Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
~140,000
Structures
in May 2018
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
à Scalability Issues
•  Interactive visualization
•  slow network transfer
•  slow parsing
•  slow rendering
•  Mobile visualization
•  limited bandwidth
•  limited memory
•  Large-scale structural analysis
•  slow repeated I/O
•  slow repeated parsing
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
MMTF
•  MacroMolecular Transmission Format (mmtf.rcsb.org)
•  Compact
•  fast network transfer
•  fast I/O
•  Fast to parse
•  binary, no string parsing
•  Contains information for structural analysis and visualization
•  covalent bonds and bond orders
•  consistently calculated secondary structure
Lossless vs. Lossy Compression
File Size Comparison
(PDB archive ~127,000 entries)
All file sizes are for the gzip compressed files
File Parsing Speed
(PDB archive ~127,000 entries)
Comparison made with BioJava and mmtf-java API (single core)
Download + Parsing time
MMTF vs. mmCIF
Bethesda,	MD	
		85		MMTF	
2418		mmCIF	
Switzerland	
1589 		MMTF	
4431		mmCIF	
Russia	
			557		MMTF	
failed		mmCIF	
Japan	
			79		MMTF	
	2838		mmCIF	
San	Diego,	CA	
	36		MMTF	
840		mmCIF	
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
Demo
Who Uses MMTF?
MMTF-Spark
mmtf-spark (Java)
•  High performance processing
•  Suitable for large-scale calculations
•  Integration with other libraries, e.g.,
BioJava
mmtf-pyspark (Python)
•  Interactive scripting
•  2D and 3D visualization
•  ML/DL tool ecosystem
•  Sharable data analysis in Jupyter
Notebooks
Example of a Spark Workflow
MMTF-PySpark in Jupyter Notebook
Interactive and reproducible dataming of the Protein Data Bank
Plans for a Scalable Infrastructure
Expand structural
coverage
•  PDB
•  PDB-redo
•  Swiss-Model
•  Rosetta-Models
•  …
Host in cloud-based infrastructure
Integrate with
other biological
resources
•  Drug info
•  Proteomic
•  Genomic
•  …
Summary
•  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
•  Compressed, binary, efficient representation of 3D structures
•  Lossless representation (~4x compression over gzip)
•  Lossy, reduced representation (~37x compression over gzip)
•  Compressive Structural Bioinformatics
•  Algorithms, application, and workflows using MMTF
•  10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
Resources•  MMTF Website
•  http://mmtf.rcsb.org
•  MMTF Specification
•  https://github.com/rcsb/mmtf
•  MMTF Libraries
•  https://github.com/rcsb/mmtf-java
•  https://github.com/rcsb/mmtf-javascript
•  https://github.com/rcsb/mmtf-python
•  https://github.com/rcsb/mmtf-c
•  https://github.com/rcsb/mmtf-cpp
•  MMTF-Spark/PySpark
•  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017
•  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018
•  Publications
•  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and
analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
https://doi.org/10.1371/journal.pcbi.1005575
•  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular
structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846
•  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the
21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA,
185-186. https://doi.org/10.1145/2945292.2945324
Funding
This workshop was supported by the National Cancer Institute
of the National Institutes of Health under Award Number
U01CA198942. The content is solely the responsibility of the
authors and does not necessarily represent the official views of
the National Institutes of Health.
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
Columnar Encoding of Arrays
Custom Encoding Examples
Run-length encoding (e.g., occupancy)
Delta and run-length encoding (e.g., serial numbers)
Integer, delta, and recursive index encoding
(e.g., xyz-coordinates as 16-bit integers)
Delta encoding reduced the dynamic range of numbers which makes them more compressible
Data In MMTF File
MMTF contains a subset of data used by
visualization and structural analysis programs
Dictionary Encoding
Unique groups (residues) are stored in a “dictionary”
Unique entities are stored in a “dictionary”
(e.g., polymer type, polymer sequences)
Community Engagement
•  Open source specification
•  Open source libraries

More Related Content

What's hot

Biological databases
Biological databasesBiological databases
Biological databasesAfra Fathima
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
 
Biological databases
Biological databasesBiological databases
Biological databasesAshfaq Ahmad
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in BioinformaticsArindam Ghosh
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahuKAUSHAL SAHU
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)ZoufishanY
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303Bruno Mmassy
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)ShivaniShewale2
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 

What's hot (20)

Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
 
Databases
DatabasesDatabases
Databases
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
 
Publicly available tools and open resources in Bioinformatics
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
 
Datasets with bioschemas
Datasets with bioschemasDatasets with bioschemas
Datasets with bioschemas
 
Gene bank by kk sahu
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Computational biology bls 303
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 

Similar to MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Peter Rose
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...Jason Riedy
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for MoleculesIchigaku Takigawa
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBioinformaticsCentre
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoningWilliam Smith
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Oladokun Sulaiman
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processinginside-BigData.com
 
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Bilingual Publishing Group
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaGezim Sejdiu
 
VIZBI 2013 - Overview
VIZBI 2013 - OverviewVIZBI 2013 - Overview
VIZBI 2013 - OverviewDerek Wright
 

Similar to MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures (20)

Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
CADD meeting 08-30-2016
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016
 
D1803012022
D1803012022D1803012022
D1803012022
 
NRNB EAC Report 2011
NRNB EAC Report 2011NRNB EAC Report 2011
NRNB EAC Report 2011
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
 
Machine Learning for Molecules
Machine Learning for MoleculesMachine Learning for Molecules
Machine Learning for Molecules
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Streaming HYpothesis REasoning
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
 
Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
 
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020Journal of Computer Science Research | Vol.2, Iss.2 April 2020
Journal of Computer Science Research | Vol.2, Iss.2 April 2020
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
VIZBI 2013 - Overview
VIZBI 2013 - OverviewVIZBI 2013 - Overview
VIZBI 2013 - Overview
 

Recently uploaded

platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxmuralinath2
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxAlaminAfendy1
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...Sérgio Sacani
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsmuralinath2
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinossaicprecious19
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingJocelyn Atis
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPirithiRaju
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Sérgio Sacani
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxANONYMOUS
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxmuralinath2
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalMd Hasan Tareq
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSEjordanparish425
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxmuralinath2
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...muralinath2
 

Recently uploaded (20)

platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...The importance of continents, oceans and plate tectonics for the evolution of...
The importance of continents, oceans and plate tectonics for the evolution of...
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
Gliese 12 b: A Temperate Earth-sized Planet at 12 pc Ideal for Atmospheric Tr...
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
Topography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of BengalTopography and sediments of the floor of the Bay of Bengal
Topography and sediments of the floor of the Bay of Bengal
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 

MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

  • 1. MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures NIH/NCI U01CA198942 Peter W. Rose Director, Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  • 2. PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units ~140,000 Structures in May 2018
  • 3. Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  • 4. à Scalability Issues •  Interactive visualization •  slow network transfer •  slow parsing •  slow rendering •  Mobile visualization •  limited bandwidth •  limited memory •  Large-scale structural analysis •  slow repeated I/O •  slow repeated parsing
  • 5. Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 6. Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  • 7. MMTF •  MacroMolecular Transmission Format (mmtf.rcsb.org) •  Compact •  fast network transfer •  fast I/O •  Fast to parse •  binary, no string parsing •  Contains information for structural analysis and visualization •  covalent bonds and bond orders •  consistently calculated secondary structure
  • 8. Lossless vs. Lossy Compression
  • 9. File Size Comparison (PDB archive ~127,000 entries) All file sizes are for the gzip compressed files
  • 10. File Parsing Speed (PDB archive ~127,000 entries) Comparison made with BioJava and mmtf-java API (single core)
  • 11. Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589  MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative Demo
  • 13. MMTF-Spark mmtf-spark (Java) •  High performance processing •  Suitable for large-scale calculations •  Integration with other libraries, e.g., BioJava mmtf-pyspark (Python) •  Interactive scripting •  2D and 3D visualization •  ML/DL tool ecosystem •  Sharable data analysis in Jupyter Notebooks
  • 14. Example of a Spark Workflow
  • 15. MMTF-PySpark in Jupyter Notebook Interactive and reproducible dataming of the Protein Data Bank
  • 16. Plans for a Scalable Infrastructure Expand structural coverage •  PDB •  PDB-redo •  Swiss-Model •  Rosetta-Models •  … Host in cloud-based infrastructure Integrate with other biological resources •  Drug info •  Proteomic •  Genomic •  …
  • 17. Summary •  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) •  Compressed, binary, efficient representation of 3D structures •  Lossless representation (~4x compression over gzip) •  Lossy, reduced representation (~37x compression over gzip) •  Compressive Structural Bioinformatics •  Algorithms, application, and workflows using MMTF •  10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  • 18. Resources•  MMTF Website •  http://mmtf.rcsb.org •  MMTF Specification •  https://github.com/rcsb/mmtf •  MMTF Libraries •  https://github.com/rcsb/mmtf-java •  https://github.com/rcsb/mmtf-javascript •  https://github.com/rcsb/mmtf-python •  https://github.com/rcsb/mmtf-c •  https://github.com/rcsb/mmtf-cpp •  MMTF-Spark/PySpark •  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017 •  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018 •  Publications •  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. https://doi.org/10.1371/journal.pcbi.1005575 •  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846 •  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. https://doi.org/10.1145/2945292.2945324
  • 19. Funding This workshop was supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
  • 20. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  • 21. MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  • 23. Custom Encoding Examples Run-length encoding (e.g., occupancy) Delta and run-length encoding (e.g., serial numbers) Integer, delta, and recursive index encoding (e.g., xyz-coordinates as 16-bit integers) Delta encoding reduced the dynamic range of numbers which makes them more compressible
  • 24. Data In MMTF File MMTF contains a subset of data used by visualization and structural analysis programs
  • 25. Dictionary Encoding Unique groups (residues) are stored in a “dictionary” Unique entities are stored in a “dictionary” (e.g., polymer type, polymer sequences)
  • 26. Community Engagement •  Open source specification •  Open source libraries