MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

Peter Rose
Peter RoseDirector, Structural Bioinformatics Laboratory
MMTF-Spark: Interactive, Scalable, and
Reproducible Datamining of
3D Macromolecular Structures
NIH/NCI U01CA198942
Peter W. Rose
Director, Structural Bioinformatics Laboratory
San Diego Supercomputer Center
UC San Diego
PDB – A Billion Atom Archive
> 1 billion atoms in the asymmetric units
~140,000
Structures
in May 2018
Growing Structure Size and Complexity
Largest asymmetric structure in PDB Largest symmetric structure in PDB
HIV-1 capsid: PDB ID 3J3Q
~2.4M unique atoms
Faustovirus major capsid: PDB ID 5J7V
~40M overall atoms
à Scalability Issues
•  Interactive visualization
•  slow network transfer
•  slow parsing
•  slow rendering
•  Mobile visualization
•  limited bandwidth
•  limited memory
•  Large-scale structural analysis
•  slow repeated I/O
•  slow repeated parsing
Compressive Structural Bioinformatics
Efficiently store, transmit, and visualize 3D structures of biological macromolecules
Perform large-scale structural calculations such as geometric queries or structural
comparisons over the entire PDB archive held in memory
Macromolecular 3D Structure
Biological macromolecules are polymers constructed
by linking monomers by covalent bonds
Biological macromolecules: proteins, nucleic acids
MMTF
•  MacroMolecular Transmission Format (mmtf.rcsb.org)
•  Compact
•  fast network transfer
•  fast I/O
•  Fast to parse
•  binary, no string parsing
•  Contains information for structural analysis and visualization
•  covalent bonds and bond orders
•  consistently calculated secondary structure
Lossless vs. Lossy Compression
File Size Comparison
(PDB archive ~127,000 entries)
All file sizes are for the gzip compressed files
File Parsing Speed
(PDB archive ~127,000 entries)
Comparison made with BioJava and mmtf-java API (single core)
Download + Parsing time
MMTF vs. mmCIF
Bethesda,	MD	
		85		MMTF	
2418		mmCIF	
Switzerland	
1589 		MMTF	
4431		mmCIF	
Russia	
			557		MMTF	
failed		mmCIF	
Japan	
			79		MMTF	
	2838		mmCIF	
San	Diego,	CA	
	36		MMTF	
840		mmCIF	
Time (seconds) to download* 100 large PDB structures from UCSD
and parse with JavaScript decoder in Chrome browser
*Note: download times are highly variable and not representative
Demo
Who Uses MMTF?
MMTF-Spark
mmtf-spark (Java)
•  High performance processing
•  Suitable for large-scale calculations
•  Integration with other libraries, e.g.,
BioJava
mmtf-pyspark (Python)
•  Interactive scripting
•  2D and 3D visualization
•  ML/DL tool ecosystem
•  Sharable data analysis in Jupyter
Notebooks
Example of a Spark Workflow
MMTF-PySpark in Jupyter Notebook
Interactive and reproducible dataming of the Protein Data Bank
Plans for a Scalable Infrastructure
Expand structural
coverage
•  PDB
•  PDB-redo
•  Swiss-Model
•  Rosetta-Models
•  …
Host in cloud-based infrastructure
Integrate with
other biological
resources
•  Drug info
•  Proteomic
•  Genomic
•  …
Summary
•  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org)
•  Compressed, binary, efficient representation of 3D structures
•  Lossless representation (~4x compression over gzip)
•  Lossy, reduced representation (~37x compression over gzip)
•  Compressive Structural Bioinformatics
•  Algorithms, application, and workflows using MMTF
•  10 to 100+ fold speedup
Structure Visualization Large Scale PDB Mining
Web-based molecular graphics for large complexes (2016)
Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
Resources•  MMTF Website
•  http://mmtf.rcsb.org
•  MMTF Specification
•  https://github.com/rcsb/mmtf
•  MMTF Libraries
•  https://github.com/rcsb/mmtf-java
•  https://github.com/rcsb/mmtf-javascript
•  https://github.com/rcsb/mmtf-python
•  https://github.com/rcsb/mmtf-c
•  https://github.com/rcsb/mmtf-cpp
•  MMTF-Spark/PySpark
•  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017
•  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018
•  Publications
•  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and
analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
https://doi.org/10.1371/journal.pcbi.1005575
•  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular
structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846
•  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the
21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA,
185-186. https://doi.org/10.1145/2945292.2945324
Funding
This workshop was supported by the National Cancer Institute
of the National Institutes of Health under Award Number
U01CA198942. The content is solely the responsibility of the
authors and does not necessarily represent the official views of
the National Institutes of Health.
PDBx/mmCIF
Flexible, extensible, and verbose format
with rich metadata, well suited for archival
purposes (mmcif.wwpdb.org)
repetitive information
redundant annotations
inefficient representation
MMTF Compression Pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural
data
calculate bonds,
SSE
Binary, extensible container format of MMTF
It's like JSON.
but fast and small.
Columnar Encoding of Arrays
Custom Encoding Examples
Run-length encoding (e.g., occupancy)
Delta and run-length encoding (e.g., serial numbers)
Integer, delta, and recursive index encoding
(e.g., xyz-coordinates as 16-bit integers)
Delta encoding reduced the dynamic range of numbers which makes them more compressible
Data In MMTF File
MMTF contains a subset of data used by
visualization and structural analysis programs
Dictionary Encoding
Unique groups (residues) are stored in a “dictionary”
Unique entities are stored in a “dictionary”
(e.g., polymer type, polymer sequences)
Community Engagement
•  Open source specification
•  Open source libraries
1 of 26

Recommended

VisAVis: An Approach to an Intermediate Layer between Ontologies and Relation... by
VisAVis: An Approach to an Intermediate Layer between Ontologies and Relation...VisAVis: An Approach to an Intermediate Layer between Ontologies and Relation...
VisAVis: An Approach to an Intermediate Layer between Ontologies and Relation...Nikolaos Konstantinou
316 views12 slides
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis by
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
3.4K views31 slides
OMIM Database by
OMIM DatabaseOMIM Database
OMIM DatabaseThi K. Tran-Nguyen, PhD
3.7K views8 slides
Database in bioinformatics by
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformaticsVinaKhan1
3.5K views38 slides
Nucleic Acid Sequence databases by
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
65K views24 slides
Gen bank databases by
Gen bank databasesGen bank databases
Gen bank databasesHafiz Muhammad Zeeshan Raza
33.9K views17 slides

More Related Content

What's hot

Biological databases by
Biological databasesBiological databases
Biological databasesAfra Fathima
1.3K views36 slides
Bioinformatics biological databases by
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
1.6K views4 slides
Databases by
DatabasesDatabases
Databasesafzamalik
1.1K views32 slides
Biological databases by
Biological databasesBiological databases
Biological databasesAshfaq Ahmad
3.5K views17 slides
Data retrieval tools by
Data retrieval toolsData retrieval tools
Data retrieval toolsVidya Kalaivani Rajkumar
14.9K views5 slides

What's hot(20)

Biological databases by Afra Fathima
Biological databasesBiological databases
Biological databases
Afra Fathima1.3K views
Bioinformatics biological databases by Sangeeta Das
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
Sangeeta Das1.6K views
Databases by afzamalik
DatabasesDatabases
Databases
afzamalik1.1K views
Biological databases by Ashfaq Ahmad
Biological databasesBiological databases
Biological databases
Ashfaq Ahmad3.5K views
Publicly available tools and open resources in Bioinformatics by Arindam Ghosh
Publicly available  tools and open resources in BioinformaticsPublicly available  tools and open resources in Bioinformatics
Publicly available tools and open resources in Bioinformatics
Arindam Ghosh6.8K views
Gene bank by kk sahu by KAUSHAL SAHU
Gene bank by kk sahuGene bank by kk sahu
Gene bank by kk sahu
KAUSHAL SAHU2.6K views
DNA data bank of japan (DDBJ) by ZoufishanY
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
ZoufishanY5.6K views
Computational biology bls 303 by Bruno Mmassy
Computational biology bls 303Computational biology bls 303
Computational biology bls 303
Bruno Mmassy4.3K views
Protein information resource (PIR) by ShivaniShewale2
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
ShivaniShewale22.6K views
End-to-End Learning for Answering Structured Queries Directly over Text by Paul Groth
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth2.9K views

Similar to MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

Compressive Structural Bioinformatics: Large-scale analysis and visualization... by
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Peter Rose
903 views16 slides
A consistent and efficient graphical User Interface Design and Querying Organ... by
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
438 views11 slides
Poster (1) by
Poster (1)Poster (1)
Poster (1)Daniel Osei
65 views1 slide
CADD meeting 08-30-2016 by
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016Yana Valasatava
951 views26 slides
D1803012022 by
D1803012022D1803012022
D1803012022IOSR Journals
233 views3 slides
NRNB EAC Report 2011 by
NRNB EAC Report 2011NRNB EAC Report 2011
NRNB EAC Report 2011Alexander Pico
2.3K views45 slides

Similar to MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures(20)

Compressive Structural Bioinformatics: Large-scale analysis and visualization... by Peter Rose
Compressive Structural Bioinformatics: Large-scale analysis and visualization...Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Compressive Structural Bioinformatics: Large-scale analysis and visualization...
Peter Rose903 views
A consistent and efficient graphical User Interface Design and Querying Organ... by CSCJournals
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
CSCJournals438 views
The swings and roundabouts of a decade of fun and games with Research Objects by Carole Goble
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
Carole Goble168 views
Computation and Knowledge by Ian Foster
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster2.3K views
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar... by Jason Riedy
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
ICASSP 2012: Analysis of Streaming Social Networks and Graphs on Multicore Ar...
Jason Riedy940 views
2014 11-13-sbsm032-reproducible research by Yannick Wurm
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
Yannick Wurm508 views
Streaming HYpothesis REasoning by William Smith
Streaming HYpothesis REasoningStreaming HYpothesis REasoning
Streaming HYpothesis REasoning
William Smith213 views
Book of abstract volume 8 no 9 ijcsis december 2010 by Oladokun Sulaiman
Book of abstract volume 8 no 9 ijcsis december 2010Book of abstract volume 8 no 9 ijcsis december 2010
Book of abstract volume 8 no 9 ijcsis december 2010
Oladokun Sulaiman2.6K views
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing by inside-BigData.com
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
inside-BigData.com1.2K views
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o... by Ilkay Altintas, Ph.D.
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva by Gezim Sejdiu
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu492 views
VIZBI 2013 - Overview by Derek Wright
VIZBI 2013 - OverviewVIZBI 2013 - Overview
VIZBI 2013 - Overview
Derek Wright934 views
Managing Multidimensional Historical by Arul Suju
Managing Multidimensional HistoricalManaging Multidimensional Historical
Managing Multidimensional Historical
Arul Suju265 views

Recently uploaded

Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Trustlife
114 views17 slides
A giant thin stellar stream in the Coma Galaxy Cluster by
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy ClusterSérgio Sacani
19 views14 slides
Light Pollution for LVIS students by
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS studentsCWBarthlmew
12 views12 slides
Indian council for child welfare by
Indian council for child welfareIndian council for child welfare
Indian council for child welfareRenuWaghmare2
7 views21 slides
ELECTRON TRANSPORT CHAIN by
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAINDEEKSHA RANI
11 views16 slides

Recently uploaded(20)

Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor... by Trustlife
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Ellagic Acid and Its Metabolites as Potent and Selective Allosteric Inhibitor...
Trustlife114 views
A giant thin stellar stream in the Coma Galaxy Cluster by Sérgio Sacani
A giant thin stellar stream in the Coma Galaxy ClusterA giant thin stellar stream in the Coma Galaxy Cluster
A giant thin stellar stream in the Coma Galaxy Cluster
Sérgio Sacani19 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew12 views
Indian council for child welfare by RenuWaghmare2
Indian council for child welfareIndian council for child welfare
Indian council for child welfare
RenuWaghmare27 views
ELECTRON TRANSPORT CHAIN by DEEKSHA RANI
ELECTRON TRANSPORT CHAINELECTRON TRANSPORT CHAIN
ELECTRON TRANSPORT CHAIN
DEEKSHA RANI11 views
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... by Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
application of genetic engineering 2.pptx by SankSurezz
application of genetic engineering 2.pptxapplication of genetic engineering 2.pptx
application of genetic engineering 2.pptx
SankSurezz14 views
Applications of Large Language Models in Materials Discovery and Design by Anubhav Jain
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain14 views
selection of preformed arch wires during the alignment stage of preadjusted o... by MaherFouda1
selection of preformed arch wires during the alignment stage of preadjusted o...selection of preformed arch wires during the alignment stage of preadjusted o...
selection of preformed arch wires during the alignment stage of preadjusted o...
MaherFouda17 views
Oral_Presentation_by_Fatma (2).pdf by fatmaalmrzqi
Oral_Presentation_by_Fatma (2).pdfOral_Presentation_by_Fatma (2).pdf
Oral_Presentation_by_Fatma (2).pdf
fatmaalmrzqi8 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya40 views
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... by ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI6 views
별헤는 사람들 2023년 12월호 전명원 교수 자료 by sciencepeople
별헤는 사람들 2023년 12월호 전명원 교수 자료별헤는 사람들 2023년 12월호 전명원 교수 자료
별헤는 사람들 2023년 12월호 전명원 교수 자료
sciencepeople68 views

MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

  • 1. MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures NIH/NCI U01CA198942 Peter W. Rose Director, Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  • 2. PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units ~140,000 Structures in May 2018
  • 3. Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  • 4. à Scalability Issues •  Interactive visualization •  slow network transfer •  slow parsing •  slow rendering •  Mobile visualization •  limited bandwidth •  limited memory •  Large-scale structural analysis •  slow repeated I/O •  slow repeated parsing
  • 5. Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 6. Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  • 7. MMTF •  MacroMolecular Transmission Format (mmtf.rcsb.org) •  Compact •  fast network transfer •  fast I/O •  Fast to parse •  binary, no string parsing •  Contains information for structural analysis and visualization •  covalent bonds and bond orders •  consistently calculated secondary structure
  • 8. Lossless vs. Lossy Compression
  • 9. File Size Comparison (PDB archive ~127,000 entries) All file sizes are for the gzip compressed files
  • 10. File Parsing Speed (PDB archive ~127,000 entries) Comparison made with BioJava and mmtf-java API (single core)
  • 11. Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589  MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative Demo
  • 13. MMTF-Spark mmtf-spark (Java) •  High performance processing •  Suitable for large-scale calculations •  Integration with other libraries, e.g., BioJava mmtf-pyspark (Python) •  Interactive scripting •  2D and 3D visualization •  ML/DL tool ecosystem •  Sharable data analysis in Jupyter Notebooks
  • 14. Example of a Spark Workflow
  • 15. MMTF-PySpark in Jupyter Notebook Interactive and reproducible dataming of the Protein Data Bank
  • 16. Plans for a Scalable Infrastructure Expand structural coverage •  PDB •  PDB-redo •  Swiss-Model •  Rosetta-Models •  … Host in cloud-based infrastructure Integrate with other biological resources •  Drug info •  Proteomic •  Genomic •  …
  • 17. Summary •  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) •  Compressed, binary, efficient representation of 3D structures •  Lossless representation (~4x compression over gzip) •  Lossy, reduced representation (~37x compression over gzip) •  Compressive Structural Bioinformatics •  Algorithms, application, and workflows using MMTF •  10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  • 18. Resources•  MMTF Website •  http://mmtf.rcsb.org •  MMTF Specification •  https://github.com/rcsb/mmtf •  MMTF Libraries •  https://github.com/rcsb/mmtf-java •  https://github.com/rcsb/mmtf-javascript •  https://github.com/rcsb/mmtf-python •  https://github.com/rcsb/mmtf-c •  https://github.com/rcsb/mmtf-cpp •  MMTF-Spark/PySpark •  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017 •  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018 •  Publications •  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. https://doi.org/10.1371/journal.pcbi.1005575 •  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846 •  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. https://doi.org/10.1145/2945292.2945324
  • 19. Funding This workshop was supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
  • 20. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  • 21. MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  • 23. Custom Encoding Examples Run-length encoding (e.g., occupancy) Delta and run-length encoding (e.g., serial numbers) Integer, delta, and recursive index encoding (e.g., xyz-coordinates as 16-bit integers) Delta encoding reduced the dynamic range of numbers which makes them more compressible
  • 24. Data In MMTF File MMTF contains a subset of data used by visualization and structural analysis programs
  • 25. Dictionary Encoding Unique groups (residues) are stored in a “dictionary” Unique entities are stored in a “dictionary” (e.g., polymer type, polymer sequences)
  • 26. Community Engagement •  Open source specification •  Open source libraries