Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

22 views

Published on

Presented at the NIH/NCI Informatics Technology for Cancer Research (ITCR) 2018 meeting (https://itcr.cancer.gov/).

Advances in Structural Bioinformatics are driven by the fast growth in experimental 3D structures and integration with even larger sets of sequence and protein function data. At the same time, the field of Data Science has created new technologies for re-engineering legacy software pipelines to make them scalable, easy to use, reproducible, reusable, and sharable.

Here, we describe the MMTF-Spark/PySpark [1] project that combines three key components to create such an infrastructure: 1. Interactive Jupyter notebooks to run ad hoc analyses, data mining, machine learning, and visualization of 3D structure and sequence datasets, 2. A scalable compute infrastructure to run these analyses interactively across large datasets, e.g., the entire PDB, using previously developed efficient data representations [2, 3] and the Apache Spark framework for distributed parallel computing, 3. A library of methods for data mining and analysis of 3D structure and sequence data, capitalizing on the rich data analytics, visualization, and machine/deep learning tools available in the Python ecosystem.

Scientists face a number of complex and time consuming barriers when applying structural bioinformatics analysis, including complex software setups, non-interoperable data formats and software applications, lack of documentation, simple examples, and tutorials. Given the large datasets, biologists routinely apply computational tools and automation pipelines in their research. However, there is a long tail of ad-hoc, one-off, questions that biologists ask that cannot be answered using available web resources or workflow systems that focus on common tasks. In this project, we provide a self-contained programming environment that caters to scientists with varying computational skills and needs, ranging from biologists with basic programming skills, to structural and computational biologists who want to share their work, to data scientists who seek access to bioinformatics datasets to benchmark new machine learning methods. A key advantage of this environment is interactivity, which enables iterative exploration. By combining documentation, data sets, analysis code, results, and interactive visualizations in Jupyter notebooks, the steps of an interactive session can be captured, reproduced, and shared.

Acknowledgements
This project was supported by the NCI of the NIH under award number U01 CA198942.

References
1. https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-pyspark
2. Bradley AR, et al. (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
3. Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846.

Published in: Science
  • Be the first to comment

  • Be the first to like this

MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures

  1. 1. MMTF-Spark: Interactive, Scalable, and Reproducible Datamining of 3D Macromolecular Structures NIH/NCI U01CA198942 Peter W. Rose Director, Structural Bioinformatics Laboratory San Diego Supercomputer Center UC San Diego
  2. 2. PDB – A Billion Atom Archive > 1 billion atoms in the asymmetric units ~140,000 Structures in May 2018
  3. 3. Growing Structure Size and Complexity Largest asymmetric structure in PDB Largest symmetric structure in PDB HIV-1 capsid: PDB ID 3J3Q ~2.4M unique atoms Faustovirus major capsid: PDB ID 5J7V ~40M overall atoms
  4. 4. à Scalability Issues •  Interactive visualization •  slow network transfer •  slow parsing •  slow rendering •  Mobile visualization •  limited bandwidth •  limited memory •  Large-scale structural analysis •  slow repeated I/O •  slow repeated parsing
  5. 5. Compressive Structural Bioinformatics Efficiently store, transmit, and visualize 3D structures of biological macromolecules Perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  6. 6. Macromolecular 3D Structure Biological macromolecules are polymers constructed by linking monomers by covalent bonds Biological macromolecules: proteins, nucleic acids
  7. 7. MMTF •  MacroMolecular Transmission Format (mmtf.rcsb.org) •  Compact •  fast network transfer •  fast I/O •  Fast to parse •  binary, no string parsing •  Contains information for structural analysis and visualization •  covalent bonds and bond orders •  consistently calculated secondary structure
  8. 8. Lossless vs. Lossy Compression
  9. 9. File Size Comparison (PDB archive ~127,000 entries) All file sizes are for the gzip compressed files
  10. 10. File Parsing Speed (PDB archive ~127,000 entries) Comparison made with BioJava and mmtf-java API (single core)
  11. 11. Download + Parsing time MMTF vs. mmCIF Bethesda, MD 85 MMTF 2418 mmCIF Switzerland 1589  MMTF 4431 mmCIF Russia 557 MMTF failed mmCIF Japan 79 MMTF 2838 mmCIF San Diego, CA 36 MMTF 840 mmCIF Time (seconds) to download* 100 large PDB structures from UCSD and parse with JavaScript decoder in Chrome browser *Note: download times are highly variable and not representative Demo
  12. 12. Who Uses MMTF?
  13. 13. MMTF-Spark mmtf-spark (Java) •  High performance processing •  Suitable for large-scale calculations •  Integration with other libraries, e.g., BioJava mmtf-pyspark (Python) •  Interactive scripting •  2D and 3D visualization •  ML/DL tool ecosystem •  Sharable data analysis in Jupyter Notebooks
  14. 14. Example of a Spark Workflow
  15. 15. MMTF-PySpark in Jupyter Notebook Interactive and reproducible dataming of the Protein Data Bank
  16. 16. Plans for a Scalable Infrastructure Expand structural coverage •  PDB •  PDB-redo •  Swiss-Model •  Rosetta-Models •  … Host in cloud-based infrastructure Integrate with other biological resources •  Drug info •  Proteomic •  Genomic •  …
  17. 17. Summary •  MacroMolecular Transmission Format (MMTF, mmtf.rcsb.org) •  Compressed, binary, efficient representation of 3D structures •  Lossless representation (~4x compression over gzip) •  Lossy, reduced representation (~37x compression over gzip) •  Compressive Structural Bioinformatics •  Algorithms, application, and workflows using MMTF •  10 to 100+ fold speedup Structure Visualization Large Scale PDB Mining Web-based molecular graphics for large complexes (2016) Web 3D ‘16, 185-186, DOI: 10.1145/2945292.2945324
  18. 18. Resources•  MMTF Website •  http://mmtf.rcsb.org •  MMTF Specification •  https://github.com/rcsb/mmtf •  MMTF Libraries •  https://github.com/rcsb/mmtf-java •  https://github.com/rcsb/mmtf-javascript •  https://github.com/rcsb/mmtf-python •  https://github.com/rcsb/mmtf-c •  https://github.com/rcsb/mmtf-cpp •  MMTF-Spark/PySpark •  https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-workshop-2017 •  https://github.com/sbl-sdsc/mmtf-pyspark, https://github.com/sbl-sdsc/mmtf-workshop-2018 •  Publications •  Bradley AR, et al. (2017) MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575. https://doi.org/10.1371/journal.pcbi.1005575 •  Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846. https://doi.org/10.1371/journal.pone.0174846 •  Rose AS, et al. (2016) Web-based molecular graphics for large complexes. In Proceedings of the 21st International Conference on Web3D Technology (Web3D '16). ACM, New York, NY, USA, 185-186. https://doi.org/10.1145/2945292.2945324
  19. 19. Funding This workshop was supported by the National Cancer Institute of the National Institutes of Health under Award Number U01CA198942. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
  20. 20. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes (mmcif.wwpdb.org) repetitive information redundant annotations inefficient representation
  21. 21. MMTF Compression Pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE Binary, extensible container format of MMTF It's like JSON. but fast and small.
  22. 22. Columnar Encoding of Arrays
  23. 23. Custom Encoding Examples Run-length encoding (e.g., occupancy) Delta and run-length encoding (e.g., serial numbers) Integer, delta, and recursive index encoding (e.g., xyz-coordinates as 16-bit integers) Delta encoding reduced the dynamic range of numbers which makes them more compressible
  24. 24. Data In MMTF File MMTF contains a subset of data used by visualization and structural analysis programs
  25. 25. Dictionary Encoding Unique groups (residues) are stored in a “dictionary” Unique entities are stored in a “dictionary” (e.g., polymer type, polymer sequences)
  26. 26. Community Engagement •  Open source specification •  Open source libraries

×