Be the first to like this
Presented at the NIH/NCI Informatics Technology for Cancer Research (ITCR) 2018 meeting (https://itcr.cancer.gov/).
Advances in Structural Bioinformatics are driven by the fast growth in experimental 3D structures and integration with even larger sets of sequence and protein function data. At the same time, the field of Data Science has created new technologies for re-engineering legacy software pipelines to make them scalable, easy to use, reproducible, reusable, and sharable.
Here, we describe the MMTF-Spark/PySpark  project that combines three key components to create such an infrastructure: 1. Interactive Jupyter notebooks to run ad hoc analyses, data mining, machine learning, and visualization of 3D structure and sequence datasets, 2. A scalable compute infrastructure to run these analyses interactively across large datasets, e.g., the entire PDB, using previously developed efficient data representations [2, 3] and the Apache Spark framework for distributed parallel computing, 3. A library of methods for data mining and analysis of 3D structure and sequence data, capitalizing on the rich data analytics, visualization, and machine/deep learning tools available in the Python ecosystem.
Scientists face a number of complex and time consuming barriers when applying structural bioinformatics analysis, including complex software setups, non-interoperable data formats and software applications, lack of documentation, simple examples, and tutorials. Given the large datasets, biologists routinely apply computational tools and automation pipelines in their research. However, there is a long tail of ad-hoc, one-off, questions that biologists ask that cannot be answered using available web resources or workflow systems that focus on common tasks. In this project, we provide a self-contained programming environment that caters to scientists with varying computational skills and needs, ranging from biologists with basic programming skills, to structural and computational biologists who want to share their work, to data scientists who seek access to bioinformatics datasets to benchmark new machine learning methods. A key advantage of this environment is interactivity, which enables iterative exploration. By combining documentation, data sets, analysis code, results, and interactive visualizations in Jupyter notebooks, the steps of an interactive session can be captured, reproduced, and shared.
This project was supported by the NCI of the NIH under award number U01 CA198942.
1. https://github.com/sbl-sdsc/mmtf-spark, https://github.com/sbl-sdsc/mmtf-pyspark
2. Bradley AR, et al. (2017) MMTF - an efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLOS Computational Biology 13(6): e1005575.
3. Valasatava Y, et al. (2017) Towards an efficient compression of 3D coordinates of macromolecular structures. PLOS ONE 12(3): e0174846.