Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

The size, number and complexity of macromolecular structures has been growing dramatically in recent years making visualisation and analysis of macromolecules non-trivial and sometimes impossible. At the same time, developments within genomics, web-based game development and Big Data mean that hardware and software now support such analysis. However existing macromolecular file formats present an I/O bottleneck meaning the power of such technologies cannot be harnessed. In this work we present a modern MacroMolecular Transmission Format (MMTF). MMTF is 91% smaller than mmCIF and is up to two orders of magnitude faster to parse. Both these changes provide a paradigm shift in the way structural biology can be carried out. The largest structures can now be visualised on all devices and the entire archive can be interactively queried and analysed in seconds through an efficient in-memory representation.

Small,	
  fast	
  and	
  useful	
  –	
  MMTF	
  a	
  new	
  paradigm	
  in	
  
macromolecular	
  data	
  transmission	
  –	
  mm9.rcsb.org	
  
Anthony	
  R.	
  Bradley,	
  Alexander	
  S.	
  Rose,	
  Yana	
  Valasatava,	
  Jose	
  M.	
  Duarte,	
  Andreas	
  Prlić,	
  Peter	
  W.	
  Rose	
  
Yet another file format???
Applications
BD2K Targeted Software Development, Grant
Number: U01 CA198942
Funding and acknowledgements
Get the data
Three ways to get involved
hJp://mm9.rcsb.org/	
  
Already several early adopters
APIs provided
Cole Christie and Chris Randle
•  Steep	
  increase	
  in	
  atoms	
  per	
  structure	
  
(37%	
  between	
  2012	
  and	
  2016)	
  
•  10,000	
  new	
  structures	
  added	
  per	
  year	
  
•  68	
  of	
  the	
  100	
  largest	
  structures	
  were	
  
deposited	
  in	
  the	
  past	
  three	
  years	
  
•  Largest	
  structure	
  contains	
  2.5	
  M	
  atoms	
  	
  
•  EM	
  seen	
  a	
  sharp	
  rise	
  in	
  recent	
  years	
  
Outcomes
•  Small	
  
~75	
  %	
  compression	
  over	
  mmCIF	
  GZIP	
  
•  Fast	
  
Parsing	
  2	
  orders	
  of	
  magnitude	
  faster	
  
•  Self-­‐contained	
  
No	
  need	
  for	
  calls	
  to	
  external	
  resources	
  
•  Useful	
  
Bonding	
  (bond	
  order)	
  and	
  secondary	
  
structure	
  info	
  included	
  in	
  all	
  files	
  
What is it?
•  Binary	
  
MessagePack	
  (binary	
  JSON	
  format)	
  used	
  
as	
  a	
  data	
  container	
  hJp://msgpack.org/	
  
•  Custom	
  lossless	
  compression	
  
Delta,	
  run-­‐length	
  and	
  dicdonary	
  encoding	
  
used	
  to	
  compress	
  data	
  
•  Open-­‐source	
  
Specificadon	
  and	
  soeware	
  libraries	
  
developed	
  under	
  Apache/MIT	
  licenses	
  
Fast	
  
•  Whole	
  PDB	
  archive	
  converted	
  to	
  MMTF	
  weekly	
  
•  Individual	
  files	
  available	
  from	
  a	
  REST	
  API:	
  
wget	
  	
  h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz	
  
•  Whole	
  archive	
  as	
  a	
  Hadoop	
  sequence	
  file:	
  
wget	
  h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar	
  
•  More	
  details:	
  
hJp://mm9.rcsb.org/download.html	
  	
  
•  MMTF	
  allows	
  interacdve	
  data	
  
mining	
  of	
  the	
  endre	
  PDB	
  archive	
  
•  No	
  need	
  for	
  SQL	
  or	
  seing	
  up	
  a	
  
database,	
  or	
  schema	
  
•  Queries	
  on	
  the	
  endre	
  archive	
  in	
  
only	
  a	
  couple	
  of	
  minutes	
  
1.  Use	
  –	
  use	
  our	
  API	
  to	
  do	
  your	
  own	
  processing	
  
2.  Adopt	
  –	
  incorporate	
  MMTF	
  into	
  your	
  toolkit	
  
3.  Contribute	
  –	
  fork	
  us	
  on	
  github	
  
Data mining
Efficient contact finding
Fragment generation
•  Generate	
  all	
  fragments	
  from	
  the	
  
protein	
  chains	
  in	
  the	
  PDB	
  
•  Commonly	
  done	
  in,	
  e.g.,	
  ab	
  ini&o	
  
structure	
  predicdon	
  
•  I/O	
  is	
  a	
  key	
  boJleneck	
  in	
  this	
  process	
  
•  MMTF	
  allows	
  for	
  such	
  analysis	
  to	
  be	
  
done	
  in	
  fracdon	
  of	
  dme	
  	
  
•  More	
  experiments	
  can	
  be	
  done	
  /	
  day	
  
•  No	
  need	
  to	
  compromise	
  on	
  dataset	
  
size	
  or	
  parameters	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
Using	
  a	
  Mac	
  mini	
  with	
  a	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  and	
  16GB	
  RAM.	
  	
  
Small	
  
High performance analysis
Hadoop	
  sequence	
  files	
  
are	
  opdmized	
  for	
  fast	
  
parallel	
  and	
  sequendal	
  
access	
  	
  
Spark	
  is	
  a	
  fast	
  in-­‐memory	
  
big	
  data	
  engine	
  with	
  
clean	
  and	
  expressive	
  APIs	
  
hJp://spark.apache.org/	
  
	
  
•  APIs	
  and	
  tools	
  designed	
  using	
  the	
  Apache	
  Spark	
  
framework	
  for	
  fast	
  parallel	
  in-­‐memory	
  processing	
  
•  Spark	
  deals	
  with	
  running	
  code	
  in	
  muld-­‐threaded	
  
manner	
  –	
  no	
  need	
  to	
  manage	
  thread	
  pools	
  
•  Python,	
  Java	
  and	
  Scala	
  APIs	
  available	
  
•  Spark	
  used	
  widely	
  in	
  other	
  areas	
  of	
  Bioinformadcs	
  
(e.g.,	
  ADAM	
  in	
  Genomics	
  hJp://bdgenomics.org/)	
  
Efficient	
  hashing	
  algorithm	
  
Inefficient	
  looping	
  algorithm	
  
•  Inter-­‐atomic	
  contacts	
  are	
  oeen	
  
analyzed,	
  e.g.,	
  empirical	
  force	
  fields	
  
•  MMTF	
  facilitates	
  the	
  efficient	
  
contact	
  finding	
  algorithm	
  to	
  have	
  a	
  
strong	
  impact	
  
•  Using	
  mmCIF	
  efficient	
  algorithm	
  
provides	
  only	
  ~10	
  %	
  speedup	
  
•  Using	
  MMTF	
  the	
  same	
  algorithm	
  
gives	
  a	
  ~90	
  %	
  speedup	
  
•  MMTF	
  promotes	
  efficient	
  
downstream	
  algorithm	
  design	
  
Element	
   Occurrences	
   %	
  of	
  PDB	
  
Carbon	
   431,487,468	
   43	
  %	
  
Oxygen	
   174,153,905	
   17	
  %	
  
Nitrogen	
   121,509,487	
   12	
  %	
  
•  Efficient	
  transmission	
  and	
  parsing	
  of	
  data	
  
integral	
  to	
  Big	
  Data	
  inidadves,	
  e.g.,	
  ADAM	
  
•  No	
  compressed	
  format	
  for	
  macromolecules	
  
•  Processing	
  and	
  analyzing	
  macromolecules	
  is	
  
a	
  boJleneck	
  	
  
•  Visualizing	
  large	
  structures	
  is	
  challenging	
  
•  Clean	
  APIs	
  to	
  the	
  data	
  provided	
  in	
  
commonly	
  used	
  languages	
  
•  No	
  need	
  to	
  write	
  your	
  own	
  parser	
  
•  No	
  more	
  parsers	
  breaking	
  
	
   hJps://github.com/rcsb/mm9-­‐python	
  
hJps://github.com/rcsb/mm9-­‐java	
  
hJps://github.com/rcsb/mm9-­‐javascript	
  
Atoms	
  per	
  structure	
  in	
  the	
  PDB	
  
Time	
  taken	
  to	
  find	
  all	
  C-­‐alpha-­‐C-­‐alpha	
  contacts	
  
using	
  mmCIF	
  and	
  MMTF	
  
Using	
  a	
  Mac	
  mini	
  with	
  2.6	
  GHz	
  Intel	
  Core	
  i5	
  (4	
  cores)	
  and	
  16GB	
  RAM.	
  	
  
30	
  GB	
  
7	
  GB	
  
<2	
  minutes	
  
400	
  minutes	
  
MMTF	
  mmCIF	
   MMTF	
  mmCIF	
  
MMTF	
  mmCIF	
  
MMTF	
  mmCIF	
  
Time	
  to	
  count	
  all	
  the	
  elements	
  in	
  the	
  PDB	
  
MMTF	
  mmCIF	
  
Experiments	
  run	
  per	
  24	
  hours	
  
50	
  
6	
  
448	
  
404	
  
4	
  
640	
  
402	
  
4	
  
EM	
  atoms	
  added	
  to	
  the	
  PDB	
  
Atoms	
  per	
  structure	
  in	
  the	
  PDB	
  
Whole	
  PDB	
  archive	
  GZIP	
  compressed	
  
BioJava	
  
•  Protein	
  Data	
  Bank	
  (PDB)	
  is	
  a	
  world-­‐wide	
  archive	
  of	
  macromolecular	
  structures	
  
•  Established	
  in	
  1972	
  it	
  has	
  seen	
  large	
  growth	
  over	
  the	
  past	
  30	
  years	
  
•  Data	
  currently	
  	
  stored	
  and	
  transmiJed	
  in	
  PDB	
  and	
  mmCIF	
  archival	
  file	
  formats	
  
•  Such	
  format	
  not	
  appropriate	
  for	
  web-­‐based	
  and	
  Big	
  Data	
  applicadons	
  

Recommended

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca... by
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn
407 views28 slides
RDF Join Query Processing with Dual Simulation Pruning by
RDF Join Query Processing with Dual Simulation PruningRDF Join Query Processing with Dual Simulation Pruning
RDF Join Query Processing with Dual Simulation Pruningwajrcs
60 views21 slides
BioTeam Bhanu Rekepalli Presentation at BICoB 2015 by
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015The BioTeam Inc.
777 views20 slides
Fabian Hueske – Juggling with Bits and Bytes by
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
7.4K views32 slides
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16 by
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
777 views20 slides
Strings, C# and Unmanaged Memory by
Strings, C# and Unmanaged MemoryStrings, C# and Unmanaged Memory
Strings, C# and Unmanaged MemoryMichael Yarichuk
657 views26 slides

More Related Content

What's hot

HDF Kita Lab: JupyterLab + HDF Service by
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceThe HDF-EOS Tools and Information Center
861 views10 slides
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data by
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
91 views30 slides
CADD meeting 08-30-2016 by
CADD meeting 08-30-2016CADD meeting 08-30-2016
CADD meeting 08-30-2016Yana Valasatava
951 views26 slides
HDFEOS.org User Analsys, Updates, and Future by
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureThe HDF-EOS Tools and Information Center
271 views12 slides
High Performance Data Analytics with Java on Large Multicore HPC Clusters by
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
1.1K views18 slides
Genome Analysis Pipelines with Spark and ADAM by
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
2.7K views16 slides

What's hot(20)

[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data by PingCAP
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PingCAP91 views
High Performance Data Analytics with Java on Large Multicore HPC Clusters by Saliya Ekanayake
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
Saliya Ekanayake1.1K views
Genome Analysis Pipelines with Spark and ADAM by Allen Day, PhD
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD2.7K views
Data science in ruby is it possible? is it fast? should we use it? by Rodrigo Urubatan
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
Rodrigo Urubatan1.8K views
Quick Understanding of NoSQL by Edward Yoon
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
Edward Yoon1.8K views
Federated Queries Across Both Different Storage Mediums and Different Data En... by VMware Tanzu
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
VMware Tanzu494 views
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust... by PingCAP
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PingCAP48 views
Boosting spark performance: An Overview of Techniques by Ahsan Javed Awan
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan515 views
Studies of HPCC Systems from Machine Learning Perspectives by HPCC Systems
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
HPCC Systems753 views
Scaling and High Performance Storage System: LeoFS by Rakuten Group, Inc.
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFS
Rakuten Group, Inc.29.5K views
RDFox Poster by DBOnto
RDFox PosterRDFox Poster
RDFox Poster
DBOnto904 views
Stories About Spark, HPC and Barcelona by Jordi Torres by Spark Summit
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
Spark Summit3K views
Analytics and Access to the UK web archive by Lewis Crawford
Analytics and Access to the UK web archiveAnalytics and Access to the UK web archive
Analytics and Access to the UK web archive
Lewis Crawford726 views

Similar to Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko by
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko GlobalLogic Ukraine
306 views38 slides
OpenPOWER Acceleration of HPCC Systems by
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
721 views29 slides
Supercharging Data Performance for Real-Time Data Analysis by
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Ryft
607 views19 slides
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j by
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jNeo4j
903 views38 slides
Play With Streams by
Play With StreamsPlay With Streams
Play With StreamsTianjian Chen
461 views78 slides
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir by
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
88 views78 slides

Similar to Small, fast and useful – MMTF a new paradigm in macromolecular data transmission (20)

Stream Data Processing at Big Data Landscape by Oleksandr Fedirko by GlobalLogic Ukraine
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
OpenPOWER Acceleration of HPCC Systems by HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
HPCC Systems721 views
Supercharging Data Performance for Real-Time Data Analysis by Ryft
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
Ryft607 views
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j by Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4jWebinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Neo4j903 views
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir by aminnezarat
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
aminnezarat88 views
Role of python in hpc by Dr Reeja S R
Role of python in hpcRole of python in hpc
Role of python in hpc
Dr Reeja S R106 views
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and... by Haidee McMahon
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and...
Haidee McMahon340 views
Real time machine learning proposers day v3 by mustafa sarac
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
mustafa sarac143 views
From Pipelines to Refineries: scaling big data applications with Tim Hunter by Databricks
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks2K views
Architecture Patterns - Open Discussion by Nguyen Tung
Architecture Patterns - Open DiscussionArchitecture Patterns - Open Discussion
Architecture Patterns - Open Discussion
Nguyen Tung4.2K views
Application Profiling at the HPCAC High Performance Center by inside-BigData.com
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com846 views
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019 by VMware Tanzu
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu2K views
Improving Efficiency of Machine Learning Algorithms using HPCC Systems by HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems161 views
What to Expect for Big Data and Apache Spark in 2017 by Databricks
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks4.2K views

Recently uploaded

MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfKerryNuez1
21 views5 slides
SANJAY HPLC.pptx by
SANJAY HPLC.pptxSANJAY HPLC.pptx
SANJAY HPLC.pptxsanjayudps2016
148 views38 slides
scopus cited journals.pdf by
scopus cited journals.pdfscopus cited journals.pdf
scopus cited journals.pdfKSAravindSrivastava
5 views15 slides
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...GIFT KIISI NKIN
17 views31 slides
plasmids by
plasmidsplasmids
plasmidsscribddarkened352
7 views2 slides
Nitrosamine & NDSRI.pptx by
Nitrosamine & NDSRI.pptxNitrosamine & NDSRI.pptx
Nitrosamine & NDSRI.pptxNileshBonde4
9 views22 slides

Recently uploaded(20)

MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez121 views
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl... by GIFT KIISI NKIN
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
Synthesis and Characterization of Magnetite-Magnesium Sulphate-Sodium Dodecyl...
GIFT KIISI NKIN17 views
A training, certification and marketing scheme for informal dairy vendors in ... by ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI11 views
Open Access Publishing in Astrophysics by Peter Coles
Open Access Publishing in AstrophysicsOpen Access Publishing in Astrophysics
Open Access Publishing in Astrophysics
Peter Coles725 views
Pollination By Nagapradheesh.M.pptx by MNAGAPRADHEESH
Pollination By Nagapradheesh.M.pptxPollination By Nagapradheesh.M.pptx
Pollination By Nagapradheesh.M.pptx
MNAGAPRADHEESH15 views
Experimental animal Guinea pigs.pptx by Mansee Arya
Experimental animal Guinea pigs.pptxExperimental animal Guinea pigs.pptx
Experimental animal Guinea pigs.pptx
Mansee Arya13 views
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx by MN
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
MN6 views
"How can I develop my learning path in bioinformatics? by Bioinformy
"How can I develop my learning path in bioinformatics?"How can I develop my learning path in bioinformatics?
"How can I develop my learning path in bioinformatics?
Bioinformy21 views
CSF -SHEEBA.D presentation.pptx by SheebaD7
CSF -SHEEBA.D presentation.pptxCSF -SHEEBA.D presentation.pptx
CSF -SHEEBA.D presentation.pptx
SheebaD711 views
Connecting communities to promote FAIR resources: perspectives from an RDA / ... by Allyson Lister
Connecting communities to promote FAIR resources: perspectives from an RDA / ...Connecting communities to promote FAIR resources: perspectives from an RDA / ...
Connecting communities to promote FAIR resources: perspectives from an RDA / ...
Allyson Lister34 views
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance... by InsideScientific
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
A Ready-to-Analyze High-Plex Spatial Signature Development Workflow for Cance...
InsideScientific43 views

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

  • 1. Small,  fast  and  useful  –  MMTF  a  new  paradigm  in   macromolecular  data  transmission  –  mm9.rcsb.org   Anthony  R.  Bradley,  Alexander  S.  Rose,  Yana  Valasatava,  Jose  M.  Duarte,  Andreas  Prlić,  Peter  W.  Rose   Yet another file format??? Applications BD2K Targeted Software Development, Grant Number: U01 CA198942 Funding and acknowledgements Get the data Three ways to get involved hJp://mm9.rcsb.org/   Already several early adopters APIs provided Cole Christie and Chris Randle •  Steep  increase  in  atoms  per  structure   (37%  between  2012  and  2016)   •  10,000  new  structures  added  per  year   •  68  of  the  100  largest  structures  were   deposited  in  the  past  three  years   •  Largest  structure  contains  2.5  M  atoms     •  EM  seen  a  sharp  rise  in  recent  years   Outcomes •  Small   ~75  %  compression  over  mmCIF  GZIP   •  Fast   Parsing  2  orders  of  magnitude  faster   •  Self-­‐contained   No  need  for  calls  to  external  resources   •  Useful   Bonding  (bond  order)  and  secondary   structure  info  included  in  all  files   What is it? •  Binary   MessagePack  (binary  JSON  format)  used   as  a  data  container  hJp://msgpack.org/   •  Custom  lossless  compression   Delta,  run-­‐length  and  dicdonary  encoding   used  to  compress  data   •  Open-­‐source   Specificadon  and  soeware  libraries   developed  under  Apache/MIT  licenses   Fast   •  Whole  PDB  archive  converted  to  MMTF  weekly   •  Individual  files  available  from  a  REST  API:   wget    h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz   •  Whole  archive  as  a  Hadoop  sequence  file:   wget  h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar   •  More  details:   hJp://mm9.rcsb.org/download.html     •  MMTF  allows  interacdve  data   mining  of  the  endre  PDB  archive   •  No  need  for  SQL  or  seing  up  a   database,  or  schema   •  Queries  on  the  endre  archive  in   only  a  couple  of  minutes   1.  Use  –  use  our  API  to  do  your  own  processing   2.  Adopt  –  incorporate  MMTF  into  your  toolkit   3.  Contribute  –  fork  us  on  github   Data mining Efficient contact finding Fragment generation •  Generate  all  fragments  from  the   protein  chains  in  the  PDB   •  Commonly  done  in,  e.g.,  ab  ini&o   structure  predicdon   •  I/O  is  a  key  boJleneck  in  this  process   •  MMTF  allows  for  such  analysis  to  be   done  in  fracdon  of  dme     •  More  experiments  can  be  done  /  day   •  No  need  to  compromise  on  dataset   size  or  parameters   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  a  2.6  GHz  Intel  Core  i5  and  16GB  RAM.     Small   High performance analysis Hadoop  sequence  files   are  opdmized  for  fast   parallel  and  sequendal   access     Spark  is  a  fast  in-­‐memory   big  data  engine  with   clean  and  expressive  APIs   hJp://spark.apache.org/     •  APIs  and  tools  designed  using  the  Apache  Spark   framework  for  fast  parallel  in-­‐memory  processing   •  Spark  deals  with  running  code  in  muld-­‐threaded   manner  –  no  need  to  manage  thread  pools   •  Python,  Java  and  Scala  APIs  available   •  Spark  used  widely  in  other  areas  of  Bioinformadcs   (e.g.,  ADAM  in  Genomics  hJp://bdgenomics.org/)   Efficient  hashing  algorithm   Inefficient  looping  algorithm   •  Inter-­‐atomic  contacts  are  oeen   analyzed,  e.g.,  empirical  force  fields   •  MMTF  facilitates  the  efficient   contact  finding  algorithm  to  have  a   strong  impact   •  Using  mmCIF  efficient  algorithm   provides  only  ~10  %  speedup   •  Using  MMTF  the  same  algorithm   gives  a  ~90  %  speedup   •  MMTF  promotes  efficient   downstream  algorithm  design   Element   Occurrences   %  of  PDB   Carbon   431,487,468   43  %   Oxygen   174,153,905   17  %   Nitrogen   121,509,487   12  %   •  Efficient  transmission  and  parsing  of  data   integral  to  Big  Data  inidadves,  e.g.,  ADAM   •  No  compressed  format  for  macromolecules   •  Processing  and  analyzing  macromolecules  is   a  boJleneck     •  Visualizing  large  structures  is  challenging   •  Clean  APIs  to  the  data  provided  in   commonly  used  languages   •  No  need  to  write  your  own  parser   •  No  more  parsers  breaking     hJps://github.com/rcsb/mm9-­‐python   hJps://github.com/rcsb/mm9-­‐java   hJps://github.com/rcsb/mm9-­‐javascript   Atoms  per  structure  in  the  PDB   Time  taken  to  find  all  C-­‐alpha-­‐C-­‐alpha  contacts   using  mmCIF  and  MMTF   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     30  GB   7  GB   <2  minutes   400  minutes   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   Time  to  count  all  the  elements  in  the  PDB   MMTF  mmCIF   Experiments  run  per  24  hours   50   6   448   404   4   640   402   4   EM  atoms  added  to  the  PDB   Atoms  per  structure  in  the  PDB   Whole  PDB  archive  GZIP  compressed   BioJava   •  Protein  Data  Bank  (PDB)  is  a  world-­‐wide  archive  of  macromolecular  structures   •  Established  in  1972  it  has  seen  large  growth  over  the  past  30  years   •  Data  currently    stored  and  transmiJed  in  PDB  and  mmCIF  archival  file  formats   •  Such  format  not  appropriate  for  web-­‐based  and  Big  Data  applicadons