Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Small,	
  fast	
  and	
  useful	
  –	
  MMTF	
  a	
  new	
  paradigm	
  in	
  
macromolecular	
  data	
  transmission	
  –...
Upcoming SlideShare
Loading in …5
×

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

252 views

Published on

The size, number and complexity of macromolecular structures has been growing dramatically in recent years making visualisation and analysis of macromolecules non-trivial and sometimes impossible. At the same time, developments within genomics, web-based game development and Big Data mean that hardware and software now support such analysis. However existing macromolecular file formats present an I/O bottleneck meaning the power of such technologies cannot be harnessed. In this work we present a modern MacroMolecular Transmission Format (MMTF). MMTF is 91% smaller than mmCIF and is up to two orders of magnitude faster to parse. Both these changes provide a paradigm shift in the way structural biology can be carried out. The largest structures can now be visualised on all devices and the entire archive can be interactively queried and analysed in seconds through an efficient in-memory representation.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

  1. 1. Small,  fast  and  useful  –  MMTF  a  new  paradigm  in   macromolecular  data  transmission  –  mm9.rcsb.org   Anthony  R.  Bradley,  Alexander  S.  Rose,  Yana  Valasatava,  Jose  M.  Duarte,  Andreas  Prlić,  Peter  W.  Rose   Yet another file format??? Applications BD2K Targeted Software Development, Grant Number: U01 CA198942 Funding and acknowledgements Get the data Three ways to get involved hJp://mm9.rcsb.org/   Already several early adopters APIs provided Cole Christie and Chris Randle •  Steep  increase  in  atoms  per  structure   (37%  between  2012  and  2016)   •  10,000  new  structures  added  per  year   •  68  of  the  100  largest  structures  were   deposited  in  the  past  three  years   •  Largest  structure  contains  2.5  M  atoms     •  EM  seen  a  sharp  rise  in  recent  years   Outcomes •  Small   ~75  %  compression  over  mmCIF  GZIP   •  Fast   Parsing  2  orders  of  magnitude  faster   •  Self-­‐contained   No  need  for  calls  to  external  resources   •  Useful   Bonding  (bond  order)  and  secondary   structure  info  included  in  all  files   What is it? •  Binary   MessagePack  (binary  JSON  format)  used   as  a  data  container  hJp://msgpack.org/   •  Custom  lossless  compression   Delta,  run-­‐length  and  dicdonary  encoding   used  to  compress  data   •  Open-­‐source   Specificadon  and  soeware  libraries   developed  under  Apache/MIT  licenses   Fast   •  Whole  PDB  archive  converted  to  MMTF  weekly   •  Individual  files  available  from  a  REST  API:   wget    h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz   •  Whole  archive  as  a  Hadoop  sequence  file:   wget  h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar   •  More  details:   hJp://mm9.rcsb.org/download.html     •  MMTF  allows  interacdve  data   mining  of  the  endre  PDB  archive   •  No  need  for  SQL  or  seing  up  a   database,  or  schema   •  Queries  on  the  endre  archive  in   only  a  couple  of  minutes   1.  Use  –  use  our  API  to  do  your  own  processing   2.  Adopt  –  incorporate  MMTF  into  your  toolkit   3.  Contribute  –  fork  us  on  github   Data mining Efficient contact finding Fragment generation •  Generate  all  fragments  from  the   protein  chains  in  the  PDB   •  Commonly  done  in,  e.g.,  ab  ini&o   structure  predicdon   •  I/O  is  a  key  boJleneck  in  this  process   •  MMTF  allows  for  such  analysis  to  be   done  in  fracdon  of  dme     •  More  experiments  can  be  done  /  day   •  No  need  to  compromise  on  dataset   size  or  parameters   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     Using  a  Mac  mini  with  a  2.6  GHz  Intel  Core  i5  and  16GB  RAM.     Small   High performance analysis Hadoop  sequence  files   are  opdmized  for  fast   parallel  and  sequendal   access     Spark  is  a  fast  in-­‐memory   big  data  engine  with   clean  and  expressive  APIs   hJp://spark.apache.org/     •  APIs  and  tools  designed  using  the  Apache  Spark   framework  for  fast  parallel  in-­‐memory  processing   •  Spark  deals  with  running  code  in  muld-­‐threaded   manner  –  no  need  to  manage  thread  pools   •  Python,  Java  and  Scala  APIs  available   •  Spark  used  widely  in  other  areas  of  Bioinformadcs   (e.g.,  ADAM  in  Genomics  hJp://bdgenomics.org/)   Efficient  hashing  algorithm   Inefficient  looping  algorithm   •  Inter-­‐atomic  contacts  are  oeen   analyzed,  e.g.,  empirical  force  fields   •  MMTF  facilitates  the  efficient   contact  finding  algorithm  to  have  a   strong  impact   •  Using  mmCIF  efficient  algorithm   provides  only  ~10  %  speedup   •  Using  MMTF  the  same  algorithm   gives  a  ~90  %  speedup   •  MMTF  promotes  efficient   downstream  algorithm  design   Element   Occurrences   %  of  PDB   Carbon   431,487,468   43  %   Oxygen   174,153,905   17  %   Nitrogen   121,509,487   12  %   •  Efficient  transmission  and  parsing  of  data   integral  to  Big  Data  inidadves,  e.g.,  ADAM   •  No  compressed  format  for  macromolecules   •  Processing  and  analyzing  macromolecules  is   a  boJleneck     •  Visualizing  large  structures  is  challenging   •  Clean  APIs  to  the  data  provided  in   commonly  used  languages   •  No  need  to  write  your  own  parser   •  No  more  parsers  breaking     hJps://github.com/rcsb/mm9-­‐python   hJps://github.com/rcsb/mm9-­‐java   hJps://github.com/rcsb/mm9-­‐javascript   Atoms  per  structure  in  the  PDB   Time  taken  to  find  all  C-­‐alpha-­‐C-­‐alpha  contacts   using  mmCIF  and  MMTF   Using  a  Mac  mini  with  2.6  GHz  Intel  Core  i5  (4  cores)  and  16GB  RAM.     30  GB   7  GB   <2  minutes   400  minutes   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   MMTF  mmCIF   Time  to  count  all  the  elements  in  the  PDB   MMTF  mmCIF   Experiments  run  per  24  hours   50   6   448   404   4   640   402   4   EM  atoms  added  to  the  PDB   Atoms  per  structure  in  the  PDB   Whole  PDB  archive  GZIP  compressed   BioJava   •  Protein  Data  Bank  (PDB)  is  a  world-­‐wide  archive  of  macromolecular  structures   •  Established  in  1972  it  has  seen  large  growth  over  the  past  30  years   •  Data  currently    stored  and  transmiJed  in  PDB  and  mmCIF  archival  file  formats   •  Such  format  not  appropriate  for  web-­‐based  and  Big  Data  applicadons  

×