Cloudy	  with	  a	  Touch	  of	                  Cheminforma4cs	  Rajarshi	  Guha,	  Tyler	  Peryea,	  Dac-­‐Trung	  Nguye...
Parallel	  compu4ng	  in	  the	  cloud	  •  Modern	  cloud	  vendors	  make	  provisioning	     compute	  resources	  easy...
All	  HPC	  is	  not	  equal	  •  Use	  cloud	  resources	  in	            •  Make	  use	  of	  cloud	                    ...
Big	  data	  &	  cheminforma4cs	  •  Computa<on	  over	  large	  chemical	  databases	     –  Pubchem,	  ChEMBL,	  GDB-­‐1...
Map-­‐Reduce	                                       copy                           sortSplit 0            Map             ...
Coun4ng	  atoms	    •  The	  chemical	  version	  of	  the	  word	  coun<ng	  task	  Arbitrary line                       ...
The	  Hadoop	  ecosystem	               Chukwa                            Zookeeper                                   Flum...
Cheminforma4cs	  on	  Hadoop	  •      Hadoop	  and	  Atom	  Coun<ng	  •      Hadoop	  and	  SD	  Files	  •      Cheminform...
Simplifying	  Hadoop	  applica4ons	                                   package gov.nih.ncgc.hadoop;                        ...
Pig	  &	  Pig	  La4n	  •  Pig	  La<n	  programs	  are	  much	  simpler	  to	  write	     and	  get	  translated	  to	     ...
Going	  beyond	  chunking?	  •  All	  the	  preceding	  use	  cases	  are	  embarrassingly	     parallel	  	      –  Chunk...
Going	  beyond	  chunking?	  •  Applica<ons	  that	  make	  use	  of	  pairwise	  (or	  higher	     order)	  calcula<ons	 ...
Iden4fying	  MMPs	  •  First	  step	  in	  iden<fying	  bioisosteres	  is	  to	  iden<fy	     candidate	  matched	  molecu...
Naïve	  Bioisostere	  evalua4on	  N	  molecules	                      N(N-­‐1)/2	  comparisons	                           ...
Scaffold	  seeding	                 Seed	  Fragment:	  Members:	  
Scaffold	  seeded	  bioisosteres	                      M(M-­‐1)/2	  comparisons	                       M(M-­‐1)/2	  compari...
Seeded	  bioisosteres	  –	  MR	  style	  • Do	  pairwise	  MCS	                                                  REDUCE	  ...
Does	  seeding	  help?	  •  Doesn’t	  bypass	  the	  O(N2)	  barrier	  –	  does	  reduce	  the	     constant	  •  Depends	...
Data	  •  Exhaus<vely	  fragmented	  ChEMBL	  13	  •  Iden<fied	  scaffolds	  with	  	     	     	                     N mem...
Timing	  experiments	  •  Selected	  50	  scaffolds	  with	  10	  or	  fewer	  members	  •  Configured	  so	  as	  to	  have...
Timing	  experiments	  •  Selected	  1000	  scaffolds	  with	  20	  or	  fewer	     members	      –  Ran	  with	  10	  scaff...
A	  M-­‐R	  workflow	  •  We’re	  currently	  focused	  on	  just	  the	  MMP	  step	  as	     as	  a	  MR	  example	  •  C...
What	  Hadoop	  is	  not	  for	  •  Doesn’t	  replace	  an	  actual	  database	  •  It’s	  not	  uniformly	  fast	  or	  e...
Conclusions	  •  Cheminforma<cs	  applica<ons	  can	  be	  rehosted	  or	     rewriOen	  to	  take	  advantage	  of	  clou...
Conclusions	  •  Q:	  But	  are	  cheminforma/cs	  problems	  really	  big	     enough	  to	  jus/fy	  all	  of	  this?	  ...
Upcoming SlideShare
Loading in...5

Cloudy with a Touch of Cheminformatics


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cloudy with a Touch of Cheminformatics

  1. 1. Cloudy  with  a  Touch  of   Cheminforma4cs  Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen   NIH  Center  for  Advancing  Transla@onal  Science     Chemaxon  UGM   September  26th,  2012   Wellesley,  MA  
  2. 2. Parallel  compu4ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica<ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa<ons  
  3. 3. All  HPC  is  not  equal  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili<es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hOp://­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  
  4. 4. Big  data  &  cheminforma4cs  •  Computa<on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  GDB-­‐13,  …  •  What  types  of  computa<ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  &  predic<ons  over  large  data  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces  
  5. 5. Map-­‐Reduce   copy sortSplit 0 Map merge Reduce Part 0Split 1 Map merge Reduce Part 1Split 2 Map K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 ) Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    
  6. 6. Coun4ng  atoms   •  The  chemical  version  of  the  word  coun<ng  task  Arbitrary line Atom list (V2) SMILES (V1) Atomnumbers (K1) Occurence (V2) Symbol (K2) Symbol (K2) 1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...) 2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...) . N1 . N1 . N1 152366, Nc1ccc2ncccc2c1N MAP   . Reduce   . Atom Count (V3) Symbol (K3) N,100 C,5684 . . .
  7. 7. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hOp://<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  
  8. 8. Cheminforma4cs  on  Hadoop  •  Hadoop  and  Atom  Coun<ng  •  Hadoop  and  SD  Files  •  Cheminforma<cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma<cs     But  are  cheminforma@cs  problems     really  big  enough  to  jus@fy  all  of  this?  
  9. 9. Simplifying  Hadoop  applica4ons   package gov.nih.ncgc.hadoop; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter import chemaxon.formats.MolFormatException; reporter) throws IOException { import chemaxon.formats.MolImporter;•  Raw  Hadoop     Molecule mol = MolImporter.importMol(value.toString()); import chemaxon.license.LicenseManager; matches.set(mol.getName()); import chemaxon.license.LicenseProcessingException; search.setTarget(mol); import; try { import; if (search.isMatching()) { import chemaxon.struc.Molecule; output.collect(matches, one); import org.apache.hadoop.conf.Configuration; } else { programs  can     import org.apache.hadoop.conf.Configured; output.collect(matches, zero); import org.apache.hadoop.filecache.DistributedCache; } import org.apache.hadoop.fs.Path; } catch (SearchException e) { import; } import; } import; } import org.apache.hadoop.mapred.FileInputFormat; be  tedious  to     import org.apache.hadoop.mapred.FileOutputFormat; public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, import org.apache.hadoop.mapred.JobClient; IntWritable, Text, IntWritable> { import org.apache.hadoop.mapred.JobConf; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; public void reduce(Text key, import org.apache.hadoop.mapred.OutputCollector; Iterator<IntWritable> values, import org.apache.hadoop.mapred.Reducer; OutputCollector<Text, IntWritable> output, write   import org.apache.hadoop.mapred.Reporter; Reporter reporter) throws IOException { import org.apache.hadoop.mapred.TextInputFormat; while (values.hasNext()) { import org.apache.hadoop.mapred.TextOutputFormat; if ( == 0) { import org.apache.hadoop.util.Tool; result.set(1); import org.apache.hadoop.util.ToolRunner; output.collect(key, result); } import; } import; } import; } import java.util.Iterator; public int run(String[] args) throws Exception { /** JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); * SMARTS searching over a set of files using Hadoop. jobConf.setJobName("smartsSearch"); * * @author Rajarshi Guha jobConf.setOutputKeyClass(Text.class); */ jobConf.setOutputValueClass(IntWritable.class); public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); jobConf.setMapperClass(MoleculeMapper.class); private final static IntWritable zero = new IntWritable(0); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class); public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { jobConf.setInputFormat(TextInputFormat.class); private String pattern = null; jobConf.setOutputFormat(TextOutputFormat.class); private MolSearch search; jobConf.setNumMapTasks(5); public void configure(JobConf job) { if (args.length != 4) { try { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); Path[] licFiles = DistributedCache.getLocalCacheFiles(job); System.exit(2); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); } StringBuilder license = new StringBuilder(); String line; FileInputFormat.setInputPaths(jobConf, new Path(args[0])); while ((line = reader.readLine()) != null) license.append(line); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); reader.close(); jobConf.setStrings("pattern", args[2]); LicenseManager.setLicense(license.toString()); } catch (IOException e) { // make the license file available vis dist cache } catch (LicenseProcessingException e) { DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf); } JobClient.runJob(jobConf); pattern = job.getStrings("pattern")[0]; return 0; search = new MolSearch(); } try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); public static void main(String[] args) throws Exception { search.setQuery(queryMol); } catch (MolFormatException e) { int res = Configuration(), new SmartsSearch(), args); } } SMARTS  based     } } final static IntWritable one = new IntWritable(1); Text matches = new Text(); substructure  search    
  10. 10. Pig  &  Pig  La4n  •  Pig  La<n  programs  are  much  simpler  to  write   and  get  translated  to   A = load medium.smi as (smiles:chararray); B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, NC(=O)C(=O)N); store B into output.txt; Hadoop  code   SMARTS  search  in     Pig  La<n  •  SQL-­‐like,  requires     package gov.nih.ncgc.hadoop.pig; import chemaxon.formats.MolImporter; UDF  to  be     import; import; import chemaxon.struc.Molecule; import org.apache.pig.FilterFunc; implemented  to     import; import; perform     public class SMATCH extends FilterFunc { static MolSearch search = null; non-­‐standard  tasks   public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; } } UDF  for  SMARTS  search  
  11. 11. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera<on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma@cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  12. 12. Going  beyond  chunking?  •  Applica<ons  that  make  use  of  pairwise  (or  higher   order)  calcula<ons  could  benefit  from  a  map-­‐ reduce  incarna<on   –  Doesn’t  necessarily  avoid  the  O(N2)  barrier   –  Bioisostere  iden<fica<on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Map-­‐Reduce  Design  PaOerns  
  13. 13. Iden4fying  MMPs  •  First  step  in  iden<fying  bioisosteres  is  to  iden<fy   candidate  matched  molecular  pairs   –  Naïve  all  pairs  comparison   –  Predefined  list  of  transforma<ons     •  Birch  et  al,  BMCL,  2009   –  Fragment  intersec<on   •  Hussain  et  al,  JCIM,  2010   –  MCS  based  approaches  (e.g.,  WizePairZ)   •  Warner  et  al,  JCIM,  2010    
  14. 14. Naïve  Bioisostere  evalua4on  N  molecules   N(N-­‐1)/2  comparisons   ...
  15. 15. Scaffold  seeding   Seed  Fragment:  Members:  
  16. 16. Scaffold  seeded  bioisosteres   M(M-­‐1)/2  comparisons   M(M-­‐1)/2  comparisons  
  17. 17. Seeded  bioisosteres  –  MR  style  • Do  pairwise  MCS   REDUCE   analysis  on  scaffold   • Collect  pairs  of   series   SMILES  for  a  given  • For  each  pair   SMIRKS   output  SMIRKS   • Store  in  DB,  or   transform  and  the   pair  of  SMILES   • Filter  by  ac<vity,  or   • …   MAP  
  18. 18. Does  seeding  help?  •  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the   constant  •  Depends  on  how  many  scaffolds  and  the     number  of  member  for   1e+14 each  scaffold  •  Certainly  useful  when   log Number of pairwise comparisons 1e+11 there  a  few  members   Method per  scaffold   1e+08 all seeded.7 seeded.21•  Highly  populated   seeded.100 scaffolds  can  throw   things  off   1e+05 1e+03 1e+05 1e+07 log Number of molecules
  19. 19. Data  •  Exhaus<vely  fragmented  ChEMBL  13  •  Iden<fied  scaffolds  with         N members   ! 1.8 N scaffold  •  Ended  up  with  231,875  scaffolds     1e+08 –  Covers  235,693  unique  molecules   log Comparisons –  Average  of  7  members  per  scaffold   1e+05 –  95%  of  scaffolds  had  <  21  members   –  99.5%  had  <  74  members   1e+02 •  The  0.05%  are  a  bit  problema<c   All Seeded Method
  20. 20. Timing  experiments  •  Selected  50  scaffolds  with  10  or  fewer  members  •  Configured  so  as  to  have  ~  5  maps  •  Effec<ve  running  <me  for   the  en<re  job  is  3.8  min   200 on  Hadoop   150 –  Only  needed  5  of  8  map   slots  on  our  “cluster”   Time (s) 100•  Takes  ~  6  min  without   50 Hadoop   0 1 2 3 4 5 Job Number
  21. 21. Timing  experiments  •  Selected  1000  scaffolds  with  20  or  fewer   members   –  Ran  with  10  scaffolds  /  map  •  Hadoop  run  <me   was  ~  2  hr   15 –  Most  maps  were   Number of Jobs 10 fast  (<  20  sec)  •  Serial  evalua<on   5 would  be  >  7  hr   0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log Time (s)
  22. 22. A  M-­‐R  workflow  •  We’re  currently  focused  on  just  the  MMP  step  as   as  a  MR  example  •  Could  also  include  fragmenta<on  step  as  part  of   the  workflow   –  But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible  •  Store  transforma<ons  and  members  in  HBase  •  Link  with  ac<vity  data  and  apply  structure  &   ac<vity  filters  on  candidate  pairs  
  23. 23. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real-­‐<me  analysis  •  Generally  not  effec<ve  unless  dealing  with   massive  datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method  
  24. 24. Conclusions  •  Cheminforma<cs  applica<ons  can  be  rehosted  or   rewriOen  to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec<ons  lets   us  explore  more  chemical  space  •  “Big  data”  isn’t  really  that  big  in  chemistry  
  25. 25. Conclusions  •  Q:  But  are  cheminforma/cs  problems  really  big   enough  to  jus/fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra<ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma/cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor  
  26. 26. hRps://  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.