  • 1. Cloudy  with  a  Touch  of   Cheminforma4cs  Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen   NIH  Center  for  Advancing  Transla@onal  Science     Chemaxon  UGM   September  26th,  2012   Wellesley,  MA  
  • 2. Parallel  compu4ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica<ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa<ons  
  • 3. All  HPC  is  not  equal  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili<es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hOp://­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  
  • 4. Big  data  &  cheminforma4cs  •  Computa<on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  GDB-­‐13,  …  •  What  types  of  computa<ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  &  predic<ons  over  large  data  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces  
  • 5. Map-­‐Reduce   copy sortSplit 0 Map merge Reduce Part 0Split 1 Map merge Reduce Part 1Split 2 Map K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 ) Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    
  • 6. Coun4ng  atoms   •  The  chemical  version  of  the  word  coun<ng  task  Arbitrary line Atom list (V2) SMILES (V1) Atomnumbers (K1) Occurence (V2) Symbol (K2) Symbol (K2) 1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...) 2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...) . N1 . N1 . N1 152366, Nc1ccc2ncccc2c1N MAP   . Reduce   . Atom Count (V3) Symbol (K3) N,100 C,5684 . . .
  • 7. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hOp://<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  
  • 8. Cheminforma4cs  on  Hadoop  •  Hadoop  and  Atom  Coun<ng  •  Hadoop  and  SD  Files  •  Cheminforma<cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma<cs     But  are  cheminforma@cs  problems     really  big  enough  to  jus@fy  all  of  this?  
  • 9. Simplifying  Hadoop  applica4ons   package gov.nih.ncgc.hadoop; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter import chemaxon.formats.MolFormatException; reporter) throws IOException { import chemaxon.formats.MolImporter;•  Raw  Hadoop     Molecule mol = MolImporter.importMol(value.toString()); import chemaxon.license.LicenseManager; matches.set(mol.getName()); import chemaxon.license.LicenseProcessingException; search.setTarget(mol); import; try { import; if (search.isMatching()) { import chemaxon.struc.Molecule; output.collect(matches, one); import org.apache.hadoop.conf.Configuration; } else { programs  can     import org.apache.hadoop.conf.Configured; output.collect(matches, zero); import org.apache.hadoop.filecache.DistributedCache; } import org.apache.hadoop.fs.Path; } catch (SearchException e) { import; } import; } import; } import org.apache.hadoop.mapred.FileInputFormat; be  tedious  to     import org.apache.hadoop.mapred.FileOutputFormat; public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, import org.apache.hadoop.mapred.JobClient; IntWritable, Text, IntWritable> { import org.apache.hadoop.mapred.JobConf; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; public void reduce(Text key, import org.apache.hadoop.mapred.OutputCollector; Iterator<IntWritable> values, import org.apache.hadoop.mapred.Reducer; OutputCollector<Text, IntWritable> output, write   import org.apache.hadoop.mapred.Reporter; Reporter reporter) throws IOException { import org.apache.hadoop.mapred.TextInputFormat; while (values.hasNext()) { import org.apache.hadoop.mapred.TextOutputFormat; if ( == 0) { import org.apache.hadoop.util.Tool; result.set(1); import org.apache.hadoop.util.ToolRunner; output.collect(key, result); } import; } import; } import; } import java.util.Iterator; public int run(String[] args) throws Exception { /** JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); * SMARTS searching over a set of files using Hadoop. jobConf.setJobName("smartsSearch"); * * @author Rajarshi Guha jobConf.setOutputKeyClass(Text.class); */ jobConf.setOutputValueClass(IntWritable.class); public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); jobConf.setMapperClass(MoleculeMapper.class); private final static IntWritable zero = new IntWritable(0); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class); public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { jobConf.setInputFormat(TextInputFormat.class); private String pattern = null; jobConf.setOutputFormat(TextOutputFormat.class); private MolSearch search; jobConf.setNumMapTasks(5); public void configure(JobConf job) { if (args.length != 4) { try { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); Path[] licFiles = DistributedCache.getLocalCacheFiles(job); System.exit(2); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); } StringBuilder license = new StringBuilder(); String line; FileInputFormat.setInputPaths(jobConf, new Path(args[0])); while ((line = reader.readLine()) != null) license.append(line); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); reader.close(); jobConf.setStrings("pattern", args[2]); LicenseManager.setLicense(license.toString()); } catch (IOException e) { // make the license file available vis dist cache } catch (LicenseProcessingException e) { DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf); } JobClient.runJob(jobConf); pattern = job.getStrings("pattern")[0]; return 0; search = new MolSearch(); } try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); public static void main(String[] args) throws Exception { search.setQuery(queryMol); } catch (MolFormatException e) { int res = Configuration(), new SmartsSearch(), args); } } SMARTS  based     } } final static IntWritable one = new IntWritable(1); Text matches = new Text(); substructure  search    
  • 10. Pig  &  Pig  La4n  •  Pig  La<n  programs  are  much  simpler  to  write   and  get  translated  to   A = load medium.smi as (smiles:chararray); B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, NC(=O)C(=O)N); store B into output.txt; Hadoop  code   SMARTS  search  in     Pig  La<n  •  SQL-­‐like,  requires     package gov.nih.ncgc.hadoop.pig; import chemaxon.formats.MolImporter; UDF  to  be     import; import; import chemaxon.struc.Molecule; import org.apache.pig.FilterFunc; implemented  to     import; import; perform     public class SMATCH extends FilterFunc { static MolSearch search = null; non-­‐standard  tasks   public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; } } UDF  for  SMARTS  search  
  • 11. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera<on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma@cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  • 12. Going  beyond  chunking?  •  Applica<ons  that  make  use  of  pairwise  (or  higher   order)  calcula<ons  could  benefit  from  a  map-­‐ reduce  incarna<on   –  Doesn’t  necessarily  avoid  the  O(N2)  barrier   –  Bioisostere  iden<fica<on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Map-­‐Reduce  Design  PaOerns  
  • 13. Iden4fying  MMPs  •  First  step  in  iden<fying  bioisosteres  is  to  iden<fy   candidate  matched  molecular  pairs   –  Naïve  all  pairs  comparison   –  Predefined  list  of  transforma<ons     •  Birch  et  al,  BMCL,  2009   –  Fragment  intersec<on   •  Hussain  et  al,  JCIM,  2010   –  MCS  based  approaches  (e.g.,  WizePairZ)   •  Warner  et  al,  JCIM,  2010    
  • 14. Naïve  Bioisostere  evalua4on  N  molecules   N(N-­‐1)/2  comparisons   ...
  • 15. Scaffold  seeding   Seed  Fragment:  Members:  
  • 16. Scaffold  seeded  bioisosteres   M(M-­‐1)/2  comparisons   M(M-­‐1)/2  comparisons  
  • 17. Seeded  bioisosteres  –  MR  style  • Do  pairwise  MCS   REDUCE   analysis  on  scaffold   • Collect  pairs  of   series   SMILES  for  a  given  • For  each  pair   SMIRKS   output  SMIRKS   • Store  in  DB,  or   transform  and  the   pair  of  SMILES   • Filter  by  ac<vity,  or   • …   MAP  
  • 18. Does  seeding  help?  •  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the   constant  •  Depends  on  how  many  scaffolds  and  the     number  of  member  for   1e+14 each  scaffold  •  Certainly  useful  when   log Number of pairwise comparisons 1e+11 there  a  few  members   Method per  scaffold   1e+08 all seeded.7 seeded.21•  Highly  populated   seeded.100 scaffolds  can  throw   things  off   1e+05 1e+03 1e+05 1e+07 log Number of molecules
  • 19. Data  •  Exhaus<vely  fragmented  ChEMBL  13  •  Iden<fied  scaffolds  with         N members   ! 1.8 N scaffold  •  Ended  up  with  231,875  scaffolds     1e+08 –  Covers  235,693  unique  molecules   log Comparisons –  Average  of  7  members  per  scaffold   1e+05 –  95%  of  scaffolds  had  <  21  members   –  99.5%  had  <  74  members   1e+02 •  The  0.05%  are  a  bit  problema<c   All Seeded Method
  • 20. Timing  experiments  •  Selected  50  scaffolds  with  10  or  fewer  members  •  Configured  so  as  to  have  ~  5  maps  •  Effec<ve  running  <me  for   the  en<re  job  is  3.8  min   200 on  Hadoop   150 –  Only  needed  5  of  8  map   slots  on  our  “cluster”   Time (s) 100•  Takes  ~  6  min  without   50 Hadoop   0 1 2 3 4 5 Job Number
  • 21. Timing  experiments  •  Selected  1000  scaffolds  with  20  or  fewer   members   –  Ran  with  10  scaffolds  /  map  •  Hadoop  run  <me   was  ~  2  hr   15 –  Most  maps  were   Number of Jobs 10 fast  (<  20  sec)  •  Serial  evalua<on   5 would  be  >  7  hr   0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log Time (s)
  • 22. A  M-­‐R  workflow  •  We’re  currently  focused  on  just  the  MMP  step  as   as  a  MR  example  •  Could  also  include  fragmenta<on  step  as  part  of   the  workflow   –  But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible  •  Store  transforma<ons  and  members  in  HBase  •  Link  with  ac<vity  data  and  apply  structure  &   ac<vity  filters  on  candidate  pairs  
  • 23. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real-­‐<me  analysis  •  Generally  not  effec<ve  unless  dealing  with   massive  datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method  
  • 24. Conclusions  •  Cheminforma<cs  applica<ons  can  be  rehosted  or   rewriOen  to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec<ons  lets   us  explore  more  chemical  space  •  “Big  data”  isn’t  really  that  big  in  chemistry  
  • 25. Conclusions  •  Q:  But  are  cheminforma/cs  problems  really  big   enough  to  jus/fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra<ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma/cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor  
  • 26. hRps://