Your SlideShare is downloading. ×
Cloudy with a Touch of Cheminformatics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloudy with a Touch of Cheminformatics


Published on

Published in: Technology, Business

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Cloudy  with  a  Touch  of   Cheminforma4cs  Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen   NIH  Center  for  Advancing  Transla@onal  Science     Chemaxon  UGM   September  26th,  2012   Wellesley,  MA  
  • 2. Parallel  compu4ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica<ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa<ons  
  • 3. All  HPC  is  not  equal  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili<es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hOp://­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  
  • 4. Big  data  &  cheminforma4cs  •  Computa<on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  GDB-­‐13,  …  •  What  types  of  computa<ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  &  predic<ons  over  large  data  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces  
  • 5. Map-­‐Reduce   copy sortSplit 0 Map merge Reduce Part 0Split 1 Map merge Reduce Part 1Split 2 Map K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 ) Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    
  • 6. Coun4ng  atoms   •  The  chemical  version  of  the  word  coun<ng  task  Arbitrary line Atom list (V2) SMILES (V1) Atomnumbers (K1) Occurence (V2) Symbol (K2) Symbol (K2) 1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...) 2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...) . N1 . N1 . N1 152366, Nc1ccc2ncccc2c1N MAP   . Reduce   . Atom Count (V3) Symbol (K3) N,100 C,5684 . . .
  • 7. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hOp://<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  
  • 8. Cheminforma4cs  on  Hadoop  •  Hadoop  and  Atom  Coun<ng  •  Hadoop  and  SD  Files  •  Cheminforma<cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma<cs     But  are  cheminforma@cs  problems     really  big  enough  to  jus@fy  all  of  this?  
  • 9. Simplifying  Hadoop  applica4ons   package gov.nih.ncgc.hadoop; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter import chemaxon.formats.MolFormatException; reporter) throws IOException { import chemaxon.formats.MolImporter;•  Raw  Hadoop     Molecule mol = MolImporter.importMol(value.toString()); import chemaxon.license.LicenseManager; matches.set(mol.getName()); import chemaxon.license.LicenseProcessingException; search.setTarget(mol); import; try { import; if (search.isMatching()) { import chemaxon.struc.Molecule; output.collect(matches, one); import org.apache.hadoop.conf.Configuration; } else { programs  can     import org.apache.hadoop.conf.Configured; output.collect(matches, zero); import org.apache.hadoop.filecache.DistributedCache; } import org.apache.hadoop.fs.Path; } catch (SearchException e) { import; } import; } import; } import org.apache.hadoop.mapred.FileInputFormat; be  tedious  to     import org.apache.hadoop.mapred.FileOutputFormat; public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, import org.apache.hadoop.mapred.JobClient; IntWritable, Text, IntWritable> { import org.apache.hadoop.mapred.JobConf; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; public void reduce(Text key, import org.apache.hadoop.mapred.OutputCollector; Iterator<IntWritable> values, import org.apache.hadoop.mapred.Reducer; OutputCollector<Text, IntWritable> output, write   import org.apache.hadoop.mapred.Reporter; Reporter reporter) throws IOException { import org.apache.hadoop.mapred.TextInputFormat; while (values.hasNext()) { import org.apache.hadoop.mapred.TextOutputFormat; if ( == 0) { import org.apache.hadoop.util.Tool; result.set(1); import org.apache.hadoop.util.ToolRunner; output.collect(key, result); } import; } import; } import; } import java.util.Iterator; public int run(String[] args) throws Exception { /** JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); * SMARTS searching over a set of files using Hadoop. jobConf.setJobName("smartsSearch"); * * @author Rajarshi Guha jobConf.setOutputKeyClass(Text.class); */ jobConf.setOutputValueClass(IntWritable.class); public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); jobConf.setMapperClass(MoleculeMapper.class); private final static IntWritable zero = new IntWritable(0); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class); public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { jobConf.setInputFormat(TextInputFormat.class); private String pattern = null; jobConf.setOutputFormat(TextOutputFormat.class); private MolSearch search; jobConf.setNumMapTasks(5); public void configure(JobConf job) { if (args.length != 4) { try { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); Path[] licFiles = DistributedCache.getLocalCacheFiles(job); System.exit(2); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); } StringBuilder license = new StringBuilder(); String line; FileInputFormat.setInputPaths(jobConf, new Path(args[0])); while ((line = reader.readLine()) != null) license.append(line); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); reader.close(); jobConf.setStrings("pattern", args[2]); LicenseManager.setLicense(license.toString()); } catch (IOException e) { // make the license file available vis dist cache } catch (LicenseProcessingException e) { DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf); } JobClient.runJob(jobConf); pattern = job.getStrings("pattern")[0]; return 0; search = new MolSearch(); } try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); public static void main(String[] args) throws Exception { search.setQuery(queryMol); } catch (MolFormatException e) { int res = Configuration(), new SmartsSearch(), args); } } SMARTS  based     } } final static IntWritable one = new IntWritable(1); Text matches = new Text(); substructure  search    
  • 10. Pig  &  Pig  La4n  •  Pig  La<n  programs  are  much  simpler  to  write   and  get  translated  to   A = load medium.smi as (smiles:chararray); B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, NC(=O)C(=O)N); store B into output.txt; Hadoop  code   SMARTS  search  in     Pig  La<n  •  SQL-­‐like,  requires     package gov.nih.ncgc.hadoop.pig; import chemaxon.formats.MolImporter; UDF  to  be     import; import; import chemaxon.struc.Molecule; import org.apache.pig.FilterFunc; implemented  to     import; import; perform     public class SMATCH extends FilterFunc { static MolSearch search = null; non-­‐standard  tasks   public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; } } UDF  for  SMARTS  search  
  • 11. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera<on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma@cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  • 12. Going  beyond  chunking?  •  Applica<ons  that  make  use  of  pairwise  (or  higher   order)  calcula<ons  could  benefit  from  a  map-­‐ reduce  incarna<on   –  Doesn’t  necessarily  avoid  the  O(N2)  barrier   –  Bioisostere  iden<fica<on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Map-­‐Reduce  Design  PaOerns  
  • 13. Iden4fying  MMPs  •  First  step  in  iden<fying  bioisosteres  is  to  iden<fy   candidate  matched  molecular  pairs   –  Naïve  all  pairs  comparison   –  Predefined  list  of  transforma<ons     •  Birch  et  al,  BMCL,  2009   –  Fragment  intersec<on   •  Hussain  et  al,  JCIM,  2010   –  MCS  based  approaches  (e.g.,  WizePairZ)   •  Warner  et  al,  JCIM,  2010    
  • 14. Naïve  Bioisostere  evalua4on  N  molecules   N(N-­‐1)/2  comparisons   ...
  • 15. Scaffold  seeding   Seed  Fragment:  Members:  
  • 16. Scaffold  seeded  bioisosteres   M(M-­‐1)/2  comparisons   M(M-­‐1)/2  comparisons  
  • 17. Seeded  bioisosteres  –  MR  style  • Do  pairwise  MCS   REDUCE   analysis  on  scaffold   • Collect  pairs  of   series   SMILES  for  a  given  • For  each  pair   SMIRKS   output  SMIRKS   • Store  in  DB,  or   transform  and  the   pair  of  SMILES   • Filter  by  ac<vity,  or   • …   MAP  
  • 18. Does  seeding  help?  •  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the   constant  •  Depends  on  how  many  scaffolds  and  the     number  of  member  for   1e+14 each  scaffold  •  Certainly  useful  when   log Number of pairwise comparisons 1e+11 there  a  few  members   Method per  scaffold   1e+08 all seeded.7 seeded.21•  Highly  populated   seeded.100 scaffolds  can  throw   things  off   1e+05 1e+03 1e+05 1e+07 log Number of molecules
  • 19. Data  •  Exhaus<vely  fragmented  ChEMBL  13  •  Iden<fied  scaffolds  with         N members   ! 1.8 N scaffold  •  Ended  up  with  231,875  scaffolds     1e+08 –  Covers  235,693  unique  molecules   log Comparisons –  Average  of  7  members  per  scaffold   1e+05 –  95%  of  scaffolds  had  <  21  members   –  99.5%  had  <  74  members   1e+02 •  The  0.05%  are  a  bit  problema<c   All Seeded Method
  • 20. Timing  experiments  •  Selected  50  scaffolds  with  10  or  fewer  members  •  Configured  so  as  to  have  ~  5  maps  •  Effec<ve  running  <me  for   the  en<re  job  is  3.8  min   200 on  Hadoop   150 –  Only  needed  5  of  8  map   slots  on  our  “cluster”   Time (s) 100•  Takes  ~  6  min  without   50 Hadoop   0 1 2 3 4 5 Job Number
  • 21. Timing  experiments  •  Selected  1000  scaffolds  with  20  or  fewer   members   –  Ran  with  10  scaffolds  /  map  •  Hadoop  run  <me   was  ~  2  hr   15 –  Most  maps  were   Number of Jobs 10 fast  (<  20  sec)  •  Serial  evalua<on   5 would  be  >  7  hr   0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log Time (s)
  • 22. A  M-­‐R  workflow  •  We’re  currently  focused  on  just  the  MMP  step  as   as  a  MR  example  •  Could  also  include  fragmenta<on  step  as  part  of   the  workflow   –  But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible  •  Store  transforma<ons  and  members  in  HBase  •  Link  with  ac<vity  data  and  apply  structure  &   ac<vity  filters  on  candidate  pairs  
  • 23. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real-­‐<me  analysis  •  Generally  not  effec<ve  unless  dealing  with   massive  datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method  
  • 24. Conclusions  •  Cheminforma<cs  applica<ons  can  be  rehosted  or   rewriOen  to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec<ons  lets   us  explore  more  chemical  space  •  “Big  data”  isn’t  really  that  big  in  chemistry  
  • 25. Conclusions  •  Q:  But  are  cheminforma/cs  problems  really  big   enough  to  jus/fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra<ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma/cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor  
  • 26. hRps://