Cloudy	
  with	
  a	
  Touch	
  of	
  
                Cheminforma4cs	
  


Rajarshi	
  Guha,	
  Tyler	
  Peryea,	
  Dac-­‐Trung	
  Nguyen	
  
        NIH	
  Center	
  for	
  Advancing	
  Transla@onal	
  Science	
  
                                       	
  
                                Chemaxon	
  UGM	
  
                          September	
  26th,	
  2012	
  
                                 Wellesley,	
  MA	
  
Parallel	
  compu4ng	
  in	
  the	
  cloud	
  
•  Modern	
  cloud	
  vendors	
  make	
  provisioning	
  
   compute	
  resources	
  easy	
  
    –  Allows	
  one	
  to	
  handle	
  unpredictable	
  loads	
  easily	
  
    –  Pay	
  only	
  for	
  what	
  you	
  need	
  
•  Chemistry	
  applica<ons	
  don’t	
  usually	
  have	
  very	
  
   dynamic	
  loads	
  
•  But	
  large	
  scale	
  resources	
  are	
  an	
  opportunity	
  for	
  
   large	
  scale	
  (parallel)	
  computa<ons	
  
All	
  HPC	
  is	
  not	
  equal	
  


•  Use	
  cloud	
  resources	
  in	
            •  Make	
  use	
  of	
  cloud	
                         •  Huge	
  datasets	
  
   the	
  same	
  way	
  as	
  a	
  local	
        capabili<es	
                                        •  Candidates	
  for	
  map-­‐
   cluster	
                                    •  Old	
  algorithms,	
  new	
                             reduce	
  
•  MIT	
  StarCluster	
  makes	
                   infrastructure	
                                     •  Involves	
  algorithm	
  	
  
   this	
  easy	
  to	
  do	
                   •  Spot	
  instances,	
  SNS,	
                            (re)design	
  
                                                   SQS	
  SimpleDB,	
  S3,	
  etc	
  

Legacy	
                                        Cloudy	
                                                 Big	
  Data	
  
HPC	
                                           HPC	
                                                    HPC	
  



                                                                      hOp://www.slideshare.net/chrisdag/mapping-­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud	
  
Big	
  data	
  &	
  cheminforma4cs	
  
•  Computa<on	
  over	
  large	
  chemical	
  databases	
  
   –  Pubchem,	
  ChEMBL,	
  GDB-­‐13,	
  …	
  
•  What	
  types	
  of	
  computa<ons?	
  
   –  Searches	
  (substructure,	
  pharmacophore,	
  ….)	
  
   –  QSAR	
  models	
  &	
  predic<ons	
  over	
  large	
  data	
  
•  Fundamentally,	
  “big	
  chemical	
  data”	
  lets	
  us	
  
   explore	
  larger	
  chemical	
  spaces	
  
Map-­‐Reduce	
  
                                     copy
                           sort


Split 0            Map
                                            merge


                                                                           Reduce                                        Part 0




Split 1            Map
                                            merge


                                                                           Reduce                                        Part 1



Split 2            Map




          K1,V1 ! list ( K 2 ,V2 )          K 2 , list (V2 ) ! list ( K 3,V3 )
                                                    Tom	
  White,	
  Hadoop,	
  The	
  Defini/ve	
  Guide.	
  3rd	
  Ed.	
  O’Reilly	
  	
  
Coun4ng	
  atoms	
  
  •  The	
  chemical	
  version	
  of	
  the	
  word	
  coun<ng	
  task	
  

Arbitrary line                                                            Atom               list (V2)
                 SMILES (V1)                 Atom
numbers (K1)                                          Occurence (V2)   Symbol (K2)
                                          Symbol (K2)


    1, Nc1ccc2ncccc2c1N                                                        N, list(1,1,1,1,...)
    2, Cl.CC1CCc2nc3ccccc3c(C)c2C1                 N1                          C, list(1,1,1,1,...)
    .                                              N1
    .                                              N1
    .                                              N1
    152366, Nc1ccc2ncccc2c1N         MAP	
          .                                         Reduce	
  
                                                    .
                                                                               Atom
                                                                                               Count (V3)
                                                                            Symbol (K3)



                                                                                     N,100
                                                                                     C,5684
                                                                                     .
                                                                                     .
                                                                                     .
The	
  Hadoop	
  ecosystem	
  

             Chukwa                            Zookeeper                                   Flume                         Pig

               HBase                                Mahout                                   Avro                       Whirr

                                   Map Reduce Engine                                                                    Hama

                                    Hadoop Distributed
                                                                                                                        Hive
                                       Filesystem

                                                         Hadoop Common


Based	
  on	
  hOp://www.slideshare.net/informa<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem	
  
Cheminforma4cs	
  on	
  Hadoop	
  
•      Hadoop	
  and	
  Atom	
  Coun<ng	
  
•      Hadoop	
  and	
  SD	
  Files	
  
•      Cheminforma<cs,	
  Hadoop	
  and	
  EC2	
  
•      Pig	
  and	
  Cheminforma<cs	
  
	
  


        But	
  are	
  cheminforma@cs	
  problems	
  	
  
       really	
  big	
  enough	
  to	
  jus@fy	
  all	
  of	
  this?	
  
Simplifying	
  Hadoop	
  applica4ons	
  
                                 package gov.nih.ncgc.hadoop;
                                                                                                                                         public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
                                 import chemaxon.formats.MolFormatException;
                                                                                                                                    reporter) throws IOException {
                                 import chemaxon.formats.MolImporter;




•  Raw	
  Hadoop	
  	
  
                                                                                                                                           Molecule mol = MolImporter.importMol(value.toString());
                                 import chemaxon.license.LicenseManager;
                                                                                                                                           matches.set(mol.getName());
                                 import chemaxon.license.LicenseProcessingException;
                                                                                                                                           search.setTarget(mol);
                                 import chemaxon.sss.search.MolSearch;
                                                                                                                                           try {
                                 import chemaxon.sss.search.SearchException;
                                                                                                                                               if (search.isMatching()) {
                                 import chemaxon.struc.Molecule;
                                                                                                                                                   output.collect(matches, one);
                                 import org.apache.hadoop.conf.Configuration;
                                                                                                                                               } else {




   programs	
  can	
  	
  
                                 import org.apache.hadoop.conf.Configured;
                                                                                                                                                   output.collect(matches, zero);
                                 import org.apache.hadoop.filecache.DistributedCache;
                                                                                                                                               }
                                 import org.apache.hadoop.fs.Path;
                                                                                                                                           } catch (SearchException e) {
                                 import org.apache.hadoop.io.IntWritable;
                                                                                                                                           }
                                 import org.apache.hadoop.io.LongWritable;
                                                                                                                                         }
                                 import org.apache.hadoop.io.Text;
                                                                                                                                      }
                                 import org.apache.hadoop.mapred.FileInputFormat;




   be	
  tedious	
  to	
  	
  
                                 import org.apache.hadoop.mapred.FileOutputFormat;
                                                                                                                                       public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text,
                                 import org.apache.hadoop.mapred.JobClient;
                                                                                                                                    IntWritable, Text, IntWritable> {
                                 import org.apache.hadoop.mapred.JobConf;
                                                                                                                                         private IntWritable result = new IntWritable();
                                 import org.apache.hadoop.mapred.MapReduceBase;
                                 import org.apache.hadoop.mapred.Mapper;
                                                                                                                                            public void reduce(Text key,
                                 import org.apache.hadoop.mapred.OutputCollector;
                                                                                                                                                         Iterator<IntWritable> values,
                                 import org.apache.hadoop.mapred.Reducer;
                                                                                                                                                         OutputCollector<Text, IntWritable> output,




   write	
  
                                 import org.apache.hadoop.mapred.Reporter;
                                                                                                                                                         Reporter reporter) throws IOException {
                                 import org.apache.hadoop.mapred.TextInputFormat;
                                                                                                                                              while (values.hasNext()) {
                                 import org.apache.hadoop.mapred.TextOutputFormat;
                                                                                                                                                 if (values.next().compareTo(one) == 0) {
                                 import org.apache.hadoop.util.Tool;
                                                                                                                                                     result.set(1);
                                 import org.apache.hadoop.util.ToolRunner;
                                                                                                                                                     output.collect(key, result);
                                                                                                                                                 }
                                 import java.io.BufferedReader;
                                                                                                                                              }
                                 import java.io.FileReader;
                                                                                                                                            }
                                 import java.io.IOException;
                                                                                                                                        }
                                 import java.util.Iterator;
                                                                                                                                        public int run(String[] args) throws Exception {
                                 /**
                                                                                                                                          JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class);
                                  * SMARTS searching over a set of files using Hadoop.
                                                                                                                                          jobConf.setJobName("smartsSearch");
                                  *
                                  * @author Rajarshi Guha
                                                                                                                                            jobConf.setOutputKeyClass(Text.class);
                                  */
                                                                                                                                            jobConf.setOutputValueClass(IntWritable.class);
                                 public class SmartsSearch extends Configured implements Tool {
                                     private final static IntWritable one = new IntWritable(1);
                                                                                                                                            jobConf.setMapperClass(MoleculeMapper.class);
                                     private final static IntWritable zero = new IntWritable(0);
                                                                                                                                            jobConf.setCombinerClass(SmartsMatchReducer.class);
                                                                                                                                            jobConf.setReducerClass(SmartsMatchReducer.class);
                                   public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text,
                                 Text, IntWritable> {
                                                                                                                                            jobConf.setInputFormat(TextInputFormat.class);
                                      private String pattern = null;
                                                                                                                                            jobConf.setOutputFormat(TextOutputFormat.class);
                                      private MolSearch search;
                                                                                                                                            jobConf.setNumMapTasks(5);
                                     public void configure(JobConf job) {
                                                                                                                                            if (args.length != 4) {
                                         try {
                                                                                                                                                System.err.println("Usage: ss <in> <out> <pattern> <license file>");
                                            Path[] licFiles = DistributedCache.getLocalCacheFiles(job);
                                                                                                                                                System.exit(2);
                                            BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString()));
                                                                                                                                            }
                                            StringBuilder license = new StringBuilder();
                                            String line;
                                                                                                                                            FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
                                            while ((line = reader.readLine()) != null) license.append(line);
                                                                                                                                            FileOutputFormat.setOutputPath(jobConf, new Path(args[1]));
                                            reader.close();
                                                                                                                                            jobConf.setStrings("pattern", args[2]);
                                            LicenseManager.setLicense(license.toString());
                                         } catch (IOException e) {
                                                                                                                                            // make the license file available vis dist cache
                                         } catch (LicenseProcessingException e) {
                                                                                                                                            DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);
                                         }
                                                                                                                                            JobClient.runJob(jobConf);
                                         pattern = job.getStrings("pattern")[0];
                                                                                                                                            return 0;
                                         search = new MolSearch();
                                                                                                                                        }
                                         try {
                                            Molecule queryMol = MolImporter.importMol(pattern, "smarts");
                                                                                                                                        public static void main(String[] args) throws Exception {
                                            search.setQuery(queryMol);
                                         } catch (MolFormatException e) {
                                                                                                                                            int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);
                                         }
                                                                                                                                        }

                                                                                                                                                       SMARTS	
  based	
  	
  
                                     }
                                                                                                                                    }
                                     final static IntWritable one = new IntWritable(1);
                                     Text matches = new Text();


                                                                                                                                                       substructure	
  search	
  	
  
Pig	
  &	
  Pig	
  La4n	
  
•  Pig	
  La<n	
  programs	
  are	
  much	
  simpler	
  to	
  write	
  
   and	
  get	
  translated	
  to	
      A = load 'medium.smi' as (smiles:chararray);
                                         B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
                                         store B into 'output.txt';


   Hadoop	
  code	
                                      SMARTS	
  search	
  in	
  	
  
                                                         Pig	
  La<n	
  

•  SQL-­‐like,	
  requires	
  	
           package gov.nih.ncgc.hadoop.pig;

                                           import chemaxon.formats.MolImporter;


   UDF	
  to	
  be	
  	
  
                                           import chemaxon.sss.search.MolSearch;
                                           import chemaxon.sss.search.SearchException;
                                           import chemaxon.struc.Molecule;
                                           import org.apache.pig.FilterFunc;


   implemented	
  to	
  	
  
                                           import org.apache.pig.data.Tuple;

                                           import java.io.IOException;



   perform	
  	
  
                                           public class SMATCH extends FilterFunc {
                                             static MolSearch search = null;




   non-­‐standard	
  tasks	
  
                                               public Boolean exec(Tuple tuple) throws IOException {
                                                 if (tuple == null || tuple.size() < 2) return false;
                                                 String target = (String) tuple.get(0);
                                                 String query = (String) tuple.get(1);
                                                 try {
                                                     Molecule queryMol = MolImporter.importMol(query, "smarts");
                                                     search.setQuery(queryMol);
                                                     search.setTarget(MolImporter.importMol(target, "smiles"));
                                                     return search.isMatching();
                                                 } catch (SearchException e) {
                                                     e.printStackTrace();
                                                 }
                                                 return false;

                                           }
                                               }                                   UDF	
  for	
  SMARTS	
  search	
  
Going	
  beyond	
  chunking?	
  
•  All	
  the	
  preceding	
  use	
  cases	
  are	
  embarrassingly	
  
   parallel	
  	
  
    –  Chunking	
  the	
  input	
  data	
  and	
  applying	
  the	
  same	
  
       opera<on	
  to	
  each	
  chunk	
  
    –  Very	
  nice	
  when	
  you	
  have	
  a	
  big	
  cluster	
  


               Are	
  there	
  algorithms	
  in	
  	
  
        cheminforma@cs	
  that	
  	
  can	
  employ	
  	
  
       map-­‐reduce	
  at	
  the	
  algorithmic	
  level?	
  
Going	
  beyond	
  chunking?	
  
•  Applica<ons	
  that	
  make	
  use	
  of	
  pairwise	
  (or	
  higher	
  
   order)	
  calcula<ons	
  could	
  benefit	
  from	
  a	
  map-­‐
   reduce	
  incarna<on	
  
    –  Doesn’t	
  necessarily	
  avoid	
  the	
  O(N2)	
  barrier	
  
    –  Bioisostere	
  iden<fica<on	
  is	
  one	
  case	
  that	
  could	
  be	
  
       rephrased	
  as	
  a	
  map-­‐reduce	
  problem	
  
•  Map-­‐Reduce	
  Design	
  PaOerns	
  
Iden4fying	
  MMPs	
  
•  First	
  step	
  in	
  iden<fying	
  bioisosteres	
  is	
  to	
  iden<fy	
  
   candidate	
  matched	
  molecular	
  pairs	
  
    –  Naïve	
  all	
  pairs	
  comparison	
  
    –  Predefined	
  list	
  of	
  transforma<ons	
  	
  
           •  Birch	
  et	
  al,	
  BMCL,	
  2009	
  
    –  Fragment	
  intersec<on	
  
           •  Hussain	
  et	
  al,	
  JCIM,	
  2010	
  
    –  MCS	
  based	
  approaches	
  (e.g.,	
  WizePairZ)	
  
           •  Warner	
  et	
  al,	
  JCIM,	
  2010	
  
    	
  
Naïve	
  Bioisostere	
  evalua4on	
  
N	
  molecules	
                      N(N-­‐1)/2	
  comparisons	
  




                                              ...
Scaffold	
  seeding	
  
               Seed	
  Fragment:	
  




Members:	
  
Scaffold	
  seeded	
  bioisosteres	
  
                    M(M-­‐1)/2	
  comparisons	
  




                     M(M-­‐1)/2	
  comparisons	
  
Seeded	
  bioisosteres	
  –	
  MR	
  style	
  

• Do	
  pairwise	
  MCS	
  
                                                REDUCE	
  
  analysis	
  on	
  scaffold	
  
                                  • Collect	
  pairs	
  of	
  
  series	
  
                                    SMILES	
  for	
  a	
  given	
  
• For	
  each	
  pair	
             SMIRKS	
  
  output	
  SMIRKS	
  
                                  • Store	
  in	
  DB,	
  or	
  
  transform	
  and	
  the	
  
  pair	
  of	
  SMILES	
          • Filter	
  by	
  ac<vity,	
  or	
  
                                  • …	
  


              MAP	
  
Does	
  seeding	
  help?	
  
•  Doesn’t	
  bypass	
  the	
  O(N2)	
  barrier	
  –	
  does	
  reduce	
  the	
  
   constant	
  
•  Depends	
  on	
  how	
  many	
  scaffolds	
  and	
  the	
  	
  
   number	
  of	
  member	
  for	
                                            1e+14


   each	
  scaffold	
  
•  Certainly	
  useful	
  when	
  
                                         log Number of pairwise comparisons
                                                                              1e+11


   there	
  a	
  few	
  members	
                                                                                               Method


   per	
  scaffold	
                                                           1e+08
                                                                                                                                   all
                                                                                                                                   seeded.7
                                                                                                                                   seeded.21



•  Highly	
  populated	
  
                                                                                                                                   seeded.100




   scaffolds	
  can	
  throw	
  
   things	
  off	
  
                                                                              1e+05




                                                                                      1e+03                 1e+05       1e+07
                                                                                              log Number of molecules
Data	
  
•  Exhaus<vely	
  fragmented	
  ChEMBL	
  13	
  
•  Iden<fied	
  scaffolds	
  with	
  	
  
   	
  
   	
                     N members
   	
                                   ! 1.8
                           N scaffold
   	
  
•  Ended	
  up	
  with	
  231,875	
  scaffolds	
  	
                              1e+08




   –  Covers	
  235,693	
  unique	
  molecules	
  


                                                               log Comparisons
   –  Average	
  of	
  7	
  members	
  per	
  scaffold	
                          1e+05




   –  95%	
  of	
  scaffolds	
  had	
  <	
  21	
  members	
  
   –  99.5%	
  had	
  <	
  74	
  members	
                                       1e+02




        •  The	
  0.05%	
  are	
  a	
  bit	
  problema<c	
  
                                                                                         All             Seeded
                                                                                               Method
Timing	
  experiments	
  
•  Selected	
  50	
  scaffolds	
  with	
  10	
  or	
  fewer	
  members	
  
•  Configured	
  so	
  as	
  to	
  have	
  ~	
  5	
  maps	
  
•  Effec<ve	
  running	
  <me	
  for	
  
   the	
  en<re	
  job	
  is	
  3.8	
  min	
                  200




   on	
  Hadoop	
  
                                                              150


    –  Only	
  needed	
  5	
  of	
  8	
  map	
  
       slots	
  on	
  our	
  “cluster”	
           Time (s)   100




•  Takes	
  ~	
  6	
  min	
  without	
                         50


   Hadoop	
  
                                                                0

                                                                    1   2       3        4   5
                                                                            Job Number
Timing	
  experiments	
  
•  Selected	
  1000	
  scaffolds	
  with	
  20	
  or	
  fewer	
  
   members	
  
    –  Ran	
  with	
  10	
  scaffolds	
  /	
  map	
  
•  Hadoop	
  run	
  <me	
  
   was	
  ~	
  2	
  hr	
  
                                                       15




    –  Most	
  maps	
  were	
  
                                      Number of Jobs




                                                       10

       fast	
  (<	
  20	
  sec)	
  
•  Serial	
  evalua<on	
                                5


   would	
  be	
  >	
  7	
  hr	
  
                                                        0

                                                             1.0   1.5   2.0        2.5   3.0   3.5   4.0
                                                                           log Time (s)
A	
  M-­‐R	
  workflow	
  
•  We’re	
  currently	
  focused	
  on	
  just	
  the	
  MMP	
  step	
  as	
  
   as	
  a	
  MR	
  example	
  
•  Could	
  also	
  include	
  fragmenta<on	
  step	
  as	
  part	
  of	
  
   the	
  workflow	
  
    –  But	
  a	
  pre-­‐calculated	
  set	
  of	
  scaffolds	
  is	
  more	
  sensible	
  
•  Store	
  transforma<ons	
  and	
  members	
  in	
  HBase	
  
•  Link	
  with	
  ac<vity	
  data	
  and	
  apply	
  structure	
  &	
  
   ac<vity	
  filters	
  on	
  candidate	
  pairs	
  
What	
  Hadoop	
  is	
  not	
  for	
  
•  Doesn’t	
  replace	
  an	
  actual	
  database	
  
•  It’s	
  not	
  uniformly	
  fast	
  or	
  efficient	
  
•  Not	
  good	
  for	
  ad	
  hoc	
  or	
  real-­‐<me	
  analysis	
  
•  Generally	
  not	
  effec<ve	
  unless	
  dealing	
  with	
  
   massive	
  datasets	
  
•  All	
  algorithms	
  are	
  not	
  amenable	
  to	
  the	
  map-­‐
   reduce	
  method	
  
Conclusions	
  
•  Cheminforma<cs	
  applica<ons	
  can	
  be	
  rehosted	
  or	
  
   rewriOen	
  to	
  take	
  advantage	
  of	
  cloud	
  resources	
  
    –  Remotely	
  hosted	
  	
  
    –  Embarrassingly	
  parallel	
  /	
  chunked	
  
    –  Map/reduce	
  	
  
•  Ability	
  to	
  process	
  larger	
  structure	
  collec<ons	
  lets	
  
   us	
  explore	
  more	
  chemical	
  space	
  
•  “Big	
  data”	
  isn’t	
  really	
  that	
  big	
  in	
  chemistry	
  
Conclusions	
  
•  Q:	
  But	
  are	
  cheminforma/cs	
  problems	
  really	
  big	
  
   enough	
  to	
  jus/fy	
  all	
  of	
  this?	
  	
  
•  A:	
  Yes	
  –	
  virtual	
  libraries,	
  integra<ng	
  chemical	
  
   structure	
  with	
  other	
  types	
  and	
  scales	
  of	
  data	
  

•  Q:	
  Are	
  there	
  algorithms	
  in	
  cheminforma/cs	
  that	
  	
  
   can	
  employ	
  map-­‐reduce	
  at	
  the	
  algorithmic	
  level?	
  
•  A:	
  Yes	
  –	
  especially	
  when	
  we	
  consider	
  problems	
  
   with	
  a	
  combinatorial	
  flavor	
  
hRps://github.com/rajarshi/chem.hadoop	
  

Cloudy with a Touch of Cheminformatics

  • 1.
    Cloudy  with  a  Touch  of   Cheminforma4cs   Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen   NIH  Center  for  Advancing  Transla@onal  Science     Chemaxon  UGM   September  26th,  2012   Wellesley,  MA  
  • 2.
    Parallel  compu4ng  in  the  cloud   •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need   •  Chemistry  applica<ons  don’t  usually  have  very   dynamic  loads   •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa<ons  
  • 3.
    All  HPC  is  not  equal   •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili<es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce   •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc   Legacy   Cloudy   Big  Data   HPC   HPC   HPC   hOp://www.slideshare.net/chrisdag/mapping-­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  
  • 4.
    Big  data  &  cheminforma4cs   •  Computa<on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  GDB-­‐13,  …   •  What  types  of  computa<ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  &  predic<ons  over  large  data   •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces  
  • 5.
    Map-­‐Reduce   copy sort Split 0 Map merge Reduce Part 0 Split 1 Map merge Reduce Part 1 Split 2 Map K1,V1 ! list ( K 2 ,V2 ) K 2 , list (V2 ) ! list ( K 3,V3 ) Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    
  • 6.
    Coun4ng  atoms   •  The  chemical  version  of  the  word  coun<ng  task   Arbitrary line Atom list (V2) SMILES (V1) Atom numbers (K1) Occurence (V2) Symbol (K2) Symbol (K2) 1, Nc1ccc2ncccc2c1N N, list(1,1,1,1,...) 2, Cl.CC1CCc2nc3ccccc3c(C)c2C1 N1 C, list(1,1,1,1,...) . N1 . N1 . N1 152366, Nc1ccc2ncccc2c1N MAP   . Reduce   . Atom Count (V3) Symbol (K3) N,100 C,5684 . . .
  • 7.
    The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop Common Based  on  hOp://www.slideshare.net/informa<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  
  • 8.
    Cheminforma4cs  on  Hadoop   •  Hadoop  and  Atom  Coun<ng   •  Hadoop  and  SD  Files   •  Cheminforma<cs,  Hadoop  and  EC2   •  Pig  and  Cheminforma<cs     But  are  cheminforma@cs  problems     really  big  enough  to  jus@fy  all  of  this?  
  • 9.
    Simplifying  Hadoop  applica4ons   package gov.nih.ncgc.hadoop; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter import chemaxon.formats.MolFormatException; reporter) throws IOException { import chemaxon.formats.MolImporter; •  Raw  Hadoop     Molecule mol = MolImporter.importMol(value.toString()); import chemaxon.license.LicenseManager; matches.set(mol.getName()); import chemaxon.license.LicenseProcessingException; search.setTarget(mol); import chemaxon.sss.search.MolSearch; try { import chemaxon.sss.search.SearchException; if (search.isMatching()) { import chemaxon.struc.Molecule; output.collect(matches, one); import org.apache.hadoop.conf.Configuration; } else { programs  can     import org.apache.hadoop.conf.Configured; output.collect(matches, zero); import org.apache.hadoop.filecache.DistributedCache; } import org.apache.hadoop.fs.Path; } catch (SearchException e) { import org.apache.hadoop.io.IntWritable; } import org.apache.hadoop.io.LongWritable; } import org.apache.hadoop.io.Text; } import org.apache.hadoop.mapred.FileInputFormat; be  tedious  to     import org.apache.hadoop.mapred.FileOutputFormat; public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, import org.apache.hadoop.mapred.JobClient; IntWritable, Text, IntWritable> { import org.apache.hadoop.mapred.JobConf; private IntWritable result = new IntWritable(); import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; public void reduce(Text key, import org.apache.hadoop.mapred.OutputCollector; Iterator<IntWritable> values, import org.apache.hadoop.mapred.Reducer; OutputCollector<Text, IntWritable> output, write   import org.apache.hadoop.mapred.Reporter; Reporter reporter) throws IOException { import org.apache.hadoop.mapred.TextInputFormat; while (values.hasNext()) { import org.apache.hadoop.mapred.TextOutputFormat; if (values.next().compareTo(one) == 0) { import org.apache.hadoop.util.Tool; result.set(1); import org.apache.hadoop.util.ToolRunner; output.collect(key, result); } import java.io.BufferedReader; } import java.io.FileReader; } import java.io.IOException; } import java.util.Iterator; public int run(String[] args) throws Exception { /** JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); * SMARTS searching over a set of files using Hadoop. jobConf.setJobName("smartsSearch"); * * @author Rajarshi Guha jobConf.setOutputKeyClass(Text.class); */ jobConf.setOutputValueClass(IntWritable.class); public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); jobConf.setMapperClass(MoleculeMapper.class); private final static IntWritable zero = new IntWritable(0); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class); public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { jobConf.setInputFormat(TextInputFormat.class); private String pattern = null; jobConf.setOutputFormat(TextOutputFormat.class); private MolSearch search; jobConf.setNumMapTasks(5); public void configure(JobConf job) { if (args.length != 4) { try { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); Path[] licFiles = DistributedCache.getLocalCacheFiles(job); System.exit(2); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); } StringBuilder license = new StringBuilder(); String line; FileInputFormat.setInputPaths(jobConf, new Path(args[0])); while ((line = reader.readLine()) != null) license.append(line); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); reader.close(); jobConf.setStrings("pattern", args[2]); LicenseManager.setLicense(license.toString()); } catch (IOException e) { // make the license file available vis dist cache } catch (LicenseProcessingException e) { DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf); } JobClient.runJob(jobConf); pattern = job.getStrings("pattern")[0]; return 0; search = new MolSearch(); } try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); public static void main(String[] args) throws Exception { search.setQuery(queryMol); } catch (MolFormatException e) { int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args); } } SMARTS  based     } } final static IntWritable one = new IntWritable(1); Text matches = new Text(); substructure  search    
  • 10.
    Pig  &  Pig  La4n   •  Pig  La<n  programs  are  much  simpler  to  write   and  get  translated  to   A = load 'medium.smi' as (smiles:chararray); B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N'); store B into 'output.txt'; Hadoop  code   SMARTS  search  in     Pig  La<n   •  SQL-­‐like,  requires     package gov.nih.ncgc.hadoop.pig; import chemaxon.formats.MolImporter; UDF  to  be     import chemaxon.sss.search.MolSearch; import chemaxon.sss.search.SearchException; import chemaxon.struc.Molecule; import org.apache.pig.FilterFunc; implemented  to     import org.apache.pig.data.Tuple; import java.io.IOException; perform     public class SMATCH extends FilterFunc { static MolSearch search = null; non-­‐standard  tasks   public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; } } UDF  for  SMARTS  search  
  • 11.
    Going  beyond  chunking?   •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera<on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma@cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  • 12.
    Going  beyond  chunking?   •  Applica<ons  that  make  use  of  pairwise  (or  higher   order)  calcula<ons  could  benefit  from  a  map-­‐ reduce  incarna<on   –  Doesn’t  necessarily  avoid  the  O(N2)  barrier   –  Bioisostere  iden<fica<on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem   •  Map-­‐Reduce  Design  PaOerns  
  • 13.
    Iden4fying  MMPs   • First  step  in  iden<fying  bioisosteres  is  to  iden<fy   candidate  matched  molecular  pairs   –  Naïve  all  pairs  comparison   –  Predefined  list  of  transforma<ons     •  Birch  et  al,  BMCL,  2009   –  Fragment  intersec<on   •  Hussain  et  al,  JCIM,  2010   –  MCS  based  approaches  (e.g.,  WizePairZ)   •  Warner  et  al,  JCIM,  2010    
  • 14.
    Naïve  Bioisostere  evalua4on   N  molecules   N(N-­‐1)/2  comparisons   ...
  • 15.
    Scaffold  seeding   Seed  Fragment:   Members:  
  • 16.
    Scaffold  seeded  bioisosteres   M(M-­‐1)/2  comparisons   M(M-­‐1)/2  comparisons  
  • 17.
    Seeded  bioisosteres  –  MR  style   • Do  pairwise  MCS   REDUCE   analysis  on  scaffold   • Collect  pairs  of   series   SMILES  for  a  given   • For  each  pair   SMIRKS   output  SMIRKS   • Store  in  DB,  or   transform  and  the   pair  of  SMILES   • Filter  by  ac<vity,  or   • …   MAP  
  • 18.
    Does  seeding  help?   •  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the   constant   •  Depends  on  how  many  scaffolds  and  the     number  of  member  for   1e+14 each  scaffold   •  Certainly  useful  when   log Number of pairwise comparisons 1e+11 there  a  few  members   Method per  scaffold   1e+08 all seeded.7 seeded.21 •  Highly  populated   seeded.100 scaffolds  can  throw   things  off   1e+05 1e+03 1e+05 1e+07 log Number of molecules
  • 19.
    Data   •  Exhaus<vely  fragmented  ChEMBL  13   •  Iden<fied  scaffolds  with         N members   ! 1.8 N scaffold   •  Ended  up  with  231,875  scaffolds     1e+08 –  Covers  235,693  unique  molecules   log Comparisons –  Average  of  7  members  per  scaffold   1e+05 –  95%  of  scaffolds  had  <  21  members   –  99.5%  had  <  74  members   1e+02 •  The  0.05%  are  a  bit  problema<c   All Seeded Method
  • 20.
    Timing  experiments   • Selected  50  scaffolds  with  10  or  fewer  members   •  Configured  so  as  to  have  ~  5  maps   •  Effec<ve  running  <me  for   the  en<re  job  is  3.8  min   200 on  Hadoop   150 –  Only  needed  5  of  8  map   slots  on  our  “cluster”   Time (s) 100 •  Takes  ~  6  min  without   50 Hadoop   0 1 2 3 4 5 Job Number
  • 21.
    Timing  experiments   • Selected  1000  scaffolds  with  20  or  fewer   members   –  Ran  with  10  scaffolds  /  map   •  Hadoop  run  <me   was  ~  2  hr   15 –  Most  maps  were   Number of Jobs 10 fast  (<  20  sec)   •  Serial  evalua<on   5 would  be  >  7  hr   0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log Time (s)
  • 22.
    A  M-­‐R  workflow   •  We’re  currently  focused  on  just  the  MMP  step  as   as  a  MR  example   •  Could  also  include  fragmenta<on  step  as  part  of   the  workflow   –  But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible   •  Store  transforma<ons  and  members  in  HBase   •  Link  with  ac<vity  data  and  apply  structure  &   ac<vity  filters  on  candidate  pairs  
  • 23.
    What  Hadoop  is  not  for   •  Doesn’t  replace  an  actual  database   •  It’s  not  uniformly  fast  or  efficient   •  Not  good  for  ad  hoc  or  real-­‐<me  analysis   •  Generally  not  effec<ve  unless  dealing  with   massive  datasets   •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method  
  • 24.
    Conclusions   •  Cheminforma<cs  applica<ons  can  be  rehosted  or   rewriOen  to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce     •  Ability  to  process  larger  structure  collec<ons  lets   us  explore  more  chemical  space   •  “Big  data”  isn’t  really  that  big  in  chemistry  
  • 25.
    Conclusions   •  Q:  But  are  cheminforma/cs  problems  really  big   enough  to  jus/fy  all  of  this?     •  A:  Yes  –  virtual  libraries,  integra<ng  chemical   structure  with  other  types  and  scales  of  data   •  Q:  Are  there  algorithms  in  cheminforma/cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?   •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor  
  • 26.