Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

  1. 1. Enabling  Discoveries  at  High  Throughput     Small  molecule  and  RNAi  HTS  at  the  NCTT   Rajarshi  Guha   NIH  Center  for  Transla6on  Therapeu6cs   May  3,  2011  
  2. 2. Outline  •  Informa6cs  for  small  molecule  &  RNAi  screening  •  HCA  &  automated  decision  making   –  Pre7y  pictures  can  lead  to  more  efficient  screens  •  Large  scale  cheminforma6cs       –  We  can  do  it,  but  do  we  need  to?  
  3. 3. NIH Chemical Genomics Center•  Founded  2004  as  part  of  NIH  Roadmap  Molecular  Libraries  Ini6a6ve   –  NCGC  staffed  with  90+  scien6sts  –  biologists,  chemists,  informa6cians,  engineers   –  Post-­‐doc  program  •  Mission   –  MLPCN  (screening  &  chemical  synthesis;  compound  repository;  PubChem  database;   funding  for  assay,  library  and  technology  development  )   –  Develop  new  chemical  probes  for  basic  research  and  leads  for  therapeu6c  development,   par6cularly  for  rare/neglected  diseases   –  New  paradigms  &  applica6ons  of  HTS  for  chemical  biology  /  chemical  genomics  •  All  NCGC  projects  are  collabora6ons  with  a  target  or  disease  expert;    currently  >200   collabora6ons  with  inves6gators  worldwide    
  4. 4. Project Diversity Project  Diversity  (A) Disease areas (B) Target types (C) Detection methods
  5. 5. Assay  formats  &  detec?on  methods  in  HTS   Assay formats •  cellular signal transduction •  luminescence  •  ligand  binding   –  reporter gene –  chemiluminescence   –  compe66on  binding     –  bioluminescence   –  second messenger•  enzyma6c  ac6vity   •  phenotypic –  BRET   –  biochemical   –  ALPHA   –  cellular   –  protein redistribution •  fluorescence  •  ion  or  ligand  transport   –  cell viability –  FI     –  Ion-­‐sensi6ve  dyes   –  etc. –  membrane  poten6al  dyes   Detection modes –  –  FRET     TRF  •  protein-­‐protein  interac6ons     •  absorbance –  TR-­‐FRET   –  biochemical   –  FP     –  cellular   •  radioactivity –  FCS   –  SPA –  FLT  
  6. 6. Detector  Systems:  “Reading  the  assay”  •  ViewLux   –  Mul6modal  CCD-­‐based  imager   •  Abs.,  Luminescence,  Fluorescence  •  Envision   –  PMT-­‐based  reader     •  ALPHA  •  Acumen  Explorer   –  Laser  Scanning  Imager   •  “sta6c”  cell  cytometry  •  Hamamatsu  FDS  7000  Series     –  rapid  kine6cs  •  INCell1000   –  Subcellular  imaging  
  7. 7. qHTS:  High  Throughput  Dose  Response   Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fittingA   (high:~ 100 μM) 1536-well plates, inter-plate dilution series and classification. 300K samples C   Assay volumes 2 – 5 μLB   Automated concentration-response data collection ~1 CRC/sec
  8. 8. Informa?cs  Ac?vi?es  •  High  throughput  curve  fieng  •  Data  integra6on,  automated  cherry  picking  •  SAR  algorithms   –  QSAR  modeling   –  Fragment  based  analysis   –  Ac6vity  cliffs  •  Tools  –  standardizer,  tautomers,  fragment  acDvity   browser,  kinome  browser  and  more  •  RNAi  hit  selec6on,  OTE  analysis  •  High  content  analysis  
  9. 9. Kinome  Navigator   •  Browse  kinase   panel  data   •  Currently  focused   on  the  Abbot   dataset   •  View     •  Fragments   •  Target  pairs   •  Kinome  overlay   hip://tripod.nih.gov  
  10. 10. Fragment  Browser  •  View  ac6vi6es  on  a  fragment  wise  basis  •  Compare  ac6vity  distribu6ons  by  fragment  •  Currently  based  around  ChEMBL  assays  but  users   can  browse  their  own  compounds  &  ac6vi6es   hip://tripod.nih.gov  
  11. 11. Structure  Ac?vity  Landscapes   •  Rugged  gorges  or  rolling  hills?   –  Small  structural  changes  associated  with  large   ac6vity  changes  represent  steep  slopes  in  the   landscape   –  But  tradi6onally,  QSAR  assumes  gentle  slopes     –  We  can  characterize  the  landscape  using  SALI  Maggiora,  G.M.,  J.  Chem.  Inf.  Model.,  2006,  46,  1535–1535  
  12. 12. What  Can  We  Do  With  SALI’s?   •  SALI  characterizes  cliffs  &  non-­‐cliffs   •  For  a    given  molecular  representa6on,  SALI’s   gives  us  an  idea  of    the   smoothness  of  the     SAR  landscape   •  Models  try  and  encode   this  landscape   •  Use  the  landscape  to  guide   descriptor  or  model     selec6on  Guha,  R.;  Van  Drie,  J.H.,  J.  Chem.  Inf.  Model.,  2008,  48,  646–658  
  13. 13. Predic?ng  the  Landscape   •  Rather  than  predic6ng  ac6vity  directly,  we  can   try  to  predict  the  SAR  landscape   •  Implies  that  we  aiempt  to  directly  predict  cliffs   –  Observa6ons  are  now  pairs  of  molecules   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   RMSE  =  0.97   RMSE  =  1.10   RMSE  =  1.04  Scheiber  et  al,  StaDsDcal  Analysis  and  Data  Mining,  2009,  2,  115-­‐122  
  14. 14. Data  Integra?on  •  It’s  nice  to  simplify  data,  but  we  can  s6ll  be  faced   with  a  mul6tude  of  data  types  •  We  want  to  explore  these  data  in  a  linked  fashion  •  How  we  explore  and  what  we  explore  is  generally   influenced  by  the  task  at  hand  •  At  one  point,  make  inferences  over  all  the  data  
  15. 15. Data  Integra?on  User’s  Network   Content:   -­‐ Drugs   -­‐ Compounds   -­‐ Scaffolds   -­‐ Assays   -­‐ Genes   -­‐ Targets   -­‐ Pathways   -­‐ Diseases   -­‐ Clinical  Trials   -­‐ Documents   Links:  Network  of  Public  Data   -­‐Manually  curated   -­‐Derived  from  algorithms  
  16. 16. Record  View  of  an  Assay  
  17. 17. Access  Disease  Hierarchy  &  Network  
  18. 18. Ar?cles,  Patents,  Drug  Labels,  …  
  19. 19. NPC  Browser  hip://tripod.nih.gov/npc/  
  20. 20. Going  Beyond  Explora?on?   •  Simply  being  able  to  explore  data  in  an  integrated   manner  is  useful  as  an  idea  generator   •  Can  we  integrate  heterogenous  data  types  &   sources  to  get  a  systems  level  view?   –  Current  research  problem  in  genomics  and  systems   biology   –  Some  aiempts  have  been  made  to  merge  chemical   data  with  other  data  types  Young,  D.W.  et  al,  Nat.  Chem.  Biol.,  2008,  4,  59-­‐68  
  21. 21. RNAi  Facility  Mission  •  Perform  collabora6ve  genome-­‐wide  RNAi  screening-­‐ based  projects  with  intramural  inves6gators  •  Advance  the  science  of  RNAi  and  miRNA  screening   and  informa6cs  via  technology  development  to   improve  efficiency,  reliability,  and  costs.   Simple Phenotypes Pathway (Reporter Complex Phenotypes (Viability, cytotoxicity, assays, e.g. luciferase, (High-content imaging, cell oxidative stress, etc)! β-lactamase)! cycle, translocation, etc)! Range of Assays!
  22. 22. RNAi  Effectors  RNAi effectors provide an excellent way to conduct gene-specific loss offunction studies."
  23. 23. Issues  Using  RNAi  Effectors  •  RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."•  RNAi effectors induce off-target effects!!!!! "
  24. 24. Examples of of  Current  Projects   Examples   Current Projects•   Protein  Quality  Control   •   Poxvirus  •   DNA  Re-­‐replica6on   •   Respiratory  Viruses  •   Base  Excision  Repair   •   Lysosomal  Storage  Disorders  •   DNA  Damage  –  ELG1  stabiliza6on   •   Parkinsons  –  Mitochondrial  Quality    Control  •   An6oxidant  Response   •   Ewings  Sarcoma  •   Hypoxia   •   Drug  Modifiers,  Pancrea6c  Cancer  •   TNFa  Response   •   Drug  Modifiers,  TOP1  Clinical  •   Interferon  Response    Agents  •   iPS  to  RPE   •   Immunotoxin-­‐Mediated  Cell  Death  
  25. 25. User  Accessible  Tools  
  26. 26. RNAi  Libraries   Ambion Human Genome- Ambion Mouse Genome-Wide Wide Library, 21,585 genes, 3 Library, 17,582 genes, 3 unique siRNAs per gene. " unique siRNAs per gene." Dharmacon Human Duet Human and Mouse miRNA Genome-Wide siRNA Mimic Libraries & Libraries, 18,236 genes, Human miRNA Inhibitor siRNA pools." Library" Qiagen Human Druggable Kinome Libraries" Genome Library, > 7,000 Purchased from a number of genes, 4 unique siRNAs per vendors." gene."• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."• Considerations are being made for additional species and shRNA resources."
  27. 27. Druggable  Genome  Screening  Campaign   Pseudo-colored Blue/Green Ratio (Normalized to plate Median)•  Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).•  85 genes were selected Significant enrichment for core for follow-up through a NF-kB components variety of threshold-based Percent Reduction in NF-kB Signal 100 selection schemes. Qiagen siRNAs Ambion siRNAs Average Inhibition (%) 80•  27 genes were validated as confident hits using 60 siRNAs from multiple 40 vendors. 20 0 TNFα Receptor IKKα RELA NEMO
  28. 28. Druggable  Genome  Screening  Campaign   Significant enrichment for proteins that form the 28S proteasome Percent Reduction in NF-kB Signal Qiagen Ambion RPN 100 19S Regulator particle Average Inhibition (%) 80 RPT 60 α1-7 20S ß1-7 Proteasome 40 α1-7 20 RPT 19S Regulator 0 particle RPN D14 C4 C5 D2 D7 B2 B3 B4 A4 A5 A6 A7 A1 A2 A3PSM Gene Murata et alPSM Protein α core 20S β core 20S RPT 19S RPN 19S Nature Reviews Mol. Cell Biol. An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway
  29. 29. Seed  Sequence  Analysis  Other instances of the seeds incorporated within siRNAs targeting PSMA3 do notexhibit significant activity, adding to the likelihood of this being an on-target effect."
  30. 30. Seed  Sequence  Analysis  Other instances of the seeds within the active siRNAs targeting SLC24A1 tend todownregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
  31. 31. RNAi  &  Small  Molecule  Screens   What  targets  mediate  ac6vity  of   siRNA    and  compound   Pathway  elucida6on,  iden6fica6on  •   Reuse  pre-­‐exis6ng  MLI  data   of  interac6ons  •   Develop  new  annotated  libraries   CAGCATGAGTACTACAGGCCA   TACGGGAACTACCATAATTTA   Target  ID  and  valida6on   Link  RNAi  generated  pathway   peturba6ons  to  small  molecule   ac6vi6es.  Could  provide  insight  into   polypharmacology  •   Run  parallel  RNAi  screen   Goal:  Develop  systems  level  view  of  small  molecule  acUvity  
  32. 32. Matching  Phenotypes  RNAI   Small  Molecule  
  33. 33. Merging  Screening  Technologies  •  Lead  iden6fica6on   High  throughput  screening   High  content  screening  •  Single  (few)  read  outs   •  Phenotypic  profiling  •  High-­‐throughput   •  Mul6ple  parameters  •  Moderate  data  volumes   •  Moderate  throughput   •  Very  large  data   volumes   •  We’d  like  to  combine  the  technologies,  to  obtain  rich   high-­‐resolu6on  data  at  high  speed   •  Is  this  feasible?  What  are  the  trade-­‐offs?  
  34. 34. Merging  Screening  Technologies  •  A  simple  solu6on  is  to  run  a  HTS  &  HCS  as   separate,  primary  &  secondary  screens  •  Alterna6vely  –  Wells  to  Cells   –  Integrate  HTS  &  HCS  in  a  single  screen  using  a   combined  plavorm  for  robo6cs  &  real  6me   automated  HTS  analy6cs   –  Selec6ve  imaging  of  interes6ng  wells  
  35. 35. Wells  to  Cells  Workflow   •  Sequen6al  qHTS  using  laser   scanning  cytometry  followed   by  high-­‐res  microscopy   •  Unit  of  work  is  a  plate  series     •  The  same  aliquot  is  analyzed   by  both  techniques   •  A  message  based  system   •  The  key  is  deciding  which   wells  go  through  the   workflow  
  36. 36. Well  to  Cells  Assays    •  Cell  cycle,  cell  transloca6on,  DNA  repreplica6on  •  All  assays  run  against  LOPAC1280    •  Consistency  between  cytometry  &  microscopy  is   measured  by  the  R2  between  log  AC50’s   –  Cell  cycle,  0.94  –  0.96   –  Cell  transloca6on,  0.66  –  0.94   –  DNA  rereplica6on,  s6ll  in  progress    
  37. 37. Cell  Transloca?on  Example  Hits  
  38. 38. Informa?cs  Pla[orm   InCell  Layout     File  •  Advanced  correc6on  and   normaliza6on  methods  •  Sophis6cated  curve  fieng   algorithm  •  Good  performance,  allows   paralleliza6on  of  the  en6re   workflow  
  39. 39. Why  Messaging?  •  A  messaging  architecture  allows  for  significant   flexibility   –  Persistent,  can  be  kept  for  process  tracking,   repor6ng   –  Asynchronous,  allows  individual  components  of   the  workflow  to  proceed  at  their  own  pace   –  Modular,  new  components  can  be  introduced  at   any  6me  without  redesigning  the  whole  workflow  •  We  employ  Oracle  AQ,  but  any  message   queue  can  be  employed  
  40. 40. Handling  Mul?ple  Pla[orms  •  Current  examples  employ  InCell  hardware  •  We  also  use  Molecular  Devices  hardware  •  As  a  result  we  have  two  orthogonal  image  stores  /   databases  •  Need  to  integrate  them   –  Support  seamless  data  browsing    across  mul6ple   screens  irrespec6ve  of  imaging  plavorm  used   –  Support  analy6cs  external  to  vendor  code  
  41. 41. A  Unified  Interface  •  A  client  sees  a  single,  simple  interface  to   screening  image  data   hXp://host/rest/protocol/plate/well/image  •  Transparently  extract     image  data  via  the     MetaXpress  database     or  via  custom  code  •  Currently  the  interface  address  image  serving  •  Unified  metadata  interface  in  the  works  
  42. 42. Trade-­‐offs  &  Opportuni?es  •  Automa6on  reduces  the  ability  to  handle   unforeseen  errors   –  Dispense  errors  and  other  plate  problems   –  Well  selec6on  based  on  curve  classes  may  need  to   be  modified  on  the  fly  •  Well  selec6on  does  not  consider  SAR   –  Wells  are  selected  independently  of  each  other   –  If  we  could  model  SAR  on  the  fly  (or  from   valida6on  screens),  we’d  select  mul6ple  wells,  to   obtain  posi6ve  and  nega?ve  results  
  43. 43. Cloud  Compu?ng  &  Cheminforma?cs  •  Cloud  compu6ng  is  a  hot  topic  •  A  number  of  examples  of  computa6onal   chemistry  /  cheminforma6cs  on  the  cloud   –  MolPlex,  hBar,  Numerate,  Wingu,  Sciligence,  Pfizer  •  Many  examples  use  the  cloud  for  remote  storage   remote  (hosted)  computa6ons  •  But  providers  such  as  Amazon  allow  us  to  run   distributed  compuDng  applica6ons  on  the  cloud  
  44. 44. Map/Reduce  •  Map/Reduce  is  a  programming  model  for   efficient  distributed  compu6ng  •  M/R  made  “famous”  by  Google,  but  the  idea   has  been  around  for  a  long  6me  •  It  works  like  a  Unix  pipeline:   –  cat input | grep | sort | uniq -c | cat > output –       Input              |  Map      |  Shuffle  &  Sort    |      Reduce            |  Output  •  Efficiency  from     –  Streaming  through  data,  reducing  seeks   –  Pipelining   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  45. 45. Map/Reduce   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  46. 46. Hadoop  &  Cheminforma?cs  •  Hadoop  is  an  Open  Source  implementa6on   of  the  map/reduce  paradigm  •  Hadoop  is  a  framework  for  scalable,     distributed  compu6ng   –  Hadoop,  HDFS,  Hive,  PIG  •  Importantly,  you  can  play  with  all  this  on  your   laptop  and  just  copy  files  to  the  big  cluster  when   you’re  ready  for  produc6on  
  47. 47. Why  Hadoop?  •  Simple  way  to  make  use  of  large  clusters   without  MPI  etc  •  AWS  supports  Hadoop,  so  easy  to  scale   up  to  100’s  or  1000’s  of  cores  •  Great  for  Java  code,  but  non-­‐Java  code  can  also   make  use  of  Hadoop  •  M/R  can  be  applied  to  a  lot  of  problems,  but  one   of  the  simplest  is  to  use  it  as  a  “chunker”  
  48. 48. Cheminforma?cs  in  Parallel  •  Many  cheminforma6cs  problems  are  data  parallel   –  Chunk  the  data  and  apply  the  same  technique  over   each  chunk  •  This  makes  many  problems  amenable  for  M/R   –  Substructure  /  pharmacophore  search   –  Descriptor  calcula6ons,  virtual  screening   –  Model  development  (?)  •  In  general,  each  chunk  is  processed  on  a  dis6nct   node  –  so  code  itself  can  be  non-­‐parallel  
  49. 49. Cheminforma?cs  in  Parallel  See  h_p://blog.rguha.net/?tag=hadoop  for  examples  &  code  
  50. 50. Substructure  Searching   public class SubSearch {!•  Substructure   …! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {! searching  is  a  trivial   private Text matches = new Text();! private String pattern;! extension  of  atom   public void setup(Context context) {! pattern = context.getConfiguration().get ("net.rguha.dc.data.pattern");! coun6ng   }! public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {!•  If  a  structure   try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); ! matches,  emit   sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! (name,1)! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }!•  Otherwise     }! }! public static class SMARTSMatchReducer extends ! (name,0)   Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!•  Reducer  simply   public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! outputs  tuples  of  the   if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! form  (name,1)   }! }! }!
  51. 51. Running  on  AWS  •  All  the  code  was  debugged  on  my  laptop  with   rela6vely  small  files  •  To  test  the  scalability,  I  shi{ed  everything  to  AWS   –  Pharmacophore  search   –  136K  structures,  single     conformer,  560MB   –  Created  a  single  JAR  file  with   CDK  &  applica6on  code   –  Uploaded  data  files  to  S3  •  Total  cost  of  experiments   was  ~  $10  
  52. 52. But  I  Don’t  Want  to  Write  Programs  •  All  these  examples  require  us  to  write  full  fledged   Java  classes  •  An  easier  way  to  use  Pig  &  Pig  La6n  –  a  plavorm   and  query  language  built  on  top  of  Hadoop  •  Lets  us  write  SQL-­‐like  queries  that  make  use  of   Hadoop  underneath  •  Flexible  due  to  user  defined  func6ons  (UDF’s)   –  UDF’s  encapsulate  the  cheminforma6cs  
  53. 53. Cheminforma?cs  &  Pig   A = load medium.smi as (smiles:chararray);! B = filter A by net.rguha.dc.pig.SMATCH(smiles, NC(=O)C(=O)N);! store B into output.txt;!•  Iden6fy  molecules  in  medium.smi  that  match  the   SMARTS  paiern  and  dump  to  output.txt  •  The  complexity  is  now  hidden  in  the  UDF  •  Many  toolkit  func6ons  could  be  wrapped  as   UDF’s,  allowing  flexible  queries  with  much   simpler  code  •  See  hip://blog.rguha.net/?p=748  for  the  code  
  54. 54. Latency  •  Hadoop  is  suited  for  batch  processing  •  Significant  network  I/O  involved  in  distribu6ng   data  to  compute  nodes  •  Not  good  for     –  Random  ad  hoc  processing  of  small  subsets   –  Small  volume  data   –  Real  6me  (low  latency)  work  •  But  latency  issues  can  be  addressed  somewhat     by  Hbase,  Hive  and  other  technologies  
  55. 55. More  than  Chunking?  •  But  all  the  examples  so  far  could  have  been  done   via  PBS/Condor  or  any  other  job  scheduler   –  (With  Hadoop  we  don’t  have  to  worry  about  explicit   chunking  of  the  input  data)  •  But  are  there  cheminforma6cs  algorithms  that   can  be  reworked  in  to  the  M/R  paradigm?   –  Predic6ve  modeling?   –  Graph  algorithms?  
  56. 56. More  than  Chunking?  •  Both  predic6ve  &  graph  algorithms  are   increasingly  supported  in  Hadoop   –  Mahout  for  M/L  algorithms  on  massive  datasets   –  Cloud9  for  graph  algorithms  •  A  number  of  bioinforma6cs  applica6ons  make   use  of  M/R  at  the  algorithmic  level  •  They  are  all  big  applica6ons   –  Crossbow  aligns  3  billion  paired/unpaired  reads  •  Cheminforma?cs  datasets  are  not  very  big  
  57. 57. Summary  •  HTS  data  is  an  ample  playground  for  interes6ng   analy6cs,  mul6ple  data  types  makes  it  more  fun  •  A  major  challenge  in  our  informa6cs   infrastructure  is  dealing  with  proprietary  vendor   interfaces  •  Hadoop  and  M/R  provide  great  opportuni6es  for   handling  large  data  in  a  flexible  manner  •  But  can  cheminforma6cs  really  make  use  of  it?  
  58. 58. AcknowledgmentsInformaUcs   RNAi  &  Small  Molecule  •  Ajit  Jadhav   •  Scoi  Mar6n  •  Trung  Nguyen   •  Pinar  Tuzmen  •  Noel  Southall   •  Yu-­‐Chi  Chen  •  Ruili  Huang   •  Carleen  Klump  •  Min  Shen   •  Craig  Thomas  •  Hongmao  Sun   •  Jim  Inglese  •  Xin  Hu   •  Ron  Johnson  •  Tongan  Zhao   •  Sam  Michael   •  Jennifer  Wichterman  
  59. 59. Coun?ng  Atoms  •  The  canonical  Hadoop  program  is  to  count  the   frequency  of  words  in  a  text  file   –  Mapper  reads  a  line,  outputs  a  tuple  –  (word,  1)   –  Reducer  will  receive  tuples,  keyed  on  word! •  Summing  up  the  1’s  gives  us  the  frequency  of  word    •  By  default,  Hadoop  works  on  a  line-­‐by-­‐line  basis  •  For  cheminforma6cs  problems,  SMILES  files   sa6sfy  this  requirement  –  one  line,  one  molecule  
  60. 60. Coun?ng  Atoms   public class HeavyAtomCount {!•  Uses  the  CDK  to   static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());! public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {! ! parse  SMILES   private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!•  For  each   public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! molecule  loop   for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! over  atoms   } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }! –  Emit     public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();! (symbol,1)! public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;!•  Reducer  simply   for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! sums  the  1’s  for   context.write(key, result);! }! }! ….! each  symbol   }!
  61. 61. Mul?line  Records  •  Lots  of  cheminforma6cs  applica6ons  require  3D  –   SMILES  won’t  do.  Need  to  support  SDF  •  We  implement  a  custom  RecordReader to   process  SD  files!•  We’re  now  ready  to     tackle  preiy  much     most    cheminforma6cs   tasks  
  62. 62. Why  Hadoop?  •  Java  and  C++  APIs   –  In  Java  use  Objects,  while  in  C++  bytes  •  Each  task  can  process  data  sets  larger     than  RAM  •  Automa6c  re-­‐execu6on  on  failure   –  In  a  large  cluster,  some  nodes  are  always  slow  or  flaky   –  Framework  re-­‐executes  failed  tasks    •  Locality  op6miza6ons   –  M/R  queries  HDFS  for  loca6ons  of  input  data   –  Map  tasks  are  scheduled  close  to  the  inputs  when   possible   Owen  O’Malley,  hip://bit.ly/ecHPvB