0
Enabling	  Discoveries	  at	  High	  Throughput	  	             Small	  molecule	  and	  RNAi	  HTS	  at	  the	  NCTT	    ...
Outline	  •  Informa6cs	  for	  small	  molecule	  &	  RNAi	  screening	  •  HCA	  &	  automated	  decision	  making	     ...
NIH Chemical Genomics Center•    Founded	  2004	  as	  part	  of	  NIH	  Roadmap	  Molecular	  Libraries	  Ini6a6ve	      ...
Project Diversity                             Project	  Diversity	  (A) Disease areas   (B) Target types                  ...
Assay	  formats	  &	  detec?on	  methods	  in	  HTS	       Assay formats                                         •    cell...
Detector	  Systems:	  “Reading	  the	  assay”	  •  ViewLux	        –  Mul6modal	  CCD-­‐based	  imager	              •  Ab...
qHTS:	  High	  Throughput	  Dose	  Response	          Assay concentration ranges over 4 logs                       Informa...
Informa?cs	  Ac?vi?es	  •  High	  throughput	  curve	  fieng	  •  Data	  integra6on,	  automated	  cherry	  picking	  •  SA...
Kinome	  Navigator	    •  Browse	  kinase	       panel	  data	    •  Currently	  focused	       on	  the	  Abbot	       da...
Fragment	  Browser	  •  View	  ac6vi6es	  on	  a	  fragment	  wise	  basis	  •  Compare	  ac6vity	  distribu6ons	  by	  fr...
Structure	  Ac?vity	  Landscapes	              •  Rugged	  gorges	  or	  rolling	  hills?	                          –  Sma...
What	  Can	  We	  Do	  With	  SALI’s?	         •  SALI	  characterizes	  cliffs	  &	  non-­‐cliffs	         •  For	  a	  	  ...
Predic?ng	  the	  Landscape	         •  Rather	  than	  predic6ng	  ac6vity	  directly,	  we	  can	            try	  to	  ...
Data	  Integra?on	  •  It’s	  nice	  to	  simplify	  data,	  but	  we	  can	  s6ll	  be	  faced	     with	  a	  mul6tude	 ...
Data	  Integra?on	  User’s	  Network	                                          Content:	                                  ...
Record	  View	  of	  an	  Assay	  
Access	  Disease	  Hierarchy	  &	  Network	  
Ar?cles,	  Patents,	  Drug	  Labels,	  …	  
NPC	  Browser	  hip://tripod.nih.gov/npc/	  
Going	  Beyond	  Explora?on?	      •  Simply	  being	  able	  to	  explore	  data	  in	  an	  integrated	         manner	 ...
RNAi	  Facility	  Mission	  •  Perform	  collabora6ve	  genome-­‐wide	  RNAi	  screening-­‐   based	  projects	  with	  in...
RNAi	  Effectors	  RNAi effectors provide an excellent way to conduct gene-specific loss offunction studies."
Issues	  Using	  RNAi	  Effectors	  •  RNAi effectors give a knockdown not a knockout (70% - 80% is considered   good). The...
Examples of of	  Current	  Projects	                         Examples	   Current Projects• 	  Protein	  Quality	  Control	...
User	  Accessible	  Tools	  
RNAi	  Libraries	          Ambion Human Genome-                 Ambion Mouse Genome-Wide       Wide Library, 21,585 genes,...
Druggable	  Genome	  Screening	  Campaign	                                                                            Pseu...
Druggable	  Genome	  Screening	  Campaign	                                              Significant enrichment for protein...
Seed	  Sequence	  Analysis	  Other instances of the seeds incorporated within siRNAs targeting PSMA3 do notexhibit signific...
Seed	  Sequence	  Analysis	  Other instances of the seeds within the active siRNAs targeting SLC24A1 tend todownregulate N...
RNAi	  &	  Small	  Molecule	  Screens	                                                                                    ...
Matching	  Phenotypes	  RNAI	    Small	  Molecule	  
Merging	  Screening	  Technologies	  •  Lead	  iden6fica6on	   High	  throughput	  screening	                High	  content...
Merging	  Screening	  Technologies	  •  A	  simple	  solu6on	  is	  to	  run	  a	  HTS	  &	  HCS	  as	     separate,	  pri...
Wells	  to	  Cells	  Workflow	    •  Sequen6al	  qHTS	  using	  laser	       scanning	  cytometry	  followed	       by	  hi...
Well	  to	  Cells	  Assays	  	  •  Cell	  cycle,	  cell	  transloca6on,	  DNA	  repreplica6on	  •  All	  assays	  run	  ag...
Cell	  Transloca?on	  Example	  Hits	  
Informa?cs	  Pla[orm	                                                           InCell	  Layout	  	                       ...
Why	  Messaging?	  •  A	  messaging	  architecture	  allows	  for	  significant	     flexibility	     –  Persistent,	  can	 ...
Handling	  Mul?ple	  Pla[orms	  •  Current	  examples	  employ	  InCell	  hardware	  •  We	  also	  use	  Molecular	  Devi...
A	  Unified	  Interface	  •  A	  client	  sees	  a	  single,	  simple	  interface	  to	     screening	  image	  data	      ...
Trade-­‐offs	  &	  Opportuni?es	  •  Automa6on	  reduces	  the	  ability	  to	  handle	     unforeseen	  errors	      –  Di...
Cloud	  Compu?ng	  &	  Cheminforma?cs	  •  Cloud	  compu6ng	  is	  a	  hot	  topic	  •  A	  number	  of	  examples	  of	  ...
Map/Reduce	  •  Map/Reduce	  is	  a	  programming	  model	  for	     efficient	  distributed	  compu6ng	  •  M/R	  made	  “f...
Map/Reduce	      Owen	  O’Malley,	  hip://bit.ly/ecHPvB	  
Hadoop	  &	  Cheminforma?cs	  •  Hadoop	  is	  an	  Open	  Source	  implementa6on	     of	  the	  map/reduce	  paradigm	  ...
Why	  Hadoop?	  •  Simple	  way	  to	  make	  use	  of	  large	  clusters	     without	  MPI	  etc	  •  AWS	  supports	  H...
Cheminforma?cs	  in	  Parallel	  •  Many	  cheminforma6cs	  problems	  are	  data	  parallel	      –  Chunk	  the	  data	 ...
Cheminforma?cs	  in	  Parallel	  See	  h_p://blog.rguha.net/?tag=hadoop	  for	  examples	  &	  code	  
Substructure	  Searching	                                           public class SubSearch {!•  Substructure	             ...
Running	  on	  AWS	  •  All	  the	  code	  was	  debugged	  on	  my	  laptop	  with	     rela6vely	  small	  files	  •  To	...
But	  I	  Don’t	  Want	  to	  Write	  Programs	  •  All	  these	  examples	  require	  us	  to	  write	  full	  fledged	   ...
Cheminforma?cs	  &	  Pig	     A = load medium.smi as (smiles:chararray);!   B = filter A by net.rguha.dc.pig.SMATCH(smiles...
Latency	  •  Hadoop	  is	  suited	  for	  batch	  processing	  •  Significant	  network	  I/O	  involved	  in	  distribu6ng...
More	  than	  Chunking?	  •  But	  all	  the	  examples	  so	  far	  could	  have	  been	  done	     via	  PBS/Condor	  or...
More	  than	  Chunking?	  •  Both	  predic6ve	  &	  graph	  algorithms	  are	     increasingly	  supported	  in	  Hadoop	 ...
Summary	  •  HTS	  data	  is	  an	  ample	  playground	  for	  interes6ng	     analy6cs,	  mul6ple	  data	  types	  makes	...
AcknowledgmentsInformaUcs	                 RNAi	  &	  Small	  Molecule	  •    Ajit	  Jadhav	        •    Scoi	  Mar6n	  • ...
Coun?ng	  Atoms	  •  The	  canonical	  Hadoop	  program	  is	  to	  count	  the	     frequency	  of	  words	  in	  a	  tex...
Coun?ng	  Atoms	                                     public class HeavyAtomCount {!•  Uses	  the	  CDK	  to	          stat...
Mul?line	  Records	  •  Lots	  of	  cheminforma6cs	  applica6ons	  require	  3D	  –	     SMILES	  won’t	  do.	  Need	  to	...
Why	  Hadoop?	  •  Java	  and	  C++	  APIs	      –  In	  Java	  use	  Objects,	  while	  in	  C++	  bytes	  •  Each	  task...
Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT
Upcoming SlideShare
Loading in...5
×

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

1,224

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,224
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT"

  1. 1. Enabling  Discoveries  at  High  Throughput     Small  molecule  and  RNAi  HTS  at  the  NCTT   Rajarshi  Guha   NIH  Center  for  Transla6on  Therapeu6cs   May  3,  2011  
  2. 2. Outline  •  Informa6cs  for  small  molecule  &  RNAi  screening  •  HCA  &  automated  decision  making   –  Pre7y  pictures  can  lead  to  more  efficient  screens  •  Large  scale  cheminforma6cs       –  We  can  do  it,  but  do  we  need  to?  
  3. 3. NIH Chemical Genomics Center•  Founded  2004  as  part  of  NIH  Roadmap  Molecular  Libraries  Ini6a6ve   –  NCGC  staffed  with  90+  scien6sts  –  biologists,  chemists,  informa6cians,  engineers   –  Post-­‐doc  program  •  Mission   –  MLPCN  (screening  &  chemical  synthesis;  compound  repository;  PubChem  database;   funding  for  assay,  library  and  technology  development  )   –  Develop  new  chemical  probes  for  basic  research  and  leads  for  therapeu6c  development,   par6cularly  for  rare/neglected  diseases   –  New  paradigms  &  applica6ons  of  HTS  for  chemical  biology  /  chemical  genomics  •  All  NCGC  projects  are  collabora6ons  with  a  target  or  disease  expert;    currently  >200   collabora6ons  with  inves6gators  worldwide    
  4. 4. Project Diversity Project  Diversity  (A) Disease areas (B) Target types (C) Detection methods
  5. 5. Assay  formats  &  detec?on  methods  in  HTS   Assay formats •  cellular signal transduction •  luminescence  •  ligand  binding   –  reporter gene –  chemiluminescence   –  compe66on  binding     –  bioluminescence   –  second messenger•  enzyma6c  ac6vity   •  phenotypic –  BRET   –  biochemical   –  ALPHA   –  cellular   –  protein redistribution •  fluorescence  •  ion  or  ligand  transport   –  cell viability –  FI     –  Ion-­‐sensi6ve  dyes   –  etc. –  membrane  poten6al  dyes   Detection modes –  –  FRET     TRF  •  protein-­‐protein  interac6ons     •  absorbance –  TR-­‐FRET   –  biochemical   –  FP     –  cellular   •  radioactivity –  FCS   –  SPA –  FLT  
  6. 6. Detector  Systems:  “Reading  the  assay”  •  ViewLux   –  Mul6modal  CCD-­‐based  imager   •  Abs.,  Luminescence,  Fluorescence  •  Envision   –  PMT-­‐based  reader     •  ALPHA  •  Acumen  Explorer   –  Laser  Scanning  Imager   •  “sta6c”  cell  cytometry  •  Hamamatsu  FDS  7000  Series     –  rapid  kine6cs  •  INCell1000   –  Subcellular  imaging  
  7. 7. qHTS:  High  Throughput  Dose  Response   Assay concentration ranges over 4 logs Informatics pipeline. Automated curve fittingA   (high:~ 100 μM) 1536-well plates, inter-plate dilution series and classification. 300K samples C   Assay volumes 2 – 5 μLB   Automated concentration-response data collection ~1 CRC/sec
  8. 8. Informa?cs  Ac?vi?es  •  High  throughput  curve  fieng  •  Data  integra6on,  automated  cherry  picking  •  SAR  algorithms   –  QSAR  modeling   –  Fragment  based  analysis   –  Ac6vity  cliffs  •  Tools  –  standardizer,  tautomers,  fragment  acDvity   browser,  kinome  browser  and  more  •  RNAi  hit  selec6on,  OTE  analysis  •  High  content  analysis  
  9. 9. Kinome  Navigator   •  Browse  kinase   panel  data   •  Currently  focused   on  the  Abbot   dataset   •  View     •  Fragments   •  Target  pairs   •  Kinome  overlay   hip://tripod.nih.gov  
  10. 10. Fragment  Browser  •  View  ac6vi6es  on  a  fragment  wise  basis  •  Compare  ac6vity  distribu6ons  by  fragment  •  Currently  based  around  ChEMBL  assays  but  users   can  browse  their  own  compounds  &  ac6vi6es   hip://tripod.nih.gov  
  11. 11. Structure  Ac?vity  Landscapes   •  Rugged  gorges  or  rolling  hills?   –  Small  structural  changes  associated  with  large   ac6vity  changes  represent  steep  slopes  in  the   landscape   –  But  tradi6onally,  QSAR  assumes  gentle  slopes     –  We  can  characterize  the  landscape  using  SALI  Maggiora,  G.M.,  J.  Chem.  Inf.  Model.,  2006,  46,  1535–1535  
  12. 12. What  Can  We  Do  With  SALI’s?   •  SALI  characterizes  cliffs  &  non-­‐cliffs   •  For  a    given  molecular  representa6on,  SALI’s   gives  us  an  idea  of    the   smoothness  of  the     SAR  landscape   •  Models  try  and  encode   this  landscape   •  Use  the  landscape  to  guide   descriptor  or  model     selec6on  Guha,  R.;  Van  Drie,  J.H.,  J.  Chem.  Inf.  Model.,  2008,  48,  646–658  
  13. 13. Predic?ng  the  Landscape   •  Rather  than  predic6ng  ac6vity  directly,  we  can   try  to  predict  the  SAR  landscape   •  Implies  that  we  aiempt  to  directly  predict  cliffs   –  Observa6ons  are  now  pairs  of  molecules   Original  pIC50   SALI,  AbsDiff   SALI,  GeoMean   RMSE  =  0.97   RMSE  =  1.10   RMSE  =  1.04  Scheiber  et  al,  StaDsDcal  Analysis  and  Data  Mining,  2009,  2,  115-­‐122  
  14. 14. Data  Integra?on  •  It’s  nice  to  simplify  data,  but  we  can  s6ll  be  faced   with  a  mul6tude  of  data  types  •  We  want  to  explore  these  data  in  a  linked  fashion  •  How  we  explore  and  what  we  explore  is  generally   influenced  by  the  task  at  hand  •  At  one  point,  make  inferences  over  all  the  data  
  15. 15. Data  Integra?on  User’s  Network   Content:   -­‐ Drugs   -­‐ Compounds   -­‐ Scaffolds   -­‐ Assays   -­‐ Genes   -­‐ Targets   -­‐ Pathways   -­‐ Diseases   -­‐ Clinical  Trials   -­‐ Documents   Links:  Network  of  Public  Data   -­‐Manually  curated   -­‐Derived  from  algorithms  
  16. 16. Record  View  of  an  Assay  
  17. 17. Access  Disease  Hierarchy  &  Network  
  18. 18. Ar?cles,  Patents,  Drug  Labels,  …  
  19. 19. NPC  Browser  hip://tripod.nih.gov/npc/  
  20. 20. Going  Beyond  Explora?on?   •  Simply  being  able  to  explore  data  in  an  integrated   manner  is  useful  as  an  idea  generator   •  Can  we  integrate  heterogenous  data  types  &   sources  to  get  a  systems  level  view?   –  Current  research  problem  in  genomics  and  systems   biology   –  Some  aiempts  have  been  made  to  merge  chemical   data  with  other  data  types  Young,  D.W.  et  al,  Nat.  Chem.  Biol.,  2008,  4,  59-­‐68  
  21. 21. RNAi  Facility  Mission  •  Perform  collabora6ve  genome-­‐wide  RNAi  screening-­‐ based  projects  with  intramural  inves6gators  •  Advance  the  science  of  RNAi  and  miRNA  screening   and  informa6cs  via  technology  development  to   improve  efficiency,  reliability,  and  costs.   Simple Phenotypes Pathway (Reporter Complex Phenotypes (Viability, cytotoxicity, assays, e.g. luciferase, (High-content imaging, cell oxidative stress, etc)! β-lactamase)! cycle, translocation, etc)! Range of Assays!
  22. 22. RNAi  Effectors  RNAi effectors provide an excellent way to conduct gene-specific loss offunction studies."
  23. 23. Issues  Using  RNAi  Effectors  •  RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."•  RNAi effectors induce off-target effects!!!!! "
  24. 24. Examples of of  Current  Projects   Examples   Current Projects•   Protein  Quality  Control   •   Poxvirus  •   DNA  Re-­‐replica6on   •   Respiratory  Viruses  •   Base  Excision  Repair   •   Lysosomal  Storage  Disorders  •   DNA  Damage  –  ELG1  stabiliza6on   •   Parkinsons  –  Mitochondrial  Quality    Control  •   An6oxidant  Response   •   Ewings  Sarcoma  •   Hypoxia   •   Drug  Modifiers,  Pancrea6c  Cancer  •   TNFa  Response   •   Drug  Modifiers,  TOP1  Clinical  •   Interferon  Response    Agents  •   iPS  to  RPE   •   Immunotoxin-­‐Mediated  Cell  Death  
  25. 25. User  Accessible  Tools  
  26. 26. RNAi  Libraries   Ambion Human Genome- Ambion Mouse Genome-Wide Wide Library, 21,585 genes, 3 Library, 17,582 genes, 3 unique siRNAs per gene. " unique siRNAs per gene." Dharmacon Human Duet Human and Mouse miRNA Genome-Wide siRNA Mimic Libraries & Libraries, 18,236 genes, Human miRNA Inhibitor siRNA pools." Library" Qiagen Human Druggable Kinome Libraries" Genome Library, > 7,000 Purchased from a number of genes, 4 unique siRNAs per vendors." gene."• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."• Considerations are being made for additional species and shRNA resources."
  27. 27. Druggable  Genome  Screening  Campaign   Pseudo-colored Blue/Green Ratio (Normalized to plate Median)•  Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).•  85 genes were selected Significant enrichment for core for follow-up through a NF-kB components variety of threshold-based Percent Reduction in NF-kB Signal 100 selection schemes. Qiagen siRNAs Ambion siRNAs Average Inhibition (%) 80•  27 genes were validated as confident hits using 60 siRNAs from multiple 40 vendors. 20 0 TNFα Receptor IKKα RELA NEMO
  28. 28. Druggable  Genome  Screening  Campaign   Significant enrichment for proteins that form the 28S proteasome Percent Reduction in NF-kB Signal Qiagen Ambion RPN 100 19S Regulator particle Average Inhibition (%) 80 RPT 60 α1-7 20S ß1-7 Proteasome 40 α1-7 20 RPT 19S Regulator 0 particle RPN D14 C4 C5 D2 D7 B2 B3 B4 A4 A5 A6 A7 A1 A2 A3PSM Gene Murata et alPSM Protein α core 20S β core 20S RPT 19S RPN 19S Nature Reviews Mol. Cell Biol. An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway
  29. 29. Seed  Sequence  Analysis  Other instances of the seeds incorporated within siRNAs targeting PSMA3 do notexhibit significant activity, adding to the likelihood of this being an on-target effect."
  30. 30. Seed  Sequence  Analysis  Other instances of the seeds within the active siRNAs targeting SLC24A1 tend todownregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
  31. 31. RNAi  &  Small  Molecule  Screens   What  targets  mediate  ac6vity  of   siRNA    and  compound   Pathway  elucida6on,  iden6fica6on  •   Reuse  pre-­‐exis6ng  MLI  data   of  interac6ons  •   Develop  new  annotated  libraries   CAGCATGAGTACTACAGGCCA   TACGGGAACTACCATAATTTA   Target  ID  and  valida6on   Link  RNAi  generated  pathway   peturba6ons  to  small  molecule   ac6vi6es.  Could  provide  insight  into   polypharmacology  •   Run  parallel  RNAi  screen   Goal:  Develop  systems  level  view  of  small  molecule  acUvity  
  32. 32. Matching  Phenotypes  RNAI   Small  Molecule  
  33. 33. Merging  Screening  Technologies  •  Lead  iden6fica6on   High  throughput  screening   High  content  screening  •  Single  (few)  read  outs   •  Phenotypic  profiling  •  High-­‐throughput   •  Mul6ple  parameters  •  Moderate  data  volumes   •  Moderate  throughput   •  Very  large  data   volumes   •  We’d  like  to  combine  the  technologies,  to  obtain  rich   high-­‐resolu6on  data  at  high  speed   •  Is  this  feasible?  What  are  the  trade-­‐offs?  
  34. 34. Merging  Screening  Technologies  •  A  simple  solu6on  is  to  run  a  HTS  &  HCS  as   separate,  primary  &  secondary  screens  •  Alterna6vely  –  Wells  to  Cells   –  Integrate  HTS  &  HCS  in  a  single  screen  using  a   combined  plavorm  for  robo6cs  &  real  6me   automated  HTS  analy6cs   –  Selec6ve  imaging  of  interes6ng  wells  
  35. 35. Wells  to  Cells  Workflow   •  Sequen6al  qHTS  using  laser   scanning  cytometry  followed   by  high-­‐res  microscopy   •  Unit  of  work  is  a  plate  series     •  The  same  aliquot  is  analyzed   by  both  techniques   •  A  message  based  system   •  The  key  is  deciding  which   wells  go  through  the   workflow  
  36. 36. Well  to  Cells  Assays    •  Cell  cycle,  cell  transloca6on,  DNA  repreplica6on  •  All  assays  run  against  LOPAC1280    •  Consistency  between  cytometry  &  microscopy  is   measured  by  the  R2  between  log  AC50’s   –  Cell  cycle,  0.94  –  0.96   –  Cell  transloca6on,  0.66  –  0.94   –  DNA  rereplica6on,  s6ll  in  progress    
  37. 37. Cell  Transloca?on  Example  Hits  
  38. 38. Informa?cs  Pla[orm   InCell  Layout     File  •  Advanced  correc6on  and   normaliza6on  methods  •  Sophis6cated  curve  fieng   algorithm  •  Good  performance,  allows   paralleliza6on  of  the  en6re   workflow  
  39. 39. Why  Messaging?  •  A  messaging  architecture  allows  for  significant   flexibility   –  Persistent,  can  be  kept  for  process  tracking,   repor6ng   –  Asynchronous,  allows  individual  components  of   the  workflow  to  proceed  at  their  own  pace   –  Modular,  new  components  can  be  introduced  at   any  6me  without  redesigning  the  whole  workflow  •  We  employ  Oracle  AQ,  but  any  message   queue  can  be  employed  
  40. 40. Handling  Mul?ple  Pla[orms  •  Current  examples  employ  InCell  hardware  •  We  also  use  Molecular  Devices  hardware  •  As  a  result  we  have  two  orthogonal  image  stores  /   databases  •  Need  to  integrate  them   –  Support  seamless  data  browsing    across  mul6ple   screens  irrespec6ve  of  imaging  plavorm  used   –  Support  analy6cs  external  to  vendor  code  
  41. 41. A  Unified  Interface  •  A  client  sees  a  single,  simple  interface  to   screening  image  data   hXp://host/rest/protocol/plate/well/image  •  Transparently  extract     image  data  via  the     MetaXpress  database     or  via  custom  code  •  Currently  the  interface  address  image  serving  •  Unified  metadata  interface  in  the  works  
  42. 42. Trade-­‐offs  &  Opportuni?es  •  Automa6on  reduces  the  ability  to  handle   unforeseen  errors   –  Dispense  errors  and  other  plate  problems   –  Well  selec6on  based  on  curve  classes  may  need  to   be  modified  on  the  fly  •  Well  selec6on  does  not  consider  SAR   –  Wells  are  selected  independently  of  each  other   –  If  we  could  model  SAR  on  the  fly  (or  from   valida6on  screens),  we’d  select  mul6ple  wells,  to   obtain  posi6ve  and  nega?ve  results  
  43. 43. Cloud  Compu?ng  &  Cheminforma?cs  •  Cloud  compu6ng  is  a  hot  topic  •  A  number  of  examples  of  computa6onal   chemistry  /  cheminforma6cs  on  the  cloud   –  MolPlex,  hBar,  Numerate,  Wingu,  Sciligence,  Pfizer  •  Many  examples  use  the  cloud  for  remote  storage   remote  (hosted)  computa6ons  •  But  providers  such  as  Amazon  allow  us  to  run   distributed  compuDng  applica6ons  on  the  cloud  
  44. 44. Map/Reduce  •  Map/Reduce  is  a  programming  model  for   efficient  distributed  compu6ng  •  M/R  made  “famous”  by  Google,  but  the  idea   has  been  around  for  a  long  6me  •  It  works  like  a  Unix  pipeline:   –  cat input | grep | sort | uniq -c | cat > output –       Input              |  Map      |  Shuffle  &  Sort    |      Reduce            |  Output  •  Efficiency  from     –  Streaming  through  data,  reducing  seeks   –  Pipelining   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  45. 45. Map/Reduce   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  46. 46. Hadoop  &  Cheminforma?cs  •  Hadoop  is  an  Open  Source  implementa6on   of  the  map/reduce  paradigm  •  Hadoop  is  a  framework  for  scalable,     distributed  compu6ng   –  Hadoop,  HDFS,  Hive,  PIG  •  Importantly,  you  can  play  with  all  this  on  your   laptop  and  just  copy  files  to  the  big  cluster  when   you’re  ready  for  produc6on  
  47. 47. Why  Hadoop?  •  Simple  way  to  make  use  of  large  clusters   without  MPI  etc  •  AWS  supports  Hadoop,  so  easy  to  scale   up  to  100’s  or  1000’s  of  cores  •  Great  for  Java  code,  but  non-­‐Java  code  can  also   make  use  of  Hadoop  •  M/R  can  be  applied  to  a  lot  of  problems,  but  one   of  the  simplest  is  to  use  it  as  a  “chunker”  
  48. 48. Cheminforma?cs  in  Parallel  •  Many  cheminforma6cs  problems  are  data  parallel   –  Chunk  the  data  and  apply  the  same  technique  over   each  chunk  •  This  makes  many  problems  amenable  for  M/R   –  Substructure  /  pharmacophore  search   –  Descriptor  calcula6ons,  virtual  screening   –  Model  development  (?)  •  In  general,  each  chunk  is  processed  on  a  dis6nct   node  –  so  code  itself  can  be  non-­‐parallel  
  49. 49. Cheminforma?cs  in  Parallel  See  h_p://blog.rguha.net/?tag=hadoop  for  examples  &  code  
  50. 50. Substructure  Searching   public class SubSearch {!•  Substructure   …! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {! searching  is  a  trivial   private Text matches = new Text();! private String pattern;! extension  of  atom   public void setup(Context context) {! pattern = context.getConfiguration().get ("net.rguha.dc.data.pattern");! coun6ng   }! public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {!•  If  a  structure   try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); ! matches,  emit   sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! (name,1)! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }!•  Otherwise     }! }! public static class SMARTSMatchReducer extends ! (name,0)   Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!•  Reducer  simply   public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! outputs  tuples  of  the   if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! form  (name,1)   }! }! }!
  51. 51. Running  on  AWS  •  All  the  code  was  debugged  on  my  laptop  with   rela6vely  small  files  •  To  test  the  scalability,  I  shi{ed  everything  to  AWS   –  Pharmacophore  search   –  136K  structures,  single     conformer,  560MB   –  Created  a  single  JAR  file  with   CDK  &  applica6on  code   –  Uploaded  data  files  to  S3  •  Total  cost  of  experiments   was  ~  $10  
  52. 52. But  I  Don’t  Want  to  Write  Programs  •  All  these  examples  require  us  to  write  full  fledged   Java  classes  •  An  easier  way  to  use  Pig  &  Pig  La6n  –  a  plavorm   and  query  language  built  on  top  of  Hadoop  •  Lets  us  write  SQL-­‐like  queries  that  make  use  of   Hadoop  underneath  •  Flexible  due  to  user  defined  func6ons  (UDF’s)   –  UDF’s  encapsulate  the  cheminforma6cs  
  53. 53. Cheminforma?cs  &  Pig   A = load medium.smi as (smiles:chararray);! B = filter A by net.rguha.dc.pig.SMATCH(smiles, NC(=O)C(=O)N);! store B into output.txt;!•  Iden6fy  molecules  in  medium.smi  that  match  the   SMARTS  paiern  and  dump  to  output.txt  •  The  complexity  is  now  hidden  in  the  UDF  •  Many  toolkit  func6ons  could  be  wrapped  as   UDF’s,  allowing  flexible  queries  with  much   simpler  code  •  See  hip://blog.rguha.net/?p=748  for  the  code  
  54. 54. Latency  •  Hadoop  is  suited  for  batch  processing  •  Significant  network  I/O  involved  in  distribu6ng   data  to  compute  nodes  •  Not  good  for     –  Random  ad  hoc  processing  of  small  subsets   –  Small  volume  data   –  Real  6me  (low  latency)  work  •  But  latency  issues  can  be  addressed  somewhat     by  Hbase,  Hive  and  other  technologies  
  55. 55. More  than  Chunking?  •  But  all  the  examples  so  far  could  have  been  done   via  PBS/Condor  or  any  other  job  scheduler   –  (With  Hadoop  we  don’t  have  to  worry  about  explicit   chunking  of  the  input  data)  •  But  are  there  cheminforma6cs  algorithms  that   can  be  reworked  in  to  the  M/R  paradigm?   –  Predic6ve  modeling?   –  Graph  algorithms?  
  56. 56. More  than  Chunking?  •  Both  predic6ve  &  graph  algorithms  are   increasingly  supported  in  Hadoop   –  Mahout  for  M/L  algorithms  on  massive  datasets   –  Cloud9  for  graph  algorithms  •  A  number  of  bioinforma6cs  applica6ons  make   use  of  M/R  at  the  algorithmic  level  •  They  are  all  big  applica6ons   –  Crossbow  aligns  3  billion  paired/unpaired  reads  •  Cheminforma?cs  datasets  are  not  very  big  
  57. 57. Summary  •  HTS  data  is  an  ample  playground  for  interes6ng   analy6cs,  mul6ple  data  types  makes  it  more  fun  •  A  major  challenge  in  our  informa6cs   infrastructure  is  dealing  with  proprietary  vendor   interfaces  •  Hadoop  and  M/R  provide  great  opportuni6es  for   handling  large  data  in  a  flexible  manner  •  But  can  cheminforma6cs  really  make  use  of  it?  
  58. 58. AcknowledgmentsInformaUcs   RNAi  &  Small  Molecule  •  Ajit  Jadhav   •  Scoi  Mar6n  •  Trung  Nguyen   •  Pinar  Tuzmen  •  Noel  Southall   •  Yu-­‐Chi  Chen  •  Ruili  Huang   •  Carleen  Klump  •  Min  Shen   •  Craig  Thomas  •  Hongmao  Sun   •  Jim  Inglese  •  Xin  Hu   •  Ron  Johnson  •  Tongan  Zhao   •  Sam  Michael   •  Jennifer  Wichterman  
  59. 59. Coun?ng  Atoms  •  The  canonical  Hadoop  program  is  to  count  the   frequency  of  words  in  a  text  file   –  Mapper  reads  a  line,  outputs  a  tuple  –  (word,  1)   –  Reducer  will  receive  tuples,  keyed  on  word! •  Summing  up  the  1’s  gives  us  the  frequency  of  word    •  By  default,  Hadoop  works  on  a  line-­‐by-­‐line  basis  •  For  cheminforma6cs  problems,  SMILES  files   sa6sfy  this  requirement  –  one  line,  one  molecule  
  60. 60. Coun?ng  Atoms   public class HeavyAtomCount {!•  Uses  the  CDK  to   static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());! public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {! ! parse  SMILES   private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!•  For  each   public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! molecule  loop   for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! over  atoms   } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }! –  Emit     public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();! (symbol,1)! public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;!•  Reducer  simply   for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! sums  the  1’s  for   context.write(key, result);! }! }! ….! each  symbol   }!
  61. 61. Mul?line  Records  •  Lots  of  cheminforma6cs  applica6ons  require  3D  –   SMILES  won’t  do.  Need  to  support  SDF  •  We  implement  a  custom  RecordReader to   process  SD  files!•  We’re  now  ready  to     tackle  preiy  much     most    cheminforma6cs   tasks  
  62. 62. Why  Hadoop?  •  Java  and  C++  APIs   –  In  Java  use  Objects,  while  in  C++  bytes  •  Each  task  can  process  data  sets  larger     than  RAM  •  Automa6c  re-­‐execu6on  on  failure   –  In  a  large  cluster,  some  nodes  are  always  slow  or  flaky   –  Framework  re-­‐executes  failed  tasks    •  Locality  op6miza6ons   –  M/R  queries  HDFS  for  loca6ons  of  input  data   –  Map  tasks  are  scheduled  close  to  the  inputs  when   possible   Owen  O’Malley,  hip://bit.ly/ecHPvB  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×