Your SlideShare is downloading. ×
  • Like
Chemogenomics in the cloud: Is the sky the limit?
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Chemogenomics in the cloud: Is the sky the limit?



Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Chemogenomics  in  the  cloud   Is  the  sky  the  limit?   Rajarshi  Guha,  Ph.D.   NIH  Center  for  Transla:onal  Therapeu:cs     June  28,  2012  
  • 2. The  cloud  as  infrastructure  •  Cloud  compu:ng  is  a  service  for   –  Infrastructure   –  PlaForm   –  SoHware  •  Much  of  the  benefits  of  cloud  compu:ng  are   –  Economic   –  Poli:cal  •  Won’t  be  discussing  the  remote  hos:ng  aspects   of  clouds  
  • 3. Characteris8cs  of  the  cloud   Virtually Pay-per-use assemble Offsite Cloud Sharedtechnology Computing workloads Massive On-demand scale self service hPp://­‐cloud-­‐compu:ng  
  • 4. Parallel  compu8ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica:ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa:ons  
  • 5. Storing  chemical  informa8on  •  Fill  up  a  hard  drive,  mail  to  Amazon  •  Copy  over  the  network   –  Aspera   –  GridFTP  •  S:ll  need  to  pay  for     storage  space  •  Lots  of  op:ons  on  the   cloud  –  S3,  rela:onal  DB’s  •  See  Chris  Dagdigian’s  talk  for  views  on  storage   hPp://­‐trends-­‐from-­‐the-­‐trenches  
  • 6. Recoding  for  the  cloud?  •  Only  if  we  really  have  to  •  Large  amounts  of  legacy  code,     runs  perfectly  well  on  local  clusters   –  May  not  make  sense  to  recode   as  a  map-­‐reduce  job   –  May  not  be  possible  to   ?  •  Different  levels  of  HPC  on  the  cloud   –  Legacy  HPC   –  ‘Cloudy’  HPC   –  Big  Data  HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  • 7. Recoding  for  the  cloud?  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili:es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  • 8. How  does  the  cloud  enable  science?  •  How  does  the  cloud  change  computa:onal   chemistry,  cheminforma:cs,  …   –  The  way  we  do  them   –  The  scale  at  which  we  do  them     Are  there  problems  that  we  can  address  that     we  could  not  have  if  we  didn’t  have  on-­‐demand,     scalable  cloud  resources?  
  • 9. Big  data  &  cheminforma8cs  •  Computa:on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  …  •  What  types  of  computa:ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  over  large  data   –  Predic:ons  for  large  data  •  Certain  applica:ons  just  need  structures  •  Access  to  correspondingly  massive  experimental   datasets  is  tough  (impossible?)  
  • 10. Big  data  &  cheminforma8cs  •  GDB-­‐13  is  a  truly  big  database  –  977  million   different  structures   –  Current  search  interface  is  based  on  NN  searches   using  a  reduced  representa:on   –  Could  be  a  good  candidate  for  a  Hadoop  based   analysis  •  More  generally,  enumerated  virtual  libraries  can   also  lead  to  very  big  data   –  Time  required  to  enumerate  is  a  boPleneck  
  • 11. Big  data  &  cheminforma8cs  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces     –  Can  plow  through  large  catalogs   –  e.g.,  iden:fying  PKR  inhibitors  by  LBVS  of  the   ChemNavigator  collec:on  [Bryk  et  al]  •  This  can  push  predic:ve  models  to  their  limits     –  Brings  us  back  to  the  global  vs  local  arguments  
  • 12. The  Hadoop  ecosystem  •  A  framework  for  the  map-­‐reduce  agorithm   –  Not  something  you  can  download  and  just  run   –  Need  to  implement  the  infrastructure  and  then   develop  code  to  run  using  the  infrastructure  •  Low  level  Hadoop  programs  can  be  large,   complex  and  tedious  •  Abstrac:ons  have  been  developed  that  make   Hadoop  queries  more  SQL-­‐like  –  results  in  much   more  concise  code  
  • 13. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hPp://­‐part-­‐3-­‐maP-­‐asleP-­‐the-­‐hadoop-­‐ecosystem  
  • 14. Simplifying  Hadoop  applica8ons  •  Raw  Hadoop     programs  can     be  very     tedious  to     write   SMARTS  based     substructure  search    
  • 15. Pig  &  Pig  La8n  •  Pig  La:n  programs  are  much  simpler  to  write   and  get  translated  to   !"#"$%&"()*+,)-.)+("&."/.)+$*.012&3&33&456" Hadoop  code   7"#"8$9*3"!":4";*9-3<,2&-1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56" .9%3*"7"+;9%"(%,9=,9-9F9(6" SMARTS  search  in    •  SQL-­‐like,  requires     Pig  La:n   !"#$%&&$())*+,-./012034)5%$2065"3&7 UDF  to  be     )2(8&*+,9-*:"06;-<<$)=2>)2(8&7 26;7 )=2?30@*+,9-*:"06;-<<$AB.BC> implemented  to     D&(2&EA.FGH1&0!8<30C7 *;)20IJ<"2J!6%32$3A0C> D D perform     )2(8&*I%$0)K(6)06)!?30@*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC> !"#$%&O<<$0(3010&A-"!$02"!$0C2E6<@)QMH1&0!8<37 non-­‐standard  tasks   %LA2"!$0??3"$$RR2"!$0J)%S0ACTUC602"63L($)0> *26%3P2(6P02?A*26%3PC2"!$0JP02AVC> *26%3P="06;?A*26%3PC2"!$0JP02AWC> 26;7 UDF  for  SMARTS  search   )=2J)02*I(62)A="06;C> Q,2<I.<32(%306I<$?)!J!(6)0*I%$0)A2(6P02C> 602"63)=2JI(2&E0)AI<$C> D&(2&EA.FGH1&0!8<30C7 2E6<@X6(!!04QMH1&0!8<3J@6(!ABH66<6%3*+,9-*!(Y063<6*+QZH*)26%3PB[="06;0C> D D D
  • 16. Working  on  top  of  Hadoop  •  Hadoop  doesn’t  know  anything  about   cheminforma:cs   –  Need  to  write  your  own  code,  UDF’s  etc  •  But  applica:on  layers  have  been  developed  for   other  purposes   –                 Apache  Mahout:  a  library  for  machine  learning                      on  data  stored  in  Hadoop  clusters       –  Possible  to  build  virtual  screening  pipelines  based  on   the  Hadoop  framework  
  • 17. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real:me  analysis  •  Not  effec:ve  unless  dealing  with  massive   datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method   –  CPU  bound  methods  and  those  requiring   communica:on  
  • 18. Cheminforma8cs  on  Hadoop  •  Hadoop  and  Atom  Coun:ng  •  Hadoop  and  SD  Files  •  Cheminforma:cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma:cs     But  are  cheminforma1cs  problems     really  big  enough  to  jus1fy  all  of  this?  
  • 19. How  big  is  big?  •  Bryk  et  al  performed  a  LBVS  of  5  million   compounds  to  iden:fy  PKR  inhibitors   –  Pharmacophore  fingerprints  +  perceptron   –  Required  conformer  genera:on    •  Given  that  conformer  and  descriptor  genera:on   are  one-­‐:me  tasks,  screening  5M  compounds   doesn’t  take  long  •  Example:  RF  models  built  on  512  bit  binary   fingerprints  gives  us  predic:ons  for  5M   fingerprints  in  12  min  [Single  core,  3  GHz  Xeon,  OS  X  10.6.8]  
  • 20. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera:on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma1cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  • 21. Going  beyond  chunking?  •  Applica:ons  that  make  use  of  pairwise  (or  higher   order)  calcula:ons  could  benefit  from  a  map-­‐ reduce  incarna:on   –  Doesn’t  always  avoid  the  O(N2)  barrier   –  Bioisostere  iden:fica:on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Search  algorithms  such  as  GA’s,  par:cle  swarms   can  make  use  of  map-­‐reduce   –  GA  based  docking   –  Feature  selec:on  for  QSAR  models  
  • 22. Going  beyond  chunking?  •  Machine  learning  for  massive  chemical  datasets?   –  MR  jobs  (descriptor  genera:on)  +  Mahout  (model   building)  lets  us  handle  this  in  a  straight  forward   manner  •  But  will  QSAR  models  benefit  from  more  data?   –  Helgee  et  al  suggest  global  models  are  preferable   –  But  diversity  and  the  structure  of  the  chemical  space   will  affect  performance  of  global  models   –  Unsupervised  methods  maybe  more  relevant   –  Philosophical  ques:on?  
  • 23. Going  beyond  chunking?  •  Many  clustering  algorithms  are  amenable  to   map-­‐reduce  style   –  K-­‐means,  Spectral,  EM,  minhash,  …   –  Many  are  implemented  in  Mahout   Problems  where  we  generate  large  numbers  of     combina8ons  can  be  amenable  to  map-­‐reduce  
  • 24. Networks  &  integra8on  •  Network  models  of  molecules,   and  targets  are  common   –  Allows  for  the  incorpora:on  of   lots  of  associated  informa:on   –  Diseases,  pathways,  OTE’s,     Yildirim,  M.A.  et  al  •  When  linked  with  clinical  data     &  outcomes,  we  can  generate  massive  networks   –  Adverse  events  (FDA  AERS)   –  Analysis  by  Cloudera  considered  >  10E6  drug-­‐drug-­‐ reac:on  triples  
  • 25. Networks  &  integra8on  •  SAR  data  can  be  viewed  in  a   network  form   –  SALI,  SARI  based  networks   –  Usually  requires  pairwise     calcula:ons  of  the  metric   Peltason,  L  et  al   hPp://  •  Current  studies  have  focused  on  small  datasets   (<  1000  molecules)  •  Hadoop  +  Giraph  could  let  us  apply  this  to  HTS-­‐ scale  datasets  
  • 26. Networks  &  integra8on  •  When  we  apply  a  network  view   we  can  consider  many  interes:ng   applica:ons  &  make  use  of  cloud   scale  infrastructure   –  Network  based  similarity   –  Community  detec:on  (aka  clustering)   Bauer-­‐Mehren  et  al   –  PageRank  style  ranking  (of  targets,  compounds,  …)   –  Generate  network  metrics,  which  can  be  used  as   input  to  predic:ve  models  (for  interac:ons,  effects,   …)  
  • 27. Conclusions  •  Cheminforma:cs  applica:ons  can  be  rewriPen   to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec:ons  lets   us  explore  more  chemical  space  •  Integra:ng  chemistry  with  clinical  &   pharmacological  data  can  lead  to  big  datasets  
  • 28. Conclusions  •  Q:  But  are  cheminforma8cs  problems  really  big   enough  to  jus8fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra:ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma8cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor