Your SlideShare is downloading. ×
Chemogenomics in the cloud: Is the sky the limit?
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Chemogenomics in the cloud: Is the sky the limit?


Published on

Published in: Technology, Business

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Chemogenomics  in  the  cloud   Is  the  sky  the  limit?   Rajarshi  Guha,  Ph.D.   NIH  Center  for  Transla:onal  Therapeu:cs     June  28,  2012  
  • 2. The  cloud  as  infrastructure  •  Cloud  compu:ng  is  a  service  for   –  Infrastructure   –  PlaForm   –  SoHware  •  Much  of  the  benefits  of  cloud  compu:ng  are   –  Economic   –  Poli:cal  •  Won’t  be  discussing  the  remote  hos:ng  aspects   of  clouds  
  • 3. Characteris8cs  of  the  cloud   Virtually Pay-per-use assemble Offsite Cloud Sharedtechnology Computing workloads Massive On-demand scale self service hPp://­‐cloud-­‐compu:ng  
  • 4. Parallel  compu8ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica:ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa:ons  
  • 5. Storing  chemical  informa8on  •  Fill  up  a  hard  drive,  mail  to  Amazon  •  Copy  over  the  network   –  Aspera   –  GridFTP  •  S:ll  need  to  pay  for     storage  space  •  Lots  of  op:ons  on  the   cloud  –  S3,  rela:onal  DB’s  •  See  Chris  Dagdigian’s  talk  for  views  on  storage   hPp://­‐trends-­‐from-­‐the-­‐trenches  
  • 6. Recoding  for  the  cloud?  •  Only  if  we  really  have  to  •  Large  amounts  of  legacy  code,     runs  perfectly  well  on  local  clusters   –  May  not  make  sense  to  recode   as  a  map-­‐reduce  job   –  May  not  be  possible  to   ?  •  Different  levels  of  HPC  on  the  cloud   –  Legacy  HPC   –  ‘Cloudy’  HPC   –  Big  Data  HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  • 7. Recoding  for  the  cloud?  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili:es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  • 8. How  does  the  cloud  enable  science?  •  How  does  the  cloud  change  computa:onal   chemistry,  cheminforma:cs,  …   –  The  way  we  do  them   –  The  scale  at  which  we  do  them     Are  there  problems  that  we  can  address  that     we  could  not  have  if  we  didn’t  have  on-­‐demand,     scalable  cloud  resources?  
  • 9. Big  data  &  cheminforma8cs  •  Computa:on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  …  •  What  types  of  computa:ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  over  large  data   –  Predic:ons  for  large  data  •  Certain  applica:ons  just  need  structures  •  Access  to  correspondingly  massive  experimental   datasets  is  tough  (impossible?)  
  • 10. Big  data  &  cheminforma8cs  •  GDB-­‐13  is  a  truly  big  database  –  977  million   different  structures   –  Current  search  interface  is  based  on  NN  searches   using  a  reduced  representa:on   –  Could  be  a  good  candidate  for  a  Hadoop  based   analysis  •  More  generally,  enumerated  virtual  libraries  can   also  lead  to  very  big  data   –  Time  required  to  enumerate  is  a  boPleneck  
  • 11. Big  data  &  cheminforma8cs  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces     –  Can  plow  through  large  catalogs   –  e.g.,  iden:fying  PKR  inhibitors  by  LBVS  of  the   ChemNavigator  collec:on  [Bryk  et  al]  •  This  can  push  predic:ve  models  to  their  limits     –  Brings  us  back  to  the  global  vs  local  arguments  
  • 12. The  Hadoop  ecosystem  •  A  framework  for  the  map-­‐reduce  agorithm   –  Not  something  you  can  download  and  just  run   –  Need  to  implement  the  infrastructure  and  then   develop  code  to  run  using  the  infrastructure  •  Low  level  Hadoop  programs  can  be  large,   complex  and  tedious  •  Abstrac:ons  have  been  developed  that  make   Hadoop  queries  more  SQL-­‐like  –  results  in  much   more  concise  code  
  • 13. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hPp://­‐part-­‐3-­‐maP-­‐asleP-­‐the-­‐hadoop-­‐ecosystem  
  • 14. Simplifying  Hadoop  applica8ons  •  Raw  Hadoop     programs  can     be  very     tedious  to     write   SMARTS  based     substructure  search    
  • 15. Pig  &  Pig  La8n  •  Pig  La:n  programs  are  much  simpler  to  write   and  get  translated  to   !"#"$%&"()*+,)-.)+("&."/.)+$*.012&3&33&456" Hadoop  code   7"#"8$9*3"!":4";*9-3<,2&-1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56" .9%3*"7"+;9%"(%,9=,9-9F9(6" SMARTS  search  in    •  SQL-­‐like,  requires     Pig  La:n   !"#$%&&$())*+,-./012034)5%$2065"3&7 UDF  to  be     )2(8&*+,9-*:"06;-<<$)=2>)2(8&7 26;7 )=2?30@*+,9-*:"06;-<<$AB.BC> implemented  to     D&(2&EA.FGH1&0!8<30C7 *;)20IJ<"2J!6%32$3A0C> D D perform     )2(8&*I%$0)K(6)06)!?30@*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC> !"#$%&O<<$0(3010&A-"!$02"!$0C2E6<@)QMH1&0!8<37 non-­‐standard  tasks   %LA2"!$0??3"$$RR2"!$0J)%S0ACTUC602"63L($)0> *26%3P2(6P02?A*26%3PC2"!$0JP02AVC> *26%3P="06;?A*26%3PC2"!$0JP02AWC> 26;7 UDF  for  SMARTS  search   )=2J)02*I(62)A="06;C> Q,2<I.<32(%306I<$?)!J!(6)0*I%$0)A2(6P02C> 602"63)=2JI(2&E0)AI<$C> D&(2&EA.FGH1&0!8<30C7 2E6<@X6(!!04QMH1&0!8<3J@6(!ABH66<6%3*+,9-*!(Y063<6*+QZH*)26%3PB[="06;0C> D D D
  • 16. Working  on  top  of  Hadoop  •  Hadoop  doesn’t  know  anything  about   cheminforma:cs   –  Need  to  write  your  own  code,  UDF’s  etc  •  But  applica:on  layers  have  been  developed  for   other  purposes   –                 Apache  Mahout:  a  library  for  machine  learning                      on  data  stored  in  Hadoop  clusters       –  Possible  to  build  virtual  screening  pipelines  based  on   the  Hadoop  framework  
  • 17. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real:me  analysis  •  Not  effec:ve  unless  dealing  with  massive   datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method   –  CPU  bound  methods  and  those  requiring   communica:on  
  • 18. Cheminforma8cs  on  Hadoop  •  Hadoop  and  Atom  Coun:ng  •  Hadoop  and  SD  Files  •  Cheminforma:cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma:cs     But  are  cheminforma1cs  problems     really  big  enough  to  jus1fy  all  of  this?  
  • 19. How  big  is  big?  •  Bryk  et  al  performed  a  LBVS  of  5  million   compounds  to  iden:fy  PKR  inhibitors   –  Pharmacophore  fingerprints  +  perceptron   –  Required  conformer  genera:on    •  Given  that  conformer  and  descriptor  genera:on   are  one-­‐:me  tasks,  screening  5M  compounds   doesn’t  take  long  •  Example:  RF  models  built  on  512  bit  binary   fingerprints  gives  us  predic:ons  for  5M   fingerprints  in  12  min  [Single  core,  3  GHz  Xeon,  OS  X  10.6.8]  
  • 20. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera:on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma1cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  • 21. Going  beyond  chunking?  •  Applica:ons  that  make  use  of  pairwise  (or  higher   order)  calcula:ons  could  benefit  from  a  map-­‐ reduce  incarna:on   –  Doesn’t  always  avoid  the  O(N2)  barrier   –  Bioisostere  iden:fica:on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Search  algorithms  such  as  GA’s,  par:cle  swarms   can  make  use  of  map-­‐reduce   –  GA  based  docking   –  Feature  selec:on  for  QSAR  models  
  • 22. Going  beyond  chunking?  •  Machine  learning  for  massive  chemical  datasets?   –  MR  jobs  (descriptor  genera:on)  +  Mahout  (model   building)  lets  us  handle  this  in  a  straight  forward   manner  •  But  will  QSAR  models  benefit  from  more  data?   –  Helgee  et  al  suggest  global  models  are  preferable   –  But  diversity  and  the  structure  of  the  chemical  space   will  affect  performance  of  global  models   –  Unsupervised  methods  maybe  more  relevant   –  Philosophical  ques:on?  
  • 23. Going  beyond  chunking?  •  Many  clustering  algorithms  are  amenable  to   map-­‐reduce  style   –  K-­‐means,  Spectral,  EM,  minhash,  …   –  Many  are  implemented  in  Mahout   Problems  where  we  generate  large  numbers  of     combina8ons  can  be  amenable  to  map-­‐reduce  
  • 24. Networks  &  integra8on  •  Network  models  of  molecules,   and  targets  are  common   –  Allows  for  the  incorpora:on  of   lots  of  associated  informa:on   –  Diseases,  pathways,  OTE’s,     Yildirim,  M.A.  et  al  •  When  linked  with  clinical  data     &  outcomes,  we  can  generate  massive  networks   –  Adverse  events  (FDA  AERS)   –  Analysis  by  Cloudera  considered  >  10E6  drug-­‐drug-­‐ reac:on  triples  
  • 25. Networks  &  integra8on  •  SAR  data  can  be  viewed  in  a   network  form   –  SALI,  SARI  based  networks   –  Usually  requires  pairwise     calcula:ons  of  the  metric   Peltason,  L  et  al   hPp://  •  Current  studies  have  focused  on  small  datasets   (<  1000  molecules)  •  Hadoop  +  Giraph  could  let  us  apply  this  to  HTS-­‐ scale  datasets  
  • 26. Networks  &  integra8on  •  When  we  apply  a  network  view   we  can  consider  many  interes:ng   applica:ons  &  make  use  of  cloud   scale  infrastructure   –  Network  based  similarity   –  Community  detec:on  (aka  clustering)   Bauer-­‐Mehren  et  al   –  PageRank  style  ranking  (of  targets,  compounds,  …)   –  Generate  network  metrics,  which  can  be  used  as   input  to  predic:ve  models  (for  interac:ons,  effects,   …)  
  • 27. Conclusions  •  Cheminforma:cs  applica:ons  can  be  rewriPen   to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec:ons  lets   us  explore  more  chemical  space  •  Integra:ng  chemistry  with  clinical  &   pharmacological  data  can  lead  to  big  datasets  
  • 28. Conclusions  •  Q:  But  are  cheminforma8cs  problems  really  big   enough  to  jus8fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra:ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma8cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor