Chemogenomics	  in	  the	  cloud	         Is	  the	  sky	  the	  limit?	            Rajarshi	  Guha,	  Ph.D.	   NIH	  Cent...
The	  cloud	  as	  infrastructure	  •  Cloud	  compu:ng	  is	  a	  service	  for	      –  Infrastructure	      –  PlaForm	...
Characteris8cs	  of	  the	  cloud	                Virtually          Pay-per-use             assemble  Offsite            ...
Parallel	  compu8ng	  in	  the	  cloud	  •  Modern	  cloud	  vendors	  make	  provisioning	     compute	  resources	  easy...
Storing	  chemical	  informa8on	  •  Fill	  up	  a	  hard	  drive,	  mail	  to	  Amazon	  •  Copy	  over	  the	  network	 ...
Recoding	  for	  the	  cloud?	  •  Only	  if	  we	  really	  have	  to	  •  Large	  amounts	  of	  legacy	  code,	  	     ...
Recoding	  for	  the	  cloud?	  •  Use	  cloud	  resources	  in	            •  Make	  use	  of	  cloud	                   ...
How	  does	  the	  cloud	  enable	  science?	  •  How	  does	  the	  cloud	  change	  computa:onal	     chemistry,	  chemi...
Big	  data	  &	  cheminforma8cs	  •  Computa:on	  over	  large	  chemical	  databases	     –  Pubchem,	  ChEMBL,	  …	  •  ...
Big	  data	  &	  cheminforma8cs	  •  GDB-­‐13	  is	  a	  truly	  big	  database	  –	  977	  million	     different	  struct...
Big	  data	  &	  cheminforma8cs	  •  Fundamentally,	  “big	  chemical	  data”	  lets	  us	     explore	  larger	  chemical...
The	  Hadoop	  ecosystem	  •  A	  framework	  for	  the	  map-­‐reduce	  agorithm	      –  Not	  something	  you	  can	  d...
The	  Hadoop	  ecosystem	               Chukwa                            Zookeeper                                   Flum...
Simplifying	  Hadoop	  applica8ons	  •  Raw	  Hadoop	  	     programs	  can	  	     be	  very	  	     tedious	  to	  	    ...
Pig	  &	  Pig	  La8n	  •  Pig	  La:n	  programs	  are	  much	  simpler	  to	  write	     and	  get	  translated	  to	     ...
Working	  on	  top	  of	  Hadoop	  •  Hadoop	  doesn’t	  know	  anything	  about	     cheminforma:cs	     –  Need	  to	  w...
What	  Hadoop	  is	  not	  for	  •  Doesn’t	  replace	  an	  actual	  database	  •  It’s	  not	  uniformly	  fast	  or	  e...
Cheminforma8cs	  on	  Hadoop	  •      Hadoop	  and	  Atom	  Coun:ng	  •      Hadoop	  and	  SD	  Files	  •      Cheminform...
How	  big	  is	  big?	  •  Bryk	  et	  al	  performed	  a	  LBVS	  of	  5	  million	     compounds	  to	  iden:fy	  PKR	  ...
Going	  beyond	  chunking?	  •  All	  the	  preceding	  use	  cases	  are	  embarrassingly	     parallel	  	      –  Chunk...
Going	  beyond	  chunking?	  •  Applica:ons	  that	  make	  use	  of	  pairwise	  (or	  higher	     order)	  calcula:ons	 ...
Going	  beyond	  chunking?	  •  Machine	  learning	  for	  massive	  chemical	  datasets?	     –  MR	  jobs	  (descriptor	...
Going	  beyond	  chunking?	  •  Many	  clustering	  algorithms	  are	  amenable	  to	     map-­‐reduce	  style	     –  K-­...
Networks	  &	  integra8on	  •  Network	  models	  of	  molecules,	     and	  targets	  are	  common	     –  Allows	  for	 ...
Networks	  &	  integra8on	  •  SAR	  data	  can	  be	  viewed	  in	  a	     network	  form	      –  SALI,	  SARI	  based	 ...
Networks	  &	  integra8on	  •  When	  we	  apply	  a	  network	  view	     we	  can	  consider	  many	  interes:ng	     ap...
Conclusions	  •  Cheminforma:cs	  applica:ons	  can	  be	  rewriPen	     to	  take	  advantage	  of	  cloud	  resources	  ...
Conclusions	  •  Q:	  But	  are	  cheminforma8cs	  problems	  really	  big	     enough	  to	  jus8fy	  all	  of	  this?	  ...
Upcoming SlideShare
Loading in …5

Chemogenomics in the cloud: Is the sky the limit?


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chemogenomics in the cloud: Is the sky the limit?

  1. 1. Chemogenomics  in  the  cloud   Is  the  sky  the  limit?   Rajarshi  Guha,  Ph.D.   NIH  Center  for  Transla:onal  Therapeu:cs     June  28,  2012  
  2. 2. The  cloud  as  infrastructure  •  Cloud  compu:ng  is  a  service  for   –  Infrastructure   –  PlaForm   –  SoHware  •  Much  of  the  benefits  of  cloud  compu:ng  are   –  Economic   –  Poli:cal  •  Won’t  be  discussing  the  remote  hos:ng  aspects   of  clouds  
  3. 3. Characteris8cs  of  the  cloud   Virtually Pay-per-use assemble Offsite Cloud Sharedtechnology Computing workloads Massive On-demand scale self service hPp://­‐cloud-­‐compu:ng  
  4. 4. Parallel  compu8ng  in  the  cloud  •  Modern  cloud  vendors  make  provisioning   compute  resources  easy   –  Allows  one  to  handle  unpredictable  loads  easily   –  Pay  only  for  what  you  need  •  Chemistry  applica:ons  don’t  usually  have  very   dynamic  loads  •  But  large  scale  resources  are  an  opportunity  for   large  scale  (parallel)  computa:ons  
  5. 5. Storing  chemical  informa8on  •  Fill  up  a  hard  drive,  mail  to  Amazon  •  Copy  over  the  network   –  Aspera   –  GridFTP  •  S:ll  need  to  pay  for     storage  space  •  Lots  of  op:ons  on  the   cloud  –  S3,  rela:onal  DB’s  •  See  Chris  Dagdigian’s  talk  for  views  on  storage   hPp://­‐trends-­‐from-­‐the-­‐trenches  
  6. 6. Recoding  for  the  cloud?  •  Only  if  we  really  have  to  •  Large  amounts  of  legacy  code,     runs  perfectly  well  on  local  clusters   –  May  not  make  sense  to  recode   as  a  map-­‐reduce  job   –  May  not  be  possible  to   ?  •  Different  levels  of  HPC  on  the  cloud   –  Legacy  HPC   –  ‘Cloudy’  HPC   –  Big  Data  HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  7. 7. Recoding  for  the  cloud?  •  Use  cloud  resources  in   •  Make  use  of  cloud   •  Huge  datasets   the  same  way  as  a  local   capabili:es   •  Candidates  for  map-­‐ cluster   •  Old  algorithms,  new   reduce  •  MIT  StarCluster  makes   infrastructure   •  Involves  algorithm     this  easy  to  do   •  Spot  instances,  SNS,   (re)design   SQS  SimpleDB,  S3,  etc  Legacy   Cloudy   Big  Data  HPC   HPC   HPC   hPp://­‐life-­‐science-­‐informa:cs-­‐to-­‐the-­‐cloud  
  8. 8. How  does  the  cloud  enable  science?  •  How  does  the  cloud  change  computa:onal   chemistry,  cheminforma:cs,  …   –  The  way  we  do  them   –  The  scale  at  which  we  do  them     Are  there  problems  that  we  can  address  that     we  could  not  have  if  we  didn’t  have  on-­‐demand,     scalable  cloud  resources?  
  9. 9. Big  data  &  cheminforma8cs  •  Computa:on  over  large  chemical  databases   –  Pubchem,  ChEMBL,  …  •  What  types  of  computa:ons?   –  Searches  (substructure,  pharmacophore,  ….)   –  QSAR  models  over  large  data   –  Predic:ons  for  large  data  •  Certain  applica:ons  just  need  structures  •  Access  to  correspondingly  massive  experimental   datasets  is  tough  (impossible?)  
  10. 10. Big  data  &  cheminforma8cs  •  GDB-­‐13  is  a  truly  big  database  –  977  million   different  structures   –  Current  search  interface  is  based  on  NN  searches   using  a  reduced  representa:on   –  Could  be  a  good  candidate  for  a  Hadoop  based   analysis  •  More  generally,  enumerated  virtual  libraries  can   also  lead  to  very  big  data   –  Time  required  to  enumerate  is  a  boPleneck  
  11. 11. Big  data  &  cheminforma8cs  •  Fundamentally,  “big  chemical  data”  lets  us   explore  larger  chemical  spaces     –  Can  plow  through  large  catalogs   –  e.g.,  iden:fying  PKR  inhibitors  by  LBVS  of  the   ChemNavigator  collec:on  [Bryk  et  al]  •  This  can  push  predic:ve  models  to  their  limits     –  Brings  us  back  to  the  global  vs  local  arguments  
  12. 12. The  Hadoop  ecosystem  •  A  framework  for  the  map-­‐reduce  agorithm   –  Not  something  you  can  download  and  just  run   –  Need  to  implement  the  infrastructure  and  then   develop  code  to  run  using  the  infrastructure  •  Low  level  Hadoop  programs  can  be  large,   complex  and  tedious  •  Abstrac:ons  have  been  developed  that  make   Hadoop  queries  more  SQL-­‐like  –  results  in  much   more  concise  code  
  13. 13. The  Hadoop  ecosystem   Chukwa Zookeeper Flume Pig HBase Mahout Avro Whirr Map Reduce Engine Hama Hadoop Distributed Hive Filesystem Hadoop CommonBased  on  hPp://­‐part-­‐3-­‐maP-­‐asleP-­‐the-­‐hadoop-­‐ecosystem  
  14. 14. Simplifying  Hadoop  applica8ons  •  Raw  Hadoop     programs  can     be  very     tedious  to     write   SMARTS  based     substructure  search    
  15. 15. Pig  &  Pig  La8n  •  Pig  La:n  programs  are  much  simpler  to  write   and  get  translated  to   !"#"$%&"()*+,)-.)+("&."/.)+$*.012&3&33&456" Hadoop  code   7"#"8$9*3"!":4";*9-3<,2&-1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56" .9%3*"7"+;9%"(%,9=,9-9F9(6" SMARTS  search  in    •  SQL-­‐like,  requires     Pig  La:n   !"#$%&&$())*+,-./012034)5%$2065"3&7 UDF  to  be     )2(8&*+,9-*:"06;-<<$)=2>)2(8&7 26;7 )=2?30@*+,9-*:"06;-<<$AB.BC> implemented  to     D&(2&EA.FGH1&0!8<30C7 *;)20IJ<"2J!6%32$3A0C> D D perform     )2(8&*I%$0)K(6)06)!?30@*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC> !"#$%&O<<$0(3010&A-"!$02"!$0C2E6<@)QMH1&0!8<37 non-­‐standard  tasks   %LA2"!$0??3"$$RR2"!$0J)%S0ACTUC602"63L($)0> *26%3P2(6P02?A*26%3PC2"!$0JP02AVC> *26%3P="06;?A*26%3PC2"!$0JP02AWC> 26;7 UDF  for  SMARTS  search   )=2J)02*I(62)A="06;C> Q,2<I.<32(%306I<$?)!J!(6)0*I%$0)A2(6P02C> 602"63)=2JI(2&E0)AI<$C> D&(2&EA.FGH1&0!8<30C7 2E6<@X6(!!04QMH1&0!8<3J@6(!ABH66<6%3*+,9-*!(Y063<6*+QZH*)26%3PB[="06;0C> D D D
  16. 16. Working  on  top  of  Hadoop  •  Hadoop  doesn’t  know  anything  about   cheminforma:cs   –  Need  to  write  your  own  code,  UDF’s  etc  •  But  applica:on  layers  have  been  developed  for   other  purposes   –                 Apache  Mahout:  a  library  for  machine  learning                      on  data  stored  in  Hadoop  clusters       –  Possible  to  build  virtual  screening  pipelines  based  on   the  Hadoop  framework  
  17. 17. What  Hadoop  is  not  for  •  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real:me  analysis  •  Not  effec:ve  unless  dealing  with  massive   datasets  •  All  algorithms  are  not  amenable  to  the  map-­‐ reduce  method   –  CPU  bound  methods  and  those  requiring   communica:on  
  18. 18. Cheminforma8cs  on  Hadoop  •  Hadoop  and  Atom  Coun:ng  •  Hadoop  and  SD  Files  •  Cheminforma:cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma:cs     But  are  cheminforma1cs  problems     really  big  enough  to  jus1fy  all  of  this?  
  19. 19. How  big  is  big?  •  Bryk  et  al  performed  a  LBVS  of  5  million   compounds  to  iden:fy  PKR  inhibitors   –  Pharmacophore  fingerprints  +  perceptron   –  Required  conformer  genera:on    •  Given  that  conformer  and  descriptor  genera:on   are  one-­‐:me  tasks,  screening  5M  compounds   doesn’t  take  long  •  Example:  RF  models  built  on  512  bit  binary   fingerprints  gives  us  predic:ons  for  5M   fingerprints  in  12  min  [Single  core,  3  GHz  Xeon,  OS  X  10.6.8]  
  20. 20. Going  beyond  chunking?  •  All  the  preceding  use  cases  are  embarrassingly   parallel     –  Chunking  the  input  data  and  applying  the  same   opera:on  to  each  chunk   –  Very  nice  when  you  have  a  big  cluster   Are  there  algorithms  in     cheminforma1cs  that    can  employ     map-­‐reduce  at  the  algorithmic  level?  
  21. 21. Going  beyond  chunking?  •  Applica:ons  that  make  use  of  pairwise  (or  higher   order)  calcula:ons  could  benefit  from  a  map-­‐ reduce  incarna:on   –  Doesn’t  always  avoid  the  O(N2)  barrier   –  Bioisostere  iden:fica:on  is  one  case  that  could  be   rephrased  as  a  map-­‐reduce  problem  •  Search  algorithms  such  as  GA’s,  par:cle  swarms   can  make  use  of  map-­‐reduce   –  GA  based  docking   –  Feature  selec:on  for  QSAR  models  
  22. 22. Going  beyond  chunking?  •  Machine  learning  for  massive  chemical  datasets?   –  MR  jobs  (descriptor  genera:on)  +  Mahout  (model   building)  lets  us  handle  this  in  a  straight  forward   manner  •  But  will  QSAR  models  benefit  from  more  data?   –  Helgee  et  al  suggest  global  models  are  preferable   –  But  diversity  and  the  structure  of  the  chemical  space   will  affect  performance  of  global  models   –  Unsupervised  methods  maybe  more  relevant   –  Philosophical  ques:on?  
  23. 23. Going  beyond  chunking?  •  Many  clustering  algorithms  are  amenable  to   map-­‐reduce  style   –  K-­‐means,  Spectral,  EM,  minhash,  …   –  Many  are  implemented  in  Mahout   Problems  where  we  generate  large  numbers  of     combina8ons  can  be  amenable  to  map-­‐reduce  
  24. 24. Networks  &  integra8on  •  Network  models  of  molecules,   and  targets  are  common   –  Allows  for  the  incorpora:on  of   lots  of  associated  informa:on   –  Diseases,  pathways,  OTE’s,     Yildirim,  M.A.  et  al  •  When  linked  with  clinical  data     &  outcomes,  we  can  generate  massive  networks   –  Adverse  events  (FDA  AERS)   –  Analysis  by  Cloudera  considered  >  10E6  drug-­‐drug-­‐ reac:on  triples  
  25. 25. Networks  &  integra8on  •  SAR  data  can  be  viewed  in  a   network  form   –  SALI,  SARI  based  networks   –  Usually  requires  pairwise     calcula:ons  of  the  metric   Peltason,  L  et  al   hPp://  •  Current  studies  have  focused  on  small  datasets   (<  1000  molecules)  •  Hadoop  +  Giraph  could  let  us  apply  this  to  HTS-­‐ scale  datasets  
  26. 26. Networks  &  integra8on  •  When  we  apply  a  network  view   we  can  consider  many  interes:ng   applica:ons  &  make  use  of  cloud   scale  infrastructure   –  Network  based  similarity   –  Community  detec:on  (aka  clustering)   Bauer-­‐Mehren  et  al   –  PageRank  style  ranking  (of  targets,  compounds,  …)   –  Generate  network  metrics,  which  can  be  used  as   input  to  predic:ve  models  (for  interac:ons,  effects,   …)  
  27. 27. Conclusions  •  Cheminforma:cs  applica:ons  can  be  rewriPen   to  take  advantage  of  cloud  resources   –  Remotely  hosted     –  Embarrassingly  parallel  /  chunked   –  Map/reduce    •  Ability  to  process  larger  structure  collec:ons  lets   us  explore  more  chemical  space  •  Integra:ng  chemistry  with  clinical  &   pharmacological  data  can  lead  to  big  datasets  
  28. 28. Conclusions  •  Q:  But  are  cheminforma8cs  problems  really  big   enough  to  jus8fy  all  of  this?    •  A:  Yes  –  virtual  libraries,  integra:ng  chemical   structure  with  other  types  and  scales  of  data  •  Q:  Are  there  algorithms  in  cheminforma8cs  that     can  employ  map-­‐reduce  at  the  algorithmic  level?  •  A:  Yes  –  especially  when  we  consider  problems   with  a  combinatorial  flavor  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.