Chemogenomics in the cloud Is the sky the limit? Rajarshi Guha, Ph.D. NIH Center for Transla:onal Therapeu:cs June 28, 2012
The cloud as infrastructure • Cloud compu:ng is a service for – Infrastructure – PlaForm – SoHware • Much of the beneﬁts of cloud compu:ng are – Economic – Poli:cal • Won’t be discussing the remote hos:ng aspects of clouds
Characteris8cs of the cloud Virtually Pay-per-use assemble Offsite Cloud Sharedtechnology Computing workloads Massive On-demand scale self service hPp://www.slideshare.net/haslinatuanhim/slides-‐cloud-‐compu:ng
Parallel compu8ng in the cloud • Modern cloud vendors make provisioning compute resources easy – Allows one to handle unpredictable loads easily – Pay only for what you need • Chemistry applica:ons don’t usually have very dynamic loads • But large scale resources are an opportunity for large scale (parallel) computa:ons
Storing chemical informa8on • Fill up a hard drive, mail to Amazon • Copy over the network – Aspera – GridFTP • S:ll need to pay for storage space • Lots of op:ons on the cloud – S3, rela:onal DB’s • See Chris Dagdigian’s talk for views on storage hPp://www.slideshare.net/chrisdag/2012-‐trends-‐from-‐the-‐trenches
Recoding for the cloud? • Only if we really have to • Large amounts of legacy code, runs perfectly well on local clusters – May not make sense to recode as a map-‐reduce job – May not be possible to ? • Diﬀerent levels of HPC on the cloud – Legacy HPC – ‘Cloudy’ HPC – Big Data HPC hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
Recoding for the cloud? • Use cloud resources in • Make use of cloud • Huge datasets the same way as a local capabili:es • Candidates for map-‐ cluster • Old algorithms, new reduce • MIT StarCluster makes infrastructure • Involves algorithm this easy to do • Spot instances, SNS, (re)design SQS SimpleDB, S3, etc Legacy Cloudy Big Data HPC HPC HPC hPp://www.slideshare.net/chrisdag/mapping-‐life-‐science-‐informa:cs-‐to-‐the-‐cloud
How does the cloud enable science? • How does the cloud change computa:onal chemistry, cheminforma:cs, … – The way we do them – The scale at which we do them Are there problems that we can address that we could not have if we didn’t have on-‐demand, scalable cloud resources?
Big data & cheminforma8cs • Computa:on over large chemical databases – Pubchem, ChEMBL, … • What types of computa:ons? – Searches (substructure, pharmacophore, ….) – QSAR models over large data – Predic:ons for large data • Certain applica:ons just need structures • Access to correspondingly massive experimental datasets is tough (impossible?)
Big data & cheminforma8cs • GDB-‐13 is a truly big database – 977 million diﬀerent structures – Current search interface is based on NN searches using a reduced representa:on – Could be a good candidate for a Hadoop based analysis • More generally, enumerated virtual libraries can also lead to very big data – Time required to enumerate is a boPleneck
Big data & cheminforma8cs • Fundamentally, “big chemical data” lets us explore larger chemical spaces – Can plow through large catalogs – e.g., iden:fying PKR inhibitors by LBVS of the ChemNavigator collec:on [Bryk et al] • This can push predic:ve models to their limits – Brings us back to the global vs local arguments
The Hadoop ecosystem • A framework for the map-‐reduce agorithm – Not something you can download and just run – Need to implement the infrastructure and then develop code to run using the infrastructure • Low level Hadoop programs can be large, complex and tedious • Abstrac:ons have been developed that make Hadoop queries more SQL-‐like – results in much more concise code
Simplifying Hadoop applica8ons • Raw Hadoop programs can be very tedious to write SMARTS based substructure search
Pig & Pig La8n • Pig La:n programs are much simpler to write and get translated to !"#"$%&"()*+,)-.)+("&."/.)+$*.012&3&33&456" Hadoop code 7"#"8$9*3"!":4";*9-3<,2&-1-=+<->?!@AB/.)+$*.C"(DA/#E5A/#E5D(56" .9%3*"7"+;9%"(%,9=,9-9F9(6" SMARTS search in • SQL-‐like, requires Pig La:n !"#$%&&$())*+,-./012034)5%$2065"3&7 UDF to be )2(8&*+,9-*:"06;-<<$)=2>)2(8&7 26;7 )=2?30@*+,9-*:"06;-<<$AB.BC> implemented to D&(2&EA.FGH1&0!8<30C7 *;)20IJ<"2J!6%32$3A0C> D D perform )2(8&*I%$0)K(6)06)!?30@*I%$0)K(6)06AF0L("$2.E0IM#N0&2O"%$406JP02Q3)2(3&0ACC> !"#$%&O<<$0(3010&A-"!$02"!$0C2E6<@)QMH1&0!8<37 non-‐standard tasks %LA2"!$0??3"$$RR2"!$0J)%S0ACTUC602"63L($)0> *26%3P2(6P02?A*26%3PC2"!$0JP02AVC> *26%3P="06;?A*26%3PC2"!$0JP02AWC> 26;7 UDF for SMARTS search )=2J)02*I(62)A="06;C> Q,2<I.<32(%306I<$?)!J!(6)0*I%$0)A2(6P02C> 602"63)=2JI(2&E0)AI<$C> D&(2&EA.FGH1&0!8<30C7 2E6<@X6(!!04QMH1&0!8<3J@6(!ABH66<6%3*+,9-*!(Y063<6*+QZH*)26%3PB[="06;0C> D D D
Working on top of Hadoop • Hadoop doesn’t know anything about cheminforma:cs – Need to write your own code, UDF’s etc • But applica:on layers have been developed for other purposes – Apache Mahout: a library for machine learning on data stored in Hadoop clusters – Possible to build virtual screening pipelines based on the Hadoop framework
What Hadoop is not for • Doesn’t replace an actual database • It’s not uniformly fast or eﬃcient • Not good for ad hoc or real:me analysis • Not eﬀec:ve unless dealing with massive datasets • All algorithms are not amenable to the map-‐ reduce method – CPU bound methods and those requiring communica:on
Cheminforma8cs on Hadoop • Hadoop and Atom Coun:ng • Hadoop and SD Files • Cheminforma:cs, Hadoop and EC2 • Pig and Cheminforma:cs But are cheminforma1cs problems really big enough to jus1fy all of this?
How big is big? • Bryk et al performed a LBVS of 5 million compounds to iden:fy PKR inhibitors – Pharmacophore ﬁngerprints + perceptron – Required conformer genera:on • Given that conformer and descriptor genera:on are one-‐:me tasks, screening 5M compounds doesn’t take long • Example: RF models built on 512 bit binary ﬁngerprints gives us predic:ons for 5M ﬁngerprints in 12 min [Single core, 3 GHz Xeon, OS X 10.6.8]
Going beyond chunking? • All the preceding use cases are embarrassingly parallel – Chunking the input data and applying the same opera:on to each chunk – Very nice when you have a big cluster Are there algorithms in cheminforma1cs that can employ map-‐reduce at the algorithmic level?
Going beyond chunking? • Applica:ons that make use of pairwise (or higher order) calcula:ons could beneﬁt from a map-‐ reduce incarna:on – Doesn’t always avoid the O(N2) barrier – Bioisostere iden:ﬁca:on is one case that could be rephrased as a map-‐reduce problem • Search algorithms such as GA’s, par:cle swarms can make use of map-‐reduce – GA based docking – Feature selec:on for QSAR models
Going beyond chunking? • Machine learning for massive chemical datasets? – MR jobs (descriptor genera:on) + Mahout (model building) lets us handle this in a straight forward manner • But will QSAR models beneﬁt from more data? – Helgee et al suggest global models are preferable – But diversity and the structure of the chemical space will aﬀect performance of global models – Unsupervised methods maybe more relevant – Philosophical ques:on?
Going beyond chunking? • Many clustering algorithms are amenable to map-‐reduce style – K-‐means, Spectral, EM, minhash, … – Many are implemented in Mahout Problems where we generate large numbers of combina8ons can be amenable to map-‐reduce
Networks & integra8on • Network models of molecules, and targets are common – Allows for the incorpora:on of lots of associated informa:on – Diseases, pathways, OTE’s, Yildirim, M.A. et al • When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-‐drug-‐ reac:on triples
Networks & integra8on • SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula:ons of the metric Peltason, L et al hPp://sali.rguha.net/ • Current studies have focused on small datasets (< 1000 molecules) • Hadoop + Giraph could let us apply this to HTS-‐ scale datasets
Networks & integra8on • When we apply a network view we can consider many interes:ng applica:ons & make use of cloud scale infrastructure – Network based similarity – Community detec:on (aka clustering) Bauer-‐Mehren et al – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic:ve models (for interac:ons, eﬀects, …)
Conclusions • Cheminforma:cs applica:ons can be rewriPen to take advantage of cloud resources – Remotely hosted – Embarrassingly parallel / chunked – Map/reduce • Ability to process larger structure collec:ons lets us explore more chemical space • Integra:ng chemistry with clinical & pharmacological data can lead to big datasets
Conclusions • Q: But are cheminforma8cs problems really big enough to jus8fy all of this? • A: Yes – virtual libraries, integra:ng chemical structure with other types and scales of data • Q: Are there algorithms in cheminforma8cs that can employ map-‐reduce at the algorithmic level? • A: Yes – especially when we consider problems with a combinatorial ﬂavor