I Don't Care Where My Data and Methods Are: A Web-Service Approach for Distributed Access to Methods, Data and Models

  • 1,468 views
Uploaded on

 

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,468
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
48
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A Web-Service Approach for Distributed Access to Methods, Data and Models A Web-Service Approach for Distributed Rajarshi Guha Geoffrey Fox Access to Methods, Data and Models Kevin E. Gilbert Marlon Pierce David Wild (I Don’t Care Where My Data and Methods Are) Overview Pub3D Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Model Exchange Marlon Pierce David Wild School of Informatics Indiana University 235th ACS National Meeting 6th March, 2008
  • 2. A Web-Service Outline Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Overview Overview Pub3D Model Exchange Pub3D Model Exchange
  • 3. A Web-Service Local versus Distributed Resources Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Overview Pub3D Model Exchange Access arbitrary resources (methods, data, applications) All resources look like function calls
  • 4. A Web-Service Combining Data & Functionality Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert But what do we mean by “combining” all these Marlon Pierce David Wild resources? Overview Different levels of complexity Pub3D Keep track of new additions to a database via an RSS Model Exchange feed Use Yahoo Pipes to combine and manipulate output easily Write full fledged programs in your language of choice Web services allow us to support all these activities in a uniform manner
  • 5. A Web-Service Web Services - What’s Available Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Cheminformatics functionality Marlon Pierce David Wild Molecular descriptors Overview Similarity (2D, 3D) Pub3D Format conversion Model Exchange Depictions 3D coordinate generation Summarized on ChemBioGrid Dong, X. et al, J. Chem. Inf. Model., 2007, 47, 1303–1307 http://www.chembiogrid.org/projects/proj_core.html
  • 6. A Web-Service Web Services for Modeling Approach for Distributed Access to Methods, Data and Models Computational web services can be viewed as wrappers Rajarshi Guha around the actual program Geoffrey Fox Kevin E. Gilbert Marlon Pierce Since predictive models are a common feature in David Wild cheminformatics we’d like to support them as well Overview This leads to a number of requirements Pub3D Ability to develop models Model Exchange Store (deploy) models Use the models via the web service infrastructure We provide a computational infrastructure based on R which supports Feature selection Model development Model deployment Arbitrary R code Guha, R. J. Chem. Inf. Model., 2008, 48, 456–464
  • 7. A Web-Service What’s Available? Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Regression (OLS, CNN, RF) Geoffrey Fox Kevin E. Gilbert Classification (LDA) Marlon Pierce David Wild Clustering (k-means) Overview Feature selection (stepwise and exhaustive) Pub3D Automated model generation Model Exchange Load X, Y data, build linear and non-linear models with optional LOO CV Deployed models Random forests for 60 NCI DTP cell lines Cytotoxicity Ames mutagenicity Wang, H. et al., J. Chem. Inf. Model., 2007, 47, 2063–2076 Guha, R. and Sch¨rer, S., J. Comp. Aid. Molec. Des, 2008, ASAP u
  • 8. A Web-Service Web Services - What’s Available Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Data sources Overview We maintain a number of databases Pub3D PubChem mirror (mainly for local research) Model Exchange Pub3D Directly accessible via queries in SQL We also encapsulate specific queries as web service calls http://www.chembiogrid.org/projects/proj_db.html
  • 9. A Web-Service REST Frontends to SOAP Services Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox REST is a network architecture that avoids complex Kevin E. Gilbert Marlon Pierce message formats David Wild No extra libraries required Overview Access services using URL’s Pub3D Much simpler interface compared to SOAP Model Exchange We’ve been putting REST interfaces onto a number of our SOAP services Current REST services include Database (3D structure, PubChem mirror) Molecular descriptors Depictions
  • 10. A Web-Service REST 3D Structure Service Approach for Distributed Access to Methods, Data To get a 3D structure for a PubChem CID and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert http://www.chembiogrid.org/cheminfo/rest/db/pub3d/2244 Marlon Pierce David Wild Overview Pub3D Model Exchange
  • 11. A Web-Service REST Depiction Service Approach for Distributed Access to Methods, Data To get a 2D depiction for arbitrary SMILES and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert http://www.chembiogrid.org/cheminfo/rest/depict/C(=O)OCC Marlon Pierce David Wild Overview Pub3D Model Exchange
  • 12. A Web-Service REST Descriptor Service Approach for Distributed Access to Methods, Data To get descriptors for an arbitrary SMILES string and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert http://www.chembiogrid.org/cheminfo/rest/desc/ Marlon Pierce descriptors/CC(=O)OCCN David Wild Overview Pub3D Model Exchange
  • 13. A Web-Service Pub3D Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox A 3D structure database derived from PubChem Kevin E. Gilbert Marlon Pierce Current version contains a single 3D structure for 17M David Wild compounds Overview Structures obtained using MMFF94 Pub3D Not the lowest energy conformer Model Exchange Structures can be retrieved in SD format by CID using a web page or web service interfaces We also store distance moment shape descriptors allowing us to perform shape similarity searches http://www.chembiogrid.org/cheminfo/p3d/ Ballester, P.J. and Graham Richards, W., J. Comp. Chem., 2007, 28, 1711–1723
  • 14. A Web-Service Pub3D Performance Approach for Distributed Access to Methods, Data Shape searches can be as fast as 5s, for reasonably large and Models result sets Rajarshi Guha Geoffrey Fox Fast enough for us to explore the “density of space” Kevin E. Gilbert Marlon Pierce around a given query compound David Wild Overview Similarity Query Times for 10000 Random CIDs (R = 0.4) 1 0.91 0.83 0.77 0.71 0.67 0.62 Pub3D 3500 140000 Levitra (110634) Model Exchange 3000 Diazepam (3016) Taxol (36314) 2500 Nearest Neighbor Count Didanosine (50599) 100000 Number of Queries Calcascorbin (6247) 2000 q 1500 60000 q q 1000 q q q 20000 q 500 q q q q q q q q q q q q 0 q 0 q q 0 20 40 60 80 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Time (sec) Radius Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1713–1722
  • 15. A Web-Service Pub3D & Clustering Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Indexing gives good performance Overview Pub3D As index size increases, Model Exchange performance degrades Could add more RAM Clustering the database allows us to scale to significantly larger collections
  • 16. A Web-Service Pub3D & Clustering Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Indexing gives good performance Overview Pub3D As index size increases, Model Exchange performance degrades Could add more RAM Clustering the database allows us to scale to significantly larger collections
  • 17. A Web-Service Pub3D and Conformers Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Single, arbitrary conformers aren’t too useful Marlon Pierce David Wild We have currently generated conformers for a subset of PubChem (4 to 10 heavy atoms) Overview Pub3D 3 Kcal energy window Model Exchange 243,892 compounds ≈ 2M conformers Clustering will be vital when conformers are considered Allows us to handle arbitrary numbers of conformers May need to consider some sort of compression Could use cluster information to optimize queries
  • 18. A Web-Service Model Exchange Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert The literature contains many published models Marlon Pierce David Wild No way to utilize them unless we manually rebuild them In many cases we will not have access to descriptors Overview Pub3D Difficult to search for models specifically Model Exchange We should be able to do the following Search for models Exchange them Execute them Is this a “format” issue? To some extent yes
  • 19. A Web-Service Model Exchange Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert The literature contains many published models Marlon Pierce David Wild No way to utilize them unless we manually rebuild them In many cases we will not have access to descriptors Overview Pub3D Difficult to search for models specifically Model Exchange We should be able to do the following Search for models Exchange them Execute them Is this a “format” issue? To some extent yes
  • 20. A Web-Service PMML for Model Exchange Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Predictive Modelling Markup Language Kevin E. Gilbert Marlon Pierce A standard supported by the Data Modeling Group David Wild Allows you to serialize predictive models to XML Overview Linear, logistic regression Pub3D Tree models Model Exchange Neural networks, Na¨ Bayes models ıve Association models Ensemble models (random forests, arbitrary ensembles) Supported by a number of platforms IBM, Salford Systems SAS, SPSS R
  • 21. A Web-Service PMML for Model Exchange Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Since it’s XML, it’s extensible Overview PMML allows us to specify the model itself Pub3D But we need to add extra information to make a model Model Exchange truly searchable Provenance (who made it, when was it made, . . . ) Property (what is being modeled) Requirements (what are the inputs, how to get them)
  • 22. A Web-Service PMML for Model Exchange Approach for Distributed Access to Methods, Data Provenance and Models Rajarshi Guha Geoffrey Fox Easily solved using Dublin Core Kevin E. Gilbert Marlon Pierce David Wild Property Overview Pub3D Could be addressed using keywords Model Exchange Could reuse pre-existing ontologies Requirements This is tricky Need to identify what descriptors were used What software, version How to evaluate the descriptors (if possible)
  • 23. A Web-Service PMML for Model Exchange Approach for Distributed Access to Methods, Data Provenance and Models Rajarshi Guha Geoffrey Fox Easily solved using Dublin Core Kevin E. Gilbert Marlon Pierce David Wild Property Overview Pub3D Could be addressed using keywords Model Exchange Could reuse pre-existing ontologies Requirements This is tricky Need to identify what descriptors were used What software, version How to evaluate the descriptors (if possible)
  • 24. A Web-Service PMML for Model Exchange Approach for Distributed Access to Methods, Data Provenance and Models Rajarshi Guha Geoffrey Fox Easily solved using Dublin Core Kevin E. Gilbert Marlon Pierce David Wild Property Overview Pub3D Could be addressed using keywords Model Exchange Could reuse pre-existing ontologies Requirements This is tricky Need to identify what descriptors were used What software, version How to evaluate the descriptors (if possible)
  • 25. A Web-Service Summary Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Pub3D is a shape searchable version of PubChem Marlon Pierce David Wild Conformers will have to be considered for it to be useful Overview Are searches meaningful? Benchmarks required Pub3D Model Exchange Web services provide one approach to model deployment We should be able to search for models explicitly PMML is one extensible solution the addresses model exchange Automated model execution is more challenging
  • 26. A Web-Service Summary Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Pub3D is a shape searchable version of PubChem Marlon Pierce David Wild Conformers will have to be considered for it to be useful Overview Are searches meaningful? Benchmarks required Pub3D Model Exchange Web services provide one approach to model deployment We should be able to search for models explicitly PMML is one extensible solution the addresses model exchange Automated model execution is more challenging
  • 27. A Web-Service Acknowledgments Approach for Distributed Access to Methods, Data and Models Rajarshi Guha Geoffrey Fox Kevin E. Gilbert Marlon Pierce David Wild Overview Pub3D Kangseok Kim Model Exchange NIH