Chemical Data Mining: Open Source & Reproducible


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Chemical Data Mining: Open Source & Reproducible

  1. R & CDK 1/18 Chemical Data Mining Open Source & Reproducible Rajarshi GuhaNIH Center for Advancing Translational Science August 21, 2012 Philadelphia PA
  2. R & CDKBackground 2/18 Been using it since 2003, developed a number of R packages, mostly public Make extensive use of R at NCGC for small molecule & RNAi screening and high content analysis In paralllel, need to manipulate and process chemical structure data How R is enhanced by other Open Source software How R enables and supports reproducible science
  3. R & CDKWhat is R? 3/18 R is an environment for modeling Contains many prepackaged statistical and mathematical functions No need to implement anything (if you don’t want to) R is a matrix programming language that is good for statistical computing Full fledged, interpreted language Well integrated with statistical functionality Easy to integrate with C, C++, Fortran Good for prototyping
  4. R & CDKWhy cheminformatics in R? 4/18 Much of cheminformatics is data modeling and mining But the numeric data is derived from chemical structure Thus we want to work with molecules & and their parts files containing molecules databases of molecules
  5. R & CDKWhy cheminformatics in R? 5/18 In contrast to bioinformatics (cf. Bioconductor), not a whole lot of cheminformatics support for R For cheminformatics and chemistry, relevant packages include rcdk, rpubchem, chemblr,fingerprint bio3d, ChemmineR, caret A lot of cheminformatics employs various forms of statistics and machine learning - R is exactly the environment for that We just need to add some chemistry capabilities to it
  6. R & CDKWhat does the CDK provide? 6/18 Fundamental chemical objects atoms bonds molecules More complex objects are also available Sequences Reactions Collections of molecules Input/Output for a wide variety of molecular file formats Fingerprints and fragment generation Rigid alignments, pharmacophore searching Substructure searching, SMARTS support Molecular descriptors
  7. R & CDKUsing the CDK in R 7/18 Based on the rJava package Two R packages to install (not counting the dependencies) Provides access to a variety of CDK classes and methods Idiomatic R rcdk CDK Jmol rpubchem rJava fingerprint XML R Programming Environment
  8. R & CDKReading in data 8/18 The CDK supports a variety of file formats rcdk loads all recognized formats, automatically Data can be local or remotemols <- load.molecules( c("data/io/set1.sdf", "data/io/set2.smi", "")) For large SDF’s use an iterating reader Can’t do much with these objects, except via rcdk functions
  9. R & CDKWorking with molecules 9/18 Currently you can access atoms, bonds, get certain atom properties, 2D/3D coordinates Since rcdk doesn’t cover the entire CDK API, you might need to drop down to the rJava level and make calls to the Java code by hand
  10. R & CDKAccessing fingerprints 10/18 CDK provides several fingerprints Path-based, MACCS, E-State, PubChem Access them via get.fingerprint(...) Works on one molecule at a time, use lapply to process a list of molecules This method works with the fingerprint package Separate package to represent and manipulate fingerprint data from various sources (CDK, BCI, MOE) Uses C to perform similarity calculations
  11. R & CDKWorking with fingerprints 11/18 The fingerprint package implements 28 similarity and dissimilarity metrics Easy to run enrichment studies We can compare datasets in O(n) time, using the “bit spectrum” 1.0 0.8 Frequency 0.6 0.4 0.2 0.0 0 50 100 150 Bit Position 0 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
  12. R & CDKVisualization 12/18 rcdk supports visualization of 2D structure images in two ways First, you can bring up a Swing window Second, you can obtain the depiction as a raster imagemols <- load.molecules("data/")## view a single molecule in a Swing windowview.molecule.2d(mols[[1]])## view a table of moleculesview.molecule.2d(mols[1:10])
  13. R & CDKThe QSAR workflow 13/18
  14. R & CDKThe QSAR workflow 14/18 Before model development you’ll need to clean the molecules, evaluate descriptors, generate subsets With the numeric data in hand, we can proceed to modeling Before building predictive models, we’d probably explore the dataset Normality of the dependent variable Correlations between descriptors and dependent variable Similarity of subsets Go wild and build all the models that R supports
  15. R & CDKInteracting with chemical databases 15/18 A variety of databases containing structures, physical properties, biological activities Direct access within R lets us streamline our workflow Enabled by public APIs Pubchem PUG and REST ChEMBL REST API (chemblr)
  16. R & CDKReproducible chemical data mining 16/18 The many toolkits and versions .Rda Reproducible Bundle make reproducibility tough DB and HTTP access ensures that an analysis can be always up to date if required If the analysis is not based on a fixed snapshot of data, .R reproducibility cannot be Sweave / Knitr guaranteed Might actually make all those published QSAR models reusable!
  17. R & CDKAcknowledgements 17/18 rcdk Steffen Neumann Miguel Rojas Ranke Johannes CDK Egon Willighagen Christoph Steinbeck ...
  18. R & CDK 18/18 @rguha