Chemical Data Mining: Open Source & Reproducible
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Chemical Data Mining: Open Source & Reproducible

on

  • 1,936 views

 

Statistics

Views

Total Views
1,936
Views on SlideShare
1,847
Embed Views
89

Actions

Likes
3
Downloads
24
Comments
0

1 Embed 89

http://lanyrd.com 89

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Chemical Data Mining: Open Source & Reproducible Presentation Transcript

  • 1. R & CDK 1/18 Chemical Data Mining Open Source & Reproducible Rajarshi GuhaNIH Center for Advancing Translational Science August 21, 2012 Philadelphia PA
  • 2. R & CDKBackground 2/18 Been using it since 2003, developed a number of R packages, mostly public Make extensive use of R at NCGC for small molecule & RNAi screening and high content analysis In paralllel, need to manipulate and process chemical structure data How R is enhanced by other Open Source software How R enables and supports reproducible science
  • 3. R & CDKWhat is R? 3/18 R is an environment for modeling Contains many prepackaged statistical and mathematical functions No need to implement anything (if you don’t want to) R is a matrix programming language that is good for statistical computing Full fledged, interpreted language Well integrated with statistical functionality Easy to integrate with C, C++, Fortran Good for prototyping
  • 4. R & CDKWhy cheminformatics in R? 4/18 Much of cheminformatics is data modeling and mining But the numeric data is derived from chemical structure Thus we want to work with molecules & and their parts files containing molecules databases of molecules
  • 5. R & CDKWhy cheminformatics in R? 5/18 In contrast to bioinformatics (cf. Bioconductor), not a whole lot of cheminformatics support for R For cheminformatics and chemistry, relevant packages include rcdk, rpubchem, chemblr,fingerprint bio3d, ChemmineR, caret A lot of cheminformatics employs various forms of statistics and machine learning - R is exactly the environment for that We just need to add some chemistry capabilities to it
  • 6. R & CDKWhat does the CDK provide? 6/18 Fundamental chemical objects atoms bonds molecules More complex objects are also available Sequences Reactions Collections of molecules Input/Output for a wide variety of molecular file formats Fingerprints and fragment generation Rigid alignments, pharmacophore searching Substructure searching, SMARTS support Molecular descriptors
  • 7. R & CDKUsing the CDK in R 7/18 Based on the rJava package Two R packages to install (not counting the dependencies) Provides access to a variety of CDK classes and methods Idiomatic R rcdk CDK Jmol rpubchem rJava fingerprint XML R Programming Environment
  • 8. R & CDKReading in data 8/18 The CDK supports a variety of file formats rcdk loads all recognized formats, automatically Data can be local or remotemols <- load.molecules( c("data/io/set1.sdf", "data/io/set2.smi", "http://rguha.net/rcdk/remote.sdf")) For large SDF’s use an iterating reader Can’t do much with these objects, except via rcdk functions
  • 9. R & CDKWorking with molecules 9/18 Currently you can access atoms, bonds, get certain atom properties, 2D/3D coordinates Since rcdk doesn’t cover the entire CDK API, you might need to drop down to the rJava level and make calls to the Java code by hand
  • 10. R & CDKAccessing fingerprints 10/18 CDK provides several fingerprints Path-based, MACCS, E-State, PubChem Access them via get.fingerprint(...) Works on one molecule at a time, use lapply to process a list of molecules This method works with the fingerprint package Separate package to represent and manipulate fingerprint data from various sources (CDK, BCI, MOE) Uses C to perform similarity calculations
  • 11. R & CDKWorking with fingerprints 11/18 The fingerprint package implements 28 similarity and dissimilarity metrics Easy to run enrichment studies We can compare datasets in O(n) time, using the “bit spectrum” 1.0 0.8 Frequency 0.6 0.4 0.2 0.0 0 50 100 150 Bit Position 0 Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
  • 12. R & CDKVisualization 12/18 rcdk supports visualization of 2D structure images in two ways First, you can bring up a Swing window Second, you can obtain the depiction as a raster imagemols <- load.molecules("data/dhfr_3d.sd")## view a single molecule in a Swing windowview.molecule.2d(mols[[1]])## view a table of moleculesview.molecule.2d(mols[1:10])
  • 13. R & CDKThe QSAR workflow 13/18
  • 14. R & CDKThe QSAR workflow 14/18 Before model development you’ll need to clean the molecules, evaluate descriptors, generate subsets With the numeric data in hand, we can proceed to modeling Before building predictive models, we’d probably explore the dataset Normality of the dependent variable Correlations between descriptors and dependent variable Similarity of subsets Go wild and build all the models that R supports
  • 15. R & CDKInteracting with chemical databases 15/18 A variety of databases containing structures, physical properties, biological activities Direct access within R lets us streamline our workflow Enabled by public APIs Pubchem PUG and REST ChEMBL REST API (chemblr)
  • 16. R & CDKReproducible chemical data mining 16/18 The many toolkits and versions .Rda Reproducible Bundle make reproducibility tough DB and HTTP access ensures that an analysis can be always up to date if required If the analysis is not based on a fixed snapshot of data, .R reproducibility cannot be Sweave / Knitr guaranteed Might actually make all those published QSAR models reusable!
  • 17. R & CDKAcknowledgements 17/18 rcdk Steffen Neumann Miguel Rojas Ranke Johannes CDK Egon Willighagen Christoph Steinbeck ...
  • 18. R & CDK 18/18http://sourceforge.net/projects/cdk/ http://github.com/rajarshi/cdk @rguha