Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Crunching Mo...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Outline
Some...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R History
S ...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview ...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
An overview ...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Usability
De...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
R primitives...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Matrix orien...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional s...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Non-function...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Functional s...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Object orien...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Interfacing ...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Visualizatio...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Code quality...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The downside...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Cheminformat...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
The Chemistr...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk - CDK f...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk Motivat...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecu...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Basic molecu...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Working with...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSA...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
rcdk and QSA...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Viewing mole...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Downsides to...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to ch...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Access to Pu...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Bioinformati...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Long calcula...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple paral...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Serial code ...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Simple paral...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Big data sce...
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Summary
rcdk...
Upcoming SlideShare
Loading in...5
×

Crunching Molecules and Numbers in R

2,688

Published on

Published in: Technology

Crunching Molecules and Numbers in R

  1. 1. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Crunching Molecules and Numbers in R Rajarshi Guha NIH Chemical Genomics Center 238th ACS National Meeting 17th August, 2009
  2. 2. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Outline Some background on R Doing cheminformatics in R
  3. 3. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R History S developed by John Chambers at Bell Labs 1976 S rewritten in C 1988 Licensed to Insightful Corp. 1993 Bought by Insightful Corp for $2M 2004 Bought by TIBCO for $25M 2008 First public release 1993 Created by Ihaka & Gentleman 1991 Released under GPL 1995 R 1.0.0 2000 R 2.9 2009
  4. 4. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R An environment for statistical computation Wide variety of standard and state of the art statistical methods built in or accessible via packages But also a complete, interpreted programming language Well suited for manipulating and operating on datasets - numerical, categorical or a mixture - and of varying shape Impressive visualization facilities (but not very interactive)
  5. 5. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R Syntax is pretty much S-Plus Highly cross-platform Frequent and regular releases, active development by core group The dev and user community extremely active r-help is not just for learning R, you can get a decent statistics education from the list! Used by many top statisticians, many cutting edge techniques first show up in R
  6. 6. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Usability Default mode is a command line like prompt GUI’s available But learning curve is steep Does force you to think about the analysis Not a great tool for casual, once-in-a-while usage
  7. 7. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R primitives Numeric, character, list, matrix, data.frame § ¤ > x <- ’Hello World ’ > x <- 1 > x <- c(1,2,3,4,5,6) > x [1] 1 2 3 4 5 6 x <- data.frame(MW=runif(5, 10, 50), hERG=sample(c(’active ’,’inactive ’), 5, TRUE )) > x MW hERG 1 23.55435 active 2 42.90365 inactive 3 49.35149 active 4 26.85912 active 5 10.01877 active ¦ ¥
  8. 8. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Matrix oriented programming Similar in style to Matlab Easily access (multiple) rows, columns Vector/matrix indexing is very powerful and key to efficient R code Perform operations on entire rows or columns Makes subsetting a trivial operation Perfect for QSAR type analyses
  9. 9. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style R’s functional paradigms are closely tied to matrix operations apply, lapply, sapply, tapply allow you to easily operate on groups of objects Elements of a list Rows and/or columns of a matrix Subsets of data, using a grouping variable Anonymous functions are supported Use of these funtional forms can lead to speed up compared to traditional for loops
  10. 10. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Non-functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) sds <- numeric(ncol(m)) for (i in 1: ncol(m)) sds[i] <- sd(m[,i]) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) toxLogP <- 0 nontoxLogP <- 0 for (j in 1: nrow(m)) { if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1] else nontoxLogP <- nontoxLogP + m[j,1] } ¦ ¥
  11. 11. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) apply(m, 2, sd) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) by(m, m$toxic , function(x) mean(x$logp )) ¦ ¥
  12. 12. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Object oriented style R supports multiple object oriented mechanisms Simplest is S3 classes Object orientation is in terms of function names Easy to work with, not always flexible enough S4 classes are much more powerful, but also more complex Many problems can ignore these as R primitives provide sufficient support for attaching meta-data to objects (crude encapsulation) Becomes important/useful when writing packages, not for day to day code
  13. 13. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Interfacing with C & Fortran R is interpreted, functional forms help a bit Very useful to refactor inner loops into C (or Fortran) Also useful to provide an R interface to pre-existing C/Fortran code Can lead to dramatic speedups 1024 166 79 Bit length Speedup 01020304050 5000 pairwise Tanimoto similarity calculations, Macbook Pro, 2GHz, 1GB RAM
  14. 14. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Visualization R generates publication quality graphics in a variety of formats A huge number of statistical visualization methods (2D, 3D, OpenGL) Extremely powerful display specifications core commands lattice (a.k.a trellis graphics) Based on sound statistical theories While standard plots are easy to make, but complex plots do have a learning curve Interactivity is limited, though some package do alleviate this
  15. 15. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Code quality It’s not just enough to write code RUnit is a package that supports unit testing, analogous to JUnit R comes with well defined package structure that can be automatically checked for various errors Packages can be uploaded to CRAN which allows any R user to install them directly from R Extensive documentation format Sweave is an important feature which allows one to include R code and associated text in a single document - literate programming or reproducible research
  16. 16. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The downsides of R Memory bound (but can use as much memory as you have) Language inconsistencies Indexing starts from 1, but no error if you use 0 as an index See blog posts by Radford Neal (U Toronto) Debugging environment not so great (though ESS is good for Emacs users)
  17. 17. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Cheminformatics programming Fundamental requirement is support for core chemical concepts Representation and manipulation of these concepts Flexibility Could implement all of this directly in R - lots of wheels would be reinvented We also want such functionality to be R-like Writing Java or C in R is not R-like
  18. 18. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The Chemistry Development Kit Open source Java library for cheminformatics Wide variety of functionality Core chemical concepts (atoms, bonds, molecules) SMARTS, pharmacophores Molecular descriptors and fingerprints 2D depictions Used in a variety of tools, applications and services Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
  19. 19. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk - CDK from R R Programming Environment rJava CDK Jmol rcdk XML rpubchem fingerprint
  20. 20. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk Motivations Have access to cheminformatics functionality from within R Support processing of data from chemistry databases Not reimplement cheminformatics methods Have access to all of this in idiomatic R
  21. 21. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations - I/O Read in molecular file formats support by the CDK Files can be local or remote Parse SMILES strings In contrast to the CDK, rcdk will configure molecules automatically (unless instructed not to) The resultant molecule objects are Java references, can be passed to a variety of rcdk functions § ¤ mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’)) mol <- parse.smiles(’c1ccccc1CC (=O)’) mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’), parse.smiles) ¦ ¥
  22. 22. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations Given a molecule, we can extract or add properties Get lists of atoms and bonds and then manipulate them Currently doesn’t support a lot of molecular graph operations § ¤ # get the atoms from a molecule mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ") atoms <- get.atoms(mol) # get the coordinate matrix of the molecule coords <- do.call(’rbind ’, lapply(atoms , get.point3d )) ¦ ¥
  23. 23. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Working with fingerprints rcdk will generate a variety of fingerprints via the CDK Other packages can generate fingerprints The fingerprint package suports I/O of fingerprint data and various similarity operations on fingerprints Provides an S4 class representing binary fingerprints § ¤ m1 <- parse.smiles(’c1ccccc1C(COC)N’) m2 <- parse.smiles(’C1CCCCC1C(COC)N’) # Calculate fingerprints fps <- lapply(list(m1 ,m2), get.fingerprint , type=’maccs ’) distance(fps [[1]] , fps [[2]] , method=’tanimoto ’) fps <- fp.read(’fp.txt ’, lf=moe.lf , size =166, header=TRUE) fpsim <- fp.sim.matrix(fps , method=’tanimoto ’) ¦ ¥
  24. 24. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Molecular Descriptors Machine Learning Property
  25. 25. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Access to descriptors and fingerprints makes for very easy QSAR modeling within R Evaluate the descriptors (individually, by type or all) Get back a data.frame which can be used as input to pretty much any modeling method § ¤ mols <- load.molecules(’big.sdf ’) dnames <- get.desc.names(’topological ’) descs <- eval.desc(mols , dnames) str(descs) ’data.frame ’: 467 obs. of 180 variables: $ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ... $ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ... $ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 .. $ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ... ¦ ¥
  26. 26. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Viewing molecules While numerical modeling is a fundamental task in this environment, visualization is also important Either view structures of individual molecules or tables of structure and data rcdk supports both (not very well on OS X) § ¤ mol <- parse.smiles(’c1ccccc1C(N)CC ’)’ view.molecule .2d(mol) smiles <- c("CCC", "CCN", "CCN(C)(C)", " c1ccccc1Cc1ccccc1 ", "C1CCC1CC(CN(C)(C))CC(=O)CC") mols <- sapply(smiles , parse.smiles) view.molecule .2d(mols) ¦ ¥
  27. 27. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Downsides to rcdk Can’t save state of Java objects Doesn’t take advantage of S4 classes to provide R-side representations of CDK classes Incomplete coverage of the CDK API - sometimes need to go down to rJava to perform an operation Big datasets are problematic (mainly due to R limitations)
  28. 28. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to chemical databases Useful to be able to transparently access data from various public data sources PubChem compound and assays are supported via rpubchem Compound access is primarily by CID, while assay data can be obtained from key word searches End up with a data.frame containing all relevant assay information (along with meta-data as attributes) R can also easily access arbitrary RDBMS’s (Postgres, MySQL, Oracle)
  29. 29. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to PubChem § ¤ > dat <- get.cids (1:30) ’data.frame ’: 30 obs. of 11 variables: $ CID : chr "1" "2" "3" "4" ... $ IUPACName : chr "3-acetyloxy -4-( trimethylaz $ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+]( $ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H $ MolecularWeight : num 203.2 204.2 156.1 75.1 169. > find.assay.id(’LDR ’) [1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865 > adat <- get.assay (990) > str(adat) ’data.frame ’: 51 obs. of 9 variables: $ PUBCHEM.SID : int 845800 848472 852502 857608 $ PUBCHEM.CID : int 648162 6603466 655127 65895 ¦ ¥
  30. 30. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Bioinformatics in R While the focus is on cheminformatics, many problems involve bioinformatics to some degree The Bioconductor project provides a wide variety of packages A lot of it focused on gene expression analysis A number of packages provide access to various biological databases, annotations etc Protein structure analysis is supported in R via Bio3d Never have to leave the comfort of R http://www.bioconductor.org/ Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
  31. 31. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Long calculations, big data Many statistical methods require long running calculations Bootstrap Bayesian methods Many problems involve large datasets A common feature to both scenarios is that they can be trivially parallelized As opposed to require parallel version of underlying algorithm R has good support for both trivial and non-trivial parallelization methods See R/parallel for a package that will parallelize actual R code Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
  32. 32. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization The snow package allows easy use of multiple cores on a single computer or a cluster of computers A simple wrapper over other parallel R libraries Can support PVM, MPI At the very least you can use all the cores on your own machine http://cran.r-project.org/web/packages/snow/
  33. 33. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Serial code - Feature Selection Rather than use GA, SA etc, just look at all combinations Inelegant, but no worries about missing the global optimum § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) apply(combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  34. 34. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization - Feature Selection Trivially parallelized § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) library(snow) cl <- makeSOCKcluster (2) clusterExport(cl , "x") clusterExport(cl , "y") parApply(cl , combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  35. 35. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Big data scenarios The idea behind snow can also be used to handle very large datasets Simply chunk the data appropriately and papply over the list of filenames Still requires you to perform chunking and keep track of everything Hadoop is a nice way to avoid all this Throw one or more (very) large files at it, let it deal with chunking and computation For non-trivial file formats, you need to implement a chunker RHIPE provides access to a Hadoop cluster from within R http://hadoop.apache.org/core/
  36. 36. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Summary rcdk successfully integrates cheminformatics functionality into the R environment Related packages provide access to other forms of chemical data (fingerprints) and data sources An excellent environment for chemical and biological data mining
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×