Your SlideShare is downloading. ×
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Crunching Molecules and Numbers in R
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Crunching Molecules and Numbers in R

2,605

Published on

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,605
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
68
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Crunching Molecules and Numbers in R Rajarshi Guha NIH Chemical Genomics Center 238th ACS National Meeting 17th August, 2009
  • 2. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Outline Some background on R Doing cheminformatics in R
  • 3. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R History S developed by John Chambers at Bell Labs 1976 S rewritten in C 1988 Licensed to Insightful Corp. 1993 Bought by Insightful Corp for $2M 2004 Bought by TIBCO for $25M 2008 First public release 1993 Created by Ihaka & Gentleman 1991 Released under GPL 1995 R 1.0.0 2000 R 2.9 2009
  • 4. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R An environment for statistical computation Wide variety of standard and state of the art statistical methods built in or accessible via packages But also a complete, interpreted programming language Well suited for manipulating and operating on datasets - numerical, categorical or a mixture - and of varying shape Impressive visualization facilities (but not very interactive)
  • 5. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R Syntax is pretty much S-Plus Highly cross-platform Frequent and regular releases, active development by core group The dev and user community extremely active r-help is not just for learning R, you can get a decent statistics education from the list! Used by many top statisticians, many cutting edge techniques first show up in R
  • 6. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Usability Default mode is a command line like prompt GUI’s available But learning curve is steep Does force you to think about the analysis Not a great tool for casual, once-in-a-while usage
  • 7. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R primitives Numeric, character, list, matrix, data.frame § ¤ > x <- ’Hello World ’ > x <- 1 > x <- c(1,2,3,4,5,6) > x [1] 1 2 3 4 5 6 x <- data.frame(MW=runif(5, 10, 50), hERG=sample(c(’active ’,’inactive ’), 5, TRUE )) > x MW hERG 1 23.55435 active 2 42.90365 inactive 3 49.35149 active 4 26.85912 active 5 10.01877 active ¦ ¥
  • 8. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Matrix oriented programming Similar in style to Matlab Easily access (multiple) rows, columns Vector/matrix indexing is very powerful and key to efficient R code Perform operations on entire rows or columns Makes subsetting a trivial operation Perfect for QSAR type analyses
  • 9. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style R’s functional paradigms are closely tied to matrix operations apply, lapply, sapply, tapply allow you to easily operate on groups of objects Elements of a list Rows and/or columns of a matrix Subsets of data, using a grouping variable Anonymous functions are supported Use of these funtional forms can lead to speed up compared to traditional for loops
  • 10. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Non-functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) sds <- numeric(ncol(m)) for (i in 1: ncol(m)) sds[i] <- sd(m[,i]) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) toxLogP <- 0 nontoxLogP <- 0 for (j in 1: nrow(m)) { if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1] else nontoxLogP <- nontoxLogP + m[j,1] } ¦ ¥
  • 11. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) apply(m, 2, sd) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) by(m, m$toxic , function(x) mean(x$logp )) ¦ ¥
  • 12. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Object oriented style R supports multiple object oriented mechanisms Simplest is S3 classes Object orientation is in terms of function names Easy to work with, not always flexible enough S4 classes are much more powerful, but also more complex Many problems can ignore these as R primitives provide sufficient support for attaching meta-data to objects (crude encapsulation) Becomes important/useful when writing packages, not for day to day code
  • 13. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Interfacing with C & Fortran R is interpreted, functional forms help a bit Very useful to refactor inner loops into C (or Fortran) Also useful to provide an R interface to pre-existing C/Fortran code Can lead to dramatic speedups 1024 166 79 Bit length Speedup 01020304050 5000 pairwise Tanimoto similarity calculations, Macbook Pro, 2GHz, 1GB RAM
  • 14. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Visualization R generates publication quality graphics in a variety of formats A huge number of statistical visualization methods (2D, 3D, OpenGL) Extremely powerful display specifications core commands lattice (a.k.a trellis graphics) Based on sound statistical theories While standard plots are easy to make, but complex plots do have a learning curve Interactivity is limited, though some package do alleviate this
  • 15. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Code quality It’s not just enough to write code RUnit is a package that supports unit testing, analogous to JUnit R comes with well defined package structure that can be automatically checked for various errors Packages can be uploaded to CRAN which allows any R user to install them directly from R Extensive documentation format Sweave is an important feature which allows one to include R code and associated text in a single document - literate programming or reproducible research
  • 16. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The downsides of R Memory bound (but can use as much memory as you have) Language inconsistencies Indexing starts from 1, but no error if you use 0 as an index See blog posts by Radford Neal (U Toronto) Debugging environment not so great (though ESS is good for Emacs users)
  • 17. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Cheminformatics programming Fundamental requirement is support for core chemical concepts Representation and manipulation of these concepts Flexibility Could implement all of this directly in R - lots of wheels would be reinvented We also want such functionality to be R-like Writing Java or C in R is not R-like
  • 18. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The Chemistry Development Kit Open source Java library for cheminformatics Wide variety of functionality Core chemical concepts (atoms, bonds, molecules) SMARTS, pharmacophores Molecular descriptors and fingerprints 2D depictions Used in a variety of tools, applications and services Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
  • 19. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk - CDK from R R Programming Environment rJava CDK Jmol rcdk XML rpubchem fingerprint
  • 20. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk Motivations Have access to cheminformatics functionality from within R Support processing of data from chemistry databases Not reimplement cheminformatics methods Have access to all of this in idiomatic R
  • 21. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations - I/O Read in molecular file formats support by the CDK Files can be local or remote Parse SMILES strings In contrast to the CDK, rcdk will configure molecules automatically (unless instructed not to) The resultant molecule objects are Java references, can be passed to a variety of rcdk functions § ¤ mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’)) mol <- parse.smiles(’c1ccccc1CC (=O)’) mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’), parse.smiles) ¦ ¥
  • 22. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations Given a molecule, we can extract or add properties Get lists of atoms and bonds and then manipulate them Currently doesn’t support a lot of molecular graph operations § ¤ # get the atoms from a molecule mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ") atoms <- get.atoms(mol) # get the coordinate matrix of the molecule coords <- do.call(’rbind ’, lapply(atoms , get.point3d )) ¦ ¥
  • 23. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Working with fingerprints rcdk will generate a variety of fingerprints via the CDK Other packages can generate fingerprints The fingerprint package suports I/O of fingerprint data and various similarity operations on fingerprints Provides an S4 class representing binary fingerprints § ¤ m1 <- parse.smiles(’c1ccccc1C(COC)N’) m2 <- parse.smiles(’C1CCCCC1C(COC)N’) # Calculate fingerprints fps <- lapply(list(m1 ,m2), get.fingerprint , type=’maccs ’) distance(fps [[1]] , fps [[2]] , method=’tanimoto ’) fps <- fp.read(’fp.txt ’, lf=moe.lf , size =166, header=TRUE) fpsim <- fp.sim.matrix(fps , method=’tanimoto ’) ¦ ¥
  • 24. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Molecular Descriptors Machine Learning Property
  • 25. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Access to descriptors and fingerprints makes for very easy QSAR modeling within R Evaluate the descriptors (individually, by type or all) Get back a data.frame which can be used as input to pretty much any modeling method § ¤ mols <- load.molecules(’big.sdf ’) dnames <- get.desc.names(’topological ’) descs <- eval.desc(mols , dnames) str(descs) ’data.frame ’: 467 obs. of 180 variables: $ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ... $ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ... $ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 .. $ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ... ¦ ¥
  • 26. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Viewing molecules While numerical modeling is a fundamental task in this environment, visualization is also important Either view structures of individual molecules or tables of structure and data rcdk supports both (not very well on OS X) § ¤ mol <- parse.smiles(’c1ccccc1C(N)CC ’)’ view.molecule .2d(mol) smiles <- c("CCC", "CCN", "CCN(C)(C)", " c1ccccc1Cc1ccccc1 ", "C1CCC1CC(CN(C)(C))CC(=O)CC") mols <- sapply(smiles , parse.smiles) view.molecule .2d(mols) ¦ ¥
  • 27. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Downsides to rcdk Can’t save state of Java objects Doesn’t take advantage of S4 classes to provide R-side representations of CDK classes Incomplete coverage of the CDK API - sometimes need to go down to rJava to perform an operation Big datasets are problematic (mainly due to R limitations)
  • 28. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to chemical databases Useful to be able to transparently access data from various public data sources PubChem compound and assays are supported via rpubchem Compound access is primarily by CID, while assay data can be obtained from key word searches End up with a data.frame containing all relevant assay information (along with meta-data as attributes) R can also easily access arbitrary RDBMS’s (Postgres, MySQL, Oracle)
  • 29. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to PubChem § ¤ > dat <- get.cids (1:30) ’data.frame ’: 30 obs. of 11 variables: $ CID : chr "1" "2" "3" "4" ... $ IUPACName : chr "3-acetyloxy -4-( trimethylaz $ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+]( $ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H $ MolecularWeight : num 203.2 204.2 156.1 75.1 169. > find.assay.id(’LDR ’) [1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865 > adat <- get.assay (990) > str(adat) ’data.frame ’: 51 obs. of 9 variables: $ PUBCHEM.SID : int 845800 848472 852502 857608 $ PUBCHEM.CID : int 648162 6603466 655127 65895 ¦ ¥
  • 30. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Bioinformatics in R While the focus is on cheminformatics, many problems involve bioinformatics to some degree The Bioconductor project provides a wide variety of packages A lot of it focused on gene expression analysis A number of packages provide access to various biological databases, annotations etc Protein structure analysis is supported in R via Bio3d Never have to leave the comfort of R http://www.bioconductor.org/ Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
  • 31. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Long calculations, big data Many statistical methods require long running calculations Bootstrap Bayesian methods Many problems involve large datasets A common feature to both scenarios is that they can be trivially parallelized As opposed to require parallel version of underlying algorithm R has good support for both trivial and non-trivial parallelization methods See R/parallel for a package that will parallelize actual R code Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
  • 32. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization The snow package allows easy use of multiple cores on a single computer or a cluster of computers A simple wrapper over other parallel R libraries Can support PVM, MPI At the very least you can use all the cores on your own machine http://cran.r-project.org/web/packages/snow/
  • 33. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Serial code - Feature Selection Rather than use GA, SA etc, just look at all combinations Inelegant, but no worries about missing the global optimum § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) apply(combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 34. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization - Feature Selection Trivially parallelized § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) library(snow) cl <- makeSOCKcluster (2) clusterExport(cl , "x") clusterExport(cl , "y") parApply(cl , combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 35. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Big data scenarios The idea behind snow can also be used to handle very large datasets Simply chunk the data appropriately and papply over the list of filenames Still requires you to perform chunking and keep track of everything Hadoop is a nice way to avoid all this Throw one or more (very) large files at it, let it deal with chunking and computation For non-trivial file formats, you need to implement a chunker RHIPE provides access to a Hadoop cluster from within R http://hadoop.apache.org/core/
  • 36. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Summary rcdk successfully integrates cheminformatics functionality into the R environment Related packages provide access to other forms of chemical data (fingerprints) and data sources An excellent environment for chemical and biological data mining

×