• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Crunching Molecules and Numbers in R

  • 2,521 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,521
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
65
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Crunching Molecules and Numbers in R Chemical Data Parallel Paradigms Rajarshi Guha NIH Chemical Genomics Center 238th ACS National Meeting 17th August, 2009
  • 2. Crunching Outline Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Some background on R Doing cheminformatics in R
  • 3. Crunching R History Molecules and Numbers in R Rajarshi Guha Background 1976 1988 Molecules in R S developed by 1991 John Chambers at S rewritten in C Chemical Data Bell Labs Created by Ihaka & Gentleman Parallel Paradigms 2004 1993 1993 1995 Bought by Insightful Licensed to Released under First public release Corp for $2M Insightful Corp. GPL 2008 2009 2000 Bought by TIBCO R 2.9 R 1.0.0 for $25M
  • 4. Crunching An overview of R Molecules and Numbers in R Rajarshi Guha Background Molecules in R An environment for statistical computation Chemical Data Wide variety of standard and state of the art statistical Parallel Paradigms methods built in or accessible via packages But also a complete, interpreted programming language Well suited for manipulating and operating on datasets - numerical, categorical or a mixture - and of varying shape Impressive visualization facilities (but not very interactive)
  • 5. Crunching An overview of R Molecules and Numbers in R Rajarshi Guha Background Molecules in R Syntax is pretty much S-Plus Chemical Data Highly cross-platform Parallel Paradigms Frequent and regular releases, active development by core group The dev and user community extremely active r-help is not just for learning R, you can get a decent statistics education from the list! Used by many top statisticians, many cutting edge techniques first show up in R
  • 6. Crunching Usability Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Default mode is a command line like prompt Paradigms GUI’s available But learning curve is steep Does force you to think about the analysis Not a great tool for casual, once-in-a-while usage
  • 7. Crunching R primitives Molecules and Numbers in R Rajarshi Guha Numeric, character, list, matrix, data.frame Background § ¤ Molecules in R > x <- ’ Hello World ’ Chemical Data > x <- 1 Parallel > x <- c (1 ,2 ,3 ,4 ,5 ,6) Paradigms > x [1] 1 2 3 4 5 6 x <- data . frame ( MW = runif (5 , 10 , 50) , hERG = sample ( c ( ’ active ’ , ’ inactive ’) , 5 , TRUE )) > x MW hERG 1 23.55435 active 2 42.90365 inactive 3 49.35149 active 4 26.85912 active 5 10.01877 active ¦ ¥
  • 8. Crunching Matrix oriented programming Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Similar in style to Matlab Parallel Paradigms Easily access (multiple) rows, columns Vector/matrix indexing is very powerful and key to efficient R code Perform operations on entire rows or columns Makes subsetting a trivial operation Perfect for QSAR type analyses
  • 9. Crunching Functional style Molecules and Numbers in R Rajarshi Guha Background Molecules in R R’s functional paradigms are closely tied to matrix Chemical Data operations Parallel Paradigms apply, lapply, sapply, tapply allow you to easily operate on groups of objects Elements of a list Rows and/or columns of a matrix Subsets of data, using a grouping variable Anonymous functions are supported Use of these funtional forms can lead to speed up compared to traditional for loops
  • 10. Crunching Non-functional style Molecules and Numbers in R Rajarshi Guha § ¤ Background # column std devs Molecules in R m <- matrix ( runif (100*100) , ncol =100) Chemical Data sds <- numeric ( ncol ( m )) for ( i in 1: ncol ( m )) sds [ i ] <- sd ( m [ , i ]) Parallel Paradigms # mean logP of toxic , non - toxic classes m <- data . frame ( logp = runif (100) , toxic = sample ( c ( ’ yes ’ , ’ no ’) , 100 , TRUE ) toxLogP <- 0 nontoxLogP <- 0 for ( j in 1: nrow ( m )) { if ( m [j ,2] = ’yes ’) toxLogP <- toxLogP + m [j ,1] else nontoxLogP <- nontoxLogP + m [j ,1] } ¦ ¥
  • 11. Crunching Functional style Molecules and Numbers in R Rajarshi Guha Background Molecules in R § ¤ Chemical Data # column std devs Parallel m <- matrix ( runif (100*100) , ncol =100) Paradigms apply (m , 2 , sd ) # mean logP of toxic , non - toxic classes m <- data . frame ( logp = runif (100) , toxic = sample ( c ( ’ yes ’ , ’ no ’) , 100 , TRUE ) by (m , m$toxic , function ( x ) mean ( x$logp )) ¦ ¥
  • 12. Crunching Object oriented style Molecules and Numbers in R Rajarshi Guha Background R supports multiple object oriented mechanisms Molecules in R Simplest is S3 classes Chemical Data Parallel Object orientation is in terms of function names Paradigms Easy to work with, not always flexible enough S4 classes are much more powerful, but also more complex Many problems can ignore these as R primitives provide sufficient support for attaching meta-data to objects (crude encapsulation) Becomes important/useful when writing packages, not for day to day code
  • 13. Crunching Interfacing with C & Fortran Molecules and Numbers in R Rajarshi Guha 50 Background R is interpreted, Molecules in R 40 Chemical Data functional forms help a Parallel bit 30 Speedup Paradigms Very useful to refactor 20 inner loops into C (or Fortran) 10 Also useful to provide an 0 1024 166 79 R interface to pre-existing Bit length C/Fortran code Can lead to dramatic 5000 pairwise Tanimoto similarity calculations, Macbook Pro, speedups 2GHz, 1GB RAM
  • 14. Crunching Visualization Molecules and Numbers in R Rajarshi Guha Background R generates publication quality graphics in a variety of Molecules in R formats Chemical Data A huge number of statistical visualization methods (2D, Parallel Paradigms 3D, OpenGL) Extremely powerful display specifications core commands lattice (a.k.a trellis graphics) Based on sound statistical theories While standard plots are easy to make, but complex plots do have a learning curve Interactivity is limited, though some package do alleviate this
  • 15. Crunching Code quality Molecules and Numbers in R Rajarshi Guha Background It’s not just enough to write code Molecules in R RUnit is a package that supports unit testing, analogous Chemical Data to JUnit Parallel Paradigms R comes with well defined package structure that can be automatically checked for various errors Packages can be uploaded to CRAN which allows any R user to install them directly from R Extensive documentation format Sweave is an important feature which allows one to include R code and associated text in a single document - literate programming or reproducible research
  • 16. Crunching The downsides of R Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Memory bound (but can use as much memory as you Parallel have) Paradigms Language inconsistencies Indexing starts from 1, but no error if you use 0 as an index See blog posts by Radford Neal (U Toronto) Debugging environment not so great (though ESS is good for Emacs users)
  • 17. Crunching Cheminformatics programming Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Fundamental requirement is support for core chemical Parallel concepts Paradigms Representation and manipulation of these concepts Flexibility Could implement all of this directly in R - lots of wheels would be reinvented We also want such functionality to be R-like Writing Java or C in R is not R-like
  • 18. Crunching The Chemistry Development Kit Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Open source Java library for cheminformatics Parallel Paradigms Wide variety of functionality Core chemical concepts (atoms, bonds, molecules) SMARTS, pharmacophores Molecular descriptors and fingerprints 2D depictions Used in a variety of tools, applications and services Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
  • 19. Crunching rcdk - CDK from R Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data rcdk Parallel Paradigms CDK Jmol rpubchem rJava fingerprint XML R Programming Environment
  • 20. Crunching rcdk Motivations Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Have access to cheminformatics functionality from Paradigms within R Support processing of data from chemistry databases Not reimplement cheminformatics methods Have access to all of this in idiomatic R
  • 21. Crunching Basic molecular operations - I/O Molecules and Numbers in R Rajarshi Guha Read in molecular file formats support by the CDK Background Molecules in R Files can be local or remote Chemical Data Parse SMILES strings Parallel Paradigms In contrast to the CDK, rcdk will configure molecules automatically (unless instructed not to) The resultant molecule objects are Java references, can be passed to a variety of rcdk functions § ¤ mols <- load . molecules ( c ( ’ abc . sdf ’ , ’ xyz . smi ’)) mol <- parse . smiles ( ’ c1ccccc1CC (= O ) ’) mols <- sapply ( c ( ’ CC ’ , ’ CCCC ’ , ’ CCCNC ’) , parse . smiles ) ¦ ¥
  • 22. Crunching Basic molecular operations Molecules and Numbers in R Rajarshi Guha Background Given a molecule, we can extract or add properties Molecules in R Chemical Data Get lists of atoms and bonds and then manipulate them Parallel Currently doesn’t support a lot of molecular graph Paradigms operations § ¤ # get the atoms from a molecule mol <- parse . smiles (" c1ccccc1C ( Cl )( Br ) c1ccccc1 ") atoms <- get . atoms ( mol ) # get the coordinate matrix of the molecule coords <- do . call ( ’ rbind ’ , lapply ( atoms , get . point3d )) ¦ ¥
  • 23. Crunching Working with fingerprints Molecules and Numbers in R Rajarshi Guha rcdk will generate a variety of fingerprints via the CDK Background Other packages can generate fingerprints Molecules in R The fingerprint package suports I/O of fingerprint Chemical Data data and various similarity operations on fingerprints Parallel Paradigms Provides an S4 class representing binary fingerprints § ¤ m1 <- parse . smiles ( ’ c1ccccc1C ( COC )N ’) m2 <- parse . smiles ( ’ C1CCCCC1C ( COC )N ’) # Calculate fingerprints fps <- lapply ( list ( m1 , m2 ) , get . fingerprint , type = ’ maccs ’) distance ( fps [[1]] , fps [[2]] , method = ’ tanimoto ’) fps <- fp . read ( ’ fp . txt ’ , lf = moe . lf , size =166 , header = TRUE ) fpsim <- fp . sim . matrix ( fps , method = ’ tanimoto ’) ¦ ¥
  • 24. Crunching rcdk and QSAR Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Property Parallel Paradigms Molecular Machine Descriptors Learning
  • 25. Crunching rcdk and QSAR Molecules and Numbers in R Rajarshi Guha Access to descriptors and fingerprints makes for very Background easy QSAR modeling within R Molecules in R Evaluate the descriptors (individually, by type or all) Chemical Data Parallel Get back a data.frame which can be used as input to Paradigms pretty much any modeling method § ¤ mols <- load . molecules ( ’ big . sdf ’) dnames <- get . desc . names ( ’ topological ’) descs <- eval . desc ( mols , dnames ) str ( descs ) ’ data . frame ’: 467 obs . of 180 variables : $ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ... $ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ... $ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 .. $ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ... ¦ ¥
  • 26. Crunching Viewing molecules Molecules and Numbers in R Rajarshi Guha Background While numerical modeling is a fundamental task in this Molecules in R environment, visualization is also important Chemical Data Either view structures of individual molecules or tables of Parallel structure and data Paradigms rcdk supports both (not very well on OS X) § ¤ mol <- parse . smiles ( ’ c1ccccc1C ( N ) CC ’) ’ view . molecule .2 d ( mol ) smiles <- c (" CCC " , " CCN " , " CCN ( C )( C )" , " c1cccc c1Cc1ccccc1 " , " C1CCC1CC ( CN ( C )( C )) CC (= O ) CC ") mols <- sapply ( smiles , parse . smiles ) view . molecule .2 d ( mols ) ¦ ¥
  • 27. Crunching Downsides to rcdk Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Can’t save state of Java objects Parallel Paradigms Doesn’t take advantage of S4 classes to provide R-side representations of CDK classes Incomplete coverage of the CDK API - sometimes need to go down to rJava to perform an operation Big datasets are problematic (mainly due to R limitations)
  • 28. Crunching Access to chemical databases Molecules and Numbers in R Rajarshi Guha Background Molecules in R Useful to be able to transparently access data from Chemical Data various public data sources Parallel PubChem compound and assays are supported via Paradigms rpubchem Compound access is primarily by CID, while assay data can be obtained from key word searches End up with a data.frame containing all relevant assay information (along with meta-data as attributes) R can also easily access arbitrary RDBMS’s (Postgres, MySQL, Oracle)
  • 29. Crunching Access to PubChem Molecules and Numbers in R § ¤ Rajarshi Guha > dat <- get . cids (1:30) Background ’ data . frame ’: 30 obs . of 11 variables : Molecules in R $ CID : chr "1" "2" "3" "4" ... Chemical Data $ IUPACName : chr "3 - acetyloxy -4 -( trimethylaz Parallel $ CanonicalSmile : chr " CC (= O ) OC ( CC (= O )[ OParadigms N +]( -]) C [ $ MolecularFormula : chr " C9H17NO4 " " C9H18NO4 +" " C7H $ MolecularWeight : num 203.2 204.2 156.1 75.1 169. > find . assay . id ( ’ LDR ’) [1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865 > adat <- get . assay (990) > str ( adat ) ’ data . frame ’: 51 obs . of 9 variables : $ PUBCHEM . SID : int 845800 848472 852502 857608 $ PUBCHEM . CID : int 648162 6603466 655127 65895 ¦ ¥
  • 30. Crunching Bioinformatics in R Molecules and Numbers in R Rajarshi Guha Background While the focus is on cheminformatics, many problems Molecules in R Chemical Data involve bioinformatics to some degree Parallel The Bioconductor project provides a wide variety of Paradigms packages A lot of it focused on gene expression analysis A number of packages provide access to various biological databases, annotations etc Protein structure analysis is supported in R via Bio3d Never have to leave the comfort of R http://www.bioconductor.org/ Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
  • 31. Crunching Long calculations, big data Molecules and Numbers in R Rajarshi Guha Many statistical methods require long running Background calculations Molecules in R Bootstrap Chemical Data Bayesian methods Parallel Paradigms Many problems involve large datasets A common feature to both scenarios is that they can be trivially parallelized As opposed to require parallel version of underlying algorithm R has good support for both trivial and non-trivial parallelization methods See R/parallel for a package that will parallelize actual R code Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
  • 32. Crunching Simple parallelization Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data The snow package allows easy use of multiple cores on a Parallel Paradigms single computer or a cluster of computers A simple wrapper over other parallel R libraries Can support PVM, MPI At the very least you can use all the cores on your own machine http://cran.r-project.org/web/packages/snow/
  • 33. Crunching Serial code - Feature Selection Molecules and Numbers in R Rajarshi Guha Rather than use GA, SA etc, just look at all Background combinations Molecules in R Chemical Data Inelegant, but no worries about missing the global Parallel optimum Paradigms § ¤ x <- matrix ( runif (500*40) , ncol =40) y <- runif (500) library ( gtools ) combos <- combinations (40 , 3) apply ( combos , 1 , function ( z ) { d <- data . frame ( y =y , x = x [ , z ]) fit <- lm ( y ~. , data = d ) cor (y , fit$fitted )^2 }) ¦ ¥
  • 34. Crunching Simple parallelization - Feature Selection Molecules and Numbers in R Rajarshi Guha Trivially parallelized Background § ¤ Molecules in R x <- matrix ( runif (500*40) , ncol =40) Chemical Data y <- runif (500) Parallel library ( gtools ) Paradigms combos <- combinations (40 , 3) library ( snow ) cl <- makeSOCKcluster (2) clusterExport ( cl , " x ") clusterExport ( cl , " y ") parApply ( cl , combos , 1 , function ( z ) { d <- data . frame ( y =y , x = x [ , z ]) fit <- lm ( y ~. , data = d ) cor (y , fit$fitted )^2 }) ¦ ¥
  • 35. Crunching Big data scenarios Molecules and Numbers in R Rajarshi Guha Background The idea behind snow can also be used to handle very Molecules in R large datasets Chemical Data Simply chunk the data appropriately and papply over Parallel Paradigms the list of filenames Still requires you to perform chunking and keep track of everything Hadoop is a nice way to avoid all this Throw one or more (very) large files at it, let it deal with chunking and computation For non-trivial file formats, you need to implement a chunker RHIPE provides access to a Hadoop cluster from within R http://hadoop.apache.org/core/
  • 36. Crunching Summary Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel rcdk successfully integrates cheminformatics Paradigms functionality into the R environment Related packages provide access to other forms of chemical data (fingerprints) and data sources An excellent environment for chemical and biological data mining