Crunching Molecules and Numbers in R - Presentation Transcript
Crunching
Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Crunching Molecules and Numbers in R Chemical Data
Parallel
Paradigms
Rajarshi Guha
NIH Chemical Genomics Center
238th ACS National Meeting
17th August, 2009
Crunching
Outline Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Paradigms
Some background on R
Doing cheminformatics in R
Crunching
R History Molecules and
Numbers in R
Rajarshi Guha
Background
1976 1988 Molecules in R
S developed by
1991
John Chambers at S rewritten in C Chemical Data
Bell Labs Created by Ihaka &
Gentleman Parallel
Paradigms
2004 1993 1993 1995
Bought by Insightful Licensed to Released under
First public release
Corp for $2M Insightful Corp. GPL
2008 2009 2000
Bought by TIBCO
R 2.9 R 1.0.0
for $25M
Crunching
An overview of R Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
An environment for statistical computation Chemical Data
Wide variety of standard and state of the art statistical Parallel
Paradigms
methods built in or accessible via packages
But also a complete, interpreted programming language
Well suited for manipulating and operating on datasets -
numerical, categorical or a mixture - and of varying
shape
Impressive visualization facilities (but not very
interactive)
Crunching
An overview of R Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Syntax is pretty much S-Plus Chemical Data
Highly cross-platform Parallel
Paradigms
Frequent and regular releases, active development by
core group
The dev and user community extremely active
r-help is not just for learning R, you can get a decent
statistics education from the list!
Used by many top statisticians, many cutting edge
techniques first show up in R
Crunching
Usability Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Default mode is a command line like prompt Paradigms
GUI’s available
But learning curve is steep
Does force you to think about the analysis
Not a great tool for casual, once-in-a-while usage
Crunching
R primitives Molecules and
Numbers in R
Rajarshi Guha
Numeric, character, list, matrix, data.frame Background
§ ¤ Molecules in R
> x <- ’ Hello World ’ Chemical Data
> x <- 1
Parallel
> x <- c (1 ,2 ,3 ,4 ,5 ,6) Paradigms
> x
[1] 1 2 3 4 5 6
x <- data . frame ( MW = runif (5 , 10 , 50) ,
hERG = sample ( c ( ’ active ’ , ’ inactive ’) ,
5 , TRUE ))
> x
MW hERG
1 23.55435 active
2 42.90365 inactive
3 49.35149 active
4 26.85912 active
5 10.01877 active
¦ ¥
Crunching
Matrix oriented programming Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Similar in style to Matlab Parallel
Paradigms
Easily access (multiple) rows, columns
Vector/matrix indexing is very powerful and key to
efficient R code
Perform operations on entire rows or columns
Makes subsetting a trivial operation
Perfect for QSAR type analyses
Crunching
Functional style Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
R’s functional paradigms are closely tied to matrix Chemical Data
operations Parallel
Paradigms
apply, lapply, sapply, tapply allow you to easily
operate on groups of objects
Elements of a list
Rows and/or columns of a matrix
Subsets of data, using a grouping variable
Anonymous functions are supported
Use of these funtional forms can lead to speed up
compared to traditional for loops
Crunching
Non-functional style Molecules and
Numbers in R
Rajarshi Guha
§ ¤ Background
# column std devs
Molecules in R
m <- matrix ( runif (100*100) , ncol =100)
Chemical Data
sds <- numeric ( ncol ( m ))
for ( i in 1: ncol ( m )) sds [ i ] <- sd ( m [ , i ]) Parallel
Paradigms
# mean logP of toxic , non - toxic classes
m <- data . frame ( logp = runif (100) ,
toxic = sample ( c ( ’ yes ’ , ’ no ’) ,
100 , TRUE )
toxLogP <- 0
nontoxLogP <- 0
for ( j in 1: nrow ( m )) {
if ( m [j ,2] = ’yes ’) toxLogP <- toxLogP + m [j ,1]
else nontoxLogP <- nontoxLogP + m [j ,1]
}
¦ ¥
Crunching
Functional style Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
§ ¤ Chemical Data
# column std devs
Parallel
m <- matrix ( runif (100*100) , ncol =100) Paradigms
apply (m , 2 , sd )
# mean logP of toxic , non - toxic classes
m <- data . frame ( logp = runif (100) ,
toxic = sample ( c ( ’ yes ’ , ’ no ’) ,
100 , TRUE )
by (m , m$toxic , function ( x ) mean ( x$logp ))
¦ ¥
Crunching
Object oriented style Molecules and
Numbers in R
Rajarshi Guha
Background
R supports multiple object oriented mechanisms Molecules in R
Simplest is S3 classes Chemical Data
Parallel
Object orientation is in terms of function names Paradigms
Easy to work with, not always flexible enough
S4 classes are much more powerful, but also more
complex
Many problems can ignore these as R primitives provide
sufficient support for attaching meta-data to objects
(crude encapsulation)
Becomes important/useful when writing packages, not
for day to day code
Crunching
Interfacing with C & Fortran Molecules and
Numbers in R
Rajarshi Guha
50
Background
R is interpreted, Molecules in R
40
Chemical Data
functional forms help a
Parallel
bit
30
Speedup
Paradigms
Very useful to refactor
20
inner loops into C (or
Fortran)
10
Also useful to provide an
0
1024 166 79
R interface to pre-existing Bit length
C/Fortran code
Can lead to dramatic 5000 pairwise Tanimoto similarity
calculations, Macbook Pro,
speedups
2GHz, 1GB RAM
Crunching
Visualization Molecules and
Numbers in R
Rajarshi Guha
Background
R generates publication quality graphics in a variety of Molecules in R
formats Chemical Data
A huge number of statistical visualization methods (2D, Parallel
Paradigms
3D, OpenGL)
Extremely powerful display specifications
core commands
lattice (a.k.a trellis graphics)
Based on sound statistical theories
While standard plots are easy to make, but complex
plots do have a learning curve
Interactivity is limited, though some package do alleviate
this
Crunching
Code quality Molecules and
Numbers in R
Rajarshi Guha
Background
It’s not just enough to write code Molecules in R
RUnit is a package that supports unit testing, analogous Chemical Data
to JUnit Parallel
Paradigms
R comes with well defined package structure that can be
automatically checked for various errors
Packages can be uploaded to CRAN which allows any R
user to install them directly from R
Extensive documentation format
Sweave is an important feature which allows one to
include R code and associated text in a single document
- literate programming or reproducible research
Crunching
The downsides of R Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Memory bound (but can use as much memory as you
Parallel
have) Paradigms
Language inconsistencies
Indexing starts from 1, but no error if you use 0 as an
index
See blog posts by Radford Neal (U Toronto)
Debugging environment not so great (though ESS is
good for Emacs users)
Crunching
Cheminformatics programming Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Fundamental requirement is support for core chemical
Parallel
concepts Paradigms
Representation and manipulation of these concepts
Flexibility
Could implement all of this directly in R - lots of wheels
would be reinvented
We also want such functionality to be R-like
Writing Java or C in R is not R-like
Crunching
The Chemistry Development Kit Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Open source Java library for cheminformatics Parallel
Paradigms
Wide variety of functionality
Core chemical concepts (atoms, bonds, molecules)
SMARTS, pharmacophores
Molecular descriptors and fingerprints
2D depictions
Used in a variety of tools, applications and services
Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
Crunching
rcdk - CDK from R Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
rcdk Parallel
Paradigms
CDK Jmol rpubchem
rJava fingerprint XML
R Programming Environment
Crunching
rcdk Motivations Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
Have access to cheminformatics functionality from Paradigms
within R
Support processing of data from chemistry databases
Not reimplement cheminformatics methods
Have access to all of this in idiomatic R
Crunching
Basic molecular operations - I/O Molecules and
Numbers in R
Rajarshi Guha
Read in molecular file formats support by the CDK Background
Molecules in R
Files can be local or remote
Chemical Data
Parse SMILES strings Parallel
Paradigms
In contrast to the CDK, rcdk will configure molecules
automatically (unless instructed not to)
The resultant molecule objects are Java references, can
be passed to a variety of rcdk functions
§ ¤
mols <- load . molecules ( c ( ’ abc . sdf ’ , ’ xyz . smi ’))
mol <- parse . smiles ( ’ c1ccccc1CC (= O ) ’)
mols <- sapply ( c ( ’ CC ’ , ’ CCCC ’ , ’ CCCNC ’) ,
parse . smiles )
¦ ¥
Crunching
Basic molecular operations Molecules and
Numbers in R
Rajarshi Guha
Background
Given a molecule, we can extract or add properties Molecules in R
Chemical Data
Get lists of atoms and bonds and then manipulate them
Parallel
Currently doesn’t support a lot of molecular graph Paradigms
operations
§ ¤
# get the atoms from a molecule
mol <- parse . smiles (" c1ccccc1C ( Cl )( Br ) c1ccccc1 ")
atoms <- get . atoms ( mol )
# get the coordinate matrix of the molecule
coords <- do . call ( ’ rbind ’ ,
lapply ( atoms , get . point3d ))
¦ ¥
Crunching
Working with fingerprints Molecules and
Numbers in R
Rajarshi Guha
rcdk will generate a variety of fingerprints via the CDK
Background
Other packages can generate fingerprints
Molecules in R
The fingerprint package suports I/O of fingerprint Chemical Data
data and various similarity operations on fingerprints Parallel
Paradigms
Provides an S4 class representing binary fingerprints
§ ¤
m1 <- parse . smiles ( ’ c1ccccc1C ( COC )N ’)
m2 <- parse . smiles ( ’ C1CCCCC1C ( COC )N ’)
# Calculate fingerprints
fps <- lapply ( list ( m1 , m2 ) ,
get . fingerprint , type = ’ maccs ’)
distance ( fps [[1]] , fps [[2]] , method = ’ tanimoto ’)
fps <- fp . read ( ’ fp . txt ’ , lf = moe . lf ,
size =166 , header = TRUE )
fpsim <- fp . sim . matrix ( fps , method = ’ tanimoto ’)
¦ ¥
Crunching
rcdk and QSAR Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Property
Parallel
Paradigms
Molecular Machine
Descriptors Learning
Crunching
rcdk and QSAR Molecules and
Numbers in R
Rajarshi Guha
Access to descriptors and fingerprints makes for very Background
easy QSAR modeling within R Molecules in R
Evaluate the descriptors (individually, by type or all) Chemical Data
Parallel
Get back a data.frame which can be used as input to Paradigms
pretty much any modeling method
§ ¤
mols <- load . molecules ( ’ big . sdf ’)
dnames <- get . desc . names ( ’ topological ’)
descs <- eval . desc ( mols , dnames )
str ( descs )
’ data . frame ’: 467 obs . of 180 variables :
$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...
$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...
$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ..
$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...
¦ ¥
Crunching
Viewing molecules Molecules and
Numbers in R
Rajarshi Guha
Background
While numerical modeling is a fundamental task in this
Molecules in R
environment, visualization is also important
Chemical Data
Either view structures of individual molecules or tables of Parallel
structure and data Paradigms
rcdk supports both (not very well on OS X)
§ ¤
mol <- parse . smiles ( ’ c1ccccc1C ( N ) CC ’) ’
view . molecule .2 d ( mol )
smiles <- c (" CCC " , " CCN " , " CCN ( C )( C )" ,
" c1cccc c1Cc1ccccc1 " ,
" C1CCC1CC ( CN ( C )( C )) CC (= O ) CC ")
mols <- sapply ( smiles , parse . smiles )
view . molecule .2 d ( mols )
¦ ¥
Crunching
Downsides to rcdk Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Can’t save state of Java objects Parallel
Paradigms
Doesn’t take advantage of S4 classes to provide R-side
representations of CDK classes
Incomplete coverage of the CDK API - sometimes need
to go down to rJava to perform an operation
Big datasets are problematic (mainly due to R
limitations)
Crunching
Access to chemical databases Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Useful to be able to transparently access data from
Chemical Data
various public data sources
Parallel
PubChem compound and assays are supported via Paradigms
rpubchem
Compound access is primarily by CID, while assay data
can be obtained from key word searches
End up with a data.frame containing all relevant assay
information (along with meta-data as attributes)
R can also easily access arbitrary RDBMS’s (Postgres,
MySQL, Oracle)
Crunching
Access to PubChem Molecules and
Numbers in R
§ ¤ Rajarshi Guha
> dat <- get . cids (1:30)
Background
’ data . frame ’: 30 obs . of 11 variables : Molecules in R
$ CID : chr "1" "2" "3" "4" ... Chemical Data
$ IUPACName : chr "3 - acetyloxy -4 -( trimethylaz
Parallel
$ CanonicalSmile : chr " CC (= O ) OC ( CC (= O )[ OParadigms N +](
-]) C [
$ MolecularFormula : chr " C9H17NO4 " " C9H18NO4 +" " C7H
$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.
> find . assay . id ( ’ LDR ’)
[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865
> adat <- get . assay (990)
> str ( adat )
’ data . frame ’: 51 obs . of 9 variables :
$ PUBCHEM . SID : int 845800 848472 852502 857608
$ PUBCHEM . CID : int 648162 6603466 655127 65895
¦ ¥
Crunching
Bioinformatics in R Molecules and
Numbers in R
Rajarshi Guha
Background
While the focus is on cheminformatics, many problems Molecules in R
Chemical Data
involve bioinformatics to some degree
Parallel
The Bioconductor project provides a wide variety of Paradigms
packages
A lot of it focused on gene expression analysis
A number of packages provide access to various
biological databases, annotations etc
Protein structure analysis is supported in R via Bio3d
Never have to leave the comfort of R
http://www.bioconductor.org/
Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
Crunching
Long calculations, big data Molecules and
Numbers in R
Rajarshi Guha
Many statistical methods require long running Background
calculations Molecules in R
Bootstrap Chemical Data
Bayesian methods Parallel
Paradigms
Many problems involve large datasets
A common feature to both scenarios is that they can be
trivially parallelized
As opposed to require parallel version of underlying
algorithm
R has good support for both trivial and non-trivial
parallelization methods
See R/parallel for a package that will parallelize
actual R code
Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
Crunching
Simple parallelization Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
The snow package allows easy use of multiple cores on a Parallel
Paradigms
single computer or a cluster of computers
A simple wrapper over other parallel R libraries
Can support PVM, MPI
At the very least you can use all the cores on your own
machine
http://cran.r-project.org/web/packages/snow/
Crunching
Serial code - Feature Selection Molecules and
Numbers in R
Rajarshi Guha
Rather than use GA, SA etc, just look at all Background
combinations Molecules in R
Chemical Data
Inelegant, but no worries about missing the global
Parallel
optimum Paradigms
§ ¤
x <- matrix ( runif (500*40) , ncol =40)
y <- runif (500)
library ( gtools )
combos <- combinations (40 , 3)
apply ( combos , 1 , function ( z ) {
d <- data . frame ( y =y , x = x [ , z ])
fit <- lm ( y ~. , data = d )
cor (y , fit$fitted )^2
})
¦ ¥
Crunching
Simple parallelization - Feature Selection Molecules and
Numbers in R
Rajarshi Guha
Trivially parallelized Background
§ ¤ Molecules in R
x <- matrix ( runif (500*40) , ncol =40) Chemical Data
y <- runif (500)
Parallel
library ( gtools ) Paradigms
combos <- combinations (40 , 3)
library ( snow )
cl <- makeSOCKcluster (2)
clusterExport ( cl , " x ")
clusterExport ( cl , " y ")
parApply ( cl , combos , 1 , function ( z ) {
d <- data . frame ( y =y , x = x [ , z ])
fit <- lm ( y ~. , data = d )
cor (y , fit$fitted )^2
})
¦ ¥
Crunching
Big data scenarios Molecules and
Numbers in R
Rajarshi Guha
Background
The idea behind snow can also be used to handle very
Molecules in R
large datasets
Chemical Data
Simply chunk the data appropriately and papply over Parallel
Paradigms
the list of filenames
Still requires you to perform chunking and keep track of
everything
Hadoop is a nice way to avoid all this
Throw one or more (very) large files at it, let it deal with
chunking and computation
For non-trivial file formats, you need to implement a
chunker
RHIPE provides access to a Hadoop cluster from within R
http://hadoop.apache.org/core/
Crunching
Summary Molecules and
Numbers in R
Rajarshi Guha
Background
Molecules in R
Chemical Data
Parallel
rcdk successfully integrates cheminformatics Paradigms
functionality into the R environment
Related packages provide access to other forms of
chemical data (fingerprints) and data sources
An excellent environment for chemical and biological
data mining
0 comments
Post a comment