SlideShare a Scribd company logo
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Crunching Molecules and Numbers in R
Rajarshi Guha
NIH Chemical Genomics Center
238th ACS National Meeting
17th August, 2009
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Some background on R
Doing cheminformatics in R
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
R History
S developed by
John Chambers at
Bell Labs
S rewritten in C
Licensed to
Insightful Corp.
Bought by Insightful
Corp for $2M
Bought by TIBCO
for $25M
First public release
Created by Ihaka &
Released under
R 1.0.0
R 2.9
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
An overview of R
An environment for statistical computation
Wide variety of standard and state of the art statistical
methods built in or accessible via packages
But also a complete, interpreted programming language
Well suited for manipulating and operating on datasets -
numerical, categorical or a mixture - and of varying
Impressive visualization facilities (but not very
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
An overview of R
Syntax is pretty much S-Plus
Highly cross-platform
Frequent and regular releases, active development by
core group
The dev and user community extremely active
r-help is not just for learning R, you can get a decent
statistics education from the list!
Used by many top statisticians, many cutting edge
techniques first show up in R
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Default mode is a command line like prompt
GUI’s available
But learning curve is steep
Does force you to think about the analysis
Not a great tool for casual, once-in-a-while usage
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
R primitives
Numeric, character, list, matrix, data.frame
§ ¤
> x <- ’Hello World ’
> x <- 1
> x <- c(1,2,3,4,5,6)
> x
[1] 1 2 3 4 5 6
x <- data.frame(MW=runif(5, 10, 50),
hERG=sample(c(’active ’,’inactive ’),
5, TRUE ))
> x
1 23.55435 active
2 42.90365 inactive
3 49.35149 active
4 26.85912 active
5 10.01877 active
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Matrix oriented programming
Similar in style to Matlab
Easily access (multiple) rows, columns
Vector/matrix indexing is very powerful and key to
efficient R code
Perform operations on entire rows or columns
Makes subsetting a trivial operation
Perfect for QSAR type analyses
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Functional style
R’s functional paradigms are closely tied to matrix
apply, lapply, sapply, tapply allow you to easily
operate on groups of objects
Elements of a list
Rows and/or columns of a matrix
Subsets of data, using a grouping variable
Anonymous functions are supported
Use of these funtional forms can lead to speed up
compared to traditional for loops
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Non-functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
sds <- numeric(ncol(m))
for (i in 1: ncol(m)) sds[i] <- sd(m[,i])
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
toxLogP <- 0
nontoxLogP <- 0
for (j in 1: nrow(m)) {
if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1]
else nontoxLogP <- nontoxLogP + m[j,1]
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Functional style
§ ¤
# column std devs
m <- matrix(runif (100*100) , ncol =100)
apply(m, 2, sd)
# mean logP of toxic , non -toxic classes
m <- data.frame(logp=runif (100) ,
toxic=sample(c(’yes ’,’no ’),
100, TRUE)
by(m, m$toxic , function(x) mean(x$logp ))
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Object oriented style
R supports multiple object oriented mechanisms
Simplest is S3 classes
Object orientation is in terms of function names
Easy to work with, not always flexible enough
S4 classes are much more powerful, but also more
Many problems can ignore these as R primitives provide
sufficient support for attaching meta-data to objects
(crude encapsulation)
Becomes important/useful when writing packages, not
for day to day code
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Interfacing with C & Fortran
R is interpreted,
functional forms help a
Very useful to refactor
inner loops into C (or
Also useful to provide an
R interface to pre-existing
C/Fortran code
Can lead to dramatic
1024 166 79
Bit length
5000 pairwise Tanimoto similarity
calculations, Macbook Pro,
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
R generates publication quality graphics in a variety of
A huge number of statistical visualization methods (2D,
3D, OpenGL)
Extremely powerful display specifications
core commands
lattice (a.k.a trellis graphics)
Based on sound statistical theories
While standard plots are easy to make, but complex
plots do have a learning curve
Interactivity is limited, though some package do alleviate
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Code quality
It’s not just enough to write code
RUnit is a package that supports unit testing, analogous
to JUnit
R comes with well defined package structure that can be
automatically checked for various errors
Packages can be uploaded to CRAN which allows any R
user to install them directly from R
Extensive documentation format
Sweave is an important feature which allows one to
include R code and associated text in a single document
- literate programming or reproducible research
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
The downsides of R
Memory bound (but can use as much memory as you
Language inconsistencies
Indexing starts from 1, but no error if you use 0 as an
See blog posts by Radford Neal (U Toronto)
Debugging environment not so great (though ESS is
good for Emacs users)
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Cheminformatics programming
Fundamental requirement is support for core chemical
Representation and manipulation of these concepts
Could implement all of this directly in R - lots of wheels
would be reinvented
We also want such functionality to be R-like
Writing Java or C in R is not R-like
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
The Chemistry Development Kit
Open source Java library for cheminformatics
Wide variety of functionality
Core chemical concepts (atoms, bonds, molecules)
SMARTS, pharmacophores
Molecular descriptors and fingerprints
2D depictions
Used in a variety of tools, applications and services
Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
rcdk - CDK from R
R Programming Environment
CDK Jmol
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
rcdk Motivations
Have access to cheminformatics functionality from
within R
Support processing of data from chemistry databases
Not reimplement cheminformatics methods
Have access to all of this in idiomatic R
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Basic molecular operations - I/O
Read in molecular file formats support by the CDK
Files can be local or remote
Parse SMILES strings
In contrast to the CDK, rcdk will configure molecules
automatically (unless instructed not to)
The resultant molecule objects are Java references, can
be passed to a variety of rcdk functions
§ ¤
mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’))
mol <- parse.smiles(’c1ccccc1CC (=O)’)
mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’),
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Basic molecular operations
Given a molecule, we can extract or add properties
Get lists of atoms and bonds and then manipulate them
Currently doesn’t support a lot of molecular graph
§ ¤
# get the atoms from a molecule
mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ")
atoms <- get.atoms(mol)
# get the coordinate matrix of the molecule
coords <-’rbind ’,
lapply(atoms , get.point3d ))
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Working with fingerprints
rcdk will generate a variety of fingerprints via the CDK
Other packages can generate fingerprints
The fingerprint package suports I/O of fingerprint
data and various similarity operations on fingerprints
Provides an S4 class representing binary fingerprints
§ ¤
m1 <- parse.smiles(’c1ccccc1C(COC)N’)
m2 <- parse.smiles(’C1CCCCC1C(COC)N’)
# Calculate fingerprints
fps <- lapply(list(m1 ,m2),
get.fingerprint , type=’maccs ’)
distance(fps [[1]] , fps [[2]] , method=’tanimoto ’)
fps <-’fp.txt ’, lf=moe.lf ,
size =166, header=TRUE)
fpsim <- fp.sim.matrix(fps , method=’tanimoto ’)
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
rcdk and QSAR
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
rcdk and QSAR
Access to descriptors and fingerprints makes for very
easy QSAR modeling within R
Evaluate the descriptors (individually, by type or all)
Get back a data.frame which can be used as input to
pretty much any modeling method
§ ¤
mols <- load.molecules(’big.sdf ’)
dnames <- get.desc.names(’topological ’)
descs <- eval.desc(mols , dnames)
’data.frame ’: 467 obs. of 180 variables:
$ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ...
$ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ...
$ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 ..
$ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ...
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Viewing molecules
While numerical modeling is a fundamental task in this
environment, visualization is also important
Either view structures of individual molecules or tables of
structure and data
rcdk supports both (not very well on OS X)
§ ¤
mol <- parse.smiles(’c1ccccc1C(N)CC ’)’
view.molecule .2d(mol)
smiles <- c("CCC", "CCN", "CCN(C)(C)",
" c1ccccc1Cc1ccccc1 ",
mols <- sapply(smiles , parse.smiles)
view.molecule .2d(mols)
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Downsides to rcdk
Can’t save state of Java objects
Doesn’t take advantage of S4 classes to provide R-side
representations of CDK classes
Incomplete coverage of the CDK API - sometimes need
to go down to rJava to perform an operation
Big datasets are problematic (mainly due to R
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Access to chemical databases
Useful to be able to transparently access data from
various public data sources
PubChem compound and assays are supported via
Compound access is primarily by CID, while assay data
can be obtained from key word searches
End up with a data.frame containing all relevant assay
information (along with meta-data as attributes)
R can also easily access arbitrary RDBMS’s (Postgres,
MySQL, Oracle)
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Access to PubChem
§ ¤
> dat <- get.cids (1:30)
’data.frame ’: 30 obs. of 11 variables:
$ CID : chr "1" "2" "3" "4" ...
$ IUPACName : chr "3-acetyloxy -4-( trimethylaz
$ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+](
$ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H
$ MolecularWeight : num 203.2 204.2 156.1 75.1 169.
>’LDR ’)
[1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865
> adat <- get.assay (990)
> str(adat)
’data.frame ’: 51 obs. of 9 variables:
$ PUBCHEM.SID : int 845800 848472 852502 857608
$ PUBCHEM.CID : int 648162 6603466 655127 65895
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Bioinformatics in R
While the focus is on cheminformatics, many problems
involve bioinformatics to some degree
The Bioconductor project provides a wide variety of
A lot of it focused on gene expression analysis
A number of packages provide access to various
biological databases, annotations etc
Protein structure analysis is supported in R via Bio3d
Never have to leave the comfort of R
Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Long calculations, big data
Many statistical methods require long running
Bayesian methods
Many problems involve large datasets
A common feature to both scenarios is that they can be
trivially parallelized
As opposed to require parallel version of underlying
R has good support for both trivial and non-trivial
parallelization methods
See R/parallel for a package that will parallelize
actual R code
Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Simple parallelization
The snow package allows easy use of multiple cores on a
single computer or a cluster of computers
A simple wrapper over other parallel R libraries
Can support PVM, MPI
At the very least you can use all the cores on your own
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Serial code - Feature Selection
Rather than use GA, SA etc, just look at all
Inelegant, but no worries about missing the global
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
combos <- combinations (40, 3)
apply(combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Simple parallelization - Feature Selection
Trivially parallelized
§ ¤
x <- matrix(runif (500*40) , ncol =40)
y <- runif (500)
combos <- combinations (40, 3)
cl <- makeSOCKcluster (2)
clusterExport(cl , "x")
clusterExport(cl , "y")
parApply(cl , combos , 1, function(z) {
d <- data.frame(y=y, x=x[,z])
fit <- lm(y~., data=d)
cor(y, fit$fitted )^2
¦ ¥
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
Big data scenarios
The idea behind snow can also be used to handle very
large datasets
Simply chunk the data appropriately and papply over
the list of filenames
Still requires you to perform chunking and keep track of
Hadoop is a nice way to avoid all this
Throw one or more (very) large files at it, let it deal with
chunking and computation
For non-trivial file formats, you need to implement a
RHIPE provides access to a Hadoop cluster from within R
Molecules and
Numbers in R
Rajarshi Guha
Molecules in R
Chemical Data
rcdk successfully integrates cheminformatics
functionality into the R environment
Related packages provide access to other forms of
chemical data (fingerprints) and data sources
An excellent environment for chemical and biological
data mining

More Related Content

What's hot

An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
Kamel Mansouri
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
Kamel Mansouri
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Kamel Mansouri
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
Ann-Marie Roche
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
Technology for Drug Discovery Research Productivity
Technology for Drug Discovery Research ProductivityTechnology for Drug Discovery Research Productivity
Technology for Drug Discovery Research Productivity
Yogesh Wagh
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
Basics of QSAR Modeling
Basics of QSAR ModelingBasics of QSAR Modeling
Basics of QSAR Modeling
Prachi Pradeep
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
Andrew McEachran
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
Paolo Missier
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
Dr. Haxel Consult
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
Kamel Mansouri

What's hot (20)

An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
Technology for Drug Discovery Research Productivity
Technology for Drug Discovery Research ProductivityTechnology for Drug Discovery Research Productivity
Technology for Drug Discovery Research Productivity
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
Basics of QSAR Modeling
Basics of QSAR ModelingBasics of QSAR Modeling
Basics of QSAR Modeling
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
The EPA CompTox Dashboard as a Data Integration Hub for Environmental Chemist...
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
Resume Or
Resume OrResume Or
Resume Or
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
ICIC 2017: Looking at the gift horse: pros and cons of over 20 million patent...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...

Viewers also liked

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Rajarshi Guha
The Trans-NIH RNAi Initiative : Informatics
The Trans-NIH RNAi Initiative: InformaticsThe Trans-NIH RNAi Initiative: Informatics
The Trans-NIH RNAi Initiative : InformaticsRajarshi Guha
The smaller sukhavati vyuha
The smaller sukhavati vyuhaThe smaller sukhavati vyuha
The smaller sukhavati vyuha
Lin Zhang Sheng
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & RRajarshi Guha
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research DatabaseRajarshi Guha
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
Historic Shed
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drugSean Ekins
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
Jüri Kaljundi
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
Olmeda Orígenes
Eit orginal
Eit orginalEit orginal
Eit orginal
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative WritingFatheha Rahman
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Insight
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
PJ Caposey
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
Centro Fuensanta Valencia. Departamento Hospital General
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Sean Ekins
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C
Akira Sasaki

Viewers also liked (20)

R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...Characterization and visualization of compound combination responses in a hig...
Characterization and visualization of compound combination responses in a hig...
The Trans-NIH RNAi Initiative : Informatics
The Trans-NIH RNAi Initiative: InformaticsThe Trans-NIH RNAi Initiative: Informatics
The Trans-NIH RNAi Initiative : Informatics
The smaller sukhavati vyuha
The smaller sukhavati vyuhaThe smaller sukhavati vyuha
The smaller sukhavati vyuha
Robots, Small Molecules & R
Robots, Small Molecules & RRobots, Small Molecules & R
Robots, Small Molecules & R
The BioAssay Research Database
The BioAssay Research DatabaseThe BioAssay Research Database
The BioAssay Research Database
Uram ecp course
Uram ecp courseUram ecp course
Uram ecp course
Codes & Tiny Houses
Codes & Tiny HousesCodes & Tiny Houses
Codes & Tiny Houses
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drug
Haapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminarHaapsalu Kolledži valikseminar
Haapsalu Kolledži valikseminar
A Writing Group Strategy for Scientists
A Writing Group Strategy for ScientistsA Writing Group Strategy for Scientists
A Writing Group Strategy for Scientists
Pintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenesPintxo banderilla olmeda origenes
Pintxo banderilla olmeda origenes
Eit orginal
Eit orginalEit orginal
Eit orginal
Plans for Creative Writing
Plans for Creative WritingPlans for Creative Writing
Plans for Creative Writing
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
Food Safety: A Communicator's Guide to Improving Understanding (Chinese version)
ILASCD - Student-Centered Leadership
ILASCD - Student-Centered LeadershipILASCD - Student-Centered Leadership
ILASCD - Student-Centered Leadership
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
DAS SOTI Presented by Nextmark: What We Love, Hate and Desire in Our Digital ...
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
MEMS sensor catalog with I2C
MEMS sensor catalog with I2CMEMS sensor catalog with I2C
MEMS sensor catalog with I2C

Similar to Crunching Molecules and Numbers in R

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
User biglm
User biglmUser biglm
User biglm
johnatan pladott
A gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R EnterpriseA gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R Enterprise
Swiss Data Forum Swiss Data Forum
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Ruby on rails
Ruby on railsRuby on rails
Ruby on rails
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
Jean-Paul Calbimonte
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
DataWorks Summit
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
DataWorks Summit
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
Nikolaos Konstantinou
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange Rajarshi Guha
Introduction to R
Introduction to RIntroduction to R
Introduction to R
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing

Similar to Crunching Molecules and Numbers in R (20)

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Get started with R lang
Get started with R langGet started with R lang
Get started with R lang
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
User biglm
User biglmUser biglm
User biglm
A gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R EnterpriseA gentle introduction to Oracle R Enterprise
A gentle introduction to Oracle R Enterprise
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Ruby on rails
Ruby on railsRuby on rails
Ruby on rails
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
SparkR best practices for R data scientist
SparkR best practices for R data scientistSparkR best practices for R data scientist
SparkR best practices for R data scientist
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...Transient and persistent RDF views over relational databases in the context o...
Transient and persistent RDF views over relational databases in the context o...
PMML for QSAR Model Exchange
PMML for QSAR Model Exchange PMML for QSAR Model Exchange
PMML for QSAR Model Exchange
Introduction to R
Introduction to RIntroduction to R
Introduction to R
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing

More from Rajarshi Guha

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
Rajarshi Guha
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
Rajarshi Guha
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
Rajarshi Guha
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
Rajarshi Guha
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Rajarshi Guha
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
Rajarshi Guha
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
Rajarshi Guha
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATSRajarshi Guha
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical StructuresRajarshi Guha
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Rajarshi Guha
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the partsRajarshi Guha
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Rajarshi Guha
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesRajarshi Guha
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsRajarshi Guha
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleRajarshi Guha
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in RRajarshi Guha
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataRajarshi Guha

More from Rajarshi Guha (20)

Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark GenomePharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: A Torch to Use in Your Journey in the Dark Genome
Pharos: Putting targets in context
Pharos: Putting targets in contextPharos: Putting targets in context
Pharos: Putting targets in context
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark GenomePharos – A Torch to Use in Your Journey In the Dark Genome
Pharos – A Torch to Use in Your Journey In the Dark Genome
Pharos - Face of the KMC
Pharos - Face of the KMCPharos - Face of the KMC
Pharos - Face of the KMC
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS PlatformEnhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
Enhancing Prioritization & Discovery of Novel Combinations using an HTS Platform
What can your library do for you?
What can your library do for you?What can your library do for you?
What can your library do for you?
So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action: Bridging Chemistry and Biology with Informatics at NCATSFrom Data to Action: Bridging Chemistry and Biology with Informatics at NCATS
From Data to Action : Bridging Chemistry and Biology with Informatics at NCATS
Fingerprinting Chemical Structures
Fingerprinting Chemical StructuresFingerprinting Chemical Structures
Fingerprinting Chemical Structures
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D...
When the whole is better than the parts
When the whole is better than the partsWhen the whole is better than the parts
When the whole is better than the parts
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Exploring Compound Combinations in High Throughput Settings: Going Beyond 1D ...
Pushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the PipesPushing Chemical Biology Through the Pipes
Pushing Chemical Biology Through the Pipes
Cloudy with a Touch of Cheminformatics
Cloudy with a Touch of CheminformaticsCloudy with a Touch of Cheminformatics
Cloudy with a Touch of Cheminformatics
Chemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & ReproducibleChemical Data Mining: Open Source & Reproducible
Chemical Data Mining: Open Source & Reproducible
Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?Chemogenomics in the cloud: Is the sky the limit?
Chemogenomics in the cloud: Is the sky the limit?
Quantifying Text Sentiment in R
Quantifying Text Sentiment in RQuantifying Text Sentiment in R
Quantifying Text Sentiment in R
Smashing Molecules
Smashing MoleculesSmashing Molecules
Smashing Molecules
Small Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity DataSmall Molecules and siRNA: Methods to Explore Bioactivity Data
Small Molecules and siRNA: Methods to Explore Bioactivity Data

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf

Crunching Molecules and Numbers in R

  • 1. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Crunching Molecules and Numbers in R Rajarshi Guha NIH Chemical Genomics Center 238th ACS National Meeting 17th August, 2009
  • 2. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Outline Some background on R Doing cheminformatics in R
  • 3. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R History S developed by John Chambers at Bell Labs 1976 S rewritten in C 1988 Licensed to Insightful Corp. 1993 Bought by Insightful Corp for $2M 2004 Bought by TIBCO for $25M 2008 First public release 1993 Created by Ihaka & Gentleman 1991 Released under GPL 1995 R 1.0.0 2000 R 2.9 2009
  • 4. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R An environment for statistical computation Wide variety of standard and state of the art statistical methods built in or accessible via packages But also a complete, interpreted programming language Well suited for manipulating and operating on datasets - numerical, categorical or a mixture - and of varying shape Impressive visualization facilities (but not very interactive)
  • 5. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms An overview of R Syntax is pretty much S-Plus Highly cross-platform Frequent and regular releases, active development by core group The dev and user community extremely active r-help is not just for learning R, you can get a decent statistics education from the list! Used by many top statisticians, many cutting edge techniques first show up in R
  • 6. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Usability Default mode is a command line like prompt GUI’s available But learning curve is steep Does force you to think about the analysis Not a great tool for casual, once-in-a-while usage
  • 7. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms R primitives Numeric, character, list, matrix, data.frame § ¤ > x <- ’Hello World ’ > x <- 1 > x <- c(1,2,3,4,5,6) > x [1] 1 2 3 4 5 6 x <- data.frame(MW=runif(5, 10, 50), hERG=sample(c(’active ’,’inactive ’), 5, TRUE )) > x MW hERG 1 23.55435 active 2 42.90365 inactive 3 49.35149 active 4 26.85912 active 5 10.01877 active ¦ ¥
  • 8. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Matrix oriented programming Similar in style to Matlab Easily access (multiple) rows, columns Vector/matrix indexing is very powerful and key to efficient R code Perform operations on entire rows or columns Makes subsetting a trivial operation Perfect for QSAR type analyses
  • 9. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style R’s functional paradigms are closely tied to matrix operations apply, lapply, sapply, tapply allow you to easily operate on groups of objects Elements of a list Rows and/or columns of a matrix Subsets of data, using a grouping variable Anonymous functions are supported Use of these funtional forms can lead to speed up compared to traditional for loops
  • 10. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Non-functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) sds <- numeric(ncol(m)) for (i in 1: ncol(m)) sds[i] <- sd(m[,i]) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) toxLogP <- 0 nontoxLogP <- 0 for (j in 1: nrow(m)) { if (m[j,2] = ’yes ’) toxLogP <- toxLogP + m[j,1] else nontoxLogP <- nontoxLogP + m[j,1] } ¦ ¥
  • 11. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Functional style § ¤ # column std devs m <- matrix(runif (100*100) , ncol =100) apply(m, 2, sd) # mean logP of toxic , non -toxic classes m <- data.frame(logp=runif (100) , toxic=sample(c(’yes ’,’no ’), 100, TRUE) by(m, m$toxic , function(x) mean(x$logp )) ¦ ¥
  • 12. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Object oriented style R supports multiple object oriented mechanisms Simplest is S3 classes Object orientation is in terms of function names Easy to work with, not always flexible enough S4 classes are much more powerful, but also more complex Many problems can ignore these as R primitives provide sufficient support for attaching meta-data to objects (crude encapsulation) Becomes important/useful when writing packages, not for day to day code
  • 13. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Interfacing with C & Fortran R is interpreted, functional forms help a bit Very useful to refactor inner loops into C (or Fortran) Also useful to provide an R interface to pre-existing C/Fortran code Can lead to dramatic speedups 1024 166 79 Bit length Speedup 01020304050 5000 pairwise Tanimoto similarity calculations, Macbook Pro, 2GHz, 1GB RAM
  • 14. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Visualization R generates publication quality graphics in a variety of formats A huge number of statistical visualization methods (2D, 3D, OpenGL) Extremely powerful display specifications core commands lattice (a.k.a trellis graphics) Based on sound statistical theories While standard plots are easy to make, but complex plots do have a learning curve Interactivity is limited, though some package do alleviate this
  • 15. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Code quality It’s not just enough to write code RUnit is a package that supports unit testing, analogous to JUnit R comes with well defined package structure that can be automatically checked for various errors Packages can be uploaded to CRAN which allows any R user to install them directly from R Extensive documentation format Sweave is an important feature which allows one to include R code and associated text in a single document - literate programming or reproducible research
  • 16. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The downsides of R Memory bound (but can use as much memory as you have) Language inconsistencies Indexing starts from 1, but no error if you use 0 as an index See blog posts by Radford Neal (U Toronto) Debugging environment not so great (though ESS is good for Emacs users)
  • 17. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Cheminformatics programming Fundamental requirement is support for core chemical concepts Representation and manipulation of these concepts Flexibility Could implement all of this directly in R - lots of wheels would be reinvented We also want such functionality to be R-like Writing Java or C in R is not R-like
  • 18. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms The Chemistry Development Kit Open source Java library for cheminformatics Wide variety of functionality Core chemical concepts (atoms, bonds, molecules) SMARTS, pharmacophores Molecular descriptors and fingerprints 2D depictions Used in a variety of tools, applications and services Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2110–2120
  • 19. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk - CDK from R R Programming Environment rJava CDK Jmol rcdk XML rpubchem fingerprint
  • 20. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk Motivations Have access to cheminformatics functionality from within R Support processing of data from chemistry databases Not reimplement cheminformatics methods Have access to all of this in idiomatic R
  • 21. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations - I/O Read in molecular file formats support by the CDK Files can be local or remote Parse SMILES strings In contrast to the CDK, rcdk will configure molecules automatically (unless instructed not to) The resultant molecule objects are Java references, can be passed to a variety of rcdk functions § ¤ mols <- load.molecules(c(’abc.sdf ’, ’xyz.smi ’)) mol <- parse.smiles(’c1ccccc1CC (=O)’) mols <- sapply(c(’CC ’, ’CCCC ’, ’CCCNC ’), parse.smiles) ¦ ¥
  • 22. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Basic molecular operations Given a molecule, we can extract or add properties Get lists of atoms and bonds and then manipulate them Currently doesn’t support a lot of molecular graph operations § ¤ # get the atoms from a molecule mol <- parse.smiles (" c1ccccc1C(Cl)(Br)c1ccccc1 ") atoms <- get.atoms(mol) # get the coordinate matrix of the molecule coords <-’rbind ’, lapply(atoms , get.point3d )) ¦ ¥
  • 23. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Working with fingerprints rcdk will generate a variety of fingerprints via the CDK Other packages can generate fingerprints The fingerprint package suports I/O of fingerprint data and various similarity operations on fingerprints Provides an S4 class representing binary fingerprints § ¤ m1 <- parse.smiles(’c1ccccc1C(COC)N’) m2 <- parse.smiles(’C1CCCCC1C(COC)N’) # Calculate fingerprints fps <- lapply(list(m1 ,m2), get.fingerprint , type=’maccs ’) distance(fps [[1]] , fps [[2]] , method=’tanimoto ’) fps <-’fp.txt ’, lf=moe.lf , size =166, header=TRUE) fpsim <- fp.sim.matrix(fps , method=’tanimoto ’) ¦ ¥
  • 24. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Molecular Descriptors Machine Learning Property
  • 25. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms rcdk and QSAR Access to descriptors and fingerprints makes for very easy QSAR modeling within R Evaluate the descriptors (individually, by type or all) Get back a data.frame which can be used as input to pretty much any modeling method § ¤ mols <- load.molecules(’big.sdf ’) dnames <- get.desc.names(’topological ’) descs <- eval.desc(mols , dnames) str(descs) ’data.frame ’: 467 obs. of 180 variables: $ ATSc1 : num 0.28 0.279 0.279 0.217 0.479 ... $ ATSc2 : num -0.0777 -0.0851 -0.0845 -0.0587 -0.2356 ... $ ATSc3 : num -0.05803 -0.04706 -0.04616 -0.0519 0.00129 .. $ ATSc4 : num -0.00906 0.00279 -0.01147 0.00241 0.00856 ... ¦ ¥
  • 26. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Viewing molecules While numerical modeling is a fundamental task in this environment, visualization is also important Either view structures of individual molecules or tables of structure and data rcdk supports both (not very well on OS X) § ¤ mol <- parse.smiles(’c1ccccc1C(N)CC ’)’ view.molecule .2d(mol) smiles <- c("CCC", "CCN", "CCN(C)(C)", " c1ccccc1Cc1ccccc1 ", "C1CCC1CC(CN(C)(C))CC(=O)CC") mols <- sapply(smiles , parse.smiles) view.molecule .2d(mols) ¦ ¥
  • 27. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Downsides to rcdk Can’t save state of Java objects Doesn’t take advantage of S4 classes to provide R-side representations of CDK classes Incomplete coverage of the CDK API - sometimes need to go down to rJava to perform an operation Big datasets are problematic (mainly due to R limitations)
  • 28. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to chemical databases Useful to be able to transparently access data from various public data sources PubChem compound and assays are supported via rpubchem Compound access is primarily by CID, while assay data can be obtained from key word searches End up with a data.frame containing all relevant assay information (along with meta-data as attributes) R can also easily access arbitrary RDBMS’s (Postgres, MySQL, Oracle)
  • 29. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Access to PubChem § ¤ > dat <- get.cids (1:30) ’data.frame ’: 30 obs. of 11 variables: $ CID : chr "1" "2" "3" "4" ... $ IUPACName : chr "3-acetyloxy -4-( trimethylaz $ CanonicalSmile : chr "CC(=O)OC(CC(=O)[O-])C[N+]( $ MolecularFormula : chr "C9H17NO4" "C9H18NO4 +" "C7H $ MolecularWeight : num 203.2 204.2 156.1 75.1 169. >’LDR ’) [1] 990 1035 1036 1037 1038 1039 1041 1042 1043 1653 1865 > adat <- get.assay (990) > str(adat) ’data.frame ’: 51 obs. of 9 variables: $ PUBCHEM.SID : int 845800 848472 852502 857608 $ PUBCHEM.CID : int 648162 6603466 655127 65895 ¦ ¥
  • 30. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Bioinformatics in R While the focus is on cheminformatics, many problems involve bioinformatics to some degree The Bioconductor project provides a wide variety of packages A lot of it focused on gene expression analysis A number of packages provide access to various biological databases, annotations etc Protein structure analysis is supported in R via Bio3d Never have to leave the comfort of R Grant, B. et al, Bioinformatics, 2006, 22, 2695–2696
  • 31. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Long calculations, big data Many statistical methods require long running calculations Bootstrap Bayesian methods Many problems involve large datasets A common feature to both scenarios is that they can be trivially parallelized As opposed to require parallel version of underlying algorithm R has good support for both trivial and non-trivial parallelization methods See R/parallel for a package that will parallelize actual R code Vera, G. et al., BMC Bioinformatics, 2008, 9, 390
  • 32. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization The snow package allows easy use of multiple cores on a single computer or a cluster of computers A simple wrapper over other parallel R libraries Can support PVM, MPI At the very least you can use all the cores on your own machine
  • 33. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Serial code - Feature Selection Rather than use GA, SA etc, just look at all combinations Inelegant, but no worries about missing the global optimum § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) apply(combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 34. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Simple parallelization - Feature Selection Trivially parallelized § ¤ x <- matrix(runif (500*40) , ncol =40) y <- runif (500) library(gtools) combos <- combinations (40, 3) library(snow) cl <- makeSOCKcluster (2) clusterExport(cl , "x") clusterExport(cl , "y") parApply(cl , combos , 1, function(z) { d <- data.frame(y=y, x=x[,z]) fit <- lm(y~., data=d) cor(y, fit$fitted )^2 }) ¦ ¥
  • 35. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Big data scenarios The idea behind snow can also be used to handle very large datasets Simply chunk the data appropriately and papply over the list of filenames Still requires you to perform chunking and keep track of everything Hadoop is a nice way to avoid all this Throw one or more (very) large files at it, let it deal with chunking and computation For non-trivial file formats, you need to implement a chunker RHIPE provides access to a Hadoop cluster from within R
  • 36. Crunching Molecules and Numbers in R Rajarshi Guha Background Molecules in R Chemical Data Parallel Paradigms Summary rcdk successfully integrates cheminformatics functionality into the R environment Related packages provide access to other forms of chemical data (fingerprints) and data sources An excellent environment for chemical and biological data mining