R Analytics in the Cloud

Introduction
 Radek Maciaszek
 DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse
consultancy.
 MSc in Bioinformatics at Birkbeck, University of
London.
 Project at UCL Institute of Healthy Ageing under
supervision of Dr Eugene Schuster.

2

Primer in Bioinformatics
 Bioinformatics - applying computer
science to biology (DNA, Proteins,
Drug discovery, etc)
 Ageing strategy – solve it in simple
organism and apply findings to more
complex organisms (i.e. humans).
 Goal: find genes responsible for ageing

Caenorhabditis Elegans
3

Central dogma of molecular biology

Genes are encoded
by the DNA. Microarray
(100 x 100)
• Database of 50 curated experiments.
• 10k genes compare to each other
4

Why R?
 Very popular in bioinformatics
 Functional, scripting programming
language
 Swiss-army knife for statistician
 Designed by statisticians for
statisticians
 Lots of ready to use packages (CRAN)

5

R limitations & Hadoop
 Data needs to fit in the memory
 Single-threaded
 Hadoop integration:
 Hadoop Streaming
 Rhipe: http://ml.stat.purdue.edu/rhipe/
 Segue: http://code.google.com/p/segue/

6

Segue
 Works with Amazon Elastic MapReduce.
 Creates a cluster for you.
 Designed for Big Computations (rather than
Big Data)
 Implements a cloud version of lapply()
function.

7

Segue workflow (emrlapply)

List (local)

List (remote)

Amazon AWS 8

R very quick example
m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)
$a
[1] 5.5
$b
[1] 4.535125

lapply(X, FUN) returns a list of the same length as X,
each element of which is the result of applying FUN to
the corresponding element of X.
9

Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
A.vector <- experiments.matrix[probe,]
p.values <- c()
for(probe.name in rownames(experiments.matrix)) {
B.vector <- experiments.matrix[probe.name,]
p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
}
return (p.values)
}
RNA Probes
> pearson.cor <- lapply(probes, AnalysePearsonCorelation)

Moving to the cloud in 3 lines of code!

10

Segue – large scale example
> AnalysePearsonCorelation <- function(probe) {
A.vector <- experiments.matrix[probe,]
p.values <- c()
for(probe.name in rownames(experiments.matrix)) {
B.vector <- experiments.matrix[probe.name,]
p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value)
}
return (p.values)
}
RNA Probes
> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”,
slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”,
slaveInstanceType=”c1.xlarge”, copy.image=TRUE)
> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)
> stopCluster(myCluster) 11

Discovering genes

Topomaps of clustered genes
This work was based on a similar approach to:
A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12
Science 293, 2087 (2001)

Conclusions
 R is great for statistics.
 It’s easy to scale up R using Segue.
 We are all going to live very long.

13

Thanks!
 Questions?

 References:
http://code.google.com/r/radek-segue/
http://www.dataminelab.com

14

R Analytics in the Cloud

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to R Analytics in the Cloud

Similar to R Analytics in the Cloud (20)

Recently uploaded

Recently uploaded (20)

R Analytics in the Cloud

Editor's Notes