R Analytics in the Cloud

5,480 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,480
On SlideShare
0
From Embeds
0
Number of Embeds
3,613
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Check Segue, LISP, R, circle
  • R Analytics in the Cloud

    1. 1. R Analyticsin the Cloud
    2. 2. Introduction Radek Maciaszek  DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  MSc in Bioinformatics at Birkbeck, University of London.  Project at UCL Institute of Healthy Ageing under supervision of Dr Eugene Schuster. 2
    3. 3. Primer in Bioinformatics Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc) Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans). Goal: find genes responsible for ageingCaenorhabditis Elegans 3
    4. 4. Central dogma of molecular biologyGenes are encodedby the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other 4
    5. 5. Why R? Very popular in bioinformatics Functional, scripting programming language Swiss-army knife for statistician Designed by statisticians for statisticians Lots of ready to use packages (CRAN) 5
    6. 6. R limitations & Hadoop Data needs to fit in the memory Single-threaded Hadoop integration:  Hadoop Streaming  Rhipe: http://ml.stat.purdue.edu/rhipe/  Segue: http://code.google.com/p/segue/ 6
    7. 7. Segue Works with Amazon Elastic MapReduce. Creates a cluster for you. Designed for Big Computations (rather than Big Data) Implements a cloud version of lapply() function. 7
    8. 8. Segue workflow (emrlapply)List (local) List (remote) Amazon AWS 8
    9. 9. R very quick examplem <- list(a = 1:10, b = exp(-3:3))lapply(m, mean)$a[1] 5.5$b[1] 4.535125lapply(X, FUN) returns a list of the same length as X,each element of which is the result of applying FUN tothe corresponding element of X. 9
    10. 10. Segue – large scale example> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} RNA Probes> pearson.cor <- lapply(probes, AnalysePearsonCorelation)Moving to the cloud in 3 lines of code! 10
    11. 11. Segue – large scale example> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} RNA Probes> # pearson.cor <- lapply(probes, AnalysePearsonCorelation)> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE)> pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)> stopCluster(myCluster) 11
    12. 12. Discovering genes Topomaps of clustered genesThis work was based on a similar approach to:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., 12Science 293, 2087 (2001)
    13. 13. Conclusions R is great for statistics. It’s easy to scale up R using Segue. We are all going to live very long. 13
    14. 14. Thanks! Questions? References: http://code.google.com/r/radek-segue/ http://www.dataminelab.com 14

    ×