Running R on the Amazon Cloud
Upcoming SlideShare
Loading in...5
×
 

Running R on the Amazon Cloud

on

  • 8,681 views

Using Amazon EC2, you can crunch massive data on servers with high RAM and CPU without paying for the hardware.

Using Amazon EC2, you can crunch massive data on servers with high RAM and CPU without paying for the hardware.

Statistics

Views

Total Views
8,681
Views on SlideShare
8,623
Embed Views
58

Actions

Likes
16
Downloads
161
Comments
3

5 Embeds 58

http://www.scoop.it 39
http://whatisasis.blogspot.tw 10
https://twitter.com 7
http://whatisasis.blogspot.hk 1
http://dmz07.app.clicktale.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Ian, Thanks for the pointer. I will invest some time of makePSOCKcluster. Right now I am struggling to make rmr run on EMR.
    Are you sure you want to
    Your message goes here
    Processing…
  • It is possible to launch multiple instances of the AMI. I have done it. Then you create the cluster with the makePSOCKcluster function and pass a vector of multiple hostnames as the hosts argument.
    Are you sure you want to
    Your message goes here
    Processing…
  • From Louis' AMI it appears that one can use just 1 instance at a time. It may not be possible to run parallel compute. If the computing need is bigger, then just pick a bigger instance. Am I right is saying that?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Running R on the Amazon Cloud Running R on the Amazon Cloud Presentation Transcript

  • Running R on the Amazon Cloud Ian Cook Raleigh-Durham-Chapel Hill R Users Group June 20, 2013 +
  • Why? • Some R jobs are RAM- and CPU-intensive • Powerful hardware is expensive to buy • Institutional cluster compute resources can be difficult to procure and to use • Amazon Web Services (AWS) provides a fast, cheap, and easy way to use computational resources in the cloud • AWS offers a free usage tier that you can use to try this: http://aws.amazon.com/free/
  • What is AWS? • A collection of cloud computing services • Billed based on usage • The best-known AWS service is Amazon Elastic Compute Cloud (EC2) which provides scalable virtual private servers • Other AWS services include Elastic MapReduce (EMR) (a hosted Hadoop service) and Simple Storage Service (S3) (for online file storage)
  • How much RAM/CPU can I use on EC2? • Up to 32 virtual CPU cores per instance • Up to 244 GB RAM per instance • Can distribute a task across multiple instances • Can resize instances (start small, grow as needed) • Instance details at http://aws.amazon.com/ ec2/instance-types/instance-details/ • Pricing at http://aws.amazon.com/ec2/ pricing/#on-demand
  • When not to use AWS? • It is often cheaper, easier, and more elegant to use tools and techniques to make your R code less RAM- and CPU-intensive: – R package bigmemory allows analysis of datasets larger than available RAM http://www.bigmemory.org/ – R package data.table enables faster operations on large data http://cran.r- project.org/web/packages/data.table/index.html – Good R programming techniques (e.g. vectorization) can make your code run drastically faster on just one CPU core http://www.noamross.net/blog/ 2013/4/25/faster-talk.html
  • More ways to speed up R code • Rewrite key functions in C++ for much improved performance, and use Dirk Eddelbuettel’s Rcpp package to embed the C++ code in your R program: – http://dirk.eddelbuettel.com/code/rcpp.html – https://github.com/hadley/devtools/wiki/Rcpp • Radford Neal’s pqR is a faster version of R – http://radfordneal.wordpress.com/2013/06/22/a nnouncing-pqr-a-faster-version-of-r/
  • Free Commercial R Distributions • Two (very different) commercial distributions of R are freely available. Both have much improved performance vs. plain R in many cases – Revolution R An enhanced distribution of open source R with an IDE http://www.revolutionanalytics.com/products/revolut ion-r.php – TIBCO Enterprise Runtime for R A high-performance R-compatible statistical engine http://spotfire.tibco.com/en/discover-spotfire/what- does-spotfire-do/predictive-analytics/tibco- enterprise-runtime-for-r-terr.aspx
  • RStudio Server AMIs • Louis Aslett maintains a set of Amazon Machine Images (AMIs) available for anyone to use • These AMIs include the latest versions of R and RStudio Server on Ubuntu • These AMIs make it very fast and easy to use R on EC2 • Thanks Louis!
  • Launch EC2 Instance • Sign up for an AWS account at https://portal.aws.amazon.com/gp/aws/develop er/registration/index.html • Go to http://www.louisaslett.com/RStudio_AMI/ and click the AMI for your region (US East, Virginia) • Complete the process to launch the instance – Choose instance type t1.micro for free usage tier – Open port 80, and optionally port 22 (to use SSH) – After done, may take about 5 minutes to launch
  • Use RStudio on EC2 Instance • Copy the “Public DNS” for your EC2 instance into your web browser address field (e.g. ec2-xx-xx- xxx-xxx.compute-1.amazonaws.com) • Login with username rstudio and password rstudio and start using RStudio • Remember to stop your instance when finished • Video instructions at http://www.louisaslett.com/RStudio_AMI/video_ guide.html
  • How to use all those CPU cores? • R package parallel enables some tasks in R to run parallel across multiple CPU cores – This is explicit parallelism—the task must be parallelizable – CPU cores can be on one machine or across multiple machines • The parallel package has been included directly in R since version 2.14.0. It derives from the two R packages snow and multicore. • http://stat.ethz.ch/R-manual/R- devel/library/parallel/doc/parallel.pdf
  • Example: Parallel numerical integration • Calculate the volume under a three-dimensional function • Adapted from the example in Appendix B, part 4 of http://www.jstatsoft.org/v31/i01/ “State of the Art in Parallel Computing with R.” Schmidberger, Morgan, Eddelbuettel, Yu, Tier ney, and Mansmann. Journal of Statistical Software. August 2009, Volume 31, Issue 1.x y z Note that paper by Schmidberger et al. was written before the package parallel was included in R. The examples in the paper use other packages including snow that were precursors of the package parallel.
  • Example: Parallel numerical integration Define a three-dimensional function and limits on its domain: func <- function(x, y) x^3-3*x + y^3-3*y xint <- c(-1, 2) yint <- c(-1, 2) Plot a figure of the function: library(lattice) g <- expand.grid(x = seq(xint[1], xint[2], 0.1), y = seq(yint[1], yint[2], 0.1)) g$z <- func(g$x, g$y) print( wireframe(z ~ x + y, data = g) )
  • Example: Parallel numerical integration Define the number of increments for integration n <- 10000 Calculate with nested for loops (very slow!) erg <- 0 xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n for(xi in seq(xint[1], xint[2], length.out = n)){ for(yi in seq(yint[1], yint[2], length.out = n)){ box <- func(xi, yi) * xincr * yincr erg <- erg + box } } erg
  • Example: Parallel numerical integration Use nested sapply (much faster) applyfunc <- function(xrange, xint, yint, n, func) { yrange <- seq(yint[1], yint[2], length.out = n) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sum( sapply(xrange, function(x) sum( func(x, yrange) )) ) * xincr * yincr return(erg) } xrange <- seq(xint[1], xint[2], length.out = n) erg <- sapply(xrange, applyfunc, xint, yint, n, func) sum(erg)
  • Example: Parallel numerical integration Define a worker function for parallel calculation workerfunc <- function(id, nworkers, xint, yint, n, func) { xrange <- seq(xint[1], xint[2], length.out = n)[seq(id, n, nworkers)] yrange <- seq(yint[1], yint[2], length.out = n) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sapply(xrange, function(x) sum( func(x, yrange ) ) ) * xincr * yincr return( sum(erg) ) }
  • Example: Parallel numerical integration Start a cluster of local R engines using all your CPU cores library(parallel) nworkers <- detectCores() cluster <- makeCluster(nworkers) Run the calculation in parallel (faster than serial calculation) erg <- clusterApplyLB(cluster, 1:nworkers, workerfunc, nworkers, xint, yint, n, func) sum(unlist(erg)) Stop the cluster stopCluster(cluster)
  • Vectorized Code Use vectorized R code (the fastest method!) xincr <- ( xint[2]-xint[1] ) / n yincr <- ( yint[2]-yint[1] ) / n erg <- sum( func( seq(xint[1], xint[2], length.out = n), seq(yint[1], yint[2], length.out = n) ) ) * xincr * yincr * n erg Refer back to slide: “When not to use AWS?” This problem is best solved through vectorization instead of using larger computational resources.
  • Reminder to Stop EC2 Instances • Stop your EC2 instances after use to avoid charges – After one year free usage of one micro instance, running one micro instance 24x7 will result in charges of about $15/month • If regularly using EC2, configure CloudWatch alarms to automatically notify you or stop your instances after period of low CPU utilization
  • R with Amazon Elastic MapReduce • The R package segue provides an integration with Amazon Elastic MapReduce (EMR) for simple parallel computation – https://code.google.com/p/segue/ – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  • Other Useful Links • CRAN Task View: High-Performance and Parallel Computing with R: http://cran.r-project.org/web/views/ HighPerformanceComputing.html • R package AWS.tools: http://cran.r-project.org/web/packages/ AWS.tools/index.html
  • Join the Raleigh-Durham-Chapel Hill R Users Group at: http://www.meetup.com/Triangle-useR/