R + 15 minutes = Hadoop cluster

3,161 views
3,024 views

Published on

Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service.

Presented at the February 2011 meeting of the Greater Boston useR Group.

Published in: Technology, Education, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,161
On SlideShare
0
From Embeds
0
Number of Embeds
41
Actions
Shares
0
Downloads
105
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

R + 15 minutes = Hadoop cluster

  1. 1. useR Vignette: R + 15 minutes = Hadoop clusterGreater Boston useR Group February 2011 by Jeffrey Breen jbreen@cambridge.aero
  2. 2. Agenda ● Whats Hadoop? ● But I dont have Big Data ● Building the cluster ● Estimating π stochastically ● Want to know more?useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2
  3. 3. MapReduce, Hadoop and Big Data ● Hadoop is an open source implementation of Googles MapReduce-based data processing infrastructure ● Designed to process huge data sets – “huge” = “all of facebooks web logs” – Yahoo! sorted 1TB in 62 seconds in May 2009 – HDFS distributed file system makes replication decisions based on knowledge of network topology ● Amazon Elastic MapReduce is full Hadoop stack on EC2useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3
  4. 4. MapReduce = Map + shuffle + Reduce Source: http://developer.yahoo.com/hadoop/tutorial/module4.htmluseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4
  5. 5. But I dont have Big Data ● Agricultural economist J.D. Long doesnt either, but he does have a bunch of simulations to run ● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function ● Enjoy Hadoops infrastructure for job scheduling, fault tolerance, inter-node communication, etc. ● Use Amazons cloud to scale up quickly as neededuseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5
  6. 6. Load the segue library> library(segue)Loading required package: rJavaLoading required package: caToolsLoading required package: bitopsSegue did not find your AWS credentials. Please runthe setCredentials() function.> setCredentials(YOUR_ACCESS_KEY_ID,YOUR_SECRET_ACCESS_KEY)useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6
  7. 7. Start the cluster> myCluster <- createCluster(numInstances=5)STARTING - 2011-01-04 15:07:53[…]BOOTSTRAPPING - 2011-01-04 15:11:28[…]WAITING - 2011-01-04 15:15:35Your Amazon EMR Hadoop Cluster is ready for action.Remember to terminate your cluster withstopCluster().Amazon is billing you!useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7
  8. 8. Estimate π stochastically> estimatePi <- function(seed){ set.seed(seed) numDraws <- 1e6 r <- .5 #radius x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) }useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8
  9. 9. Run the simulation> seedList <- as.list(1:1e3)> myEstimates <- emrlapply( myCluster, seedList,estimatePi )RUNNING - 2011-01-04 15:22:28[…]WAITING - 2011-01-04 15:32:18> myPi <- Reduce(sum, myEstimates) / length(myEstimates)> format(myPi, digits=10)[1] "3.141586544"> format(pi, digits=10)[1] "3.141592654"useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9
  10. 10. Wont break the bank ● Total cost: $0.15 Standard On-Demand Amazon EC2 Amazon Elastic Instances Price per hour MapReduce (On-Demand Instances) Price per hour Small (Default) $0.085 per hour $0.015 per hour Large $0.34 per hour $0.06 per hour Extra Large $0.68 per hour $0.12 per houruseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10
  11. 11. Want to know more? ● JD Longs segue package ● http://code.google.com/p/segue/ ● Hadoop ● http://hadoop.apache.org/ ● Book: http://oreilly.com/catalog/0636920010388 ● My blog ● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-auseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11

×