• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
R + 15 minutes = Hadoop cluster
 

R + 15 minutes = Hadoop cluster

on

  • 2,837 views

Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service. ...

Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service.

Presented at the February 2011 meeting of the Greater Boston useR Group.

Statistics

Views

Total Views
2,837
Views on SlideShare
2,817
Embed Views
20

Actions

Likes
4
Downloads
70
Comments
0

2 Embeds 20

http://www.linkedin.com 16
https://www.linkedin.com 4

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    R + 15 minutes = Hadoop cluster R + 15 minutes = Hadoop cluster Presentation Transcript

    • useR Vignette: R + 15 minutes = Hadoop clusterGreater Boston useR Group February 2011 by Jeffrey Breen jbreen@cambridge.aero
    • Agenda ● Whats Hadoop? ● But I dont have Big Data ● Building the cluster ● Estimating π stochastically ● Want to know more?useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2
    • MapReduce, Hadoop and Big Data ● Hadoop is an open source implementation of Googles MapReduce-based data processing infrastructure ● Designed to process huge data sets – “huge” = “all of facebooks web logs” – Yahoo! sorted 1TB in 62 seconds in May 2009 – HDFS distributed file system makes replication decisions based on knowledge of network topology ● Amazon Elastic MapReduce is full Hadoop stack on EC2useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3
    • MapReduce = Map + shuffle + Reduce Source: http://developer.yahoo.com/hadoop/tutorial/module4.htmluseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4
    • But I dont have Big Data ● Agricultural economist J.D. Long doesnt either, but he does have a bunch of simulations to run ● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function ● Enjoy Hadoops infrastructure for job scheduling, fault tolerance, inter-node communication, etc. ● Use Amazons cloud to scale up quickly as neededuseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5
    • Load the segue library> library(segue)Loading required package: rJavaLoading required package: caToolsLoading required package: bitopsSegue did not find your AWS credentials. Please runthe setCredentials() function.> setCredentials(YOUR_ACCESS_KEY_ID,YOUR_SECRET_ACCESS_KEY)useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6
    • Start the cluster> myCluster <- createCluster(numInstances=5)STARTING - 2011-01-04 15:07:53[…]BOOTSTRAPPING - 2011-01-04 15:11:28[…]WAITING - 2011-01-04 15:15:35Your Amazon EMR Hadoop Cluster is ready for action.Remember to terminate your cluster withstopCluster().Amazon is billing you!useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7
    • Estimate π stochastically> estimatePi <- function(seed){ set.seed(seed) numDraws <- 1e6 r <- .5 #radius x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) }useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8
    • Run the simulation> seedList <- as.list(1:1e3)> myEstimates <- emrlapply( myCluster, seedList,estimatePi )RUNNING - 2011-01-04 15:22:28[…]WAITING - 2011-01-04 15:32:18> myPi <- Reduce(sum, myEstimates) / length(myEstimates)> format(myPi, digits=10)[1] "3.141586544"> format(pi, digits=10)[1] "3.141592654"useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9
    • Wont break the bank ● Total cost: $0.15 Standard On-Demand Amazon EC2 Amazon Elastic Instances Price per hour MapReduce (On-Demand Instances) Price per hour Small (Default) $0.085 per hour $0.015 per hour Large $0.34 per hour $0.06 per hour Extra Large $0.68 per hour $0.12 per houruseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10
    • Want to know more? ● JD Longs segue package ● http://code.google.com/p/segue/ ● Hadoop ● http://hadoop.apache.org/ ● Book: http://oreilly.com/catalog/0636920010388 ● My blog ● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-auseR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11