"R, Hadoop, and Amazon Web Services (20 December 2011)"


Published on

Dalbey, Timothy. "R, Hadoop, and Amazon Web Services (PPT)." Portland R User Group, 20 December 2011.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

"R, Hadoop, and Amazon Web Services (20 December 2011)"

  1. 1. R, Hadoop and Amazon Web Services Portland R Users Group December 20th, 2011
  2. 2. A general disclaimer• Good programmers learn fast and develop expertise in technologies and methodologies in a rather intrepid, exploratory manner.• I am by no means a expert in the paradigm which we are discussing this evening but I’d like to share what I have learned in the last year while developing MapReduce applications in R within the AWS. Translation: ask anything and everything but reserve the right to say “I don’t know, yet.”• Also, this is a meetup.com meeting – seems only appropriate to keep this short, sweet, high-level and full of solicitous discussion points.
  3. 3. The whole point of this presentation• I am selfish (and you should be too!) – I like collaborators – I like collaborators interested in things I am interested in – I believe that dissemination of information related to sophisticated, numerical decision making processes generally makes the world a better place – I believe that the more people use Open Source technology, the more people contribute to Open Source technology and the better Open Source technology gets in general. Hence, my life gets easier and cheaper which is presumably analogous to “better” in some respect. – There is beer at this meetup. Queue short intermission.• Otherweiser® (brought by the aforementioned speaking point,) I’d really be very happy if people said to themselves at the end of this presentation “Hadoop seems easy! I’m going to give it a try.”
  4. 4. Why are we talking about this anyhow?“Every two days now we create as much information as we did from the dawn of civilization up until 2003.“ -Eric Schmidt, August 2010• We aggregate a lot of data (and have been) – Particularly businesses like Google, Amazon, Apple etc… – Presumably the government is doing awful things with data too• But aggregation isn’t understanding – Lawnmower Man aside – We need to UNDERSTAND the data- that is take raw data and make it interoperable. – Hence the need for a marriage of Statistics and Programming directed at understanding phenomena expressed in these large data sets – Can’t recommend this book enough: • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Freidman • http://www.amazon.com/Elements-Statistical-Learning-Prediction- Statistics/dp/0387848576/ref=pd_sim_b_1• So everybody is going crazy about this in general.
  5. 5. Also, who is this “self” I speak of?• tis’ I, Timothy Dalbey • I work for the Emerging Technologies Group of News Corporation • I live in North East Portland and keep an office on 53rd and 5th in New York City • Studied Mathematics and Economics as a undergraduate student and Statistics as a graduate student at University of Virginia • 2 awesome kids and a awesome partner at home: Liam, Juniper and Lindsay • Enthusiastic about technology, science and futuristic endeavors in general
  6. 6. Elastic MapReduce• Elastic Map reduce is – A service of Amazon Web Services – Is composed of Amazon Machine Images • ssh capability • Debian Linux • Preloaded with ancient versions of R – A complimentary set of Ruby Client Tools – A web interface – Preconfigured to run Hadoop
  7. 7. Hadoop• Popular framework for controlling distributed cluster computations – Popularity is important – queue story about MPI at Levy Laboratory and Beowulf clusters…• Hadoop is a Apache Project product – http://hadoop.apache.org/• Open Source• Java• Configurable (mostly uses XML config files)• Fault Tolerant• Lots of ways to interact with Hadoop – Pig – Hive – Streaming – Custom .jar
  8. 8. Hadoop is MapReduce• What is a MapReduce? – Originally coined by Google Labs in 2004 – A super simplified single-node version of the paradigm is as follows: cat input.txt | ./mapper.R | sort | reducer.R > output.txt• That is, MapReduce has follows a general process: – Read input (cat input) – Map (mapper.R) – Partition – Comparison (sort) – Reduce (reducer.R) – Output (output.txt)• You can use most popular scripting languages – Perl, PHP, Python etc… – R
  9. 9. But – that sort of misses the point• MapReduce is computational paradigm intended for – Large Datasets – Multi-Node Computation – Truly Parallel Processing• Master/Slave architecture – Nodes are agnostic of one another, only the master node(s) have any idea about the greater scheme of things. • The importance of truly parallel processing• A good first question before engaging in creating a Hadoop job is: – Is this process a good candidate for Hadoop processing in the first place?
  10. 10. Benefits to using AWS for Hadoop Jobs• Preconfigured to run Hadoop – This is itself is something of a miracle• Virtual Servers – Use the servers for only as long as you need – configurability• Handy command line tools• S3 is sitting in the same cloud – Your data is sitting in the same space• Servers come at $0.06 per hour of compute time – dirt cheap
  11. 11. Specifics• Bootstrapping – Bootstrapping is a process by which you may customize the nodes via bash shell • Acquiring data • Updating R • Installing Packages • Please, you example:#!/bin/bash#debian R upgradegpg --keyserver pgpkeys.mit.edu --recv-key 06F90DE5381BA480gpg -a --export 06F90DE5381BA480 | sudo apt-key add -echo "deb http://streaming.stat.iastate.edu/CRAN/bin/linux/debian lenny-cran/" | sudo tee -a /etc/apt/sources.listsudo apt-get updatesudo apt-get -t lenny-cran install --yes --force-yes r-base r-base-dev• Input file – Mapper specific • Classic example in WordCounter.py – Example: “It was the best of times, it was the worst of times…” – Note: Big data set! • An example from a recent appliocation of mine: – "25621”r"23803"r"31712”r… – Note: Not such a big data set• Mapper & Reducer – Both typically draw from STDIN and write to STDOUT – Please see the following examples
  12. 12. The typical “Hello World” MapReduce Mapper#! /usr/bin/env RscripttrimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+”)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) words <- splitIntoWords(line) cat(paste(words, "t1n", sep=""), sep="")}close(con)
  13. 13. The typical “Hello World” MapReduce Reducer#! /usr/bin/env RscripttrimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)splitLine <- function(line) { val <- unlist(strsplit(line, "t")) list(word = val[1], count = as.integer(val[2]))}env <- new.env(hash = TRUE)con <- file("stdin", open = "r")while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) { line <- trimWhiteSpace(line) split <- splitLine(line) word <- split$word count <- split$count if (exists(word, envir = env, inherits = FALSE)) { oldcount <- get(word, envir = env) assign(word, oldcount + count, envir = env) }else{ assign(word, count, envir = env) }}close(con)for (w in ls(env, all = TRUE)){ cat(w, "t", get(w, envir = env), "n", sep = "”)}
  14. 14. MapReduce and R: Forecasting data for News Corporation• 50k+ products with historical unit sales data of roughly 2.5MM rows• Some of the titles require heavy computational processing – Titles with insufficient data require augmented or surrogate data in order to make “good” predictions – thus identifying good candidate data was also necessary in addition to prediction methods – Took lots of time (particularly in R) • But R had the analysis tools I needed!• Key observation: The predictions were independent of one another which made the process truly parallel.• Thus, Hadoop and Elastic MapReduce were merited
  15. 15. My Experience Learning and Using Hadoop with AWS• Debugging is something of a nightmare. – SSH onto nodes to figure out what’s really going on – STDERR is your enemy – it will cause your job to fail rather completely – STDERR is your best friend. No errors and failed jobs are rather frustrating• Most of the work is in transactional with AWS Elastic MapReduce• I followed conventional advice which is “move data to the nodes.” – This meant moving data into csv’s in S3 and importing the data into R via standard read methods – This also meant that my processes were database agnostic – JSON is a great way of structuring input and output between phases of the MapReduce Process • To that effect, check out RJSON – great package.• In general, the following rule seems to apply: – Data frame bad. – Data table good. • http://cran.r-project.org/web/packages/data.table/index.html• Packages to simplify R make my skin crawl – Ever see Jurassic Park? – Just a stubborn programmer – of course the logic extension leads me to contradiction. Never mind that I said that.
  16. 16. R Package to Utilize Map Reduce• Segue – Written J.D. Long – http://www.cerebralmastication.com • P.s. We all realize that www is a subdomain, right? World Wide Web… is that really necessary? – Handles much of the transactional details and allows the use of Elastic MapReduce through apply() and lapply() wrappers• Seems like this is a good tutorial too: – http://jeffreybreen.wordpress.com/2011/01/10/s egue-r-to-amazon-elastic-mapreduce-hadoop/
  17. 17. Other stuff• Distributed Cache – Load your data the smart way!• Ruby Command Tools – Interact with AWS the smart way!• Web interface – Simple. – Helpful when monitoring jobs when you wake up at 3:30AM and wonder “is my script still running?”