Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 
Scalable Analytics 
with 
R, Hadoop and RHadoop 
Gwen Shapira, Software Engineer 
@gwenshap 
gshapira@cloudera.com
2
3
4
#include warning.h 
5
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
6
Get Started with R-Studio 
7
Basic Data Types 
• String 
• Number 
• Boolean 
• Assignment <- 
8
R can be a nice calculator 
> x <- 1 
> x * 2 
[1] 2 
> y <- x + 3 
> y 
[1] 4 
> log(y) 
[1] 1.386294 
> help(log) 
9
Complex Data Types 
• Vector 
• c, seq, rep, [] 
• List 
• Data Frame 
• Lists of vectors of same length 
• Not a matrix 
...
Creating vectors 
> v1 <- c(1,2,3,4) 
[1] 1 2 3 4 
> v1 * 4 
[1] 4 8 12 16 
> v4 <- c(1:5) 
[1] 1 2 3 4 5 
> v2 <- seq(2,1...
Accessing and filtering vectors 
> v1 <- c(2,4,6,8) 
[1] 2 4 6 8 
> v1[2] 
[1] 4 
> v1[2:4] 
[1] 4 6 8 
> v1[-2] 
[1] 2 6 ...
Lists 
> lst <- list (1,"x",FALSE) 
[[1]] 
[1] 1 
[[2]] 
[1] "x" 
[[3]] 
[1] FALSE 
> lst[1] 
[[1]] 
[1] 1 
> lst[[1]] 
[1...
Data Frames 
books <- read.csv("~/books.csv") 
books[1,] 
books[,1] 
books[3:4] 
books$price 
books[books$price==6.99,] 
m...
15
Functions 
> sq <- function(x) { x*x } 
> sq(3) 
[1] 9 
16 
Note: 
R is a functional programming language. 
Functions are ...
packages 
17
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
18
“In pioneer days they used oxen for heavy 
pulling, and when one ox couldn’t budge a log, 
we didn’t try to grow a larger ...
20 
Hadoop in a Nutshell
Map-Reduce is the interesting bit 
• Map – Apply a function to each input record 
• Shuffle & Sort – Partition the map out...
Example – Sessionize clickstream 
22
Sessionize 
Identify unique “sessions” of interacting with our 
website 
Session – for each user (IP), set of clicks that ...
Input – Apache Access Log Records 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 232...
Output – Add Session ID 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 15 
25
Overview 
26 
Map 
Map 
Map 
Reduce 
Reduce 
Log line 
Log line 
Log line 
IP1, log lines 
Log line, session ID
Map 
parsedRecord = re.search(‘(d+.d+….’,record) 
IP = parsedRecord.group(1) 
timestamp = parsedRecord.group(2) 
print ((I...
Shuffle & Sort 
Partition by: IP 
Sort by: timestamp 
Now reduce gets: 
(IP,timestamp) [record1,record2,record3….] 
28
Reduce 
SessionID = 1 
curr_record = records[0] 
Curr_timestamp = getTimestamp(curr_record) 
foreach record in records: 
i...
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
30
Reshape2 
• Two functions: 
• Melt – wide format to long format 
• Cast – long format to wide format 
• Columns: identifie...
Melt 
> tips 
total_bill tip sex smoker day time size 
16.99 1.01 Female No Sun Dinner 2 
10.34 1.66 Male No Sun Dinner 3 ...
Cast 
> m_tips <- melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Din...
*Apply 
• apply – apply function on rows or columns of matrix 
• lapply – apply function on each item of list 
• Returns l...
plyr 
• Split – apply – combine 
• Ddply – data frame to data frame 
ddply(.data, .variables, .fun = NULL, ..., 
• Summari...
DDPLY Example 
> ddply(tips,c("sex","time"),summarize, 
+ mean=mean(tip), 
+ sd=sd(tip), 
+ ratio=mean(tip/total_bill) 
+ ...
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
37
Rhadoop Projects 
• RMR 
• RHDFS 
• RHBase 
• (new) PlyRMR 
38
Most Important: 
RMR does not parallelize algorithms. 
It allows you to implement MapReduce in R. 
Efficiently. That’s it....
What does that mean? 
• Use RMR if you can break your problem down to 
small pieces and apply the algorithm there 
• Use c...
Use-case examples – Good or Bad? 
1. Model power consumption per household to 
determine if incentive programs work 
2. Ag...
Second Most Important: 
RMR requires R, RMR and all libraries you’ll 
use to be installed on all nodes and 
accessible by ...
RMR is different from Hadoop Streaming. 
RMR mapper input: 
Key, [List of Records] 
This is so we can use vector operation...
How to RMRify a Problem 
44
In more detail… 
• Mappers get list of values 
• You need to process each one independently 
• But do it for all lines at ...
Demo 6 
> library(rmr2) 
t <- list("hello world","don't worry be happy") 
unlist(sapply(t,function (x) {strsplit(x," ")}))...
Cheating in MapReduce: 
Do everything possible to have 
map only jobs 
47
Avg Tips per Person – Naïve Input 
Gwen 1 
Jeff 2 
Leon 1 
Gwen 2.5 
Leon 3 
Jeff 1 
Gwen 1 
Gwen 2 
Jeff 1.5 
48
Avg Tips per Person - Naive 
avg.map <- function(k,v){keyval(v$V1,v$V2)} 
avg.reduce <- function(k,v) {keyval(k,mean(v))} ...
Avg Tips per Person – Awesome Input 
Gwen 1,2.5,1,2 
Jeff 2,1,1.5 
Leon 1,3 
50
Avg Tips per Person - Optimized 
function(k,v) { 
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," 
")})) 
keyval...
Few Final RMR Tips 
• Backend = “local” has files as input and output 
• Backend = “hadoop” uses HDFS directories 
• In “h...
Recommended Reading 
• http://cran.r-project.org/doc/manuals/R-intro.html 
• http://blog.revolutionanalytics.com/2013/02/1...
54
Upcoming SlideShare
Loading in …5
×

R for hadoopers

937 views

Published on

R + Hadoop presentation. From OSCON and Seattle meetup

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

R for hadoopers

  1. 1. 1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap gshapira@cloudera.com
  2. 2. 2
  3. 3. 3
  4. 4. 4
  5. 5. #include warning.h 5
  6. 6. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 6
  7. 7. Get Started with R-Studio 7
  8. 8. Basic Data Types • String • Number • Boolean • Assignment <- 8
  9. 9. R can be a nice calculator > x <- 1 > x * 2 [1] 2 > y <- x + 3 > y [1] 4 > log(y) [1] 1.386294 > help(log) 9
  10. 10. Complex Data Types • Vector • c, seq, rep, [] • List • Data Frame • Lists of vectors of same length • Not a matrix 10
  11. 11. Creating vectors > v1 <- c(1,2,3,4) [1] 1 2 3 4 > v1 * 4 [1] 4 8 12 16 > v4 <- c(1:5) [1] 1 2 3 4 5 > v2 <- seq(2,12,by=3) [1] 2 5 8 11 > v1 * v2 [1] 2 10 24 44 > v3 <- rep(3,4) [1] 3 3 3 3 11
  12. 12. Accessing and filtering vectors > v1 <- c(2,4,6,8) [1] 2 4 6 8 > v1[2] [1] 4 > v1[2:4] [1] 4 6 8 > v1[-2] [1] 2 6 8 > v1[v1>3] [1] 4 6 8 12
  13. 13. Lists > lst <- list (1,"x",FALSE) [[1]] [1] 1 [[2]] [1] "x" [[3]] [1] FALSE > lst[1] [[1]] [1] 1 > lst[[1]] [1] 1 13
  14. 14. Data Frames books <- read.csv("~/books.csv") books[1,] books[,1] books[3:4] books$price books[books$price==6.99,] martin_price <- books[books$author_t=="George R.R. Martin",]$price mean(martin_price) subset(books,select=-c(id,cat,sequence_i)) 14
  15. 15. 15
  16. 16. Functions > sq <- function(x) { x*x } > sq(3) [1] 9 16 Note: R is a functional programming language. Functions are first class objects And can be passed to other functions.
  17. 17. packages 17
  18. 18. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 18
  19. 19. “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox” — Grace Hopper, early advocate of distributed computing
  20. 20. 20 Hadoop in a Nutshell
  21. 21. Map-Reduce is the interesting bit • Map – Apply a function to each input record • Shuffle & Sort – Partition the map output and sort each partition • Reduce – Apply aggregation function to all values in each partition • Map reads input from disk • Reduce writes output to disk 21
  22. 22. Example – Sessionize clickstream 22
  23. 23. Sessionize Identify unique “sessions” of interacting with our website Session – for each user (IP), set of clicks that happened within 30 minutes of each other 23
  24. 24. Input – Apache Access Log Records 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 24
  25. 25. Output – Add Session ID 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15 25
  26. 26. Overview 26 Map Map Map Reduce Reduce Log line Log line Log line IP1, log lines Log line, session ID
  27. 27. Map parsedRecord = re.search(‘(d+.d+….’,record) IP = parsedRecord.group(1) timestamp = parsedRecord.group(2) print ((IP,Timestamp),record) 27
  28. 28. Shuffle & Sort Partition by: IP Sort by: timestamp Now reduce gets: (IP,timestamp) [record1,record2,record3….] 28
  29. 29. Reduce SessionID = 1 curr_record = records[0] Curr_timestamp = getTimestamp(curr_record) foreach record in records: if (curr_timestamp – getTimestamp(record) > 30): sessionID += 1 curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID) 29
  30. 30. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 30
  31. 31. Reshape2 • Two functions: • Melt – wide format to long format • Cast – long format to wide format • Columns: identifiers or measured variables • Molten data: • Unique identifiers • New column – variable name • New column – value • Default – all numbers are values 31
  32. 32. Melt > tips total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 > melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 32
  33. 33. Cast > m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 > dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115 2.461538 Female Lunch 16.33914 2.582857 2.457143 Male Dinner 21.46145 3.144839 2.701613 Male Lunch 18.04848 2.882121 2.363636 33
  34. 34. *Apply • apply – apply function on rows or columns of matrix • lapply – apply function on each item of list • Returns list • sapply – like lapply, but return vector • tapply – apply function to subsets of vector or lists 34
  35. 35. plyr • Split – apply – combine • Ddply – data frame to data frame ddply(.data, .variables, .fun = NULL, ..., • Summarize – aggregate data into new data frame • Transform – modify data frame 35
  36. 36. DDPLY Example > ddply(tips,c("sex","time"),summarize, + mean=mean(tip), + sd=sd(tip), + ratio=mean(tip/total_bill) + ) sex time mean sd ratio 1 Female Dinner 3.002115 1.193483 0.1693216 2 Female Lunch 2.582857 1.075108 0.1622849 3 Male Dinner 3.144839 1.529116 0.1554065 4 Male Lunch 2.882121 1.329017 0.1660826 36
  37. 37. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 37
  38. 38. Rhadoop Projects • RMR • RHDFS • RHBase • (new) PlyRMR 38
  39. 39. Most Important: RMR does not parallelize algorithms. It allows you to implement MapReduce in R. Efficiently. That’s it. 39
  40. 40. What does that mean? • Use RMR if you can break your problem down to small pieces and apply the algorithm there • Use commercial R+Hadoop if you need a parallel version of well known algorithm • Good fit: Fit piecewise regression model for each county in the US • Bad fit: Fit piecewise regression model for the entire US population • Bad fit: Logistic regression 40
  41. 41. Use-case examples – Good or Bad? 1. Model power consumption per household to determine if incentive programs work 2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use 3. Create churn models for service subscribers and determine who is most likely to cancel 4. Determine correlation between device restarts and support calls 41
  42. 42. Second Most Important: RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user 42
  43. 43. RMR is different from Hadoop Streaming. RMR mapper input: Key, [List of Records] This is so we can use vector operations 43
  44. 44. How to RMRify a Problem 44
  45. 45. In more detail… • Mappers get list of values • You need to process each one independently • But do it for all lines at once. • Reducers work normally 45
  46. 46. Demo 6 > library(rmr2) t <- list("hello world","don't worry be happy") unlist(sapply(t,function (x) {strsplit(x," ")})) function(k,v) { ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) keyval(ret_k,1) } function(k,v) { keyval(k,sum(v))} mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", output=”~/wc.json",input.format="text”,output.format=”json", map=wc.map,reduce=wc.reduce); 46
  47. 47. Cheating in MapReduce: Do everything possible to have map only jobs 47
  48. 48. Avg Tips per Person – Naïve Input Gwen 1 Jeff 2 Leon 1 Gwen 2.5 Leon 3 Jeff 1 Gwen 1 Gwen 2 Jeff 1.5 48
  49. 49. Avg Tips per Person - Naive avg.map <- function(k,v){keyval(v$V1,v$V2)} avg.reduce <- function(k,v) {keyval(k,mean(v))} mapreduce(input=”~/hadoop-recipes/data/tip1.txt", output="~/avg.txt", input.format=make.input.format("csv"), output.format="text", map=avg.map,reduce=avg.reduce); 49
  50. 50. Avg Tips per Person – Awesome Input Gwen 1,2.5,1,2 Jeff 2,1,1.5 Leon 1,3 50
  51. 51. Avg Tips per Person - Optimized function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) } mapreduce(input=”~/hadoop-recipes/data/tip2.txt", output="~/avg2.txt", input.format=make.input.format("csv",sep=","), output.format="text",map=avg2.map); 51
  52. 52. Few Final RMR Tips • Backend = “local” has files as input and output • Backend = “hadoop” uses HDFS directories • In “hadoop” mode, print(X) inside the mapper will fail the job. • Use: cat(“ERROR!”, file = stderr()) 52
  53. 53. Recommended Reading • http://cran.r-project.org/doc/manuals/R-intro.html • http://blog.revolutionanalytics.com/2013/02/10-r-packages- every-data-scientist-should-know-about. html • http://had.co.nz/reshape/paper-dsc2005.pdf • http://seananderson.ca/2013/12/01/plyr.html • https://github.com/RevolutionAnalytics/rmr2/blob/m aster/docs/tutorial.md • http://cran.r-project. org/web/packages/data.table/index.html 53
  54. 54. 54

×