June 2017 Yanbo Liang Apache Spark committer Hortonworks SparkR best practices for R data scientists
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline Ã Introduction to R and SparkR. Ã Typical data science workf...
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved R for data scientist Ã Pros – Open source. – Rich ecosystem of packa...
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR = Spark + R Ã An R frontend for Apache Spark, a widely deploy...
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR architecture
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data science workflow R for Data Science (http://r4ds.had.co.nz/)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why SparkR + R Ã There are thousands of community packages on CRAN. ...
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join sel...
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle with SparkR Operation/Transformation function Join dif...
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle airlines <- read.df(path="/data/2008.csv", source="csv...
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR performance
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sampling Algorithms Ã Bernoulli sampling (without replacement) – df...
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join sel...
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Distributed dataset to local
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partition aggregate Ã User Defined Functions (UDFs). – dapply – gap...
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Defined Functions (UDFs) Ã dapply Ã gapply
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved dapply > schema <- structType(structField(”aircraft_type”, “string”...
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved gapply > schema <- structType(structField(”Distance“, ”integer“), s...
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply Ã Ideal way for distributing existing R functionality ...
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply for (lambda in c(0.5, 1.5)) { for (alpha in c(0.1, 0.5...
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1...
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply executor executor executor executor executor Driver la...
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply (0.5, 0.1) executor (1.5, 0.1) executor (0.5, 0.5) exe...
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment (glmnet) executor (glmnet) executor (glmnet) ex...
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment download.packages(”glmnet", packagesDir, repos ...
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1....
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning > model <- glm(ArrDelay ~ DepDelay + D...
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future directions Ã Improve collect/createDataFrame performance in ...
  1. 1. June 2017 Yanbo Liang Apache Spark committer Hortonworks SparkR best practices for R data scientists
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR = Spark + R Ã An R frontend for Apache Spark, a widely deployed cluster computing engine. Ã Wrappers over DataFrames and DataFrame-based APIs (MLlib). – Complete DataFrame API to behave just like R data.frame. – ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs. Ã Data frame concept is the corner stone of both Spark and R. Ã Convenient interoperability between R and Spark DataFrames.
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR architecture
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data science workflow R for Data Science (http://r4ds.had.co.nz/)
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why SparkR + R Ã There are thousands of community packages on CRAN. – It is impossible for SparkR to match all existing features. Ã Not every dataset is large. – Many people work with small/medium datasets.
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join select/ where/ aggregate/ sample collect model/ analytics SparkR R
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle with SparkR Operation/Transformation function Join different data sources or tables join Pick observations by their value filter/where Reorder the rows arrange Pick variables by their names select Create new variable with functions of existing variables mutate/withColumn Collapse many values down to a single summary summary/describe Aggregation groupBy
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle airlines <- read.df(path="/data/2008.csv", source="csv", header="true", inferSchema="true") planes <- read.df(path="/data/plane-data.csv", source="csv", header="true", inferSchema="true") joined <- join(airlines, planes, airlines$TailNum == planes$tailnum) df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”, “DepDelay”) df2 <- dropna(df1)
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR performance
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sampling Algorithms Ã Bernoulli sampling (without replacement) – df3 <- sample(df2, FALSE, 0.1) Ã Poisson sampling (with replacement) – df3 <- sample(df2, TRUE, 0.1) Ã stratified sampling – df3 <- sampleBy(df2, "aircraft_type", list("Fixed Wing Multi-Engine"=0.1, "Fixed Wing Single- Engine"=0.2, "Rotorcraft"=0.3), 0)
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Distributed dataset to local
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partition aggregate Ã User Defined Functions (UDFs). – dapply – gapply Ã Parallel execution of function. – spark.lapply
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Defined Functions (UDFs) Ã dapply Ã gapply
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved dapply > schema <- structType(structField(”aircraft_type”, “string”), structField(”Distance“, ”integer“), structField(”ArrDelay“, ”integer“), structField(”DepDelay“, ”integer“), structField(”DepDelayS“, ”integer“)) > df4 <- dapply(df3, function(x) { x <- cbind(x, x$DepDelay * 60L) }, schema) > head(df4)
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved gapply > schema <- structType(structField(”Distance“, ”integer“), structField(”MaxActualDelay“, ”integer“)) > df5 <- gapply(df3, “Distance”, function(key, x) { y <- data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema) > head(df5)
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply Ã Ideal way for distributing existing R functionality and packages
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply for (lambda in c(0.5, 1.5)) { for (alpha in c(0.1, 0.5, 1.0)) { model <- glmnet(A, b, lambda=lambda, alpha=alpha) c <- predit(model, A) c(coef(model), auc(c, b)) } }
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5, 0.5), c(1.5, 1.0)) train <- function(value) { lambda <- value[1] alpha <- value[2] model <- glmnet(A, b, lambda=lambda, alpha=alpha) c(coef(model), auc(c, b)) } models <- spark.lapply(values, train)
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply executor executor executor executor executor Driver lambda = c(0.5, 1.5) alpha = c(0.1, 0.5, 1.0) executor
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply (0.5, 0.1) executor (1.5, 0.1) executor (0.5, 0.5) executor (0.5, 1.0) executor (1.5, 1.0) executor Driver (1.5, 0.5) executor
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment (glmnet) executor (glmnet) executor (glmnet) executor (glmnet) executor (glmnet) executor Driver (glmnet) executor
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment download.packages(”glmnet", packagesDir, repos = "https://cran.r-project.org") filename <- list.files(packagesDir, "^glmnet") packagesPath <- file.path(packagesDir, filename) spark.addFile(packagesPath)
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5, 0.5), c(1.5, 1.0)) train <- function(value) { path <- spark.getSparkFiles(filename) install.packages(path, repos = NULL, type = "source") library(glmnet) lambda <- value[1] alpha <- value[2] model <- glmnet(A, b, lambda=lambda, alpha=alpha) c(coef(model), auc(c, b)) } models <- spark.lapply(values, train)
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning > model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type, family = "gaussian", data = df3) > summary(model)
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future directions Ã Improve collect/createDataFrame performance in SparkR (SPARK-18924). Ã More scalable machine learning algorithms from MLlib. Ã Better R formula support. Ã Improve UDF performance.
