Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SparkR Best Practices for R Data Scientists

1,067 views

Published on

Spark plays an important role on data scientists to solve all kinds of problems, especially the release of SparkR which provide very friendly APIs for traditional data scientists. However, processing various data size, data format and models will lead to different application patterns compared with traditional R. In this talk, we will illustrate the practical experience that using SparkR to solve some typical data science problems, such as the performance improvement for SparkR and native R interoperation, how to load data from HBase which is a very common data source efficiently, how to schedule a large scale machine learning job with multiple single R machine learning jobs, how to tuning performance for jobs triggered by many different users, how to use SparkR in the cloud-based environment, etc. At last, we will shortly introduce the community efforts in progress on SparkR in the coming releases.

Speakers:
Yanbo Liang, Software Engineer, Hortonworks
Casey Stella, Principal Software Engineer/Data Scientist, Hortonworks

Published in: Technology
  • Be the first to comment

SparkR Best Practices for R Data Scientists

  1. 1. June 2017 Yanbo Liang Apache Spark committer Hortonworks SparkR best practices for R data scientists
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline à Introduction to R and SparkR. à Typical data science workflow. à SparkR + R for typical data science problem. – Big data, small learning. – Partition aggregate. – Large scale machine learning. à Future directions.
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline à Introduction to R and SparkR. à Typical data science workflow. à SparkR + R for typical data science problem. – Big data, small learning. – Partition aggregate. – Large scale machine learning. à Future directions.
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved R for data scientist à Pros – Open source. – Rich ecosystem of packages. – Powerful visualization infrastructure. – Data frames make data manipulation convenient. – Taught by many schools to statistics and computer science students. à Cons – Single threaded – Everything has to fit in single machine memory
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR = Spark + R Ã An R frontend for Apache Spark, a widely deployed cluster computing engine. Ã Wrappers over DataFrames and DataFrame-based APIs (MLlib). – Complete DataFrame API to behave just like R data.frame. – ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs. Ã Data frame concept is the corner stone of both Spark and R. Ã Convenient interoperability between R and Spark DataFrames.
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR architecture
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline à Introduction to R and SparkR. à Typical data science workflow. à SparkR + R for typical data science problem. – Big data, small learning. – Partition aggregate. – Large scale machine learning. à Future directions.
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data science workflow R for Data Science (http://r4ds.had.co.nz/)
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why SparkR + R Ã There are thousands of community packages on CRAN. – It is impossible for SparkR to match all existing features. Ã Not every dataset is large. – Many people work with small/medium datasets.
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline à Introduction to R and SparkR. à Typical data science workflow. à SparkR + R for typical data science problem. – Big data, small learning. – Partition aggregate. – Large scale machine learning. à Future directions.
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join select/ where/ aggregate/ sample collect model/ analytics SparkR R
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle with SparkR Operation/Transformation function Join different data sources or tables join Pick observations by their value filter/where Reorder the rows arrange Pick variables by their names select Create new variable with functions of existing variables mutate/withColumn Collapse many values down to a single summary summary/describe Aggregation groupBy
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data wrangle airlines <- read.df(path="/data/2008.csv", source="csv", header="true", inferSchema="true") planes <- read.df(path="/data/plane-data.csv", source="csv", header="true", inferSchema="true") joined <- join(airlines, planes, airlines$TailNum == planes$tailnum) df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”, “DepDelay”) df2 <- dropna(df1)
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SparkR performance
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sampling Algorithms à Bernoulli sampling (without replacement) – df3 <- sample(df2, FALSE, 0.1) à Poisson sampling (with replacement) – df3 <- sample(df2, TRUE, 0.1) à stratified sampling – df3 <- sampleBy(df2, "aircraft_type", list("Fixed Wing Multi-Engine"=0.1, "Fixed Wing Single- Engine"=0.2, "Rotorcraft"=0.3), 0)
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join select/ where/ aggregate/ sample collect model/ analytics SparkR R
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big data, small learning Table1 Table2 Table3 Table4 Table5join select/ where/ aggregate/ sample collect model/ analytics SparkDataFrame data.frame
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Distributed dataset to local
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Partition aggregate à User Defined Functions (UDFs). – dapply – gapply à Parallel execution of function. – spark.lapply
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User Defined Functions (UDFs) à dapply à gapply
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved dapply > schema <- structType(structField(”aircraft_type”, “string”), structField(”Distance“, ”integer“), structField(”ArrDelay“, ”integer“), structField(”DepDelay“, ”integer“), structField(”DepDelayS“, ”integer“)) > df4 <- dapply(df3, function(x) { x <- cbind(x, x$DepDelay * 60L) }, schema) > head(df4)
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved gapply > schema <- structType(structField(”Distance“, ”integer“), structField(”MaxActualDelay“, ”integer“)) > df5 <- gapply(df3, “Distance”, function(key, x) { y <- data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema) > head(df5)
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply à Ideal way for distributing existing R functionality and packages
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply for (lambda in c(0.5, 1.5)) { for (alpha in c(0.1, 0.5, 1.0)) { model <- glmnet(A, b, lambda=lambda, alpha=alpha) c <- predit(model, A) c(coef(model), auc(c, b)) } }
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5, 0.5), c(1.5, 1.0)) train <- function(value) { lambda <- value[1] alpha <- value[2] model <- glmnet(A, b, lambda=lambda, alpha=alpha) c(coef(model), auc(c, b)) } models <- spark.lapply(values, train)
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply executor executor executor executor executor Driver lambda = c(0.5, 1.5) alpha = c(0.1, 0.5, 1.0) executor
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved spark.lapply (0.5, 0.1) executor (1.5, 0.1) executor (0.5, 0.5) executor (0.5, 1.0) executor (1.5, 1.0) executor Driver (1.5, 0.5) executor
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment (glmnet) executor (glmnet) executor (glmnet) executor (glmnet) executor (glmnet) executor Driver (glmnet) executor
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment download.packages(”glmnet", packagesDir, repos = "https://cran.r-project.org") filename <- list.files(packagesDir, "^glmnet") packagesPath <- file.path(packagesDir, filename) spark.addFile(packagesPath)
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Virtual environment values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5, 0.5), c(1.5, 1.0)) train <- function(value) { path <- spark.getSparkFiles(filename) install.packages(path, repos = NULL, type = "source") library(glmnet) lambda <- value[1] alpha <- value[2] model <- glmnet(A, b, lambda=lambda, alpha=alpha) c(coef(model), auc(c, b)) } models <- spark.lapply(values, train)
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large scale machine learning > model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type, family = "gaussian", data = df3) > summary(model)
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outline à Introduction to R and SparkR. à Typical data science workflow. à SparkR + R for typical data science problem. – Big data, small learning. – Partition aggregate. – Large scale machine learning. à Future directions.
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Future directions à Improve collect/createDataFrame performance in SparkR (SPARK-18924). à More scalable machine learning algorithms from MLlib. à Better R formula support. à Improve UDF performance.
  36. 36. June 2017 Yanbo Liang Apache Spark committer Hortonworks SparkR best practices for R data scientists

×