Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Parallelizing Existing R Packages with SparkR

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk, Hossein Falaki gives an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.

Speaker: Hossein Falaki

This talk was originally presented at Spark Summit East 2017.

  • Be the first to comment

Parallelizing Existing R Packages with SparkR

  1. 1. Parallelizing Existing R Packages with SparkR Hossein Falaki @mhfalaki
  2. 2. About me • Former Data Scientist at Apple Siri • Software Engineer at Databricks • Started using Apache Spark since version 0.6 • Developed first version of Apache Spark CSV data source • Worked on SparkR &Databricks R Notebook feature 2
  3. 3. What is SparkR? An R package distributed with Apache Spark: - Provides R frontend to Spark - Exposes Spark DataFrames (inspired by R and Pandas) - Convenient interoperability between R and Spark DataFrames 3 distributed/robust processing, data sources, off-memory data structures + Dynamic environment, interactivity, packages, visualization
  4. 4. SparkR architecture 4 Spark Driver JVMR RBackend JVM Worker JVM Worker DataSources JVM
  5. 5. SparkR architecture (since 2.0) 5 Spark Driver R JVM RBackend JVM Worker JVM Worker DataSources R R
  6. 6. Overview of SparkR API IO read.df / write.df / createDataFrame / collect Caching cache / persist / unpersist / cacheTable / uncacheTable SQL sql / table / saveAsTable / registerTempTable / tables 6 ML Lib glm / kmeans / Naïve Bayes Survival regression DataFrame API select / subset / groupBy / head / avg / column / dim UDF functionality (since 2.0) spark.lapply / dapply / gapply / dapplyCollect http://spark.apache.org/docs/latest/api/R/
  7. 7. SparkR UDF API 7 spark.lapply Runs a function over a list of elements spark.lapply() dapply Applies a function to each partition of a SparkDataFrame dapply() dapplyCollect() gapply Applies a function to each group within a SparkDataFrame gapply() gapplyCollect()
  8. 8. spark.lapply 8 Simplest SparkR UDF pattern For each element of a list: 1. Sends the function to an R worker 2. Executes the function 3. Returns the result of all workers as a list to R driver spark.lapply(1:100, function(x) { runBootstrap(x) }
  9. 9. spark.lapply control flow 9 RWorker JVM RWorker JVM RWorker JVMR Driver JVM 1. Serialize R closure 3. Transfer serialized closure over the network 5. De-serialize closure 4. Transfer over local socket 6. Serialize result 2. Transfer over local socket 7. Transfer over local socket9. Transfer over local socket 10. Deserialize result 8. Transfer serialized closure over the network
  10. 10. dapply 10 For each partition of a Spark DataFrame 1. collects each partition as an R data.frame 2. sends the R function to the R worker 3. executes the function dapply(sparkDF, func, schema) combines results as DataFrame with provided schema dapplyCollect(sparkDF, func) combines results as R data.frame
  11. 11. dapply control & data flow 11 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer
  12. 12. dapplyCollect control & data flow 12 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result transfer result deser
  13. 13. gapply 13 Groups a Spark DataFrame on one or more columns 1. collects each group as an R data.frame 2. sends the R function to the R worker 3. executes the function gapply(sparkDF, cols, func, schema) combines results as DataFrame with provided schema gapplyCollect(sparkDF, cols, func) combines results as R data.frame
  14. 14. gapply control & data flow 14 RWorker JVM RWorker JVM RWorker JVMR Driver JVM local socket cluster network local socket input data ser/de transfer result data ser/de transfer data shuffle
  15. 15. dapply vs. gapply 15 gapply dapply signature gapply(df, cols, func, schema) gapply(gdf, func, schema) dapply(df, func, schema) user function signature function(key, data) function(data) data partition controlled by grouping not controlled
  16. 16. Parallelizing data • Do not use spark.lapply() to distribute large data sets • Do not pack data in the closure • Watch for skew in data –Are partitions evenly sized? • Auxiliary data –Can be joined with input DataFrame –Can be distributed to all the workers 16
  17. 17. Packages on workers • SparkR closure capture does not include packages • You need to import packages on each worker inside your function • If not installed install packages on workers out-of-band • spark.lapply() can be used to install packages 17
  18. 18. Debugging user code 1. Verify your code on the Driver 2. Interactively execute the code on the cluster – When R worker fails, Spark Driver throws exception with the R error text 3. Inspect details of failure reason of failed job in spark UI 4. Inspect stdout/stderror of workers 18
  19. 19. Demo 19 http://bit.ly/2krYMwC http://bit.ly/2ltLVKs
  20. 20. Thank you!

×