Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Vectorized R Execution in Apache Spark Slide 1 Vectorized R Execution in Apache Spark Slide 2 Vectorized R Execution in Apache Spark Slide 3 Vectorized R Execution in Apache Spark Slide 4 Vectorized R Execution in Apache Spark Slide 5 Vectorized R Execution in Apache Spark Slide 6 Vectorized R Execution in Apache Spark Slide 7 Vectorized R Execution in Apache Spark Slide 8 Vectorized R Execution in Apache Spark Slide 9 Vectorized R Execution in Apache Spark Slide 10 Vectorized R Execution in Apache Spark Slide 11 Vectorized R Execution in Apache Spark Slide 12 Vectorized R Execution in Apache Spark Slide 13 Vectorized R Execution in Apache Spark Slide 14 Vectorized R Execution in Apache Spark Slide 15 Vectorized R Execution in Apache Spark Slide 16 Vectorized R Execution in Apache Spark Slide 17 Vectorized R Execution in Apache Spark Slide 18 Vectorized R Execution in Apache Spark Slide 19 Vectorized R Execution in Apache Spark Slide 20 Vectorized R Execution in Apache Spark Slide 21 Vectorized R Execution in Apache Spark Slide 22 Vectorized R Execution in Apache Spark Slide 23 Vectorized R Execution in Apache Spark Slide 24 Vectorized R Execution in Apache Spark Slide 25 Vectorized R Execution in Apache Spark Slide 26 Vectorized R Execution in Apache Spark Slide 27 Vectorized R Execution in Apache Spark Slide 28 Vectorized R Execution in Apache Spark Slide 29 Vectorized R Execution in Apache Spark Slide 30 Vectorized R Execution in Apache Spark Slide 31 Vectorized R Execution in Apache Spark Slide 32 Vectorized R Execution in Apache Spark Slide 33 Vectorized R Execution in Apache Spark Slide 34 Vectorized R Execution in Apache Spark Slide 35 Vectorized R Execution in Apache Spark Slide 36 Vectorized R Execution in Apache Spark Slide 37 Vectorized R Execution in Apache Spark Slide 38 Vectorized R Execution in Apache Spark Slide 39 Vectorized R Execution in Apache Spark Slide 40 Vectorized R Execution in Apache Spark Slide 41 Vectorized R Execution in Apache Spark Slide 42
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Vectorized R Execution in Apache Spark

Download to read offline

Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.

  • Be the first to like this

Vectorized R Execution in Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Hyukjin Kwon, Databricks Vectorized R Execution in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. #UnifiedDataAnalytics #SparkAISummit Hyukjin Kwon 3 • Apache Spark PMC and Committer • Koalas committer • PySpark, SparkSQL, SparkR, build, etc. • Active in Spark dev @HyukjinKwon
  4. 4. #UnifiedDataAnalytics #SparkAISummit Agenda • SparkR and R interaction • Native Implementation • Apache Arrow • Vectorized Implementation • Future Work 4
  5. 5. #UnifiedDataAnalytics #SparkAISummit 5 SparkR and R interaction
  6. 6. #UnifiedDataAnalytics #SparkAISummit Why? Scala API R API 6 Cool!
  7. 7. #UnifiedDataAnalytics #SparkAISummit Why? Scala API R API 7 12.5x slower … ?
  8. 8. #UnifiedDataAnalytics #SparkAISummit Why? Scala API R API 8 40x slower … ???
  9. 9. #UnifiedDataAnalytics #SparkAISummit createDataFrame Create Spark DataFrame from R DataFrame and lists. 9
  10. 10. #UnifiedDataAnalytics #SparkAISummit collect Collect R DataFrame from Spark DataFrame at Driver. 10
  11. 11. #UnifiedDataAnalytics #SparkAISummit dapply Apply R native function to each partition 11
  12. 12. #UnifiedDataAnalytics #SparkAISummit gapply Apply R native function on each group. 12
  13. 13. #UnifiedDataAnalytics #SparkAISummit 13 Native Implementation
  14. 14. #UnifiedDataAnalytics #SparkAISummit SparkR Architecture 14 Spark Driver JVM JVM DataSources JVMR RBackend R R R R
  15. 15. #UnifiedDataAnalytics #SparkAISummit Driver implementation 15 1. RBackend opens a server port and waits for connections 4. RBackendHandler handles and process requests. It sends back row by row 2. R establishes the socket connections 3. Each SparkR call sends serialized data over the socket and waits for response R JVM Backend
  16. 16. #UnifiedDataAnalytics #SparkAISummit createDataFrame and collect 16 DataFrame R Data Frame R Data Frame Rows to Array(Array(…)) list(list(…)) parallelize(…) row, row, ... row, row, ... parallelize(…) Bytes to rows Bytes to lists data.frame(…)
  17. 17. #UnifiedDataAnalytics #SparkAISummit Worker Implementation 17 R R JVM 1. RRunner sends data and serialized R function through a socket. 2. R receives the serialized function and data. 3. R deserializes the function and the data row by row 4. R executes the function, and send the results back to RRunner.
  18. 18. #UnifiedDataAnalytics #SparkAISummit dapply and gapply 18 RRunner PhysicalOperator row by row row by row Invoke R function serialize row by row deserialize row by row row, row, ... row, row, ...
  19. 19. #UnifiedDataAnalytics #SparkAISummit 19 Apache Arrow
  20. 20. #UnifiedDataAnalytics #SparkAISummit Apache Arrow A cross-language development platform for in-memory data Columnar In-Memory SparkR supports Arrow 0.12.1+(?) 20 20
  21. 21. #UnifiedDataAnalytics #SparkAISummit Vectorization 21 21 https://www.slideshare.net/Hadoop_Summit/the-columnar-roadmap-apache-parquet-and-apache-arrow-102997214 SIMD Pipelining https://medium.com/wasmer/webassembly-and-simd-13badb9bf1a8
  22. 22. #UnifiedDataAnalytics #SparkAISummit Interchangeable, no copy 22 Each system has its own internal memory format 70-80% computation wasted on (de)serialization Similar functionality implemented in multiple projects All systems utilize the same memory format No overhead for cross-system communication Projects can share functionality (eg, Parquet-to-Arrow reader)
  23. 23. #UnifiedDataAnalytics #SparkAISummit Serialization and deserialization 23 23 https://wesmckinney.com/blog/arrow-streaming-columnar/ See also Arrow format https://sapbr.com/2016/08/15/dictionary-encoding/
  24. 24. #UnifiedDataAnalytics #SparkAISummit "Portable" Data Frames 24 Share data and algorithm at ~zero cost https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018
  25. 25. #UnifiedDataAnalytics #SparkAISummit 25 Vectorized Implementation
  26. 26. #UnifiedDataAnalytics #SparkAISummit createDataFrame and collect Use Arrow to Serialize/Deserialize data Streaming format for Interprocess messaging / communication (IPC) ArrowWriter and ArrowColumnVector Communicate JVM and R worker via Socket createDataFrame in SQLContext.R readArrowStreamFromFile in SQLUtils.scala collect in DataFrame.R collectAsArrowToR in Dataset.scala 26
  27. 27. #UnifiedDataAnalytics #SparkAISummit createDataFrame and collect 27 DataFrame R Data Frame R Data Frame Rows to Array(Array(…)) list(list(…)) parallelize(…) row, row, ... row, row, ... parallelize(…) Bytes to rows Bytes to lists data.frame(…)
  28. 28. #UnifiedDataAnalytics #SparkAISummit createDataFrame and collect 28 DataFrame R Data Frame Arrow batches R Data Frame Arrow batches to Spark DataFrameArrow batches Arrow batches Arrow batches Spark DataFrame to Arrow bathes
  29. 29. #UnifiedDataAnalytics #SparkAISummit Benchmark 29 collect No Arrow: 20.85112 secs Arrow: 1.224419 secs 17x faster No Arrow: 240.50508 secs Arrow: 5.707062 secs 42x faster createDataFrame
  30. 30. #UnifiedDataAnalytics #SparkAISummit dapply and gapply Use Arrow to Serialize/Deserialize data Streaming format for Interprocess messaging / communication (IPC) ArrowWriter and ArrowColumnVector Communicate JVM and R worker via Socket ArrowRRunner Physical Operators for each R native function executions MapPartitionsInRWithArrowExec FlatMapGroupsInRWithArrowExec 30
  31. 31. #UnifiedDataAnalytics #SparkAISummit dapply and gapply 31 RRunner PhysicalOperator row by row row by row Invoke R function serialize row by row deserialize row by row row, row, ... row, row, ...
  32. 32. #UnifiedDataAnalytics #SparkAISummit dapply and gapply 32 ArrowRRunner PhysicalOperator group of rows group of rows Invoke R function serialize Arrow batches deserialize Arrow batches Arrow batches Arrow batches
  33. 33. #UnifiedDataAnalytics #SparkAISummit Benchmark 33 gapply No Arrow: 699.0714 secs Arrow: 16.2713 secs 43x faster No Arrow: 202.36236 secs Arrow: 6.222105 secs 33x faster dapply
  34. 34. #UnifiedDataAnalytics #SparkAISummit Benchmark 34 Can’t believe? Can’t wait to try it out by yourself? Try it out here on a live Jupyter notebook github.com/HyukjinKwon/spark-notebooks
  35. 35. #UnifiedDataAnalytics #SparkAISummit 35 Future Work
  36. 36. #UnifiedDataAnalytics #SparkAISummit Apache Arrow ARROW-4512 Actually, the vectorized implementation does not fully work in a streaming manner yet. 36
  37. 37. #UnifiedDataAnalytics #SparkAISummit dapplyCollect 37
  38. 38. #UnifiedDataAnalytics #SparkAISummit gapplyCollect 38
  39. 39. #UnifiedDataAnalytics #SparkAISummit dapplyCollect and gapplyCollect 39 ArrowRRunner PhysicalOperator group of rows Invoke R function serialize Arrow batches deserialize Arrow batches Arrow batches Arrow batches ??? Schema is unknown before eager execution
  40. 40. #UnifiedDataAnalytics #SparkAISummit dapplyCollect and gapplyCollect 40ArrowRRunner PhysicalOperator Invoke R function serialize Arrow batches deserialize Arrow batches DataFrame R Data Frame R Data Frame Arrow batches Arrow batches Arrow batchesBy-passed Arrow batches By-pass with DataFrame<binary>
  41. 41. #UnifiedDataAnalytics #SparkAISummit That’s it! Special thanks to Felix Cheung Bryan Cutler Liang-Chi Hsieh Hossein Falaki Takuya Ueshin Yanbo Liang and, last but not least Apache Arrow community :D 41
  42. 42. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.

Views

Total views

368

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×