Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SparkR: Enabling Interactive 
Data Science at Scale 
Shivaram Venkataraman 
Zongheng Yang
Fast ! 
Scalable Flexible
Statistics ! 
Packages Plots
Fast ! 
Scalable 
Flexible 
Statistics ! 
Plots 
Packages
Outline 
SparkR API 
Live Demo 
Design Details
RDD 
Parallel Collection 
Transformations 
map 
filter 
groupBy 
… 
Actions 
count 
collect 
saveAsTextFile 
…
R + RDD = 
R2D2
R + RDD = 
RRDD 
lapply 
lapplyPartition 
groupByKey 
reduceByKey 
sampleRDD 
collect 
cache 
filter 
… 
broadcast 
includ...
SparkR – R package for Spark 
R 
RRDD 
Spark
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”)
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”) 
words 
<-­‐ 
flatMap(lines, 
fu...
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”) 
words 
<-­‐ 
flatMap(lines, 
fu...
Demo: 
Digit Classification
MNIST
A 
b 
|| Ax − b ||2 Minimize 
x = (ATA)−1ATb
How does this work ?
Dataflow 
Local Worker 
Worker
Dataflow 
Local 
R 
Worker 
Worker
Dataflow 
Local 
R Spark 
Context 
Java 
Spark 
Context 
JNI 
Worker 
Worker
Dataflow 
Local Worker 
R Spark 
Context 
Worker Java 
Spark 
Context 
JNI 
Spark 
Executor exec R 
Spark 
Executor exec R
From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
Dataflow 
Local Worker 
R Spark 
Context 
Worker Java 
Spark 
Context 
JNI 
Spark 
Executor exec R 
Spark 
Executor exec R
Pipelined RDD 
words 
<-­‐ 
flatMap(lines,…) 
wordCount 
<-­‐ 
lapply(words,…) 
Spark 
Executor exec R Spark 
Executor R e...
Pipelined RDD 
Spark 
Executor exec R Spark 
Executor R exec 
Spark 
Executor exec R R Spark 
Executor
Alpha developer release 
One line install ! 
install_github("amplab-­‐extras/SparkR-­‐pkg", 
subdir="pkg")
SparkR Implementation 
Very similar to PySpark 
Spark is easy to extend 
329 lines of Scala code 
2079 lines of R code 
69...
EC2 setup scripts 
All Spark examples 
MNIST demo 
YARN, Windows support 
Also on github
Developer Community 
13 contributors 
(10 from outside AMPLab) 
Collaboration with Alteryx
On the Roadmap 
High level DataFrame API 
Integrating Spark’s MLLib from R 
Merge with Apache Spark
SparkR 
RDD à distributed lists 
Run R on clusters 
Re-use existing packages 
Combine scalability & utility
SparkR 
https://github.com/amplab-extras/SparkR-pkg 
Shivaram Venkataraman shivaram@cs.berkeley.edu 
Zongheng Yang zonghen...
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
Upcoming SlideShare
Loading in …5
×

SparkR: Enabling Interactive Data Science at Scale

4,667 views

Published on

"SparkR" presentation by Shivaram Venkataraman and Zongheng Yang

Published in: Software
  • Be the first to comment

SparkR: Enabling Interactive Data Science at Scale

  1. 1. SparkR: Enabling Interactive Data Science at Scale Shivaram Venkataraman Zongheng Yang
  2. 2. Fast ! Scalable Flexible
  3. 3. Statistics ! Packages Plots
  4. 4. Fast ! Scalable Flexible Statistics ! Plots Packages
  5. 5. Outline SparkR API Live Demo Design Details
  6. 6. RDD Parallel Collection Transformations map filter groupBy … Actions count collect saveAsTextFile …
  7. 7. R + RDD = R2D2
  8. 8. R + RDD = RRDD lapply lapplyPartition groupByKey reduceByKey sampleRDD collect cache filter … broadcast includePackage textFile parallelize
  9. 9. SparkR – R package for Spark R RRDD Spark
  10. 10. Example: word_count.R library(SparkR) lines <-­‐ textFile(sc, “hdfs://my_text_file”)
  11. 11. Example: word_count.R library(SparkR) lines <-­‐ textFile(sc, “hdfs://my_text_file”) words <-­‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-­‐ lapply(words, function(word) { list(word, 1L) })
  12. 12. Example: word_count.R library(SparkR) lines <-­‐ textFile(sc, “hdfs://my_text_file”) words <-­‐ flatMap(lines, function(line) { strsplit(line, " ")[[1]] }) wordCount <-­‐ lapply(words, function(word) { list(word, 1L) }) counts <-­‐ reduceByKey(wordCount, "+", 2L) output <-­‐ collect(counts)
  13. 13. Demo: Digit Classification
  14. 14. MNIST
  15. 15. A b || Ax − b ||2 Minimize x = (ATA)−1ATb
  16. 16. How does this work ?
  17. 17. Dataflow Local Worker Worker
  18. 18. Dataflow Local R Worker Worker
  19. 19. Dataflow Local R Spark Context Java Spark Context JNI Worker Worker
  20. 20. Dataflow Local Worker R Spark Context Worker Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  21. 21. From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
  22. 22. Dataflow Local Worker R Spark Context Worker Java Spark Context JNI Spark Executor exec R Spark Executor exec R
  23. 23. Pipelined RDD words <-­‐ flatMap(lines,…) wordCount <-­‐ lapply(words,…) Spark Executor exec R Spark Executor R exec
  24. 24. Pipelined RDD Spark Executor exec R Spark Executor R exec Spark Executor exec R R Spark Executor
  25. 25. Alpha developer release One line install ! install_github("amplab-­‐extras/SparkR-­‐pkg", subdir="pkg")
  26. 26. SparkR Implementation Very similar to PySpark Spark is easy to extend 329 lines of Scala code 2079 lines of R code 693 lines of test code in R
  27. 27. EC2 setup scripts All Spark examples MNIST demo YARN, Windows support Also on github
  28. 28. Developer Community 13 contributors (10 from outside AMPLab) Collaboration with Alteryx
  29. 29. On the Roadmap High level DataFrame API Integrating Spark’s MLLib from R Merge with Apache Spark
  30. 30. SparkR RDD à distributed lists Run R on clusters Re-use existing packages Combine scalability & utility
  31. 31. SparkR https://github.com/amplab-extras/SparkR-pkg Shivaram Venkataraman shivaram@cs.berkeley.edu Zongheng Yang zongheng.y@gmail.com SparkR mailing list sparkr-dev@googlegroups.com

×