SparkR: Enabling Interactive 
Data Science at Scale 
Shivaram Venkataraman 
Zongheng Yang
Fast ! 
Scalable Flexible
Statistics ! 
Packages Plots
Fast ! 
Scalable 
Flexible 
Statistics ! 
Plots 
Packages
Outline 
SparkR API 
Live Demo 
Design Details
RDD 
Parallel Collection 
Transformations 
map 
filter 
groupBy 
… 
Actions 
count 
collect 
saveAsTextFile 
…
R + RDD = 
R2D2
R + RDD = 
RRDD 
lapply 
lapplyPartition 
groupByKey 
reduceByKey 
sampleRDD 
collect 
cache 
filter 
… 
broadcast 
includePackage 
textFile 
parallelize
SparkR – R package for Spark 
R 
RRDD 
Spark
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”)
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”) 
words 
<-­‐ 
flatMap(lines, 
function(line) 
{ 
strsplit(line, 
" 
")[[1]] 
}) 
wordCount 
<-­‐ 
lapply(words, 
function(word) 
{ 
list(word, 
1L) 
})
Example: word_count.R 
library(SparkR) 
lines 
<-­‐ 
textFile(sc, 
“hdfs://my_text_file”) 
words 
<-­‐ 
flatMap(lines, 
function(line) 
{ 
strsplit(line, 
" 
")[[1]] 
}) 
wordCount 
<-­‐ 
lapply(words, 
function(word) 
{ 
list(word, 
1L) 
}) 
counts 
<-­‐ 
reduceByKey(wordCount, 
"+", 
2L) 
output 
<-­‐ 
collect(counts)
Demo: 
Digit Classification
MNIST
A 
b 
|| Ax − b ||2 Minimize 
x = (ATA)−1ATb
How does this work ?
Dataflow 
Local Worker 
Worker
Dataflow 
Local 
R 
Worker 
Worker
Dataflow 
Local 
R Spark 
Context 
Java 
Spark 
Context 
JNI 
Worker 
Worker
Dataflow 
Local Worker 
R Spark 
Context 
Worker Java 
Spark 
Context 
JNI 
Spark 
Executor exec R 
Spark 
Executor exec R
From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/
Dataflow 
Local Worker 
R Spark 
Context 
Worker Java 
Spark 
Context 
JNI 
Spark 
Executor exec R 
Spark 
Executor exec R
Pipelined RDD 
words 
<-­‐ 
flatMap(lines,…) 
wordCount 
<-­‐ 
lapply(words,…) 
Spark 
Executor exec R Spark 
Executor R exec
Pipelined RDD 
Spark 
Executor exec R Spark 
Executor R exec 
Spark 
Executor exec R R Spark 
Executor
Alpha developer release 
One line install ! 
install_github("amplab-­‐extras/SparkR-­‐pkg", 
subdir="pkg")
SparkR Implementation 
Very similar to PySpark 
Spark is easy to extend 
329 lines of Scala code 
2079 lines of R code 
693 lines of test code in R
EC2 setup scripts 
All Spark examples 
MNIST demo 
YARN, Windows support 
Also on github
Developer Community 
13 contributors 
(10 from outside AMPLab) 
Collaboration with Alteryx
On the Roadmap 
High level DataFrame API 
Integrating Spark’s MLLib from R 
Merge with Apache Spark
SparkR 
RDD à distributed lists 
Run R on clusters 
Re-use existing packages 
Combine scalability & utility
SparkR 
https://github.com/amplab-extras/SparkR-pkg 
Shivaram Venkataraman shivaram@cs.berkeley.edu 
Zongheng Yang zongheng.y@gmail.com 
SparkR mailing list sparkr-dev@googlegroups.com

SparkR: Enabling Interactive Data Science at Scale