More Related Content
Similar to SparkR: Enabling Interactive Data Science at Scale (20)
More from jeykottalam (8)
SparkR: Enabling Interactive Data Science at Scale
- 8. R + RDD =
RRDD
lapply
lapplyPartition
groupByKey
reduceByKey
sampleRDD
collect
cache
filter
…
broadcast
includePackage
textFile
parallelize
- 11. Example: word_count.R
library(SparkR)
lines
<-‐
textFile(sc,
“hdfs://my_text_file”)
words
<-‐
flatMap(lines,
function(line)
{
strsplit(line,
"
")[[1]]
})
wordCount
<-‐
lapply(words,
function(word)
{
list(word,
1L)
})
- 12. Example: word_count.R
library(SparkR)
lines
<-‐
textFile(sc,
“hdfs://my_text_file”)
words
<-‐
flatMap(lines,
function(line)
{
strsplit(line,
"
")[[1]]
})
wordCount
<-‐
lapply(words,
function(word)
{
list(word,
1L)
})
counts
<-‐
reduceByKey(wordCount,
"+",
2L)
output
<-‐
collect(counts)
- 15. A
b
|| Ax − b ||2 Minimize
x = (ATA)−1ATb
- 20. Dataflow
Local Worker
R Spark
Context
Worker Java
Spark
Context
JNI
Spark
Executor exec R
Spark
Executor exec R
- 24. Dataflow
Local Worker
R Spark
Context
Worker Java
Spark
Context
JNI
Spark
Executor exec R
Spark
Executor exec R
- 25. Pipelined RDD
words
<-‐
flatMap(lines,…)
wordCount
<-‐
lapply(words,…)
Spark
Executor exec R Spark
Executor R exec
- 26. Pipelined RDD
Spark
Executor exec R Spark
Executor R exec
Spark
Executor exec R R Spark
Executor
- 28. SparkR Implementation
Very similar to PySpark
Spark is easy to extend
329 lines of Scala code
2079 lines of R code
693 lines of test code in R
- 29. EC2 setup scripts
All Spark examples
MNIST demo
YARN, Windows support
Also on github
- 31. On the Roadmap
High level DataFrame API
Integrating Spark’s MLLib from R
Merge with Apache Spark
- 32. SparkR
RDD à distributed lists
Run R on clusters
Re-use existing packages
Combine scalability & utility