SparkR + Zeppelin
Seattle Spark Meetup
Sept 9, 2015
Felix Cheung
Agenda
• R & SparkR
• SparkR DataFrame
• SparkR in Zeppelin
• What’s next
R• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 7000+ packages
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly DataFrame API
• Runs as its own REPL sparkR
• or as a standard R package imported in tools like Rstudio
library(SparkR)
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
5
History
• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley
• RDD APIs in a standalone package (Jan/2014)
• Spark SQL and SchemaRDD -> DataFrame
• Spark 1.4 – first Spark release with SparkR APIs
• Spark 1.5 (today!)
6
Architecture
7
Native S4
classes &
methods
RBackend
socket
• A set of native S4 classes and methods that live inside a
standard R package
• A backend that passes data structures and method calls to
Spark Scala/JVM
• A collection of “helper” methods written in Scala
Advantages
• R-like syntax extending DataFrame API
• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,
e.g. execution plan optimization, constant-folding, predicate
pushdown, and code generation
8
https://databricks.com/blog/201
5/06/09/announcing-sparkr-r-
on-spark.html
SparkR DataFrame
• Spark packages
• Data Source API
• Optimizations
SparkR in Zeppelin
Architecture
R
R adaptor
Demo
DIY
• https://github.com/felixcheung/vagrant-
projects/tree/master/SparkR-Zeppelin
• Vagrant + VirtualBox
• Install prerequisites: JDK, R, R packages
• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from
https://github.com/felixcheung/incubator-zeppelin/tree/r
• Notebook from https://github.com/felixcheung/spark-
notebook-
examples/blob/master/Zeppelin_notebook/2AZ9584GE/not
e.json
(extracted from the demo)
Native R
(extracted from the demo)
Native R and dplyr...
Similarly SparkR DataFrame…
(extracted from the demo)
SparkR DataFrame…
What’s new
• Zeppelin - run with provided Spark (SPARK_HOME)
• Spark 1.5.0 release
• SparkR new APIs
SparkR in Spark 1.5.0
Get this today:
• R formula
• Machine learning like GLM
model <- glm(Sepal_Length ~ Sepal_Width +
Species, data = df, family = "gaussian")
• More R-like
df[df$age %in% c(19, 30), 1:2]
transform(df, newCol = df$col1 / 5, newCol2 =
df$col1 * 2)
Zeppelin
• Stay tuned! More to come with R/SparkR
• Lots of updates in the upcoming 0.5.x/0.6.0 release
Question?
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
subset
# Columns can be selected using `[[` and `[`
df[[2]] == df[["age"]]
df[,2] == df[,"age"]
df[,c("name", "age")]
# Or to filter rows
df[df$age > 20,]
# DataFrame can be subset on both rows and Columns
df[df$name == "Smith", c(1,2)]
df[df$age %in% c(19, 30), 1:2]
subset(df, df$age %in% c(19, 30), 1:2)
subset(df, df$age %in% c(19), select = c(1,2))
Transform/mutate
newDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

SparkR + Zeppelin

Editor's Notes

  • #6 InRstudio: .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
  • #14 Use the viewer to check out the notebook without running Zeppelin: https://www.zeppelinhub.com/viewer/
  • #15 Retail employment, in millions (2008-2014) Source: Bureau of Labor Statistics Credit: NPR
  • #23 dplyr-like syntax