2. Agenda
• R & SparkR
• SparkR DataFrame
• SparkR in Zeppelin
• What’s next
3. R• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 7000+ packages
5. SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly DataFrame API
• Runs as its own REPL sparkR
• or as a standard R package imported in tools like Rstudio
library(SparkR)
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
5
6. History
• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley
• RDD APIs in a standalone package (Jan/2014)
• Spark SQL and SchemaRDD -> DataFrame
• Spark 1.4 – first Spark release with SparkR APIs
• Spark 1.5 (today!)
6
7. Architecture
7
Native S4
classes &
methods
RBackend
socket
• A set of native S4 classes and methods that live inside a
standard R package
• A backend that passes data structures and method calls to
Spark Scala/JVM
• A collection of “helper” methods written in Scala
8. Advantages
• R-like syntax extending DataFrame API
• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,
e.g. execution plan optimization, constant-folding, predicate
pushdown, and code generation
8
17. What’s new
• Zeppelin - run with provided Spark (SPARK_HOME)
• Spark 1.5.0 release
• SparkR new APIs
18. SparkR in Spark 1.5.0
Get this today:
• R formula
• Machine learning like GLM
model <- glm(Sepal_Length ~ Sepal_Width +
Species, data = df, family = "gaussian")
• More R-like
df[df$age %in% c(19, 30), 1:2]
transform(df, newCol = df$col1 / 5, newCol2 =
df$col1 * 2)
19. Zeppelin
• Stay tuned! More to come with R/SparkR
• Lots of updates in the upcoming 0.5.x/0.6.0 release