Spark + R = sparklyr
Jeff Allen, RStudio
@trestleJeff
THE R LANGUAGE
• 5th most popular
programming language in
the world1
• Data analysis, statistical
modeling, visualization
• Great interface to your data
• Historically limited to in-
memory data
1 http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages
WHAT IS SPARK?
• Open-source Apache
computing engine
• Bigger-than-memory data,
low-latency distributed
computing
• Can integrate with the
Hadoop ecosystem
• Built-in machine learning
sparklyr
Introducing…
“SPARRK-lee ARR”
http://spark.rstudio.com
• New open-source R
package from RStudio
• Complete dplyr back-end
for Spark
• Integrated with the RStudio
IDE
• Extensible foundation for
Spark + R
sparklyr
KICKTHETIRES
library(sparklyr)
spark_install()
sc <- spark_connect(“local")
my_tbl <- copy_to(sc, iris)
Driver Program
Cluster Manager
Worker
Spark Context
Task Task
Executor Cache
library(dplyr)
# use standard verbs to filter and aggregate 

select(

filter(my_tbl, Petal_Width < 0.3),

Petal_Length, Petal_Width

)
# use magrittr pipes for a cleaner syntax

my_tbl %>%

filter(Petal_Width < 0.3) %>%

select(Petal_Length, Petal_Width)
USE DPLYRTO WRITE
SPARK SQL
Demo
RUN IN PRODUCTION
Spark Cluster
Master Node
Driver Program
Spark Context
Worker Node
Task Task
Executor Cache
Cluster Manager
Worker Node
Task Task
Executor Cache
Use RStudio Server on the Spark cluster master node
> spark_connect(“spark://spark.company.org:7077”)
> my_tbl <- tbl(sc, “tblname”)
SPARKLYR FUNCTIONALITY
• Full dplyr back-end for Spark DataFrames
• R wrappers for all MLlib functions
• Easily leverage from R Markdown, Shiny, etc.
• IDE integration
• Windows, too!
SPARKLYR & H2O
• rsparkling: R interface to
Sparkling Water Spark
package
• A sparklyr extension
• http://spark.rstudio.com/h2o.html
RELATIONSHIPTO SPARKR
• Working together to establish a common
extension API
• Some differences in approach:
• CRAN distribution
• dplyr compatibility
SPARKR DPLYR
APPLICATIONS
http://spark.rstudio.com/examples.html
QUESTIONS?
http://spark.rstudio.com
@trestleJeff
https://github.com/trestletech/user2016-sparklyr

sparklyr - Jeff Allen