June 2017
Yanbo Liang
Apache Spark committer
Hortonworks
SparkR best practices for R data scientists
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
R for data scientist
à Pros
– Open source.
– Rich ecosystem of packages.
– Powerful visualization infrastructure.
– Data frames make data manipulation convenient.
– Taught by many schools to statistics and computer science students.
à Cons
– Single threaded
– Everything has to fit in single machine memory
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR = Spark + R
à An	R	frontend	for	Apache	Spark,	a	widely deployed cluster computing engine.
à Wrappers over DataFrames and DataFrame-based APIs (MLlib).
– Complete DataFrame API to behave just like R data.frame.
– ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs.
à Data frame concept is the corner stone of both Spark and R.
à Convenient interoperability between R and Spark DataFrames.
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR architecture
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data science workflow
R for Data Science (http://r4ds.had.co.nz/)
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why SparkR + R
à There are thousands of community packages on CRAN.
– It is impossible for SparkR to match all existing features.
à Not every dataset is large.
– Many people work with small/medium datasets.
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data wrangle with SparkR
Operation/Transformation function
Join different data sources or tables join
Pick observations by their value filter/where
Reorder the rows arrange
Pick variables by their names select
Create new variable with functions of existing variables mutate/withColumn
Collapse many values down to a single summary summary/describe
Aggregation groupBy
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data wrangle
airlines <- read.df(path="/data/2008.csv", source="csv",
header="true", inferSchema="true")
planes <- read.df(path="/data/plane-data.csv", source="csv",
header="true", inferSchema="true")
joined <- join(airlines, planes, airlines$TailNum ==
planes$tailnum)
df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”,
“DepDelay”)
df2 <- dropna(df1)
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkR performance
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sampling Algorithms
à Bernoulli sampling (without replacement)
– df3 <- sample(df2,	FALSE,	0.1)
à Poisson sampling (with replacement)
– df3 <- sample(df2, TRUE, 0.1)
à stratified sampling
– df3 <- sampleBy(df2,	"aircraft_type",	list("Fixed	Wing	Multi-Engine"=0.1,	"Fixed	Wing	Single-
Engine"=0.2,	"Rotorcraft"=0.3),	0)
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkDataFrame data.frame
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Distributed dataset to local
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Partition aggregate
à User Defined Functions (UDFs).
– dapply
– gapply
à Parallel execution of function.
– spark.lapply
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
User Defined Functions (UDFs)
à dapply
à gapply
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
dapply
> schema <- structType(structField(”aircraft_type”, “string”),
structField(”Distance“, ”integer“),
structField(”ArrDelay“, ”integer“),
structField(”DepDelay“, ”integer“),
structField(”DepDelayS“, ”integer“))
> df4 <- dapply(df3, function(x) { x <- cbind(x, x$DepDelay *
60L) }, schema)
> head(df4)
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
gapply
> schema <- structType(structField(”Distance“, ”integer“),
structField(”MaxActualDelay“, ”integer“))
> df5 <- gapply(df3, “Distance”, function(key, x) { y <-
data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema)
> head(df5)
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
à Ideal way for distributing existing R functionality and packages
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
for (lambda in c(0.5, 1.5)) {
for (alpha in c(0.1, 0.5, 1.0)) {
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c <- predit(model, A)
c(coef(model), auc(c, b))
}
}
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5,
0.1), c(1.5, 0.5), c(1.5, 1.0))
train <- function(value) {
lambda <- value[1]
alpha <- value[2]
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c(coef(model), auc(c, b))
}
models <- spark.lapply(values, train)
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
executor
executor
executor
executor
executor
Driver
lambda = c(0.5, 1.5)
alpha = c(0.1, 0.5, 1.0)
executor
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
spark.lapply
(0.5, 0.1)
executor
(1.5, 0.1)
executor
(0.5, 0.5)
executor
(0.5, 1.0)
executor
(1.5, 1.0)
executor
Driver
(1.5, 0.5)
executor
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
Driver
(glmnet)
executor
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
download.packages(”glmnet", packagesDir, repos =
"https://cran.r-project.org")
filename <- list.files(packagesDir, "^glmnet")
packagesPath <- file.path(packagesDir, filename)
spark.addFile(packagesPath)
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Virtual environment
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5,
0.5), c(1.5, 1.0))
train <- function(value) {
path <- spark.getSparkFiles(filename)
install.packages(path, repos = NULL, type = "source")
library(glmnet)
lambda <- value[1]
alpha <- value[2]
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c(coef(model), auc(c, b))
}
models <- spark.lapply(values, train)
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Large scale machine learning
> model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type,
family = "gaussian", data = df3)
> summary(model)
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Outline
à Introduction to	R	and SparkR.
à Typical data science workflow.
à SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
à Future directions.
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Future directions
à Improve collect/createDataFrame performance in SparkR (SPARK-18924).
à More scalable machine learning algorithms from MLlib.
à Better R formula support.
à Improve UDF performance.
June 2017
Yanbo Liang
Apache Spark committer
Hortonworks
SparkR best practices for R data scientists

SparkR best practices for R data scientist