SparkR best practices for R data scientist

June 2017
Yanbo Liang
Apache Spark committer
Hortonworks
SparkR best practices for R data scientists

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outline
Ã Introduction to R and SparkR.
Ã Typical data science workflow.
Ã SparkR + R for typical data science problem.
– Big data, small learning.
– Partition aggregate.
– Large scale machine learning.
Ã Future directions.

Outline

R for data scientist
Ã Pros
– Open source.
– Rich ecosystem of packages.
– Powerful visualization infrastructure.
– Data frames make data manipulation convenient.
– Taught by many schools to statistics and computer science students.
Ã Cons
– Single threaded
– Everything has to fit in single machine memory

SparkR = Spark + R
Ã An R frontend for Apache Spark, a widely deployed cluster computing engine.
Ã Wrappers over DataFrames and DataFrame-based APIs (MLlib).
– Complete DataFrame API to behave just like R data.frame.
– ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs.
Ã Data frame concept is the corner stone of both Spark and R.
Ã Convenient interoperability between R and Spark DataFrames.

SparkR architecture

Outline

Data science workflow
R for Data Science (http://r4ds.had.co.nz/)

Why SparkR + R
Ã There are thousands of community packages on CRAN.
– It is impossible for SparkR to match all existing features.
Ã Not every dataset is large.
– Many people work with small/medium datasets.

Outline

Big data, small learning
Table1
Table2
Table3 Table4 Table5join
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R

Data wrangle with SparkR
Operation/Transformation function
Join different data sources or tables join
Pick observations by their value filter/where
Reorder the rows arrange
Pick variables by their names select
Create new variable with functions of existing variables mutate/withColumn
Collapse many values down to a single summary summary/describe
Aggregation groupBy

Data wrangle
airlines <- read.df(path="/data/2008.csv", source="csv",
header="true", inferSchema="true")
planes <- read.df(path="/data/plane-data.csv", source="csv",
header="true", inferSchema="true")
joined <- join(airlines, planes, airlines$TailNum ==
planes$tailnum)
df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”,
“DepDelay”)
df2 <- dropna(df1)

SparkR performance

Sampling Algorithms
Ã Bernoulli sampling (without replacement)
– df3 <- sample(df2, FALSE, 0.1)
Ã Poisson sampling (with replacement)
– df3 <- sample(df2, TRUE, 0.1)
Ã stratified sampling
– df3 <- sampleBy(df2, "aircraft_type", list("Fixed Wing Multi-Engine"=0.1, "Fixed Wing Single-
Engine"=0.2, "Rotorcraft"=0.3), 0)

Table1
Table2
select/
where/
aggregate/
sample collect
model/
analytics
SparkR R

Table1
Table2
select/
where/
aggregate/
sample collect
model/
analytics
SparkDataFrame data.frame

Distributed dataset to local

Partition aggregate
Ã User Defined Functions (UDFs).
– dapply
– gapply
Ã Parallel execution of function.
– spark.lapply

User Defined Functions (UDFs)
Ã dapply
Ã gapply

dapply
> schema <- structType(structField(”aircraft_type”, “string”),
structField(”Distance“, ”integer“),
structField(”ArrDelay“, ”integer“),
structField(”DepDelay“, ”integer“),
structField(”DepDelayS“, ”integer“))
> df4 <- dapply(df3, function(x) { x <- cbind(x, x$DepDelay *
60L) }, schema)
> head(df4)

gapply
> schema <- structType(structField(”Distance“, ”integer“),
structField(”MaxActualDelay“, ”integer“))
> df5 <- gapply(df3, “Distance”, function(key, x) { y <-
data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema)
> head(df5)

spark.lapply
Ã Ideal way for distributing existing R functionality and packages

spark.lapply
for (lambda in c(0.5, 1.5)) {
for (alpha in c(0.1, 0.5, 1.0)) {
model <- glmnet(A, b, lambda=lambda, alpha=alpha)
c <- predit(model, A)
c(coef(model), auc(c, b))
}
}

spark.lapply
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5,
0.1), c(1.5, 0.5), c(1.5, 1.0))
train <- function(value) {
lambda <- value[1]
alpha <- value[2]
}
models <- spark.lapply(values, train)

spark.lapply
executor
executor
executor
executor
executor
Driver
lambda = c(0.5, 1.5)
alpha = c(0.1, 0.5, 1.0)
executor

spark.lapply
(0.5, 0.1)
executor
(1.5, 0.1)
executor
(0.5, 0.5)
executor
(0.5, 1.0)
executor
(1.5, 1.0)
executor
Driver
(1.5, 0.5)
executor

Virtual environment
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
(glmnet)
executor
Driver
(glmnet)
executor

Virtual environment
download.packages(”glmnet", packagesDir, repos =
"https://cran.r-project.org")
filename <- list.files(packagesDir, "^glmnet")
packagesPath <- file.path(packagesDir, filename)
spark.addFile(packagesPath)

Virtual environment
values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5,
0.5), c(1.5, 1.0))
train <- function(value) {
path <- spark.getSparkFiles(filename)
install.packages(path, repos = NULL, type = "source")
library(glmnet)
lambda <- value[1]
alpha <- value[2]
}
models <- spark.lapply(values, train)

Large scale machine learning

> model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type,
family = "gaussian", data = df3)
> summary(model)

Outline

Future directions
Ã Improve collect/createDataFrame performance in SparkR (SPARK-18924).
Ã More scalable machine learning algorithms from MLlib.
Ã Better R formula support.
Ã Improve UDF performance.

SparkR best practices for R data scientist

More Related Content

What's hot

Similar to SparkR best practices for R data scientist

More from DataWorks Summit

Recently uploaded

SparkR best practices for R data scientist