First impressions of SparkR: our own machine learning algorithm

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
SparkR
RBelgium
21/10/2015

Who am I
• Data Scientist at InfoFarm
www.infofarm.be
• PhD in Math
• Author of parallelML
https://cran.r-project.org/web/packages/parallelML
• Daily R user
• Spark enthusiast
Wannes.rosiers@infofarm.be @RosiersWannes

Overview
• Apache Spark
– A brief introduction
– R versus Scala (Java/Python)
• SparkR-1.4.0
– Getting started
– R integration
– Our own machine learning algorithms
• SparkR-1.5…
– What’s new?
– Spark MLlib

Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Apache Spark

“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”

Fast, Scalable and Fault Tolerant

One ring to rule them all…

Being lazy…
• Transformations (map, filter, union, sort, …) are lazy
• Actions (count, collect, save, …) force computations of
transformations
… is a good thing!

versus
• Scala advantages
– Natively written in Scala
– Big Data extension of Scala concepts
• R disadvantages
– Work in progress
– R packages not implemented for parallel
processing
Yet promising as excellent Big Data analysis tool for R users

SparkR-1.4.0

Initializing SparkR
• Download Spark http://spark.apache.org/downloads
• Install
– Installation (R/install-dev.sh)
– Documentation (R/create-docs.sh)
• Run

Using SparkR
• sparkContext
• sqlContext
• Possibly hiveContext
• parquetFile
• jsonFile
• read.df
(via "com.databricks.spark.csv”)

Integrating native R code
• Magrittr
• Local computations
• Within SparkR functions
collect createDataFrame

Machine learning
• Spark MLlib machine learning algorithms
were not available yet
• R algorithms are not implemented in a
distributed way
We implemented
–Naive Bayes (classification)
–K-means (clustering)
–Association rules (recommendation)

Performance
• Naive Bayes
• K-means
• Association rules
Set # observations Acion Time taken
Training +
Calibration
5.890.434 +
654.325
Build model +
Threshold
9min 6sec
Test 725.479 Prediction 3min 40sec
# observations Total time Time per iteration
7270238 3min 40 sec 25sec (4 iterations)
Action # observations Time taken
Construct rules 1.048.575 < 30sec
Predict 1 Instantly

Lessons learned
• Nasty workarounds: e.g.
– Rounding: var – var %% 1
– Adding constant column:
cast(data[[1]]*0, 'integer')
– Calculating which column has
the smallest value:

Lessons learned
• No notion of row indexes
Solvable via HiveQL
• Possible loss of orders
Solvable by keeping an order on a certain
column
• Not all Spark code available yet (map,
flatmap, lapply, …)
Solvable by altering source code to export them
• Slow computations due to framework
At least numPartitions might help you

Lessons learned
• Caching does not support all types:
lapply(nb[["model"]], function(mod){
cache(mod)
count(mod)
})
})
• Sometimes necessary to collect intermediate
results
local_model <- collect(model)
for( i in 0:n){
if(! i %in% local_model$category)
local_model <- rbind(local_model, c(i, -1))
}
When using R code, this will always be the case

SparkR-1.5…

What’s new?
• Time classes (adding/subtracting times)
• More math functions (e.g. atan, rand)
• More text functions (e.g. concat, locate)
• More R functions (e.g. dim, ifelse)
• Create contigency table (crosstab)
• First machine learning algorithm (glm)

Algorithms provided by Spark
● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)
○ Naive Bayes
○ Decision trees
○ Ensembles of trees (Random Forests and Gradient-Boosted trees)
○ Isotonic regression
● Collaborative filtering
o Alternating least squares (ALS)
● Frequent pattern mining
○ FP-growth
○ Association rules
○ PrefixSpan
● Feature extraction and transformation
● Clustering
○ K-Means
○ Gaussian mixture
○ Power Iteration clustering
○ Latent Dirichlet allocation
○ Streaming k-means
● Dimensionality reduction
○ Singular value decomposition (SVD)
○ Principal component analysis (PCA)

Questions

First impressions of SparkR: our own machine learning algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to First impressions of SparkR: our own machine learning algorithm

Similar to First impressions of SparkR: our own machine learning algorithm (20)

Recently uploaded

Recently uploaded (20)

First impressions of SparkR: our own machine learning algorithm