Introduction to SparkR

Introduction to R and
integration of SparkR and
Spark’s MLlib
Dang Trung Kien

About me
• Statistics undergraduate
• trungkiendang@hotmail.com

What is R?
• Statistical Programming Language
• Open source
• > 6000 available packages
• widely used in academics and research

Companies that use R
• Facebook
• Google
• Foursquare
• Ford
• Bank of America
• ANZ
• …

Data types
• Vector
• Matrix
• List
• Data frame

Vector
• c(1, 2, 3, 4)
## [1] 1 2 3 4
• 1:4
## [1] 1 2 3 4
• c("a", "b", "c")
## [1] "a" "b" “c"
• c(T, F, T)
## [1] TRUE FALSE TRUE

Matrix
• matrix(c(1, 2, 3, 4), ncol=2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
• matrix(c(1, 2, 3, 4), ncol=2, byrow=T)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4

List
• list(12, “twelve")
## [[1]]
## [1] 12
##
## [[2]]
## [1] "twelve"

Data frame
name <- c("A", "B", “C")
age <- c(30, 17, 42)
male <- c(T, F, F)
data.frame(name, age, male)
## name age male
## 1 A 30 TRUE
## 2 B 17 FALSE
## 3 C 42 FALSE

x <- 1:100
y <- 1:100 + runif(100, 0, 20)
m <- lm(y~x)
plot(y~x)
abline(m$coefficients)

But…
• R is single-threaded
• Can only process data sets that fit in a single
machine

SparkR
• An R package that provides a light-weight front-end
to use Apache Spark from R
• exposes the RDD API of Spark as distributed lists
in R
• allows users to interactively run jobs from the R
shell on a cluster

Spark
count
countByKey
countByValue
flatMap
map (lapply)
…
broadcast
includePackage
…
Filter
reduce
reduceByKey
distinct
union
…
+ R

Data flow
Local
Worker
Worker
Worker
R Spark
Context
Java
Spark
Context
Spark R
Executer
JNI

Word count
lines <- textFile(sc, “/path/to/file")
words <- flatMap(lines,
function(line) {
strsplit(line, " “)[[1]]
})
wordCount <- lapply(words, function(word) { list(word, 1L) })
counts <- reduceByKey(wordCount, "+", 2L)
output <- collect(counts)
for (wordcount in output) {
cat(wordcount[[1]], ": ", wordcount[[2]], “n")
}

Machine Learning
• Arthur Samuel (1959): Field of study that gives
computers the ability to learn without being
explicitly programmed.

Machine Learning
• Supervised
Labels, features
Mapping of features to labels
Estimate a concept (model) that is closest to the true mapping
• Unsupervised
No labels
Clustering of data

Machine Learning
• Supervised
Naive Bayes, nearest neighbour, decision tree,
linear regression, support vector machine…
• Unsupervised
K-means, DBSCAN, one-class SVM…

Supervised
• Classification
Cat or dog?

Supervised
• Classification
Cat or dog?
• Regression
Age?

Naive Bayes
• Supervised machine learning
• Classifies texts based on word frequency

Naive Bayes
Π
P(class | doc) = P(class) P(word | class)
word in doc
Π
class argmax P(class | doc) =
class argmax P(class) P(word | class)
word in doc
class argmax log(P(class | doc))
argmax log(P(class))+ Σ
log(P(word | class))
class word in doc
P(c) = number of class c documents in training sets
total number of documents in training sets
P(w | c) = no. of occurences of word w in documents type c + 1
total no. of words in documents type c + size of vocab

“a" “b” “c”
1 1 1 0
2 0 2 1
P(1) = P(2) = 1
2
P(a |1) = 1+1
1+1+ 3
= 2
5
P(b |1) = 2
5
P(c |1) = 1
5
P(a | 2) = 1
5
P(b | 2) = 3
5
P(c | 2) = 2
5
P(1| "a b b") = 1
2
× 2
5
× 2
5
× 2
5
= 0.032
P(2 | "a b b") = 1
2
× 1
5
× 3
5
× 3
5
= 0.036

MLlib
• Spark’s scalable machine learning library
consisting of common learning algorithms and
utilities, including classification, regression,
clustering, collaborative filtering, dimensionality
reduction, as well as underlying optimization
primitives.

MLlib and SparkR
• Currently access to MLlib in SparkR is still in
development. Thus use this method to run MLlib in
R until MLlib is officially integrated into SparkR.

MLlib’s Naive Bayes in R
R RDD of list(label,
features)
Java RDD of
serialised R objects
Scala RDD of
LabeledPoint
rJava
J("org.apache.spark.mllib.classification.NaiveBayes", "train",
labeled.point.rdd, lambda)

Introduction to SparkR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to SparkR

Similar to Introduction to SparkR (20)

Recently uploaded

Recently uploaded (20)

Introduction to SparkR