This document introduces R and its integration with SparkR and Spark's MLlib machine learning library. It provides an overview of R and some of its most common data types like vectors, matrices, lists, and data frames. It then discusses how SparkR allows R to leverage Apache Spark's capabilities for large-scale data processing. SparkR exposes Spark's RDD API as distributed lists in R. The document also gives examples of using SparkR for tasks like word counting. It provides an introduction to machine learning concepts like supervised and unsupervised learning, and gives Naive Bayes classification as an example algorithm. Finally, it discusses how MLlib can currently be accessed from R through rJava until full integration with SparkR is completed.
11. Data frame
name <- c("A", "B", “C")
age <- c(30, 17, 42)
male <- c(T, F, F)
data.frame(name, age, male)
## name age male
## 1 A 30 TRUE
## 2 B 17 FALSE
## 3 C 42 FALSE
12. x <- 1:100
y <- 1:100 + runif(100, 0, 20)
m <- lm(y~x)
plot(y~x)
abline(m$coefficients)
13.
14. But…
• R is single-threaded
• Can only process data sets that fit in a single
machine
16. SparkR
• An R package that provides a light-weight front-end
to use Apache Spark from R
• exposes the RDD API of Spark as distributed lists
in R
• allows users to interactively run jobs from the R
shell on a cluster
17. Spark
count
countByKey
countByValue
flatMap
map (lapply)
…
broadcast
includePackage
…
Filter
reduce
reduceByKey
distinct
union
…
+ R
18. Data flow
Local
Worker
Worker
Worker
R Spark
Context
Java
Spark
Context
Spark R
Executer
JNI
21. Machine Learning
• Arthur Samuel (1959): Field of study that gives
computers the ability to learn without being
explicitly programmed.
22. Machine Learning
• Supervised
Labels, features
Mapping of features to labels
Estimate a concept (model) that is closest to the true mapping
• Unsupervised
No labels
Clustering of data
28. Naive Bayes
• Supervised machine learning
• Classifies texts based on word frequency
29. Naive Bayes
Π
P(class | doc) = P(class) P(word | class)
word in doc
Π
class argmax P(class | doc) =
class argmax P(class) P(word | class)
word in doc
class argmax log(P(class | doc))
argmax log(P(class))+ Σ
log(P(word | class))
class word in doc
P(c) = number of class c documents in training sets
total number of documents in training sets
P(w | c) = no. of occurences of word w in documents type c + 1
total no. of words in documents type c + size of vocab
31. MLlib
• Spark’s scalable machine learning library
consisting of common learning algorithms and
utilities, including classification, regression,
clustering, collaborative filtering, dimensionality
reduction, as well as underlying optimization
primitives.
32. MLlib and SparkR
• Currently access to MLlib in SparkR is still in
development. Thus use this method to run MLlib in
R until MLlib is officially integrated into SparkR.
33. MLlib’s Naive Bayes in R
R RDD of list(label,
features)
Java RDD of
serialised R objects
Scala RDD of
LabeledPoint
rJava
J("org.apache.spark.mllib.classification.NaiveBayes", "train",
labeled.point.rdd, lambda)