GeeCON Prague 2015

Distributed machine learning
on the JVM without a Ph.D.
Mateusz Dymczyk
Mateusz Dymczyk, 23rd October 2015

Mateusz Dymczyk Prague, 23rd October 2015
@mdymczyk
Say who?

BIG data

“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.” — Gartner
3Vs
NSA Baidu
10-100pb
eBay
100pb
Google
100pb
* Estimated data processed per day, circa 2014

Current state

Data source
Data
collection Data storage
Simple analytics
Data
processing

What can we do?

Machine what?

“The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.” — Mitchell, Tom M., “Machine Learning”
ML is extremely broad and involves several domains:
• computer science
• probability and statistics
• optimisation
• linear algebra
Machine learning

• Observation - object which is used for learning or evaluation (eg. a house)
• Features - representation of the observation (eg. square meters, number of rooms, location)
• Labels - a value assigned to an observation (not always used)
• System - set of related objects forming a complex whole (eg. set of observations)
• Model (math) - description of a system using mathematical concepts/language
• Data:
• training gets us our candidate parameters =>
• validation (optional) gets us optimal parameter set =>
• test checks how good the model is
Basic terminology

eg. regression,
when you want to
predict a real
number
eg. clustering,
when you want to
cluster or have
too much data
eg. classification,
when you want to
assign to a category
eg. association analysis,
when you want to find
relations between data

Why?
•Recommendation engines
•Customer churn (dissatisfaction)
•Customer segmentation
•Fraud detection
•Sentiment analysis
•Similar document search
•Demand forecasting
•Spam detection
•Store layout
•Much more: https://www.kaggle.com/wiki/DataScienceUseCases

But don’t take my word for it…

• Lack of distributed/scalable solutions
• Not enough data and/or computing power
• False conviction that we:
• Need to read hard research papers
• Use “weird” programming languages
So what’s the problem?

ML and JVM?

Nothing new!

Still not good enough…
•Not designed for big data
•Didn’t fit machine learning computation models
vs

ML, JVM
and a (iterative) distribution?

New (distributed) kids on the block
•MLlib (+Spark)
• TridentML (+Storm)
• Apache FlinkML (+Flink)
• Mahout Samsara
• Mahout R-like DSL
• Mahout on Spark
• H2O
• back-end agnostic (but with native APIs)
• open-source machine learning platform

What is Spark?
• Distributed, fast, in-memory computational framework
• Based RDD (Resilient Distributed Dataset: abstract, immutable, distributed, easily
rebuilt data format)
• Support for Scala, Java, Python and R
• Focuses on well known methods  
(map(), flatMap(), filter(), reduce() …)

What is Spark?
val conf = new SparkConf().setAppName("Spark App")
val sc = new SparkContext(conf)
val textFile: RDD[String] = sc.textFile("hdfs://...")
val counts: RDD[(String, Int)] = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
println(s"Found ${counts.count()}")
counts.saveAsTextFile("hdfs://...")

What is Spark?
SparkSQL
Spark
Streaming
MLlib GraphX
Apache Spark (core)
Mesos/Yarn/Standalone
(cluster management)

What is MLlib?
•Machine learning library for Spark (scalable by definition)
•Since September 2013, initially created at AMPLab (UC Berkeley)
•Contains common, well established machine learning algorithms and
utilities

Is it for me?
PROS
• extensive community, part of Spark
(Databricks support)
• Java, Scala, Python, R APIs
• solid implementation of most popular
algorithms
• easy to use, well documented, multitude of
examples
• fast and robust
CONS
• only Spark
• very young, still missing algorithms
• still pretty “low level”

Any problems left?
•Young projects, still require a lot of work
•Plenty of ML algorithms are not good for distribution by definition
•Simply throwing more machines won’t always work (eg. too
much data movement, too many operations)

What can we do?
•Go to Spark’s JIRA
•Add a ticket to MLlib
•Relax
1
2
3

Go smart(er)
•Compromise:
•Approximate
•Lambda architecture
•Compose algorithms:
•eg. clustering + actual similarity check
•User different algorithms
•for instance instead of closed form solution use iterative solutions
•Come up with new algorithms :-)
Data in
Serving layer

Examples!

What we’ll see
•End to end example: similarity search
•Built-in algorithm/util examples:
•Clustering
•Recommender systems (collaborative filtering)
•Logistic regression
•Model evaluation

Similarity search
•Problem: given an object (document, image) find all objects which
are similar to it in a given set.
•Solution: similarity is a well research topic in 
mathematics!
•Why:
•Find most popular objects.
•Aggregate similar objects to declutter view.
•Find k most similar objects.

50 shades of similarity
•Distance based:
•Manhattan distance (L1 norm)
•Euclidean distance (L2 norm)
•Angular similarity:
•Cosine
•Set similarity (vectorization not necessary):
•Jaccard
•Dice
•Hamming

Similarity search - pipeline
Data preprocessing
(eg. tokenization,
text normalization)
Input data
Vectorization
Similarity
check
Result
“This’s a Short test” [“short”, “test”]
“This’s a not so [“long”, “test”]
long Test”
[1,1,0], [1,0,1] …
Similarity
algorithm

Similarity search - distributed pipeline
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Similarity
check
Node
Similarity
check
Node
Cluster

Similarity search I
•Brute force solution:
•pre-process text
•vectorize (in our case TF-IDF)
•compute all possible pairs
•compute cosine similarity between each pair

Vectorization: TF-IDF
• Term Frequency–Inverse Document Frequency:
• how important a word is for a document in a collection
• higher when the word occurrence is big in a document
• smaller when the word is also common in the whole collection
“This’s a Short test” [“short”, “test”] 
“This’s a not so long Test” [“long”, “test”]
[1/6, 1/3, 0], [1/6 , 0, 1/3] …

TF-IDF
val documents: RDD[Seq[String]] = sc.textFile("...")
.map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

Similarity search I
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.vectorize(normalized).cache()
// Brute-force similarity
val cartesian = vectorized
.cartesian(vectorized)
.filter { case (doc1, doc2) => doc1._1.id < doc2._1.id }
.map {
case (doc1, doc2) =>
val similarity: Double = cosine(doc1._2, doc2._2)
Seq(
(doc1._2, (doc2._2, similarity)),
(doc2._2, (doc1._2, similarity))
)
}
.flatMap(identity)
.combineByKey[Seq[(RantTuple, Double)]](
(x: (RantTuple, Double)) => Seq(x),
(acc: Seq[(RantTuple, Double)], y: (RantTuple, Double)) => acc.+:(y),
(acc1: Seq[(RantTuple, Double)], acc2: Seq[(RantTuple, Double)]) => acc1.++:(acc2)
)
}
“This’s a Short test” [“short”, “test”]
[1,1,0], [1,0,1] …

Similarity search I - problems
•Compute all-pairs similarity:
•O(n^2) comparisons
•10^6 documents
•~5*10^11 comparisons =>
•~6 days (10^3 comp/ms)
•Data shuffle size O(nL^2)
•Largest reduce-key: O(n)
n — # docs, L — # of unique words in a doc
Similarity
check
Node
Similarity
check
Node
Cluster

Why is data shuffle so bad?
50 GB/s
100MB/s
100-600MB/s
1 GB/s
0.3 GB/s

Similarity search II
Input data Result
Data
preprocessing
Node
Data
preprocessing
Node
Vectorization
Node
Vectorization
Node
Cluster
Similarity
check
Node
Similarity
check
Node
Cluster
Group by
feature(s)

Similarity search II
•Problems:
•What if no features to group by
•What if it produces too big clusters?
•Solution: cluster anyway but smart!

Locality sensitive hashing
•Similar objects = same bucket (maximizes the % of collisions)
•Group of algorithms (different similarity measures):
•random projection for cosine
•min-hash for jaccard
•…
•Problems:
•possibility of false positives and false negatives
•double check the former, minimize the latter
•might produce duplicates pairs!

Similarity search III
type DocSimilarity = (String, Seq[(String, Double)])
case class Document(id: Long, doc: String)
def similarity(docs: RDD[Document]): RDD[DocSimilarity] = {
val normalized: RDD[(Document, Set[String])] = docs.map(Normalizer.normalize(_))
val vectorized: RDD[(Document, Vector)] = TfIdf.extract(normalized).cache()
val lsh = new LSH(data=vectorized, p=65537, m=1000, numRows=1000, numBands=25, minClusterSize=2)
val model = lsh.run
var clusters : RDD[(Long, Iterable[SparseVector])] = model.clusters
clusters.map { case (id, cluster) => cosines(cluster) }
}
• Sample implementations:
• https://github.com/mrsqueeze/spark-hash (min-hash)
• https://github.com/marufaytekin/lsh-spark (Charikar’s LSH for cosine)

Similarity search - results
INPUT
”パウダーファンデーションのパフがすぐに汚れ
てしまう。” (“Powder foundation’s puff gets dirty really fast”)
OUTPUT
0.80 “パウダーをつけるパフがすぐに汚れる。”
(“The puff gets dirty really fast after applying the powder.”)
0.53 “パフがすぐに汚くなってしまう。” (“The
puff gets dirty really fast.”)
0.30 “パウダリーファンデーションをつけるた
めのスポンジというかパフ、すぐに汚れて、ファ
ンデをつける時にきれいに伸ばせなくなる。”
(“The sponge for applying the powdery foundation gets dirty really
fast, when using the foundation it doesn’t spread nicely.”)

Built-ins

Clustering
val data = sc.textFile("...")
val parsedData = data.map(_.split(' ').map(_.toDouble)).cache()
val clusters = KMeans.train(parsedData, 2, numIterations = 20)
val prediction = clusters.predict(point)
•unsupervised learning problem which tries to group
subsets of objects with one another based on some notion
of similarity.
•supported algorithms: K-means, Gaussian mixture, Power
iteration clustering (PIC), Latent Dirichlet allocation (LDA)

Recommender systems
val data = sc.textFile(“…”)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val predictions = model.predict(usersProducts)
• Collaborative filtering
• User/product matrix predictions

(Logistic) Regression
•iterative algorithm - greatly benefits from caching
•often used for binary classification (can be
generalised)
// <label> <idx1>:<val1> <idx2>:<val2> ...
val data = MLUtils.loadLibSVMFile(sc, “…”).cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(data)
model.predict(pointToPredict)

Is everything OK?

Supervised learning workflow
Raw
data
Cleaned/
scaled
data
Training
set
Validating
set
Model
creation
Final
model
Validation
Incoming
new
data

Model evaluation
•Certain ML algorithms create models
•How do we know if the model we got is good (enough)?
•Different types of evaluation depending on the ML algorithm type:
•classification: prediction and recall (based on true/false positive/
negative)
•regression: different methods based on the difference of evaluation
and validation data

Model evaluation
val data = MLUtils.loadLibSVMFile(sc, "...")
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
model.clearThreshold
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold: $t, Precision: $p")
}
val recall = metrics.recallByThreshold
recall.foreach { case (t, r) =>
println(s"Threshold: $t, Recall: $r")
}

To remember…

Common pitfalls
1. Try to avoid groupByKey()
•instead try reduceByKey()
2.Don’t collect all the data in the driver:
•collect() will copy all the elements to the driver node
•instead persist it (file, DB)
3.Use cache()/persist() where necessary (use Sparks WebUI)!
4.Code for failure and handle malformed input!
5.Remember about Serializable!

Performance recap
1. Parallelising (not concurrency!) makes us faster
2.Network traffic makes us (really) slow
1. keep data close to the processing units (stay local)
2.take note of operation order
3.don’t iterate more than necessary
3.In-memory computation/caching helps a lot (especially in case of
iterative machine learning!)

Where to go from here
• Get ideas: https://www.kaggle.com/wiki/DataScienceUseCases
• Get started with Spark:
• http://spark.apache.org/docs/latest/quick-start.html
• https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x
• Get started with MLlib:
• http://spark.apache.org/docs/latest/mllib-guide.html
• https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x
• Try out other frameworks and courses:
• https://github.com/h2oai/sparkling-water
• https://www.coursera.org/course/mmds
• Learn the basics:
• https://www.coursera.org/learn/machine-learning
• Practical books:
• “Advanced Analytics with Spark” — Sandy Ryza et al. O’Reilly Media
• “Data Algorithms: Recipes for Scaling Up with Hadoop and Spark” — Mahmoud Parsian, O’Reilly Media

Q&A

Streams

Can I has stream?
•Linear models (regression) can be trained in a streaming fashion (1.1+)
•Clustering can be done on streams (with k-means)
•what if data over time changes? — mllib supports “forgetfulness”

Can I has stream?
val trainingData = ssc.textFileStream("...").map(Vectors.parse)
val testData = ssc.textFileStream("...").map(LabeledPoint.parse)
val model = new StreamingKMeans()
.setK(2)
.setDecayFactor(1.0)
.setRandomCenters(3, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
ssc.start()
ssc.awaitTermination()

Still sounds like work….

Seldon.io
• open predictive platform
• provides content  
recommendation and  
predictive functionality

Prediction.io
• open source ML server for building predictive engines
• event collection, algorithms, evaluation and querying predictive results via REST
• uses Hadoop, HBase, Spark and Elasticsearch

GeeCON Prague 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to GeeCON Prague 2015

Similar to GeeCON Prague 2015 (20)

Recently uploaded

Recently uploaded (20)

GeeCON Prague 2015