Introduction to ML with Apache Spark MLlib

with Apache Spark MLlib
#javaone

https://ua.linkedin.com/in/tarasmatyashovsky
2

I am not
a data science
engineer
3

“I'm a rolling thunder, a pouring rain
I'm comin' on like a hurricane
My lightning's flashing across the sky
You're only young but you're gonna die
I won't take no prisoners, won't spare no lives
Nobody's putting up a fight
I got my bell, I'm gonna take you to hell
I'm gonna get you, Satan get you”
https://github.com/tmatyashovsky/spark-ml-samples
6

“I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you”
7

 Look for particular words like “fear”, “fight”, “kill”,
“devil”, ”death”, etc.?
 Count length of a verse?
 Count unique words in a verse?
9

is the study of
computer
algorithms that
improve
automatically
through
experience
12

Supervise
d
learning
Unsupervise
d
learning
Reinforcemen
t
learning
13

 Date & time
 Conference name
 Speaker
 Talk name
 Track
 Duration
 Type
 Overall impression
 Overall rating
 Number of slides
 Time spent on live
coding
 Number of jokes
 Etc.
15

Learning algorithms
Hypotheses:
Сost function:
Features:
Target variable:
Training example:
Training set:
16

http://www.slideshare.net/liweiyang5/spark-mllib-training-material
17

Number of jokes during a talk
Speaker’s
rating
18

Positive
Negative
Impression
Number of jokes during a talk
25

Numberofjokesduringa
talk
Time (min.) spent on live
coding
Number of
clusters:
K = 5K = 2
32

33
 Initialize cluster centroids:
 assign each example to the closest
cluster centroid
 Recalculate centroids as an average (mean) of
examples assigned to a cluster

 Collect data set of lyrics:
 Abba, Ace of base, Backstreet Boys, Britney Spears,
Christina Aguilera, Madonna, etc.
 Black Sabbath, In Flames, Iron Maiden, Metallica,
Moonspell, Nightwish, Sentenced, etc.
 Create training set, i.e. label (0|1) + features
 Train logistic regression (or other classification
algorithm)
37

38

GloV
e Bag
of
Words
Word2VecTF-
IDF
http://spark.apache.org/docs/latest/ml-features.html#feature-extractors
40

 Produces unique fixed-size dense vectors
 Captures semantic and morphologic similarity
https://code.google.com/archive/p/word2vec/
41

Similar
scores
(cos ~ 1)
Opposite
scores
(cos ~ -1)
Unrelated
scores
(cos ~ 0)
http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png
42

43
Verse Cosine Distance
baby one more time 0.482028
crazy for you 0.437875
show me the meaning
of being lonely
0.258147
highway to hell -0.1120049
kill them all -0.231876

44

Under-fitting
(high bias)
Over-fitting
(high variance)
Appropriate
fitting
http://mlwiki.org/index.php/Overfitting
47

Training set (66,6%)
Test set (33%)
K = 3
48

Test set (33%)
K = 3
49

Test set (33%)
K = 3
50

Weka
Encog
AerosolveFlinkM
L
https://github.com/josephmisiti/awesome-machine-learning
53

Easy of
use
Cloud
computing
Spee
d
Generali
ty
Data
processing
54

https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
55

Is a library of ML algorithms and utilities
designed to run in parallel on Spark cluster
56

 Introduces a few new data types, e.g.
vector (dense and sparse), labeled point,
rating, etc.
 Allows to invoke various algorithms on
distributed datasets (RDD/Dataset)
http://spark.apache.org/docs/latest/mllib-guide.html
57

Build on
top of
RDDs
Build on
top of
Datasets
spark.mll
ib
spark.ml
58

 Utilities: linear algebra, statistics, etc.
 Features extraction, features transforming, etc.
 Regression
 Classification
 Clustering
 Collaborative filtering, e.g. alternating least squares
 Dimensionality reduction
 And many more
59

”All” spark.mllib features plus:
• Pipelines
• Persistence
• Model selection and tuning:
• Train validation split
• K-folds cross validation
http://spark.apache.org/docs/latest/ml-guide.html
60

Raw data Transformer
Estimator
[parameters]
Transformer
[parameters]
Estimator
[parameters]
Dataset Dataset
Dataset
Dataset
http://spark.apache.org/docs/latest/ml-pipeline.html
Cross
Validator
[pipeline,
evaluator,
parameters]
Dataset
61

Lyrics
63

I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you
64

Lyrics Cleanser
Dataset
65

I'm a rolling thunder, a pouring rain
I'm gonna get you, Satan get you
66

Lyrics Cleanser
Dataset
Numerator
Dataset
67

Im a rolling thunder a pouring rain
Im comin on like a hurricane
My lightnings flashing across the sky
Youre only young but youre gonna die
I wont take no prisoners wont spare no lives
Nobodys putting up a fight
I got my bell Im gonna take you to hell
Im gonna get you Satan get you
68
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
Dataset
69

im a rolling thunder a pouring rain
im comin on like a hurricane
My lightnings flashing across the sky
youre only young but youre gonna die
I wont take no prisoners wont spare no lives
nobodys putting up a fight
I got my bell im gonna take you to hell
im gonna get you satan get you
70
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Dataset
71

im rolling thunder pouring rain
im comin like hurricane
lightnings flashing across sky
youre young youre gonna die
wont take prisoners wont spare lives
nobodiys putting fight
got bell im gonna take hell
im gonna get satan get
72
1
2
3
4
5
6
7
8

Lyrics Cleanser
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
73

4
im roll thunder pour rain
im comin like hurrican
lightn flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
74
1
2
3
4
5
6
7
8
verse1
verse2

8
im roll thunder pour rain
im comin like hurrican
Light n flash across sky
your young your gonna die
wont take prison wont spare live
nobodi put fight
75
1
2
3
4
5
6
7
8
verse1

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Dataset
Dataset
76

4
[0.036463763926011056,
-0.013076733228398295,
...
0.03816963326281462]
77
feature1
feature2
[-0.013962931134021625,
0.049275818325650804,
...
-0.058982484615766086]

8
[0.036463763926011056,
-0.013076733228398295,
0.044362547532774695,
0.03816963326281462,
...
-0.013962931134021625,
0.049275818325650804,
-0.058982484615766086]
78
feature1

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Dataset
79

Probability:
[0.9212126972383768,
0.07878730276162313]
Prediction:
0.0
80

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
81

[0.8454839775240359,
0.9061236588248319,
0.9527128936788524,
0.9522790271664413,
...
0.9526248129757111,
0.9522790271664411]
82

Lyrics Cleanser
Word2Vec
[Vector size]
Dataset
Dataset
Numerator Tokenizer
Stop Words
Remover
Dataset Dataset
ExploderStemmer
Dataset
Uniter
Dataset
Verser
[Sentences
in verse]
Dataset
Logistic
Regression
[Max iterations,
Reg parameter]
Dataset
Dataset
Cross
Validator
Model
Dataset
83

86
 ML is not as complex as it seems from an applied
perspective
 Existing libraries and frameworks reduce a lot of
tedious work
 For instance, Spark MLlib can help to build nice ML
pipelines

 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia
 https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
 https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html
 https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
 https://www.kaggle.com/c/dogs-vs-cats/
 http://yann.lecun.com/exdb/mnist/
 http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html
 http://www.slideshare.net/jeykottalam/pipelines-ampcamp
 https://github.com/master/spark-stemming
 https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html
 http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/
 https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html
 https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms
 http://www.slideshare.net/liweiyang5/spark-mllib-training-material
 https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm
 http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l
 https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html
 https://github.com/deeplearning4j/deeplearning4j
 http://deeplearning4j.org/spark
 http://mlwiki.org/index.php/Overfitting
 http://bionlp-www.utu.fi/wv_demo/
 https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/
88

Introduction to ML with Apache Spark MLlib

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

More from Taras Matyashovsky

More from Taras Matyashovsky (12)

Recently uploaded

Recently uploaded (20)

Introduction to ML with Apache Spark MLlib

Editor's Notes