Apache Spark Machine Learning

Apache
Spark
Machine
Learning

-‐
Praveen
Devarao

Agenda

•  What
is
Machine
Learning?

•  The
machine
learning
module
in
Spark

•  SparkML
pipelines

•  Extrac?on,
Selec?on
and
Tuning

•  Demo

What
is
Machine
Learning?

•  A
computer
program
is
said
to
learn
from
experience
E

with
respect
to
some
class
of
tasks
T
and
performance

measure
P
if
its
performance
at
tasks
in
T,
as
measured

by
P,
improves
with
experience
E

•  Field
of
study
that
gives
computers
the
ability
to
learn

without
being
explicitly
programmed

How
is
it
achieved?

•  Build
mathema?cal
models
for
given
tasks

•  Represent
the
given
dataset
mathema?cally

•  Apply
sta?s?c
methods
on
this
math
representa?on

•  Tune
and
derive
a
model
that
can
perform
the
needed
task

Categories
of
ML

•  Supervised
learning

•  The
program
is
“trained”
on
a
pre-‐deﬁned
set
of
“training
examples”,
which

then
facilitate
its
ability
to
reach
an
accurate
conclusion
when
given
new

data

•  The
goal
is
to
learn
a
general
rule
that
maps
inputs
to
outputs

•  Unsupervised
learning

•  No
labels
are
given
to
the
learning
algorithm,
leaving
it
on
its
own
to
ﬁnd

structure
(paOerns
and
rela?onships)
in
its
input

•  Unsupervised
learning
can
be
a
goal
in
itself
(discovering
hidden
paOerns
in

data)
or
a
means
towards
an
end
(feature
learning)

Categories
of
ML

f1

f2

f1

f2

Supervised
Un-‐Supervised

SparkML
–
The
Machine
learning
module
of
Spark

•  APIs
Based
on
Dataframes

•  Distributed
collec?on
of
data
organized
as
columns

•  Contains
commonly
used
ML
algorithms

•  Classiﬁca?on

•  Regression

•  Clustering

•  Featuriza?on
-‐

feature
extrac?on,
transforma?on,
dimensionality

reduc?on,
and
selec?on

•  Pipelines
-‐

tools
for
construc?ng,
evalua?ng,
and
tuning

•  Persistence
of
models
and
pipelines

Machine
Learning
process

SparkML
Pipelines

•  Transformer
:

Algorithm
to
transform
one
dataframe
to
another

•  Es?mator
:
Algorithm
applied
on
dataframe
to
produce
a
transformer

•  Parameters
:
Factors
aﬀec?ng
the
Es?mators

•  Pipeline
:
Chain
of
mul?ple
transformers
and
es?mators
that
forms
the
ML
ﬂow

Extractors

•  Algorithms
to
extract
features
from
raw
data

•  TermFrequency-‐InverseDocumentFrequency

•  Word2Vec:

•  2
layer
neural
network
that
converts
words
to
vectors

•  CountVectorizer:

•  Number
of
tokens

Transformers
and
Selectors

•  Transformers
:

•  Algorithms
for
scaling,
modifying
or
conver?ng
features

•  Tokenizer

•  StringIndexer

•  VectorAssembler

•  PCA

•  Selectors
:

•  Libraries
for
selec?ng
subset
of
larger
set
of
features

•  Vector
Slicer

•  RFormula

•  ChiSqSelector

Model
evaluaEon
Techniques

•  Evalua?on:

•  F1
Score

Calculate
precision
and
recall
from
confusion
matrix

precision
=

True
Posi?ves

,
recall
=

True
Posi?ves

Predicted
Posi?ves

Actual
Posi?ves

•  ROC

Predicted

PosiEve

Predicted

NegaEve

Actual

PosiEve

True

Posi?ve

False

Nega?ve

Actual

NegaEve

False

posi?ve

True

Nega?ve

Confusion
Matrix

SparkML
Evaluators
and
Tuning

•  Evaluators:

•  BinaryClassifica?onEvaluator

•  areaUnderROC
&
areaUnderPR

•  Mul?classClassifica?onEvaluator

•  F1,
weightedPrecison,
WeightedRecall

•  RegressionEvaluator

•  MSE,
RMSE

•  Model
Tuning
and
Selec?on:

•  CrossValidator

•  k
folds
(train,test)
dataset
pair
is
created

•  Trains
and
evaluates
for
different
param
se_ngs

•  Expensive

•  TrainValida?onSplit

•  1
(train,test)
dataset
pair
is
created

•  Trains
for
one
combina?on
of
the
params
only

•  Less
expensive
than
cross-‐valida?on

Apache Spark Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (7)

Similar to Apache Spark Machine Learning

Similar to Apache Spark Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Apache Spark Machine Learning