Fast Distributed Online Classification

Fast Distributed Online Classiﬁcation
Ram Sriharsha (Product Manager, Apache Spark, Databricks)
Prasad Chalasani (SVP Data Science, MediaMath)
13 April, 2016

Summary
We leveraged recent machine-learning research to develop a
I fast, practical,
I scalable (up to 100s of Millions of sparse features)
I online,
I distributed (built on Apache Spark),
I single-pass,
ML classiﬁer that has signiﬁcant advantages
over most similar ML packages.

Key Conceptual Take-aways
I Supervised Machine Learning
I Online vs Batch Learning, and importance of Online
I Challenges in online-learning
I Distributed implementation in Spark.

Supervised Machine Learning: Overview
Given:
I labeled training data,
Goal:
I ﬁt a model to predict labels on (unseen) test data.

Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.

Given:
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
label is y.
Fix a family of functions fw (x) œ F that are parametrised by a
weight-vector w.

Given:
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
label is y.
Fix a family of functions fw (x) œ F that are parametrised by a
weight-vector w.
Goal: ﬁnd w that minimizes average loss over D:
L(w) =
1
n
nÿ
i=1
Li (w) =
1
n
nÿ
i=1
L(fw (xi ), yi ).

Logistic Regression
Logistic model fw (x) = 1
1+e≠w·x
Probability interpretation

Logistic Regression
1+e≠w·x
Loss Function Li (w) = ≠yi ln(fw (xi )) ≠ (1 ≠ yi ) ln(1 ≠ fw (xi ))

Logistic Regression
1+e≠w·x
Overall Loss L(w) =
qn
i=1 Li (w)

Logistic Regression
1+e≠w·x
Overall Loss L(w) =
qn
i=1 Li (w)
L(w) is convex:
I no local minima
I di erentiate and follow gradients

Logistic Regression: gradient descent

Gradient Descent
Basic idea:
I start with an initial guess of weight-vector w
I at iteration t, update w to a new weight-vector wÕ:
wÕ
= w ≠ ⁄gt
where
I gt is the (vector) gradient of L(w) w.r.t. w at time t,
I ⁄ is the learning rate.

Gradient Descent
Gradient gt =∆ step direction
Learning rate ⁄ =∆ step size

Gradient Descent
gt =
ˆL(w)
ˆw
=
ÿ
i
ˆLi (w)
ˆw
=
ÿ
i
gti

Gradient Descent
gt =
ˆL(w)
ˆw
=
ÿ
i
ˆLi (w)
ˆw
=
ÿ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.

Gradient Descent
gt =
ˆL(w)
ˆw
=
ÿ
i
ˆLi (w)
ˆw
=
ÿ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.
BGD is not scalable to large data-sets.

Online (Stochastic) Gradient Descent (SGD)
A drastic simpliﬁcation:
Instead of computing gradient based on entire training data-set
gt =
ÿ
i
ˆLi (w)
ˆw
,
and doing an update wÕ = w ≠ ⁄gt.

Online (Stochastic) Gradient Descent (SGD)
A drastic simpliﬁcation: Shu e data-set (if not naturally shu ed),
Compute gradient based on a one example
gti =
ˆLi (w)
ˆw
,
and do an update wÕ = w ≠ ⁄gti .

Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient

Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient
Online:
I to make one step, compute gradient w.r.t. one example:
I extremely fast updates
I not necessarily correct gradient

Visualize Batch vs Stochastic Gradient Descent

Batch vs Online Learning
Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set

Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set
Drawbacks:
I infeasible/impractically slow for large data-sets.
I need to repeat batch process to update model with new data

Online Learning:
I for each “training” example:

Online Learning:
I generate prediction (score),

Online Learning:
I compare with true label,

Online Learning:
I update model (weights w)

Online Learning:
I for each “test” example:

Online Learning:
I for each “test” example:
I predict with latest learned model (weights w)

Online Learning beneﬁts:
I does not pre-process enire training data-set

I does not explicitly retain previously-seen examples

I extremely light-weight: space and time-e cient

I no distinct “training” and “testing” phases:

I incremental, continual learning

I adapts to changing patterns

I easily update existing model with new data

I easily update existing model with new data
I better generalization to unseen observations.

The Online Learning Paradigm
As each labeled example (xi , yi ) is seen,
I make prediction given only current weight-vector w
I update weight-vector w

Online Learning: Use Scenarios
I extremely large data-sets where

I batch learning is computationally infeasible/impractical, and

I it’s possible to only do a single pass over the data.

I data arrives in real-time, and

I decisions/predictions must be made quickly

I decisions/predictions must be made quickly
I learned model needs to adapt quickly to recent observations.

Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.

For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .

Online learning beneﬁts:
I fast update of learned model to reﬂect latest observations

Online learning beneﬁts:
I fast update of learned model to reﬂect latest observations
I light-weight models extremely quick to compute

Online Learning: IOT
Vast amounts of data; need to adapt, respond quickly.

Nest Thermostats: behavior data =∆ predict preferred room temp.

Self-driving Cars: (sensor data, other cars) =∆ predict collision

Clinical: Sensors (activity, vitals, . . . ) =∆ predict cardiac event

Clinical: Sensors (activity, vitals, . . . ) =∆ predict cardiac event
Smart cities: tra c sensors =∆ predict congestion

Online Learning: Challenge #1
Feature scale di erences

Online Learning: Feature Scaling
Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000

Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000
Extreme scale di erences =∆ convergence problems.
Convergence much faster when features are of same scale:
I normalize each feature by dividing by its max possible value.

But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to ﬁnd ranges.

But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to ﬁnd ranges.
=∆ Need single-pass algorithms that
adaptively normalize features with each new observation.
[Ross,Mineiro,Langford 2013] proposed such an algorithm,
which we implemented in our online ML system.

Online Learning Challenge #2:
(Sparse) Feature frequency di erences

Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,

I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0

and the rest 0
I country=USA may occur much more often than
country=Belgium

and the rest 0
I country=USA may occur much more often than
country=Belgium
I indicator feature visited_site = 1 much more often than
purchased=1.

Often, rare features much more predictive than frequent features.
Same learning rate for all features =∆ slow convergence.

=∆ rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence

E ectively, the algo pays more attention to rare features
Enables ﬁnding rare but predictive features.

E ectively, the algo pays more attention to rare features
Enables ﬁnding rare but predictive features.
ADAGRAD is an algorithm for this [Duchi,Hazan,Singer 2010], and
we implemented this in our learning system.

Encoding sparse features

Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .

I need to encode (conceptually) as 1-hot vectors, e.g.

I google.com = (1, 0, 0, 0, ... )

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I cannot pre-process data to ﬁnd all possible values

I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I cannot pre-process data to ﬁnd all possible values
I don’t want to encode explicit (long) vectors

Online Learning: Sparse Features, Hashing Trick
e.g. observation:
I country = "china" (categorical)
I age=32 (numerical)
I domain="google.com" (categorical)

Hash the feature-names:
I hash("country_china") = 24378
I hash("age") = 32905
I hash("domain_google.com") = 84395

Represent observation as a (special) Map:
{24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}

{24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}
Sparse Representation (no explicit vectors)

{24378 æ 1.0, 32905 æ 32.0, 84395 æ 1.0}
Sparse Representation (no explicit vectors)
No need for separate pass on data (unlike Spark MLLib)

Distributed Implementation of Online Learning

Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?

Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?
Our (Scala) implementation in Apache Spark:
I Randomly re-partition training data into shards
I Use SGD to learn a model for each shard
I average models using TreeReduce (~ “AllReduce”)
I leverages Spark/Hadoop fault-tolerance.

Slider:
Fast Distributed Online Learning System

Slider
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark
I Works directly with Spark Data Frames
I Usable as a library within other JVM systems
I Leverages Spark/Hadoop fault-tolerance
I Stochastic Gradient Descent
I Online feature-scaling/normalization
I Adaptive (per-feature) learning-rates
I Single-pass
I Hashing-trick to encode sparse features

Slider, Vowpal-Wabbit (VW), Spark-ML (SML)
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark (SML)
I Works directly with Spark Data Frames (SML)
I Usable as a library within other JVM systems (SML)
I Leverages Spark/Hadoop fault-tolerance (SML)
I Stochastic Gradient Descent (SGD) (VW, SML)
I Online feature-scaling/normalization (VW)
I Adaptive (per-feature) learning-rates (VW)
I Single-pass (VW, SML)
I Hashing-trick to encode sparse features (VW)

Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on ﬁrst 80%, test on remaining 20%

Slider vs Spark ML
Spark ML (using Pipelines)
I makes 17 passes over data: one for each categorical feature
I trains and scores in 40 minutes
I need to specify iterations, etc.
I AUC = 0.52 on test data

Slider vs Spark ML
Slider
I makes just one pass over data.
I trains and scores in 5 minutes.
I no tuning
I AUC = 0.68 on test data

Other Work
I Online version of k-means clustering
I FTRL algorithm (regularized alternative to SGD)
Ongoing/Future:
I Online learning with Spark Streaming
I Benchmarking vs other ML systems

Fast Distributed Online Classification

Recommended

Recommended

More Related Content

Similar to Fast Distributed Online Classification

Similar to Fast Distributed Online Classification (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Fast Distributed Online Classification