The document describes a fast, scalable, online, distributed machine learning classifier built on Apache Spark. It leverages recent research to develop a classifier that can handle large, sparse datasets with up to hundreds of millions of features in a single pass. The system uses online learning techniques like stochastic gradient descent that allow incremental updates to the model as new data is received without requiring multiple passes over the training data. This makes it suitable for applications with streaming data where predictions are needed in real-time. Key challenges addressed include feature scaling, handling different feature frequencies, and efficiently encoding sparse features.
1. Fast Distributed Online Classiļ¬cation
Ram Sriharsha (Product Manager, Apache Spark, Databricks)
Prasad Chalasani (SVP Data Science, MediaMath)
13 April, 2016
2. Summary
We leveraged recent machine-learning research to develop a
I fast, practical,
I scalable (up to 100s of Millions of sparse features)
I online,
I distributed (built on Apache Spark),
I single-pass,
ML classiļ¬er that has signiļ¬cant advantages
over most similar ML packages.
3. Key Conceptual Take-aways
I Supervised Machine Learning
I Online vs Batch Learning, and importance of Online
I Challenges in online-learning
I Distributed implementation in Spark.
4. Supervised Machine Learning: Overview
Given:
I labeled training data,
Goal:
I ļ¬t a model to predict labels on (unseen) test data.
5. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
6. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
Fix a family of functions fw (x) Å F that are parametrised by a
weight-vector w.
7. Supervised Machine Learning: Overview
Given:
I training data D: n labeled examples
{(x1, y1), (x2, y2), . . . , (xn, yn)} where
I xi is a k-dimensional feature-vector
I yi is the label (0 or 1) that we want to predict.
I an error (or loss) metric L(p, y) from predicting p when true
label is y.
Fix a family of functions fw (x) Å F that are parametrised by a
weight-vector w.
Goal: ļ¬nd w that minimizes average loss over D:
L(w) =
1
n
nĆæ
i=1
Li (w) =
1
n
nĆæ
i=1
L(fw (xi ), yi ).
9. Logistic Regression
Logistic model fw (x) = 1
1+eā wĀ·x
Probability interpretation
Loss Function Li (w) = ā yi ln(fw (xi )) ā (1 ā yi ) ln(1 ā fw (xi ))
10. Logistic Regression
Logistic model fw (x) = 1
1+eā wĀ·x
Probability interpretation
Loss Function Li (w) = ā yi ln(fw (xi )) ā (1 ā yi ) ln(1 ā fw (xi ))
Overall Loss L(w) =
qn
i=1 Li (w)
11. Logistic Regression
Logistic model fw (x) = 1
1+eā wĀ·x
Probability interpretation
Loss Function Li (w) = ā yi ln(fw (xi )) ā (1 ā yi ) ln(1 ā fw (xi ))
Overall Loss L(w) =
qn
i=1 Li (w)
L(w) is convex:
I no local minima
I di erentiate and follow gradients
13. Gradient Descent
Basic idea:
I start with an initial guess of weight-vector w
I at iteration t, update w to a new weight-vector wĆ:
wĆ
= w ā āgt
where
I gt is the (vector) gradient of L(w) w.r.t. w at time t,
I ā is the learning rate.
16. Gradient Descent
gt =
ĖL(w)
Ėw
=
Ćæ
i
ĖLi (w)
Ėw
=
Ćæ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.
17. Gradient Descent
gt =
ĖL(w)
Ėw
=
Ćæ
i
ĖLi (w)
Ėw
=
Ćæ
i
gti
This is Batch Gradient Descent (BGD) :
I To make one weight-update, need to compute gradient over
entire training data-set.
I repeat this until convergence.
BGD is not scalable to large data-sets.
18. Online (Stochastic) Gradient Descent (SGD)
A drastic simpliļ¬cation:
Instead of computing gradient based on entire training data-set
gt =
Ćæ
i
ĖLi (w)
Ėw
,
and doing an update wĆ = w ā āgt.
19. Online (Stochastic) Gradient Descent (SGD)
A drastic simpliļ¬cation: Shu e data-set (if not naturally shu ed),
Compute gradient based on a one example
gti =
ĖLi (w)
Ėw
,
and do an update wĆ = w ā āgti .
20. Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient
21. Batch vs Online Gradient Descent
Batch:
I to make one step, compute gradient w.r.t. entire data-set
I extremely slow updates
I correct gradient
Online:
I to make one step, compute gradient w.r.t. one example:
I extremely fast updates
I not necessarily correct gradient
23. Batch vs Online Learning
Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set
24. Batch vs Online Learning
Batch Learning:
I process a large training data-set, generate a model
I use model to predict labels of test data-set
Drawbacks:
I infeasible/impractically slow for large data-sets.
I need to repeat batch process to update model with new data
25. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
26. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
I generate prediction (score),
27. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
I generate prediction (score),
I compare with true label,
28. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
29. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
I for each ātestā example:
30. Batch vs Online Learning
Online Learning:
I for each ātrainingā example:
I generate prediction (score),
I compare with true label,
I update model (weights w)
I for each ātestā example:
I predict with latest learned model (weights w)
31. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
32. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
33. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
34. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct ātrainingā and ātestingā phases:
35. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct ātrainingā and ātestingā phases:
I incremental, continual learning
36. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct ātrainingā and ātestingā phases:
I incremental, continual learning
I adapts to changing patterns
37. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct ātrainingā and ātestingā phases:
I incremental, continual learning
I adapts to changing patterns
I easily update existing model with new data
38. Batch vs Online Learning
Online Learning beneļ¬ts:
I does not pre-process enire training data-set
I does not explicitly retain previously-seen examples
I extremely light-weight: space and time-e cient
I no distinct ātrainingā and ātestingā phases:
I incremental, continual learning
I adapts to changing patterns
I easily update existing model with new data
I better generalization to unseen observations.
39. The Online Learning Paradigm
As each labeled example (xi , yi ) is seen,
I make prediction given only current weight-vector w
I update weight-vector w
41. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
42. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I itās possible to only do a single pass over the data.
43. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I itās possible to only do a single pass over the data.
I data arrives in real-time, and
44. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I itās possible to only do a single pass over the data.
I data arrives in real-time, and
I decisions/predictions must be made quickly
45. Online Learning: Use Scenarios
I extremely large data-sets where
I batch learning is computationally infeasible/impractical, and
I itās possible to only do a single pass over the data.
I data arrives in real-time, and
I decisions/predictions must be made quickly
I learned model needs to adapt quickly to recent observations.
47. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
48. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
49. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
Online learning beneļ¬ts:
I fast update of learned model to reļ¬ect latest observations
50. Online Learning Example: Advertising (MediaMath)
Listen to 100 billion ad-opportunities daily from Ad Exchanges.
For each opportunity, need to predict whether exposed user will buy,
as function of several features:
I hour_of_day, browser_type, geo_region, age, . . .
Online learning beneļ¬ts:
I fast update of learned model to reļ¬ect latest observations
I light-weight models extremely quick to compute
57. Online Learning: Feature Scaling
Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000
58. Online Learning: Feature Scaling
Example from wearable devices domain:
I feature 1 = heart-rate, range 40 to 200
I feature 2 = step-count, range 0 to 500,000
Extreme scale di erences =ā convergence problems.
Convergence much faster when features are of same scale:
I normalize each feature by dividing by its max possible value.
59. Online Learning: Feature Scaling
But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to ļ¬nd ranges.
60. Online Learning: Feature Scaling
But often:
I range of features not known in advance, and
I we cannot make a separate pass over the data to ļ¬nd ranges.
=ā Need single-pass algorithms that
adaptively normalize features with each new observation.
[Ross,Mineiro,Langford 2013] proposed such an algorithm,
which we implemented in our online ML system.
62. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
63. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
64. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
I country=USA may occur much more often than
country=Belgium
65. Online Learning: Feature Frequency Di erences
Some sparse features occur much more frequently than others, e.g.:
I categorical feature country with 200 values,
I encoded as a vector of length 200 with exactly one entry = 1,
and the rest 0
I country=USA may occur much more often than
country=Belgium
I indicator feature visited_site = 1 much more often than
purchased=1.
66. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =ā slow convergence.
67. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =ā slow convergence.
=ā rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
68. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =ā slow convergence.
=ā rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
E ectively, the algo pays more attention to rare features
Enables ļ¬nding rare but predictive features.
69. Online Learning: Feature Frequency Di erences
Often, rare features much more predictive than frequent features.
Same learning rate for all features =ā slow convergence.
=ā rare features should have larger learning rates:
I bigger steps whenever a rare feature is seen
I much faster convergence
E ectively, the algo pays more attention to rare features
Enables ļ¬nding rare but predictive features.
ADAGRAD is an algorithm for this [Duchi,Hazan,Singer 2010], and
we implemented this in our learning system.
71. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
72. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
73. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
74. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
75. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
76. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
77. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
78. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
I cannot pre-process data to ļ¬nd all possible values
79. Online Learning: Sparse Features
E.g. site_domain has large (unknown) set of possible values
I google.com, yahoo.com, cnn.com, . . .
I need to encode (conceptually) as 1-hot vectors, e.g.
I google.com = (1, 0, 0, 0, ... )
I yahoo.com = (0, 1, 0, 0, ... )
I cnn.com = (0, 0, 1, 0, ... )
I . . .
I all possible values not known in advance
I cannot pre-process data to ļ¬nd all possible values
I donāt want to encode explicit (long) vectors
80. Online Learning: Sparse Features, Hashing Trick
e.g. observation:
I country = "china" (categorical)
I age=32 (numerical)
I domain="google.com" (categorical)
81. Online Learning: Sparse Features, Hashing Trick
Hash the feature-names:
I hash("country_china") = 24378
I hash("age") = 32905
I hash("domain_google.com") = 84395
83. Online Learning: Sparse Features, Hashing Trick
Represent observation as a (special) Map:
{24378 Ʀ 1.0, 32905 Ʀ 32.0, 84395 Ʀ 1.0}
Sparse Representation (no explicit vectors)
84. Online Learning: Sparse Features, Hashing Trick
Represent observation as a (special) Map:
{24378 Ʀ 1.0, 32905 Ʀ 32.0, 84395 Ʀ 1.0}
Sparse Representation (no explicit vectors)
No need for separate pass on data (unlike Spark MLLib)
86. Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?
87. Distributed Online Logistic Regression
Stochastic Gradient Descent (SGD) is inherently sequential:
I how to parallelize?
Our (Scala) implementation in Apache Spark:
I Randomly re-partition training data into shards
I Use SGD to learn a model for each shard
I average models using TreeReduce (~ āAllReduceā)
I leverages Spark/Hadoop fault-tolerance.
89. Slider
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark
I Works directly with Spark Data Frames
I Usable as a library within other JVM systems
I Leverages Spark/Hadoop fault-tolerance
I Stochastic Gradient Descent
I Online feature-scaling/normalization
I Adaptive (per-feature) learning-rates
I Single-pass
I Hashing-trick to encode sparse features
90. Slider, Vowpal-Wabbit (VW), Spark-ML (SML)
Fast, distributed, online, single-pass learning system.
I Written in Scala on top of Spark (SML)
I Works directly with Spark Data Frames (SML)
I Usable as a library within other JVM systems (SML)
I Leverages Spark/Hadoop fault-tolerance (SML)
I Stochastic Gradient Descent (SGD) (VW, SML)
I Online feature-scaling/normalization (VW)
I Adaptive (per-feature) learning-rates (VW)
I Single-pass (VW, SML)
I Hashing-trick to encode sparse features (VW)
95. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on ļ¬rst 80%, test on remaining 20%
96. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on ļ¬rst 80%, test on remaining 20%
Spark ML (using Pipelines)
I makes 17 passes over data: one for each categorical feature
I trains and scores in 40 minutes
I need to specify iterations, etc.
I AUC = 0.52 on test data
97. Slider vs Spark ML
Task: Predict conversion probability from ad Impression features
I 14M impressions from 1 ad campaign
I 17 Categorical features, 2 numerical features
I Train on ļ¬rst 80%, test on remaining 20%
Slider
I makes just one pass over data.
I trains and scores in 5 minutes.
I no tuning
I AUC = 0.68 on test data
98. Other Work
I Online version of k-means clustering
I FTRL algorithm (regularized alternative to SGD)
Ongoing/Future:
I Online learning with Spark Streaming
I Benchmarking vs other ML systems