This document discusses developing machine learning at scale at LinkedIn. It outlines the challenges of fast iteration with large, dynamic data and the need for models to adapt. It describes LinkedIn's machine learning framework using a cold start model fitted via distributed logistic regression and a warm start model for recent data. Experimentation and A/B testing are used to evaluate models online since offline metrics do not always translate. Real-time feedback through analytics systems like Kafka and Voldemort is also important for ranking and recommendation models.
3. Many Practical Challenges, but
Fast iteration is desired
– large-scale machine-learning framework
Models should adapt to large dynamic data
– cold start model + warm start model + explore/exploit
Offline metrics do not always reflect online
– online A/B test
Real-time feedback is important
– near real-time event stream
3
4. The 30,000-foot Overview
4
feature extraction
feature transformation
user modeling
Data Pipeline
data tracking & logging candidates generationmulti-pass rankers
Online Serving System
real-time feedback
Model Fitting Pipeline (Hadoop)
offline modeling fitting
(cold-start model)
nearline modeling fitting
(warm-start model)
daily/weekly hourly
online A/B test
model evaluation
6. Fast Iteration is Desired
Every machine learning task is different
– still the steps to solving the problem are often quite similar
Fast iteration is desired
– successful systems require lots of tuning & experimentation
– reusable modules and easy-to-config workflows dramatically
improve productivity
– free engineers to concentrate on unique aspects of the project
6
7. Large-scale Machine-learning Framework at LinkedIn
7
scheduled workflow
on Azkaban
feature, target sources
feature, target join
partition
test datatraining data
sampling
model training
scoring
model evaluation
sampled data
previous model best model
model deployment
snapshots
9. Ads Click Prediction Problem As an Example
A member comes to LinkedIn
Ads platform prepares a list of eligible advertising campaigns
The statistical model predicts CTR for <member, ad, context>
Rank all ads by (predicted CTR × bid), and show the top-k results
to the member
9
10. Models Should Adapt to Large Dynamic Data
Very Large-scale data
– Billions of records, large feature space (~100k covariates)
Low positive rates (e.g. CTR)
– Sparsity issue
Data is dynamic
– New ads come into the system any time
Models have to be adaptive
– Otherwise bad experience, $ loss
10
11. Cold-start & Warm-start Model
Cold-start & Warm-start
Cold-start component θcold relatively stable
– Less frequent update
– Large-scale model fitting using large amount of historical data
Warm-start component θwarm more dynamic
– Update as frequently as possible
– Trained using fresh data to explain the residual of cold start
only model
11
Large-scale logistic
regression
Per-item logistic
regression
12. Cold-start Model Fitting
Alternating Direction Method of Multipliers (ADMM)
– Stephen Boyd et al. 2011
– Constraint that each partition’s coefficient βi = global
consensus β
12
13. Large Scale Logistic Regression via ADMM
13
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 1
14. Large Scale Logistic Regression via ADMM
14
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Consensus
Computation
Logistic
Regression
Logistic
Regression
Logistic
Regression
Iteration 1
15. Large Scale Logistic Regression via ADMM
15
BIG DATA
Partition 1 Partition 2 Partition 3 Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Iteration 2
16. Warm-start Model Fitting
Warm-start θwarm trained with fresh data to explain the
residual of cold start only model
Select small set of item-dependent features for fast training
Train as frequently as possible
16
warm
17. Explore/Exploit
Model serving and data feedback loop
– Ad A: true CTR 5%, 500 clicks, 10000 views
– Ad B: true CTR 10%, 0 click, 10 views
We ends up always serving A!
Multi-armed bandit problem
– ε-greedy: random selection for ε% traffic
– UCB (Upper Confidence Bound)
– Gittins Index
– Thompson sampling
17
18. Thompson Sampling
Instead of using MAP estimation of θ, we sample it from its
posterior distribution
We sample from warm-start model only
Compute CTR using the sampled coefficients
Assumptions
– Cold start coefficients have 0 variance
– Covariance between warm start coefficients is 0
18
19. Explore/Exploit Experiments
19
Campaign Warmness Segment
LiftPercentageofCTR
−%
−%
−%
0%
+%
+%
1 2 3 4 5 6 7 8
LASER
LASER−EEWith E/E
Without E/E
lots of training dataalmost no training data
Exploration Cost Winner’s Curse
Campaign Warmness Segment
CTRLift
21. Offline Metrics Do Not Always Reflect Online
Offline Metrics
– ROC, AUC
– Test log likelihood
Online Metrics
– eCPC (effective cost per click)
– eCPM (effective cost per thousand impressions)
– downstream or sitewide effects are hard to measure offline…
How to select best model to ramp
– A/B testing in a scientific and controlled manner
21
22. A/B Testing in One Slide
22
20%80%
collect results to determine which one is better
Join now
Control Treatment
25. Growth of Experiments at LinkedIn
• 200+ experiments
• 800+ side-wide and
vertical specific metrics
• billions of experiment
events
all on a daily basis, and
counting …
25
#ofExperiments
28. Who Moved My Cheese?
28
Experiments Owners Metrics Owners
• Provide visibility, ensure responsible ramping, and encourage
communication
Most Impactful Experiments
30. Real-time Feedback is Important
Recommendations served
every millisecond
How frequently do users keep
seeing the same item?
High frequency user fatigue
low engagement
Impression discounting to
penalize relevance scores by
impression counts
30
31. Impression Discounting
31
Item Id Relevance
Score
Action Counts Adjusted Score
item:10 0.9 VIEW 3 0.9 * (0.2231) = 0.20
item:15 0.8 VIEW 0 0.8
item:20 0.7 VIEW 2 0.7 * (0.3679) = 0.26
Impression discounting
score¬ score´exp(-exposure / factor),
where exposure is a function of impression counts.
Member member:1234567
Time Range within last 12 hours
Item Id Action Counts
item:10 VIEW 3
Item:10 CLICK 1
item:15 VIEW 0
item:20 VIEW 2
Real-time impression tracking
Sounds a simple idea! But …
32. How to Achieve (Near) Real-time Tracking
Apache Kafka: a high-throughput distributed messaging system
Voldemort: distributed key-value storage system
32
33. Conclusions
Fast iteration is desired
Models should adapt to large dynamic data
Offline metrics do not always reflect online
Real-time feedback is important
and there is something else as important
33
We are hiring!
The whole model is trained during cold start phase using large amount of historical ads events data. Afterwards, we drop θwarm from cold start training and keep only θcold (The actual implementation is different due to various considerations, some features dropped from cold start are not in warm start, and vice verses). During the warm start phase, θwarm is trained using fresh ads events to explain the residual of cold start only model.
In probability theory, the multi-armed bandit problem (sometimes called the K-[1] or N-armed bandit problem[2]) is a problem in which a gambler at a row of slot machines (sometimes known as "one-armed bandits") has to decide which machines to play, how many times to play each machine and in which order to play them.[3] When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.[4][5]
Current response prediction algorithm apply Thompson Sampling on warm start component for exploit/exploration. Instead of using MAP estimation of θ, we sample it from its posterior distribution. The rationale of sampling only warm start parameters but not cold start parameters, is the assumption that the sheer volume of training data in cold start phase must have driven the posterior variance to a small enough range.
During the warm start training, we use L2-regularization on all warm start parameters, which is equivalent to adding a multivariate-Gaussian prior on θwarm with zero mean and diagonal covariance matrix in which all elements are the same.
A/B testing is an indispensable driver behind LinkedIn’s data culture. We run more than 200 experiments in parallel daily and that number is growing quickly.
When we launched XLNT a year ago the platform only supported about 50 tests per day. Today, that number has increased to more than 200 tests. The number of metrics supported has also grown from 60 to more than 700. The XLNT platform now crunches billions of experiment events daily to produce more than 30 million summary records.
The XLNT platform automatically generates analysis reports, including both the percentage delta and the statistical significance information on many metrics, on a daily basis for every A/B test run across LinkedIn.
Automated analytics not only saves teams from time-consuming, ad hoc analysis, but also ensures that the methodology behind the reports is solid, consistent and scientifically founded.
To keep A/B testing scalable as LinkedIn grows, XLNT now lets each team own the logic of their metrics while the experimentation team reviews the metrics definitions, manages the metrics onboarding process and operates daily computations.
XLNT also makes it extremely easy to test ways we can improve the member experience for customized groups of professionals. Clearly, a job recommendation that works for a CEO is very different from one that works for students. Even for the same member, the need is different when visiting from their computer than from their smartphone. XLNT keeps many targeting attributes in a Voldemort store to query real time for any experiments and, in addition, XLNT is able to process any other attributes that are passed along in the request, e.g. browser, device etc.