In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Sam Daulton from Facebook discusses "Practical Solutions to real-world exploration problems".
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Facebook Talk at Netflix ML Platform meetup Sep 2019
1. Practical Solutions to Exploration Problems
Sam Daulton
Core Data Science, Facebook
Adaptive Experimentation Practical Solutions to Exploration Problems 1 / 68
2. Overview
1 Adaptive Experimentation
Introduction
2 Direct policy search via Bayesian optimization
Motivating Example
Gaussian Process Regression
Bayesian Optimization
3 Combining online and offline experiments
Value Model Tuning
Multi-Task Bayesian Optimization
4 Open Source Tools
Ax
BoTorch
5 Constrained Bayesian Contextual Bandits
Video Upload Transcoding Optimization
Constrained Thompson Sampling (CTS)
Reward Shaping and Hyperparameter Optimization
Adaptive Experimentation Practical Solutions to Exploration Problems 2 / 68
3. Adaptive Experimentation Team
• Horizontal R&D team within
Facebook
• Goal: radically change the way
people run experiments and
develop systems:
• Reduce threshold for
experimentation
• Use RL to robustly solve
explore/exploit problems
• Develop tools to improve and
automate decision-making
under multiple and/or
constrained objectives
Adaptive Experimentation Practical Solutions to Exploration Problems 3 / 68
8. Homogeneous Status Quo Policy
Adaptive Experimentation Practical Solutions to Exploration Problems 8 / 68
9. Homogeneous Status Quo Policy
Idea: What if we loaded different numbers of stories depending on the
connection type?
Adaptive Experimentation Practical Solutions to Exploration Problems 8 / 68
10. Potential Contextualized Policy
Idea: What if we loaded more posts for better connections types?
Adaptive Experimentation Practical Solutions to Exploration Problems 9 / 68
11. Potential Contextualized Policy - Opposite
Idea: What if we loaded fewer posts for better connections types?
Adaptive Experimentation Practical Solutions to Exploration Problems 10 / 68
12. Potential Contextualized Policies
Suppose that for each connection type c:
• We could fetch any number of posts xc ∈ [2, 24]
• Then there are 224 = 234, 256 possible configurations to test!
Adaptive Experimentation Practical Solutions to Exploration Problems 11 / 68
13. Policies as Black-box Functions
The average treatment effect over all individuals can be expected to be
some smooth function of the policy table x = [x1, ..., xk]:
f(x) : Rk
→ R
Adaptive Experimentation Practical Solutions to Exploration Problems 12 / 68
14. Black-box Function View of RL
• Turns ”full RL” problem into an infinite-armed bandit problem
πx∗ = arg max
x
g(f(x))
• Advantages:
• Does not require estimating value functions, state transition functions,
or inference about unobserved states
• Involves virtually no logging of actions, states, or intermediate rewards
• Allows for direct maximization of multiple, delayed rewards
Question: How can we make predictions about long-term outcomes from
limited number of vector-valued policies?
Adaptive Experimentation Practical Solutions to Exploration Problems 13 / 68
15. Gaussian Process (GP) Priors
Adaptive Experimentation Practical Solutions to Exploration Problems 14 / 68
16. Gaussian Process (GP) Priors
Adaptive Experimentation Practical Solutions to Exploration Problems 15 / 68
17. Gaussian Process (GP) Posteriors
Adaptive Experimentation Practical Solutions to Exploration Problems 16 / 68
18. Gaussian Process (GP) Posteriors
Adaptive Experimentation Practical Solutions to Exploration Problems 17 / 68
19. Gaussian Process (GP) Posteriors
GP regression gives well-calibrated posterior predictive intervals that are
easy to compute
Adaptive Experimentation Practical Solutions to Exploration Problems 18 / 68
20. Gaussian Process (GP) Regression
In practice, we find that GP surrogate models fit the data well for many
online experiments.
Adaptive Experimentation Practical Solutions to Exploration Problems 19 / 68
21. Other Examples with Continuous Action Spaces
• Value models governing ranking policies: e.g.
rank(Z) = x1P(click|Z) + x2Zx3
num friends + f(P(spam|Z)/x4) + ...
• Bit-rate controllers for video and audio streaming
• Data retrieval policies for ML backends
Question: How do we use GP surrogate models to guide the
explore-exploit trade-off?
Adaptive Experimentation Practical Solutions to Exploration Problems 20 / 68
28. Bayesian Optimization
Response surface is maximized sequentially
• Models tell us which regions should be considered for further
assessment
Adaptive Experimentation Practical Solutions to Exploration Problems 27 / 68
29. Bayesian Optimization
Algorithm 1 BayesianOptimization
1: Run N random initial arms
2: for t = 0 to T do
3: Fit GP model to data
4: Use acquistion function select candidates C
5: Evaluate C on black box function
6: Add new observations to dataset
7: end for
Adaptive Experimentation Practical Solutions to Exploration Problems 28 / 68
31. Alternatives
Random Search (Cheaper - 25 arms)
• Maxima can be deduced with only a few, smartly chosen arms
Adaptive Experimentation Practical Solutions to Exploration Problems 30 / 68
32. Competing Objectives
• Product teams are used to running an A/B test and observing the
outcomes.
• Often, there are multiple competing objectives
Adaptive Experimentation Practical Solutions to Exploration Problems 31 / 68
33. Competing Objectives
If we want full automation, we need to specify more information in
advance: ideally, ”the” scalarized objective
Adaptive Experimentation Practical Solutions to Exploration Problems 32 / 68
35. Competing Objectives
Decision makers don’t like scalarizations: e.g.
objective = −0.8 · cpu + 1.1 · time spent
Adaptive Experimentation Practical Solutions to Exploration Problems 34 / 68
36. Competing Objectives
Decision makers prefer constraints:
min(cpu) subject to time spent > 0.7
Adaptive Experimentation Practical Solutions to Exploration Problems 35 / 68
37. Practical Challenges
• Constrained optimization
• Observations often have high variance, leading to potentially large
measurement error
• High noise levels can degrade the performance of many common
acquisition functions including Expected Improvement
Adaptive Experimentation Practical Solutions to Exploration Problems 36 / 68
38. Solution
For more details, see
• Constrained Bayesian Optimization with Noisy Experiments Bayesian
Analysis 2019. Letham, Karrer, Ottoni, & Bakshy
Adaptive Experimentation Practical Solutions to Exploration Problems 37 / 68
39. Overview
1 Adaptive Experimentation
Introduction
2 Direct policy search via Bayesian optimization
Motivating Example
Gaussian Process Regression
Bayesian Optimization
3 Combining online and offline experiments
Value Model Tuning
Multi-Task Bayesian Optimization
4 Open Source Tools
Ax
BoTorch
5 Constrained Bayesian Contextual Bandits
Video Upload Transcoding Optimization
Constrained Thompson Sampling (CTS)
Reward Shaping and Hyperparameter Optimization
Adaptive Experimentation Practical Solutions to Exploration Problems 38 / 68
40. Value Model Tuning
• Ranking teams use value models, combine multiple predictive models
and features, e.g.
rank(Z) = x1P(click|Z) + x2Zx3
num friends + f(P(spam|Z)/x4) + ...
• Not feasible to run sufficiently powered experiments with 20+ arms,
so the team developed a simulator
Adaptive Experimentation Practical Solutions to Exploration Problems 39 / 68
43. Debiasing Simulations with Multi-Task Models
Adaptive Experimentation Practical Solutions to Exploration Problems 42 / 68
44. Debiasing Simulations with Multi-Task Models
Adaptive Experimentation Practical Solutions to Exploration Problems 43 / 68
45. Multi-Task Bayesian Optimization Loop
Algorithm 2 MultiTaskBayesianOptimization
1: Run N random arms online
2: Run M random arms offline with M > N
3: for t = 0 to T do
4: Fit MT-GP model to all data, with each batch as separate task
5: Use NEI to generate q candidates C (e.g. q = 30)
6: Run C on the simulator, fit GP model again
7: Use NEI to generate candidates to run online
8: end for
Adaptive Experimentation Practical Solutions to Exploration Problems 44 / 68
47. Paper
For more details, see
• See Bayesian Optimization for Policy Search via Online-Offline
Experimentation. Letham & Bakshy 2019. Forthcoming, arXiv
1904.01049
Adaptive Experimentation Practical Solutions to Exploration Problems 46 / 68
48. Overview
1 Adaptive Experimentation
Introduction
2 Direct policy search via Bayesian optimization
Motivating Example
Gaussian Process Regression
Bayesian Optimization
3 Combining online and offline experiments
Value Model Tuning
Multi-Task Bayesian Optimization
4 Open Source Tools
Ax
BoTorch
5 Constrained Bayesian Contextual Bandits
Video Upload Transcoding Optimization
Constrained Thompson Sampling (CTS)
Reward Shaping and Hyperparameter Optimization
Adaptive Experimentation Practical Solutions to Exploration Problems 47 / 68
57. Overview
1 Adaptive Experimentation
Introduction
2 Direct policy search via Bayesian optimization
Motivating Example
Gaussian Process Regression
Bayesian Optimization
3 Combining online and offline experiments
Value Model Tuning
Multi-Task Bayesian Optimization
4 Open Source Tools
Ax
BoTorch
5 Constrained Bayesian Contextual Bandits
Video Upload Transcoding Optimization
Constrained Thompson Sampling (CTS)
Reward Shaping and Hyperparameter Optimization
Adaptive Experimentation Practical Solutions to Exploration Problems 56 / 68
58. Video Upload Transcoding Optimization
Problem
• System receives requests to upload videos of different source qualities
and file sizes from a variety of network connections and devices.
• To ensure high reliability, a video may be transcoded to be uploaded
at a lower quality
• For each video upload request, we have features about
• the video: file size, duration, source resolution
• the network: country, network type, download speed
• the device
Goal
• Maximize quality preserved without decreasing reliability
Adaptive Experimentation Practical Solutions to Exploration Problems 57 / 68
59. Video Upload Transcoding - CB Problem
• Context: features about video, network, device
• Actions: 360p, 480p, 720p, 1080p
• Outcomes: reliability y(x, a)
• Rewards: ?? some function R(x, a, y)
Adaptive Experimentation Practical Solutions to Exploration Problems 58 / 68
60. Approach - Bandit Algorithmm
Thompson Sampling
• Works well in batch mode
• Hyper-parameter free exploration
• Always ”picks the best” codec: picks codecs with probability
proportional to it being the best
Adaptive Experimentation Practical Solutions to Exploration Problems 59 / 68
61. Approach - Modeling
Bayesian Linear Model
• Bernoulli likelihood to predict reliability
• Using a neural network feature extractor
• Simple two-layer MLP (50, 4) trained via SGD
• Last layer is a stochastic variational GP with a linear kernel
• Trained via stochastic variational inference using 1000 inducing points
according to space-filling design
Adaptive Experimentation Practical Solutions to Exploration Problems 60 / 68
62. Thompson Sampling
Algorithm 3 ThompsonSampling
Input: discrete set of actions A, distribution over models P0(f)
1: for t = 0 to T do
2: Sample model ˜ft ∼ Pt(f|X, y)
3: Select an action at ← arg maxa∈A E(rt|xt, a, ˜ft)
4: Observe reward rt
5: Update distribution Pt+1(f)
6: end for
Adaptive Experimentation Practical Solutions to Exploration Problems 61 / 68
63. Issues with Vanilla Thompson Sampling
• Thompson sampling does not account for the constraint
• Change in reliability must be non-negative
• Unclear how to optimally specify reward parameterization
Adaptive Experimentation Practical Solutions to Exploration Problems 62 / 68
64. Constrained Thompson Sampling
Algorithm 4 ConstrainedThompsonSampling
1: Input: discrete set of actions A, distribution over models P0(f)
2: for t = 0 to T do
3: Receive context xt
4: Sample model ˜ft ∼ Pt(f|X, y)
5: for a ∈ A do
6: Estimate outcomes ˜ft(xt, a)
7: end for
8: Fetch action under baseline policy b ← πb(xt)
9: Filter feasible actions: Afeas ← {a ∈ A| ˜ft(xt, a) ≥ ε · ˜ft(xt, b)}
10: Select an action at ← arg maxa∈Afeas
E(rt|xt, a, ˜ft)
11: Observe outcome yt
12: Update distribution Pt+1(f)
13: end for
Adaptive Experimentation Practical Solutions to Exploration Problems 63 / 68
65. Reward Shaping Setup
Reward Shaping:
• Reward is 0 if the upload is a failure
• Reward is fixed at 1 for a 360p upload success:
• Reward is monotonically increasing with quality:
R(y = 1, a) = 1 +
a ≤a
wa
where
wi ∈ (0.0, 0.2]
Safety Constraint: ε ∈ [0.95, 1.0]
Adaptive Experimentation Practical Solutions to Exploration Problems 64 / 68
66. Reward Shaping Optimization
• Teams care about top-line outcomes:
• Reliability: mean reliability per user
• Quality preserved: mean quality (e.g., 1080p preserved, HD) per user
• Other outcomes: watch time, content production
• Difficult to evaluate these outcomes from purely offline data
Solution: Use Bayesian Optimization (via Ax) using online experiments
Adaptive Experimentation Practical Solutions to Exploration Problems 65 / 68
67. Reward Shaping Optimization
(a) 1080p quality preserved (b) Reliability
Figure: GP-modeled response surface of mean percent change in video quality
and reliability relative to the baseline policy. Each point represents a policy
parameterized by reward function hyperparameters and constraint parameter ε.
Adaptive Experimentation Practical Solutions to Exploration Problems 66 / 68