Horizon: Deep Reinforcement Learning at Scale

Horizon: Deep
Reinforcement Learning at
Scale
Jason Gauci
Applied RL, Facebook AI

About Me
• Recommender systems @ Google/Apple/Facebook
• TLM on Horizon: A framework for large-scale RL:
https://github.com/facebookresearch/Horizon
• Eternal Terminal: a replacement for ssh/mosh
https://mistertea.github.io/EternalTerminal/
• Programming Throwdown: tech podcast
https://itunes.apple.com/us/podcast/programming-
throwdown/id427166321?mt=2

Recommender Systems
in 20 10 Minutes

Recommender Systems
1. Retrieval Matrix Factorization, Two Tower DNN
2. Event Prediction DNN, GBDT, Convnets, Seq2seq, etc.
3. Ranking Black Box Optimization, Bandits, RL
4. Data Science A/B Tests, Query Engines, User Studies
https://www.mailmunch.com/blog/sales-funnel/

Recommender Systems are Control Systems
1. Retrieval Control
2. Event Prediction Signal Processing
3. Ranking Control
4. Data Science Causal Analysis

Recommender
Systems are
Control
Systems
Control the user experience
• Explore/exploit
• Freshness
• Slate optimization
Control future models’ data
• Break feedback loops
• De-bias the model

Classification Versus Decision Making
Classification Decision Making
"What" questions (What will happen?) "How" questions (How can we do better?)
Trained on ground truth (Hotdog / Not Hotdog) Trained from another policy (usually a worse one)
Evaluated via accuracy (F1, AUC, NE) Counterfactual Evaluation (IPS, DR, MAGIC)
Assume data is perfect Assume data is flawed (explore/exploit)

Framework For Recommendation
• Action Features: 𝑋" ∈ 𝑅%
• Context Features: 𝑋& ∈ 𝑅%
• Session Features: 𝑋' ∈ 𝑅%
• Event Predictors: 𝐸(𝑋", 𝑋&, 𝑋') → 𝑅
Greedy Slate Recommendation:
• Value Function: 𝑉 𝑋", 𝑋&, 𝑋', 𝐸., 𝐸/, … , 𝐸1 → 𝑅
• Control Function: 𝜋 𝑉3, 𝑉., … , 𝑉1 → {0, … , 𝑛}
• Transition Function: 𝑇 𝑋", 𝑋&, 𝑋', 𝐸., 𝐸/, … , 𝐸1, 𝜋 → 𝑋'9

Discovering The Value Function
• What should we optimize for?
• Ads: Clicks? Conversions? Impressions?
• Feed/Search: Clicks? Time-Spent? Favorable user surveys?
• Answer: All of the above.
• How to combine?
• How to assign credit?
• Differentiable?

Searching Through Value Functions

Learning Value Functions
• Search is limited
• Curse of dimensionality
• Value models are sequential
• Optimize for long-term value
• Value models should be personalized
• Relationship between event predictors and utility is contextual
• Optimizing metrics is counterfactual
• “If I chose action a’, would metric m increase?”

Learning Value Functions
• Reinforcement Learning is designed around agents who make
decisions and improve their actions over time
Hypothesis: We can use RL to learn better value functions

Reinforcement Learning (RL)
• Agent
• Recommendation System
• Reward
• User Behavior
• State
• Context (inc. historical)
• Action
• Content
https://becominghuman.ai/the-very-basics-of-reinforcement-learning-154f28a79071

RL Terms
• State (S)
• Every piece of data needed to
decide a single action.
• Example: User/Post/Session
features
• Action (A)
• A decision to be made by the
system
• Example: Which post to show
• Reward (𝑹 𝑺, 𝑨 )
• A function of utility based on the
current state and action

RL Terms
• Transition (𝑻 𝑺, 𝑨 → 𝑺>
)
• A function that maps state-action pairs
to a future state
• Bandit: 𝑻 𝑺, 𝑨 = 𝑻(𝑺)
• Policy (𝝅 𝑺, 𝑨 𝟎, 𝑨 𝟏, … , 𝑨 𝒏 → {𝟎, 𝒏})
• A function that, given a state, chooses
an action
• Episode
• A sequence of state-action pairs for a
single run (e.g. a complete game of Go)

Value Optimization
• Value (𝑸 𝑺, 𝑨 )
• The cumulative discounted
reward given a state and action
• 𝑄 𝑠G, 𝑎G = 𝑟G +
𝝲 ∗ 𝑟GM. + 𝝲/
∗ 𝑟GM/ +
𝝲N
∗ 𝑟GMN + ⋯
• A good policy becomes:
𝜋 𝑠 = 𝑚𝑎𝑥" 𝑄(𝑠, 𝑎)

Value Regression
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝝲 ∗ 𝑟GM. + 𝝲/
∗ 𝑟GM/ +
𝝲N
∗ 𝑟GMN + ⋯
• Collect historical data
• Solve with linear regression
• Problem: 𝑟GM. also depends on 𝑎GM.

Credit Assignment Problem
• Current state/action
• X’s turn to move
• What is the value?
• Pretty high

Credit Assignment Problem
• Next State/Action
• Now what is the value?
• Low
• The future actions affect the
past value

State Action Reward State Action (SARSA)
• Value Regression
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝑟GM. + 𝛾/ ∗ 𝑟GM/ +…
• SARSA
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝑄 𝑠GM., 𝑎GM.
• Idea borrowed from Dynamic Programming
• Using the future Q is more robust
• Value still highly influenced by current policy

Q-Learning: Off-Policy SARSA
• SARSA
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝑄 𝑠GM., 𝑎GM.
• Q-Learning
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝑚𝑎𝑥" GM. 𝑄 𝑠GM., 𝑎GM.
• Has better off-policy guarantees
• 𝑚𝑎𝑥" GM. may be difficult to know/compute

Policy Gradients
• Q-Learning: 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝑚𝑎𝑥" GM. [𝑄 𝑠GM., 𝑎GM. ]
• What if we can’t do 𝑚𝑎𝑥" GM. [… ]?
• Policy Gradient
• Approximate 𝑚𝑎𝑥" GM. [𝑄 𝑠GM., 𝑎GM. ]
• 𝑄 𝑠G, 𝑎G = 𝑟G + 𝛾 ∗ 𝐴 𝑠GM.
• Learn 𝐴 𝑠GM. assuming Q is perfect:
• Deep Deterministic Policy Gradient
• 𝐿 𝐴 𝑠GM. = min(−𝑄 𝑠GM., 𝑎GM. )
• Soft Actor Critic
• 𝐿 𝐴 𝑠GM. = min(log(𝑃(𝐴 𝑠GM. = 𝑎GM.)) − 𝑄 𝑠GM., 𝑎GM. )

Prior State of Applied RL
• Small-scale
• Notable Exceptions: ELF OpenGo, OpenAI Five, AlphaGo
• Simulation-Driven
• Simulators are often deterministic and stationary

Prior State of Applied RL
• Small-scale
• Notable Exceptions: ELF OpenGo, OpenAI Five, AlphaGo
• Simulation-Driven
• Simulators are often deterministic and stationary
Can we train personalized, large-scale RL models and bring them to
billions of people?

Applying RL at Scale
• Batch Feature normalization & training
• Because the loss target is dynamic, normalization is critical
• Distributed training
• Synchronous SGD (PASGD should be fine)
• Fixed (but stochastic) policies
• E-greedy, Softmax, Thompson Sampling
• Fixed policies allow for massive deployment
• No need for checkpointing, online parameter servers
• Counterfactual Policy Evaluation
• Detect anomalies and gain insights offline

Horizon: Applied RL Platform
• Robust
• Massively Parallel
• Open Source
• Built on high-performance platforms
• Spark
• PyTorch
• ONNX
• OpenAI Gym & Gridworld Integration
tests

Safe, Large-Scale deployment
• Deploy models to 1000s of
frontend servers
• Counterfactual Policy Evaluation
• Warm-start for continuous
deployment
• Built-in Explore/Exploit policies

Preprocessing & Training
• Preprocessing happens as part of training
• Training begins by imitation learning, then pivots to policy
maximization
• Time-based or sequence-based discount factor
• Highly optimized with Pytorch 1.0

Counterfactual Policy Evaluation (CPE)
• One-Step (estimate reward)
• Direct Method (DM): Learn reward function for all states/actions
• Inverse Propensity Score (IPS): Boost reward by ratio of action probabilities
• Doubly-Robust: Use DM to reduce IPS variance
• Value (estimate cumulative reward)
• Direct Method: Learn reward and transition functions (model-based RL)
• Sequential DR: Extrapolate one-step CPE across episode
• MAGIC: Sliding window approach

CPE: Results on OpenAI Cartpole
Mean absolute error
(fraction of true value):
3.4%

Production Launches
• Infrastructure
• 360 Video adaptive bitrate
• Marketing/Growth
• Newsfeed Notifications
• Page Notifications
• Ad Coupons
• Recommendations
• M Assistant filtering
• Newsfeed/IG Value Model Optimization

Questions/Comments?
Jason Gauci jjg@fb.com

Horizon: Deep Reinforcement Learning at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Horizon: Deep Reinforcement Learning at Scale

Similar to Horizon: Deep Reinforcement Learning at Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Horizon: Deep Reinforcement Learning at Scale