Deep Reinforcement Learning has driven exciting AI breakthroughs like self-driving cars, beating the best Go players in the world and even winning at StarCraft. How can businesses harness this power for real world applications?
3. Agenda
Reinforcement Learning (RL) Intro
Intro to RL and personalization
Zynga’s RL Tech Stack
The off the shelf technologies Zynga uses to run
RL in production for millions of users per day
Designing RL Applications
Creating RL applications is hard! Here’s what
we’ve learned
5. Game Design is Hard
Lots of design decisions
▪ How hard to make level?
▪ What game mode do we recommend?
How do we personalize games?
We want to choose behaviors for
user to maximize long term
engagement
6. Personalization Problem Formulation
Given a user’s State
▪ User features
What Action do we pick
▪ E.g. Difficulty level
That maximizes a long term Reward
▪ E.g. Engagement, retention
7. Personalization Method 1: Rules Based Segments
PMs define segments via rules
Assign ‘personalized action’ to each segment
A/B test segment vs. control
8. Challenges
Lots of trial and error manual work
Player patterns change
Limited ability to personalize
▪ Small set of outputs
▪ Small set of datapoints to make decisions on
9. Personalization Method 2: Prediction Models
Train model to predict long term reward for each
personalized Action
10. Challenges
Requires lots of models
Requires lots of labelled data
▪ Need to randomly assign users to each Action, then wait long enough to measure long term results
Limited to simple outputs. E.g. How to pick best continuous
value?
11. Personalization Wishlist
Automatically tune details of personalization
Continuously explore and improve over time
Personalize complex outputs
▪ E.g. Continuous values, multiple dimensions
12. Solution: Reinforcement Learning (RL)
AI for making sequences of decisions
Agent picks Action based on current State to maximize Reward
Automatically learn from past experiences
Balance exploration with choosing best known Action
13. If RL can beat the world’s best GO player, can we use it to make
our games better?
14. Application: WWF Daily Message Timing
What time should we send user their daily message?
We used RL Agent to personalize based on hourly activity
Results: Significant increase in CTR vs. hand tuned system
Delivered to millions of users per day
18. RL Model Training
action
Agent Environment
ExperienceExperience Replay Buffer
training logging Single Experience/Trajectory
S0: Previous Observation
A0: Action
R : Reward
S1: Current Observation
A1: Next Optimal Action
State
19. Academic RL Applications
RL Agents learn by interacting with environment
▪ Can’t just train Agent with static set of labelled data like supervised learning models
Well known RL applications are trained offline with simulator
E.g. Training Agent to play Atari
Agent is applied after lots of offline learning
RL Agent
v1
Action Learn
RL Agent
v2
Action Learn
20. Production RL Applications for Personalization
Hard to simulate humans, so we learn by interacting with real users
Agent interacts with humans from v1
Need to learn from batches in parallel
Harder to manage data and workflows!
RL
Agent
v1
Action Learn
RL
Agent
v2
Action Learn
Action Learn Action Learn
21. RL Model Training
Training Pipeline Wish List:
Off-the-shelf
Scalable
Cutting-edge algorithms
Reliable & robust
Easily extendable
22. RL Model Training
TF-Agents
Open-source RL Library that implements
cutting-edge Deep RL algorithms (DQN,
PPO, TD3 etc.)
Advantages:
▪ Modular design
▪ Well-written
▪ Accuracy
▪ New algorithms
23. Production RL Challenges
How to:
Convert messy, real-time logged data into RL trajectories?
Persist, restore, and re-use past agents & trajectories
Create trajectories at production scale?
And… how do we make this repeatable and data-scientist friendly?
24. RL – Bakery
Our open-source library to help build batch RL applications in production, at scale
github.com/zynga/rl-bakery
25. RL-Bakery
Wrapper around RL algorithm libraries
that simplifies developing real world RL
apps like personalization
RL-Application
RL-Bakery
RL Library
26. RL-Bakery
RL-Application:
▪ Application-specific
▪ Written in a Databricks notebook
▪ Data-scientist friendly
▪ Provides model configuration &
hyperparameters
▪ Fetches observations, actions,
rewards as Spark DataFrames
RL-Application
RL-Bakery
RL Library
27. RL-Bakery
RL-Bakery:
▪ Orchestrate steps of training
pipeline
▪ Restore models and old time steps
▪ Create new training trajectories
▪ Persist model and trajectories
between runs
▪ Deploy models to serving system
▪ Add functionality unavailable in TF-
Agents (e.g. prioritized replay buffer)
RL-Application
RL-Bakery
RL Library
28. RL-Bakery
RL Library:
▪ Open source RL libs implement algos
like PPO, DQN etc.
▪ Currently only support TF-Agents
▪ Core RL Algorithms implemented
using TensorFlow
RL-Application
RL-Bakery
RL Library
30. MODEL SERVING MODEL TRAINING
Real-Time
Features
Real-Time
Serving
RL Bakery Application
AWS SageMaker
ActionObservations
Recommendation
S3 Experience
Logs
Training
RL Agent
32. Choose the Right Application
Is the problem best modelled as a sequence of decisions?
▪ Does the Action taken in one timestep affect future Actions?
▪ Otherwise, use simpler solutions like predictive models or contextual Multi-Armed Bandits
Is the Reward learnable?
▪ Does the Action impact the Reward
▪ Hard to learn sparse rewards
RL shouldn’t be applied to every situation
33. Choose States
Anecdotally, RL Agents are sensitive to too many inputs
Choose simple state spaces
Compress state space size with unsupervised learning techniques like
Auto-Encoding
34. Designing Actions
Start simple: small set of discrete Actions
▪ Allows you to use simpler Deep RL algorithms
Continuous action spaces require algos from Policy Gradient family
Large set of discrete Actions -> classic Recommendation Systems
▪ This goes beyond traditional RL set up
▪ Some cutting edge Recommendation Systems use RL
35. Choosing RL Algorithms
Active area of research, new algos are constantly being developed
Algorithms are hard to implement, subtle details affect results
Off the shelf implementations available from Open Source libs
36. Hyperparameter Tuning
Lots of RL application design choices
▪ Allows you to use simpler Deep RL algorithms
Plus Deep Learning hyper parameters
▪ Learning rate
▪ Neural network architecture
Slight hyper parameter changes have big effects
How can we choose best options before going live?
37. How to Pretrain
Can you do better than random for initial launch?
Train Agent to mimic some existing behavior
▪ Use historic data to reward Agent for picking previous Actions
▪ Agent then slowly learns when to deviate
Simulate simple scenarios with hand made mechanics
▪ Capture relationships between features in State and Action
▪ Simple scenarios have clear “optimal” strategy so you can measure success
39. Key Takeaways
RL is the perfect methodology for personalization problems
RL is ready for production with off the shelf technology
RL applications are challenging to develop, best practices are
currently being discovered