Productionizing Deep Reinforcement Learning with Spark and MLflow

Reinforcement Learning in
Production at Zynga
Patrick Halina, Architect, ML Engineering
Curren Pangler, Principal Engineer, ML Engineering

Agenda
Reinforcement Learning (RL) Intro
Intro to RL and personalization
Zynga’s RL Tech Stack
The off the shelf technologies Zynga uses to run
RL in production for millions of users per day
Designing RL Applications
Creating RL applications is hard! Here’s what
we’ve learned

Mobile game developer
Over 60M monthly active users

Game Design is Hard
Lots of design decisions
▪ How hard to make level?
▪ What game mode do we recommend?
How do we personalize games?
We want to choose behaviors for
user to maximize long term
engagement

Personalization Problem Formulation
Given a user’s State
▪ User features
What Action do we pick
▪ E.g. Difficulty level
That maximizes a long term Reward
▪ E.g. Engagement, retention

Personalization Method 1: Rules Based Segments
PMs define segments via rules
Assign ‘personalized action’ to each segment
A/B test segment vs. control

Challenges
Lots of trial and error manual work
Player patterns change
Limited ability to personalize
▪ Small set of outputs
▪ Small set of datapoints to make decisions on

Personalization Method 2: Prediction Models
Train model to predict long term reward for each
personalized Action

Challenges
Requires lots of models
Requires lots of labelled data
▪ Need to randomly assign users to each Action, then wait long enough to measure long term results
Limited to simple outputs. E.g. How to pick best continuous
value?

Personalization Wishlist
Automatically tune details of personalization
Continuously explore and improve over time
Personalize complex outputs
▪ E.g. Continuous values, multiple dimensions

Solution: Reinforcement Learning (RL)
AI for making sequences of decisions
Agent picks Action based on current State to maximize Reward
Automatically learn from past experiences
Balance exploration with choosing best known Action

If RL can beat the world’s best GO player, can we use it to make
our games better?

Application: WWF Daily Message Timing
What time should we send user their daily message?
We used RL Agent to personalize based on hourly activity
Results: Significant increase in CTR vs. hand tuned system
Delivered to millions of users per day

RL Model Training
Action
Agent EnvironmentState

RL Model Training
Action
Agent Environment
ExperienceExperience Replay Buffer
training logging
State

RL Model Training
action
Agent Environment
ExperienceExperience Replay Buffer
training logging Single Experience/Trajectory
S0: Previous Observation
A0: Action
R : Reward
S1: Current Observation
A1: Next Optimal Action
State

Academic RL Applications
RL Agents learn by interacting with environment
▪ Can’t just train Agent with static set of labelled data like supervised learning models
Well known RL applications are trained offline with simulator
E.g. Training Agent to play Atari
Agent is applied after lots of offline learning
RL Agent
v1
Action Learn
RL Agent
v2
Action Learn

Production RL Applications for Personalization
Hard to simulate humans, so we learn by interacting with real users
Agent interacts with humans from v1
Need to learn from batches in parallel
Harder to manage data and workflows!
RL
Agent
v1
Action Learn
RL
Agent
v2
Action Learn
Action Learn Action Learn

RL Model Training
Training Pipeline Wish List:
Off-the-shelf
Scalable
Cutting-edge algorithms
Reliable & robust
Easily extendable

RL Model Training
TF-Agents
Open-source RL Library that implements
cutting-edge Deep RL algorithms (DQN,
PPO, TD3 etc.)
Advantages:
▪ Modular design
▪ Well-written
▪ Accuracy
▪ New algorithms

Production RL Challenges
How to:
Convert messy, real-time logged data into RL trajectories?
Persist, restore, and re-use past agents & trajectories
Create trajectories at production scale?
And… how do we make this repeatable and data-scientist friendly?

RL – Bakery
Our open-source library to help build batch RL applications in production, at scale
github.com/zynga/rl-bakery

RL-Bakery
Wrapper around RL algorithm libraries
that simplifies developing real world RL
apps like personalization
RL-Application
RL-Bakery
RL Library

RL-Bakery
RL-Application:
▪ Application-specific
▪ Written in a Databricks notebook
▪ Data-scientist friendly
▪ Provides model configuration &
hyperparameters
▪ Fetches observations, actions,
rewards as Spark DataFrames
RL-Application
RL-Bakery
RL Library

RL-Bakery
RL-Bakery:
▪ Orchestrate steps of training
pipeline
▪ Restore models and old time steps
▪ Create new training trajectories
▪ Persist model and trajectories
between runs
▪ Deploy models to serving system
▪ Add functionality unavailable in TF-
Agents (e.g. prioritized replay buffer)
RL-Application
RL-Bakery
RL Library

RL-Bakery
RL Library:
▪ Open source RL libs implement algos
like PPO, DQN etc.
▪ Currently only support TF-Agents
▪ Core RL Algorithms implemented
using TensorFlow
RL-Application
RL-Bakery
RL Library

Pre-processing
Model Inference
Post-processing
State & Action
Logging
S3
Zynga
Personalize
AWS SageMaker
Feature
Hydration
Zynga Feature
Store
action
observation
Real Time Model Serving

MODEL SERVING MODEL TRAINING
Real-Time
Features
Real-Time
Serving
RL Bakery Application
AWS SageMaker
ActionObservations
Recommendation
S3 Experience
Logs
Training
RL Agent

Choose the Right Application
Is the problem best modelled as a sequence of decisions?
▪ Does the Action taken in one timestep affect future Actions?
▪ Otherwise, use simpler solutions like predictive models or contextual Multi-Armed Bandits
Is the Reward learnable?
▪ Does the Action impact the Reward
▪ Hard to learn sparse rewards
RL shouldn’t be applied to every situation

Choose States
Anecdotally, RL Agents are sensitive to too many inputs
Choose simple state spaces
Compress state space size with unsupervised learning techniques like
Auto-Encoding

Designing Actions
Start simple: small set of discrete Actions
▪ Allows you to use simpler Deep RL algorithms
Continuous action spaces require algos from Policy Gradient family
Large set of discrete Actions -> classic Recommendation Systems
▪ This goes beyond traditional RL set up
▪ Some cutting edge Recommendation Systems use RL

Choosing RL Algorithms
Active area of research, new algos are constantly being developed
Algorithms are hard to implement, subtle details affect results
Off the shelf implementations available from Open Source libs

Hyperparameter Tuning
Lots of RL application design choices
▪ Allows you to use simpler Deep RL algorithms
Plus Deep Learning hyper parameters
▪ Learning rate
▪ Neural network architecture
Slight hyper parameter changes have big effects
How can we choose best options before going live?

How to Pretrain
Can you do better than random for initial launch?
Train Agent to mimic some existing behavior
▪ Use historic data to reward Agent for picking previous Actions
▪ Agent then slowly learns when to deviate
Simulate simple scenarios with hand made mechanics
▪ Capture relationships between features in State and Action
▪ Simple scenarios have clear “optimal” strategy so you can measure success

Hyperparameter Tuning Automation
Automate deep learning hyperparameter tuning with MLFlow

Key Takeaways
RL is the perfect methodology for personalization problems
RL is ready for production with off the shelf technology
RL applications are challenging to develop, best practices are
currently being discovered

Thank You!
Patrick HalinaMehdi Ben AyedCurren Pangler

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Productionizing Deep Reinforcement Learning with Spark and MLflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Productionizing Deep Reinforcement Learning with Spark and MLflow

Similar to Productionizing Deep Reinforcement Learning with Spark and MLflow (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Productionizing Deep Reinforcement Learning with Spark and MLflow