Is Production RL at a tipping point?

The ML data engineering conference presented by
Is Production RL at a
tipping point?
Waleed Kadous, Head of Engineering, Anyscale

Overall outline
- Reinforcement Learning (RL) showing huge successes in research
- Almost all the “superhuman” wins in games are RL based (Go, Poker, Chess, Atari, …)
- Production RL is still rare. Why?
- First: an incremental journey in RL from simple to complex
- Is RL at a tipping point? If not, why not?
- Tips for bootstrapping production RL in your company
- Common patterns in successful RL applications
- Things to watch out for
- Conclusion: RL is not yet off-the-shelf for all problems, but there are
subsets where it is becoming the clear winner

Our experiences
Anyscale is the company behind RLlib, most popular open source distributed
RL library.
These are stories from our customers

Understanding the RL complexity spectrum
- Helps build a mental model of what’s easy and hard.
- A map from things people already know.
- Provides a roadmap for how to tackle problems.
- Will become relevant as we discuss how to deploy
Production RL in your company.
- Simplest to most complex.

Bandits (many people use already)
Unknown part: How often does each machine pay out?
State: None
Action: Pull lever 1 to 4
Reward: $10 if machine pays out, -$1 if machine doesn’t.
Practical example: UI treatments – each different UI is a bandit
1 2 3 4

Challenges with bandit
Key challenge is explore-exploit tradeoff: how do I balance using my existing
policy vs searching for a better policy?
Whole range of policies, e.g. Epsilon greedy:
p = random.uniform(0,1)
if (p < epsilon):
pull_random_lever()
else:
pull_max_reward([l1,..,l4])

Contextual Bandits
Context: Is it sunny or cloudy?
Unknown part: How often does each machine pay out given the weather?
State: None
Action: Pull lever 1 to 4
Reward: $10 if machine pays out, -$1 if machine doesn’t.
Practical example: Recommender system. Context = user profile

Bandits + Sequentiality = RL
Optimal policy: If you pull the levers in the order 3, 4 then 1 then when you pull it pays out $100 bonus
State: previous arm pulls
Action: Next arm to pull
Reward: Payout or -1 if no payout.
Example: Playing Chess – moves early on can impact the end of the game a lot
2
1 2
3
3 4
4
3
1
1
Order:

Challenges with Making Bandits Stateful
Temporal credit assignment:
If you get a payout at time t, how do I divide the credit between all the
previous states I’ve been in? (which move was the move?)
Need some form of backpropagation.
Search space expands (10^80 possible chess games).
What if reward is delayed (e.g. sacriﬁce in chess)?

Large action and state space RL
What if you had 32
machines to bet on,
and you also had to
decide how hard to
pull all of the levers
Example: Trading
stocks (S & P 500)

Challenges with Large Action and State Spaces
- Dimensionality and possibility grows.
- State space:
- Grows exponentially
- Action space:
- 4 discrete values → 32 ﬂoating point numbers

Offline RL
What if you only had the logs of every lever pull and the reward for yesterday and you just
want to make the best policy from it?
Example: Learning from historical stock purchases

Challenges with Offline RL
- You’re stuck with whatever experiences were encountered: incomplete
state space.
- What if you trained when there was never a recession?
- Example: if you pull 4, 4, in a row and you don’t win any of them, you lose
all your winnings so far.
- Very sensitive to distributional shift:
- If the reward changes, RL policies tend to be brittle

Multi-agent RL (MARL)
What if there are two people playing?
Can be cooperative or competitive
Example: Stocks with a few big players

Challenges with Multiagent RL
- Interactions between agents (e.g. if two agents try to pull same lever)
- Shared representation or separate representations
- What’s the reward function across the set of agents?

Key distinctions between SL and RL
Exploitation vs exploration tradeoff
- Is there a more optimal strategy, or use what we’ve already learned?
Maximizing future reward
- Not just the next step but all future decisions in future states
- Related: how to assign “credit” for reward states temporally?
Online, incremental approach
- Regularly updates the model, needs to experiment as part of process
- Offline approaches sensitive to “distributional shift”
State and Action spaces can be large
- Complex, multidimensional, Q-learning requires S x A

Is Production RL at a tipping point?

RL Hugely Successful in Research …

And yet: 4 things that make production RL Hard
- High training requirements
- Overcoming limitations of online training
- Solving temporal credit assignment problem
- Large action and state spaces
but there’s been recent progress on all of these

Huge amount of training required
AlphaGo Zero: played 5 million games against itself to become superhuman
AlphaStar: Each agent trained for 200 years
Implications:
Well beyond the capabilities of a single machine
Recent Progress:
Distributed training (RLLib can help)
Transfer learning (e.g. learn to play one Atari game and apply to new domains)
Bootstrapping from existing state - action logs (human or previous runs)
Reducing difficulty using parameterized state spaces, action masking etc.

Default implementation is online
Naive implementation of RL is online
Implications:
Hard to validate models in real-time
Hard to reuse data if model parameters are changed
Progress:
Offline training algorithms, dealing with counterfactuals
(RLLib supports both)

Temporal credit assignment
Actions do not immediately lead to rewards in real life
Implications:
- Introduces a host of problems: discounting rewards, Q functions
- Signiﬁcantly increases training data requirements
Recent Progress:
- Contextual bandits are RL without the temporal credit assignment
- Though limited, are simple to deploy and use and are seeing adoption

Large state and action spaces
Large action and state spaces require lots more training. In the worst case, new policies need to be
retested
Implication
Not practical for real problems (e.g. robots)
Recent Progress
High ﬁdelity simulators
Distributed RL libraries and techniques allow running many simulations at once
Deep Learning approaches to learning the state space
Embedding approaches for action space (e.g. ﬁrst candidate selection, then rank)
Offline learning does not require relearning

Common patterns in successful
production RL applications

3 patterns we see in successful production RL
applications
- Parallelizable simulation tasks
- Low temporality with immediate reward signal (aka contextual bandits)
- Optimization: The Next Generation

RL in simulated environments
Why?
- RL takes a lot of iterations to converge
- Too slow for the real world
- But if your problem is virtual OR your simulation is faithful …
Enabling techniques
- Running lots of simulations at once using distributed RL (e.g. RLLib)
- Systems for merging results from lots of experiments (batching)
- Getting close with simulator, then ﬁne tuning in the real world

Example 1: Games!
- Games are nothing but virtual environments!
- Example:
- Riot Games – the company behind League of Legends
- Legends of Runeterra: card game (like Magic the Gathering)
- State: scores of individual players + remaining cards
- Action: which cards to play
- Reward: +1 for winning
- Create 10 “virtual players” and play against each other.
- Identify virtual players who win disproportionately.

Example 2: Markets
- Simulation does not have to be perfect
- Example:
- JP Morgan using to model forex transactions
- State: holdings of each participant in the market
- Action: Buy or Sell a certain amount of stock
- Reward: proﬁt - unsold stock
- Used to test automated trading before release into production

Low sequentiality with immediate reward signal
Reminder: RL = Contextual bandits + sequentiality
Pseudo-contextual bandits?
If R(a|t1
…tn
) ~= R(a|tn
) and R(a|tn+1
) ~= R(a|tn+1
…tM
)
then a lot of things get easier
Recent Progress
Ignoring sequentiality
Unrolling sequentiality into state

Recommender systems
Example: Wildlife game recommendations
State: last played games + user proﬁle (the contextual part)
Action: present a recommendation for next game
Reward: +1 if user clicks on game
Question: what if I have millions of users and hundreds of games?
A key technique here is embedding. Use embedding to reduce dimensionality of
users, and embedding to ﬁnd the next game.

Availability
Microsoft Azure and Google Cloud both offer personalization services based
on RL
You can go right now* and use RL for recommendations as a SaaS

Optimization: The Next Generation
One take on RL is that it is data driven optimization.
Traditional optimization is very much about modeling (e.g. linear
programming). Developed at a time when computation was scarce.
RL does not require modeling, it just runs experiments. Obviously this takes a
lot more computation, but it is often “plug and play” with optimization.

Example: Dow Chemical using RL for scheduling
Task: Schedule chemical plants’ production schedules to meet evolving
demand
OR: Mixed Integer Linear Programming
RL:
- State: Scheduling parameters
- Action: What to schedule when
- Reward: Total money saved

Simplest choice for each axis of complexity
- Stateless vs contextual vs stateful
- On-policy vs off-policy training vs offline training
- Small, discrete state and action spaces vs Large, continuous state and
action spaces
- Single agent vs multi-agent shared-policy vs true multi-agent
Ideally use the simplest possible

Be Aware of Special Challenges Deploying RL Models
- Validation
- How do you ensure RL model doesn’t do something stupid (like ram itself into a wall?)
- Are some available approaches (e.g. Action Masking and Counterfactuals)
- Updating
- In almost all cases, the deployed policy is “frozen” – no further updates to policy once
deployed. Epsilon turned to 0.
- Monitoring for “distributional shift”
- RL can be brittle, and policies may catastrophically collapse if the distribution changes
- Need to catch very quickly or you will literally lose your shirt
- Retraining
- Need to gather logs of decisions sequentially.
- Because the policy is not updated after every episode, this effectively means we are doing
a type of off-policy learning;

Conclusions
- Reaching a tipping point in some areas. Early adopters seeing successes
- 3 Patterns that seem a good fit for production RL
- Parallelizable and/or high-fidelity simulation is possible
- Key enabler: distributed simulation
- Low temporality problems (reward mostly depends on what’s happening right now)
- Key enabler: use of embeddings to reduce State and Action space complexity
- Optimization
- Key enabler: availability of greater data (e.g. from machine sensors, digital twins)
- 2 Practical tips:
- There are simpler versions of RL. Try them first
- The MLOps/deployment around RL is very different. Make sure to understand them
- RLlib can help.

More information?
RLLib: Leading open source distributed production RL library
ray.io/rllib
Questions? mwk@anyscale.com
Special Thanks: Richard Liaw, Paige Bailey, Jun Gong, Sven Mika

Is Production RL at a tipping point?

Recommended

Recommended

More Related Content

Similar to Is Production RL at a tipping point?

Similar to Is Production RL at a tipping point? (20)

Recently uploaded

Recently uploaded (20)

Is Production RL at a tipping point?