SlideShare a Scribd company logo
1 of 41
Download to read offline
The ML data engineering conference presented by
Is Production RL at a
tipping point?
Waleed Kadous, Head of Engineering, Anyscale
Overall outline
- Reinforcement Learning (RL) showing huge successes in research
- Almost all the “superhuman” wins in games are RL based (Go, Poker, Chess, Atari, …)
- Production RL is still rare. Why?
- First: an incremental journey in RL from simple to complex
- Is RL at a tipping point? If not, why not?
- Tips for bootstrapping production RL in your company
- Common patterns in successful RL applications
- Things to watch out for
- Conclusion: RL is not yet off-the-shelf for all problems, but there are
subsets where it is becoming the clear winner
Our experiences
Anyscale is the company behind RLlib, most popular open source distributed
RL library.
These are stories from our customers
Understanding the RL complexity spectrum
- Helps build a mental model of what’s easy and hard.
- A map from things people already know.
- Provides a roadmap for how to tackle problems.
- Will become relevant as we discuss how to deploy
Production RL in your company.
- Simplest to most complex.
Bandits (many people use already)
Unknown part: How often does each machine pay out?
State: None
Action: Pull lever 1 to 4
Reward: $10 if machine pays out, -$1 if machine doesn’t.
Practical example: UI treatments – each different UI is a bandit
1 2 3 4
Challenges with bandit
Key challenge is explore-exploit tradeoff: how do I balance using my existing
policy vs searching for a better policy?
Whole range of policies, e.g. Epsilon greedy:
p = random.uniform(0,1)
if (p < epsilon):
pull_random_lever()
else:
pull_max_reward([l1,..,l4])
Contextual Bandits
Context: Is it sunny or cloudy?
Unknown part: How often does each machine pay out given the weather?
State: None
Action: Pull lever 1 to 4
Reward: $10 if machine pays out, -$1 if machine doesn’t.
Practical example: Recommender system. Context = user profile
Bandits + Sequentiality = RL
Optimal policy: If you pull the levers in the order 3, 4 then 1 then when you pull it pays out $100 bonus
State: previous arm pulls
Action: Next arm to pull
Reward: Payout or -1 if no payout.
Example: Playing Chess – moves early on can impact the end of the game a lot
2
1 2
3
3 4
4
3
1
1
Order:
Challenges with Making Bandits Stateful
Temporal credit assignment:
If you get a payout at time t, how do I divide the credit between all the
previous states I’ve been in? (which move was the move?)
Need some form of backpropagation.
Search space expands (10^80 possible chess games).
What if reward is delayed (e.g. sacrifice in chess)?
Large action and state space RL
What if you had 32
machines to bet on,
and you also had to
decide how hard to
pull all of the levers
Example: Trading
stocks (S & P 500)
Challenges with Large Action and State Spaces
- Dimensionality and possibility grows.
- State space:
- Grows exponentially
- Action space:
- 4 discrete values → 32 floating point numbers
Offline RL
What if you only had the logs of every lever pull and the reward for yesterday and you just
want to make the best policy from it?
Example: Learning from historical stock purchases
Challenges with Offline RL
- You’re stuck with whatever experiences were encountered: incomplete
state space.
- What if you trained when there was never a recession?
- Example: if you pull 4, 4, in a row and you don’t win any of them, you lose
all your winnings so far.
- Very sensitive to distributional shift:
- If the reward changes, RL policies tend to be brittle
Multi-agent RL (MARL)
What if there are two people playing?
Can be cooperative or competitive
Example: Stocks with a few big players
Challenges with Multiagent RL
- Interactions between agents (e.g. if two agents try to pull same lever)
- Shared representation or separate representations
- What’s the reward function across the set of agents?
Key distinctions between SL and RL
Exploitation vs exploration tradeoff
- Is there a more optimal strategy, or use what we’ve already learned?
Maximizing future reward
- Not just the next step but all future decisions in future states
- Related: how to assign “credit” for reward states temporally?
Online, incremental approach
- Regularly updates the model, needs to experiment as part of process
- Offline approaches sensitive to “distributional shift”
State and Action spaces can be large
- Complex, multidimensional, Q-learning requires S x A
Is Production RL at a tipping point?
RL Hugely Successful in Research …
Evidence for …
And yet: 4 things that make production RL Hard
- High training requirements
- Overcoming limitations of online training
- Solving temporal credit assignment problem
- Large action and state spaces
but there’s been recent progress on all of these
Huge amount of training required
AlphaGo Zero: played 5 million games against itself to become superhuman
AlphaStar: Each agent trained for 200 years
Implications:
Well beyond the capabilities of a single machine
Recent Progress:
Distributed training (RLLib can help)
Transfer learning (e.g. learn to play one Atari game and apply to new domains)
Bootstrapping from existing state - action logs (human or previous runs)
Reducing difficulty using parameterized state spaces, action masking etc.
Default implementation is online
Naive implementation of RL is online
Implications:
Hard to validate models in real-time
Hard to reuse data if model parameters are changed
Progress:
Offline training algorithms, dealing with counterfactuals
(RLLib supports both)
Temporal credit assignment
Actions do not immediately lead to rewards in real life
Implications:
- Introduces a host of problems: discounting rewards, Q functions
- Significantly increases training data requirements
Recent Progress:
- Contextual bandits are RL without the temporal credit assignment
- Though limited, are simple to deploy and use and are seeing adoption
Large state and action spaces
Large action and state spaces require lots more training. In the worst case, new policies need to be
retested
Implication
Not practical for real problems (e.g. robots)
Recent Progress
High fidelity simulators
Distributed RL libraries and techniques allow running many simulations at once
Deep Learning approaches to learning the state space
Embedding approaches for action space (e.g. first candidate selection, then rank)
Offline learning does not require relearning
Common patterns in successful
production RL applications
3 patterns we see in successful production RL
applications
- Parallelizable simulation tasks
- Low temporality with immediate reward signal (aka contextual bandits)
- Optimization: The Next Generation
RL in simulated environments
Why?
- RL takes a lot of iterations to converge
- Too slow for the real world
- But if your problem is virtual OR your simulation is faithful …
Enabling techniques
- Running lots of simulations at once using distributed RL (e.g. RLLib)
- Systems for merging results from lots of experiments (batching)
- Getting close with simulator, then fine tuning in the real world
Example 1: Games!
- Games are nothing but virtual environments!
- Example:
- Riot Games – the company behind League of Legends
- Legends of Runeterra: card game (like Magic the Gathering)
- State: scores of individual players + remaining cards
- Action: which cards to play
- Reward: +1 for winning
- Create 10 “virtual players” and play against each other.
- Identify virtual players who win disproportionately.
Example 2: Markets
- Simulation does not have to be perfect
- Example:
- JP Morgan using to model forex transactions
- State: holdings of each participant in the market
- Action: Buy or Sell a certain amount of stock
- Reward: profit - unsold stock
- Used to test automated trading before release into production
Low sequentiality with immediate reward signal
Reminder: RL = Contextual bandits + sequentiality
Pseudo-contextual bandits?
If R(a|t1
…tn
) ~= R(a|tn
) and R(a|tn+1
) ~= R(a|tn+1
…tM
)
then a lot of things get easier
Recent Progress
Ignoring sequentiality
Unrolling sequentiality into state
Recommender systems
Example: Wildlife game recommendations
State: last played games + user profile (the contextual part)
Action: present a recommendation for next game
Reward: +1 if user clicks on game
Question: what if I have millions of users and hundreds of games?
A key technique here is embedding. Use embedding to reduce dimensionality of
users, and embedding to find the next game.
Availability
Microsoft Azure and Google Cloud both offer personalization services based
on RL
You can go right now* and use RL for recommendations as a SaaS
*
Optimization: The Next Generation
One take on RL is that it is data driven optimization.
Traditional optimization is very much about modeling (e.g. linear
programming). Developed at a time when computation was scarce.
RL does not require modeling, it just runs experiments. Obviously this takes a
lot more computation, but it is often “plug and play” with optimization.
Example: Dow Chemical using RL for scheduling
Task: Schedule chemical plants’ production schedules to meet evolving
demand
OR: Mixed Integer Linear Programming
RL:
- State: Scheduling parameters
- Action: What to schedule when
- Reward: Total money saved
2 tips for production RL
Simplest choice for each axis of complexity
- Stateless vs contextual vs stateful
- On-policy vs off-policy training vs offline training
- Small, discrete state and action spaces vs Large, continuous state and
action spaces
- Single agent vs multi-agent shared-policy vs true multi-agent
Ideally use the simplest possible
Be Aware of Special Challenges Deploying RL Models
- Validation
- How do you ensure RL model doesn’t do something stupid (like ram itself into a wall?)
- Are some available approaches (e.g. Action Masking and Counterfactuals)
- Updating
- In almost all cases, the deployed policy is “frozen” – no further updates to policy once
deployed. Epsilon turned to 0.
- Monitoring for “distributional shift”
- RL can be brittle, and policies may catastrophically collapse if the distribution changes
- Need to catch very quickly or you will literally lose your shirt
- Retraining
- Need to gather logs of decisions sequentially.
- Because the policy is not updated after every episode, this effectively means we are doing
a type of off-policy learning;
Conclusions
- Reaching a tipping point in some areas. Early adopters seeing successes
- 3 Patterns that seem a good fit for production RL
- Parallelizable and/or high-fidelity simulation is possible
- Key enabler: distributed simulation
- Low temporality problems (reward mostly depends on what’s happening right now)
- Key enabler: use of embeddings to reduce State and Action space complexity
- Optimization
- Key enabler: availability of greater data (e.g. from machine sensors, digital twins)
- 2 Practical tips:
- There are simpler versions of RL. Try them first
- The MLOps/deployment around RL is very different. Make sure to understand them
- RLlib can help.
More information?
RLLib: Leading open source distributed production RL library
ray.io/rllib
Questions? mwk@anyscale.com
Special Thanks: Richard Liaw, Paige Bailey, Jun Gong, Sven Mika

More Related Content

Similar to Is Production RL at a tipping point?

Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowDatabricks
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingDonal Byrne
 
Agile2015: Introduction to DevOps with Chocolate and Lego Game
Agile2015: Introduction to DevOps with Chocolate and Lego GameAgile2015: Introduction to DevOps with Chocolate and Lego Game
Agile2015: Introduction to DevOps with Chocolate and Lego GameDana Pylayeva
 
Developing an effective LTV model at the soft launch and keeping it valid fur...
Developing an effective LTV model at the soft launch and keeping it valid fur...Developing an effective LTV model at the soft launch and keeping it valid fur...
Developing an effective LTV model at the soft launch and keeping it valid fur...GameCamp
 
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...Willy Marroquin (WillyDevNET)
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기NAVER D2
 
Open LLMs: Viable for Production or Low-Quality Toy?
Open LLMs: Viable for Production or Low-Quality Toy?Open LLMs: Viable for Production or Low-Quality Toy?
Open LLMs: Viable for Production or Low-Quality Toy?M Waleed Kadous
 
Architecting a real time optimization platform for driver positioning (applie...
Architecting a real time optimization platform for driver positioning (applie...Architecting a real time optimization platform for driver positioning (applie...
Architecting a real time optimization platform for driver positioning (applie...Lyft
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersTanzim Saqib
 
@RISK Unchained Webinar
@RISK Unchained Webinar@RISK Unchained Webinar
@RISK Unchained WebinarAndrew Sich
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social NetworksWei Lu
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdfStephenLeo7
 

Similar to Is Production RL at a tipping point? (20)

Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Agile2015: Introduction to DevOps with Chocolate and Lego Game
Agile2015: Introduction to DevOps with Chocolate and Lego GameAgile2015: Introduction to DevOps with Chocolate and Lego Game
Agile2015: Introduction to DevOps with Chocolate and Lego Game
 
Developing an effective LTV model at the soft launch and keeping it valid fur...
Developing an effective LTV model at the soft launch and keeping it valid fur...Developing an effective LTV model at the soft launch and keeping it valid fur...
Developing an effective LTV model at the soft launch and keeping it valid fur...
 
How Microsoft AI defeated Ms Pacman
How Microsoft AI defeated Ms PacmanHow Microsoft AI defeated Ms Pacman
How Microsoft AI defeated Ms Pacman
 
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
Deep Learning for Real-Time Atari Game Play Using Offline Monte-CarloTree Sear...
 
[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기[244]로봇이 현실 세계에 대해 학습하도록 만들기
[244]로봇이 현실 세계에 대해 학습하도록 만들기
 
Open LLMs: Viable for Production or Low-Quality Toy?
Open LLMs: Viable for Production or Low-Quality Toy?Open LLMs: Viable for Production or Low-Quality Toy?
Open LLMs: Viable for Production or Low-Quality Toy?
 
Architecting a real time optimization platform for driver positioning (applie...
Architecting a real time optimization platform for driver positioning (applie...Architecting a real time optimization platform for driver positioning (applie...
Architecting a real time optimization platform for driver positioning (applie...
 
Designing an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginnersDesigning an AI that gains experience for absolute beginners
Designing an AI that gains experience for absolute beginners
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
@RISK Unchained Webinar
@RISK Unchained Webinar@RISK Unchained Webinar
@RISK Unchained Webinar
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slides
 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social Networks
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Weak Supervision.pdf
Weak Supervision.pdfWeak Supervision.pdf
Weak Supervision.pdf
 

Recently uploaded

Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixingviprabot1
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 

Recently uploaded (20)

Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Effects of rheological properties on mixing
Effects of rheological properties on mixingEffects of rheological properties on mixing
Effects of rheological properties on mixing
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 

Is Production RL at a tipping point?

  • 1. The ML data engineering conference presented by Is Production RL at a tipping point? Waleed Kadous, Head of Engineering, Anyscale
  • 2. Overall outline - Reinforcement Learning (RL) showing huge successes in research - Almost all the “superhuman” wins in games are RL based (Go, Poker, Chess, Atari, …) - Production RL is still rare. Why? - First: an incremental journey in RL from simple to complex - Is RL at a tipping point? If not, why not? - Tips for bootstrapping production RL in your company - Common patterns in successful RL applications - Things to watch out for - Conclusion: RL is not yet off-the-shelf for all problems, but there are subsets where it is becoming the clear winner
  • 3. Our experiences Anyscale is the company behind RLlib, most popular open source distributed RL library. These are stories from our customers
  • 4. Understanding the RL complexity spectrum - Helps build a mental model of what’s easy and hard. - A map from things people already know. - Provides a roadmap for how to tackle problems. - Will become relevant as we discuss how to deploy Production RL in your company. - Simplest to most complex.
  • 5. Bandits (many people use already) Unknown part: How often does each machine pay out? State: None Action: Pull lever 1 to 4 Reward: $10 if machine pays out, -$1 if machine doesn’t. Practical example: UI treatments – each different UI is a bandit 1 2 3 4
  • 6. Challenges with bandit Key challenge is explore-exploit tradeoff: how do I balance using my existing policy vs searching for a better policy? Whole range of policies, e.g. Epsilon greedy: p = random.uniform(0,1) if (p < epsilon): pull_random_lever() else: pull_max_reward([l1,..,l4])
  • 7. Contextual Bandits Context: Is it sunny or cloudy? Unknown part: How often does each machine pay out given the weather? State: None Action: Pull lever 1 to 4 Reward: $10 if machine pays out, -$1 if machine doesn’t. Practical example: Recommender system. Context = user profile
  • 8. Bandits + Sequentiality = RL Optimal policy: If you pull the levers in the order 3, 4 then 1 then when you pull it pays out $100 bonus State: previous arm pulls Action: Next arm to pull Reward: Payout or -1 if no payout. Example: Playing Chess – moves early on can impact the end of the game a lot 2 1 2 3 3 4 4 3 1 1 Order:
  • 9. Challenges with Making Bandits Stateful Temporal credit assignment: If you get a payout at time t, how do I divide the credit between all the previous states I’ve been in? (which move was the move?) Need some form of backpropagation. Search space expands (10^80 possible chess games). What if reward is delayed (e.g. sacrifice in chess)?
  • 10. Large action and state space RL What if you had 32 machines to bet on, and you also had to decide how hard to pull all of the levers Example: Trading stocks (S & P 500)
  • 11. Challenges with Large Action and State Spaces - Dimensionality and possibility grows. - State space: - Grows exponentially - Action space: - 4 discrete values → 32 floating point numbers
  • 12. Offline RL What if you only had the logs of every lever pull and the reward for yesterday and you just want to make the best policy from it? Example: Learning from historical stock purchases
  • 13. Challenges with Offline RL - You’re stuck with whatever experiences were encountered: incomplete state space. - What if you trained when there was never a recession? - Example: if you pull 4, 4, in a row and you don’t win any of them, you lose all your winnings so far. - Very sensitive to distributional shift: - If the reward changes, RL policies tend to be brittle
  • 14. Multi-agent RL (MARL) What if there are two people playing? Can be cooperative or competitive Example: Stocks with a few big players
  • 15. Challenges with Multiagent RL - Interactions between agents (e.g. if two agents try to pull same lever) - Shared representation or separate representations - What’s the reward function across the set of agents?
  • 16. Key distinctions between SL and RL Exploitation vs exploration tradeoff - Is there a more optimal strategy, or use what we’ve already learned? Maximizing future reward - Not just the next step but all future decisions in future states - Related: how to assign “credit” for reward states temporally? Online, incremental approach - Regularly updates the model, needs to experiment as part of process - Offline approaches sensitive to “distributional shift” State and Action spaces can be large - Complex, multidimensional, Q-learning requires S x A
  • 17. Is Production RL at a tipping point?
  • 18. RL Hugely Successful in Research …
  • 20. And yet: 4 things that make production RL Hard - High training requirements - Overcoming limitations of online training - Solving temporal credit assignment problem - Large action and state spaces but there’s been recent progress on all of these
  • 21. Huge amount of training required AlphaGo Zero: played 5 million games against itself to become superhuman AlphaStar: Each agent trained for 200 years Implications: Well beyond the capabilities of a single machine Recent Progress: Distributed training (RLLib can help) Transfer learning (e.g. learn to play one Atari game and apply to new domains) Bootstrapping from existing state - action logs (human or previous runs) Reducing difficulty using parameterized state spaces, action masking etc.
  • 22. Default implementation is online Naive implementation of RL is online Implications: Hard to validate models in real-time Hard to reuse data if model parameters are changed Progress: Offline training algorithms, dealing with counterfactuals (RLLib supports both)
  • 23. Temporal credit assignment Actions do not immediately lead to rewards in real life Implications: - Introduces a host of problems: discounting rewards, Q functions - Significantly increases training data requirements Recent Progress: - Contextual bandits are RL without the temporal credit assignment - Though limited, are simple to deploy and use and are seeing adoption
  • 24. Large state and action spaces Large action and state spaces require lots more training. In the worst case, new policies need to be retested Implication Not practical for real problems (e.g. robots) Recent Progress High fidelity simulators Distributed RL libraries and techniques allow running many simulations at once Deep Learning approaches to learning the state space Embedding approaches for action space (e.g. first candidate selection, then rank) Offline learning does not require relearning
  • 25. Common patterns in successful production RL applications
  • 26. 3 patterns we see in successful production RL applications - Parallelizable simulation tasks - Low temporality with immediate reward signal (aka contextual bandits) - Optimization: The Next Generation
  • 27. RL in simulated environments Why? - RL takes a lot of iterations to converge - Too slow for the real world - But if your problem is virtual OR your simulation is faithful … Enabling techniques - Running lots of simulations at once using distributed RL (e.g. RLLib) - Systems for merging results from lots of experiments (batching) - Getting close with simulator, then fine tuning in the real world
  • 28. Example 1: Games! - Games are nothing but virtual environments! - Example: - Riot Games – the company behind League of Legends - Legends of Runeterra: card game (like Magic the Gathering) - State: scores of individual players + remaining cards - Action: which cards to play - Reward: +1 for winning - Create 10 “virtual players” and play against each other. - Identify virtual players who win disproportionately.
  • 29. Example 2: Markets - Simulation does not have to be perfect - Example: - JP Morgan using to model forex transactions - State: holdings of each participant in the market - Action: Buy or Sell a certain amount of stock - Reward: profit - unsold stock - Used to test automated trading before release into production
  • 30. Low sequentiality with immediate reward signal Reminder: RL = Contextual bandits + sequentiality Pseudo-contextual bandits? If R(a|t1 …tn ) ~= R(a|tn ) and R(a|tn+1 ) ~= R(a|tn+1 …tM ) then a lot of things get easier Recent Progress Ignoring sequentiality Unrolling sequentiality into state
  • 31. Recommender systems Example: Wildlife game recommendations State: last played games + user profile (the contextual part) Action: present a recommendation for next game Reward: +1 if user clicks on game Question: what if I have millions of users and hundreds of games? A key technique here is embedding. Use embedding to reduce dimensionality of users, and embedding to find the next game.
  • 32. Availability Microsoft Azure and Google Cloud both offer personalization services based on RL You can go right now* and use RL for recommendations as a SaaS
  • 33. *
  • 34. Optimization: The Next Generation One take on RL is that it is data driven optimization. Traditional optimization is very much about modeling (e.g. linear programming). Developed at a time when computation was scarce. RL does not require modeling, it just runs experiments. Obviously this takes a lot more computation, but it is often “plug and play” with optimization.
  • 35. Example: Dow Chemical using RL for scheduling Task: Schedule chemical plants’ production schedules to meet evolving demand OR: Mixed Integer Linear Programming RL: - State: Scheduling parameters - Action: What to schedule when - Reward: Total money saved
  • 36.
  • 37. 2 tips for production RL
  • 38. Simplest choice for each axis of complexity - Stateless vs contextual vs stateful - On-policy vs off-policy training vs offline training - Small, discrete state and action spaces vs Large, continuous state and action spaces - Single agent vs multi-agent shared-policy vs true multi-agent Ideally use the simplest possible
  • 39. Be Aware of Special Challenges Deploying RL Models - Validation - How do you ensure RL model doesn’t do something stupid (like ram itself into a wall?) - Are some available approaches (e.g. Action Masking and Counterfactuals) - Updating - In almost all cases, the deployed policy is “frozen” – no further updates to policy once deployed. Epsilon turned to 0. - Monitoring for “distributional shift” - RL can be brittle, and policies may catastrophically collapse if the distribution changes - Need to catch very quickly or you will literally lose your shirt - Retraining - Need to gather logs of decisions sequentially. - Because the policy is not updated after every episode, this effectively means we are doing a type of off-policy learning;
  • 40. Conclusions - Reaching a tipping point in some areas. Early adopters seeing successes - 3 Patterns that seem a good fit for production RL - Parallelizable and/or high-fidelity simulation is possible - Key enabler: distributed simulation - Low temporality problems (reward mostly depends on what’s happening right now) - Key enabler: use of embeddings to reduce State and Action space complexity - Optimization - Key enabler: availability of greater data (e.g. from machine sensors, digital twins) - 2 Practical tips: - There are simpler versions of RL. Try them first - The MLOps/deployment around RL is very different. Make sure to understand them - RLlib can help.
  • 41. More information? RLLib: Leading open source distributed production RL library ray.io/rllib Questions? mwk@anyscale.com Special Thanks: Richard Liaw, Paige Bailey, Jun Gong, Sven Mika