SlideShare a Scribd company logo
Making smart decisions in real-time
with Reinforcement Learning
Ruth Yakubu
Sr. Cloud Advocate
@RuthieYakubu
4
Agenda
 Reinforcement Learning (RL) concepts
 RL approaches, challenges and
algorithms
 Q-Learning methods
 Introduction to Azure Personalizer
 Demo
 Reinforcement Learning on Azure ML
 Quick Ray/RLlib framework
 Training built-in RL agents using the
RLlib framework
 Demo
5
Basic Reinforcement
Learning
• Learning by experience.
• Goal: choose actions that maximize rewards
• Agent: Dog
• State: Sit. Walk
• Reward: Get a Treat. No Treat
• Environment: Room or Anywhere
• We have the Environment, on which an Agent operates by responding to
commands and receiving Rewards and some State information.
• Involves trail and error
• Remember pattern that lead to success or failure.
Reinforcement learning structure
 State: where in the maze
 Action: up, down, left, right
 Reward: +1 for each cheese
7
Q-Learning Algorithm
 Start with 𝑄∗ 𝑠, 𝑎 = 0 for all 𝑠, 𝑎
 Get initial state 𝑠
 Repeat until convergence of 𝑄∗:
 Select action 𝑎 and get immediate reward 𝑟 and next state 𝑠′
 Update Q-value and current state:
 𝑄∗
𝑠, 𝑎 ← 𝑅∗
𝑠, 𝑎 + 𝐺𝑎𝑚𝑚𝑎 ∗ 𝑀𝑎𝑥[𝑄 𝑛𝑒𝑥𝑡 𝑠, 𝑎𝑙𝑙 𝑎 ]
 Type equation here.
 Note: Gamma is a discount value that ranges between 0 and 1
Exploration &
Exploitation
• Exploration: process of exploring &
learning more information about
environment
• Exploitation: uses know information
about the environment to gain rewards
quicker
9
How to select actions?
• Common strategies:
• Epsilon-Greedy exploration: with probability 𝜀 execute a random action, otherwise execute the best
action 𝑎∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
• In practice we need a decreasing schedule for 𝜀 during training, so that the agent explores enough at the
beginning and exploits enough as it converges.
• Boltzmann exploration: similar to a softmax distribution 𝑃 𝑎 =
𝑒 𝑄(𝑠,𝑎)/𝑇
𝑎 𝑒 𝑄(𝑠,𝑎)/𝑇 , but with a parameter 𝑇 that
controls the spread of the distribution, such that a high value gives a more uniform distribution than a low
value.
10
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic
11
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic
Deep Q-Learning (DQN)
Deep Q-learning
Q-Learning
𝑄(𝑠, 𝑎)
𝑄(𝑠, 𝑎; 𝜃)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′
)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′; 𝜃)
Greedy policy
Greedy policy
Reinforcement Learning challenges
• The environment might be stochastic
• The model of the environment is usually
hidden or incomplete
• Actions are interdependent
• There is no supervision
• The feedback received might be partial
and/or delayed
• Partial Observability
• Actions and/or states might be continuous
14
Some Use Cases for RL
• Game Playing (some famous examples: Backgammon, Atari,
Go)
• Operations Research (examples: Pricing, Vehicle Routing)
• Robotic Control
• Dialog Systems
• Energy Optimization
• Resource Allocation (examples: Computation, Networking)
• Autonomous Vehicles
• Computational Finance
What does Personalizer do?
15
Present the best action Uses Reinforcement Learning
Exploit the existing model in
most cases
Occasionally, explore new
possibilities
Continuous model updates
Update the scoring model with
the training model.
From a given set of input
actions
Your App
Action 2 Info
Action 3 Info
User & Context
Info
Action 1 info
Reward Score
How it works?
• Rank API
• Explore
• Exploit
Rank API • Explore
• Exploit
Reward API • Reward action
Personalizer in Action
Xbox Home
Results: +40% lift in
engagement for items
Bing Ads
Results: +6% in ad
clickthrough
MSN News
Results: +25%
improvement in News
clickthrough
Personalized: News content on top of page in
MSN.com or Edge DHP/NTP
Reward: Click on content on the first slot
Personalized: Type of content in hero
position, item in secondary river.
Reward: Click and engagement
Personalized: Layout and location of ads
Reward: Ad click through
Demo
https://aka.ms/PersonalizerCodeDemo
21
RL on Azure ML – What is It?
Fully-Managed RL service for
large scale distributed
simulation and training, using
Ray/RLlib framework​.
Customers create compute
clusters and submit
simulation/training jobs using
standard Azure ML pattern
(Estimator) with SDK & CLI​.
RL algorithms are in RLlib –
Deep training is Tensorflow by
default, Pytorch possible​.
Available in azureml-sdk 1.0.76​
22
RL Jobs Requirements
100’s of parallel
simulations.
Training: can take
multiple days.
Support for multiple
Ray jobs.
Resilient to
simulator / worker
failures.
ML Ops pipeline
integration.
23
Simulators
Support
Open AI Gym.
Custom simulators with Open AI Gym
Environment interface – worker local or
remote in simulator.
Windows support.
Investigating additional simulator support.
24
What is Ray? • High-performance distributed execution
framework targeted at large-scale machine
learning and reinforcement learning applications.
• Uses a lightweight API based on dynamic task
graphs and actors to express a wide range of
applications in a flexible manner.
25
What is RLlib?
Library for Reinforcement Learning built on
top of the Ray framework.
High scalability and unified API.
Provide abstractions for common RL
components: Policy Model, Policy Evaluator,
Policy Optimizer.
Hierarchical and logically centralized control
to compose common RL components.
26
RLlib Architecture
Source: RLlib: Scalable Reinforcement Learning
27
RL on Azure ML – How it Works?
Data Scientist Submits
Experiment
Azure Machine Learning
Ray Cluster
Head Node (Training)
Worker Node
Worker Node
Worker Worker
Worker Worker
Simulator Cluster
Simulator Node
Simulator Node
Sim Sim
Sim Sim
Training Results
28
DEMO
https://aka.ms/AzureMLRayRLDemo

More Related Content

Similar to Making smart decisions in real-time with Reinforcement Learning

Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Donal Byrne
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
Databricks
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Takashi Nagata
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
Dr.Shweta
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
Cairo University
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
Ammar Rashed
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Data Con LA
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
Luca Marignati
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Jian Wu
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
ShubhaManikarnike
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Flavian Vasile
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptx
ManiMaran230751
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
Harivamshi D
 
Is Production RL at a tipping point?
Is Production RL at a tipping point?Is Production RL at a tipping point?
Is Production RL at a tipping point?
M Waleed Kadous
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo
 

Similar to Making smart decisions in real-time with Reinforcement Learning (20)

Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Productionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflowProductionizing Deep Reinforcement Learning with Spark and MLflow
Productionizing Deep Reinforcement Learning with Spark and MLflow
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Introduction: Asynchronous Methods for Deep Reinforcement Learning
Introduction: Asynchronous Methods for  Deep Reinforcement LearningIntroduction: Asynchronous Methods for  Deep Reinforcement Learning
Introduction: Asynchronous Methods for Deep Reinforcement Learning
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
 
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
Evaluate deep q learning for sequential targeted marketing with 10-fold cross...
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
kdd2015
kdd2015kdd2015
kdd2015
 
Reinforcemnet Leaning in ML and DL.pptx
Reinforcemnet Leaning in ML and  DL.pptxReinforcemnet Leaning in ML and  DL.pptx
Reinforcemnet Leaning in ML and DL.pptx
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
Is Production RL at a tipping point?
Is Production RL at a tipping point?Is Production RL at a tipping point?
Is Production RL at a tipping point?
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 

Recently uploaded

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Making smart decisions in real-time with Reinforcement Learning

  • 1.
  • 2.
  • 3. Making smart decisions in real-time with Reinforcement Learning Ruth Yakubu Sr. Cloud Advocate @RuthieYakubu
  • 4. 4 Agenda  Reinforcement Learning (RL) concepts  RL approaches, challenges and algorithms  Q-Learning methods  Introduction to Azure Personalizer  Demo  Reinforcement Learning on Azure ML  Quick Ray/RLlib framework  Training built-in RL agents using the RLlib framework  Demo
  • 5. 5 Basic Reinforcement Learning • Learning by experience. • Goal: choose actions that maximize rewards • Agent: Dog • State: Sit. Walk • Reward: Get a Treat. No Treat • Environment: Room or Anywhere • We have the Environment, on which an Agent operates by responding to commands and receiving Rewards and some State information. • Involves trail and error • Remember pattern that lead to success or failure.
  • 6. Reinforcement learning structure  State: where in the maze  Action: up, down, left, right  Reward: +1 for each cheese
  • 7. 7 Q-Learning Algorithm  Start with 𝑄∗ 𝑠, 𝑎 = 0 for all 𝑠, 𝑎  Get initial state 𝑠  Repeat until convergence of 𝑄∗:  Select action 𝑎 and get immediate reward 𝑟 and next state 𝑠′  Update Q-value and current state:  𝑄∗ 𝑠, 𝑎 ← 𝑅∗ 𝑠, 𝑎 + 𝐺𝑎𝑚𝑚𝑎 ∗ 𝑀𝑎𝑥[𝑄 𝑛𝑒𝑥𝑡 𝑠, 𝑎𝑙𝑙 𝑎 ]  Type equation here.  Note: Gamma is a discount value that ranges between 0 and 1
  • 8. Exploration & Exploitation • Exploration: process of exploring & learning more information about environment • Exploitation: uses know information about the environment to gain rewards quicker
  • 9. 9 How to select actions? • Common strategies: • Epsilon-Greedy exploration: with probability 𝜀 execute a random action, otherwise execute the best action 𝑎∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎) • In practice we need a decreasing schedule for 𝜀 during training, so that the agent explores enough at the beginning and exploits enough as it converges. • Boltzmann exploration: similar to a softmax distribution 𝑃 𝑎 = 𝑒 𝑄(𝑠,𝑎)/𝑇 𝑎 𝑒 𝑄(𝑠,𝑎)/𝑇 , but with a parameter 𝑇 that controls the spread of the distribution, such that a high value gives a more uniform distribution than a low value.
  • 10. 10 • Learns a transition and reward models of the environment to compute optimal policy Model Based • Learns an optimal policy by interacting with the environment Model Free • Learns a value function explicitly and computes the policy from that Value Based • Learns a policy directly without computing a value function Policy Based • Learns both a policy (the actor) and a value function (the critic), which measures how good a policy is Actor Critic
  • 11. 11 • Learns a transition and reward models of the environment to compute optimal policy Model Based • Learns an optimal policy by interacting with the environment Model Free • Learns a value function explicitly and computes the policy from that Value Based • Learns a policy directly without computing a value function Policy Based • Learns both a policy (the actor) and a value function (the critic), which measures how good a policy is Actor Critic
  • 12. Deep Q-Learning (DQN) Deep Q-learning Q-Learning 𝑄(𝑠, 𝑎) 𝑄(𝑠, 𝑎; 𝜃) 𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′ ) 𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′; 𝜃) Greedy policy Greedy policy
  • 13. Reinforcement Learning challenges • The environment might be stochastic • The model of the environment is usually hidden or incomplete • Actions are interdependent • There is no supervision • The feedback received might be partial and/or delayed • Partial Observability • Actions and/or states might be continuous
  • 14. 14 Some Use Cases for RL • Game Playing (some famous examples: Backgammon, Atari, Go) • Operations Research (examples: Pricing, Vehicle Routing) • Robotic Control • Dialog Systems • Energy Optimization • Resource Allocation (examples: Computation, Networking) • Autonomous Vehicles • Computational Finance
  • 15. What does Personalizer do? 15 Present the best action Uses Reinforcement Learning Exploit the existing model in most cases Occasionally, explore new possibilities Continuous model updates Update the scoring model with the training model. From a given set of input actions
  • 16. Your App Action 2 Info Action 3 Info User & Context Info Action 1 info Reward Score
  • 17.
  • 18. How it works? • Rank API • Explore • Exploit Rank API • Explore • Exploit Reward API • Reward action
  • 19. Personalizer in Action Xbox Home Results: +40% lift in engagement for items Bing Ads Results: +6% in ad clickthrough MSN News Results: +25% improvement in News clickthrough Personalized: News content on top of page in MSN.com or Edge DHP/NTP Reward: Click on content on the first slot Personalized: Type of content in hero position, item in secondary river. Reward: Click and engagement Personalized: Layout and location of ads Reward: Ad click through
  • 21. 21 RL on Azure ML – What is It? Fully-Managed RL service for large scale distributed simulation and training, using Ray/RLlib framework​. Customers create compute clusters and submit simulation/training jobs using standard Azure ML pattern (Estimator) with SDK & CLI​. RL algorithms are in RLlib – Deep training is Tensorflow by default, Pytorch possible​. Available in azureml-sdk 1.0.76​
  • 22. 22 RL Jobs Requirements 100’s of parallel simulations. Training: can take multiple days. Support for multiple Ray jobs. Resilient to simulator / worker failures. ML Ops pipeline integration.
  • 23. 23 Simulators Support Open AI Gym. Custom simulators with Open AI Gym Environment interface – worker local or remote in simulator. Windows support. Investigating additional simulator support.
  • 24. 24 What is Ray? • High-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. • Uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner.
  • 25. 25 What is RLlib? Library for Reinforcement Learning built on top of the Ray framework. High scalability and unified API. Provide abstractions for common RL components: Policy Model, Policy Evaluator, Policy Optimizer. Hierarchical and logically centralized control to compose common RL components.
  • 26. 26 RLlib Architecture Source: RLlib: Scalable Reinforcement Learning
  • 27. 27 RL on Azure ML – How it Works? Data Scientist Submits Experiment Azure Machine Learning Ray Cluster Head Node (Training) Worker Node Worker Node Worker Worker Worker Worker Simulator Cluster Simulator Node Simulator Node Sim Sim Sim Sim Training Results

Editor's Notes

  1. Provide a basic understanding of common Reinforcement Learning concepts, approaches, and its mathematical foundations and algorithms. Understand common challenges in Reinforcement Learning and techniques to address them. Show how to code a Deep Reinforcement Learning agent from scratch, using as an example a Deep Q-Learning agent. Show a preview of the upcoming RL infrastructure on Azure ML and how to use it to train agents at scale.
  2. We have the Environment, on which an Agent operates by acting on commands and receiving Rewards and some State information. The goal here is to train an agent that learns to choose actions that maximizes the Rewards received.
  3. In its highest level, an RL system has the structure depicted in this diagram. We usually model this problem as a Markov Decision Process (MDP). Agent and environment State, action, reward In its highest level, an RL system has the structure depicted in this diagram.
  4. Performing the Q-function is known as the tabular Q-Learning. Tabular because we explicitly enumerate the Q-values for all state-action pairs in a table and solve the optimization problem through dynamic programming.
  5. Markov Decision Process (MDP) A central aspect for Q-Learning to work is a good strategy to choose actions in the environment. The idea here is that an agent needs to execute actions to explore the environment enough, in order to learn from good experiences. On the other hand, the agent also needs a good policy in order to obtain good experiences from the environment. This is known as the Exploration vs Exploitation tradeoff.
  6. A common strategy to balance exploration and exploitation is known as the Epsilon-Greedy exploration, where we introduce an uncertainty when choosing the best action. This is what we are going to use in our lab. In practice, we implement this with an annealing scheme for decreasing the probability to pick random actions as the model converges. There are other strategies, such as the Boltzmann exploration, which is like a softmax function with an additional parameter that controls the spread of the distribution. By varying this parameter we can also control the uncertainty in picking random actions.
  7. With those definitions, we can categorize RL algorithms in the following classes.
  8. Here we will focus in Model-free approaches, getting into the details of Value-based algorithms.
  9. Solving Q-Learning with neural network.
  10. Here are some examples of use cases that can be solved by RL.
  11. What is Personalizer? Personalizer implements an AI technique called Reinforcement Learning. Here's how it works. Suppose we want to display a "hero" action to the user. The user might not be sure what to do next, but we could display one of several suggestions. For a gaming app, that might be: "play a game", "watch a movie", or "join a clan". Based on that user's history and other contextual information -- say, their location, the time of day, and the day of the week -- the Personalizer service will rank the possible actions and suggest the best one to promote Hopefully, the user will be happy, but how can we be sure? That depends on what the user does next, and whether that was something we wanted them to do. According to our business logic we'll assign a "reward score" between 0 and 1 to what happens next. For example, spending more time playing a game or reading an article, or spending more money in the store, might lead to higher reward scores. Personalizer feeds that info back into the ranking system for the next time we need to feature an activity.
  12. You only need the Rank API and Reward API to integrate with your application
  13. Here’s how in the background the Personalizer API is build on Reinforcement Learning
  14. Personalizer has been in development at Microsoft for many years. It's used on Xbox devices, to determine what activities are featured on the home page, like playing an installed game, or purchasing a new game from the store, or watching others play on Mixer. Since the introduction of Personalizer, the Xbox team has seen a significant lift in key engagement metrics. Personalizer is also used to optimize the placement of ads in Bing search, and the articles featured in MSN News, again with great results in improving engagement from users. Now you can use Personalizer in your own apps, as well.