Making smart decisions in real-time with Reinforcement Learning

Making smart decisions in real-time
with Reinforcement Learning
Ruth Yakubu
Sr. Cloud Advocate
@RuthieYakubu

4
Agenda
 Reinforcement Learning (RL) concepts
 RL approaches, challenges and
algorithms
 Q-Learning methods
 Introduction to Azure Personalizer
 Demo
 Reinforcement Learning on Azure ML
 Quick Ray/RLlib framework
 Training built-in RL agents using the
RLlib framework
 Demo

5
Basic Reinforcement
Learning
• Learning by experience.
• Goal: choose actions that maximize rewards
• Agent: Dog
• State: Sit. Walk
• Reward: Get a Treat. No Treat
• Environment: Room or Anywhere
• We have the Environment, on which an Agent operates by responding to
commands and receiving Rewards and some State information.
• Involves trail and error
• Remember pattern that lead to success or failure.

Reinforcement learning structure
 State: where in the maze
 Action: up, down, left, right
 Reward: +1 for each cheese

7
Q-Learning Algorithm
 Start with 𝑄∗ 𝑠, 𝑎 = 0 for all 𝑠, 𝑎
 Get initial state 𝑠
 Repeat until convergence of 𝑄∗:
 Select action 𝑎 and get immediate reward 𝑟 and next state 𝑠′
 Update Q-value and current state:
 𝑄∗
𝑠, 𝑎 ← 𝑅∗
𝑠, 𝑎 + 𝐺𝑎𝑚𝑚𝑎 ∗ 𝑀𝑎𝑥[𝑄 𝑛𝑒𝑥𝑡 𝑠, 𝑎𝑙𝑙 𝑎 ]
 Type equation here.
 Note: Gamma is a discount value that ranges between 0 and 1

Exploration &
Exploitation
• Exploration: process of exploring &
learning more information about
environment
• Exploitation: uses know information
about the environment to gain rewards
quicker

9
How to select actions?
• Common strategies:
• Epsilon-Greedy exploration: with probability 𝜀 execute a random action, otherwise execute the best
action 𝑎∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
• In practice we need a decreasing schedule for 𝜀 during training, so that the agent explores enough at the
beginning and exploits enough as it converges.
• Boltzmann exploration: similar to a softmax distribution 𝑃 𝑎 =
𝑒 𝑄(𝑠,𝑎)/𝑇
𝑎 𝑒 𝑄(𝑠,𝑎)/𝑇 , but with a parameter 𝑇 that
controls the spread of the distribution, such that a high value gives a more uniform distribution than a low
value.

10
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic

11
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic

Deep Q-Learning (DQN)
Deep Q-learning
Q-Learning
𝑄(𝑠, 𝑎)
𝑄(𝑠, 𝑎; 𝜃)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′
)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′; 𝜃)
Greedy policy
Greedy policy

Reinforcement Learning challenges
• The environment might be stochastic
• The model of the environment is usually
hidden or incomplete
• Actions are interdependent
• There is no supervision
• The feedback received might be partial
and/or delayed
• Partial Observability
• Actions and/or states might be continuous

14
Some Use Cases for RL
• Game Playing (some famous examples: Backgammon, Atari,
Go)
• Operations Research (examples: Pricing, Vehicle Routing)
• Robotic Control
• Dialog Systems
• Energy Optimization
• Resource Allocation (examples: Computation, Networking)
• Autonomous Vehicles
• Computational Finance

What does Personalizer do?
15
Present the best action Uses Reinforcement Learning
Exploit the existing model in
most cases
Occasionally, explore new
possibilities
Continuous model updates
Update the scoring model with
the training model.
From a given set of input
actions

Your App
Action 2 Info
Action 3 Info
User & Context
Info
Action 1 info
Reward Score

How it works?
• Rank API
• Explore
• Exploit
Rank API • Explore
• Exploit
Reward API • Reward action

Personalizer in Action
Xbox Home
Results: +40% lift in
engagement for items
Bing Ads
Results: +6% in ad
clickthrough
MSN News
Results: +25%
improvement in News
clickthrough
Personalized: News content on top of page in
MSN.com or Edge DHP/NTP
Reward: Click on content on the first slot
Personalized: Type of content in hero
position, item in secondary river.
Reward: Click and engagement
Personalized: Layout and location of ads
Reward: Ad click through

Demo
https://aka.ms/PersonalizerCodeDemo

21
RL on Azure ML – What is It?
Fully-Managed RL service for
large scale distributed
simulation and training, using
Ray/RLlib framework.
Customers create compute
clusters and submit
simulation/training jobs using
standard Azure ML pattern
(Estimator) with SDK & CLI.
RL algorithms are in RLlib –
Deep training is Tensorflow by
default, Pytorch possible.
Available in azureml-sdk 1.0.76

22
RL Jobs Requirements
100’s of parallel
simulations.
Training: can take
multiple days.
Support for multiple
Ray jobs.
Resilient to
simulator / worker
failures.
ML Ops pipeline
integration.

23
Simulators
Support
Open AI Gym.
Custom simulators with Open AI Gym
Environment interface – worker local or
remote in simulator.
Windows support.
Investigating additional simulator support.

24
What is Ray? • High-performance distributed execution
framework targeted at large-scale machine
learning and reinforcement learning applications.
• Uses a lightweight API based on dynamic task
graphs and actors to express a wide range of
applications in a flexible manner.

25
What is RLlib?
Library for Reinforcement Learning built on
top of the Ray framework.
High scalability and unified API.
Provide abstractions for common RL
components: Policy Model, Policy Evaluator,
Policy Optimizer.
Hierarchical and logically centralized control
to compose common RL components.

26
RLlib Architecture
Source: RLlib: Scalable Reinforcement Learning

27
RL on Azure ML – How it Works?
Data Scientist Submits
Experiment
Azure Machine Learning
Ray Cluster
Head Node (Training)
Worker Node
Worker Node
Worker Worker
Worker Worker
Simulator Cluster
Simulator Node
Simulator Node
Sim Sim
Sim Sim
Training Results

28
DEMO
https://aka.ms/AzureMLRayRLDemo

Making smart decisions in real-time with Reinforcement Learning

More Related Content

Similar to Making smart decisions in real-time with Reinforcement Learning

Recently uploaded

Making smart decisions in real-time with Reinforcement Learning

Editor's Notes