Making smart decisions in real-time
with Reinforcement Learning
Ruth Yakubu
Sr. Cloud Advocate
@RuthieYakubu
4
Agenda
 Reinforcement Learning (RL) concepts
 RL approaches, challenges and
algorithms
 Q-Learning methods
 Introduction to Azure Personalizer
 Demo
 Reinforcement Learning on Azure ML
 Quick Ray/RLlib framework
 Training built-in RL agents using the
RLlib framework
 Demo
5
Basic Reinforcement
Learning
• Learning by experience.
• Goal: choose actions that maximize rewards
• Agent: Dog
• State: Sit. Walk
• Reward: Get a Treat. No Treat
• Environment: Room or Anywhere
• We have the Environment, on which an Agent operates by responding to
commands and receiving Rewards and some State information.
• Involves trail and error
• Remember pattern that lead to success or failure.
Reinforcement learning structure
 State: where in the maze
 Action: up, down, left, right
 Reward: +1 for each cheese
7
Q-Learning Algorithm
 Start with 𝑄∗ 𝑠, 𝑎 = 0 for all 𝑠, 𝑎
 Get initial state 𝑠
 Repeat until convergence of 𝑄∗:
 Select action 𝑎 and get immediate reward 𝑟 and next state 𝑠′
 Update Q-value and current state:
 𝑄∗
𝑠, 𝑎 ← 𝑅∗
𝑠, 𝑎 + 𝐺𝑎𝑚𝑚𝑎 ∗ 𝑀𝑎𝑥[𝑄 𝑛𝑒𝑥𝑡 𝑠, 𝑎𝑙𝑙 𝑎 ]
 Type equation here.
 Note: Gamma is a discount value that ranges between 0 and 1
Exploration &
Exploitation
• Exploration: process of exploring &
learning more information about
environment
• Exploitation: uses know information
about the environment to gain rewards
quicker
9
How to select actions?
• Common strategies:
• Epsilon-Greedy exploration: with probability 𝜀 execute a random action, otherwise execute the best
action 𝑎∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎)
• In practice we need a decreasing schedule for 𝜀 during training, so that the agent explores enough at the
beginning and exploits enough as it converges.
• Boltzmann exploration: similar to a softmax distribution 𝑃 𝑎 =
𝑒 𝑄(𝑠,𝑎)/𝑇
𝑎 𝑒 𝑄(𝑠,𝑎)/𝑇 , but with a parameter 𝑇 that
controls the spread of the distribution, such that a high value gives a more uniform distribution than a low
value.
10
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic
11
• Learns a transition and
reward models of the
environment to compute
optimal policy
Model
Based
• Learns an optimal
policy by interacting
with the environment
Model
Free
• Learns a value function
explicitly and computes the
policy from that
Value
Based
• Learns a policy directly without
computing a value function
Policy
Based
• Learns both a policy (the actor)
and a value function (the critic),
which measures how good a
policy is
Actor
Critic
Deep Q-Learning (DQN)
Deep Q-learning
Q-Learning
𝑄(𝑠, 𝑎)
𝑄(𝑠, 𝑎; 𝜃)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′
)
𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′; 𝜃)
Greedy policy
Greedy policy
Reinforcement Learning challenges
• The environment might be stochastic
• The model of the environment is usually
hidden or incomplete
• Actions are interdependent
• There is no supervision
• The feedback received might be partial
and/or delayed
• Partial Observability
• Actions and/or states might be continuous
14
Some Use Cases for RL
• Game Playing (some famous examples: Backgammon, Atari,
Go)
• Operations Research (examples: Pricing, Vehicle Routing)
• Robotic Control
• Dialog Systems
• Energy Optimization
• Resource Allocation (examples: Computation, Networking)
• Autonomous Vehicles
• Computational Finance
What does Personalizer do?
15
Present the best action Uses Reinforcement Learning
Exploit the existing model in
most cases
Occasionally, explore new
possibilities
Continuous model updates
Update the scoring model with
the training model.
From a given set of input
actions
Your App
Action 2 Info
Action 3 Info
User & Context
Info
Action 1 info
Reward Score
How it works?
• Rank API
• Explore
• Exploit
Rank API • Explore
• Exploit
Reward API • Reward action
Personalizer in Action
Xbox Home
Results: +40% lift in
engagement for items
Bing Ads
Results: +6% in ad
clickthrough
MSN News
Results: +25%
improvement in News
clickthrough
Personalized: News content on top of page in
MSN.com or Edge DHP/NTP
Reward: Click on content on the first slot
Personalized: Type of content in hero
position, item in secondary river.
Reward: Click and engagement
Personalized: Layout and location of ads
Reward: Ad click through
Demo
https://aka.ms/PersonalizerCodeDemo
21
RL on Azure ML – What is It?
Fully-Managed RL service for
large scale distributed
simulation and training, using
Ray/RLlib framework​.
Customers create compute
clusters and submit
simulation/training jobs using
standard Azure ML pattern
(Estimator) with SDK & CLI​.
RL algorithms are in RLlib –
Deep training is Tensorflow by
default, Pytorch possible​.
Available in azureml-sdk 1.0.76​
22
RL Jobs Requirements
100’s of parallel
simulations.
Training: can take
multiple days.
Support for multiple
Ray jobs.
Resilient to
simulator / worker
failures.
ML Ops pipeline
integration.
23
Simulators
Support
Open AI Gym.
Custom simulators with Open AI Gym
Environment interface – worker local or
remote in simulator.
Windows support.
Investigating additional simulator support.
24
What is Ray? • High-performance distributed execution
framework targeted at large-scale machine
learning and reinforcement learning applications.
• Uses a lightweight API based on dynamic task
graphs and actors to express a wide range of
applications in a flexible manner.
25
What is RLlib?
Library for Reinforcement Learning built on
top of the Ray framework.
High scalability and unified API.
Provide abstractions for common RL
components: Policy Model, Policy Evaluator,
Policy Optimizer.
Hierarchical and logically centralized control
to compose common RL components.
26
RLlib Architecture
Source: RLlib: Scalable Reinforcement Learning
27
RL on Azure ML – How it Works?
Data Scientist Submits
Experiment
Azure Machine Learning
Ray Cluster
Head Node (Training)
Worker Node
Worker Node
Worker Worker
Worker Worker
Simulator Cluster
Simulator Node
Simulator Node
Sim Sim
Sim Sim
Training Results
28
DEMO
https://aka.ms/AzureMLRayRLDemo

Making smart decisions in real-time with Reinforcement Learning

  • 3.
    Making smart decisionsin real-time with Reinforcement Learning Ruth Yakubu Sr. Cloud Advocate @RuthieYakubu
  • 4.
    4 Agenda  Reinforcement Learning(RL) concepts  RL approaches, challenges and algorithms  Q-Learning methods  Introduction to Azure Personalizer  Demo  Reinforcement Learning on Azure ML  Quick Ray/RLlib framework  Training built-in RL agents using the RLlib framework  Demo
  • 5.
    5 Basic Reinforcement Learning • Learningby experience. • Goal: choose actions that maximize rewards • Agent: Dog • State: Sit. Walk • Reward: Get a Treat. No Treat • Environment: Room or Anywhere • We have the Environment, on which an Agent operates by responding to commands and receiving Rewards and some State information. • Involves trail and error • Remember pattern that lead to success or failure.
  • 6.
    Reinforcement learning structure State: where in the maze  Action: up, down, left, right  Reward: +1 for each cheese
  • 7.
    7 Q-Learning Algorithm  Startwith 𝑄∗ 𝑠, 𝑎 = 0 for all 𝑠, 𝑎  Get initial state 𝑠  Repeat until convergence of 𝑄∗:  Select action 𝑎 and get immediate reward 𝑟 and next state 𝑠′  Update Q-value and current state:  𝑄∗ 𝑠, 𝑎 ← 𝑅∗ 𝑠, 𝑎 + 𝐺𝑎𝑚𝑚𝑎 ∗ 𝑀𝑎𝑥[𝑄 𝑛𝑒𝑥𝑡 𝑠, 𝑎𝑙𝑙 𝑎 ]  Type equation here.  Note: Gamma is a discount value that ranges between 0 and 1
  • 8.
    Exploration & Exploitation • Exploration:process of exploring & learning more information about environment • Exploitation: uses know information about the environment to gain rewards quicker
  • 9.
    9 How to selectactions? • Common strategies: • Epsilon-Greedy exploration: with probability 𝜀 execute a random action, otherwise execute the best action 𝑎∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠, 𝑎) • In practice we need a decreasing schedule for 𝜀 during training, so that the agent explores enough at the beginning and exploits enough as it converges. • Boltzmann exploration: similar to a softmax distribution 𝑃 𝑎 = 𝑒 𝑄(𝑠,𝑎)/𝑇 𝑎 𝑒 𝑄(𝑠,𝑎)/𝑇 , but with a parameter 𝑇 that controls the spread of the distribution, such that a high value gives a more uniform distribution than a low value.
  • 10.
    10 • Learns atransition and reward models of the environment to compute optimal policy Model Based • Learns an optimal policy by interacting with the environment Model Free • Learns a value function explicitly and computes the policy from that Value Based • Learns a policy directly without computing a value function Policy Based • Learns both a policy (the actor) and a value function (the critic), which measures how good a policy is Actor Critic
  • 11.
    11 • Learns atransition and reward models of the environment to compute optimal policy Model Based • Learns an optimal policy by interacting with the environment Model Free • Learns a value function explicitly and computes the policy from that Value Based • Learns a policy directly without computing a value function Policy Based • Learns both a policy (the actor) and a value function (the critic), which measures how good a policy is Actor Critic
  • 12.
    Deep Q-Learning (DQN) DeepQ-learning Q-Learning 𝑄(𝑠, 𝑎) 𝑄(𝑠, 𝑎; 𝜃) 𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′ ) 𝜋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎′ 𝑄(𝑠, 𝑎′; 𝜃) Greedy policy Greedy policy
  • 13.
    Reinforcement Learning challenges •The environment might be stochastic • The model of the environment is usually hidden or incomplete • Actions are interdependent • There is no supervision • The feedback received might be partial and/or delayed • Partial Observability • Actions and/or states might be continuous
  • 14.
    14 Some Use Casesfor RL • Game Playing (some famous examples: Backgammon, Atari, Go) • Operations Research (examples: Pricing, Vehicle Routing) • Robotic Control • Dialog Systems • Energy Optimization • Resource Allocation (examples: Computation, Networking) • Autonomous Vehicles • Computational Finance
  • 15.
    What does Personalizerdo? 15 Present the best action Uses Reinforcement Learning Exploit the existing model in most cases Occasionally, explore new possibilities Continuous model updates Update the scoring model with the training model. From a given set of input actions
  • 16.
    Your App Action 2Info Action 3 Info User & Context Info Action 1 info Reward Score
  • 18.
    How it works? •Rank API • Explore • Exploit Rank API • Explore • Exploit Reward API • Reward action
  • 19.
    Personalizer in Action XboxHome Results: +40% lift in engagement for items Bing Ads Results: +6% in ad clickthrough MSN News Results: +25% improvement in News clickthrough Personalized: News content on top of page in MSN.com or Edge DHP/NTP Reward: Click on content on the first slot Personalized: Type of content in hero position, item in secondary river. Reward: Click and engagement Personalized: Layout and location of ads Reward: Ad click through
  • 20.
  • 21.
    21 RL on AzureML – What is It? Fully-Managed RL service for large scale distributed simulation and training, using Ray/RLlib framework​. Customers create compute clusters and submit simulation/training jobs using standard Azure ML pattern (Estimator) with SDK & CLI​. RL algorithms are in RLlib – Deep training is Tensorflow by default, Pytorch possible​. Available in azureml-sdk 1.0.76​
  • 22.
    22 RL Jobs Requirements 100’sof parallel simulations. Training: can take multiple days. Support for multiple Ray jobs. Resilient to simulator / worker failures. ML Ops pipeline integration.
  • 23.
    23 Simulators Support Open AI Gym. Customsimulators with Open AI Gym Environment interface – worker local or remote in simulator. Windows support. Investigating additional simulator support.
  • 24.
    24 What is Ray?• High-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications. • Uses a lightweight API based on dynamic task graphs and actors to express a wide range of applications in a flexible manner.
  • 25.
    25 What is RLlib? Libraryfor Reinforcement Learning built on top of the Ray framework. High scalability and unified API. Provide abstractions for common RL components: Policy Model, Policy Evaluator, Policy Optimizer. Hierarchical and logically centralized control to compose common RL components.
  • 26.
    26 RLlib Architecture Source: RLlib:Scalable Reinforcement Learning
  • 27.
    27 RL on AzureML – How it Works? Data Scientist Submits Experiment Azure Machine Learning Ray Cluster Head Node (Training) Worker Node Worker Node Worker Worker Worker Worker Simulator Cluster Simulator Node Simulator Node Sim Sim Sim Sim Training Results
  • 28.

Editor's Notes

  • #5 Provide a basic understanding of common Reinforcement Learning concepts, approaches, and its mathematical foundations and algorithms. Understand common challenges in Reinforcement Learning and techniques to address them. Show how to code a Deep Reinforcement Learning agent from scratch, using as an example a Deep Q-Learning agent. Show a preview of the upcoming RL infrastructure on Azure ML and how to use it to train agents at scale.
  • #6 We have the Environment, on which an Agent operates by acting on commands and receiving Rewards and some State information. The goal here is to train an agent that learns to choose actions that maximizes the Rewards received.
  • #7 In its highest level, an RL system has the structure depicted in this diagram. We usually model this problem as a Markov Decision Process (MDP). Agent and environment State, action, reward In its highest level, an RL system has the structure depicted in this diagram.
  • #8 Performing the Q-function is known as the tabular Q-Learning. Tabular because we explicitly enumerate the Q-values for all state-action pairs in a table and solve the optimization problem through dynamic programming.
  • #9 Markov Decision Process (MDP) A central aspect for Q-Learning to work is a good strategy to choose actions in the environment. The idea here is that an agent needs to execute actions to explore the environment enough, in order to learn from good experiences. On the other hand, the agent also needs a good policy in order to obtain good experiences from the environment. This is known as the Exploration vs Exploitation tradeoff.
  • #10 A common strategy to balance exploration and exploitation is known as the Epsilon-Greedy exploration, where we introduce an uncertainty when choosing the best action. This is what we are going to use in our lab. In practice, we implement this with an annealing scheme for decreasing the probability to pick random actions as the model converges. There are other strategies, such as the Boltzmann exploration, which is like a softmax function with an additional parameter that controls the spread of the distribution. By varying this parameter we can also control the uncertainty in picking random actions.
  • #11 With those definitions, we can categorize RL algorithms in the following classes.
  • #12 Here we will focus in Model-free approaches, getting into the details of Value-based algorithms.
  • #13 Solving Q-Learning with neural network.
  • #15 Here are some examples of use cases that can be solved by RL.
  • #17 What is Personalizer? Personalizer implements an AI technique called Reinforcement Learning. Here's how it works. Suppose we want to display a "hero" action to the user. The user might not be sure what to do next, but we could display one of several suggestions. For a gaming app, that might be: "play a game", "watch a movie", or "join a clan". Based on that user's history and other contextual information -- say, their location, the time of day, and the day of the week -- the Personalizer service will rank the possible actions and suggest the best one to promote Hopefully, the user will be happy, but how can we be sure? That depends on what the user does next, and whether that was something we wanted them to do. According to our business logic we'll assign a "reward score" between 0 and 1 to what happens next. For example, spending more time playing a game or reading an article, or spending more money in the store, might lead to higher reward scores. Personalizer feeds that info back into the ranking system for the next time we need to feature an activity.
  • #18 You only need the Rank API and Reward API to integrate with your application
  • #19 Here’s how in the background the Personalizer API is build on Reinforcement Learning
  • #20 Personalizer has been in development at Microsoft for many years. It's used on Xbox devices, to determine what activities are featured on the home page, like playing an installed game, or purchasing a new game from the store, or watching others play on Mixer. Since the introduction of Personalizer, the Xbox team has seen a significant lift in key engagement metrics. Personalizer is also used to optimize the placement of ads in Bing search, and the articles featured in MSN News, again with great results in improving engagement from users. Now you can use Personalizer in your own apps, as well.