Thom Lane
14th December 2018
Reinforcement Learning
DQN and PPO on SageMaker RL
Agenda
Reinforcement Learning Overview
SageMaker RL Overview
CartPole with DQN and Coach
Q-learning and Deep Q Network
Roboschool Hopper with PPO and Ray
Policy Gradients and Proximal Policy Optimization
Mountain Car with PPO and Coach
What is Reinforcement Learning?
Agent interacts with the environment
EnvironmentAgent
Actions
Observations
Rewards
Actions
Observations
Rewards
Agent Environment
AWS DeepRacer interacts with the racetrack
Challenges
Unstable training
Training Data isn’t static
Current training data is from current policy!
Sparse Rewards
High Sensitivity to HPs
broken RL code almost always fails silently
Exploration vs exploitation
Environments
Closer look at the environment
Many kinds of observation spaces
Continuous or discrete
Could be vector, audio, image, video, etc.
Observation != State
Observation: e.g. image from camera
State: e.g. momentum (physical dynamics)
Many kinds of action spaces
Continuous or discrete
!" !# !$
%" %# %$
&" &# &$
Markov Decision Process (MDP)
Stay Go
Stay GoStay Go StayGo StayGo
+10
+5
-10 -20
+1
80% 20%
-100-100
Modeling the MDP
Usually we don’t have the MDP in reality!
“Model-based” RL is when the following is given or learnt:
1) Transition function: e.g. ℙ "#$% "#, '#)
2) Reward function: e.g. ℙ )#$% "#, '#)
And then use planning methods to decide on actions.
“Model free” RL doesn’t have these requirements.
Collecting trajectories
!"#" $" !%#% $% !&#& $& '
!"#" $" !(#( $( '
!"#" $" !%#% $% '
Episode
Transition or Step
Done
Example Environment: Open AI Gym
Agents
Rewards
Get a reward !" from environment after action #".
Can be positive, negative or equal to 0.
Return $" is cumulative discounted future reward.
$" = 1 −1 + 0.9 −1 + 0.81 5 = 2.15
$" = !" + /!"01 + /2 !"02 + ⋯ = ∑567
8
/5 !"05
Objective of RL Agent is to maximize expected return.
Agent/Algorithm Taxonomy
Source: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html
What are we trying to learn?
With Q-Learning…
We learn the value of taking an action from a given state.
‘Q-Value’ is the expected return after taking the action.
We’ll use Deep Q Network as an example.
With Policy Optimization…
We learn the action to take from a given state (i.e. observations).
We call the model for this the policy, denoted !"($|&).
We’ll use Proximal Policy Optimization as an example.
We often learn the value of being in a given state too (i.e. Actor Critic)
What are we trying to learn?
Q-Learning Policy Optimization Actor Critic
Observation Observation Observation
Q-values
Action
probabilities
Action
probabilities
State
value
Exploration vs Exploitation
Greedy
always take greedy action
!-greedy
with probability ! take random action
otherwise take greedy action
Exploration Policies
Action
Space Noise
Parameter
Space Noise
Ornstein
Uhlenbeck
Process
Entropy Loss
Exploration Policies: Curiosity
Intrinsic Curiosity Module
Without Curiosity
(i.e. random actions)
With Curiosity
Source: https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/
Agent interacts with the environment
EnvironmentAgent
Actions
Observations
Rewards
Neural Networks
Exploration Policies
Algorithm
Reward Aggregation
SageMaker RL
Applicable in many domains (not just gaming)
Robotics Industrial
control
HVAC Autonomous
vehicles
NLP Operations Finance Resource
allocation
Advertising
Online content
delivery
But RL models can be tricky to train
Difficult to
get started
RL agent
algorithms are
complex to
implement
Hard to
integrate
environments
for training
Training is
computationally
expensive and
time consuming
Requires trial and
error and frequent
tuning of
hyperparameters
SageMaker RL: Environments
Open AI Gym (Cartpole, Mountain Car, Atari)
Open AI Roboschool
EnergyPlus
AWS RoboMaker
Simulink
Amazon Sumerian
Custom Environments
Travelling Salesman
Portfolio Management
Auto-scaling
Model Compression
SageMaker RL: Toolkits & Agents
Intel Coach Open AI Baselines
Stable BaselinesRay RLlib
SageMaker RL: Code
notebook.ipynb
SageMaker RL: Code
notebook.ipynb
train-coach.py
notebook.ipynb
SageMaker RL: Code
preset-cartpole-clippedppo.py
Customers are using Amazon SageMaker RL
Thanks!

Reinforcement Learning with Amazon SageMaker RL