Reinforcement Learning with Amazon SageMaker RL

Thom Lane
14th December 2018
Reinforcement Learning
DQN and PPO on SageMaker RL

Agenda
Reinforcement Learning Overview
SageMaker RL Overview
CartPole with DQN and Coach
Q-learning and Deep Q Network
Roboschool Hopper with PPO and Ray
Policy Gradients and Proximal Policy Optimization
Mountain Car with PPO and Coach

What is Reinforcement Learning?

Agent interacts with the environment
EnvironmentAgent
Actions
Observations
Rewards

Actions
Observations
Rewards
Agent Environment
AWS DeepRacer interacts with the racetrack

Challenges
Unstable training
Training Data isn’t static
Current training data is from current policy!
Sparse Rewards
High Sensitivity to HPs
broken RL code almost always fails silently
Exploration vs exploitation

Closer look at the environment
Many kinds of observation spaces
Continuous or discrete
Could be vector, audio, image, video, etc.
Observation != State
Observation: e.g. image from camera
State: e.g. momentum (physical dynamics)
Many kinds of action spaces
Continuous or discrete
!" !# !$
%" %# %$
&" &# &$

Markov Decision Process (MDP)
Stay Go
Stay GoStay Go StayGo StayGo
+10
+5
-10 -20
+1
80% 20%
-100-100

Modeling the MDP
Usually we don’t have the MDP in reality!
“Model-based” RL is when the following is given or learnt:
1) Transition function: e.g. ℙ "#$% "#, '#)
2) Reward function: e.g. ℙ )#$% "#, '#)
And then use planning methods to decide on actions.
“Model free” RL doesn’t have these requirements.

Collecting trajectories
!"#" $" !%#% $% !&#& $& '
!"#" $" !(#( $( '
!"#" $" !%#% $% '
Episode
Transition or Step
Done

Example Environment: Open AI Gym

Rewards
Get a reward !" from environment after action #".
Can be positive, negative or equal to 0.
Return $" is cumulative discounted future reward.
$" = 1 −1 + 0.9 −1 + 0.81 5 = 2.15
$" = !" + /!"01 + /2 !"02 + ⋯ = ∑567
8
/5 !"05
Objective of RL Agent is to maximize expected return.

Agent/Algorithm Taxonomy
Source: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

What are we trying to learn?
With Q-Learning…
We learn the value of taking an action from a given state.
‘Q-Value’ is the expected return after taking the action.
We’ll use Deep Q Network as an example.
With Policy Optimization…
We learn the action to take from a given state (i.e. observations).
We call the model for this the policy, denoted !"($|&).
We’ll use Proximal Policy Optimization as an example.
We often learn the value of being in a given state too (i.e. Actor Critic)

What are we trying to learn?
Q-Learning Policy Optimization Actor Critic
Observation Observation Observation
Q-values
Action
probabilities
Action
probabilities
State
value

Exploration vs Exploitation
Greedy
always take greedy action
!-greedy
with probability ! take random action
otherwise take greedy action

Exploration Policies
Action
Space Noise
Parameter
Space Noise
Ornstein
Uhlenbeck
Process
Entropy Loss

Exploration Policies: Curiosity
Intrinsic Curiosity Module
Without Curiosity
(i.e. random actions)
With Curiosity
Source: https://blogs.unity3d.com/2018/06/26/solving-sparse-reward-tasks-with-curiosity/

Agent interacts with the environment
EnvironmentAgent
Actions
Observations
Rewards
Neural Networks
Exploration Policies
Algorithm
Reward Aggregation

Applicable in many domains (not just gaming)
Robotics Industrial
control
HVAC Autonomous
vehicles
NLP Operations Finance Resource
allocation
Advertising
Online content
delivery

But RL models can be tricky to train
Difficult to
get started
RL agent
algorithms are
complex to
implement
Hard to
integrate
environments
for training
Training is
computationally
expensive and
time consuming
Requires trial and
error and frequent
tuning of
hyperparameters

SageMaker RL: Environments
Open AI Gym (Cartpole, Mountain Car, Atari)
Open AI Roboschool
EnergyPlus
AWS RoboMaker
Simulink
Amazon Sumerian
Custom Environments
Travelling Salesman
Portfolio Management
Auto-scaling
Model Compression

SageMaker RL: Toolkits & Agents
Intel Coach Open AI Baselines
Stable BaselinesRay RLlib

SageMaker RL: Code
notebook.ipynb

SageMaker RL: Code
notebook.ipynb
train-coach.py

notebook.ipynb
SageMaker RL: Code
preset-cartpole-clippedppo.py

Customers are using Amazon SageMaker RL

Reinforcement Learning with Amazon SageMaker RL

Recommended

Recommended

More Related Content

Similar to Reinforcement Learning with Amazon SageMaker RL

Similar to Reinforcement Learning with Amazon SageMaker RL (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning with Amazon SageMaker RL