2. Agenda
Reinforcement Learning Overview
SageMaker RL Overview
CartPole with DQN and Coach
Q-learning and Deep Q Network
Roboschool Hopper with PPO and Ray
Policy Gradients and Proximal Policy Optimization
Mountain Car with PPO and Coach
6. Challenges
Unstable training
Training Data isn’t static
Current training data is from current policy!
Sparse Rewards
High Sensitivity to HPs
broken RL code almost always fails silently
Exploration vs exploitation
8. Closer look at the environment
Many kinds of observation spaces
Continuous or discrete
Could be vector, audio, image, video, etc.
Observation != State
Observation: e.g. image from camera
State: e.g. momentum (physical dynamics)
Many kinds of action spaces
Continuous or discrete
!" !# !$
%" %# %$
&" &# &$
9. Markov Decision Process (MDP)
Stay Go
Stay GoStay Go StayGo StayGo
+10
+5
-10 -20
+1
80% 20%
-100-100
10. Modeling the MDP
Usually we don’t have the MDP in reality!
“Model-based” RL is when the following is given or learnt:
1) Transition function: e.g. ℙ "#$% "#, '#)
2) Reward function: e.g. ℙ )#$% "#, '#)
And then use planning methods to decide on actions.
“Model free” RL doesn’t have these requirements.
16. What are we trying to learn?
With Q-Learning…
We learn the value of taking an action from a given state.
‘Q-Value’ is the expected return after taking the action.
We’ll use Deep Q Network as an example.
With Policy Optimization…
We learn the action to take from a given state (i.e. observations).
We call the model for this the policy, denoted !"($|&).
We’ll use Proximal Policy Optimization as an example.
We often learn the value of being in a given state too (i.e. Actor Critic)
17. What are we trying to learn?
Q-Learning Policy Optimization Actor Critic
Observation Observation Observation
Q-values
Action
probabilities
Action
probabilities
State
value
23. Applicable in many domains (not just gaming)
Robotics Industrial
control
HVAC Autonomous
vehicles
NLP Operations Finance Resource
allocation
Advertising
Online content
delivery
24. But RL models can be tricky to train
Difficult to
get started
RL agent
algorithms are
complex to
implement
Hard to
integrate
environments
for training
Training is
computationally
expensive and
time consuming
Requires trial and
error and frequent
tuning of
hyperparameters
25. SageMaker RL: Environments
Open AI Gym (Cartpole, Mountain Car, Atari)
Open AI Roboschool
EnergyPlus
AWS RoboMaker
Simulink
Amazon Sumerian
Custom Environments
Travelling Salesman
Portfolio Management
Auto-scaling
Model Compression