Model: Deep Q Network
Approximator: Convolutional
Neural Network
Policy: Epsilon Greedy
Player 2Player 1
Model: Original
Deterministic Pong
Approximator: Hard-
Coded
Policy: Greedy
Intro to
Reinforcement
Learning
Reinforcement Learning VS Other Types of ML
Classification / Regression
Reinforcement Learning VS Other Types of ML
Classification / Regression Clustering / DR
Reinforcement Learning VS Other Types of ML
Classification / Regression Clustering / DR Optimal Decision Sequences
Reinforcement Learning VS Other Types of ML
Trial and Error
Optimal Decision Sequences
Specific Traits of RL
Trial and Error
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
Trial and Error
Optimizes policies
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
Trial and Error
Maximizes Rewards
Optimizes policies
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
Markov Decision
Process
P =0.3
P = 1.0
P =0.7
P = 1.0 P = 0.3
P = 0.4
P = 0.3
P = 1.0
Trial and Error
Maximizes Rewards
Optimizes policies
Maps States to Actions
Specific Traits of RL
Provides States
Provides Feedback
How RL Works?
?
Learns to Maximize Rewards
Exploitation VS Exploration
Best Known
Strategy
“Greedy”
Highest Known
Value
Can’t Learn
New Skills
Random
Decisions
Not “Greedy”
Most
Information
Gained
Can’t Use
Learned Skill
Exploitation VS Exploration
Primary
Technique
Exploration policy:
• How Random,
How Long
Examples:
• E-Greedy,
Exploitation VS Exploration
Primary
Measures
Regret:
• Reward vs Max-
Reward
Time/Cost:
• Training Costs
Primary
Technique
Exploration policy:
• How Random,
How Long
Examples:
• E-Greedy,
Why Start with Gym?
?
#1 It Takes Care of Most Necessities
Writing Simulations and Games
Defining Reward Systems
We Can Focus on the Learner
?
Left = Q(s,a1)
Right = Q(s,a2)
Jump = Q(s,a3)
Creating Learning Agents
#2
Gym
Environments
Classical Control
Robotics
MuJoCo
Box2D
Atari
Toy Text
RL in Python
GPU Access
Easy Start with Keras-RL
#3
Python Ecosystem
Getting started
A Simple Example
A Simple Example
gym.make()
• Environment
A Simple Example
gym.make()
• Environment
env.reset()
• Initial State • Resets Episodes
A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.reset()
• Initial State • Resets Episodes
A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.action_space.sample()
• Sample of Action
env.reset()
• Initial State • Resets Episodes
A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.action_space.sample()
• Sample of Actions
env.step()
• State
• Reward
• Done
• Info
env.reset()
• Initial State • Resets Episodes
Challenge 1
MountainC
ar-v0
Premise:
Under-Powered Car
Must Go Up a Steep
Hill.
MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
Position
Velocity
MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
Action Space: Discrete(3)
Left Right
None
MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
Action Space: Discrete(3)
Rewards
MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Action Space: Discrete(3)
Rewards
MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Goal: Get up the Hill with Minimum
Regret
Action Space: Discrete(3)
Rewards
Import Packages
Make Environment
Set Parameters
Set Epsilon Decay Schedule
Discretization and Q-Table
Q-Table
Action 1 … Action n
State 1 Value
(s1, a1)
… Value
(s1, an)
State 2 … … …
… … … …
… … … …
State n Value
(sn, a1)
… Value
(sn, an)
Q-Value
Current State
Action
Decision and Action
Decision and Action
Decision and Action
Update Q-Table
Update Q-Table
Update Q-Table
Episode 0 Episode 400 Episode 600
Episode 1200 Episode 1800
Results
Amended From
Original Version
Results
Exploration
Exploitation
Amended From
Original Version
Challenge 2
CartPole-
v0
Premise:
Overcome Complex
Dynamics to Hold the
Pole Vertical.
CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Left Right
CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Game
Over
±12°
Cart Position ±2.4
Game
Over
CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Solved When: Avg Reward >= 195
(over 100 consecutive episodes)
Import Packages
Create CartPole Environment
Function Approximator
Input Layer
Hidden Layer
Output Layer
States
Value
Action 1
Value
Action 2
Function Approximator
Input Layer
Hidden Layer
Output Layer
States
Value
Action 1
Value
Action 2
Function Approximator
Input Layer
Hidden Layer
Output Layer
States
Value
Action 1
Value
Action 2
Function Approximator
Input Layer
Hidden Layer
Output Layer
States
Value
Action 1
Value
Action 2
Policy, Agent, and Fitting
Policy, Agent, and Fitting
Policy, Agent, and Fitting
(On-Policy)
(Off-Policy)
Policy, Agent, and Fitting
CartPole-v0 Test
Learner
Agent: SARSA
Function Approximator:
Deep Neural Net
Policy: Epsilon-Greedy
Test
Episodes: 100
Average: 200
Solved: Yes (avg reward >=
195)
Questions

Getting started with reinforcement learning in open ai gym