Getting started with reinforcement learning in open ai gym

Model: Deep Q Network
Approximator: Convolutional
Neural Network
Policy: Epsilon Greedy
Player 2Player 1
Model: Original
Deterministic Pong
Approximator: Hard-
Coded
Policy: Greedy

Intro to
Reinforcement
Learning

Reinforcement Learning VS Other Types of ML

Classification / Regression

Classification / Regression Clustering / DR

Classification / Regression Clustering / DR Optimal Decision Sequences

Trial and Error
Optimal Decision Sequences
Specific Traits of RL

Trial and Error
Maps States to Actions

Trial and Error
Optimizes policies

Trial and Error
Maximizes Rewards
Optimizes policies

Markov Decision
Process
P =0.3
P = 1.0
P =0.7
P = 1.0 P = 0.3
P = 0.4
P = 0.3
P = 1.0
Trial and Error
Maximizes Rewards
Optimizes policies

Provides States
Provides Feedback
How RL Works?
?
Learns to Maximize Rewards

Exploitation VS Exploration
Best Known
Strategy
“Greedy”
Highest Known
Value
Can’t Learn
New Skills
Random
Decisions
Not “Greedy”
Most
Information
Gained
Can’t Use
Learned Skill

Primary
Technique
Exploration policy:
• How Random,
How Long
Examples:
• E-Greedy,

Primary
Measures
Regret:
• Reward vs Max-
Reward
Time/Cost:
• Training Costs
Primary
Technique
Exploration policy:
• How Random,
How Long
Examples:
• E-Greedy,

?
#1 It Takes Care of Most Necessities

Writing Simulations and Games
Defining Reward Systems
We Can Focus on the Learner
?
Left = Q(s,a1)
Right = Q(s,a2)
Jump = Q(s,a3)
Creating Learning Agents

#2
Gym
Environments
Classical Control
Robotics
MuJoCo
Box2D
Atari
Toy Text

RL in Python
GPU Access
Easy Start with Keras-RL
#3
Python Ecosystem

A Simple Example
gym.make()
• Environment

A Simple Example
gym.make()
• Environment
env.reset()
• Initial State • Resets Episodes

A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.reset()

A Simple Example
gym.make()
• Environment
env.render()
env.action_space.sample()
• Sample of Action
env.reset()

A Simple Example
gym.make()
• Environment
env.render()
env.action_space.sample()
• Sample of Actions
env.step()
• State
• Reward
• Done
• Info
env.reset()

MountainC
ar-v0
Premise:
Under-Powered Car
Must Go Up a Steep
Hill.

MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
Position
Velocity

MountainC
ar-v0
• Push Left
• Push Right
• No Push
Action Space: Discrete(3)
Left Right
None

MountainC
ar-v0
• Push Left
• Push Right
• No Push
• -1 Per Step
Rewards

MountainC
ar-v0
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Rewards

MountainC
ar-v0
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Goal: Get up the Hill with Minimum
Regret
Rewards

Import Packages
Make Environment

Set Parameters
Set Epsilon Decay Schedule

Q-Table
Action 1 … Action n
State 1 Value
(s1, a1)
… Value
(s1, an)
State 2 … … …
… … … …
… … … …
State n Value
(sn, a1)
… Value
(sn, an)
Q-Value
Current State
Action

Episode 0 Episode 400 Episode 600
Episode 1200 Episode 1800

Results
Amended From
Original Version

Results
Exploration
Exploitation
Amended From
Original Version

CartPole-
v0
Premise:
Overcome Complex
Dynamics to Hold the
Pole Vertical.

CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity

CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
• Push Left
• Push Right
Left Right

CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
• Push Left
• Push Right
Rewards
• 1 Point Per Step

CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
• Push Left
• Push Right
Rewards
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Game
Over
±12°
Cart Position ±2.4
Game
Over

CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
• Push Left
• Push Right
Rewards
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Solved When: Avg Reward >= 195
(over 100 consecutive episodes)

Import Packages
Create CartPole Environment

Function Approximator
Input Layer
Hidden Layer
Output Layer
States
Value
Action 1
Value
Action 2

Policy, Agent, and Fitting
(On-Policy)
(Off-Policy)

CartPole-v0 Test
Learner
Agent: SARSA
Function Approximator:
Deep Neural Net
Policy: Epsilon-Greedy
Test
Episodes: 100
Average: 200
Solved: Yes (avg reward >=
195)

Getting started with reinforcement learning in open ai gym

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with reinforcement learning in open ai gym

Similar to Getting started with reinforcement learning in open ai gym (20)

Recently uploaded

Recently uploaded (20)

Getting started with reinforcement learning in open ai gym