This presentation gives an introduction to reinforcement learning in the OpenAI Gym library. It starts with RL basics, then moves to basic Gym functionality. Next, it shows how to solve the MountainCar-v0 and CartPole-v0 environments with Q-learning and SARSA learning agents. Finally, it displays the results of the learning process and the skill of the agents (qualitatively and quantitatively).
8. Trial and Error
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
9. Trial and Error
Optimizes policies
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
10. Trial and Error
Maximizes Rewards
Optimizes policies
Maps States to Actions
Optimal Decision Sequences
Specific Traits of RL
11. Markov Decision
Process
P =0.3
P = 1.0
P =0.7
P = 1.0 P = 0.3
P = 0.4
P = 0.3
P = 1.0
Trial and Error
Maximizes Rewards
Optimizes policies
Maps States to Actions
Specific Traits of RL
13. Exploitation VS Exploration
Best Known
Strategy
“Greedy”
Highest Known
Value
Can’t Learn
New Skills
Random
Decisions
Not “Greedy”
Most
Information
Gained
Can’t Use
Learned Skill
18. Writing Simulations and Games
Defining Reward Systems
We Can Focus on the Learner
?
Left = Q(s,a1)
Right = Q(s,a2)
Jump = Q(s,a3)
Creating Learning Agents
25. A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.reset()
• Initial State • Resets Episodes
26. A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.action_space.sample()
• Sample of Action
env.reset()
• Initial State • Resets Episodes
27. A Simple Example
gym.make()
• Environment
env.render()
• Video Display or Text Map
env.action_space.sample()
• Sample of Actions
env.step()
• State
• Reward
• Done
• Info
env.reset()
• Initial State • Resets Episodes
31. MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
Action Space: Discrete(3)
Left Right
None
32. MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
Action Space: Discrete(3)
Rewards
33. MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Action Space: Discrete(3)
Rewards
34. MountainC
ar-v0
• Position • Velocity
Observation Space: Box(2)
• Push Left
• Push Right
• No Push
• -1 Per Step
• 200 Steps • Reach Goal
Game Ends
Goal: Get up the Hill with Minimum
Regret
Action Space: Discrete(3)
Rewards
38. Q-Table
Action 1 … Action n
State 1 Value
(s1, a1)
… Value
(s1, an)
State 2 … … …
… … … …
… … … …
State n Value
(sn, a1)
… Value
(sn, an)
Q-Value
Current State
Action
51. CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Left Right
52. CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
53. CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Game
Over
±12°
Cart Position ±2.4
Game
Over
54. CartPole-
v0
• Cart Position
• Cart Velocity
• Pole Angle
• Pole Velocity
Observation Space: Box(4)
Action Space: Discrete(2)
• Push Left
• Push Right
Rewards
• 1 Point Per Step
• 200 Steps
• Pole Angle ±12°
• Cart Position ±2.4
Game Ends
Solved When: Avg Reward >= 195
(over 100 consecutive episodes)