A Journey to Reinforcement Learning

A Journey to Reinforcement Learning
fangkuoyu@gmail.com
12/30/2022

2
The Treasure Map
MuZero
Alpha Zero
Gym
Gym

3
Atari Games
Pong Breakout Phoenix
https://www.gymlibrary.dev/
https://gymnasium.farama.org/

4
Reinforcement Learning Framework
ENVIRONMENT
AGENT
State Action Reward
(s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ …
Learning to Make Decisions for Maximizing Long-Term Rewards

5
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()

6
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point

7
How Well Can Reinforcement Learning Do?
Artificial Intelligence and the Future - Demis Hassabis/DeepMind
https://youtu.be/zYII3AOSgo8?t=2236

8
From One Game to All The Games in Atari
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark

9
One Journey to General Artificial Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020

10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
-1 : per step unless other rewards is triggered
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
https://www.gymlibrary.dev/environments/toy_text/taxi/

11
OpenAI Gym Taxi-v3 : Q Table
(500 x 6)
https://www.gocoder.one/blog/rl-tutorial-with-openai-gym

12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/

13
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning

14
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning

15
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning

16
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.

17
Distributed Reinforcement Learning
Agent57
Gorila
https://arxiv.org/abs/2003.13350
https://arxiv.org/abs/1507.04296

18
How Well Can Agent57 Do?

19
Reinforcement Learning at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/

20
Mastering Go at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/

21
Another Journey to General Artificial Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U

22
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
60 players from China, Korea, Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=HT-UZkiOLv8

23
AlphaGo Zero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network
https://www.youtube.com/watch?v=mWHK27pXjqo

24
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/

25
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

26
MuZero Performance Benchmark
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

27
Exploring The Treasure Map ...
MuZero
Alpha Zero
Gym
Gym

28
Beyond the Treasure Map ...
MuZero
Alpha Zero
Gym
Gym
AlphaStar
AlphaFlod
AlphaTensor
Other Domains, e.g.,
Mobile/Wireless
Communication

A Journey to Reinforcement Learning
Q & A

A Journey to Reinforcement Learning

Recommended

Recommended

More Related Content

Similar to A Journey to Reinforcement Learning

Similar to A Journey to Reinforcement Learning (20)

More from Frank Fang Kuo Yu

More from Frank Fang Kuo Yu (18)

Recently uploaded

Recently uploaded (20)

A Journey to Reinforcement Learning