5. 5
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
6. 6
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
7. 7
How Well Can Reinforcement Learning Do?
Artificial Intelligence and the Future - Demis Hassabis/DeepMind
https://youtu.be/zYII3AOSgo8?t=2236
8. 8
From One Game to All The Games in Atari
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
9. 9
One Journey to General Artificial Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020
10. 10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
-1 : per step unless other rewards is triggered
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
https://www.gymlibrary.dev/environments/toy_text/taxi/
12. 12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/
13. 13
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning
14. 14
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning
15. 15
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning
16. 16
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
18. 18
How Well Can Agent57 Do?
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
19. 19
Reinforcement Learning at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
20. 20
Mastering Go at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
21. 21
Another Journey to General Artificial Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U
22. 22
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
60 players from China, Korea, Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=HT-UZkiOLv8
23. 23
AlphaGo Zero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network
https://www.youtube.com/watch?v=mWHK27pXjqo
24. 24
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
25. 25
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model