pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

RL Adventure
TO THE RAINBOW
성태경 양홍선 이의령 김예찬

DQN, Double DQN & Dueling DQN
PER and NoisyNet
Distributional RL
RAINBOW
성태경 양홍선 이의령 김예찬

RL Adventure
DQN, Double DQN & Duel DQN
성태경

OUTLINE
2
DQN
Double
DQN
Dueling
DQN
PER
C51NoisyNet Rainbow + 구현

OUTLINE
3
DQN
Double
DQN
Dueling
DQN
PER
C51NoisyNet Rainbow + 구현

RL APPLICATIONS
[Atari]
[Robotics] [Autonomous driving]
[Mario] [Pommerman] [Go]
4

HIGH-LEVEL PROCESS
5
[Decision]
[Pixel information]
reward

HIGH-LEVEL PROCESS
6
[Decision]
[Pixel information]
[
………
]
[Input values]
PREPROCESS
[Neural networks]
MLPs, CNNs, RNNs, …
TRAINING
DQN,  
Double DQN,  
DDQN, …
[Objective function]
reward

DQN
NEURAL NETWORKS IN ONE SLIDE
8
Weight 연산
Backpropagation
Non-linear function

DQN
NEURAL NETWORKS IN ONE SLIDE
9
Convolutional neural network
Max-pooling
Softmax
Weight 연산
Backpropagation
Non-linear function

DQN
Q-LEARNING
‣ 목적: 현재의 상황에서 어떤 행동을 하는 것이 가장 좋은지
V. Minh, et al. Playing Atari with Deep Reinforcement Learning. NIPS, 2013
C. J. C. H. Watkins, P. Dayan. Q-learning. 1992.
Qnew
(st, at) ← (1 − α)Q(st, at) + α(rt+γmaxaQ(st+1, at))
[Value iteration update]
Qπ(s, a) = 𝔼[
∞
∑
t=0
γt
R(xt, at)], γ ∈ (0,1)
[Expected rewards]
10
다음 state의 reward값현재의 reward값

DQN
MOTIVATION
11
Q(s, a) → → Q(s, a; θ)
Li(θi) = 𝔼s,a,r,s′[(r + γmaxa′Q(s′, a′; θi) − Q(s, a; θi))
2
]
목표값
TD error
예측값
뉴럴 네트워크로 Q함수를 근사화

DQN
PROBLEM
12
‣ Unstable update
‣ 입력 데이터간의 high correlations
https://curt-park.github.io/2018-05-17/dqn/
‣ Non-stationary targets (같은 네트워크 파라미터)
2
]

DQN
SOLUTION
13
‣ Experience replay
Matiisen, Tambet Demystifying Deep Reinforcement Learning. Computational Neuroscience LAB. 2015.
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature, 2015.
→ {s1, a1, r1, s2, …, sT−1, aT, rT−1, sT}
{
[Buffer]
Training
sampling
Episode
Experience

DQN
SOLUTION
14
‣ Experience replay
Matiisen, Tambet Demystifying Deep Reinforcement Learning. Computational Neuroscience LAB. 2015.
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature, 2015.
→ {s1, a1, r1, s2, …, sT−1, aT, rT−1, sT}
{
[Buffer]
Training
sampling
Episode
Experience
‣ Fixed Q-targets
Li(θi) = 𝔼s,a,r,s′[(r + γmaxa′
̂Q(s′, a′; θ−
i ) − Q(s, a; θi))
2
]
2
]

DQN
DQN - PREPROCESSING
15
210
160
https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/
[
………
]

DQN
DQN - PREPROCESSING
16
사이즈를 줄이고 grayscale로 바꿈으로써 입력데이터 사이즈를 줄이자
84
84
[
………
]

DQN
DQN - PREPROCESSING
17
Parameter: 4
x1
x3 x5
x2
x7
x4
x6
s1 = (x1, x2, x3, x4)
s2 = (x2, x3, x4, x5)
[Input]: (84 x 84 x 4)
[
………
]
Frame skipping

DQN
DQN - PREPROCESSING
18
DeepMind took the component-wise maximum over two consecutive frames (Atari setting)
[
………
]
Frame skipping

DQN
IMPLEMENTATION
19
https://github.com/higgsﬁeld/RL-Adventure/blob/master/1.dqn.ipynb

DQN
IMPLEMENTATION
20

DQN
IMPLEMENTATION
21

DQN
IMPLEMENTATION
22

DQN
IMPLEMENTATION
23

DQN
IMPLEMENTATION
24

DQN
IMPLEMENTATION
25

DQN
IMPLEMENTATION
26

DQN
IMPLEMENTATION
27

DOUBLE Q-LEARNING
MOTIVATION
‣ DQN의 문제:
van Hasselt H., Guez A. and Silver D. Deep reinforcement learning with double Q-learning, AAAI, 2015
van Hasselt H., Double Q-learning, NIPS, 2011
29
Q(s, a) = r(s, a) + γmaxaQ(s′, a)
Q-target Accumulated rewards Maximum Q-value of next state
Overestimating the action values. What if the environment is noisy?

DOUBLE Q-LEARNING
MOTIVATION
‣ DQN의 문제:
van Hasselt H., Guez A. and Silver D. Deep reinforcement learning with double Q-learning, AAAI, 2015
van Hasselt H., Double Q-learning, NIPS, 2011
30
Q(s, a) = r(s, a) + γmaxaQ(s′, a)
Q-target Accumulated rewards Maximum Q-value of next state
Overestimating the action values.
‣ 해결:
Q(s, a) = r(s, a)+γQ(s′, argmaxaQ(s′, a))
DQN Network choose action for next state
What if the environment is noisy?

DOUBLE Q-LEARNING
IMPLEMENTATION
https://github.com/higgsﬁeld/RL-Adventure/blob/master/2.double%20dqn.ipynb
31

DOUBLE Q-LEARNING
IMPLEMENTATION
32

DOUBLE Q-LEARNING
IMPLEMENTATION
33

DUELING DQN
MOTIVATION
35
Q(s, a) = V(s) + A(s, a)
[Q-value decomposition] State value Advantage value
‣ 현재 state의 가치에 비교가치로 정보를 추가한다
‣ 가치의 차이(advantage value) —> 더 빠른 학습속도
하나의 action 값만 반영 
다른 actions는 그대로
선택한 하나의 action보다  
얼마나 더 좋은지(비교)를 나타낸다

DUELING DQN
IMPLEMENTATION
https://github.com/higgsﬁeld/RL-Adventure/blob/master/3.dueling%20dqn.ipynb
36

DUELING DQN
IMPLEMENTATION
37

DUELING DQN
IMPLEMENTATION
38

EXPERIMENTS
EXPERIMENTAL COMPARISON
[DQN] [Double DQN] [Dueling DQN]
39

EXTRA
DQN LINEAGE
40
Niels Justesen, Philip Bontrager, Julian Togelius, Sebastian Risi. Deep Learning for Video Game Playing. 2017.

RL Adventure
PER and NoisyNet
양홍선
1

PER
Prioritized Experience Replay
2

Replay Memory
3
𝑠𝑠𝑡𝑡, 𝑎𝑎𝑡𝑡
𝑟𝑟𝑡𝑡, 𝑠𝑠𝑡𝑡+1
Replay Buffer
frequency

Replay Memory
4
Replay Buffer
More task-relevant
frequency

Replay Memory
5
Replay Buffer
More task-relevant
frequency

6
Which experiences to store
Which experiences to replay
Design of Replay Memory

7
Which experiences to store
Which experiences to replay
Design of Replay Memory

A Motivating Example
8
Two actions: ‘right(→→)’ and ‘wrong(→)’
The environment requires an exponential number of random steps until the first
non-zero reward
The most relevant transitions are hidden in a mass of highly redundant failure
cases

10
Prioritizing with TD-Error
A transition’s TD error 𝛿𝛿
how ‘surprising’ or unexpected the transition is

11
A low TD-Error on first visit may not be replayed for a long time
The PER with TD-Error is sensitive to noise spikes
Greedy prioritization focuses on a small subset of the experience
Weakness

Stochastic Prioritization
Proportional prioritization
• 𝑝𝑝𝑖𝑖 = 𝛿𝛿𝑖𝑖 + 𝜖𝜖
• 𝑃𝑃 𝑖𝑖 =
𝑝𝑝𝑖𝑖
𝛼𝛼
∑𝑘𝑘 𝑝𝑝𝑘𝑘
𝛼𝛼
• 𝑝𝑝𝑖𝑖 > 0: the priority of transition 𝑖𝑖
• 𝛼𝛼: determines how much prioritization is used
• Sum-tree
13

Stochastic Prioritization
Rank-based prioritization
• 𝑝𝑝𝑖𝑖 =
1
rank(𝑖𝑖)
• rank(𝑖𝑖) is the rank of transition 𝑖𝑖 when the replay memory is sorted
according to 𝛿𝛿𝑖𝑖
• More robust
• Binary heap
14

Annealing the Bias
• Importance-Sampling (IS) weights
• 𝑤𝑤𝑖𝑖 =
1
𝑁𝑁
1
𝑃𝑃 𝑖𝑖
𝛽𝛽
• Normalize:
1
max𝑖𝑖 𝑤𝑤𝑖𝑖
• ∆← ∆ + 𝑤𝑤𝑖𝑖 � 𝛿𝛿𝑖𝑖 � ∇𝜃𝜃 𝑄𝑄(𝑆𝑆𝑖𝑖−1, 𝐴𝐴𝑖𝑖−1)
15

17
Proportional prioritization (without sum-tree)

19
𝑃𝑃 𝑖𝑖 =
𝑝𝑝𝑖𝑖
𝛼𝛼
𝛼𝛼

23
𝑃𝑃 𝑖𝑖 =
𝑝𝑝𝑖𝑖
𝛼𝛼
𝛼𝛼

NoisyNet
Noisy Networks for Exploration
27

29
High exploration
Optimal
High exploitation
Exploration
Exploitation

31
Exploration methods
𝝐𝝐 −greedy
Entropy regularization
Loss에 추가하는 패널티로 한쪽으로 치우치지 않게 함
− �
𝑎𝑎
π(s,a) log π(s,a)
일정 확률 (𝜖𝜖) 만큼 무작위로 행동

32
𝜖𝜖 − greedy, Entropy regularization

33
Random perturbations

34
Random perturbations
Hard to the large-scale behavioural patterns

36
NoisyNet learn perturbations of the
network weights are used to drive
exploration

37
𝜃𝜃 ≔ 𝜇𝜇 + ∑ ⊙ 𝜖𝜖

38
𝜃𝜃 ≔ 𝜇𝜇 + ∑ ⊙ 𝜖𝜖
Learnable parameters
Noise variables

39
𝜃𝜃 ≔ 𝜇𝜇 + ∑ ⊙ 𝜖𝜖
Learnable parameters
Noise variables
𝜁𝜁 ≔ (𝜇𝜇, ∑)

40
𝑦𝑦 = 𝑤𝑤𝑤𝑤 + 𝑏𝑏
𝑦𝑦 ≔ 𝜇𝜇 𝑤𝑤
+ 𝜎𝜎 𝑤𝑤
⊙ 𝜖𝜖 𝑤𝑤
𝑥𝑥 + 𝜇𝜇𝑏𝑏
+ 𝜎𝜎 𝑏𝑏
⊙ 𝜖𝜖 𝑏𝑏

• p inputs and q outputs
• Independent Gaussian noise
• Using an independent Gaussian noise entry per weight
• pq+q
• Factorised Gaussian noise
• Using and independent noise per each output and input
• p+q
41
NoisyNet

42
�𝐿𝐿 𝜁𝜁 = 𝔼𝔼 𝔼𝔼 𝑥𝑥,𝑎𝑎,𝑟𝑟,𝑦𝑦 ~𝐷𝐷[𝑟𝑟 + 𝛾𝛾 max
𝑏𝑏∈𝐴𝐴
𝑄𝑄 𝑦𝑦, 𝑏𝑏, 𝜖𝜖′; 𝜁𝜁− − 𝑄𝑄 𝑥𝑥, 𝑎𝑎, 𝜖𝜖; 𝜁𝜁 2
𝐿𝐿(𝜃𝜃) = 𝔼𝔼 𝔼𝔼 𝑥𝑥,𝑎𝑎,𝑟𝑟,𝑦𝑦 ~𝐷𝐷[𝑟𝑟 + 𝛾𝛾 max
𝑏𝑏∈𝐴𝐴
𝑄𝑄 𝑦𝑦, 𝑏𝑏; 𝜃𝜃−
− 𝑄𝑄 𝑥𝑥, 𝑎𝑎; 𝜃𝜃) 2
Loss

43
Loss
�𝐿𝐿 𝜁𝜁 = 𝔼𝔼 𝔼𝔼 𝑥𝑥,𝑎𝑎,𝑟𝑟,𝑦𝑦 ~𝐷𝐷[𝑟𝑟 + 𝛾𝛾 max
𝑏𝑏∈𝐴𝐴
𝑄𝑄 𝑦𝑦, 𝑏𝑏, 𝜖𝜖′; 𝜁𝜁− − 𝑄𝑄 𝑥𝑥, 𝑎𝑎, 𝜖𝜖; 𝜁𝜁 2
𝐿𝐿(𝜃𝜃) = 𝔼𝔼 𝔼𝔼 𝑥𝑥,𝑎𝑎,𝑟𝑟,𝑦𝑦 ~𝐷𝐷[𝑟𝑟 + 𝛾𝛾 max
𝑏𝑏∈𝐴𝐴
𝑄𝑄 𝑦𝑦, 𝑏𝑏; 𝜃𝜃−
− 𝑄𝑄 𝑥𝑥, 𝑎𝑎; 𝜃𝜃) 2

Initialisation of NoisyNet
• An unfactorized NoisyNet
• 𝜇𝜇𝑖𝑖,𝑗𝑗 ~ 𝑢𝑢 −
3
𝑝𝑝
, +
3
𝑝𝑝
• p: The number of inputs
• 𝜎𝜎𝑖𝑖,𝑗𝑗 = 0.017
• Factorised NosiyNet
• 𝜇𝜇𝑖𝑖,𝑗𝑗 ~ 𝑢𝑢 −
1
𝑝𝑝
, +
1
𝑝𝑝
• 𝜎𝜎𝑖𝑖,𝑗𝑗 =
𝜎𝜎0
𝑝𝑝
• 𝜎𝜎0 = 0.5
44

45
The learning curves of
the average noise parameter �∑

46
The learning curves of
the average noise parameter �∑

50
Factorised NosiyNet
𝜇𝜇𝑖𝑖,𝑗𝑗 ~ 𝑢𝑢 −
1
𝑝𝑝
, +
1
𝑝𝑝

53
𝜃𝜃 ≔ 𝜇𝜇 + ∑ ⊙ 𝜖𝜖

54
Code: https://github.com/higgsfield/RL-Adventure
PER: https://arxiv.org/abs/1511.05952
NoisyNet: https://arxiv.org/abs/1706.10295

1
RL Adventure
Distributional RL
이의령

목차
1. Motivation
2. Distributional RL(C51) 설명
3. C51 Result
4. 코드 구현체 분석
3

5
Motivation
+ $ 200
- $ 1,800
Ε [R x ] =
()
36
× 200 −
0
36
× 1,800
= 144

6
Motivation
+ $ 200
- $ 1,800
𝑅230 + 𝛾𝑅236 + ⋯ + 𝛾89290 𝑅8
보상의 합

7
Expected RL
+ $ 200
- $ 1,800
벨만 방정식
𝑣 𝑥 = 𝑬 𝑅230 + 𝛾𝑅236 + ⋯ | 𝑆2 = 𝑥
= 𝑬 R x + 𝛾 𝑬 𝑣(𝑥)
= 𝑬 𝑅230 + 𝛾 𝑣 𝑥 | 𝑆2 = 𝑥

Reward를 Random Variable 관점에서 바라보면…
§ 가치함수는 discount된 미래 보상에 대한 기댓값을 리턴한다.
§ 기댓값 = Scalar(o) / Distribution(x)
§ 미래 보상 값들은 complex, Multimodal의 특성을 가진다.
§ 기댓값은 각 보상들이 가지는 intrinsic(본질적인)한 특성을 담아내지 못한다.
8
Expected RL
Ε [R x ] =
()
36
× 200 −
0
36
× 1,800
= 144

Reward를 Random Variable 관점에서 바라보면…
9
Expected RL

이러한 Expected RL의 한계점을 보완책
-> A Distributional Perspective on RL (C51)

Return을 Distribution으로 만들어
Randomness한 특성과 정보를 최대한 반영해보자
𝑉B = 𝐸 𝑍B 𝑥 = 𝐸 𝑅 𝑥 + 𝐸[ 𝑍B 𝑋F ]

Return을 Distribution으로 만들어
Randomness한 특성과 정보를 최대한 반영해보자
𝑉B = 𝐸 𝑍B 𝑥 = 𝐸 𝑅 𝑥 + 𝐸[ 𝑍B 𝑋F ]
𝑍B 𝑥 = 𝑅 𝑥 + 𝑍B 𝑋F

§ Expected RL à Distributional RL
§ Return에 대한 Value Distribution을 만들자.
§ C51 = Categorical / 이산형 분포
§ 51개의 bin을 이용하여 분포를 만든다.
14
Distributional RL
A Distributional Perspective on Reinforcement Learning (C51)
https://arxiv.org/abs/1707.06887

§ Distributional Bellman Equation
§ Cf) Bellman Equation
§ 𝑍 𝑠, 𝑎 는 Distribution을 의미, 이를 이용하여 Distribution을 생성
15
Distributional RL
𝑄 𝑥, 𝑎 = 𝑅(𝑥, 𝑎) + 𝛾𝑄B(𝑥′, 𝑎′)
𝑄 𝑠, 𝑎 = 𝐸 𝑍 𝑠, 𝑎 = L 𝑝N 𝑥N
O
NP0

16
Distributional RL

17
Distributional RL

18
Distributional RL

19
Distributional RL

20
Distributional RL
C51 = DQN + Projection Distribution
(분포 만들기)

Distributional DQN
1. Return에 대한 Value Distribution(51개 bin)을 만든다.
2. 각 스텝마다 만든 Value Distribution 들간의 거리를 구한다.
à 논문에서 이론상 Wasserstein distance로 정의했지만
실험에서 KL-divergence로 계산
3. Cross entropy로 분포간의 Loss 계산
21
Distributional RL

22
Distributional RL

23
Distributional RL
Replay Buffer에서 Batch size만큼 추출

24
Distributional RL
Projection Distribution
(분포 만들기)

25
Distributional RL
Bellman distributional operator
𝑽 𝒎𝒂𝒙 = 𝟏𝟎
𝑽 𝒎𝒊𝒎 = -10

26
Distributional RL

27
Distributional RL
KL-divergence(cross entropy)로
Loss 구하기

28
Performance
Relative PerformanceComparison

감사합니다.
» urleee@naver.com
30

RL Adventure
RAINBOW
김예찬
1

INDEX
1. Environment
2. Before RAINBOW
DDQN(Double Deep Q-Learning)
Dueling DQN
Multi-Step TD(Temporal Difference)
PER(Prioritized Experience Replay)
Noisy Network
Categorical DQN(C51)
3. RAINBOW
4. RAINBOW - Code
2

OPENAI GYM
HTTPS://GYM.OPENAI.COM
HTTPS://GITHUB.COM/OPENAI/GYM
1. EXPERIMENT ENVIRONMENT
3

2. BEFORE RAINBOW : DOUBLE DQN
4
HTTPS://ARXIV.ORG/ABS/1509.06461

2. BEFORE RAINBOW : DUELING DQN
5

2. BEFORE RAINBOW : DUELING DQN
6

2. BEFORE RAINBOW : MULTI-STEP LEARNING
7

2. BEFORE RAINBOW : PER
8

2. BEFORE RAINBOW : NOISY NETWORK
9

2. BEFORE RAINBOW : NOISY NETWORK
10

2. BEFORE RAINBOW : CATEGORICAL DQN(C51)
HTTPS://ARXIV.ORG/PDF/1707.06887.PDF
11

2. BEFORE RAINBOW : CATEGORICAL DQN(C51)
HTTPS://ARXIV.ORG/PDF/1707.06887.PDF
12

3. RAINBOW
RAINBOW
DDQN(Double Deep Q-Learning)
+
Dueling DQN
+
Multi-Step TD(Temporal Difference)
+
PER(Prioritized Experience Replay)
+
Noisy Network
+
Categorical DQN(C51)
14

NOISYLINEAR
4. RAINBOW - CODE
20

DUELING + NOISY + C51
4. RAINBOW - CODE
21

PROJECTION STEP
4. RAINBOW - CODE
22

CROSS-ENTROPY LOSS
4. RAINBOW - CODE
23

Thank you
RAINBOW
김예찬
25

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Similar to pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지" (20)

More from YeChan(Paul) Kim

More from YeChan(Paul) Kim (8)

Recently uploaded

Recently uploaded (20)

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"