DRL challenge on Montezuma's Revenge

20170722
Training long history on real reward and diverse hyper
parameters in threads combined with DeepMind’s A3C+
Takayoshi Iitsuka
The Whole Brain Architecture Initiative
a specified non-profit organization, Japan
1983-2003: Researcher of compiler for Hitachi‘s computers (mainly, Supercomputers)
2003-2015: Strategy and Planning Department of several divisions (Cloud Service, etc.)
2015/9 : Early retired Hitachi with additional payment
2016/2-12 : Catched up with latest IT including Deep Learning
2016/10 : Got top position in OpenAI Gym (Montezuma's Revenge), Kept until 2017/3
2016/10 : Return to Hitachi as contract employee (my work is not related to AI)
1

20170722
Table of Content
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
2

20170722
1. Background
4. Conclusion
3

20170722
State
Screen Image
after Action etc.
Deep Reinforcement Learning (DRL)
Environment
Game Emulator etc.
Agent
Predict best Action
by Deep Learning
Action
...
Reward
Score
obtained by Action
etc.
 Agent predicts best Action from State by Deep Learning
 Environment returns State and Reward as result of Action
 Agent updates internal Neural Network based on Reward
4

20170722
Score of DRL in Atari 2600 games
 DRL reached human level score in
more than half of Atari 2600 games
(Deep Q-Network, DeepMind 2015)
 But poor score games still remained
 One of the hardest games for DRL
was "Montesuma's Revenge"
(until DeepMind submitted very effective paper to arXiv
in June 2016. I did not notice the paper by late August)
 I started challenge on DRL of
"Montesuma's Revenge" in the
beginning of August as my hobby
[DRL] https://deepmind.com/blog/deep-reinforcement-learning/
[My blog (in Japanese)] http://itsukara.hateblo.jp/
[My github] https://github.com/Itsukara/async_deep_reinforce Joe
Montesuma's Revenge
Human Level or Above
5

20170722
Why so hard?
 So many kill-points => hard to go forward
 Little chance to get reward => little chance to learn
Reward chance by random actions (first 1M steps)
Name of game # of gameover Non-ZERO score Reward chance
Breakout 5440 4202 77.3%
Montezuma's Revenge 2323 1 0.043%
6

20170722
Simple countermeasures and their results
 So many kill-point
[measure] Give negative reward when Joe killed to avoid kill-point
[result] Joe does not approach kill-point and can't go over it
 Little chance to get reward
[measure] Give basic-income reward to promote learning
(provide constant reward in every steps or periodically)
[result] Joe stays one place forever
 Additionally, no motivation to go over kill-point
[measure] Combination (basic-income after kill-point may be attractive)
[result] Joe stays one place and can't go foward
=>
 Reward is important for training. But, at the same time, some kind
of motivation to move and go-over kill point is necessary. For that
purpose, reward should be decreased when visiting same place
many times or making same action many times.
7

20170722
1. Background
4. Conclusion
8

20170722
DeepMind's paper
 I had been offering information of DRL experiment with
Monezuma's Revenge in my blog and twitter
 Auther of A3C reproduciton code which I was using read
my blog and gave me information of DeepMind's new
paper “Unifying Count-Based Exploration and Intrinsic
Motivation (Bellmare, et. al., June 2016)” by twitter message
 Reading abstract of the paper, I realized that what I
wanted in rewarding was written in the paper in name of
“pseudo-reward based on pseudo-count“
 They applied pseudo-count to Montezuma's Revenge
and get good result (average score after 100M steps
training with double-DQN is 3439, that with A3C is 273)
9

20170722
Key idea
 Although there is simple method to count the the occurrence number of game
state i.e. binary comparison of game state, it is not effective when the
probability of game state is too small or zero, e.g. what is the probability of
(SUN, LATE, BUSY) after following observation? Just Zero?
 Key idea：
 ρ = 1/10*1/10*9/10 (=0.009) looks natural as probalility of (SUN, LAGTE, BUSY)
 After observation of (SUN, LATE, BUSY),
it will become ρ' = 2/11*2/11*10/11 (=0.03)
(The paper named ρ' as "recording probability" )
day# Weather Time-of-day Crowdness
1 SUN LATE QUIET
2 RAIN EARY BUSY
3 RAIN EARY BUSY
4 RAIN EARY BUSY
5 RAIN EARY BUSY
6 RAIN EARY BUSY
7 RAIN EARY BUSY
8 RAIN EARY BUSY
9 RAIN EARY BUSY
10 RAIN EARY BUSY
10

20170722
Pseudo-count
 When the data-space S is direct product of multiple sub-data-spaces S1, S2, ...,
SM (in previous slide, Weather, Time-of-day, Crowdeness), the probability of a
sample D=(d1, ..., dM) in S is product of the probability of in d1, ..., dM in each of
S1, S2, ..., SM (assumption: each space is independent)
 For each Si , when the number of occurrence of a sample di is N, and the
number of the observation is n, ρ and ρ' can be calculated by definition:
 ρ = N/n
 ρ' = (N + 1)/(n + 1)
 From above equations, N can be calculate as follows from ρ and ρ':
 N = ρ(1 – ρ')/(ρ' – ρ) ≒ ρ/(ρ' – ρ) (when ρ' << 1)
 ρ (and ρ') of D can be calculated as products of ρ (and ρ') in S1, S2, ..., S
 So, N (the number of occurence) of D can be calculate from ρ and ρ' of D
 The paper named N as “pseudo-count“.
 In previous slide, ρ = 1/10*1/10*9/10 =0.009, ρ' = 2/11*2/11*10/11 = 0.03.
So, pseudo-count N = 0.009/(0.03 – 0.009) = 0.42 (not 0 & < 1: looks resonable)
Notice: Above explanation is much simplifed. See DeepMind paper for details.
11

20170722
Utilization in DRL: Pseudo-Reward
 For every pixel of a game screen x, calculate ρ and ρ'
 Caculate product of all ρ and ρ' => These are ρ and ρ' of x
 Calculate N(x) (pseudo-count of x) from ρ and ρ' of x
 Calculate R(x) (Pseudo-Reward of the screen x) as follows
 R(x) = β / (N(x) + 0.01)1/P
 N(x) is bigger, R(x) is smaller => smaller in high-occurence screen
 0.01 has no meaning (just to avoid zero-division)
 P was selected by experiment (tried P=2 and 1)
 P=2 both in Double DQN and A3C
 β was selected from short paramer sweep
 β=0.05 in Double DQN, β=0.01 in A3C
=> R(x) ≒ β/
 “Real-Reward + R(x)” is used as Reward for training
(not used as score of the game)
 This gives motivation to extend exploration of state in DRL
12

20170722
Result: Double DQN + Pseudo-Reward
 Evaluated in 5 games. Effective in the following games
 In Montezuma's Revenge, extended reached rooms
This room was the
most important in
DeepMind’s
evaluation
(confirmed Bellmare).
Because Joe can
get 3,000 in this
room only.
13

20170722
Result: A3C + Pseudo-Reward (A3C+)
 Evaluated in 60 games. The number of low-score
games was reduced (low-score: score is less than 150%
of random actions. Pink cells in the following table)
 Not so good (273.7) in Montezuma's Revenge
Score<150%Random Stochastic-ALE Deterministic-ALE Stochastic-ALE Deterministic-ALE
A3C A3C+ DQN A3C A3C+ A3C A3C+ Random Human A3C A3C+ DQN A3C A3C+ DQN
1 ASTEROIDS X 2680.7 2257.9 3946.2 2406.6 719.1 47388.7 4% 3% 0% 7% 4% 0%
2 BATTLE-ZONE X 3143.0 7429.0 3393.8 7969.1 2360.0 37187.5 2% 15% 41% 3% 16% 45%
3 BOWLING X 32.9 68.7 35.0 76.0 23.1 160.7 7% 33% 4% 9% 38% 5%
4 DOUBLE-DUNK X X 0.5 -8.9 0.2 -7.8 -18.6 -16.4 870% 442% 320% 854% 489% 210%
5 ENDURO X 0.0 749.1 0.0 694.8 0.0 860.5 0% 87% 40% 0% 81% 51%
6 FREEWAY X 0.0 27.3 0.0 30.5 0.0 29.6 0% 92% 103% 0% 103% 102%
7 GRAVITAR X X X 204.7 246.0 201.3 238.7 173.0 3351.4 1% 2% -4% 1% 2% 1%
8 ICE-HOCKEY X X -5.2 -7.1 -5.1 -6.5 -11.2 0.9 49% 34% 12% 50% 39% 7%
9 KANGAROO X 47.2 5475.7 46.6 4883.5 52.0 3035.0 0% 182% 138% 0% 162% 198%
10 MONTEZUMA'S-REVENGE X 0.1 142.5 0.2 273.7 0.0 4753.3 0% 3% 0% 0% 6% 0%
11 PITFALL X X X -8.8 -156.0 -7.0 -259.1 -229.4 6463.7 3% 1% 2% 3% 0% 2%
12 ROBOTANK X 2.1 6.7 2.2 7.7 2.2 11.9 -1% 46% 501% 0% 56% 395%
13 SKIING X X X -23670.0 -20066.7 -20959.0 -22177.5 -17098.1 -4336.9 -51% -23% -73% -30% -40% -85%
14 SOLARIS X X 2157.0 2175.7 2102.1 2270.2 1236.3 12326.7 8% 8% -4% 8% 9% 5%
15 SURROUND X X X -7.8 -7.0 -7.1 -7.2 -10.0 6.5 13% 18% 7% 18% 17% 11%
16 TENNIS X X X -12.4 -20.5 -16.2 -23.1 -23.8 -8.9 76% 22% 73% 51% 5% 106%
17 TIME-PILOT X X X 7417.1 3816.4 9000.9 4103.0 3568.0 5925.0 163% 11% -32% 231% 23% 21%
18 VENTURE X X 0.0 0.0 0.0 0.0 0.0 1188.0 0% 0% 5% 0% 0% 0%
14X 10X 10X 15X 14X 14X 16X 14X 13X
Notice: Above table was created from the paper
14

20170722
1. Background
4. Conclusion
15

20170722
I tried A3C+ => Why?
 I already had A3C environment and had been
trying Montezuma's Revenge in this environment
 Training speed (steps per sec.) of A3C is very fast
 So I think that I can verify the effect of pseudo-reward
based on pseudo-count very soon
 The paper provides the result of few games only for
Double DQN. I felt that the reason might be the time to
tuning or evaluation with D-DQN is too long. It might
cosume much time to get good result in D-DQN.
16

20170722
First trial: better than A3C+
 By incorporating pseudo-reward to my code,
I got very good result in first trial
 It was better than the result of A3C+ (273.7)
400
300
200
100
DeepMind's A3C+ Score
17

20170722
Effect of my original code
 To evaluate precisely, I turned-OFF my original code which I
incorporated in the past trials => Bad score (aroud 100 point)
 By turning-ON my original code, the score went up
My original code: OFF->ON
400
500
300
200
100
18

20170722
My original code
 My original code contained several function
 Training Long History on Real Reward (TLHoRR)
 Inspired by reinforcement of learning with dopamine in human
brain. In this case, Real-Reward is very valuable event in brain
and TLHoRR strongly trains neural network like dopamine does
 Give negative reward when Joe killed
 Increase randomness of actions when no-reward time is long
 Only TLHoRR was effective
 My code contains so many hyper parameters now. I feel it is very difficut
to find best parameters because there are so many hyper parameters
 The length of history to train (various values tried)
 β and P in caluculation of pseudo-reward (varioius values tried)
 Learning algorithm (A3C-ff and A3C-lstm tried)
 The number of skipping frames (4 looks like best for ALE. 2 looks like best for OpenAI gym)
 Color conversion scheme (averege/max/last of skipping frames. max looks like best)
 “save thread0‘s pseudo-count and all thread use it when restored“ or all save and all restore
 Bits for Pixel value (DeepMind used 3. 7 looks like best for my code)
 Have data for pseudo-count in each room or have one data for all rooms
 ...
19

20170722
Structure of Neural Network (NN) for DRL
Value
Screen Images
scaled 84x84
last 4 Images
Convoution
8x8x16
Stride 4
Convoution
4x4x32
Stride 2
Fully
Connected
-> 256
Fully
Connected
-> 18
-> 1
Action
...
Action
and
Value
 Predict best Action and Value from last 4 Screen Images
(Value: predicted sum of Reward obtained until game over)
 Reward is used to correct the prediction of best Action and Value
20

20170722
A3C: Asynchronous Advantage Actor-Critic
 Gradients ( ) is calucluated like
 Gradients are asynchronously accumulated to Globlal Network ( )
 Global Network ( ) is periodically write back to Local Network
Local (thread0)
Calculate
Local (thread1)
Calculate
Local (threadN)
Calculate
...
Global
Accumulate
(update by )
Periodical Write Back
21

20170722
Calculation of Gradients ( ) in A3C+
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Reward rt and new State st+1
t += 1
R = Vt if not game over else 0
# Calculate from history of last 5 steps (backward propagation)
For i = 0 to 4
R = rt-i + d * R (d is discount ratio)
+=
22

20170722
Calculation of Gradients ( ) in TLHoRR
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Reward rt and new State st+1
t += 1
R = Vt if not game over else 0
# Traning Long History on Real Reward (TLHoRR)
T = 180 if Real Reward is included in last 5 steps else 4
# Pseudo Reward => T=5 : learn from last 0.3 seconds in game
# Real Reward => T=180 : learn from last 12 seconds in game
# Calculate from history of last T steps (backward propagation)
For i = 0 to T
R = rt-i + d * R (d is discount ratio)
+=
23

20170722
Effect of TLHoRR with A3C+ in ALE
 Average score approached 2000 (2016/10/6)
 Could not go over laser barriers. So, could not get additional 3,000 point.
Laser barriers
Laser barriers
24
2500
2000
1500
1000
500

20170722
Strange behavior of JOE
 JOE looked like captured by the ghost of successful experience
 This happens because Value at step d is kept very high
 At step b (#2), reward is provided after disappearing SWORD
 So, screen image at step b is same as that of step d (*1)
 So, in step d, JOE think there is reward
 Additionally, Value at step d will not decreased by learning because reward
at #2 (step b) is backward propagated to itself through the loop of state
(#1 -> #1 -> #2). => Values of states in this loop is kept very high.
(*1) Actually, the number of monster (#M) will change (2->1 or 1->0).
But, state at step b when #M=1 is same as that at step d when #M=1.
That means this game does not obey Markov process.
25
2
1
a. Come from left of #1 and go down the stairs
b. Arrive #2 and get reward by getting SWORD
c. Return #1 and get reward by killing a monster by SWORD (1:00)
d. Return #2 and stay there forever (1:00 – 5:00)
(looks like waiting the ghost of SWORD)

20170722
Effect of TLHoRR with A3C+ in OpenAI Gym
 Average score exceeded 1600
 Reached 6 rooms which DeepMind did't reached
 Movie reached 3, 8, 9 https://youtu.be/qOyFLCK8Umw
 Movie reached 18, 19 https://youtu.be/jMDhb-Toii8
 Movie reached 19, 20 https://youtu.be/vwkIg1Un7JA
26

20170722
Diverse Hyper Parameters in Thread (DHPT)
 Same Hyper Parameters in every thread
 Diverse Hyper Parameters in Thread (DHPT)
Score went down to 0,
and not recovered from it
Score went down to 0,
but recovered from it
The length of
history in TLHoRR,
β and P (in
caluculation of
pseudo-reward)
was changed in
each thread
Lost best action in start room and
unable to learn again because Pseudo-
Reward in start room is almost 0.
Details at http://52.199.15.161/OpenAIGym/montezuma-x1/00index.html

20170722
Frame skip in OpenAI Gym
 In ALE environment, screen images, after same Action is
repeated 4 times (frame skip = 4), is used for learning
 But in OpenAI Gym, # of frame skip is determined by
OpenAI Gym by uniform random number between 2 to 4
 This randomness prevented learning in OpenAI Gym
 I resolved this issue by calling OpenAI Gym environment
twice by same Action (result: frame skip become gussian
distribution with avarage of 7)
 I believe this proper randomness helped to beak through
laser barrier
28

20170722
Effect of TLHoRR with A3C+ in ALE (again)
 Retried THLHoRR + DHPT with A3C+ in ALE by setting fame
skip = 7 because 7 is relatively prime with 60 (frame rate in ALE)
and looks to contribute extension of exploration of game state
 It enabled break through of laser barrier
29

20170722
1. Background
4. Conclusion
30

20170722
Conclusion
 Pseudo-count is effective for games with little chace to get reward
 TLHoRR is useful to get good score in A3C+
 DHPT is effecitve for stability of training in A3C+
 6 rooms are newly visited by TLHoRR + DHPT with A3C+
 Related Information
 Blog (in Japanese) : http://itsukara.hateblo.jp/
 Code : https://github.com/Itsukara/async_deep_reinforce
 OpenAI Gym Result : https://gym.openai.com/evaluations/eval_e6uQIveRRVZHz5C2RSlPg
Top position in Montezuma‘s Revenge from 2016/10 to 2017/3

 Acknowledgment
 I woud like to thank Mr. Miyoshi providing very fast A3C code
31

20170722
1. Background
4. Conclusion
32

20170722
Future Directions
 Random search of best Hyper Parameters using a large
amount of IT resources
 Combination of TLHoRR and DHPT with other method
(Replay Memory, UNREAL, EWC, DNC, ...: all from DeepMind)
 Building and utilization of maze map (like human)
 Learning with color screen image (like human)
33

20170722
Thank you for listening
34

20170722
Appendix 1 Details of My pseudo-reward (data structure)
Data structure (with initial value)
 Case when having pseudo-count in each room, each thread has following data
 psc_vcount = np.zeros((24, maxval + 1, frsize * frsize), dtype=np.float64)
 24 is the number of rooms in Montezuma’s Revenge
 Currently it is constant.
 In the future, currently playing room and connection structure of rooms
should be detected automatically.
 This will be useful to evaluate the value of exploration.
 The value of exploration can be used as additional reward.
 maxval is the max value of pixel in pseudo-count
 Can be changed in option. Default:128
 Real pixel value is scaled to fit this maxval
 frsize is size of image in pseudo-count
 Can be changed in option. Default:42
 Screen of game is scaled to fit image size (frsize * frsize)
 Case when having one pseudo-count, each thread has following data
 psc_vcount = np.zeros((maxval + 1, frsize * frsize), dtype=np.float64)
 Two cases in above can be selected by option
 The order of dimension is important to have good memory locality
 If dimension for pixel value comes last, the performance of training decreases
roughly 20%. Because the value of pixel is sparse and cause many cache miss.
35

20170722
Appendix 1 Details of My pseudo-reward (algorithm)
Algorithm (algorithm when having one pseudo-count is omitted here)
 vcount = psc_vcount[room_no, psc_image, range_k]
 This is not a scalar, not a fancy index, but is a temporary array
 room_no is index of the room currently playing
 psc_image is screen image scaled to fit size:(frsize * frsize), pixel-value:maxval
 range_k = np.array([i for i in range(frsize * frsize)]) (calculated in initialization)
 psc_vcount[room_no, psc_image, range_k] += 1.0
 The count of occurred pixel value is incremented
 r_over_rp = np.prod(nr * vcount / (1.0 + vcount))
 ρ / ρ‘ for each pixel is calculated, and ρ / ρ‘ for screen image is calculated
 ρ / ρ‘ = {N/n} / {(N+1)/(n+1)} = nr * N / (1.0 + N) = nr * vcount /(1.0 + count)
 nr = (n + 1.0) / n where n is the number of observation, count starts in initialization
 psc_count = r_over_rp / (1.0 – r_over_rp)
 This is a pseudo-count. As easily confirmed, r_over_rp / (1.0 – r_over_rp) = ρ/(ρ' – ρ)
 Not directly calculate ρ/(ρ' – ρ).
Because both ρ' and ρ are very small, caluculation error in ρ' – ρ become big.
 psc_reward = psc_beta / math.pow(psc_count + psc_alpha, psc_rev_pow)
 This is a pseudo-reward calculated from pseudo-count
 psc_beta = β and can be changed by option in each thread
 psc_rev_pow = 1/P, P is float value and can be changed by option in each thread
 Psc_alpha = math.pow(0.1, P) ; So,
 math.pow(psc_count + psc_alpha, psc_rev_pow) = 0.1 for any P when psc_count is almost 0
36

20170722
Appendix 2 Visualization of Pseudo-Count
37
 3M steps
 45M steps
Most frequent pixels 2nd frequent pixels 3rd frequent pixels
Pictures of several
rooms are
intermixed in
pictures of 2nd and
3rd frequent pixels.
=>
It might be better to
have pseudo-count
in each room
independently.
I tried this and it
looks like promising.
Picture of most
frequent pixels
looks like image of
fist room.
Pictures of 2nd and
3rd frequent pixels
looks like trace of
JOE’s motion
Most frequent pixels 2nd frequent pixels 3rd frequent pixels

20170722
Appendix 3 Real-time visualization of training
*.r: Real reward (all scores and moving average)
*.R: Frequency of visit in each room
*.RO: Frequency of TLHoRR in each room
*.lives: Number of LIVES when TLHoRR
*.k: Frequency of KILL in each room
*.tes: Length of history of TLHoRR in each score
*.s: The nuber of steps until getting real-reward
*.prR: Pseudo-Reward in each room (all PR and moving average)
*.vR: Values in each room (all Values and moving average)

DRL challenge on Montezuma's Revenge

Recommended

Recommended

More Related Content

Similar to DRL challenge on Montezuma's Revenge

Similar to DRL challenge on Montezuma's Revenge (20)

Recently uploaded

Recently uploaded (20)

DRL challenge on Montezuma's Revenge