SlideShare a Scribd company logo
1 of 38
Download to read offline
20170722
Training long history on real reward and diverse hyper
parameters in threads combined with DeepMind’s A3C+
Takayoshi Iitsuka
The Whole Brain Architecture Initiative
a specified non-profit organization, Japan
1983-2003: Researcher of compiler for Hitachi‘s computers (mainly, Supercomputers)
2003-2015: Strategy and Planning Department of several divisions (Cloud Service, etc.)
2015/9 : Early retired Hitachi with additional payment
2016/2-12 : Catched up with latest IT including Deep Learning
2016/10 : Got top position in OpenAI Gym (Montezuma's Revenge), Kept until 2017/3
2016/10 : Return to Hitachi as contract employee (my work is not related to AI)
1
20170722
Table of Content
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
2
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
3
20170722
State
Screen Image
after Action etc.
Deep Reinforcement Learning (DRL)
Environment
Game Emulator etc.
Agent
Predict best Action
by Deep Learning
Action
...
Reward
Score
obtained by Action
etc.
 Agent predicts best Action from State by Deep Learning
 Environment returns State and Reward as result of Action
 Agent updates internal Neural Network based on Reward
4
20170722
Score of DRL in Atari 2600 games
 DRL reached human level score in
more than half of Atari 2600 games
(Deep Q-Network, DeepMind 2015)
 But poor score games still remained
 One of the hardest games for DRL
was "Montesuma's Revenge"
(until DeepMind submitted very effective paper to arXiv
in June 2016. I did not notice the paper by late August)
 I started challenge on DRL of
"Montesuma's Revenge" in the
beginning of August as my hobby
[DRL] https://deepmind.com/blog/deep-reinforcement-learning/
[My blog (in Japanese)] http://itsukara.hateblo.jp/
[My github] https://github.com/Itsukara/async_deep_reinforce Joe
Montesuma's Revenge
Human Level or Above
5
20170722
Why so hard?
 So many kill-points => hard to go forward
 Little chance to get reward => little chance to learn
Reward chance by random actions (first 1M steps)
Name of game # of gameover Non-ZERO score Reward chance
Breakout 5440 4202 77.3%
Montezuma's Revenge 2323 1 0.043%
6
20170722
Simple countermeasures and their results
 So many kill-point
[measure] Give negative reward when Joe killed to avoid kill-point
[result] Joe does not approach kill-point and can't go over it
 Little chance to get reward
[measure] Give basic-income reward to promote learning
(provide constant reward in every steps or periodically)
[result] Joe stays one place forever
 Additionally, no motivation to go over kill-point
[measure] Combination (basic-income after kill-point may be attractive)
[result] Joe stays one place and can't go foward
=>
 Reward is important for training. But, at the same time, some kind
of motivation to move and go-over kill point is necessary. For that
purpose, reward should be decreased when visiting same place
many times or making same action many times.
7
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
8
20170722
DeepMind's paper
 I had been offering information of DRL experiment with
Monezuma's Revenge in my blog and twitter
 Auther of A3C reproduciton code which I was using read
my blog and gave me information of DeepMind's new
paper “Unifying Count-Based Exploration and Intrinsic
Motivation (Bellmare, et. al., June 2016)” by twitter message
 Reading abstract of the paper, I realized that what I
wanted in rewarding was written in the paper in name of
“pseudo-reward based on pseudo-count“
 They applied pseudo-count to Montezuma's Revenge
and get good result (average score after 100M steps
training with double-DQN is 3439, that with A3C is 273)
9
20170722
Key idea
 Although there is simple method to count the the occurrence number of game
state i.e. binary comparison of game state, it is not effective when the
probability of game state is too small or zero, e.g. what is the probability of
(SUN, LATE, BUSY) after following observation? Just Zero?
 Key idea:
 ρ = 1/10*1/10*9/10 (=0.009) looks natural as probalility of (SUN, LAGTE, BUSY)
 After observation of (SUN, LATE, BUSY),
it will become ρ' = 2/11*2/11*10/11 (=0.03)
(The paper named ρ' as "recording probability" )
day# Weather Time-of-day Crowdness
1 SUN LATE QUIET
2 RAIN EARY BUSY
3 RAIN EARY BUSY
4 RAIN EARY BUSY
5 RAIN EARY BUSY
6 RAIN EARY BUSY
7 RAIN EARY BUSY
8 RAIN EARY BUSY
9 RAIN EARY BUSY
10 RAIN EARY BUSY
10
20170722
Pseudo-count
 When the data-space S is direct product of multiple sub-data-spaces S1, S2, ...,
SM (in previous slide, Weather, Time-of-day, Crowdeness), the probability of a
sample D=(d1, ..., dM) in S is product of the probability of in d1, ..., dM in each of
S1, S2, ..., SM (assumption: each space is independent)
 For each Si , when the number of occurrence of a sample di is N, and the
number of the observation is n, ρ and ρ' can be calculated by definition:
 ρ = N/n
 ρ' = (N + 1)/(n + 1)
 From above equations, N can be calculate as follows from ρ and ρ':
 N = ρ(1 – ρ')/(ρ' – ρ) ≒ ρ/(ρ' – ρ) (when ρ' << 1)
 ρ (and ρ') of D can be calculated as products of ρ (and ρ') in S1, S2, ..., S
 So, N (the number of occurence) of D can be calculate from ρ and ρ' of D
 The paper named N as “pseudo-count“.
 In previous slide, ρ = 1/10*1/10*9/10 =0.009, ρ' = 2/11*2/11*10/11 = 0.03.
So, pseudo-count N = 0.009/(0.03 – 0.009) = 0.42 (not 0 & < 1: looks resonable)
Notice: Above explanation is much simplifed. See DeepMind paper for details.
11
20170722
Utilization in DRL: Pseudo-Reward
 For every pixel of a game screen x, calculate ρ and ρ'
 Caculate product of all ρ and ρ' => These are ρ and ρ' of x
 Calculate N(x) (pseudo-count of x) from ρ and ρ' of x
 Calculate R(x) (Pseudo-Reward of the screen x) as follows
 R(x) = β / (N(x) + 0.01)1/P
 N(x) is bigger, R(x) is smaller => smaller in high-occurence screen
 0.01 has no meaning (just to avoid zero-division)
 P was selected by experiment (tried P=2 and 1)
 P=2 both in Double DQN and A3C
 β was selected from short paramer sweep
 β=0.05 in Double DQN, β=0.01 in A3C
=> R(x) ≒ β/
 “Real-Reward + R(x)” is used as Reward for training
(not used as score of the game)
 This gives motivation to extend exploration of state in DRL
12
20170722
Result: Double DQN + Pseudo-Reward
 Evaluated in 5 games. Effective in the following games
 In Montezuma's Revenge, extended reached rooms
This room was the
most important in
DeepMind’s
evaluation
(confirmed Bellmare).
Because Joe can
get 3,000 in this
room only.
13
20170722
Result: A3C + Pseudo-Reward (A3C+)
 Evaluated in 60 games. The number of low-score
games was reduced (low-score: score is less than 150%
of random actions. Pink cells in the following table)
 Not so good (273.7) in Montezuma's Revenge
Score<150%Random Stochastic-ALE Deterministic-ALE Stochastic-ALE Deterministic-ALE
A3C A3C+ DQN A3C A3C+ A3C A3C+ Random Human A3C A3C+ DQN A3C A3C+ DQN
1 ASTEROIDS X 2680.7 2257.9 3946.2 2406.6 719.1 47388.7 4% 3% 0% 7% 4% 0%
2 BATTLE-ZONE X 3143.0 7429.0 3393.8 7969.1 2360.0 37187.5 2% 15% 41% 3% 16% 45%
3 BOWLING X 32.9 68.7 35.0 76.0 23.1 160.7 7% 33% 4% 9% 38% 5%
4 DOUBLE-DUNK X X 0.5 -8.9 0.2 -7.8 -18.6 -16.4 870% 442% 320% 854% 489% 210%
5 ENDURO X 0.0 749.1 0.0 694.8 0.0 860.5 0% 87% 40% 0% 81% 51%
6 FREEWAY X 0.0 27.3 0.0 30.5 0.0 29.6 0% 92% 103% 0% 103% 102%
7 GRAVITAR X X X 204.7 246.0 201.3 238.7 173.0 3351.4 1% 2% -4% 1% 2% 1%
8 ICE-HOCKEY X X -5.2 -7.1 -5.1 -6.5 -11.2 0.9 49% 34% 12% 50% 39% 7%
9 KANGAROO X 47.2 5475.7 46.6 4883.5 52.0 3035.0 0% 182% 138% 0% 162% 198%
10 MONTEZUMA'S-REVENGE X 0.1 142.5 0.2 273.7 0.0 4753.3 0% 3% 0% 0% 6% 0%
11 PITFALL X X X -8.8 -156.0 -7.0 -259.1 -229.4 6463.7 3% 1% 2% 3% 0% 2%
12 ROBOTANK X 2.1 6.7 2.2 7.7 2.2 11.9 -1% 46% 501% 0% 56% 395%
13 SKIING X X X -23670.0 -20066.7 -20959.0 -22177.5 -17098.1 -4336.9 -51% -23% -73% -30% -40% -85%
14 SOLARIS X X 2157.0 2175.7 2102.1 2270.2 1236.3 12326.7 8% 8% -4% 8% 9% 5%
15 SURROUND X X X -7.8 -7.0 -7.1 -7.2 -10.0 6.5 13% 18% 7% 18% 17% 11%
16 TENNIS X X X -12.4 -20.5 -16.2 -23.1 -23.8 -8.9 76% 22% 73% 51% 5% 106%
17 TIME-PILOT X X X 7417.1 3816.4 9000.9 4103.0 3568.0 5925.0 163% 11% -32% 231% 23% 21%
18 VENTURE X X 0.0 0.0 0.0 0.0 0.0 1188.0 0% 0% 5% 0% 0% 0%
14X 10X 10X 15X 14X 14X 16X 14X 13X
Notice: Above table was created from the paper
14
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
15
20170722
I tried A3C+ => Why?
 I already had A3C environment and had been
trying Montezuma's Revenge in this environment
 Training speed (steps per sec.) of A3C is very fast
 So I think that I can verify the effect of pseudo-reward
based on pseudo-count very soon
 The paper provides the result of few games only for
Double DQN. I felt that the reason might be the time to
tuning or evaluation with D-DQN is too long. It might
cosume much time to get good result in D-DQN.
16
20170722
First trial: better than A3C+
 By incorporating pseudo-reward to my code,
I got very good result in first trial
 It was better than the result of A3C+ (273.7)
400
300
200
100
DeepMind's A3C+ Score
17
20170722
Effect of my original code
 To evaluate precisely, I turned-OFF my original code which I
incorporated in the past trials => Bad score (aroud 100 point)
 By turning-ON my original code, the score went up
My original code: OFF->ON
400
500
300
200
100
18
20170722
My original code
 My original code contained several function
 Training Long History on Real Reward (TLHoRR)
 Inspired by reinforcement of learning with dopamine in human
brain. In this case, Real-Reward is very valuable event in brain
and TLHoRR strongly trains neural network like dopamine does
 Give negative reward when Joe killed
 Increase randomness of actions when no-reward time is long
 Only TLHoRR was effective
 My code contains so many hyper parameters now. I feel it is very difficut
to find best parameters because there are so many hyper parameters
 The length of history to train (various values tried)
 β and P in caluculation of pseudo-reward (varioius values tried)
 Learning algorithm (A3C-ff and A3C-lstm tried)
 The number of skipping frames (4 looks like best for ALE. 2 looks like best for OpenAI gym)
 Color conversion scheme (averege/max/last of skipping frames. max looks like best)
 “save thread0‘s pseudo-count and all thread use it when restored“ or all save and all restore
 Bits for Pixel value (DeepMind used 3. 7 looks like best for my code)
 Have data for pseudo-count in each room or have one data for all rooms
 ...
19
20170722
Structure of Neural Network (NN) for DRL
Value
Screen Images
scaled 84x84
last 4 Images
Convoution
8x8x16
Stride 4
Convoution
4x4x32
Stride 2
Fully
Connected
-> 256
Fully
Connected
-> 18
-> 1
Action
...
Action
and
Value
 Predict best Action and Value from last 4 Screen Images
(Value: predicted sum of Reward obtained until game over)
 Reward is used to correct the prediction of best Action and Value
20
20170722
A3C: Asynchronous Advantage Actor-Critic
 Gradients ( ) is calucluated like
 Gradients are asynchronously accumulated to Globlal Network ( )
 Global Network ( ) is periodically write back to Local Network
Local (thread0)
Calculate
Local (thread1)
Calculate
Local (threadN)
Calculate
...
Global
Accumulate
(update by )
Periodical Write Back
21
20170722
Calculation of Gradients ( ) in A3C+
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Reward rt and new State st+1
t += 1
R = Vt if not game over else 0
# Calculate from history of last 5 steps (backward propagation)
For i = 0 to 4
R = rt-i + d * R (d is discount ratio)
+=
22
20170722
Calculation of Gradients ( ) in TLHoRR
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Reward rt and new State st+1
t += 1
R = Vt if not game over else 0
# Traning Long History on Real Reward (TLHoRR)
T = 180 if Real Reward is included in last 5 steps else 4
# Pseudo Reward => T=5 : learn from last 0.3 seconds in game
# Real Reward => T=180 : learn from last 12 seconds in game
# Calculate from history of last T steps (backward propagation)
For i = 0 to T
R = rt-i + d * R (d is discount ratio)
+=
23
20170722
Effect of TLHoRR with A3C+ in ALE
 Average score approached 2000 (2016/10/6)
 Could not go over laser barriers. So, could not get additional 3,000 point.
Laser barriers
Laser barriers
24
2500
2000
1500
1000
500
20170722
Strange behavior of JOE
 JOE looked like captured by the ghost of successful experience
 This happens because Value at step d is kept very high
 At step b (#2), reward is provided after disappearing SWORD
 So, screen image at step b is same as that of step d (*1)
 So, in step d, JOE think there is reward
 Additionally, Value at step d will not decreased by learning because reward
at #2 (step b) is backward propagated to itself through the loop of state
(#1 -> #1 -> #2). => Values of states in this loop is kept very high.
(*1) Actually, the number of monster (#M) will change (2->1 or 1->0).
But, state at step b when #M=1 is same as that at step d when #M=1.
That means this game does not obey Markov process.
25
2
1
a. Come from left of #1 and go down the stairs
b. Arrive #2 and get reward by getting SWORD
c. Return #1 and get reward by killing a monster by SWORD (1:00)
d. Return #2 and stay there forever (1:00 – 5:00)
(looks like waiting the ghost of SWORD)
20170722
Effect of TLHoRR with A3C+ in OpenAI Gym
 Average score exceeded 1600
 Reached 6 rooms which DeepMind did't reached
 Movie reached 3, 8, 9 https://youtu.be/qOyFLCK8Umw
 Movie reached 18, 19 https://youtu.be/jMDhb-Toii8
 Movie reached 19, 20 https://youtu.be/vwkIg1Un7JA
26
20170722
Diverse Hyper Parameters in Thread (DHPT)
 Same Hyper Parameters in every thread
 Diverse Hyper Parameters in Thread (DHPT)
Score went down to 0,
and not recovered from it
Score went down to 0,
but recovered from it
The length of
history in TLHoRR,
β and P (in
caluculation of
pseudo-reward)
was changed in
each thread
Lost best action in start room and
unable to learn again because Pseudo-
Reward in start room is almost 0.
Details at http://52.199.15.161/OpenAIGym/montezuma-x1/00index.html
20170722
Frame skip in OpenAI Gym
 In ALE environment, screen images, after same Action is
repeated 4 times (frame skip = 4), is used for learning
 But in OpenAI Gym, # of frame skip is determined by
OpenAI Gym by uniform random number between 2 to 4
 This randomness prevented learning in OpenAI Gym
 I resolved this issue by calling OpenAI Gym environment
twice by same Action (result: frame skip become gussian
distribution with avarage of 7)
 I believe this proper randomness helped to beak through
laser barrier
28
20170722
Effect of TLHoRR with A3C+ in ALE (again)
 Retried THLHoRR + DHPT with A3C+ in ALE by setting fame
skip = 7 because 7 is relatively prime with 60 (frame rate in ALE)
and looks to contribute extension of exploration of game state
 It enabled break through of laser barrier
29
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
30
20170722
Conclusion
 Pseudo-count is effective for games with little chace to get reward
 TLHoRR is useful to get good score in A3C+
 DHPT is effecitve for stability of training in A3C+
 6 rooms are newly visited by TLHoRR + DHPT with A3C+
 Related Information
 Blog (in Japanese) : http://itsukara.hateblo.jp/
 Code : https://github.com/Itsukara/async_deep_reinforce
 OpenAI Gym Result : https://gym.openai.com/evaluations/eval_e6uQIveRRVZHz5C2RSlPg
Top position in Montezuma‘s Revenge from 2016/10 to 2017/3

 Acknowledgment
 I woud like to thank Mr. Miyoshi providing very fast A3C code
31
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Directions
32
20170722
Future Directions
 Random search of best Hyper Parameters using a large
amount of IT resources
 Combination of TLHoRR and DHPT with other method
(Replay Memory, UNREAL, EWC, DNC, ...: all from DeepMind)
 Building and utilization of maze map (like human)
 Learning with color screen image (like human)
33
20170722
Thank you for listening
34
20170722
Appendix 1 Details of My pseudo-reward (data structure)
Data structure (with initial value)
 Case when having pseudo-count in each room, each thread has following data
 psc_vcount = np.zeros((24, maxval + 1, frsize * frsize), dtype=np.float64)
 24 is the number of rooms in Montezuma’s Revenge
 Currently it is constant.
 In the future, currently playing room and connection structure of rooms
should be detected automatically.
 This will be useful to evaluate the value of exploration.
 The value of exploration can be used as additional reward.
 maxval is the max value of pixel in pseudo-count
 Can be changed in option. Default:128
 Real pixel value is scaled to fit this maxval
 frsize is size of image in pseudo-count
 Can be changed in option. Default:42
 Screen of game is scaled to fit image size (frsize * frsize)
 Case when having one pseudo-count, each thread has following data
 psc_vcount = np.zeros((maxval + 1, frsize * frsize), dtype=np.float64)
 Two cases in above can be selected by option
 The order of dimension is important to have good memory locality
 If dimension for pixel value comes last, the performance of training decreases
roughly 20%. Because the value of pixel is sparse and cause many cache miss.
35
20170722
Appendix 1 Details of My pseudo-reward (algorithm)
Algorithm (algorithm when having one pseudo-count is omitted here)
 vcount = psc_vcount[room_no, psc_image, range_k]
 This is not a scalar, not a fancy index, but is a temporary array
 room_no is index of the room currently playing
 psc_image is screen image scaled to fit size:(frsize * frsize), pixel-value:maxval
 range_k = np.array([i for i in range(frsize * frsize)]) (calculated in initialization)
 psc_vcount[room_no, psc_image, range_k] += 1.0
 The count of occurred pixel value is incremented
 r_over_rp = np.prod(nr * vcount / (1.0 + vcount))
 ρ / ρ‘ for each pixel is calculated, and ρ / ρ‘ for screen image is calculated
 ρ / ρ‘ = {N/n} / {(N+1)/(n+1)} = nr * N / (1.0 + N) = nr * vcount /(1.0 + count)
 nr = (n + 1.0) / n where n is the number of observation, count starts in initialization
 psc_count = r_over_rp / (1.0 – r_over_rp)
 This is a pseudo-count. As easily confirmed, r_over_rp / (1.0 – r_over_rp) = ρ/(ρ' – ρ)
 Not directly calculate ρ/(ρ' – ρ).
Because both ρ' and ρ are very small, caluculation error in ρ' – ρ become big.
 psc_reward = psc_beta / math.pow(psc_count + psc_alpha, psc_rev_pow)
 This is a pseudo-reward calculated from pseudo-count
 psc_beta = β and can be changed by option in each thread
 psc_rev_pow = 1/P, P is float value and can be changed by option in each thread
 Psc_alpha = math.pow(0.1, P) ; So,
 math.pow(psc_count + psc_alpha, psc_rev_pow) = 0.1 for any P when psc_count is almost 0
36
20170722
Appendix 2 Visualization of Pseudo-Count
37
 3M steps
 45M steps
Most frequent pixels 2nd frequent pixels 3rd frequent pixels
Pictures of several
rooms are
intermixed in
pictures of 2nd and
3rd frequent pixels.
=>
It might be better to
have pseudo-count
in each room
independently.
I tried this and it
looks like promising.
Picture of most
frequent pixels
looks like image of
fist room.
Pictures of 2nd and
3rd frequent pixels
looks like trace of
JOE’s motion
Most frequent pixels 2nd frequent pixels 3rd frequent pixels
20170722
Appendix 3 Real-time visualization of training
*.r: Real reward (all scores and moving average)
*.R: Frequency of visit in each room
*.RO: Frequency of TLHoRR in each room
*.lives: Number of LIVES when TLHoRR
*.k: Frequency of KILL in each room
*.tes: Length of history of TLHoRR in each score
*.s: The nuber of steps until getting real-reward
*.prR: Pseudo-Reward in each room (all PR and moving average)
*.vR: Values in each room (all Values and moving average)

More Related Content

Similar to DRL challenge on Montezuma's Revenge

Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)
 Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version) Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)
Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)Tobias Pfeiffer
 
building_games_with_ruby_rubyconf
building_games_with_ruby_rubyconfbuilding_games_with_ruby_rubyconf
building_games_with_ruby_rubyconftutorialsruby
 
building_games_with_ruby_rubyconf
building_games_with_ruby_rubyconfbuilding_games_with_ruby_rubyconf
building_games_with_ruby_rubyconftutorialsruby
 
A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingSteven Tovey
 
CUDA by Example : Constant Memory and Events : Notes
CUDA by Example : Constant Memory and Events : NotesCUDA by Example : Constant Memory and Events : Notes
CUDA by Example : Constant Memory and Events : NotesSubhajit Sahu
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Visualising Big Data
Visualising Big DataVisualising Big Data
Visualising Big DataAmit Kapoor
 
Coin change Problem (DP & GREEDY)
Coin change Problem (DP & GREEDY)Coin change Problem (DP & GREEDY)
Coin change Problem (DP & GREEDY)Ridhima Chowdhury
 
AI_Lab_File()[1]sachin_final (1).pdf
AI_Lab_File()[1]sachin_final (1).pdfAI_Lab_File()[1]sachin_final (1).pdf
AI_Lab_File()[1]sachin_final (1).pdfpankajkaushik2216
 
Pytorch and Machine Learning for the Math Impaired
Pytorch and Machine Learning for the Math ImpairedPytorch and Machine Learning for the Math Impaired
Pytorch and Machine Learning for the Math ImpairedTyrel Denison
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting SpatialFAO
 
Estimation, Approximation and Standard form
Estimation, Approximation and Standard formEstimation, Approximation and Standard form
Estimation, Approximation and Standard formNsomp
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015ihji
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsAjay Bidyarthy
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptISHANAMRITSRIVASTAVA
 

Similar to DRL challenge on Montezuma's Revenge (20)

Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)
 Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version) Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)
Stop Guessing and Start Measuring - Benchmarking Practice (Poly Version)
 
building_games_with_ruby_rubyconf
building_games_with_ruby_rubyconfbuilding_games_with_ruby_rubyconf
building_games_with_ruby_rubyconf
 
building_games_with_ruby_rubyconf
building_games_with_ruby_rubyconfbuilding_games_with_ruby_rubyconf
building_games_with_ruby_rubyconf
 
About Cocos2djs
About Cocos2djsAbout Cocos2djs
About Cocos2djs
 
A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time LightingA Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time Lighting
 
CUDA by Example : Constant Memory and Events : Notes
CUDA by Example : Constant Memory and Events : NotesCUDA by Example : Constant Memory and Events : Notes
CUDA by Example : Constant Memory and Events : Notes
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Programmers guide
Programmers guideProgrammers guide
Programmers guide
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
VU_undergrad_res_preprint
VU_undergrad_res_preprintVU_undergrad_res_preprint
VU_undergrad_res_preprint
 
Visualising Big Data
Visualising Big DataVisualising Big Data
Visualising Big Data
 
Coin change Problem (DP & GREEDY)
Coin change Problem (DP & GREEDY)Coin change Problem (DP & GREEDY)
Coin change Problem (DP & GREEDY)
 
AI_Lab_File()[1]sachin_final (1).pdf
AI_Lab_File()[1]sachin_final (1).pdfAI_Lab_File()[1]sachin_final (1).pdf
AI_Lab_File()[1]sachin_final (1).pdf
 
Pytorch and Machine Learning for the Math Impaired
Pytorch and Machine Learning for the Math ImpairedPytorch and Machine Learning for the Math Impaired
Pytorch and Machine Learning for the Math Impaired
 
10. Getting Spatial
10. Getting Spatial10. Getting Spatial
10. Getting Spatial
 
Estimation, Approximation and Standard form
Estimation, Approximation and Standard formEstimation, Approximation and Standard form
Estimation, Approximation and Standard form
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
Design and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation AlgorithmsDesign and Implementation of Parallel and Randomized Approximation Algorithms
Design and Implementation of Parallel and Randomized Approximation Algorithms
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
 

Recently uploaded

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

DRL challenge on Montezuma's Revenge

  • 1. 20170722 Training long history on real reward and diverse hyper parameters in threads combined with DeepMind’s A3C+ Takayoshi Iitsuka The Whole Brain Architecture Initiative a specified non-profit organization, Japan 1983-2003: Researcher of compiler for Hitachi‘s computers (mainly, Supercomputers) 2003-2015: Strategy and Planning Department of several divisions (Cloud Service, etc.) 2015/9 : Early retired Hitachi with additional payment 2016/2-12 : Catched up with latest IT including Deep Learning 2016/10 : Got top position in OpenAI Gym (Montezuma's Revenge), Kept until 2017/3 2016/10 : Return to Hitachi as contract employee (my work is not related to AI) 1
  • 2. 20170722 Table of Content 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 2
  • 3. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 3
  • 4. 20170722 State Screen Image after Action etc. Deep Reinforcement Learning (DRL) Environment Game Emulator etc. Agent Predict best Action by Deep Learning Action ... Reward Score obtained by Action etc.  Agent predicts best Action from State by Deep Learning  Environment returns State and Reward as result of Action  Agent updates internal Neural Network based on Reward 4
  • 5. 20170722 Score of DRL in Atari 2600 games  DRL reached human level score in more than half of Atari 2600 games (Deep Q-Network, DeepMind 2015)  But poor score games still remained  One of the hardest games for DRL was "Montesuma's Revenge" (until DeepMind submitted very effective paper to arXiv in June 2016. I did not notice the paper by late August)  I started challenge on DRL of "Montesuma's Revenge" in the beginning of August as my hobby [DRL] https://deepmind.com/blog/deep-reinforcement-learning/ [My blog (in Japanese)] http://itsukara.hateblo.jp/ [My github] https://github.com/Itsukara/async_deep_reinforce Joe Montesuma's Revenge Human Level or Above 5
  • 6. 20170722 Why so hard?  So many kill-points => hard to go forward  Little chance to get reward => little chance to learn Reward chance by random actions (first 1M steps) Name of game # of gameover Non-ZERO score Reward chance Breakout 5440 4202 77.3% Montezuma's Revenge 2323 1 0.043% 6
  • 7. 20170722 Simple countermeasures and their results  So many kill-point [measure] Give negative reward when Joe killed to avoid kill-point [result] Joe does not approach kill-point and can't go over it  Little chance to get reward [measure] Give basic-income reward to promote learning (provide constant reward in every steps or periodically) [result] Joe stays one place forever  Additionally, no motivation to go over kill-point [measure] Combination (basic-income after kill-point may be attractive) [result] Joe stays one place and can't go foward =>  Reward is important for training. But, at the same time, some kind of motivation to move and go-over kill point is necessary. For that purpose, reward should be decreased when visiting same place many times or making same action many times. 7
  • 8. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 8
  • 9. 20170722 DeepMind's paper  I had been offering information of DRL experiment with Monezuma's Revenge in my blog and twitter  Auther of A3C reproduciton code which I was using read my blog and gave me information of DeepMind's new paper “Unifying Count-Based Exploration and Intrinsic Motivation (Bellmare, et. al., June 2016)” by twitter message  Reading abstract of the paper, I realized that what I wanted in rewarding was written in the paper in name of “pseudo-reward based on pseudo-count“  They applied pseudo-count to Montezuma's Revenge and get good result (average score after 100M steps training with double-DQN is 3439, that with A3C is 273) 9
  • 10. 20170722 Key idea  Although there is simple method to count the the occurrence number of game state i.e. binary comparison of game state, it is not effective when the probability of game state is too small or zero, e.g. what is the probability of (SUN, LATE, BUSY) after following observation? Just Zero?  Key idea:  ρ = 1/10*1/10*9/10 (=0.009) looks natural as probalility of (SUN, LAGTE, BUSY)  After observation of (SUN, LATE, BUSY), it will become ρ' = 2/11*2/11*10/11 (=0.03) (The paper named ρ' as "recording probability" ) day# Weather Time-of-day Crowdness 1 SUN LATE QUIET 2 RAIN EARY BUSY 3 RAIN EARY BUSY 4 RAIN EARY BUSY 5 RAIN EARY BUSY 6 RAIN EARY BUSY 7 RAIN EARY BUSY 8 RAIN EARY BUSY 9 RAIN EARY BUSY 10 RAIN EARY BUSY 10
  • 11. 20170722 Pseudo-count  When the data-space S is direct product of multiple sub-data-spaces S1, S2, ..., SM (in previous slide, Weather, Time-of-day, Crowdeness), the probability of a sample D=(d1, ..., dM) in S is product of the probability of in d1, ..., dM in each of S1, S2, ..., SM (assumption: each space is independent)  For each Si , when the number of occurrence of a sample di is N, and the number of the observation is n, ρ and ρ' can be calculated by definition:  ρ = N/n  ρ' = (N + 1)/(n + 1)  From above equations, N can be calculate as follows from ρ and ρ':  N = ρ(1 – ρ')/(ρ' – ρ) ≒ ρ/(ρ' – ρ) (when ρ' << 1)  ρ (and ρ') of D can be calculated as products of ρ (and ρ') in S1, S2, ..., S  So, N (the number of occurence) of D can be calculate from ρ and ρ' of D  The paper named N as “pseudo-count“.  In previous slide, ρ = 1/10*1/10*9/10 =0.009, ρ' = 2/11*2/11*10/11 = 0.03. So, pseudo-count N = 0.009/(0.03 – 0.009) = 0.42 (not 0 & < 1: looks resonable) Notice: Above explanation is much simplifed. See DeepMind paper for details. 11
  • 12. 20170722 Utilization in DRL: Pseudo-Reward  For every pixel of a game screen x, calculate ρ and ρ'  Caculate product of all ρ and ρ' => These are ρ and ρ' of x  Calculate N(x) (pseudo-count of x) from ρ and ρ' of x  Calculate R(x) (Pseudo-Reward of the screen x) as follows  R(x) = β / (N(x) + 0.01)1/P  N(x) is bigger, R(x) is smaller => smaller in high-occurence screen  0.01 has no meaning (just to avoid zero-division)  P was selected by experiment (tried P=2 and 1)  P=2 both in Double DQN and A3C  β was selected from short paramer sweep  β=0.05 in Double DQN, β=0.01 in A3C => R(x) ≒ β/  “Real-Reward + R(x)” is used as Reward for training (not used as score of the game)  This gives motivation to extend exploration of state in DRL 12
  • 13. 20170722 Result: Double DQN + Pseudo-Reward  Evaluated in 5 games. Effective in the following games  In Montezuma's Revenge, extended reached rooms This room was the most important in DeepMind’s evaluation (confirmed Bellmare). Because Joe can get 3,000 in this room only. 13
  • 14. 20170722 Result: A3C + Pseudo-Reward (A3C+)  Evaluated in 60 games. The number of low-score games was reduced (low-score: score is less than 150% of random actions. Pink cells in the following table)  Not so good (273.7) in Montezuma's Revenge Score<150%Random Stochastic-ALE Deterministic-ALE Stochastic-ALE Deterministic-ALE A3C A3C+ DQN A3C A3C+ A3C A3C+ Random Human A3C A3C+ DQN A3C A3C+ DQN 1 ASTEROIDS X 2680.7 2257.9 3946.2 2406.6 719.1 47388.7 4% 3% 0% 7% 4% 0% 2 BATTLE-ZONE X 3143.0 7429.0 3393.8 7969.1 2360.0 37187.5 2% 15% 41% 3% 16% 45% 3 BOWLING X 32.9 68.7 35.0 76.0 23.1 160.7 7% 33% 4% 9% 38% 5% 4 DOUBLE-DUNK X X 0.5 -8.9 0.2 -7.8 -18.6 -16.4 870% 442% 320% 854% 489% 210% 5 ENDURO X 0.0 749.1 0.0 694.8 0.0 860.5 0% 87% 40% 0% 81% 51% 6 FREEWAY X 0.0 27.3 0.0 30.5 0.0 29.6 0% 92% 103% 0% 103% 102% 7 GRAVITAR X X X 204.7 246.0 201.3 238.7 173.0 3351.4 1% 2% -4% 1% 2% 1% 8 ICE-HOCKEY X X -5.2 -7.1 -5.1 -6.5 -11.2 0.9 49% 34% 12% 50% 39% 7% 9 KANGAROO X 47.2 5475.7 46.6 4883.5 52.0 3035.0 0% 182% 138% 0% 162% 198% 10 MONTEZUMA'S-REVENGE X 0.1 142.5 0.2 273.7 0.0 4753.3 0% 3% 0% 0% 6% 0% 11 PITFALL X X X -8.8 -156.0 -7.0 -259.1 -229.4 6463.7 3% 1% 2% 3% 0% 2% 12 ROBOTANK X 2.1 6.7 2.2 7.7 2.2 11.9 -1% 46% 501% 0% 56% 395% 13 SKIING X X X -23670.0 -20066.7 -20959.0 -22177.5 -17098.1 -4336.9 -51% -23% -73% -30% -40% -85% 14 SOLARIS X X 2157.0 2175.7 2102.1 2270.2 1236.3 12326.7 8% 8% -4% 8% 9% 5% 15 SURROUND X X X -7.8 -7.0 -7.1 -7.2 -10.0 6.5 13% 18% 7% 18% 17% 11% 16 TENNIS X X X -12.4 -20.5 -16.2 -23.1 -23.8 -8.9 76% 22% 73% 51% 5% 106% 17 TIME-PILOT X X X 7417.1 3816.4 9000.9 4103.0 3568.0 5925.0 163% 11% -32% 231% 23% 21% 18 VENTURE X X 0.0 0.0 0.0 0.0 0.0 1188.0 0% 0% 5% 0% 0% 0% 14X 10X 10X 15X 14X 14X 16X 14X 13X Notice: Above table was created from the paper 14
  • 15. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 15
  • 16. 20170722 I tried A3C+ => Why?  I already had A3C environment and had been trying Montezuma's Revenge in this environment  Training speed (steps per sec.) of A3C is very fast  So I think that I can verify the effect of pseudo-reward based on pseudo-count very soon  The paper provides the result of few games only for Double DQN. I felt that the reason might be the time to tuning or evaluation with D-DQN is too long. It might cosume much time to get good result in D-DQN. 16
  • 17. 20170722 First trial: better than A3C+  By incorporating pseudo-reward to my code, I got very good result in first trial  It was better than the result of A3C+ (273.7) 400 300 200 100 DeepMind's A3C+ Score 17
  • 18. 20170722 Effect of my original code  To evaluate precisely, I turned-OFF my original code which I incorporated in the past trials => Bad score (aroud 100 point)  By turning-ON my original code, the score went up My original code: OFF->ON 400 500 300 200 100 18
  • 19. 20170722 My original code  My original code contained several function  Training Long History on Real Reward (TLHoRR)  Inspired by reinforcement of learning with dopamine in human brain. In this case, Real-Reward is very valuable event in brain and TLHoRR strongly trains neural network like dopamine does  Give negative reward when Joe killed  Increase randomness of actions when no-reward time is long  Only TLHoRR was effective  My code contains so many hyper parameters now. I feel it is very difficut to find best parameters because there are so many hyper parameters  The length of history to train (various values tried)  β and P in caluculation of pseudo-reward (varioius values tried)  Learning algorithm (A3C-ff and A3C-lstm tried)  The number of skipping frames (4 looks like best for ALE. 2 looks like best for OpenAI gym)  Color conversion scheme (averege/max/last of skipping frames. max looks like best)  “save thread0‘s pseudo-count and all thread use it when restored“ or all save and all restore  Bits for Pixel value (DeepMind used 3. 7 looks like best for my code)  Have data for pseudo-count in each room or have one data for all rooms  ... 19
  • 20. 20170722 Structure of Neural Network (NN) for DRL Value Screen Images scaled 84x84 last 4 Images Convoution 8x8x16 Stride 4 Convoution 4x4x32 Stride 2 Fully Connected -> 256 Fully Connected -> 18 -> 1 Action ... Action and Value  Predict best Action and Value from last 4 Screen Images (Value: predicted sum of Reward obtained until game over)  Reward is used to correct the prediction of best Action and Value 20
  • 21. 20170722 A3C: Asynchronous Advantage Actor-Critic  Gradients ( ) is calucluated like  Gradients are asynchronously accumulated to Globlal Network ( )  Global Network ( ) is periodically write back to Local Network Local (thread0) Calculate Local (thread1) Calculate Local (threadN) Calculate ... Global Accumulate (update by ) Periodical Write Back 21
  • 22. 20170722 Calculation of Gradients ( ) in A3C+ # Play 5 steps For i = 0 to 4 Predict best Action At and perform it Get Reward rt and new State st+1 t += 1 R = Vt if not game over else 0 # Calculate from history of last 5 steps (backward propagation) For i = 0 to 4 R = rt-i + d * R (d is discount ratio) += 22
  • 23. 20170722 Calculation of Gradients ( ) in TLHoRR # Play 5 steps For i = 0 to 4 Predict best Action At and perform it Get Reward rt and new State st+1 t += 1 R = Vt if not game over else 0 # Traning Long History on Real Reward (TLHoRR) T = 180 if Real Reward is included in last 5 steps else 4 # Pseudo Reward => T=5 : learn from last 0.3 seconds in game # Real Reward => T=180 : learn from last 12 seconds in game # Calculate from history of last T steps (backward propagation) For i = 0 to T R = rt-i + d * R (d is discount ratio) += 23
  • 24. 20170722 Effect of TLHoRR with A3C+ in ALE  Average score approached 2000 (2016/10/6)  Could not go over laser barriers. So, could not get additional 3,000 point. Laser barriers Laser barriers 24 2500 2000 1500 1000 500
  • 25. 20170722 Strange behavior of JOE  JOE looked like captured by the ghost of successful experience  This happens because Value at step d is kept very high  At step b (#2), reward is provided after disappearing SWORD  So, screen image at step b is same as that of step d (*1)  So, in step d, JOE think there is reward  Additionally, Value at step d will not decreased by learning because reward at #2 (step b) is backward propagated to itself through the loop of state (#1 -> #1 -> #2). => Values of states in this loop is kept very high. (*1) Actually, the number of monster (#M) will change (2->1 or 1->0). But, state at step b when #M=1 is same as that at step d when #M=1. That means this game does not obey Markov process. 25 2 1 a. Come from left of #1 and go down the stairs b. Arrive #2 and get reward by getting SWORD c. Return #1 and get reward by killing a monster by SWORD (1:00) d. Return #2 and stay there forever (1:00 – 5:00) (looks like waiting the ghost of SWORD)
  • 26. 20170722 Effect of TLHoRR with A3C+ in OpenAI Gym  Average score exceeded 1600  Reached 6 rooms which DeepMind did't reached  Movie reached 3, 8, 9 https://youtu.be/qOyFLCK8Umw  Movie reached 18, 19 https://youtu.be/jMDhb-Toii8  Movie reached 19, 20 https://youtu.be/vwkIg1Un7JA 26
  • 27. 20170722 Diverse Hyper Parameters in Thread (DHPT)  Same Hyper Parameters in every thread  Diverse Hyper Parameters in Thread (DHPT) Score went down to 0, and not recovered from it Score went down to 0, but recovered from it The length of history in TLHoRR, β and P (in caluculation of pseudo-reward) was changed in each thread Lost best action in start room and unable to learn again because Pseudo- Reward in start room is almost 0. Details at http://52.199.15.161/OpenAIGym/montezuma-x1/00index.html
  • 28. 20170722 Frame skip in OpenAI Gym  In ALE environment, screen images, after same Action is repeated 4 times (frame skip = 4), is used for learning  But in OpenAI Gym, # of frame skip is determined by OpenAI Gym by uniform random number between 2 to 4  This randomness prevented learning in OpenAI Gym  I resolved this issue by calling OpenAI Gym environment twice by same Action (result: frame skip become gussian distribution with avarage of 7)  I believe this proper randomness helped to beak through laser barrier 28
  • 29. 20170722 Effect of TLHoRR with A3C+ in ALE (again)  Retried THLHoRR + DHPT with A3C+ in ALE by setting fame skip = 7 because 7 is relatively prime with 60 (frame rate in ALE) and looks to contribute extension of exploration of game state  It enabled break through of laser barrier 29
  • 30. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 30
  • 31. 20170722 Conclusion  Pseudo-count is effective for games with little chace to get reward  TLHoRR is useful to get good score in A3C+  DHPT is effecitve for stability of training in A3C+  6 rooms are newly visited by TLHoRR + DHPT with A3C+  Related Information  Blog (in Japanese) : http://itsukara.hateblo.jp/  Code : https://github.com/Itsukara/async_deep_reinforce  OpenAI Gym Result : https://gym.openai.com/evaluations/eval_e6uQIveRRVZHz5C2RSlPg Top position in Montezuma‘s Revenge from 2016/10 to 2017/3   Acknowledgment  I woud like to thank Mr. Miyoshi providing very fast A3C code 31
  • 32. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 32
  • 33. 20170722 Future Directions  Random search of best Hyper Parameters using a large amount of IT resources  Combination of TLHoRR and DHPT with other method (Replay Memory, UNREAL, EWC, DNC, ...: all from DeepMind)  Building and utilization of maze map (like human)  Learning with color screen image (like human) 33
  • 34. 20170722 Thank you for listening 34
  • 35. 20170722 Appendix 1 Details of My pseudo-reward (data structure) Data structure (with initial value)  Case when having pseudo-count in each room, each thread has following data  psc_vcount = np.zeros((24, maxval + 1, frsize * frsize), dtype=np.float64)  24 is the number of rooms in Montezuma’s Revenge  Currently it is constant.  In the future, currently playing room and connection structure of rooms should be detected automatically.  This will be useful to evaluate the value of exploration.  The value of exploration can be used as additional reward.  maxval is the max value of pixel in pseudo-count  Can be changed in option. Default:128  Real pixel value is scaled to fit this maxval  frsize is size of image in pseudo-count  Can be changed in option. Default:42  Screen of game is scaled to fit image size (frsize * frsize)  Case when having one pseudo-count, each thread has following data  psc_vcount = np.zeros((maxval + 1, frsize * frsize), dtype=np.float64)  Two cases in above can be selected by option  The order of dimension is important to have good memory locality  If dimension for pixel value comes last, the performance of training decreases roughly 20%. Because the value of pixel is sparse and cause many cache miss. 35
  • 36. 20170722 Appendix 1 Details of My pseudo-reward (algorithm) Algorithm (algorithm when having one pseudo-count is omitted here)  vcount = psc_vcount[room_no, psc_image, range_k]  This is not a scalar, not a fancy index, but is a temporary array  room_no is index of the room currently playing  psc_image is screen image scaled to fit size:(frsize * frsize), pixel-value:maxval  range_k = np.array([i for i in range(frsize * frsize)]) (calculated in initialization)  psc_vcount[room_no, psc_image, range_k] += 1.0  The count of occurred pixel value is incremented  r_over_rp = np.prod(nr * vcount / (1.0 + vcount))  ρ / ρ‘ for each pixel is calculated, and ρ / ρ‘ for screen image is calculated  ρ / ρ‘ = {N/n} / {(N+1)/(n+1)} = nr * N / (1.0 + N) = nr * vcount /(1.0 + count)  nr = (n + 1.0) / n where n is the number of observation, count starts in initialization  psc_count = r_over_rp / (1.0 – r_over_rp)  This is a pseudo-count. As easily confirmed, r_over_rp / (1.0 – r_over_rp) = ρ/(ρ' – ρ)  Not directly calculate ρ/(ρ' – ρ). Because both ρ' and ρ are very small, caluculation error in ρ' – ρ become big.  psc_reward = psc_beta / math.pow(psc_count + psc_alpha, psc_rev_pow)  This is a pseudo-reward calculated from pseudo-count  psc_beta = β and can be changed by option in each thread  psc_rev_pow = 1/P, P is float value and can be changed by option in each thread  Psc_alpha = math.pow(0.1, P) ; So,  math.pow(psc_count + psc_alpha, psc_rev_pow) = 0.1 for any P when psc_count is almost 0 36
  • 37. 20170722 Appendix 2 Visualization of Pseudo-Count 37  3M steps  45M steps Most frequent pixels 2nd frequent pixels 3rd frequent pixels Pictures of several rooms are intermixed in pictures of 2nd and 3rd frequent pixels. => It might be better to have pseudo-count in each room independently. I tried this and it looks like promising. Picture of most frequent pixels looks like image of fist room. Pictures of 2nd and 3rd frequent pixels looks like trace of JOE’s motion Most frequent pixels 2nd frequent pixels 3rd frequent pixels
  • 38. 20170722 Appendix 3 Real-time visualization of training *.r: Real reward (all scores and moving average) *.R: Frequency of visit in each room *.RO: Frequency of TLHoRR in each room *.lives: Number of LIVES when TLHoRR *.k: Frequency of KILL in each room *.tes: Length of history of TLHoRR in each score *.s: The nuber of steps until getting real-reward *.prR: Pseudo-Reward in each room (all PR and moving average) *.vR: Values in each room (all Values and moving average)