Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
20170722
Training long history on real reward and diverse hyper
parameters in threads combined with DeepMind’s A3C+
Takayo...
20170722
Table of Content
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusio...
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Direc...
20170722
State
Screen Image
after Action etc.
Deep Reinforcement Learning (DRL)
Environment
Game Emulator etc.
Agent
Predi...
20170722
Score of DRL in Atari 2600 games
 DRL reached human level score in
more than half of Atari 2600 games
(Deep Q-Ne...
20170722
Why so hard?
 So many kill-points => hard to go forward
 Little chance to get reward => little chance to learn
...
20170722
Simple countermeasures and their results
 So many kill-point
[measure] Give negative reward when Joe killed to a...
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Direc...
20170722
DeepMind's paper
 I had been offering information of DRL experiment with
Monezuma's Revenge in my blog and twitt...
20170722
Key idea
 Although there is simple method to count the the occurrence number of game
state i.e. binary compariso...
20170722
Pseudo-count
 When the data-space S is direct product of multiple sub-data-spaces S1, S2, ...,
SM (in previous s...
20170722
Utilization in DRL: Pseudo-Reward
 For every pixel of a game screen x, calculate ρ and ρ'
 Caculate product of ...
20170722
Result: Double DQN + Pseudo-Reward
 Evaluated in 5 games. Effective in the following games
 In Montezuma's Reve...
20170722
Result: A3C + Pseudo-Reward (A3C+)
 Evaluated in 60 games. The number of low-score
games was reduced (low-score:...
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Direc...
20170722
I tried A3C+ => Why?
 I already had A3C environment and had been
trying Montezuma's Revenge in this environment
...
20170722
First trial: better than A3C+
 By incorporating pseudo-reward to my code,
I got very good result in first trial
...
20170722
Effect of my original code
 To evaluate precisely, I turned-OFF my original code which I
incorporated in the pas...
20170722
My original code
 My original code contained several function
 Training Long History on Real Reward (TLHoRR)
 ...
20170722
Structure of Neural Network (NN) for DRL
Value
Screen Images
scaled 84x84
last 4 Images
Convoution
8x8x16
Stride ...
20170722
A3C: Asynchronous Advantage Actor-Critic
 Gradients ( ) is calucluated like
 Gradients are asynchronously accum...
20170722
Calculation of Gradients ( ) in A3C+
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Rewa...
20170722
Calculation of Gradients ( ) in TLHoRR
# Play 5 steps
For i = 0 to 4
Predict best Action At and perform it
Get Re...
20170722
Effect of TLHoRR with A3C+ in ALE
 Average score approached 2000 (2016/10/6)
 Could not go over laser barriers....
20170722
Strange behavior of JOE
 JOE looked like captured by the ghost of successful experience
 This happens because V...
20170722
Effect of TLHoRR with A3C+ in OpenAI Gym
 Average score exceeded 1600
 Reached 6 rooms which DeepMind did't rea...
20170722
Diverse Hyper Parameters in Thread (DHPT)
 Same Hyper Parameters in every thread
 Diverse Hyper Parameters in T...
20170722
Frame skip in OpenAI Gym
 In ALE environment, screen images, after same Action is
repeated 4 times (frame skip =...
20170722
Effect of TLHoRR with A3C+ in ALE (again)
 Retried THLHoRR + DHPT with A3C+ in ALE by setting fame
skip = 7 beca...
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Direc...
20170722
Conclusion
 Pseudo-count is effective for games with little chace to get reward
 TLHoRR is useful to get good s...
20170722
1. Background
2. DeepMind's paper and A3C+
3. Experience with A3C+ and My Proposals
4. Conclusion
5. Future Direc...
20170722
Future Directions
 Random search of best Hyper Parameters using a large
amount of IT resources
 Combination of ...
20170722
Thank you for listening
34
20170722
Appendix 1 Details of My pseudo-reward (data structure)
Data structure (with initial value)
 Case when having ps...
20170722
Appendix 1 Details of My pseudo-reward (algorithm)
Algorithm (algorithm when having one pseudo-count is omitted h...
20170722
Appendix 2 Visualization of Pseudo-Count
37
 3M steps
 45M steps
Most frequent pixels 2nd frequent pixels 3rd f...
20170722
Appendix 3 Real-time visualization of training
*.r: Real reward (all scores and moving average)
*.R: Frequency of...
Upcoming SlideShare
Loading in …5
×

DRL challenge on Montezuma's Revenge

10,482 views

Published on

DRL (Deep Reinforcement Learning) challenge on Montezuma's Revenge is presented. The score and the rooms reached in A3C exceed that of DeepMind. This is English translation of my Japanese slide + some update. (updated 2017/7/22)
I changed http server. See following for result of experiment: http://35.197.57.214/
(I'd like to update slide, but re-upload function was already lost form SlideShare)

Published in: Software

DRL challenge on Montezuma's Revenge

  1. 1. 20170722 Training long history on real reward and diverse hyper parameters in threads combined with DeepMind’s A3C+ Takayoshi Iitsuka The Whole Brain Architecture Initiative a specified non-profit organization, Japan iitsuka@wba-initiative.org 1983-2003: Researcher of compiler for Hitachi‘s computers (mainly, Supercomputers) 2003-2015: Strategy and Planning Department of several divisions (Cloud Service, etc.) 2015/9 : Early retired Hitachi with additional payment 2016/2-12 : Catched up with latest IT including Deep Learning 2016/10 : Got top position in OpenAI Gym (Montezuma's Revenge), Kept until 2017/3 2016/10 : Return to Hitachi as contract employee (my work is not related to AI) 1
  2. 2. 20170722 Table of Content 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 2
  3. 3. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 3
  4. 4. 20170722 State Screen Image after Action etc. Deep Reinforcement Learning (DRL) Environment Game Emulator etc. Agent Predict best Action by Deep Learning Action ... Reward Score obtained by Action etc.  Agent predicts best Action from State by Deep Learning  Environment returns State and Reward as result of Action  Agent updates internal Neural Network based on Reward 4
  5. 5. 20170722 Score of DRL in Atari 2600 games  DRL reached human level score in more than half of Atari 2600 games (Deep Q-Network, DeepMind 2015)  But poor score games still remained  One of the hardest games for DRL was "Montesuma's Revenge" (until DeepMind submitted very effective paper to arXiv in June 2016. I did not notice the paper by late August)  I started challenge on DRL of "Montesuma's Revenge" in the beginning of August as my hobby [DRL] https://deepmind.com/blog/deep-reinforcement-learning/ [My blog (in Japanese)] http://itsukara.hateblo.jp/ [My github] https://github.com/Itsukara/async_deep_reinforce Joe Montesuma's Revenge Human Level or Above 5
  6. 6. 20170722 Why so hard?  So many kill-points => hard to go forward  Little chance to get reward => little chance to learn Reward chance by random actions (first 1M steps) Name of game # of gameover Non-ZERO score Reward chance Breakout 5440 4202 77.3% Montezuma's Revenge 2323 1 0.043% 6
  7. 7. 20170722 Simple countermeasures and their results  So many kill-point [measure] Give negative reward when Joe killed to avoid kill-point [result] Joe does not approach kill-point and can't go over it  Little chance to get reward [measure] Give basic-income reward to promote learning (provide constant reward in every steps or periodically) [result] Joe stays one place forever  Additionally, no motivation to go over kill-point [measure] Combination (basic-income after kill-point may be attractive) [result] Joe stays one place and can't go foward =>  Reward is important for training. But, at the same time, some kind of motivation to move and go-over kill point is necessary. For that purpose, reward should be decreased when visiting same place many times or making same action many times. 7
  8. 8. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 8
  9. 9. 20170722 DeepMind's paper  I had been offering information of DRL experiment with Monezuma's Revenge in my blog and twitter  Auther of A3C reproduciton code which I was using read my blog and gave me information of DeepMind's new paper “Unifying Count-Based Exploration and Intrinsic Motivation (Bellmare, et. al., June 2016)” by twitter message  Reading abstract of the paper, I realized that what I wanted in rewarding was written in the paper in name of “pseudo-reward based on pseudo-count“  They applied pseudo-count to Montezuma's Revenge and get good result (average score after 100M steps training with double-DQN is 3439, that with A3C is 273) 9
  10. 10. 20170722 Key idea  Although there is simple method to count the the occurrence number of game state i.e. binary comparison of game state, it is not effective when the probability of game state is too small or zero, e.g. what is the probability of (SUN, LATE, BUSY) after following observation? Just Zero?  Key idea:  ρ = 1/10*1/10*9/10 (=0.009) looks natural as probalility of (SUN, LAGTE, BUSY)  After observation of (SUN, LATE, BUSY), it will become ρ' = 2/11*2/11*10/11 (=0.03) (The paper named ρ' as "recording probability" ) day# Weather Time-of-day Crowdness 1 SUN LATE QUIET 2 RAIN EARY BUSY 3 RAIN EARY BUSY 4 RAIN EARY BUSY 5 RAIN EARY BUSY 6 RAIN EARY BUSY 7 RAIN EARY BUSY 8 RAIN EARY BUSY 9 RAIN EARY BUSY 10 RAIN EARY BUSY 10
  11. 11. 20170722 Pseudo-count  When the data-space S is direct product of multiple sub-data-spaces S1, S2, ..., SM (in previous slide, Weather, Time-of-day, Crowdeness), the probability of a sample D=(d1, ..., dM) in S is product of the probability of in d1, ..., dM in each of S1, S2, ..., SM (assumption: each space is independent)  For each Si , when the number of occurrence of a sample di is N, and the number of the observation is n, ρ and ρ' can be calculated by definition:  ρ = N/n  ρ' = (N + 1)/(n + 1)  From above equations, N can be calculate as follows from ρ and ρ':  N = ρ(1 – ρ')/(ρ' – ρ) ≒ ρ/(ρ' – ρ) (when ρ' << 1)  ρ (and ρ') of D can be calculated as products of ρ (and ρ') in S1, S2, ..., S  So, N (the number of occurence) of D can be calculate from ρ and ρ' of D  The paper named N as “pseudo-count“.  In previous slide, ρ = 1/10*1/10*9/10 =0.009, ρ' = 2/11*2/11*10/11 = 0.03. So, pseudo-count N = 0.009/(0.03 – 0.009) = 0.42 (not 0 & < 1: looks resonable) Notice: Above explanation is much simplifed. See DeepMind paper for details. 11
  12. 12. 20170722 Utilization in DRL: Pseudo-Reward  For every pixel of a game screen x, calculate ρ and ρ'  Caculate product of all ρ and ρ' => These are ρ and ρ' of x  Calculate N(x) (pseudo-count of x) from ρ and ρ' of x  Calculate R(x) (Pseudo-Reward of the screen x) as follows  R(x) = β / (N(x) + 0.01)1/P  N(x) is bigger, R(x) is smaller => smaller in high-occurence screen  0.01 has no meaning (just to avoid zero-division)  P was selected by experiment (tried P=2 and 1)  P=2 both in Double DQN and A3C  β was selected from short paramer sweep  β=0.05 in Double DQN, β=0.01 in A3C => R(x) ≒ β/  “Real-Reward + R(x)” is used as Reward for training (not used as score of the game)  This gives motivation to extend exploration of state in DRL 12
  13. 13. 20170722 Result: Double DQN + Pseudo-Reward  Evaluated in 5 games. Effective in the following games  In Montezuma's Revenge, extended reached rooms This room was the most important in DeepMind’s evaluation (confirmed Bellmare). Because Joe can get 3,000 in this room only. 13
  14. 14. 20170722 Result: A3C + Pseudo-Reward (A3C+)  Evaluated in 60 games. The number of low-score games was reduced (low-score: score is less than 150% of random actions. Pink cells in the following table)  Not so good (273.7) in Montezuma's Revenge Score<150%Random Stochastic-ALE Deterministic-ALE Stochastic-ALE Deterministic-ALE A3C A3C+ DQN A3C A3C+ A3C A3C+ Random Human A3C A3C+ DQN A3C A3C+ DQN 1 ASTEROIDS X 2680.7 2257.9 3946.2 2406.6 719.1 47388.7 4% 3% 0% 7% 4% 0% 2 BATTLE-ZONE X 3143.0 7429.0 3393.8 7969.1 2360.0 37187.5 2% 15% 41% 3% 16% 45% 3 BOWLING X 32.9 68.7 35.0 76.0 23.1 160.7 7% 33% 4% 9% 38% 5% 4 DOUBLE-DUNK X X 0.5 -8.9 0.2 -7.8 -18.6 -16.4 870% 442% 320% 854% 489% 210% 5 ENDURO X 0.0 749.1 0.0 694.8 0.0 860.5 0% 87% 40% 0% 81% 51% 6 FREEWAY X 0.0 27.3 0.0 30.5 0.0 29.6 0% 92% 103% 0% 103% 102% 7 GRAVITAR X X X 204.7 246.0 201.3 238.7 173.0 3351.4 1% 2% -4% 1% 2% 1% 8 ICE-HOCKEY X X -5.2 -7.1 -5.1 -6.5 -11.2 0.9 49% 34% 12% 50% 39% 7% 9 KANGAROO X 47.2 5475.7 46.6 4883.5 52.0 3035.0 0% 182% 138% 0% 162% 198% 10 MONTEZUMA'S-REVENGE X 0.1 142.5 0.2 273.7 0.0 4753.3 0% 3% 0% 0% 6% 0% 11 PITFALL X X X -8.8 -156.0 -7.0 -259.1 -229.4 6463.7 3% 1% 2% 3% 0% 2% 12 ROBOTANK X 2.1 6.7 2.2 7.7 2.2 11.9 -1% 46% 501% 0% 56% 395% 13 SKIING X X X -23670.0 -20066.7 -20959.0 -22177.5 -17098.1 -4336.9 -51% -23% -73% -30% -40% -85% 14 SOLARIS X X 2157.0 2175.7 2102.1 2270.2 1236.3 12326.7 8% 8% -4% 8% 9% 5% 15 SURROUND X X X -7.8 -7.0 -7.1 -7.2 -10.0 6.5 13% 18% 7% 18% 17% 11% 16 TENNIS X X X -12.4 -20.5 -16.2 -23.1 -23.8 -8.9 76% 22% 73% 51% 5% 106% 17 TIME-PILOT X X X 7417.1 3816.4 9000.9 4103.0 3568.0 5925.0 163% 11% -32% 231% 23% 21% 18 VENTURE X X 0.0 0.0 0.0 0.0 0.0 1188.0 0% 0% 5% 0% 0% 0% 14X 10X 10X 15X 14X 14X 16X 14X 13X Notice: Above table was created from the paper 14
  15. 15. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 15
  16. 16. 20170722 I tried A3C+ => Why?  I already had A3C environment and had been trying Montezuma's Revenge in this environment  Training speed (steps per sec.) of A3C is very fast  So I think that I can verify the effect of pseudo-reward based on pseudo-count very soon  The paper provides the result of few games only for Double DQN. I felt that the reason might be the time to tuning or evaluation with D-DQN is too long. It might cosume much time to get good result in D-DQN. 16
  17. 17. 20170722 First trial: better than A3C+  By incorporating pseudo-reward to my code, I got very good result in first trial  It was better than the result of A3C+ (273.7) 400 300 200 100 DeepMind's A3C+ Score 17
  18. 18. 20170722 Effect of my original code  To evaluate precisely, I turned-OFF my original code which I incorporated in the past trials => Bad score (aroud 100 point)  By turning-ON my original code, the score went up My original code: OFF->ON 400 500 300 200 100 18
  19. 19. 20170722 My original code  My original code contained several function  Training Long History on Real Reward (TLHoRR)  Inspired by reinforcement of learning with dopamine in human brain. In this case, Real-Reward is very valuable event in brain and TLHoRR strongly trains neural network like dopamine does  Give negative reward when Joe killed  Increase randomness of actions when no-reward time is long  Only TLHoRR was effective  My code contains so many hyper parameters now. I feel it is very difficut to find best parameters because there are so many hyper parameters  The length of history to train (various values tried)  β and P in caluculation of pseudo-reward (varioius values tried)  Learning algorithm (A3C-ff and A3C-lstm tried)  The number of skipping frames (4 looks like best for ALE. 2 looks like best for OpenAI gym)  Color conversion scheme (averege/max/last of skipping frames. max looks like best)  “save thread0‘s pseudo-count and all thread use it when restored“ or all save and all restore  Bits for Pixel value (DeepMind used 3. 7 looks like best for my code)  Have data for pseudo-count in each room or have one data for all rooms  ... 19
  20. 20. 20170722 Structure of Neural Network (NN) for DRL Value Screen Images scaled 84x84 last 4 Images Convoution 8x8x16 Stride 4 Convoution 4x4x32 Stride 2 Fully Connected -> 256 Fully Connected -> 18 -> 1 Action ... Action and Value  Predict best Action and Value from last 4 Screen Images (Value: predicted sum of Reward obtained until game over)  Reward is used to correct the prediction of best Action and Value 20
  21. 21. 20170722 A3C: Asynchronous Advantage Actor-Critic  Gradients ( ) is calucluated like  Gradients are asynchronously accumulated to Globlal Network ( )  Global Network ( ) is periodically write back to Local Network Local (thread0) Calculate Local (thread1) Calculate Local (threadN) Calculate ... Global Accumulate (update by ) Periodical Write Back 21
  22. 22. 20170722 Calculation of Gradients ( ) in A3C+ # Play 5 steps For i = 0 to 4 Predict best Action At and perform it Get Reward rt and new State st+1 t += 1 R = Vt if not game over else 0 # Calculate from history of last 5 steps (backward propagation) For i = 0 to 4 R = rt-i + d * R (d is discount ratio) += 22
  23. 23. 20170722 Calculation of Gradients ( ) in TLHoRR # Play 5 steps For i = 0 to 4 Predict best Action At and perform it Get Reward rt and new State st+1 t += 1 R = Vt if not game over else 0 # Traning Long History on Real Reward (TLHoRR) T = 180 if Real Reward is included in last 5 steps else 4 # Pseudo Reward => T=5 : learn from last 0.3 seconds in game # Real Reward => T=180 : learn from last 12 seconds in game # Calculate from history of last T steps (backward propagation) For i = 0 to T R = rt-i + d * R (d is discount ratio) += 23
  24. 24. 20170722 Effect of TLHoRR with A3C+ in ALE  Average score approached 2000 (2016/10/6)  Could not go over laser barriers. So, could not get additional 3,000 point. Laser barriers Laser barriers 24 2500 2000 1500 1000 500
  25. 25. 20170722 Strange behavior of JOE  JOE looked like captured by the ghost of successful experience  This happens because Value at step d is kept very high  At step b (#2), reward is provided after disappearing SWORD  So, screen image at step b is same as that of step d (*1)  So, in step d, JOE think there is reward  Additionally, Value at step d will not decreased by learning because reward at #2 (step b) is backward propagated to itself through the loop of state (#1 -> #1 -> #2). => Values of states in this loop is kept very high. (*1) Actually, the number of monster (#M) will change (2->1 or 1->0). But, state at step b when #M=1 is same as that at step d when #M=1. That means this game does not obey Markov process. 25 2 1 a. Come from left of #1 and go down the stairs b. Arrive #2 and get reward by getting SWORD c. Return #1 and get reward by killing a monster by SWORD (1:00) d. Return #2 and stay there forever (1:00 – 5:00) (looks like waiting the ghost of SWORD)
  26. 26. 20170722 Effect of TLHoRR with A3C+ in OpenAI Gym  Average score exceeded 1600  Reached 6 rooms which DeepMind did't reached  Movie reached 3, 8, 9 https://youtu.be/qOyFLCK8Umw  Movie reached 18, 19 https://youtu.be/jMDhb-Toii8  Movie reached 19, 20 https://youtu.be/vwkIg1Un7JA 26
  27. 27. 20170722 Diverse Hyper Parameters in Thread (DHPT)  Same Hyper Parameters in every thread  Diverse Hyper Parameters in Thread (DHPT) Score went down to 0, and not recovered from it Score went down to 0, but recovered from it The length of history in TLHoRR, β and P (in caluculation of pseudo-reward) was changed in each thread Lost best action in start room and unable to learn again because Pseudo- Reward in start room is almost 0. Details at http://52.199.15.161/OpenAIGym/montezuma-x1/00index.html
  28. 28. 20170722 Frame skip in OpenAI Gym  In ALE environment, screen images, after same Action is repeated 4 times (frame skip = 4), is used for learning  But in OpenAI Gym, # of frame skip is determined by OpenAI Gym by uniform random number between 2 to 4  This randomness prevented learning in OpenAI Gym  I resolved this issue by calling OpenAI Gym environment twice by same Action (result: frame skip become gussian distribution with avarage of 7)  I believe this proper randomness helped to beak through laser barrier 28
  29. 29. 20170722 Effect of TLHoRR with A3C+ in ALE (again)  Retried THLHoRR + DHPT with A3C+ in ALE by setting fame skip = 7 because 7 is relatively prime with 60 (frame rate in ALE) and looks to contribute extension of exploration of game state  It enabled break through of laser barrier 29
  30. 30. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 30
  31. 31. 20170722 Conclusion  Pseudo-count is effective for games with little chace to get reward  TLHoRR is useful to get good score in A3C+  DHPT is effecitve for stability of training in A3C+  6 rooms are newly visited by TLHoRR + DHPT with A3C+  Related Information  Blog (in Japanese) : http://itsukara.hateblo.jp/  Code : https://github.com/Itsukara/async_deep_reinforce  OpenAI Gym Result : https://gym.openai.com/evaluations/eval_e6uQIveRRVZHz5C2RSlPg Top position in Montezuma‘s Revenge from 2016/10 to 2017/3   Acknowledgment  I woud like to thank Mr. Miyoshi providing very fast A3C code 31
  32. 32. 20170722 1. Background 2. DeepMind's paper and A3C+ 3. Experience with A3C+ and My Proposals 4. Conclusion 5. Future Directions 32
  33. 33. 20170722 Future Directions  Random search of best Hyper Parameters using a large amount of IT resources  Combination of TLHoRR and DHPT with other method (Replay Memory, UNREAL, EWC, DNC, ...: all from DeepMind)  Building and utilization of maze map (like human)  Learning with color screen image (like human) 33
  34. 34. 20170722 Thank you for listening 34
  35. 35. 20170722 Appendix 1 Details of My pseudo-reward (data structure) Data structure (with initial value)  Case when having pseudo-count in each room, each thread has following data  psc_vcount = np.zeros((24, maxval + 1, frsize * frsize), dtype=np.float64)  24 is the number of rooms in Montezuma’s Revenge  Currently it is constant.  In the future, currently playing room and connection structure of rooms should be detected automatically.  This will be useful to evaluate the value of exploration.  The value of exploration can be used as additional reward.  maxval is the max value of pixel in pseudo-count  Can be changed in option. Default:128  Real pixel value is scaled to fit this maxval  frsize is size of image in pseudo-count  Can be changed in option. Default:42  Screen of game is scaled to fit image size (frsize * frsize)  Case when having one pseudo-count, each thread has following data  psc_vcount = np.zeros((maxval + 1, frsize * frsize), dtype=np.float64)  Two cases in above can be selected by option  The order of dimension is important to have good memory locality  If dimension for pixel value comes last, the performance of training decreases roughly 20%. Because the value of pixel is sparse and cause many cache miss. 35
  36. 36. 20170722 Appendix 1 Details of My pseudo-reward (algorithm) Algorithm (algorithm when having one pseudo-count is omitted here)  vcount = psc_vcount[room_no, psc_image, range_k]  This is not a scalar, not a fancy index, but is a temporary array  room_no is index of the room currently playing  psc_image is screen image scaled to fit size:(frsize * frsize), pixel-value:maxval  range_k = np.array([i for i in range(frsize * frsize)]) (calculated in initialization)  psc_vcount[room_no, psc_image, range_k] += 1.0  The count of occurred pixel value is incremented  r_over_rp = np.prod(nr * vcount / (1.0 + vcount))  ρ / ρ‘ for each pixel is calculated, and ρ / ρ‘ for screen image is calculated  ρ / ρ‘ = {N/n} / {(N+1)/(n+1)} = nr * N / (1.0 + N) = nr * vcount /(1.0 + count)  nr = (n + 1.0) / n where n is the number of observation, count starts in initialization  psc_count = r_over_rp / (1.0 – r_over_rp)  This is a pseudo-count. As easily confirmed, r_over_rp / (1.0 – r_over_rp) = ρ/(ρ' – ρ)  Not directly calculate ρ/(ρ' – ρ). Because both ρ' and ρ are very small, caluculation error in ρ' – ρ become big.  psc_reward = psc_beta / math.pow(psc_count + psc_alpha, psc_rev_pow)  This is a pseudo-reward calculated from pseudo-count  psc_beta = β and can be changed by option in each thread  psc_rev_pow = 1/P, P is float value and can be changed by option in each thread  Psc_alpha = math.pow(0.1, P) ; So,  math.pow(psc_count + psc_alpha, psc_rev_pow) = 0.1 for any P when psc_count is almost 0 36
  37. 37. 20170722 Appendix 2 Visualization of Pseudo-Count 37  3M steps  45M steps Most frequent pixels 2nd frequent pixels 3rd frequent pixels Pictures of several rooms are intermixed in pictures of 2nd and 3rd frequent pixels. => It might be better to have pseudo-count in each room independently. I tried this and it looks like promising. Picture of most frequent pixels looks like image of fist room. Pictures of 2nd and 3rd frequent pixels looks like trace of JOE’s motion Most frequent pixels 2nd frequent pixels 3rd frequent pixels
  38. 38. 20170722 Appendix 3 Real-time visualization of training *.r: Real reward (all scores and moving average) *.R: Frequency of visit in each room *.RO: Frequency of TLHoRR in each room *.lives: Number of LIVES when TLHoRR *.k: Frequency of KILL in each room *.tes: Length of history of TLHoRR in each score *.s: The nuber of steps until getting real-reward *.prR: Pseudo-Reward in each room (all PR and moving average) *.vR: Values in each room (all Values and moving average)

×