Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RL_Tutorial

2,873 views

Published on

Tutorial of Deep Reinforcement Learning

Published in: Science
  • It doesn't matter how long ago you broke up, these methods will make your Ex crawling back! See how at here. ◆◆◆ http://ishbv.com/exback123/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

RL_Tutorial

  1. 1. &
  2. 2. • • • 2 2 • •
  3. 3. 3
  4. 4. • G • :61 / 1 0 C 46 : A 2 , 1 .
  5. 5. • C A • 0/ /7 2 , K 6077 ./ :1
  6. 6. • • L 1:3 1 1 /4 . I F 72 01 3/ ,1 61
  7. 7. 2 9 • 4 . 1 . 2 3 • ( ( )
  8. 8. 10 • ) .( • • ) •
  9. 9. + e+ 11 • • . 0 - • 0 • 1- 0 ( ( )
  10. 10. 12 獲得できる報酬を 最大化するような行動を学習 ( ) .
  11. 11. 13 • • •
  12. 12. • • • • : γ=0 γ=0.9 14
  13. 13. 15 将来にわたって得られる報酬(収益)を 最大化するような行動を学習 ( ) .
  14. 14. 16
  15. 15. 17 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  16. 16. 18 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  17. 17. 19 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  18. 18. a 20 a : a 8 b .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  19. 19. > - ) 21 • [ Tdt • r L T m T • Tnk oy Tie • o W l ( , 19 c nk] [ gb a f TpQ [Tnk 状態 行動↑ 行動↓ 行動← 行動→ s = 1 7 9 3 0 s = 2 4 7 0 0 1 2 Q-Table , := > 9 ) > 2 ,) f D dts
  20. 20. 20 0- i 22 • g g a P • e • 4 e 5 • 4 e 5 • CL Q 2 1 l • 4 5 4 g 8 6
  21. 21. • 9 5 • 1,- -, 24
  22. 22. 25 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  23. 23. )11 , 1 ), - 32 1 5 0 (. 26 • , 1 MN [ • D Q
  24. 24. t i 27 • c t t i • R T a rn • o kg dC E i • - • Cpl a N t • • Q R T e • - • w
  25. 25. 1,2 , +, 2,1 ( 28 • 2 2,1 ,2 L • E 9 • • Replay buffer Shuffled θの学習 ⋮ ⋮
  26. 26. 29 • • a Q • T - • a • 1 • N • 00
  27. 27. , - 30 • • 1 • Q 1 • R • • •
  28. 28. 強化学習アルゴリズムマップ 31 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  29. 29. 2,. 00 .1 1 . 33 • 6HN • • • Q D
  30. 30. 37 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  31. 31. 38 • Q • • • • • A 2 Q
  32. 32. , , . , 39 • , , • c g a g PQ • d • , . - , , • LN • g PQD g i V l • . , • c g e b • LN • n
  33. 33. - - ) 2 , 40 • ( - - 0 ]C • • P Vd • K c Ta • [ A e P Vd
  34. 34. 21 ,3 41 • -16 • A • C [aA c • . , ,0 • M ] • - 6 62 2- • - 6 62 2- A
  35. 35. i fP 42 -: 1 : .4A / 4A4 4 C4 N l ha k k l N m e m -: 1 : .4A gG ( -: 1 : .4A dc ) b AA 43 2 4 4 64 A 5 A 4
  36. 36. 20 43 • • 3
  37. 37. 44 : 8 .6635 4//6 57 707 /6 5 Monte Carlo Q-Learning Actor-Critic Policy Gradient Deep Learning Deep Q-Network Double DQN Double Q-Learning GORILA (並列化) Dueling DQN A3C TRPO PPO UNREAL Generalized Advantage Estimator Advantage Q-Learning Prioritized Experience Replay SRASA
  38. 38. ,. 1 36 3 2 45 • P Lp Pa L o • l] l Fe d J P o • 03 . 3C n PR UV o • . A3 6 6 b gENcrP[iJ o
  39. 39. 46 • 3 L • A 3 E • A 3 https://deepmind.com/blog/reinforcement-learning-unsupervised-auxiliary-tasks/
  40. 40. 47
  41. 41. . ) 48 • ( • ChainerRL • OpenAI Gym • MuJoCo
  42. 42. . 49 • G I A • . P • O • • /.I: I P / . /
  43. 43. / 50 • : RoboticsMuJoCo AtariClassic control CartPole Pong Breakout Boxing Humanoid Picking Hand manipulate / ./ .
  44. 44. 0 / Ib H ] [H 51 • g • 2 • 1 A ) O ( • ea • 5 [ ] [d • ] [ • . P U ] • ] [ : 7 A C : 7A A 5 : : 7 : hc ] [ C 5 G [ ] [ C 5 G 35 4
  45. 45. 52 import gym env = gym.make(‘CartPole-v0’) env.reset() // 初期化 env.render() // 表示 出力
  46. 46. ( C )/ -: C 53 : / A)/ -: = : . / 1 /0 : /0 : .= /0 =/ O G P = /0 : GI 出力
  47. 47. . 54 • 3 • : /// .
  48. 48. 55 • OM C C • AC P • * .3 G: PI • 3 J / .3
  49. 49. 56 • 1 C • 3 • • 0
  50. 50. 57 • C • • 分散学習 画像認識 強化学習
  51. 51. F 1. 58 C 1 1 1 1 . 1 . 1 1
  52. 52. 3B C 75 [ 3 L6 59 ( 3 L6 b _ a] 1 D 3 L6 R" C L LC = 0 P L LC : = "" C L =LC = 0 P L =LC : = "" 1 L " " C L LC 0 P L "" LC = 0 2 " =LC = 0 4C = L (" LC 0 8 ', )() ( ', ( ) .-( . ' 9 出力
  53. 53. A: D 1 ]Q P : 2 C P 60 ( 3a_ V U C: 3 D D A: D A: D * = D C # # D : D # D A == D A:DD C ) * 2 A D F 3 D D# C D 2 A D ( F C D C 1 1 D : # D A == D A:DD C # C 1 1 D : D A == D A:DD C #D A == D A:DD C # C 1 1 D : D A == D A:DD C # D : D = :CC C # * A :DA C C A :DA C C A D A: D C : D L:C . , D :C C C A DL L: D F: A:F D : D DL : D F: D G D 3 D D # D : D 025 ]QRQ[ G D F
  54. 54. 2D=E AL 2=LN9 FA 61 * b c _d NEGE AL D=E AL NEGE AL 0 =G A A ( NEGE AL AN B a C=GG= - A EF CLAA R AQ F LAL D=E ALLF AQ F LAL 2 N= N4 EF 6LAA R A EF )# L= G = NE B A P = NE = A =G FA AQ ALEA A LA F=R LA F=R BBAL D=E ALLF LA F=R BBAL A F=R1 BBAL = = ENR , DE F=G = Q.Q = NR A BF =N)(# R 5=F A =CA N D=E ALLF =CA N 3: B # NEGE AL# LA F=R BBAL# C=GG=# AQ F LAL# LA F=R N=LN E A # =NA E NALP=F # N=LCAN =NA E NALP=F # DE DE
  55. 55. :3 1 L P 3 C0 < 62 ) 3 # , C 3 , , .3< 1 , C , : < C 3 C + 35C , 3 C 35C23 2C 3 3 CF < 3C( # 3 # 3 # # 2 , C 35C 1 , 3 C , 3 C C 2 23 2C 3 # 3 # 3 C 3 < 3=
  56. 56. :3 = 1 L F 3 0 < F 63 ) = 3= , =C 6 = , .3< 1 , , : < = 6 = 3=6 + ' =C =6 3 = , 3 = 3 3 = < 3 (' 6 = 2 , =C 3 = 1 , , = 6 1 1 3 = 2 6 学習前 200episode学習後
  57. 57. 64
  58. 58. 4 65 • [ N • • • • • C • : 84 84 入力画像 Convolution + Pooling Fully connected 8 P8
  59. 59. 66 • •
  60. 60. eg b T U T 67 • T • o iu k C A w • y t w • U T • nE • cd t • U T rD Q L • b a • 3 3 3 • p lx • R N
  61. 61. エージェント (ex. ロボット) 環境 (ex. 迷路) 状態観測部 行動選択 状態s 状態s’ 観測 行動 68 •
  62. 62. :94 69 • e 4 • [ 1 -1 0+ • e e 1 e • 2 2 2 2 [ 2 83 5]
  63. 63. 70 • • : 2 平均行動ステップ数 平均収益
  64. 64. 71
  65. 65. 72
  66. 66. d 73 • a e gc R • i • e e • • • 1 0) (- .+
  67. 67. 1 2 1 7 0 2 74
  68. 68. f a 75 • Zdl f • f l • • he + 1/ -+4/ .+:+ - 44/-:3 3: + + + : :: : / 3+ / / IF Zdl b : L c b : L c i g L
  69. 69. 76 • • • • • •
  70. 70. 77 • • G • • B R B B
  71. 71. • • • 78
  72. 72. 1 . • • • 79
  73. 73. 80 1 2 D (( (-. ) 1 2 D D • N • Q • • N • Q •
  74. 74. 81 Q Q • • • ND )) ( ). - 3 1 e 2
  75. 75. D C DI 7 82 • +, D C DI ,- • [ 1 +, Q 2 ] • U R R 1 M0 N D C DI
  76. 76. D C DI 7 83 • +, D C DI ,- • [1 +, R 2 • ] U 1 N0Q DI M
  77. 77. icgn 84 • , D 3 1 G C ,31 0B = G 1 2 ( • aNSP ]Q m • [ . Mpl GR oR G R k FG +CBIC G CB 2CC B - CBB G 3 FG ) r 3 FG )d e 3 FG )h e +11 o ) 3 s
  78. 78. 85 •
  79. 79. 3 - 86 • • • • • 5 - • • • 5 - • 5 - 0 1 • +
  80. 80. : 87 • • ( ) R H4 : H : • • S
  81. 81. 88
  82. 82. B 89 • • 9 . • G H • 12S • :6 8 ] R • • [ :6 8 ] R [
  83. 83. E 90 B B GB シミュレータではどの入力画像でも学習可能
  84. 84. 91 • 24 01 • • • • • • •
  85. 85. 4 4 92 • B B 4 • 4 5 / • 5 0/14 [8 .2 ] B B R ] G 4
  86. 86. 93 • • • •

×