Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Q-Network 論文輪読会

6,146 views

Published on

Deep-Q Networkに関するNatureの論文"Human-level control through deep reinforcement learning"を社内論文輪読会で読みました

Published in: Technology
  • Be the first to comment

Deep Q-Network 論文輪読会

  1. 1. 2016/O2/l9 §“I= I°fi5C$II°II§; I.%‘ Human-level control through deep reinforcement learning Mnih, Volodymyr. et al. "Human—level control through deep reinforcement learning. " Nature 518.7540 (2015): 529-533. U7»-F3E1:7—9a>fI>§:7 Wfifixfl
  2. 2. Deep Q—Network https: //www. youtube. corn/ watch? v=iqXKQf2BOSE - ’7°—A0))l»—)l/ ci$5ZZ%>/ JZ¥7b“7ti: , . gzj/ z7—1;fC, c@Ei0)l: °’7JczJl/ '|‘§$Et7«: I7($EiEiIi)0>a; «7b5$‘£J - atari 26OO0>29/490>'7“—AT°AFafi0>IfieX/ I’— I~ cl: 0 / /{X :1 7
  3. 3. Google DeepMind 2010, III ‘/ l~“ ‘/ ’("’DeepMind Techno| ogy”2: L/ ‘C 2014, Goog| e7’J“5i%i PJLTELIE. Al 7“—L0) EPIE’. fiiiil Deep-Q learning%Fa'fi% 3 Deep Iearninglicli ‘9,%§I%§0D5’>* 7JE>i'§i’E7Z427-"f- | ~“'7“—£. ¢aE$§ L; /’I’Z: l7’aEL'l: l§‘ 1': 53% _ . ~54» g: E“; in . iii} “:5 " 11$"-' Jijsi 1» u is-Iafifis Dalila! -Il E. j_. :_u-__ um cnn for dqn atari 2600 deep dream
  4. 4. Deep Q-Network ilill)lF§§¥5I’é‘: Fl§l. T: Q-learning + Deep Neural Network iv €- %_t)_%__-bQ_| ear. ningtH:7 §c7>lflc7>t:1—5)l/3r<“/ l'7—7t7J‘ (E‘I75%>i§'2l§§[l'3‘Cl»%>? lTJi§: ) %i: ')%4:')Qt ii ?
  5. 5. I’s. jIi = :. .I, ._. ,, , Ii «Sh ‘ v_I- ¢z1&<: I| ' L. *I. I«:4‘r‘li1'llE1l3l‘= .“F‘-‘J’: —‘r'»-' ii. ‘-x ‘i'I"3i'3E. ”', lllj: :I i'lEIE§§-. §i'i: <i; :.1'= ;:; ..2Ii T ~‘-= '~l. :'I-i~: =-Ms = ~4= 7 »—’~-''. I } I ; ]"‘r‘(: ;_; ¢5] »j, p:9¢1,Qg, p::1ii . --. g—n ‘~74!’ PE: 91': fi‘t‘~£
  6. 6. 75 7lL’. L/ Ti§lfCa">k%> 375 %>i7§f'i3S7b 5fi§JaE 3: %> at /3lJODl1i7i§s’l:5%i§’>b’C$l§@Jllr‘&i«~§'= %> (a) S 4--"""V(S) 1572;<7>Ll: , lfi? i§s7b5i‘_I§E1J ‘L ”(S~a) aeE3ze§as«(Ie$)en(s, a) (I E Q(S, a) Tiéb, 7:‘%(po| icy) é: l1¥,3i‘ S E. ’.i’L"<I': /““/ 7 7 “/ 7ii$<‘: l1¥/3‘
  7. 7. §'§ilZ—"? §0)37(%i%)5i%7b“'C‘% 5 '3 7‘: Q(s, a): fi§JlfflilIE|3§fi(action—value function) We): liI‘filI"E’l3§¥5I(value function) 7: (s, a): 7'3‘%(po| icy)
  8. 8. E5&bfi%t i: I@ilh‘lilI"e'l3§l%3I Q”('s. a.) : £: }{1I’, |.s', :s. a, : {I} : En’; -,_, .+, s, :s. n,: rI}. Xv 1EE7J5%| zE@5E; EiT0) discount rate: fififififim amexmamee<am lflfilEF’a§%5I l'7r(S) : F R(| .$, :,$} : I? n_{EA; l,~, .’. k‘l L'= U Sr :8}. B%Z| Jt’C0Di‘"I§Jl; t5l%lCi‘§EbT7'; £L0)’C fiE3JaEH3l? aE$li 7r (s, a)lC? fiE‘3 &: ?%>
  9. 9. Q, VOD3_l? <Vb75‘ Q, VfJ“1075‘1’Lli°. 35 <‘: “ / u7”aI”fi§J"7J“EE‘iE7“at0')7b‘7J“1’J7b‘%’> ‘(‘15Q, Vl2t E BHT‘l2tt; L "E Z ‘C‘Q, V’aEflc‘: 75‘ b? #5 2: M5 0)7fJ“"5'§llZ$E"’ Q, V"& B c‘: 67')%>l’€%El’J72t3’J0D7’J'5£ 3E‘/7_‘7'J)l/ Elffi : +: “~/7'1/J“Lrcor:3‘5$§+ - §JB"J§‘I'E(DP)73'§ : riiiiszcieeiimanyaieatiei; -ea-cireeearn - TD(Temporal Difference)$%J : =E‘/7"7‘J)l«lZl; ‘£+DP; ‘£ ”*. ’z§il3$i1=5I’ lZ2F3l‘I§¥fiL/ l,%i7‘50)l’tiEt7‘aI§B0>l1F'aEliEl, 7‘a2 < TD$f-£73?) 5 ” Richard 8. Sutton
  10. 10. TD*—’7—"§ t l: ;ifi_i‘zI'9“7b? -l-D$>jE: J0)lE2i: l:.35§: E)0)o %hl; t ) (Vc‘;7‘J‘Q0); ‘$lill: Et 1§u)/ gggeiimanyggfij : :3iT"l? z—S-_>§E$EfiBlJ7’c5ED fi7Jl: f—/1o7rLT<7>drLl$E5)5l lml-5') = ExiRIl-‘''r: -‘‘i : Err{ZArk"I~l. -ll l. '—'U -‘it: -'5} : En{"I~I+“l"Tl-‘I'I+I) TD$“¥(7)$JZl9§: %il“§E’JlCV(s)= E[r+rV(s')] &:72téV’EE. Dli%>4lZ§7'J'35%>. '61-3, fiJ§i%llEE[ ' V£/ u‘(b7fJ": ‘>faIl, ., 0163732: 035131 V(s)= r+rV(s’)"a‘: EiE*’c5 iE. ll§l75‘ll_§’EE¥lllJ'Cl, %> 5 '5 lCV(s)= E[r+ rV(S’)] ll? ??-('(l§l: ' b lz(7J%>lEt§: ) (D lil-57) “ ‘I'll-*7) + (1 i"I+I + A. 'l"(5l«—l) — l-(5!) - <‘: ll3_F:3i‘ a: $El? '&§5I (1)-A: 73“%% [J < 73:5 3: -3 l: $‘§ r. ;A. a7:; <TD$aIrI<Ior: ~o 2 bfcfr?
  11. 11. :~* ~41- 3'é3.ili. i."§‘I: ii-‘Zih"t»-“‘-‘. “"’g‘I. .5’. -~‘I'-'! :eIl~fZ*II3i“l. I‘-1"; -ll: -" . alI£"AT. *‘“~= ‘?3“~v'. :-3-H 23‘; :3‘) ’. ‘r‘§1'§. -1"’-‘Ifi-': *‘-'*: IIE. ?‘= ’-‘F’ ar"‘= e;Ir'; Ii‘n'nIa): ("§, »lt-. ~‘. jjl l_«iiflfiil: ?3f$I‘. %.~I. £». EIi>‘-5-. 1.§= E—-»’: “'. “'3’i"“I"I“§-‘J>“5'E'§. l5l"l-: i_=5§l LI’ ‘= ii‘€%i‘l: :“; .«ir: - '1 '?1eT‘? ... A'--I)>‘i‘-6=€? ‘i; .;£-rhI T —‘I-eIiiI‘mgIeIllos'. --I' 1:251I+uIliI: Ii: i!E9l: ?5PI«3-
  12. 12. Q Pfi _7r aoaamaefie Q$¥®fiM: —P Iiiitializv Q(s. (I) ill'l)llI'}lI'll_' Repeat (for (‘n. (‘ll (‘pis()(lv): Iiiitializv s R(‘[)(‘&ll (for 021011 stop of (Ipis0(lo): Cliooso u from . s- Iisiiip, ‘ policy (l(‘1‘i'(‘(l from Q ((‘. g‘. . E-j. {l'(‘(‘(l_‘) Take : I<Itio1I (I. ol)svI''<- I‘. s’ . /2. ’ j’ I I Q(s. u) ~— Q(. s-. u) + (l [r + 7. ll1ilX, ,r Q(. s‘ . u ) — Q( I .5’ '»— .5" until . - is f(‘1‘llllIl2ll ti l: l3QlC’fiEbt+‘l. fiflE t 9‘C«? ;%: ($’I'<‘%) l ®l 7 mu) m l ®<‘: I2I7‘J"%Lz < 72:5; '3 l: lEEL72i7‘J"5$¥ (%L; LB§ Ql2tBellman%3E75i§EClCiiE'3)
  13. 13. i_3A:5é: i5T! ! - Q? *’? ‘%l; .t7t; /uc‘:7t; < io7fIa7‘:7b‘ . ‘i'r’5El1tfi§slat? Ei? E§$I§ E? EmtmHEmT? $h?7Dv7%b®$5U 3EI‘l7~f. §FaEil: $<a‘lfClatc‘: “5*9“i’LlaELL/ u’c©‘7'J? l ifi. lL3lFa'§¥5lEl§': T: Q$E Ill Q9(. ‘L', (1) = $€, ‘(; L', a)()(('. ) = £T(; L', a)0, i=1 QlR]¥é5(’. >2lfifi‘§0Di3Iiifil E 7% L/ "C D E36 5 e. g. E= Fa‘$Zt0)EEE§ ff= ;t0)EEE‘§7a‘t‘ 6 ii/ °5X_5
  14. 14. ililL3ll3§§5Z7é': ’.D7J5 7‘: Q?°’? § L7; = Es, a~p(-) — Q (3:61; 0i))2] Z. Z. T‘ 211:’ = lEs'~8 i7‘ + ’7111aXa' Q(3/»a';9I—1)l-9,0] V6.-Li (90 = 1E. ~:. u~/ )('): s’~£ + ‘/ ’1I}f}XQ(-5'/ -(1I19I—1) — Ql-5’~(1I'»9i)) V6.-Qls-11191)] $7‘a‘%>’3l3JZl3§§SZ0)i3I‘xfl9J‘ llllll. Ill/ t‘ QI. s_III ; II'lIilI'. II'il_' RI-pa-at [fur I-urli : ~pi. ~ml: -III lIIi1i; Ili/ .I- . ~ Rt-p<I; It Illur (‘2|('ll . ~l(1I uf <-pi. ~ml(-): ('lIu<I. ~(- u | 'IIu1II . ~ Iisiiu; p(>li«I_ IlIII'i'I-(I lrnin Q (I-. I.>, .. _-urn-(l_) g; WIWWW : neEfi§%mflDTeQ$¥ s um I é; mII~ Tgmq T eeeaogvoto 0’ 9 V9,-LI (9I: )’C‘L§; %i&i_%
  15. 15. DQN 03173355 2m3$é5myHytTw Ql3§§5I0)iElLl{lCl§5 7‘c<V)0)*i§i$l(r. g.§fil&: UJEEEEE E) li'7“—/ A E$5TEE5T§E? $hiob#%%®fl§%@EA . Fa-'i7'J“%i7f; l«c‘: Ll3‘72I2L0)l; t6b/ -uc‘3< é<7“atl. ’€? '7J‘l’c1,-Z? #1 ¢—Ao@Eoa7t»%fie%o: : DeepMind0)A CNNlC)7Ilt lfC57‘<. ’C %0)Fl= '7'J‘5E §JB’Jl:5l$fi¥5I’&iElitl: l lift» 5 ilzfét < t; b‘C‘ ’9”7fJ‘ ? % bT: E> 2:“/ u7ti’J“—1.‘(°1‘-25?}/ ?U7QI§§<“ E72: b l:7 ‘J 7'(‘€‘-S %>73‘I§'é73IE‘I‘§il. ’F‘31§¥ ; ‘£7b“l’F5l’L% 5 ‘C‘"9”J: i<. ~ r_LJ7N__7C*WL_C“_gh*a| | DeepMind0DBlJOD)
  16. 16. DQNOD7)lzZl“'J XVA Algorithm 1 Deep Q—leaming with Experience Replay initialize replay memory D to capacity . ' Initialize action-value function Q with random weights for episode = 1. . ll do lnitialise sequence . -1 = {. r1 } and preprocessed sequenced (:21 = (; I( s1 ) for! = l. T do With probability ( select a random action (I, otherwise select (:1 : nIux, , (1,)‘[(: »(. -1 )_ HI 0) Execute action (:1 in eIIIulator and observe reward F1 and image . r, , 1 Set . -1. 1 : .~I, .(I1.. r1. 1 and preprocess 01,1 : (, >(. ~, . 1) Store transition ((3), . (:1, r, _ (:2, . 1) in ‘D Sample random Ininibatch oftrI. IIIsilion. ~; ((1)1. (:1, II_, , (3, . 1 ) from D { r for terminal (3 SL‘I. u,= .' _ . . . I. ~ _ I, + 1III. Ix, ,-Q((, -1-1.1: .0) tor non—terInI 111,1 Perform a gradient descent step on (y, — (2((, 'I‘11ll12flll-) according to equation end for end for ammeafimaaeeatoeaeeeaumn DQNl: a*5I. ’C§¥7‘otI9E, .$. lat l. E: /“Iy9"(ilEo‘(l. %>(experience replay) 2. ? ‘—9latklE| l:l l§lL;7fI‘fi?6b72i: l, —>7“-‘—7Fa'i0>i‘El3§’aE’}‘7Zi< §'%>7‘:67>
  17. 17. DQN0Dél2|-W Convplution Fully connected Fully connected Convolution V No II: 'n. I . t 3 . , 0 D >_. _ + 9 e e n9s+zex i. _l0._OOOOOOOOOOOCOOOOlO. . COOOOOOOO. _OU7O. .OOOOOOOCO. . cO. ..C. ... ..CC. ... .O. O1. Dr _H_ D _H_ : _H_ aefie _e. a.. we , ... ,u, o8.mn_. . a, ._£a__s. _.e emcee; . :s_ I expenence ; replay ’
  18. 18. V at -1; P nu. ac. 'Iq Bw.1-tom SL1: G. u~n. -v Roootar-t . AI .1:-i u ur. C'az11IC mn. - Gopnw DI-rt‘-rm / tllnck] Nam-. « Tl" s Game Tl. l.)r‘-'l. I"‘ Ku'ig HI I. l n 1 Tlfllru P lot LrI.1_IIo Firm 1:; D-my Uu .1r-Id Down lc-- Hoc-wy 0'13-In N L F? O A31--I it 8 min Zon. » l. '.'II ud ol Wot Cnopo-. ~' Co ’I’l. I’ld C. .. Fl mu H. » ‘.1 Fl WT’ FLI a Z. Ixx<m Arman’ Au-I-1 4;‘. . '--. "lIIrz— u. 5» Quest 1 25-» l. )lll‘K v~ Bow Iwq 4 M-, FRI-'. ' . ‘.‘l. i(-1 Do DQNODEJEIE. AF‘a'10DI= ’—7/ ’— l~7J“100% AlaB'i03li¥%iiE—7'J“75% 7“ t»—lsttait1i30Ji&lL: tQ—*? % FV'rl| lf“. l"1-‘4“4“l 0' . ’-. . B- o. - )u. Iu. I-I I. 1. <—Pacmanl3t5§l, ---
  19. 19. $%bt$vh7—7®flfiW
  20. 20. ". .‘. "3F' ’5%3C%1i ‘iii 13?"? Richard S. Sutton (£1. Andrew G. Barto lfi). EJ: E? ‘Elli SEE iiitlfllfi

×