5. Inverse Reinforcement Learning
• p
R ] R b mx!"L
• rt Ma [ e ) 8 : ( ,+ 0
u #I E i oL u I xm
x
$" # yn a Z
• yn x ou L mx %L
5
max
"
)*~, log $" #
$" # =
1
2
exp(!"(#))
!"(#):reward function
2:partition function
6. [] Inverse Reinforcement Learning
• ) + ) 1 1 + 6 6 ( 66
• , 6 +F a!Fg E R RIG
• R RI "($)Ge n a LM a F i[
C"($)F i[GdmE
n aF i[ o [
R RI F i[ n RM [
6
ℒ'()*'+ , = ./~1 log 56 $ = ./~1 76($) − log!
= ./~1 76($) − log ./~9
exp(76($)
"($)
ℒ=*>?@(' " = ./~9 76($) − ./~9[log("($))]
7. Generative Adversarial Nets [Goodfellow+, 14]
• ) ) , ( , , , , ) M
a M
• ) ) G ! M
"#$%$(') M
• , ( , , :D M
M )
) ) :D - )
•
min
,
max
/
0 1, 3 = 56 ~ 8#$%$(6) log 1(<) + 5> ~ 8?(>) log(1 − 1 < )
Discriminator true labels
for dataset
Discriminator false labels
for generated data
7
13. eB o lg
e fa 9 N
• !(#) e#B o B
• ̂!(s) e#B o
• [ B e m ] n , (
• 9 Φ # ∶ ) ↦ ℝ
• ,-,/
∗
! 1B B m
• , ̂-,/
∗
̂! 1B B m
• + ) m
, ̂-,/
∗
#, 2 = ,-,/
∗
#, 2 − Φ(#)
13
,-
∗ #, 2 = ! # + 6789[softmaxA9,-
∗ #′, 2′ ]
14. a B c
• ) ( ,
• ̂" # = " # + &(s) &(s) D
• ( ,
14
̂" # = " # + const
̂"(#) = " # + &(s) = " # + ./01[Φ(#′)] − Φ(#)
d a c b
7 ̂8
∗
#, ; = " # + ./01[Φ(#′)] − Φ(#) + ./01[softmax@17 ̂8
∗
#A, ;A ]