GAN生成画像の原理と応用

Generative Adversarial Network
（と強化学習との関係を少し）
鈴⽊雅⼤
2017/12/12
強化学習アーキテクチャ勉強会資料

⾃⼰紹介
鈴⽊雅⼤（博⼠課程3年）
¤ 経歴
¤ 学部〜修⼠：北海道⼤学情報科学研究科
¤ 博⼠〜：東京⼤学⼯学系研究科松尾研究室
¤ 専⾨分野：
¤ 機械学習，転移学習
¤ 深層⽣成モデル（VAE）
¤ Deep Learning基礎講座，先端⼈⼯知能論などの講義・演習担当
¤ Goodfellow著「Deep Learning」の監修・翻訳
Male : 0.95
Eyeglasses : -0.99
Young : 0.30
Smiling : -0.97
Male : 0.22
Eyeglasses : -0.99
Young : 0.87
Smiling : -1.00
Input
Generated
attributes Average face Reconstruction Not Male Eyeglasses Not Young Smiling
Mouth
slightly open

今⽇の内容
¤ GANの説明と強化学習との関係について
¤ 前半：GANの説明（※先端⼈⼯知能論 II／ Deep Learning 応⽤講座で話した内容とほぼ同じです）
¤ 後半：GANと強化学習の関係
[Choi+ 2017] [Zhu+ 2017]

⽬次
¤ 前半：
¤ ⽣成モデルとニューラルネットワーク
¤ Generative Adversarial Network
¤ GANの学習の難しさ
¤ GANの種類
¤ GANの評価
¤ GANの応⽤
¤ 後半：
¤ Actor-criticとGAN
¤ 逆強化学習とGAN
¤ 深層⽣成モデルと強化学習

⽣成モデルとニューラルネットワーク

⽣成モデル
¤ データの⽣成過程を数理的にモデル化したもの．
¤ 数理モデルは確率分布によって表される
¤ 𝑥"が確率𝑝 𝑥; 𝜃 から⽣成されるときは𝑥" ~𝑝 𝑥; 𝜃 と表記する．
¤ 𝑝 𝑥; 𝜃 を𝑝( 𝑥 とも書く
観測データ
𝑝 𝑥; 𝜃
⽣成モデル⽣成
学習
モデルパラメータ
観測データはある数理モデルから
⽣成されたとする

⽣成モデルだとできること
¤ サンプリング：
¤ 確率モデルがあるので，未知のデータを⽣成できる
¤ 「⽣成」モデルと呼ばれるのはここから
¤ 密度推定：
¤ データ𝑥を⼊⼒すると，密度𝑝(𝑥)が得られる．
¤ 外れ値検出や異常検知に⽤いられる．
¤ ⽋損値補完，ノイズ除去：
¤ ⽋損やノイズのある𝑥+を⼊⼒すると，真の𝑥の推定値が得られる．
𝑝 𝑥; 𝜃
http://jblomo.github.io/datamining290/slides/2013-04-26-
Outliers.html

ニューラルネットワークによる⽣成モデルの表現
¤ ⽣成モデル 𝑝 𝑥; 𝜃 を深層ニューラルネットワークによって表現する（深層⽣成モデル）．
¤ 従来の確率分布だと，複雑な⼊⼒を直接扱えない．
¤ ニューラルネットワークで表現することで，より複雑な⼊⼒xを扱えるようになる．
¤ ⼀般物体画像のような複雑な画像も⽣成できる可能性がある！
¤ ⼤きく分けて2つのアプローチがある（有向モデルの場合）．
¤ 𝑝(𝑥|𝑧) （⽣成器，generator）をモデル化する．
¤ 𝑧から𝑥へのニューラルネットワーク𝑥 = 𝑓(𝑧)によって表現．
¤ VAE，GANなど．
¤ 𝑝(𝑥)を直接モデル化する（⾃⼰回帰モデル）．
¤ 𝑝 𝑥 = ∏ 𝑝(𝑥"|𝑥1, … , 𝑥"41)" として，各条件付き分布をモデル化．
¤ NADE[Larochelle+ 11]，PixelRNN（CNN）[Oord+ 16]， WaveNet[Oord+ 16]など．
⽣成器 𝑥𝑧
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

深層⽣成モデルの分類
¤ それぞれ利点と⽋点がある
¤ VAEとGANはサンプリングが早いが，尤度を直接求められない．
¤ ⾃⼰回帰モデルは尤度を正確に求められるが，モデルが多くサンプリングや学習の速度が遅い．
学習⽅法尤度の計算サンプリング潜在変数への推論学習するモデル
（NNでモデル化）
VAE 変分下界の
最⼤化
厳密にはできない
（変分下界を計算）
低コスト近似分布によって可能⽣成モデル 𝑝 𝑥 𝑧
推論モデル 𝑞 𝑧 𝑥
GAN 敵対的学習できない
（尤度⽐のみ）
低コスト推論はモデル化されてい
ない（モデルによる）
⽣成器 𝐺(𝑧)
識別器 𝐷(𝑥)
⾃⼰回帰
モデル
対数尤度の
最⼤化
できる⾼コスト潜在変数⾃体がない複数の条件付き分布
8 𝑝(𝑥"|𝑥1, … , 𝑥"41)
"

Variational Autoencoder
¤ Variational autoencoder（VAE） [Kingma+ 13; ICLR 2014][Rezende+ 14; ICML 2014]
𝑥
𝑧
𝑞9(𝑧|𝑥)
𝑥 ~ 𝑝((𝑥|𝑧)
𝑧 ~ 𝑝(𝑧)𝑞9 𝑧 𝑥 = 𝒩(𝑧|𝜇 𝑥 , 𝜎=
(𝑥))
推論モデル（ガウス分布）
エンコーダーと考える
⽣成モデル（ベルヌーイ分布）
𝑝( 𝑥 𝑧 = ℬ(𝑥|𝜇 𝑥 )
デコーダーと考える

VAEからの画像⽣成
¤ 𝑝(𝑥|𝑧)だけでなく𝑝(𝑧) （ = ∫ 𝑞(𝑧|𝑥)𝑝(𝑥)𝑑𝑥）も学習している
¤ データの多様体が𝑧で獲得されている．
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p✓(x|z) with the learned parameters ✓.
[Kingma+ 13]より

⽣成画像
¤ ランダムな𝑧から画像をサンプリング
¤ 輪郭等がぼやける傾向がある．
anifold (b) Learned MNIST manifold
arned data manifold for generative models with two-dimensional latent
Since the prior of the latent space is Gaussian, linearly spaced coor-
were transformed through the inverse CDF of the Gaussian to produce
z. For each of these values z, we plotted the corresponding generative
ameters ✓.
(b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
rom learned generative models of MNIST for different dimensionalities
[Kingma+ 13]より @AlecRad

Generative Adversarial Network

なぜ⽣成画像がぼやけてしまうのか？
¤ 理由の⼀つは，確率分布を明⽰的にモデル化して最尤学習を⾏っているため．
¤ VAEでは𝑝(𝑥|𝑧)をガウス分布としてモデル化しているため，再構成誤差が（分散を考慮しなけれ
ば）⼆乗誤差になる．
¤ 画素をはっきりさせるより曖昧にした⽅が，誤差は⼩さくなる．
¤ ガウス分布の分散を⼩さくするとはっきり再構成できるようになるが，逆に未知のデータをうま
く⽣成できなくなる．
¤ また最尤学習では，訓練データにない部分についても⾼い確率を置くように訓練しがち．
¤ 最尤学習（尤度最⼤化）＝KLダイバージェンスの最⼩化
¤ 訓練データにない部分がぼやけた画像になる．
-> 確率分布を暗黙的にモデル化し，別の学習基準を⽤いる必要がある！
𝑝ABCB(𝑥) 𝑝D(𝑥)
𝐾𝐿[𝑝ABCB||𝑝D]
= I 𝑝ABCB log
𝑝ABCB
𝑝D
𝑑𝑥
= 𝐸OPQRQ
log 𝑝ABCB − 𝐸OPQRQ
log 𝑝D
対数尤度

暗黙的な⽣成モデルの学習
¤ 確率分布 𝑝D(𝑥)を定義しないで，暗黙的な⽣成モデルとして考える．
¤ 尤度を測ることができない！
¤ ⽬標：真の分布 𝑝ABCB(𝑥)と近いモデル分布𝑝D(𝑥)を求める．
¤ 直接尤度を評価できないので，モデル分布と真の分布の密度⽐を考える[Uehara+ 17,
Mohamed+ 16]．
𝑝D(𝑥)
𝑝ABCB(𝑥)
𝑟 𝑥 =
𝑝ABCB(𝑥)
𝑝D(𝑥)

密度⽐の解釈
¤ データ集合𝒳 = {𝑥1, … , 𝑥W}のうち，半分のデータがモデル分布から⽣成され，もう半分が真の
分布から⽣成されるとする．
¤ 真の分布からの⽅を𝑦 = 1，モデル分布からの⽅を𝑦 = 0とラベル付けする．
¤ するとデータ集合は次のようになる．
¤ このことから，モデル分布と真の分布は，ラベルが与えられた下での条件付き分布となる
[Sugiyama+ 2012]．
(𝑥1, 1 … , 𝑥W
=
, 1 , 𝑥W
=1
, 0 , … , (𝑥W, 0)}
𝑝ABCB 𝑥 = 𝑝(𝑥|𝑦 = 1)
𝑝D 𝑥 = 𝑝(𝑥|𝑦 = 0)

密度⽐の解釈
¤ したがって，密度⽐は次のようになる．
->すなわち，𝑝 𝑦 = 1 𝑥 を推定できればよい．
¤ 𝑝 𝑦 = 1 𝑥 を推定する分布を𝑞9(𝑦 = 1|𝑥)とする．
¤ この分布をニューラルネットワークでパラメータ化する．
¤ 𝐷を識別器(discriminator)と呼ぶ．
¤ 識別器によって，密度推定をNNが得意な分類問題に置き換えることができる．
𝑝ABCB(𝑥)
𝑝D(𝑥)
=
𝑝(𝑥|𝑦 = 1)
𝑝(𝑥|𝑦 = 0)
=
𝑝 𝑦 = 1 𝑥 𝑝(𝑥)
𝑝(𝑦 = 1)
𝑝 𝑦 = 0 𝑥 𝑝(𝑥)
𝑝(𝑦 = 0)
=
𝑝 𝑦 = 1 𝑥
𝑝 𝑦 = 0 𝑥
・
1 − 𝜋
𝜋
（ただし，𝜋 = 𝑝(𝑦 = 1)）
𝑞9 𝑦 = 1 𝑥 = 𝐷(𝑥; 𝜑) 𝐷(𝑥; 𝜑) 𝑦𝑥

識別器の⽬的関数
¤ データ集合から識別器を学習するため，次の負の交差エントロピー損失を最⼤化する．
¤ 式変形して，
¤ 𝜋 =
1
=
とすると，⽬的関数𝑉(𝐷)は，
𝐸O(`,a)[𝑦 log 𝐷 𝑥; 𝜑 + 1 − 𝑦 log(1 − 𝐷(𝑥; 𝜑))]
𝐸O 𝑥 𝑦 O(a) 𝑦 log 𝐷 𝑥; 𝜑 + 1 − 𝑦 log 1 − 𝐷 𝑥; 𝜑
= 𝐸O 𝑥 𝑦 = 1 O(ac1) log 𝐷 𝑥; 𝜑 + 𝐸O 𝑥 𝑦 = 0 O(acd) log 1 − 𝐷 𝑥; 𝜑
= 𝜋𝐸OPQRQ(`) log 𝐷 𝑥; 𝜑 + (1 − 𝜋)𝐸Oe(`) log 1 − 𝐷 𝑥; 𝜑
𝑉 𝐷 = 𝐸OPQRQ(`) log 𝐷 𝑥; 𝜑 + 𝐸Oe(`) log 1 − 𝐷 𝑥; 𝜑

⽬的関数の意味
¤ 本来の⽬的は， 𝑝D(𝑥)を 𝑝ABCB(𝑥) に近づけること！
¤ 学習した𝐷を使って 𝑝D 𝑥 と𝑝ABCB(𝑥)の距離を求めたい．
¤ ⽬的関数𝑉(𝐷)にはどのような意味がある？
¤ もし識別関数が適切に推定できたら（すなわち𝐷∗ 𝑥 = 𝑝(𝑦 = 1|𝑥) ），
¤ このとき， 𝑉 𝐷∗ = 2・𝐽𝑆𝐷(𝑝ABCB||𝑝D) − 2 log 2となる．
¤ 𝐽𝑆𝐷はJensen-Shannonダイバージェンス．
¤ 𝐽𝑆𝐷(𝑝ABCB||𝑝D) =
1
=
𝐾𝐿(𝑝ABCB||
OPQRQOe
=
) +
1
=
𝐾𝐿(𝑝D||
OPQRQOe
=
) （KLと違って対称的）
->つまり， 𝑉(𝐷∗)は𝑝ABCBと𝑝DのJensen-Shannonダイバージェンスに対応している！
𝐷∗ 𝑥 =
i
i1
=
OPQRQ(`)
OPQRQ ` Oe(`)
に収束する．

⽣成モデルの学習
¤ 𝑝D 𝑥 = ∫ 𝑝 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 と考えて， 𝑝 𝑥 𝑧 を推定する分布を𝑞((𝑥|𝑧)とする．
¤ この分布をニューラルネットワークでパラメータ化する．
¤ 𝐺を⽣成器(generator)と呼ぶ．
¤ すると，Gを学習するための⽬的関数は，
¤ Gは，この⽬的関数を最⼩化するように学習する．
¤ 𝑝ABCBと𝑝D のJS距離を近づけるため．
𝑞( 𝑥 𝑧 = 𝐺(𝑧; 𝜃) 𝐺(𝑧; 𝜃) 𝑥𝑧
𝑉(𝐷∗, 𝐺) = 𝐸OPQRQ(`) log 𝐷∗ (𝑥) + 𝐸O(m) log 1 − 𝐷∗(𝐺(𝑧; 𝜃))

⽣成器と識別器の学習
¤ 実際には，適切な𝐺が得られないと，最適な𝐷は学習できない．
¤ 逆も同じ．
¤ したがって，⽬的関数を交互に最適化することを考える．
¤ このような枠組みで学習する⽣成モデルを，
generative adversarial networks（GANs）[Goodfellow+ 14]
と呼ぶ．
min
(
𝐸O(m) log 1 − 𝐷 𝐺 𝑧; 𝜃 ; 𝜑
max
9
𝐸OPQRQ(`) log 𝐷 (𝑥; 𝜑) + 𝐸O(m) log 1 − 𝐷 𝐺 𝑧; 𝜃 ; 𝜑
・⽣成器の学習（𝐷は固定）
・識別器の学習（𝐺は固定）

Generative adversarial nets
¤ 全体の構造
¤ 直感的には，𝐺と𝐷で次のゲーム（ミニマックスゲーム）をする（敵対的学習）．
¤ 𝐺はなるべく𝐷を騙すように𝑥を⽣成する．
¤ 𝐷はなるべく𝐺に騙されないように識別する．
->最終的には，𝐷が本物と区別できないような𝑥が𝐺から⽣成される．
min
s
max
t
𝑉(𝐷, 𝐺)

アルゴリズム
¤ 𝐷を𝑘ステップ更新してから，𝐺を更新する．
¤ ただし，𝑘 = 1とする場合が多い．

VAEとGANの違い
¤ VAEは⽣成モデルの形を仮定しているが，GANは分布を仮定しない
¤ GANは尤度ではなく，識別器によって，⽣成器のよさを評価する．
¤ 測る距離が異なる．
¤ VAEは対数尤度（=KLダイバージェンス），GANはJSダイバージェンス．
¤ VAEは推論分布（エンコーダー）を考えるが，GANは考えない．
¤ ただし，モデルによる．
¤ どちらもDNNの得意な識別問題に持っていっているところがポイント．
⽣成器
識別器
𝑥v
𝑥
𝑧推論モデル⽣成モデル
𝑥 𝑥v
𝑧
VAE GAN

KLダイバージェンスとJSダイバージェンス
¤ 尤度最⼤化（KLダイバージェンス最⼩化）はデータのないところも覆ってしまう（B）．
¤ KLを逆にすると，⼀つの峰だけにfitするようになる（C）．
¤ JSダイバージェンスはちょうど中間あたりで学習（D）．
[Huszar´+ 15]
Under review as a conference paper at ICLR 2016
A: P B: arg minQ JS0.1[PkQ] C: arg minQ JS0.5[PkQ] D: arg minQ JS0.99[PkQ]
Figure 1: Illustrating the behaviour of the generalised JS divergence under model underspeciﬁcation
for a range of values of ⇡. Data is drawn from a multivariate Gaussian distribution P (A) and we aim
approximate it by a single isotropic Gaussian (B-D). Contours show level sets the approximating
distribution, overlaid on top of the 2D histogram of observed data. For ⇡ = 0.1, JS divergence
minimisation behaves like maximum likelihood (B), resulting in the characteristic moment matching
behaviour. For ⇡ = 0.99 (D), the behaviour becomes more akin to the mode-seeking behaviour
of minimising KL[QkP]. For the intermediate value of ⇡ = 0.5 (C) we recover the standard JS
divergence approximated by adversarial training. To produce this illustraiton we used software made
available by Theis et al. (2015).
It is easy to show that JS⇡ divergence converges to 0 in the limit of both ⇡ ! 0 and ⇡ ! 1. Cru-
cially, it can be shown that the gradients with respect to ⇡ at these two extremes recover KL[QkP]
and KL[PkQ], respectively. A proof of this property can be obtained by considering the Taylor-
expansion KL[QkQ + a] ⇡ aT
Ha, where H is the positive deﬁnite Hessian and substituting
a = ⇡(P Q) as follows:
lim
⇡!0
JSD[PkQ; ⇡]
⇡
= lim
⇡!0
⇢
KL[Pk⇡P + (1 ⇡)Q] +
1 ⇡)
⇡
KL[Qk⇡P + (1 ⇡)Q] (14)

GANの⽣成画像
¤ ランダムな𝑧から画像をサンプリング[Goodfellow+ 14]
Adversarial nets 225 ± 2 2057 ± 26
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-
likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we
computed the standard error across folds of the dataset, with a different chosen using the validation set of
each fold. On TFD, was cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [8] and used for various generative models for which the exact likelihood
is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood
has somewhat high variance and does not perform well in high dimensional spaces but it is the best
method available to our knowledge. Advances in generative models that can sample but not estimate
likelihood directly motivate further research into how to evaluate such models.
In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no
claim that these samples are better than samples generated by existing methods, we believe that these
samples are at least competitive with the better generative models in the literature and highlight the
potential of the adversarial framework.
a) b)
c) d)
Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of
the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples
are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain
mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator
and “deconvolutional” generator)

GANの問題点
¤ GANにはいくつかの困難な点がある．
¤ 収束性
¤ Mode collapse問題
¤ 勾配消失
¤ これらの困難は互いに関係している．

GANの収束性について
¤ 𝑉(𝐷∗, 𝐺)が𝜃について凸関数ならば，Gは収束することが保証されている．
¤ しかし，⾮凸の場合は保証されない（NNは当然⾮凸）．
¤ 実際の学習では，SGDで交互に最適化している．
¤ ミニマックスゲームの平衡点は，両⽅のプレイヤーが同時に最⼩になる点（ナッシュ均衡）．
¤ ゼロサムゲームなので，お互いが𝑉を最適化しても，平衡点に⾏かずに振動する可能性がある．
¤ 例：⽬的関数を𝑣(𝑎, 𝑏) = 𝑎𝑏とし，𝐺と𝐷で交互に最⼩化と最⼤化を繰り返すと，次のようになる．
※最近では，最適化によって（局所）ナッシュ均衡に到達するよう保証する研究が⾏われている（[Nagarajan+ 17][Heusel+ 17][Kodali+ 17]など）．
https://lilianweng.github.io/lil-
log/2017/08/20/from-GAN-to-WGAN.html

GANにおけるGとDの学習の様⼦
¤ 𝐷は，𝐺が最適化する⽅向を⽰している．
¤ 敵対的というよりは，協⼒して学習しているイメージ．
http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/

DCGAN
¤ Deep Convolutional Generative Adversarial Network[Alec+ 15]
¤ 畳込みNNでモデルを設計
¤ 学習の安定性のために，いくつかのトリックを利⽤．
¤ Batch Normalizationを使う（学習の安定のために重要）．
¤ 𝐺にはReLUを使い，出⼒はtanh（必然的にデータは事前に標準化しておく）．
¤ 𝐷にはleaky ReLUを使う．

DCGANの⽣成画像
¤ 従来と⽐較して，はるかに綺麗な画像が⽣成できるようになった．
¤ DCGANはGANのブレイクスルー的研究
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate.
Figure 3: Generated bedrooms after ﬁve epochs of training. There appears to be evidence of visual
under-ﬁtting via repeated noise textures across multiple samples such as the base boards of some of
the beds.

潜在空間上でのベクトル演算
¤ ただし，GANにしかできない訳ではないことに注意．
¤ VAEでも可能（ただしGANの⽅が綺麗な画像が⽣成できる）．

Mode collapse問題
¤ 元々の定式化は，𝐺を固定して𝐷を最適化するというものだった．
¤ 𝐺は最適化した𝐷を⽤いて学習する(minmax)．
¤ （学習が不⼗分な）𝐷を固定して，𝐺を最適化した場合はどうなるか？
->𝐺の⽣成データが全て，ある峰（peaks）に対応するように学習してしまう（mode collapse）．
Published as a conference paper at ICLR 2017
Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians
dataset. Columns show a heatmap of the generator distribution after increasing numbers of training
steps. The final column shows the data distribution. The top row shows training for a GAN with
10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The
bottom row shows standard GAN training. The generator rotates through the modes of the data
distribution. It never converges to a fixed distribution, and only ever assigns significant probability
mass to a single data mode at once.
Figure 3: Unrolled GAN training increases stability for an RNN generator and convolutional dis-
criminator trained on MNIST. The top row was run with 20 unrolling steps. The bottom row is a
standard GAN, with 0 unrolling steps. Images are samples from the generator after the indicated
number of training steps.
generator, but without backpropagating through the generator. In both cases we find that the unrolled
objective performs better.
3.2 PATHOLOGICAL MODEL WITH MISMATCHED GENERATOR AND DISCRIMINATOR
To evaluate the ability of this approach to improve trainability, we look to a traditionally challenging
family of models to train – recurrent neural networks (RNNs). In this experiment we try to generate
MNIST samples using an LSTM (Hochreiter & Schmidhuber, 1997). MNIST digits are 28x28 pixel
images. At each timestep of the generator LSTM, it outputs one column of this image, so that
after 28 timesteps it has output the entire sample. We use a convolutional neural network as the
Published as a conference paper at ICLR 2017
Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians
dataset. Columns show a heatmap of the generator distribution after increasing numbers of training
steps. The final column shows the data distribution. The top row shows training for a GAN with
10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The
bottom row shows standard GAN training. The generator rotates through the modes of the data
distribution. It never converges to a fixed distribution, and only ever assigns significant probability
mass to a single data mode at once.
[Metz+ 17]

解決⽅法
¤ Minibatch discrimination[Salimans+ 17]
¤ Minibatch内の多様性（サンプル間のノルムの距離）を考慮した
discriminatorを設計．多様性が⼤きくなる⽅向にDを導くように
する．
¤ Unrolled GANs[Metz+ 17]
¤ 𝐷のパラメータをK回更新し（unrolling），その時点でのパラメー
タで𝐺を更新．𝐷は1回更新したパラメータに戻す．
¤ AdaGAN[Tolstikhin+ 17]
¤ GANにブースティングの⼿法を⽤いる．
¤ Wasserstein GANs[Arjovsky+ 17]
¤ 後述．

GANの勾配消失問題
¤ では，𝐷を完全に最適化させればいい？
¤ 𝑝ABCB(𝑥) について1， 𝑝D(𝑥) について0を出⼒する．
-> 勾配が消えてしまうため，𝐺の⽅向がわからなくなる！！
¤ GANのジレンマ：
¤ 𝐷が⼗分学習できないうちに𝐺を最適化すると，𝐺が同じようなサンプルしか⽣成しなくなる（mode
collapse）．
¤ ある𝐺のときに𝐷を完全に学習すると，勾配が消失してしまう（勾配消失問題）．

勾配消失を防ぐトリック
¤ 勾配が消失しないようにするため，𝐺の学習を次のように変更する場合が多い．
¤ 𝐷がほぼ偽物（𝐷 = 0）と判断したときでも，勾配が消失することを防ぐ．
¤ ただし，もはやJSダイバージェンスの最⼩化ではない！
¤ 逆⽅向のKLと負のJSを最⼩化することに対応[Arjovsky+ 17]
¤ （ただし[Zhou+ 17]はこの証明は誤りと指摘している）
min
(
𝐸O(m) −log 𝐷 𝐺 𝑧; 𝜃 ; 𝜑
−𝛻( log 𝐷(𝐺 𝑧; 𝜃 ) = −
𝛻( 𝐷(𝐺 𝑧; 𝜃 )
𝐷(𝐺 𝑧; 𝜃 )
𝛻( log 1 − 𝐷(𝐺 𝑧; 𝜃 ) = −
𝛻( 𝐷(𝐺 𝑧; 𝜃 )
1 − 𝐷(𝐺 𝑧; 𝜃 )
元の⽬的関数の勾配
変更した⽬的関数の勾配

その他のトリック
¤ https://github.com/soumith/ganhacks に⾊々経験則が書いてある．
¤ Normalize the inputs
¤ Use a spherical Z
¤ Use Soft and Noisy Labels
などなど．
¤ 基本的には，実装例があればそれに従うこと．

GAN Zoo
¤ GANの論⽂⼀覧
¤ https://github.com/hindupuravinash/the-gan-zoo
¤ 指数関数的に増加．
¤ 理論研究など，ここには⼊っていない研究も結構ある．
¤ 12/11現在234．12⽉分はまだ．

代表的なGANs
https://github.com/hwalsuklee/tensorflow-generative-model-collections

Conditional GAN
¤ Conditional Generative Adversarial Nets[Mirza+ 14]
¤ 𝑐で条件づけたGAN．
¤ GとDの両⽅に𝑐を加える（下の図では𝑦）．
𝑉t 𝐷, 𝐺 = 𝐸OPQRQ ` log 𝐷 𝑥, 𝑐; 𝜑 + 𝐸O m log 1 − 𝐷 𝐺 𝑧; 𝜃 , 𝑐; 𝜑
𝑉s 𝐷, 𝐺 = −𝐸O m log 𝐷 𝐺 𝑧; 𝜃 , 𝑐; 𝜑

cGANで⽣成した画像
¤ ⽂章情報で条件づけた例[Reed+ 16]．
¤ 画像で条件づけた例[Isola+ 16]（pix2pix，後ほど改めて紹介）．
nerative Adversarial Text to Image Synthesis
Xinchen Yan, Lajanugen Logeswaran REEDSCOT1
, AKATA2
, XCYAN1
, LLAJAN1
SCHIELE2
,HONGLAK1
nn Arbor, MI, USA (UMICH.EDU)
formatics, Saarbrücken, Germany (MPI-INF.MPG.DE)
tract
realistic images from text
nd useful, but current AI
m this goal. However, in
d powerful recurrent neu-
es have been developed
text feature representa-
convolutional generative
ANs) have begun to gen-
g images of specific cat-
album covers, and room
we develop a novel deep
ormulation to effectively
n text and image model-
concepts from characters
ate the capability of our
sible images of birds and
xt descriptions.
d in translating text in the form
itten descriptions directly into
“this small bird has a short,
e belly” or ”the petals of this
r are yellow”. The problem of
al descriptions gained interest
this small bird has a pink
breast and crown, and black
primaries and secondaries.
the flower has petals that
are bright pinkish purple
with white stigma
this magnificent fellow is
almost all black with a red
crest, and white cheek patch.
this white and yellow flower
have thin white petals and a
round yellow stamen
Figure 1. Examples of generated images from text descriptions.
Left: captions are from zero-shot (held out) categories, unseen
text. Right: captions are from the training set.
properties of attribute representations are attractive, at-
tributes are also cumbersome to obtain as they may require
domain-specific knowledge. In comparison, natural lan-
guage offers a general and flexible interface for describing
objects in any space of visual categories. Ideally, we could
have the generality of text descriptions with the discrimi-

infoGAN
¤ Information Maximizing Generative Adversarial Networks[Chen+ 16]
¤ cGANでは明⽰的に𝑐と𝑥のペアを与えていたが，暗黙的に𝑥と対応するcを獲得したい．
¤ そのため，相互情報量𝐼(𝑐; 𝑥 = 𝐺(𝑐, 𝑧))が⾼くなるように学習する．
¤ 実際はその下界を正則化項として加える．
¤ 𝑄(𝑐|𝑥)はニューラルネットワークとする（𝐷と重みを共有）．
𝑧
𝑥
𝑐 𝑧
𝑥
𝑐
infoGANcGAN
𝑉t.• 𝐷, 𝐺, 𝑄 = 𝐸OPQRQ ` log 𝐷 𝑥 + 𝐸O m log 1 − 𝐷 𝐺 𝑧 − 𝜆𝐿•(𝐺, 𝑄)
𝑉s 𝐷, 𝐺, 𝑄 = −𝐸O m log 𝐷 𝐺 𝑧 + 𝜆𝐿•(𝐺, 𝑄)
𝐿• 𝐺, 𝑄 = 𝐸‚~O ‚ ,`~s m,‚ [log 𝑄 𝑐 𝑥 ] ≤ 𝐼 𝑐; 𝐺 𝑐, 𝑧

ACGAN
¤ Conditional Image Synthesis With Auxiliary Classifier GANs[Odena+ 16]
¤ 綺麗な画像𝑥は，クラス𝑐の識別性も⾼いはず．
¤ 多様性と識別性を⾼めたい．
¤ 𝐺を𝑐で条件づけて，𝐷で𝑐を識別する補助分類器𝑄(𝐶 = 𝑐|𝑥)を追加．
¤ cGANの𝐺とinfoGANの𝐷．
¤ 𝐺を𝑐で条件づけることで，学習する分布の峰を少なくできる．
¤ 128×128の画像を⽣成．
𝑉t,• 𝐷, 𝐺, 𝑄 = 𝐸OPQRQ ` log 𝐷 𝑥 + 𝐸O m log 1 − 𝐷 𝐺 𝑧 + 𝐸 𝑄 𝐶 = 𝑐 𝑥 + 𝐸[𝑄(𝐶 = 𝑐|𝐺 𝑧 )]
𝑉s 𝐷, 𝐺, 𝑄 = −𝐸O m log 𝐷 𝐺 𝑧 − 𝐸[𝑄(𝐶 = 𝑐|𝐺 𝑧 )]

GANのダイバージェンスの変更
¤ GANはJSダイバージェンスを最⼩化している．
¤ 通常の⽣成モデルの最尤推定（KL最⼩化）から解放されている．
->別のダイバージェンスで最適化できないか？
¤ これまでに様々なダイバージェンス（距離）のGANが提案されている．
¤ ピアソンの𝜒2ダイバージェンス：LSGAN
¤ 𝑓ダイバージェンス：f-GAN
¤ Bregmanダイバージェンス（真の密度⽐との距離）：b-GAN[Uehara+ 17]
¤ Wasserstein距離：WGAN[Arjovsky+ 17]，WGAN-GP[Gulrajani+ 17]，BEGAN[Berthelot+ 17]
¤ MMD：MMD GAN [Li+ 17]
¤ Cramér距離： Cramér GAN[Bellemare+ 17]
¤ その他のIntegral Probability Metrics ：McGAN[Mroueh+ 17]， Fisher GAN[Mroueh+ 17]
などなど．

JSダイバージェンスの限界
¤ データは次元のあらゆるところに存在する訳ではない．
¤ 実際は低次元な多様体として存在する．
¤ ⾼次元空間の場合，2つの分布の低次元多様体が交わらない可能性が⾼い．
¤ 分布の台が交わらない場合，𝐷は完全に 𝑝ABCB(𝑥)と 𝑝D 𝑥 が分離できてしまう．
¤ JSダイバージェンスには交わらない分布間の距離の違いがわからない（定数になる）．
¤ Gの⽅向が決められない！！
𝑝D(𝑥)𝑝ABCB(𝑥)
𝑝D(𝑥)𝑝ABCB(𝑥)

Wasserstein距離
¤ Earth Mover距離（Wasserstein-1距離）
in𝑓
†∈ OPQRQ,Oe
𝐸 `,a ~†[||𝑥 − 𝑦||]
¤ 直感的には，𝑝ABCBから𝑝Dに確率密度を移すときの最⼩コスト．
¤ 𝛾が輸送量， | 𝑥 − 𝑦 |が移動距離に対応．
¤ JSダイバージェンスとの⽐較
¤ 直線の確率密度分布を移動した時の距離（右図）
Wasserstein距離 JSダイバージェンス
0
𝜃

Wasserstein GAN
¤ Wasserstein距離は双対表現として，次のように書ける．
¤ Wasserstein GAN（WGAN）では，これを𝐺の⽬的関数とする．
¤ 通常のGANとほぼ同じ！ただし・・・
¤ 𝐷の出⼒はそのまま（⾮線形関数を加えない）．
¤ 𝐷がリプシッツ性を満たすように学習する．
¤ 毎回𝜑を複数回更新して， Wasserstein距離を正確に近似するようにする．
𝑊 𝑝D, 𝑝ABCB = max
9
𝐸OPQRQ
𝐷 𝑥; 𝜑 − 𝐸O(m)[𝐷(𝐺 𝑧 ; 𝜑)]
ただし| 𝐷 𝑥 − 𝐷 𝑦 | ≤ | 𝑥 − 𝑦 |
1-リプシッツ連続
𝑉t 𝐷, 𝐺 = 𝐸OPQRQ ` 𝐷 𝑥; 𝜑 − 𝐸O m 𝐷 𝐺 𝑧; 𝜃 ; 𝜑
𝑉s 𝐷, 𝐺 = −𝐸O m 𝐷 𝐺 𝑧; 𝜃 ; 𝜑
s.t. | 𝐷 𝑥 − 𝐷 𝑦 | ≤ | 𝑥 − 𝑦 |

WGANの利点
¤ 従来のダイバージェンスに起因するmode collapseなどの問題が発⽣しない．
¤ 台が交わらない分布でも，勾配が得られる．
¤ ⽬的関数が意味を持つようになる．
¤ 従来のGANは，どれくらい画像が学習できているか，客観的に計測できなかった．

課題：リプシッツ性の担保
¤ Dのリプシッツ性を保つために，パラメータを制約する必要がある．
¤ ⽅法1：パラメータをclippingする．
¤ 𝜑を[−0.01,0.01]の範囲に収める．
¤ ただし，勾配爆発や消失の問題が⽣じる可能性がある．
¤ ⽅法2：gradient penaltyの追加[Gulrajani+ 17]
¤ Dの勾配ノルムが1になるような制約項を追加して学習．
¤ ただし，制約が厳しすぎるという指摘もあり[Kodali+ 17]．
¤ その後も，様々なintegral probability metricsによるGANが提案されている．
¤ Dの制約を変更することで，⽬的関数が様々な距離に対応するようになる．
¤ MMD[Li+ 17]， Cramér距離[Bellemare+ 17]などなど．

⽣成モデルの評価とGAN
¤ ⽣成モデルの場合，学習したモデルを評価するのが⾮常に困難．
¤ 教師あり学習では，テストデータ(𝑥, 𝑦)でうまく予測できるかを評価すればいい．
¤ 教師なし学習では，𝑥 → 𝑧などは⾃明ではない．
¤ よくある⽅法は，テストデータに対する（対数）尤度の値を調べる．
¤ VAEでは下界（正確にはimportance weighted AE[Burda+ 15]の下界）を計算する．
¤ GANは明⽰的に分布を持っていないので，尤度の評価が困難．
¤ 初期はパルツェン窓密度推定などが使われていた．
¤ しかし，評価としては役に⽴たない場合が多いことが⽰された[Theis+ 15]
¤ しかし，そもそも対数尤度が⾼ければ良い⽣成モデルといえるのか？？
¤ 最初に述べたように，KLはぼやけた画像を⾼く評価する．
¤ GANは，尤度の代わりにモデルに応じて独⾃の評価指標（識別器）で学習している．
-> 客観的にGANの評価は可能なのか？？

Inception Score
¤ 良い⽣成画像とは何か？[Salimans+ 16]
1. ⾼い信念でクラス分類ができる->エントロピー𝑝(𝑦|𝑥)が⼩さい．
2. サンプルのバリエーションが⼤きい->エントロピー 𝑝(𝑦)が⼤きい．
¤ よって次のスコアを調べればいい（inception score）．
¤ 𝐺からのサンプル𝑥を使って評価
¤ スコアが⼩さければ良いモデル．
¤ 𝑝(𝑦|𝑥)は事前学習済みのinceptionモデル．
¤ 経験的に，⼈間の評価と相関することがわかった．
¤ 現在GANで最もよく使われている評価指標．
𝐼𝑆 𝐺 = exp(𝐸`~s[𝑝(𝑦|𝑥)||𝑝(𝑦)])

Fréchet Inception Distance
¤ ISとは異なる評価指標[Heusel+ 17]．
¤ Inceptionモデルの任意の層（pool 3層）に， 𝑝ABCB 𝑥 と 𝑝D(𝑥)からのサンプルを写像する．
¤ 埋め込んだ層を連続多変量ガウス分布と考えて，平均と共分散を計算する．
¤ それらを⽤いてFréchet距離（Wasserstein-2距離）を計算する（Fréchet inception distance, FID）．
¤ IS（ここではinception distance）と⽐べて，適切な評価指標になっている．
𝐹𝐼𝐷 𝑑𝑎𝑡𝑎, 𝑔 = ||𝜇ABCB − 𝜇D||=
=
+ 𝑇𝑟(ΣABCB + ΣD − 2(ΣABCBΣD)
1
=)
Figure A8: Left: FID and right: Inception Score are evaluated for first row: Gaussian noise, second
row: Gaussian blur, third row: implanted black rectangles, fourth row: swirled images, fifth row.
salt and pepper noise, and sixth row: the CelebA dataset contaminated by ImageNet images. Left is
the smallest disturbance level of zero, which increases to the highest level at right. The FID captures
the disturbance level very well by monotonically increasing whereas the Inception Score fluctuates,
stays flat or even, in the worst case, decreases.
13

結局どのGANがいい？
¤ 安定性や収束性の問題はいまだに解決されていない．
¤ GANはVAEに⽐べて不安定で，ハイパーパラメータ等に対する分散が⾮常に⼤きい．
¤ 計算コストをかけて適切にパラメータチューニングすれば，GANの種類であまり差はない[Lucic+ 17]．
¤ ⾃分で動かして確認するのが重要．
¤ ほとんどの場合，⽬的関数の変更と多少NNアーキテクチャを変更すればよい．
¤ 最近はまとまった実装やライブラリが公開されている．
¤ https://github.com/hwalsuklee/tensorflow-generative-model-collections
¤ https://github.com/pfnet-research/chainer-gan-lib
¤ https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan
¤ ⾃分の⽬的にあったGANを選ぶ．
¤ 具体的な⽬的ベースで提案されたGANも多い（GAN Zooやこの後のスライドを参照）．

⾼解像度の画像⽣成
¤ 段階的に解像度を上げるように学習することで，⾼解像度の画像⽣成が可能になっている．
¤ LAPGAN[Denton+ 15]，StackGAN[Zhang+ 16]， StackGAN++[Zhang+ 17]
¤ Progressive GAN[Karras+ 17]
¤ 低解像度の⽣成から学習していく（最初はモードが少ないので楽）．
¤ 安定して⾼解像度（1024x1024！）の画像を⽣成できる．

Image-to-image translation
¤ 画像から対応する画像を⽣成する．
¤ Pix2pix[Isola+ 16]
¤ Conditional GANの条件付けを変換前の画像とする．
¤ Gをautoencoderにする．
¤ BicycleGAN[Zhu+ 17]
¤ 決定論的な対応を学習するのではなく，潜在変数の分散を考慮したモデル．
Real or fake pair?
Positive examples Negative examples
Real or fake pair?
DD
G
G tries to synthesize fake
images that fool D
D tries to identify the fakes
Figure 2: Training a conditional GAN to predict aerial photos from
maps. The discriminator, D, learns to classify between real and
synthesized pairs. The generator learns to fool the discriminator.
Unlike an unconditional GAN, both the generator and discrimina-
tor observe an input image.
where G tries to minimize this objective against an ad-
versarial D that tries to maximize it, i.e. G⇤
=
arg minG maxD LcGAN (G, D).
To test the importance of conditioning the discrimintor,
we also compare to an unconditional variant in which the
discriminator does not observe x:
LGAN (G, D) =Ey⇠pdata(y)[log D(y)]+
Ex⇠pdata(x),z⇠pz(z)[log(1 D(G(x, z))].
(2)
Previous approaches to conditional GANs have found it
beneficial to mix the GAN objective with a more traditional
loss, such as L2 distance [29]. The discriminator’s job re-
mains unchanged, but the generator is tasked to not only
fool the discriminator but also to be near the ground truth
output in an L2 sense. We also explore this option, using
L1 distance rather than L2 as L1 encourages less blurring:
Figure 3
“U-Net”
tween m
this stra
nore the
Instead,
form of
at both t
observe
Designi
put, and
distribu
by the p
2.2. Ne
We a
from th
module
Details
with key
2.2.1
A defin
is that th
tion out
the inpu
are rend
structur
Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros
Berkeley AI Research (BAIR) Laboratory
University of California, Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Facade BW to Color
Aerial to Map
Labels to Street Scene
Edges to Photo
input output input
inputinput
input output
output
outputoutput
input output
Day to Night
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.
Abstract
We investigate conditional adversarial networks as a
general-purpose solution to image-to-image translation
problems. These networks not only learn the mapping from
input image to output image, but also learn a loss func-
tion to train this mapping. This makes it possible to apply
the same generic approach to problems that traditionally
would require very different loss formulations. We demon-
strate that this approach is effective at synthesizing photos
from label maps, reconstructing objects from edge maps,
and colorizing images, among other tasks. As a commu-
nity, we no longer hand-engineer our mapping functions,
and this work suggests we can achieve reasonable results
without hand-engineering our loss functions either.
Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
image into a corresponding output image. Just as a concept
may be expressed in either English or French, a scene may
be rendered as an RGB image, a gradient field, an edge map,
a semantic label map, etc. In analogy to automatic language
translation, we define automatic image-to-image translation
as the problem of translating one possible representation of
a scene into another, given sufficient training data (see Fig-
ure 1). One reason language translation is difficult is be-
cause the mapping between languages is rarely one-to-one
– any given concept is easier to express in one language
than another. Similarly, most image-to-image translation
problems are either many-to-one (computer vision) – map-
ping photographs to edges, segments, or semantic labels,
or one-to-many (computer graphics) – mapping labels or
sparse user inputs to realistic images. Traditionally, each of
these tasks has been tackled with separate, special-purpose
machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]),
despite the fact that the setting is always the same: predict
pixels from pixels. Our goal in this paper is to develop a
common framework for all these problems.
1
arXiv:1611.07004v1[cs.CV]21Nov2016

Image-to-image translation
¤ ペアになっていないデータで双⽅向に変換
¤ CycleGAN[Zhu+ 17]
¤ StarGAN[Choi+ 17]
¤ 2つのドメインではなく，複数のドメイン間で変換．
¤ GとDは1つだけ．represent one domain while those of men represent another.
Several image datasets come with a number of labeled
attributes. For instance, the CelebA[17] dataset contains 40
labels related to facial attributes such as hair color, gender,
and age, and the RaFD [11] dataset has 8 labels for facial
expressions such as ‘happy’, ‘angry’ and ‘sad’. These set-
tings enable us to perform more interesting tasks, namely
multi-domain image-to-image translation, where we change
images according to attributes from multiple domains. The
first five columns in Fig. 1 show how a CelebA image can
be translated according to any of the four domains, ‘blond
hair’, ‘gender’, ‘aged’, and ‘pale skin’. We can further ex-
tend to training multiple domains from different datasets,
such as jointly training CelebA and RaFD images to change
a CelebA image’s facial expression using features learned
by training on RaFD, as in the rightmost columns of Fig. 1.
However, existing models are both inefficient and inef-
fective in such multi-domain image translation tasks. Their
(a) Cross-domain models
21
4 3
G21 G12
G41
G14
G32
G23
G34 G43
2
1
5
4 3
(b) StarGAN
Figure 2. Comparison between cross-domain models and our pro-
posed model, StarGAN. (a) To handle multiple domains, cross-
domain models should be built for every pair of image domains.
(b) StarGAN is capable of learning mappings among multiple do-
mains using a single generator. The figure represents a star topol-
ogy connecting multi-domains.

Actor-critic
¤ ⽅策（actor）と価値関数（critic）を同時に学習する⼿法．
¤ MDPの下で（記号の説明を省略）．．．
¤ ⾏動価値関数は，
¤ ⽅策の更新：
¤ ⾏動価値関数の更新：
¤ よって，2段階の最適化アルゴリズムになる．
¤ なんかGANと似ている？？
most reinforcement learning algorithms either focus on learning a value function, like value iteration
and TD-learning, or learning a policy directly, as in policy gradient methods, AC methods learn
both simultaneously - the actor being the policy and the critic being the value function. In some AC
methods, the critic provides a lower-variance baseline for policy gradient methods than estimating
the value from returns. In this case even a bad estimate of the value function can be useful, as the
policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is
updated with respect to the approximate value function, in which case pathologies similar to those in
GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead
to a bad policy which never fully explores the space, preventing a good value function from being
found and leading to degenerate solutions. A number of techniques exist to remedy this problem.
Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a
distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st)
and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action-
value function Qπ
(s, a) that predicts the expected discounted reward:
Qπ
(s, a) = Est+k∼P,rt+k∼R,at+k∼π
∞
k=1
γk
rt+k st = s, at = a (6)
and learn a policy that is optimal for that value function:
π∗
= arg max
π
Es0∼p0,a0∼π[Qπ
(s0, a0)] (7)
We can express Qπ
as the solution to a minimization problem:
Qπ
= arg min
Q
Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8)
Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic
problem can be expressed as a bilevel optimization problem as well:
F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9)
f(Q, π) = −Es0∼p0,a0∼π[Qπ
(s0, a0)] (10)
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the
policy through policy gradients and scale the policy gradient by the TD error, while the action-value
function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG)
[7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-ﬁtted Q-learning with
continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and
value function Qπ
Qπ
∞
k=1
γk
rt+k st = s, at = a (6)
π∗
= arg max
π
(s0, a0)] (7)
We can express Qπ
Qπ
= arg min
Q
(s0, a0)] (10)
observations are continuous, and use neural networks for function approximation for both the action-
value function Qπ
Qπ
∞
k=1
γk
rt+k st = s, at = a (6)
π∗
= arg max
π
(s0, a0)] (7)
We can express Qπ
Qπ
= arg min
Q
(s0, a0)] (10)
value function and policy. This is an established approach in RL with continuous actions [11], and
all methods update the policy by passing back gradients of the estimated value with respect to the
actions rather than passing the TD error directly. The distinction between the methods lies mainly
in the way training proceeds. In NFQCA, the actor and critic are trained in batch mode after every
value function Qπ
Qπ
∞
k=1
γk
rt+k st = s, at = a (6)
π∗
= arg max
π
(s0, a0)] (7)
We can express Qπ
Qπ
= arg min
Q
(s0, a0)] (10)

Actor-criticとGAN
¤ Actor-criticとGANを同じ図式で並べてみる[Pfau+ 17]．
¤ 𝐺=⽅策，𝐷=価値関数とみなせる．
¤ ただし，GANでは𝐺が𝑥（現在の状態）を受け取らない．
¤ 𝑧からのランダム．
z G D
x y
(a) Generative Adversarial Networks [4]
π
st
Q
rt
(b) Deterministic Policy Gradient [7] /
SVG(0) [8] / Neurally-Fitted Q-learning
with Continuous Actions [9]
Figure 1: Information structure of GANs and AC methods. Empty circles represent models with a
distinct loss function. Filled circles represent information from the environment. Diamonds repre-
sent fixed functions, both deterministic and stochastic. Solid lines represent the flow of information,
while dotted lines represent the flow of gradients used by another model. Paths which are analo-
gous between the two models are highlighted in red. The dependence of Q on future states and the
dependence of future states on π are omitted for clarity.
2 Algorithms
Both GANs and AC can be seen as bilevel or two-time-scale optimization problems, where one
model is optimized with respect to the optimum of another model:
∗ ∗

双⽅のテクニックを利⽤できないか？
¤ 著者のPfau⽒はUnrolled GAN[Metz+ 17]を推している．
¤ というかUnrolled GAN提案者の⼀⼈．
Method GANs AC
Freezing learning yes yes
Label smoothing yes no
Historical averaging yes no
Minibatch discrimination yes no
Batch normalization yes yes
Target networks n/a yes
Replay buffers no yes
Entropy regularization no yes
Compatibility no yes
Table 1: Summary of different approaches used to stabilize and improve training for GANs and AC
methods. Those approaches that have been shown to improve performance are in green, those that
have not yet been demonstrated to improve training are in yellow, and those that are not applicable
to the particular method are in red.
Let the reward from the environment be 1 if the environment chose the real image and 0 if not. This
MDP is stateless as the image generated by the actor does not affect future data.
An actor-critic architecture learning in this environment clearly closely resembles the GAN game.
A few adjustments have to be made to make it identical. If the actor had access to the state it

SeqGAN
¤ SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient[Yu+ 17]
¤ 系列情報を⽣成するGAN
¤ 系列情報を学習するのは困難
¤ 系列情報は離散．
¤ 識別器は完全な系列しか評価できないため，部分的に⽣成された系列の評価が困難（最終的なスコア
が最⼤になるように𝐺を誘導したい）．
¤ 解決策：強化学習の利⽤
¤ 𝐺を⽅策，𝐷を報酬とする．
¤ 𝐷から価値関数を求める．
Reward
Next
action
MC
search D
Reward
Reward
Reward
Policy Gradient
Left: D is trained over
G. Right: G is trained
ard signal is provided
diate action value via
generative model Gθ.
Gθ is updated by em-
ch on the basis of the
discriminative model
elihood that it would
where Y n
1:t = (y1, . . . , yt) and Y n
t+1:T is sampled based on
the roll-out policy Gβ and the current state. In our experi-
ment, Gβ is set the same as the generator, but one can use
a simplified version if the speed is the priority (Silver et al.
2016). To reduce the variance and get more accurate assess-
ment of the action value, we run the roll-out policy starting
from current state till the end of the sequence for N times to
get a batch of output samples. Thus, we have:
Q
Gθ
Dφ
(s = Y1:t−1, a = yt) = (4)
1
N
N
n=1 Dφ(Y n
1:T ), Y n
1:T ∈ MCGβ (Y1:t; N) for t < T
Dφ(Y1:t) for t = T,
where, we see that when no intermediate reward, the function
is iteratively defined as the next-state value starting from state
s′
= Y1:t and rolling out to the end.
A benefit of using the discriminator Dφ as a reward func-
tion is that it can be dynamically updated to further improve
the generative model iteratively. Once we have a set of more
realistic generated sequences, we shall re-train the discrimi-
Reward
Next
action
State
MC
searchG D
Generate
True data
Train
G
Real World
D
Reward
Reward
Reward
Policy Gradient
Figure 1: The illustration of SeqGAN. Left: D is trained over
the real data and the generated data by G. Right: G is trained
by policy gradient where the final reward signal is provided
by D and is passed back to the intermediate action value via
Monte Carlo search.
synthetic sequences generated from the generative model Gθ.
w
th
m
a s
20
m
fro
ge
Q
w
is
s′

SeqGAN：擬似コード
Algorithm 1 Sequence Generative Adversarial Nets
Require: generator policy Gθ; roll-out policy Gβ; discriminator
Dφ; a sequence dataset S = {X1:T }
1: Initialize Gθ, Dφ with random weights θ, φ.
2: Pre-train Gθ using MLE on S
3: β ← θ
4: Generate negative samples using Gθ for training Dφ
5: Pre-train Dφ via minimizing the cross entropy
6: repeat
7: for g-steps do
8: Generate a sequence Y1:T = (y1, . . . , yT ) ∼Gθ
9: for t in 1 :T do
10: Compute Q(a = yt;s = Y1:t−1) by Eq. (4)
11: end for
12: Update generator parameters via policy gradient Eq. (8)
13: end for
14: for d-steps do
15: Use current Gθ to generate negative examples and com-
bine with given positive examples S
16: Train discriminator Dφ for k epochs by Eq. (5)
17: end for
18: β ← θ
19: until SeqGAN converges
In summary, Algorithm 1 shows full details of the pro-
posed SeqGAN. At the beginning of the training, we use
the maximum likelihood estimation (MLE) to pre-train Gθ
on training set S. We found the supervised signal from the
Short-Term Memory (LSTM) cells (Hochrei
huber 1997) to implement the update functio
is worth noticing that most of the RNN varia
gated recurrent unit (GRU) (Cho et al. 2014
tion mechanism (Bahdanau, Cho, and Bengi
used as a generator in SeqGAN.
The Discriminative Model for Sequenc
Deep discriminative models such as deep
(DNN) (Vesel`y et al. 2013), convolutional
(CNN) (Kim 2014) and recurrent convoluti
work (RCNN) (Lai et al. 2015) have shown
mance in complicated sequence classificatio
paper, we choose the CNN as our discrim
has recently been shown of great effectiven
ken sequence) classification (Zhang and LeC
discriminative models can only perform cla
for an entire sequence rather than the unfinis
paper, we also focus on the situation where th
predicts the probability that a finished seque
We first represent an input sequence x1, .
E1:T = x1 ⊕x2⊕. . . ⊕xT ,
where xt ∈ Rk
is the k-dimensional token
⊕ is the concatenation operator to build the
RT ×k
. Then a kernel w ∈ Rl×k
applies a
operation to a window size of l words to

逆強化学習
¤ Inverse reinforcement learning（IRL）
¤ ⽬標の⾏動（最適戦略）から報酬関数を推定する⼿法．
¤ Maximum entropy IRL（MaxEnt IRL）[Ng+ 00]
¤ 軌道がボルツマン分布に従うと考える．
¤ 最適な軌道で尤度が最⼤になると考える -> 尤度最⼤化
¤ しかし，分配関数𝑍をどうする・・・？
um or integral for most high-dimensional problems. A common approach to estimating Z requires
ampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods;
however, these methods face issues when there are several distinct modes of the distribution and, as
a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate
nference methods can also be used during training, though the energy function may incorrectly
assign low energy to some modes if the approximate inference method cannot find them [14].
2.3 Inverse Reinforcement Learning
The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated
behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving
near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost
earning, an algorithm for MaxEnt IRL.
2.3.1 Maximum entropy inverse reinforcement learning
Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann
distribution, where the energy is given by the cost function cθ :
pθ (τ) =
1
Z
exp(−cθ (τ)),
Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function
parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z
s the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
cs.2
Under this model, the optimal trajectories have the highest likelihood, and the expert can generate
uboptimal trajectories with a probability that decreases exponentially as the trajectories become
more costly. As in other energy-based models, the parameters θ are optimized to maximize the like-
main challenge in this optimization is evaluating the partition function Z, which is an intractable
sum or integral for most high-dimensional problems. A common approach to estimating Z requires
sampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
inference methods can also be used during training, though the energy function may incorrectly
learning, an algorithm for MaxEnt IRL.
pθ (τ) =
1
Z
exp(−cθ (τ)),
is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
ics.2
suboptimal trajectories with a probability that decreases exponentially as the trajectories become
lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous
domains, and presents the main computational challenge. The first applications of this model com-
puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete
domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown.
2.3.2 Guided cost learning
Guided cost learning introduces an iterative sample-based method for estimating Z in the Max-
Boltzmann distribution:
pθ (x) =
1
Z
exp(−Eθ (x)) (1)
The energy function parameters θ are often chosen to maximize the likelihood of the data; the
main challenge in this optimization is evaluating the partition function Z, which is an intractable
sum or integral for most high-dimensional problems. A common approach to estimating Z requires
sampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
inference methods can also be used during training, though the energy function may incorrectly
learning, an algorithm for MaxEnt IRL.
pθ (τ) =
1
Z
exp(−cθ (τ)),
is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
ics.2
コスト関数（報酬関数）
https://jangirrishabh.github.io/2016/07/09/virtual-car-IRL/

逆強化学習
¤ Guided cost learning (GCL) [Finn+ 16]
¤ 分配関数を推定するために，提案分布𝑞(𝜏)を考えて重点サンプリングによって求める．
¤ 提案分布が⾼い尤度で軌道をカバーできないと重点サンプリングが⾼バリアンスになる可能性があるので，デ
モンストレーション𝑝+も利⽤する．
¤ 𝑞は次のコストを最⼩化して求める（詳細は省略）．
-> 𝜃と𝑞について交互に最適化すればいい．
2.3.2 Guided cost learning
Guided cost learning introduces an iterative sample-based method for estimating Z in the Max-
Ent IRL formulation, and can scale to high-dimensional state and action spaces and nonlinear cost
functions [7]. The algorithm estimates Z by training a new sampling distribution q(τ) and using
importance sampling:
Lcost(θ) = Eτ∼p[−log pθ (τ)] = Eτ∼p[cθ (τ)]+ logZ
= Eτ∼p[cθ (τ)]+ log Eτ∼q
exp(−cθ (τ))
q(τ)
.
Guided cost learning alternates between optimizing cθ using this estimate, and optimizing q(τ) to
minimize the variance of the importance sampling estimate.
2This formula assumes that xt+1 is a deterministic function of the previous history. A more general form
of this equation can be derived for stochastic dynamics [26]. However, the analysis largely remains the same:
the probability of a trajectory can be written as the product of conditional probabilities, but the conditional
probabilities of the states xt are not affected by θ and so factor out of all likelihood ratios.
3
The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ
is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy
q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and
1
Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy.
Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2)
Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost
function. Thus, this training procedure results in both a learned cost function, characterizing the
demonstration distribution, and a learned policy q(τ), capable of generating samples from the
demonstration distribution.
This importance sampling estimate can have very high variance if the sampling distribution q fails
to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have
low cost (as a result of the IRL objective), we can address this coverage problem by mixing the
demonstration data samples with the generated samples. Let µ = 1
2 p+ 1
2 q be the mixture distribution
over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for
example we could use the current model pθ , or we could use a simpler density model trained using
another method. Guided cost learning uses µ for importance sampling3, with 1
2 p(τ) + 1
2 q(τ) as the
importance weights:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
,
2.4 Direct Maximum Likelihood and Behavioral Cloning
A simple approach to imitation learning and generative modeling is to train a generator or policy
to output a distribution over the data, without learning a discriminator or energy function. For
tractability, the data distribution is typically factorized using a directed graphical model or Bayesian
network. In the field of generative modeling, this approach has most commonly been applied to
speech and language generation tasks [23, 18], but has also been applied to image generation [22].
Like most EBMs, these models are trained by maximizing the likelihood of the observed data points.
When a generative model does not have the capacity to represent the entire data distribution, max-
imizing likelihood directly will lead to a moment-matching distribution that tries to “cover” all of
the modes, leading to a solution that puts much of its mass in parts of the space that have negligible
probability under the true distribution. In many scenarios, it is preferable to instead produce only re-
alistic, highly probable samples, by “filling in” as many modes as possible, at the trade-off of lower
1
demonstration data samples with the generated samples. Let µ = 1
2 p+ 1
2 q be the mixture distribution
over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for
example we could use the current model pθ , or we could use a simpler density model trained using
another method. Guided cost learning uses µ for importance sampling3, with 1
2 p(τ) + 1
2 q(τ) as the
importance weights:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
,
2.4 Direct Maximum Likelihood and Behavioral Cloning
A simple approach to imitation learning and generative modeling is to train a generator or policy
to output a distribution over the data, without learning a discriminator or energy function. For
tractability, the data distribution is typically factorized using a directed graphical model or Bayesian
network. In the field of generative modeling, this approach has most commonly been applied to
speech and language generation tasks [23, 18], but has also been applied to image generation [22].
Like most EBMs, these models are trained by maximizing the likelihood of the observed data points.
When a generative model does not have the capacity to represent the entire data distribution, max-
1
𝜇=

GANの再解釈
¤ GANの最適な識別器は
¤ ここでモデル分布𝑞の確率密度は推定できるとする（GANの前提を変える）．
¤ さらに真の分布𝑝をコスト関数でパラメータ化したものに置き換える．
¤ するとGANの識別器の損失関数は
¤ Dのパラメータ=コスト関数のパラメータ
¤ のとき最適．
¤ [再掲]MaxEnt IRLのコストは（とすると）
verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The
derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After
making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec-
tion 3.2, where the discriminator involves the learned cost and the generator represents the policy.
3.1 A special form of discriminator
For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol-
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
where p(τ) is the actual distribution of the data.
In the traditional GAN algorithm, the discriminator is trained to directly output this value. When
the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified
to incorporate this density information. Instead of having the discriminator estimate the value of
Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value.
In this case, the new form of the discriminator Dθ with parameters θ is
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
In order to make the connection to MaxEnt IRL, we also replace the estimated data density with
the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the
learned cost. Now the discriminator’s output is:
Dθ (τ) =
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
.
The resulting architecture for the discriminator is very similar to a typical model for binary classi-
fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted
the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change
allows the optimal discriminator to be completely independent of the generator: the discriminator is
optimal when 1
Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim-
inator may significantly improve the stability of training.
This change is very simple to implement and is applicable in any setting where the density q(τ)
can be cheaply evaluated. Of course this is precisely the case where we could directly maximize
likelihood, and we might wonder whether it is worth the additional complexity of GAN training.
But the experience of researchers in IRL has shown that maximizing log likelihood directly is not
always the most effective way to learn complex behaviors, even when it is possible to implement. As
we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting
alleviate the issue, they do not solve it completely.
3 GANs and IRL
We now show how generative adversarial modeling has implicitly been applied to the setting of in-
verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The
derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After
making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec-
tion 3.2, where the discriminator involves the learned cost and the generator represents the policy.
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
Dθ (τ) =
1
Z exp(−cθ (τ))
1
.
optimal when 1
that the same phenomenon may occur in other domains: GAN training may provide advantages even
when it would be possible to maximize likelihood directly.
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
Dθ (τ) =
1
Z exp(−cθ (τ))
1
.
optimal when 1
3.2 Equivalence between generative adversarial networks and guided cost learning
In this section, we show that GANs, when applied to IRL problems, optimize the same objective as
MaxEnt IRL, and in fact the variant of GANs described in the previous section is precisely equivalent
to guided cost learning.
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+ q(τ)
+ Eτ∼q −log
q(τ)
1
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima-
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
+Eτ∼q −log
q(τ)
1
Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
Ent IRL problem:
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
+ Eτ∼q −log
q(τ)
1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
Ent IRL problem:
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
+Eτ∼q −log
q(τ)
1
Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2q(τ)
, (5)
Ent IRL problem:
tor for the partition function, as described in Section 2.3.2.
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
+ Eτ∼q −log
q(τ)
1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
Ent IRL problem:
2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to
the derivative of the MaxEnt IRL objective.
3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
Dθ (τ) =
1
Z exp(−cθ (τ))
1
.
optimal when 1

GANとIRLの関係
¤ 次のことが証明されている[Finn+ 17]．
¤ 識別器の⽬的関数を最⼩化する𝑍は，分配関数の重点サンプリングに対応する．
¤ この𝑍で，識別器の⽬的関数の勾配＝MaxEnt IRLの⽬的関数の勾配となる．
¤ ⽣成器の⽬的関数＝MaxEnt IRLの提案分布の⽬的関数
MaxEnt policy loss defined in Equation 2 in Section 2.3.2.
Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1
2Z exp(−cθ (τ)) + 1
2 q(τ).
Note that when θ and Z are optimized, 1
Z exp(−cθ (τ)) is an estimate for the density of p(τ), and
hence µ(τ) is an estimate for the density of µ.
3.2.1 Z estimates the partition function
We can compute the discriminator’s loss:
Ldiscriminator(Dθ ) =Eτ∼p −log
1
Z exp(−cθ (τ))
µ(τ)
+ Eτ∼q −log
q(τ)
µ(τ)
(6)
=logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7)
=logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8)
Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term
with respect to Z will be zero:
∂ZLdiscriminator(Dθ ) = 0
1
Z
= Eτ∼µ
1
Z2 exp(−cθ (τ))
µ(τ)
Z = Eτ∼µ
exp(−cθ (τ))
µ(τ)
.
Thus the minimizing Z is precisely the importance sampling estimate of the partition function in
Equation 4.
3.2.2 cθ optimizes the IRL objective
We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with
respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL
objective.
6
Ent IRL problem:
2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to
the derivative of the MaxEnt IRL objective.
3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the
MaxEnt policy loss defined in Equation 2 in Section 2.3.2.
Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1
2Z exp(−cθ (τ)) + 1
2 q(τ).
Note that when θ and Z are optimized, 1
Z exp(−cθ (τ)) is an estimate for the density of p(τ), and
hence µ(τ) is an estimate for the density of µ.
3.2.1 Z estimates the partition function
We can compute the discriminator’s loss:
Ldiscriminator(Dθ ) =Eτ∼p −log
1
Z exp(−cθ (τ))
µ(τ)
+ Eτ∼q −log
q(τ)
µ(τ)
(6)
=logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7)
=logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8)
Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term
with respect to Z will be zero:
∂ZLdiscriminator(Dθ ) = 0
1
Z
= Eτ∼µ
1
Z2 exp(−cθ (τ))
µ(τ)
Z = Eτ∼µ
exp(−cθ (τ))
µ(τ)
.
Thus the minimizing Z is precisely the importance sampling estimate of the partition function in
Equation 4.
3.2.2 cθ optimizes the IRL objective
We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with
respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL
objective.
6
Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we
obtain:
∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
Z exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
On the other hand, when we differentiate the MaxEnt IRL objective, we obtain:
∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ
−exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
µ(τ)
= ∂θ Ldiscriminator(Dθ ).
In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the
second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This
is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the
weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate
through the importance weights.
3.3 The generator optimizes the MaxEnt IRL objective
Finally, we compute the generator’s loss:
Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))]
= Eτ∼q log
q(τ)
µ(τ)
− log
1
Z exp(−cθ (τ))
µ(τ)
= Eτ∼q[logq(τ)+ logZ + cθ (τ)]
= logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q).
The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator,
this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2.
Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we
obtain:
∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
µ(τ)
On the other hand, when we differentiate the MaxEnt IRL objective, we obtain:
∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ
−exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
µ(τ)
= ∂θ Ldiscriminator(Dθ ).
In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the
second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This
is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the
weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate
through the importance weights.
3.3 The generator optimizes the MaxEnt IRL objective
Finally, we compute the generator’s loss:
Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))]
= Eτ∼q log
q(τ)
µ(τ)
− log
1
Z exp(−cθ (τ))
µ(τ)
= Eτ∼q[logq(τ)+ logZ + cθ (τ)]
= logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q).
The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator,
this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2.
3.4 Discussion
There are many apparent differences between MaxEnt IRL and the GAN optimization problem. But,
we have shown that after making a single key change—using a generator q(τ) for which densities

GANとIRLの関係
¤ つまり，
¤ コスト関数（報酬関数）の最⼩化は𝐷の最適化とみなせる．
¤ 提案分布𝑞は𝐺とみなせる．
¤ GAN = IRLを⽰すと何がうれしい？
¤ 𝑞は密度推定できるので，直接最⼤化できるのでは？
¤ 𝑞を最尤推定（=KL最⼩化）で求めたくない（peakがないところにも分布を置くので）．
¤ GANとみなすと，様々なダイバージェンスで考えることができそう．
¤ その他の研究
¤ 模倣学習とGAN（GAIL[Ho+ 17]）
¤ OptionGAN[Henderson+ 17]
¤ InfoGAIL[Li+ 17]

深層⽣成モデルと強化学習

LeCun先⽣⽈く
Y LeCun
How Much Information Does the Machine Need to Predict?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar
reward given once in a while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category
or a few numbers for each input
Predicting human-supplied data
10 10,000 bits per sample→
Unsupervised/Predictive Learning (cake)
The machine predicts any part of
its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
(Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)

GAN生成画像の原理と応用

GAN生成画像の原理と応用

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GAN生成画像の原理と応用

Similar to GAN生成画像の原理と応用 (20)

More from Masahiro Suzuki

More from Masahiro Suzuki (17)

Recently uploaded

Recently uploaded (20)

GAN生成画像の原理と応用