Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GAN(と強化学習との関係)

20,183 views

Published on

2017/12/12
強化学習アーキテクチャ勉強会の資料

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

GAN(と強化学習との関係)

  1. 1. Generative Adversarial Network (と強化学習との関係を少し) 鈴⽊ 雅⼤ 2017/12/12 強化学習アーキテクチャ勉強会資料
  2. 2. ⾃⼰紹介 鈴⽊雅⼤(博⼠課程3年) ¤ 経歴 ¤ 学部〜修⼠:北海道⼤学情報科学研究科 ¤ 博⼠〜:東京⼤学⼯学系研究科 松尾研究室 ¤ 専⾨分野: ¤ 機械学習,転移学習 ¤ 深層⽣成モデル(VAE) ¤ Deep Learning基礎講座,先端⼈⼯知能論などの講義・演習担当 ¤ Goodfellow著「Deep Learning」の監修・翻訳 Male : 0.95 Eyeglasses : -0.99 Young : 0.30 Smiling : -0.97 Male : 0.22 Eyeglasses : -0.99 Young : 0.87 Smiling : -1.00 Input Generated attributes Average face Reconstruction Not Male Eyeglasses Not Young Smiling Mouth slightly open
  3. 3. 今⽇の内容 ¤ GANの説明と強化学習との関係について ¤ 前半:GANの説明(※先端⼈⼯知能論 II/ Deep Learning 応⽤講座で話した内容とほぼ同じです) ¤ 後半:GANと強化学習の関係 [Choi+ 2017] [Zhu+ 2017]
  4. 4. ⽬次 ¤ 前半: ¤ ⽣成モデルとニューラルネットワーク ¤ Generative Adversarial Network ¤ GANの学習の難しさ ¤ GANの種類 ¤ GANの評価 ¤ GANの応⽤ ¤ 後半: ¤ Actor-criticとGAN ¤ 逆強化学習とGAN ¤ 深層⽣成モデルと強化学習
  5. 5. ⽣成モデルとニューラルネットワーク
  6. 6. ⽣成モデル ¤ データの⽣成過程を数理的にモデル化したもの. ¤ 数理モデルは確率分布によって表される ¤ 𝑥"が確率𝑝 𝑥; 𝜃 から⽣成されるときは𝑥" ~𝑝 𝑥; 𝜃 と表記する. ¤ 𝑝 𝑥; 𝜃 を𝑝( 𝑥 とも書く 観測データ 𝑝 𝑥; 𝜃 ⽣成モデル⽣成 学習 モデルパラメータ 観測データはある数理モデルから ⽣成されたとする
  7. 7. ⽣成モデルだとできること ¤ サンプリング: ¤ 確率モデルがあるので,未知のデータを⽣成できる ¤ 「⽣成」モデルと呼ばれるのはここから ¤ 密度推定: ¤ データ𝑥を⼊⼒すると,密度𝑝(𝑥)が得られる. ¤ 外れ値検出や異常検知に⽤いられる. ¤ ⽋損値補完,ノイズ除去: ¤ ⽋損やノイズのある𝑥+を⼊⼒すると,真の𝑥の推定値が得られる. 𝑝 𝑥; 𝜃 http://jblomo.github.io/datamining290/slides/2013-04-26- Outliers.html
  8. 8. ニューラルネットワークによる⽣成モデルの表現 ¤ ⽣成モデル 𝑝 𝑥; 𝜃 を深層ニューラルネットワークによって表現する(深層⽣成モデル). ¤ 従来の確率分布だと,複雑な⼊⼒を直接扱えない. ¤ ニューラルネットワークで表現することで,より複雑な⼊⼒xを扱えるようになる. ¤ ⼀般物体画像のような複雑な画像も⽣成できる可能性がある! ¤ ⼤きく分けて2つのアプローチがある(有向モデルの場合). ¤ 𝑝(𝑥|𝑧) (⽣成器,generator)をモデル化する. ¤ 𝑧から𝑥へのニューラルネットワーク𝑥 = 𝑓(𝑧)によって表現. ¤ VAE,GANなど. ¤ 𝑝(𝑥)を直接モデル化する(⾃⼰回帰モデル). ¤ 𝑝 𝑥 = ∏ 𝑝(𝑥"|𝑥1, … , 𝑥"41)" として,各条件付き分布をモデル化. ¤ NADE[Larochelle+ 11],PixelRNN(CNN)[Oord+ 16], WaveNet[Oord+ 16]など. ⽣成器 𝑥𝑧 https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  9. 9. 深層⽣成モデルの分類 ¤ それぞれ利点と⽋点がある ¤ VAEとGANはサンプリングが早いが,尤度を直接求められない. ¤ ⾃⼰回帰モデルは尤度を正確に求められるが,モデルが多くサンプリングや学習の速度が遅い. 学習⽅法 尤度の計算 サンプリング 潜在変数への推論 学習するモデル (NNでモデル化) VAE 変分下界の 最⼤化 厳密にはできない (変分下界を計算) 低コスト 近似分布によって可能 ⽣成モデル 𝑝 𝑥 𝑧 推論モデル 𝑞 𝑧 𝑥 GAN 敵対的学習 できない (尤度⽐のみ) 低コスト 推論はモデル化されてい ない(モデルによる) ⽣成器 𝐺(𝑧) 識別器 𝐷(𝑥) ⾃⼰回帰 モデル 対数尤度の 最⼤化 できる ⾼コスト 潜在変数⾃体がない 複数の条件付き分布 8 𝑝(𝑥"|𝑥1, … , 𝑥"41) "
  10. 10. Variational Autoencoder ¤ Variational autoencoder(VAE) [Kingma+ 13; ICLR 2014][Rezende+ 14; ICML 2014] 𝑥 𝑧 𝑞9(𝑧|𝑥) 𝑥 ~ 𝑝((𝑥|𝑧) 𝑧 ~ 𝑝(𝑧)𝑞9 𝑧 𝑥 = 𝒩(𝑧|𝜇 𝑥 , 𝜎= (𝑥)) 推論モデル(ガウス分布) エンコーダーと考える ⽣成モデル(ベルヌーイ分布) 𝑝( 𝑥 𝑧 = ℬ(𝑥|𝜇 𝑥 ) デコーダーと考える
  11. 11. VAEからの画像⽣成 ¤ 𝑝(𝑥|𝑧)だけでなく𝑝(𝑧) ( = ∫ 𝑞(𝑧|𝑥)𝑝(𝑥)𝑑𝑥)も学習している ¤ データの多様体が𝑧で獲得されている. (a) Learned Frey Face manifold (b) Learned MNIST manifold Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor- dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce values of the latent variables z. For each of these values z, we plotted the corresponding generative p✓(x|z) with the learned parameters ✓. [Kingma+ 13]より
  12. 12. ⽣成画像 ¤ ランダムな𝑧から画像をサンプリング ¤ 輪郭等がぼやける傾向がある. anifold (b) Learned MNIST manifold arned data manifold for generative models with two-dimensional latent Since the prior of the latent space is Gaussian, linearly spaced coor- were transformed through the inverse CDF of the Gaussian to produce z. For each of these values z, we plotted the corresponding generative ameters ✓. (b) 5-D latent space (c) 10-D latent space (d) 20-D latent space rom learned generative models of MNIST for different dimensionalities [Kingma+ 13]より @AlecRad
  13. 13. Generative Adversarial Network
  14. 14. なぜ⽣成画像がぼやけてしまうのか? ¤ 理由の⼀つは,確率分布を明⽰的にモデル化して最尤学習を⾏っているため. ¤ VAEでは𝑝(𝑥|𝑧)をガウス分布としてモデル化しているため,再構成誤差が(分散を考慮しなけれ ば)⼆乗誤差になる. ¤ 画素をはっきりさせるより曖昧にした⽅が,誤差は⼩さくなる. ¤ ガウス分布の分散を⼩さくするとはっきり再構成できるようになるが,逆に未知のデータをうま く⽣成できなくなる. ¤ また最尤学習では,訓練データにない部分についても⾼い確率を置くように訓練しがち. ¤ 最尤学習(尤度最⼤化)=KLダイバージェンスの最⼩化 ¤ 訓練データにない部分がぼやけた画像になる. -> 確率分布を暗黙的にモデル化し,別の学習基準を⽤いる必要がある! 𝑝ABCB(𝑥) 𝑝D(𝑥) 𝐾𝐿[𝑝ABCB||𝑝D] = I 𝑝ABCB log 𝑝ABCB 𝑝D 𝑑𝑥 = 𝐸OPQRQ log 𝑝ABCB − 𝐸OPQRQ log 𝑝D 対数尤度
  15. 15. 暗黙的な⽣成モデルの学習 ¤ 確率分布 𝑝D(𝑥)を定義しないで,暗黙的な⽣成モデルとして考える. ¤ 尤度を測ることができない! ¤ ⽬標:真の分布 𝑝ABCB(𝑥)と近いモデル分布𝑝D(𝑥)を求める. ¤ 直接尤度を評価できないので,モデル分布と真の分布の密度⽐を考える[Uehara+ 17, Mohamed+ 16]. 𝑝D(𝑥) 𝑝ABCB(𝑥) 𝑟 𝑥 = 𝑝ABCB(𝑥) 𝑝D(𝑥)
  16. 16. 密度⽐の解釈 ¤ データ集合𝒳 = {𝑥1, … , 𝑥W}のうち,半分のデータがモデル分布から⽣成され,もう半分が真の 分布から⽣成されるとする. ¤ 真の分布からの⽅を𝑦 = 1,モデル分布からの⽅を𝑦 = 0とラベル付けする. ¤ するとデータ集合は次のようになる. ¤ このことから,モデル分布と真の分布は,ラベルが与えられた下での条件付き分布となる [Sugiyama+ 2012]. (𝑥1, 1 … , 𝑥W = , 1 , 𝑥W =1 , 0 , … , (𝑥W, 0)} 𝑝ABCB 𝑥 = 𝑝(𝑥|𝑦 = 1) 𝑝D 𝑥 = 𝑝(𝑥|𝑦 = 0)
  17. 17. 密度⽐の解釈 ¤ したがって,密度⽐は次のようになる. ->すなわち,𝑝 𝑦 = 1 𝑥 を推定できればよい. ¤ 𝑝 𝑦 = 1 𝑥 を推定する分布を𝑞9(𝑦 = 1|𝑥)とする. ¤ この分布をニューラルネットワークでパラメータ化する. ¤ 𝐷を識別器(discriminator)と呼ぶ. ¤ 識別器によって,密度推定をNNが得意な分類問題に置き換えることができる. 𝑝ABCB(𝑥) 𝑝D(𝑥) = 𝑝(𝑥|𝑦 = 1) 𝑝(𝑥|𝑦 = 0) = 𝑝 𝑦 = 1 𝑥 𝑝(𝑥) 𝑝(𝑦 = 1) 𝑝 𝑦 = 0 𝑥 𝑝(𝑥) 𝑝(𝑦 = 0) = 𝑝 𝑦 = 1 𝑥 𝑝 𝑦 = 0 𝑥 ・ 1 − 𝜋 𝜋 (ただし,𝜋 = 𝑝(𝑦 = 1)) 𝑞9 𝑦 = 1 𝑥 = 𝐷(𝑥; 𝜑) 𝐷(𝑥; 𝜑) 𝑦𝑥
  18. 18. 識別器の⽬的関数 ¤ データ集合から識別器を学習するため,次の負の交差エントロピー損失を最⼤化する. ¤ 式変形して, ¤ 𝜋 = 1 = とすると,⽬的関数𝑉(𝐷)は, 𝐸O(`,a)[𝑦 log 𝐷 𝑥; 𝜑 + 1 − 𝑦 log(1 − 𝐷(𝑥; 𝜑))] 𝐸O 𝑥 𝑦 O(a) 𝑦 log 𝐷 𝑥; 𝜑 + 1 − 𝑦 log 1 − 𝐷 𝑥; 𝜑 = 𝐸O 𝑥 𝑦 = 1 O(ac1) log 𝐷 𝑥; 𝜑 + 𝐸O 𝑥 𝑦 = 0 O(acd) log 1 − 𝐷 𝑥; 𝜑 = 𝜋𝐸OPQRQ(`) log 𝐷 𝑥; 𝜑 + (1 − 𝜋)𝐸Oe(`) log 1 − 𝐷 𝑥; 𝜑 𝑉 𝐷 = 𝐸OPQRQ(`) log 𝐷 𝑥; 𝜑 + 𝐸Oe(`) log 1 − 𝐷 𝑥; 𝜑
  19. 19. ⽬的関数の意味 ¤ 本来の⽬的は, 𝑝D(𝑥)を 𝑝ABCB(𝑥) に近づけること! ¤ 学習した𝐷を使って 𝑝D 𝑥 と𝑝ABCB(𝑥)の距離を求めたい. ¤ ⽬的関数𝑉(𝐷)にはどのような意味がある? ¤ もし識別関数が適切に推定できたら(すなわち𝐷∗ 𝑥 = 𝑝(𝑦 = 1|𝑥) ), ¤ このとき, 𝑉 𝐷∗ = 2・𝐽𝑆𝐷(𝑝ABCB||𝑝D) − 2 log 2となる. ¤ 𝐽𝑆𝐷はJensen-Shannonダイバージェンス. ¤ 𝐽𝑆𝐷(𝑝ABCB||𝑝D) = 1 = 𝐾𝐿(𝑝ABCB|| OPQRQOe = ) + 1 = 𝐾𝐿(𝑝D|| OPQRQOe = ) (KLと違って対称的) ->つまり, 𝑉(𝐷∗)は𝑝ABCBと𝑝DのJensen-Shannonダイバージェンスに対応している! 𝐷∗ 𝑥 = i i1 = OPQRQ(`) OPQRQ ` Oe(`) に収束する.
  20. 20. ⽣成モデルの学習 ¤ 𝑝D 𝑥 = ∫ 𝑝 𝑥 𝑧 𝑝 𝑧 𝑑𝑧 と考えて, 𝑝 𝑥 𝑧 を推定する分布を𝑞((𝑥|𝑧)とする. ¤ この分布をニューラルネットワークでパラメータ化する. ¤ 𝐺を⽣成器(generator)と呼ぶ. ¤ すると,Gを学習するための⽬的関数は, ¤ Gは,この⽬的関数を最⼩化するように学習する. ¤ 𝑝ABCBと𝑝D のJS距離を近づけるため. 𝑞( 𝑥 𝑧 = 𝐺(𝑧; 𝜃) 𝐺(𝑧; 𝜃) 𝑥𝑧 𝑉(𝐷∗, 𝐺) = 𝐸OPQRQ(`) log 𝐷∗ (𝑥) + 𝐸O(m) log 1 − 𝐷∗(𝐺(𝑧; 𝜃))
  21. 21. ⽣成器と識別器の学習 ¤ 実際には,適切な𝐺が得られないと,最適な𝐷は学習できない. ¤ 逆も同じ. ¤ したがって,⽬的関数を交互に最適化することを考える. ¤ このような枠組みで学習する⽣成モデルを, generative adversarial networks(GANs)[Goodfellow+ 14] と呼ぶ. min ( 𝐸O(m) log 1 − 𝐷 𝐺 𝑧; 𝜃 ; 𝜑 max 9 𝐸OPQRQ(`) log 𝐷 (𝑥; 𝜑) + 𝐸O(m) log 1 − 𝐷 𝐺 𝑧; 𝜃 ; 𝜑 ・⽣成器の学習(𝐷は固定) ・識別器の学習(𝐺は固定)
  22. 22. Generative adversarial nets ¤ 全体の構造 ¤ 直感的には,𝐺と𝐷で次のゲーム(ミニマックスゲーム)をする(敵対的学習). ¤ 𝐺はなるべく𝐷を騙すように𝑥を⽣成する. ¤ 𝐷はなるべく𝐺に騙されないように識別する. ->最終的には,𝐷が本物と区別できないような𝑥が𝐺から⽣成される. min s max t 𝑉(𝐷, 𝐺)
  23. 23. アルゴリズム ¤ 𝐷を𝑘ステップ更新してから,𝐺を更新する. ¤ ただし,𝑘 = 1とする場合が多い.
  24. 24. VAEとGANの違い ¤ VAEは⽣成モデルの形を仮定しているが,GANは分布を仮定しない ¤ GANは尤度ではなく,識別器によって,⽣成器のよさを評価する. ¤ 測る距離が異なる. ¤ VAEは対数尤度(=KLダイバージェンス),GANはJSダイバージェンス. ¤ VAEは推論分布(エンコーダー)を考えるが,GANは考えない. ¤ ただし,モデルによる. ¤ どちらもDNNの得意な識別問題に持っていっているところがポイント. ⽣成器 識別器 𝑥v 𝑥 𝑧推論モデル ⽣成モデル 𝑥 𝑥v 𝑧 VAE GAN
  25. 25. KLダイバージェンスとJSダイバージェンス ¤ 尤度最⼤化(KLダイバージェンス最⼩化)はデータのないところも覆ってしまう(B). ¤ KLを逆にすると,⼀つの峰だけにfitするようになる(C). ¤ JSダイバージェンスはちょうど中間あたりで学習(D). [Huszar´+ 15] Under review as a conference paper at ICLR 2016 A: P B: arg minQ JS0.1[PkQ] C: arg minQ JS0.5[PkQ] D: arg minQ JS0.99[PkQ] Figure 1: Illustrating the behaviour of the generalised JS divergence under model underspecification for a range of values of ⇡. Data is drawn from a multivariate Gaussian distribution P (A) and we aim approximate it by a single isotropic Gaussian (B-D). Contours show level sets the approximating distribution, overlaid on top of the 2D histogram of observed data. For ⇡ = 0.1, JS divergence minimisation behaves like maximum likelihood (B), resulting in the characteristic moment matching behaviour. For ⇡ = 0.99 (D), the behaviour becomes more akin to the mode-seeking behaviour of minimising KL[QkP]. For the intermediate value of ⇡ = 0.5 (C) we recover the standard JS divergence approximated by adversarial training. To produce this illustraiton we used software made available by Theis et al. (2015). It is easy to show that JS⇡ divergence converges to 0 in the limit of both ⇡ ! 0 and ⇡ ! 1. Cru- cially, it can be shown that the gradients with respect to ⇡ at these two extremes recover KL[QkP] and KL[PkQ], respectively. A proof of this property can be obtained by considering the Taylor- expansion KL[QkQ + a] ⇡ aT Ha, where H is the positive definite Hessian and substituting a = ⇡(P Q) as follows: lim ⇡!0 JSD[PkQ; ⇡] ⇡ = lim ⇡!0 ⇢ KL[Pk⇡P + (1 ⇡)Q] + 1 ⇡) ⇡ KL[Qk⇡P + (1 ⇡)Q] (14)
  26. 26. GANの⽣成画像 ¤ ランダムな𝑧から画像をサンプリング[Goodfellow+ 14] Adversarial nets 225 ± 2 2057 ± 26 Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log- likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different chosen using the validation set of each fold. On TFD, was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset. of the Gaussians was obtained by cross validation on the validation set. This procedure was intro- duced in Breuleux et al. [8] and used for various generative models for which the exact likelihood is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models. In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework. a) b) c) d) Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator)
  27. 27. GANの学習の難しさ
  28. 28. GANの問題点 ¤ GANにはいくつかの困難な点がある. ¤ 収束性 ¤ Mode collapse問題 ¤ 勾配消失 ¤ これらの困難は互いに関係している.
  29. 29. GANの収束性について ¤ 𝑉(𝐷∗, 𝐺)が𝜃について凸関数ならば,Gは収束することが保証されている. ¤ しかし,⾮凸の場合は保証されない(NNは当然⾮凸). ¤ 実際の学習では,SGDで交互に最適化している. ¤ ミニマックスゲームの平衡点は,両⽅のプレイヤーが同時に最⼩になる点(ナッシュ均衡). ¤ ゼロサムゲームなので,お互いが𝑉を最適化しても,平衡点に⾏かずに振動する可能性がある. ¤ 例:⽬的関数を𝑣(𝑎, 𝑏) = 𝑎𝑏とし,𝐺と𝐷で交互に最⼩化と最⼤化を繰り返すと,次のようになる. ※最近では,最適化によって(局所)ナッシュ均衡に到達するよう保証する研究が⾏われている([Nagarajan+ 17][Heusel+ 17][Kodali+ 17]など). https://lilianweng.github.io/lil- log/2017/08/20/from-GAN-to-WGAN.html
  30. 30. GANにおけるGとDの学習の様⼦ ¤ 𝐷は,𝐺が最適化する⽅向を⽰している. ¤ 敵対的というよりは,協⼒して学習しているイメージ. http://www.inference.vc/an-alternative-update-rule-for-generative-adversarial-networks/
  31. 31. DCGAN ¤ Deep Convolutional Generative Adversarial Network[Alec+ 15] ¤ 畳込みNNでモデルを設計 ¤ 学習の安定性のために,いくつかのトリックを利⽤. ¤ Batch Normalizationを使う(学習の安定のために重要). ¤ 𝐺にはReLUを使い,出⼒はtanh(必然的にデータは事前に標準化しておく). ¤ 𝐷にはleaky ReLUを使う.
  32. 32. DCGANの⽣成画像 ¤ 従来と⽐較して,はるかに綺麗な画像が⽣成できるようになった. ¤ DCGANはGANのブレイクスルー的研究 Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate. Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual under-fitting via repeated noise textures across multiple samples such as the base boards of some of the beds.
  33. 33. 潜在空間上でのベクトル演算 ¤ ただし,GANにしかできない訳ではないことに注意. ¤ VAEでも可能(ただしGANの⽅が綺麗な画像が⽣成できる).
  34. 34. Mode collapse問題 ¤ 元々の定式化は,𝐺を固定して𝐷を最適化するというものだった. ¤ 𝐺は最適化した𝐷を⽤いて学習する(minmax). ¤ (学習が不⼗分な)𝐷を固定して,𝐺を最適化した場合はどうなるか? ->𝐺の⽣成データが全て,ある峰(peaks)に対応するように学習してしまう(mode collapse). Published as a conference paper at ICLR 2017 Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians dataset. Columns show a heatmap of the generator distribution after increasing numbers of training steps. The final column shows the data distribution. The top row shows training for a GAN with 10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The bottom row shows standard GAN training. The generator rotates through the modes of the data distribution. It never converges to a fixed distribution, and only ever assigns significant probability mass to a single data mode at once. Figure 3: Unrolled GAN training increases stability for an RNN generator and convolutional dis- criminator trained on MNIST. The top row was run with 20 unrolling steps. The bottom row is a standard GAN, with 0 unrolling steps. Images are samples from the generator after the indicated number of training steps. generator, but without backpropagating through the generator. In both cases we find that the unrolled objective performs better. 3.2 PATHOLOGICAL MODEL WITH MISMATCHED GENERATOR AND DISCRIMINATOR To evaluate the ability of this approach to improve trainability, we look to a traditionally challenging family of models to train – recurrent neural networks (RNNs). In this experiment we try to generate MNIST samples using an LSTM (Hochreiter & Schmidhuber, 1997). MNIST digits are 28x28 pixel images. At each timestep of the generator LSTM, it outputs one column of this image, so that after 28 timesteps it has output the entire sample. We use a convolutional neural network as the Published as a conference paper at ICLR 2017 Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians dataset. Columns show a heatmap of the generator distribution after increasing numbers of training steps. The final column shows the data distribution. The top row shows training for a GAN with 10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The bottom row shows standard GAN training. The generator rotates through the modes of the data distribution. It never converges to a fixed distribution, and only ever assigns significant probability mass to a single data mode at once. [Metz+ 17]
  35. 35. 解決⽅法 ¤ Minibatch discrimination[Salimans+ 17] ¤ Minibatch内の多様性(サンプル間のノルムの距離)を考慮した discriminatorを設計.多様性が⼤きくなる⽅向にDを導くように する. ¤ Unrolled GANs[Metz+ 17] ¤ 𝐷のパラメータをK回更新し(unrolling),その時点でのパラメー タで𝐺を更新.𝐷は1回更新したパラメータに戻す. ¤ AdaGAN[Tolstikhin+ 17] ¤ GANにブースティングの⼿法を⽤いる. ¤ Wasserstein GANs[Arjovsky+ 17] ¤ 後述.
  36. 36. GANの勾配消失問題 ¤ では,𝐷を完全に最適化させればいい? ¤ 𝑝ABCB(𝑥) について1, 𝑝D(𝑥) について0を出⼒する. -> 勾配が消えてしまうため,𝐺の⽅向がわからなくなる!! ¤ GANのジレンマ: ¤ 𝐷が⼗分学習できないうちに𝐺を最適化すると,𝐺が同じようなサンプルしか⽣成しなくなる(mode collapse). ¤ ある𝐺のときに𝐷を完全に学習すると,勾配が消失してしまう(勾配消失問題).
  37. 37. 勾配消失を防ぐトリック ¤ 勾配が消失しないようにするため,𝐺の学習を次のように変更する場合が多い. ¤ 𝐷がほぼ偽物(𝐷 = 0)と判断したときでも,勾配が消失することを防ぐ. ¤ ただし,もはやJSダイバージェンスの最⼩化ではない! ¤ 逆⽅向のKLと負のJSを最⼩化することに対応[Arjovsky+ 17] ¤ (ただし[Zhou+ 17]はこの証明は誤りと指摘している) min ( 𝐸O(m) −log 𝐷 𝐺 𝑧; 𝜃 ; 𝜑 −𝛻( log 𝐷(𝐺 𝑧; 𝜃 ) = − 𝛻( 𝐷(𝐺 𝑧; 𝜃 ) 𝐷(𝐺 𝑧; 𝜃 ) 𝛻( log 1 − 𝐷(𝐺 𝑧; 𝜃 ) = − 𝛻( 𝐷(𝐺 𝑧; 𝜃 ) 1 − 𝐷(𝐺 𝑧; 𝜃 ) 元の⽬的関数の勾配 変更した⽬的関数の勾配
  38. 38. その他のトリック ¤ https://github.com/soumith/ganhacks に⾊々経験則が書いてある. ¤ Normalize the inputs ¤ Use a spherical Z ¤ Use Soft and Noisy Labels などなど. ¤ 基本的には,実装例があればそれに従うこと.
  39. 39. GANの種類
  40. 40. GAN Zoo ¤ GANの論⽂⼀覧 ¤ https://github.com/hindupuravinash/the-gan-zoo ¤ 指数関数的に増加. ¤ 理論研究など,ここには⼊っていない研究も結構ある. ¤ 12/11現在234.12⽉分はまだ.
  41. 41. 代表的なGANs https://github.com/hwalsuklee/tensorflow-generative-model-collections
  42. 42. Conditional GAN ¤ Conditional Generative Adversarial Nets[Mirza+ 14] ¤ 𝑐で条件づけたGAN. ¤ GとDの両⽅に𝑐を加える(下の図では𝑦). 𝑉t 𝐷, 𝐺 = 𝐸OPQRQ ` log 𝐷 𝑥, 𝑐; 𝜑 + 𝐸O m log 1 − 𝐷 𝐺 𝑧; 𝜃 , 𝑐; 𝜑 𝑉s 𝐷, 𝐺 = −𝐸O m log 𝐷 𝐺 𝑧; 𝜃 , 𝑐; 𝜑
  43. 43. cGANで⽣成した画像 ¤ ⽂章情報で条件づけた例[Reed+ 16]. ¤ 画像で条件づけた例[Isola+ 16](pix2pix,後ほど改めて紹介). nerative Adversarial Text to Image Synthesis Xinchen Yan, Lajanugen Logeswaran REEDSCOT1 , AKATA2 , XCYAN1 , LLAJAN1 SCHIELE2 ,HONGLAK1 nn Arbor, MI, USA (UMICH.EDU) formatics, Saarbr¨ucken, Germany (MPI-INF.MPG.DE) tract realistic images from text nd useful, but current AI m this goal. However, in d powerful recurrent neu- es have been developed text feature representa- convolutional generative ANs) have begun to gen- g images of specific cat- album covers, and room we develop a novel deep ormulation to effectively n text and image model- concepts from characters ate the capability of our sible images of birds and xt descriptions. d in translating text in the form itten descriptions directly into “this small bird has a short, e belly” or ”the petals of this r are yellow”. The problem of al descriptions gained interest this small bird has a pink breast and crown, and black primaries and secondaries. the flower has petals that are bright pinkish purple with white stigma this magnificent fellow is almost all black with a red crest, and white cheek patch. this white and yellow flower have thin white petals and a round yellow stamen Figure 1. Examples of generated images from text descriptions. Left: captions are from zero-shot (held out) categories, unseen text. Right: captions are from the training set. properties of attribute representations are attractive, at- tributes are also cumbersome to obtain as they may require domain-specific knowledge. In comparison, natural lan- guage offers a general and flexible interface for describing objects in any space of visual categories. Ideally, we could have the generality of text descriptions with the discrimi-
  44. 44. infoGAN ¤ Information Maximizing Generative Adversarial Networks[Chen+ 16] ¤ cGANでは明⽰的に𝑐と𝑥のペアを与えていたが,暗黙的に𝑥と対応するcを獲得したい. ¤ そのため,相互情報量𝐼(𝑐; 𝑥 = 𝐺(𝑐, 𝑧))が⾼くなるように学習する. ¤ 実際はその下界を正則化項として加える. ¤ 𝑄(𝑐|𝑥)はニューラルネットワークとする(𝐷と重みを共有). 𝑧 𝑥 𝑐 𝑧 𝑥 𝑐 infoGANcGAN 𝑉t.• 𝐷, 𝐺, 𝑄 = 𝐸OPQRQ ` log 𝐷 𝑥 + 𝐸O m log 1 − 𝐷 𝐺 𝑧 − 𝜆𝐿•(𝐺, 𝑄) 𝑉s 𝐷, 𝐺, 𝑄 = −𝐸O m log 𝐷 𝐺 𝑧 + 𝜆𝐿•(𝐺, 𝑄) 𝐿• 𝐺, 𝑄 = 𝐸‚~O ‚ ,`~s m,‚ [log 𝑄 𝑐 𝑥 ] ≤ 𝐼 𝑐; 𝐺 𝑐, 𝑧
  45. 45. ACGAN ¤ Conditional Image Synthesis With Auxiliary Classifier GANs[Odena+ 16] ¤ 綺麗な画像𝑥は,クラス𝑐の識別性も⾼いはず. ¤ 多様性と識別性を⾼めたい. ¤ 𝐺を𝑐で条件づけて,𝐷で𝑐を識別する補助分類器𝑄(𝐶 = 𝑐|𝑥)を追加. ¤ cGANの𝐺とinfoGANの𝐷. ¤ 𝐺を𝑐で条件づけることで,学習する分布の峰を少なくできる. ¤ 128×128の画像を⽣成. 𝑉t,• 𝐷, 𝐺, 𝑄 = 𝐸OPQRQ ` log 𝐷 𝑥 + 𝐸O m log 1 − 𝐷 𝐺 𝑧 + 𝐸 𝑄 𝐶 = 𝑐 𝑥 + 𝐸[𝑄(𝐶 = 𝑐|𝐺 𝑧 )] 𝑉s 𝐷, 𝐺, 𝑄 = −𝐸O m log 𝐷 𝐺 𝑧 − 𝐸[𝑄(𝐶 = 𝑐|𝐺 𝑧 )]
  46. 46. GANのダイバージェンスの変更 ¤ GANはJSダイバージェンスを最⼩化している. ¤ 通常の⽣成モデルの最尤推定(KL最⼩化)から解放されている. ->別のダイバージェンスで最適化できないか? ¤ これまでに様々なダイバージェンス(距離)のGANが提案されている. ¤ ピアソンの𝜒2ダイバージェンス:LSGAN ¤ 𝑓ダイバージェンス:f-GAN ¤ Bregmanダイバージェンス(真の密度⽐との距離):b-GAN[Uehara+ 17] ¤ Wasserstein距離:WGAN[Arjovsky+ 17],WGAN-GP[Gulrajani+ 17],BEGAN[Berthelot+ 17] ¤ MMD:MMD GAN [Li+ 17] ¤ Cramér距離: Cramér GAN[Bellemare+ 17] ¤ その他のIntegral Probability Metrics :McGAN[Mroueh+ 17], Fisher GAN[Mroueh+ 17] などなど.
  47. 47. JSダイバージェンスの限界 ¤ データは次元のあらゆるところに存在する訳ではない. ¤ 実際は低次元な多様体として存在する. ¤ ⾼次元空間の場合,2つの分布の低次元多様体が交わらない可能性が⾼い. ¤ 分布の台が交わらない場合,𝐷は完全に 𝑝ABCB(𝑥)と 𝑝D 𝑥 が分離できてしまう. ¤ JSダイバージェンスには交わらない分布間の距離の違いがわからない(定数になる). ¤ Gの⽅向が決められない!! 𝑝D(𝑥)𝑝ABCB(𝑥) 𝑝D(𝑥)𝑝ABCB(𝑥)
  48. 48. Wasserstein距離 ¤ Earth Mover距離(Wasserstein-1距離) in𝑓 †∈ OPQRQ,Oe 𝐸 `,a ~†[||𝑥 − 𝑦||] ¤ 直感的には,𝑝ABCBから𝑝Dに確率密度を移すときの最⼩コスト. ¤ 𝛾が輸送量, | 𝑥 − 𝑦 |が移動距離に対応. ¤ JSダイバージェンスとの⽐較 ¤ 直線の確率密度分布を移動した時の距離(右図) Wasserstein距離 JSダイバージェンス 0 𝜃
  49. 49. Wasserstein GAN ¤ Wasserstein距離は双対表現として,次のように書ける. ¤ Wasserstein GAN(WGAN)では,これを𝐺の⽬的関数とする. ¤ 通常のGANとほぼ同じ!ただし・・・ ¤ 𝐷の出⼒はそのまま(⾮線形関数を加えない). ¤ 𝐷がリプシッツ性を満たすように学習する. ¤ 毎回𝜑を複数回更新して, Wasserstein距離を正確に近似するようにする. 𝑊 𝑝D, 𝑝ABCB = max 9 𝐸OPQRQ 𝐷 𝑥; 𝜑 − 𝐸O(m)[𝐷(𝐺 𝑧 ; 𝜑)] ただし| 𝐷 𝑥 − 𝐷 𝑦 | ≤ | 𝑥 − 𝑦 | 1-リプシッツ連続 𝑉t 𝐷, 𝐺 = 𝐸OPQRQ ` 𝐷 𝑥; 𝜑 − 𝐸O m 𝐷 𝐺 𝑧; 𝜃 ; 𝜑 𝑉s 𝐷, 𝐺 = −𝐸O m 𝐷 𝐺 𝑧; 𝜃 ; 𝜑 s.t. | 𝐷 𝑥 − 𝐷 𝑦 | ≤ | 𝑥 − 𝑦 |
  50. 50. WGANの利点 ¤ 従来のダイバージェンスに起因するmode collapseなどの問題が発⽣しない. ¤ 台が交わらない分布でも,勾配が得られる. ¤ ⽬的関数が意味を持つようになる. ¤ 従来のGANは,どれくらい画像が学習できているか,客観的に計測できなかった.
  51. 51. 課題:リプシッツ性の担保 ¤ Dのリプシッツ性を保つために,パラメータを制約する必要がある. ¤ ⽅法1:パラメータをclippingする. ¤ 𝜑を[−0.01,0.01]の範囲に収める. ¤ ただし,勾配爆発や消失の問題が⽣じる可能性がある. ¤ ⽅法2:gradient penaltyの追加[Gulrajani+ 17] ¤ Dの勾配ノルムが1になるような制約項を追加して学習. ¤ ただし,制約が厳しすぎるという指摘もあり[Kodali+ 17]. ¤ その後も,様々なintegral probability metricsによるGANが提案されている. ¤ Dの制約を変更することで,⽬的関数が様々な距離に対応するようになる. ¤ MMD[Li+ 17], Cramér距離[Bellemare+ 17]などなど.
  52. 52. GANの評価
  53. 53. ⽣成モデルの評価とGAN ¤ ⽣成モデルの場合,学習したモデルを評価するのが⾮常に困難. ¤ 教師あり学習では,テストデータ(𝑥, 𝑦)でうまく予測できるかを評価すればいい. ¤ 教師なし学習では,𝑥 → 𝑧などは⾃明ではない. ¤ よくある⽅法は,テストデータに対する(対数)尤度の値を調べる. ¤ VAEでは下界(正確にはimportance weighted AE[Burda+ 15]の下界)を計算する. ¤ GANは明⽰的に分布を持っていないので,尤度の評価が困難. ¤ 初期はパルツェン窓密度推定などが使われていた. ¤ しかし,評価としては役に⽴たない場合が多いことが⽰された[Theis+ 15] ¤ しかし,そもそも対数尤度が⾼ければ良い⽣成モデルといえるのか?? ¤ 最初に述べたように,KLはぼやけた画像を⾼く評価する. ¤ GANは,尤度の代わりにモデルに応じて独⾃の評価指標(識別器)で学習している. -> 客観的にGANの評価は可能なのか??
  54. 54. Inception Score ¤ 良い⽣成画像とは何か?[Salimans+ 16] 1. ⾼い信念でクラス分類ができる->エントロピー𝑝(𝑦|𝑥)が⼩さい. 2. サンプルのバリエーションが⼤きい->エントロピー 𝑝(𝑦)が⼤きい. ¤ よって次のスコアを調べればいい(inception score). ¤ 𝐺からのサンプル𝑥を使って評価 ¤ スコアが⼩さければ良いモデル. ¤ 𝑝(𝑦|𝑥)は事前学習済みのinceptionモデル. ¤ 経験的に,⼈間の評価と相関することがわかった. ¤ 現在GANで最もよく使われている評価指標. 𝐼𝑆 𝐺 = exp(𝐸`~s[𝑝(𝑦|𝑥)||𝑝(𝑦)])
  55. 55. Fréchet Inception Distance ¤ ISとは異なる評価指標[Heusel+ 17]. ¤ Inceptionモデルの任意の層(pool 3層)に, 𝑝ABCB 𝑥 と 𝑝D(𝑥)からのサンプルを写像する. ¤ 埋め込んだ層を連続多変量ガウス分布と考えて,平均と共分散を計算する. ¤ それらを⽤いてFréchet距離(Wasserstein-2距離)を計算する(Fréchet inception distance, FID). ¤ IS(ここではinception distance)と⽐べて,適切な評価指標になっている. 𝐹𝐼𝐷 𝑑𝑎𝑡𝑎, 𝑔 = ||𝜇ABCB − 𝜇D||= = + 𝑇𝑟(ΣABCB + ΣD − 2(ΣABCBΣD) 1 =) Figure A8: Left: FID and right: Inception Score are evaluated for first row: Gaussian noise, second row: Gaussian blur, third row: implanted black rectangles, fourth row: swirled images, fifth row. salt and pepper noise, and sixth row: the CelebA dataset contaminated by ImageNet images. Left is the smallest disturbance level of zero, which increases to the highest level at right. The FID captures the disturbance level very well by monotonically increasing whereas the Inception Score fluctuates, stays flat or even, in the worst case, decreases. 13
  56. 56. 結局どのGANがいい? ¤ 安定性や収束性の問題はいまだに解決されていない. ¤ GANはVAEに⽐べて不安定で,ハイパーパラメータ等に対する分散が⾮常に⼤きい. ¤ 計算コストをかけて適切にパラメータチューニングすれば,GANの種類であまり差はない[Lucic+ 17]. ¤ ⾃分で動かして確認するのが重要. ¤ ほとんどの場合,⽬的関数の変更と多少NNアーキテクチャを変更すればよい. ¤ 最近はまとまった実装やライブラリが公開されている. ¤ https://github.com/hwalsuklee/tensorflow-generative-model-collections ¤ https://github.com/pfnet-research/chainer-gan-lib ¤ https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan ¤ ⾃分の⽬的にあったGANを選ぶ. ¤ 具体的な⽬的ベースで提案されたGANも多い(GAN Zooやこの後のスライドを参照).
  57. 57. GANの応⽤
  58. 58. ⾼解像度の画像⽣成 ¤ 段階的に解像度を上げるように学習することで,⾼解像度の画像⽣成が可能になっている. ¤ LAPGAN[Denton+ 15],StackGAN[Zhang+ 16], StackGAN++[Zhang+ 17] ¤ Progressive GAN[Karras+ 17] ¤ 低解像度の⽣成から学習していく(最初はモードが少ないので楽). ¤ 安定して⾼解像度(1024x1024!)の画像を⽣成できる.
  59. 59. Image-to-image translation ¤ 画像から対応する画像を⽣成する. ¤ Pix2pix[Isola+ 16] ¤ Conditional GANの条件付けを変換前の画像とする. ¤ Gをautoencoderにする. ¤ BicycleGAN[Zhu+ 17] ¤ 決定論的な対応を学習するのではなく,潜在変数の分散を考慮したモデル. Real or fake pair? Positive examples Negative examples Real or fake pair? DD G G tries to synthesize fake images that fool D D tries to identify the fakes Figure 2: Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discrimina- tor observe an input image. where G tries to minimize this objective against an ad- versarial D that tries to maximize it, i.e. G⇤ = arg minG maxD LcGAN (G, D). To test the importance of conditioning the discrimintor, we also compare to an unconditional variant in which the discriminator does not observe x: LGAN (G, D) =Ey⇠pdata(y)[log D(y)]+ Ex⇠pdata(x),z⇠pz(z)[log(1 D(G(x, z))]. (2) Previous approaches to conditional GANs have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 distance [29]. The discriminator’s job re- mains unchanged, but the generator is tasked to not only fool the discriminator but also to be near the ground truth output in an L2 sense. We also explore this option, using L1 distance rather than L2 as L1 encourages less blurring: Figure 3 “U-Net” tween m this stra nore the Instead, form of at both t observe Designi put, and distribu by the p 2.2. Ne We a from th module Details with key 2.2.1 A defin is that th tion out the inpu are rend structur Image-to-Image Translation with Conditional Adversarial Networks Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros Berkeley AI Research (BAIR) Laboratory University of California, Berkeley {isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu Labels to Facade BW to Color Aerial to Map Labels to Street Scene Edges to Photo input output input inputinput input output output outputoutput input output Day to Night Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image. These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels. Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show results of the method on several. In each case we use the same architecture and objective, and simply train on different data. Abstract We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss func- tion to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demon- strate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a commu- nity, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either. Many problems in image processing, computer graphics, and computer vision can be posed as “translating” an input image into a corresponding output image. Just as a concept may be expressed in either English or French, a scene may be rendered as an RGB image, a gradient field, an edge map, a semantic label map, etc. In analogy to automatic language translation, we define automatic image-to-image translation as the problem of translating one possible representation of a scene into another, given sufficient training data (see Fig- ure 1). One reason language translation is difficult is be- cause the mapping between languages is rarely one-to-one – any given concept is easier to express in one language than another. Similarly, most image-to-image translation problems are either many-to-one (computer vision) – map- ping photographs to edges, segments, or semantic labels, or one-to-many (computer graphics) – mapping labels or sparse user inputs to realistic images. Traditionally, each of these tasks has been tackled with separate, special-purpose machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]), despite the fact that the setting is always the same: predict pixels from pixels. Our goal in this paper is to develop a common framework for all these problems. 1 arXiv:1611.07004v1[cs.CV]21Nov2016
  60. 60. Image-to-image translation ¤ ペアになっていないデータで双⽅向に変換 ¤ CycleGAN[Zhu+ 17] ¤ StarGAN[Choi+ 17] ¤ 2つのドメインではなく,複数のドメイン間で変換. ¤ GとDは1つだけ.represent one domain while those of men represent another. Several image datasets come with a number of labeled attributes. For instance, the CelebA[17] dataset contains 40 labels related to facial attributes such as hair color, gender, and age, and the RaFD [11] dataset has 8 labels for facial expressions such as ‘happy’, ‘angry’ and ‘sad’. These set- tings enable us to perform more interesting tasks, namely multi-domain image-to-image translation, where we change images according to attributes from multiple domains. The first five columns in Fig. 1 show how a CelebA image can be translated according to any of the four domains, ‘blond hair’, ‘gender’, ‘aged’, and ‘pale skin’. We can further ex- tend to training multiple domains from different datasets, such as jointly training CelebA and RaFD images to change a CelebA image’s facial expression using features learned by training on RaFD, as in the rightmost columns of Fig. 1. However, existing models are both inefficient and inef- fective in such multi-domain image translation tasks. Their (a) Cross-domain models 21 4 3 G21 G12 G41 G14 G32 G23 G34 G43 2 1 5 4 3 (b) StarGAN Figure 2. Comparison between cross-domain models and our pro- posed model, StarGAN. (a) To handle multiple domains, cross- domain models should be built for every pair of image domains. (b) StarGAN is capable of learning mappings among multiple do- mains using a single generator. The figure represents a star topol- ogy connecting multi-domains.
  61. 61. Actor–criticとGAN
  62. 62. Actor-critic ¤ ⽅策(actor)と価値関数(critic)を同時に学習する⼿法. ¤ MDPの下で(記号の説明を省略)... ¤ ⾏動価値関数は, ¤ ⽅策の更新: ¤ ⾏動価値関数の更新: ¤ よって,2段階の最適化アルゴリズムになる. ¤ なんかGANと似ている?? most reinforcement learning algorithms either focus on learning a value function, like value iteration and TD-learning, or learning a policy directly, as in policy gradient methods, AC methods learn both simultaneously - the actor being the policy and the critic being the value function. In some AC methods, the critic provides a lower-variance baseline for policy gradient methods than estimating the value from returns. In this case even a bad estimate of the value function can be useful, as the policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is updated with respect to the approximate value function, in which case pathologies similar to those in GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead to a bad policy which never fully explores the space, preventing a good value function from being found and leading to degenerate solutions. A number of techniques exist to remedy this problem. Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st) and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action- value function Qπ (s, a) that predicts the expected discounted reward: Qπ (s, a) = Est+k∼P,rt+k∼R,at+k∼π ∞ k=1 γk rt+k st = s, at = a (6) and learn a policy that is optimal for that value function: π∗ = arg max π Es0∼p0,a0∼π[Qπ (s0, a0)] (7) We can express Qπ as the solution to a minimization problem: Qπ = arg min Q Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8) Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic problem can be expressed as a bilevel optimization problem as well: F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9) f(Q, π) = −Es0∼p0,a0∼π[Qπ (s0, a0)] (10) There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by the TD error, while the action-value function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG) [7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and both simultaneously - the actor being the policy and the critic being the value function. In some AC methods, the critic provides a lower-variance baseline for policy gradient methods than estimating the value from returns. In this case even a bad estimate of the value function can be useful, as the policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is updated with respect to the approximate value function, in which case pathologies similar to those in GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead to a bad policy which never fully explores the space, preventing a good value function from being found and leading to degenerate solutions. A number of techniques exist to remedy this problem. Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st) and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action- value function Qπ (s, a) that predicts the expected discounted reward: Qπ (s, a) = Est+k∼P,rt+k∼R,at+k∼π ∞ k=1 γk rt+k st = s, at = a (6) and learn a policy that is optimal for that value function: π∗ = arg max π Es0∼p0,a0∼π[Qπ (s0, a0)] (7) We can express Qπ as the solution to a minimization problem: Qπ = arg min Q Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8) Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic problem can be expressed as a bilevel optimization problem as well: F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9) f(Q, π) = −Es0∼p0,a0∼π[Qπ (s0, a0)] (10) There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by the TD error, while the action-value function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG) [7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and observations are continuous, and use neural networks for function approximation for both the action- both simultaneously - the actor being the policy and the critic being the value function. In some AC methods, the critic provides a lower-variance baseline for policy gradient methods than estimating the value from returns. In this case even a bad estimate of the value function can be useful, as the policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is updated with respect to the approximate value function, in which case pathologies similar to those in GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead to a bad policy which never fully explores the space, preventing a good value function from being found and leading to degenerate solutions. A number of techniques exist to remedy this problem. Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st) and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action- value function Qπ (s, a) that predicts the expected discounted reward: Qπ (s, a) = Est+k∼P,rt+k∼R,at+k∼π ∞ k=1 γk rt+k st = s, at = a (6) and learn a policy that is optimal for that value function: π∗ = arg max π Es0∼p0,a0∼π[Qπ (s0, a0)] (7) We can express Qπ as the solution to a minimization problem: Qπ = arg min Q Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8) Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic problem can be expressed as a bilevel optimization problem as well: F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9) f(Q, π) = −Es0∼p0,a0∼π[Qπ (s0, a0)] (10) There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by the TD error, while the action-value function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG) [7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and observations are continuous, and use neural networks for function approximation for both the action- value function and policy. This is an established approach in RL with continuous actions [11], and all methods update the policy by passing back gradients of the estimated value with respect to the actions rather than passing the TD error directly. The distinction between the methods lies mainly in the way training proceeds. In NFQCA, the actor and critic are trained in batch mode after every distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st) and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action- value function Qπ (s, a) that predicts the expected discounted reward: Qπ (s, a) = Est+k∼P,rt+k∼R,at+k∼π ∞ k=1 γk rt+k st = s, at = a (6) and learn a policy that is optimal for that value function: π∗ = arg max π Es0∼p0,a0∼π[Qπ (s0, a0)] (7) We can express Qπ as the solution to a minimization problem: Qπ = arg min Q Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8) Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic problem can be expressed as a bilevel optimization problem as well: F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9) f(Q, π) = −Es0∼p0,a0∼π[Qπ (s0, a0)] (10) There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the policy through policy gradients and scale the policy gradient by the TD error, while the action-value function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG) [7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and observations are continuous, and use neural networks for function approximation for both the action-
  63. 63. Actor-criticとGAN ¤ Actor-criticとGANを同じ図式で並べてみる[Pfau+ 17]. ¤ 𝐺=⽅策,𝐷=価値関数とみなせる. ¤ ただし,GANでは𝐺が𝑥(現在の状態)を受け取らない. ¤ 𝑧からのランダム. z G D x y (a) Generative Adversarial Networks [4] π st Q rt (b) Deterministic Policy Gradient [7] / SVG(0) [8] / Neurally-Fitted Q-learning with Continuous Actions [9] Figure 1: Information structure of GANs and AC methods. Empty circles represent models with a distinct loss function. Filled circles represent information from the environment. Diamonds repre- sent fixed functions, both deterministic and stochastic. Solid lines represent the flow of information, while dotted lines represent the flow of gradients used by another model. Paths which are analo- gous between the two models are highlighted in red. The dependence of Q on future states and the dependence of future states on π are omitted for clarity. 2 Algorithms Both GANs and AC can be seen as bilevel or two-time-scale optimization problems, where one model is optimized with respect to the optimum of another model: ∗ ∗
  64. 64. 双⽅のテクニックを利⽤できないか? ¤ 著者のPfau⽒はUnrolled GAN[Metz+ 17]を推している. ¤ というかUnrolled GAN提案者の⼀⼈. Method GANs AC Freezing learning yes yes Label smoothing yes no Historical averaging yes no Minibatch discrimination yes no Batch normalization yes yes Target networks n/a yes Replay buffers no yes Entropy regularization no yes Compatibility no yes Table 1: Summary of different approaches used to stabilize and improve training for GANs and AC methods. Those approaches that have been shown to improve performance are in green, those that have not yet been demonstrated to improve training are in yellow, and those that are not applicable to the particular method are in red. Let the reward from the environment be 1 if the environment chose the real image and 0 if not. This MDP is stateless as the image generated by the actor does not affect future data. An actor-critic architecture learning in this environment clearly closely resembles the GAN game. A few adjustments have to be made to make it identical. If the actor had access to the state it
  65. 65. SeqGAN ¤ SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient[Yu+ 17] ¤ 系列情報を⽣成するGAN ¤ 系列情報を学習するのは困難 ¤ 系列情報は離散. ¤ 識別器は完全な系列しか評価できないため,部分的に⽣成された系列の評価が困難(最終的なスコア が最⼤になるように𝐺を誘導したい). ¤ 解決策:強化学習の利⽤ ¤ 𝐺を⽅策,𝐷を報酬とする. ¤ 𝐷から価値関数を求める. Reward Next action MC search D Reward Reward Reward Policy Gradient Left: D is trained over G. Right: G is trained ard signal is provided diate action value via generative model Gθ. Gθ is updated by em- ch on the basis of the discriminative model elihood that it would where Y n 1:t = (y1, . . . , yt) and Y n t+1:T is sampled based on the roll-out policy Gβ and the current state. In our experi- ment, Gβ is set the same as the generator, but one can use a simplified version if the speed is the priority (Silver et al. 2016). To reduce the variance and get more accurate assess- ment of the action value, we run the roll-out policy starting from current state till the end of the sequence for N times to get a batch of output samples. Thus, we have: Q Gθ Dφ (s = Y1:t−1, a = yt) = (4) 1 N N n=1 Dφ(Y n 1:T ), Y n 1:T ∈ MCGβ (Y1:t; N) for t < T Dφ(Y1:t) for t = T, where, we see that when no intermediate reward, the function is iteratively defined as the next-state value starting from state s′ = Y1:t and rolling out to the end. A benefit of using the discriminator Dφ as a reward func- tion is that it can be dynamically updated to further improve the generative model iteratively. Once we have a set of more realistic generated sequences, we shall re-train the discrimi- Reward Next action State MC searchG D Generate True data Train G Real World D Reward Reward Reward Policy Gradient Figure 1: The illustration of SeqGAN. Left: D is trained over the real data and the generated data by G. Right: G is trained by policy gradient where the final reward signal is provided by D and is passed back to the intermediate action value via Monte Carlo search. synthetic sequences generated from the generative model Gθ. w th m a s 20 m fro ge Q w is s′
  66. 66. SeqGAN:擬似コード Algorithm 1 Sequence Generative Adversarial Nets Require: generator policy Gθ; roll-out policy Gβ; discriminator Dφ; a sequence dataset S = {X1:T } 1: Initialize Gθ, Dφ with random weights θ, φ. 2: Pre-train Gθ using MLE on S 3: β ← θ 4: Generate negative samples using Gθ for training Dφ 5: Pre-train Dφ via minimizing the cross entropy 6: repeat 7: for g-steps do 8: Generate a sequence Y1:T = (y1, . . . , yT ) ∼Gθ 9: for t in 1 :T do 10: Compute Q(a = yt;s = Y1:t−1) by Eq. (4) 11: end for 12: Update generator parameters via policy gradient Eq. (8) 13: end for 14: for d-steps do 15: Use current Gθ to generate negative examples and com- bine with given positive examples S 16: Train discriminator Dφ for k epochs by Eq. (5) 17: end for 18: β ← θ 19: until SeqGAN converges In summary, Algorithm 1 shows full details of the pro- posed SeqGAN. At the beginning of the training, we use the maximum likelihood estimation (MLE) to pre-train Gθ on training set S. We found the supervised signal from the Short-Term Memory (LSTM) cells (Hochrei huber 1997) to implement the update functio is worth noticing that most of the RNN varia gated recurrent unit (GRU) (Cho et al. 2014 tion mechanism (Bahdanau, Cho, and Bengi used as a generator in SeqGAN. The Discriminative Model for Sequenc Deep discriminative models such as deep (DNN) (Vesel`y et al. 2013), convolutional (CNN) (Kim 2014) and recurrent convoluti work (RCNN) (Lai et al. 2015) have shown mance in complicated sequence classificatio paper, we choose the CNN as our discrim has recently been shown of great effectiven ken sequence) classification (Zhang and LeC discriminative models can only perform cla for an entire sequence rather than the unfinis paper, we also focus on the situation where th predicts the probability that a finished seque We first represent an input sequence x1, . E1:T = x1 ⊕x2⊕. . . ⊕xT , where xt ∈ Rk is the k-dimensional token ⊕ is the concatenation operator to build the RT ×k . Then a kernel w ∈ Rl×k applies a operation to a window size of l words to
  67. 67. IRLとGAN
  68. 68. 逆強化学習 ¤ Inverse reinforcement learning(IRL) ¤ ⽬標の⾏動(最適戦略)から報酬関数を推定する⼿法. ¤ Maximum entropy IRL(MaxEnt IRL)[Ng+ 00] ¤ 軌道 がボルツマン分布に従うと考える. ¤ 最適な軌道で尤度が最⼤になると考える -> 尤度最⼤化 ¤ しかし,分配関数𝑍をどうする・・・? um or integral for most high-dimensional problems. A common approach to estimating Z requires ampling from the Boltzmann distribution pθ (x) within the inner loop of learning. Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods; however, these methods face issues when there are several distinct modes of the distribution and, as a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate nference methods can also be used during training, though the energy function may incorrectly assign low energy to some modes if the approximate inference method cannot find them [14]. 2.3 Inverse Reinforcement Learning The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost earning, an algorithm for MaxEnt IRL. 2.3.1 Maximum entropy inverse reinforcement learning Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann distribution, where the energy is given by the cost function cθ : pθ (τ) = 1 Z exp(−cθ (τ)), Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z s the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam- cs.2 Under this model, the optimal trajectories have the highest likelihood, and the expert can generate uboptimal trajectories with a probability that decreases exponentially as the trajectories become more costly. As in other energy-based models, the parameters θ are optimized to maximize the like- main challenge in this optimization is evaluating the partition function Z, which is an intractable sum or integral for most high-dimensional problems. A common approach to estimating Z requires sampling from the Boltzmann distribution pθ (x) within the inner loop of learning. Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods; however, these methods face issues when there are several distinct modes of the distribution and, as a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate inference methods can also be used during training, though the energy function may incorrectly assign low energy to some modes if the approximate inference method cannot find them [14]. 2.3 Inverse Reinforcement Learning The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost learning, an algorithm for MaxEnt IRL. 2.3.1 Maximum entropy inverse reinforcement learning Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann distribution, where the energy is given by the cost function cθ : pθ (τ) = 1 Z exp(−cθ (τ)), Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam- ics.2 Under this model, the optimal trajectories have the highest likelihood, and the expert can generate suboptimal trajectories with a probability that decreases exponentially as the trajectories become more costly. As in other energy-based models, the parameters θ are optimized to maximize the like- lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous domains, and presents the main computational challenge. The first applications of this model com- puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown. 2.3.2 Guided cost learning Guided cost learning introduces an iterative sample-based method for estimating Z in the Max- Boltzmann distribution: pθ (x) = 1 Z exp(−Eθ (x)) (1) The energy function parameters θ are often chosen to maximize the likelihood of the data; the main challenge in this optimization is evaluating the partition function Z, which is an intractable sum or integral for most high-dimensional problems. A common approach to estimating Z requires sampling from the Boltzmann distribution pθ (x) within the inner loop of learning. Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods; however, these methods face issues when there are several distinct modes of the distribution and, as a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate inference methods can also be used during training, though the energy function may incorrectly assign low energy to some modes if the approximate inference method cannot find them [14]. 2.3 Inverse Reinforcement Learning The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost learning, an algorithm for MaxEnt IRL. 2.3.1 Maximum entropy inverse reinforcement learning Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann distribution, where the energy is given by the cost function cθ : pθ (τ) = 1 Z exp(−cθ (τ)), Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam- ics.2 Under this model, the optimal trajectories have the highest likelihood, and the expert can generate suboptimal trajectories with a probability that decreases exponentially as the trajectories become more costly. As in other energy-based models, the parameters θ are optimized to maximize the like- lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous domains, and presents the main computational challenge. The first applications of this model com- puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown. コスト関数(報酬関数) https://jangirrishabh.github.io/2016/07/09/virtual-car-IRL/
  69. 69. 逆強化学習 ¤ Guided cost learning (GCL) [Finn+ 16] ¤ 分配関数を推定するために,提案分布𝑞(𝜏)を考えて重点サンプリングによって求める. ¤ 提案分布が⾼い尤度で軌道をカバーできないと重点サンプリングが⾼バリアンスになる可能性があるので,デ モンストレーション𝑝+も利⽤する. ¤ 𝑞は次のコストを最⼩化して求める(詳細は省略). -> 𝜃と𝑞について交互に最適化すればいい. Under this model, the optimal trajectories have the highest likelihood, and the expert can generate suboptimal trajectories with a probability that decreases exponentially as the trajectories become more costly. As in other energy-based models, the parameters θ are optimized to maximize the like- lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous domains, and presents the main computational challenge. The first applications of this model com- puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown. 2.3.2 Guided cost learning Guided cost learning introduces an iterative sample-based method for estimating Z in the Max- Ent IRL formulation, and can scale to high-dimensional state and action spaces and nonlinear cost functions [7]. The algorithm estimates Z by training a new sampling distribution q(τ) and using importance sampling: Lcost(θ) = Eτ∼p[−log pθ (τ)] = Eτ∼p[cθ (τ)]+ logZ = Eτ∼p[cθ (τ)]+ log Eτ∼q exp(−cθ (τ)) q(τ) . Guided cost learning alternates between optimizing cθ using this estimate, and optimizing q(τ) to minimize the variance of the importance sampling estimate. 2This formula assumes that xt+1 is a deterministic function of the previous history. A more general form of this equation can be derived for stochastic dynamics [26]. However, the analysis largely remains the same: the probability of a trajectory can be written as the product of conditional probabilities, but the conditional probabilities of the states xt are not affected by θ and so factor out of all likelihood ratios. 3 The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and 1 Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy. Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2) Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost function. Thus, this training procedure results in both a learned cost function, characterizing the demonstration distribution, and a learned policy q(τ), capable of generating samples from the demonstration distribution. This importance sampling estimate can have very high variance if the sampling distribution q fails to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have low cost (as a result of the IRL objective), we can address this coverage problem by mixing the demonstration data samples with the generated samples. Let µ = 1 2 p+ 1 2 q be the mixture distribution over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for example we could use the current model pθ , or we could use a simpler density model trained using another method. Guided cost learning uses µ for importance sampling3, with 1 2 p(τ) + 1 2 q(τ) as the importance weights: Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) , 2.4 Direct Maximum Likelihood and Behavioral Cloning A simple approach to imitation learning and generative modeling is to train a generator or policy to output a distribution over the data, without learning a discriminator or energy function. For tractability, the data distribution is typically factorized using a directed graphical model or Bayesian network. In the field of generative modeling, this approach has most commonly been applied to speech and language generation tasks [23, 18], but has also been applied to image generation [22]. Like most EBMs, these models are trained by maximizing the likelihood of the observed data points. When a generative model does not have the capacity to represent the entire data distribution, max- imizing likelihood directly will lead to a moment-matching distribution that tries to “cover” all of the modes, leading to a solution that puts much of its mass in parts of the space that have negligible probability under the true distribution. In many scenarios, it is preferable to instead produce only re- alistic, highly probable samples, by “filling in” as many modes as possible, at the trade-off of lower The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and 1 Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy. Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2) Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost function. Thus, this training procedure results in both a learned cost function, characterizing the demonstration distribution, and a learned policy q(τ), capable of generating samples from the demonstration distribution. This importance sampling estimate can have very high variance if the sampling distribution q fails to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have low cost (as a result of the IRL objective), we can address this coverage problem by mixing the demonstration data samples with the generated samples. Let µ = 1 2 p+ 1 2 q be the mixture distribution over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for example we could use the current model pθ , or we could use a simpler density model trained using another method. Guided cost learning uses µ for importance sampling3, with 1 2 p(τ) + 1 2 q(τ) as the importance weights: Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) , 2.4 Direct Maximum Likelihood and Behavioral Cloning A simple approach to imitation learning and generative modeling is to train a generator or policy to output a distribution over the data, without learning a discriminator or energy function. For tractability, the data distribution is typically factorized using a directed graphical model or Bayesian network. In the field of generative modeling, this approach has most commonly been applied to speech and language generation tasks [23, 18], but has also been applied to image generation [22]. Like most EBMs, these models are trained by maximizing the likelihood of the observed data points. When a generative model does not have the capacity to represent the entire data distribution, max- The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and 1 Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy. Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2) Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost function. Thus, this training procedure results in both a learned cost function, characterizing the demonstration distribution, and a learned policy q(τ), capable of generating samples from the demonstration distribution. This importance sampling estimate can have very high variance if the sampling distribution q fails to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have low cost (as a result of the IRL objective), we can address this coverage problem by mixing the 𝜇=
  70. 70. GANの再解釈 ¤ GANの最適な識別器は ¤ ここでモデル分布𝑞の確率密度は推定できるとする(GANの前提を変える). ¤ さらに真の分布𝑝をコスト関数でパラメータ化したものに置き換える. ¤ するとGANの識別器の損失関数は ¤ Dのパラメータ=コスト関数のパラメータ ¤ のとき最適. ¤ [再掲]MaxEnt IRLのコストは( とすると) verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec- tion 3.2, where the discriminator involves the learned cost and the generator represents the policy. 3.1 A special form of discriminator For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol- lowing [8]: D∗ (τ) = p(τ) p(τ)+q(τ) , (3) where p(τ) is the actual distribution of the data. In the traditional GAN algorithm, the discriminator is trained to directly output this value. When the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified to incorporate this density information. Instead of having the discriminator estimate the value of Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value. In this case, the new form of the discriminator Dθ with parameters θ is Dθ (τ) = ˜pθ (τ) ˜pθ (τ)+q(τ) . In order to make the connection to MaxEnt IRL, we also replace the estimated data density with the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the learned cost. Now the discriminator’s output is: Dθ (τ) = 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) . The resulting architecture for the discriminator is very similar to a typical model for binary classi- fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change allows the optimal discriminator to be completely independent of the generator: the discriminator is optimal when 1 Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim- inator may significantly improve the stability of training. This change is very simple to implement and is applicable in any setting where the density q(τ) can be cheaply evaluated. Of course this is precisely the case where we could directly maximize likelihood, and we might wonder whether it is worth the additional complexity of GAN training. But the experience of researchers in IRL has shown that maximizing log likelihood directly is not always the most effective way to learn complex behaviors, even when it is possible to implement. As we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting alleviate the issue, they do not solve it completely. 3 GANs and IRL We now show how generative adversarial modeling has implicitly been applied to the setting of in- verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec- tion 3.2, where the discriminator involves the learned cost and the generator represents the policy. 3.1 A special form of discriminator For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol- lowing [8]: D∗ (τ) = p(τ) p(τ)+q(τ) , (3) where p(τ) is the actual distribution of the data. In the traditional GAN algorithm, the discriminator is trained to directly output this value. When the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified to incorporate this density information. Instead of having the discriminator estimate the value of Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value. In this case, the new form of the discriminator Dθ with parameters θ is Dθ (τ) = ˜pθ (τ) ˜pθ (τ)+q(τ) . In order to make the connection to MaxEnt IRL, we also replace the estimated data density with the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the learned cost. Now the discriminator’s output is: Dθ (τ) = 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) . The resulting architecture for the discriminator is very similar to a typical model for binary classi- fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change allows the optimal discriminator to be completely independent of the generator: the discriminator is optimal when 1 Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim- inator may significantly improve the stability of training. This change is very simple to implement and is applicable in any setting where the density q(τ) can be cheaply evaluated. Of course this is precisely the case where we could directly maximize likelihood, and we might wonder whether it is worth the additional complexity of GAN training. But the experience of researchers in IRL has shown that maximizing log likelihood directly is not always the most effective way to learn complex behaviors, even when it is possible to implement. As we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting that the same phenomenon may occur in other domains: GAN training may provide advantages even when it would be possible to maximize likelihood directly. 3.1 A special form of discriminator For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol- lowing [8]: D∗ (τ) = p(τ) p(τ)+q(τ) , (3) where p(τ) is the actual distribution of the data. In the traditional GAN algorithm, the discriminator is trained to directly output this value. When the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified to incorporate this density information. Instead of having the discriminator estimate the value of Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value. In this case, the new form of the discriminator Dθ with parameters θ is Dθ (τ) = ˜pθ (τ) ˜pθ (τ)+q(τ) . In order to make the connection to MaxEnt IRL, we also replace the estimated data density with the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the learned cost. Now the discriminator’s output is: Dθ (τ) = 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) . The resulting architecture for the discriminator is very similar to a typical model for binary classi- fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change allows the optimal discriminator to be completely independent of the generator: the discriminator is optimal when 1 Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim- inator may significantly improve the stability of training. This change is very simple to implement and is applicable in any setting where the density q(τ) can be cheaply evaluated. Of course this is precisely the case where we could directly maximize likelihood, and we might wonder whether it is worth the additional complexity of GAN training. But the experience of researchers in IRL has shown that maximizing log likelihood directly is not always the most effective way to learn complex behaviors, even when it is possible to implement. As we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting that the same phenomenon may occur in other domains: GAN training may provide advantages even when it would be possible to maximize likelihood directly. 3.2 Equivalence between generative adversarial networks and guided cost learning In this section, we show that GANs, when applied to IRL problems, optimize the same objective as MaxEnt IRL, and in fact the variant of GANs described in the previous section is precisely equivalent to guided cost learning. Recall that the discriminator’s loss is equal to Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))] = Eτ∼p −log 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+ q(τ) + Eτ∼q −log q(τ) 1 Z exp(−cθ (τ))+ q(τ) In maximum entropy IRL, the log-likelihood objective is: Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1 2 p+ 1 2 q exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) (4) = Eτ∼p[cθ (τ)]+ log Eτ∼µ exp(−cθ (τ)) 1 2Z exp(−cθ (τ))+ 1 2 q(τ) , (5) where we have substituted p(τ) = pθ (τ) = 1 Z exp(−cθ (τ)), i.e. we are using the current model to estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: 1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima- Recall that the discriminator’s loss is equal to Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))] = Eτ∼p −log 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) +Eτ∼q −log q(τ) 1 Z exp(−cθ (τ))+q(τ) In maximum entropy IRL, the log-likelihood objective is: Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1 2 p+ 1 2 q exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) (4) = Eτ∼p[cθ (τ)]+log Eτ∼µ exp(−cθ (τ)) 1 2Z exp(−cθ (τ))+ 1 2 q(τ) , (5) where we have substituted p(τ) = pθ (τ) = 1 Z exp(−cθ (τ)), i.e. we are using the current model to estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: Recall that the discriminator’s loss is equal to Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))] = Eτ∼p −log 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+ q(τ) + Eτ∼q −log q(τ) 1 Z exp(−cθ (τ))+ q(τ) In maximum entropy IRL, the log-likelihood objective is: Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1 2 p+ 1 2 q exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) (4) = Eτ∼p[cθ (τ)]+ log Eτ∼µ exp(−cθ (τ)) 1 2Z exp(−cθ (τ))+ 1 2 q(τ) , (5) where we have substituted p(τ) = pθ (τ) = 1 Z exp(−cθ (τ)), i.e. we are using the current model to estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: Recall that the discriminator’s loss is equal to Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))] = Eτ∼p −log 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) +Eτ∼q −log q(τ) 1 Z exp(−cθ (τ))+q(τ) In maximum entropy IRL, the log-likelihood objective is: Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1 2 p+ 1 2 q exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) (4) = Eτ∼p[cθ (τ)]+log Eτ∼µ exp(−cθ (τ)) 1 2Z exp(−cθ (τ))+ 1 2q(τ) , (5) where we have substituted p(τ) = pθ (τ) = 1 Z exp(−cθ (τ)), i.e. we are using the current model to estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: 1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima- tor for the partition function, as described in Section 2.3.2. Recall that the discriminator’s loss is equal to Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))] = Eτ∼p −log 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+ q(τ) + Eτ∼q −log q(τ) 1 Z exp(−cθ (τ))+ q(τ) In maximum entropy IRL, the log-likelihood objective is: Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1 2 p+ 1 2 q exp(−cθ (τ)) 1 2 p(τ)+ 1 2 q(τ) (4) = Eτ∼p[cθ (τ)]+ log Eτ∼µ exp(−cθ (τ)) 1 2Z exp(−cθ (τ))+ 1 2 q(τ) , (5) where we have substituted p(τ) = pθ (τ) = 1 Z exp(−cθ (τ)), i.e. we are using the current model to estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: 1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima- tor for the partition function, as described in Section 2.3.2. 2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to the derivative of the MaxEnt IRL objective. 3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the 3.1 A special form of discriminator For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol- lowing [8]: D∗ (τ) = p(τ) p(τ)+q(τ) , (3) where p(τ) is the actual distribution of the data. In the traditional GAN algorithm, the discriminator is trained to directly output this value. When the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified to incorporate this density information. Instead of having the discriminator estimate the value of Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value. In this case, the new form of the discriminator Dθ with parameters θ is Dθ (τ) = ˜pθ (τ) ˜pθ (τ)+q(τ) . In order to make the connection to MaxEnt IRL, we also replace the estimated data density with the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the learned cost. Now the discriminator’s output is: Dθ (τ) = 1 Z exp(−cθ (τ)) 1 Z exp(−cθ (τ))+q(τ) . The resulting architecture for the discriminator is very similar to a typical model for binary classi- fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change allows the optimal discriminator to be completely independent of the generator: the discriminator is optimal when 1 Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim- inator may significantly improve the stability of training. This change is very simple to implement and is applicable in any setting where the density q(τ) can be cheaply evaluated. Of course this is precisely the case where we could directly maximize likelihood, and we might wonder whether it is worth the additional complexity of GAN training. But the experience of researchers in IRL has shown that maximizing log likelihood directly is not always the most effective way to learn complex behaviors, even when it is possible to implement. As we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting that the same phenomenon may occur in other domains: GAN training may provide advantages even when it would be possible to maximize likelihood directly.
  71. 71. GANとIRLの関係 ¤ 次のことが証明されている[Finn+ 17]. ¤ 識別器の⽬的関数を最⼩化する𝑍は,分配関数の重点サンプリングに対応する. ¤ この𝑍で,識別器の⽬的関数の勾配=MaxEnt IRLの⽬的関数の勾配となる. ¤ ⽣成器の⽬的関数=MaxEnt IRLの提案分布の⽬的関数 MaxEnt policy loss defined in Equation 2 in Section 2.3.2. Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1 2Z exp(−cθ (τ)) + 1 2 q(τ). Note that when θ and Z are optimized, 1 Z exp(−cθ (τ)) is an estimate for the density of p(τ), and hence µ(τ) is an estimate for the density of µ. 3.2.1 Z estimates the partition function We can compute the discriminator’s loss: Ldiscriminator(Dθ ) =Eτ∼p −log 1 Z exp(−cθ (τ)) µ(τ) + Eτ∼q −log q(τ) µ(τ) (6) =logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7) =logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8) Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term with respect to Z will be zero: ∂ZLdiscriminator(Dθ ) = 0 1 Z = Eτ∼µ 1 Z2 exp(−cθ (τ)) µ(τ) Z = Eτ∼µ exp(−cθ (τ)) µ(τ) . Thus the minimizing Z is precisely the importance sampling estimate of the partition function in Equation 4. 3.2.2 cθ optimizes the IRL objective We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL objective. 6 estimate the importance weights. We will establish the following facts, which together imply that GANs optimize precisely the Max- Ent IRL problem: 1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima- tor for the partition function, as described in Section 2.3.2. 2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to the derivative of the MaxEnt IRL objective. 3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the MaxEnt policy loss defined in Equation 2 in Section 2.3.2. Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1 2Z exp(−cθ (τ)) + 1 2 q(τ). Note that when θ and Z are optimized, 1 Z exp(−cθ (τ)) is an estimate for the density of p(τ), and hence µ(τ) is an estimate for the density of µ. 3.2.1 Z estimates the partition function We can compute the discriminator’s loss: Ldiscriminator(Dθ ) =Eτ∼p −log 1 Z exp(−cθ (τ)) µ(τ) + Eτ∼q −log q(τ) µ(τ) (6) =logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7) =logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8) Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term with respect to Z will be zero: ∂ZLdiscriminator(Dθ ) = 0 1 Z = Eτ∼µ 1 Z2 exp(−cθ (τ)) µ(τ) Z = Eτ∼µ exp(−cθ (τ)) µ(τ) . Thus the minimizing Z is precisely the importance sampling estimate of the partition function in Equation 4. 3.2.2 cθ optimizes the IRL objective We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL objective. 6 Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we obtain: ∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ 1 Z exp(−cθ (τ))∂θ cθ (τ) µ(τ) On the other hand, when we differentiate the MaxEnt IRL objective, we obtain: ∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ exp(−cθ (τ)) µ(τ) = Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ −exp(−cθ (τ))∂θ cθ (τ) µ(τ) Eτ∼µ exp(−cθ (τ)) µ(τ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ 1 Z exp(−cθ (τ))∂θ cθ (τ) µ(τ) = ∂θ Ldiscriminator(Dθ ). In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate through the importance weights. 3.3 The generator optimizes the MaxEnt IRL objective Finally, we compute the generator’s loss: Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))] = Eτ∼q log q(τ) µ(τ) − log 1 Z exp(−cθ (τ)) µ(τ) = Eτ∼q[logq(τ)+ logZ + cθ (τ)] = logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q). The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator, this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2. Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we obtain: ∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ 1 Z exp(−cθ (τ))∂θ cθ (τ) µ(τ) On the other hand, when we differentiate the MaxEnt IRL objective, we obtain: ∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ exp(−cθ (τ)) µ(τ) = Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ −exp(−cθ (τ))∂θ cθ (τ) µ(τ) Eτ∼µ exp(−cθ (τ)) µ(τ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ 1 Z exp(−cθ (τ))∂θ cθ (τ) µ(τ) = ∂θ Ldiscriminator(Dθ ). In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate through the importance weights. 3.3 The generator optimizes the MaxEnt IRL objective Finally, we compute the generator’s loss: Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))] = Eτ∼q log q(τ) µ(τ) − log 1 Z exp(−cθ (τ)) µ(τ) = Eτ∼q[logq(τ)+ logZ + cθ (τ)] = logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q). The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator, this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2. 3.4 Discussion There are many apparent differences between MaxEnt IRL and the GAN optimization problem. But, we have shown that after making a single key change—using a generator q(τ) for which densities
  72. 72. GANとIRLの関係 ¤ つまり, ¤ コスト関数(報酬関数)の最⼩化は𝐷の最適化とみなせる. ¤ 提案分布𝑞は𝐺とみなせる. ¤ GAN = IRLを⽰すと何がうれしい? ¤ 𝑞は密度推定できるので,直接最⼤化できるのでは? ¤ 𝑞を最尤推定(=KL最⼩化)で求めたくない(peakがないところにも分布を置くので). ¤ GANとみなすと,様々なダイバージェンスで考えることができそう. ¤ その他の研究 ¤ 模倣学習とGAN(GAIL[Ho+ 17]) ¤ OptionGAN[Henderson+ 17] ¤ InfoGAIL[Li+ 17]
  73. 73. 深層⽣成モデルと強化学習
  74. 74. LeCun先⽣⽈く Y LeCun How Much Information Does the Machine Need to Predict? “Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10 10,000 bits per sample→ Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample (Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)

×