20180622 munit multimodal unsupervised image-to-image translation

文献紹介
MUNIT | Multimodal Unsupervised
Image-to-Image Translation
author: Xun Huang (Cornell University, NVIDIA)

abstract
- image translation において、 unsupervised で multi-modal な手法を提案した。
- 精度も supervised に近い値が出た。

Related Works - GANs: Generative Adversarial Networks
- 設計の難しい loss 関数に対し、その loss 関数すら Neural Network で学習させて
しまおうという手法
- image generation, text generation などの多くの応用先
- 生成モデル(generative model) ≒ 教師なし (unsupervised)
- P(X) をモデリング (X: 画像など）
Generator
c.f. ProgressinGAN
gaussian noise
generated image
OR Discriminator
real image
True 1
/
False 0

Related Works - Image Translation
- input: an image in the source domain
- output: an image in the target domain
- using GAN

- unsupervised approach
- pair がいらない。
- e.g. CycleGAN

- cycle consistency loss
- 変換した画像を、逆変換した画像が、元の入力画像と近くなるよう学習
- unsupervised machine translation などでも似たような手法が使われている。

Related Works - Image Translation の問題
- multi-modal mapping ではない。
- 馬→シマウマ、などドメイン内で多峰性がなければ上手くいくが、
- 猫→犬などの場合、犬の画像は、ポメラニアンでもいいし、柴犬でもいい。
- このようにドメイン内の分布が多峰性（ multi-modal) だと、生成がうまくいかない。
- BicycleGAN は multi-modal mapping だが、 supervision が必要。
- 今回は、 mutli-modal かつ unsupervised な image translation を提案。

Related Works - Auto-Encoder
- データを本質的な情報だけに圧縮する手法。(feature extraction, dimentional
reduction)

Related Works - VAE Variational Auto-Encoder
- latent representation を確率変数とすることで、連続的な表現を可能にしたもの。

Related Works - Disentangled Representation
- なんらかの方法で、 latent
representation を、情報の意味で分
割する方法。
contents
(shape, pose, location, …)
style
(pattern, color,
appearance, ...)

Method - MUNIT
- 変換する image を Auto-Encoder を使い、以下の 2つの latent representation に
分割して embedding
- content: 変換後も保存したい情報
- e.g. 変換元の虎の顔の向き、位置
- style: 変換後は持ち越したくなく、かつ、変換先のドメインの情報を使って multi-modal に操作した
い情報
- e.g. 変換先の猫の毛色、見た目

Method - MUNIT
- 3 つの loss を使う。
- まずは、 image を Auto-Encoder を
使い、content と style に分けて
embedding (これだけだと、無理。次
のGANのステップが必要。)
- ①Image reconstruction loss
- style と content から image を
reconstruction。 s と c で元の image の
本質的な情報を保存するように学習。
① ①

Method - MUNIT
- ② Adversarial loss
- 変換後の画像が、target domain の画像
か/そうでないかをdiscriminator が識別。
- content: 変換前の image のもの
- style: Gaussian noise
- これだけだと、c, s の情報に関係ない
target っぽい generated image でも
Discriminator は騙されてしまう。 (e.g. と
りあえず猫っぽい画像を生成しとけば
OKってなる。そうでなく、虎の向きは保持
したい。)
② ②

Method - MUNIT
- ③ Latent reconstruction loss
- 変換に使った content と style を変換後
の画像から復元できるようにする。
- 変換後の画像が、変換に使った content
と style の情報を保持していないと行けな
い。
③ ③③③

Method - MUNIT
① ①
③ ③③③
② ②

Method - Auto-Encoder
- Downsampling: CNN
- AdaIn: parameters in normalization layers to represent styles

Method - Auto-Encoder
- Discriminator
- LSGAN objective
- multi-scale discriminators
- to learn realistic details
- to learn correct global structure
- Domain-invariant perceptual loss
- supervised setting でしか使えない perceptual loss を unsupervised にも拡張
- a distance in the VGG feature space between the output and the reference image
- high-resolution の学習を助ける。

Evaluation - Task
- supervised image translation
- unsupervised image translatinon

- dataset: Edges <-> shoes/handbags
- colored image
- corresponding edge images
- eval. metric
- quality: human preference
- diversity: LPIPS distance
- baselines
- UNIT
- CycleGAN
- CycleGAN with noise
- BicycleGAN
Evaluation - supervised

Evaluation - Baselines
- UNIT: latent representation が disentangled でな
く、 1つ。
- CycleGAN
- CycleGAN with noise: input image に
Gaussian noise を加える。
- BicycleGAN: continuous multi-modal mapping
が可能。ただ、supervision が必要。

Results - supervised - qualitative

Evaluation - Human Preference
- to evaluate the quality
- Amazon Mechanical Turk
- 500 questions/worker
- 1 source image
- 2 translated images from different methods

Evaluation - LPIPS Distances
- to evaluate diversity
- a weighted L2 distance between pairs of deep features of
randomly-sampled translated images from the same input
- deep feature extractor: ImageNet-pretrained AlexNet
- correlate well with human perceptual similarity
- 1900 pairs
- 100 input images
- x 19 output pairs/input

- BicycleGAN との Quality 以外の比較では全てにおいて優っている。
- 3つの loss のうち 1 つでも欠けると、 Quality が大幅に下がることから、すべての
loss が有効だと判断できる。
Results - supervised - quantitative

- dataset: Animal image translation
- 動物の画像が category ごとにまと
まっている。
- pair なし。
- eval. metric
- IS = Inception Score
- CIS = conditional Inception Score
- baselines
- UNIT
- CycleGAN
- CycleGAN with noise
- (BicycleGAN は supervised しか対応し
てないので、なし)
Evaluation - unsupervised
big cats
house cats dogs

Results - unsupervised - qualitative
cycleGAN

Evaluation - (C)IS=(Conditional) Inception Score
- popular for image generation
- to evaluate quality and diversity
- IS: diversity of all output images
- Inception-v3 で識別しやすい画像であるほどスコアが高い。
- CIS: diversity of outputs conditioned on a single input image
- more suited for evaluating multi-modal mapping
- e.g. 1 枚の猫の画像が、ほぼ完璧な犬の画像に変換されたら、 ISは高くなる。ただ、もし、その変換
先が、画像ごとに同じ犬の画像に変換される（ multi-modal mapping でない）なら、 IS は高いが、
CIS は低くなる。

Evaluation - (C)IS=(Conditional) Inception Score
- x1: source image
- x2: target image
- x1->2: translated image from 1 to 2
- y: class=mode (e.g. ポメラニアン、柴犬、シベリアンハスキー if X2 is a set of
dogs)

- 既存の unsupervised approach に比べ圧勝。 (the higher the better)
Results - unsupervised - quantitative

- style の指定を、noise ではなく、 target domain の 1 枚の image を使い、任意の
style で指定できる。
Results - Example-guided image translation

Conclusion
- unsupervised な multi-modal image translation の手法を提案した。
- Auto-Encoder の中間層を disentangled にすることで解決した。
- supervised image translation においては、supervised multi-modal の
BicycleGAN に近いスコアを出した。
- unsupervised image translation においては他を圧勝した。

20180622 munit multimodal unsupervised image-to-image translation

Recommended

Recommended

More Related Content

Similar to 20180622 munit multimodal unsupervised image-to-image translation

Similar to 20180622 munit multimodal unsupervised image-to-image translation (7)

20180622 munit multimodal unsupervised image-to-image translation