3. 3
Interactive visual manipulation (object removing/adding, changing the object category)
Generate diverse results given the same input allowing users to edit the object appearance
interactively
Goal
Interactive editing resultsEditing interface
https://github.com/NVIDIA/pix2pixHD
4. 4
Related Work – Pix2Pix [21]
Image-to-Image Translation with Conditional Adversarial Networks (CVPR 2017)
cGAN: {x , z} → y
x: observed image (condition)
z: random noisevector
y: generatedoutput
5. 5
Related Work – Cascade Refinement Networks [5]
Photographic Image Synthesis with Cascaded Refinement Networks (ICCV 2017)
• GAN : training instability and optimization issues
• First model that can synthesize HD images
• Propose cascade of refinement modules
• Direct regression objective with perceptual loss
• Weakness : lack fine details and realistic textures
pix2pixHD
6. 6
From semantic label map to neural photo
Pix2Pix[21] : training unstable, the quality unsatisfactory
Conditional GAN Framework
11. 11
Improving Photorealism and Resolution
<Coarse-to-fine Generator>
Perceptual losses for real-time style transfer and super-resolution. (ECCV2016) [22]
G1 : Global Generator
G2 : Local Enhancer Generator G2 : Local Enhancer Generator
G2 Input :
2048x1024
Element-wise sum of two feature maps
G2 Output :
2048x1024
G1 Input : 1024x512
G1 Output : 1024x512
Training :
1. Train the global generator
2. Train the local enhancer
3. Jointly fine-tune all the networks together
12. 12
Semantic label map vs Instance Map
<Input Image> <Semantic Label Map> <Instance Label Map>
Semantic Label Map은 같은 class의 object를 구분하지 못함.
Instance Label Map은 개별 object마다 고유의 ID를 포함함.
14. 14
Improving Photorealism and Resolution
<Multi-scale Discriminator>
To differentiate high-resolution real and synthesized Images,
the discriminator needs to have large receptive field.
1. A deeper network
2. Larger convolutional kernels
increased network capacity, overfitting
Multi-scale discriminators :
3 discriminators that have an identical network structure
15. 15
Improving Photorealism and Resolution
<Improved Adversarial Loss> Improve GAN loss by incorporating a feature matching loss
based on discriminator.
i th-layer feature extractor
VGG perceptual loss 를 추가 시
약간의 성능 향상
16. 16
Learning an Instance-level Feature Embedding
To generate diverse images and allow instance-level control:
Adding additional low-dimensional feature channels as the input to the generator.
Training time :
1. discriminator, generator, feature encoder를 같이
학습
2. Training data의 모든 instance에 대한 feature를
기록
3. 각 semantic category에 포함된 feature들에 대
해서 k-means clustering 수행
Inference time :
1. 각 object instance에 대해서 랜덤으로 cluster
의 center 중 하나를 선택해서 encoded
feature로 사용함.
2. Editing 시 user가 k mode중 하나를 선택하도
록 해서 다른 스타일을 선택 가능
18. 18
Implementation details
• LSGAN
• 𝜆 = 10
• K =10 for K-means
• 3-dimentional vectors to encode features
• Ours : GAN loss + Feature Matching Loss + VGG Perceptual Loss
• Ours(w/o VGG loss) : GAN loss + Feature Matching Loss
Datasets
• Cityscapes, NYU Indoor RGBD, ADE20K, Helen Face
Baseline
• pix2pix, CRN
Experimental Results
19. 19
Quantitative Comparisons
• Ground truth vs PSPNet from generated image
Experimental Results
<Different Methods>
<Different Generators>
<Different Discriminators>
20. 20
Human Perceptual Study
• A/B tests deployed on the Amazon Mechanical Turk
• Unlimited time
• Limited time : 1/8 seconds~8 seconds
Experimental Results
<Preference Rates>