A Style-Based Generator Architecture for Generative Adversarial Networks
1. 2022. 06. 03.
A Style-Based Generator Architecture for Generative
Adversarial Networks
Tero Karras, Samuli Laine, Timo Aila
CVPR 2019
Hyunwook Lee
2. Contents
• Overview
• Preliminaries
• Disentangled Representation
• Various Normalization in Deep Learning
Domain
• StyleGAN
• Disentanglement of StyleGAN
• AdaIN in the StyleGAN
• Applications of StyleGAN
• Case: Music Generation
• Conclusion
3. 3
Overview: What is the StyleGAN?
(a) Traditional generator and (b) StyleGAN generator
One of the most famous GANs for image synthesis
Automatic, unsupervised separation of high-level attributes
Control image synthesis inspired by style transfer
Scale-specific mixing and interpolation
Learnable
Operation
AdaIN
4. 4
Overview: Examples with StyleGAN
• StyleGAN enables scale-specific styling
• Different styles in each layer only affect to
the corresponding scale of style
• Coarse – pose, general hair style, face, eyeglasses
• Middle – small facial features, hair style, eyes
• Fine – color scheme and microstructure
• How they achieve these styling?
AdaIN from Style Transfer & Disentangled Representation!
5. 5
Preliminaries: Disentangled Representation
Entangled and disentangled representation
One of the unsupervised representation learning in generative learning
Control image synthesis inspired by style transfer
Automatic, unsupervised separation of high-level attributes
Scale-specific mixing and interpolation
6. 6
Preliminaries: Disentangled Representation
Training of the GANs in traditional way
Latent z will be a kind of feature vector
(i.e., representation)
Train & Test
• Change of the latent z in arbitrary dimension causes changes of two or
more features these features are entangled!
• Degrades interpretability and controllability of generation process
each latent dimension should correspond to one “independent feature”
7. 7
Preliminaries: Disentangled Representation
• Based on manifold hypothesis
• real-world high-dimensional data lie on low-
dimensional manifolds embedded within the high-
dimensional space
• Unit Gaussian is not enough to represents image
manifolds
• Images can badly reconstructed
• A latent space that consists of linear subspaces,
each of which controls one factor of variation
• Reading materials:
• InfoGAN, β-VAE, Spatial CBN, LAPGAN,…
8. 8
Various Normalization in Deep Learning
• Commonly utilized in most of the deep learning models
• Main idea: normalize layer input guarantee all the layers have same /
similar input distribution
mean/std among minibatch mean/std among channels mean/std in minibatch mean/std in minibatch
9. 9
Instance Normalization in Style Transfer
• Convolutional feature statistics of DNN can capture the style of images
• Recent work reveals that channel-wise mean/variance are effective for
style transfer
Instance Normalization can be seen as one of the style normalization!
10. 10
Adaptive Instance Normalization
• Given context image x and style image y, the style can be obtained by:
• Normalize x (remove style of the context) denormalize x with style of y
11. 11
Style-based Generator (StyleGAN)
How can they achieve Disentangled Representation?
How can they design a generator as a style transfer?
Why do they need noise input for each layer?
(a) Traditional generator and (b) StyleGAN generator
Learnable
Operation
AdaIN
12. 12
StyleGAN: Disentangled Representation
• Latent space disentanglement is crucial part for both style
transfer and generative model
• Hard to achieved by direct mapping (b in lower figure)
• StyleGAN generates disentangled intermediate latent space
𝒲
• Not a fixed distribution, but learned mapping
• Spatially invariant, modified by affined transformation A
• Generate images from disentangled representation is much
easier than that from entangled representation
mapping network surely trained to generate disentangled
representation
13. 13
StyleGAN: AdaIN as Styling Methods
• By affined transformation A, the vector w be the
style y = (ys, yb)
• ys is style deviation and yb is style mean
• Step-by-Step
• Input x is normalized as Instance Normalization
• Effectively localize the styles
• Denormalized by ys and yb
• To guarantee ys is standard deviation (i.e., positive value),
actual multiplier is ys + 1
• Forward to next layer
• Note: scale-specific styling is only possible when
we can separate each network output gradually
14. 14
StyleGAN: Style Mixing
• Encouraging the styles to localize by
Style Mixing in training
• Simply,
• try to run two different latent code z1, z2
• Mix corresponding intermediate latent w1, w2
at a randomly selected point in 𝑔
• preventing the network from assuming
that adjacent styles are correlated
more localized, scale-specific modification!
17. 17
Style Mixing in Middle Level (162 - 322)
• Bring smaller scale face features, hair style, eye
open / close,…
18. 18
Style Mixing in Fine Level (642 – 10242)
• Mainly bring color scheme and microstructures
• Doesn’t change coarse / middle styles
19. 19
StyleGAN: Stochastic Variation
• Traditional GANs achieves stochastic
variation by…
• generating spatially-varying pseudorandom
numbers
• Consumes network capacity
• Not always successful
• In StyleGAN…
• Introduce random noise in layer-level
• Hypothesis: there is pressure to introduce new
content as soon as possible at any point
• Fake discriminator
• The easiest way: introducing new random noise for
each layers variation with random noises
20. 20
StyleGAN: Stochastic Variation
• The main areas of stochastic
variation is
• the hair
• Silhouettes
• parts of background
The noise doesn’t affect to
global aspects!
23. 23
Advances of StyleGAN: StyleGAN2
Phase artifacts in StyleGAN Examples of unnatural images w/ StyleGAN
• StyleGAN (left) has texture sticking problem due to
the progressive growing
• Each Image in different scale generated by
corresponding generator, independently
• Adopting ResNet architecture to solve problem
• Note: not perfectly solved – it’ll be discussed in StyleGAN3
Examples of natural images w/ StyleGAN2
24. 24
Advances of StyleGAN: StyleGAN3
• StyleGAN2 (left) has not perfectly solved texture sticking problem
• (left) Averaged images w/ small changes of latent should blur the central image
• (left) But StyleGAN2 have stick to the same pixel coordinates
• asdf
26. 26
Applications of StyleGAN: Image Domain
• InterFaceGAN
• Extract linear editing directions through attribute-level supervision
• StyleFlow
• First to present editing that is stable to be composed
• Normalizing flows and attribute-level supervision
• DyStyle
• Addresses compositional editing directly
• Accurate, elaborate, and diverse editing
• StyleCLIP
• Free textual editing w/ visual-linguistic pretrained model
• Pose with Style
• Human pose supervision to edit body poses and clothing
• StyleMapGAN
• Localized editing by augmenting StyleGAN’s architecture
w/ spatially adaptive modulation
28. 28
Applications of StyleGAN: StyleMapGAN
• Localized editing by augmenting StyleGAN’s
architecture w/ spatially adaptive modulation
• Localied editing conducted with
29. 29
Music Generation: Recent Works w/wo GANs
• Style-Conditioned Music Generation
• Style transfer-like methods in music generation w/ LSTM-based GANs
• Making style codebook that decides overall style of the music
• Symbolic Music Generation with Transformer-GANs
• Compound Word Transformer: Learning to Compose Full-Song Music over
Dynamic Directed Hypergraphs
• Transformer-based music generation model
30. 30
Music Generation: Why is the StyleGAN hard to utilized?
• Main “scale-specific controllability” of the StyleGAN comes from the
stacked CNN w/ various size
To utilize StyleGAN, it should be separable
• Music composition should be infinitely extended
cannot utilize CNNs in temporal dimension
• Separation of the musical components (e.g., Motive – Phrase – Period)
Hard to modeled like CNNs (intuitive separation of the components are hard)
Each of them shares overall flow separation causes incoherence music
• Separation of the Midi components (e.g., bar – beat - …)
Using CNNs to combine them can cause information loss
(e.g., structured tokens)
• Too many additional features to consider
• StyleGAN and Image processing no other input or consideration except image
• Music has a bunch of extra features like instrument
31. 31
Conclusion
• “Scale-specific controllability” of the StyleGAN comes from the stacked
CNN w/ various size
To utilize StyleGAN, target domain output should be separable
(e.g., 4x4 8x8 16x16 32x32 … 1024x1024)
• Maybe utilized in GUI design, but it will be more like “Conditioned image
synthesis regardless of the structure”
• If we utilize StyleGAN in GUI design, we should defense…
• why do we ignore the structures?
• Isn’t this design a combination of existing designs?
Batch Norm 가장 기본적인 normalization technique, batch의 평균 / 분산이 전체 데이터셋을 대표한다는 가정하에 실행. inference와 training시의 실행 방식이 다름
Layer Norm 입력 scale에 robust, 가중치의 scale / shifting에 robust
Instance Norm 각 채널별 / 배치별로 mean / std normalization, inferenc단에서도 동일하게 이용가능, 명암 대비 등을 normalize할 수 있음
Group Norm 2018년 Kaiming He가 발표, Layer Norm과 Instance Norm의 절충안
Bring High-level aspects (i.e., pose, general hair style, face shape, and eye glasses)
Normalization으로 인해 발생하는 smooth하지 못한 mapping임
64 by 64 image부터 나타나며, 모든 feature map에 발생함 AdaIN이 결국 channel간의 연관성을 박살을 내기때문
또한, normalization이 입력에 의존하기때문에, 입력에 아주 큰 spike가 있다면 다른 곳에서 세부 조정이 쉬워지기때문에
Bias 및 noise가 normalize 전에 적용된다면 상대적인 영향력이 style magnitude에 반비례하게 됨. noise와 bias가 style과 correlate하게 됨 따라서, normalization 이후에 noise 및 bias를 적용함
Mean 빼는 부분을 없앰 + data를 기반으로 한 normalization을 없앰
AdaIN을 weight의 norm / denorm으로 변경 (bias가 없는 convolution의 성질을 생각해본다면 간단하게 가능함.)
Average value를 다루는 것으로 EMA 추가. Upsample을 통해 Filtering 진행
Upsampling에는 (a)에 있는 Filter를 이용 continuous, infinite spatial domain 가정
Note: scale-specific styling is only possible when we can separate each network output gradually