2022. 06. 03.
A Style-Based Generator Architecture for Generative
Adversarial Networks
Tero Karras, Samuli Laine, Timo Aila
CVPR 2019
Hyunwook Lee
Contents
• Overview
• Preliminaries
• Disentangled Representation
• Various Normalization in Deep Learning
Domain
• StyleGAN
• Disentanglement of StyleGAN
• AdaIN in the StyleGAN
• Applications of StyleGAN
• Case: Music Generation
• Conclusion
3
Overview: What is the StyleGAN?
(a) Traditional generator and (b) StyleGAN generator
One of the most famous GANs for image synthesis
Automatic, unsupervised separation of high-level attributes
Control image synthesis inspired by style transfer
Scale-specific mixing and interpolation
Learnable
Operation
AdaIN
4
Overview: Examples with StyleGAN
• StyleGAN enables scale-specific styling
• Different styles in each layer only affect to
the corresponding scale of style
• Coarse – pose, general hair style, face, eyeglasses
• Middle – small facial features, hair style, eyes
• Fine – color scheme and microstructure
• How they achieve these styling?
 AdaIN from Style Transfer & Disentangled Representation!
5
Preliminaries: Disentangled Representation
Entangled and disentangled representation
One of the unsupervised representation learning in generative learning
Control image synthesis inspired by style transfer
Automatic, unsupervised separation of high-level attributes
Scale-specific mixing and interpolation
6
Preliminaries: Disentangled Representation
Training of the GANs in traditional way
 Latent z will be a kind of feature vector
(i.e., representation)
Train & Test
• Change of the latent z in arbitrary dimension causes changes of two or
more features  these features are entangled!
• Degrades interpretability and controllability of generation process
 each latent dimension should correspond to one “independent feature”
7
Preliminaries: Disentangled Representation
• Based on manifold hypothesis
• real-world high-dimensional data lie on low-
dimensional manifolds embedded within the high-
dimensional space
• Unit Gaussian is not enough to represents image
manifolds
• Images can badly reconstructed
• A latent space that consists of linear subspaces,
each of which controls one factor of variation
• Reading materials:
• InfoGAN, β-VAE, Spatial CBN, LAPGAN,…
8
Various Normalization in Deep Learning
• Commonly utilized in most of the deep learning models
• Main idea: normalize layer input  guarantee all the layers have same /
similar input distribution
mean/std among minibatch mean/std among channels mean/std in minibatch mean/std in minibatch
9
Instance Normalization in Style Transfer
• Convolutional feature statistics of DNN can capture the style of images
• Recent work reveals that channel-wise mean/variance are effective for
style transfer
 Instance Normalization can be seen as one of the style normalization!
10
Adaptive Instance Normalization
• Given context image x and style image y, the style can be obtained by:
• Normalize x (remove style of the context)  denormalize x with style of y
11
Style-based Generator (StyleGAN)
How can they achieve Disentangled Representation?
How can they design a generator as a style transfer?
Why do they need noise input for each layer?
(a) Traditional generator and (b) StyleGAN generator
Learnable
Operation
AdaIN
12
StyleGAN: Disentangled Representation
• Latent space disentanglement is crucial part for both style
transfer and generative model
• Hard to achieved by direct mapping (b in lower figure)
• StyleGAN generates disentangled intermediate latent space
𝒲
• Not a fixed distribution, but learned mapping
• Spatially invariant, modified by affined transformation A
• Generate images from disentangled representation is much
easier than that from entangled representation
 mapping network surely trained to generate disentangled
representation
13
StyleGAN: AdaIN as Styling Methods
• By affined transformation A, the vector w be the
style y = (ys, yb)
• ys is style deviation and yb is style mean
• Step-by-Step
• Input x is normalized as Instance Normalization
• Effectively localize the styles
• Denormalized by ys and yb
• To guarantee ys is standard deviation (i.e., positive value),
actual multiplier is ys + 1
• Forward to next layer
• Note: scale-specific styling is only possible when
we can separate each network output gradually
14
StyleGAN: Style Mixing
• Encouraging the styles to localize by
Style Mixing in training
• Simply,
• try to run two different latent code z1, z2
• Mix corresponding intermediate latent w1, w2
at a randomly selected point in 𝑔
• preventing the network from assuming
that adjacent styles are correlated
 more localized, scale-specific modification!
15
StyleGAN: Style Mixing
16
Style Mixing in Coarse Level (42 - 82)
17
Style Mixing in Middle Level (162 - 322)
• Bring smaller scale face features, hair style, eye
open / close,…
18
Style Mixing in Fine Level (642 – 10242)
• Mainly bring color scheme and microstructures
• Doesn’t change coarse / middle styles
19
StyleGAN: Stochastic Variation
• Traditional GANs achieves stochastic
variation by…
• generating spatially-varying pseudorandom
numbers
• Consumes network capacity
• Not always successful
• In StyleGAN…
• Introduce random noise in layer-level
• Hypothesis: there is pressure to introduce new
content as soon as possible at any point
• Fake discriminator
• The easiest way: introducing new random noise for
each layers  variation with random noises
20
StyleGAN: Stochastic Variation
• The main areas of stochastic
variation is
• the hair
• Silhouettes
• parts of background
 The noise doesn’t affect to
global aspects!
21
StyleGAN: Water Droplet –like Artifacts
22
Advances of StyleGAN: StyleGAN2
23
Advances of StyleGAN: StyleGAN2
Phase artifacts in StyleGAN Examples of unnatural images w/ StyleGAN
• StyleGAN (left) has texture sticking problem due to
the progressive growing
• Each Image in different scale generated by
corresponding generator, independently
• Adopting ResNet architecture to solve problem
• Note: not perfectly solved – it’ll be discussed in StyleGAN3
Examples of natural images w/ StyleGAN2
24
Advances of StyleGAN: StyleGAN3
• StyleGAN2 (left) has not perfectly solved texture sticking problem
• (left) Averaged images w/ small changes of latent should blur the central image
• (left) But StyleGAN2 have stick to the same pixel coordinates
• asdf
25
Advances of StyleGAN: StyleGAN3
26
Applications of StyleGAN: Image Domain
• InterFaceGAN
• Extract linear editing directions through attribute-level supervision
• StyleFlow
• First to present editing that is stable to be composed
• Normalizing flows and attribute-level supervision
• DyStyle
• Addresses compositional editing directly
• Accurate, elaborate, and diverse editing
• StyleCLIP
• Free textual editing w/ visual-linguistic pretrained model
• Pose with Style
• Human pose supervision to edit body poses and clothing
• StyleMapGAN
• Localized editing by augmenting StyleGAN’s architecture
w/ spatially adaptive modulation
27
Applications of StyleGAN: StyleMapGAN
28
Applications of StyleGAN: StyleMapGAN
• Localized editing by augmenting StyleGAN’s
architecture w/ spatially adaptive modulation
• Localied editing conducted with
29
Music Generation: Recent Works w/wo GANs
• Style-Conditioned Music Generation
• Style transfer-like methods in music generation w/ LSTM-based GANs
• Making style codebook that decides overall style of the music
• Symbolic Music Generation with Transformer-GANs
• Compound Word Transformer: Learning to Compose Full-Song Music over
Dynamic Directed Hypergraphs
• Transformer-based music generation model
30
Music Generation: Why is the StyleGAN hard to utilized?
• Main “scale-specific controllability” of the StyleGAN comes from the
stacked CNN w/ various size
 To utilize StyleGAN, it should be separable
• Music composition should be infinitely extended
 cannot utilize CNNs in temporal dimension
• Separation of the musical components (e.g., Motive – Phrase – Period)
 Hard to modeled like CNNs (intuitive separation of the components are hard)
 Each of them shares overall flow  separation causes incoherence music
• Separation of the Midi components (e.g., bar – beat - …)
 Using CNNs to combine them can cause information loss
(e.g., structured tokens)
• Too many additional features to consider
• StyleGAN and Image processing  no other input or consideration except image
• Music has a bunch of extra features like instrument
31
Conclusion
• “Scale-specific controllability” of the StyleGAN comes from the stacked
CNN w/ various size
 To utilize StyleGAN, target domain output should be separable
(e.g., 4x4  8x8  16x16  32x32  …  1024x1024)
• Maybe utilized in GUI design, but it will be more like “Conditioned image
synthesis regardless of the structure”
• If we utilize StyleGAN in GUI design, we should defense…
• why do we ignore the structures?
• Isn’t this design a combination of existing designs?
Thank you

A Style-Based Generator Architecture for Generative Adversarial Networks

  • 1.
    2022. 06. 03. AStyle-Based Generator Architecture for Generative Adversarial Networks Tero Karras, Samuli Laine, Timo Aila CVPR 2019 Hyunwook Lee
  • 2.
    Contents • Overview • Preliminaries •Disentangled Representation • Various Normalization in Deep Learning Domain • StyleGAN • Disentanglement of StyleGAN • AdaIN in the StyleGAN • Applications of StyleGAN • Case: Music Generation • Conclusion
  • 3.
    3 Overview: What isthe StyleGAN? (a) Traditional generator and (b) StyleGAN generator One of the most famous GANs for image synthesis Automatic, unsupervised separation of high-level attributes Control image synthesis inspired by style transfer Scale-specific mixing and interpolation Learnable Operation AdaIN
  • 4.
    4 Overview: Examples withStyleGAN • StyleGAN enables scale-specific styling • Different styles in each layer only affect to the corresponding scale of style • Coarse – pose, general hair style, face, eyeglasses • Middle – small facial features, hair style, eyes • Fine – color scheme and microstructure • How they achieve these styling?  AdaIN from Style Transfer & Disentangled Representation!
  • 5.
    5 Preliminaries: Disentangled Representation Entangledand disentangled representation One of the unsupervised representation learning in generative learning Control image synthesis inspired by style transfer Automatic, unsupervised separation of high-level attributes Scale-specific mixing and interpolation
  • 6.
    6 Preliminaries: Disentangled Representation Trainingof the GANs in traditional way  Latent z will be a kind of feature vector (i.e., representation) Train & Test • Change of the latent z in arbitrary dimension causes changes of two or more features  these features are entangled! • Degrades interpretability and controllability of generation process  each latent dimension should correspond to one “independent feature”
  • 7.
    7 Preliminaries: Disentangled Representation •Based on manifold hypothesis • real-world high-dimensional data lie on low- dimensional manifolds embedded within the high- dimensional space • Unit Gaussian is not enough to represents image manifolds • Images can badly reconstructed • A latent space that consists of linear subspaces, each of which controls one factor of variation • Reading materials: • InfoGAN, β-VAE, Spatial CBN, LAPGAN,…
  • 8.
    8 Various Normalization inDeep Learning • Commonly utilized in most of the deep learning models • Main idea: normalize layer input  guarantee all the layers have same / similar input distribution mean/std among minibatch mean/std among channels mean/std in minibatch mean/std in minibatch
  • 9.
    9 Instance Normalization inStyle Transfer • Convolutional feature statistics of DNN can capture the style of images • Recent work reveals that channel-wise mean/variance are effective for style transfer  Instance Normalization can be seen as one of the style normalization!
  • 10.
    10 Adaptive Instance Normalization •Given context image x and style image y, the style can be obtained by: • Normalize x (remove style of the context)  denormalize x with style of y
  • 11.
    11 Style-based Generator (StyleGAN) Howcan they achieve Disentangled Representation? How can they design a generator as a style transfer? Why do they need noise input for each layer? (a) Traditional generator and (b) StyleGAN generator Learnable Operation AdaIN
  • 12.
    12 StyleGAN: Disentangled Representation •Latent space disentanglement is crucial part for both style transfer and generative model • Hard to achieved by direct mapping (b in lower figure) • StyleGAN generates disentangled intermediate latent space 𝒲 • Not a fixed distribution, but learned mapping • Spatially invariant, modified by affined transformation A • Generate images from disentangled representation is much easier than that from entangled representation  mapping network surely trained to generate disentangled representation
  • 13.
    13 StyleGAN: AdaIN asStyling Methods • By affined transformation A, the vector w be the style y = (ys, yb) • ys is style deviation and yb is style mean • Step-by-Step • Input x is normalized as Instance Normalization • Effectively localize the styles • Denormalized by ys and yb • To guarantee ys is standard deviation (i.e., positive value), actual multiplier is ys + 1 • Forward to next layer • Note: scale-specific styling is only possible when we can separate each network output gradually
  • 14.
    14 StyleGAN: Style Mixing •Encouraging the styles to localize by Style Mixing in training • Simply, • try to run two different latent code z1, z2 • Mix corresponding intermediate latent w1, w2 at a randomly selected point in 𝑔 • preventing the network from assuming that adjacent styles are correlated  more localized, scale-specific modification!
  • 15.
  • 16.
    16 Style Mixing inCoarse Level (42 - 82)
  • 17.
    17 Style Mixing inMiddle Level (162 - 322) • Bring smaller scale face features, hair style, eye open / close,…
  • 18.
    18 Style Mixing inFine Level (642 – 10242) • Mainly bring color scheme and microstructures • Doesn’t change coarse / middle styles
  • 19.
    19 StyleGAN: Stochastic Variation •Traditional GANs achieves stochastic variation by… • generating spatially-varying pseudorandom numbers • Consumes network capacity • Not always successful • In StyleGAN… • Introduce random noise in layer-level • Hypothesis: there is pressure to introduce new content as soon as possible at any point • Fake discriminator • The easiest way: introducing new random noise for each layers  variation with random noises
  • 20.
    20 StyleGAN: Stochastic Variation •The main areas of stochastic variation is • the hair • Silhouettes • parts of background  The noise doesn’t affect to global aspects!
  • 21.
    21 StyleGAN: Water Droplet–like Artifacts
  • 22.
  • 23.
    23 Advances of StyleGAN:StyleGAN2 Phase artifacts in StyleGAN Examples of unnatural images w/ StyleGAN • StyleGAN (left) has texture sticking problem due to the progressive growing • Each Image in different scale generated by corresponding generator, independently • Adopting ResNet architecture to solve problem • Note: not perfectly solved – it’ll be discussed in StyleGAN3 Examples of natural images w/ StyleGAN2
  • 24.
    24 Advances of StyleGAN:StyleGAN3 • StyleGAN2 (left) has not perfectly solved texture sticking problem • (left) Averaged images w/ small changes of latent should blur the central image • (left) But StyleGAN2 have stick to the same pixel coordinates • asdf
  • 25.
  • 26.
    26 Applications of StyleGAN:Image Domain • InterFaceGAN • Extract linear editing directions through attribute-level supervision • StyleFlow • First to present editing that is stable to be composed • Normalizing flows and attribute-level supervision • DyStyle • Addresses compositional editing directly • Accurate, elaborate, and diverse editing • StyleCLIP • Free textual editing w/ visual-linguistic pretrained model • Pose with Style • Human pose supervision to edit body poses and clothing • StyleMapGAN • Localized editing by augmenting StyleGAN’s architecture w/ spatially adaptive modulation
  • 27.
  • 28.
    28 Applications of StyleGAN:StyleMapGAN • Localized editing by augmenting StyleGAN’s architecture w/ spatially adaptive modulation • Localied editing conducted with
  • 29.
    29 Music Generation: RecentWorks w/wo GANs • Style-Conditioned Music Generation • Style transfer-like methods in music generation w/ LSTM-based GANs • Making style codebook that decides overall style of the music • Symbolic Music Generation with Transformer-GANs • Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs • Transformer-based music generation model
  • 30.
    30 Music Generation: Whyis the StyleGAN hard to utilized? • Main “scale-specific controllability” of the StyleGAN comes from the stacked CNN w/ various size  To utilize StyleGAN, it should be separable • Music composition should be infinitely extended  cannot utilize CNNs in temporal dimension • Separation of the musical components (e.g., Motive – Phrase – Period)  Hard to modeled like CNNs (intuitive separation of the components are hard)  Each of them shares overall flow  separation causes incoherence music • Separation of the Midi components (e.g., bar – beat - …)  Using CNNs to combine them can cause information loss (e.g., structured tokens) • Too many additional features to consider • StyleGAN and Image processing  no other input or consideration except image • Music has a bunch of extra features like instrument
  • 31.
    31 Conclusion • “Scale-specific controllability”of the StyleGAN comes from the stacked CNN w/ various size  To utilize StyleGAN, target domain output should be separable (e.g., 4x4  8x8  16x16  32x32  …  1024x1024) • Maybe utilized in GUI design, but it will be more like “Conditioned image synthesis regardless of the structure” • If we utilize StyleGAN in GUI design, we should defense… • why do we ignore the structures? • Isn’t this design a combination of existing designs?
  • 32.

Editor's Notes

  • #9 Batch Norm  가장 기본적인 normalization technique, batch의 평균 / 분산이 전체 데이터셋을 대표한다는 가정하에 실행. inference와 training시의 실행 방식이 다름 Layer Norm  입력 scale에 robust, 가중치의 scale / shifting에 robust Instance Norm  각 채널별 / 배치별로 mean / std normalization, inferenc단에서도 동일하게 이용가능, 명암 대비 등을 normalize할 수 있음 Group Norm  2018년 Kaiming He가 발표, Layer Norm과 Instance Norm의 절충안
  • #17 Bring High-level aspects (i.e., pose, general hair style, face shape, and eye glasses)
  • #22 Normalization으로 인해 발생하는 smooth하지 못한 mapping임 64 by 64 image부터 나타나며, 모든 feature map에 발생함  AdaIN이 결국 channel간의 연관성을 박살을 내기때문 또한, normalization이 입력에 의존하기때문에, 입력에 아주 큰 spike가 있다면 다른 곳에서 세부 조정이 쉬워지기때문에
  • #23 Bias 및 noise가 normalize 전에 적용된다면 상대적인 영향력이 style magnitude에 반비례하게 됨.  noise와 bias가 style과 correlate하게 됨  따라서, normalization 이후에 noise 및 bias를 적용함 Mean 빼는 부분을 없앰 + data를 기반으로 한 normalization을 없앰 AdaIN을 weight의 norm / denorm으로 변경 (bias가 없는 convolution의 성질을 생각해본다면 간단하게 가능함.)
  • #26 Average value를 다루는 것으로 EMA 추가. Upsample을 통해 Filtering 진행 Upsampling에는 (a)에 있는 Filter를 이용  continuous, infinite spatial domain 가정
  • #31 Note: scale-specific styling is only possible when we can separate each network output gradually