Few-Shot Adversarial Learning of
Realistic Neural Talking Head Models
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky
Samsung AI center, Moscow, Skolkovo Institute of Science and Technology
Presented by Taesu Kim
May. 26, 2019
Learning talking heads from few examples
Related work in image synthesis
PR12-074: ObamaNet
https://youtu.be/A1o6SUsWd98
Synthesizing Obama: Learning Lip Sync from Audio, SIGRAPH2017
Face2face: Real-time face capture and reenactment of RGB videos, CVPR2016
PR12-104: Video-to-video synthesis
https://youtu.be/WxeeqxqnRyE
Image-to-image translation with conditional adversarial networks, CVPR2017
Related work in meta learning
› Meta learning: learning to learn
PR12-094 Model-Agnostic Meta-Learning
https://youtu.be/fxJXXKZb-ik
Related work in speech synthesis
O. Arik et al, “Neural voice cloning with a few samples” Arxiv, Feb 14, 2018
Y. Lee, "Voice imitation based on speaker adaptive multi-speaker speech synthesis model",
MS Thesis, KAIST, Dec 13, 2017
Y. Jia et al, “Transfer Learning from Speaker Verification to
Multispeaker Text-To-Speech Synthesis,”Arxiv, Jun 12, 2018
Y. Lee, T. Kim, S.Y. Lee, “Voice Imitating Text-to-Speech Neural Networks,” Arxiv, Jun 4, 2018
Speech Generator
Speaker Embedder
Text
Speech
Speech
Image
Image
Landmark Image
Image
Architecture and notation
Meta learning stage
› Assuming i-th video sequence and its t-th time frame
› For K-shot learning
Draw K random frames in i-th video sequence
› Generator loss
– measures the distance between the ground truth image and reconstruction using
perceptual similarity measures fro VGG19 and VGGFace
– corresponds to the realism score computed by discriminator
– measures perceptual similarity of discriminator output
– is
› Discriminator loss
Few-shot learning by fine-tuning
Estimate the embedding for the new talking head sequence
Generator
Discriminator Initialize
InitializePerson specific parametersPerson generic parameters
loss
loss
Implementation details
› Use the architecture proposed here
– Perceptual Losses for Real-Time Style Transfer and Super-Resolution:
https://arxiv.org/abs/1603.08155
› Replace downsampling and upsampling layers with residual blocks similar to this
– Large Scale GAN Training for High Fidelity Natural Image Synthesis: https://arxiv.org/abs/1809.11096
› Replace batch normalization by instance normalization
– Instance Normalization: The Missing Ingredient for Fast Stylization: https://arxiv.org/abs/1607.08022
› Use similar networks for both embedder and convolution part of discriminator
– Same as generator but without normalization layers
– Discriminator has additional residual block at the end, which operates 4x4 spatial resolution
– To obtain the vectorized outputs in both networks, perform sum pooling over spatial dimensions
followed by ReLU
› Use spectral normalization for all convolution and fully connected layers
– Spectral Normalization for Generative Adversarial Networks: https://arxiv.org/abs/1802.05957
› Use self-attention blocks
– Self-attention generative adversarial networks: https://arxiv.org/abs/1805.08318
Experiments
› Use VoxCeleb1 to compare results
with others
› VoxCeleb2 has 10 times more videos
› Metrics
– FID: Frechet-inception distance
› Perceptual similarity
– SSIM: Structured similarity
› Low level similarity
– CSIM: Cosine similarity of face
recognition embeddings
› Person identity mismatch
– USER: User survey to find fake among
triplets
› Method
– FF: No fine tune on embedding
– FT: Fine tune all
Example result: VoxCeleb1
Example results: VoxCeleb2
More examples
Concolusion
› Present a framework for meta-learning of adversarial
generative models
› Only a handful of photographs is needed to create a new
model
– As little as one
– 32 images achieves perfect realism and personalization score
› Limitations
– Mimics representation limitation
› Current set of landmarks doesn’t present the gaze
– The lack of landmark adaptation
› Using landmarks from a different person leads to noticeable personality
mismatch
Follow us:
Contact us:
contact@neosapience.com
For more information:
http://www.neosapience.com

PR12-165 Few-Shot Adversarial Learning of Realistic Neural Talking Head Models

  • 1.
    Few-Shot Adversarial Learningof Realistic Neural Talking Head Models Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky Samsung AI center, Moscow, Skolkovo Institute of Science and Technology Presented by Taesu Kim May. 26, 2019
  • 2.
    Learning talking headsfrom few examples
  • 3.
    Related work inimage synthesis PR12-074: ObamaNet https://youtu.be/A1o6SUsWd98 Synthesizing Obama: Learning Lip Sync from Audio, SIGRAPH2017 Face2face: Real-time face capture and reenactment of RGB videos, CVPR2016 PR12-104: Video-to-video synthesis https://youtu.be/WxeeqxqnRyE Image-to-image translation with conditional adversarial networks, CVPR2017
  • 4.
    Related work inmeta learning › Meta learning: learning to learn PR12-094 Model-Agnostic Meta-Learning https://youtu.be/fxJXXKZb-ik
  • 5.
    Related work inspeech synthesis O. Arik et al, “Neural voice cloning with a few samples” Arxiv, Feb 14, 2018 Y. Lee, "Voice imitation based on speaker adaptive multi-speaker speech synthesis model", MS Thesis, KAIST, Dec 13, 2017 Y. Jia et al, “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis,”Arxiv, Jun 12, 2018 Y. Lee, T. Kim, S.Y. Lee, “Voice Imitating Text-to-Speech Neural Networks,” Arxiv, Jun 4, 2018 Speech Generator Speaker Embedder Text Speech Speech Image Image Landmark Image Image
  • 6.
  • 7.
    Meta learning stage ›Assuming i-th video sequence and its t-th time frame › For K-shot learning Draw K random frames in i-th video sequence › Generator loss – measures the distance between the ground truth image and reconstruction using perceptual similarity measures fro VGG19 and VGGFace – corresponds to the realism score computed by discriminator – measures perceptual similarity of discriminator output – is › Discriminator loss
  • 8.
    Few-shot learning byfine-tuning Estimate the embedding for the new talking head sequence Generator Discriminator Initialize InitializePerson specific parametersPerson generic parameters loss loss
  • 9.
    Implementation details › Usethe architecture proposed here – Perceptual Losses for Real-Time Style Transfer and Super-Resolution: https://arxiv.org/abs/1603.08155 › Replace downsampling and upsampling layers with residual blocks similar to this – Large Scale GAN Training for High Fidelity Natural Image Synthesis: https://arxiv.org/abs/1809.11096 › Replace batch normalization by instance normalization – Instance Normalization: The Missing Ingredient for Fast Stylization: https://arxiv.org/abs/1607.08022 › Use similar networks for both embedder and convolution part of discriminator – Same as generator but without normalization layers – Discriminator has additional residual block at the end, which operates 4x4 spatial resolution – To obtain the vectorized outputs in both networks, perform sum pooling over spatial dimensions followed by ReLU › Use spectral normalization for all convolution and fully connected layers – Spectral Normalization for Generative Adversarial Networks: https://arxiv.org/abs/1802.05957 › Use self-attention blocks – Self-attention generative adversarial networks: https://arxiv.org/abs/1805.08318
  • 10.
    Experiments › Use VoxCeleb1to compare results with others › VoxCeleb2 has 10 times more videos › Metrics – FID: Frechet-inception distance › Perceptual similarity – SSIM: Structured similarity › Low level similarity – CSIM: Cosine similarity of face recognition embeddings › Person identity mismatch – USER: User survey to find fake among triplets › Method – FF: No fine tune on embedding – FT: Fine tune all
  • 11.
  • 12.
  • 13.
  • 14.
    Concolusion › Present aframework for meta-learning of adversarial generative models › Only a handful of photographs is needed to create a new model – As little as one – 32 images achieves perfect realism and personalization score › Limitations – Mimics representation limitation › Current set of landmarks doesn’t present the gaze – The lack of landmark adaptation › Using landmarks from a different person leads to noticeable personality mismatch
  • 15.
    Follow us: Contact us: contact@neosapience.com Formore information: http://www.neosapience.com