Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Speech Conditioned Face Generation with Deep Adversarial Networks

2,396 views

Published on

Master thesis by Francisco Roldan, Master in Computer Vision Barcelona 2018

Image synthesis have been a trending task for the AI community in recent years. Many works have shown the potential of Generative Adversarial Networks (GANs) to deal with tasks such as text or audio to image synthesis. In particular, recent advances in deep learning using audio have inspired many works involving both visual and auditory information. In this work we propose a face synthesis method using audio and/or language representations as inputs. Furthermore, a dataset which relates speech utterances with a face and an identity has been built, fitting for other tasks apart from face synthesis such as speaker recognition or voice conversion.

https://github.com/imatge-upc/speech2face

Published in: Data & Analytics
  • Login to see the comments

Speech Conditioned Face Generation with Deep Adversarial Networks

  1. 1. Speech-Conditioned Face Generation with Deep Adversarial Networks Slides by Francisco Roldán Msc Thesis, UPC 13th July, 2018 Author: Francisco Roldán Sánchez Advisors: Xavier Giró-i-Nieto, Kevin McGuinness, Santiago Pascual de la Puente, Amaia Salvador Aguilera
  2. 2. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 2
  3. 3. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 3
  4. 4. Problem Definition 4 Deep Learning Model Face Speech Signal
  5. 5. Why Multimodal Learning? 5
  6. 6. Why Speech-Conditioned Face Synthesis? ● Cross-modal task. ● Models need to tackle different sub-tasks at once 6 Speech Processing Knowledge Representation Computer Vision
  7. 7. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 7
  8. 8. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 8
  9. 9. Generative Models 9Slide credit Santiago Pascual
  10. 10. Generative Models 10Slide credit Santiago Pascual
  11. 11. Generative Adversarial Networks (GAN) 11Goodfellow, Ian, et al. "Generative adversarial nets." NIPS. 2014. Other approaches that modify the loss function have been proposed in the latest years: - Wasserstein GAN - Least-Squares GAN - ...
  12. 12. Conditioned GANs 12Mirza, Mehdi, and Simon Osindero. "Conditional generative adversarial nets." NIPS (2014).
  13. 13. Text-to-Image Synthesis 13Reed, Scott, et al. "Generative adversarial text to image synthesis." ICML (2016).
  14. 14. Speech to Frame Synthesis 14 Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
  15. 15. How to evaluate results? 1. Qualitative visual inspection 2. Quantitative metrics: Inception Score (IS) a. Still an open research line 15Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. Improved techniques for training gans. NIPS (2016)
  16. 16. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 16
  17. 17. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 17
  18. 18. YouTubers Dataset ● Clean Speech ● Wide range of emotions 18
  19. 19. Dataset Summary 19
  20. 20. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 20
  21. 21. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 21
  22. 22. Id2Face 22
  23. 23. Wasserstein GAN outputs 23 Unrealistic Images!
  24. 24. Least Squares GAN outputs 24 IS: 2.91 IS: 2.05
  25. 25. Least Squares GAN outputs 25 Projection Concatenation 1st Run 2nd Run 3rd Run ... ... ... ... Nth Run Mode Collapse
  26. 26. Improving our results ● Usage of spectral normalization instead of batch normalization. ● Modification of learning rate values. ● Addition of noise to the conditioning embedded vector. ● Addition of controlled dropouts in G. 26
  27. 27. Improving Image Quality Removing batch normalization and adding spectral normalization in G: 27
  28. 28. Improving Image Quality Two Time-scale Update Rule (TTUR): Setting D learning rate to 0.0004 and G to 0.0001 28 Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS (2017)
  29. 29. Addressing mode collapse Aggregation of noise to the one-hot: 29 Without noise
  30. 30. Addressing mode collapse Adding controlled dropouts in the generator: 30 Solved! Without dropout With dropout 1st run 2nd run 3rd run 4th run IS: 3.00 IS: 2.63
  31. 31. Id2Face Auxiliary Classifier GAN: 31
  32. 32. AC-GAN vs LS-GAN Results without all changes done to address mode collapse: 32 IS: 2.12 IS: 2.05
  33. 33. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 33
  34. 34. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 34
  35. 35. Speech2Face 35
  36. 36. Speech2Face outputs 36
  37. 37. Speech2Face AC-GAN 37 Not included on the thesis, new from this week
  38. 38. Speech2Face AC-GAN outputs AC-GAN trained with just two male speakers: Train Male 1 Male 2 Male 1 Male 2 Test
  39. 39. Speech2Face AC-GAN outputs 39 AC-GAN trained with just two speakers: a female and a male Train Test MaleFemaleMaleFemale
  40. 40. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 40
  41. 41. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 41
  42. 42. YouTubers Dataset ● Clean Speech ● Wide range of emotions 42
  43. 43. End-to-end Speech2Face 43
  44. 44. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 44
  45. 45. Roadmap 1. Introduction 2. Related Work 3. Dataset 4. Id2Face 5. Speech2Face 6. Conclusion 7. Future Work 45
  46. 46. Future Work ● Speech2Faces → Generating video of a talking face just from raw speech utterances. ● Face-Conditioned Voice Conversion ● Latent space disentanglement 46
  47. 47. 47 https://github.com/imatge-upc/speech2face
  48. 48. 48

×