Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)

358 views

Published on

https://telecombcn-dl.github.io/2017-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Audio and Vision (D4L6 2017 UPC Deep Learning for Computer Vision)

  1. 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Audio and Vision Day 4 Lecture 6 #DLUPC
  2. 2. 2 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  3. 3. 3 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  4. 4. 4 Transfer Learning Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Audio features help learn semantically meaningful visual representation
  5. 5. 5 Transfer Learning Images are grouped into audio clusters. Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Mean & standard deviation of the frequency channels
  6. 6. 6 Transfer Learning Task: Predict the audio cluster for a given image (classification) Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016
  7. 7. 7 Transfer Learning Although the model was not trained with class labels, units with semantical meaning emerge Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016
  8. 8. 8 Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  9. 9. 9 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  10. 10. 10 Recognizing Objects and Scenes from Sound Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  11. 11. 11 Recognizing Objects and Scenes from Sound Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  12. 12. 12 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  13. 13. 13 Visualization of the 1D filters over raw audio in conv1. Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  14. 14. 14 Visualization of the 1D filters over raw audio in conv1. Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  15. 15. 15 Visualize samples that mostly activate a neuron in a late layer (conv7) Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  16. 16. 16 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Transfer Learning: SoundNet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  17. 17. 17 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv7) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  18. 18. 18 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv5) Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  19. 19. 19 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  20. 20. 20 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material?
  21. 21. 21 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  22. 22. 22 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  23. 23. 23 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 ● Map video sequences to sequences of sound features ● Sound features in the output are converted into waveform by example-based synthesis (i.e. nearest neighbor retrieval).
  24. 24. 24 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  25. 25. 25 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  26. 26. 26 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  27. 27. 27 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  28. 28. 28 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  29. 29. 29 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Downsample 8kHz Split in 40ms chunks (.5 overlap) LPC analysis 9-dim LPC features Waveform processing
  30. 30. 30Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Waveform processing (2 LSP features/ frame)
  31. 31. 31 Speech Generation from Video Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 2006 GRID Dataset
  32. 32. 32 Speech Generation from Video Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
  33. 33. 33 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  34. 34. 34 Lip Reading: LipNet Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  35. 35. 35 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016). Lip Reading: LipNet Input (frames) and output (sentence) sequences are not aligned
  36. 36. 36 Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  37. 37. 37 Lip Reading: LipNet Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  38. 38. 38Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  39. 39. 39 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  40. 40. 40 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Audio features Image features
  41. 41. 41 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Attention over output states from audio and video is computed at each timestep
  42. 42. 42 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  43. 43. 43 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 LipNet
  44. 44. 44 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
  45. 45. 45 Summary ● Transfer learning ● Sonorization ● Speech generation ● Lip Reading
  46. 46. 46 Questions?

×