Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

324 views

Published on

https://telecombcn-dl.github.io/dlmm-2017-dcu/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Audio and Vision (D2L9 Insight@DCU Machine Learning Workshop 2017)

  1. 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Audio & Vision D2 L9
  2. 2. 2 Multimedia Text Audio Vision
  3. 3. 3 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  4. 4. 4 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  5. 5. 5 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Audio features help learn semantically meaningful visual representation
  6. 6. 6 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Images are grouped into audio clusters. The task is to predict the audio cluster for a given image (classification)
  7. 7. 7 Transfer Learning Owens et al. Ambient Sound Provides Supervision for Visual Learning. ECCV 2016 Although the model was not trained with class labels, units with semantical meaning emerge
  8. 8. 8 Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  9. 9. 9 Videos for training are unlabeled. Relies on Convnets trained on labeled images. Transfer Learning: SoundNet Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016
  10. 10. 10 Kullback-Leibler Divergence: Cross-Entropy: Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 prob. distribution of vision network prob. distribution of sound network Transfer Learning: SoundNet
  11. 11. 11 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  12. 12. 12 Recognizing Objects and Scenes from Sound Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  13. 13. 13 Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  14. 14. 14 Visualization of the 1D filters over raw audio in conv1. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Transfer Learning: SoundNet
  15. 15. 15 Visualization of the 1D filters over raw audio in conv1. Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  16. 16. 16 Visualize samples that mostly activate a neuron in a late layer (conv7) Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  17. 17. 17 Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet
  18. 18. 18Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv7)
  19. 19. 19 Aytar, Vondrick and Torralba. Soundnet: Learning sound representations from unlabeled video. NIPS 2016 Transfer Learning: SoundNet Hearing sounds that most activate a neuron in the sound network (conv5)
  20. 20. 20 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  21. 21. 21 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material?
  22. 22. 22 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  23. 23. 23 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 The Greatest Hits Dataset
  24. 24. 24 Sonorization Owens et al. Visually indicated sounds. CVPR 2016 ● Map video sequences to sequences of sound features ● Sound features in the output are converted into waveform by example-based synthesis (i.e. nearest neighbor retrieval).
  25. 25. 25 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  26. 26. 26 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  27. 27. 27 Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  28. 28. 28 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  29. 29. 29 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video CNN (VGG) Frame from a silent video Audio feature Post-hoc synthesis
  30. 30. 30 Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Downsample 8kHz Split in 40ms chunks (.5 overlap) LPC analysis 9-dim LPC features Waveform processing
  31. 31. 31Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017 Speech Generation from Video Waveform processing (2 LSP features/ frame)
  32. 32. 32 Speech Generation from Video Cooke et al. An audio-visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 2006 GRID Dataset
  33. 33. 33 Speech Generation from Video Ephrat et al. Vid2speech: Speech Reconstruction from Silent Video. ICASSP 2017
  34. 34. 34 Audio & Vision ● Transfer Learning ● Sonorization ● Speech generation ● Lip Reading
  35. 35. 35Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet
  36. 36. 36Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016 Lip Reading: LipNet Input (frames) and output (sentence) sequences are not aligned
  37. 37. 37 Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 2006 Lip Reading: LipNet CTC Loss: Connectionist temporal classification ● Avoiding the need for alignment between input and output sequence by predicting an additional “_” blank word ● Before computing the loss, repeated words and blank tokens are removed ● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
  38. 38. 38 Lip Reading: LipNet Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016
  39. 39. 39Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  40. 40. 40 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  41. 41. 41 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Audio features Image features
  42. 42. 42 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell Attention over output states from audio and video is computed at each timestep
  43. 43. 43 Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 Lip Reading: Watch, Listen, Attend & Spell
  44. 44. 44 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016 LipNet
  45. 45. 45 Lip Reading: Watch, Listen, Attend & Spell Chung et al. Lip reading sentences in the wild. arXiv Nov 2016
  46. 46. 46 Summary ● Transfer learning ● Sonorization ● Speech generation ● Lip Reading
  47. 47. 47 Questions?
  48. 48. 48 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  49. 49. 49 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  50. 50. 50 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Denoising autoencoder required to reconstruct both modalities given only one or both in the input
  51. 51. 51 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Video classification: Add Linear SVM on top of learned features ● Combination of audio&video as inputs gives relative improvement (on AVLetters) ● Using video as input to predict both audio& video gives SotA results in both
  52. 52. 52 Cross-Modality Ngiam et al. Multimodal Deep Learning. ICML 2011 Audio classification: Add Linear SVM on top of learned features ● Performance of cross-modal features below SotA ● Still, remarkable performance when using video as input to classify audio
  53. 53. 53 Multimodal Learning Ngiam et al. Multimodal Deep Learning. ICML 2011
  54. 54. 54 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 First, learn a shared representation using the two modalities as input
  55. 55. 55 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Then, train linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.
  56. 56. 56 Shared Learning Ngiam et al. Multimodal Deep Learning. ICML 2011 Train Linear classifiers on top of shared representations (previously trained on audio/video data) using only one of the two modalities as input. Testing is performed with the other modality.

×