Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018

218 views

Published on

https://telecombcn-dl.github.io/2018-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018

  1. 1. Eva Mohedano eva.mohedano@insight-centre.org Postdoctoral Researcher Insight Centre for Data Analytics Dublin City University Audio and Vision Day 4 Lecture 5 #DLUPC http://bit.ly/dlcv2018
  2. 2. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Sonorization
  3. 3. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Sonorization
  4. 4. Self-supervised learning Self-supervised methods learn features by training a model to solve a task derived from the input data itself, without human labeling Audio Video
  5. 5. Visual Feature Learning Vision Audio Video
  6. 6. Visual Feature Learning Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Based on the assumption that ambient sound in video is related to the visual semantics.
  7. 7. Visual Feature Learning Use videos to train a CNN that predicts the audio statistics of a frame. Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016 Self-supervised classification problem
  8. 8. Visual Feature Learning Task: Use the predicted audio stats to clusters images. Audio clusters built with K-means over training set Average statsCluster assignments at test time (one row=one cluster) Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016
  9. 9. Visual Feature Learning Although the CNN was not trained with class labels, local units with semantic meaning emerge. Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound provides supervision for visual learning." ECCV 2016
  10. 10. Audio Feature Learning Vision Audio Video
  11. 11. Audio Feature Learning Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016 Pretrained visual ConvNets supervise the training of a model for sound representation
  12. 12. Audio Feature Learning Videos for training are unlabeled. Relies on Convnets trained on labeled images. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  13. 13. Audio Feature Learning Recognizing Objects and Scenes from Sound Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  14. 14. Audio Feature Learning Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016
  15. 15. Audio Feature Learning Hearing sounds that most activate a neuron in the sound network (conv7)
  16. 16. Audio Feature Learning Hearing sounds that most activate a neuron in the sound network (conv5)
  17. 17. Audio & Visual feature learning Vision Audio Video
  18. 18. Audio & Visual feature learning Arandjelović, Relja, and Zisserman, Andrew. "Look, Listen and Learn." ICCV 2017. Audio and visual features learned by assessing correspondence.
  19. 19. Audio & Visual feature learning Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. ● Large collection of unlabelled videos ● Flickr-SoundNet → Subset of 500k videos ● Kinetics-Sound → Subset of 19k videos from the Kinetics dataset ● 10s length videos ● 1 second of audio as input. ● A visual frame happening within that second is a match Audio and visual features learned by assessing correspondence.
  20. 20. Audio & Visual feature learning Most activated unit in pool4 layer of the visual network Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  21. 21. Audio & Visual feature learning Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  22. 22. Audio & Visual feature learning Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. Most activated unit in pool4 layer of the audio network
  23. 23. Audio & Visual feature learning Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
  24. 24. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Sonorization
  25. 25. The problem: query by example Given: ● An example query image that illustrates the user's information need ● A very large dataset of images Task: ● Rank all images in the dataset according to how likely they are to fulfil the user's information need 25 How to make audio and visual features comparable?
  26. 26. Cross-modal retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Video-level features
  27. 27. Cross-modal retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Similarity loss Classification Regularization
  28. 28. Cross-modal retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Best match Visual feature Audio feature
  29. 29. Cross-modal retrieval Surís, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." arXiv preprint arXiv:1801.02200 (2018). Best match Visual feature Audio feature
  30. 30. Cross-modal retrieval Hung-IM Music-Video: Large-scale video-music pair dataset 200k video-music pairs Based on all videos tagged with “music video” in Y8M Sungeun Hong, Woobin Im and Hyun Seung Yang: CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint, ICMR 2018 - Code available Video-level features
  31. 31. Cross-modal retrieval Sungeun Hong, Woobin Im and Hyun Seung Yang: CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint, ICMR 2018 - Code available
  32. 32. Cross-modal retrieval Sungeun Hong, Woobin Im and Hyun Seung Yang: CBVMR: Content-Based Video-Music Retrieval Using Soft Intra-Modal Structure Constraint, ICMR 2018 - Code available Inter-modal ranking constraint Soft intra-modal structure constraint
  33. 33. Cross-modal retrieval Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017). Unsupervised feature learning
  34. 34. Cross-modal retrieval Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017).
  35. 35. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Speech generation
  36. 36. Sound source localization Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." arXiv preprint arXiv:1712.06651 (2017). Audio-visual Object localization Network Unsupervised feature learning
  37. 37. Sound source localization Semi-supervised feature learning Senocak, A., Oh, T.H., Kim, J., Yang, M.H. and Kweon, I.S. Learning to Localize Sound Source in Visual Scenes. CVPR 2018
  38. 38. Sound source localization ● Flickr-SoundNet subset of 144k frame pairs for training ● New dataset from Flickr with manual localization annotations (2.5k frames) ○ Manually annotate those frame with non-matching audio (edited or not present object) Semi-supervised feature learning Senocak, A., Oh, T.H., Kim, J., Yang, M.H. and Kweon, I.S. Learning to Localize Sound Source in Visual Scenes. arXiv preprint 2018
  39. 39. Sound source localization Senocak, A., Oh, T.H., Kim, J., Yang, M.H. and Kweon, I.S. Learning to Localize Sound Source in Visual Scenes. CVPR 2018
  40. 40. Senocak, A., Oh, T.H., Kim, J., Yang, M.H. and Kweon, I.S. Learning to Localize Sound Source in Visual Scenes. CVPR 2018
  41. 41. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Sonorization
  42. 42. Sonorization Learn synthesized sounds from videos of people hitting objects with a drumstick. Are features learned when training for sound prediction useful to identify the object’s material? Owens et al. Visually indicated sounds. CVPR 2016
  43. 43. Sonorization The Greatest Hits Dataset Owens et al. Visually indicated sounds. CVPR 2016
  44. 44. Sonorization The Greatest Hits Dataset Owens et al. Visually indicated sounds. CVPR 2016
  45. 45. Sonorization Owens et al. Visually indicated sounds. CVPR 2016 Sound Generation 2-layer LSTM For sound feature regression AlexNet FC7 For image feature extraction Sound retrieval Two stream CNN At each time step, RGB frame (one stream) and 3-spatiotemporal image (current, prev and following)
  46. 46. Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  47. 47. Sonorization Owens et al. Visually indicated sounds. CVPR 2016
  48. 48. Visual to Sound Zhou, Y., Wang, Z., Fang, C., Bui, T. and Berg, T.L., 2017. Visual to Sound: Generating Natural Sound for Videos in the Wild. arXiv 2017 Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A. and Bengio, Y., 2016. SampleRNN: An unconditional end-to-end neural audio generation model. ICLR 2017 Dataset: Visually Engaged and Grounded AudioSet (VEGAS) Model trained to directly predict raw signal Inputs RGB + Optical flow LSTM SampleRNN for sound generation
  49. 49. Contents ● Feature learning ● Cross-modal retrieval ● Sound source localization ● Sonorization
  50. 50. Questions?

×