Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cross Modal Embeddings for Video and Audio Retrieval #WiCV18


Published on
The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the feature space, obtaining joint audio-visual embeddings. These links are used to retrieve audio samples that fit well to a given silent video, and also to retrieve images that match a given a query audio. The results in terms of Recall@K obtained over a subset of YouTube-8M videos show the potential of this unsupervised approach for cross-modal feature learning. We train embeddings for both scales and assess their quality in a retrieval problem, formulated as using the feature extracted from one modality to retrieve the most similar videos based on the features computed in the other modality.

Published in: Data & Analytics
  • Login to see the comments

  • Be the first to like this

Cross Modal Embeddings for Video and Audio Retrieval #WiCV18

  1. 1. Sur�s, Didac, Amanda Duarte, Amaia Salvador, Jordi Torres, and Xavier Gir�-i-Nieto. "Cross-modal Embeddings for Video and Audio Retrieval." ECCV 2018 Women in Computer Vision workshop. Cross-modal Embeddings for Video and Audio Retrieval Xavier Giro-i-NietoAmaia SalvadorAmanda DuarteD�dac Sur�s Jordi Torres
  2. 2. 2 Motivation
  3. 3. 3 Self-supervision
  4. 4. 4 Architecture Joint embedding space for audio and video.
  5. 5. 5 Dataset Large scale 4716 classes Pre-extracted features: video-level Frame-level (not used) Abu-El-Haija, Sami et al. "YouTube-8M: A large-scale video classification benchmark." arXiv:1609.08675, 2016.
  6. 6. 6 Qualitative Results: Video sonorization
  7. 7. 7 Best match Visual feature Audio feature Qualitative Results: Video sonorization
  8. 8. 8 Qualitative Results: Audio Colorization
  9. 9. 9 Visual feature Audio feature Best match Qualitative Results: Audio Colorization
  10. 10. 10 Quantitative Results Video sonorization Audio Colorization