Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

262 views

Published on

Benet Oriol, Jordi Luque, Ferran Diego, Xavier Giro-i-Nieto
Telefonica Research / Universitat Politecnica de Catalunya (UPC)
CVPR 2020 Workshop on on Egocentric Perception, Interaction and Computing

In this work, we propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives. The proposed methodology departs from a baseline system that spawns a embedding space trained with only spoken narratives and image cues. Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations. The triad speech, image and words allows for a better estimate of the point embedding and show an improving of the performance within tasks like image and speech retrieval, even when text third modality, text, is not present in the task.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos

  1. 1. Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and Videos Benet Oriol1, 2 Jordi Luque1 Ferran Diego1 Xavier Giró-i-Nieto2, 3 1 Telefonica Research 2 Universitat Politècnica de Catalunya 3 Barcelona Supercomputing Center
  2. 2. “Dog” “Cat” Image model Speech model 2 1. Joint Embeddings [1] Harwath, David, et al. "Jointly discovering visual objects and spoken words from raw sensory input." (ECCV). 2018.
  3. 3. “Dog” “Cat” Image model Speech model Text model Dog Cat 3 1. Joint Embeddings
  4. 4. “Dog” “Cat” VGG-16 Davenet BERT Dog Cat 4 1. Joint Embeddings
  5. 5. 2. How to compute similarity? 5 Speech features Image features Text features 14 14 N tokens N audio Embedding size Embedding size Embedding size
  6. 6. 3. Text & speech matchmap Openfridge 6 N audio N tokens Emb size
  7. 7. 4. From matchmap to similarity Scalar similarity score 7 Average Max N audio N tokens
  8. 8. 5. Dataset - Epic Kitchens (I) ● Images of kitchen procedures + ● Spoken narrations + ● Clean transcriptions of the narrations. 8 Pick up spoon Damen, Dima, et al. "Scaling egocentric vision: The epic-kitchens dataset." ECCV. 2018.
  9. 9. 6. Dataset - Epic Kitchens (II) ● Images of kitchen procedures + ● Spoken narrations + ● Clean transcriptions of the narrations. 9 Continue breaking up rice
  10. 10. 7. Image retrieval task 10 ... “Dog”
  11. 11. 7. Image retrieval task 11
  12. 12. 8. Audio retrieval task 12 ... “Dog” “Cat”
  13. 13. 13 8. Audio retrieval task
  14. 14. 9. Places Dataset 14 A small jet taking off from an airport. There are three planes and buildings in the background David Harwath, Antonio Torralba, and James Glass. Unsupervised learning of spoken language with visual context. In Advances in Neural Information Processing Systems, pages 1858–1866, 2016
  15. 15. *Harwath, David, et al. "Jointly discovering visual objects and spoken words from raw sensory input." Proceedings of the European conference on computer vision (ECCV). 2018. 15 9. Places Dataset
  16. 16. 8. Conclusions Text helps! Better speech & image embeddings 16
  17. 17. Thanks! 17

×