Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

50 views

Published on

Ruijie Quan

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

  1. 1. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Ruijie Quan 2019/07/13
  2. 2. MOTIVATION 121.07.2019 Most current methods learn injective embedding functions that map an instance to a single point in the shared space. Drawback: Cannot effectively handle polysemous instances while individual instances and their cross modal associations are often ambiguousin real-world scenarios. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  3. 3. CONTRIBUTIONS 221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval  Contributions: 1. Introduce Polysemous Instance Embedding Networks (PIE-Nets) that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning. 2. Tackle a more challenging case of video-text retrieval. 3. A new dataset of 50K video-sentence pairs collected from social media, dubbed MRW (my reaction when)
  4. 4. INTRODUCTION 321.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. injective embedding can suffer when there is ambiguity in individual instances. e.g., polysemy words and images containing multiple objects. 2. partial cross-domain association e.g. a text sentence may describe only certain regions of an image, a video may contain extra frames not described by its associated sentence Injective Embedding could be problematic
  5. 5. INTRODUCTION 421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 1. formulating instance embedding as a one- to-many mapping task 2. optimizing the mapping functions to be robust to ambiguous instances and partial cross-modal associations Address the above issues by:
  6. 6. APPROACH 521.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  7. 7. APPROACH 621.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the issues with ambiguous instances Propose a novel one-to-many instance embedding model, Polysemous Instance Embedding Network (PIE-Net)  extracts K embeddings of each instance by combining global and local information of its input.  obtain K locally-guided representations by attending to different parts of an input instance (e.g., regions, frames, words) using a multi-head self-attention module  combine each of such local representation with global representation via residual learning to avoid learning redundant information  regularize the K locally-guided representations to be diverse
  8. 8. APPROACH 721.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  9. 9. APPROACH 821.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval To address the partial association issue Polysemous Visual-Semantic Embedding (PVSE)tie-up two PIE-Nets and train our model in the multiple-instance learning (MIL) framework
  10. 10. APPROACH 121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval Image Encoder Video Encoder Sentence Encoder the feature map before the final average pooling layer as local features apply average pooling and feed the output to one fully-connected layer to obtain global features ResNet-152 take the 2048-dim output from the final average pooling layer, and use them as local features feed Ψ(x) into a bidirec- tional GRU (bi-GRU) with H hidden units, and take the final hidden states as global features ResNet-152 producing L 300-dim vectors, and use them as local features local features feed them into a bi-GRU with H hidden units, and take the final hidden states as global features GloVe pretrained on the CommonCrawl dataset 1. Modality-Specific Feature Encoder
  11. 11. APPROACH 1021.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 2. Local Feature Transformer Multihead self-attention 3. Feature Fusion With Residual Learning Residual Learning
  12. 12. APPROACH 1121.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval 4. Optimization and Inference MIL Loss: Diversity Loss: Domain Discrepancy Loss:
  13. 13. MRW DATASET 1221.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  14. 14. EXPERIMENTS 1321.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  15. 15. EXPERIMENTS 1421.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The number of embeddings K: a significant improvement from K = 0 to K = 1; this shows the effectiveness of Local Feature Transformer. this shows the importance of balancing global and local information in the final embedding simply concatenating the two features (no residual learning) hurts the performance Global vs. locally-guided features:
  16. 16. EXPERIMENTS 1521.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval The results show that both loss terms are important in our model. Overall, the model is not much sensitive to the two relative weight terms. Sensitivity analysis on different loss weights:
  17. 17. EXPERIMENTS 1621.07.2019 Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  18. 18. Thank you for your attention.

×