Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hard-Negatives Selection Strategy for Cross-Modal Retrieval

Ad

The MIRROR project has received funding from the European Union’s Horizon 2020 research and innovation action program unde...

Ad

2
Introduction
● Cross-modal learning has gained a lot of interest
● The improved marginal ranking loss is extensively use...

Ad

3
Problem statement
Sample A is the anchor
video-caption sample
Which one of B or C
should be considered as a
hard-negativ...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Video Thumbnail Selector
Video Thumbnail Selector
Loading in …3
×

Check these out next

1 of 15 Ad
1 of 15 Ad

Hard-Negatives Selection Strategy for Cross-Modal Retrieval

Download to read offline

Cross-modal learning has gained a lot of interest recently, and many applications of it, such as image-text retrieval, cross-modal video search, or video captioning have been proposed. In this work, we deal with the cross-modal video retrieval problem. The state-of-the-art approaches are based on deep network architectures, and rely on mining hard-negative samples during training to optimize the selection of the network’s parameters. Starting from a state-of-the-art cross-modal architecture that uses the improved marginal ranking loss function, we propose a simple strategy for hard-negative mining to identify which training samples are hard-negatives and which, although presently treated as hard-negatives, are likely not negative samples at all and shouldn’t be treated as such. Additionally, to take full advantage of network models trained using different design choices for hard-negative mining, we examine model combination strategies, and we design a hybrid one effectively combining large numbers of trained models.

Cross-modal learning has gained a lot of interest recently, and many applications of it, such as image-text retrieval, cross-modal video search, or video captioning have been proposed. In this work, we deal with the cross-modal video retrieval problem. The state-of-the-art approaches are based on deep network architectures, and rely on mining hard-negative samples during training to optimize the selection of the network’s parameters. Starting from a state-of-the-art cross-modal architecture that uses the improved marginal ranking loss function, we propose a simple strategy for hard-negative mining to identify which training samples are hard-negatives and which, although presently treated as hard-negatives, are likely not negative samples at all and shouldn’t be treated as such. Additionally, to take full advantage of network models trained using different design choices for hard-negative mining, we examine model combination strategies, and we design a hybrid one effectively combining large numbers of trained models.

More Related Content

Hard-Negatives Selection Strategy for Cross-Modal Retrieval

  1. 1. The MIRROR project has received funding from the European Union’s Horizon 2020 research and innovation action program under grant agreement № 832921. Hard-Negatives or Non-Negatives? A Hard-Negative Selection Strategy for Cross-Modal Retrieval Using the Improved Marginal Ranking Loss Damianos Galanopoulos, Vasileios Mezaris 2nd International Workshop on Video Retrieval Methods and Their Limits @ ICCV 2021 conference, 16 Oct. 2021
  2. 2. 2 Introduction ● Cross-modal learning has gained a lot of interest ● The improved marginal ranking loss is extensively used ● State-of-the-art approaches rely on hard-negative samples during training ● We aim on extracting actual hard-negative samples ○ We focus on samples that are semantical closeby to the anchor and should not be considered as negatives ● We examine different strategies for efficient combination of multiple trained models
  3. 3. 3 Problem statement Sample A is the anchor video-caption sample Which one of B or C should be considered as a hard-negative sample? Typical approaches will select B (the nearest-to-anchor sample), but this is a positive one! C is clearly a negative sample and should be used as the hard-negative
  4. 4. 4 Baseline ● We utilized the attention based dual encoding network of [1] ● The improved marginal ranking loss is used to train the network [1] D. Galanopoulos, V. Mezaris, "Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Networks", Proc. ACM Int. Conf. on Multimedia Retrieval (ICMR 2020), Dublin, Ireland, October 2020.
  5. 5. 5 Baseline ● The combination of multiple models boosts performance ● As in [1], combination of 24 different models by modifying parameters: ○ Attention mechanism in the textual or visual stream ○ Two textual encodings (BERT and W2V+BERT) ○ Two optimizers (Adam and RMSprop) ○ Three learning rates
  6. 6. 6 Hard-negative mining ● We designed an offline-online strategy to exclude potentially-positive samples ● At the offline stage, we estimate a threshold p for the similarity of samples, so that only samples with similarity < p will be treated as hard-negative candidates ○ Randomly split the training dataset into batches (as done for training) ○ In each batch, the cosine similarity , between all possible caption pairs is calculated ○ Within the entire set of calculated similarities for all batches, we make the assumption that x% (e.g. 1%) of them indicate very similar samples (thus, one could not be treated as a hard-negative for the other) ○ The similarity threshold p for which x% of the similarities are higher than p is identified
  7. 7. 7 Hard-negative mining ● At the online stage (during training) we enforce the threshold value p ○ In every batch, for an anchor (vi,ci), every sample (vj,cj) (within the batch) with > p is not considered as a negative at all ○ Every other sample is labeled as negative ○ Out of this subset of samples, the negative one with the highest is selected as the hard-negative sample
  8. 8. 8 Fusion strategies ● Following the proposed hard-negative mining strategy for different plausible assumptions about x% (e.g. 1%, 2%), thus different p values, the number of available models can be quickly increased ● We study the combination of multiple trained models (late fusion) ● Every trained model, and for a given query, results in a ranking list of the most relevant videos ● Three different strategies for combining them are examined: ○ AVG ○ MAX ○ Hybrid
  9. 9. 9 Fusion strategies AVG ● We assume that every model as a well-performing one, and we treat them equally ● We average the rankings for a given video MAX ● We assume that not all our models have very good recall ● But, we assume that at least the samples they place at the very top of their ranking lists are true positives ● Thus, if a video appears very high in the ranking list generated by at least one model, we trust this video to be a good answer to the query.
  10. 10. 10 Fusion strategies Hybrid ● Neither of the previous two assumptions seems perfectly plausible ● In our Hybrid strategy, for a retrieved video, we select the Q’ ranking lists where the video is ranked the highest among the Q in total ranking lists ● These top-Q’ rankings are averaged, to calculate the final ranking for this video ● All retrieved videos are re-ordered according to their final ranking So, if at least Q’ models bring a video high in their ranking lists, we trust this to be a good answer to the query. Special cases: ● If Q’=Q, the Hybrid approach is the same as the AVG ● If Q’=1, the Hybrid approach is the same as the MAX
  11. 11. 11 Experimental results ● Training datasets: ○ MSR-VTT, TGIF, ActivityNet Captions and Vatex ● Evaluation datasets: ○ V3C1 evaluated on TRECVID AVS 2019 and 2020 queries ● Evaluation metric: ○ Mean extended inferred average precision (MXinfAP) ● Keyframe representation: ○ ResNet-152 trained on Imagenet 11K
  12. 12. 12 Experimental results ● Results in MxinfAP of the combination of multiple models and different setups. ● Comparison between the baseline hard-negative mining strategy and the proposed one with x=1% and x=2% ● The last row shows the results when all models from every mining strategy are combined
  13. 13. ● Results on the AVS19 and AVS20 datasets for the Hybrid fusion strategy and different values of Q’ 13 Experimental results
  14. 14. 14 Conclusion ● New strategy for hard-negative mining to improve the performance of a cross- modal video retrieval network ● We focused on excluding positive samples from being wrongfully utilized as hard-negatives ● We proposed a hybrid strategy for model combination to take advantage of the high number of trained models ● The new hard-negative mining strategy gives small improvements ● In combination with the Hybrid fusion strategy, the performance is further boosted
  15. 15. 15 Contact details Damianos Galanopoulos, Vasileios Mezaris Information Technologies Institute-CERTH dgalanop@iti.gr, bmezaris@iti.gr www.iti.gr/~bmezaris

×