MCSE: Multimodal Contrastive Learning of
Sentence Embeddings
Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, and Dietrich Klakow,
2022
Experiments
Conclusions
Introduction
Related Work
01
02
03
04
Introduction
• MCSE : Multimodal Contrastive Learning of Sentence Embeddings
 Background: Unsupervised SimCSE (Gao et al., 2021)
 Extend a multimodal contrastive objective
 Experiments on standard Semantic Textual Similarity (STS)
• Architecture of MCSE
Introduction
f v (·
) is a pre-trained image encoder such as ResNet
• Contrastive learning : background Unsupervised SimCSE
 data augmentation strategy : dropout noise
 pulling positive sentences closer and pushing apart negatives
Related Work
Cosine similarity
• Multimodal Contrastive Learning
 sentence-image pairs , sentence xi and image yi
• f v (·
) : pre-trained image encoder such as ResNet
• fθ(·
) : pre-trained language encoder such as BERT
 pull semantically close image-sentence pairs together and push away non-related pairs
Related Work
• Dataset
 Multimodal datasets : Flickr30k (29,783 images) and MS-COCO (82,783 images)
 text corpus : Wiki1M (English Wikipedia : 106 sentences)
• Encoder
 Language encoders : BERT and RoBERTa
 Image encoder : ResNet-50
Single layer MLPs
• Evaluation
 7 Semantic Textual Similarity (STS) : STS 2012-2016, STS Benchmark, SICK-Relatedness
 Spearman’s correlation
Experiments
Results
• MCSE : Wiki1M, Flickr30k
BERT (76.3 → 77.3)
RoBERTa (76.6 → 78.3)
• STS16 MCSE-BERT
-> the domain discrepancy
Performance comparison on STS tasks
Results
• the performances decrease
(without the large text-only corpus)
• MCSE models (0.9 – 3.8 points improvement)
• Spearman’s correlation(0.8 – 5.0 points reduction)
-> validating the efficacy of visual semantics
Average Spearman’s correlation on 7 STS tasks
Results
• Alignment-Uniformity
 Alignment : paired instance 사이의 거리
(짧을수록 좋음)
Similar samples have similar features
 Uniformity : embedding이 얼만큼 균일하게
분포하는지 (균일 할수록 좋음)
Preserve maximal information
* 참고 논문 : Understanding Contrastive Representation Learning through
Alignment and Uniformity on the Hypersphere (ICML 2020)
• Embedding space가 넓고, 고르게
분포하여 각 단어가 고유한 의미를
보존하는 것이 중요함.
• Contrastive learning을 통해
Negative Pair를 Positive Pair와 멀게
강제하는 과정에서 embedding space를
균일하게 분포하게 함.
Results
• Alignment-Uniformity
 PPOS : positive pairs distribution
 Pdata : data distribution
MCSE models : visually grounding
enhance by improving the
alignment property
The alignment-uniformity plot of models (BERT)
Results
• Improvements on Different Subsets
 different degrees from the visually grounding
because of domain discrepancy
Results
• SimCSE는 구문이 유사한 문장을 검색하는 반면
MCSE는 구문이 다양하고 의미 체계를 공유하는 문장을 검색
Results
• Cross-Modal Retrieval : metric Recall@K
 Recall@K : k개 추천 결과에 대한 recall
Conclusion
• MCSE 제안 : sentence embedding learning
• MCSE consistently improves the performance on STS tasks
• the superiority of method : by analyzing the alignment and uniformity properties
of the embedding space.
• SimCSE는 limited SAMPLE에서 MCSE 보다 나은 성능을 보임
MCSE는 큰 데이터에서는 SimCSE 성능을 능가함.
-> multimodal weight training 관련
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

  • 1.
    MCSE: Multimodal ContrastiveLearning of Sentence Embeddings Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, and Dietrich Klakow, 2022
  • 2.
  • 3.
    Introduction • MCSE :Multimodal Contrastive Learning of Sentence Embeddings  Background: Unsupervised SimCSE (Gao et al., 2021)  Extend a multimodal contrastive objective  Experiments on standard Semantic Textual Similarity (STS)
  • 4.
    • Architecture ofMCSE Introduction f v (· ) is a pre-trained image encoder such as ResNet
  • 5.
    • Contrastive learning: background Unsupervised SimCSE  data augmentation strategy : dropout noise  pulling positive sentences closer and pushing apart negatives Related Work Cosine similarity
  • 6.
    • Multimodal ContrastiveLearning  sentence-image pairs , sentence xi and image yi • f v (· ) : pre-trained image encoder such as ResNet • fθ(· ) : pre-trained language encoder such as BERT  pull semantically close image-sentence pairs together and push away non-related pairs Related Work
  • 7.
    • Dataset  Multimodaldatasets : Flickr30k (29,783 images) and MS-COCO (82,783 images)  text corpus : Wiki1M (English Wikipedia : 106 sentences) • Encoder  Language encoders : BERT and RoBERTa  Image encoder : ResNet-50 Single layer MLPs • Evaluation  7 Semantic Textual Similarity (STS) : STS 2012-2016, STS Benchmark, SICK-Relatedness  Spearman’s correlation Experiments
  • 8.
    Results • MCSE :Wiki1M, Flickr30k BERT (76.3 → 77.3) RoBERTa (76.6 → 78.3) • STS16 MCSE-BERT -> the domain discrepancy Performance comparison on STS tasks
  • 9.
    Results • the performancesdecrease (without the large text-only corpus) • MCSE models (0.9 – 3.8 points improvement) • Spearman’s correlation(0.8 – 5.0 points reduction) -> validating the efficacy of visual semantics Average Spearman’s correlation on 7 STS tasks
  • 10.
    Results • Alignment-Uniformity  Alignment: paired instance 사이의 거리 (짧을수록 좋음) Similar samples have similar features  Uniformity : embedding이 얼만큼 균일하게 분포하는지 (균일 할수록 좋음) Preserve maximal information * 참고 논문 : Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere (ICML 2020) • Embedding space가 넓고, 고르게 분포하여 각 단어가 고유한 의미를 보존하는 것이 중요함. • Contrastive learning을 통해 Negative Pair를 Positive Pair와 멀게 강제하는 과정에서 embedding space를 균일하게 분포하게 함.
  • 11.
    Results • Alignment-Uniformity  PPOS: positive pairs distribution  Pdata : data distribution MCSE models : visually grounding enhance by improving the alignment property The alignment-uniformity plot of models (BERT)
  • 12.
    Results • Improvements onDifferent Subsets  different degrees from the visually grounding because of domain discrepancy
  • 13.
    Results • SimCSE는 구문이유사한 문장을 검색하는 반면 MCSE는 구문이 다양하고 의미 체계를 공유하는 문장을 검색
  • 14.
    Results • Cross-Modal Retrieval: metric Recall@K  Recall@K : k개 추천 결과에 대한 recall
  • 16.
    Conclusion • MCSE 제안: sentence embedding learning • MCSE consistently improves the performance on STS tasks • the superiority of method : by analyzing the alignment and uniformity properties of the embedding space. • SimCSE는 limited SAMPLE에서 MCSE 보다 나은 성능을 보임 MCSE는 큰 데이터에서는 SimCSE 성능을 능가함. -> multimodal weight training 관련