이 논문은 MCSE라는 새로운 접근법을 제시하며, 시각과 텍스트 정보를 결합하여 의미있는 문장 임베딩을 학습합니다. 다양한 데이터셋과 사전 훈련된 인코더에서 성능 향상을 보이며, 의미론적으로 유사한 문장을 잘 정렬합니다. 또한, 비전을 추가 의미 정보로 사용함으로써 문장 표현 학습을 더욱 촉진할 수 있다는 주장을 하고 있습니다. 이 방법은 기존의 문장 임베딩 학습 방법과 비교되며, 그 결과로서 이론과 실제에서 모두 탁월한 성능을 보입니다.
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
1. MCSE: Multimodal Contrastive Learning of
Sentence Embeddings
Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, and Dietrich Klakow,
2022
3. Introduction
• MCSE : Multimodal Contrastive Learning of Sentence Embeddings
Background: Unsupervised SimCSE (Gao et al., 2021)
Extend a multimodal contrastive objective
Experiments on standard Semantic Textual Similarity (STS)
4. • Architecture of MCSE
Introduction
f v (·
) is a pre-trained image encoder such as ResNet
5. • Contrastive learning : background Unsupervised SimCSE
data augmentation strategy : dropout noise
pulling positive sentences closer and pushing apart negatives
Related Work
Cosine similarity
6. • Multimodal Contrastive Learning
sentence-image pairs , sentence xi and image yi
• f v (·
) : pre-trained image encoder such as ResNet
• fθ(·
) : pre-trained language encoder such as BERT
pull semantically close image-sentence pairs together and push away non-related pairs
Related Work
7. • Dataset
Multimodal datasets : Flickr30k (29,783 images) and MS-COCO (82,783 images)
text corpus : Wiki1M (English Wikipedia : 106 sentences)
• Encoder
Language encoders : BERT and RoBERTa
Image encoder : ResNet-50
Single layer MLPs
• Evaluation
7 Semantic Textual Similarity (STS) : STS 2012-2016, STS Benchmark, SICK-Relatedness
Spearman’s correlation
Experiments
8. Results
• MCSE : Wiki1M, Flickr30k
BERT (76.3 → 77.3)
RoBERTa (76.6 → 78.3)
• STS16 MCSE-BERT
-> the domain discrepancy
Performance comparison on STS tasks
9. Results
• the performances decrease
(without the large text-only corpus)
• MCSE models (0.9 – 3.8 points improvement)
• Spearman’s correlation(0.8 – 5.0 points reduction)
-> validating the efficacy of visual semantics
Average Spearman’s correlation on 7 STS tasks
10. Results
• Alignment-Uniformity
Alignment : paired instance 사이의 거리
(짧을수록 좋음)
Similar samples have similar features
Uniformity : embedding이 얼만큼 균일하게
분포하는지 (균일 할수록 좋음)
Preserve maximal information
* 참고 논문 : Understanding Contrastive Representation Learning through
Alignment and Uniformity on the Hypersphere (ICML 2020)
• Embedding space가 넓고, 고르게
분포하여 각 단어가 고유한 의미를
보존하는 것이 중요함.
• Contrastive learning을 통해
Negative Pair를 Positive Pair와 멀게
강제하는 과정에서 embedding space를
균일하게 분포하게 함.
11. Results
• Alignment-Uniformity
PPOS : positive pairs distribution
Pdata : data distribution
MCSE models : visually grounding
enhance by improving the
alignment property
The alignment-uniformity plot of models (BERT)
12. Results
• Improvements on Different Subsets
different degrees from the visually grounding
because of domain discrepancy
16. Conclusion
• MCSE 제안 : sentence embedding learning
• MCSE consistently improves the performance on STS tasks
• the superiority of method : by analyzing the alignment and uniformity properties
of the embedding space.
• SimCSE는 limited SAMPLE에서 MCSE 보다 나은 성능을 보임
MCSE는 큰 데이터에서는 SimCSE 성능을 능가함.
-> multimodal weight training 관련