HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching

HCMUS at MediaEval 2020: Image-Text Fusion for Automatic
News-Images Re-Matching
Thuc Nguyen-Quang 1,3, Tuan-Duy H. Nguyen1,3, Thang-Long Nguyen-Ho1,3,
Anh-Kiet Duong1,3, Xuan-Nhat Hoang1,3, Vinh-Thuyen Nguyen-Truong1,3,
Hai-Dang Nguyen1,3, Minh-Triet Tran1,2,3
1University of Science, VNU-HCM,
2John von Neumann Institute, VNU-HCM,
3Vietnam National University, Ho Chi Minh city, Vietnam
December 14-15, 2020
T. Nguyen-Quang et al. HCMUS at MediaEval 2020

Outlines
1 Introduction
2 Methods
Metric Learning
Image-Text Matching via Categorization
Image-Text Fusion with Image Captioning and
Contextual Embeddings
Image-Text Fusion with Knowledge Graph-based
Image-Text Fusion with Knowledge Graph-based
Graph-based Face-Name Matching
Ensemble
3 Results
4 Conclusion and future works
5 Bibliography
T. Nguyen-Quang et al. HCMUS at MediaEval 2020 December 14-15, 2020 1 / 19

Introduction
T. Nguyen-Quang et al. HCMUS at MediaEval 2020 December 14-15, 2020
Introduction

Introduction
Introduction
Mainly concern fusing cross-modal embedded information extracted as:
Simple set intersection
Deep neural features
Knowledge-graph-enhanced neural features

Introduction
Introduction
M1 Metric Learning
M2 Image-Text Matching via Categorization
M3 Image-Text Fusion with Image Captioning and Contextual Embeddings
M4 Image-Text Fusion with Knowledge Graph-based Contextual Embeddings
M5 Graph-based Face-Name Matching

Methods
Methods

Methods • § Metric Learning
Metric Learning
Using a Triplet Loss model to project embeddings of image-text pairs to bases of
significant similarity.
Title texts are embedded with BERT.
Image embeddings:
Global context embedding: EfficientNet
Local context embedding: Top-k bottom-up-attention objects passed to a
self-attention sequential model.

Methods • § Image-Text Matching via Categorization
Categorizing images and texts with two gradient boosting decision trees.
Target categories extracted from URLS:
nrw
kultur
region
panorama
sport
wirtscharft
koeln
ratgerber
politik
unknown

Methods • § Image-Text Matching via Categorization
Augment and extract image features with VGG16, InceptionResNetV2, MobileNetV2,
EfficientNetB1-7, Xception, ResNet152V2, NASNetLarge, DenseNet201.
Texts are mapped to BERT and ELECTRA contextual embeddings.
An iterative ranking method that takes into account the order of matched categories:
At the k-th iteration, finds top-k categories for each image and top-k categories
for each article.
For each article: candidate images are ones having top-k categories intersect that
of the article.
Sequentially concatenate k candidate lists, then append the remaining images to
the tail to make the final ranked list.

Methods • § Image-Text Fusion with Image Captioning and Contextual Embeddings
Image-Text Fusion with Image Captioning and Contextual
Embeddings
We hypothesize that the description of the image is semantically similar to the title.
Captioning model consist of three parts:
Image feature extractor: We use EfficientNetTan and Le, EfficientNet:
Rethinking Model Scaling for Convolutional Neural Networks for feature
extraction. The feature has the shape (8, 8, 2048)
Feature encoder: The features pass through fully connected giving a vector
256-dims.
Decoder: To generate the caption, we use Bahdanau attentionBahdanau, Cho,
and Bengio, Neural Machine Translation by Jointly Learning to Align and
Translate and GRU to predict the next word.

Methods • § Image-Text Fusion with Image Captioning and Contextual Embeddings
Image-Text Fusion with Image Captioning and Contextual
Embeddings
To represent the caption and the title as vectors, we use RoBERTa and doc2vec. Then
we compute their similarity via:
Stotal = Swiki + Sapnews + SRoBERTa + (1 − Dfuzzy) + (1 − Dpartial)
with
Swiki, Sapnews, SRoBERTa are cosine similarity of two vectors generated by enwiki dbow,
apnews dbow, RoBERTa, respectively
Dfuzzy , Dpartial are fuzzywuzzy and partial ratios, respectively.

Methods • § Image-Text Fusion with Knowledge Graph-based Contextual Embeddings
Image-Text Fusion with Knowledge Graph-based Contextual
Embeddings
To account for high-level semantics, we exploit BabelNet knowledge graph.
For articles:
Link textual entities from texts to their synsets in the WordNet subset of
BabelNet using EWISER word sense disambiguator.
Use mean of accompanied SenSemBERT+LMMS embeddings corresponds to
these extracted synsets representing the texts
For images:
Use ResNET-L with Asymmetric Loss (ASL) pre-trained on OpenImagesV6 to
extract multi-label from images.
Map concatenated labels to SenSemBERT+LMMS synset embeddings similar to
the texts.

Methods • § Image-Text Fusion with Knowledge Graph-based Contextual Embeddings
Image-Text Fusion with Knowledge Graph-based Contextual
Embeddings
Train a canonical correlation analysis (CCA) on the train set to project cross-modal
embeddings to bases of significant similarity.
Finally, rank all images in the test set using the L2-distance between the transformed
embeddings.

Methods • § Graph-based Face-Name Matching
Graph-based Face-Name Matching
In a lot of instances, the publisher uses a portrait of somebody mentioned in the text.
Person name extraction: We use entity-fishing to automatically extract people’s
name from the text.
Face encoding: We use face recognition open-source library to detect and
represent the face as 128-dims vectors.
We connect each person mentioned in the articles with features extracted from
accompanying faces on the train set.
During testing, we encode the face from the image and aggregate the number of
matched faces connected to the people mentioned in the text.

Methods • § Ensemble
Ensemble
The Ensemble submission combines all described methods, weighting each models
based on their efficiency. As such, the final ranking of a candidate image is:
REnsemble = w1RCaption + w2RTriplet + w3RFace + w4RKG−Fusion.
With
REnsemble, RCaption, RTriplet, RFace, RKG−Fusion are ranks of the image produced by
respective methods.
Weighting factors are empirically chosen to be w1 = w4 = 1, w2 = 0.02 and w3 = 0.25.

Results
Results

Results
Results
Figure: Submission result
Figure: Visualized result

Conclusion and future works
Conclusion
Our methods systematically increase the performance on the recall@100 metric.
Consistent results, i.e., high-ranking images are of relevance to queried articles.

Conclusion
Incorporating high-level semantics increase performance.
System builders should use multiple methods to handle different aspects of the
complex image-text multimodal relation.

Future Works
Investigate better fusion methods.
Thorough ablation study for proposed methods.
Enhance the dataset for thorough evaluation with information retrieval metrics
like NDCG

Bibliography
References I
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:
1409.0473 [cs.CL].
Ben-Baruch, Emanuel et al. “Asymmetric Loss For Multi-Label Classification”. In: arXiv preprint arXiv:2009.14119 (2020).
Bevilacqua, Michele and Roberto Navigli. “Breaking through the 80% glass ceiling: Raising the state of the art in Word Sense Disambiguation by
incorporating knowledge graph information”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020,
pp. 2854–2864.
Bollacker, Kurt et al. “Freebase: a collaboratively created graph database for structuring human knowledge”. In: Proceedings of the 2008 ACM
SIGMOD international conference on Management of data. 2008, pp. 1247–1250.
Branden Chan Timo Möller, Malte Pietsch Tanay Soni. “Model from https://huggingface.co/bert-base-german-cased”. In: (2020).
Chan, Branden, Stefan Schweter, and Timo Möller. German’s Next Language Model. 2020. arXiv: 2010.10906 [cs.CL].
Chollet, François. “Xception: Deep learning with depthwise separable convolutions”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017, pp. 1251–1258.
dbmdz. “Model from https://huggingface.co/dbmdz/bert-base-german-uncased”. In: (2020).
Devlin, Jacob et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL].
Geitgey, Adam. Face Recognition. 2018. url: https://github.com/ageitgey/face_recognition.
He, Kaiming et al. “Identity mappings in deep residual networks”. In: European conference on computer vision. Springer. 2016, pp. 630–645.
Hoffer, Elad and Nir Ailon. Deep metric learning using Triplet network. 2018. arXiv: 1412.6622 [cs.LG].
Hossain, MD Zakir et al. “A comprehensive survey of deep learning for image captioning”. In: ACM Computing Surveys (CSUR) 51.6 (2019),
pp. 1–36.
Huang, Gao et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2017, pp. 4700–4708.

Bibliography
References II
Ke, Guolin et al. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. In: Advances in Neural Information Processing Systems. Ed. by
I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017, pp. 3146–3154. url:
https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
Kille, Benjamin, Andreas Lommatzsch, and Özlem Özgöbek. “News Images in MediaEval 2020”. In: Proc. of the MediaEval 2020 Workshop. Online.
2020.
King, Davis E. dlib-models. 2018. url: https://github.com/davisking/dlib-models.
Kuznetsova, Alina et al. “The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale”. In:
IJCV (2020).
Lau, Jey Han and Timothy Baldwin. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. 2016. arXiv:
1607.05368 [cs.CL].
Lopez, Patrice. Entity Fishing. 2020. url: https://github.com/kermitt2/entity-fishing.
“Model from https://huggingface.co/german-nlp-group/electra-base-german-uncased”. In: (2020).
“Model from https://huggingface.co/T-Systems-onsite/bert-german-dbmdz-uncased-sentence-stsb”. In: (2020).
Navigli, Roberto and Simone Paolo Ponzetto. “BabelNet: Building a very large multilingual semantic network”. In: Proceedings of the 48th annual
meeting of the association for computational linguistics. 2010, pp. 216–225.
Oostdijk, NHJ et al. “The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis”. In: (2020).
Ridnik, Tal et al. “TResNet: High Performance GPU-Dedicated Architecture”. In: arXiv preprint arXiv:2003.13630 (2020).
Sandler, Mark et al. “Mobilenetv2: Inverted residuals and linear bottlenecks”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018, pp. 4510–4520.
Simonyan, Karen and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556
(2014).

Bibliography
References III
Szegedy, Christian et al. “Inception-v4, inception-resnet and the impact of residual connections on learning”. In: arXiv preprint arXiv:1602.07261
(2016).
Tan, Mingxing and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2020. arXiv: 1905.11946 [cs.LG].
Xu, Kelvin et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 2016. arXiv: 1502.03044 [cs.LG].
Zoph, Barret et al. “Learning transferable architectures for scalable image recognition”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2018, pp. 8697–8710.

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching

Similar to HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

HCMUS at MediaEval 2020:Image-Text Fusion for Automatic News-Images Re-Matching