HCMUS at MediaEval 2020: Image-Text Fusion for Automatic
News-Images Re-Matching
Thuc Nguyen-Quang 1,3, Tuan-Duy H. Nguyen1,3, Thang-Long Nguyen-Ho1,3,
Anh-Kiet Duong1,3, Xuan-Nhat Hoang1,3, Vinh-Thuyen Nguyen-Truong1,3,
Hai-Dang Nguyen1,3, Minh-Triet Tran1,2,3
1University of Science, VNU-HCM,
2John von Neumann Institute, VNU-HCM,
3Vietnam National University, Ho Chi Minh city, Vietnam
December 14-15, 2020
1 Introduction
2 Methods
Metric Learning
Image-Text Matching via Categorization
Image-Text Fusion with Image Captioning and
Contextual Embeddings
Image-Text Fusion with Knowledge Graph-based
Contextual Embeddings
Image-Text Fusion with Knowledge Graph-based
Contextual Embeddings
Graph-based Face-Name Matching
3 Results
4 Conclusion and future works
5 Bibliography
Mainly concern fusing cross-modal embedded information extracted as:
Simple set intersection
Deep neural features
Knowledge-graph-enhanced neural features
M1 Metric Learning
M2 Image-Text Matching via Categorization
M3 Image-Text Fusion with Image Captioning and Contextual Embeddings
M4 Image-Text Fusion with Knowledge Graph-based Contextual Embeddings
M5 Graph-based Face-Name Matching
Methods • § Metric Learning
Metric Learning
Using a Triplet Loss model to project embeddings of image-text pairs to bases of
significant similarity.
Title texts are embedded with BERT.
Image embeddings:
Global context embedding: EfficientNet
Local context embedding: Top-k bottom-up-attention objects passed to a
self-attention sequential model.
Methods • § Image-Text Matching via Categorization
Image-Text Matching via Categorization
Categorizing images and texts with two gradient boosting decision trees.
Target categories extracted from URLS:
Image-Text Matching via Categorization
Augment and extract image features with VGG16, InceptionResNetV2, MobileNetV2,
EfficientNetB1-7, Xception, ResNet152V2, NASNetLarge, DenseNet201.
Texts are mapped to BERT and ELECTRA contextual embeddings.
An iterative ranking method that takes into account the order of matched categories:
At the k-th iteration, finds top-k categories for each image and top-k categories
for each article.
For each article: candidate images are ones having top-k categories intersect that
of the article.
Sequentially concatenate k candidate lists, then append the remaining images to
the tail to make the final ranked list.
Image-Text Fusion with Image Captioning and Contextual
We hypothesize that the description of the image is semantically similar to the title.
Captioning model consist of three parts:
Image feature extractor: We use EfficientNetTan and Le, EfficientNet:
Rethinking Model Scaling for Convolutional Neural Networks for feature
extraction. The feature has the shape (8, 8, 2048)
Feature encoder: The features pass through fully connected giving a vector
Decoder: To generate the caption, we use Bahdanau attentionBahdanau, Cho,
and Bengio, Neural Machine Translation by Jointly Learning to Align and
Translate and GRU to predict the next word.
Image-Text Fusion with Image Captioning and Contextual
To represent the caption and the title as vectors, we use RoBERTa and doc2vec. Then
we compute their similarity via:
Stotal = Swiki + Sapnews + SRoBERTa + (1 − Dfuzzy) + (1 − Dpartial)
Swiki, Sapnews, SRoBERTa are cosine similarity of two vectors generated by enwiki dbow,
apnews dbow, RoBERTa, respectively
Dfuzzy , Dpartial are fuzzywuzzy and partial ratios, respectively.
Image-Text Fusion with Knowledge Graph-based Contextual
To account for high-level semantics, we exploit BabelNet knowledge graph.
For articles:
Link textual entities from texts to their synsets in the WordNet subset of
BabelNet using EWISER word sense disambiguator.
Use mean of accompanied SenSemBERT+LMMS embeddings corresponds to
these extracted synsets representing the texts
For images:
Use ResNET-L with Asymmetric Loss (ASL) pre-trained on OpenImagesV6 to
extract multi-label from images.
Map concatenated labels to SenSemBERT+LMMS synset embeddings similar to
the texts.
Image-Text Fusion with Knowledge Graph-based Contextual
Train a canonical correlation analysis (CCA) on the train set to project cross-modal
embeddings to bases of significant similarity.
Finally, rank all images in the test set using the L2-distance between the transformed
Graph-based Face-Name Matching
In a lot of instances, the publisher uses a portrait of somebody mentioned in the text.
Person name extraction: We use entity-fishing to automatically extract people’s
name from the text.
Face encoding: We use face recognition open-source library to detect and
represent the face as 128-dims vectors.
We connect each person mentioned in the articles with features extracted from
accompanying faces on the train set.
During testing, we encode the face from the image and aggregate the number of
matched faces connected to the people mentioned in the text.
The Ensemble submission combines all described methods, weighting each models
based on their efficiency. As such, the final ranking of a candidate image is:
REnsemble = w1RCaption + w2RTriplet + w3RFace + w4RKG−Fusion.
REnsemble, RCaption, RTriplet, RFace, RKG−Fusion are ranks of the image produced by
respective methods.
Weighting factors are empirically chosen to be w1 = w4 = 1, w2 = 0.02 and w3 = 0.25.
Figure: Submission result
Figure: Visualized result
Conclusion and future works
Our methods systematically increase the performance on the recall@100 metric.
Consistent results, i.e., high-ranking images are of relevance to queried articles.
Conclusion and future works
Incorporating high-level semantics increase performance.
System builders should use multiple methods to handle different aspects of the
complex image-text multimodal relation.
Conclusion and future works
Future Works
Investigate better fusion methods.
Thorough ablation study for proposed methods.
Enhance the dataset for thorough evaluation with information retrieval metrics
like NDCG
References I
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2016. arXiv:
1409.0473 [cs.CL].
Ben-Baruch, Emanuel et al. “Asymmetric Loss For Multi-Label Classification”. In: arXiv preprint arXiv:2009.14119 (2020).
Bevilacqua, Michele and Roberto Navigli. “Breaking through the 80% glass ceiling: Raising the state of the art in Word Sense Disambiguation by
incorporating knowledge graph information”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020,
pp. 2854–2864.
Bollacker, Kurt et al. “Freebase: a collaboratively created graph database for structuring human knowledge”. In: Proceedings of the 2008 ACM
SIGMOD international conference on Management of data. 2008, pp. 1247–1250.
Branden Chan Timo Möller, Malte Pietsch Tanay Soni. “Model from”. In: (2020).
Chan, Branden, Stefan Schweter, and Timo Möller. German’s Next Language Model. 2020. arXiv: 2010.10906 [cs.CL].
Chollet, François. “Xception: Deep learning with depthwise separable convolutions”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017, pp. 1251–1258.
dbmdz. “Model from”. In: (2020).
Devlin, Jacob et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. arXiv: 1810.04805 [cs.CL].
Geitgey, Adam. Face Recognition. 2018. url:
He, Kaiming et al. “Identity mappings in deep residual networks”. In: European conference on computer vision. Springer. 2016, pp. 630–645.
Hoffer, Elad and Nir Ailon. Deep metric learning using Triplet network. 2018. arXiv: 1412.6622 [cs.LG].
Hossain, MD Zakir et al. “A comprehensive survey of deep learning for image captioning”. In: ACM Computing Surveys (CSUR) 51.6 (2019),
pp. 1–36.
Huang, Gao et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2017, pp. 4700–4708.
Ke, Guolin et al. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. In: Advances in Neural Information Processing Systems. Ed. by
I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017, pp. 3146–3154. url:
Kille, Benjamin, Andreas Lommatzsch, and Özlem Özgöbek. “News Images in MediaEval 2020”. In: Proc. of the MediaEval 2020 Workshop. Online.
King, Davis E. dlib-models. 2018. url:
Kuznetsova, Alina et al. “The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale”. In:
IJCV (2020).
Lau, Jey Han and Timothy Baldwin. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. 2016. arXiv:
1607.05368 [cs.CL].
Lopez, Patrice. Entity Fishing. 2020. url:
“Model from”. In: (2020).
“Model from”. In: (2020).
Navigli, Roberto and Simone Paolo Ponzetto. “BabelNet: Building a very large multilingual semantic network”. In: Proceedings of the 48th annual
meeting of the association for computational linguistics. 2010, pp. 216–225.
Oostdijk, NHJ et al. “The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis”. In: (2020).
Ridnik, Tal et al. “TResNet: High Performance GPU-Dedicated Architecture”. In: arXiv preprint arXiv:2003.13630 (2020).
Sandler, Mark et al. “Mobilenetv2: Inverted residuals and linear bottlenecks”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018, pp. 4510–4520.
Simonyan, Karen and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556
Szegedy, Christian et al. “Inception-v4, inception-resnet and the impact of residual connections on learning”. In: arXiv preprint arXiv:1602.07261
Tan, Mingxing and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2020. arXiv: 1905.11946 [cs.LG].
Xu, Kelvin et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 2016. arXiv: 1502.03044 [cs.LG].
Zoph, Barret et al. “Learning transferable architectures for scalable image recognition”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2018, pp. 8697–8710.
