Ukrainian Catholic University
Faculty of Applied Sciences
Data Science Master Program
January 22nd
Abstract. Generative adversarial networks (GANs) are one of the most popular models capable of producing high-quality images. However, most of the works generate images from the vector of random values, without explicit control of desired output properties. We study the ways of introducing such control for the user-selected region of interest (RoI). First, we overview and analyze the existing works in areas of image completion (inpainting) and controllable generation. Second, we propose our model based on GANs, which united approaches from the two mentioned areas, for the controllable local content generation. Third, we evaluate the controllability of our model on three accessible datasets – Celeba, Cats, and Cars – and give numerical and visual results of our method.
7. Related Work
1. Rasiwasia et al.(2010) - cross-modal retrieval for Wikipedia articles.
The dataset contains featured articles of 10 most popular
categories. Their solution approach is to exploit correlation
between text & image features obtained via latent Dirichlet
allocation and SIFT models respectively
2. Hessel et al(2018) - visual concreteness of particular topic for
Wikipedia articles. The dataset contains 192K most popular
articles, specifically included images and topics.
3. Dong et al.(2018) - cross-modal retrieval for Flickr dataset
leveraged by deep neural networks.
7
11. Collection
1. article
a. text content
b. title
1. images
b. raw images
c. metadata: description, title
d. only publicly available
11
12. Preprocessing
1. text:
a. wiki-markup removal
1. image:
b. converting everything to 600px width JPEG
c. icon removal
d. title words parsing
e. storing image features(computed with ResNet152)
12
16. Evaluation Setting
1. image-level split
a. images from the same article might appear in both test and train subsets
b. theoretical model precision with comprehensive fine grained dataset
1. article-level split
b. images from the same article always either in test or in train subset
c. real-world performance of this particular model
16
17. Baseline
Alternative to multimodal approach is classical text-based
techniques. We will experiment with a following models and choose
the best one as our baseline:
● word2vec
● wikipedia2vec
● inferText
● co-occurrence
17
26. Contribution (Conclusions)
26
1. Dataset сollection
a. 36.4K articles
b. 216K images
2. Identify best-performing text-similarity baseline
3. Word2VisualVec model adjustment to our real-world data
a. image-level model outperformed baseline by 145%*
b. article-level model outperformed baseline by 37%*
* performance compared based on averaging the R@1, R@3 and R@10 scores
27. Future Work
27
● create an API for our model to be accessible in real time
● adjust evaluation metric to recognise all photos of the same
entity as correct match, not just one mentioned in the article
● properly experiment with compound Word2VisualVec + text-
similarity model
● try more complex model, which learns best feature
representation, not assume one
● use more metadata such as article topics
● retrain the model on a bigger “good articles” dataset
28. Review Comments
1. There is no implementation details described about text encoding
methods ( see Section 4.3.2) even though that they are crucial for the
proper performance
a. Rather Disagree. All details are described in the original paper of model’s
authors. We concentrated on covering our own contribution in the thesis. But
we can see the benefit of replicating this information to make the thesis more
self-contained
2. There is no dataset statistics, train-val split descriptions and so on in the
thesis nor in the relevant kaggle-dataset page
a. Disagree. Statistics of article/image count is available. Dataset selection,
collection, cleaning, and formatting are described in details. But we agree that
additional EDA would be beneficial.
3. The problems with a presentation are small but numerous
a. Agree. Experimental section could be presented better.
28
31. Conclusions
31
1. Developed a simple cross-modal retrieval model, which
significantly outperforms our baseline
2. Showed that performance might be significantly better with
huge fine-grained dataset
3. Developed a simple text-similarity model to show that it
contains supplementary predicting power
4. Created a real-world multimodal dataset, which is publicly
available