Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features
Placing Images with Refined Language Models and
Similarity Search with PCA-reduced VGG Features
Giorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and
Yiannis Kompatsiaris1
1 Information Technologies Institute (ITI), CERTH, Greece
2 CEA LIST, 91190 Gif-sur-Yvette, France
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
Summary
Tag-based location estimation (1 runs)
• Built upon the scheme of our 2015 participation (Kordopatis-Zilos et al.,
MediaEval 2015)
• Based on a refined probabilistic Language Model
Visual-based location estimation (1 run)
• Extract PCA-reduced VGG features to compute image similarities
• Geospatial clustering scheme of the most visually similar images
Hybrid location estimation (3 run)
• Combination of the textual and visual approaches using a set of rules
Training sets
• Training set released by the organisers (≈4.7M geotagged items)
• YFCC dataset, excl. images from users in test set (≈40M geotagged items)
• External data derived from gazetteers, i.e. Geonames and OpenStreetMap
G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task
2015. In MediaEval 2015 Placing Task, 2015
Tag-based location estimation
• Processing steps of the approach
– Offline: language model construction
– Online: location estimation
OpenStreetMap
Pre-processing
• Tags and titles of the training set items are processed
• Apply
– URL decoding
– lowercase transformation
– tokenization
• Remove
– accents
– symbols
– punctuations
• The multi-word tags are split into their individual terms,
which are also included in the item's term set
• Discard numerics or less than three characters terms
Language Model (LM)
• LM-based estimation
– Most Likely Cell (mlc) considered the cell with the highest probability and
used to produce the estimation
𝑚𝑙𝑐𝑗 = arg max 𝑖
𝑘=1
𝑇 𝑗
𝑝(𝑡 𝑘|𝑐𝑖) ∗ 𝑤(𝑡 𝑘)
Inspired from (Popescu, MediaEval 2013)
• LM generation scheme
– divide earth surface in rectangular
cells with a side length of 0.01°
– calculate term-cell probabilities
𝑝(𝑡|𝑐) = 𝑁 𝑢/𝑁𝑡
A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013
Feature Selection and Weighting
Feature Weighting
• Locality weight function, a function based on term relative position in T
• Spatial Entropy weight function, a Gaussian function based on the term’s
spatial entropy
• Linear combination of the two weights
Feature Selection
• Calculate terms locality using a grid of 0.01°×0.01°
• When a user uses a given term, he/she is assigned to the
entire cell neighborhood instead of a unique cell:
𝑙 𝑡 = 𝑁𝑡 ∗
𝑐∈𝐶 𝑢∈𝑈𝑡,𝑐
|{𝑢′|𝑢′ ∈ 𝑈𝑡,𝑐, 𝑢′ ≠ 𝑢}|
𝑁𝑡
2
• Terms with non-zero locality score form the term set 𝑇
Refinements
• Multiple Grids
– Built an additional LM using a finer
grid (cell side length of 0.001°)
– combine the MLC of the individual
language models
• Similarity search (Van Laere et al., ICMR 2011)
– determine 𝑘 𝑡 most similar training images in the MLC
– their center-of-gravity is the final location estimation
From: (Kordopatis-Zilos et al., PAISI 2015)
G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a
refined language modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015
Visual-based location estimation
Main Objectives
• Ensure that the visual features are generic and transferable
• Provide a compact representation of the features
Model building
• CNN features extracted by fine-tuning the VGG model
• Training: ~5K Points Of Interest (POIs), over 7M Flickr images using
queries with:
– the POI name and a radius of 5km around its coordinates
– the POI name and the associated city name
• Compressed outputs of fc7 layer (4096d) to 128d using PCA,
learned on a subset of 250,000 train images
• Similarity Search based on the PCA-reduced CNN features
O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity
search. ICMR ’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM
Visual-based location estimation
Location Estimation
• Geospatial clustering of 𝑘 𝑣 = 20 visually most similar images
• The largest cluster (or the first in case of equal size) is selected and
its centroid is used as the location estimate
Visual Confidence
• Confidence metric for the visual estimation is based on the size of
the largest cluster
𝑐𝑜𝑛𝑓𝑣 𝑖 = max(
𝑛 𝑖 − 𝑛 𝑡
𝑘 𝑣 − 𝑛 𝑡
, 0)
𝑛 𝑖 : number of neighbors in the largest cluster of image i
𝑛 𝑡: configuration parameter of the confidence score ‘’strictness’’
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In
International Conference on Learning Representations, 2015
Hybrid-based location estimation
• A set of rules to determine the
source of estimation between the
text and visual approaches
• The visual estimation is chosen in
cases:
→ No estimation could be produced by
the text approach
→ Visual estimation fell inside the
borders of the mlc
→ By comparing the confidence scores
𝑐𝑜𝑛𝑓𝑣 and 𝑐𝑜𝑛𝑓𝑡
• Otherwise the text estimation is
selected
Runs and Results
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
RUN-E: Visual-based location estimation + entire YFCC dataset
Images
Runs and Results
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Hybrid location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC + External data
Videos
References
G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris.
Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing
Task, 2015
G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social
media content with a refined language modelling approach. In
Intelligence and Security Informatics, pages 21–40, 2015
A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In
MediaEval 2013 Placing Task, 2013
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. In International Conference on Learning
Representations, 2015
O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr
resources using language models and similarity search. ICMR ’11, pages
48:1–48:8, New York, NY, USA, 2011. ACM
Different kinds of user classification:
topic-oriented (e.g., interest/expertise)
role-based/behavioral (e.g., bot/spammer)
geographical location
Useful for advertising,
user recommendation,
expert search, etc.
For personal accounts,
user classification raises
privacy concerns
Challenges
multi-linguality
Brevity
informal language