Presentation of the joint participation between CERTH and CEA LIST in the 2015 edition of the MediaEval Placing Task in Wurzen, Germany, September 14-15, 2015.
CERTH/CEA LIST at MediaEval Placing Task 2015
Giorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and
Yiannis Kompatsiaris1
1 Information Technologies Institute (ITI), CERTH, Greece
2 CEA LIST, 91190 Gif-sur-Yvette, France
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
Summary
#2
Tag-based location estimation (2 runs)
• Based on a geographic Language Model
• Built upon the scheme of our 2014 participation [2] (Kordopatis-Zilos et
al., MediaEval 2014)
• Extensions from [3]: improved feature selection and weighting
(Kordopatis-Zilos et al., PAISI 2015)
Visual-based location estimation (1 run)
• Geospatial clustering scheme of the most visually similar images
Hybrid location estimation (2 run)
• Combination of the textual and visual approaches
Training sets
• Training set released by the organisers (≈4.7M geotagged items)
• YFCC dataset, excl. images from users in test set (≈40M geotagged items)
Language Model (LM)
• LM generation scheme
– divide earth surface in rectangular cells with a side length of 0.01°
– calculate tag-cell probabilities based on the users that used the tag inside the cell
• LM-based estimation
– the probability of each cell is calculated from the summation of the respective
tag-cell probabilities
– Most Likely Cell (MLC) considered the cell with the highest probability and used
to produce the estimation
Inspired from [4]: (Popescu, MediaEval 2013)
#4
Feature Selection and Weighting
Feature Selection
• The final tag set 𝑇 is the intersection of the two tag sets
𝑇 = 𝑇𝑎 ∩ 𝑇𝑙
Feature Weighting
• Locality weight function, sort tags in 𝑇 based on their locality score
𝑤𝑙 =
𝑇 − (𝑗 − 1)
|𝑇|
• Normalize the weights from the Spatial Entropy (SE) function
𝑤𝑠𝑒 = 𝑁(𝑒(𝑡), 𝜇, 𝜎) max
𝑡∈𝑇
(𝑁(𝑒(𝑡), 𝜇, 𝜎))
• Combine the two weighting functions
𝑤 = 𝜔 ∗ 𝑤𝑠𝑒 + (1 − 𝜔) ∗ 𝑤𝑙
#5
accuracy locality
Accuracy
• Partition training set into p folds (p = 10)
• Keep one partition at a time, and build LM with
the rest p − 1
• Estimate the location of every item of the
withheld partition
• Accuracy score of every tag
tgeo 𝑡 =
𝑁𝑟
𝑁𝑡
𝑁𝑟: correctly geotagged items
𝑁𝑡: total items tagged with 𝑡
• Tags with non-zero accuracy score form the tag
set 𝑇𝑎
From [3]: Kordopatis-Zilos et al., PAISI 2015
#6
Estimated
Locations
Locality
#7
• Captures the spatial awareness of tags
• When a user uses a tag, he/she is assigned to the respective location cell
• Each cell has a set of users assigned to it
• All users assigned to the same cell are considered neighbours
• Locality score of every tag
loc 𝑡 = 𝑁𝑡 ∗
𝑐∈𝐶 𝑢∈𝑈𝑡,𝑐
|{𝑢′|𝑢′
∈ 𝑈𝑡,𝑐, 𝑢′ ≠ 𝑢}|
𝑁𝑡
2
𝑁𝑡: total occurrences of 𝑡
𝐶 : set of all cells
𝑈𝑡,𝑐: set of users that used tag 𝑡 inside cell c
• Tags with non-zero locality score form the tag set 𝑇𝑙
Locality – value distribution
#8
london (6975), paris (5452), nyc (3917)
luminancehdr (0.0035), dsc6362 (0.003), air photo (0.002)
Extensions
• Spatial Entropy (SE) function
– calculate entropy values applying the Shannon entropy formula in the tag-cell
probabilities
– build a Gaussian weight function based on the values of the tag SE
#9
• Internal Grid
– Built an additional LM using a finer grid, cell side length of 0.001°
– combine the MLC of the individual language models
• Similarity search [6] (Van Laere et al., ICMR 2011)
– determine 𝑘 most similar training images in the MLC
– their center-of-gravity is the final location estimation
From [2]: (Kordopatis-Zilos et al., MediaEval 2014)
Visual-based location estimation
#10
Model building
• CNN features adapted by fine-tuning the VGG model [5] (Simonyan & Zisserman,
ICLR 2015)
• Training: ~1K Points Of Interest (POIs), ~1200 images/POI
• Caffe [1] (Jia et al., arxiv 2014) is fed directly with the CNN features
• Compressed outputs of fc7 layer (4096d) to 128d using PCA
• CNN features used to compute image similarities 𝑠 𝑣𝑖𝑠,𝑖𝑗
Location Estimation
• Geospatial clustering of 𝑘 = 20 visually most similar images
• If 𝑗-th image is within 1km from the closest one of the previous j − 1 images, it is
assigned to its cluster, otherwise it forms its own cluster
• The largest cluster (or the first in case of equal size) is selected and its centroid is
used as the location estimate
Hybrid-based location estimation
Model building
• Combination of the textual and visual approaches
• Build LM model using the tag-based approach above and use it for MLC selection
Similarity Calculation
• Combination of the visual and textual similarities.
• Normalize the visual similarities to the range [0, 1]
• Similarity between two images
𝑠𝑖𝑗 =
𝑠𝑡𝑒𝑥,𝑖𝑗 + 𝑠 𝑣𝑖𝑠,𝑖𝑗
2
• The final estimation is the center-of-gravity of the 𝑘 = 5 most similar images
Low Confidence Estimations
• For those test images, with no estimate or confidence lower than 0.02 (≈10% of
the test set), the visual approach is used to produce the estimated locations
#11
Confidence
• Evaluate the confidence of the LM estimation of each query image
• Measures how localized are the language model cell estimations, based on
cell probabilities
• Confidence measure
conf 𝑖 =
𝑐∈𝐶{𝑝 𝑐 𝑖 |dist 𝑐, mlc < 𝑙}
𝑐∈𝐶 𝑝 𝑐 𝑖
𝑝(𝑐|𝑖): cell probability of cell c for image 𝑖
𝑑𝑖𝑠𝑡(𝑐1, 𝑐2): distance between 𝑐1 and 𝑐2
mlc: Most Likely Cell
#12
Runs and Results
#13
measure RUN-1 RUN-2 RUN-3 RUN-4 RUN-5
acc(1m) 0.15 0.01 0.15 0.16 0.16
acc(10m) 0.61 0.08 0.62 0.75 0.76
acc(100m) 6.40 1.76 6.52 7.73 7.83
acc(1km) 24.33 5.19 24.61 27.30 27.54
acc(10km) 43.07 7.43 43.41 46.48 46.77
m. error (km) 69 5663 61 24 22
RUN-1: Tag-based location estimation + released training set
RUN-2: Visual-based location estimation + released training set
RUN-3: Hybrid location estimation + released training set
RUN-4: Tag-based location estimation + YFCC dataset
RUN-5: Hybrid location estimation + YFCC dataset
References
#15
[1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell. Caffe: Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093, 2014.
[2] G. Kordopatis-Zilos, G. Orfanidis, S. Papadopoulos, and Y. Kompatsiaris.
Socialsensor at mediaeval placing task 2014. In MediaEval 2014 Placing Task,
2014.
[3] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social
media content with a refined language modelling approach. In Intelligence and
Security Informatics, pages 21–40, 2015.
[4] A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In
MediaEval 2013 Placing Task, 2013.
[5] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-
scale image recognition. In International Conference on Learning
Representations, 2015.
[6] O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources
using language models and similarity search. ICMR ’11, pages 48:1–48:8, New
York, NY, USA, 2011. ACM.
Editor's Notes
Different kinds of user classification:
topic-oriented (e.g., interest/expertise)
role-based/behavioral (e.g., bot/spammer)
geographical location
Useful for advertising,
user recommendation,
expert search, etc.
For personal accounts,
user classification raises
privacy concerns
Challenges
multi-linguality
Brevity
informal language