Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features

216 views

Published on

Participation in MediaEval Placing Task 2016. Presented by Giorgos Kordopatis-Zilos in Amsterdam, 2016.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features

  1. 1. Placing Images with Refined Language Models and Similarity Search with PCA-reduced VGG Features Giorgos Kordopatis-Zilos1, Adrian Popescu2, Symeon Papadopoulos1 and Yiannis Kompatsiaris1 1 Information Technologies Institute (ITI), CERTH, Greece 2 CEA LIST, 91190 Gif-sur-Yvette, France MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
  2. 2. Summary Tag-based location estimation (1 runs) • Built upon the scheme of our 2015 participation (Kordopatis-Zilos et al., MediaEval 2015) • Based on a refined probabilistic Language Model Visual-based location estimation (1 run) • Extract PCA-reduced VGG features to compute image similarities • Geospatial clustering scheme of the most visually similar images Hybrid location estimation (3 run) • Combination of the textual and visual approaches using a set of rules Training sets • Training set released by the organisers (≈4.7M geotagged items) • YFCC dataset, excl. images from users in test set (≈40M geotagged items) • External data derived from gazetteers, i.e. Geonames and OpenStreetMap G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015
  3. 3. Tag-based location estimation • Processing steps of the approach – Offline: language model construction – Online: location estimation OpenStreetMap
  4. 4. Pre-processing • Tags and titles of the training set items are processed • Apply – URL decoding – lowercase transformation – tokenization • Remove – accents – symbols – punctuations • The multi-word tags are split into their individual terms, which are also included in the item's term set • Discard numerics or less than three characters terms
  5. 5. Language Model (LM) • LM-based estimation – Most Likely Cell (mlc) considered the cell with the highest probability and used to produce the estimation 𝑚𝑙𝑐𝑗 = arg max 𝑖 𝑘=1 𝑇 𝑗 𝑝(𝑡 𝑘|𝑐𝑖) ∗ 𝑤(𝑡 𝑘) Inspired from (Popescu, MediaEval 2013) • LM generation scheme – divide earth surface in rectangular cells with a side length of 0.01° – calculate term-cell probabilities 𝑝(𝑡|𝑐) = 𝑁 𝑢/𝑁𝑡 A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013
  6. 6. Feature Selection and Weighting Feature Weighting • Locality weight function, a function based on term relative position in T • Spatial Entropy weight function, a Gaussian function based on the term’s spatial entropy • Linear combination of the two weights Feature Selection • Calculate terms locality using a grid of 0.01°×0.01° • When a user uses a given term, he/she is assigned to the entire cell neighborhood instead of a unique cell: 𝑙 𝑡 = 𝑁𝑡 ∗ 𝑐∈𝐶 𝑢∈𝑈𝑡,𝑐 |{𝑢′|𝑢′ ∈ 𝑈𝑡,𝑐, 𝑢′ ≠ 𝑢}| 𝑁𝑡 2 • Terms with non-zero locality score form the term set 𝑇
  7. 7. Refinements • Multiple Grids – Built an additional LM using a finer grid (cell side length of 0.001°) – combine the MLC of the individual language models • Similarity search (Van Laere et al., ICMR 2011) – determine 𝑘 𝑡 most similar training images in the MLC – their center-of-gravity is the final location estimation From: (Kordopatis-Zilos et al., PAISI 2015) G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015
  8. 8. Visual-based location estimation Main Objectives • Ensure that the visual features are generic and transferable • Provide a compact representation of the features Model building • CNN features extracted by fine-tuning the VGG model • Training: ~5K Points Of Interest (POIs), over 7M Flickr images using queries with: – the POI name and a radius of 5km around its coordinates – the POI name and the associated city name • Compressed outputs of fc7 layer (4096d) to 128d using PCA, learned on a subset of 250,000 train images • Similarity Search based on the PCA-reduced CNN features O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM
  9. 9. Visual-based location estimation Location Estimation • Geospatial clustering of 𝑘 𝑣 = 20 visually most similar images • The largest cluster (or the first in case of equal size) is selected and its centroid is used as the location estimate Visual Confidence • Confidence metric for the visual estimation is based on the size of the largest cluster 𝑐𝑜𝑛𝑓𝑣 𝑖 = max( 𝑛 𝑖 − 𝑛 𝑡 𝑘 𝑣 − 𝑛 𝑡 , 0) 𝑛 𝑖 : number of neighbors in the largest cluster of image i 𝑛 𝑡: configuration parameter of the confidence score ‘’strictness’’ K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015
  10. 10. Hybrid-based location estimation • A set of rules to determine the source of estimation between the text and visual approaches • The visual estimation is chosen in cases: → No estimation could be produced by the text approach → Visual estimation fell inside the borders of the mlc → By comparing the confidence scores 𝑐𝑜𝑛𝑓𝑣 and 𝑐𝑜𝑛𝑓𝑡 • Otherwise the text estimation is selected
  11. 11. Runs and Results RUN-1: Tag-based location estimation + released training set RUN-2: Visual-based location estimation + released training set RUN-3: Hybrid location estimation + released training set RUN-4: Hybrid location estimation + YFCC dataset RUN-5: Hybrid location estimation + YFCC + External data RUN-E: Visual-based location estimation + entire YFCC dataset Images
  12. 12. Runs and Results RUN-1: Tag-based location estimation + released training set RUN-2: Visual-based location estimation + released training set RUN-3: Hybrid location estimation + released training set RUN-4: Hybrid location estimation + YFCC dataset RUN-5: Hybrid location estimation + YFCC + External data Videos
  13. 13. References G. Kordopatis-Zilos, A. Popescu, S. Papadopoulos, and Y. Kompatsiaris. Socialsensor at mediaeval placing task 2015. In MediaEval 2015 Placing Task, 2015 G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language modelling approach. In Intelligence and Security Informatics, pages 21–40, 2015 A. Popescu. CEA LIST's participation at mediaeval 2013 placing task. In MediaEval 2013 Placing Task, 2013 K. Simonyan and A. Zisserman. Very deep convolutional networks for large- scale image recognition. In International Conference on Learning Representations, 2015 O. Van Laere, S. Schockaert, and B. Dhoedt. Finding locations of Flickr resources using language models and similarity search. ICMR ’11, pages 48:1–48:8, New York, NY, USA, 2011. ACM
  14. 14. Thank you! Data/Code: – https://github.com/MKLab-ITI/multimedia-geotagging/ Get in touch: – Giorgos Kordopatis-Zilos: georgekordopatis@iti.gr – Symeon Papadopoulos: papadop@iti.gr / @sympap With the support of:

×