TALP-UPC at MediaEval 2014 Placing Task: Combining Geographical Knowledge Bases and Language Models for Large-Scale Textual Georeferencing

TALP-UPC at MediaEval 2014 Placing Task
TALP-UPC at MediaEval 2014 Placing Task:
Combining Geographical Knowledge Bases and
Language Models for Large-Scale Textual
Georeferencing
Daniel Ferr´es and Horacio Rodr´ıguez
TALP Research Center
Computer Science Department
Universitat Polit`ecnica de Catalunya
Barcelona, Spain
17th October 2014

Introduction
Textual Approaches: we use only textual metadata to
predict
Title
Description
User Tags (keywords)
Approaches.
1) GeoKB. Geographical Knowledge Heuristics
2) Hiemstra Language Model with Re-Ranking (HLM-RR)
3) GeoFusion: combines results from approaches GeoKB and
HLM-RR
4) GeoFusion+GeoWiki: combines results from GeoKB,
HLM-RR, and a HLM model of English Wikipedia
Georeferenced pages.

Geographical Knowledge Approach
System Architecture
Place Name Detection with Geonames Gazetteer.
Filtering Geographically ambiguous words with Multilingual
Stopwords Lists and the English Dictionary.
Geographical Focus Detection with Toponym Disambiguation
Heuristics.

Geographical Focus Detection Heuristics
Geographical Knowledge Heuristics (high priority):
GH1) select the most populated place that is not a state,
country or continent and has its state apearing in the text,
GH2) otherwise select the most populated place that is not a
state, country or continent and has its country apearing in the
text,
GH3) otherwise select the most populated state that has its
country apearing in the text
otherwise try Population rules
Population Heuristics (low priority):
PH1) if a place exists select the most populated place that is
not a country, state or a continent,
PH2) otherwise if exists a state, select the most populated one,
PH3) otherwise select the most populated country,
PH4) otherwise select the most populated continent.

Information Retrieval with Re-Ranking Approach
Terrier1 IR software (version 3.0) with the Hiemstra Language
Modelling (HLM) weighting model.
Indexing: coordinates pairs (document identiﬁer) and tagsets
(document text).
Retrieval: topic composed by Metadata ﬁelds (title,
description and user tags).
Re-Ranking the IR results (1000 top documents).
Get the subset of top-weighted coordinate pairs.
Recalculate the weights of the coordinates of the subset by
adding the weights of neighbours up to a X distance in Kms.
Get the top-weighted coordinate pair.
1
Terrier. http://terrier.org

GeoFusion Approach
Combines the results of the GeoKB approach and the IR
approach with Re-Ranking
Geographical Knowledge Rules (GH1, GH2 and GH3) if
activated are selected in priority order to get the prediction.
Otherwise, the prediction comes from the IR Re-Ranking
approach.

GeoFusion+GeoWiki Approach
Combines the results of the GeoKB approach, the HLM with
Re-Ranking approach, and a HLM model of the English
Wikipedia georeferenced pages (857,574 pages).
Geographical Knowledge Rules (GH1, GH2 and GH3) if
activated are selected in priority order to get the prediction.
Otherwise, the prediction comes from the HLM Re-Ranking
approach.
The HLM model of the georeferenced Wikipedia is used if the
score of the HLM Re-Ranking approach does not achieves a
threshold.

Training Corpora for IR
Corpora used
Placing Task 2014 Training dataset: (5,025,000
photos/videos). (after a ﬁltering process we get a set of
3,057,718 coordinates pairs with related metadata)
English Wikipedia georeferenced pages2 ( a set of 857,574
pages with associated coordinates)
2
http://de.wikipedia.org/wiki/Wikipedia:
WikiProjekt_Georeferenzierung/Hauptseite/Wikipedia-World/en

Experiments and Results
run Approach Training Corpus
run1 HLM Re-Rank (100km) Placing Task 2014
run3 GeoKB -
run4 GeoFusion Placing Task 2014
run5 GeoFusion+GeoWiki Placing Task 2014+ Wikipedia Georeferenced pages
0 %
10 %
20 %
30 %
40 %
50 %
60 %
70 %
80 %
90 %
100 %
0.01 0.1 1 10 100 1000 5000
Accuracy
Kms
experiments
run1
run3
run4
run5

Results II
Table : Percentage of correctly georeferenced photos/videos within
certain amount of kilometers and median error for each run.
Margin run1 run3 run4 run5
10m 0.29 0.08 0.23 0.23
100m 4.12 0.80 3.00 3.00
1km 16.54 10.71 15.90 15.90
10km 34.34 33.89 38.52 38.53
100km 51.06 42.35 52.47 52.47
1000km 64.67 52.54 65.87 65.86
5000km 78.63 69.84 79.29 79.28
Median Error (kms) 83.98 602.21 64.36 64.41

Conclusions
The HLM approach with Re-Ranking showed the best
performance within 10m to 1km distances.
GeoFusion and GeoFusion+GeoWiki outperform the other
approaches within the margin of errors from 10km to 5000km..
GeoKB rules achieved 81.17% of accuracy predicting up to
100km.
Diﬃcult cases: few textual information.

TALP-UPC at MediaEval 2014 Placing Task: Combining Geographical Knowledge Bases and Language Models for Large-Scale Textual Georeferencing

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to TALP-UPC at MediaEval 2014 Placing Task: Combining Geographical Knowledge Bases and Language Models for Large-Scale Textual Georeferencing

Similar to TALP-UPC at MediaEval 2014 Placing Task: Combining Geographical Knowledge Bases and Language Models for Large-Scale Textual Georeferencing (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

TALP-UPC at MediaEval 2014 Placing Task: Combining Geographical Knowledge Bases and Language Models for Large-Scale Textual Georeferencing