Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

Inferring the Geolocation of Tweets at a
Fine-Grained Level
Jorge David Gonzalez Paule
PhD Thesis Defence – 7th November 2018
Supervisors: Iadh Ounis, Craig Macdonald, Yashar Moshfeghi

Motivations
• Only 1% of the Twitter stream contains geographical information [Graham et al.,
2014].
• This sample of tweets is insufficient for several applications such as Disaster
and Emergency Management or Traffic Incident Detection (Chapter 6)
• Inferring the geolocation of non-geotagged tweets can increase the sample of
actionable geotagged data.
• Previous tweet geolocalisation approaches work at a coarse-grained level
(country or city level) [Eisenstein et al., 2010a; Han and Cook, 2013; Kinsella et al., 2011; Schulz et al.,
2013a]
• We aim to infer the geolocation of tweets at a fine-grained level (street or
neighbour level).
In this thesis, we aim to achieve an error distance of at most 1 km!
2

Thesis Statement
“The geolocalisation of non-geotagged tweets at a fine-grained
level can be achieved by exploiting the characteristics of already
available individual finely-grained geotagged tweets.”
Chapter 3
Chapter 4
Chapter 5
Chapter 6
H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of
geotagged data and, therefore, improve the effectiveness of the traffic incident detection task.
H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […]
we can obtain a higher number of other fine-grained predictions.
H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation
between their content similarity and geographical distance to finely-grained geotagged tweets.
H1: By considering geotagged tweets individually […] we can improve the performance of fine-
grained geolocalisation.
3

Thesis Structure
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Fine-Grained Geolocalisation of Tweets.
Application: Traffic Incident Detection
Ranking Approach
Ranking Approach +
Majority Voting
Learning to Rank +
Majority Voting
Sophisticated but
more accurate(1km)
Simple but efficient
• Evaluate effectiveness
• Assess the generalisation of our approaches on a new
dataset
Evolution of Techniques
4

Data and Metrics
Datasets
• 3 different datasets of geotagged tweets
−Chicago (March 2016)
−New York (March 2016)
−Chicago (July 2016)
Metrics
• Average Error Distance (AED)
−Distance on Earth between the predicted and the real location
• Accuracy@1km
−The fraction of predicted locations that lie within a radius of 1 km from the real
location
• Coverage
−The fraction of tweets for which our approach finds a geolocation
Chapters 3,4 and 5
Chapter 6
5

Motivation
• Existing approaches in the literature have limitations when they are adapted
to work at a fine-grained level [Kinsella et al. (2011), Paraskevopoulos et al. (2015)].
−To represent a location, these approaches aggregate the text of the tweets in this
location into a virtual document.
−We postulate that this aggregation approach can lead to a loss of important
evidence.
Chapter 3
H1:
By considering geotagged tweets individually we can preserve the
evidence lost when adapting previous approaches at a fine-
grained level, and thus we can improve the performance of fine-
grained geolocalisation.
Enabling Fine-Grained
Geolocalisation
6

Chapter 3
Geographical Area: Grid of squares of a
pre-defined size (1km)
Ranking of Documents
Enabling Fine-Grained
Geolocalisation
Aggregated
Individual
7

Chapter 3 Enabling Fine-Grained
Geolocalisation
Conclusions
• Individual outperforms Aggregated (accuracy@1km):
−from 50.67% to 55.20% in Chicago
−from 45.40% to 48.46% in New York
• Key Findings:
−IDF is the best performing retrieval model.
−Document Frequency (DF) is the most important feature when using
individual tweets.
−The Aggregated approach loses in accuracy when transforming DF to term
frequency.
Contributions
• A novel ranking approach for fine-grained geolocalisation (Individual),
compared against the baseline (Aggregated).
• An investigation into the performance issues of the existing SOTA approaches
(Aggregated).
8

Majority Voting For Fine-
Grained Geolocalisation
Chapter 4
Motivation
• The approach of Chapter 3 obtains an AED of 4.693 km (Chicago)
−Not enough. We aim for an AED of 1 km.
• This approach considers only the similarity evidence for prediction (always
returns the Top-1 geotagged tweet)
• We postulate that in some cases the similarity of the tweets does not always
correlate with geographical distance, thus we cannot predict a location in
such cases.
• We aim to exploit the geographical evidence encoded within the Top-N
geotagged tweets.
9
H2:
The predictability of the geolocation of tweets at a fine-grained level is
given by the correlation between their content similarity and geographical
distance to finely-grained geotagged tweets.

Contributions
• A novel approach that uses a weighted majority voting to exploit the
geographical evidence encoded within the Top-N geotagged tweets in the
ranking.
Chapter 4 Majority Voting For Fine-
Grained Geolocalisation
Conclusions
• AED is markedly reduced:
−4.694 km to 1.602 km in Chicago
−4.972 km to 1.448 km in New York
• Key Findings:
−Trade-off between AED and Coverage (weighted Majority Voting)
−As the value N of the Top-N increases we observe:
• Lower AED
• But also Lower Coverage
.
.
.
Ranking of Geotagged Tweets
(Chapter 3) Majority
Voting
Top-N
10

Learning to GeolocaliseChapter 5
H3:
By improving the ranking of geotagged tweets with respect to a given
non-geotagged tweet, […] we can obtain a higher number of other fine-
grained predictions.
Motivation
• The approaches of Chapters 3 & 4 use traditional retrieval models
−Similarity is based only on document frequency information (IDF weighting).
• Considering only IDF can limit the quality of the Top-N geotagged tweets.
11

Contributions
• A novel learning to rank-based approach that re-ranks geotagged tweets
−Targeting their geographical proximity to a given non-geotagged tweet.
• We propose a set of 28 features for learning.
Chapter 5 Learning to Geolocalise
Experiments
• Experiment with several learning to rank algorithms
−MART, Random Forest, RankNet, AdaRank, ListNet and LambdaMART
• Explore the best combination of features
−Document features, query features and query-dependent features.
12

Chapter 5 Learning to Geolocalise
Conclusions
• LambdaMART is the best algorithm for fine-grained geolocalisation.
• A better ranking result does improve geoloalisation
−Increased accuracy while increasing coverage (best configuration @Top-13):
−AED: from 1.490 km to 1.441 km
−Coverage: from 31.88% to 46.01%
• Best combination of features:
−Extracted from query-tweet .
−Query-dependent: relation between query and doc tweets.
13

Application: Traffic Incident
Detection
Chapter 6
H4:
By geolocalising non-geotagged tweets we can obtain a more
representative sample of geotagged data and, therefore, improve the
effectiveness of the traffic incident detection task.
Motivation
• We aim to assess the generalisation of our approaches.
• It is important to:
−Identify traffic incident-related content.
−Know the precise location of the tweets to locate incidents.
• There are detection rates problems due to small sample size (use only 1% of
geotagged tweets) (Gu et al., 2016; Mai and Hranac, 2013)
14

Detection
Chapter 6
Geolocalised
+
Geotagged
Geotagged Alone
Detection Rate
15

Detection
Chapter 6
Geotagged Alone
Geolocalised + Geotagged
(High Acc. Config)
Geolocalised + Geotagged
(High Coverage Config)
Accuracy@1km
16

Contributions
• Demonstrated the usefulness of our approaches in a traffic incident detection
task.
• We showed improvements in the performance of the traffic incident
detection task.
Chapter 6
Detection
Conclusions
• Expanding the sample with new geolocalised tweets increases the
performance of the traffic incident detection task.
• The geolocalised tweets are located close to the location of the real
incidents.
• Our fine-grained geolocalisation approaches do seem to generalise
− Consistency with behaviour observed in the previous chapters on other datasets
17

Concluding Remarks
Across all the chapters in this thesis we have….
• Addressed the fine-grained tweet geolocalisation problem.
• Alleviated the problem of the existing state-of-the-art
approaches when working at a fine-grained level
• Proposed an suite of techniques that
−Effectively infer the geolocation of tweets at a fine-grained
level
−Provide different trade-offs between accuracy and coverage
• Showed the effectiveness and generalisation of our
approaches on a traffic incident detection task
18

Publication Breakdown
• “On fine-grained geolocalisation of tweets”. ICTIR 2017.
• “On fine-grained geolocalisation of tweets and real-time traffic incident
detection”. Information Processing & Management. 2018.
• “Beyond geotagged tweets: exploring the geolocalisation of tweets for
transportation applications”. In Transportation Analytics in the Era of Big
Data, Springer. 2018.
• “Learning to Geolocalise Tweets at a Fine-Grained Level”. In CIKM 2018.
Also contributed to Geographical and Social Sciences
• "Sensing spatiotemporal patterns in urban areas: analytics and visualizations
using the integrated multimedia city data platform“. Built Environment 42.3.
2016.
• "Spatial analysis of usersgenerated ratings of yelp venues“. In Open
Geospatial Data, Software and Standards. 2017.
5 research publications
Ch. 6
Ch. 5
Ch. 3
Ch. 4
Ch. 4
19

Thanks for listening!
Glad to answer your
questions
20

Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

More Related Content

Similar to Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

Recently uploaded

Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

Editor's Notes