Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

112 views

Published on

David Paule's PhD Thesis Defence

Presentation used for my PhD defence in November 2018. My thesis is titled "Inferring the Geolocation of Tweets at a Fine-Grained Level"

Supervisors: Professor Iadh Ounis, Dr Craig MacDonald and Dr Yashar Moshfeghi

Examiners: Dr Richard McCreadie and Professor Mohand Boughanem.

Want to know more? Visit my website davidpaule.es

Published in: Science
  • Be the first to comment

  • Be the first to like this

Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

  1. 1. Inferring the Geolocation of Tweets at a Fine-Grained Level Jorge David Gonzalez Paule PhD Thesis Defence – 7th November 2018 Supervisors: Iadh Ounis, Craig Macdonald, Yashar Moshfeghi
  2. 2. Motivations • Only 1% of the Twitter stream contains geographical information [Graham et al., 2014]. • This sample of tweets is insufficient for several applications such as Disaster and Emergency Management or Traffic Incident Detection (Chapter 6) • Inferring the geolocation of non-geotagged tweets can increase the sample of actionable geotagged data. • Previous tweet geolocalisation approaches work at a coarse-grained level (country or city level) [Eisenstein et al., 2010a; Han and Cook, 2013; Kinsella et al., 2011; Schulz et al., 2013a] • We aim to infer the geolocation of tweets at a fine-grained level (street or neighbour level). In this thesis, we aim to achieve an error distance of at most 1 km! 2
  3. 3. Thesis Statement “The geolocalisation of non-geotagged tweets at a fine-grained level can be achieved by exploiting the characteristics of already available individual finely-grained geotagged tweets.” Chapter 3 Chapter 4 Chapter 5 Chapter 6 H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of geotagged data and, therefore, improve the effectiveness of the traffic incident detection task. H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […] we can obtain a higher number of other fine-grained predictions. H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation between their content similarity and geographical distance to finely-grained geotagged tweets. H1: By considering geotagged tweets individually […] we can improve the performance of fine- grained geolocalisation. 3
  4. 4. Thesis Structure Chapter 3 Chapter 4 Chapter 5 Chapter 6 Fine-Grained Geolocalisation of Tweets. Application: Traffic Incident Detection Ranking Approach Ranking Approach + Majority Voting Learning to Rank + Majority Voting Sophisticated but more accurate(1km) Simple but efficient • Evaluate effectiveness • Assess the generalisation of our approaches on a new dataset Evolution of Techniques 4
  5. 5. Data and Metrics Datasets • 3 different datasets of geotagged tweets −Chicago (March 2016) −New York (March 2016) −Chicago (July 2016) Metrics • Average Error Distance (AED) −Distance on Earth between the predicted and the real location • Accuracy@1km −The fraction of predicted locations that lie within a radius of 1 km from the real location • Coverage −The fraction of tweets for which our approach finds a geolocation Chapters 3,4 and 5 Chapter 6 5
  6. 6. Motivation • Existing approaches in the literature have limitations when they are adapted to work at a fine-grained level [Kinsella et al. (2011), Paraskevopoulos et al. (2015)]. −To represent a location, these approaches aggregate the text of the tweets in this location into a virtual document. −We postulate that this aggregation approach can lead to a loss of important evidence. Chapter 3 H1: By considering geotagged tweets individually we can preserve the evidence lost when adapting previous approaches at a fine- grained level, and thus we can improve the performance of fine- grained geolocalisation. Enabling Fine-Grained Geolocalisation 6
  7. 7. Chapter 3 Geographical Area: Grid of squares of a pre-defined size (1km) Ranking of Documents Enabling Fine-Grained Geolocalisation Aggregated Individual 7
  8. 8. Chapter 3 Enabling Fine-Grained Geolocalisation Conclusions • Individual outperforms Aggregated (accuracy@1km): −from 50.67% to 55.20% in Chicago −from 45.40% to 48.46% in New York • Key Findings: −IDF is the best performing retrieval model. −Document Frequency (DF) is the most important feature when using individual tweets. −The Aggregated approach loses in accuracy when transforming DF to term frequency. Contributions • A novel ranking approach for fine-grained geolocalisation (Individual), compared against the baseline (Aggregated). • An investigation into the performance issues of the existing SOTA approaches (Aggregated). 8
  9. 9. Majority Voting For Fine- Grained Geolocalisation Chapter 4 Motivation • The approach of Chapter 3 obtains an AED of 4.693 km (Chicago) −Not enough. We aim for an AED of 1 km. • This approach considers only the similarity evidence for prediction (always returns the Top-1 geotagged tweet) • We postulate that in some cases the similarity of the tweets does not always correlate with geographical distance, thus we cannot predict a location in such cases. • We aim to exploit the geographical evidence encoded within the Top-N geotagged tweets. 9 H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation between their content similarity and geographical distance to finely-grained geotagged tweets.
  10. 10. Contributions • A novel approach that uses a weighted majority voting to exploit the geographical evidence encoded within the Top-N geotagged tweets in the ranking. Chapter 4 Majority Voting For Fine- Grained Geolocalisation Conclusions • AED is markedly reduced: −4.694 km to 1.602 km in Chicago −4.972 km to 1.448 km in New York • Key Findings: −Trade-off between AED and Coverage (weighted Majority Voting) −As the value N of the Top-N increases we observe: • Lower AED • But also Lower Coverage . . . Ranking of Geotagged Tweets (Chapter 3) Majority Voting Top-N 10
  11. 11. Learning to GeolocaliseChapter 5 H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […] we can obtain a higher number of other fine- grained predictions. Motivation • The approaches of Chapters 3 & 4 use traditional retrieval models −Similarity is based only on document frequency information (IDF weighting). • Considering only IDF can limit the quality of the Top-N geotagged tweets. 11
  12. 12. Contributions • A novel learning to rank-based approach that re-ranks geotagged tweets −Targeting their geographical proximity to a given non-geotagged tweet. • We propose a set of 28 features for learning. Chapter 5 Learning to Geolocalise Experiments • Experiment with several learning to rank algorithms −MART, Random Forest, RankNet, AdaRank, ListNet and LambdaMART • Explore the best combination of features −Document features, query features and query-dependent features. 12
  13. 13. Chapter 5 Learning to Geolocalise Conclusions • LambdaMART is the best algorithm for fine-grained geolocalisation. • A better ranking result does improve geoloalisation −Increased accuracy while increasing coverage (best configuration @Top-13): −AED: from 1.490 km to 1.441 km −Coverage: from 31.88% to 46.01% • Best combination of features: −Extracted from query-tweet . −Query-dependent: relation between query and doc tweets. 13
  14. 14. Application: Traffic Incident Detection Chapter 6 H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of geotagged data and, therefore, improve the effectiveness of the traffic incident detection task. Motivation • We aim to assess the generalisation of our approaches. • It is important to: −Identify traffic incident-related content. −Know the precise location of the tweets to locate incidents. • There are detection rates problems due to small sample size (use only 1% of geotagged tweets) (Gu et al., 2016; Mai and Hranac, 2013) 14
  15. 15. Application: Traffic Incident Detection Chapter 6 Geolocalised + Geotagged Geotagged Alone Detection Rate 15
  16. 16. Application: Traffic Incident Detection Chapter 6 Geotagged Alone Geolocalised + Geotagged (High Acc. Config) Geolocalised + Geotagged (High Coverage Config) Accuracy@1km 16
  17. 17. Contributions • Demonstrated the usefulness of our approaches in a traffic incident detection task. • We showed improvements in the performance of the traffic incident detection task. Chapter 6 Application: Traffic Incident Detection Conclusions • Expanding the sample with new geolocalised tweets increases the performance of the traffic incident detection task. • The geolocalised tweets are located close to the location of the real incidents. • Our fine-grained geolocalisation approaches do seem to generalise − Consistency with behaviour observed in the previous chapters on other datasets 17
  18. 18. Concluding Remarks Across all the chapters in this thesis we have…. • Addressed the fine-grained tweet geolocalisation problem. • Alleviated the problem of the existing state-of-the-art approaches when working at a fine-grained level • Proposed an suite of techniques that −Effectively infer the geolocation of tweets at a fine-grained level −Provide different trade-offs between accuracy and coverage • Showed the effectiveness and generalisation of our approaches on a traffic incident detection task 18
  19. 19. Publication Breakdown • “On fine-grained geolocalisation of tweets”. ICTIR 2017. • “On fine-grained geolocalisation of tweets and real-time traffic incident detection”. Information Processing & Management. 2018. • “Beyond geotagged tweets: exploring the geolocalisation of tweets for transportation applications”. In Transportation Analytics in the Era of Big Data, Springer. 2018. • “Learning to Geolocalise Tweets at a Fine-Grained Level”. In CIKM 2018. Also contributed to Geographical and Social Sciences • "Sensing spatiotemporal patterns in urban areas: analytics and visualizations using the integrated multimedia city data platform“. Built Environment 42.3. 2016. • "Spatial analysis of usersgenerated ratings of yelp venues“. In Open Geospatial Data, Software and Standards. 2017. 5 research publications Ch. 6 Ch. 5 Ch. 3 Ch. 4 Ch. 4 19
  20. 20. Thanks for listening! Glad to answer your questions 20

×