SlideShare a Scribd company logo
1 of 20
Inferring the Geolocation of Tweets at a
Fine-Grained Level
Jorge David Gonzalez Paule
PhD Thesis Defence – 7th November 2018
Supervisors: Iadh Ounis, Craig Macdonald, Yashar Moshfeghi
Motivations
• Only 1% of the Twitter stream contains geographical information [Graham et al.,
2014].
• This sample of tweets is insufficient for several applications such as Disaster
and Emergency Management or Traffic Incident Detection (Chapter 6)
• Inferring the geolocation of non-geotagged tweets can increase the sample of
actionable geotagged data.
• Previous tweet geolocalisation approaches work at a coarse-grained level
(country or city level) [Eisenstein et al., 2010a; Han and Cook, 2013; Kinsella et al., 2011; Schulz et al.,
2013a]
• We aim to infer the geolocation of tweets at a fine-grained level (street or
neighbour level).
In this thesis, we aim to achieve an error distance of at most 1 km!
2
Thesis Statement
“The geolocalisation of non-geotagged tweets at a fine-grained
level can be achieved by exploiting the characteristics of already
available individual finely-grained geotagged tweets.”
Chapter 3
Chapter 4
Chapter 5
Chapter 6
H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of
geotagged data and, therefore, improve the effectiveness of the traffic incident detection task.
H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […]
we can obtain a higher number of other fine-grained predictions.
H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation
between their content similarity and geographical distance to finely-grained geotagged tweets.
H1: By considering geotagged tweets individually […] we can improve the performance of fine-
grained geolocalisation.
3
Thesis Structure
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Fine-Grained Geolocalisation of Tweets.
Application: Traffic Incident Detection
Ranking Approach
Ranking Approach +
Majority Voting
Learning to Rank +
Majority Voting
Sophisticated but
more accurate(1km)
Simple but efficient
• Evaluate effectiveness
• Assess the generalisation of our approaches on a new
dataset
Evolution of Techniques
4
Data and Metrics
Datasets
• 3 different datasets of geotagged tweets
−Chicago (March 2016)
−New York (March 2016)
−Chicago (July 2016)
Metrics
• Average Error Distance (AED)
−Distance on Earth between the predicted and the real location
• Accuracy@1km
−The fraction of predicted locations that lie within a radius of 1 km from the real
location
• Coverage
−The fraction of tweets for which our approach finds a geolocation
Chapters 3,4 and 5
Chapter 6
5
Motivation
• Existing approaches in the literature have limitations when they are adapted
to work at a fine-grained level [Kinsella et al. (2011), Paraskevopoulos et al. (2015)].
−To represent a location, these approaches aggregate the text of the tweets in this
location into a virtual document.
−We postulate that this aggregation approach can lead to a loss of important
evidence.
Chapter 3
H1:
By considering geotagged tweets individually we can preserve the
evidence lost when adapting previous approaches at a fine-
grained level, and thus we can improve the performance of fine-
grained geolocalisation.
Enabling Fine-Grained
Geolocalisation
6
Chapter 3
Geographical Area: Grid of squares of a
pre-defined size (1km)
Ranking of Documents
Enabling Fine-Grained
Geolocalisation
Aggregated
Individual
7
Chapter 3 Enabling Fine-Grained
Geolocalisation
Conclusions
• Individual outperforms Aggregated (accuracy@1km):
−from 50.67% to 55.20% in Chicago
−from 45.40% to 48.46% in New York
• Key Findings:
−IDF is the best performing retrieval model.
−Document Frequency (DF) is the most important feature when using
individual tweets.
−The Aggregated approach loses in accuracy when transforming DF to term
frequency.
Contributions
• A novel ranking approach for fine-grained geolocalisation (Individual),
compared against the baseline (Aggregated).
• An investigation into the performance issues of the existing SOTA approaches
(Aggregated).
8
Majority Voting For Fine-
Grained Geolocalisation
Chapter 4
Motivation
• The approach of Chapter 3 obtains an AED of 4.693 km (Chicago)
−Not enough. We aim for an AED of 1 km.
• This approach considers only the similarity evidence for prediction (always
returns the Top-1 geotagged tweet)
• We postulate that in some cases the similarity of the tweets does not always
correlate with geographical distance, thus we cannot predict a location in
such cases.
• We aim to exploit the geographical evidence encoded within the Top-N
geotagged tweets.
9
H2:
The predictability of the geolocation of tweets at a fine-grained level is
given by the correlation between their content similarity and geographical
distance to finely-grained geotagged tweets.
Contributions
• A novel approach that uses a weighted majority voting to exploit the
geographical evidence encoded within the Top-N geotagged tweets in the
ranking.
Chapter 4 Majority Voting For Fine-
Grained Geolocalisation
Conclusions
• AED is markedly reduced:
−4.694 km to 1.602 km in Chicago
−4.972 km to 1.448 km in New York
• Key Findings:
−Trade-off between AED and Coverage (weighted Majority Voting)
−As the value N of the Top-N increases we observe:
• Lower AED
• But also Lower Coverage
.
.
.
Ranking of Geotagged Tweets
(Chapter 3) Majority
Voting
Top-N
10
Learning to GeolocaliseChapter 5
H3:
By improving the ranking of geotagged tweets with respect to a given
non-geotagged tweet, […] we can obtain a higher number of other fine-
grained predictions.
Motivation
• The approaches of Chapters 3 & 4 use traditional retrieval models
−Similarity is based only on document frequency information (IDF weighting).
• Considering only IDF can limit the quality of the Top-N geotagged tweets.
11
Contributions
• A novel learning to rank-based approach that re-ranks geotagged tweets
−Targeting their geographical proximity to a given non-geotagged tweet.
• We propose a set of 28 features for learning.
Chapter 5 Learning to Geolocalise
Experiments
• Experiment with several learning to rank algorithms
−MART, Random Forest, RankNet, AdaRank, ListNet and LambdaMART
• Explore the best combination of features
−Document features, query features and query-dependent features.
12
Chapter 5 Learning to Geolocalise
Conclusions
• LambdaMART is the best algorithm for fine-grained geolocalisation.
• A better ranking result does improve geoloalisation
−Increased accuracy while increasing coverage (best configuration @Top-13):
−AED: from 1.490 km to 1.441 km
−Coverage: from 31.88% to 46.01%
• Best combination of features:
−Extracted from query-tweet .
−Query-dependent: relation between query and doc tweets.
13
Application: Traffic Incident
Detection
Chapter 6
H4:
By geolocalising non-geotagged tweets we can obtain a more
representative sample of geotagged data and, therefore, improve the
effectiveness of the traffic incident detection task.
Motivation
• We aim to assess the generalisation of our approaches.
• It is important to:
−Identify traffic incident-related content.
−Know the precise location of the tweets to locate incidents.
• There are detection rates problems due to small sample size (use only 1% of
geotagged tweets) (Gu et al., 2016; Mai and Hranac, 2013)
14
Application: Traffic Incident
Detection
Chapter 6
Geolocalised
+
Geotagged
Geotagged Alone
Detection Rate
15
Application: Traffic Incident
Detection
Chapter 6
Geotagged Alone
Geolocalised + Geotagged
(High Acc. Config)
Geolocalised + Geotagged
(High Coverage Config)
Accuracy@1km
16
Contributions
• Demonstrated the usefulness of our approaches in a traffic incident detection
task.
• We showed improvements in the performance of the traffic incident
detection task.
Chapter 6
Application: Traffic Incident
Detection
Conclusions
• Expanding the sample with new geolocalised tweets increases the
performance of the traffic incident detection task.
• The geolocalised tweets are located close to the location of the real
incidents.
• Our fine-grained geolocalisation approaches do seem to generalise
− Consistency with behaviour observed in the previous chapters on other datasets
17
Concluding Remarks
Across all the chapters in this thesis we have….
• Addressed the fine-grained tweet geolocalisation problem.
• Alleviated the problem of the existing state-of-the-art
approaches when working at a fine-grained level
• Proposed an suite of techniques that
−Effectively infer the geolocation of tweets at a fine-grained
level
−Provide different trade-offs between accuracy and coverage
• Showed the effectiveness and generalisation of our
approaches on a traffic incident detection task
18
Publication Breakdown
• “On fine-grained geolocalisation of tweets”. ICTIR 2017.
• “On fine-grained geolocalisation of tweets and real-time traffic incident
detection”. Information Processing & Management. 2018.
• “Beyond geotagged tweets: exploring the geolocalisation of tweets for
transportation applications”. In Transportation Analytics in the Era of Big
Data, Springer. 2018.
• “Learning to Geolocalise Tweets at a Fine-Grained Level”. In CIKM 2018.
Also contributed to Geographical and Social Sciences
• "Sensing spatiotemporal patterns in urban areas: analytics and visualizations
using the integrated multimedia city data platform“. Built Environment 42.3.
2016.
• "Spatial analysis of usersgenerated ratings of yelp venues“. In Open
Geospatial Data, Software and Standards. 2017.
5 research publications
Ch. 6
Ch. 5
Ch. 3
Ch. 4
Ch. 4
19
Thanks for listening!
Glad to answer your
questions
20

More Related Content

Similar to Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014DUSPviz
 
Presentation-Licentiate degree.pptx
Presentation-Licentiate degree.pptxPresentation-Licentiate degree.pptx
Presentation-Licentiate degree.pptxrebeen4
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataijdms
 
Wsn state-centric programming
Wsn   state-centric programmingWsn   state-centric programming
Wsn state-centric programmingAanchalKumari4
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachSymeon Papadopoulos
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachREVEAL - Social Media Verification
 
FUZZY CLUSTERING FOR IMPROVED POSITIONING
FUZZY CLUSTERING FOR IMPROVED POSITIONINGFUZZY CLUSTERING FOR IMPROVED POSITIONING
FUZZY CLUSTERING FOR IMPROVED POSITIONINGijitjournal
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering toBiniam Behailu
 
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Chih-Chuan Cheng
 
Big Data for Local Context
Big Data for Local ContextBig Data for Local Context
Big Data for Local ContextGeorge Percivall
 
Spatial Data Mining : Seminar
Spatial Data Mining : SeminarSpatial Data Mining : Seminar
Spatial Data Mining : SeminarIpsit Dash
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...Dongmin Choi
 
Updating Ecological Niche Modeling Methodologies
Updating Ecological Niche Modeling MethodologiesUpdating Ecological Niche Modeling Methodologies
Updating Ecological Niche Modeling MethodologiesTown Peterson
 
D1T3 enm workflows updated
D1T3 enm workflows updatedD1T3 enm workflows updated
D1T3 enm workflows updatedTown Peterson
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 

Similar to Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis (20)

Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014
 
Presentation-Licentiate degree.pptx
Presentation-Licentiate degree.pptxPresentation-Licentiate degree.pptx
Presentation-Licentiate degree.pptx
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
G44093135
G44093135G44093135
G44093135
 
A hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial dataA hybrid approach for analysis of dynamic changes in spatial data
A hybrid approach for analysis of dynamic changes in spatial data
 
Wsn state-centric programming
Wsn   state-centric programmingWsn   state-centric programming
Wsn state-centric programming
 
Exploratory Spatial Analytics (ESA)
Exploratory Spatial Analytics (ESA)Exploratory Spatial Analytics (ESA)
Exploratory Spatial Analytics (ESA)
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling Approach
 
Geotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling ApproachGeotagging Social Media Content with a Refined Language Modelling Approach
Geotagging Social Media Content with a Refined Language Modelling Approach
 
FUZZY CLUSTERING FOR IMPROVED POSITIONING
FUZZY CLUSTERING FOR IMPROVED POSITIONINGFUZZY CLUSTERING FOR IMPROVED POSITIONING
FUZZY CLUSTERING FOR IMPROVED POSITIONING
 
Using parallel hierarchical clustering to
Using parallel hierarchical clustering toUsing parallel hierarchical clustering to
Using parallel hierarchical clustering to
 
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
 
Big Data for Local Context
Big Data for Local ContextBig Data for Local Context
Big Data for Local Context
 
Spatial Data Mining : Seminar
Spatial Data Mining : SeminarSpatial Data Mining : Seminar
Spatial Data Mining : Seminar
 
Strategic Visitor Flows (SVF) analysis using mobile data
Strategic Visitor Flows (SVF) analysis using mobile dataStrategic Visitor Flows (SVF) analysis using mobile data
Strategic Visitor Flows (SVF) analysis using mobile data
 
How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...How much position information do convolutional neural networks encode? review...
How much position information do convolutional neural networks encode? review...
 
Updating Ecological Niche Modeling Methodologies
Updating Ecological Niche Modeling MethodologiesUpdating Ecological Niche Modeling Methodologies
Updating Ecological Niche Modeling Methodologies
 
D1T3 enm workflows updated
D1T3 enm workflows updatedD1T3 enm workflows updated
D1T3 enm workflows updated
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 

Recently uploaded

Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 

Recently uploaded (20)

Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

Inferring the Geolocation of Tweets at a Fine-Grained Level - PhD Thesis

  • 1. Inferring the Geolocation of Tweets at a Fine-Grained Level Jorge David Gonzalez Paule PhD Thesis Defence – 7th November 2018 Supervisors: Iadh Ounis, Craig Macdonald, Yashar Moshfeghi
  • 2. Motivations • Only 1% of the Twitter stream contains geographical information [Graham et al., 2014]. • This sample of tweets is insufficient for several applications such as Disaster and Emergency Management or Traffic Incident Detection (Chapter 6) • Inferring the geolocation of non-geotagged tweets can increase the sample of actionable geotagged data. • Previous tweet geolocalisation approaches work at a coarse-grained level (country or city level) [Eisenstein et al., 2010a; Han and Cook, 2013; Kinsella et al., 2011; Schulz et al., 2013a] • We aim to infer the geolocation of tweets at a fine-grained level (street or neighbour level). In this thesis, we aim to achieve an error distance of at most 1 km! 2
  • 3. Thesis Statement “The geolocalisation of non-geotagged tweets at a fine-grained level can be achieved by exploiting the characteristics of already available individual finely-grained geotagged tweets.” Chapter 3 Chapter 4 Chapter 5 Chapter 6 H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of geotagged data and, therefore, improve the effectiveness of the traffic incident detection task. H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […] we can obtain a higher number of other fine-grained predictions. H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation between their content similarity and geographical distance to finely-grained geotagged tweets. H1: By considering geotagged tweets individually […] we can improve the performance of fine- grained geolocalisation. 3
  • 4. Thesis Structure Chapter 3 Chapter 4 Chapter 5 Chapter 6 Fine-Grained Geolocalisation of Tweets. Application: Traffic Incident Detection Ranking Approach Ranking Approach + Majority Voting Learning to Rank + Majority Voting Sophisticated but more accurate(1km) Simple but efficient • Evaluate effectiveness • Assess the generalisation of our approaches on a new dataset Evolution of Techniques 4
  • 5. Data and Metrics Datasets • 3 different datasets of geotagged tweets −Chicago (March 2016) −New York (March 2016) −Chicago (July 2016) Metrics • Average Error Distance (AED) −Distance on Earth between the predicted and the real location • Accuracy@1km −The fraction of predicted locations that lie within a radius of 1 km from the real location • Coverage −The fraction of tweets for which our approach finds a geolocation Chapters 3,4 and 5 Chapter 6 5
  • 6. Motivation • Existing approaches in the literature have limitations when they are adapted to work at a fine-grained level [Kinsella et al. (2011), Paraskevopoulos et al. (2015)]. −To represent a location, these approaches aggregate the text of the tweets in this location into a virtual document. −We postulate that this aggregation approach can lead to a loss of important evidence. Chapter 3 H1: By considering geotagged tweets individually we can preserve the evidence lost when adapting previous approaches at a fine- grained level, and thus we can improve the performance of fine- grained geolocalisation. Enabling Fine-Grained Geolocalisation 6
  • 7. Chapter 3 Geographical Area: Grid of squares of a pre-defined size (1km) Ranking of Documents Enabling Fine-Grained Geolocalisation Aggregated Individual 7
  • 8. Chapter 3 Enabling Fine-Grained Geolocalisation Conclusions • Individual outperforms Aggregated (accuracy@1km): −from 50.67% to 55.20% in Chicago −from 45.40% to 48.46% in New York • Key Findings: −IDF is the best performing retrieval model. −Document Frequency (DF) is the most important feature when using individual tweets. −The Aggregated approach loses in accuracy when transforming DF to term frequency. Contributions • A novel ranking approach for fine-grained geolocalisation (Individual), compared against the baseline (Aggregated). • An investigation into the performance issues of the existing SOTA approaches (Aggregated). 8
  • 9. Majority Voting For Fine- Grained Geolocalisation Chapter 4 Motivation • The approach of Chapter 3 obtains an AED of 4.693 km (Chicago) −Not enough. We aim for an AED of 1 km. • This approach considers only the similarity evidence for prediction (always returns the Top-1 geotagged tweet) • We postulate that in some cases the similarity of the tweets does not always correlate with geographical distance, thus we cannot predict a location in such cases. • We aim to exploit the geographical evidence encoded within the Top-N geotagged tweets. 9 H2: The predictability of the geolocation of tweets at a fine-grained level is given by the correlation between their content similarity and geographical distance to finely-grained geotagged tweets.
  • 10. Contributions • A novel approach that uses a weighted majority voting to exploit the geographical evidence encoded within the Top-N geotagged tweets in the ranking. Chapter 4 Majority Voting For Fine- Grained Geolocalisation Conclusions • AED is markedly reduced: −4.694 km to 1.602 km in Chicago −4.972 km to 1.448 km in New York • Key Findings: −Trade-off between AED and Coverage (weighted Majority Voting) −As the value N of the Top-N increases we observe: • Lower AED • But also Lower Coverage . . . Ranking of Geotagged Tweets (Chapter 3) Majority Voting Top-N 10
  • 11. Learning to GeolocaliseChapter 5 H3: By improving the ranking of geotagged tweets with respect to a given non-geotagged tweet, […] we can obtain a higher number of other fine- grained predictions. Motivation • The approaches of Chapters 3 & 4 use traditional retrieval models −Similarity is based only on document frequency information (IDF weighting). • Considering only IDF can limit the quality of the Top-N geotagged tweets. 11
  • 12. Contributions • A novel learning to rank-based approach that re-ranks geotagged tweets −Targeting their geographical proximity to a given non-geotagged tweet. • We propose a set of 28 features for learning. Chapter 5 Learning to Geolocalise Experiments • Experiment with several learning to rank algorithms −MART, Random Forest, RankNet, AdaRank, ListNet and LambdaMART • Explore the best combination of features −Document features, query features and query-dependent features. 12
  • 13. Chapter 5 Learning to Geolocalise Conclusions • LambdaMART is the best algorithm for fine-grained geolocalisation. • A better ranking result does improve geoloalisation −Increased accuracy while increasing coverage (best configuration @Top-13): −AED: from 1.490 km to 1.441 km −Coverage: from 31.88% to 46.01% • Best combination of features: −Extracted from query-tweet . −Query-dependent: relation between query and doc tweets. 13
  • 14. Application: Traffic Incident Detection Chapter 6 H4: By geolocalising non-geotagged tweets we can obtain a more representative sample of geotagged data and, therefore, improve the effectiveness of the traffic incident detection task. Motivation • We aim to assess the generalisation of our approaches. • It is important to: −Identify traffic incident-related content. −Know the precise location of the tweets to locate incidents. • There are detection rates problems due to small sample size (use only 1% of geotagged tweets) (Gu et al., 2016; Mai and Hranac, 2013) 14
  • 15. Application: Traffic Incident Detection Chapter 6 Geolocalised + Geotagged Geotagged Alone Detection Rate 15
  • 16. Application: Traffic Incident Detection Chapter 6 Geotagged Alone Geolocalised + Geotagged (High Acc. Config) Geolocalised + Geotagged (High Coverage Config) Accuracy@1km 16
  • 17. Contributions • Demonstrated the usefulness of our approaches in a traffic incident detection task. • We showed improvements in the performance of the traffic incident detection task. Chapter 6 Application: Traffic Incident Detection Conclusions • Expanding the sample with new geolocalised tweets increases the performance of the traffic incident detection task. • The geolocalised tweets are located close to the location of the real incidents. • Our fine-grained geolocalisation approaches do seem to generalise − Consistency with behaviour observed in the previous chapters on other datasets 17
  • 18. Concluding Remarks Across all the chapters in this thesis we have…. • Addressed the fine-grained tweet geolocalisation problem. • Alleviated the problem of the existing state-of-the-art approaches when working at a fine-grained level • Proposed an suite of techniques that −Effectively infer the geolocation of tweets at a fine-grained level −Provide different trade-offs between accuracy and coverage • Showed the effectiveness and generalisation of our approaches on a traffic incident detection task 18
  • 19. Publication Breakdown • “On fine-grained geolocalisation of tweets”. ICTIR 2017. • “On fine-grained geolocalisation of tweets and real-time traffic incident detection”. Information Processing & Management. 2018. • “Beyond geotagged tweets: exploring the geolocalisation of tweets for transportation applications”. In Transportation Analytics in the Era of Big Data, Springer. 2018. • “Learning to Geolocalise Tweets at a Fine-Grained Level”. In CIKM 2018. Also contributed to Geographical and Social Sciences • "Sensing spatiotemporal patterns in urban areas: analytics and visualizations using the integrated multimedia city data platform“. Built Environment 42.3. 2016. • "Spatial analysis of usersgenerated ratings of yelp venues“. In Open Geospatial Data, Software and Standards. 2017. 5 research publications Ch. 6 Ch. 5 Ch. 3 Ch. 4 Ch. 4 19
  • 20. Thanks for listening! Glad to answer your questions 20

Editor's Notes

  1. Talk about the sparsity problem and put an example; Traffic incident detection So dealing with such small sample is challenging and limiting the task Could we increase the sample by inferring the geolocation of non-geotagged tweets
  2. How can we infer that? Well, our statement says that….. In order to validate our statement, we have identified 4 main hypothesis that we will tackle in each of the chapters We will see those hypothesis in detail later
  3. The rest of the thesis is structured as follows Chapter 4 to 5 explores the problem of fine-grained geolocalisation and their challenges Remenber that our challenge is to reduce error to 1 km distance, and during this journey we describe an evolution of techniques (from less to more sophisticated) in order to achieve that Finally, in Chapter 6 we will show how these approaches generalise when applied to improve the traffic incident detection task
  4. So, in order to evaluate our approaches, throught the thesis we will ise 3 different datasets Describe data…. Than, in order to measure how success our approaches are, we have selected the following metrics Special attention to Accuracy@1km, which is our metric of sucess
  5. How did we start this journey? First, we identified a possible drawback in the state-of-the-art approaches that we will explore and propose a solution for in this chapter The main problem is that they aggregate….. We hipothesise that such aggregation …… read hypothesis
  6. This chart provides a visual overview of the approaches….. Introduce name of approaches AGGREGATED and INDIVUDUAL
  7. Finally, we observed that the individual approach outperforms aggregated, increasing accuracy@1km This validates our hypothesis In addition, we observed that IDF is the best feature for the task (as IDF and TF-IDF ourperforms others) Finally, we observed that when aggregating the tweets, models that relies on IDF performs badly, But models than rely on term frequency performs better. This suggest that the evidence encoded in terms of Idf is then transformed intro term frequency when documents are aggregated
  8. In Chapter 3, we improved the accuracy of the state-of-the-art by proposing a ranking apprpach of individual tweets However, we are still dar away from our objective of 1 km (4.69 km) Approach in Chapter 3 returns always the Top-1 most similar geotagged tweet as the predicted location However, we postulate that in some cases similarity between two tweets does not always correlate with their geographical proximity
  9. To solve that issue, we aim to explore the geographical evidence encoded with the Top-N geotagged tweets Using a majority voting algorithm we can obtain a location in which the majority of the Top-N tweets falls, and that will give us enough evidence for predicting a location (correlation exists) By doing that, in this chapter we reduced the AED However, due to such correlation hypothesis, we observed a trade-off between AED and Coverage (measure that allow us to see how many tweets do we find a prediction to)
  10. So, at this point we were very close to our objective of 1 km However, we think we can improve accuracy further Previous approaches in the thesis use a IDF model to return the Top-N most likely locations Considering only IDF to rank the tweets can limit the quality of the Top-N We hypothesis that by improving the ranking, we can improve geolocalisation
  11. What do we propose to improve the ranking? Use machine-learning (learning to rank) in order to learn from more characteristics of the tweets We propose a L2R approach that re-ranks geotagged tweets based on their geographical proximity (label) And we propose a set of features for doing that Explain experiments……
  12. As a results, first we observe that LambdaMART is the best ranking algorithm for geolocalisation And second, we observed that by improving the ranking we also improved accuracy and reduced AED, while maintaining (even increasing) coverage This validated our hypothesis Finally, we observed that features that models the relationshio combined with query were the most effective