Pairing Tweets with the Right Location

By
Esha and Osmar Zaïane
Alberta Machine Intelligence Institute,
University of Alberta, Edmonton, AB, Canada
Pairing Tweets with the
Right Location

• Introduction
• Related Work
• DigiCities
• Methodology
• Key Findings
• Limitations
• Conclusion
• Future Work
Presentation Overview

• The use of Twitter has become ubiquitous and it is used for various
reasons (e.g.,)
▪ Products and services promotion, information dissemination, event updates
• Tweets are becoming digital footprints of users’ expressions in real
world
• Information in tweets posted by users may have local relevance (e.g.,)
▪ It can be utilized to understand trends, and emerging emotions & sentiments
in a geographical location
• It is critical to identify the location-relevant tweets to effectively use
information posted by users in context of a geographical location
Introduction

• Geolocation detection is challenging to solve in the context of
Twitter (Cheng et al., 2010; Lee et al., 2014)
▪ Only a limited number of tweets are geotagged or have correct
geolocation information in a tweet's metadata records (e.g., Graham et
al., 2014; Lee et al., 2014; Watanabe et al., 2011)
▪ Data Sparsity i.e., limited tweets contain a specific city name (e.g.
Chang et al., 2012; Inkpen et al., 2015)
▪ Users may include varying granular levels of location information when
referring to a specific location (e.g., Huang et al,. 2019)
Introduction

• John (a hypothetical Twitter user) resides in St.
Paul, Minnesota, USA
• His profile states Minneapolis as the location (St.
Paul and Minneapolis are the twin cities)
• Currently, John is traveling to Toronto in Canada
• He is sitting in a restaurant and watching a hockey
game on TV played in Calgary, Canada
• He tweets about it – Just watched another win by
#CalgaryFlames an amazing game played
@TheSaddledome #YYC
• Calgary is the event-
related location
• Two geolocations
captured in the metadata
record are:
• Minneapolis - Twitter
profile
• Toronto – Location of
tweet posted
• But neither is relevant to
the posted tweet
Scenario

• The scenario re-iterates the argument that the location information
in metadata records may not be relevant to a tweet's content.
• However, in a large number of tweets, content will have relevant
contextual location-related information
▪ Such information can be exploited to identify location referenced by
users in their tweets
• We propose a novel approach, labeled as DigiCities
▪ It adds geographical context to tweets by harnessing information
included in the content of tweets.
Introduction

• Researchers have used different features and techniques to identify
or improve location detection for tweets (e.g.,):
▪ Exploiting variations in languages and terms used by users in tweets to
identify locations (e.g., Cheng et al., 2010; Hong et al., 2012)
▪ Utilizing location mentioned in tweet content, location included in users
profile, and location as captured at the time of posting tweet to identify
city-level appropriate location (Shen et al., 2018)
Related Work

▪ Exploiting tweet's content with a number of metadata elements (e.g.,
user-description & user-location) supported by neural network-based
framework to predict locations for tweets (Thomas & Henning, 2017)
▪ Using location-relevant terms from content of tweets to identify
locations supported by Convolutional Neural Network (CNN) based
framework (Kumar & Singh, 2019)
▪ Using unsupervised method that used users past tweets and Google
Trends to estimate users location (Zola et al., 2020)
Related Work

▪ Harnessing information in metadata record available in user’s Twitter
profile including user’s profile location, time zone, and language to
detect country-level location (Almadany et al., 2020)
▪ Utilizing information recorded coordinates and tweet content and the
applying geographical knowledge using specific set of rule to detect
event location (Ying et al., 2018)
▪ Researchers like Paradesi (2011) and Inkpen et al. (2015) aimed to
address geo/geo ambiguity and geo/non-geo ambiguity (e.g.,):
o Memphis - location in Egypt and the US (geo/geo)
o Berlin as the name of a person and also a location name in Germany ( geo/non-
geo)
Related Work

• The information on the Internet reflects our physical world
(Kindberg et al., 2002)
• “virtual worlds... serve as digital equivalents to...physical world”
(Warf and Sui, 2010 , p.202)
• Drawing on their viewpoints, a real world geographical location
can be represented by multiple facets in the virtual/digital world
• Our proposed novel approach, DigiCities, is based on a
linkage between the digital world and the physical world
What is DigiCities

DigiCities and the POP Framework
• DigiCities is the digital avatar of real world cities
▪ A city can be represented in a digital world by multiple facets including:
o People (e.g., City Mayor)
o Organizations (e.g., Local Museum, library)
o Places (e.g., local airport)
We call it the POP Framework!
• This framework helps in creating
Digital Profile of a location A City in
Digital Space
People
Organizations
Places
The POP Framework

DigiCities: Mapping Real and Digital World
• Facets in the POP Framework are digitally reflected in tweets
by:
▪ Handles (or user-ids) (starting with @)
▪ Hashtags (starting with #)
• Both handles and hashtags are semantically representing an
entity such as a geographical location
• Tweet Example
▪ I was on #5thavenue and guess who I saw? @Billdeblasio coming
out of the #nypl

DigiCities and the POP Framework
• I was on #5thavenue and guess who I saw? @Billdeblasio
coming out of the #nypl
Place
(Iconic New York Avenue)
People
(Mayor of New York)
Organization
(New York Public Library)
• Such representation can help in feature convergence and feature strengthening
(Saif et. al., 2012) i.e.,
▪ Handles and hashtags are referring to different facets (POP) associated with a
geographical location
▪ Thereby, converging to one semantic concept i.e., a location (or a city → New York
in the above example)

Edmonton
Calgary
Red Deer
Medicine Hat
FtMcMurry
Lethbridge
Banff
StAlbert
• Eight (8) cities from the Province of Alberta,
a mix of different-sized urban population
center:
▪ The provincial capital (Edmonton)
▪ The largest city in Alberta (Calgary)
▪ A popular tourist destination (Banff) with
transient population
▪ The twin-city of a larger population center (St.
Albert)
▪ An industrial center (Fort McMurray)
▪ Other key but relatively small cities (Red Deer,
Lethbridge, Medicine Hat)
DigiCities: Shortlisted Cities
Image Source: https://www.yellowmaps.com/map/alberta-printable-map-618.htm

Process of
Developing
Digital Profile
of Cities
Create Digital Profile of Cities
• The first step in
implementing our approach
is to create digital profiles
of cities as they are
represented by the elements
of the POP framework on
Twitter by handles (‘@’)
and hashtags (‘#’)

Digital Profile of Cities
▪ Lethbridge: 98
▪ Medicine Hat: 46
▪ Red Deer: 112
▪ St. Albert: 72
▪ Banff : 114
▪ Calgary: 214
▪ Edmonton : 198
▪ Fort McMurray: 100
• Total number of handles, hashtags and their variants in each
city’s digital profile include:

Create
T
W
I
T
T
E
R
DigiCities Implementation – An Overview

• DigiCities was implemented using two approaches
▪ Append Strategy and Replace Strategy
Original
Tweet
Just landed at #LGA and went straight to @Broadwaycom
so see #Aladdin. This is why I love the #biggapple.
Append
Strategy
Just landed at #LGA newyork and went straight to @Broadwaycom
newyork so see #Aladdin. This is why I love the #biggapple newyork
Replace
Strategy
Just landed at newyork and went straight to newyork so see
#Aladdin. This is why I love the newyork.
Implementation – Append and Replace

• A total of 4,500 tweets were manually selected
▪ 500 tweets per city (8 cities x 500 = 4,000 tweets for 8 Cities)
▪ Additional 500 tweets for ‘others’ category
• Basic preprocessing involved preliminary cleaning such as removal of
html tags and special characters (This dataset was labelled as the
Baseline Data)
• Stopwords were not removed and stemming was not applied on the
dataset
• Three algorithms were used: k-Nearest Neighbour (kNN), Naïve Bayes
NB) and Sequential Minimal Optimization (SMO)
Dataset, Data Preparation and Algorithms

• A total of 27 classification experiments were performed
Classification Experimentation Details

• Classification accuracy scores improved
significantly for all the three algorithms over
the baseline data using both append and
replace strategies
• Comparing with the baseline (data) accuracy
scores of each algorithm, for example, with the
use of append strategy:
▪ kNN had the highest improvement (by ~22%)
▪ NB had the next best improvement (by ~15%)
▪ SMO has an improvement in the accuracy score
(by ~6%) but was relatively less than kNN and NB
47.6%
56.1%
69.6%
69.9%
81.0%
85.1%
87.8% 93.8% 93.9%
40%
50%
60%
70%
80%
90%
100%
Baseline
(B)
Replace
(R)
Append
(A)
kNN NB SMO Algorithms
 Data Variants →
Impact of DigiCities (No Preprocessing)

WS: Without Stopwords (i.e., Stopwords Removed)
• kNN and NB Algorithms
▪ Removal of stopwords alone as well as the
implementation of DigiCities improved the
accuracy scores significantly
▪ The append strategy worked relatively better
than the replace strategy
• SMO Algorithm
▪ Removal of stopwords alone did not play
critical role in improving classification
accuracy scores
▪ Use of DigiCities made impact on the
accuracy scores with or without stopwords
58.8%
74.6%
83.0%
77.3%
88.4%
89.9%
89.1%
94.1% 94.2%
40%
50%
60%
70%
80%
90%
100%
Base_WS Rep_WS App_WS
B_WS R_WS A_WS
47.6%
56.1%
69.6%
69.9%
81.0%
85.1%
87.8% 93.8% 93.9%
40%
60%
80%
100%
Baseline (B) Replace (R) Append (A)
DigiCities and Stopwords

SA: Stemming Applied
• NB, kNN & SMO
▪ After stemming, the impact on the
accuracy scores was only marginal
▪ The results show that accuracy scores
improved after stemming with the
implementation of DigiCities
o The improvement can only be attributed to
our approach
DigiCities and Stemming
48.3%
57.8%
70.0%
69.6%
80.3%
85.2%
87.4%
94.0% 93.9%
40%
50%
60%
70%
80%
90%
100%
B_SA R_SA A_SA
47.6%
56.1%
69.6%
69.9%
81.0%
85.1%
87.8% 93.8% 93.9%
40%
60%
80%
100%

• Both append and replace strategies helped in
improving the accuracy scores for all the three
algorithms
• NB and kNN Algorithms
▪ Append strategy gave relatively better results as
compared to the replace strategy
▪ Statistically, the accuracy scores achieved with the
use of append and replace strategies were
significantly different
• SMO Algorithm
▪ There was no statistical difference in the accuracy
scores achieved with the use of append and replace
strategies
DigiCities – Append vs Replace
58.8%
74.6%
83.0%
77.3%
88.4%
89.9%
89.1% 94.1% 94.2%
40%
60%
80%
100%
Base_WS Rep_WS App_WS
B_WS R_WS A_WS
47.6%
56.1%
69.6%
69.9%
81.0%
85.1%
87.8% 93.8% 93.9%
40%
60%
80%
100%
48.3% 57.8%
70.0%
69.6%
80.3%
85.2%
87.4%
94.0% 93.9%
40%
60%
80%
100%
B_SA R_SA A_SA

Impact of DigiCities – P/R Scores
Algo-
rithms
Measures
No Preprocessing Stopwords Removed Stemming Applied
Baseline Append Replace Baseline Append Replace Baseline Append Replace
kNN
Precision 0.66 0.75 0.72 0.68 0.86 0.83 0.65 0.75 0.71
Recall 0.48 0.70 0.56 0.59 0.83 0.75 0.48 0.70 0.58
NB
Precision 0.75 0.88 0.84 0.82 0.89 0.92 0.75 0.88 0.83
Recall 0.70 0.85 0.81 0.77 0.90 0.88 0.70 0.80 0.85
SMO
Precision 0.91 0.95 0.95 0.93 0.96 0.95 0.90 0.95 0.95
Recall 0.88 0.94 0.94 0.89 0.94 0.94 0.87 0.94 0.94

• Our approach, DigiCities, helped in improving the classification
accuracy scores
▪ For all the three algorithms, kNN, NB and SMO
▪ By using either append or replace strategy
• For SMO algorithm, both removal of stopwords and stemming did
not play a critical role with the use of DigiCities approach
• Removal of stopwords with DigiCities will positively impact
classification accuracy for both kNN and NB algorithms
Summary – Key Findings

• Stemming of tweet data may not play a critical role, particularly
when used with DigiCities
• Both append and replace strategies helped in improving the
classification accuracy of all the three algorithms
• The append strategy is better as compared to the replace strategy
to implement DigiCities when using kNN and NB algorithms but
with SMO either strategy would work
Summary – Key Findings

Limitations, Conclusion
& Future Work

• Geographical biasness - Only eight cities from one province
• No prior research to model digital profiles of cities
• Lack of city-level knowledge may impact development of digital
profiles of cities
• Tweet dataset used in this research
▪ Small Dataset
▪ Tweet selection bias
▪ No inter-coder reliability
Limitations

• We proposed a novel approach, DigiCities, which uses the POP
Framework to map the real world locations in the digital world
• The facet of the POP framework includes:
▪ People, Organizations, and Places
• DigiCities helped in improving the right location – tweet pairing
by harnessing city relevant information from the content of
tweet, particularly by using hashtags and handles
Conclusion

• Areas of future work include:
▪ Automating the process of building digital profiles of cities
▪ Increasing the diversity of cities and the dataset size
▪ Scope to further enhance the POP Framework by adding new facets
such as local language and seasonal terms.
▪ Test our approach by varying of hyper-parameters and using other
classification algorithms
▪ Implement this approach in combination with other approaches (e.g.,
Inkpen et al., 2015) to make improvements in location detection and
disambiguation
Future Work

References
• Acampora, G., Anastasio, P., Risi, M., Tortora, G., Vitiello, A.: Automatic event geo-location in twitter. IEEE
Access 8, 128213-128223 (2020)
• Almadany, Y., Saer, K.M., Jameil, A.K., Albawi, S.: A novel algorithm for estimation of twitter users location
using public available information. International Journal on Smart Sensing & Intelligent Systems 13(1) (2020)
• Chang, H.w., Lee, D., Eltaher, M., Lee, J.: @ phillies tweeting from philly? Predicting twitter user locations
with spatial word usage. In: IEEE International Conference on Advances in Social Networks Analysis and
Mining. pp. 111-118 (2012)
• Cheng, Z., Caverlee, J., Lee, K.: You are where you tweet: a content-based approach to geo-locating twitter
users. In: ACM international conference on Information and knowledge management. pp. 759-768 (2010)
• Graham, M., Hale, S.A., Ganey, D.: Where in the world are you? geolocation and language identification in
twitter. The Professional Geographer 66(4), 568-578 (2014)
• Hong, L., Ahmed, A., Gurumurthy, S., Smola, A.J., Tsioutsiouliklis, K.: Discovering geographical topics in the
twitter stream. In: International conference on World Wide Web. pp. 769-778. ACM (2012)
• Huang, C.Y., Tong, H., He, J., Maciejewski, R.: Location prediction for tweets. Frontiers in Big Data 2, 5
(2019)

References
• Inkpen, D., Liu, J., Farzindar, A., Kazemi, F., Ghazi, D.: Detecting and disambiguating locations mentioned in
twitter messages. In: International Conference on Intelligent Text Processing and Computational Linguistics.
pp. 321-332. Springer (2015)
• Kindberg, T., Barton, J., Morgan, J., Becker, G., Caswell, D., Debaty, P., Gopal, G., Frid, M., Krishnan, V.,
Morris, H., et al.: People, places, things: Web presence for the real world. Mobile Networks and Applications
7(5), 365-376 (2002)
• Kumar, A., Singh, J.P.: Location reference identification from tweets during emergencies: A deep learning
approach. International journal of disaster risk reduction 33, 365-375 (2019)
• Lee, K., Ganti, R.K., Srivatsa, M., Liu, L.: When twitter meets foursquare: tweet location prediction using
foursquare. In: International Conference on Mobile and Ubiquitous Systems: Computing, Networking and
Services. pp. 198-207 (2014)
• Paradesi, S.M.: Geotagging tweets using their content. In: Twenty-Fourth International FLAIRS Conference
(2011)
• Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: International semantic web conference.
pp. 508-524 Springer (2012)

References
• Shen, W., Liu, Y., Wang, J.: Predicting named entity location using twitter. In: IEEE International Conference
on Data Engineering (ICDE). pp. 161-172 (2018)
• Thomas, P., Hennig, L.: Twitter geolocation prediction using neural networks. In: International Conference of
the German Society for Computational Linguistics and Language Technology. pp. 248-255. Springer (2017)
• Warf, B., Sui, D.: From gis to neogeography: ontological implications and theories of truth. Annals of GIS
16(4), 197-209 (2010)
• Watanabe, K., Ochi, M., Okabe, M., Onai, R.: Jasmine: a real-time local-event detection system based on
geolocation information propagated to microblogs. In: International conference on Information and knowledge
management. pp. 2541-2544. ACM (2011)
• Ying, Y., Peng, C., Dong, C., Li, Y., Feng, Y.: Inferring event geolocation based on twitter. In: Proceedings of
the 10th International Conference on Internet Multimedia Computing and Service. pp. 1-5 (2018)
• Zola, P., Ragno, C., Cortez, P.: A google trends spatial clustering approach for a worldwide twitter user
geolocation. Information Processing & Management 57(6), 102312 (2020)

Pairing Tweets with the Right Location

Recommended

Recommended

More Related Content

Similar to Pairing Tweets with the Right Location

Similar to Pairing Tweets with the Right Location (20)

More from ICDEcCnferenece

More from ICDEcCnferenece (20)

Recently uploaded

Recently uploaded (20)

Pairing Tweets with the Right Location