As the popularity of online social networking sites such as Twitter and Facebook continues to rise, the volume of textual content generated on the web is increasing rapidly. The mining of user generated content in social media has proven effective in domains ranging from personalization and recommendation systems to crisis management. These applications stand to be further enhanced by incorporating information about the geo-position of social media users in their analysis.
Due to privacy concerns, users are largely reluctant to share their location information. As a consequence of this, researchers have focused on automatic inferring of location information from the contents of a user's tweets. Existing approaches are purely data-driven and require large training data sets of geo-tagged tweets. Furthermore, these approaches rely solely on social media features or probabilistic language models and fail to capture the underlying semantics of the tweets.
In this thesis, we propose a novel knowledge based approach that does not require any training data. Our approach uses Wikipedia, a crowd sourced knowledge base, to extract entities that are relevant to a location. We refer to these entities as local entities. Additionally, we score the relevance of each local entity with respect to the city, using the Wikipedia Hyperlink Graph. We predict the most likely location of the user by matching the scored entities of a city and the entities mentioned by users in their tweets. We evaluate our approach on a publicly available dataset consisting of 5119 Twitter users across continental United States and show comparable accuracy to the state-of-the-art approaches. Our results demonstrate the ability to pinpoint the location of a Twitter user to a state and a city using Wikipedia, without needing to train a probabilistic model.
5. News Recommender
Systems
Beavercreek preschool to open in 2015
By Sharon D. Boykin
A $5.1 million preschool in Beavercreek city
Schools district will help accommodate a
growing of student population and reduce
overcrowding, according to school officials.
Ohio’s health exchange to include
more competition
By Randy Tucker
It was just a year ago that the insurance industry
fretted over potential loses from the new
insurance market created by Affordable Care Act.
Recommended for you
WHY IS LOCATION IMPORTANT?
• Targeted advertising
• Opinion Analysis
• Disaster Response
• Location Based
Services
Other applications
5
7. Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
• Less than 4% of tweets contain geo-spatial tags
• Location field in profile is either empty or contains
invalid information such as “Justin Bieber’s heart”
7
8. Friends
INFERRING LOCATION OF A TWITTER USER
Followees
8
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
Followers
Network based
Content based
10. CONTENT BASED APPROACHES
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
• Supervised Approaches
• Probabilistic Models – (Cheng, Caverlee, and Lee, 2010)
• Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010)
• Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012)
• Language Models – (Doran, Gokhale, and Dagnino, 2014)
• Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols,
and Drews, 2014)
10
Geographic location of a user
influences the contents of their
tweets
13. PROBLEM STATEMENT
13
Predict the location of a Twitter user based on their
tweets, by exploiting Wikipedia to create a location
specific knowledgebase
14. • Knowledge-enabled approach to predict the location of Twitter
users based on the contents of their tweets without using any
training dataset of geo-tagged tweets
• Creation of location specific knowledgebase extracted from
Wikipedia by introducing the concept of Local Entities
• Evaluation of the approach on a publicly available dataset with
55% accuracy and 429 miles of Average Error Distance
CONTRIBUTIONS
14
15. KNOWLEDGE-BASE ENABLED APPROACH
San Francisco:
Golden Gate Bridge,
San Francisco 49ers,
San Francisco Chronicle …
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco
Chronicle
1
Top-k predictions:
San Francisco
Oakland
Palo Alto
15
18. • Collaborative encyclopedia
• As of 2014, English Wikipedia has 4.6 million articles, 18 billion pages views
and 500 million unique visitors per month.
• Category Structure
• Used for document clustering, tweet classification, personalization
systems etc.
• At Kno.e.sis, used in applications such as
• Doozer (Thomas, Mehra, Brooks, and Sheth, 2008)
• BLOOMS (Jain, Hitzler, Sheth, Verma, and Yeh, 2010)
• Hierarchical Interest Graph (Kapanipathi, Jain, Venkataramani, and
Sheth, 2014)
• Link Structure
• Used for word sense disambiguation, semantic relatedness between
terms etc.
WIKIPEDIA
18
21. “In general, links should be created to relevant
connections to the subject of another article that will
help readers understand the article more fully. This
can include people, events, and topics that already
have an article or that clearly deserve one, so long
as the link is relevant to the article in question.”
Source: http://en.wikipedia.org/wiki/Help:Link#Wikilinks
LINK STRUCTURE OF WIKIPEDIA
21
22. • We consider the internal links of location pages as Local Entities of the
city
Local Entities of San Francisco
LOCAL ENTITIES
• While a city does not contain link to itself, we use the city as a local
entity
22
25. ARE ALL ENTITIES EQUALLY LOCAL?
25
San Francisco Chronicle
San Francisco ExaminerSF Weekly
MSNBC CNN BBC
Al Jazeera America
26. • Pointwise Mutual Information – standard measure of
association between two variables
• Assumption is that higher is the localness of an entity with
respect to the city, higher will be the statistical dependence
between them
• Computed as:
𝑃𝑀𝐼 𝑐, 𝑒 = 𝑙𝑜𝑔2
𝑃 𝑐,𝑒
𝑃 𝑐 .𝑃(𝑒)
Association-based Measure
LOCALNESS MEASURE OF ENTITIES
26
27. Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
27
The Boston Red Sox, a founding member of the
American League of Major League Baseball in
1901..
Boston Red Sox
The Boston Red Sox are an American
professional baseball team based in
Boston, Massachusetts ...
They are members of American League (AL).
Boston
American League
29. • Betweenness Centrality (BC) – Measures the importance of a
node relative to the rest of the nodes in the graph
• A high BC score of a vertex in a graph indicates that it lies on
considerable fraction of shortest path connecting others
• Computed as:
𝐶 𝐵 𝑐, 𝑒 = 𝑒𝑖
≠𝑒≠𝑒𝑗
𝜎 𝑒𝑖𝑒𝑗
(𝑒)
𝜎 𝑒𝑖𝑒𝑗
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
29
30. LOCALNESS MEASURE OF ENTITIES
30
Directed Graph of Local Entities of Boston
Boston Red Sox: 0.004540
American League: 0.000046
31. Alcatraz Island
Treasure Island
Alameda Island
Financial District
Market Street
Fisherman’s Wharf
San Francisco 49ers
Cow Hollow
Silicon Valley
South Beach
….
Suspension Bridge
Hyde Street Pier
Irving Morrow
Angelo Rossi
Art Deco
Charles Alton Ellis
Bethlehem Steel
Half Way to Hell Club
International Orange
…
San Francisco Bay
Golden Gate
San Francisco Chronicle
U.S. Route 101
Marin County
Sausalito
Bay Area
…
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
31
32. • Measures the relatedness between concepts with the intuition
that related concepts are connected to similar entities
• Jaccard Index: Overlap between two sets
𝑗𝑎𝑐𝑐𝑎𝑟𝑑 𝑐, 𝑒 =
|𝑂 𝑐 ∩𝑂 𝑒 |
|𝑂 𝑐 ∪𝑂 𝑒 |
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
32
33. • Tversky Index: Asymmetric similarity measure between two
sets
𝑡𝑖 𝑐, 𝑒 =
|𝑂 𝑐 ∩𝑂 𝑒 |
𝑂 𝑐 ∩𝑂 𝑒 + α 𝑂 𝑐 −𝑂 𝑒 + β|𝑂 𝑒 −𝑂 𝑐 |
• We choose α = 0 and β = 1
• For every entity in the page of a local entity not found in the
page of the city, penalize the local entity
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
33
34. KNOWLEDGE-BASE OF LOCAL ENTITIES
Local Entities of San Francisco (Localness measure: Tversky Index)
34
36. Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
We use Zemanta for Entity Linking
36
37. Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco Chronicle 1
User Profile for user 𝑢 defined as:
𝑃 𝑢 = 𝑒, 𝑠 𝑒 ∈ 𝑊, 𝑠 ∈ 𝑅}
Step 2: Entity Scoring
We use Zemanta for Entity Linking
37
39. LOCATION PREDICTION
• Compute an aggregate score for each city whose local entities are found
in a user’s tweets
𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 =
𝑗=1
𝐼 𝑐𝑢
𝑙𝑜𝑐𝑙 𝑐, 𝑒𝑗 × 𝑠𝑒𝑗
where 𝐼 𝑐𝑢 are local entities of city 𝑐 found in tweets of
user 𝑢 , 𝑒𝑗 ∈ 𝐼𝑐𝑢 and 𝑙𝑜𝑐𝑙(𝑐, 𝑒𝑗) is the localness score of entity
𝑒𝑗 with respect to city 𝑐
• Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑐, 𝑢 in descending order to predict the top-k locations
of a user
39
40. San Francisco International Airport (6),
San Francisco (4), Nob Hill (3), San
Francisco Museum of Modern Art (1),
Beach Blanket Babylon (2), San Francisco
Municipal Railway (4), Golden Gate Park
(1), San Francisco Bay Area (1), SF Weekly
(1), Fox Oakland Theatre (2), Berkley (1),
Green Day (1), Oakland (9), San Francisco
Bay Area (1), The White Stripes (1),
Detroit Metropolitan Wayne County
Airport (1), Detroit Historical Museum
(1), Detroit Red Wings (4), General
Motors (1), Palo Alto (6), SAP AG (8),
Facebook (3), PARC (company) (2), Dell
(1), Google (1), …
LOCATION PREDICTION
User Profile Knowledgebase
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
40
41. LOCATION PREDICTION
San Francisco International Airport (6), San
Francisco (4), Nob Hill (3), San Francisco
Museum of Modern Art (1), Beach Blanket
Babylon (2), San Francisco Municipal Railway
(4), Golden Gate Park (1), San Francisco Bay
Area (1), SF Weekly (1)
14.5531
Fox Oakland Theatre (2), Berkley (1), Green Day
(1), Oakland (9), San Francisco Bay Area (1)
10.7584
The White Stripes (1), Detroit Metropolitan
Wayne County Airport (1), Detroit Historical
Museum (1), Detroit Red Wings (4), General
Motors (1)
8.0600
Palo Alto (6), SAP AG (8), Facebook (3), PARC
(company) (2), Dell (1), Google (1)
6.9175
User Profile Knowledgebase Location
Prediction
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
41
42. • All cities of United States with population > 5000 as published in census
estimates of 2012
• 4,661 cities and 500714 local entities
Knowledge base
IMPLEMENTATION
Baseline
• Considers all local entities to be equally local to the city
• Location prediction based only on frequency of entities
42
43. • Published by Cheng, Caverlee, and Lee, 2010.
• Contains 5119 active users from continental United States with
approximately 1000 tweets per user.
• User’s location listed in the form of latitude and longitude.
Test Dataset
EVALUATION
43
44. • Error Distance
𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑙𝑜𝑐𝑎𝑐𝑡 𝑢 , 𝑙𝑜𝑐𝑒𝑠𝑡 𝑢
Distance between actual location of the user and the estimated location
• Average Error Distance
𝐴𝐸𝐷 𝑈 = 𝑢∈𝑈 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡(𝑢)
|𝑈|
Average of error distance of all users in the test dataset
• Accuracy
𝐴𝐶𝐶 𝑈 =
|{𝑢|𝑢∈𝑈 ˄ 𝐸𝑟𝑟𝑜𝑟𝐷𝑖𝑠𝑡 𝑢 ≤100}|
|𝑈|
Percentage of users predicted within 100 miles of their actual location
Evaluation Metrics
EVALUATION
44
46. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• PMI is not normalized hence sensitive to the count of the occurrences of local
entities in the Wikipedia corpus
• E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of
San Francisco
46
47. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Does a good job of assigning low scores to common entities.
• E.g. community college, National Weather Service, start up company
etc.
• Fails for entities with some relevance to the city but no distinguishing factor
• E.g. IBM with respect to Endicott, New York
47
49. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard
Index
53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Underperforms for local entities with fewer entities than the city
• E.g. Eureka Valley and California with respect to San Francisco.
49
51. EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky
Index
54.48 429.00 68.72 74.68 79.99
• Best performing localness measure
• Overcomes the disadvantage of Jaccard Index.
• For example: We are able to assign higher localness to Eureka Valley
(0.7096) than California (0.1270) with respect to San Francisco
51
57. Top 100 Cities
EVALUATION
• 2172 users from the dataset are from the top-100 most
populated cities of United States
• 60% users predicted within 100 miles of their actual location
• 54% users predicted exactly at the city level
57
58. CONCLUSION
• Presented a crowd sourced knowledge based approach, that does not
require geo-tagged tweets as a training dataset, to predict the location
of a user
• Introduced the concept of Local Entities and preprocessed Wikipedia
Hyperlink Graph to extract local entities for each city
• Investigated relatedness measures to establish the degree of
association between a local entity and a city
• Evaluated the proposed approach against a benchmark dataset
published by Cheng et al. For 5119 users, we are able to predict the
location of 55% of users within 100 miles with an average error
distance of 429 miles
58
59. FUTURE WORK
• Compute the confidence score of the prediction based on top-k cities
and count of local entities in tweets
• Investigate other localness measures for score local entities
• Consider semantic types, categories of local entities and weight the
contribution based on types
• Explore other knowledge bases such as Wikitravel and GeoNames
59
Users can publish geographic information through cellphone or profile
Cheng et al. found 21% users contained location as granular as city, state in their profile
There was a need to automatically infer the location of a user
Network based approaches are based on the network of a Twitter user
he general idea behind these approaches is to determine the probabilistic distribution of words across a region
Cheng et al. proposed a probabilistic framework to model the spatial distribution of words. They proposed the concept of local words which are words that have a compact geographic scope.
Weakness: Large training dataset required. Collection process of geo-tagged tweets is time intensive.
Does not exploit underlying semantics in tweets
My thesis addresses weaknesses of existing approaches by using a knowledge-based approach to extract location specific concepts from Wikipedia
Brief description of the approach
Brief description of the approach
Local Entities: Entities that have a high relatedness to a city and can discriminate between geographic locations
Our work is based on the link structure of Wikipedia
Links only to topically relevant entities
Previous research have concluded that location names provide important clues to the location of a user
Number of local entities for each city varies.
Association based measure: Compute relatedness based on their occurrences in a large corpus
Construct a directed graph of local entities for each city
Construct a graph of local entities for each city
Number of Nodes: 474
Number of Edges: 5921
Based on the idea that higher is the overlap between concepts found in the Wikipedia pages of a city and an entity higher is the degree of localness of the entity
Local Entities of San Francisco represented in a tag cloud weighted based on Tversky Index
The next module is the user profile generator. We create a semantic profile of each user consisting of Wikipedia entities found in their tweets
Brief description of the approach
For predicting the location of a user we compute an aggregate score for each city whose local entities are found in a user’s tweets.
Cheng, Caverlee, and Lee, 2010
Introduction to Twitter and location of a Twitter user