SlideShare a Scribd company logo
1
Knowledge Enabled Location Prediction of
Twitter Users
Presented at ESWC 2015, Slovenia, June 3, 2015
Krishnaprasad
Thirunarayan
Pavan
Kapanipathi
Revathy
Krishnamurthy
Amit Sheth
revathy@knoesis.org pavan@knoesis.org amit@knoesis.org tkprasad@knoesis.org
ESWC 2015
Kno.e.sis: Ohio Center of Excellence in Knowledge Enabled Computing
Wright State University, Dayton, OH, USA
Background Knowledge can improve a
machine’s ability to interpret text
2
City of Lights
ESWC 2015
BACKGROUND KNOWLEDGE
3ESWC 2015
Geographic footprint of a Twitter user
4
News Recommender
Systems
Beavercreek preschool to open in 2015
By Sharon D. Boykin
A $5.1 million preschool in Beavercreek city
Schools district will help accommodate a
growing of student population and reduce
overcrowding, according to school officials.
Ohio’s health exchange to include
more competition
By Randy Tucker
It was just a year ago that the insurance industry
fretted over potential loses from the new
insurance market created by Affordable Care Act.
Recommended for you
WHY IS LOCATION IMPORTANT?
• Targeted advertising
• Opinion Analysis
• Disaster Response
• Location Based
Services
Other applications
5ESWC 2015
Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
6ESWC 2015
Geo-tagged Tweets Profile Information
LOCATION PUBLISHED BY USER
• Less than 4% of tweets contain geo-spatial tags
• ~4 out of 5 cases, location field in profile is either
empty or contains invalid information such as “Justin
Bieber’s heart,” even when present, it might be at
state or nation level
7
Friends
LOCATION INFERENCE
Followees
8
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
Followers
Network based
Content based
ESWC 2015
CONTENT BASED APPROACHES
Just drove around Golden Gate Park two times
trying to get in
Cleveland Browns confuse me. When I give up
on them, they actually show up to play.
• Supervised Approaches
• Probabilistic Models – (Cheng, Caverlee, and Lee, 2010)
• Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010)
• Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012)
• Language Models – (Doran, Gokhale, and Dagnino, 2014)
• Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols,
and Drews, 2014)
9
Geographic location of a user
influences the contents of their
tweets
ESWC 2015
PROBLEM STATEMENT
10
Predict the location of a Twitter user based on their
tweets, by exploiting Wikipedia to create a location
specific knowledgebase
ESWC 2015
KNOWLEDGE-BASE ENABLED APPROACH
San Francisco:
Golden Gate Bridge,
San Francisco 49ers,
San Francisco Chronicle …
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco
Chronicle
1
Top-k predictions:
San Francisco
Oakland
Palo Alto
11ESWC 2015
KNOWLEDGE BASE
GENERATOR
Internal Links
Extraction
LocalEntity-1
LocalEntity-2
---
LocalEntity-n
city-1 city-2 city-k
Weighted Local
Entities
Entity Recognition
and Scoring
Annotated
Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location Predictor
Ranked
cities for
user
KNOWLEDGE-BASE ENABLED APPROACH
12ESWC 2015
SAN FRANCISCO NEW YORK CITY
HOUSTON
LOCAL ENTITIES
13ESWC 2015
• Collaborative encyclopedia
• As of 2014, English Wikipedia has 4.6 million articles, 18 billion
pages views and 500 million unique visitors per month.
• Category Structure
• Used for document clustering, tweet classification,
personalization systems etc.
• Link Structure
• Used for word sense disambiguation, semantic relatedness
between terms etc.
WIKIPEDIA
14ESWC 2015
• We consider the internal links of location pages as Local Entities of the
city
Local Entities of San Francisco
LOCAL ENTITIES
• While a city does not contain link to itself, we use the city as a local
entity
15ESWC 2015
ARE ALL ENTITIES EQUALLY LOCAL?
16ESWC 2015
ARE ALL ENTITIES EQUALLY LOCAL?
17ESWC 2015
San Francisco Chronicle
San Francisco ExaminerSF Weekly
CNN BBC
Al Jazeera America
• Pointwise Mutual Information – standard measure of
association between two variables
• Assumption is that higher is the localness of an entity with
respect to the city, higher will be the statistical dependence
between them
• Computed as:
where le is the local entity, c is the city, P(le,c) is the joint probability of occurrence of
the city and the local entity in the Wikipedia dump, P(e) and P(c) are the individual
probability of occurrence of the local entity and city respectively.
Association-based Measure
LOCALNESS MEASURE OF ENTITIES
18ESWC 2015
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
19
The Boston Red Sox, a founding member of the
American League of Major League Baseball in
1901..
Boston Red Sox
The Boston Red Sox are an American
professional baseball team based in
Boston, Massachusetts ...
They are members of American League (AL).
Boston
American League
ESWC 2015
• Betweenness Centrality (BC) – Measures the importance of a
node relative to the rest of the nodes in the graph
• A high BC score of a vertex in a graph indicates that it lies on
considerable fraction of shortest path connecting others
• Computed as:
where lei, lej, le are local entities of c, σleilej represents the total
number of shortest paths from lei to lej
Graph-based Measure
LOCALNESS MEASURE OF ENTITIES
20ESWC 2015
Alcatraz Island
Treasure Island
Alameda Island
Financial District
Market Street
Fisherman’s Wharf
San Francisco 49ers
Cow Hollow
Silicon Valley
South Beach
….
Suspension Bridge
Hyde Street Pier
Irving Morrow
Angelo Rossi
Art Deco
Charles Alton Ellis
Bethlehem Steel
Half Way to Hell Club
International Orange
…
San Francisco Bay
Golden Gate
San Francisco Chronicle
U.S. Route 101
Marin County
Sausalito
Bay Area
…
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
21ESWC 2015
• Measures the relatedness between concepts with the intuition
that related concepts are connected to similar entities
• Jaccard Index: Overlap between two sets
Where IL(c) and IL(e) and are the internal links found in the Wikipedia page
of the city c and the local entity le.
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
22ESWC 2015
• Tversky Index: Asymmetric similarity measure between two
sets
𝑡𝑖 𝑙𝑒, 𝑐 =
|𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 |
𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 + α 𝐼𝐿 𝑐 −𝐼𝐿 𝑙𝑒 + β|𝐼𝐿 𝑙𝑒 −𝐼𝐿 𝑐 |
Where 𝐼𝐿(𝑐) and 𝐼𝐿 𝑙𝑒 are the internal links found in the Wikipedia page
of the city 𝑐 and the local entity 𝑙𝑒
• We choose α = 0 and β = 1
• For every entity in the page of a local entity not found in the
page of the city, penalize the local entity
Semantic Overlap Measure
LOCALNESS MEASURE OF ENTITIES
23ESWC 2015
KNOWLEDGE BASE
GENERATOR
Internal Links
Extraction
LocalEntity-1
LocalEntity-2
---
LocalEntity-n
city-1 city-2 city-k
Weighted Local
Entities
Entity Recognition
and Scoring
Annotated
Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location Predictor
Ranked
cities for
user
KNOWLEDGE-BASE ENABLED APPROACH
24ESWC 2015
Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
We use Zemanta for Entity Linking
25ESWC 2015
Step 1: Entity Linking
Just drove around Golden Gate Park trying to get in.
CREATION OF USER PROFILE
Entity Count
Golden Gate Bridge 4
San Francisco 49ers 2
San Francisco Chronicle 1
Step 2: Entity Scoring
We use Zemanta for Entity Linking
26ESWC 2015
KNOWLEDGE BASE
GENERATOR
Internal Links
Extraction
LocalEntity-1
LocalEntity-2
---
LocalEntity-n
city-1 city-2 city-k
Weighted Local
Entities
Entity Recognition
and Scoring
Annotated
Tweets
USER PROFILE GENERATOR
LOCATION PREDICTION
Location Predictor
Ranked
cities for
user
KNOWLEDGE-BASE ENABLED APPROACH
27ESWC 2015
LOCATION PREDICTION
• Compute an aggregate score for each city whose local entities are found
in a user’s tweets
𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 =
𝑒∈𝐿𝐸𝑐𝑢
𝑙𝑜𝑐𝑙 𝑐, 𝑒 × 𝑠𝑒
where LE 𝑐𝑢 is the set of local entities of 𝑐 found in the profile of
user 𝑢 , 𝑙𝑜𝑐𝑙(𝑒, 𝑐) is the localness measure of the entity 𝑒 with
respect to city 𝑐
• Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 in descending order to predict the top-k locations
of a user
28ESWC 2015
San Francisco International Airport (6),
San Francisco (4), Nob Hill (3), San
Francisco Museum of Modern Art (1),
Beach Blanket Babylon (2), San Francisco
Municipal Railway (4), Golden Gate Park
(1), San Francisco Bay Area (1), SF Weekly
(1), Fox Oakland Theatre (2), Berkley (1),
Green Day (1), Oakland (9), San Francisco
Bay Area (1), The White Stripes (1),
Detroit Metropolitan Wayne County
Airport (1), Detroit Historical Museum
(1), Detroit Red Wings (4), General
Motors (1), Palo Alto (6), SAP AG (8),
Facebook (3), PARC (company) (2), Dell
(1), Google (1), …
LOCATION PREDICTION
San Francisco International Airport (6), San
Francisco (4), Nob Hill (3), San Francisco
Museum of Modern Art (1), Beach Blanket
Babylon (2), San Francisco Municipal Railway
(4), Golden Gate Park (1), San Francisco Bay
Area (1), SF Weekly (1)
14.5531
Fox Oakland Theatre (2), Berkley (1), Green Day
(1), Oakland (9), San Francisco Bay Area (1)
10.7584
The White Stripes (1), Detroit Metropolitan
Wayne County Airport (1), Detroit Historical
Museum (1), Detroit Red Wings (4), General
Motors (1)
8.0600
Palo Alto (6), SAP AG (8), Facebook (3), PARC
(company) (2), Dell (1), Google (1)
6.9175
User Profile Knowledgebase Location
Prediction
Nob Hill 0.48214
SF Weekly 0.1875
Golden Gate Park 0.16783
San Francisco International
Airport 0.06818
…
Fox Oakland Theatre 0.09375
SF Bay Area 0.12972
Green Day 0.02066
…
Detroit Historical
Museum 0.4838
General Motors 0.05538
Detroit Red Wings 0.0232
…
PARC (company) 0.03726
Google 0.04678
Facebook 0.05810
San Francisco
Oakland, CA
Detroit, MI
Palo Alto, CA
29ESWC 2015
• All cities of United States with population > 5000 as published in census
estimates of 2012
• 4,661 cities and 500714 local entities
Knowledge base
IMPLEMENTATION
Baseline
• Considers all local entities to be equally local to the city
• Location prediction based only on frequency of entities
30ESWC 2015
• Published by Cheng et al.
• Collected from September 2009 to January 2010.
• Contains 5119 active users from continental United States with
approximately 1000 tweets per user.
• User’s location listed in the form of latitude and longitude.
Test Dataset
EVALUATION
31ESWC 2015
• Error Distance
Distance between actual location of the user and the estimated
location
• Average Error Distance
Average of error distance of all users in the test dataset
• Accuracy
Percentage of users predicted within 100 miles of their actual
location
Evaluation Metrics
EVALUATION
32ESWC 2015
Location Prediction Results
EVALUATION
Localness
Measure
ACC (%) AED (in
Miles)
ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
33ESWC 2015
EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• PMI is not normalized hence sensitive to the count of the occurrences of local
entities in the Wikipedia corpus
• E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of
San Francisco
34ESWC 2015
EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Does a good job of assigning low scores to common entities.
• E.g. community college, National Weather Service, start up company
etc.
• Fails for entities with some relevance to the city but no distinguishing factor
• E.g. IBM with respect to Endicott, New York
35ESWC 2015
LOCALNESS MEASURE OF ENTITIES
36
EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard
Index
53.21 433.62 67.41 73.56 78.84
Tversky Index 54.48 429.00 68.72 74.68 79.99
• Underperforms for local entities with fewer entities than the city
• E.g. Eureka Valley and California with respect to San Francisco.
37ESWC 2015
EVALUATION
California
San Francisco
Eureka
Valley
38
0.03005
Overlap
Overlap
0.07092
ESWC 2015
EVALUATION
Localness
Measure
ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5
Baseline 25.21 632.56 38.01 42.78 47.95
PMI 38.48 599.40 49.85 56.06 64.15
BC 47.91 478.14 57.39 62.18 66.98
Jaccard Index 53.21 433.62 67.41 73.56 78.84
Tversky
Index
54.48 429.00 68.72 74.68 79.99
• Best performing localness measure
• Overcomes the disadvantage of Jaccard Index.
• For example: We are able to assign higher localness to Eureka Valley
(0.7096) than California (0.1270) with respect to San Francisco
39ESWC 2015
Comparison with Existing Approaches
EVALUATION
Method ACC (%) AED (in miles)
Cheng, Caverlee, and Lee, 2010 51.00 535.56
Chang, Lee, Eltaher, and Lee, 2012 49.9 509.3
Wikipedia based Approach 54.48 429.00
40ESWC 2015
CONCLUSION
• Presented a crowd sourced knowledge based approach, that does not
require geo-tagged tweets as a training dataset, to predict the
location of a user
• Introduced the concept of Local Entities and preprocessed Wikipedia
Hyperlink Graph to extract local entities for each city
• Investigated relatedness measures to establish the degree of
association between a local entity and a city
• Evaluated the proposed approach against a benchmark dataset
published by Cheng et al. For 5119 users, we are able to predict the
location of 55% of users within 100 miles with an average error
distance of 429 miles
41ESWC 2015
FUTURE WORK
• Compute the confidence score of the prediction based on top-k cities
and count of local entities in tweets
• Investigate other localness measures for score local entities
• Consider semantic types, categories of local entities and weight the
contribution based on types
• Explore other knowledge bases such as Wikitravel and GeoNames
42ESWC 2015
Thank you!
43ESWC 2015
Paper at: http://www.knoesis.org/library/resource.php?id=2039
Contact:
pavan@knoesis.org
@pavankaps
Top-k Accuracy
EVALUATION
44ESWC 2015
Top-k Average Error Distance
EVALUATION
45ESWC 2015
Distribution of all
users in the dataset
Distribution of
accurately predicted
users
Distribution of users
46ESWC 2015
Impact of Local Entities
EVALUATION
47ESWC 2015
Top 100 Cities
EVALUATION
• 2172 users from the dataset are from the top-100 most
populated cities of United States
• 60% users predicted within 100 miles of their actual location
• 54% users predicted exactly at the city level
48ESWC 2015

More Related Content

Viewers also liked

Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...
Artificial Intelligence Institute at UofSC
 
Mastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop DemonstrationMastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop Demonstration
Artificial Intelligence Institute at UofSC
 
Walk through Streaming Technologies: EPL
Walk through Streaming Technologies: EPLWalk through Streaming Technologies: EPL
Walk through Streaming Technologies: EPL
Artificial Intelligence Institute at UofSC
 
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Artificial Intelligence Institute at UofSC
 
Examples of Real-World Big Data Application
Examples of Real-World Big Data ApplicationExamples of Real-World Big Data Application
Examples of Real-World Big Data Application
Artificial Intelligence Institute at UofSC
 
Big dataprocessing cts2015
Big dataprocessing cts2015Big dataprocessing cts2015
Big dataprocessing cts2015
Artificial Intelligence Institute at UofSC
 
Mastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big DataMastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big Data
Artificial Intelligence Institute at UofSC
 
Analyzing the Social Media Footprint of Street Gangs
Analyzing the Social Media Footprint of Street GangsAnalyzing the Social Media Footprint of Street Gangs
Analyzing the Social Media Footprint of Street Gangs
Artificial Intelligence Institute at UofSC
 
Semantic Gateway as a Service architecture for IoT Interoperability
Semantic Gateway as a Service architecture for IoT InteroperabilitySemantic Gateway as a Service architecture for IoT Interoperability
Semantic Gateway as a Service architecture for IoT Interoperability
Artificial Intelligence Institute at UofSC
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
Artificial Intelligence Institute at UofSC
 
Evaluating a Potential Commercial Tool for Healthcare Application for People ...
Evaluating a Potential Commercial Tool for Healthcare Application for People ...Evaluating a Potential Commercial Tool for Healthcare Application for People ...
Evaluating a Potential Commercial Tool for Healthcare Application for People ...
Artificial Intelligence Institute at UofSC
 
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Artificial Intelligence Institute at UofSC
 
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Artificial Intelligence Institute at UofSC
 
ESWC 2015 Closing and "General Chair's minute of Madness"
ESWC 2015 Closing and "General Chair's minute of Madness"ESWC 2015 Closing and "General Chair's minute of Madness"
ESWC 2015 Closing and "General Chair's minute of Madness"
Fabien Gandon
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
Artificial Intelligence Institute at UofSC
 
Trust Management: A Tutorial
Trust Management: A TutorialTrust Management: A Tutorial
Trust Management: A Tutorial
Artificial Intelligence Institute at UofSC
 

Viewers also liked (16)

Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...Semantics Approach to Big Data and Event Processing: an introduction focused ...
Semantics Approach to Big Data and Event Processing: an introduction focused ...
 
Mastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop DemonstrationMastering the variety dimension - Ontop Demonstration
Mastering the variety dimension - Ontop Demonstration
 
Walk through Streaming Technologies: EPL
Walk through Streaming Technologies: EPLWalk through Streaming Technologies: EPL
Walk through Streaming Technologies: EPL
 
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
Examples of Applied Semantic Technologies: Application of Semantic Sensor Net...
 
Examples of Real-World Big Data Application
Examples of Real-World Big Data ApplicationExamples of Real-World Big Data Application
Examples of Real-World Big Data Application
 
Big dataprocessing cts2015
Big dataprocessing cts2015Big dataprocessing cts2015
Big dataprocessing cts2015
 
Mastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big DataMastering the Velocity Dimension of Big Data
Mastering the Velocity Dimension of Big Data
 
Analyzing the Social Media Footprint of Street Gangs
Analyzing the Social Media Footprint of Street GangsAnalyzing the Social Media Footprint of Street Gangs
Analyzing the Social Media Footprint of Street Gangs
 
Semantic Gateway as a Service architecture for IoT Interoperability
Semantic Gateway as a Service architecture for IoT InteroperabilitySemantic Gateway as a Service architecture for IoT Interoperability
Semantic Gateway as a Service architecture for IoT Interoperability
 
Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...Mastering the variety dimension of Big Data with semantic technologies: high ...
Mastering the variety dimension of Big Data with semantic technologies: high ...
 
Evaluating a Potential Commercial Tool for Healthcare Application for People ...
Evaluating a Potential Commercial Tool for Healthcare Application for People ...Evaluating a Potential Commercial Tool for Healthcare Application for People ...
Evaluating a Potential Commercial Tool for Healthcare Application for People ...
 
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
Knowledge-driven Personalized Contextual mHealth Service for Asthma Managemen...
 
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
Hemant Purohit PhD Defense: Mining Citizen Sensor Communities for Cooperation...
 
ESWC 2015 Closing and "General Chair's minute of Madness"
ESWC 2015 Closing and "General Chair's minute of Madness"ESWC 2015 Closing and "General Chair's minute of Madness"
ESWC 2015 Closing and "General Chair's minute of Madness"
 
Web and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sisWeb and Complex Systems Lab @ Kno.e.sis
Web and Complex Systems Lab @ Kno.e.sis
 
Trust Management: A Tutorial
Trust Management: A TutorialTrust Management: A Tutorial
Trust Management: A Tutorial
 

Similar to Knowledge Enabled Location Prediction of Twitter Users

Location prediction
Location predictionLocation prediction
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Artificial Intelligence Institute at UofSC
 
Final LA tech and venture landscape
Final LA tech and venture landscapeFinal LA tech and venture landscape
Final LA tech and venture landscape
Mark Suster
 
Shared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to educationShared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to education
Mathieu d'Aquin
 
Semantics for Smarter Cities
Semantics for Smarter CitiesSemantics for Smarter Cities
Semantics for Smarter Cities
LD4SC
 
Revealing social bot communities through coordinated behaviour
Revealing social bot communities through coordinated behaviourRevealing social bot communities through coordinated behaviour
Revealing social bot communities through coordinated behaviour
Derek Weber
 
Smart city case studies in the USA
Smart city case studies in the USASmart city case studies in the USA
Smart city case studies in the USA
Team Finland Future Watch
 
CKX: Wellbeing Toronto - More Than Just a Map
CKX: Wellbeing Toronto - More Than Just a MapCKX: Wellbeing Toronto - More Than Just a Map
CKX: Wellbeing Toronto - More Than Just a Map
Community Knowledge Exchange
 
Imagine austin speakers bureau june 18
Imagine austin speakers bureau june 18Imagine austin speakers bureau june 18
Imagine austin speakers bureau june 18
gclaxton
 
This is not your grandmother's online map: Advancing your mission with GIS tools
This is not your grandmother's online map: Advancing your mission with GIS toolsThis is not your grandmother's online map: Advancing your mission with GIS tools
This is not your grandmother's online map: Advancing your mission with GIS tools
Chicago Technology Cooperative
 
Ric Rodriguez - How Information With Transform The Search Landscape
Ric Rodriguez - How Information With Transform The Search LandscapeRic Rodriguez - How Information With Transform The Search Landscape
Ric Rodriguez - How Information With Transform The Search Landscape
Ric Rodriguez
 
Building a North Texas Smart Region
Building a North Texas Smart RegionBuilding a North Texas Smart Region
Building a North Texas Smart Region
Scott Turnbull
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
odsc
 
An open data story
An open data storyAn open data story
An open data story
ProgCity
 
PlanBayArea unified slide deck
PlanBayArea unified slide deckPlanBayArea unified slide deck
PlanBayArea unified slide deck
Adina Levin
 
Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2
Shweta Bhatia Gupta
 
Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2
guesta56b77
 
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
Thomas Lowenhaupt
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
Emily Nimsakont
 
Healthy City Community Planning and Development webinar
Healthy City Community Planning and Development webinarHealthy City Community Planning and Development webinar
Healthy City Community Planning and Development webinar
Healthy City
 

Similar to Knowledge Enabled Location Prediction of Twitter Users (20)

Location prediction
Location predictionLocation prediction
Location prediction
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
Final LA tech and venture landscape
Final LA tech and venture landscapeFinal LA tech and venture landscape
Final LA tech and venture landscape
 
Shared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to educationShared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to education
 
Semantics for Smarter Cities
Semantics for Smarter CitiesSemantics for Smarter Cities
Semantics for Smarter Cities
 
Revealing social bot communities through coordinated behaviour
Revealing social bot communities through coordinated behaviourRevealing social bot communities through coordinated behaviour
Revealing social bot communities through coordinated behaviour
 
Smart city case studies in the USA
Smart city case studies in the USASmart city case studies in the USA
Smart city case studies in the USA
 
CKX: Wellbeing Toronto - More Than Just a Map
CKX: Wellbeing Toronto - More Than Just a MapCKX: Wellbeing Toronto - More Than Just a Map
CKX: Wellbeing Toronto - More Than Just a Map
 
Imagine austin speakers bureau june 18
Imagine austin speakers bureau june 18Imagine austin speakers bureau june 18
Imagine austin speakers bureau june 18
 
This is not your grandmother's online map: Advancing your mission with GIS tools
This is not your grandmother's online map: Advancing your mission with GIS toolsThis is not your grandmother's online map: Advancing your mission with GIS tools
This is not your grandmother's online map: Advancing your mission with GIS tools
 
Ric Rodriguez - How Information With Transform The Search Landscape
Ric Rodriguez - How Information With Transform The Search LandscapeRic Rodriguez - How Information With Transform The Search Landscape
Ric Rodriguez - How Information With Transform The Search Landscape
 
Building a North Texas Smart Region
Building a North Texas Smart RegionBuilding a North Texas Smart Region
Building a North Texas Smart Region
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
An open data story
An open data storyAn open data story
An open data story
 
PlanBayArea unified slide deck
PlanBayArea unified slide deckPlanBayArea unified slide deck
PlanBayArea unified slide deck
 
Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2
 
Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2Risk Analysis Of Cultural Resource4th June2
Risk Analysis Of Cultural Resource4th June2
 
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
City TLDs as Urban Infrastructure, PlanningTech Conference at MIT, April 8, 2011
 
RDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar SRDA: Are We There Yet? Carterette Webinar S
RDA: Are We There Yet? Carterette Webinar S
 
Healthy City Community Planning and Development webinar
Healthy City Community Planning and Development webinarHealthy City Community Planning and Development webinar
Healthy City Community Planning and Development webinar
 

Recently uploaded

原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
7lkkjxt
 
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdfUR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
urbhattiacademy
 
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdf
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdfTransportation_Channel_Investor_Presentation_April_2024_ Final .pdf
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdf
Matthewperry105
 
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
anubug
 
HMS Facebook Stories All V1 06092024.docx
HMS Facebook Stories All V1 06092024.docxHMS Facebook Stories All V1 06092024.docx
HMS Facebook Stories All V1 06092024.docx
Charles Bayless
 
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISMSTUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
AJHSSR Journal
 
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
9u4xjk4w
 
Dominate Reddit Discussions.............
Dominate Reddit Discussions.............Dominate Reddit Discussions.............
Dominate Reddit Discussions.............
SocioCosmos
 
Maximize Your Twitch Potential!..........
Maximize Your Twitch Potential!..........Maximize Your Twitch Potential!..........
Maximize Your Twitch Potential!..........
SocioCosmos
 

Recently uploaded (9)

原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
原版制作(Hull毕业证书)赫尔大学毕业证Offer一模一样
 
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdfUR BHATTI ACADEMY AND ONLINE COURSES.pdf
UR BHATTI ACADEMY AND ONLINE COURSES.pdf
 
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdf
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdfTransportation_Channel_Investor_Presentation_April_2024_ Final .pdf
Transportation_Channel_Investor_Presentation_April_2024_ Final .pdf
 
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
一比一原版(AU毕业证)英国阿伯丁大学毕业证如何办理
 
HMS Facebook Stories All V1 06092024.docx
HMS Facebook Stories All V1 06092024.docxHMS Facebook Stories All V1 06092024.docx
HMS Facebook Stories All V1 06092024.docx
 
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISMSTUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
STUDY ON THE DEVELOPMENT STRATEGY OF HUZHOU TOURISM
 
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
快速办理(worcester毕业证书)伍斯特大学毕业证PDF成绩单一模一样
 
Dominate Reddit Discussions.............
Dominate Reddit Discussions.............Dominate Reddit Discussions.............
Dominate Reddit Discussions.............
 
Maximize Your Twitch Potential!..........
Maximize Your Twitch Potential!..........Maximize Your Twitch Potential!..........
Maximize Your Twitch Potential!..........
 

Knowledge Enabled Location Prediction of Twitter Users

  • 1. 1 Knowledge Enabled Location Prediction of Twitter Users Presented at ESWC 2015, Slovenia, June 3, 2015 Krishnaprasad Thirunarayan Pavan Kapanipathi Revathy Krishnamurthy Amit Sheth revathy@knoesis.org pavan@knoesis.org amit@knoesis.org tkprasad@knoesis.org ESWC 2015 Kno.e.sis: Ohio Center of Excellence in Knowledge Enabled Computing Wright State University, Dayton, OH, USA
  • 2. Background Knowledge can improve a machine’s ability to interpret text 2 City of Lights ESWC 2015
  • 4. Geographic footprint of a Twitter user 4
  • 5. News Recommender Systems Beavercreek preschool to open in 2015 By Sharon D. Boykin A $5.1 million preschool in Beavercreek city Schools district will help accommodate a growing of student population and reduce overcrowding, according to school officials. Ohio’s health exchange to include more competition By Randy Tucker It was just a year ago that the insurance industry fretted over potential loses from the new insurance market created by Affordable Care Act. Recommended for you WHY IS LOCATION IMPORTANT? • Targeted advertising • Opinion Analysis • Disaster Response • Location Based Services Other applications 5ESWC 2015
  • 6. Geo-tagged Tweets Profile Information LOCATION PUBLISHED BY USER 6ESWC 2015
  • 7. Geo-tagged Tweets Profile Information LOCATION PUBLISHED BY USER • Less than 4% of tweets contain geo-spatial tags • ~4 out of 5 cases, location field in profile is either empty or contains invalid information such as “Justin Bieber’s heart,” even when present, it might be at state or nation level 7
  • 8. Friends LOCATION INFERENCE Followees 8 Just drove around Golden Gate Park two times trying to get in Cleveland Browns confuse me. When I give up on them, they actually show up to play. Followers Network based Content based ESWC 2015
  • 9. CONTENT BASED APPROACHES Just drove around Golden Gate Park two times trying to get in Cleveland Browns confuse me. When I give up on them, they actually show up to play. • Supervised Approaches • Probabilistic Models – (Cheng, Caverlee, and Lee, 2010) • Cascading Topic Models – (Eisenstein, Connor, Smith, and Xing, 2010) • Gaussian Mixture Model – (Chang, Lee, Eltaher, and Lee, 2012) • Language Models – (Doran, Gokhale, and Dagnino, 2014) • Ensemble of Statistical and Heuristic Classifiers – (Mahmud, Nichols, and Drews, 2014) 9 Geographic location of a user influences the contents of their tweets ESWC 2015
  • 10. PROBLEM STATEMENT 10 Predict the location of a Twitter user based on their tweets, by exploiting Wikipedia to create a location specific knowledgebase ESWC 2015
  • 11. KNOWLEDGE-BASE ENABLED APPROACH San Francisco: Golden Gate Bridge, San Francisco 49ers, San Francisco Chronicle … Entity Count Golden Gate Bridge 4 San Francisco 49ers 2 San Francisco Chronicle 1 Top-k predictions: San Francisco Oakland Palo Alto 11ESWC 2015
  • 12. KNOWLEDGE BASE GENERATOR Internal Links Extraction LocalEntity-1 LocalEntity-2 --- LocalEntity-n city-1 city-2 city-k Weighted Local Entities Entity Recognition and Scoring Annotated Tweets USER PROFILE GENERATOR LOCATION PREDICTION Location Predictor Ranked cities for user KNOWLEDGE-BASE ENABLED APPROACH 12ESWC 2015
  • 13. SAN FRANCISCO NEW YORK CITY HOUSTON LOCAL ENTITIES 13ESWC 2015
  • 14. • Collaborative encyclopedia • As of 2014, English Wikipedia has 4.6 million articles, 18 billion pages views and 500 million unique visitors per month. • Category Structure • Used for document clustering, tweet classification, personalization systems etc. • Link Structure • Used for word sense disambiguation, semantic relatedness between terms etc. WIKIPEDIA 14ESWC 2015
  • 15. • We consider the internal links of location pages as Local Entities of the city Local Entities of San Francisco LOCAL ENTITIES • While a city does not contain link to itself, we use the city as a local entity 15ESWC 2015
  • 16. ARE ALL ENTITIES EQUALLY LOCAL? 16ESWC 2015
  • 17. ARE ALL ENTITIES EQUALLY LOCAL? 17ESWC 2015 San Francisco Chronicle San Francisco ExaminerSF Weekly CNN BBC Al Jazeera America
  • 18. • Pointwise Mutual Information – standard measure of association between two variables • Assumption is that higher is the localness of an entity with respect to the city, higher will be the statistical dependence between them • Computed as: where le is the local entity, c is the city, P(le,c) is the joint probability of occurrence of the city and the local entity in the Wikipedia dump, P(e) and P(c) are the individual probability of occurrence of the local entity and city respectively. Association-based Measure LOCALNESS MEASURE OF ENTITIES 18ESWC 2015
  • 19. Graph-based Measure LOCALNESS MEASURE OF ENTITIES 19 The Boston Red Sox, a founding member of the American League of Major League Baseball in 1901.. Boston Red Sox The Boston Red Sox are an American professional baseball team based in Boston, Massachusetts ... They are members of American League (AL). Boston American League ESWC 2015
  • 20. • Betweenness Centrality (BC) – Measures the importance of a node relative to the rest of the nodes in the graph • A high BC score of a vertex in a graph indicates that it lies on considerable fraction of shortest path connecting others • Computed as: where lei, lej, le are local entities of c, σleilej represents the total number of shortest paths from lei to lej Graph-based Measure LOCALNESS MEASURE OF ENTITIES 20ESWC 2015
  • 21. Alcatraz Island Treasure Island Alameda Island Financial District Market Street Fisherman’s Wharf San Francisco 49ers Cow Hollow Silicon Valley South Beach …. Suspension Bridge Hyde Street Pier Irving Morrow Angelo Rossi Art Deco Charles Alton Ellis Bethlehem Steel Half Way to Hell Club International Orange … San Francisco Bay Golden Gate San Francisco Chronicle U.S. Route 101 Marin County Sausalito Bay Area … Semantic Overlap Measure LOCALNESS MEASURE OF ENTITIES 21ESWC 2015
  • 22. • Measures the relatedness between concepts with the intuition that related concepts are connected to similar entities • Jaccard Index: Overlap between two sets Where IL(c) and IL(e) and are the internal links found in the Wikipedia page of the city c and the local entity le. Semantic Overlap Measure LOCALNESS MEASURE OF ENTITIES 22ESWC 2015
  • 23. • Tversky Index: Asymmetric similarity measure between two sets 𝑡𝑖 𝑙𝑒, 𝑐 = |𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 | 𝐼𝐿 𝑐 ∩𝐼𝐿 𝑙𝑒 + α 𝐼𝐿 𝑐 −𝐼𝐿 𝑙𝑒 + β|𝐼𝐿 𝑙𝑒 −𝐼𝐿 𝑐 | Where 𝐼𝐿(𝑐) and 𝐼𝐿 𝑙𝑒 are the internal links found in the Wikipedia page of the city 𝑐 and the local entity 𝑙𝑒 • We choose α = 0 and β = 1 • For every entity in the page of a local entity not found in the page of the city, penalize the local entity Semantic Overlap Measure LOCALNESS MEASURE OF ENTITIES 23ESWC 2015
  • 24. KNOWLEDGE BASE GENERATOR Internal Links Extraction LocalEntity-1 LocalEntity-2 --- LocalEntity-n city-1 city-2 city-k Weighted Local Entities Entity Recognition and Scoring Annotated Tweets USER PROFILE GENERATOR LOCATION PREDICTION Location Predictor Ranked cities for user KNOWLEDGE-BASE ENABLED APPROACH 24ESWC 2015
  • 25. Step 1: Entity Linking Just drove around Golden Gate Park trying to get in. CREATION OF USER PROFILE We use Zemanta for Entity Linking 25ESWC 2015
  • 26. Step 1: Entity Linking Just drove around Golden Gate Park trying to get in. CREATION OF USER PROFILE Entity Count Golden Gate Bridge 4 San Francisco 49ers 2 San Francisco Chronicle 1 Step 2: Entity Scoring We use Zemanta for Entity Linking 26ESWC 2015
  • 27. KNOWLEDGE BASE GENERATOR Internal Links Extraction LocalEntity-1 LocalEntity-2 --- LocalEntity-n city-1 city-2 city-k Weighted Local Entities Entity Recognition and Scoring Annotated Tweets USER PROFILE GENERATOR LOCATION PREDICTION Location Predictor Ranked cities for user KNOWLEDGE-BASE ENABLED APPROACH 27ESWC 2015
  • 28. LOCATION PREDICTION • Compute an aggregate score for each city whose local entities are found in a user’s tweets 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 = 𝑒∈𝐿𝐸𝑐𝑢 𝑙𝑜𝑐𝑙 𝑐, 𝑒 × 𝑠𝑒 where LE 𝑐𝑢 is the set of local entities of 𝑐 found in the profile of user 𝑢 , 𝑙𝑜𝑐𝑙(𝑒, 𝑐) is the localness measure of the entity 𝑒 with respect to city 𝑐 • Rank 𝑙𝑜𝑐𝑆𝑐𝑜𝑟𝑒 𝑢, 𝑐 in descending order to predict the top-k locations of a user 28ESWC 2015
  • 29. San Francisco International Airport (6), San Francisco (4), Nob Hill (3), San Francisco Museum of Modern Art (1), Beach Blanket Babylon (2), San Francisco Municipal Railway (4), Golden Gate Park (1), San Francisco Bay Area (1), SF Weekly (1), Fox Oakland Theatre (2), Berkley (1), Green Day (1), Oakland (9), San Francisco Bay Area (1), The White Stripes (1), Detroit Metropolitan Wayne County Airport (1), Detroit Historical Museum (1), Detroit Red Wings (4), General Motors (1), Palo Alto (6), SAP AG (8), Facebook (3), PARC (company) (2), Dell (1), Google (1), … LOCATION PREDICTION San Francisco International Airport (6), San Francisco (4), Nob Hill (3), San Francisco Museum of Modern Art (1), Beach Blanket Babylon (2), San Francisco Municipal Railway (4), Golden Gate Park (1), San Francisco Bay Area (1), SF Weekly (1) 14.5531 Fox Oakland Theatre (2), Berkley (1), Green Day (1), Oakland (9), San Francisco Bay Area (1) 10.7584 The White Stripes (1), Detroit Metropolitan Wayne County Airport (1), Detroit Historical Museum (1), Detroit Red Wings (4), General Motors (1) 8.0600 Palo Alto (6), SAP AG (8), Facebook (3), PARC (company) (2), Dell (1), Google (1) 6.9175 User Profile Knowledgebase Location Prediction Nob Hill 0.48214 SF Weekly 0.1875 Golden Gate Park 0.16783 San Francisco International Airport 0.06818 … Fox Oakland Theatre 0.09375 SF Bay Area 0.12972 Green Day 0.02066 … Detroit Historical Museum 0.4838 General Motors 0.05538 Detroit Red Wings 0.0232 … PARC (company) 0.03726 Google 0.04678 Facebook 0.05810 San Francisco Oakland, CA Detroit, MI Palo Alto, CA 29ESWC 2015
  • 30. • All cities of United States with population > 5000 as published in census estimates of 2012 • 4,661 cities and 500714 local entities Knowledge base IMPLEMENTATION Baseline • Considers all local entities to be equally local to the city • Location prediction based only on frequency of entities 30ESWC 2015
  • 31. • Published by Cheng et al. • Collected from September 2009 to January 2010. • Contains 5119 active users from continental United States with approximately 1000 tweets per user. • User’s location listed in the form of latitude and longitude. Test Dataset EVALUATION 31ESWC 2015
  • 32. • Error Distance Distance between actual location of the user and the estimated location • Average Error Distance Average of error distance of all users in the test dataset • Accuracy Percentage of users predicted within 100 miles of their actual location Evaluation Metrics EVALUATION 32ESWC 2015
  • 33. Location Prediction Results EVALUATION Localness Measure ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5 Baseline 25.21 632.56 38.01 42.78 47.95 PMI 38.48 599.40 49.85 56.06 64.15 BC 47.91 478.14 57.39 62.18 66.98 Jaccard Index 53.21 433.62 67.41 73.56 78.84 Tversky Index 54.48 429.00 68.72 74.68 79.99 33ESWC 2015
  • 34. EVALUATION Localness Measure ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5 Baseline 25.21 632.56 38.01 42.78 47.95 PMI 38.48 599.40 49.85 56.06 64.15 BC 47.91 478.14 57.39 62.18 66.98 Jaccard Index 53.21 433.62 67.41 73.56 78.84 Tversky Index 54.48 429.00 68.72 74.68 79.99 • PMI is not normalized hence sensitive to the count of the occurrences of local entities in the Wikipedia corpus • E.g. PMI of local entities of Glenn Rock, New Jersey is higher than those of San Francisco 34ESWC 2015
  • 35. EVALUATION Localness Measure ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5 Baseline 25.21 632.56 38.01 42.78 47.95 PMI 38.48 599.40 49.85 56.06 64.15 BC 47.91 478.14 57.39 62.18 66.98 Jaccard Index 53.21 433.62 67.41 73.56 78.84 Tversky Index 54.48 429.00 68.72 74.68 79.99 • Does a good job of assigning low scores to common entities. • E.g. community college, National Weather Service, start up company etc. • Fails for entities with some relevance to the city but no distinguishing factor • E.g. IBM with respect to Endicott, New York 35ESWC 2015
  • 36. LOCALNESS MEASURE OF ENTITIES 36
  • 37. EVALUATION Localness Measure ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5 Baseline 25.21 632.56 38.01 42.78 47.95 PMI 38.48 599.40 49.85 56.06 64.15 BC 47.91 478.14 57.39 62.18 66.98 Jaccard Index 53.21 433.62 67.41 73.56 78.84 Tversky Index 54.48 429.00 68.72 74.68 79.99 • Underperforms for local entities with fewer entities than the city • E.g. Eureka Valley and California with respect to San Francisco. 37ESWC 2015
  • 39. EVALUATION Localness Measure ACC (%) AED (in Miles) ACC@2 ACC@3 ACC@5 Baseline 25.21 632.56 38.01 42.78 47.95 PMI 38.48 599.40 49.85 56.06 64.15 BC 47.91 478.14 57.39 62.18 66.98 Jaccard Index 53.21 433.62 67.41 73.56 78.84 Tversky Index 54.48 429.00 68.72 74.68 79.99 • Best performing localness measure • Overcomes the disadvantage of Jaccard Index. • For example: We are able to assign higher localness to Eureka Valley (0.7096) than California (0.1270) with respect to San Francisco 39ESWC 2015
  • 40. Comparison with Existing Approaches EVALUATION Method ACC (%) AED (in miles) Cheng, Caverlee, and Lee, 2010 51.00 535.56 Chang, Lee, Eltaher, and Lee, 2012 49.9 509.3 Wikipedia based Approach 54.48 429.00 40ESWC 2015
  • 41. CONCLUSION • Presented a crowd sourced knowledge based approach, that does not require geo-tagged tweets as a training dataset, to predict the location of a user • Introduced the concept of Local Entities and preprocessed Wikipedia Hyperlink Graph to extract local entities for each city • Investigated relatedness measures to establish the degree of association between a local entity and a city • Evaluated the proposed approach against a benchmark dataset published by Cheng et al. For 5119 users, we are able to predict the location of 55% of users within 100 miles with an average error distance of 429 miles 41ESWC 2015
  • 42. FUTURE WORK • Compute the confidence score of the prediction based on top-k cities and count of local entities in tweets • Investigate other localness measures for score local entities • Consider semantic types, categories of local entities and weight the contribution based on types • Explore other knowledge bases such as Wikitravel and GeoNames 42ESWC 2015
  • 43. Thank you! 43ESWC 2015 Paper at: http://www.knoesis.org/library/resource.php?id=2039 Contact: pavan@knoesis.org @pavankaps
  • 45. Top-k Average Error Distance EVALUATION 45ESWC 2015
  • 46. Distribution of all users in the dataset Distribution of accurately predicted users Distribution of users 46ESWC 2015
  • 47. Impact of Local Entities EVALUATION 47ESWC 2015
  • 48. Top 100 Cities EVALUATION • 2172 users from the dataset are from the top-100 most populated cities of United States • 60% users predicted within 100 miles of their actual location • 54% users predicted exactly at the city level 48ESWC 2015

Editor's Notes

  1. “City of Lights” – nickname of Paris. When we see a text like this, we use that background knowledge we have to interpret the text. Similarly, background knowledge can be used to improve a machine’s ability to understand and interpret text.
  2. Background knowledge is central to the idea of semantic web. Domain specific knowledgebases such as MusicBrainz, UMLS, Dbpedia etc have been used to solve problems such as named entity recognition and disambiguation, relevant document retrieval, data analysis etc. Today I am going to talk about applying background knowledge to predict the geographic location of a user. In this context, the location of a user is their home location at the city level.
  3. Social media has grown rapidly. Many applications have been developed based on Twitter Associating geographic information with Twitter users can provide value addition to many applications
  4. Location provides “context”
  5. Users can volunteer geographic information through cellphone or profile
  6. Cheng et al. found 21% users contained location as granular as city, state in their profile There was a need to automatically infer the location of a user
  7. Network based approach uses training data to determine hidden patterns in the communications between a user and his friends. These observations are used to predict the location of a user
  8. Geographic location of a user influences the contents of their tweets The general idea behind these approaches is to determine the probabilistic distribution of words across a region. Different approaches such as language models, topic models, Gaussian mixture model and ensemble based classifiers have been used for the task of location prediction.
  9. We address these weaknesses by using a knowledge-based approach to extract location specific concepts from Wikipedia. Content based prediction of the location of a Twitter user by exploiting Wikipedia to create a location specific knowledge-base
  10. This approach consists of three modules – A knowledge base generator that extracts location specific concepts from a crowd sourced knowledgebase. A User profile generator that creates a semantic profile of a user whose location is to be determined and a Location Prediction module that uses the semantic profile and the knowledgebase to predict the top-k locations of a user.
  11. Knowledgebase generator extracts location specific concepts from Wikipedia.
  12. Local Entities: Entities that have a high relatedness to a city and can discriminate between geographic locations
  13. Wikipedia is publicly available. Anyone can edit a Wikipedia article, correct errors and compensate for any biased views. Wikipedia has been used as background knowledge to solve many problems. At Kno.e.sis, Wikipedia has been used in many applications. Doozer used Wikipedia to create domain specific ontologies, Blooms used Wikipedia for Ontology Alignment and the hierarchical interest graph for a personalization system, was also based on the category structure of Wikipedia. Rephrase this slide.
  14. So we hypothesize that for each location, the internal links in its Wikipedia page represent entities that are relevant to it and we consider them as Local Entities of the city. In other words, the local entities of a city are all the outgoing links from its Wikipedia page.
  15. For example, consider this snippet from the Wikipedia page of San Francisco. The San Francisco Chronicle is a major daily newspaper in San Francisco. Clearly it is more local to San Francisco than CNN or MSNBC which is are national media outlets.
  16. So now our goal is to score each entity with respect to the city such that the score reflects the localness measure of the entity with respect to the city. We experiment with measures of three different classes to score the localness of an entity with respect to a city. That is, association based measure, graph based measure and semantic overlap based measure.
  17. Association based measure: Compute relatedness based on their occurrences in a large corpus PMI is a measure of how much the actual probability of their occurrence differs from what is expected based on the probabilities of their individual occurrences. Wrt San Francisco: CNN: 7.94 San Francisco Chronicle: 10.41
  18. Construct a directed graph of local entities for each city Ulrik Brandes
  19. BC used to compute the importance of an actor in a social network. The importance is measured in terms of shortest paths. For each node, the BC is the sum of the fraction of shortest paths that pass through that node. So BC is helping us determine which nodes are important in the network of local entities of a city.
  20. Based on the idea that higher is the overlap between concepts found in the Wikipedia pages of a city and an entity higher is the degree of localness of the entity
  21. Jaccard Index is a symmetric measure to compute the semantic overlap between a city and an entity. But we find that a local entity generally represents a part of the city. For example, Golden Gate Bridge will not completely overlap with all the concepts of San Francisco which contains entities from different categories like Climate, History, Geography etc. With that in mind, we use Tversky Index which is an asymmetric similarity measure.
  22. Tversky Index is a unidirectional measure of similarity of the local entity with respect to the city. We penalize the local entity for every concept in its Wikipedia page that is not present in the city
  23. The next module is the user profile generator. We create a semantic profile of each user consisting of Wikipedia entities found in their tweets
  24. We create a semantic profile of each user consisting of Wikipedia entities found in their tweets. First step in the creation of User Profile is Entity Linking: Identification of entities from tweets and linking them to their Wikipedia article. Zemanta for Entity Linking
  25. The next step is Entity Scoring i.e. Scoring each local entity in a user’s tweet using frequency of their occurrence. We are predicting the home location of a user based on a set o their historic tweets. Using the frequency helps to determine how relevant the entity is with respect to the user. For example: a football fan may tweet about many football teams, chances are that he will tweet more frequently about the football team of his city
  26. Brief description of the approach
  27. For predicting the location of a user we compute an aggregate score for each city whose local entities are found in a user’s tweets.
  28. We first created a knowledgebase of locations that contains location specific entities along with a score that represents the localness of the entity with respect to the city. Next, for a user whose location is to be predicted we create a profile that consists of entities mentioned in their tweets and their frequency. Finally, we predict the top-k locations of a user.
  29. We evaluated our approach on a test dataset published by Cheng et al. Their test dataset consists of 5119 active users from continental United States with approximately 1000 tweets per user. These users are spread across 569 cities in US. Spammers and bots are cleaned from the dataset to ensure a clean dataset.
  30. We compute the distance as a straight-line distance between a pair of lat-longs using haversine formula
  31. PMI – not normalized, sensitive to count of occurrences
  32. We find that betweenness centrality does a good job of assigning low scores to common entities such as the National Weather Service but it fails for entities which have some relevance to the city but no distinguishing factor.
  33. Endicott, New York is the birthplace of IBM. So the Wikipedia page of Endicott has an entire section dedicated to IBM and this section contains entities very specific to IBM e.g. punched cards and T.J Watson. Because the shortest path between these IBM-specific nodes and other nodes of the city lies through IBM, the betweenness centrality of IBM is very high. In such cases, local entities scored using betweenness centrality lead to incorrect predictions.
  34. Jaccard Index underperforms for entities with fewer entities than the city.
  35. Example: Eureka Valley is a residential neighbourhood of San Francisco so we would expect it to be more local to SFO than California.
  36. With Tversky, the localness of a local entity only diminishes for entities in its page not present in the city.
  37. Introduction to Twitter and location of a Twitter user
  38. We find that the more local entities we find in a user’s tweet the higher is our ability to predict their location accurately. This is quite intuitive. If you tweet more and give more clues about your location then there is a higher possibility of being able to locate you. On the other hand, if your tweets are specific to a certain topic such as technology or national level or world politics then it is difficult to find evidence that can be used to locate you accurately. As shown in this graph, the predictions made on the basis of 10 or more entities were able to locate 66% of the users within 100 miles.
  39. From this dataset we selected users from the top-100 most populated cities of US and found that our algorithm was able to locate 60% of the users within 100 miles of their actual location and 54% users at exactly the city level.