“Exploiting Wikipedia for Entity
Name Disambiguation in Tweets”
Muhammad Atif Qureshi
Colm O'Riordan
Gabriella Pasi
06/16/14 2
Contents
● Introduction
● Related Work
● Methodology
● Evaluation
● Conclusion
06/16/14 3
06/16/14 4
Motivation
● Social media users voice their opinions about various
entities/brands (e.g., musicians, movies, companies)
● So that's an implicit feedback for an entity/brand
● This has recently given birth to a new area within the
marketing domain known as “online reputation
management”
06/16/14 5
Problem Statement
● Given a set of tweets collected after issuing a
query of entity (brand) name, the task is to
determine which of the tweets are related to
the entity and which are not
● Decide if tweet is related to Apple Inc.
– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
06/16/14 6
Wikipedia Graph Structure
C1
A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
06/16/14 7
Related Work
● Entity Linking: to link an entity to it's correct sense
– Ferragina and Scaiella 2010 and Meij et al 2012 has
proposed strategies over tweets
● Use hyperlink structure of Wikipedia and anchor texts of the
links to those Wikipedia pages.
● Disambiguation is performed by application of a voting
function among all senses associated to anchors detected
– Meij et al 2012 employs supervised machine learning
techniques for further improvement
06/16/14 8
Methodology
● Chunking Strategy
● Entity Phrases & Categories
● Features Based on Wikipedia Articles'
Hyperlinks
● Features Based on Wikipedia Articles'
Hyperlinks
06/16/14 9
Chunking Strategy
I prefer Samsung over HTC, Apple, Nokia because it is economical and good
i prefer samsung over htc apple nokia because it is economical and good
Phrase Chunks with boundaries
samsungprefer htc apple nokia economical
Stopwords removed,
Longest phrase matched over
Wikipedia as article
06/16/14 10
Entity Phrases & Categories
Entity E1
Wikipedia Article AE1
of entity E1
List of Wikipedia
Categories CL_E1
of AE1
Sub-Categories SCL_E1
of
CL_E1
up to a depth 2
List of
Entity Phrases of E1
or ArticlesRC
Wikipedia Articles
in CL_E1
Wikipedia Articles in SCL_E1
Entity Categories or RC
Has a
Mentions inside
Has
Categories or WC (i.e., RC WC)⊂
06/16/14 11
Context PhrasesEntity Phrase
Features Based on Wikipedia
Articles' Hyperlinks
apple
Chunked tweet
Entity Phrase Senses Context Phrase Senses Avg. Max.
Sense Scoredoctor fruit
phd band medical album plant
apple (fruit) 80 45 230 6 532 381
apple (film) 10 50 0 9 0 29.5
apple (inc.) 83 20 10 5 0 44
Feature values are generated using Inlinks, outlinks, inlink+outlinks
Sense apple (inc.) is related to Entity while others were not
For entity Apple Inc.
doctor fruit
06/16/14 12
Relatedness Score Based on
Wikipedia Category-Article Structure
DepthSignificace ( p)= ∑
cat ∈RC∩ pcat
1
depthcat +1
CatSignificace ( p)=
∣RC∩ pcat∣
∣WC∩ pcat∣
∗log(∣RC∩ pcat∣+1)
PhraseSignificace( p)=log(wordlen( p)+1)× pfrequency
Relatedness Score= ∑
p∈MatchedPhrases
Depthsignificance ( p)×Catsignificance ( p)×Phrasesignificance
06/16/14 13
Dataset
● Multilingual tweets of 61 entities (25%
Spanish, 75% English)
– Training ~749 tweets for each entity
– Testing ~1481 tweets for each entity
Domains No. of
Entities
Training Testing
Non Rev Orig Trans Non Rev Orig Trans
Music 20 1461 14353 12518 3296 1998 28137 23442 6693
University 10 3548 3412 6569 391 6760 7387 13060 1087
Banking 11 2021 5753 5327 2447 4335 11635 10918 5052
Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414
Total 61 10797 34874 36999 8672 19944 70412 72110 18246
06/16/14 14
Measure
● Reliability is the product of precision in both
classes (i.e., true positives and true negatives)
● Sensitivity is the product of recall of both
classes
Reliability=
TP
TP+FP
×
TN
TN +FN
Sensitivity=
TP
TP+FN
×
TN
TN +FP
06/16/14 15
Settings
● Classifier: Random Forest
Settings Features Based on
Wikipedia Articles'
Hyperlinks
Relatedness Score Based
on Wikipedia Category-
Article Structure
Domain
Level
Training
Entity
Level
Training
hrdomain
x x x
hrentity
x x x
rdomain
x x
rentity
x x
06/16/14 16
Results
Team Reliability Sensitivity F(R,S)
POPSTAR 0.73 0.45 0.49
OUR APPROACH 0.67 0.42 0.45
SZTE NLP 0.60 0.44 0.44
LIA 0.66 0.36 0.38
BASELINE 0.49 0.32 0.33
UvA UNED 0.68 0.22 0.21
Domain Setting Reliability Sensitivity F(R,S)
Automotives hrdomain
0.54 0.47 0.47
Banking hrentity
0.75 0.58 0.49
University hrdomain
0.71 0.44 0.49
Music rentity
0.83 0.34 0.39
Evaluation Results on Test Set by Domain
Performance Comparison with Other Systems
06/16/14 17
Conclusion
● The experimental evaluations establish Wikipedia’s
strength as a significant encyclopaedic resource for
the task of entity name disambiguation in tweets.
● The relatedness score defined using Wikipedia
category-article structure introduces a powerful
semantic notion of linking n-grams in a tweet with
the information relevant to an entity
● As future work, we aim to combine our Wikipedia
based features with text based techniques to further
improve the performance
06/16/14 18
References
●
E. Amigo, J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de
Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring
systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013.
●
P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by
wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM.
●
E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12,
pages 563–572, New York, NY, USA, 2012. ACM.
●
M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for
company name disambiguation in microblog streams. In CLEF (Online Working
Notes/Labs/Workshop), 2013.
06/16/14 19
Questions
???

Exploiting Wikipedia for Entity Name Disambiguation in Tweets

  • 1.
    “Exploiting Wikipedia forEntity Name Disambiguation in Tweets” Muhammad Atif Qureshi Colm O'Riordan Gabriella Pasi
  • 2.
    06/16/14 2 Contents ● Introduction ●Related Work ● Methodology ● Evaluation ● Conclusion
  • 3.
  • 4.
    06/16/14 4 Motivation ● Socialmedia users voice their opinions about various entities/brands (e.g., musicians, movies, companies) ● So that's an implicit feedback for an entity/brand ● This has recently given birth to a new area within the marketing domain known as “online reputation management”
  • 5.
    06/16/14 5 Problem Statement ●Given a set of tweets collected after issuing a query of entity (brand) name, the task is to determine which of the tweets are related to the entity and which are not ● Decide if tweet is related to Apple Inc. – “Apple tastes better than blackberry” – “Apple phones are better than blackberry”
  • 6.
    06/16/14 6 Wikipedia GraphStructure C1 A1 A3 A4 C3C2 C4 C5 C6 C7 C10 C9 Category Article Category Edge Article Belonging to Category A2 Article Link
  • 7.
    06/16/14 7 Related Work ●Entity Linking: to link an entity to it's correct sense – Ferragina and Scaiella 2010 and Meij et al 2012 has proposed strategies over tweets ● Use hyperlink structure of Wikipedia and anchor texts of the links to those Wikipedia pages. ● Disambiguation is performed by application of a voting function among all senses associated to anchors detected – Meij et al 2012 employs supervised machine learning techniques for further improvement
  • 8.
    06/16/14 8 Methodology ● ChunkingStrategy ● Entity Phrases & Categories ● Features Based on Wikipedia Articles' Hyperlinks ● Features Based on Wikipedia Articles' Hyperlinks
  • 9.
    06/16/14 9 Chunking Strategy Iprefer Samsung over HTC, Apple, Nokia because it is economical and good i prefer samsung over htc apple nokia because it is economical and good Phrase Chunks with boundaries samsungprefer htc apple nokia economical Stopwords removed, Longest phrase matched over Wikipedia as article
  • 10.
    06/16/14 10 Entity Phrases& Categories Entity E1 Wikipedia Article AE1 of entity E1 List of Wikipedia Categories CL_E1 of AE1 Sub-Categories SCL_E1 of CL_E1 up to a depth 2 List of Entity Phrases of E1 or ArticlesRC Wikipedia Articles in CL_E1 Wikipedia Articles in SCL_E1 Entity Categories or RC Has a Mentions inside Has Categories or WC (i.e., RC WC)⊂
  • 11.
    06/16/14 11 Context PhrasesEntityPhrase Features Based on Wikipedia Articles' Hyperlinks apple Chunked tweet Entity Phrase Senses Context Phrase Senses Avg. Max. Sense Scoredoctor fruit phd band medical album plant apple (fruit) 80 45 230 6 532 381 apple (film) 10 50 0 9 0 29.5 apple (inc.) 83 20 10 5 0 44 Feature values are generated using Inlinks, outlinks, inlink+outlinks Sense apple (inc.) is related to Entity while others were not For entity Apple Inc. doctor fruit
  • 12.
    06/16/14 12 Relatedness ScoreBased on Wikipedia Category-Article Structure DepthSignificace ( p)= ∑ cat ∈RC∩ pcat 1 depthcat +1 CatSignificace ( p)= ∣RC∩ pcat∣ ∣WC∩ pcat∣ ∗log(∣RC∩ pcat∣+1) PhraseSignificace( p)=log(wordlen( p)+1)× pfrequency Relatedness Score= ∑ p∈MatchedPhrases Depthsignificance ( p)×Catsignificance ( p)×Phrasesignificance
  • 13.
    06/16/14 13 Dataset ● Multilingualtweets of 61 entities (25% Spanish, 75% English) – Training ~749 tweets for each entity – Testing ~1481 tweets for each entity Domains No. of Entities Training Testing Non Rev Orig Trans Non Rev Orig Trans Music 20 1461 14353 12518 3296 1998 28137 23442 6693 University 10 3548 3412 6569 391 6760 7387 13060 1087 Banking 11 2021 5753 5327 2447 4335 11635 10918 5052 Automotive 20 3767 11356 12585 2538 6851 23253 24690 5414 Total 61 10797 34874 36999 8672 19944 70412 72110 18246
  • 14.
    06/16/14 14 Measure ● Reliabilityis the product of precision in both classes (i.e., true positives and true negatives) ● Sensitivity is the product of recall of both classes Reliability= TP TP+FP × TN TN +FN Sensitivity= TP TP+FN × TN TN +FP
  • 15.
    06/16/14 15 Settings ● Classifier:Random Forest Settings Features Based on Wikipedia Articles' Hyperlinks Relatedness Score Based on Wikipedia Category- Article Structure Domain Level Training Entity Level Training hrdomain x x x hrentity x x x rdomain x x rentity x x
  • 16.
    06/16/14 16 Results Team ReliabilitySensitivity F(R,S) POPSTAR 0.73 0.45 0.49 OUR APPROACH 0.67 0.42 0.45 SZTE NLP 0.60 0.44 0.44 LIA 0.66 0.36 0.38 BASELINE 0.49 0.32 0.33 UvA UNED 0.68 0.22 0.21 Domain Setting Reliability Sensitivity F(R,S) Automotives hrdomain 0.54 0.47 0.47 Banking hrentity 0.75 0.58 0.49 University hrdomain 0.71 0.44 0.49 Music rentity 0.83 0.34 0.39 Evaluation Results on Test Set by Domain Performance Comparison with Other Systems
  • 17.
    06/16/14 17 Conclusion ● Theexperimental evaluations establish Wikipedia’s strength as a significant encyclopaedic resource for the task of entity name disambiguation in tweets. ● The relatedness score defined using Wikipedia category-article structure introduces a powerful semantic notion of linking n-grams in a tweet with the information relevant to an entity ● As future work, we aim to combine our Wikipedia based features with text based techniques to further improve the performance
  • 18.
    06/16/14 18 References ● E. Amigo,J. Carrillo de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Martin, E. Meij, M. de Rijke, and D. Spina. Overview of replab 2013: Evaluating on-line reputation monitoring systems. In CLEF 2013 Labs and Workshop Notebook Papers, Springer LNCS, 2013. ● P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM. ● E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM. ● M.-H. Peetz, D. Spina, J. Gonzalo, and M. de Rijke. Towards an active learning system for company name disambiguation in microblog streams. In CLEF (Online Working Notes/Labs/Workshop), 2013.
  • 19.