2. Introduction
Problem Statement: For a given tweet, find the entities and then using
contextual information about these entities, try to link it with the corresponding
information resources
● Microblogs capture an unprecedented amount of information
● Information extraction from microblog posts
● In our case, Twitter Entity Linking.
3. Related Works
● To Link or Not to Link? A Study on End-to-End Tweet
Entity Linking - Stephen Guo, Ming Wei Chang, Emre
Kiciman
● TAGME: On-the-fly Annotation of Short Text Fragments
(by Wikipedia Entities)
4. System Components
The project has been broken down into 3 major parts
● Mention Detection,
○ task of extraction of surface form candidates that can link to an entity
in the domain of interest
● Link Generation,
○ task of finding the relevant Wikipedia pages for each entity obtained in
the tweet
● Entity Disambiguation.
○ task of linking an extracted mention to a specific definition or instance
of an entity
5. Approach
Mention Detection
● Classification and Segmentation of named entities as
separate tasks
● Most words found in tweets are not part of an entity
● Annotated dataset to effectively learn a model of named
entities required
6. Approach
Mention Detection
● Segmentation
○ @usernames not considered as entity
- unambiguous
- trivial to identify with 100% accuracy
- would only serve to inflate performance statistics
○ Brown clusters and tagging system, chunking system and
capitalization system, have been used to generate features.
7. Approach
Mention Detection
● Classification
○ Tweets do not contain enough context
○ A large lists of entities and their types
○ Use of LabeledLDA
- Models each entity string as a mixture of types
- Information about an entity's distribution over types can be shared,
thus handling ambiguous entity strings
- For example, Amazon could correspond to a distribution over two
types:COMPANY, and LOCATION, whereas Apple might represent a
distribution over COMPANY, and FOOD.
8. Approach
Link Generation
● For each entity obtained in the previous step, find the
relevant Wikipedia pages
● Previously done using the Wikipedia library
● Results were given an inverted rank on the basis of
combination of Jaccard Similarity and commonness of
the entity.
9. Approach
Entity Disambiguation
● List of Wikipedia pages obtained in previous step
● Rank them according to the context of the tweet
● Pick out the most relevant ones
10. Approach
Entity Disambiguation
● Semantic similarity measure known as Relatedness used for
disambiguation
● Quantification of the relation between two Wikipedia entities
based upon their inlinks and outlinks.
- For one entity in tweet, the one with the highest rank in the
previous step is selected as the answer.
- For two entities, the pair giving the highest relatedness
measure is selected.
- For more than two entities, pair-wise relatedness is used.
11. Accuracy
● To calculate accuracy, we manually annotated around 100
tweets. While annotating, using human intelligence, we found
out the entities inside the tweet and linked it to it’s relevant
wikipedia page.
● The misses and hits were considered to calculate the
accuracy.
● The test set of 100 tweets was very diverse and contained
tweets which had multiple entities as well.
● The overall accuracy of the system was around ~52%
Results