Combining Textual and Graph-based Features for Entity Disambiguation

Combining Textual and
Graph-based Features for Entity
Disambiguation
Sherzod Hakimov
Semantic Computing Group
CITEC, Bielefeld University
DBpedia Community Meeting 2016
Leipzig

is the capital of and the largest city inIstanbul Turkey EU
Problem Definition - Named Entity Disambiguation

Outline
● Candidate Retrieval

Outline
● Undirected Factor Graphs
● Inference Strategy
● Combining different features
● Comparison with state-of-the-art

Outline
● Undirected Factor Graphs
● Inference Strategy
● Combining different features
● Comparison with state-of-the-art
● NERFGUN - Named Entity disambiguation by Ranking with Factor Graphs
over Undirected edges
● Published* EKAW 2016, Bologna, Italy

Candidate Retrieval
● Lucene index of DBpedia & Wikipedia data with Frequency values
○ DBpedia label properties (rdfs:label, dbo:firstName, etc.)
○ Wikipedia anchors

is the capital of and the largest city inIstanbul Turkey EU
Ambiguity

Undirected Factor Graphs
Given a document with annotation spans

Inference - Markov Chain Monte Carlo
● Initial document Creates samples of partial solutions

Inference - Markov Chain Monte Carlo
● Randomly initialized

Inference
● Creates samples of partial solutions

Inference - Sampling Step 1
● Creates samples of partial solutions (aka Sampling)

Inference - Sampling Step 2
● Creates samples of partial solutions
Istanbul
Istanbul
Istanbul
Istanbul
Istanbul

Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)

Features
● Term Frequency - frequency values between surface form and URI

Features
● Edit distance - Levenshtein distance between URI and surface form

Features
● Document Similarity - Text similarity of the given document and DBpedia
abstracts of each annotation

Features
● Document Similarity - Text similarity of the given document and DBpedia
abstracts of each annotation
● Topic Specific PageRank - compute for all DBpedia resources using random
walk (Babelfy)

Features
Edit distance
Term Frequency
PageRank
Topic Specific PageRank

Features
Edit distance
Term Frequency
Document Similarity
PageRank
Topic Specific PageRank
Istanbul dbo:abstract “Istanbul once known as Constantinople, is the most populous city in Turkey, ... “

Model Training
● SampleRank - learning weights for features
● Datasets : AIDA/CoNLL Training & MicroPost 2014 Training

Model Training - Local Evaluation

Comparison
● GERBIL - framework for benchmarking named entity disambiguation and
recognition, question answering
● State-of-the-art systems : AGDISTIS, AIDA, DBpedia Spotlight, TagMe,
Babelfy, etc.

Conclusion
● Collective disambiguation of named entities
● Model based on factor graphs to capture dependencies between annotations
● Impact of combining different features
● Comparable results to state-of-the-art

Conclusion
● Collective disambiguation of named entities
● Model based on factor graphs to capture dependencies between annotations
● Impact of combining different features
● Comparable results to state-of-the-art
Thanks

Combining Textual and Graph-based Features for Entity Disambiguation

More Related Content

Similar to Combining Textual and Graph-based Features for Entity Disambiguation

Recently uploaded

Combining Textual and Graph-based Features for Entity Disambiguation