Combining Textual and
Graph-based Features for Entity
Disambiguation
Sherzod Hakimov
Semantic Computing Group
CITEC, Bielefeld University
DBpedia Community Meeting 2016
Leipzig
is the capital of and the largest city inIstanbul Turkey EU
Problem Definition - Named Entity Disambiguation
is the capital of and the largest city inIstanbul Turkey EU
Problem Definition - Named Entity Disambiguation
is the capital of and the largest city inIstanbul Turkey EU
Problem Definition - Named Entity Disambiguation
is the capital of and the largest city inIstanbul Turkey EU
Problem Definition - Named Entity Disambiguation
Outline
● Candidate Retrieval
Outline
● Candidate Retrieval
● Undirected Factor Graphs
● Inference Strategy
● Combining different features
● Comparison with state-of-the-art
Outline
● Candidate Retrieval
● Undirected Factor Graphs
● Inference Strategy
● Combining different features
● Comparison with state-of-the-art
● NERFGUN - Named Entity disambiguation by Ranking with Factor Graphs
over Undirected edges
● Published* EKAW 2016, Bologna, Italy
Candidate Retrieval
● Lucene index of DBpedia & Wikipedia data with Frequency values
○ DBpedia label properties (rdfs:label, dbo:firstName, etc.)
○ Wikipedia anchors
is the capital of and the largest city inIstanbul Turkey EU
Ambiguity
Candidate Retrieval
● Lucene index of DBpedia & Wikipedia data with Frequency values
○ DBpedia label properties (rdfs:label, dbo:firstName, etc.)
○ Wikipedia anchors
Undirected Factor Graphs
Given a document with annotation spans
Undirected Factor Graphs
Given a document with annotation spans
Inference - Markov Chain Monte Carlo
● Initial document Creates samples of partial solutions
Inference - Markov Chain Monte Carlo
● Randomly initialized
Inference
● Creates samples of partial solutions
Inference
● Creates samples of partial solutions
Inference - Sampling Step 1
● Creates samples of partial solutions (aka Sampling)
Inference - Sampling Step 2
● Creates samples of partial solutions
Istanbul
Istanbul
Istanbul
Istanbul
Istanbul
Features
Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)
Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)
● Term Frequency - frequency values between surface form and URI
Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)
● Term Frequency - frequency values between surface form and URI
● Edit distance - Levenshtein distance between URI and surface form
Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)
● Term Frequency - frequency values between surface form and URI
● Edit distance - Levenshtein distance between URI and surface form
● Document Similarity - Text similarity of the given document and DBpedia
abstracts of each annotation
Features
● PageRank - compute for all DBpedia resources using random walk (Babelfy)
● Term Frequency - frequency values between surface form and URI
● Edit distance - Levenshtein distance between URI and surface form
● Document Similarity - Text similarity of the given document and DBpedia
abstracts of each annotation
● Topic Specific PageRank - compute for all DBpedia resources using random
walk (Babelfy)
Features
Edit distance
Term Frequency
PageRank
Topic Specific PageRank
Features
Edit distance
Term Frequency
Document Similarity
PageRank
Topic Specific PageRank
Istanbul dbo:abstract “Istanbul once known as Constantinople, is the most populous city in Turkey, ... “
Model Training
● SampleRank - learning weights for features
● Datasets : AIDA/CoNLL Training & MicroPost 2014 Training
Model Training - Local Evaluation
Model Training - Local Evaluation
Model Training - Local Evaluation
Comparison
● GERBIL - framework for benchmarking named entity disambiguation and
recognition, question answering
● State-of-the-art systems : AGDISTIS, AIDA, DBpedia Spotlight, TagMe,
Babelfy, etc.
Conclusion
● Collective disambiguation of named entities
● Model based on factor graphs to capture dependencies between annotations
● Impact of combining different features
● Comparable results to state-of-the-art
Conclusion
● Collective disambiguation of named entities
● Model based on factor graphs to capture dependencies between annotations
● Impact of combining different features
● Comparable results to state-of-the-art
Thanks

Combining Textual and Graph-based Features for Entity Disambiguation

  • 1.
    Combining Textual and Graph-basedFeatures for Entity Disambiguation Sherzod Hakimov Semantic Computing Group CITEC, Bielefeld University DBpedia Community Meeting 2016 Leipzig
  • 2.
    is the capitalof and the largest city inIstanbul Turkey EU Problem Definition - Named Entity Disambiguation
  • 3.
    is the capitalof and the largest city inIstanbul Turkey EU Problem Definition - Named Entity Disambiguation
  • 4.
    is the capitalof and the largest city inIstanbul Turkey EU Problem Definition - Named Entity Disambiguation
  • 5.
    is the capitalof and the largest city inIstanbul Turkey EU Problem Definition - Named Entity Disambiguation
  • 6.
  • 7.
    Outline ● Candidate Retrieval ●Undirected Factor Graphs ● Inference Strategy ● Combining different features ● Comparison with state-of-the-art
  • 8.
    Outline ● Candidate Retrieval ●Undirected Factor Graphs ● Inference Strategy ● Combining different features ● Comparison with state-of-the-art ● NERFGUN - Named Entity disambiguation by Ranking with Factor Graphs over Undirected edges ● Published* EKAW 2016, Bologna, Italy
  • 9.
    Candidate Retrieval ● Luceneindex of DBpedia & Wikipedia data with Frequency values ○ DBpedia label properties (rdfs:label, dbo:firstName, etc.) ○ Wikipedia anchors
  • 10.
    is the capitalof and the largest city inIstanbul Turkey EU Ambiguity
  • 11.
    Candidate Retrieval ● Luceneindex of DBpedia & Wikipedia data with Frequency values ○ DBpedia label properties (rdfs:label, dbo:firstName, etc.) ○ Wikipedia anchors
  • 12.
    Undirected Factor Graphs Givena document with annotation spans
  • 13.
    Undirected Factor Graphs Givena document with annotation spans
  • 14.
    Inference - MarkovChain Monte Carlo ● Initial document Creates samples of partial solutions
  • 15.
    Inference - MarkovChain Monte Carlo ● Randomly initialized
  • 16.
    Inference ● Creates samplesof partial solutions
  • 17.
    Inference ● Creates samplesof partial solutions
  • 18.
    Inference - SamplingStep 1 ● Creates samples of partial solutions (aka Sampling)
  • 19.
    Inference - SamplingStep 2 ● Creates samples of partial solutions Istanbul Istanbul Istanbul Istanbul Istanbul
  • 20.
  • 21.
    Features ● PageRank -compute for all DBpedia resources using random walk (Babelfy)
  • 22.
    Features ● PageRank -compute for all DBpedia resources using random walk (Babelfy) ● Term Frequency - frequency values between surface form and URI
  • 23.
    Features ● PageRank -compute for all DBpedia resources using random walk (Babelfy) ● Term Frequency - frequency values between surface form and URI ● Edit distance - Levenshtein distance between URI and surface form
  • 24.
    Features ● PageRank -compute for all DBpedia resources using random walk (Babelfy) ● Term Frequency - frequency values between surface form and URI ● Edit distance - Levenshtein distance between URI and surface form ● Document Similarity - Text similarity of the given document and DBpedia abstracts of each annotation
  • 25.
    Features ● PageRank -compute for all DBpedia resources using random walk (Babelfy) ● Term Frequency - frequency values between surface form and URI ● Edit distance - Levenshtein distance between URI and surface form ● Document Similarity - Text similarity of the given document and DBpedia abstracts of each annotation ● Topic Specific PageRank - compute for all DBpedia resources using random walk (Babelfy)
  • 26.
  • 27.
    Features Edit distance Term Frequency DocumentSimilarity PageRank Topic Specific PageRank Istanbul dbo:abstract “Istanbul once known as Constantinople, is the most populous city in Turkey, ... “
  • 28.
    Model Training ● SampleRank- learning weights for features ● Datasets : AIDA/CoNLL Training & MicroPost 2014 Training
  • 29.
    Model Training -Local Evaluation
  • 30.
    Model Training -Local Evaluation
  • 31.
    Model Training -Local Evaluation
  • 32.
    Comparison ● GERBIL -framework for benchmarking named entity disambiguation and recognition, question answering ● State-of-the-art systems : AGDISTIS, AIDA, DBpedia Spotlight, TagMe, Babelfy, etc.
  • 34.
    Conclusion ● Collective disambiguationof named entities ● Model based on factor graphs to capture dependencies between annotations ● Impact of combining different features ● Comparable results to state-of-the-art
  • 35.
    Conclusion ● Collective disambiguationof named entities ● Model based on factor graphs to capture dependencies between annotations ● Impact of combining different features ● Comparable results to state-of-the-art Thanks