Named Entity Disambiguation
via
Large-Scale Graphs Analytics
Alberto Parravicini
2018-05-05
NECSTlab
3
4
● Finance
news have a direct impact on the market.
● Advertising
targeted advertising for each user.
● Recommender Systems
targeted recommendation for each user.
Understanding trending topics
5
● The identification of topics requires 2 main steps:
Extracting Topics from Text
6
● The identification of topics requires 2 main steps:
Extracting Topics from Text
1. Named Entity Recognition: spot names of persons, companies,
etc…
○ High-accuracy in the state-of-the-art [1]
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
7
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
1. en.wikipedia.org/wiki/Defensive_wall
2. .../wiki/Berlin wall
3. .../wiki/The Wall (album)
4. .../wiki/Mexico-United_States_barrier
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
10
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
11
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Our Goal:
an approach which is language independent
and can deal with ambiguity
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
12
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
13
”Donald_Trump”
Subject
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
14
”Donald_Trump”
Subject
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
15
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
16
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
17
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
18
Preprocessing
& PageRank
Graph
Building
Preprocessing
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
19
Candidate
selection
Preprocessing
& PageRank
Collective
Optimization
Graph
Building
New
Text
Entity
Disambiguation
Preprocessing In-Production Execution
Graph Building
From DBPedia, we build 2 graphs...
20
Graph Building
From DBPedia, we build 2 graphs...
21
Relation Graph
It contains “standard” relations
Graph Building
From DBPedia, we build 2 graphs...
22
Relation Graph
It contains “standard” relations
Redirects Graph
It contains “redirection” relations,
used to solve ambiguity
Graph Building
... and join them together!
23
Preprocessing
● We precompute 2 measures for edges and vertices
24
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
25
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
26
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
27
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
28
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
29
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
Advantage 2:
Dealing with ambiguity
30
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Collective Linking
31
Candidates
Collective Linking
32
Candidates
Collective Linking
Input Graph
Candidates
Collective Linking
34
Input Graph
Candidate Graphs
Candidates
Collective Linking
35
Input Graph
Candidate Graphs
Candidates
Salience Entropy
Collective Linking
36
Input Graph
Candidate Graphs
Best Match!
Candidates
Salience Entropy
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
37
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
Solution:
● Oracle PGX, state-of-the-art toolkit for graph
analytics.
○ Graph queries
○ Custom algorithms
○ Graph modifications
38
Preliminary Results
39
● We are still working on the 4th
stage of the pipeline
● According to the paper, > 75% disambiguation accuracy
● With our extensions, we can already obtain almost
80% accuracy on tweets
○ Similar to in-production data
Thank you!
Named Entity Disambiguation via Large-Scale Graphs Analytics
Alberto Parravicini
alberto.parravicini@mail.polimi.it
Entropy and Salience
● Entropy: computed on each relation/edge.
● Salience: computed on each vertex, similar to PageRank.
41
How random the
destinations of a
relation are
Graph Similarity
● First, compute a measure of topological similarity
● Then, combine it with salience and entropy
42
Percentage of
vertices in
common.
Salience of candidate Entropy of candidate
Oracle PGX
43
Pgx Shell Java/Python API
Pgx API
Pgx Engine
● Java Interface
● PGQL (queries)
● Green Marl (Algorithm DSL)
U.S.
Trump
MexicoNAFTA
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
44
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
45
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
U.S.
Trump
MexicoNAFTA
U.S. “Wall”

Exploiting large-scale graph analytics for unsupervised Entity Linking

  • 1.
    Named Entity Disambiguation via Large-ScaleGraphs Analytics Alberto Parravicini 2018-05-05 NECSTlab
  • 3.
  • 4.
  • 5.
    ● Finance news havea direct impact on the market. ● Advertising targeted advertising for each user. ● Recommender Systems targeted recommendation for each user. Understanding trending topics 5
  • 6.
    ● The identificationof topics requires 2 main steps: Extracting Topics from Text 6
  • 7.
    ● The identificationof topics requires 2 main steps: Extracting Topics from Text 1. Named Entity Recognition: spot names of persons, companies, etc… ○ High-accuracy in the state-of-the-art [1] [1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging." 7
  • 8.
    ● The identificationof topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement
  • 9.
    ● The identificationof topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement 1. en.wikipedia.org/wiki/Defensive_wall 2. .../wiki/Berlin wall 3. .../wiki/The Wall (album) 4. .../wiki/Mexico-United_States_barrier
  • 10.
    Historically, most NamedEntity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 10 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language
  • 11.
    Historically, most NamedEntity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 11 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language Our Goal: an approach which is language independent and can deal with ambiguity
  • 12.
    ● We exploitthe structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 12
  • 13.
    ● We exploitthe structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 13 ”Donald_Trump” Subject
  • 14.
    ● We exploitthe structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 14 ”Donald_Trump” Subject “birthPlace” Relation
  • 15.
    ● We exploitthe structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 15 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 16.
    ● We exploitthe structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 16 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 17.
    Proposed Approach ● Ourwork extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 17
  • 18.
    Proposed Approach ● Ourwork extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 18 Preprocessing & PageRank Graph Building Preprocessing
  • 19.
    Proposed Approach ● Ourwork extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 19 Candidate selection Preprocessing & PageRank Collective Optimization Graph Building New Text Entity Disambiguation Preprocessing In-Production Execution
  • 20.
    Graph Building From DBPedia,we build 2 graphs... 20
  • 21.
    Graph Building From DBPedia,we build 2 graphs... 21 Relation Graph It contains “standard” relations
  • 22.
    Graph Building From DBPedia,we build 2 graphs... 22 Relation Graph It contains “standard” relations Redirects Graph It contains “redirection” relations, used to solve ambiguity
  • 23.
    Graph Building ... andjoin them together! 23
  • 24.
    Preprocessing ● We precompute2 measures for edges and vertices 24
  • 25.
    Preprocessing ● We precompute2 measures for edges and vertices Entropy: how much “information” an edge has 25
  • 26.
    Preprocessing ● We precompute2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 26
  • 27.
    Preprocessing ● We precompute2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 27
  • 28.
    Candidate Selection ● Idea:for each named entity, pick a small number of candidate vertices, through string similarity. 28
  • 29.
    Candidate Selection ● Idea:for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction 29 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 30.
    Candidate Selection ● Idea:for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction Advantage 2: Dealing with ambiguity 30 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    Collective Linking 35 Input Graph CandidateGraphs Candidates Salience Entropy
  • 36.
    Collective Linking 36 Input Graph CandidateGraphs Best Match! Candidates Salience Entropy
  • 37.
    Experimental Setup Problem: ● Wehave huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) 37
  • 38.
    Experimental Setup Problem: ● Wehave huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) Solution: ● Oracle PGX, state-of-the-art toolkit for graph analytics. ○ Graph queries ○ Custom algorithms ○ Graph modifications 38
  • 39.
    Preliminary Results 39 ● Weare still working on the 4th stage of the pipeline ● According to the paper, > 75% disambiguation accuracy ● With our extensions, we can already obtain almost 80% accuracy on tweets ○ Similar to in-production data
  • 40.
    Thank you! Named EntityDisambiguation via Large-Scale Graphs Analytics Alberto Parravicini alberto.parravicini@mail.polimi.it
  • 41.
    Entropy and Salience ●Entropy: computed on each relation/edge. ● Salience: computed on each vertex, similar to PageRank. 41 How random the destinations of a relation are
  • 42.
    Graph Similarity ● First,compute a measure of topological similarity ● Then, combine it with salience and entropy 42 Percentage of vertices in common. Salience of candidate Entropy of candidate
  • 43.
    Oracle PGX 43 Pgx ShellJava/Python API Pgx API Pgx Engine ● Java Interface ● PGQL (queries) ● Green Marl (Algorithm DSL)
  • 44.
    U.S. Trump MexicoNAFTA Leveraging Graphs ● Wikipediapages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 44
  • 45.
    Leveraging Graphs ● Wikipediapages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 45 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier U.S. Trump MexicoNAFTA U.S. “Wall”