SlideShare a Scribd company logo
Named Entity Disambiguation
via
Large-Scale Graphs Analytics
Alberto Parravicini
2018-05-05
NECSTlab
3
4
● Finance
news have a direct impact on the market.
● Advertising
targeted advertising for each user.
● Recommender Systems
targeted recommendation for each user.
Understanding trending topics
5
● The identification of topics requires 2 main steps:
Extracting Topics from Text
6
● The identification of topics requires 2 main steps:
Extracting Topics from Text
1. Named Entity Recognition: spot names of persons, companies,
etc…
○ High-accuracy in the state-of-the-art [1]
[1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging."
7
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
● The identification of topics requires 2 main steps:
Extracting Topics from Text
2. Named Entity Disambiguation: connecting named entities to a
unique identity (e.g. Wikipedia page)
en.wikipedia.org/wiki/
Donald_Trump
en.wikipedia.org/wiki/
North_American_Free_Trade_
Agreement
1. en.wikipedia.org/wiki/Defensive_wall
2. .../wiki/Berlin wall
3. .../wiki/The Wall (album)
4. .../wiki/Mexico-United_States_barrier
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
10
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Historically, most Named Entity Disambiguation techniques rely on
Rule-Based Natural Language Processing (NLP)
Current Approaches
11
● Pro:
○ Usually not computationally intensive
● Cons:
○ Can’t deal with ambiguity
○ Dependency on grammar and language
Our Goal:
an approach which is language independent
and can deal with ambiguity
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
12
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
13
”Donald_Trump”
Subject
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
14
”Donald_Trump”
Subject
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
15
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
● We exploit the structure of Wikipedia to obtain a large graph
(~100M edges)
○ DBpedia contains all the information in Wikipedia, stored in a
structured way:
Proposed Approach
16
”Donald_Trump”
Subject
“Queens”
Object
“birthPlace”
Relation
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
17
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
18
Preprocessing
& PageRank
Graph
Building
Preprocessing
Proposed Approach
● Our work extends the state-of-the-art method of Quantified
Collective Validation (QCV) [2]
● High Level Pipeline:
[2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation."
19
Candidate
selection
Preprocessing
& PageRank
Collective
Optimization
Graph
Building
New
Text
Entity
Disambiguation
Preprocessing In-Production Execution
Graph Building
From DBPedia, we build 2 graphs...
20
Graph Building
From DBPedia, we build 2 graphs...
21
Relation Graph
It contains “standard” relations
Graph Building
From DBPedia, we build 2 graphs...
22
Relation Graph
It contains “standard” relations
Redirects Graph
It contains “redirection” relations,
used to solve ambiguity
Graph Building
... and join them together!
23
Preprocessing
● We precompute 2 measures for edges and vertices
24
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
25
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
26
Preprocessing
● We precompute 2 measures for edges and vertices
Entropy: how much
“information” an edge has
Salience: how
“important” each vertex
is, similar to PageRank
27
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
28
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
29
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Candidate Selection
● Idea: for each named entity, pick a small number of candidate
vertices, through string similarity.
Advantage 1:
Problem size reduction
Advantage 2:
Dealing with ambiguity
30
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
Collective Linking
31
Candidates
Collective Linking
32
Candidates
Collective Linking
Input Graph
Candidates
Collective Linking
34
Input Graph
Candidate Graphs
Candidates
Collective Linking
35
Input Graph
Candidate Graphs
Candidates
Salience Entropy
Collective Linking
36
Input Graph
Candidate Graphs
Best Match!
Candidates
Salience Entropy
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
37
Experimental Setup
Problem:
● We have huge graphs (~15M vertices, ~100M edges)
● We need fast execution time (a few seconds at most)
Solution:
● Oracle PGX, state-of-the-art toolkit for graph
analytics.
○ Graph queries
○ Custom algorithms
○ Graph modifications
38
Preliminary Results
39
● We are still working on the 4th
stage of the pipeline
● According to the paper, > 75% disambiguation accuracy
● With our extensions, we can already obtain almost
80% accuracy on tweets
○ Similar to in-production data
Thank you!
Named Entity Disambiguation via Large-Scale Graphs Analytics
Alberto Parravicini
alberto.parravicini@mail.polimi.it
Entropy and Salience
● Entropy: computed on each relation/edge.
● Salience: computed on each vertex, similar to PageRank.
41
How random the
destinations of a
relation are
Graph Similarity
● First, compute a measure of topological similarity
● Then, combine it with salience and entropy
42
Percentage of
vertices in
common.
Salience of candidate Entropy of candidate
Oracle PGX
43
Pgx Shell Java/Python API
Pgx API
Pgx Engine
● Java Interface
● PGQL (queries)
● Green Marl (Algorithm DSL)
U.S.
Trump
MexicoNAFTA
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
44
Leveraging Graphs
● Wikipedia pages are used to
build a graph.
● We match the text to the
Knowledge Base through its
topological relations.
45
Candidates:
1. http://dbpedia.org/page/Defensive_wall
2. http://.../Berlin wall
3. http:/.../The Wall (album)
4. http://.../Mexico-United_States_barrier
U.S.
Trump
MexicoNAFTA
U.S. “Wall”

More Related Content

Similar to Exploiting large-scale graph analytics for unsupervised Entity Linking

Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 

Similar to Exploiting large-scale graph analytics for unsupervised Entity Linking (20)

Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Graph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRankGraph Gurus Episode 5: Webinar PageRank
Graph Gurus Episode 5: Webinar PageRank
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
Neo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
[246]QANet: Towards Efficient and Human-Level Reading Comprehension on SQuAD
 
Graph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise GraphGraph Gurus Episode 1: Enterprise Graph
Graph Gurus Episode 1: Enterprise Graph
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming GraphsScalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
 
Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...Asynchronous single page applications without a line of HTML or Javascript, o...
Asynchronous single page applications without a line of HTML or Javascript, o...
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)An Overview of the Emerging Graph Landscape (Oct 2013)
An Overview of the Emerging Graph Landscape (Oct 2013)
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 

More from NECST Lab @ Politecnico di Milano

Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
NECST Lab @ Politecnico di Milano
 

More from NECST Lab @ Politecnico di Milano (20)

Mesticheria Team - WiiReflex
Mesticheria Team - WiiReflexMesticheria Team - WiiReflex
Mesticheria Team - WiiReflex
 
Punto e virgola Team - Stressometro
Punto e virgola Team - StressometroPunto e virgola Team - Stressometro
Punto e virgola Team - Stressometro
 
BitIt Team - Stay.straight
BitIt Team - Stay.straight BitIt Team - Stay.straight
BitIt Team - Stay.straight
 
BabYodini Team - Talking Gloves
BabYodini Team - Talking GlovesBabYodini Team - Talking Gloves
BabYodini Team - Talking Gloves
 
printf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTonprintf("Nome Squadra"); Team - NeoTon
printf("Nome Squadra"); Team - NeoTon
 
BlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking PlatformBlackBoard Team - Motion Tracking Platform
BlackBoard Team - Motion Tracking Platform
 
#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome#include<brain.h> Team - HomeBeatHome
#include<brain.h> Team - HomeBeatHome
 
Flipflops Team - Wave U
Flipflops Team - Wave UFlipflops Team - Wave U
Flipflops Team - Wave U
 
Bug(atta) Team - Little Brother
Bug(atta) Team - Little BrotherBug(atta) Team - Little Brother
Bug(atta) Team - Little Brother
 
#NECSTCamp: come partecipare
#NECSTCamp: come partecipare#NECSTCamp: come partecipare
#NECSTCamp: come partecipare
 
NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1NECSTCamp101@2020.10.1
NECSTCamp101@2020.10.1
 
NECSTLab101 2020.2021
NECSTLab101 2020.2021NECSTLab101 2020.2021
NECSTLab101 2020.2021
 
TreeHouse, nourish your community
TreeHouse, nourish your communityTreeHouse, nourish your community
TreeHouse, nourish your community
 
TiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architectureTiReX: Tiled Regular eXpressionsmatching architecture
TiReX: Tiled Regular eXpressionsmatching architecture
 
Embedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposingEmbedding based knowledge graph link prediction for drug repurposing
Embedding based knowledge graph link prediction for drug repurposing
 
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
PLASTER - PYNQ-based abandoned object detection using a map-reduce approach o...
 
EMPhASIS - An EMbedded Public Attention Stress Identification System
 EMPhASIS - An EMbedded Public Attention Stress Identification System EMPhASIS - An EMbedded Public Attention Stress Identification System
EMPhASIS - An EMbedded Public Attention Stress Identification System
 
Luns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural networkLuns - Automatic lungs segmentation through neural network
Luns - Automatic lungs segmentation through neural network
 
BlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAsBlastFunction: How to combine Serverless and FPGAs
BlastFunction: How to combine Serverless and FPGAs
 
Maeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matchingMaeve - Fast genome analysis leveraging exact string matching
Maeve - Fast genome analysis leveraging exact string matching
 

Recently uploaded

LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 

Recently uploaded (20)

shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
Danfoss NeoCharge Technology -A Revolution in 2024.pdf
Danfoss NeoCharge Technology -A Revolution in 2024.pdfDanfoss NeoCharge Technology -A Revolution in 2024.pdf
Danfoss NeoCharge Technology -A Revolution in 2024.pdf
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Introduction to Casting Processes in Manufacturing
Introduction to Casting Processes in ManufacturingIntroduction to Casting Processes in Manufacturing
Introduction to Casting Processes in Manufacturing
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 

Exploiting large-scale graph analytics for unsupervised Entity Linking

  • 1. Named Entity Disambiguation via Large-Scale Graphs Analytics Alberto Parravicini 2018-05-05 NECSTlab
  • 2.
  • 3. 3
  • 4. 4
  • 5. ● Finance news have a direct impact on the market. ● Advertising targeted advertising for each user. ● Recommender Systems targeted recommendation for each user. Understanding trending topics 5
  • 6. ● The identification of topics requires 2 main steps: Extracting Topics from Text 6
  • 7. ● The identification of topics requires 2 main steps: Extracting Topics from Text 1. Named Entity Recognition: spot names of persons, companies, etc… ○ High-accuracy in the state-of-the-art [1] [1] Huang, Zhiheng, Wei Xu, and Kai Yu. "Bidirectional LSTM-CRF models for sequence tagging." 7
  • 8. ● The identification of topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement
  • 9. ● The identification of topics requires 2 main steps: Extracting Topics from Text 2. Named Entity Disambiguation: connecting named entities to a unique identity (e.g. Wikipedia page) en.wikipedia.org/wiki/ Donald_Trump en.wikipedia.org/wiki/ North_American_Free_Trade_ Agreement 1. en.wikipedia.org/wiki/Defensive_wall 2. .../wiki/Berlin wall 3. .../wiki/The Wall (album) 4. .../wiki/Mexico-United_States_barrier
  • 10. Historically, most Named Entity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 10 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language
  • 11. Historically, most Named Entity Disambiguation techniques rely on Rule-Based Natural Language Processing (NLP) Current Approaches 11 ● Pro: ○ Usually not computationally intensive ● Cons: ○ Can’t deal with ambiguity ○ Dependency on grammar and language Our Goal: an approach which is language independent and can deal with ambiguity
  • 12. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 12
  • 13. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 13 ”Donald_Trump” Subject
  • 14. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 14 ”Donald_Trump” Subject “birthPlace” Relation
  • 15. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 15 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 16. ● We exploit the structure of Wikipedia to obtain a large graph (~100M edges) ○ DBpedia contains all the information in Wikipedia, stored in a structured way: Proposed Approach 16 ”Donald_Trump” Subject “Queens” Object “birthPlace” Relation
  • 17. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 17
  • 18. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 18 Preprocessing & PageRank Graph Building Preprocessing
  • 19. Proposed Approach ● Our work extends the state-of-the-art method of Quantified Collective Validation (QCV) [2] ● High Level Pipeline: [2] Wang, Han, et al. "Language and domain independent entity linking with quantified collective validation." 19 Candidate selection Preprocessing & PageRank Collective Optimization Graph Building New Text Entity Disambiguation Preprocessing In-Production Execution
  • 20. Graph Building From DBPedia, we build 2 graphs... 20
  • 21. Graph Building From DBPedia, we build 2 graphs... 21 Relation Graph It contains “standard” relations
  • 22. Graph Building From DBPedia, we build 2 graphs... 22 Relation Graph It contains “standard” relations Redirects Graph It contains “redirection” relations, used to solve ambiguity
  • 23. Graph Building ... and join them together! 23
  • 24. Preprocessing ● We precompute 2 measures for edges and vertices 24
  • 25. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has 25
  • 26. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 26
  • 27. Preprocessing ● We precompute 2 measures for edges and vertices Entropy: how much “information” an edge has Salience: how “important” each vertex is, similar to PageRank 27
  • 28. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. 28
  • 29. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction 29 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 30. Candidate Selection ● Idea: for each named entity, pick a small number of candidate vertices, through string similarity. Advantage 1: Problem size reduction Advantage 2: Dealing with ambiguity 30 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier
  • 35. Collective Linking 35 Input Graph Candidate Graphs Candidates Salience Entropy
  • 36. Collective Linking 36 Input Graph Candidate Graphs Best Match! Candidates Salience Entropy
  • 37. Experimental Setup Problem: ● We have huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) 37
  • 38. Experimental Setup Problem: ● We have huge graphs (~15M vertices, ~100M edges) ● We need fast execution time (a few seconds at most) Solution: ● Oracle PGX, state-of-the-art toolkit for graph analytics. ○ Graph queries ○ Custom algorithms ○ Graph modifications 38
  • 39. Preliminary Results 39 ● We are still working on the 4th stage of the pipeline ● According to the paper, > 75% disambiguation accuracy ● With our extensions, we can already obtain almost 80% accuracy on tweets ○ Similar to in-production data
  • 40. Thank you! Named Entity Disambiguation via Large-Scale Graphs Analytics Alberto Parravicini alberto.parravicini@mail.polimi.it
  • 41. Entropy and Salience ● Entropy: computed on each relation/edge. ● Salience: computed on each vertex, similar to PageRank. 41 How random the destinations of a relation are
  • 42. Graph Similarity ● First, compute a measure of topological similarity ● Then, combine it with salience and entropy 42 Percentage of vertices in common. Salience of candidate Entropy of candidate
  • 43. Oracle PGX 43 Pgx Shell Java/Python API Pgx API Pgx Engine ● Java Interface ● PGQL (queries) ● Green Marl (Algorithm DSL)
  • 44. U.S. Trump MexicoNAFTA Leveraging Graphs ● Wikipedia pages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 44
  • 45. Leveraging Graphs ● Wikipedia pages are used to build a graph. ● We match the text to the Knowledge Base through its topological relations. 45 Candidates: 1. http://dbpedia.org/page/Defensive_wall 2. http://.../Berlin wall 3. http:/.../The Wall (album) 4. http://.../Mexico-United_States_barrier U.S. Trump MexicoNAFTA U.S. “Wall”