SlideShare a Scribd company logo
Orange restricted
DAGOBAH
An End-to-End Context-Free Tabular Data
Semantic Annotation System
Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy
Orange Orange Orange EURECOM
@yoan_chabot @rtroncy@tau_labbe @yansera1
DAGOBAH-IC 202001
Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
I don’t
know
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
In which city was "Our
Happy Lives" filmed?
P840
narrative
location
Context & Goals
▪ Design a semantic engine able to query (semi-)structured data
DAGOBAH-IC 202002
In which city was "Our
happy lives" filmed?
In
Belfort!
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
Movie Location
Our Happy Lives Belfort
The French
Kissers
Rennes
P840
narrative
location
P840
narrative
location
DAGOBAH
Movie Location
Our Happy Lives Belfort
The French Kissers Rennes
Tabular Data to Knowledge Graph Matching
DAGOBAH-IC 202003
CTA Column-Type Annotation
CEA Cell-Entity Annotation
CPA Columns-Property Annotation
Q647
Rennes
Q171545
Belfort
Q484170
Commune of
France
Q745690
The French Kissers
Q3344332
Our Happy Lives
Q11424
film
P31
instance of
P31
instance of
Q142
France
P495
country of
origin
P17
country
P840
narrative
location
P840
narrative
location
CPA
State of the Art
▪ Disambiguate cell values (CEA)
▪ 2 Strategies
▪ For each cell, lookup for the most probable entity. [1] [2]
▪ Joint disambiguation of each cell considering the entire row. [3]
▪ Matches for entities can be made using:
▪ Syntactic comparisons [1][2]
▪ Alignment of ontologies [1][3]
▪ Word embeddings [2][3]
▪ Extract column type (CTA)
▪ Majority voting based on CEA outputs [4]
▪ Extract relationships between columns (CPA)
▪ Majority voting based on previously determined types and entities [5]
[1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships.
In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347.
[2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018).
Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000.
[3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity
embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277.
[4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD).
[5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables.
In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Preprocessing
Embedding
Baseline
Annotations workflows
DAGOBAH-IC 202005
Challenges Requiring Pre-processing
Pre-processing
• Relational table
• Horizontal
• Header: True, index = 0
• Key column: 0
• Primitive Typing: [Object, Unit, Unit, Object]
Lake Area Depth County
Windermere 14,73 km² 66 m Cumbria
Kielder Reservoir 10,86 km² 52 m Northumberland
Ullswater 8,9 km² 63 m Lake district
Bassenthwaite
Lake
5,1 km² 21 m Cumbria
Derwent Water 5,1 km² 22 m Lake District
DAGOBAH-IC 202006
Challenges:
• Table nature
• Table orientation
• Column header presence
• Key column identification
• Column type detection
Contribution: New Homogeneity Factor
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
item 1 item 2 item 3 Hom
String_number String_number String_number 0
String_number String_number String_normal 0.89
String_number String_normal String_normal 0.89
String_normal String_normal String_normal 0
DAGOBAH-IC 202007
• String_Normal (France)
• String_Datetime (2020-06-30)
• String_Uppercase (IC)
• String_Number (150 km)
• Number (454)
• Boolean (Yes)
▪ We have 6 cell types:
Example: New homogeneity Factor
Datatable
corpus
(CSV, TSV,
HTML, …)
Converter
Table in WTC format
Table orientation
Header
detection
Primitive typing
DWTC algorithm [1]
Key column
detection
• Object
• Unit
• Number
• Date
• Unknown
Pre-processed tables
Content-based algorithm
(homogeneity factor)
Lake Area Depth Country Hom. RH
Windermere String_number String_number String unknown 0.89
Kielder Reservoir String_number String_number String unknown 0.89
Ullswater String_number String_number String unknown 0.89
Bassenthwaite Lake String_number String_number String unknown 0.89
Derwent Water String_number String_number String unknown 0.89
Hom. CH 0 0 0
𝐻𝑜𝑚 𝑥 = [
1
𝑙𝑒𝑛(𝑥)
෍
𝑡 𝑖∈ 𝑥
(1 − 1 − 2 ∗
𝑐𝑜𝑢𝑛𝑡 𝑡𝑖
𝑙𝑒𝑛 𝑥
2
)]
2
∃ 𝑐𝑜𝑙 𝑤ℎ𝑒𝑟𝑒 𝐻𝑜𝑚 𝑐𝑜𝑙 0: 3 ≠ 0 → 𝑯𝒆𝒂𝒅𝒆𝒓 = 𝒕𝒓𝒖𝒆
𝑀𝑒𝑎𝑛 𝐶𝐻 < 𝑀𝑒𝑎𝑛(𝑅𝐻) → 𝑯𝒐𝒓𝒊𝒛𝒐𝒏𝒕𝒂𝒍
[1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/ DAGOBAH-IC 202008
Evaluation: New homogeneity Factor
DAGOBAH-IC 202009
Precision =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Precision of pre-processing tasks
▪ Evaluation on SemTab 2019 Round 1 (64 tables)
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
The DAGOBAH Approach
▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…)
▪ 2nd step: annotations workflows
▪ Method 1: Baseline lookups
▪ Method 2: Embedding approach
Annotations workflows
DAGOBAH-IC 202010
Preprocessing
Embedding
Baseline
Baseline Lookups
Pre-processed
tables
API
Server
Ingestion
Lake Area
Windermere 14,73 km²
Kielder Reservoir 10,86 km²
API
CirrusSearch
API
Entities Lookups
{title: "Q119936",
label: "Windermere"},
{title: "Q390370",
label: "Windermere"}
…
{"mainType": "populated place",
"types": "settlement"
"subTypes": ""}
Type(s) selection
Types scoring
Entities
Disambiguation
CTA output
CEA output
1
3
4
6
7
7
SPARQL2
DBpedia entity
uri & types
5
𝑆𝑠𝑝𝑒𝑐 𝑡 =
𝑐𝑜𝑢𝑛𝑡(𝑡)
𝑠𝑢𝑚(𝑡𝑖)
∗ log
𝑐𝑜𝑢𝑛𝑡(𝑟𝑜𝑤𝑠)
𝑐𝑜𝑢𝑛𝑡(𝑡𝑖)
‒ Lookups from all tables cells
(4 external sources + 1
internal Wikidata ES)
‒ Wikidata as pivot metadata
‒ DBpedia translation (uri &
types)
‒ TF-IDF-like types scoring
‒ Entities disambiguation with
target type(s)
1
3
4
2
6
7
DAGOBAH-IC 202011
Embedding Approach
EMBEDDING
OpenKE [1]
Id: ["Q223687"],
label:["Wes Anderson"],
aliases:["Wesley Wales Anderson"],
types:["Q5","dbPedia.Person"],
subTypes:["dbPedia.Director","Q2526255"," Q36180"]
Q223687
Title Director
Rushmore Anderson
Fight Club Fincher
Entities
Lookup
Candidates
clustering
Lookup + Table based
hyperparameters
Clusters scoring
Candidates’ types
scoring
CTA output
Candidates’ entities
scoring
CEA output
1
3
5
Lookup
candidates2
4
Embedding
Enrichment
6
‒ Embedding enrichment
through Wikidata ES server
‒ Regex + Levenshtein lookup
‒ K-means clustering over
candidates' space
‒ Scoring algorithm to extract
best cluster and deduce
target type
‒ Candidates disambiguation
from clusters, types and
entities scores
[1] OpenKE TransE Wikidata Embeddings :
http://139.129.163.161/index/toolkits#pretrained-wikidata
1
2
3
4
5
6
DAGOBAH-IC 202012
Embedding Approach Example
𝑺 𝒆 𝑾𝒆𝒔 𝑨𝒏𝒅𝒆𝒓𝒔𝒐𝒏 >
Entities disambiguation:
Entities scoring (CEA):
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑇ℎ𝑜𝑚𝑎𝑠 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 ,
𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑊. 𝑆. 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛
𝑆 𝑘 𝑐𝑙𝑢𝑠𝑡𝑒𝑟#2
𝑆𝑐 𝑄941209
Candidates scoring (CTA)
Clusters scoring
DAGOBAH-IC 202013
Evaluation Dataset- Semtab2019
DAGOBAH-IC 202014
SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Statistics of the datasets in each SemTab round
▪ T:denotes all the columns for annotation.
▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the
ground truth.
▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of
perfect classes
▪ W:Other annotations not in the ground truths.
DAGOBAH-IC 202015
Assessment Criteria
Results
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Baseline
Embedding
0.479
1.212
0.242
0.336
0.883
0.841
0.892
0.853
0.415
-
0.347
-
Task CTA CEA CPA
Criteria AH AP F1 Precision F1 Precision
Round 2
Baseline
Mtab
0.641
1.414
0.247
0.276
0.713
0.911
0.816
0.911
0.533
0.881
0.919
0.929
Round 3
Baseline
Mtab
0.745
1.956
0.161
0.261
0.725
0.970
0.745
0.970
0.519
0.844
0.826
0.845
Round 4
Baseline
Mtab
0.684
2.012
0.206
0.300
0.578
0.983
0.599
0.983
0.398
0.832
0.874
0.832
DAGOBAH-IC 202016
▪ DAGOBAH
result
for Round 1:
✓ Mtab is the winner of this challenge
✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
Conclusions
Approach Pros Cons
Baseline ▪ High coverage (multiple sources)
▪ Computational efficiency
▪ Lookup-services dependency (reliability)
▪ Blackbox (indexing, scoring…)
▪ Queries volume
Embedding ▪ Lookup strategy independence
▪ Relevant clustering even with few data
▪ Generalization (no tailored cleaning + less
heuristics in lookups and scoring)
▪ Computational performances
▪ K optimization
▪ Embedding dependency
DAGOBAH-IC 202017
▪ New homogeneity factor that improves the pre-processing
▪ 2 approaches:
▪ Baseline composed of lookups and majority voting
▪ Clustering of embeddings
▪ Performance bottlenecks (due to the challenge context):
✓ Light Data cleaning … on purpose
✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
Future Work
✓ Test other Wikidata embeddings methods (Currently TransE)
✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage
✓ Experiment more clustering algorithms and parameters on different datasets
✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space
✓ …
DAGOBAH-IC 202018
Orange restricted
DAGOBAH
Datatable-powered Accurate-knowledge Graph
for Outstanding and Beautiful Answers to Humans
Twitter: @yansera1
Jixiong.liu@orange.com
Slides are available: https://www.slideshare.net/JixiongLIU/dagobahic2020orange

More Related Content

Similar to Dagobahic2020orange

Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
Ian Foster
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
Toshiyuki Shimono
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
ssuser034ce1
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
Olav Sandstå
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
Konstantinos Zagoris
 
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
Tobias Gärtner
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
PaulSombat
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
Roman Gavuliak
 
Grid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objectsGrid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objects
Yunsu Lee
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
Anastasija Nikiforova
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
Armando Benitez -- Data x Desing
Armando Benitez -- Data x DesingArmando Benitez -- Data x Desing
Armando Benitez -- Data x Desing
Jorge Armando Benitez
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
SimplyInsight
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuages
Andrew Forward
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
台灣資料科學年會
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
MapR Technologies
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
RohanBorgalli
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
Chawanat Nakasan
 

Similar to Dagobahic2020orange (20)

Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
CS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdfCS-102 DS-class_01_02 Lectures Data .pdf
CS-102 DS-class_01_02 Lectures Data .pdf
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
(Semi-) Big Data Corpora: New Challanges and New Solutions for Corpus Linguists
 
AI Deeplearning Programming
AI Deeplearning ProgrammingAI Deeplearning Programming
AI Deeplearning Programming
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 
Grid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objectsGrid based distributed in memory indexing for moving objects
Grid based distributed in memory indexing for moving objects
 
Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...Comparative analysis of national open data portals or whether your portal is ...
Comparative analysis of national open data portals or whether your portal is ...
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Armando Benitez -- Data x Desing
Armando Benitez -- Data x DesingArmando Benitez -- Data x Desing
Armando Benitez -- Data x Desing
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
 
L'ingénierie dans les nuages
L'ingénierie dans les nuagesL'ingénierie dans les nuages
L'ingénierie dans les nuages
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Dagobahic2020orange

  • 1. Orange restricted DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas Labbé Jixiong Liu Raphaël Troncy Orange Orange Orange EURECOM @yoan_chabot @rtroncy@tau_labbe @yansera1 DAGOBAH-IC 202001
  • 2. Context & Goals ▪ Design a semantic engine able to query (semi-)structured data DAGOBAH-IC 202002 I don’t know Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country In which city was "Our Happy Lives" filmed? P840 narrative location
  • 3. Context & Goals ▪ Design a semantic engine able to query (semi-)structured data DAGOBAH-IC 202002 In which city was "Our happy lives" filmed? In Belfort! Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country Movie Location Our Happy Lives Belfort The French Kissers Rennes P840 narrative location P840 narrative location DAGOBAH
  • 4. Movie Location Our Happy Lives Belfort The French Kissers Rennes Tabular Data to Knowledge Graph Matching DAGOBAH-IC 202003 CTA Column-Type Annotation CEA Cell-Entity Annotation CPA Columns-Property Annotation Q647 Rennes Q171545 Belfort Q484170 Commune of France Q745690 The French Kissers Q3344332 Our Happy Lives Q11424 film P31 instance of P31 instance of Q142 France P495 country of origin P17 country P840 narrative location P840 narrative location CPA
  • 5. State of the Art ▪ Disambiguate cell values (CEA) ▪ 2 Strategies ▪ For each cell, lookup for the most probable entity. [1] [2] ▪ Joint disambiguation of each cell considering the entire row. [3] ▪ Matches for entities can be made using: ▪ Syntactic comparisons [1][2] ▪ Alignment of ontologies [1][3] ▪ Word embeddings [2][3] ▪ Extract column type (CTA) ▪ Majority voting based on CEA outputs [4] ▪ Extract relationships between columns (CPA) ▪ Majority voting based on previously determined types and entities [5] [1] LIMAYE G., SARAWAGI S. & CHAKRABARTI S. (2010). Annotating and searching web tables using entities, types and relationships. In 36th International Conference on Very Large Data Bases (VLDB), p. 1338–1347. [2] FERNANDEZ R. C., MANSOUR E., QAHTAN A. A., ELMAGARMID A., ILYAS I., MADDEN S., OUZZANI M., STONEBRAKER M. & TANG N. (2018). Seeping semantics : Linking datasets using word embeddings for data discovery. In 34th International Conference on Data Engineering (ICDE), p. 989–1000. [3] EFTHYMIOU V., HASSANZADEH O., RODRIGUEZ-MURO M. & CHRISTOPHIDES V. (2017). Matching web tables with knowledge base entities : From entity lookups to entity embeddings. In 16th International Semantic Web Conference (ISWC), p. 260–277. [4] MULWAD V., FININ T., SYED Z. & JOSHI A. (2010). Using linked data to interpret tables. In 1 st International Workshop on Consuming Linked Data (COLD). [5] RAN C., SHEN W., WANG J. & ZHU X. (2016). Domain-specific knowledge base enrichment using wikipedia tables. In IEEE International Conference on Data Mining (ICDM), p. 349–358. DAGOBAH-IC 202004
  • 6. The DAGOBAH Approach ▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…) ▪ 2nd step: annotations workflows ▪ Method 1: Baseline lookups ▪ Method 2: Embedding approach Preprocessing Embedding Baseline Annotations workflows DAGOBAH-IC 202005
  • 7. Challenges Requiring Pre-processing Pre-processing • Relational table • Horizontal • Header: True, index = 0 • Key column: 0 • Primitive Typing: [Object, Unit, Unit, Object] Lake Area Depth County Windermere 14,73 km² 66 m Cumbria Kielder Reservoir 10,86 km² 52 m Northumberland Ullswater 8,9 km² 63 m Lake district Bassenthwaite Lake 5,1 km² 21 m Cumbria Derwent Water 5,1 km² 22 m Lake District DAGOBAH-IC 202006 Challenges: • Table nature • Table orientation • Column header presence • Key column identification • Column type detection
  • 8. Contribution: New Homogeneity Factor 𝐻𝑜𝑚 𝑥 = [ 1 𝑙𝑒𝑛(𝑥) ෍ 𝑡 𝑖∈ 𝑥 (1 − 1 − 2 ∗ 𝑐𝑜𝑢𝑛𝑡 𝑡𝑖 𝑙𝑒𝑛 𝑥 2 )] 2 item 1 item 2 item 3 Hom String_number String_number String_number 0 String_number String_number String_normal 0.89 String_number String_normal String_normal 0.89 String_normal String_normal String_normal 0 DAGOBAH-IC 202007 • String_Normal (France) • String_Datetime (2020-06-30) • String_Uppercase (IC) • String_Number (150 km) • Number (454) • Boolean (Yes) ▪ We have 6 cell types:
  • 9. Example: New homogeneity Factor Datatable corpus (CSV, TSV, HTML, …) Converter Table in WTC format Table orientation Header detection Primitive typing DWTC algorithm [1] Key column detection • Object • Unit • Number • Date • Unknown Pre-processed tables Content-based algorithm (homogeneity factor) Lake Area Depth Country Hom. RH Windermere String_number String_number String unknown 0.89 Kielder Reservoir String_number String_number String unknown 0.89 Ullswater String_number String_number String unknown 0.89 Bassenthwaite Lake String_number String_number String unknown 0.89 Derwent Water String_number String_number String unknown 0.89 Hom. CH 0 0 0 𝐻𝑜𝑚 𝑥 = [ 1 𝑙𝑒𝑛(𝑥) ෍ 𝑡 𝑖∈ 𝑥 (1 − 1 − 2 ∗ 𝑐𝑜𝑢𝑛𝑡 𝑡𝑖 𝑙𝑒𝑛 𝑥 2 )] 2 ∃ 𝑐𝑜𝑙 𝑤ℎ𝑒𝑟𝑒 𝐻𝑜𝑚 𝑐𝑜𝑙 0: 3 ≠ 0 → 𝑯𝒆𝒂𝒅𝒆𝒓 = 𝒕𝒓𝒖𝒆 𝑀𝑒𝑎𝑛 𝐶𝐻 < 𝑀𝑒𝑎𝑛(𝑅𝐻) → 𝑯𝒐𝒓𝒊𝒛𝒐𝒏𝒕𝒂𝒍 [1] https://subversion.assembla.com/svn/commondata/WDCFramework/tags/1/0/3/ DAGOBAH-IC 202008
  • 10. Evaluation: New homogeneity Factor DAGOBAH-IC 202009 Precision = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 Precision of pre-processing tasks ▪ Evaluation on SemTab 2019 Round 1 (64 tables) SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
  • 11. The DAGOBAH Approach ▪ 1st step: pre-processing to identify tables characteristics (orientation, key-column…) ▪ 2nd step: annotations workflows ▪ Method 1: Baseline lookups ▪ Method 2: Embedding approach Annotations workflows DAGOBAH-IC 202010 Preprocessing Embedding Baseline
  • 12. Baseline Lookups Pre-processed tables API Server Ingestion Lake Area Windermere 14,73 km² Kielder Reservoir 10,86 km² API CirrusSearch API Entities Lookups {title: "Q119936", label: "Windermere"}, {title: "Q390370", label: "Windermere"} … {"mainType": "populated place", "types": "settlement" "subTypes": ""} Type(s) selection Types scoring Entities Disambiguation CTA output CEA output 1 3 4 6 7 7 SPARQL2 DBpedia entity uri & types 5 𝑆𝑠𝑝𝑒𝑐 𝑡 = 𝑐𝑜𝑢𝑛𝑡(𝑡) 𝑠𝑢𝑚(𝑡𝑖) ∗ log 𝑐𝑜𝑢𝑛𝑡(𝑟𝑜𝑤𝑠) 𝑐𝑜𝑢𝑛𝑡(𝑡𝑖) ‒ Lookups from all tables cells (4 external sources + 1 internal Wikidata ES) ‒ Wikidata as pivot metadata ‒ DBpedia translation (uri & types) ‒ TF-IDF-like types scoring ‒ Entities disambiguation with target type(s) 1 3 4 2 6 7 DAGOBAH-IC 202011
  • 13. Embedding Approach EMBEDDING OpenKE [1] Id: ["Q223687"], label:["Wes Anderson"], aliases:["Wesley Wales Anderson"], types:["Q5","dbPedia.Person"], subTypes:["dbPedia.Director","Q2526255"," Q36180"] Q223687 Title Director Rushmore Anderson Fight Club Fincher Entities Lookup Candidates clustering Lookup + Table based hyperparameters Clusters scoring Candidates’ types scoring CTA output Candidates’ entities scoring CEA output 1 3 5 Lookup candidates2 4 Embedding Enrichment 6 ‒ Embedding enrichment through Wikidata ES server ‒ Regex + Levenshtein lookup ‒ K-means clustering over candidates' space ‒ Scoring algorithm to extract best cluster and deduce target type ‒ Candidates disambiguation from clusters, types and entities scores [1] OpenKE TransE Wikidata Embeddings : http://139.129.163.161/index/toolkits#pretrained-wikidata 1 2 3 4 5 6 DAGOBAH-IC 202012
  • 14. Embedding Approach Example 𝑺 𝒆 𝑾𝒆𝒔 𝑨𝒏𝒅𝒆𝒓𝒔𝒐𝒏 > Entities disambiguation: Entities scoring (CEA): 𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑇ℎ𝑜𝑚𝑎𝑠 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 , 𝑆 𝑒 𝑃𝑎𝑢𝑙 𝑊. 𝑆. 𝐴𝑛𝑑𝑒𝑟𝑠𝑜𝑛 𝑆 𝑘 𝑐𝑙𝑢𝑠𝑡𝑒𝑟#2 𝑆𝑐 𝑄941209 Candidates scoring (CTA) Clusters scoring DAGOBAH-IC 202013
  • 15. Evaluation Dataset- Semtab2019 DAGOBAH-IC 202014 SemTab2019: https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ Table from:Ernesto et al. (2020). SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems Statistics of the datasets in each SemTab round
  • 16. ▪ T:denotes all the columns for annotation. ▪ P: The most fine-grained classes in the (ontology) hierarchy that also appear in the ground truth. ▪ O:Involving the super-classes (excluding very generic top classes like owl:Thing) of perfect classes ▪ W:Other annotations not in the ground truths. DAGOBAH-IC 202015 Assessment Criteria
  • 17. Results Task CTA CEA CPA Criteria AH AP F1 Precision F1 Precision Baseline Embedding 0.479 1.212 0.242 0.336 0.883 0.841 0.892 0.853 0.415 - 0.347 - Task CTA CEA CPA Criteria AH AP F1 Precision F1 Precision Round 2 Baseline Mtab 0.641 1.414 0.247 0.276 0.713 0.911 0.816 0.911 0.533 0.881 0.919 0.929 Round 3 Baseline Mtab 0.745 1.956 0.161 0.261 0.725 0.970 0.745 0.970 0.519 0.844 0.826 0.845 Round 4 Baseline Mtab 0.684 2.012 0.206 0.300 0.578 0.983 0.599 0.983 0.398 0.832 0.874 0.832 DAGOBAH-IC 202016 ▪ DAGOBAH result for Round 1: ✓ Mtab is the winner of this challenge ✓ Relatively behind Mtab due to missing Wikidata – DBpedia type mappings
  • 18. Conclusions Approach Pros Cons Baseline ▪ High coverage (multiple sources) ▪ Computational efficiency ▪ Lookup-services dependency (reliability) ▪ Blackbox (indexing, scoring…) ▪ Queries volume Embedding ▪ Lookup strategy independence ▪ Relevant clustering even with few data ▪ Generalization (no tailored cleaning + less heuristics in lookups and scoring) ▪ Computational performances ▪ K optimization ▪ Embedding dependency DAGOBAH-IC 202017 ▪ New homogeneity factor that improves the pre-processing ▪ 2 approaches: ▪ Baseline composed of lookups and majority voting ▪ Clustering of embeddings ▪ Performance bottlenecks (due to the challenge context): ✓ Light Data cleaning … on purpose ✓ Basic lookup strategies … on purpose (e.g. no use of dictionary)
  • 19. Future Work ✓ Test other Wikidata embeddings methods (Currently TransE) ✓ Compute joint embeddings with Wikipedia/DBpedia to enhance coverage ✓ Experiment more clustering algorithms and parameters on different datasets ✓ Learn data table embedding and find vectoral transformation(s) with KG embedding space ✓ … DAGOBAH-IC 202018
  • 20. Orange restricted DAGOBAH Datatable-powered Accurate-knowledge Graph for Outstanding and Beautiful Answers to Humans Twitter: @yansera1 Jixiong.liu@orange.com Slides are available: https://www.slideshare.net/JixiongLIU/dagobahic2020orange