Semantic Data Enrichment: a Human-in-the-Loop Perspective

Semantic Data Enrichment: a
Human-in-the-Loop Perspective
Matteo Palmonari matteo.palmonari@unimib.it
INSID&S Lab
Department of Informatics,
Systems and Communication
Università degli Studi di
Milano-Bicocca
Seminar at INRIA –
Sophie Antipolis, July
20th, 2023

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
About me/this seminar…
n Associate Prof. at University of Milano-Bicocca
¨ INSID&S Lab: 4 faculty / 2 assistant prof. / 4 PhD students (now)
n Covered quite a broad spectrum of topics
¨ AI / Data Integration >> Knowledge Graphs (KGs)
¨ Representation Learning & NLP to track the evolution and to compare
distributional representations >> Computational Social Science (CSS)
n Which topic for this talk ?
¨ Human-in-the-loop (HITL) semantic data enrichment >> broad topic
driving specific work; should match WIMMICS (NLP and KG)
¨ More in-depth presentation of recent work and CSS-related work >>
Manuel Vimercati
2

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Overview
n Semantic Data Integration, Annotations and Data
Enrichment
n Semantic Enrichment of Tabular Data
n HITL Tabular Data Enrichment
n Towards HITL Textual Data Enrichment
n Conclusions
3
**Slides contain excerpts of content created by former/currrent PhD students Vincenzo Cutrona and Riccardo Pozzi

ARTIFICIAL
INTELLIGENCE
@UNIMIB
1)
SEMANTIC DATA
INTEGRATION,
ANNOTATIONS, AND DATA
ENRICHMENT
4

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
6
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Background
KG
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Inspiration for this example: [Knoblock&Szekely 2015], ICIJ + Neo4J work for Panama Papers

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
7
Company data from
.data.gouv.fr
https://annuaire-
Person
Country
Org.
Named Entity Recognition
Annotations: named entities
Background
KG

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
8
Company data from
.data.gouv.fr
https://annuaire-
Person
Country
Org.
Named Entity Recognition (NER)
Annotations: named entities
Background
KG

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
9
Company data from
.data.gouv.fr
https://annuaire-
Person
Country
Org.
Annotations: data linking
Named Entity Recognition Named Entity Linking (NEL)
Background
KG

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
10
Company data from
.data.gouv.fr
https://annuaire-
Person
Country
Org.
Annotations: data linking
Named Entity Recognition Named Entity Linking
… Sienna …
Clustering
NIL Prediction
=
=
Background
KG

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Techniques from Established Research Fields
n Texts
¨ Annotation / Information Extraction
n NER and NEL: huge body of work
¨ Recent work on NEL: BLINK [Wu&al.EMNLP20] and GENRE [DeCao&al.TACL22] …
n NIL prediction and Clustering: ~less investigated
¨ Increased interest in the last 2 years
¨ [Argawal&al.NAACL22], [Kassner&al.ACL22], [Heist&Pauheim ESWC23]
n Tabular data
¨ Annotation / Semantic Table Interpretation
n More details in this presentation
n Survey: [Liu&al.JWA22] (R. Troncy and P. Monnin are co-authors J)
11

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment and KG Construction
12
A shift in perspective:
• Users are interested in their content
• Background KGs useful to
• support integraton
• extend their content with additional data
• The construction of a KG can be a byproduct

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment and KG Construction
13
A shift in perspective:
• Users are interested in their content
• Background KGs useful to
• support integraton
• extend their content with additional data
• The construction of a KG can be a byproduct

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Downstream Applications of Data Enrichment
14
Query
Answering
Semantic Search
&
Data Exploration
“Traditional” ML
&
Data Analytics
Analyses with
Representation
Learning
• Criminal
investigations
[SDSM20]
• Explorinig data-
contexts to
contextualize news
articles
[ISWCdemo15,
ESWC17]
• Enrichment and
analysis of social
media [EACLdemo17]
• Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
• Text-based
entity
embeddings and
time-aware
entity similarity
[ISWC18]
• Entity evolution
(+ with CADE
alignment
[AAAI19])
Documents
Tabular
data
Documents
Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03
D4.1: Business Cases Requirements Analysis & Specifications
Work Package 4
Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Work Package 4
Contributions:
applications
and novel
analytical
methods
…
Main projects
Main data
Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità
Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità
Applications
and analytical
methods
This talk

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Several Examples from Past/Ongoing Projects
15
Domain Value Enrichment
Data Sources
Data
eCommerce Predict impact of events on customer searches Events, weather Tabular
Retail Workforce/budget optimization Events, weather Tabular
CRM Workforce optimization Events, weather Tabular
IOT Customer flow analysis Events, weather Tabular
Digital Marketing Ad impression prediction for campaign optimization Weather Tabular
Digital Marketing Ad impression prediction for campaign optimization Events Tabular
Manufactoring AI-based analytics on welding robot data (tables and user manuals) Prorpetary ~KG Tabular, Texts
Manufactoring Troubleshooting and repair based on service manuals, records, log data Prorpetary ~KG Tabular, Texts
Open data Construction and maintenance of a European dataset of organizations in
procurement from tenders
Prorpetary ~KG,
Wikidata, Crunch Base
Tabular, Texts
Observatory on AI Construction and maintenance of a KG to track AI-related innovations
from different data sources
Crunch Base, WikiData Tabular, Texts
Business analysis Cost-effective enrichment of client datasets’ with proprietary company KG Proprietary KG Tabular
THIS
PROJECT
HAS
RECEIVED
FUNDING
FROM
THE
EUROPEAN
UNION'S
HORIZON
EUROPE
RESEARCH
AND
INNOVATION
PROGRAMME
UNDER
GRANT
AGREEMENT
NO
101070284.
Enabling
Data
Enrichment
Pipeline
AI-driven
Business
Products
and
Se
D4.1:
Business
Cases
Requirements
Analysis
&
Sp
Work
Package
4
Type
of
document:
Report
Dissemination
level:
SEN
-
Sensitive
Lead
beneficiary:
JOT
Authors:
Fernando
Perales
and
Cynthia
Parrondo
(JOT
Cuong
Xuan
Chu
and
Evgeny
Kharlamov
(BOS
Qi
Gao
(PHI)
Alex
Young
and
Ian
Makgill
(SN)
Luis
Rei
and
Besher
Massri
(JSI)
Tao
Song
(BGRIMM)
Version:
1.0
Due
Date
of
document:
30/06/2023
Delivery
Date
of
document:
30/06/2023

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
2)
SEMANTIC
ENRICHMENT OF
TABULAR DATA
16
“Traditional” ML
&
Data Analytics
• Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
Tabular
data
Work Package 4

Semantically-Enabled Optimization of
Digital Marketing Campaigns
Vincenzo Cutrona1, Flavio De Paoli1, Aljaž Košmerlj2, Nikolay Nikolov3,
Matteo Palmonari1, Fernando Perales4, and Dumitru Roman3
1 University of
Milano - Bicocca
2 Josef Stefan Institute 3 SINTEF DIGITAL 4 JOT Internet Media

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021
Weather-based Campaign Scheduler
18
New services for campaign optimization:
● Main service: weather-based campaign
scheduler
○ Predict the best dates to launch the
campaign with weather-sensitive keywords
○ in the upcoming week
○ for each region
● + additional services
● Why do we focus on data enrichment?
○ 80% time in data analysis project is spent for
cleaning and enriching the data*
C°/+0 C°/+1
18 20
17 19
17 20
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Input data Additional data
Target data
ML model
Business
service
*Worldwide Semiannual Big Data and Analytics
Spending Guide from International Data Corporation
(IDC)

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Weather Service
city: 2950157
- date: 2017-03- 12
2t: 17
- date: 2017-03-13
2t: 20
regionID (GeoNames) date (ISO 8601)
Data Enrichment: Digital Marketing Example
19
194906 64 Thuringia 11/03/2017
517827 50 Bavaria 12/03/2017
459143 42 Berlin 12/03/2017
DIFFERENT systems of identifiers

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
20
194906 64 Thuringia 11/03/2017
517827 50 Bavaria 12/03/2017
459143 42 Berlin 12/03/2017
STEP 1
VALUE MANIPULATION

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
21
gn:2822542
gn:2951839
gn:2950157
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
STEP 2
LINKING
The region, not the city

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
22
EQUAL systems of identifiers
C°/+0 C°/+1
18 20
17 19
17 20
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Weather Service
city: 2950157
- date: 2017-03- 12
2t: 17
- date: 2017-03-13
2t: 20
cityID (GeoNames) date (ISO 8601)
STEP 3
EXTENSION

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment: Problem Statement
● Inputs:
○ a source dataset
○ a pool of reference data sources
Data Enrichment: a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
23
● Output:
○ the source dataset extended with
modified/additional columns
Linking Extension
Value
manipulation
source output
external
data sources
reference
KGs
Large data volumes
Unknown or little-known, large
and complex data sources
Intrinsic uncertainty

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment: Problem Statement
● Inputs:
○ a source dataset
○ a pool of reference data sources
Data Enrichment: a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
24
● Output:
○ the source dataset extended with
modified/additional columns
Linking Extension
Value
manipulation
source output
external
data sources
reference
KGs
Large data volumes
Unknown or little-known, large
and complex data sources
Intrinsic uncertainty
Annotations from algorithms

DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
2.A)
TABULAR DATA ANNOTATION
ALGORITHMS:
SEMANTIC TABLE INTERPRETATION
25

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Semantic Table Interpretation
26
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column of the table is associated with one or more types (CTA)
● each cell in the table is annotated with the entity in the catalog (CEA)
● each pair of columns is annotated with a binary relation in the catalog (CPA)
Name Coordinates Height Range
Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level

ARTIFICIAL
INTELLIGENCE
@UNIMIB
27
Given
● each column is associated with one or more KG-types (CTA)
● each cell in the table is annotated with the entity in the catalog (CEA)
Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
28
Given
● each cell in “entity columns” is annotated with a KG-entity (CEA)
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif

ARTIFICIAL
INTELLIGENCE
@UNIMIB
29
Given
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
Subject column
Named-Entity column
Literal column
Also referred to as “entity
linking” (for tables)

ARTIFICIAL
INTELLIGENCE
@UNIMIB
30
Given
● some pair of columns is annotated with a binary KG-predicate (CPA)
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point

ARTIFICIAL
INTELLIGENCE
@UNIMIB
31
Given
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Range
Mont_Blanc
MontBlanc
Massif
4808
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point
dbo:mountainRange
dbo:elevation
georss:point
45°49′57″N
06°51′52″E
… for KG completion

ARTIFICIAL
INTELLIGENCE
@UNIMIB
32
Given
● each cell in “entity columns” is annotated with a KG-entity or with NIL (if not in the KG)
KNOWLEDGE GRAPH
Mountain
Range
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Range
Mont_Blanc
MontBlanc
Massif
4808
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point
dbo:mountainRange
dbo:elevation
georss:point
45°49′57″N
06°51′52″E
… with novel entities
Pennine
Alps
Monte
Cervino
[NIL: Hohtälli]
Pennine
Alps

ARTIFICIAL
INTELLIGENCE
@UNIMIB
INSID&S Contributions
n Entity linking in tables
¨ Soft filters to filter candidate entities based on
type embedding similarity [SEMANTICS’21]
¨ LamAPI: supporting indexing and matching
[OM@ISWC’22]
n End-to-end STI
¨ s-elBat: dealing with messy tables
[SemTab@ISWC’22]*
¨ MantisTable
[Fut.Gen.Internet’20,SemTab@ISWC’19-21]*
n Evaluation & datasets
¨ Tough Tables: misspelling and noisy labels
[ISWC’20]
¨ MammoTab: large dataset of annotated tables,
to learn neural linking algorithms and evaluate
them [SemTab@ISWC’22]
n Participation to STI Challenges
¨ 2019-2022
33
http://www.cs.ox.ac.uk/isg/challenges/sem-tab/

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Recap: Annotations, Enrichment and KG Construction
34
Table annotation
• Schema mapping
• Entity linking
Table augmentation
• With links and data
extention services
Export: graph
• Table to graph
transformations
KG generation
KG completion
Export: tabular data
Downstream
analysis
Enrichment Exploitation

ARTIFICIAL
INTELLIGENCE
@UNIMIB
2.B)
HITL TABULAR DATA
ENRICHMENT
35

ARTIFICIAL
INTELLIGENCE
@UNIMIB
36
1 – User interfaces for interactive data annotation and enrichment
2.B)
HITL TABULAR DATA
ENRICHMENT

ARTIFICIAL
INTELLIGENCE
@UNIMIB
ASIA: Assisted Semantic Interpretation and Annotation of tabular data
• Interactive annotation
• Execute linking services
• Exploit vocabulary suggestions from
ABSTAT […, VLDBJ21]
• Edit / revise annotations
Table
Vocabulary suggestions and search
Cutrona, V., Ciavotta, M., De Paoli, F., & Palmonari, M. (2019). ASIA: A
tool for assisted semantic interpretation and annotation of tabular data.
In Proceedings of ISWC Demo Papers [ISWCdemo19]
• Interactive extension
• Execute data extension services specifying
parameters from the interface

ARTIFICIAL
INTELLIGENCE
@UNIMIB
SemTUI – Interactive Semantic
Enrichment of Tabular Data
n UI accessing external services
¨ STI (full)
n S-elBat
¨ Reconciliation/linking services
(OpenRefine interface)
n Geonames
n WikiData
n DBpedia
n Atoka-linking (SpazioDati)
¨ Extension services
n WikiData / DBpedia (SPARQL)
n Weather extension (ECMWF)
n HERE (georeferencing)
n Shortest-route
n Atoka-extension (SpazioDati)
n …
38
Support to Linking – Revision – Extension of tabular data
n Graphical view & revision of annotations
¨ Global and specific annotation rendering
¨ Single cell editing / annotation revision
¨ Column annotation revision
Ripamonti, M., De Paoli, F., & Palmonari, M. (2022). SemTUI: a
Framework for the Interactive Semantic Enrichment of Tabular
Data. arXiv preprint arXiv:2203.09521.

ARTIFICIAL
INTELLIGENCE
@UNIMIB
39
2.B)
HITL TABULAR DATA
ENRICHMENT
2 – Make data enrichment pipelines scalable

ARTIFICIAL
INTELLIGENCE
@UNIMIB
● Remember: enrichment ~ sequence of transformations that
can be executed (batch mode)
● A two-step paradigm
[ISWC19,ISWC19demo,Tech.andAppl.for BDV22]
● Small-scale design
● Algorithms + UI to specify annotations and data
extensions on a data sample
● Large-scale execution
● Big data technologies to speed up large-scale
execution of transformations on large data
● Docker
● Parallelization
● …
Annotation for Tabular Data Enrichment at Scale
40
SAMPLE
QUALITY
INSIGHTS
ENRICHMENT
DESIGN
QUALITY ASSESSMENT
STACK
CONFIGURATION
ENRICHED
SAMPLE
DATASET
ENRICHED
DATASET
SMALL-SIZE
PROCESSING
TRANSFORMATION
MODEL
BATCH PROCESSING
Ciavotta, M., Cutrona, V., De Paoli, F., Nikolov, N., Palmonari, M., & Roman, D. (2022). Supporting semantic data enrichment at
scale. In Technologies and Applications for Big Data Value (pp. 19-39). Cham: Springer International Publishing.
[Tech.andAppl.for BDV22]

ARTIFICIAL
INTELLIGENCE
@UNIMIB
41
2.B)
HITL TABULAR DATA
ENRICHMENT
3 – Deeper integration of UI and algorithms (ongoing)

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Challenges: Entity Disambiguation and Ranking in Tables
42
title director
release
year
domestic distributor
length
in min
worldwide
gross
jurassic world colin trevorrow 2015 universal pictures 124 1670400637
Q3512046
(Jurassic World)
12 June 2015
124
1670400637
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P
2
7
2
(
p
r
o
d
u
c
t
i
o
n
c
o
m
p
a
n
y
)
Q5145625
(Colin Trevorrow)
P57 (director)
Q20647533
(Jurassic World)
2015
P
5
7
7
(
p
u
b
l
i
c
a
t
i
o
n
d
a
t
e
)
Q937857
(Michael Giacchino)
P175
(perform
er)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series)
Q21877685
(Jurassic World)
22 June 2018
128
1309500000
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P
7
5
0
(
d
i
s
t
r
i
b
u
t
e
d
b
y
)
Q937857
(Colin Trevorrow)
P57
(director)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series)
Q932019
(J. A. Bayona)
P
7
5
0
(
d
i
s
t
r
i
b
u
t
e
d
b
y
)
P
2
7
2
(
p
r
o
d
u
c
t
i
o
n
c
o
m
p
a
n
y
)
...
✔ 🚫 🚫

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Challenges: Novel Entities
n Linking with NIL prediction
¨ Detection of novel entities
¨ Underrepresented task in benchmark data
n Greedy algorithms often rewarded
¨ Important problem in real-world data enrichment
settings
n E.g., a fragment of organizations in tables not
extracted/constructed from WikiData have links to WIkiData
43
Enabling Data Enrichment Pipelin
AI-driven Business Products and
HORIZON-CL4-202
D4.1: Bu

ARTIFICIAL
INTELLIGENCE
@UNIMIB
HITL in Linking Tasks
n Personal background on HITL approaches
¨ Ontology matching with multi-user feedback [SWJ’16,KEOD’17]
¨ Active learning to rank for semantic association relevance
[ESWC’17]
n Objective
¨ Maximize quality while minimizing user effort
n Two levels
¨ Fast revision
n Revise first links that are more likely to be incorrect
¨ Learning from the user feedback
n Feedback propagation, learn from limited data
44

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Sel-Bat
‘22>>’23
n [SemTab22]:
¨ Ad-hoc transformation
of features into
unbound ranking
score
n New:
¨ NN-based
transformation into a
bounded confidence
score 𝜔 ∈ [0,1]
¨ NIL prediction with
threshold
45
Mention vs labels
Row vs properties
Row vs description
Predicates and types hits

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Entity Linking with NIL Prediction
46
n Confidence-based revision:
¨ Use the confidence score to order links to revise
n E.g., mentions with lower confidence first, i.e., order all mentions m by increasing 𝜔!
n E.g., mentions that are more uncertain first, i.e., order all mentions m by distance of 𝜔! from the threshold
¨ Optimal 𝑘 for ranking is learned on the train set (maximize F-1/minimize revisions)
PN-Θ RN-Θ
Decision
ω(δ, ρ, k), σ
i,j
ci,j,1 s i,j,1
… …
ci,j,k s i,j,k
Entity
Retriever
ci,j,1 s i,j,1 Fi,j,1
… …
ci,j,k s i,j,k Fi,j,1
ci,j,1 ρi,j,1
…
ci,j,k ρi,j,k
ci,j,1 ρi,j,1
ωi,j
L
… …
ci,j,k ρi,j,k
NL
L
Fi,j,1
Fi,j,1
pi,j,1
pi,j,k
Top-k candidates from
ER with features
Normalized scores from PN (pi,j,h∈[0,1])
Column-wise type-consistency features
added from other rows
Refined matching
scores (ρi,j,h∈[0,1])
Ti,j,1
Ti,j,k
Candidates for the i-th row values in the j-th column
Feature
Generator
Confidence score
Link | Not Link
decision
2
Feature
Refiner
Learning from human feedback
Θ
< δ,ρ,σ>
Candidates for the values in the other cells in the j-th column
NL
NL
ci,j ωi,j L
NL
Smart revision
!! = 1 − % &"#$%(!) + %(!
)*+%(-, /012 - ) iif !! ≥ 5
δ

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Experimental Settings
n Evaluate
¨ Quality of the links with NIL prediction in ~ out-of-domain training settings
n Main: F-1 compared with top SemTab scorers (greedy algorithms)
¨ k-fold validation with out-of-domain testing (5 dataset for train, 1 for test)
n Ablation: impact of different components (ranking + PN + RN)
n Ablation: impact of parameter k (final matching score vs distance between top candidates)
¨ Effectiveness of the uncertainty measure to support smart revision
n Main: increase in link quality at incremental revision iterations
¨ User revision simulated with an Oracle
¨ Area Under the Curve of F-1 at increasing number or revised mentions
¨ Fair experimental simplification: global ranking (all tables) vs. local ranking (one table)
n Ablation: impact of parameter k for ordering mentions to be revised
47
TABLE I
STATISTICS OF THE DATASETS USED IN THE EXPERIMENTS
dataset # tables # columns # rows # entities (CEA) # classes (CTA) # predicates (CPA)
Round1 T2D 64 323 9089 8078 119 115
Round3 2161 9736 152753 390456 5761 7574
Round4 22207 78750 475897 994920 31921 56475
2T-2020 180 802 194438 667243 539 0
HardTableR2 1750 5589 29280 47439 2190 3835
HardTableR3 7207 17902 58949 58948 7206 10694
of the problem, including textual, semantic, and contextual
information, to enhance the entity resolution process.
Table II provides details about the architecture of the
neural network employed in this study. It is a plain feed-
forward neural network, whose hyper-parameters were deter-
mined through preliminary experiments. In recent times, deep
networks have demonstrated remarkable potential in handling
increasingly difficult and complex tasks, often rivaling or even
surpassing human capabilities. These networks are typically
built using highly intricate architectures. However, for this
particular work, we opted for an approach that prioritizes
simplicity and speed, while still maintaining excellent learning
capability and generalization. Although we acknowledge that
the network’s classification capability could be enhanced,
devising an architecture optimized for the candidate ranking
task is beyond the scope and objectives of this paper.
TABLE II
MODEL ARCHITECTURE
Sorting ⌦ produces a global ranking of candidates
ciated with mentions that can be used to split the
mentions into linked and unlinked subsets. The intuit
that candidates with the highest ! can be considered c
and the others of uncertain classification. Human revie
help disambiguate uncertain cases.
D. Human Revision
The process described so far automatically classifie
mention as either linked or unlinked. Subsequently, the
tations are presented to the user, who reviews the resul
verifies or corrects the annotations generated by the ma
algorithm.
The ordered set ⌦ facilitates the assessment of the
of uncertainty associated with each link. This emp
the user to determine which cases should be prioritiz
review based on the estimated degree of uncertainty inv
Considering that manual link review is a time-cons
task, a straightforward criterion is to commence fro
n Benchmark data
¨ Links to DBpedia | WikiData
¨ Tables may introduce
specific/different challenges

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Experimental Results
n Entity linking (main)
¨ All components are relevant
¨ Competitive results despite NIL
prediction (benchmark data reward
greedy decisions)
¨ Gaps on test sets with specific
data distributions (also due to
retrieval module)
48
n Smart revision (main)
¨ Confidence-based revision >>
faster than >> random revision
TABLE III
F1 FOR EACH STEP IN THE LINKING WORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
Round T2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned value of k
demonstrates even better performance on the test dataset,
achieving a remarkable AUC value of 0.9929 and an F1 score
above 0.98 after examining only 10% of the mentions.
Table IV presents the results obtained from the experiments
conducted on all datasets. The outcomes are consistent with
the aforementioned discussion. Specifically, it is evident that
in the case of outlier datasets, such as Round3, even with less
than 30% of reviews, the F1 score surpasses 0.90, whereas
the performance of the highest-scoring participant in the
AUC on
HardTable-R2
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values of k. The embedded table reports the performance
measures AUC. The figure refers to the experiment with the
fold that excludes the HartTable-R2 dataset.
The evidence is that we need to review at most 30% of
mentions of the training set to reach 0.98 for F1 and that
almost any value of k produces similar results. The best value
Fig. 3.
The results p
demonstrates ev
achieving a rem
above 0.98 afte
Table IV pres
conducted on a
the aforementio
in the case of o
than 30% of re
the performanc
Challenge (refer
Moreover, for
F1 > 0.90 is
maximum F1 s
only 10% of the
The lessons l
ing sample sets
creases linearly
to reach high v
TABLE III
F1 FOR EACH STEP IN THE LINKING WORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
Round T2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values of k. The embedded table reports the performance
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned va
demonstrates even better performance on the test
achieving a remarkable AUC value of 0.9929 and an
above 0.98 after examining only 10% of the mention
Table IV presents the results obtained from the exp
conducted on all datasets. The outcomes are consist
the aforementioned discussion. Specifically, it is evid
in the case of outlier datasets, such as Round3, even w
than 30% of reviews, the F1 score surpasses 0.90,
the performance of the highest-scoring participan
Challenge (refer to Table III) is achieved with 40% of
Moreover, for datasets with fewer typos, the thre
F1 > 0.90 is attained much earlier. As an illustra
maximum F1 score of 0.98 is accomplished after re
Also: more interpretable
scores for human
interaction

ARTIFICIAL
INTELLIGENCE
@UNIMIB
3)
HITL FOR TEXTUAL DATA
ENRICHMENT
(LEGAL DOMAIN)
49

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Entity Extraction from Legal Documents
50
KB
[...] A. Donati [...]
[...] Dott.sa Donati [...]
Anna
Donati
Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking IJCKG’22, Oc
Table 3: Statistics about the d
plant.
mentions (N
train 2.2M (2
dev 10k (3
test 10k (3
train 2.008M (2
dev 100k (5
test 100k (5
ground truth on NILs) and uses
as most state-of-the-art NEL a
reference KB [3, 6, 35]. Observe
NER
Enriched text
Enrichment
+
KG
construction
End-to-end
entity
extraction
with
background
KG (beyond
NER)

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Target Applications
n Court decisions (texts)
¨ Semantic search
n E.g., find all decisions in
controversies with [Money Bank]
in [2008]
¨ Anonymization
n E.g., replace all occurrences of
persons with *****
¨ Advanced statistics
n E.g., Count all decisions in
controversies with banks in
[2008]-[2018]
n Criminal investigations (texts +
tabular data + ..)
¨ Search on investigation files
and report writing
n E.g., all paragraphs mentioning
[J.Smith]
¨ Analyze files hard to timely
analyze today (chats, audio,
files, …)
n E.g., all messages/chats where
[J.Smith] wrote to [A.Black] about
[L.Red]
51

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Incremental Entity Extraction and Linking: Evaluation
52
Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari
on and
diction,
xecuted
sed that
plied to
0], in a
ntity ex-
med on
hat also
uments
where
en they
llenges
ontrast
EL and
ing the
ally, we
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and �xing links.
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia) KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
• Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
• Injection/transplant of NIL
entities (~same overall %)
Main challenges
• Error propagation
• NIL Prediction
• Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Incremental Entity Extraction and Linking: Evaluation
53
Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari
on and
diction,
xecuted
sed that
plied to
0], in a
ntity ex-
med on
hat also
uments
where
en they
llenges
ontrast
EL and
ing the
ally, we
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and �xing links.
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia) KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
• Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
• Injection/transplant of NIL
entities (~same overall %)
Main challenges
• Error propagation
• NIL Prediction
• Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]
Certain application domains require HITL
end-to-end entity extraction to achieve
production-level quality

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Dave: Semantic Search + HITL Annotation
54
All visible names in this text are made up as other PI information. None of
the facts mentioned in this decision refer to the names referred therein.

ARTIFICIAL
INTELLIGENCE
@UNIMIB
4)
CONCLUSIONS
AND FUTURE WORK
55

ARTIFICIAL
INTELLIGENCE
@UNIMIB
Conclusions & Future Work
n Conclusions
¨ Data linking + data extension: core semantic data enrichment tasks
¨ Tabular data and textual data
n Similar tasks: annotations >> KG construction | enriched data
n Still several challenges
¨ NIL prediction and entity clustering
¨ Incremental KB construction from tables and text
¨ HITL approach
n Interactive data enrichment to overcome intrinsic limitations
n Enrichment at scale while controling the quality
n Future work
¨ Full-fledged HITL: learning from the user feedback
¨ Combining Generative AI and data enrichment algorithms for dialogical
data enrichment
56

ARTIFICIAL
INTELLIGENCE
@UNIMIB
THANKS! QUESTIONS?
57
This work presented in this presentation has received funding from the European
Union’s Horizon 2020 research and innovation program under grant agreements No
732590 - EW-Shopp - and No 732003 – euBusinessGraph - and from the European
Union’s Horizon Europe research and innovation program under grant agreements No
101070284 - enRichMyData.
Funding acknowledgements

Semantic Data Enrichment: a Human-in-the-Loop Perspective

Recommended

Recommended

More Related Content

Similar to Semantic Data Enrichment: a Human-in-the-Loop Perspective

Similar to Semantic Data Enrichment: a Human-in-the-Loop Perspective (20)

More from Università degli Studi di Milano-Bicocca

More from Università degli Studi di Milano-Bicocca (8)

Recently uploaded

Recently uploaded (20)

Semantic Data Enrichment: a Human-in-the-Loop Perspective