SlideShare a Scribd company logo
1 of 56
Download to read offline
Semantic Data Enrichment: a
Human-in-the-Loop Perspective
Matteo Palmonari matteo.palmonari@unimib.it
INSID&S Lab
Department of Informatics,
Systems and Communication
Università degli Studi di
Milano-Bicocca
Seminar at INRIA –
Sophie Antipolis, July
20th, 2023
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
About me/this seminar…
n Associate Prof. at University of Milano-Bicocca
¨ INSID&S Lab: 4 faculty / 2 assistant prof. / 4 PhD students (now)
n Covered quite a broad spectrum of topics
¨ AI / Data Integration >> Knowledge Graphs (KGs)
¨ Representation Learning & NLP to track the evolution and to compare
distributional representations >> Computational Social Science (CSS)
n Which topic for this talk ?
¨ Human-in-the-loop (HITL) semantic data enrichment >> broad topic
driving specific work; should match WIMMICS (NLP and KG)
¨ More in-depth presentation of recent work and CSS-related work >>
Manuel Vimercati
2
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Overview
n Semantic Data Integration, Annotations and Data
Enrichment
n Semantic Enrichment of Tabular Data
n HITL Tabular Data Enrichment
n Towards HITL Textual Data Enrichment
n Conclusions
3
**Slides contain excerpts of content created by former/currrent PhD students Vincenzo Cutrona and Riccardo Pozzi
ARTIFICIAL
INTELLIGENCE
@UNIMIB
1)
SEMANTIC DATA
INTEGRATION,
ANNOTATIONS, AND DATA
ENRICHMENT
4
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
6
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Background
KG
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Inspiration for this example: [Knoblock&Szekely 2015], ICIJ + Neo4J work for Panama Papers
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
7
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Named Entity Recognition
Annotations: named entities
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
8
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Named Entity Recognition (NER)
Annotations: named entities
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
9
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Annotations: data linking
Named Entity Recognition Named Entity Linking (NEL)
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
Background
KG
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Integration
10
Company data from
.data.gouv.fr
https://annuaire-
entreprises.data.gouv.fr/entrep
rise/sienna-real-estate-
holding-france-492220553
Person
Country
Org.
Foundations firms 'offshore' customers through banks in Wikipedia
Annotations: data linking
Named Entity Recognition Named Entity Linking
Entities in OffshoreLeaks linked to France
https://offshoreleaks.icij.org/search?c=FRA&cat=0
… Sienna …
Clustering
NIL Prediction
=
=
Background
KG
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Techniques from Established Research Fields
n Texts
¨ Annotation / Information Extraction
n NER and NEL: huge body of work
¨ Recent work on NEL: BLINK [Wu&al.EMNLP20] and GENRE [DeCao&al.TACL22] …
n NIL prediction and Clustering: ~less investigated
¨ Increased interest in the last 2 years
¨ [Argawal&al.NAACL22], [Kassner&al.ACL22], [Heist&Pauheim ESWC23]
n Tabular data
¨ Annotation / Semantic Table Interpretation
n More details in this presentation
n Survey: [Liu&al.JWA22] (R. Troncy and P. Monnin are co-authors J)
11
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment and KG Construction
12
A shift in perspective:
• Users are interested in their content
• Background KGs useful to
• support integraton
• extend their content with additional data
• The construction of a KG can be a byproduct
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Semantic Data Enrichment and KG Construction
13
A shift in perspective:
• Users are interested in their content
• Background KGs useful to
• support integraton
• extend their content with additional data
• The construction of a KG can be a byproduct
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Downstream Applications of Data Enrichment
14
Query
Answering
Semantic Search
&
Data Exploration
“Traditional” ML
&
Data Analytics
Analyses with
Representation
Learning
• Criminal
investigations
[SDSM20]
• Explorinig data-
contexts to
contextualize news
articles
[ISWCdemo15,
ESWC17]
• Enrichment and
analysis of social
media [EACLdemo17]
• Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
• Text-based
entity
embeddings and
time-aware
entity similarity
[ISWC18]
• Entity evolution
(+ with CADE
alignment
[AAAI19])
Documents
Tabular
data
Documents
Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03
D4.1: Business Cases Requirements Analysis & Specifications
Work Package 4
Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03
D4.1: Business Cases Requirements Analysis & Specifications
Work Package 4
Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Contributions:
applications
and novel
analytical
methods
…
Main projects
Main data
Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità
Elicitazione dei bisogni informativi dei magistrati
nell’ambito del sistema di ricerca semantica e serialità
Applications
and analytical
methods
This talk
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Several Examples from Past/Ongoing Projects
15
Domain Value Enrichment
Data Sources
Data
eCommerce Predict impact of events on customer searches Events, weather Tabular
Retail Workforce/budget optimization Events, weather Tabular
CRM Workforce optimization Events, weather Tabular
IOT Customer flow analysis Events, weather Tabular
Digital Marketing Ad impression prediction for campaign optimization Weather Tabular
Digital Marketing Ad impression prediction for campaign optimization Events Tabular
Manufactoring AI-based analytics on welding robot data (tables and user manuals) Prorpetary ~KG Tabular, Texts
Manufactoring Troubleshooting and repair based on service manuals, records, log data Prorpetary ~KG Tabular, Texts
Open data Construction and maintenance of a European dataset of organizations in
procurement from tenders
Prorpetary ~KG,
Wikidata, Crunch Base
Tabular, Texts
Observatory on AI Construction and maintenance of a KG to track AI-related innovations
from different data sources
Crunch Base, WikiData Tabular, Texts
Business analysis Cost-effective enrichment of client datasets’ with proprietary company KG Proprietary KG Tabular
THIS
PROJECT
HAS
RECEIVED
FUNDING
FROM
THE
EUROPEAN
UNION'S
HORIZON
EUROPE
RESEARCH
AND
INNOVATION
PROGRAMME
UNDER
GRANT
AGREEMENT
NO
101070284.
Enabling
Data
Enrichment
Pipeline
AI-driven
Business
Products
and
Se
HORIZON-CL4-2021-DATA-01-03
D4.1:
Business
Cases
Requirements
Analysis
&
Sp
Work
Package
4
Type
of
document:
Report
Dissemination
level:
SEN
-
Sensitive
Lead
beneficiary:
JOT
Authors:
Fernando
Perales
and
Cynthia
Parrondo
(JOT
Cuong
Xuan
Chu
and
Evgeny
Kharlamov
(BOS
Qi
Gao
(PHI)
Alex
Young
and
Ian
Makgill
(SN)
Luis
Rei
and
Besher
Massri
(JSI)
Tao
Song
(BGRIMM)
Version:
1.0
Due
Date
of
document:
30/06/2023
Delivery
Date
of
document:
30/06/2023
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
2)
SEMANTIC
ENRICHMENT OF
TABULAR DATA
16
“Traditional” ML
&
Data Analytics
• Weather-
based
optimization in
digital
marketing
[ISWC19,Tech
. and Appl. for
BDV22]
Tabular
data
Enabling Data Enrichment Pipelines for
AI-driven Business Products and Services
HORIZON-CL4-2021-DATA-01-03
D4.1: Business Cases Requirements Analysis & Specifications
Work Package 4
Type of document: Report
Dissemination level: SEN - Sensitive
Lead beneficiary: JOT
Authors: Fernando Perales and Cynthia Parrondo (JOT)
Cuong Xuan Chu and Evgeny Kharlamov (BOS)
Semantically-Enabled Optimization of
Digital Marketing Campaigns
Vincenzo Cutrona1, Flavio De Paoli1, Aljaž Košmerlj2, Nikolay Nikolov3,
Matteo Palmonari1, Fernando Perales4, and Dumitru Roman3
1 University of
Milano - Bicocca
2 Josef Stefan Institute 3 SINTEF DIGITAL 4 JOT Internet Media
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021
Weather-based Campaign Scheduler
18
New services for campaign optimization:
● Main service: weather-based campaign
scheduler
○ Predict the best dates to launch the
campaign with weather-sensitive keywords
○ in the upcoming week
○ for each region
● + additional services
● Why do we focus on data enrichment?
○ 80% time in data analysis project is spent for
cleaning and enriching the data*
C°/+0 C°/+1
18 20
17 19
17 20
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Input data Additional data
Target data
ML model
Business
service
*Worldwide Semiannual Big Data and Analytics
Spending Guide from International Data Corporation
(IDC)
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Weather Service
city: 2950157
- date: 2017-03- 12
2t: 17
- date: 2017-03-13
2t: 20
regionID (GeoNames) date (ISO 8601)
Data Enrichment: Digital Marketing Example
19
KEYWORD #im REGION Date
194906 64 Thuringia 11/03/2017
517827 50 Bavaria 12/03/2017
459143 42 Berlin 12/03/2017
DIFFERENT systems of identifiers
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
Data Enrichment: Digital Marketing Example
20
KEYWORD #im REGION Date
194906 64 Thuringia 11/03/2017
517827 50 Bavaria 12/03/2017
459143 42 Berlin 12/03/2017
STEP 1
VALUE MANIPULATION
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
Data Enrichment: Digital Marketing Example
21
gn:2822542
gn:2951839
gn:2950157
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
STEP 2
LINKING
The region, not the city
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Data Enrichment: Digital Marketing Example
22
EQUAL systems of identifiers
C°/+0 C°/+1
18 20
17 19
17 20
KEYWORD #im REGION Date
194906 64 Thuringia 2017-03-11
517827 50 Bavaria 2017-03-12
459143 42 Berlin 2017-03-12
geoId.
gn:2822542
gn:2951839
gn:2950157
Weather Service
city: 2950157
- date: 2017-03- 12
2t: 17
- date: 2017-03-13
2t: 20
cityID (GeoNames) date (ISO 8601)
STEP 3
EXTENSION
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021
Semantic Data Enrichment: Problem Statement
● Inputs:
○ a source dataset
○ a pool of reference data sources
Data Enrichment: a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
23
● Output:
○ the source dataset extended with
modified/additional columns
Linking Extension
Value
manipulation
source output
external
data sources
reference
KGs
Large data volumes
Unknown or little-known, large
and complex data sources
Intrinsic uncertainty
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021
Semantic Data Enrichment: Problem Statement
● Inputs:
○ a source dataset
○ a pool of reference data sources
Data Enrichment: a path on the data transformations graph GT
Semantic Data Enrichment: at least one node is linking
24
● Output:
○ the source dataset extended with
modified/additional columns
Linking Extension
Value
manipulation
source output
external
data sources
reference
KGs
Large data volumes
Unknown or little-known, large
and complex data sources
Intrinsic uncertainty
Annotations from algorithms
DATA
SEMANTICS
@
DATA
SCIENCE
-
UNIMIB
2.A)
TABULAR DATA ANNOTATION
ALGORITHMS:
SEMANTIC TABLE INTERPRETATION
25
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Semantic Table Interpretation
26
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column of the table is associated with one or more types (CTA)
● each cell in the table is annotated with the entity in the catalog (CEA)
● each pair of columns is annotated with a binary relation in the catalog (CPA)
Name Coordinates Height Range
Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
ARTIFICIAL
INTELLIGENCE
@UNIMIB
27
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in the table is annotated with the entity in the catalog (CEA)
● each pair of columns is annotated with a binary relation in the catalog (CPA)
Name Coordinates Height Range
Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Semantic Table Interpretation
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Name Coordinates Height Range
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
28
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in “entity columns” is annotated with a KG-entity (CEA)
● each pair of columns is annotated with a binary relation in the catalog (CPA)
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
Semantic Table Interpretation
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Name Coordinates Height Range
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
29
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in “entity columns” is annotated with a KG-entity (CEA)
● each pair of columns is annotated with a binary relation in the catalog (CPA)
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
Subject column
Named-Entity column
Literal column
Also referred to as “entity
linking” (for tables)
Semantic Table Interpretation
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Name Coordinates Height Range
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
30
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in “entity columns” is annotated with a KG-entity (CEA)
● some pair of columns is annotated with a binary KG-predicate (CPA)
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
dbo:elevation
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point
Semantic Table Interpretation
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Name Coordinates Height Range
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
31
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in “entity columns” is annotated with a KG-entity (CEA)
● some pair of columns is annotated with a binary KG-predicate (CPA)
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point
dbo:mountainRange
dbo:elevation
georss:point
45°49′57″N
06°51′52″E
Semantic Table Interpretation
… for KG completion
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Name Coordinates Height Range
Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif
Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps
Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps
32
Given
● a relational table T
● a Knowledge Graph (entities + statements) and an ontology (types + predicates)
T is annotated when:
● each column is associated with one or more KG-types (CTA)
● each cell in “entity columns” is annotated with a KG-entity or with NIL (if not in the KG)
● some pair of columns is annotated with a binary KG-predicate (CPA)
KNOWLEDGE GRAPH
Mountain
Range
Mountain xsd:integer
xsd:string
Natural Place
georss:point
dbo:elevation
dbo:mountainRange
…
…
Mountain xsd:string xsd:integer Mountain
Range
Mont_Blanc
MontBlanc
Massif
4808
Schema level
Entity level
Mont_Blanc
MontBlanc
Massif
dbo:mountainRange
dbo:elevation
georss:point
dbo:mountainRange
dbo:elevation
georss:point
45°49′57″N
06°51′52″E
Semantic Table Interpretation
… with novel entities
Pennine
Alps
Monte
Cervino
[NIL: Hohtälli]
Pennine
Alps
ARTIFICIAL
INTELLIGENCE
@UNIMIB
INSID&S Contributions
n Entity linking in tables
¨ Soft filters to filter candidate entities based on
type embedding similarity [SEMANTICS’21]
¨ LamAPI: supporting indexing and matching
[OM@ISWC’22]
n End-to-end STI
¨ s-elBat: dealing with messy tables
[SemTab@ISWC’22]*
¨ MantisTable
[Fut.Gen.Internet’20,SemTab@ISWC’19-21]*
n Evaluation & datasets
¨ Tough Tables: misspelling and noisy labels
[ISWC’20]
¨ MammoTab: large dataset of annotated tables,
to learn neural linking algorithms and evaluate
them [SemTab@ISWC’22]
n Participation to STI Challenges
¨ 2019-2022
33
http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Recap: Annotations, Enrichment and KG Construction
34
Table annotation
• Schema mapping
• Entity linking
Table augmentation
• With links and data
extention services
Export: graph
• Table to graph
transformations
KG generation
KG completion
Export: tabular data
Downstream
analysis
Enrichment Exploitation
ARTIFICIAL
INTELLIGENCE
@UNIMIB
2.B)
HITL TABULAR DATA
ENRICHMENT
35
ARTIFICIAL
INTELLIGENCE
@UNIMIB
36
1 – User interfaces for interactive data annotation and enrichment
2.B)
HITL TABULAR DATA
ENRICHMENT
ARTIFICIAL
INTELLIGENCE
@UNIMIB
ASIA: Assisted Semantic Interpretation and Annotation of tabular data
• Interactive annotation
• Execute linking services
• Exploit vocabulary suggestions from
ABSTAT […, VLDBJ21]
• Edit / revise annotations
Table
Vocabulary suggestions and search
Cutrona, V., Ciavotta, M., De Paoli, F., & Palmonari, M. (2019). ASIA: A
tool for assisted semantic interpretation and annotation of tabular data.
In Proceedings of ISWC Demo Papers [ISWCdemo19]
• Interactive extension
• Execute data extension services specifying
parameters from the interface
ARTIFICIAL
INTELLIGENCE
@UNIMIB
SemTUI – Interactive Semantic
Enrichment of Tabular Data
n UI accessing external services
¨ STI (full)
n S-elBat
¨ Reconciliation/linking services
(OpenRefine interface)
n Geonames
n WikiData
n DBpedia
n Atoka-linking (SpazioDati)
¨ Extension services
n WikiData / DBpedia (SPARQL)
n Weather extension (ECMWF)
n HERE (georeferencing)
n Shortest-route
n Atoka-extension (SpazioDati)
n …
38
Support to Linking – Revision – Extension of tabular data
n Graphical view & revision of annotations
¨ Global and specific annotation rendering
¨ Single cell editing / annotation revision
¨ Column annotation revision
Ripamonti, M., De Paoli, F., & Palmonari, M. (2022). SemTUI: a
Framework for the Interactive Semantic Enrichment of Tabular
Data. arXiv preprint arXiv:2203.09521.
ARTIFICIAL
INTELLIGENCE
@UNIMIB
39
2.B)
HITL TABULAR DATA
ENRICHMENT
2 – Make data enrichment pipelines scalable
ARTIFICIAL
INTELLIGENCE
@UNIMIB
● Remember: enrichment ~ sequence of transformations that
can be executed (batch mode)
● A two-step paradigm
[ISWC19,ISWC19demo,Tech.andAppl.for BDV22]
● Small-scale design
● Algorithms + UI to specify annotations and data
extensions on a data sample
● Large-scale execution
● Big data technologies to speed up large-scale
execution of transformations on large data
● Docker
● Parallelization
● …
Annotation for Tabular Data Enrichment at Scale
40
SAMPLE
QUALITY
INSIGHTS
ENRICHMENT
DESIGN
QUALITY ASSESSMENT
STACK
CONFIGURATION
ENRICHED
SAMPLE
DATASET
ENRICHED
DATASET
SMALL-SIZE
PROCESSING
TRANSFORMATION
MODEL
BATCH PROCESSING
Ciavotta, M., Cutrona, V., De Paoli, F., Nikolov, N., Palmonari, M., & Roman, D. (2022). Supporting semantic data enrichment at
scale. In Technologies and Applications for Big Data Value (pp. 19-39). Cham: Springer International Publishing.
[Tech.andAppl.for BDV22]
ARTIFICIAL
INTELLIGENCE
@UNIMIB
41
2.B)
HITL TABULAR DATA
ENRICHMENT
3 – Deeper integration of UI and algorithms (ongoing)
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Challenges: Entity Disambiguation and Ranking in Tables
42
title director
release
year
domestic distributor
length
in min
worldwide
gross
jurassic world colin trevorrow 2015 universal pictures 124 1670400637
Q3512046
(Jurassic World)
12 June 2015
124
1670400637
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P
2
7
2
(
p
r
o
d
u
c
t
i
o
n
c
o
m
p
a
n
y
)
Q5145625
(Colin Trevorrow)
P57 (director)
Q20647533
(Jurassic World)
2015
P
5
7
7
(
p
u
b
l
i
c
a
t
i
o
n
d
a
t
e
)
Q937857
(Michael Giacchino)
P175
(perform
er)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series)
Q21877685
(Jurassic World)
22 June 2018
128
1309500000
P577 (publication date)
Q13377
(Universal Pictures)
P2047 (duration)
P2142 (box office)
P
7
5
0
(
d
i
s
t
r
i
b
u
t
e
d
b
y
)
Q937857
(Colin Trevorrow)
P57
(director)
P58 (screenwriter)
Q17862144
(Jurassic Park)
P179 (part of the series)
Q932019
(J. A. Bayona)
P
7
5
0
(
d
i
s
t
r
i
b
u
t
e
d
b
y
)
P
2
7
2
(
p
r
o
d
u
c
t
i
o
n
c
o
m
p
a
n
y
)
...
✔ 🚫 🚫
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Challenges: Novel Entities
n Linking with NIL prediction
¨ Detection of novel entities
¨ Underrepresented task in benchmark data
n Greedy algorithms often rewarded
¨ Important problem in real-world data enrichment
settings
n E.g., a fragment of organizations in tables not
extracted/constructed from WikiData have links to WIkiData
43
Enabling Data Enrichment Pipelin
AI-driven Business Products and
HORIZON-CL4-202
D4.1: Bu
ARTIFICIAL
INTELLIGENCE
@UNIMIB
HITL in Linking Tasks
n Personal background on HITL approaches
¨ Ontology matching with multi-user feedback [SWJ’16,KEOD’17]
¨ Active learning to rank for semantic association relevance
[ESWC’17]
n Objective
¨ Maximize quality while minimizing user effort
n Two levels
¨ Fast revision
n Revise first links that are more likely to be incorrect
¨ Learning from the user feedback
n Feedback propagation, learn from limited data
44
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Sel-Bat
‘22>>’23
n [SemTab22]:
¨ Ad-hoc transformation
of features into
unbound ranking
score
n New:
¨ NN-based
transformation into a
bounded confidence
score 𝜔 ∈ [0,1]
¨ NIL prediction with
threshold
45
Mention vs labels
Row vs properties
Row vs description
Predicates and types hits
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Entity Linking with NIL Prediction
46
n Confidence-based revision:
¨ Use the confidence score to order links to revise
n E.g., mentions with lower confidence first, i.e., order all mentions m by increasing 𝜔!
n E.g., mentions that are more uncertain first, i.e., order all mentions m by distance of 𝜔! from the threshold
¨ Optimal 𝑘 for ranking is learned on the train set (maximize F-1/minimize revisions)
PN-Θ RN-Θ
Decision
ω(δ, ρ, k), σ
i,j
ci,j,1 s i,j,1
… …
ci,j,k s i,j,k
Entity
Retriever
ci,j,1 s i,j,1 Fi,j,1
… …
ci,j,k s i,j,k Fi,j,1
ci,j,1 ρi,j,1
…
ci,j,k ρi,j,k
ci,j,1 ρi,j,1
ωi,j
L
… …
ci,j,k ρi,j,k
NL
L
Fi,j,1
Fi,j,1
pi,j,1
pi,j,k
Top-k candidates from
ER with features
Normalized scores from PN (pi,j,h∈[0,1])
Column-wise type-consistency features
added from other rows
Refined matching
scores (ρi,j,h∈[0,1])
Ti,j,1
Ti,j,k
Candidates for the i-th row values in the j-th column
Feature
Generator
Confidence score
Link | Not Link
decision
2
Feature
Refiner
Learning from human feedback
Θ
< δ,ρ,σ>
Candidates for the values in the other cells in the j-th column
NL
NL
ci,j ωi,j L
NL
Smart revision
!! = 1 − % &"#$%(!) + %(!
)*+%(-, /012 - ) iif !! ≥ 5
δ
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Experimental Settings
n Evaluate
¨ Quality of the links with NIL prediction in ~ out-of-domain training settings
n Main: F-1 compared with top SemTab scorers (greedy algorithms)
¨ k-fold validation with out-of-domain testing (5 dataset for train, 1 for test)
n Ablation: impact of different components (ranking + PN + RN)
n Ablation: impact of parameter k (final matching score vs distance between top candidates)
¨ Effectiveness of the uncertainty measure to support smart revision
n Main: increase in link quality at incremental revision iterations
¨ User revision simulated with an Oracle
¨ Area Under the Curve of F-1 at increasing number or revised mentions
¨ Fair experimental simplification: global ranking (all tables) vs. local ranking (one table)
n Ablation: impact of parameter k for ordering mentions to be revised
47
TABLE I
STATISTICS OF THE DATASETS USED IN THE EXPERIMENTS
dataset # tables # columns # rows # entities (CEA) # classes (CTA) # predicates (CPA)
Round1 T2D 64 323 9089 8078 119 115
Round3 2161 9736 152753 390456 5761 7574
Round4 22207 78750 475897 994920 31921 56475
2T-2020 180 802 194438 667243 539 0
HardTableR2 1750 5589 29280 47439 2190 3835
HardTableR3 7207 17902 58949 58948 7206 10694
of the problem, including textual, semantic, and contextual
information, to enhance the entity resolution process.
Table II provides details about the architecture of the
neural network employed in this study. It is a plain feed-
forward neural network, whose hyper-parameters were deter-
mined through preliminary experiments. In recent times, deep
networks have demonstrated remarkable potential in handling
increasingly difficult and complex tasks, often rivaling or even
surpassing human capabilities. These networks are typically
built using highly intricate architectures. However, for this
particular work, we opted for an approach that prioritizes
simplicity and speed, while still maintaining excellent learning
capability and generalization. Although we acknowledge that
the network’s classification capability could be enhanced,
devising an architecture optimized for the candidate ranking
task is beyond the scope and objectives of this paper.
TABLE II
MODEL ARCHITECTURE
Sorting ⌦ produces a global ranking of candidates
ciated with mentions that can be used to split the
mentions into linked and unlinked subsets. The intuit
that candidates with the highest ! can be considered c
and the others of uncertain classification. Human revie
help disambiguate uncertain cases.
D. Human Revision
The process described so far automatically classifie
mention as either linked or unlinked. Subsequently, the
tations are presented to the user, who reviews the resul
verifies or corrects the annotations generated by the ma
algorithm.
The ordered set ⌦ facilitates the assessment of the
of uncertainty associated with each link. This emp
the user to determine which cases should be prioritiz
review based on the estimated degree of uncertainty inv
Considering that manual link review is a time-cons
task, a straightforward criterion is to commence fro
n Benchmark data
¨ Links to DBpedia | WikiData
¨ Tables may introduce
specific/different challenges
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Experimental Results
n Entity linking (main)
¨ All components are relevant
¨ Competitive results despite NIL
prediction (benchmark data reward
greedy decisions)
¨ Gaps on test sets with specific
data distributions (also due to
retrieval module)
48
n Smart revision (main)
¨ Confidence-based revision >>
faster than >> random revision
TABLE III
F1 FOR EACH STEP IN THE LINKING WORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
Round T2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned value of k
demonstrates even better performance on the test dataset,
achieving a remarkable AUC value of 0.9929 and an F1 score
above 0.98 after examining only 10% of the mentions.
Table IV presents the results obtained from the experiments
conducted on all datasets. The outcomes are consistent with
the aforementioned discussion. Specifically, it is evident that
in the case of outlier datasets, such as Round3, even with less
than 30% of reviews, the F1 score surpasses 0.90, whereas
the performance of the highest-scoring participant in the
AUC on
HardTable-R2
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values of k. The embedded table reports the performance
measures AUC. The figure refers to the experiment with the
fold that excludes the HartTable-R2 dataset.
The evidence is that we need to review at most 30% of
mentions of the training set to reach 0.98 for F1 and that
almost any value of k produces similar results. The best value
Fig. 3.
The results p
demonstrates ev
achieving a rem
above 0.98 afte
Table IV pres
conducted on a
the aforementio
in the case of o
than 30% of re
the performanc
Challenge (refer
Moreover, for
F1 > 0.90 is
maximum F1 s
only 10% of the
The lessons l
ing sample sets
creases linearly
to reach high v
TABLE III
F1 FOR EACH STEP IN THE LINKING WORKFLOW
Test Dataset
Retrieval
with
indexing
PN
ranking
PN + RN
ranking
with types
SemTab
Top
Scorer
F1 F1 F1 F1
Round T2D 0.82 0.83 0.86 0.90
Round3 0.72 0.73 0.76 0.97
Round4 0.83 0.90 0.91 0.99
2T-2020 0.62 0.86 0.89 0.90
HardTableR2 0.90 0.91 0.93 0.98
HardTableR3 0.52 0.54 0.62 0.97
TABLE IV
F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS
Test Dataset k
10% 20% 30% 40% 50%
F1 F1 F1 F1 F1
Round T2D 0.4 0.91 0.95 0.97 0.98 0.98
Round3 0.5 0.82 0.87 0.94 0.97 0.98
Round4 0.1 0.95 0.97 0.98 0.99 0.99
2T-2020 0.9 0.93 0.94 0.95 0.96 0.98
HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0
HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90
the model’s predictive quality, irrespective of the chosen clas-
sification threshold. Fig. 2 shows the values of F1 calculated
for different percentages of links to be reviewed and different
values of k. The embedded table reports the performance
Fig. 3. F1 and AUC computed for the test dataset.
The results provide evidence that the learned va
demonstrates even better performance on the test
achieving a remarkable AUC value of 0.9929 and an
above 0.98 after examining only 10% of the mention
Table IV presents the results obtained from the exp
conducted on all datasets. The outcomes are consist
the aforementioned discussion. Specifically, it is evid
in the case of outlier datasets, such as Round3, even w
than 30% of reviews, the F1 score surpasses 0.90,
the performance of the highest-scoring participan
Challenge (refer to Table III) is achieved with 40% of
Moreover, for datasets with fewer typos, the thre
F1 > 0.90 is attained much earlier. As an illustra
maximum F1 score of 0.98 is accomplished after re
Also: more interpretable
scores for human
interaction
ARTIFICIAL
INTELLIGENCE
@UNIMIB
3)
HITL FOR TEXTUAL DATA
ENRICHMENT
(LEGAL DOMAIN)
49
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Entity Extraction from Legal Documents
50
KB
[...] A. Donati [...]
[...] Dott.sa Donati [...]
Anna
Donati
Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking IJCKG’22, Oc
Table 3: Statistics about the d
plant.
mentions (N
train 2.2M (2
dev 10k (3
test 10k (3
train 2.008M (2
dev 100k (5
test 100k (5
ground truth on NILs) and uses
as most state-of-the-art NEL a
reference KB [3, 6, 35]. Observe
NER
Enriched text
Enrichment
+
KG
construction
End-to-end
entity
extraction
with
background
KG (beyond
NER)
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Target Applications
n Court decisions (texts)
¨ Semantic search
n E.g., find all decisions in
controversies with [Money Bank]
in [2008]
¨ Anonymization
n E.g., replace all occurrences of
persons with *****
¨ Advanced statistics
n E.g., Count all decisions in
controversies with banks in
[2008]-[2018]
n Criminal investigations (texts +
tabular data + ..)
¨ Search on investigation files
and report writing
n E.g., all paragraphs mentioning
[J.Smith]
¨ Analyze files hard to timely
analyze today (chats, audio,
files, …)
n E.g., all messages/chats where
[J.Smith] wrote to [A.Black] about
[L.Red]
51
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Incremental Entity Extraction and Linking: Evaluation
52
Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari
on and
diction,
xecuted
sed that
plied to
0], in a
ntity ex-
med on
hat also
uments
where
en they
llenges
ontrast
EL and
ing the
ally, we
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and �xing links.
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia) KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
• Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
• Injection/transplant of NIL
entities (~same overall %)
Main challenges
• Error propagation
• NIL Prediction
• Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Incremental Entity Extraction and Linking: Evaluation
53
Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari
on and
diction,
xecuted
sed that
plied to
0], in a
ntity ex-
med on
hat also
uments
where
en they
llenges
ontrast
EL and
ing the
ally, we
Figure 1: Documents are processed in batches through time;
at each iteration, novel entities are added into the NEW-
KB and can be linked in following steps. Between each
step, a human validator can correct pipeline mistakes, split-
ting/merging clusters and �xing links.
Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and
Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22]
Batches of documents acquired at
different time points
Background KB (e.g., Wikipedia) KB with NEW entities
*Use Case*
build a KB from criminal
investigation documents/data
Dataset
• Split of WikilinksNED
Unseen-Mentions in 10
batches
[Onoe&DurrettAAAI20]
• Injection/transplant of NIL
entities (~same overall %)
Main challenges
• Error propagation
• NIL Prediction
• Clustering
Similar
conclusions
as in
[Kassner&al.
ACL22]
Certain application domains require HITL
end-to-end entity extraction to achieve
production-level quality
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Dave: Semantic Search + HITL Annotation
54
All visible names in this text are made up as other PI information. None of
the facts mentioned in this decision refer to the names referred therein.
ARTIFICIAL
INTELLIGENCE
@UNIMIB
4)
CONCLUSIONS
AND FUTURE WORK
55
ARTIFICIAL
INTELLIGENCE
@UNIMIB
Conclusions & Future Work
n Conclusions
¨ Data linking + data extension: core semantic data enrichment tasks
¨ Tabular data and textual data
n Similar tasks: annotations >> KG construction | enriched data
n Still several challenges
¨ NIL prediction and entity clustering
¨ Incremental KB construction from tables and text
¨ HITL approach
n Interactive data enrichment to overcome intrinsic limitations
n Enrichment at scale while controling the quality
n Future work
¨ Full-fledged HITL: learning from the user feedback
¨ Combining Generative AI and data enrichment algorithms for dialogical
data enrichment
56
ARTIFICIAL
INTELLIGENCE
@UNIMIB
THANKS! QUESTIONS?
57
This work presented in this presentation has received funding from the European
Union’s Horizon 2020 research and innovation program under grant agreements No
732590 - EW-Shopp - and No 732003 – euBusinessGraph - and from the European
Union’s Horizon Europe research and innovation program under grant agreements No
101070284 - enRichMyData.
Funding acknowledgements

More Related Content

Similar to Semantic Data Enrichment: a Human-in-the-Loop Perspective

HLG Big Data project and Sandbox
HLG Big Data project and SandboxHLG Big Data project and Sandbox
HLG Big Data project and SandboxCarlo Vaccari
 
Data sharing between private companies and research facilities
Data sharing between private companies and research facilitiesData sharing between private companies and research facilities
Data sharing between private companies and research facilitiesInstitute of Contemporary Sciences
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euEUDAT
 
Mastercourse Hortibusiness
Mastercourse HortibusinessMastercourse Hortibusiness
Mastercourse HortibusinessSjaak Wolfert
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Research Data Alliance
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Anastasija Nikiforova
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Carlo Vaccari
 
User privacy in mobility data
User privacy in mobility data User privacy in mobility data
User privacy in mobility data Chiara Renso
 
PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users EOSC-hub project
 
eGovernment for Citizen: Leveraging Open SOA Standards and Interoperability ...
eGovernment for Citizen:  Leveraging Open SOA Standards and Interoperability ...eGovernment for Citizen:  Leveraging Open SOA Standards and Interoperability ...
eGovernment for Citizen: Leveraging Open SOA Standards and Interoperability ...Adomas Svirskas
 
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)ICARUS2020.aero
 
AIVS - AI, Industrial Data Space, and Innovation Transformation
AIVS - AI, Industrial Data Space, and Innovation TransformationAIVS - AI, Industrial Data Space, and Innovation Transformation
AIVS - AI, Industrial Data Space, and Innovation Transformationpantapong
 
NIC Linked Data: the OHIO project
NIC Linked Data:   the OHIO projectNIC Linked Data:   the OHIO project
NIC Linked Data: the OHIO projectMichael Wilkinson
 
Putting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataPutting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataMartin Kaltenböck
 
Big Data: Profile and Skills of the Information Professional.
Big Data: Profile and Skills of the Information Professional.Big Data: Profile and Skills of the Information Professional.
Big Data: Profile and Skills of the Information Professional.Luísa Alvim
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfAkuhuruf
 

Similar to Semantic Data Enrichment: a Human-in-the-Loop Perspective (20)

HLG Big Data project and Sandbox
HLG Big Data project and SandboxHLG Big Data project and Sandbox
HLG Big Data project and Sandbox
 
Data sharing between private companies and research facilities
Data sharing between private companies and research facilitiesData sharing between private companies and research facilities
Data sharing between private companies and research facilities
 
Data management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.euData management plans – EUDAT Best practices and case study | www.eudat.eu
Data management plans – EUDAT Best practices and case study | www.eudat.eu
 
Mastercourse Hortibusiness
Mastercourse HortibusinessMastercourse Hortibusiness
Mastercourse Hortibusiness
 
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
 
Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...Towards High-Value Datasets determination for data-driven development: a syst...
Towards High-Value Datasets determination for data-driven development: a syst...
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8
 
User privacy in mobility data
User privacy in mobility data User privacy in mobility data
User privacy in mobility data
 
PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users
 
eGovernment for Citizen: Leveraging Open SOA Standards and Interoperability ...
eGovernment for Citizen:  Leveraging Open SOA Standards and Interoperability ...eGovernment for Citizen:  Leveraging Open SOA Standards and Interoperability ...
eGovernment for Citizen: Leveraging Open SOA Standards and Interoperability ...
 
Open Access, Preservation and eGovernment
Open Access, Preservation and eGovernmentOpen Access, Preservation and eGovernment
Open Access, Preservation and eGovernment
 
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)
ICARUS @EBDVF 2018 - TransformingTransport Session (November 2018, Vienna)
 
AIVS - AI, Industrial Data Space, and Innovation Transformation
AIVS - AI, Industrial Data Space, and Innovation TransformationAIVS - AI, Industrial Data Space, and Innovation Transformation
AIVS - AI, Industrial Data Space, and Innovation Transformation
 
NIC Linked Data: the OHIO project
NIC Linked Data:   the OHIO projectNIC Linked Data:   the OHIO project
NIC Linked Data: the OHIO project
 
Putting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open DataPutting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open Data
 
Data Activities in Austria
Data Activities in AustriaData Activities in Austria
Data Activities in Austria
 
Big Data: Profile and Skills of the Information Professional.
Big Data: Profile and Skills of the Information Professional.Big Data: Profile and Skills of the Information Professional.
Big Data: Profile and Skills of the Information Professional.
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdfbig-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
 
Fabrice Clari - EC-FIA
Fabrice Clari - EC-FIAFabrice Clari - EC-FIA
Fabrice Clari - EC-FIA
 

More from Università degli Studi di Milano-Bicocca

More from Università degli Studi di Milano-Bicocca (8)

DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
DaCENA Personalized Exploration of Knowledge Graphs Within a Context. Seminar...
 
EW-Shopp: Interoperability Challenges and Solutions
EW-Shopp: Interoperability Challenges and SolutionsEW-Shopp: Interoperability Challenges and Solutions
EW-Shopp: Interoperability Challenges and Solutions
 
EW-Shopp: Supporting Event and Weather-based Data Analytics and Marketing alo...
EW-Shopp: Supporting Event and Weather-basedData Analytics and Marketing alo...EW-Shopp: Supporting Event and Weather-basedData Analytics and Marketing alo...
EW-Shopp: Supporting Event and Weather-based Data Analytics and Marketing alo...
 
Research Challenges in Artificial Intelligence: Tackling the Complexity of H...
Research Challenges in Artificial Intelligence: Tackling the Complexity of H...Research Challenges in Artificial Intelligence: Tackling the Complexity of H...
Research Challenges in Artificial Intelligence: Tackling the Complexity of H...
 
Using Ontology-based Data Summarization to Develop Semantics-aware Recommende...
Using Ontology-based Data Summarization to Develop Semantics-aware Recommende...Using Ontology-based Data Summarization to Develop Semantics-aware Recommende...
Using Ontology-based Data Summarization to Develop Semantics-aware Recommende...
 
Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (R...
Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (R...Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (R...
Facet Annotation Using Reference Knowledge Bases - The Web Conference 2018 (R...
 
Pay-as-you-go Multi-User Feedback Model for Ontology Matching - EKAW2014
Pay-as-you-go Multi-User Feedback Model for Ontology Matching - EKAW2014Pay-as-you-go Multi-User Feedback Model for Ontology Matching - EKAW2014
Pay-as-you-go Multi-User Feedback Model for Ontology Matching - EKAW2014
 
Information Quality in the Web Era
Information Quality in the Web EraInformation Quality in the Web Era
Information Quality in the Web Era
 

Recently uploaded

Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts ServiceCall Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Servicenishakur201
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...vershagrag
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 

Recently uploaded (20)

Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts ServiceCall Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in G.T.B. Nagar  (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in G.T.B. Nagar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Semantic Data Enrichment: a Human-in-the-Loop Perspective

  • 1. Semantic Data Enrichment: a Human-in-the-Loop Perspective Matteo Palmonari matteo.palmonari@unimib.it INSID&S Lab Department of Informatics, Systems and Communication Università degli Studi di Milano-Bicocca Seminar at INRIA – Sophie Antipolis, July 20th, 2023
  • 2. DATA SEMANTICS @ DATA SCIENCE - UNIMIB About me/this seminar… n Associate Prof. at University of Milano-Bicocca ¨ INSID&S Lab: 4 faculty / 2 assistant prof. / 4 PhD students (now) n Covered quite a broad spectrum of topics ¨ AI / Data Integration >> Knowledge Graphs (KGs) ¨ Representation Learning & NLP to track the evolution and to compare distributional representations >> Computational Social Science (CSS) n Which topic for this talk ? ¨ Human-in-the-loop (HITL) semantic data enrichment >> broad topic driving specific work; should match WIMMICS (NLP and KG) ¨ More in-depth presentation of recent work and CSS-related work >> Manuel Vimercati 2
  • 3. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Overview n Semantic Data Integration, Annotations and Data Enrichment n Semantic Enrichment of Tabular Data n HITL Tabular Data Enrichment n Towards HITL Textual Data Enrichment n Conclusions 3 **Slides contain excerpts of content created by former/currrent PhD students Vincenzo Cutrona and Riccardo Pozzi
  • 5. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Integration 6 Company data from .data.gouv.fr https://annuaire- entreprises.data.gouv.fr/entrep rise/sienna-real-estate- holding-france-492220553 Person Country Org. Foundations firms 'offshore' customers through banks in Wikipedia Background KG Entities in OffshoreLeaks linked to France https://offshoreleaks.icij.org/search?c=FRA&cat=0 Inspiration for this example: [Knoblock&Szekely 2015], ICIJ + Neo4J work for Panama Papers
  • 6. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Integration 7 Company data from .data.gouv.fr https://annuaire- entreprises.data.gouv.fr/entrep rise/sienna-real-estate- holding-france-492220553 Person Country Org. Foundations firms 'offshore' customers through banks in Wikipedia Named Entity Recognition Annotations: named entities Entities in OffshoreLeaks linked to France https://offshoreleaks.icij.org/search?c=FRA&cat=0 Background KG
  • 7. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Integration 8 Company data from .data.gouv.fr https://annuaire- entreprises.data.gouv.fr/entrep rise/sienna-real-estate- holding-france-492220553 Person Country Org. Foundations firms 'offshore' customers through banks in Wikipedia Named Entity Recognition (NER) Annotations: named entities Entities in OffshoreLeaks linked to France https://offshoreleaks.icij.org/search?c=FRA&cat=0 Background KG
  • 8. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Integration 9 Company data from .data.gouv.fr https://annuaire- entreprises.data.gouv.fr/entrep rise/sienna-real-estate- holding-france-492220553 Person Country Org. Foundations firms 'offshore' customers through banks in Wikipedia Annotations: data linking Named Entity Recognition Named Entity Linking (NEL) Entities in OffshoreLeaks linked to France https://offshoreleaks.icij.org/search?c=FRA&cat=0 Background KG
  • 9. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Integration 10 Company data from .data.gouv.fr https://annuaire- entreprises.data.gouv.fr/entrep rise/sienna-real-estate- holding-france-492220553 Person Country Org. Foundations firms 'offshore' customers through banks in Wikipedia Annotations: data linking Named Entity Recognition Named Entity Linking Entities in OffshoreLeaks linked to France https://offshoreleaks.icij.org/search?c=FRA&cat=0 … Sienna … Clustering NIL Prediction = = Background KG
  • 10. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Techniques from Established Research Fields n Texts ¨ Annotation / Information Extraction n NER and NEL: huge body of work ¨ Recent work on NEL: BLINK [Wu&al.EMNLP20] and GENRE [DeCao&al.TACL22] … n NIL prediction and Clustering: ~less investigated ¨ Increased interest in the last 2 years ¨ [Argawal&al.NAACL22], [Kassner&al.ACL22], [Heist&Pauheim ESWC23] n Tabular data ¨ Annotation / Semantic Table Interpretation n More details in this presentation n Survey: [Liu&al.JWA22] (R. Troncy and P. Monnin are co-authors J) 11
  • 11. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Enrichment and KG Construction 12 A shift in perspective: • Users are interested in their content • Background KGs useful to • support integraton • extend their content with additional data • The construction of a KG can be a byproduct
  • 12. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Semantic Data Enrichment and KG Construction 13 A shift in perspective: • Users are interested in their content • Background KGs useful to • support integraton • extend their content with additional data • The construction of a KG can be a byproduct
  • 13. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Downstream Applications of Data Enrichment 14 Query Answering Semantic Search & Data Exploration “Traditional” ML & Data Analytics Analyses with Representation Learning • Criminal investigations [SDSM20] • Explorinig data- contexts to contextualize news articles [ISWCdemo15, ESWC17] • Enrichment and analysis of social media [EACLdemo17] • Weather- based optimization in digital marketing [ISWC19,Tech . and Appl. for BDV22] • Text-based entity embeddings and time-aware entity similarity [ISWC18] • Entity evolution (+ with CADE alignment [AAAI19]) Documents Tabular data Documents Enabling Data Enrichment Pipelines for AI-driven Business Products and Services HORIZON-CL4-2021-DATA-01-03 D4.1: Business Cases Requirements Analysis & Specifications Work Package 4 Type of document: Report Dissemination level: SEN - Sensitive Lead beneficiary: JOT Authors: Fernando Perales and Cynthia Parrondo (JOT) Cuong Xuan Chu and Evgeny Kharlamov (BOS) Enabling Data Enrichment Pipelines for AI-driven Business Products and Services HORIZON-CL4-2021-DATA-01-03 D4.1: Business Cases Requirements Analysis & Specifications Work Package 4 Type of document: Report Dissemination level: SEN - Sensitive Lead beneficiary: JOT Authors: Fernando Perales and Cynthia Parrondo (JOT) Cuong Xuan Chu and Evgeny Kharlamov (BOS) Contributions: applications and novel analytical methods … Main projects Main data Elicitazione dei bisogni informativi dei magistrati nell’ambito del sistema di ricerca semantica e serialità Elicitazione dei bisogni informativi dei magistrati nell’ambito del sistema di ricerca semantica e serialità Applications and analytical methods This talk
  • 14. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Several Examples from Past/Ongoing Projects 15 Domain Value Enrichment Data Sources Data eCommerce Predict impact of events on customer searches Events, weather Tabular Retail Workforce/budget optimization Events, weather Tabular CRM Workforce optimization Events, weather Tabular IOT Customer flow analysis Events, weather Tabular Digital Marketing Ad impression prediction for campaign optimization Weather Tabular Digital Marketing Ad impression prediction for campaign optimization Events Tabular Manufactoring AI-based analytics on welding robot data (tables and user manuals) Prorpetary ~KG Tabular, Texts Manufactoring Troubleshooting and repair based on service manuals, records, log data Prorpetary ~KG Tabular, Texts Open data Construction and maintenance of a European dataset of organizations in procurement from tenders Prorpetary ~KG, Wikidata, Crunch Base Tabular, Texts Observatory on AI Construction and maintenance of a KG to track AI-related innovations from different data sources Crunch Base, WikiData Tabular, Texts Business analysis Cost-effective enrichment of client datasets’ with proprietary company KG Proprietary KG Tabular THIS PROJECT HAS RECEIVED FUNDING FROM THE EUROPEAN UNION'S HORIZON EUROPE RESEARCH AND INNOVATION PROGRAMME UNDER GRANT AGREEMENT NO 101070284. Enabling Data Enrichment Pipeline AI-driven Business Products and Se HORIZON-CL4-2021-DATA-01-03 D4.1: Business Cases Requirements Analysis & Sp Work Package 4 Type of document: Report Dissemination level: SEN - Sensitive Lead beneficiary: JOT Authors: Fernando Perales and Cynthia Parrondo (JOT Cuong Xuan Chu and Evgeny Kharlamov (BOS Qi Gao (PHI) Alex Young and Ian Makgill (SN) Luis Rei and Besher Massri (JSI) Tao Song (BGRIMM) Version: 1.0 Due Date of document: 30/06/2023 Delivery Date of document: 30/06/2023
  • 15. DATA SEMANTICS @ DATA SCIENCE - UNIMIB 2) SEMANTIC ENRICHMENT OF TABULAR DATA 16 “Traditional” ML & Data Analytics • Weather- based optimization in digital marketing [ISWC19,Tech . and Appl. for BDV22] Tabular data Enabling Data Enrichment Pipelines for AI-driven Business Products and Services HORIZON-CL4-2021-DATA-01-03 D4.1: Business Cases Requirements Analysis & Specifications Work Package 4 Type of document: Report Dissemination level: SEN - Sensitive Lead beneficiary: JOT Authors: Fernando Perales and Cynthia Parrondo (JOT) Cuong Xuan Chu and Evgeny Kharlamov (BOS)
  • 16. Semantically-Enabled Optimization of Digital Marketing Campaigns Vincenzo Cutrona1, Flavio De Paoli1, Aljaž Košmerlj2, Nikolay Nikolov3, Matteo Palmonari1, Fernando Perales4, and Dumitru Roman3 1 University of Milano - Bicocca 2 Josef Stefan Institute 3 SINTEF DIGITAL 4 JOT Internet Media
  • 17. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021 Weather-based Campaign Scheduler 18 New services for campaign optimization: ● Main service: weather-based campaign scheduler ○ Predict the best dates to launch the campaign with weather-sensitive keywords ○ in the upcoming week ○ for each region ● + additional services ● Why do we focus on data enrichment? ○ 80% time in data analysis project is spent for cleaning and enriching the data* C°/+0 C°/+1 18 20 17 19 17 20 KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 geoId. gn:2822542 gn:2951839 gn:2950157 Input data Additional data Target data ML model Business service *Worldwide Semiannual Big Data and Analytics Spending Guide from International Data Corporation (IDC)
  • 18. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Weather Service city: 2950157 - date: 2017-03- 12 2t: 17 - date: 2017-03-13 2t: 20 regionID (GeoNames) date (ISO 8601) Data Enrichment: Digital Marketing Example 19 KEYWORD #im REGION Date 194906 64 Thuringia 11/03/2017 517827 50 Bavaria 12/03/2017 459143 42 Berlin 12/03/2017 DIFFERENT systems of identifiers
  • 19. DATA SEMANTICS @ DATA SCIENCE - UNIMIB KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 Data Enrichment: Digital Marketing Example 20 KEYWORD #im REGION Date 194906 64 Thuringia 11/03/2017 517827 50 Bavaria 12/03/2017 459143 42 Berlin 12/03/2017 STEP 1 VALUE MANIPULATION
  • 20. DATA SEMANTICS @ DATA SCIENCE - UNIMIB KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 Data Enrichment: Digital Marketing Example 21 gn:2822542 gn:2951839 gn:2950157 KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 geoId. gn:2822542 gn:2951839 gn:2950157 STEP 2 LINKING The region, not the city
  • 21. DATA SEMANTICS @ DATA SCIENCE - UNIMIB KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 geoId. gn:2822542 gn:2951839 gn:2950157 Data Enrichment: Digital Marketing Example 22 EQUAL systems of identifiers C°/+0 C°/+1 18 20 17 19 17 20 KEYWORD #im REGION Date 194906 64 Thuringia 2017-03-11 517827 50 Bavaria 2017-03-12 459143 42 Berlin 2017-03-12 geoId. gn:2822542 gn:2951839 gn:2950157 Weather Service city: 2950157 - date: 2017-03- 12 2t: 17 - date: 2017-03-13 2t: 20 cityID (GeoNames) date (ISO 8601) STEP 3 EXTENSION
  • 22. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021 Semantic Data Enrichment: Problem Statement ● Inputs: ○ a source dataset ○ a pool of reference data sources Data Enrichment: a path on the data transformations graph GT Semantic Data Enrichment: at least one node is linking 23 ● Output: ○ the source dataset extended with modified/additional columns Linking Extension Value manipulation source output external data sources reference KGs Large data volumes Unknown or little-known, large and complex data sources Intrinsic uncertainty
  • 23. DATA SEMANTICS @ DATA SCIENCE - UNIMIB Vincenzo Cutrona - Ph.D. Presentation - 25/05/2021 Semantic Data Enrichment: Problem Statement ● Inputs: ○ a source dataset ○ a pool of reference data sources Data Enrichment: a path on the data transformations graph GT Semantic Data Enrichment: at least one node is linking 24 ● Output: ○ the source dataset extended with modified/additional columns Linking Extension Value manipulation source output external data sources reference KGs Large data volumes Unknown or little-known, large and complex data sources Intrinsic uncertainty Annotations from algorithms
  • 25. ARTIFICIAL INTELLIGENCE @UNIMIB Semantic Table Interpretation 26 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column of the table is associated with one or more types (CTA) ● each cell in the table is annotated with the entity in the catalog (CEA) ● each pair of columns is annotated with a binary relation in the catalog (CPA) Name Coordinates Height Range Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mont_Blanc MontBlanc Massif 4808 dbo:elevation Schema level Entity level
  • 26. ARTIFICIAL INTELLIGENCE @UNIMIB 27 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in the table is annotated with the entity in the catalog (CEA) ● each pair of columns is annotated with a binary relation in the catalog (CPA) Name Coordinates Height Range Le Mont Blanc 45°49′57″N 06°51′52″E 4808 M. Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 dbo:elevation Schema level Entity level Semantic Table Interpretation
  • 27. ARTIFICIAL INTELLIGENCE @UNIMIB Name Coordinates Height Range Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps 28 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in “entity columns” is annotated with a KG-entity (CEA) ● each pair of columns is annotated with a binary relation in the catalog (CPA) KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 dbo:elevation Schema level Entity level Mont_Blanc MontBlanc Massif Semantic Table Interpretation
  • 28. ARTIFICIAL INTELLIGENCE @UNIMIB Name Coordinates Height Range Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps 29 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in “entity columns” is annotated with a KG-entity (CEA) ● each pair of columns is annotated with a binary relation in the catalog (CPA) KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 dbo:elevation Schema level Entity level Mont_Blanc MontBlanc Massif Subject column Named-Entity column Literal column Also referred to as “entity linking” (for tables) Semantic Table Interpretation
  • 29. ARTIFICIAL INTELLIGENCE @UNIMIB Name Coordinates Height Range Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps 30 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in “entity columns” is annotated with a KG-entity (CEA) ● some pair of columns is annotated with a binary KG-predicate (CPA) KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 dbo:elevation Schema level Entity level Mont_Blanc MontBlanc Massif dbo:mountainRange dbo:elevation georss:point Semantic Table Interpretation
  • 30. ARTIFICIAL INTELLIGENCE @UNIMIB Name Coordinates Height Range Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps 31 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in “entity columns” is annotated with a KG-entity (CEA) ● some pair of columns is annotated with a binary KG-predicate (CPA) KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 Schema level Entity level Mont_Blanc MontBlanc Massif dbo:mountainRange dbo:elevation georss:point dbo:mountainRange dbo:elevation georss:point 45°49′57″N 06°51′52″E Semantic Table Interpretation … for KG completion
  • 31. ARTIFICIAL INTELLIGENCE @UNIMIB Name Coordinates Height Range Mont Blanc 45°49′57″N 06°51′52″E 4808 Mont Blanc massif Hohtälli 45°98’96″N 07°80’25″E 3275 Pennine Alps Monte Cervino 45°58′35″N 07°39′31″E 4478 Pennine Alps 32 Given ● a relational table T ● a Knowledge Graph (entities + statements) and an ontology (types + predicates) T is annotated when: ● each column is associated with one or more KG-types (CTA) ● each cell in “entity columns” is annotated with a KG-entity or with NIL (if not in the KG) ● some pair of columns is annotated with a binary KG-predicate (CPA) KNOWLEDGE GRAPH Mountain Range Mountain xsd:integer xsd:string Natural Place georss:point dbo:elevation dbo:mountainRange … … Mountain xsd:string xsd:integer Mountain Range Mont_Blanc MontBlanc Massif 4808 Schema level Entity level Mont_Blanc MontBlanc Massif dbo:mountainRange dbo:elevation georss:point dbo:mountainRange dbo:elevation georss:point 45°49′57″N 06°51′52″E Semantic Table Interpretation … with novel entities Pennine Alps Monte Cervino [NIL: Hohtälli] Pennine Alps
  • 32. ARTIFICIAL INTELLIGENCE @UNIMIB INSID&S Contributions n Entity linking in tables ¨ Soft filters to filter candidate entities based on type embedding similarity [SEMANTICS’21] ¨ LamAPI: supporting indexing and matching [OM@ISWC’22] n End-to-end STI ¨ s-elBat: dealing with messy tables [SemTab@ISWC’22]* ¨ MantisTable [Fut.Gen.Internet’20,SemTab@ISWC’19-21]* n Evaluation & datasets ¨ Tough Tables: misspelling and noisy labels [ISWC’20] ¨ MammoTab: large dataset of annotated tables, to learn neural linking algorithms and evaluate them [SemTab@ISWC’22] n Participation to STI Challenges ¨ 2019-2022 33 http://www.cs.ox.ac.uk/isg/challenges/sem-tab/
  • 33. ARTIFICIAL INTELLIGENCE @UNIMIB Recap: Annotations, Enrichment and KG Construction 34 Table annotation • Schema mapping • Entity linking Table augmentation • With links and data extention services Export: graph • Table to graph transformations KG generation KG completion Export: tabular data Downstream analysis Enrichment Exploitation
  • 35. ARTIFICIAL INTELLIGENCE @UNIMIB 36 1 – User interfaces for interactive data annotation and enrichment 2.B) HITL TABULAR DATA ENRICHMENT
  • 36. ARTIFICIAL INTELLIGENCE @UNIMIB ASIA: Assisted Semantic Interpretation and Annotation of tabular data • Interactive annotation • Execute linking services • Exploit vocabulary suggestions from ABSTAT […, VLDBJ21] • Edit / revise annotations Table Vocabulary suggestions and search Cutrona, V., Ciavotta, M., De Paoli, F., & Palmonari, M. (2019). ASIA: A tool for assisted semantic interpretation and annotation of tabular data. In Proceedings of ISWC Demo Papers [ISWCdemo19] • Interactive extension • Execute data extension services specifying parameters from the interface
  • 37. ARTIFICIAL INTELLIGENCE @UNIMIB SemTUI – Interactive Semantic Enrichment of Tabular Data n UI accessing external services ¨ STI (full) n S-elBat ¨ Reconciliation/linking services (OpenRefine interface) n Geonames n WikiData n DBpedia n Atoka-linking (SpazioDati) ¨ Extension services n WikiData / DBpedia (SPARQL) n Weather extension (ECMWF) n HERE (georeferencing) n Shortest-route n Atoka-extension (SpazioDati) n … 38 Support to Linking – Revision – Extension of tabular data n Graphical view & revision of annotations ¨ Global and specific annotation rendering ¨ Single cell editing / annotation revision ¨ Column annotation revision Ripamonti, M., De Paoli, F., & Palmonari, M. (2022). SemTUI: a Framework for the Interactive Semantic Enrichment of Tabular Data. arXiv preprint arXiv:2203.09521.
  • 38. ARTIFICIAL INTELLIGENCE @UNIMIB 39 2.B) HITL TABULAR DATA ENRICHMENT 2 – Make data enrichment pipelines scalable
  • 39. ARTIFICIAL INTELLIGENCE @UNIMIB ● Remember: enrichment ~ sequence of transformations that can be executed (batch mode) ● A two-step paradigm [ISWC19,ISWC19demo,Tech.andAppl.for BDV22] ● Small-scale design ● Algorithms + UI to specify annotations and data extensions on a data sample ● Large-scale execution ● Big data technologies to speed up large-scale execution of transformations on large data ● Docker ● Parallelization ● … Annotation for Tabular Data Enrichment at Scale 40 SAMPLE QUALITY INSIGHTS ENRICHMENT DESIGN QUALITY ASSESSMENT STACK CONFIGURATION ENRICHED SAMPLE DATASET ENRICHED DATASET SMALL-SIZE PROCESSING TRANSFORMATION MODEL BATCH PROCESSING Ciavotta, M., Cutrona, V., De Paoli, F., Nikolov, N., Palmonari, M., & Roman, D. (2022). Supporting semantic data enrichment at scale. In Technologies and Applications for Big Data Value (pp. 19-39). Cham: Springer International Publishing. [Tech.andAppl.for BDV22]
  • 40. ARTIFICIAL INTELLIGENCE @UNIMIB 41 2.B) HITL TABULAR DATA ENRICHMENT 3 – Deeper integration of UI and algorithms (ongoing)
  • 41. ARTIFICIAL INTELLIGENCE @UNIMIB Challenges: Entity Disambiguation and Ranking in Tables 42 title director release year domestic distributor length in min worldwide gross jurassic world colin trevorrow 2015 universal pictures 124 1670400637 Q3512046 (Jurassic World) 12 June 2015 124 1670400637 P577 (publication date) Q13377 (Universal Pictures) P2047 (duration) P2142 (box office) P 2 7 2 ( p r o d u c t i o n c o m p a n y ) Q5145625 (Colin Trevorrow) P57 (director) Q20647533 (Jurassic World) 2015 P 5 7 7 ( p u b l i c a t i o n d a t e ) Q937857 (Michael Giacchino) P175 (perform er) P58 (screenwriter) Q17862144 (Jurassic Park) P179 (part of the series) Q21877685 (Jurassic World) 22 June 2018 128 1309500000 P577 (publication date) Q13377 (Universal Pictures) P2047 (duration) P2142 (box office) P 7 5 0 ( d i s t r i b u t e d b y ) Q937857 (Colin Trevorrow) P57 (director) P58 (screenwriter) Q17862144 (Jurassic Park) P179 (part of the series) Q932019 (J. A. Bayona) P 7 5 0 ( d i s t r i b u t e d b y ) P 2 7 2 ( p r o d u c t i o n c o m p a n y ) ... ✔ 🚫 🚫
  • 42. ARTIFICIAL INTELLIGENCE @UNIMIB Challenges: Novel Entities n Linking with NIL prediction ¨ Detection of novel entities ¨ Underrepresented task in benchmark data n Greedy algorithms often rewarded ¨ Important problem in real-world data enrichment settings n E.g., a fragment of organizations in tables not extracted/constructed from WikiData have links to WIkiData 43 Enabling Data Enrichment Pipelin AI-driven Business Products and HORIZON-CL4-202 D4.1: Bu
  • 43. ARTIFICIAL INTELLIGENCE @UNIMIB HITL in Linking Tasks n Personal background on HITL approaches ¨ Ontology matching with multi-user feedback [SWJ’16,KEOD’17] ¨ Active learning to rank for semantic association relevance [ESWC’17] n Objective ¨ Maximize quality while minimizing user effort n Two levels ¨ Fast revision n Revise first links that are more likely to be incorrect ¨ Learning from the user feedback n Feedback propagation, learn from limited data 44
  • 44. ARTIFICIAL INTELLIGENCE @UNIMIB Sel-Bat ‘22>>’23 n [SemTab22]: ¨ Ad-hoc transformation of features into unbound ranking score n New: ¨ NN-based transformation into a bounded confidence score 𝜔 ∈ [0,1] ¨ NIL prediction with threshold 45 Mention vs labels Row vs properties Row vs description Predicates and types hits
  • 45. ARTIFICIAL INTELLIGENCE @UNIMIB Entity Linking with NIL Prediction 46 n Confidence-based revision: ¨ Use the confidence score to order links to revise n E.g., mentions with lower confidence first, i.e., order all mentions m by increasing 𝜔! n E.g., mentions that are more uncertain first, i.e., order all mentions m by distance of 𝜔! from the threshold ¨ Optimal 𝑘 for ranking is learned on the train set (maximize F-1/minimize revisions) PN-Θ RN-Θ Decision ω(δ, ρ, k), σ i,j ci,j,1 s i,j,1 … … ci,j,k s i,j,k Entity Retriever ci,j,1 s i,j,1 Fi,j,1 … … ci,j,k s i,j,k Fi,j,1 ci,j,1 ρi,j,1 … ci,j,k ρi,j,k ci,j,1 ρi,j,1 ωi,j L … … ci,j,k ρi,j,k NL L Fi,j,1 Fi,j,1 pi,j,1 pi,j,k Top-k candidates from ER with features Normalized scores from PN (pi,j,h∈[0,1]) Column-wise type-consistency features added from other rows Refined matching scores (ρi,j,h∈[0,1]) Ti,j,1 Ti,j,k Candidates for the i-th row values in the j-th column Feature Generator Confidence score Link | Not Link decision 2 Feature Refiner Learning from human feedback Θ < δ,ρ,σ> Candidates for the values in the other cells in the j-th column NL NL ci,j ωi,j L NL Smart revision !! = 1 − % &"#$%(!) + %(! )*+%(-, /012 - ) iif !! ≥ 5 δ
  • 46. ARTIFICIAL INTELLIGENCE @UNIMIB Experimental Settings n Evaluate ¨ Quality of the links with NIL prediction in ~ out-of-domain training settings n Main: F-1 compared with top SemTab scorers (greedy algorithms) ¨ k-fold validation with out-of-domain testing (5 dataset for train, 1 for test) n Ablation: impact of different components (ranking + PN + RN) n Ablation: impact of parameter k (final matching score vs distance between top candidates) ¨ Effectiveness of the uncertainty measure to support smart revision n Main: increase in link quality at incremental revision iterations ¨ User revision simulated with an Oracle ¨ Area Under the Curve of F-1 at increasing number or revised mentions ¨ Fair experimental simplification: global ranking (all tables) vs. local ranking (one table) n Ablation: impact of parameter k for ordering mentions to be revised 47 TABLE I STATISTICS OF THE DATASETS USED IN THE EXPERIMENTS dataset # tables # columns # rows # entities (CEA) # classes (CTA) # predicates (CPA) Round1 T2D 64 323 9089 8078 119 115 Round3 2161 9736 152753 390456 5761 7574 Round4 22207 78750 475897 994920 31921 56475 2T-2020 180 802 194438 667243 539 0 HardTableR2 1750 5589 29280 47439 2190 3835 HardTableR3 7207 17902 58949 58948 7206 10694 of the problem, including textual, semantic, and contextual information, to enhance the entity resolution process. Table II provides details about the architecture of the neural network employed in this study. It is a plain feed- forward neural network, whose hyper-parameters were deter- mined through preliminary experiments. In recent times, deep networks have demonstrated remarkable potential in handling increasingly difficult and complex tasks, often rivaling or even surpassing human capabilities. These networks are typically built using highly intricate architectures. However, for this particular work, we opted for an approach that prioritizes simplicity and speed, while still maintaining excellent learning capability and generalization. Although we acknowledge that the network’s classification capability could be enhanced, devising an architecture optimized for the candidate ranking task is beyond the scope and objectives of this paper. TABLE II MODEL ARCHITECTURE Sorting ⌦ produces a global ranking of candidates ciated with mentions that can be used to split the mentions into linked and unlinked subsets. The intuit that candidates with the highest ! can be considered c and the others of uncertain classification. Human revie help disambiguate uncertain cases. D. Human Revision The process described so far automatically classifie mention as either linked or unlinked. Subsequently, the tations are presented to the user, who reviews the resul verifies or corrects the annotations generated by the ma algorithm. The ordered set ⌦ facilitates the assessment of the of uncertainty associated with each link. This emp the user to determine which cases should be prioritiz review based on the estimated degree of uncertainty inv Considering that manual link review is a time-cons task, a straightforward criterion is to commence fro n Benchmark data ¨ Links to DBpedia | WikiData ¨ Tables may introduce specific/different challenges
  • 47. ARTIFICIAL INTELLIGENCE @UNIMIB Experimental Results n Entity linking (main) ¨ All components are relevant ¨ Competitive results despite NIL prediction (benchmark data reward greedy decisions) ¨ Gaps on test sets with specific data distributions (also due to retrieval module) 48 n Smart revision (main) ¨ Confidence-based revision >> faster than >> random revision TABLE III F1 FOR EACH STEP IN THE LINKING WORKFLOW Test Dataset Retrieval with indexing PN ranking PN + RN ranking with types SemTab Top Scorer F1 F1 F1 F1 Round T2D 0.82 0.83 0.86 0.90 Round3 0.72 0.73 0.76 0.97 Round4 0.83 0.90 0.91 0.99 2T-2020 0.62 0.86 0.89 0.90 HardTableR2 0.90 0.91 0.93 0.98 HardTableR3 0.52 0.54 0.62 0.97 TABLE IV F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS Test Dataset k 10% 20% 30% 40% 50% F1 F1 F1 F1 F1 Round T2D 0.4 0.91 0.95 0.97 0.98 0.98 Round3 0.5 0.82 0.87 0.94 0.97 0.98 Round4 0.1 0.95 0.97 0.98 0.99 0.99 2T-2020 0.9 0.93 0.94 0.95 0.96 0.98 HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0 HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90 Fig. 3. F1 and AUC computed for the test dataset. The results provide evidence that the learned value of k demonstrates even better performance on the test dataset, achieving a remarkable AUC value of 0.9929 and an F1 score above 0.98 after examining only 10% of the mentions. Table IV presents the results obtained from the experiments conducted on all datasets. The outcomes are consistent with the aforementioned discussion. Specifically, it is evident that in the case of outlier datasets, such as Round3, even with less than 30% of reviews, the F1 score surpasses 0.90, whereas the performance of the highest-scoring participant in the AUC on HardTable-R2 Round3 0.72 0.73 0.76 0.97 Round4 0.83 0.90 0.91 0.99 2T-2020 0.62 0.86 0.89 0.90 HardTableR2 0.90 0.91 0.93 0.98 HardTableR3 0.52 0.54 0.62 0.97 TABLE IV F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS Test Dataset k 10% 20% 30% 40% 50% F1 F1 F1 F1 F1 Round T2D 0.4 0.91 0.95 0.97 0.98 0.98 Round3 0.5 0.82 0.87 0.94 0.97 0.98 Round4 0.1 0.95 0.97 0.98 0.99 0.99 2T-2020 0.9 0.93 0.94 0.95 0.96 0.98 HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0 HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90 the model’s predictive quality, irrespective of the chosen clas- sification threshold. Fig. 2 shows the values of F1 calculated for different percentages of links to be reviewed and different values of k. The embedded table reports the performance measures AUC. The figure refers to the experiment with the fold that excludes the HartTable-R2 dataset. The evidence is that we need to review at most 30% of mentions of the training set to reach 0.98 for F1 and that almost any value of k produces similar results. The best value Fig. 3. The results p demonstrates ev achieving a rem above 0.98 afte Table IV pres conducted on a the aforementio in the case of o than 30% of re the performanc Challenge (refer Moreover, for F1 > 0.90 is maximum F1 s only 10% of the The lessons l ing sample sets creases linearly to reach high v TABLE III F1 FOR EACH STEP IN THE LINKING WORKFLOW Test Dataset Retrieval with indexing PN ranking PN + RN ranking with types SemTab Top Scorer F1 F1 F1 F1 Round T2D 0.82 0.83 0.86 0.90 Round3 0.72 0.73 0.76 0.97 Round4 0.83 0.90 0.91 0.99 2T-2020 0.62 0.86 0.89 0.90 HardTableR2 0.90 0.91 0.93 0.98 HardTableR3 0.52 0.54 0.62 0.97 TABLE IV F1 WITH HITL INCREMENTAL PERCENTAGE OF REVIEWS Test Dataset k 10% 20% 30% 40% 50% F1 F1 F1 F1 F1 Round T2D 0.4 0.91 0.95 0.97 0.98 0.98 Round3 0.5 0.82 0.87 0.94 0.97 0.98 Round4 0.1 0.95 0.97 0.98 0.99 0.99 2T-2020 0.9 0.93 0.94 0.95 0.96 0.98 HardTableR2 0.4 0.98 0.99 1.0 1.0 1.0 HardTableR3 0.4 0.68 0.75 0.81 0.86 0.90 the model’s predictive quality, irrespective of the chosen clas- sification threshold. Fig. 2 shows the values of F1 calculated for different percentages of links to be reviewed and different values of k. The embedded table reports the performance Fig. 3. F1 and AUC computed for the test dataset. The results provide evidence that the learned va demonstrates even better performance on the test achieving a remarkable AUC value of 0.9929 and an above 0.98 after examining only 10% of the mention Table IV presents the results obtained from the exp conducted on all datasets. The outcomes are consist the aforementioned discussion. Specifically, it is evid in the case of outlier datasets, such as Round3, even w than 30% of reviews, the F1 score surpasses 0.90, the performance of the highest-scoring participan Challenge (refer to Table III) is achieved with 40% of Moreover, for datasets with fewer typos, the thre F1 > 0.90 is attained much earlier. As an illustra maximum F1 score of 0.98 is accomplished after re Also: more interpretable scores for human interaction
  • 48. ARTIFICIAL INTELLIGENCE @UNIMIB 3) HITL FOR TEXTUAL DATA ENRICHMENT (LEGAL DOMAIN) 49
  • 49. ARTIFICIAL INTELLIGENCE @UNIMIB Entity Extraction from Legal Documents 50 KB [...] A. Donati [...] [...] Dott.sa Donati [...] Anna Donati Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking IJCKG’22, Oc Table 3: Statistics about the d plant. mentions (N train 2.2M (2 dev 10k (3 test 10k (3 train 2.008M (2 dev 100k (5 test 100k (5 ground truth on NILs) and uses as most state-of-the-art NEL a reference KB [3, 6, 35]. Observe NER Enriched text Enrichment + KG construction End-to-end entity extraction with background KG (beyond NER)
  • 50. ARTIFICIAL INTELLIGENCE @UNIMIB Target Applications n Court decisions (texts) ¨ Semantic search n E.g., find all decisions in controversies with [Money Bank] in [2008] ¨ Anonymization n E.g., replace all occurrences of persons with ***** ¨ Advanced statistics n E.g., Count all decisions in controversies with banks in [2008]-[2018] n Criminal investigations (texts + tabular data + ..) ¨ Search on investigation files and report writing n E.g., all paragraphs mentioning [J.Smith] ¨ Analyze files hard to timely analyze today (chats, audio, files, …) n E.g., all messages/chats where [J.Smith] wrote to [A.Black] about [L.Red] 51
  • 51. ARTIFICIAL INTELLIGENCE @UNIMIB Incremental Entity Extraction and Linking: Evaluation 52 Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari on and diction, xecuted sed that plied to 0], in a ntity ex- med on hat also uments where en they llenges ontrast EL and ing the ally, we Figure 1: Documents are processed in batches through time; at each iteration, novel entities are added into the NEW- KB and can be linked in following steps. Between each step, a human validator can correct pipeline mistakes, split- ting/merging clusters and �xing links. Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22] Batches of documents acquired at different time points Background KB (e.g., Wikipedia) KB with NEW entities *Use Case* build a KB from criminal investigation documents/data Dataset • Split of WikilinksNED Unseen-Mentions in 10 batches [Onoe&DurrettAAAI20] • Injection/transplant of NIL entities (~same overall %) Main challenges • Error propagation • NIL Prediction • Clustering Similar conclusions as in [Kassner&al. ACL22]
  • 52. ARTIFICIAL INTELLIGENCE @UNIMIB Incremental Entity Extraction and Linking: Evaluation 53 Riccardo Pozzi, Federico Moiraghi, Fausto Lodi, and Ma�eo Palmonari on and diction, xecuted sed that plied to 0], in a ntity ex- med on hat also uments where en they llenges ontrast EL and ing the ally, we Figure 1: Documents are processed in batches through time; at each iteration, novel entities are added into the NEW- KB and can be linked in following steps. Between each step, a human validator can correct pipeline mistakes, split- ting/merging clusters and �xing links. Pozzi, R., Moiraghi, F., Lodi, F., & Palmonari, M. (2022, October). Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking. In Proceedings of the 11th International Joint Conference on Knowledge Graphs (pp. 30-38). [IJCKG22] Batches of documents acquired at different time points Background KB (e.g., Wikipedia) KB with NEW entities *Use Case* build a KB from criminal investigation documents/data Dataset • Split of WikilinksNED Unseen-Mentions in 10 batches [Onoe&DurrettAAAI20] • Injection/transplant of NIL entities (~same overall %) Main challenges • Error propagation • NIL Prediction • Clustering Similar conclusions as in [Kassner&al. ACL22] Certain application domains require HITL end-to-end entity extraction to achieve production-level quality
  • 53. ARTIFICIAL INTELLIGENCE @UNIMIB Dave: Semantic Search + HITL Annotation 54 All visible names in this text are made up as other PI information. None of the facts mentioned in this decision refer to the names referred therein.
  • 55. ARTIFICIAL INTELLIGENCE @UNIMIB Conclusions & Future Work n Conclusions ¨ Data linking + data extension: core semantic data enrichment tasks ¨ Tabular data and textual data n Similar tasks: annotations >> KG construction | enriched data n Still several challenges ¨ NIL prediction and entity clustering ¨ Incremental KB construction from tables and text ¨ HITL approach n Interactive data enrichment to overcome intrinsic limitations n Enrichment at scale while controling the quality n Future work ¨ Full-fledged HITL: learning from the user feedback ¨ Combining Generative AI and data enrichment algorithms for dialogical data enrichment 56
  • 56. ARTIFICIAL INTELLIGENCE @UNIMIB THANKS! QUESTIONS? 57 This work presented in this presentation has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreements No 732590 - EW-Shopp - and No 732003 – euBusinessGraph - and from the European Union’s Horizon Europe research and innovation program under grant agreements No 101070284 - enRichMyData. Funding acknowledgements