Link Reuse and Evolution for Data Integration (LSWT 2020)

Link Reuse and Evolution for Data Integration
Anika Groß
It‘s all about the
data
Link Reuse and Evolution
for Data Integration
Anika Groß
8. Leipziger Semantic Web Tag, 17.06.2020

Anika Groß
Data Science Workflow
2
Logos/pictures: pixabay.com, © Can Stock Photo / memoangeles

Anika Groß
Data integration (variety): combine data from various sources
• Exploit potential of data
• Added value, new insights
• Improved interoperability
2
Data
extraction
& cleaning
Data
integration
& enrichment
Visualization Interpretation
Data
acquisition
Analytics
- descriptive
- predictive

Anika Groß
HOW?
Based on links between objects
Data integration (variety): combine data from various sources
• Exploit potential of data
• Added value, new insights
• Improved interoperability
2
Data
extraction
& cleaning
Data
integration
& enrichment
Visualization Interpretation
Data
acquisition
Analytics
- descriptive
- predictive
Matching /
Linking
Post-
processing
Pre-
processing
Merge /
Fusion

Anika Groß
Matching / Linking
• Schema level
• Schema and ontology matching
• Schema merging
3
…Hämatologische
Krankheit
…
Krankheiten
Blutarmut Leukopenie
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…
Aim: (Semi-)automatically interconnect different data sources via explicit links

Anika Groß
Matching / Linking
• Schema level
• Schema merging
• Instance level
• Entity resolution, link discovery
• Object fusion
3
Severe anemia
(hemoglobin < 8 g/dL),
leukopenia (white blood
cell count [WBC] < 2500
mm3), thrombocytopenia
(platelet count < 80,000
mm3)
Patients with
significantly impaired
bone marrow function
or significant anemia,
leukopenia, or
thrombocytopenia
…Hämatologische
Krankheit
…
Krankheiten
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…

Anika Groß
Matching / Linking
• Schema level
• Schema merging
• Semantic annotation
• Linking instances
with ontology concepts
• Entity linking
• Instance level
• Entity resolution, link discovery
• Object fusion
3
Severe anemia
(hemoglobin < 8 g/dL),
leukopenia (white blood
cell count [WBC] < 2500
mm3), thrombocytopenia
(platelet count < 80,000
mm3)
Patients with
significantly impaired
bone marrow function
or significant anemia,
leukopenia, or
thrombocytopenia
…Hämatologische
Krankheit
…
Krankheiten
…Hematological
Disease
Disease
Cytopenia
Anemia Leukopenia
Thrombo
cytopenia
…

Anika Groß
Data is not static
4
≥ 2 Input
sources
Integration &
Enrichment
linking, fusion, …
Analysis
e.g. graph-based
Result
interpretation
Intra-source links
Inter-source links

Anika Groß
Evolution, Dynamics
Data is not static
4
≥ 2 Input
sources
Integration &
Enrichment
linking, fusion, …
Analysis
e.g. graph-based
Result
interpretation
Intra-source links
Inter-source links
Links between different versions, temporal links

Anika Groß
Agenda
✓Introduction
✓ Data Science Workflow
✓ Matching / Linking
✓ Evolution
• Link Reuse
• Link Evolution and Temporal Linking
• Future Research Directions
5

Anika Groß
Can be real/tiny/no
improvement
Many many
test runs
Cooperativeness of
domain experts
Again and again …
• Implementation of matching tools/algorithms
• Configuration of matching workflows
• Verification of links
6

Anika Groß
Link Reuse
Again and again …
Existing links between (meta)data sources
• Linked Open Data Cloud
• Repositories/platforms: Bioportal, local / own project, sameas.org
…
6

Anika Groß
Link Reuse
Again and again …
…
x No solution
Manual or (semi-)
automatic Matching
6

Anika Groß
Link Reuse
Again and again …
…
✓ Complete solution⸦ Partial solution
Link reuse instead of full (manual
or automatic) re-determination
x No solution
Manual or (semi-)
automatic Matching
6

Anika Groß
Link Reuse
Again and again …
…
✓ Complete solution⸦ Partial solution
Link reuse instead of full (manual
or automatic) re-determination
x No solution
Manual or (semi-)
automatic Matching
Aims
• Improved match result quality
• Less effort
• Link update (evolution)
6

Anika Groß
Link Reuse - Methods
7
Composition
Combine mappings via intermediate sources
I1
I2
S1 S2
indirect
direct

Anika Groß
7
Composition
I1
I2
S1 S2
indirect
direct
Clustering
Create groups of (connected) entities

Anika Groß
7
Composition
I1
I2
S1 S2
indirect
direct
Clustering
Supervised Learning
Train ML model

Anika Groß
7
Composition
I1
I2
S1 S2
indirect
direct
Clustering
Evolution
Connect and update over time
Supervised Learning
Train ML model

Anika Groß
Link Reuse – in my research
8
Composition
• Indirect Ontology Matching
(schema level)
Clustering
• Holistic entity clustering
for linked data (instance level)
• Semantic annotation of
medical documents
Supervised Learning
• Combination of results from
different semantic annotation
tools
Temporal Linking
• Ontology mapping evolution and
update (schema level)
• Temporal group linkage for
census data (instance level)
Evolution
• Ontology mapping evolution
and update (schema level)

Anika Groß
Link Evolution and Temporal Linking
9
𝒅𝒊𝒇𝒇 𝑺𝟏,𝑺𝟏′
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
Find links between different
source versions or temporal datasets

Anika Groß
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
S1’’ …

Anika Groß
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
Update set of outdated links
between older versions
S1’’ …

Anika Groß
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
S1
S2
𝑴 𝑺𝟏,𝑺𝟐
S1’’ …

Anika Groß
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
𝒅𝒊𝒇𝒇 𝑺𝟐,𝑺𝟐′
S1
S2
𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
S1’’ …

Anika Groß
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
S1
S2
𝑴 𝑺𝟏′,𝑺𝟐′𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
S1’’ …

Anika Groß
Reuse existing intra- or intersource links
9
S1 S1’
𝑴 𝑺𝟏,𝑺𝟏′
S1
S2
𝑴 𝑺𝟏′,𝑺𝟐′𝑴 𝑺𝟏,𝑺𝟐
S1’
S2’
𝑴 𝑺𝟏,𝑺𝟏′
𝑴 𝑺𝟐,𝑺𝟐′
S1’’ …

Anika Groß
Evolution
• Ontology mapping evolution
and update (schema level)
10
Composition
• Indirect Ontology Matching
(schema level)
Clustering
• Holistic entity clustering
for linked data (instance level)
• Semantic annotation of
medical documents
Supervised Learning
• Combination of results from
different semantic annotation
tools
Temporal Linking
• Ontology mapping evolution and
update (schema level)

Anika Groß
Temporal Group Linkage
for Census Data
11
• 6 census (1851-1901) in Rawtenstall, Lancashire, U.K.
• Household graphs (known family connections) but unknown temporal links
Temporal Linking
Instance level
Reuse
Christen, Groß, Fisher et al.: Temporal group linkage and evolution analysis for census data.
Intl. Conf. on Extending Database Technology (EDBT), 2017.

Anika Groß
for Census Data
11
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
Alice Smith Mary Smith
daughter
1871 1881
Temporal Linking
Instance level
Reuse

Anika Groß
for Census Data
11
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
daughter
1871 1881Problems
• Attribute values change over time (surname, occupation)
• Difficult disambiguation (same pre- and surname)
• Poor data quality (misspelling etc.)
• …
Temporal Linking
Instance level
Reuse

Anika Groß
for Census Data
11
Elizabeth
Ashworth
John
Riley
William
Ashworth
wife
father
in law
daughter son
wife son
head
John Ashworth
Alice
Ashworth
head
John Smith
Elizabeth
Smith
Steve
Smith
Elizabeth
Ashworth
William
Ashworth
wife son
wife
head
John Ashworth
head
John Smith
Elizabeth Smith
wife
head
Steve Smith
daughter
1871 1881Problems
• Attribute values change over time (surname, occupation)
• Difficult disambiguation (same pre- and surname)
• Poor data quality (misspelling etc.)
• …
Temporal Linking
Instance level
Reuse
Temporal Entity and Group Linkage
• Method → paper
• ≈ 96% F-Measure for record and group mapping
(2-9% improvement over compared approaches)

Anika Groß
• Evolution patterns on individual (preserve, add, remove)
and group level (split, merge, move, …)
Evolution Patterns and Evolution Graph
12

Anika Groß
• Evolution patterns on individual (preserve, add, remove)
and group level (split, merge, move, …)
• Evolution Graph over longer time periods
Evolution Patterns and Evolution Graph
12

Anika Groß
“The Reuse Application”: Knowledge Graphs
13
& many more
• Continuous reuse and integration
• of instances, ontology concepts and links from various sources
• methods:
• matching/link discovery, NLP, entity linking, clustering, fusion/merging, …
+ expert knowledge / verification
• Evolution and update
• Can be highly dynamic graph
• Direct change in knowledge graph
• Extension and update based on usage (user queries)
• Update when source versions evolve
• Integrate additions, deletions, structural changes, …
• Complex: keep meanwhile verified changes

Anika Groß
✓ Improved link
quality
✓ Less effort /
more efficient
✓ Up-to-date links
✓ New temporal links
• Data sources evolve over time … and so do the links between them
• Reuse existing verified links to create new links for new versions
• Create new temporal links between objects and object groups
• Problems: poor trust, missing context, no knowledge of existing links, …
need to be overcome
• Lineage, provenance, data profiling, accessibility …
Conclusion
14

Anika Groß
Future Research Directions
15
Evolution of
Knowledge Graphs
• Evolution of
integrated sources
• evolution-aware
ontology merge,
knowledge graph
update
• Scalable iterative
integration
• Temporal patterns
on graph data
• …
Semantic
Interoperability
• Semantic Annotation
of heterogenous, un-
/ semi structured
data
• Multilingual
Matching
• Semantic Mappings
(“beyond sameAs”)
• …
End-to-End
Analytics Workflows
• Close to seamless
data integration for
complex analytics
workflows
• Management and
reproducibility of
scientific workflows
• …

Anika Groß
References
Reuse Annotation
• Christen, Lin, Groß, Domingos Cardoso, Pruski, Da Silveira, Rahm: A Learning-Based Approach to Combine Medical Annotation Results - (Short Paper).
13th Intl. Conference on Data Integration in the Life Sciences (DILS), 2018.
• Christen, Groß, Rahm: A Reuse-based Annotation Approach for Medical Documents. The Semantic Web -- ISWC 2016: 15th Intl. Semantic Web
Conference, 2016.
Reuse Entity Links
• Nentwig, Groß, Möller, Rahm: Distributed Holistic Clustering on Linked Data. Proc. OTM 2017 Conferences - Confederated International Conferences:
CoopIS, C&TC, and ODBASE, 2017.
• Nentwig, Groß, Rahm: Holistic Entity Clustering for Linked Data. IEEE 16th International Conference on Data Mining Workshops (ICDMW), 2016.
Temporal Linking / Entity Evolution
• Christen, Groß, Fisher, Wang, Christen, Rahm: Temporal group linkage and evolution analysis for census data. 19th Intl. Conference on Extending
Database Technology (EDBT), 2017.
Mapping/Link Composition
• A. Groß, Hartung, Kirsten, Rahm: Mapping Composition for Matching Large Life Science Ontologies. 2nd Intl. Conference on Biomedical Ontology
(ICBO), 2011.
• M. Hartung, Groß, Rahm: Composition Methods for Link Discovery. Proc. of 15. GI-Fachtagung für Datenbanksysteme in Business, Technologie und
Web (BTW), 2013.
Mapping / Link Evolution
• Groß, Pruski, Rahm: Evolution of Biomedical Ontologies and Mappings: Overview of Recent Approaches. Computational and Structural Biotechnology
Journal. 14, 2016.
• Groß, dos Reis, Hartung, Pruski, Rahm: Semi-automatic Adaptation of Mappings between Life Science Ontologies. 9th Intl. Conference on Data
Integration in the Life Sciences (DILS), 2013.
• Groß, Hartung, Prüfer, Kelso, Rahm: Impact of ontology evolution on functional analyses. Bioinformatics 28(20), 2012.
• Groß, Hartung, Kirsten, Rahm: Estimating the Quality of Ontology-Based Annotations by Considering Evolutionary Changes.
6th Intl. Workshop on Data Integration in the Life Sciences, 2009.
Ontology Evolution
• Christen, Groß, Hartung: REX - A Tool for Discovering Evolution Trends in Ontology Regions. 10th Intl. Conference on Data Integration in the Life
Sciences (DILS), 2014.
• Hartung, Groß, Rahm: COnto-Diff: Generation of Complex Evolution Mappings for Life Science Ontologies. Journal of Biomedical Informatics 46(1),
2013.
• M. Hartung, Groß, Rahm: CODEX: exploration of semantic changes between ontology versions. Bioinformatics 28(6), 2012.
16

Link Reuse and Evolution for Data Integration (LSWT 2020)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Link Reuse and Evolution for Data Integration (LSWT 2020)

Similar to Link Reuse and Evolution for Data Integration (LSWT 2020) (20)

Recently uploaded

Recently uploaded (20)

Link Reuse and Evolution for Data Integration (LSWT 2020)