Scalable Link Discovery for Modern Data-Driven
Applications
Kleanthi Georgala
University of Leipzig, Institute for Applied Informatics
Motivation
Growth of Linked Data Web (volume, velocity)
Time and space constraints for Link Discovery
Scalability of execution of Link Specifications (LSs)
Relevancy
Implement the fourth Link Discovery principle
Linking and integration of constantly increasing knowledge bases
Structured Machine Learning
Complex Event Processing
Solution
Scale up Link Discovery through partial-recall linking and better
planning.
Preliminaries
Filters are pairs (f, τ), where (1) f is either empty (denoted ) or a
combination of similarity measures and (2) τ ∈ [0, 1] is a threshold.
Atomic LS: L = (f, τ, X) with [[X]] = S × T, where S and T are sets
of resources OR L = (m, θ), where m is a similarity measure and θ is
a similarity threshold.
Complex LS: L = (f, τ, ω(L1, L2)) where ω is a LS operator, (f, τ) is a
filter, L1 and L2 are the left and the right child of L resp.
H1: Link Discovery with Partial Recall
Hypothesis
Given a LS L, k ∈ [0, 1] and maxOpt ∈ [0, +∞), a LS L L exists that achieves a lower
runtime than L, while generating at least |[[L]]| × k links.
Approach
Use of downward refinement operator ρ: finite, proper, incomplete and redundant
ρ(L) =



∅ if L = L∅,
L∅ if L = (m(ps, pt), 1),
(m(ps, pt), next(θ)) if L = (m(ps, pt), θ) ∧ θ < 1,
(ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2,
(ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2,
(ρ(L1)L2) if L = L1L2.
Explore refinement tree.
Example: maxOpt = 30s and k = 0.7
Current time = 0s
(qgrams(:name, :label), 0.3)
(qgrams(:name, :label), 0.7)
( , 0.5) RT = 60s
Current time = 15s
(qgrams(:name, :label), 0.4)
(qgrams(:name, :label), 0.7)
( , 0.5) RT = 32s, Recall = 0.8
Current time = 30s
(qgrams(:name, :label), 0.5)
(qgrams(:name, :label), 0.7)
( , 0.5) RT = 20s, Recall = 0.65
H2: Dynamic Planning for Link Discovery
Hypothesis
A dynamic planner can generate more
time-efficient LS than a static planner by using
information from the execution engine.
Approach
Use of runtime approximation function for a plan
Overwrite cost estimation when a step is executed
Duplicated steps: Re-use previous results
Time complexity of O(|L|)
1st
iteration
Canonical, RT(Plan1) = 32s
(qgrams(:name, :label), 0.4)
RT(Run(LS1)) = 12s
(qgrams(:name, :label), 0.7)
RT(Run(LS2)) = 10s

RT() = 5s
( , 0.5) RT( ) = 5s
Filter-right, RT(Plan2) = 25s
(qgrams(:name, :label), 0.4)
RT(Run(LS1)) = 12s
(qgrams(:name, :label), 0.7)
RT(ϕ(LS2)) = 8s
( , 0.5)RT( ) = 5s
2nd
iteration
Run (qgrams(:name, :label), 0.4)
Replan
Canonical, RT(Plan1) = 11s
(qgrams(:name, :label), 0.4)
RT(Run(LS1)) = 0s
(qgrams(:name, :label), 0.7)
RT(Run(LS2)) = 1s Dependency with LS1

RT() = 5s
( , 0.5)
RT( ) = 5s
Filter-right, RT(Plan2) = 13s
(qgrams(:name, :label), 0.4)
RT(Run(LS1)) = 0s
(qgrams(:name, :label), 0.7)
RT(ϕ(LS2)) = 8s
( , 0.5)RT( ) = 5s
Evaluation and Primary Results
Datasets: Abt-Buy, Amazon-Google
Products, DBLP-ACM and DBLP-Scholar
OAEI datasets, LANCE, HOBBIT-generated
datasets, MOVIES, TOWNS and VILLAGES
(H1): C-RO
Use of different values for k and maxOpt
Comparison with baseline
(H2): CONDOR
Comparison with the state-of-the-art
10 20 30 40 50 60 70 80 90 100
number of LSs
0
20
40
60
80
100
120
cumulativeexecutiontimeinseconds
Baseline
C-RO
Figure: Partial Recall for DBLP-ACM
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
number of LSs
0
20
40
60
80
100
120
140
160
cumulativeexecutiontimeinseconds
CANONICAL
HELIOS
CONDOR
Figure: Dynamic Planning for DBLP-ACM
This work was supported by research grants from the German Ministry for Finances and Energy under the SAKE project (Grant No. 01MD15006E), from the EU H2020 Framework Programme provided for the project
HOBBIT (GA no. 688227) and from Semantic Web Science Association (SWSA).
http://aksw.org/KleanthiGeorgala.html georgala@informatik.uni-leipzig.de

Scalable Link Discovery for Modern Data-Driven Applications (poster)

  • 1.
    Scalable Link Discoveryfor Modern Data-Driven Applications Kleanthi Georgala University of Leipzig, Institute for Applied Informatics Motivation Growth of Linked Data Web (volume, velocity) Time and space constraints for Link Discovery Scalability of execution of Link Specifications (LSs) Relevancy Implement the fourth Link Discovery principle Linking and integration of constantly increasing knowledge bases Structured Machine Learning Complex Event Processing Solution Scale up Link Discovery through partial-recall linking and better planning. Preliminaries Filters are pairs (f, τ), where (1) f is either empty (denoted ) or a combination of similarity measures and (2) τ ∈ [0, 1] is a threshold. Atomic LS: L = (f, τ, X) with [[X]] = S × T, where S and T are sets of resources OR L = (m, θ), where m is a similarity measure and θ is a similarity threshold. Complex LS: L = (f, τ, ω(L1, L2)) where ω is a LS operator, (f, τ) is a filter, L1 and L2 are the left and the right child of L resp. H1: Link Discovery with Partial Recall Hypothesis Given a LS L, k ∈ [0, 1] and maxOpt ∈ [0, +∞), a LS L L exists that achieves a lower runtime than L, while generating at least |[[L]]| × k links. Approach Use of downward refinement operator ρ: finite, proper, incomplete and redundant ρ(L) =    ∅ if L = L∅, L∅ if L = (m(ps, pt), 1), (m(ps, pt), next(θ)) if L = (m(ps, pt), θ) ∧ θ < 1, (ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2, (ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2, (ρ(L1)L2) if L = L1L2. Explore refinement tree. Example: maxOpt = 30s and k = 0.7 Current time = 0s (qgrams(:name, :label), 0.3) (qgrams(:name, :label), 0.7) ( , 0.5) RT = 60s Current time = 15s (qgrams(:name, :label), 0.4) (qgrams(:name, :label), 0.7) ( , 0.5) RT = 32s, Recall = 0.8 Current time = 30s (qgrams(:name, :label), 0.5) (qgrams(:name, :label), 0.7) ( , 0.5) RT = 20s, Recall = 0.65 H2: Dynamic Planning for Link Discovery Hypothesis A dynamic planner can generate more time-efficient LS than a static planner by using information from the execution engine. Approach Use of runtime approximation function for a plan Overwrite cost estimation when a step is executed Duplicated steps: Re-use previous results Time complexity of O(|L|) 1st iteration Canonical, RT(Plan1) = 32s (qgrams(:name, :label), 0.4) RT(Run(LS1)) = 12s (qgrams(:name, :label), 0.7) RT(Run(LS2)) = 10s RT() = 5s ( , 0.5) RT( ) = 5s Filter-right, RT(Plan2) = 25s (qgrams(:name, :label), 0.4) RT(Run(LS1)) = 12s (qgrams(:name, :label), 0.7) RT(ϕ(LS2)) = 8s ( , 0.5)RT( ) = 5s 2nd iteration Run (qgrams(:name, :label), 0.4) Replan Canonical, RT(Plan1) = 11s (qgrams(:name, :label), 0.4) RT(Run(LS1)) = 0s (qgrams(:name, :label), 0.7) RT(Run(LS2)) = 1s Dependency with LS1 RT() = 5s ( , 0.5) RT( ) = 5s Filter-right, RT(Plan2) = 13s (qgrams(:name, :label), 0.4) RT(Run(LS1)) = 0s (qgrams(:name, :label), 0.7) RT(ϕ(LS2)) = 8s ( , 0.5)RT( ) = 5s Evaluation and Primary Results Datasets: Abt-Buy, Amazon-Google Products, DBLP-ACM and DBLP-Scholar OAEI datasets, LANCE, HOBBIT-generated datasets, MOVIES, TOWNS and VILLAGES (H1): C-RO Use of different values for k and maxOpt Comparison with baseline (H2): CONDOR Comparison with the state-of-the-art 10 20 30 40 50 60 70 80 90 100 number of LSs 0 20 40 60 80 100 120 cumulativeexecutiontimeinseconds Baseline C-RO Figure: Partial Recall for DBLP-ACM 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 number of LSs 0 20 40 60 80 100 120 140 160 cumulativeexecutiontimeinseconds CANONICAL HELIOS CONDOR Figure: Dynamic Planning for DBLP-ACM This work was supported by research grants from the German Ministry for Finances and Energy under the SAKE project (Grant No. 01MD15006E), from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227) and from Semantic Web Science Association (SWSA). http://aksw.org/KleanthiGeorgala.html georgala@informatik.uni-leipzig.de