Scalable Link Discovery for Modern Data-Driven Applications (poster)

Scalable Link Discovery for Modern Data-Driven
Applications
Kleanthi Georgala
University of Leipzig, Institute for Applied Informatics
Motivation
Growth of Linked Data Web (volume, velocity)
Time and space constraints for Link Discovery
Scalability of execution of Link Specifications (LSs)
Relevancy
Implement the fourth Link Discovery principle
Linking and integration of constantly increasing knowledge bases
Structured Machine Learning
Complex Event Processing
Solution
Scale up Link Discovery through partial-recall linking and better
planning.
Preliminaries
Filters are pairs (f, τ), where (1) f is either empty (denoted ) or a
combination of similarity measures and (2) τ ∈ [0, 1] is a threshold.
Atomic LS: L = (f, τ, X) with [[X]] = S × T, where S and T are sets
of resources OR L = (m, θ), where m is a similarity measure and θ is
a similarity threshold.
Complex LS: L = (f, τ, ω(L1, L2)) where ω is a LS operator, (f, τ) is a
filter, L1 and L2 are the left and the right child of L resp.
H1: Link Discovery with Partial Recall
Hypothesis
Given a LS L, k ∈ [0, 1] and maxOpt ∈ [0, +∞), a LS L L exists that achieves a lower
runtime than L, while generating at least |[[L]]| × k links.
Approach
Use of downward refinement operator ρ: finite, proper, incomplete and redundant
ρ(L) =



∅ if L = L∅,
L∅ if L = (m(ps, pt), 1),
(m(ps, pt), next(θ)) if L = (m(ps, pt), θ) ∧ θ < 1,
(ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2,
(ρ(L1) L2) ∪ (L1 ρ(L2)) if L = L1 L2,
(ρ(L1)L2) if L = L1L2.
Explore refinement tree.
Example: maxOpt = 30s and k = 0.7
Current time = 0s
(qgrams(:name, :label), 0.3)
( , 0.5) RT = 60s
Current time = 15s
( , 0.5) RT = 32s, Recall = 0.8
Current time = 30s
( , 0.5) RT = 20s, Recall = 0.65
H2: Dynamic Planning for Link Discovery
Hypothesis
A dynamic planner can generate more
time-efficient LS than a static planner by using
information from the execution engine.
Approach
Use of runtime approximation function for a plan
Overwrite cost estimation when a step is executed
Duplicated steps: Re-use previous results
Time complexity of O(|L|)
1st
iteration
Canonical, RT(Plan1) = 32s
RT(Run(LS1)) = 12s
RT(Run(LS2)) = 10s

RT() = 5s
( , 0.5) RT( ) = 5s
Filter-right, RT(Plan2) = 25s
RT(Run(LS1)) = 12s
RT(ϕ(LS2)) = 8s
( , 0.5)RT( ) = 5s
2nd
iteration
Run (qgrams(:name, :label), 0.4)
Replan
Canonical, RT(Plan1) = 11s
RT(Run(LS1)) = 0s
RT(Run(LS2)) = 1s Dependency with LS1

RT() = 5s
( , 0.5)
RT( ) = 5s
Filter-right, RT(Plan2) = 13s
RT(Run(LS1)) = 0s
RT(ϕ(LS2)) = 8s
( , 0.5)RT( ) = 5s
Evaluation and Primary Results
Datasets: Abt-Buy, Amazon-Google
Products, DBLP-ACM and DBLP-Scholar
OAEI datasets, LANCE, HOBBIT-generated
datasets, MOVIES, TOWNS and VILLAGES
(H1): C-RO
Use of different values for k and maxOpt
Comparison with baseline
(H2): CONDOR
Comparison with the state-of-the-art
10 20 30 40 50 60 70 80 90 100
number of LSs
0
20
40
60
80
100
120
cumulativeexecutiontimeinseconds
Baseline
C-RO
Figure: Partial Recall for DBLP-ACM
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
number of LSs
0
20
40
60
80
100
120
140
160
cumulativeexecutiontimeinseconds
CANONICAL
HELIOS
CONDOR
Figure: Dynamic Planning for DBLP-ACM
This work was supported by research grants from the German Ministry for Finances and Energy under the SAKE project (Grant No. 01MD15006E), from the EU H2020 Framework Programme provided for the project
HOBBIT (GA no. 688227) and from Semantic Web Science Association (SWSA).
http://aksw.org/KleanthiGeorgala.html georgala@informatik.uni-leipzig.de

Scalable Link Discovery for Modern Data-Driven Applications (poster)

More Related Content

What's hot

Similar to Scalable Link Discovery for Modern Data-Driven Applications (poster)

More from Holistic Benchmarking of Big Linked Data

Recently uploaded

Scalable Link Discovery for Modern Data-Driven Applications (poster)