An Evaluation of Models for Runtime Approximation
in Link Discovery
Kleanthi Georgala and Michael Hoffmann and Axel-Cyrille Ngonga Ngomo
University of Leipzig
Institute for Applied Informatics
August 25th, 2017
Leipzig, Germany
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 1 / 23
Overview
1 Motivation
2 Approach
3 Evaluation
4 Conclusions and Future Work
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 2 / 23
Why Link Discovery?
:E1 rdfs:label "Engine failure"@en
:E1 rdf:type :Error
:E1 :beginDate :"2015-04-22T11:39:35"
:E1 :endDate :"2015-04-22T11:39:37"
:E2 rdfs:label "Car accident"@en
:E2 rdf:type :Accident
:E2 :beginDate :"2015-06-28T11:45:22"
:E2 :endDate :"2015-06-28T11:45:24"
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 3 / 23
What is Link Discovery
Linked Data 4th principle: Include links to other URIs so that they can
discover more things.
Definition (Link Discovery)
Given sets S and T of resources and relation R
Find M = {(s, t) ∈ S × T : R(s, t)}
Example: R = :failureType
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 4 / 23
Declarative LD
M is difficult to compute directly
Declarative LD frameworks use Link Specifications (LSs):
describe conditions for which R(s, t) holds
Similarity measure m: compare property values of resources
Specification operators op: combine two LS L1 and L2 to a more complex LS
L = op(L1, L2)
(θ, 0.73) levSim(:label, :label), 0.46
trigrams(:type, :type), 0.87
Figure: Graphical representation of an example LS for R = :failureType
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 5 / 23
Challenges in Link Discovery
Accuracy: correct links
Genetic programming
Probabilistic models
Time efficiency: fast and scalable linking
Planning algorithms (e.g. HELIOS [2]):
Use of cost functions to approximate runtime of LS
Cost functions are ONLY linear in the parameters of the planning:
threshold of LS, θ
size of datasets, |S| and |T|
...
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 6 / 23
HELIOS planner example
Canonical (1-1 correspondence between LS and the plan)
RT(Plan1) = 32s
(trigrams(:type, :type), 0.87) RT(Run(Left-subLS)) = 12s
(levSim(:label, :label), 0.46) RT(Run(Right-subLS)) = 10sRT( ) = 5s
(θ, 0.73) RT((θ, 0.73)) = 5s
Filter-right (optimization)
RT(Plan2) = 25s
(trigrams(:type, :type), 0.87) RT(Run(Left-subLS)) = 12s
(levSim(:label, :label), 0.46) RT(f (Right-subLS)) = 8s
(θ, 0.73) RT((θ, 0.73)) = 5s
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 7 / 23
Our Contribution
Three different models for runtime approximation in planning for LD
linear
exponential
mixed
Integration into HELIOS
Comparison of these models on 6 different datasets
Analysis on their sufficiency to approximate runtime
Study their generalization ability across datasets
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 8 / 23
Runtime Estimation
Sampling-based approach
similarity measure m (e.g. Levenshtein) and an implementation of the m
(e.g., Ed-Join [3])
execution of m with varying values of |S|, |T| and θ
collection of runtimes
What is the shape of the runtime evaluation function?
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 9 / 23
Runtime Estimation cont.
Define an evaluation function as a mapping φ : N × N × (0, 1] → R, whose
value at (|S|, |T|, θ) is an approximation of the runtime for the LS with
these parameters
Define R = (R1, . . . , Rn) as the measured runtimes for the parameters
S = (|S1|, . . . , |Sn|), T = (|T1|, . . . , |Tn|) and θ = (θ1, . . . , θn)
Constrain the mapping φ to be a local minimum of the L2-Loss:
E(S, T, θ, r) := R − φ(S, T, θ) 2
,
writing φ(S, T, θ) = (φ(|S1|, |T1|, θ1), . . . , φ(|Sn|, |Tn|, θn)).
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 10 / 23
Runtime Estimation cont.
Three models:
φ1(S, T, θ) = a + b|S| + c|T| + dθ (1)
φ2(S, T, θ) = exp (a + b|S| + c|T| + dθ + eθ2
) (2)
φ3(S, T, θ) = a + (b + c|S| + d|T| + e|S||T|) exp (f θ + gθ2
) (3)
where
a∗
, b∗
, · · · = arg min E(S, T, θ, R)(a, b, . . . )
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 11 / 23
Experiment set-up
Datasets:
3 benchmark datasets: Amazon-GP, DBLP-ACM and DBLP-Scholar
scalability: MOVIES and VILLAGES
all English labels from DBpedia 2014
Similarity measures and atomic LS
Ed-Join [3]: Levenshtein string distance
PPJoin+ [4]: Jaccard, Overlap, Cosine and Trigrams string similarity measures
θ ∈ [0.5, 1]
Evaluation metric: root mean squared error (RMSE)
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 12 / 23
Phases of an experiment
Training:
each model trained independently
computation of the set of coefficients for each model with minimum RMSE
Testing (Evaluation):
accuracy of the runtime estimation of each model
performance of the currently best LD planner, HELIOS
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 13 / 23
Experiment 1
Q1 : How do our models fit each class separately?
Split S and T into non-overlapping parts of equal size
Training with the first half:
selection of 15 source and 15 target random samples of random sizes
comparison each source sample with each target sample 3 times
Testing with the second half:
execution of Ed-Join and PPJoin+ with random θ ∈ [0.5, 1]
store real execution runtime
100 experiments to test each model on each dataset
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 14 / 23
Experiment 1 cont.
PPJoin+ Ed-Join
exp DBLP-ACM x
linear MOVIES, VILLAGES Amazon-GP, DBLP-ACM
mixed Amazon-GP, DBLP-Scholar DBLP-Scholar, MOVIES, VILLAGES
PPJoin+ Ed-Join
0
1
2
3
4
5
6
7
8
9
10
11
12
AverageRMSEamongdatasets
exp
linear
mixed
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 15 / 23
Experiment 2
Q2 : How do our models generalize across classes?
Split S and T into non-overlapping parts
Train on one dataset,
Test on the remaining four
Training with the first half:
same as in Q1
Testing with the second half:
same as in Q1
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 16 / 23
Experiment 2 cont.
PPJoin+ Ed-Join
exp VILLAGES VILLAGES
linear AMAZON-GP, DBLP-ACM,
DBLP-Scholar
AMAZON-GP, DBLP-ACM,
DBLP-Scholar, MOVIES
mixed MOVIES x
PPJoin+ Ed-Join
1.0E1
1.0E2
1.0E3
1.0E4
1.0E5
1.0E6
1.0E7
AverageRMSE(log) exp
linear
mixed
Figure: Training on MOVIES
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 17 / 23
Experiment 2 cont.
PPJoin+ Ed-Join
1.0E6
1.0E12
1.0E18
1.0E24
1.0E30
1.0E36
1.0E42
AverageRMSE(log)
exp
linear
mixed
Table: Training on AMAZON-GP
PPJoin+ Ed-Join
1.0E6
1.0E12
1.0E18
1.0E24
1.0E30
1.0E36
1.0E42
AverageRMSE(log)
exp
linear
mixed
Table: Training on DBLP-ACM
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 18 / 23
Experiment 3
Q3 : How do our models perform when trained on a large dataset?
Train on English labels of DBpedia,
Test on Amazon-GP, DBLP-ACM, MOVIES and VILLAGES
Training:
same as in Q1, but
15 source and 15 target random samples of various sizes between 10, 000 and
100, 000
the samples were taken from the English labels of DBpedia
Testing:
use of the EAGLE [1] to learn 100 LSs for each testing dataset
use of evaluation models obtained by training
execution of 100 LSs by HELIOS against the four datasets
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 19 / 23
Experiment 3 cont.
1E02
1E03
1E04
1E05
1E06
AverageRMSE(log) exp
linear
mixed
Figure: Training on DBpedia English labels
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 20 / 23
Conclusions
Runtime Approximation in Link Discovery:
Detailed study of 3 models: linear, exponential, mixed
Integrated into HELIOS
Experiments on 6 datasets: variety in size and classes
On average, linear models outperform the others
Mixed and exponential model are inadequate
Future Work:
Study other models for runtime estimation
Consider more features for runtime approximation
Different combinations of features
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 21 / 23
Thank you!
Visit http://aksw.org/Projects/LIMES.html
https://twitter.com/diceupb
Questions?
Kleanthi Georgala
georgala@informatik.uni-leipzig.de
AKSW Research Group at Leipzig University
and
DICE Group at Paderborn University
http://aksw.org/KleanthiGeorgala.html
This work has been supported by the H2020 project HOBBIT (GA no. 688227), the EuroStars project QAMEL (project no. 01QE1549C) and the BMWi
project SAKE (project no. 01MD15006E).
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 22 / 23
References
A.-C. N. Ngomo and K. Lyko.
Eagle: Efficient active learning of link specifications using genetic programming.
In The Semantic Web: Research and Applications, pages 149–163. Springer, 2012.
A.-C. Ngonga Ngomo.
HELIOS - Execution Optimization for Link Discovery.
In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference,
Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, pages 17–32.
Springer, 2014.
C. Xiao, W. Wang, and X. Lin.
Ed-join: an efficient algorithm for similarity joins with edit distance constraints.
Proceedings of the VLDB Endowment, 1(1):933–944, 2008.
C. Xiao, W. Wang, X. Lin, and J. X. Yu.
Efficient similarity joins for near duplicate detection.
In Proceedings of the 17th International Conference on World Wide Web, WWW
’08, pages 131–140, New York, NY, USA, 2008. ACM.
Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 23 / 23

An Evaluation of Models for Runtime Approximation in Link Discovery

  • 1.
    An Evaluation ofModels for Runtime Approximation in Link Discovery Kleanthi Georgala and Michael Hoffmann and Axel-Cyrille Ngonga Ngomo University of Leipzig Institute for Applied Informatics August 25th, 2017 Leipzig, Germany Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 1 / 23
  • 2.
    Overview 1 Motivation 2 Approach 3Evaluation 4 Conclusions and Future Work Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 2 / 23
  • 3.
    Why Link Discovery? :E1rdfs:label "Engine failure"@en :E1 rdf:type :Error :E1 :beginDate :"2015-04-22T11:39:35" :E1 :endDate :"2015-04-22T11:39:37" :E2 rdfs:label "Car accident"@en :E2 rdf:type :Accident :E2 :beginDate :"2015-06-28T11:45:22" :E2 :endDate :"2015-06-28T11:45:24" Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 3 / 23
  • 4.
    What is LinkDiscovery Linked Data 4th principle: Include links to other URIs so that they can discover more things. Definition (Link Discovery) Given sets S and T of resources and relation R Find M = {(s, t) ∈ S × T : R(s, t)} Example: R = :failureType Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 4 / 23
  • 5.
    Declarative LD M isdifficult to compute directly Declarative LD frameworks use Link Specifications (LSs): describe conditions for which R(s, t) holds Similarity measure m: compare property values of resources Specification operators op: combine two LS L1 and L2 to a more complex LS L = op(L1, L2) (θ, 0.73) levSim(:label, :label), 0.46 trigrams(:type, :type), 0.87 Figure: Graphical representation of an example LS for R = :failureType Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 5 / 23
  • 6.
    Challenges in LinkDiscovery Accuracy: correct links Genetic programming Probabilistic models Time efficiency: fast and scalable linking Planning algorithms (e.g. HELIOS [2]): Use of cost functions to approximate runtime of LS Cost functions are ONLY linear in the parameters of the planning: threshold of LS, θ size of datasets, |S| and |T| ... Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 6 / 23
  • 7.
    HELIOS planner example Canonical(1-1 correspondence between LS and the plan) RT(Plan1) = 32s (trigrams(:type, :type), 0.87) RT(Run(Left-subLS)) = 12s (levSim(:label, :label), 0.46) RT(Run(Right-subLS)) = 10sRT( ) = 5s (θ, 0.73) RT((θ, 0.73)) = 5s Filter-right (optimization) RT(Plan2) = 25s (trigrams(:type, :type), 0.87) RT(Run(Left-subLS)) = 12s (levSim(:label, :label), 0.46) RT(f (Right-subLS)) = 8s (θ, 0.73) RT((θ, 0.73)) = 5s Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 7 / 23
  • 8.
    Our Contribution Three differentmodels for runtime approximation in planning for LD linear exponential mixed Integration into HELIOS Comparison of these models on 6 different datasets Analysis on their sufficiency to approximate runtime Study their generalization ability across datasets Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 8 / 23
  • 9.
    Runtime Estimation Sampling-based approach similaritymeasure m (e.g. Levenshtein) and an implementation of the m (e.g., Ed-Join [3]) execution of m with varying values of |S|, |T| and θ collection of runtimes What is the shape of the runtime evaluation function? Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 9 / 23
  • 10.
    Runtime Estimation cont. Definean evaluation function as a mapping φ : N × N × (0, 1] → R, whose value at (|S|, |T|, θ) is an approximation of the runtime for the LS with these parameters Define R = (R1, . . . , Rn) as the measured runtimes for the parameters S = (|S1|, . . . , |Sn|), T = (|T1|, . . . , |Tn|) and θ = (θ1, . . . , θn) Constrain the mapping φ to be a local minimum of the L2-Loss: E(S, T, θ, r) := R − φ(S, T, θ) 2 , writing φ(S, T, θ) = (φ(|S1|, |T1|, θ1), . . . , φ(|Sn|, |Tn|, θn)). Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 10 / 23
  • 11.
    Runtime Estimation cont. Threemodels: φ1(S, T, θ) = a + b|S| + c|T| + dθ (1) φ2(S, T, θ) = exp (a + b|S| + c|T| + dθ + eθ2 ) (2) φ3(S, T, θ) = a + (b + c|S| + d|T| + e|S||T|) exp (f θ + gθ2 ) (3) where a∗ , b∗ , · · · = arg min E(S, T, θ, R)(a, b, . . . ) Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 11 / 23
  • 12.
    Experiment set-up Datasets: 3 benchmarkdatasets: Amazon-GP, DBLP-ACM and DBLP-Scholar scalability: MOVIES and VILLAGES all English labels from DBpedia 2014 Similarity measures and atomic LS Ed-Join [3]: Levenshtein string distance PPJoin+ [4]: Jaccard, Overlap, Cosine and Trigrams string similarity measures θ ∈ [0.5, 1] Evaluation metric: root mean squared error (RMSE) Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 12 / 23
  • 13.
    Phases of anexperiment Training: each model trained independently computation of the set of coefficients for each model with minimum RMSE Testing (Evaluation): accuracy of the runtime estimation of each model performance of the currently best LD planner, HELIOS Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 13 / 23
  • 14.
    Experiment 1 Q1 :How do our models fit each class separately? Split S and T into non-overlapping parts of equal size Training with the first half: selection of 15 source and 15 target random samples of random sizes comparison each source sample with each target sample 3 times Testing with the second half: execution of Ed-Join and PPJoin+ with random θ ∈ [0.5, 1] store real execution runtime 100 experiments to test each model on each dataset Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 14 / 23
  • 15.
    Experiment 1 cont. PPJoin+Ed-Join exp DBLP-ACM x linear MOVIES, VILLAGES Amazon-GP, DBLP-ACM mixed Amazon-GP, DBLP-Scholar DBLP-Scholar, MOVIES, VILLAGES PPJoin+ Ed-Join 0 1 2 3 4 5 6 7 8 9 10 11 12 AverageRMSEamongdatasets exp linear mixed Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 15 / 23
  • 16.
    Experiment 2 Q2 :How do our models generalize across classes? Split S and T into non-overlapping parts Train on one dataset, Test on the remaining four Training with the first half: same as in Q1 Testing with the second half: same as in Q1 Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 16 / 23
  • 17.
    Experiment 2 cont. PPJoin+Ed-Join exp VILLAGES VILLAGES linear AMAZON-GP, DBLP-ACM, DBLP-Scholar AMAZON-GP, DBLP-ACM, DBLP-Scholar, MOVIES mixed MOVIES x PPJoin+ Ed-Join 1.0E1 1.0E2 1.0E3 1.0E4 1.0E5 1.0E6 1.0E7 AverageRMSE(log) exp linear mixed Figure: Training on MOVIES Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 17 / 23
  • 18.
    Experiment 2 cont. PPJoin+Ed-Join 1.0E6 1.0E12 1.0E18 1.0E24 1.0E30 1.0E36 1.0E42 AverageRMSE(log) exp linear mixed Table: Training on AMAZON-GP PPJoin+ Ed-Join 1.0E6 1.0E12 1.0E18 1.0E24 1.0E30 1.0E36 1.0E42 AverageRMSE(log) exp linear mixed Table: Training on DBLP-ACM Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 18 / 23
  • 19.
    Experiment 3 Q3 :How do our models perform when trained on a large dataset? Train on English labels of DBpedia, Test on Amazon-GP, DBLP-ACM, MOVIES and VILLAGES Training: same as in Q1, but 15 source and 15 target random samples of various sizes between 10, 000 and 100, 000 the samples were taken from the English labels of DBpedia Testing: use of the EAGLE [1] to learn 100 LSs for each testing dataset use of evaluation models obtained by training execution of 100 LSs by HELIOS against the four datasets Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 19 / 23
  • 20.
    Experiment 3 cont. 1E02 1E03 1E04 1E05 1E06 AverageRMSE(log)exp linear mixed Figure: Training on DBpedia English labels Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 20 / 23
  • 21.
    Conclusions Runtime Approximation inLink Discovery: Detailed study of 3 models: linear, exponential, mixed Integrated into HELIOS Experiments on 6 datasets: variety in size and classes On average, linear models outperform the others Mixed and exponential model are inadequate Future Work: Study other models for runtime estimation Consider more features for runtime approximation Different combinations of features Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 21 / 23
  • 22.
    Thank you! Visit http://aksw.org/Projects/LIMES.html https://twitter.com/diceupb Questions? KleanthiGeorgala georgala@informatik.uni-leipzig.de AKSW Research Group at Leipzig University and DICE Group at Paderborn University http://aksw.org/KleanthiGeorgala.html This work has been supported by the H2020 project HOBBIT (GA no. 688227), the EuroStars project QAMEL (project no. 01QE1549C) and the BMWi project SAKE (project no. 01MD15006E). Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 22 / 23
  • 23.
    References A.-C. N. Ngomoand K. Lyko. Eagle: Efficient active learning of link specifications using genetic programming. In The Semantic Web: Research and Applications, pages 149–163. Springer, 2012. A.-C. Ngonga Ngomo. HELIOS - Execution Optimization for Link Discovery. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part I, pages 17–32. Springer, 2014. C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proceedings of the VLDB Endowment, 1(1):933–944, 2008. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 131–140, New York, NY, USA, 2008. ACM. Georgala Hoffmann Ngonga Ngomo (InfAI) September 15, 2017 23 / 23