05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
05/26/14 Heiko Paulheim 2
Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generat...
05/26/14 Heiko Paulheim 3
Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist t...
05/26/14 Heiko Paulheim 4
Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vecto...
05/26/14 Heiko Paulheim 5
Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a bina...
05/26/14 Heiko Paulheim 6
Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTrop...
05/26/14 Heiko Paulheim 8
Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the ...
05/26/14 Heiko Paulheim 10
Results
• Type features work better than property features
• LoOP delivers reliably good result...
05/26/14 Heiko Paulheim 11
Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=2...
05/26/14 Heiko Paulheim 12
Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=2...
05/26/14 Heiko Paulheim 13
Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in l...
05/26/14 Heiko Paulheim 14
Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection ba...
05/26/14 Heiko Paulheim 15
Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of ...
05/26/14 Heiko Paulheim 16
Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach...
05/26/14 Heiko Paulheim 17
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
Upcoming SlideShare
Loading in …5
×

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

504 views

Published on

Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
504
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection

  1. 1. 05/26/14 Heiko Paulheim 1 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim
  2. 2. 05/26/14 Heiko Paulheim 2 Motivation • Dataset interlinks can be wrong for many reasons – Oversimplified heuristic generation (e.g., label equality) – owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.) – Concept drift of link targets • e.g., dbpedia:Prong used to denote a band until DBpedia 3.1 • now it's a disambiguation page 04/08/0812/04/07 <http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
  3. 3. 05/26/14 Heiko Paulheim 3 Overall Idea • Links between datasets follow certain patterns – e.g., linking a mo:MusicArtist to a dbo:Artist, and a mo:MusicalWork to a dbo:Album or a dbo:Song • Wrong links violate those patterns • Hence, outlier detection should find wrong links – Definition: “finding patterns in data that do not conform to the expected normal behavior” (Chandola et al., 2009) • Difference over related approaches – does not require the same schema used in both datasets – nor schema mappings – no external/human knowledge required
  4. 4. 05/26/14 Heiko Paulheim 4 Projection of Links into Vector Space • Represent each link as a point in an n-dimensional vector space – e.g., using their direct types • Outliers are found in sparse areas
  5. 5. 05/26/14 Heiko Paulheim 5 Projection of Links into Vector Space • Types – each type of LHS and RHS resource becomes a binary (0/1) feature – types on both sides are treated separately • i.e., LHS_foaf:person and RHS_foaf:person are distinct features • Properties – each ingoing/outgoing property of LHS and RHS resource becomes a binary (0/1) feature – properties on both sides are treated separately – ingoing and outgoing properties are treated separately • i.e., LHS_foaf:based_near, RHS_foaf:based_near, foaf:based_near_LHS and foaf:based_near_RHS are all distinct features • Joint feature set of types and properties
  6. 6. 05/26/14 Heiko Paulheim 6 Experiments • Datasets: link sets between – BBC Peel Sessions and DBpedia (2,087 links) – DBTropes and DBpedia (4,229 links) • Gold standard – 100 randomly sampled links from each set, manually evaluated – Peel: 90 out of 100 are correct – Tropes: 76 out of 100 are correct • We run outlier detection on the whole link set – and validate the output only on the gold standard
  7. 7. 05/26/14 Heiko Paulheim 8 Experiments • Outlier Detection Approaches – assign a score (or label) to each data point – the higher the score, the likelier it is an outlier • Evaluation – Ordering descending by outlier score – Ideally, all outliers are above all non-outliers – Plot a ROC curve to measure the quality • i.e., AUC – F-Measure • with best possible threshold
  8. 8. 05/26/14 Heiko Paulheim 10 Results • Type features work better than property features • LoOP delivers reliably good results – though not the best • Best performance on Peel dataset – CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857) • Best performance on DBTropes dataset – LOF (F1 = 0.5, AUC = 0.619)
  9. 9. 05/26/14 Heiko Paulheim 11 Results • ROC curves for Peel dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
  10. 10. 05/26/14 Heiko Paulheim 12 Results • ROC curves DBTropes dataset 0 1 0 1 GAS k=10 GAS k=25 GAS k=50 LOF LoOP k=10 LoOP k=25 LoOP k=50 CBLOF LDCOF 1-class SVM Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical, CBLOF and LDCOF mostly identical
  11. 11. 05/26/14 Heiko Paulheim 13 Runtimes • Most outlier detection algorithms are reasonably fast – both linksets processed in less than 10 seconds on a normal laptop • Exceptions: – clustering (for CBLOF/LDCOF) takes up to 30 seconds – 1-class SVM takes up to 15 minutes • ...but creating the feature vector representation takes much more time – some hours against public SPARQL endpoint(s) – reasonably fast with downloaded dumps
  12. 12. 05/26/14 Heiko Paulheim 14 Discussion of Results • Results on Peel dataset better than on DBTropes dataset • Projection based on types better than on properties • most likely due to lower dimensionality of vector space • Peel: #types = 34, #properties = 60 • DBTropes: #types = 81, #properties = 142 • Variation of outlier detection algorithms across datasets – also observed in other experiments – general rules of thumb are hard to come up with
  13. 13. 05/26/14 Heiko Paulheim 15 Possible Improvements & Future Work • Other projection methods – e.g., using numeric counts of relations • Other outlier detection algorithms – e.g., Replicating Neural Networks and their generalizations • Preprocessing – e.g., Feature Subset Selection – caveat: the valuable features are often sparse
  14. 14. 05/26/14 Heiko Paulheim 16 Possible Improvements & Future Work • So far, we have looked at owl:sameAs links • The approach is not limited to that – should work for other link predicates as well – e.g., a dataset of persons and a dataset of places – linked by foaf:based_near • It is not even limited to linksets – also for debugging statements inside a knowledge base – e.g., dbpedia-owl:deathPlace
  15. 15. 05/26/14 Heiko Paulheim 17 Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection Heiko Paulheim

×