Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications

Digital Enterprise Research Institute www.deri.ie

Leveraging Matching Dependencies for Guided
User Feedback in Linked Data Applications
Umair ul Hassan, Sean O’Riain, Edward Curry
Digital Enterprise Research Institute
National University of Ireland, Galway

Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Outline

 Motivation & Problem Space
 Identity Resolution on the Linked Open Data (LOD) Web
 Proposed Approach
 LOD Application Architecture
 How it relates to existing works
 Evaluation
 Conclusion & Future Work

Overview

 Identity Resolution in the Linked Open Data Web
 Real-world entities have multiple identifiers in LOD
 Identity resolution links have associated uncertainty
 LOD Applications require user verification of links
 Problem
 Feedback for all links is infeasible for large datasets
 LOD Applications have domain specific utility of links
 Leverages matching dependencies to define domain specific
requirements of identity resolution
 Ranks identity resolution links according to value of perfect information

Linked Open Data (LOD)

 Expose and interlink datasets on the Web
 Using URIs to identify “things” in your data
 Using a graph representation (RDF) to describe URIs
 Vision: The Web as a huge graph database

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Linked Data Example

Identity resolution links

Multiple Identifiers

Identity Resolution in LOD

 Identity resolution is required for consolidation of data in
applications consuming LOD

 Three sources of identity resolution links
 Provided by data publishers (e.g. dbpedia.org)
 Generated by consumer through tools (e.g. SILK, SEMIRI, RiMOM)
 Maintained by third party web services (e.g. sameas.org)

 Uncertainty associated with links
 Due to multiple identity equivalence interpretations
 Due to characteristics of link generation algorithms (similarity based)

Identity Resolution Problem

 User feedback for uncertain links
 Verify uncertain identity resolution links from users/experts
 Improve quality of entity consolidation

 Challenges
 Domain specific semantic requirements
– How to define domain specific requirements of quality for Linked
Data applications?

 Limited user attention
– How to rank candidate links according to their benefit to maximize
utility of user feedback?

Identity Resolution Problem

 User feedback for uncertain links
 Verify uncertain identity resolution links from users/experts
 Improve quality of entity consolidation

 Domain specific semantic requirements
– Leverage Matching Dependencies

 Limited user attention
– Employ value of perfect information theory

LOD Application Architecture

Utility Feedback Consolidation
Module Module Module
Candidate Links

Questions
Rules Feedback

Matching Utility
Dependencies Improvement

Ranked
Feedback Tasks

Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.

Related Work

 Jeffery et al., “Pay-as-you-go user feedback for dataspace
systems,” in Proceedings of the 2008 ACM SIGMOD
Conference, 2008, pp. 847-860.

 Utility:
 In terms of cardinality of query results on dataspace
 General metric not suitable for application specific data quality
 Assumption:
 Availability of global query statistics
– Problematic for Linked Open Data

Proposed Approach

 Domain Specific Utility
 Define utility in terms of user specified rules i.e. matching dependencies
 Rank candidates links for user feedback according to value of perfect
information

 Assumptions
 We assume matching dependencies are either provided by user or generated
through existing tools
 Utility is based on satisfaction ratio of dependencies in dataspace

Proposed Approach

 Matching Dependencies

 Matching Rule

 Example

 Utility of rule

g (mk ) U ( Dmk , M {mk }) pk
 Value of Perfect Information U ( Dmk , M {mk })(1 pk )
U ( D, M )

Evaluation

 Measure change in utility of a dataspace according to
matching rules after a specific number of feedback iterations
 Candidate links generated by the Silk framework

Evaluation

 Datasets

IIMB 2009 Dataset UCI-Adult Dataset Drug Dataset

Data Source Instance Matching Benchmark UCI Machine Learning Repository Instance Matching Benchmark
2009 2010
Data Collection IIMB 2009 US Consensus Dataset DrugBank and Sider Datasets
- Reference Ontology - Manually created duplicates and - Interlinking between two datasets
- Ontology #16 with errors in data value errors of same domain
attributes

Entity Types imdb:Movie foaf:Person drugbank:drugs, sider:drugs
Total Triples 291 64000 14348
Total Entity IDs 44 4000 5696
Total Attributes 9 16 3
Total Values 130 10878 8473
Candidate Links 81 72 94
Correct Links 22 72 66

Evaluation

IIMB 2009 Dataset UCI-Adult Dataset
100% 100%
Dataspace Utility Improvement

Dataspace Utility Improvement
90% 90%
80% 80%
70% 70%
60% 60%
50% 50%
40% 40%
30% VPI_RULES 30% VPI_RULES
20% CONFIDENCE 20% CONFIDENCE
10% RANDOM 10% RANDOM

0% 0%
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Feedback Iteration Feedback Iteration

Conclusion

 Matching dependencies provide an effective mechanism to:
 Represent entity matching rules
 Specify domain specific semantic requirements
 Measure utility of dataspaces

 Value of perfect information enables effective ranking strategy
for user feedback

 In the three datasets 100% utility improvement was reached
under 40% of user feedback

Future Work

 Expand to other data quality problems

 Expand on types of dependencies such as comparable
dependencies and order dependencies

 Allow multi-user feedback for collaborative data cleaning

Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications

Similar to Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications (20)

Recently uploaded

Recently uploaded (20)

Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications

Editor's Notes