Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu d’Aquin Knowledge Media In...
How to link a new dataset? <ul><li>What other repositories contain relevant data which I should link to? </li></ul><ul><ul...
Selection criteria <ul><li>Additional information about local instances </li></ul><ul><li>Popularity </li></ul><ul><li>Deg...
Available information <ul><li>Additional information about resources </li></ul><ul><ul><li>Schema ontology </li></ul></ul>...
Approach <ul><li>Search for sources with  potentially  high degree of overlap </li></ul><ul><ul><li>Use a subset of entity...
Approach <ul><li>Aggregate results </li></ul><ul><ul><li>Group instances occurring in returned result sets by their source...
Approach <ul><li>Rank sources  </li></ul><ul><ul><li>Sort by number of individuals returned in search results </li></ul></ul>
Approach <ul><li>Select “most relevant” class  </li></ul><ul><ul><li>Select the class in each source, which covers most of...
Issues: imprecise results <ul><li>Main cause: ambiguous instance labels </li></ul><ul><li>Inclusion of irrelevant sources ...
Filtering results <ul><li>Determine potentially irrelevant classes </li></ul><ul><ul><li>Use state-of-the-art  schema matc...
Filtering results <ul><li>Filter out irrelevant search results </li></ul><ul><ul><li>Only consider search result instances...
Preliminary experiments <ul><li>Datasets </li></ul><ul><ul><li>ORO journals (data.open.ac.uk): 3110 instances </li></ul></...
Preliminary experiments <ul><li>Performance measure: </li></ul><ul><ul><li>Proportion of relevant sources among the top-10...
<ul><li>Summary: </li></ul><ul><ul><li>Top-ranked returned repositories are largely relevant from the point of view of lin...
Future work <ul><li>Improving the quality of results </li></ul><ul><ul><li>E.g., estimating the potential loss of precisio...
Questions? <ul><li>Thanks for your attention </li></ul>
Upcoming SlideShare
Loading in …5
×

Identifying Relevant Sources for Data Linking using a Semantic Web Index

440 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
440
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Identifying Relevant Sources for Data Linking using a Semantic Web Index

  1. 1. Identifying Relevant Sources for Data Linking using a Semantic Web Index Andriy Nikolov Mathieu d’Aquin Knowledge Media Institute The Open University, UK
  2. 2. How to link a new dataset? <ul><li>What other repositories contain relevant data which I should link to? </li></ul><ul><ul><li>Select the external repository </li></ul></ul><ul><li>How to select the relevant data instances to link? </li></ul><ul><ul><li>Select the relevant classes within the chosen repository </li></ul></ul>TV programs movies pieces of music LinkedMDB DBPedia Freebase MusicBrainz ? actors composers bestbuy
  3. 3. Selection criteria <ul><li>Additional information about local instances </li></ul><ul><li>Popularity </li></ul><ul><li>Degree of overlap </li></ul>data.open.ac.uk Publication data DBPedia DBLP rae:RKBExplorer
  4. 4. Available information <ul><li>Additional information about resources </li></ul><ul><ul><li>Schema ontology </li></ul></ul><ul><ul><li>Test examples </li></ul></ul><ul><li>Popularity </li></ul><ul><ul><li>VoiD descriptors </li></ul></ul><ul><ul><ul><li>Linking repositories </li></ul></ul></ul><ul><ul><li>Catalog of repositories (CKAN) </li></ul></ul><ul><li>Degree of overlap </li></ul><ul><ul><li>VoiD descriptors (only topic relevance) </li></ul></ul><ul><ul><li>Relevant info hard to obtain on the client side </li></ul></ul>
  5. 5. Approach <ul><li>Search for sources with potentially high degree of overlap </li></ul><ul><ul><li>Use a subset of entity labels from the original dataset as keywords for entity search </li></ul></ul>
  6. 6. Approach <ul><li>Aggregate results </li></ul><ul><ul><li>Group instances occurring in returned result sets by their source repositories </li></ul></ul>
  7. 7. Approach <ul><li>Rank sources </li></ul><ul><ul><li>Sort by number of individuals returned in search results </li></ul></ul>
  8. 8. Approach <ul><li>Select “most relevant” class </li></ul><ul><ul><li>Select the class in each source, which covers most of instances </li></ul></ul>
  9. 9. Issues: imprecise results <ul><li>Main cause: ambiguous instance labels </li></ul><ul><li>Inclusion of irrelevant sources </li></ul><ul><ul><li>E.g., DBLP for movie score composers </li></ul></ul><ul><li>Selection of inappropriate classes within the selected source </li></ul><ul><ul><li>Too generic: e.g., dbpedia:Person vs dbpedia:MusicArtist </li></ul></ul><ul><ul><li>Irrelevant: e.g., akt:Publication-Reference (journal volume) vs akt:Journal </li></ul></ul>
  10. 10. Filtering results <ul><li>Determine potentially irrelevant classes </li></ul><ul><ul><li>Use state-of-the-art schema matching to select relevant classes </li></ul></ul>
  11. 11. Filtering results <ul><li>Filter out irrelevant search results </li></ul><ul><ul><li>Only consider search result instances belonging to “approved” classes </li></ul></ul>
  12. 12. Preliminary experiments <ul><li>Datasets </li></ul><ul><ul><li>ORO journals (data.open.ac.uk): 3110 instances </li></ul></ul><ul><ul><li>LinkedMDB films: 400 instances </li></ul></ul><ul><ul><li>LinkedMDB music contributors: 400 instances </li></ul></ul><ul><li>External components </li></ul><ul><ul><li>Semantic index: Sig.ma </li></ul></ul><ul><ul><li>Ontology matching techniques: CIDER, instance-based schema mappings retrieved from BTC2009 dataset </li></ul></ul>
  13. 13. Preliminary experiments <ul><li>Performance measure: </li></ul><ul><ul><li>Proportion of relevant sources among the top-10 returned results </li></ul></ul>Before filtering +/- After filtering +/- rae2001 (RKB) + rae2001 (RKB) + dotac (RKB) + DBPedia + DBPedia + dblp.l3s.de + oai (RKB) + Freebase + dblp.l3s.de + DBLP (RKB) + wordnet (RKB) - eprints (RKB) + bibsonomy - eprints (RKB) + Freebase + www.examiner.com -
  14. 14. <ul><li>Summary: </li></ul><ul><ul><li>Top-ranked returned repositories are largely relevant from the point of view of linking </li></ul></ul><ul><ul><li>Filtering using schema matching techniques greatly improves precision (all remaining sources are relevant) </li></ul></ul><ul><ul><li>… but at the expense of some recall </li></ul></ul>Preliminary experiments
  15. 15. Future work <ul><li>Improving the quality of results </li></ul><ul><ul><li>E.g., estimating the potential loss of precision/recall for different filtering decisions </li></ul></ul><ul><li>Integrating with the data linking workflow </li></ul><ul><ul><li>Automatically pre-configuring the data linking algorithm </li></ul></ul><ul><li>Repository search as a potentially useful semantic search use case (in addition to entity and document search) </li></ul>
  16. 16. Questions? <ul><li>Thanks for your attention </li></ul>

×