Your SlideShare is downloading. ×

Instance Matching

1,109
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,109
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. WWW2012 TutorialPractical Cross-Dataset Queries on the Web of Data Instance Matching Robert Isele Freie Universität Berlin WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 2. Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 3. Motivation The Web of Data is a single global data space because data sources are connected by links Over 31 billion triples published as Linked Open Data and growing But: ● Less than 500 million links ● Most publishers only link to one other dataset WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 4. Use Case 1: Publishing a New Dataset A data provider wants to publish a new dataset Wants to interlink with existing data sets from the same domain Example ● A data publisher wants to publish a new dataset about movies ● Interlink movies with LinkedMDB (Linked Movie Data Base) ● Interlink directors with DBpedia (Wikipedia) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 5. Use Case 2: Linked Data Application Linked Data application integrates multiple data sources from the same domain In the decentralized Web of Data, many data sources use different URIs for the same real world object. Identifying these URI aliases, is a central problem in Linked Data. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 6. Challenges for Link Discovery The Web of Data is heterogeneous ● Many different vocabularies are in use ● Different data formats ● Many different ways to represent the same information Distribution of the most widely used vocabularies WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 7. Challenges for Link Discovery Large range of domains ● 256 data sources in the LOD cloud from a variety of domains ● Linkage Rules are different in each domain ● Writing a Linkage Rule is for each of these domains is usually not trivial Distribution of triples by domain WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 8. Challenges for Link Discovery Scalability ● The current LOD cloud contains 277 datasets (August 2011) ● 30 billion triples in total ● Infeasible to compare every possible entity pair WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 9. Link Discovery Tools Tools enable data publishers to set links Most tools generate links based on user-defined linkage rules A linkage rule specifies the conditions data items must fulfill in order to be interlinked Popular Link Discover Tools: ● Silk Link Discovery Framework ● LIMES ● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 10. Silk Link Discovery Framework Tool for discovering links between data items within different Linked Data sources. The Silk Link Specification Language (Silk-LSL) allows to express complex linkage rules Can be used to generate owl:sameAs links as well as other relationships Scalability and high performance through efficient data handling WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 11. Silk Versions Silk Single Machine ● Generate links on a single machine ● Local or remote data sets Silk MapReduce ● Generate RDF links using a cluster of multiple machines ● Based on Hadoop (Can be run on Amazon Elastic MapReduce) Silk Server ● Provides an HTTP API for matching instances from an incoming stream of RDF data while keeping track of known entities ● Can be used as an identity resolution component within applications that consume Linked Data from the Web WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 12. Silk Workbench Silk Workbench is a web application which guides the user through the process of interlinking different data sources. Enables the user to manage different sets of data sources and linking tasks. Offers a graphical editor which enables the user to easily create and edit linkage rules Offers tools to evaluate the current linkage rule Includes experimental support for learning linkage rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 13. Linking Workflow WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 14. Typical linkage rule Select the values to be compared ● Example: Select labels and dates of a music record Normalize the values ● Example: Transform dates to a common format Compare different values using similarity measures ● Example: Compare labels and dates of a music record Aggregate the results of multiple comparisons ● Example: Compute the average of the label and date similarity WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 15. Value selectors Values in the graph around the entities can be used for comparison Property path languages have been developed for that purpose Examples (SPARQL 1.1 Property Paths Language): ● Entity label: rdfs:label ● Movie director name: dbpedia-owl:director/foaf:name ● All movies of a director: ^dbpedia-owl:director/rdfs:label WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 16. Data Transformations Different data sets may use different data formats Data sets may be noisy⇒ Values must be normalized prior to comparison WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 17. Common Transformations Case normalization Structural transformation Extract values from URIs WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 18. Similarity Measures A similarity measure compares two values It returns a value between 0 (no similarity) and 1 (equality) Formally, a similarity measure is a function: * * sim : Σ ×Σ →[0,1] Various similarity measures have been proposed ● Character-based measures ● Token-based measures ● Domain-specific measures WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 19. Character-Based Similarity Measures Usually rely on character edit operations Often used for catching typographical errors Most popular ● Levenstein ● Jaro/Jaro-Winkler WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 20. Levenshtein Distance The minimum number of edits needed to transform one string into the other Allowed edit operations: ● insert a character into the string ● delete a character from the string ● replace one character with a different character Examples: ● levensthein(Table, Cable) = 1 (1 Substitution) ● levensthein(Table, able) = 1 (1 Deletion) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 21. Token-Based Similarity Measures Character-based measures work well for typographical errors, but fail when word arrangements differ Example: John Doe, Doe, John, Mr. John Doe Token-based measures split the values into tokens before computing the similarity Example: tokenize(Mr. John Doe) = {Mr., John, Doe} Most popular: Jaccard, Dice WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 22. Jaccard coefficient Intuition: Measure the fraction of the tokens which are shared by both strings Defined as the number of matching words divided by the total number of distinct words: ∣A∩B∣ Jaccard ( A , B)= ∣A∪B∣ Example: 2 Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5 4 WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 23. Domain-Specific Similarity Measures Geographic distance Date/Time Numbers WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 24. Aggregating Similarity Values In order to determine if two entities are duplicates it is usually not sufficient to compare a single property Aggregation Functions aggregate the similarity of multiple comparisons Example: Interlinking geographical datasets ● Compare by label and geographic coordinates ● Aggregate similarity values WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 25. Popular Aggregation Functions Minimum ● Choose the lowest value ● ⇒ All values must exceed the threshold Maximum ● Choose the highest value ● ⇒ At least one value must exceed the threshold Weighted Average ● Assign a weight to each comparison ● Compute the weighted mean WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 26. Putting it all together WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 27. Example Interlink cities in different data sources: WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 28. Evaluating Linkage Rules Gold standard in the form of reference links ● Positive links (definitive matches) ● Negative links (definitive non-matches) Based on the reference links, we can determine the number of correct and incorrect matches We distinguish between 4 cases: Positive Link Negative Link match(a,b) = link True positive False positive match(a,b) = nonlink False negative True negative WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 29. Evaluating Linkage Rules Recall: Ratio of correct links compared to all known links ∣true positives∣ recall = ∣true positives∣+ ∣ false positives∣ Precision: Ratio of correct links compared to all found links ∣true positives∣ precision = ∣true positives∣+ ∣ false negatives∣ F-measure: Harmonic mean of precision and recall 2⋅precision⋅recall F= precision + recall WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 30. Recall-Precision diagram  A recall-precision diagram visualizes the trade-off between maximizing the recall and maximizing the precisionFrom: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 31. Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 32. Silk Worbench Silk Workbench offers a GUI for: ● Manage different data sourcs and linkage rules ● Creating linkage rules ● Executing linkage rules ● Evaluating linkage rules ● Learning Linkage Rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 33. WorkspaceThe Workspace holds a set of projectsconsisting of: Data Sources ● Holds all information that is needed by Silk to retrieve entities from it.  ● Usually a file dump or a SPARQL endpoint Linking Tasks ● Interlinks a type of entity between two data sources ● e.g. Interlinkiing movies in DBpedia and LinkedMDB WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 34. Linkage Rule Editor Allows to view and edit linkage rules Linkage Rules are shown as a tree Editing using drag & drop. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 35. Generating Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 36. Managing Reference Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 37. Conclusion In order to publish a new data set or to consume an existing dataset we need to generate links A linkage rule specifies the conditions which must hold true for two entities in order to be considered the same real- world object. The Silk Workbench provides a graphical user interface to create and edit linking tasks The hands on session will cover a simple example interlinking musical artists in freebase and DBpedia WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 38. Q&AWWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data