Instance Matching
Upcoming SlideShare
Loading in...5
×
 

Instance Matching

on

  • 1,287 views

 

Statistics

Views

Total Views
1,287
Views on SlideShare
1,273
Embed Views
14

Actions

Likes
0
Downloads
34
Comments
0

1 Embed 14

http://cynin.champagne.ixxo.fr 14

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Instance Matching Instance Matching Presentation Transcript

  • WWW2012 TutorialPractical Cross-Dataset Queries on the Web of Data Instance Matching Robert Isele Freie Universität Berlin WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Motivation The Web of Data is a single global data space because data sources are connected by links Over 31 billion triples published as Linked Open Data and growing But: ● Less than 500 million links ● Most publishers only link to one other dataset WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Use Case 1: Publishing a New Dataset A data provider wants to publish a new dataset Wants to interlink with existing data sets from the same domain Example ● A data publisher wants to publish a new dataset about movies ● Interlink movies with LinkedMDB (Linked Movie Data Base) ● Interlink directors with DBpedia (Wikipedia) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Use Case 2: Linked Data Application Linked Data application integrates multiple data sources from the same domain In the decentralized Web of Data, many data sources use different URIs for the same real world object. Identifying these URI aliases, is a central problem in Linked Data. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Challenges for Link Discovery The Web of Data is heterogeneous ● Many different vocabularies are in use ● Different data formats ● Many different ways to represent the same information Distribution of the most widely used vocabularies WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Challenges for Link Discovery Large range of domains ● 256 data sources in the LOD cloud from a variety of domains ● Linkage Rules are different in each domain ● Writing a Linkage Rule is for each of these domains is usually not trivial Distribution of triples by domain WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Challenges for Link Discovery Scalability ● The current LOD cloud contains 277 datasets (August 2011) ● 30 billion triples in total ● Infeasible to compare every possible entity pair WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Link Discovery Tools Tools enable data publishers to set links Most tools generate links based on user-defined linkage rules A linkage rule specifies the conditions data items must fulfill in order to be interlinked Popular Link Discover Tools: ● Silk Link Discovery Framework ● LIMES ● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Silk Link Discovery Framework Tool for discovering links between data items within different Linked Data sources. The Silk Link Specification Language (Silk-LSL) allows to express complex linkage rules Can be used to generate owl:sameAs links as well as other relationships Scalability and high performance through efficient data handling WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Silk Versions Silk Single Machine ● Generate links on a single machine ● Local or remote data sets Silk MapReduce ● Generate RDF links using a cluster of multiple machines ● Based on Hadoop (Can be run on Amazon Elastic MapReduce) Silk Server ● Provides an HTTP API for matching instances from an incoming stream of RDF data while keeping track of known entities ● Can be used as an identity resolution component within applications that consume Linked Data from the Web WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Silk Workbench Silk Workbench is a web application which guides the user through the process of interlinking different data sources. Enables the user to manage different sets of data sources and linking tasks. Offers a graphical editor which enables the user to easily create and edit linkage rules Offers tools to evaluate the current linkage rule Includes experimental support for learning linkage rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Linking Workflow WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Typical linkage rule Select the values to be compared ● Example: Select labels and dates of a music record Normalize the values ● Example: Transform dates to a common format Compare different values using similarity measures ● Example: Compare labels and dates of a music record Aggregate the results of multiple comparisons ● Example: Compute the average of the label and date similarity WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Value selectors Values in the graph around the entities can be used for comparison Property path languages have been developed for that purpose Examples (SPARQL 1.1 Property Paths Language): ● Entity label: rdfs:label ● Movie director name: dbpedia-owl:director/foaf:name ● All movies of a director: ^dbpedia-owl:director/rdfs:label WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Data Transformations Different data sets may use different data formats Data sets may be noisy⇒ Values must be normalized prior to comparison WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Common Transformations Case normalization Structural transformation Extract values from URIs WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Similarity Measures A similarity measure compares two values It returns a value between 0 (no similarity) and 1 (equality) Formally, a similarity measure is a function: * * sim : Σ ×Σ →[0,1] Various similarity measures have been proposed ● Character-based measures ● Token-based measures ● Domain-specific measures WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Character-Based Similarity Measures Usually rely on character edit operations Often used for catching typographical errors Most popular ● Levenstein ● Jaro/Jaro-Winkler WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Levenshtein Distance The minimum number of edits needed to transform one string into the other Allowed edit operations: ● insert a character into the string ● delete a character from the string ● replace one character with a different character Examples: ● levensthein(Table, Cable) = 1 (1 Substitution) ● levensthein(Table, able) = 1 (1 Deletion) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Token-Based Similarity Measures Character-based measures work well for typographical errors, but fail when word arrangements differ Example: John Doe, Doe, John, Mr. John Doe Token-based measures split the values into tokens before computing the similarity Example: tokenize(Mr. John Doe) = {Mr., John, Doe} Most popular: Jaccard, Dice WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Jaccard coefficient Intuition: Measure the fraction of the tokens which are shared by both strings Defined as the number of matching words divided by the total number of distinct words: ∣A∩B∣ Jaccard ( A , B)= ∣A∪B∣ Example: 2 Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5 4 WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Domain-Specific Similarity Measures Geographic distance Date/Time Numbers WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Aggregating Similarity Values In order to determine if two entities are duplicates it is usually not sufficient to compare a single property Aggregation Functions aggregate the similarity of multiple comparisons Example: Interlinking geographical datasets ● Compare by label and geographic coordinates ● Aggregate similarity values WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Popular Aggregation Functions Minimum ● Choose the lowest value ● ⇒ All values must exceed the threshold Maximum ● Choose the highest value ● ⇒ At least one value must exceed the threshold Weighted Average ● Assign a weight to each comparison ● Compute the weighted mean WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Putting it all together WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Example Interlink cities in different data sources: WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Evaluating Linkage Rules Gold standard in the form of reference links ● Positive links (definitive matches) ● Negative links (definitive non-matches) Based on the reference links, we can determine the number of correct and incorrect matches We distinguish between 4 cases: Positive Link Negative Link match(a,b) = link True positive False positive match(a,b) = nonlink False negative True negative WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Evaluating Linkage Rules Recall: Ratio of correct links compared to all known links ∣true positives∣ recall = ∣true positives∣+ ∣ false positives∣ Precision: Ratio of correct links compared to all found links ∣true positives∣ precision = ∣true positives∣+ ∣ false negatives∣ F-measure: Harmonic mean of precision and recall 2⋅precision⋅recall F= precision + recall WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Recall-Precision diagram  A recall-precision diagram visualizes the trade-off between maximizing the recall and maximizing the precisionFrom: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Silk Worbench Silk Workbench offers a GUI for: ● Manage different data sourcs and linkage rules ● Creating linkage rules ● Executing linkage rules ● Evaluating linkage rules ● Learning Linkage Rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • WorkspaceThe Workspace holds a set of projectsconsisting of: Data Sources ● Holds all information that is needed by Silk to retrieve entities from it.  ● Usually a file dump or a SPARQL endpoint Linking Tasks ● Interlinks a type of entity between two data sources ● e.g. Interlinkiing movies in DBpedia and LinkedMDB WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Linkage Rule Editor Allows to view and edit linkage rules Linkage Rules are shown as a tree Editing using drag & drop. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Generating Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Managing Reference Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Conclusion In order to publish a new data set or to consume an existing dataset we need to generate links A linkage rule specifies the conditions which must hold true for two entities in order to be considered the same real- world object. The Silk Workbench provides a graphical user interface to create and edit linking tasks The hands on session will cover a simple example interlinking musical artists in freebase and DBpedia WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • Q&AWWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data