WWW2012 TutorialPractical Cross-Dataset Queries on the Web of Data Instance Matching Robert Isele Freie Universität Berlin WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Motivation The Web of Data is a single global data space because data sources are connected by links Over 31 billion triples published as Linked Open Data and growing But: ● Less than 500 million links ● Most publishers only link to one other dataset WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Use Case 1: Publishing a New Dataset A data provider wants to publish a new dataset Wants to interlink with existing data sets from the same domain Example ● A data publisher wants to publish a new dataset about movies ● Interlink movies with LinkedMDB (Linked Movie Data Base) ● Interlink directors with DBpedia (Wikipedia) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Use Case 2: Linked Data Application Linked Data application integrates multiple data sources from the same domain In the decentralized Web of Data, many data sources use different URIs for the same real world object. Identifying these URI aliases, is a central problem in Linked Data. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery The Web of Data is heterogeneous ● Many different vocabularies are in use ● Different data formats ● Many different ways to represent the same information Distribution of the most widely used vocabularies WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery Large range of domains ● 256 data sources in the LOD cloud from a variety of domains ● Linkage Rules are different in each domain ● Writing a Linkage Rule is for each of these domains is usually not trivial Distribution of triples by domain WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery Scalability ● The current LOD cloud contains 277 datasets (August 2011) ● 30 billion triples in total ● Infeasible to compare every possible entity pair WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Link Discovery Tools Tools enable data publishers to set links Most tools generate links based on user-defined linkage rules A linkage rule specifies the conditions data items must fulfill in order to be interlinked Popular Link Discover Tools: ● Silk Link Discovery Framework ● LIMES ● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Link Discovery Framework Tool for discovering links between data items within different Linked Data sources. The Silk Link Specification Language (Silk-LSL) allows to express complex linkage rules Can be used to generate owl:sameAs links as well as other relationships Scalability and high performance through efficient data handling WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Versions Silk Single Machine ● Generate links on a single machine ● Local or remote data sets Silk MapReduce ● Generate RDF links using a cluster of multiple machines ● Based on Hadoop (Can be run on Amazon Elastic MapReduce) Silk Server ● Provides an HTTP API for matching instances from an incoming stream of RDF data while keeping track of known entities ● Can be used as an identity resolution component within applications that consume Linked Data from the Web WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Workbench Silk Workbench is a web application which guides the user through the process of interlinking different data sources. Enables the user to manage different sets of data sources and linking tasks. Offers a graphical editor which enables the user to easily create and edit linkage rules Offers tools to evaluate the current linkage rule Includes experimental support for learning linkage rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Linking Workflow WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Typical linkage rule Select the values to be compared ● Example: Select labels and dates of a music record Normalize the values ● Example: Transform dates to a common format Compare different values using similarity measures ● Example: Compare labels and dates of a music record Aggregate the results of multiple comparisons ● Example: Compute the average of the label and date similarity WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Value selectors Values in the graph around the entities can be used for comparison Property path languages have been developed for that purpose Examples (SPARQL 1.1 Property Paths Language): ● Entity label: rdfs:label ● Movie director name: dbpedia-owl:director/foaf:name ● All movies of a director: ^dbpedia-owl:director/rdfs:label WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Data Transformations Different data sets may use different data formats Data sets may be noisy⇒ Values must be normalized prior to comparison WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Common Transformations Case normalization Structural transformation Extract values from URIs WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Similarity Measures A similarity measure compares two values It returns a value between 0 (no similarity) and 1 (equality) Formally, a similarity measure is a function: * * sim : Σ ×Σ →[0,1] Various similarity measures have been proposed ● Character-based measures ● Token-based measures ● Domain-specific measures WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Character-Based Similarity Measures Usually rely on character edit operations Often used for catching typographical errors Most popular ● Levenstein ● Jaro/Jaro-Winkler WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Levenshtein Distance The minimum number of edits needed to transform one string into the other Allowed edit operations: ● insert a character into the string ● delete a character from the string ● replace one character with a different character Examples: ● levensthein(Table, Cable) = 1 (1 Substitution) ● levensthein(Table, able) = 1 (1 Deletion) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Token-Based Similarity Measures Character-based measures work well for typographical errors, but fail when word arrangements differ Example: John Doe, Doe, John, Mr. John Doe Token-based measures split the values into tokens before computing the similarity Example: tokenize(Mr. John Doe) = {Mr., John, Doe} Most popular: Jaccard, Dice WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Jaccard coefficient Intuition: Measure the fraction of the tokens which are shared by both strings Defined as the number of matching words divided by the total number of distinct words: ∣A∩B∣ Jaccard ( A , B)= ∣A∪B∣ Example: 2 Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5 4 WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Domain-Specific Similarity Measures Geographic distance Date/Time Numbers WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Aggregating Similarity Values In order to determine if two entities are duplicates it is usually not sufficient to compare a single property Aggregation Functions aggregate the similarity of multiple comparisons Example: Interlinking geographical datasets ● Compare by label and geographic coordinates ● Aggregate similarity values WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Popular Aggregation Functions Minimum ● Choose the lowest value ● ⇒ All values must exceed the threshold Maximum ● Choose the highest value ● ⇒ At least one value must exceed the threshold Weighted Average ● Assign a weight to each comparison ● Compute the weighted mean WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Putting it all together WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Example Interlink cities in different data sources: WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Evaluating Linkage Rules Gold standard in the form of reference links ● Positive links (definitive matches) ● Negative links (definitive non-matches) Based on the reference links, we can determine the number of correct and incorrect matches We distinguish between 4 cases: Positive Link Negative Link match(a,b) = link True positive False positive match(a,b) = nonlink False negative True negative WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Evaluating Linkage Rules Recall: Ratio of correct links compared to all known links ∣true positives∣ recall = ∣true positives∣+ ∣ false positives∣ Precision: Ratio of correct links compared to all found links ∣true positives∣ precision = ∣true positives∣+ ∣ false negatives∣ F-measure: Harmonic mean of precision and recall 2⋅precision⋅recall F= precision + recall WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Recall-Precision diagram A recall-precision diagram visualizes the trade-off between maximizing the recall and maximizing the precisionFrom: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Outline Motivation Link Discovery Tools Linking Workflow Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Worbench Silk Workbench offers a GUI for: ● Manage different data sourcs and linkage rules ● Creating linkage rules ● Executing linkage rules ● Evaluating linkage rules ● Learning Linkage Rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
WorkspaceThe Workspace holds a set of projectsconsisting of: Data Sources ● Holds all information that is needed by Silk to retrieve entities from it. ● Usually a file dump or a SPARQL endpoint Linking Tasks ● Interlinks a type of entity between two data sources ● e.g. Interlinkiing movies in DBpedia and LinkedMDB WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Linkage Rule Editor Allows to view and edit linkage rules Linkage Rules are shown as a tree Editing using drag & drop. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Generating Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Managing Reference Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Conclusion In order to publish a new data set or to consume an existing dataset we need to generate links A linkage rule specifies the conditions which must hold true for two entities in order to be considered the same real- world object. The Silk Workbench provides a graphical user interface to create and edit linking tasks The hands on session will cover a simple example interlinking musical artists in freebase and DBpedia WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Q&AWWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare