Contextual ontology alignment may 2011
Upcoming SlideShare
Loading in...5
×
 

Contextual ontology alignment may 2011

on

  • 1,525 views

An approach for automated matching of Linked Open Data at the schema level with very high scoring evaluation results, based on comparison with manually mapped schemata of major LOD datasets to PROTON ...

An approach for automated matching of Linked Open Data at the schema level with very high scoring evaluation results, based on comparison with manually mapped schemata of major LOD datasets to PROTON upper level ontology.

Statistics

Views

Total Views
1,525
Views on SlideShare
1,404
Embed Views
121

Actions

Likes
1
Downloads
19
Comments
0

2 Embeds 121

http://render-project.eu 119
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Linking Open Data (LOD) initiative aims to facilitate the emergence of a web of linked data by means of publishing and interlinking open data on the web in RDF. One can explore linked data across servers by following the links in the graph in a manner similar to the way the HTML web is navigated. Wealth of information – more than 25 billion RDF triples Variety of data sources – 203+ datasets Heterogeneity – different subject domains with contribution from from companies, government and public sector projects, as well as from individual Web enthusiasts Issue with the quality of the data – inconsistent, incomplete, with mistakes, not suitable for automated reasoning

Contextual ontology alignment may 2011 Contextual ontology alignment may 2011 Presentation Transcript

  • Contextual Ontology Alignment of LOD with an Upper Ontology: A Case Study with PROTON Prateek Jain, Peter Z. Yeh, Kumal Verma, Reymond Vasques, Mariana Damova, Pascal Hitzler and Amit Sheth ESWC’2011
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • Linking Open Data (LOD)
    • Providers creating links across datasets
    • 203 datasets in LOD cover a wide spectrum of subject domains – biomedical, science, geographic, generic knowledge, entertainment, government.
    • The linkage of these ever growing data sources takes place at the instance-level
  • Linking Open Data (LOD)
    • Without schema-level linkages the LOD cloud will not have semantic enough information to enable reasoning-based applications
      • Question Answering
      • Agent-based information brokering
      • and the like
  • Managing Linking Open Data (LOD)
    • FactForge ( http://factforge.net )
    • a reason-able view of the web of data
    • the biggest and most heterogeneous body of factual knowledge on which inference is performed
    • 8 datasets from the LOD cloud (DBPedia, Freebase, UMBEL, CIA World Factbook , MusicBrainz, Wordnet, Lingvoj, Geonames)
    • an overall of 1, 4 million loaded statements, 2,2 million stored statements (indexed), 9,8 million distinct retrievable statements
    • manually developed schema-level alignment to efficiently query across multiple datasets
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • The Problem
    • Manual creation of schema-level mappings across LOD ontologies is not a viable solution
      • Size of the LOD
      • Rate at which it is growing
    • An automated solution is needed so that applications such as FactForge can effectively scale and keep up with the size of LOD.
  • Solution : BLOOM+ - Boothstraping-based LOD Ontology Matching
    • Extended BLOOMS
    • - uses more sophisticated metric to determine which classes between two ontologies to align
    • - considers contextual information to further support or reject an alignment
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • Access to FactForge
    • Forest Interface
    • - keyword search (for molecules in the RDF graph)
    • - auto-suggestion of URI
    • - SPARQL queries using facts from different datasets
    • ( formulating SPARQL queries requires in depth knowledge about the datasets and the schemata in FactForge )
    • - exploration
  • Knowledge Requirements
    • A knowledge source:
    • - organized as a class hierarchy
    • links between classes capture sub and super class relationships
    • - covering a wide range of concepts and domains
    • widely applicable, given the wide range of domains covered by the LOD cloud
    • Wikipedia category hierarchy is used by BLOOMS+
    • taxonomy of over 10 million categories across many domains.
  • Approach
    • BLOOMS+ aligns two ontologies with the steps:
    • - Use Wikipedia to construct a set of category trees for each class in the source and target ontologies
    • - Determine which classes to align
    • - a sophisticated measure to compute similarity between source and target classes
    • - contextual similarity between the classes to support or reject an alignment
    • - Align classes with high similarity based on the class and contextual similarity
  • Construct BLOOMS+ Forest
    • The root of the tree is s i
    • The immediate children of s i are all categoies that s i belongs to
    • Each subsequent level includes all unique, direct super categories of the categories at the current level
    • BLOOMS+ imposes a limit on the depth of the tree 4 as it has been empirically proven that depths beyond 4 typically include very general categories not useful for alignment.
  • Construct BLOOMS+ Forest
    • The root of the tree is s i
    • The immediate children of s i are all categoies that s i belongs to
    • Each subsequent level includes all unique, direct super categories of the categories at the current level
    • BLOOMS+ imposes a limit on the depth of the tree 4 as it has been empirically proven that depths beyond 4 typically include very general categories not useful for alignment.
  • Compute Class Similarity
    • Comparison between each class C from the source ontology with each class D from the target ontology
    • each T i  F c with each T j  F d
    • For each T i its overlap with T j is determined
    • Issues:
    • - common nodes appear often, result in false alignments
    • - large tree can be unfairly penalized because it must have more nodes in common with another three for a high similarity score
  • Compute Class Similarity
    • n  Ti  Tj the common nodes
    • d(n) is the depth of the common node in Ti
    • The exponentiation of the inverse depth of a common node gives less importance to the node if it is generic
    • The log of the tree size avoids bias against large trees
    • Range from 0.0 to 1.0
    Overlap ( Ti , Tj ) = log  n  Ti  Tj (1+e d(n)-1 -1 ) log2 | Ti | To cope with the issues:
  • Compute Contextual Similarity
    • A good source of contextual information is the superclass of C and D from their respective ontologies.
    • If superclasses agree, then alignment is supported, if not alignment is penalized.
    • The method:
    • - for each pair wise class comparison ( C , D ) all superclasses of C and D are retrieved up to a specified level of 2.
    • - the two sets of superclasses N(C) and N(D) are the neighbourhoods of C and D .
    • - for each tree pair ( Ti , Tj ) between C and D , the number of superclasses in N(C) and N(D) that are supported by Ti and Tj are determined.
    • c  N(C) is supported by T i if
    • - the name of c matches a node in T i
    • - a Wikipedia article corresponding to c matches a node in Ti
  • Compute Contextual Similarity
    • The overall contextual similarity between C and D with respect to Ti and Tj is computed with a harmonic mean:
    • CSim(Ti, Tj) = 2R C R D R C + R D
    • R C and R D are the fractions of superclasses in N(C) and N(D) supported by Ti and Tj
    • This harmonic mean helps emphasizing the superclass neighbourhoods that are not well supported thus lowering the overall contextual similarity.
  • Compute Overall Similarity
    • The overall similarity between classes C and D w.r.t. BLOOMS+ trees Ti and Tj is computed by taking the weighted average of the class and contextual similarity:
    • O(Ti, Tj) =  Overlap(Ti , Tj) +  CSim(Ti , Tj)
    • 2
    • and  are weights for the concept and contextual similarity
    • Both are defaulted to 1.0 to give an equal importance to the two parameters.
    • Selected is the tree pair (Ti, Tj)  F C  F D with the highest overall similarity score and if the score is greater than the alignment threshold H A . Then a link between C and D will be established.
  • Determining of the Type of Link
    • If O(Ti, Tj) = O(Tj, Ti) then C owl:equivalentClass D
    • If O(Ti, Tj) < O(Tj, Ti) then C rdfs:subClassOf D
    • Otherwise, D rdfs:subClassOf C
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • Evaluation
    • Criteria:
    • BLOOMS+ can outperform state-of-the-art solutions on the task of aligning LOD ontologies
    • BLOOMS+ performs well because it accounts for the critical factors when computing the similarity between two classes
      • The importance of common nodes between the BLOOMS+ trees for the two classes
      • Bias against large trees
    • The performance of BLOOMS+ can be further improved by using contextual information
  • Dataset – the Gold Standard
    • Schema-level mappings from 3 LOD datasets to PROTON
    • PROTON – an upper-level ontology with over 300 classes and 100 properties designed to support semantic annotation, indexing and search
    • DBpedia – the RDF version of Wikipedia, of 259 classes (Event to Protein)
    • Freebase – a knowledge base of structured data collected from multiple sources
    • Geonames – a geographic dataset with over 6 million locations of interest, which are classified into 11 different classes
    • These mappings have been systematically created by knowledge engineers at Ontotext for FactForge, which enables SPARQL query over the LOD cloud
    • 544 mappings
    • 373 PROTON-DBpedia
    • 150 PROTON-Freebase
    • 12 PROTON-Geonames
  • Dataset – the Gold Standard
    • Advantages
    • The mappings were created by an independent source for a real world use case
    • The mappings were created by knowledge engineers through a systematic process, hence their high quality
    • The mappings cover a diverse set of ontologies
  • Experimental Setup
    • (a) BLOOMS+ can outperform state-of-the-art solutions on the task of aligning LOD ontologies
    • - Precision and Recall of the mappings from the LOD to PROTON
    • BLOOMS+ was applied to each PROTON-LOD ontology pair with alignment threshold of 0.85, because empirically this threshold produces the best F-measure
    • Then the results were compared with the Gold Standard
    • Precision : the number of correct mappings over the total number of mappings generated by BLOOMS+
    • Recall : the number of correct mappings over all mappings in the gold standard
  • Experimental Setup
    • (a) BLOOMS+ can outperform state-of-the-art solutions on the task of aligning LOD ontologies
    • - Precision and Recall of the mappings from the LOD to PROTON
    • Comparison with the results of
    • BLOOMS
    • S-Match (three matching algorithms – basic, minimal, structure preserving)
    • AROMA (association rule mining paradigm to discover relationaships)
    • The best alignment threshold for BLOOMS is 0.6
  • Experimental Setup
    • (b) BLOOMS+ performs well because it accounts for the critical factors when computing the similarity between two classes
      • The importance of common nodes between the BLOOMS+ trees for the two classes
      • Bias against large trees
    • (c) The performance of BLOOMS+ can be further improved by using contextual information
    • BLOOMS+ without contextual information vs. BLOOMS
    • (b) measure used to compute the similarity between two classes
    • BLOOMS+ without contextual information vs. BLOOMS+
    • (c) use of contextual information
    • Alignment threshold to 0.85
  • Experimental Results   Linked Open Data and Proton Schema Ontology Alignment   DB-PRO GEO-PRO FB-PRO Overall System Rec Prec F Rec Prec F Rec Prec F Rec Prec F AROMA 0.19 0.59 0.28 0.04 8/1000 0.01 0.31 0.49 0.38 0.22 0.37 0.28 S-Match-M 0.26 0.05 0.08 0.04 6/1000 0.01 0.2 0.05 0.08 0.23 0.05 0.08 S-Match-C 0.33 3/1000 6/1000 0.04 0.009 0.01 0.3 0.4 0.34 0.31 4/1000 0.007 BLOOMS 0.48 0.19 0.27 0.04 6/1000 0.01 0.28 0.32 0.3 0.42 0.19 0.26 BLOOMS+ no context 0.77 0.59 0.67 0.04 5/1000 0.01 0.48 0.65 0.55 0.66 0.45 0.54 BLOOMS+ 0.73 0.90 0.81 0.04 5/1000 0.01 0.49 0.59 0.54 0.63 0.55 0.59
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • Related Work Euzenat, J. & al. Matching ontologies for context, a Tech Report of NEON project from 2007 - rely on background knowledge from online ontologies - their process relies on identification of contextual relationship using the relationships encoded in the ontologies Ontology matching surveys (Euzenat and Shvaiko, 2007) emphasize that systems typically utilize a structured source of information (dictionaries or upper level ontologies) Wikipedia categorization has been utilized for creating and restructuring taxonomies Identification and creation of links between LOD cloud datasets ontology schema matching used to improve instance coreference resolution UMBEL – a unified reference point to LOD schemas
  • Outline
    • Introduction
    • Problem and Conceptual Solution
    • Approach
    • Evaluation
    • Related Work
    • Conclusion
  • Conclusion
    • A solution for automated ontology matching – BLOOMS+
    • Evaluated on a real world Gold Standard mapping LOD ontologies to PROTON created for FactForge
    • Compared the performance of BLOOMS+ to other state of the art systems
    • BLOOMS+ is the only system utilizing the contextual information present in the ontology and Wikipedia category hierarchy for OM
  • Future Work
    • Use BLOOMS+ for LOD querying
    • Use other techniques to improve BLOOMS+
    • Test BLOOMS+ with other knowledge sources
    • Incorporate additional contextual information
    • Use BLOOMS+ in data integration scenarios
    • Thank you for your attention!
    • Questions?