A distributional structured semantic space for querying rdf graph data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

A distributional structured semantic space for querying rdf graph data

  • 1,491 views
Uploaded on

The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end......

The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale Linked Data consumption. This article introduces a distributional structured semantic space which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The article analyzes the geometric aspects of the proposed space, providing its description as a distributional structured vector space, which is built upon the Generalized Vector Space Model (GVSM). The final semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall = 0.491.

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,491
On Slideshare
980
From Embeds
511
Number of Embeds
1

Actions

Shares
Downloads
24
Comments
0
Likes
0

Embeds 511

http://treo.deri.ie 511

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digital Enterprise Research Institute www.deri.ie A Distributional Structured Semantic Space for Querying RDF Graph Data André Freitas, Edward Curry, João G. Oliveira, Seán O’Riain Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
  • 2. OutlineDigital Enterprise Research Institute www.deri.ie  Motivation & Problem Space  Description of the Proposed Approach  Evaluation  Conclusions
  • 3. Motivation & Problem SpaceDigital Enterprise Research Institute www.deri.ie  Linked Data: Heterogeneous and distributed data environment.  Traditional Database and Information Retrieval approaches for querying/searching Linked Data provide limited solutions for casual users.  Querying and searching Linked Data remains a fundamental challenge. ?
  • 4. Query/Search SpectrumDigital Enterprise Research Institute www.deri.ie Adapted from Kauffman et al (2009)
  • 5. Semantic GapDigital Enterprise Research Institute www.deri.ie From which university did the wife of Barack Obama graduate? Data model independent (DMI) queries Semantic Matching
  • 6. Semantic Matching ProblemDigital Enterprise Research Institute www.deri.ie From which university did the wife of Barack Obama graduate?
  • 7. Semantic Matching ProblemDigital Enterprise Research Institute www.deri.ie From which university did the wife of Barack Obama graduate? Semantic matching
  • 8. Semantic Matching RequirementsDigital Enterprise Research Institute www.deri.ie  Expressivity:  Supporting expressive natural language queries.  Path, conjunctions, disjunctions, aggregations, conditions.  Flexibility & Accuracy:  Accurate and flexible semantic matching approach.  Maintainability:  Easily transportable across datasets from different domains.  Can support Open domain (e.g. Wikipedia) and doman-specific (e.g. Financial) datasets without customization.  Performance:  Suitable for realtime querying (low query execution time).  Scalability:  Scalable to a large number of datasets (Organization-scale, Web- scale).
  • 9. Research QuestionsDigital Enterprise Research Institute www.deri.ie Q1: How to provide a query mechanism which allows users to expressively query linked datasets without a previous understanding of the vocabularies behind the data? Q2: Which semantic model could support this query mechanism?
  • 10. Digital Enterprise Research Institute www.deri.ie Proposed Approach
  • 11. Proposed Approach (Outline)Digital Enterprise Research Institute www.deri.ie  Treo:  Query approach based on dynamic Linked Data navigation.  Performance limitations.  Treo T-Space:  Existing IR approaches: traditional Vector Space Models (VSMs) were not able to: (i) capture the structure of graphs and (ii) support a precise semantic matching behavior.  A VSM supporting these two requirements was formulated: T-Space.  Ranking function based on semantic relatedness.
  • 12. Proposed Approach (Outline)Digital Enterprise Research Institute www.deri.ie
  • 13. First Approach (Treo)Digital Enterprise Research Institute www.deri.ie  Natural language queries.  Ranked query results.  Two phase search process combining entity search with spreading activation search.  Use of Wikipedia-based semantic relatedness to semantically match query terms to dataset terms.  Limited query execution performance. Requirements: Expressivity, Flexibility & Accuracy.
  • 14. Semantic RelatednessDigital Enterprise Research Institute www.deri.ie  Computation of a measure of “semantic proximity” between two terms.  Allows a semantic approximate matching between query terms and dataset terms.  Most existing approaches use WordNet-based solutions for approximate semantic matching (limited solution).  Use of Wikipedia-based semantic relatedness measures (addresses the limitations of WordNet- based measures).
  • 15. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie
  • 16. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie RDF
  • 17. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie Wikipedia Link Measure
  • 18. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie
  • 19. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie RDF
  • 20. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie
  • 21. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie
  • 22. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie
  • 23. Query Approach Rationale (Treo)Digital Enterprise Research Institute www.deri.ie Final Query- Data Matching: Currently limited to path queries Querying Linked Data using Semantic Relatedness: A Vocabulary Independent Approach. In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB) 2011.
  • 24. Digital Enterprise Research Institute www.deri.ie Treo (Irish): Direction, path
  • 25. From which university did the wife of Barack Obama graduate?Digital Enterprise Research Institute www.deri.ie
  • 26. Second Approach: Treo T- SpaceDigital Enterprise Research Institute www.deri.ie  Motivation:  Performance problem of the original approach.  Transportability across different domains.  Rationale:  Rephrasing of the proposed approach as a vector space model.  Addressing the limitations of the vector space model (BoW: lack of structure) to represent the structure present in the data.  Keeping the semantic matching behavior. Requirements: Performance & Scalability, Maintainability.
  • 27. T- SpaceDigital Enterprise Research Institute www.deri.ie  A Vector Space Model for building RDF indexes that preserve the graph structure and which allows a semantic matching/search behavior.
  • 28. Generalization (Treo T- Space)Digital Enterprise Research Institute www.deri.ie  Approach:  Ranked results.  Natural language queries.  Two phase search process combining entity search with spreading activation search.  Use of distributional semantics to semantically match query terms to dataset terms.  Semantic manifold (‘structured vector space’) model: T-Space.
  • 29. Distributional Semantic ModelDigital Enterprise Research Institute www.deri.ie  Assumption: the context surrounding a given word in a text provides important information about its meaning.  Simplified semantic model.  Explicit Semantic Analysis (ESA) is the primary distributional model used in this work. A wife is a female partner in a marriage. The term "wife" seems to be a close term to bride, the latter is a female participant in a wedding ceremony, while a wife is a married woman during her marriage. ...
  • 30. T- SpaceDigital Enterprise Research Institute www.deri.ie Embedding the structure of a graph into a semantic vector space Distributional relations semantic model: Semantic statisticalknowledge extracted from large Web corpora instances classes
  • 31. Querying the T- SpaceDigital Enterprise Research Institute www.deri.ie
  • 32. Querying the T- SpaceDigital Enterprise Research Institute www.deri.ie
  • 33. Querying the T- SpaceDigital Enterprise Research Institute www.deri.ie
  • 34. Querying the T- SpaceDigital Enterprise Research Institute www.deri.ie Querying Heterogeneous Datasets on the Data Model Independent Queries over RDF A Multidimensional Semantic Space for Linked Data Web: Challenges, Approaches and Trends.Proceedings of the 5th Special Issue on Internet-Scale Data, Semantic Data. In IEEE Internet Computing, IEEE International Conference on 2012. Computing (ICSC), 2011.
  • 35. Ranking/Filtering ResultsDigital Enterprise Research Institute www.deri.ie  How to filter non-related results.  Analysis of semantic relatedness as a ranking function.  Creation of a methodology for the determination of a semantic relatedness threshold.  Semantic differential analysis. Requirements: Flexibility & Accuracy. A Distributional Approach for Terminological Semantic Search on the Linked Data Web. 27th ACM Applied Computing Symposium, Semantic Web and Its Applications Track, 2012.
  • 36. VSM & FormalizationDigital Enterprise Research Institute www.deri.ie  Based on the Generalized Vector Space Model.  Simplification and clarification of T-Space properties.  Tensors are used to support domain specific meaning.  Tensors can be used to connect concepts across datasets and T-Spaces
  • 37. VSM & FormalizationDigital Enterprise Research Institute www.deri.ie 1. The distributional space can be transformed to the keyword space. Requirements: Flexibility & Accuracy
  • 38. VSM & FormalizationDigital Enterprise Research Institute www.deri.ie 2. The vector field defines an additional structure over the vector space model. Requirements: Expressivity
  • 39. Customized Distributional ModelDigital Enterprise Research Institute www.deri.ie Distributional models customizable for different datasets.
  • 40. Customized Distributional ModelDigital Enterprise Research Institute www.deri.ie Transformation from different distributional models Wikipedia Financial Corpus DBPedia Financial Dataset A Distributional Structured Semantic Space for Querying RDF Graph Requirements: Maintainability Data. International Journal of Semantic Computing (IJSC), 2012.
  • 41. Digital Enterprise Research Institute www.deri.ie Evaluation (Treo T- Space)
  • 42. Quality of ResultsDigital Enterprise Research Institute www.deri.ie  Q1: How does the approach work for addressing the vocabulary problem?  Q2: How does the approach work for addressing natural language queries?  Precision, recall, mean reciprocal rank, % of answered queries.
  • 43. Quality of ResultsDigital Enterprise Research Institute www.deri.ie  QALD DBPedia Training Set.  50 natural language queries.  DBpedia 3.6. Full DBPedia QuerySet (50 queries) Avg. Precision Avg. Recall MRR % of queries answered 0.482 0.491 0.516 58% Partial DBPedia QuerySet (38 queries) Avg. Precision Avg. Recall MRR % of queries answered 0.634 0. 645 0.679 76%
  • 44. Error DistributionDigital Enterprise Research Institute www.deri.ie  Q4: What is the error associated with each component of the search approach?  Error distribution measurements.
  • 45. Error DistributionDigital Enterprise Research Institute www.deri.ie  Conclusion: Robustness of the distributional semantic model for open domain queries.  Conclusion: Major need for improvement of the pre/post processing phases.
  • 46. Evaluating Terminology- level Semantic MatchingDigital Enterprise Research Institute www.deri.ie  Q1: How does the approach work for addressing the vocabulary problem?  Q5: Does the approach improve the semantic matching of the queries?  Quantitative evaluation: P@5, P@10, MRR, % of the queries answered, comparative evaluation using string matching and WordNet-based query expansion.
  • 47. Evaluating Terminology- level Semantic MatchingDigital Enterprise Research Institute www.deri.ie Avg. Avg. MRR % of queries Precision@5 Precision@10 answered 0.732 0.646 0.646 92.25% Approach % of queries answered ESA 92.25% String matching 45.77% String matching + WordNet QE 52.48%
  • 48. Comparative AnalysisDigital Enterprise Research Institute www.deri.ie  Q3: What is the best distributional semantic model for the vocabulary problem?  Preliminary comparative analysis between different distributional semantic models.
  • 49. Comparative Analysis (Treo vs Treo T- Space)Digital Enterprise Research Institute www.deri.ie  Wikipedia Link Measure (WLM) vs Explicit Semantic Analysis (ESA) ESA (Full Query Set) Avg. Precision Avg. Recall MRR % of queries answered 0.482 0.491 0.516 58% WLM (Full Query Set) Avg. Precision Avg. Recall MRR % of queries answered 0.395 0. 451 0.489 56% Improvement % Avg. Precision % Avg. Recall % MRR % of queries answered 18% 8.2% 5.2% 3.5%
  • 50. Comparative AnalysisDigital Enterprise Research Institute www.deri.ie  Conclusion: From the two tested approaches, ESA provides a better semantic model.
  • 51. ConclusionsDigital Enterprise Research Institute www.deri.ie  The T-Space model provides a principled way to build data model intependent queries over RDF graphs.  The distributional semantic model supports a flexible matching between query terms and dataset terms in a semantic best-effort scenario.  The ESA semantic model provides a better distributional model compared to WLM.  Improvements are needed on the pre/post processing phase of the approach.
  • 52. Associated ArticleDigital Enterprise Research Institute www.deri.ie André Freitas, Edward Curry, João Gabriel Oliveira, Sean ORiain, A Distributional Structured Semantic Space for Querying RDF Graph Data. International Journal of Semantic Computing (IJSC),2012. http://www.worldscinet.com/ijsc/05/0504/S1793351X1100133X.html http://andrefreitas.org/papers/preprint_distributional_structured_space.pdf
  • 53. Related PublicationsDigital Enterprise Research Institute www.deri.ie André Freitas, Edward Curry, João Gabriel Oliveira, Sean ORiain, Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches and Trends. IEEE Internet Computing, Special Issue on Internet-Scale Data, 2012 (Article). André Freitas, Edward Curry, João Gabriel Oliveira, Sean ORiain, A Distributional Structured Semantic Space for Querying RDF Graph Data. International Journal of Semantic Computing (IJSC), 2012 (Article). André Freitas, Sean ORiain, Edward Curry, A Distributional Approach for Terminological Semantic Search on the Linked Data Web. 27th ACM Applied Computing Symposium, Semantic Web and Its Applications Track, 2012 (Conference Full Paper). André Freitas, João Gabriel Oliveira, Edward Curry, Sean ORiain, A Multidimensional Semantic Space for Data Model Independent Queries over RDF Data. In Proceedings of the 5th International Conference on Semantic Computing (ICSC), 2011. (Conference Full Paper). André Freitas, João Gabriel Oliveira, Sean ORiain, Edward Curry, João Carlos Pereira da Silva, Querying Linked Data using Semantic Relatedness: A Vocabulary Independent Approach. In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB) 2011. (Conference Full Paper). André Freitas, João Gabriel Oliveira, Sean ORiain, Edward Curry, João Carlos Pereira da Silva, Treo: Combining Entity-Search, Spreading Activation and Semantic Relatedness for Querying Linked Data, In 1st Workshop on Question Answering over Linked Data (QALD-1) Workshop at 8th Extended Semantic Web Conference (ESWC), 2011 (Workshop Full Paper) . André Freitas, João Gabriel Oliveira, Sean ORiain, Edward Curry, João Carlos Pereira da Silva, Treo: Best-Effort Natural Language Queries over Linked Data, In Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems (NLDB), 2011 (Poster in Proceedings).
  • 54. Digital Enterprise Research Institute www.deri.ie http://andrefreitas.org