Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

1,064 views

Published on

The demand to access large amounts of heterogeneous structured
data is emerging as a trend for many users and applications.
However, the effort involved in querying heterogeneous
and distributed third-party databases can create major
barriers for data consumers. At the core of this problem is
the semantic gap between the way users express their information
needs and the representation of the data. This work
aims to provide a natural language interface and an associated
semantic index to support an increased level of vocabulary
independency for queries over Linked Data/Semantic
Web datasets, using a distributional-compositional semantics
approach. Distributional semantics focuses on the automatic
construction of a semantic model based on the statistical distribution
of co-occurring words in large-scale texts. The proposed
query model targets the following features: (i) a principled
semantic approximation approach with low adaptation
effort (independent from manually created resources such as
ontologies, thesauri or dictionaries), (ii) comprehensive semantic
matching supported by the inclusion of large volumes
of distributional (unstructured) commonsense knowledge into
the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,064
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

  1. 1. Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach André Freitas and Edward Curry Insight Centre for Data Analytics International Conference on Intelligent User Interfaces Haifa, 2014
  2. 2. Talking to your (Big) Data
  3. 3. Motivation
  4. 4. Shift in the Database Landscape  Heterogeneous, complex and large-scale databases.  Very-large and dynamic “schemas”. circa 2014 circa 2000 10s-100s attributes 1,000s-1,000,000s attributes
  5. 5. Databases for a Complex World How do you query data on this scenario?
  6. 6. Vocabulary Problem for Databases Query: Who is the daughter of Bill Clinton married to? Semantic Gap Possible representations Semantic approximation = Commonsense Knowledge
  7. 7. Semantics for a Complex World Formal World Real World Distributional Semantics Query Approach
  8. 8. Does it work?
  9. 9. Addressing the Vocabulary Problem for Databases (with Distributional Semantics) Gaelic: direction
  10. 10. Solution (Video)
  11. 11. More Complex Queries (Video)
  12. 12. Treo Answers Jeopardy Queries (Video) http://bit.ly/1hWcch9
  13. 13. Evaluation  102 natural language queries (Test Collection: QALD 2011).  Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries). Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  14. 14. Comparative Evaluation
  15. 15. Query Approach
  16. 16. Distributional Semantics “Words occurring in similar (linguistic) contexts are semantically related.”  If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).  This can then be used as a surrogate of its semantic representation.
  17. 17. Distributional Semantic Model function (number of times that the words occur in c1) c1 0.7 0.5 husband spouse cn c2 child Commonsense is here
  18. 18. Semantic Relatedness c1 husband spouse Works as a semantic ranking function θ cn c2 child
  19. 19. Approach Overview Query Query Analysis Query Features Query Planner Query Plan Core semantic approximation & composition operations Ƭ-Space Database Distributional semantics Large-scale unstructured data Commonsense knowledge
  20. 20. Approach Overview Query Query Analysis Query Features Query Planner Query Plan Core semantic approximation & composition operations Ƭ-Space RDF Data Explicit Semantic Analysis (ESA) Wikipedia Commonsense knowledge
  21. 21. Ƭ-Space r p e
  22. 22. Core Operations Query
  23. 23. Core Operations Query Search & Composition Operations
  24. 24. Search and Composition Operations  Instance search - Proper nouns - String similarity + node cardinality  Class (unary predicate) search - Nouns, adjectives and adverbs - String similarity + Distributional semantic relatedness  Property (binary predicate) search - Nouns, adjectives, verbs and adverbs - Distributional semantic relatedness  Navigation  Extensional expansion - Expands the instances associated with a class.  Operator application - Aggregations, conditionals, ordering, position   Disjunction & Conjunction Disambiguation dialog (instance, predicate)
  25. 25. Core Principles  Minimize the impact of Ambiguity, Vagueness, Synonymy.  Address the simplest matchings first (heuristics).  Semantic Relatedness as a primitive operation.  Distributional semantics as commonsense knowledge.
  26. 26. Question Analysis Transform natural language queries into triple patterns “Who is the daughter of Bill Clinton married to?” Bill Clinton daughter married to PODS (INSTANCE) (PREDICATE) (PREDICATE) Query Features
  27. 27. Query Plan Map query features into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE)  (1) INSTANCE SEARCH (Bill Clinton)  (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)  (3) e1 <- NAVIGATE (Bill Clintion, p1)  (4) p2 <- SEARCH PREDICATE (e1, married to)  (5) e2 <- NAVIGATE (e1, p2) Query Features Query Plan
  28. 28. Instance Search Query: Bill Clinton daughter Instance Search Linked Data: :Bill_Clinton married to
  29. 29. Predicate Search Query: Linked Data: Bill Clinton daughter married to :child :Bill_Clinton :Chelsea_Clinton :religion :Baptists :almaMater ... (PIVOT ENTITY) :Yale_Law_School (ASSOCIATED TRIPLES)
  30. 30. Predicate Search Query: Bill Clinton daughter married to Which properties are semantically related to „daughter‟? Linked Data: :child :Bill_Clinton :Chelsea_Clinton :religion ... :Baptists sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 :almaMater :Yale_Law_School sem_rel(daughter,alma mater)=0.001
  31. 31. Navigate Query: Linked Data: Bill Clinton daughter :child :Bill_Clinton :Chelsea_Clinton married to
  32. 32. Navigate Query: Linked Data: Bill Clinton daughter :child :Bill_Clinton :Chelsea_Clinton (PIVOT ENTITY) married to
  33. 33. Predicate Search Query: Linked Data: Bill Clinton daughter :spouse :child :Bill_Clinton married to :Chelsea_Clinton (PIVOT ENTITY) :Mark_Mezvinsky
  34. 34. Results
  35. 35. Conclusions  The compositional-distributional model supports a schemaagnostic natural language query mechanism over a large schema (open domain) database  Comprehensive and accurate semantic matching - Avg. recall=0.81, map=0.62, mrr=0.49  Medium-high expressivity - 80% of queries answered  Interactive query execution time - Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query  Better recall and query coverage compared to baselines with equivalent precision  Low adaptation effort for new datasets

×