Your SlideShare is downloading. ×
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

442
views

Published on

The demand to access large amounts of heterogeneous structured …

The demand to access large amounts of heterogeneous structured
data is emerging as a trend for many users and applications.
However, the effort involved in querying heterogeneous
and distributed third-party databases can create major
barriers for data consumers. At the core of this problem is
the semantic gap between the way users express their information
needs and the representation of the data. This work
aims to provide a natural language interface and an associated
semantic index to support an increased level of vocabulary
independency for queries over Linked Data/Semantic
Web datasets, using a distributional-compositional semantics
approach. Distributional semantics focuses on the automatic
construction of a semantic model based on the statistical distribution
of co-occurring words in large-scale texts. The proposed
query model targets the following features: (i) a principled
semantic approximation approach with low adaptation
effort (independent from manually created resources such as
ontologies, thesauri or dictionaries), (ii) comprehensive semantic
matching supported by the inclusion of large volumes
of distributional (unstructured) commonsense knowledge into
the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
442
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach André Freitas and Edward Curry Insight Centre for Data Analytics International Conference on Intelligent User Interfaces Haifa, 2014
  • 2. Talking to your (Big) Data
  • 3. Motivation
  • 4. Shift in the Database Landscape  Heterogeneous, complex and large-scale databases.  Very-large and dynamic “schemas”. circa 2014 circa 2000 10s-100s attributes 1,000s-1,000,000s attributes
  • 5. Databases for a Complex World How do you query data on this scenario?
  • 6. Vocabulary Problem for Databases Query: Who is the daughter of Bill Clinton married to? Semantic Gap Possible representations Semantic approximation = Commonsense Knowledge
  • 7. Semantics for a Complex World Formal World Real World Distributional Semantics Query Approach
  • 8. Does it work?
  • 9. Addressing the Vocabulary Problem for Databases (with Distributional Semantics) Gaelic: direction
  • 10. Solution (Video)
  • 11. More Complex Queries (Video)
  • 12. Treo Answers Jeopardy Queries (Video) http://bit.ly/1hWcch9
  • 13. Evaluation  102 natural language queries (Test Collection: QALD 2011).  Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries). Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  • 14. Comparative Evaluation
  • 15. Query Approach
  • 16. Distributional Semantics “Words occurring in similar (linguistic) contexts are semantically related.”  If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).  This can then be used as a surrogate of its semantic representation.
  • 17. Distributional Semantic Model function (number of times that the words occur in c1) c1 0.7 0.5 husband spouse cn c2 child Commonsense is here
  • 18. Semantic Relatedness c1 husband spouse Works as a semantic ranking function θ cn c2 child
  • 19. Approach Overview Query Query Analysis Query Features Query Planner Query Plan Core semantic approximation & composition operations Ƭ-Space Database Distributional semantics Large-scale unstructured data Commonsense knowledge
  • 20. Approach Overview Query Query Analysis Query Features Query Planner Query Plan Core semantic approximation & composition operations Ƭ-Space RDF Data Explicit Semantic Analysis (ESA) Wikipedia Commonsense knowledge
  • 21. Ƭ-Space r p e
  • 22. Core Operations Query
  • 23. Core Operations Query Search & Composition Operations
  • 24. Search and Composition Operations  Instance search - Proper nouns - String similarity + node cardinality  Class (unary predicate) search - Nouns, adjectives and adverbs - String similarity + Distributional semantic relatedness  Property (binary predicate) search - Nouns, adjectives, verbs and adverbs - Distributional semantic relatedness  Navigation  Extensional expansion - Expands the instances associated with a class.  Operator application - Aggregations, conditionals, ordering, position   Disjunction & Conjunction Disambiguation dialog (instance, predicate)
  • 25. Core Principles  Minimize the impact of Ambiguity, Vagueness, Synonymy.  Address the simplest matchings first (heuristics).  Semantic Relatedness as a primitive operation.  Distributional semantics as commonsense knowledge.
  • 26. Question Analysis Transform natural language queries into triple patterns “Who is the daughter of Bill Clinton married to?” Bill Clinton daughter married to PODS (INSTANCE) (PREDICATE) (PREDICATE) Query Features
  • 27. Query Plan Map query features into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE)  (1) INSTANCE SEARCH (Bill Clinton)  (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)  (3) e1 <- NAVIGATE (Bill Clintion, p1)  (4) p2 <- SEARCH PREDICATE (e1, married to)  (5) e2 <- NAVIGATE (e1, p2) Query Features Query Plan
  • 28. Instance Search Query: Bill Clinton daughter Instance Search Linked Data: :Bill_Clinton married to
  • 29. Predicate Search Query: Linked Data: Bill Clinton daughter married to :child :Bill_Clinton :Chelsea_Clinton :religion :Baptists :almaMater ... (PIVOT ENTITY) :Yale_Law_School (ASSOCIATED TRIPLES)
  • 30. Predicate Search Query: Bill Clinton daughter married to Which properties are semantically related to „daughter‟? Linked Data: :child :Bill_Clinton :Chelsea_Clinton :religion ... :Baptists sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 :almaMater :Yale_Law_School sem_rel(daughter,alma mater)=0.001
  • 31. Navigate Query: Linked Data: Bill Clinton daughter :child :Bill_Clinton :Chelsea_Clinton married to
  • 32. Navigate Query: Linked Data: Bill Clinton daughter :child :Bill_Clinton :Chelsea_Clinton (PIVOT ENTITY) married to
  • 33. Predicate Search Query: Linked Data: Bill Clinton daughter :spouse :child :Bill_Clinton married to :Chelsea_Clinton (PIVOT ENTITY) :Mark_Mezvinsky
  • 34. Results
  • 35. Conclusions  The compositional-distributional model supports a schemaagnostic natural language query mechanism over a large schema (open domain) database  Comprehensive and accurate semantic matching - Avg. recall=0.81, map=0.62, mrr=0.49  Medium-high expressivity - 80% of queries answered  Interactive query execution time - Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query  Better recall and query coverage compared to baselines with equivalent precision  Low adaptation effort for new datasets