The increase in the size, heterogeneity and complexity of contemporary Big Data environments brings major challenges for the consumption of structured and semi–structured data. Addressing these challenges requires a convergence of approaches from different communities including databases, natural language processing, and information retrieval. Research on Natural Language Interfaces (NLI) and Question Answering systems has played a prominent role in stimulating a multidisciplinary approach to the problem that has moved the field from a futuristic vision to a concrete industry-level technological trend.
In this talk we distill the key principles of state-of-the-art approaches for data consumption using NLI. Particular attention is paid to the maturity and effectiveness of each approach together with discussion on future trends and active research questions.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)
1. Talking to your Data:
Natural Language Interfaces for a
schema-less world
André Freitas
NLIWoD at ISWC 2014
Riva del Garda
2. Outline
Shift in the Database Landscape
On Schema-agnosticism & Semantics
Distributional Semantics to the Help
Case Study: Treo QA System
Living in a Schema-less World
Take-away Message
14. First-level independency
(Relational Model)
“… it provides a basis for a high level data language which
will yield maximal independence between programs on
the one hand and representation and organization of data
on the other”
Codd, 1970
Second-level independency
(Schema-agnosticism)
14
16. Vocabulary Problem for Databases
Query: Who is the daughter of Bill Clinton married to?
Semantic Gap
Possible representations
Schema-agnostic query
mechanisms
Abstraction level differences
Lexical variation
Structural (compositional) differences
Operational/functional differences
16
17. Robust Semantic Model
Semantic intelligent behaviour is highly dependent on
knowledge scale (commonsense, semantic)
Semantics
=
Formal meaning representation model
(lots of data)
+
inference model
17
18. Robust Semantic Model
Not scalable!
1st Hard problem: Acquisition
Semantics
=
Formal meaning representation model
(lots of data)
+
inference model
18
19. Robust Semantic Model
Not scalable!
2nd Hard problem: Consistency
Semantics
=
Formal meaning representation model
(lots of data)
+
inference model
19
20. Semantics for a Complex World
“Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying
assumptions, in true lab conditions.”
“If these idealizations are removed it is not clear at all that modern
semantics can give a full account of all but the simplest
models/statements.”
Formal World Real World
Baroni et al. 2013
20
21. Distributional Semantic Models
Semantic Model with low acquisition effort
(automatically built from text)
Simplification of the representation
Enables the construction of comprehensive
commonsense/semantic KBs
What is the cost?
Some level of noise
(semantic best-effort)
21
22. Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
He filled the wampimuk with the substance, passed it
around and we all drunk some
22
23. Distributional Semantic Models (DSMs)
“The dog barked in the park. The owner of the dog put him on the
leash since he barked.”
contexts = nouns and verbs in the same
sentence
23
24. Distributional Semantic Models (DSMs)
“The dog barked in the park. The owner of the dog put him on the
leash since he barked.”
bark
dog
park
leash
contexts = nouns and verbs in the same
sentence
bark : 2
park : 1
leash : 1
owner : 1
24
40. Relevance
Test Collection: QALD 2011.
DBpedia.
Dataset (DBpedia + YAGO links): 45,767 predicates, 9,434,677
instances, more than 200,000 classes
40
41. Query Pre-Processing
(Question Analysis)
Transform natural language queries into triple
patterns.
“Who is the daughter of Bill Clinton married to?”
41
43. Query Pre-Processing
(Question Analysis)
Step 2: Core Entity Recognition
- Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN INSTANCE)
43
44. Query Pre-Processing
(Question Analysis)
Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
44
46. Query Pre-Processing
(Question Analysis)
Step 5: Determine Partial Ordered Dependency Structure
(PODS)
- Rules based.
• Remove stop words.
• Merge words into entities.
• Reorder structure from core entity position.
Bill Clinton daughter married to
46
(INSTANCE)
ANSWER
TYPE
Person
Lower level of ambiguity, QUESTION FOCUS
vagueness, synonimy
47. Question Analysis
Transform natural language queries into triple
patterns
“Who is the daughter of Bill Clinton married to?”
Bill Clinton daughter married to
PODS
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
47
48. Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
Query Plan
(1) INSTANCE SEARCH (Bill Clinton)
(2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
(3) e1 <- NAVIGATE (Bill Clintion, p1)
(4) p2 <- SEARCH PREDICATE (e1, married to)
(5) e2 <- NAVIGATE (e1, p2)
48
49. Instance Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
Instance Search
49
50. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
(PIVOT ENTITY)
(ASSOCIATED
TRIPLES)
50
51. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
Which properties are semantically related to ‘daughter’?
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
51
52. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
Which properties are semantically related to ‘daughter’?
(In the context of Bill Clinton)
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
52
53. Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
53
54. Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
54
55. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky
:spouse
55
57. Core Principles
Minimize the impact of Ambiguity, Vagueness, Synonymy with
semantic pivoting.
Semantic pivoting: Address the simplest matchings first
(heuristics).
Semantic Relatedness as a primitive semantic approximation
operation.
Distributional semantics as commonsense/semantic
knowledge.
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-
Compositional Semantics Approach, IUI 2014
63. Data variety +
Data
Full knowledge
Full data coverage
Full automation
63
64. Linked Data: Datasets are easier to integrate and to
consume (data model level). However, the semantic
barrier for consumption is still there
65. Data variety +
Data
Full knowledge
Full data coverage
Full automation
65
66. Distributional DBMS
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-
Compositional Semantics Approach, IUI 2014
67. Data variety +
Data
Full knowledge
Full data coverage
Full automation
67
68. Simplification of Information Extraction
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs, WoLE, 2012
69. Simplification of Information Extraction
General Electric Company, or GE , is an American multinational conglomerate
corporation incorporated in Schenectady , New York
69
70. Data variety +
Data
Full knowledge
Full data coverage
Full automation
70
72. Data variety +
Data
Full knowledge
Full data coverage
Full automation
72
73. Reasoning with Distributional Semantics
A Distributional Semantics Approach for Selective Reasoning on Commonsense Graph
Knowledge Bases, NLDB 2014
74. Data variety +
Data
Full knowledge
Full data coverage
Full automation
74
75. Take-away Message
Existing semantic technologies can address today major data
management problems
Muiti-disciplinarity is one key (and NLI people are very good at it!):
- NLP + IR + Semantic Web + Databases
Schema-agnosticism is a central property/functionality/goal!
Distributional Semantics + semantics of structured data =
schema-agnosticism
Schema-agnosticism brings major impact for information systems.
We can tame the long tail of data variety!
The wave is just starting. Be a part of it!
75
76. Want to play with Distributional
Semantics?
http://easy-esa.org
76