Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)

Talking to your Data:
Natural Language Interfaces for a
schema-less world
André Freitas
NLIWoD at ISWC 2014
Riva del Garda

Outline
 Shift in the Database Landscape
 On Schema-agnosticism & Semantics
 Distributional Semantics to the Help
 Case Study: Treo QA System
 Living in a Schema-less World
 Take-away Message

Shift in the Database
Landscape
3

Big Data (Data Variety)
 Vision: More complete data-based picture of the world for
systems and users.
4

The Long Tail of Data Variety
6

Data variety +
Data
Programs
Full knowledge
Full data coverage
Full automation
7

Data variety +
Data
Programs
Full knowledge
Full data coverage
Full automation
Data generation
8

Very-large and dynamic “schemas”
10s-100s attributes
1,000s-1,000,000s attributes
circa 2000
circa 2014
9

Semantic Heterogeneity
 Decentralized content generation.
 Multiple perspectives (conceptualizations) of the reality.
 Ambiguity, vagueness, inconsistency.
10

Data variety +
Data
Programs
Full knowledge
Full data coverage
Full automation
Data consumption
Data generation
11

Databases for a Complex World
How do you query data at this scale?
12

Schema-agnosticism
Abstraction
Layer
User
13

First-level independency
(Relational Model)
“… it provides a basis for a high level data language which
will yield maximal independence between programs on
the one hand and representation and organization of data
on the other”
Codd, 1970
Second-level independency
(Schema-agnosticism)
14

On Schema-agnosticism
& semantics
15

Vocabulary Problem for Databases
Query: Who is the daughter of Bill Clinton married to?
Semantic Gap
Possible representations
Schema-agnostic query
mechanisms
 Abstraction level differences
 Lexical variation
 Structural (compositional) differences
 Operational/functional differences
16

Robust Semantic Model
 Semantic intelligent behaviour is highly dependent on
knowledge scale (commonsense, semantic)
Semantics
=
Formal meaning representation model
(lots of data)
+
inference model
17

 Not scalable!
1st Hard problem: Acquisition
Semantics
=
(lots of data)
+
inference model
18

 Not scalable!
2nd Hard problem: Consistency
Semantics
=
(lots of data)
+
inference model
19

Semantics for a Complex World
 “Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying
assumptions, in true lab conditions.”
 “If these idealizations are removed it is not clear at all that modern
semantics can give a full account of all but the simplest
models/statements.”
Formal World Real World
Baroni et al. 2013
20

Distributional Semantic Models
 Semantic Model with low acquisition effort
(automatically built from text)
Simplification of the representation
 Enables the construction of comprehensive
commonsense/semantic KBs
 What is the cost?
Some level of noise
(semantic best-effort)
21

Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
 He filled the wampimuk with the substance, passed it
around and we all drunk some
22

Distributional Semantic Models (DSMs)
“The dog barked in the park. The owner of the dog put him on the
leash since he barked.”
contexts = nouns and verbs in the same
sentence
23

“The dog barked in the park. The owner of the dog put him on the
leash since he barked.”
bark
dog
park
leash
contexts = nouns and verbs in the same
sentence
bark : 2
park : 1
leash : 1
owner : 1
24

Context
car
dog
bark
run
leash
25

Semantic Similarity & Relatedness
dog
car
bark
run
leash
26
Query: cat

Semantic Similarity & Relatedness
θ
dog
cat
car
bark
run
leash
27
Query: cat

DSMs as Commonsense Reasoning
Commonsense is here
θ
car
dog
cat
bark
run
leash
Semantic Approximation is here
28

DSMs as Commonsense Reasoning
θ
car
dog
cat
bark
run
leash
...
vs.
Semantic best-effort

Case Study: Treo QA
System
30

Approach Overview
Query Query Analysis Query Features
Query Planner
Ƭ-Space
Large-scale
unstructured data
Query Plan
Structured
Data
Commonsense
knowledge
Distributional
semantics
Core semantic approximation &
composition operations
31

Approach Overview
Query Query Analysis Query Features
Query Planner
Ƭ-Space
Wikipedia
Query Plan
RDF Data
Explicit Semantic
Analysis (ESA)
Core semantic approximation &
composition operations
Commonsense
knowledge
32

Core Operations
Search &
Composition
Operations
Query
34

Addressing the Vocabulary Problem for
Databases (with Distributional Semantics)
Gaelic: direction
36

More Complex Queries (Video)
38

Treo Answers Jeopardy Queries (Video)
39 http://bit.ly/1hWcch9

Relevance
 Test Collection: QALD 2011.
 DBpedia.
Dataset (DBpedia + YAGO links): 45,767 predicates, 9,434,677
instances, more than 200,000 classes
40

Query Pre-Processing
(Question Analysis)
 Transform natural language queries into triple
patterns.
“Who is the daughter of Bill Clinton married to?”
41

(Question Analysis)
 Step 1: POS Tagging
- Who/WP
- is/VBZ
- the/DT
- daughter/NN
- of/IN
- Bill/NNP
- Clinton/NNP
- married/VBN
- to/TO
- ?/.
42

(Question Analysis)
 Step 2: Core Entity Recognition
- Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN INSTANCE)
43

(Question Analysis)
Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
44

(Question Analysis)
 Step 4: Dependency parsing
- dep(married-8, Who-1)
- auxpass(married-8, is-2)
- det(daughter-4, the-3)
- nsubjpass(married-8, daughter-4)
- prep(daughter-4, of-5)
- nn(Clinton-7, Bill-6)
- pobj(of-5, Clinton-7)
- root(ROOT-0, married-8)
- xcomp(married-8, to-9)
45

(Question Analysis)
 Step 5: Determine Partial Ordered Dependency Structure
(PODS)
- Rules based.
• Remove stop words.
• Merge words into entities.
• Reorder structure from core entity position.
Bill Clinton daughter married to
46
(INSTANCE)
ANSWER
TYPE
Person
Lower level of ambiguity, QUESTION FOCUS
vagueness, synonimy

Question Analysis
Transform natural language queries into triple
patterns
“Who is the daughter of Bill Clinton married to?”
PODS
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
47

Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
Query Plan
 (1) INSTANCE SEARCH (Bill Clinton)
 (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
 (3) e1 <- NAVIGATE (Bill Clintion, p1)
 (4) p2 <- SEARCH PREDICATE (e1, married to)
 (5) e2 <- NAVIGATE (e1, p2)
48

Instance Search
:Bill_Clinton
Query:
Linked
Data:
Instance Search
49

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
(PIVOT ENTITY)
(ASSOCIATED
TRIPLES)
50

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
Which properties are semantically related to ‘daughter’?
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,alma mater)=0.001
51

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
Which properties are semantically related to ‘daughter’?
(In the context of Bill Clinton)
:Chelsea_Clinton
:child
:Baptists
:religion
:almaMater
:Yale_Law_School
...
sem_rel(daughter,alma mater)=0.001
52

Navigate
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
53

Navigate
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
54

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky
:spouse
55

Core Principles
 Minimize the impact of Ambiguity, Vagueness, Synonymy with
semantic pivoting.
 Semantic pivoting: Address the simplest matchings first
(heuristics).
 Semantic Relatedness as a primitive semantic approximation
operation.
 Distributional semantics as commonsense/semantic
knowledge.
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-
Compositional Semantics Approach, IUI 2014

Living in a
Schema-less World
58

How do we build systems today?
Structure the domain
59

Generalize and encode some rules

Allow some constrained interaction
Query is here
61

Data variety +
Data
Full knowledge
Full data coverage
Full automation
63

Linked Data: Datasets are easier to integrate and to
consume (data model level). However, the semantic
barrier for consumption is still there

Data variety +
Data
Full knowledge
Full data coverage
Full automation
65

Distributional DBMS
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-
Compositional Semantics Approach, IUI 2014

Data variety +
Data
Full knowledge
Full data coverage
Full automation
67

Simplification of Information Extraction
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs, WoLE, 2012

Simplification of Information Extraction
General Electric Company, or GE , is an American multinational conglomerate
corporation incorporated in Schenectady , New York
69

Data variety +
Data
Full knowledge
Full data coverage
Full automation
70

Schema-agnostic programs
Towards An Approximative Ontology-Agnostic Approach for Logic Programs, FOIKS 2014

Data variety +
Data
Full knowledge
Full data coverage
Full automation
72

Reasoning with Distributional Semantics
A Distributional Semantics Approach for Selective Reasoning on Commonsense Graph
Knowledge Bases, NLDB 2014

Data variety +
Data
Full knowledge
Full data coverage
Full automation
74

Take-away Message
 Existing semantic technologies can address today major data
management problems
 Muiti-disciplinarity is one key (and NLI people are very good at it!):
- NLP + IR + Semantic Web + Databases
 Schema-agnosticism is a central property/functionality/goal!
 Distributional Semantics + semantics of structured data =
schema-agnosticism
 Schema-agnosticism brings major impact for information systems.
 We can tame the long tail of data variety!
 The wave is just starting. Be a part of it!
75

Want to play with Distributional
Semantics?
http://easy-esa.org
76

Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)

Recommended

Recommended

More Related Content

Similar to Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)

Similar to Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014) (20)

More from Andre Freitas

More from Andre Freitas (20)

Recently uploaded

Recently uploaded (20)

Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)