Talking to your Data: 
Natural Language Interfaces for a 
schema-less world 
André Freitas 
NLIWoD at ISWC 2014 
Riva del Garda
Outline 
 Shift in the Database Landscape 
 On Schema-agnosticism & Semantics 
 Distributional Semantics to the Help 
 Case Study: Treo QA System 
 Living in a Schema-less World 
 Take-away Message
Shift in the Database 
Landscape 
3
Big Data (Data Variety) 
 Vision: More complete data-based picture of the world for 
systems and users. 
4
The Long Tail of Data Variety
The Long Tail of Data Variety 
6
Data variety + 
Data 
Programs 
Full knowledge 
Full data coverage 
Full automation 
The Long Tail of Data Variety 
7
Data variety + 
Data 
Programs 
Full knowledge 
Full data coverage 
Full automation 
The Long Tail of Data Variety 
Data generation 
8
Very-large and dynamic “schemas” 
10s-100s attributes 
1,000s-1,000,000s attributes 
circa 2000 
circa 2014 
9
Semantic Heterogeneity 
 Decentralized content generation. 
 Multiple perspectives (conceptualizations) of the reality. 
 Ambiguity, vagueness, inconsistency. 
10
Data variety + 
Data 
Programs 
Full knowledge 
Full data coverage 
Full automation 
The Long Tail of Data Variety 
Data consumption 
Data generation 
11
Databases for a Complex World 
How do you query data at this scale? 
12
Schema-agnosticism 
Abstraction 
Layer 
User 
13
First-level independency 
(Relational Model) 
“… it provides a basis for a high level data language which 
will yield maximal independence between programs on 
the one hand and representation and organization of data 
on the other” 
Codd, 1970 
Second-level independency 
(Schema-agnosticism) 
14
On Schema-agnosticism 
& semantics 
15
Vocabulary Problem for Databases 
Query: Who is the daughter of Bill Clinton married to? 
Semantic Gap 
Possible representations 
Schema-agnostic query 
mechanisms 
 Abstraction level differences 
 Lexical variation 
 Structural (compositional) differences 
 Operational/functional differences 
16
Robust Semantic Model 
 Semantic intelligent behaviour is highly dependent on 
knowledge scale (commonsense, semantic) 
Semantics 
= 
Formal meaning representation model 
(lots of data) 
+ 
inference model 
17
Robust Semantic Model 
 Not scalable! 
1st Hard problem: Acquisition 
Semantics 
= 
Formal meaning representation model 
(lots of data) 
+ 
inference model 
18
Robust Semantic Model 
 Not scalable! 
2nd Hard problem: Consistency 
Semantics 
= 
Formal meaning representation model 
(lots of data) 
+ 
inference model 
19
Semantics for a Complex World 
 “Most semantic models have dealt with particular types of 
constructions, and have been carried out under very simplifying 
assumptions, in true lab conditions.” 
 “If these idealizations are removed it is not clear at all that modern 
semantics can give a full account of all but the simplest 
models/statements.” 
Formal World Real World 
Baroni et al. 2013 
20
Distributional Semantic Models 
 Semantic Model with low acquisition effort 
(automatically built from text) 
Simplification of the representation 
 Enables the construction of comprehensive 
commonsense/semantic KBs 
 What is the cost? 
Some level of noise 
(semantic best-effort) 
21
Distributional Hypothesis 
“Words occurring in similar (linguistic) contexts tend 
to be semantically similar” 
 He filled the wampimuk with the substance, passed it 
around and we all drunk some 
22
Distributional Semantic Models (DSMs) 
“The dog barked in the park. The owner of the dog put him on the 
leash since he barked.” 
contexts = nouns and verbs in the same 
sentence 
23
Distributional Semantic Models (DSMs) 
“The dog barked in the park. The owner of the dog put him on the 
leash since he barked.” 
bark 
dog 
park 
leash 
contexts = nouns and verbs in the same 
sentence 
bark : 2 
park : 1 
leash : 1 
owner : 1 
24
Distributional Semantic Models (DSMs) 
Context 
car 
dog 
bark 
run 
leash 
25
Semantic Similarity & Relatedness 
dog 
car 
bark 
run 
leash 
26 
Query: cat
Semantic Similarity & Relatedness 
θ 
dog 
cat 
car 
bark 
run 
leash 
27 
Query: cat
DSMs as Commonsense Reasoning 
Commonsense is here 
θ 
car 
dog 
cat 
bark 
run 
leash 
Semantic Approximation is here 
28
DSMs as Commonsense Reasoning 
θ 
car 
dog 
cat 
bark 
run 
leash 
... 
vs. 
Semantic best-effort
Case Study: Treo QA 
System 
30
Approach Overview 
Query Query Analysis Query Features 
Query Planner 
Ƭ-Space 
Large-scale 
unstructured data 
Query Plan 
Structured 
Data 
Commonsense 
knowledge 
Distributional 
semantics 
Core semantic approximation & 
composition operations 
31
Approach Overview 
Query Query Analysis Query Features 
Query Planner 
Ƭ-Space 
Wikipedia 
Query Plan 
RDF Data 
Explicit Semantic 
Analysis (ESA) 
Core semantic approximation & 
composition operations 
Commonsense 
knowledge 
32
Ƭ-Space 
e 
p 
r 
33
Core Operations 
Search & 
Composition 
Operations 
Query 
34
Does it work? 
35
Addressing the Vocabulary Problem for 
Databases (with Distributional Semantics) 
Gaelic: direction 
36
Solution (Video) 
37
More Complex Queries (Video) 
38
Treo Answers Jeopardy Queries (Video) 
39 http://bit.ly/1hWcch9
Relevance 
 Test Collection: QALD 2011. 
 DBpedia. 
Dataset (DBpedia + YAGO links): 45,767 predicates, 9,434,677 
instances, more than 200,000 classes 
40
Query Pre-Processing 
(Question Analysis) 
 Transform natural language queries into triple 
patterns. 
“Who is the daughter of Bill Clinton married to?” 
41
Query Pre-Processing 
(Question Analysis) 
 Step 1: POS Tagging 
- Who/WP 
- is/VBZ 
- the/DT 
- daughter/NN 
- of/IN 
- Bill/NNP 
- Clinton/NNP 
- married/VBN 
- to/TO 
- ?/. 
42
Query Pre-Processing 
(Question Analysis) 
 Step 2: Core Entity Recognition 
- Rules-based: POS Tag + TF/IDF 
Who is the daughter of Bill Clinton married to? 
(PROBABLY AN INSTANCE) 
43
Query Pre-Processing 
(Question Analysis) 
Step 3: Determine answer type 
Rules-based. 
Who is the daughter of Bill Clinton married to? 
(PERSON) 
44
Query Pre-Processing 
(Question Analysis) 
 Step 4: Dependency parsing 
- dep(married-8, Who-1) 
- auxpass(married-8, is-2) 
- det(daughter-4, the-3) 
- nsubjpass(married-8, daughter-4) 
- prep(daughter-4, of-5) 
- nn(Clinton-7, Bill-6) 
- pobj(of-5, Clinton-7) 
- root(ROOT-0, married-8) 
- xcomp(married-8, to-9) 
45
Query Pre-Processing 
(Question Analysis) 
 Step 5: Determine Partial Ordered Dependency Structure 
(PODS) 
- Rules based. 
• Remove stop words. 
• Merge words into entities. 
• Reorder structure from core entity position. 
Bill Clinton daughter married to 
46 
(INSTANCE) 
ANSWER 
TYPE 
Person 
Lower level of ambiguity, QUESTION FOCUS 
vagueness, synonimy
Question Analysis 
Transform natural language queries into triple 
patterns 
“Who is the daughter of Bill Clinton married to?” 
Bill Clinton daughter married to 
PODS 
(INSTANCE) (PREDICATE) (PREDICATE) Query Features 
47
Query Plan 
Map query features into a query plan. 
A query plan contains a sequence of core operations. 
(INSTANCE) (PREDICATE) (PREDICATE) Query Features 
Query Plan 
 (1) INSTANCE SEARCH (Bill Clinton) 
 (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter) 
 (3) e1 <- NAVIGATE (Bill Clintion, p1) 
 (4) p2 <- SEARCH PREDICATE (e1, married to) 
 (5) e2 <- NAVIGATE (e1, p2) 
48
Instance Search 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
Instance Search 
49
Predicate Search 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
:Chelsea_Clinton 
:child 
:Baptists 
:religion 
:almaMater 
:Yale_Law_School 
... 
(PIVOT ENTITY) 
(ASSOCIATED 
TRIPLES) 
50
Predicate Search 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
Which properties are semantically related to ‘daughter’? 
:Chelsea_Clinton 
:child 
:Baptists 
:religion 
:almaMater 
:Yale_Law_School 
... 
sem_rel(daughter,child)=0.054 
sem_rel(daughter,child)=0.004 
sem_rel(daughter,alma mater)=0.001 
51
Predicate Search 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
Which properties are semantically related to ‘daughter’? 
(In the context of Bill Clinton) 
:Chelsea_Clinton 
:child 
:Baptists 
:religion 
:almaMater 
:Yale_Law_School 
... 
sem_rel(daughter,child)=0.054 
sem_rel(daughter,child)=0.004 
sem_rel(daughter,alma mater)=0.001 
52
Navigate 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
:Chelsea_Clinton 
:child 
53
Navigate 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
:Chelsea_Clinton 
:child 
(PIVOT ENTITY) 
54
Predicate Search 
Bill Clinton daughter married to 
:Bill_Clinton 
Query: 
Linked 
Data: 
:Chelsea_Clinton 
:child 
(PIVOT ENTITY) 
:Mark_Mezvinsky 
:spouse 
55
Results 
56
Core Principles 
 Minimize the impact of Ambiguity, Vagueness, Synonymy with 
semantic pivoting. 
 Semantic pivoting: Address the simplest matchings first 
(heuristics). 
 Semantic Relatedness as a primitive semantic approximation 
operation. 
 Distributional semantics as commonsense/semantic 
knowledge. 
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional- 
Compositional Semantics Approach, IUI 2014
Living in a 
Schema-less World 
58
How do we build systems today? 
Structure the domain 
59
How do we build systems today? 
Generalize and encode some rules
How do we build systems today? 
Allow some constrained interaction 
Query is here 
61
Siloed Systems 
62
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
63
Linked Data: Datasets are easier to integrate and to 
consume (data model level). However, the semantic 
barrier for consumption is still there
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
65
Distributional DBMS 
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional- 
Compositional Semantics Approach, IUI 2014
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
67
Simplification of Information Extraction 
A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs, WoLE, 2012
Simplification of Information Extraction 
General Electric Company, or GE , is an American multinational conglomerate 
corporation incorporated in Schenectady , New York 
69
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
70
Schema-agnostic programs 
Towards An Approximative Ontology-Agnostic Approach for Logic Programs, FOIKS 2014
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
72
Reasoning with Distributional Semantics 
A Distributional Semantics Approach for Selective Reasoning on Commonsense Graph 
Knowledge Bases, NLDB 2014
Data variety + 
Data 
Full knowledge 
Full data coverage 
Full automation 
74
Take-away Message 
 Existing semantic technologies can address today major data 
management problems 
 Muiti-disciplinarity is one key (and NLI people are very good at it!): 
- NLP + IR + Semantic Web + Databases 
 Schema-agnosticism is a central property/functionality/goal! 
 Distributional Semantics + semantics of structured data = 
schema-agnosticism 
 Schema-agnosticism brings major impact for information systems. 
 We can tame the long tail of data variety! 
 The wave is just starting. Be a part of it! 
75
Want to play with Distributional 
Semantics? 
http://easy-esa.org 
76
Any Queries?

Talking to your Data: Natural Language Interfaces for a schema-less world (Keynote at NLIWoD, ISWC 2014)

  • 1.
    Talking to yourData: Natural Language Interfaces for a schema-less world André Freitas NLIWoD at ISWC 2014 Riva del Garda
  • 2.
    Outline  Shiftin the Database Landscape  On Schema-agnosticism & Semantics  Distributional Semantics to the Help  Case Study: Treo QA System  Living in a Schema-less World  Take-away Message
  • 3.
    Shift in theDatabase Landscape 3
  • 4.
    Big Data (DataVariety)  Vision: More complete data-based picture of the world for systems and users. 4
  • 5.
    The Long Tailof Data Variety
  • 6.
    The Long Tailof Data Variety 6
  • 7.
    Data variety + Data Programs Full knowledge Full data coverage Full automation The Long Tail of Data Variety 7
  • 8.
    Data variety + Data Programs Full knowledge Full data coverage Full automation The Long Tail of Data Variety Data generation 8
  • 9.
    Very-large and dynamic“schemas” 10s-100s attributes 1,000s-1,000,000s attributes circa 2000 circa 2014 9
  • 10.
    Semantic Heterogeneity Decentralized content generation.  Multiple perspectives (conceptualizations) of the reality.  Ambiguity, vagueness, inconsistency. 10
  • 11.
    Data variety + Data Programs Full knowledge Full data coverage Full automation The Long Tail of Data Variety Data consumption Data generation 11
  • 12.
    Databases for aComplex World How do you query data at this scale? 12
  • 13.
  • 14.
    First-level independency (RelationalModel) “… it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and representation and organization of data on the other” Codd, 1970 Second-level independency (Schema-agnosticism) 14
  • 15.
  • 16.
    Vocabulary Problem forDatabases Query: Who is the daughter of Bill Clinton married to? Semantic Gap Possible representations Schema-agnostic query mechanisms  Abstraction level differences  Lexical variation  Structural (compositional) differences  Operational/functional differences 16
  • 17.
    Robust Semantic Model  Semantic intelligent behaviour is highly dependent on knowledge scale (commonsense, semantic) Semantics = Formal meaning representation model (lots of data) + inference model 17
  • 18.
    Robust Semantic Model  Not scalable! 1st Hard problem: Acquisition Semantics = Formal meaning representation model (lots of data) + inference model 18
  • 19.
    Robust Semantic Model  Not scalable! 2nd Hard problem: Consistency Semantics = Formal meaning representation model (lots of data) + inference model 19
  • 20.
    Semantics for aComplex World  “Most semantic models have dealt with particular types of constructions, and have been carried out under very simplifying assumptions, in true lab conditions.”  “If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements.” Formal World Real World Baroni et al. 2013 20
  • 21.
    Distributional Semantic Models  Semantic Model with low acquisition effort (automatically built from text) Simplification of the representation  Enables the construction of comprehensive commonsense/semantic KBs  What is the cost? Some level of noise (semantic best-effort) 21
  • 22.
    Distributional Hypothesis “Wordsoccurring in similar (linguistic) contexts tend to be semantically similar”  He filled the wampimuk with the substance, passed it around and we all drunk some 22
  • 23.
    Distributional Semantic Models(DSMs) “The dog barked in the park. The owner of the dog put him on the leash since he barked.” contexts = nouns and verbs in the same sentence 23
  • 24.
    Distributional Semantic Models(DSMs) “The dog barked in the park. The owner of the dog put him on the leash since he barked.” bark dog park leash contexts = nouns and verbs in the same sentence bark : 2 park : 1 leash : 1 owner : 1 24
  • 25.
    Distributional Semantic Models(DSMs) Context car dog bark run leash 25
  • 26.
    Semantic Similarity &Relatedness dog car bark run leash 26 Query: cat
  • 27.
    Semantic Similarity &Relatedness θ dog cat car bark run leash 27 Query: cat
  • 28.
    DSMs as CommonsenseReasoning Commonsense is here θ car dog cat bark run leash Semantic Approximation is here 28
  • 29.
    DSMs as CommonsenseReasoning θ car dog cat bark run leash ... vs. Semantic best-effort
  • 30.
    Case Study: TreoQA System 30
  • 31.
    Approach Overview QueryQuery Analysis Query Features Query Planner Ƭ-Space Large-scale unstructured data Query Plan Structured Data Commonsense knowledge Distributional semantics Core semantic approximation & composition operations 31
  • 32.
    Approach Overview QueryQuery Analysis Query Features Query Planner Ƭ-Space Wikipedia Query Plan RDF Data Explicit Semantic Analysis (ESA) Core semantic approximation & composition operations Commonsense knowledge 32
  • 33.
  • 34.
    Core Operations Search& Composition Operations Query 34
  • 35.
  • 36.
    Addressing the VocabularyProblem for Databases (with Distributional Semantics) Gaelic: direction 36
  • 37.
  • 38.
  • 39.
    Treo Answers JeopardyQueries (Video) 39 http://bit.ly/1hWcch9
  • 40.
    Relevance  TestCollection: QALD 2011.  DBpedia. Dataset (DBpedia + YAGO links): 45,767 predicates, 9,434,677 instances, more than 200,000 classes 40
  • 41.
    Query Pre-Processing (QuestionAnalysis)  Transform natural language queries into triple patterns. “Who is the daughter of Bill Clinton married to?” 41
  • 42.
    Query Pre-Processing (QuestionAnalysis)  Step 1: POS Tagging - Who/WP - is/VBZ - the/DT - daughter/NN - of/IN - Bill/NNP - Clinton/NNP - married/VBN - to/TO - ?/. 42
  • 43.
    Query Pre-Processing (QuestionAnalysis)  Step 2: Core Entity Recognition - Rules-based: POS Tag + TF/IDF Who is the daughter of Bill Clinton married to? (PROBABLY AN INSTANCE) 43
  • 44.
    Query Pre-Processing (QuestionAnalysis) Step 3: Determine answer type Rules-based. Who is the daughter of Bill Clinton married to? (PERSON) 44
  • 45.
    Query Pre-Processing (QuestionAnalysis)  Step 4: Dependency parsing - dep(married-8, Who-1) - auxpass(married-8, is-2) - det(daughter-4, the-3) - nsubjpass(married-8, daughter-4) - prep(daughter-4, of-5) - nn(Clinton-7, Bill-6) - pobj(of-5, Clinton-7) - root(ROOT-0, married-8) - xcomp(married-8, to-9) 45
  • 46.
    Query Pre-Processing (QuestionAnalysis)  Step 5: Determine Partial Ordered Dependency Structure (PODS) - Rules based. • Remove stop words. • Merge words into entities. • Reorder structure from core entity position. Bill Clinton daughter married to 46 (INSTANCE) ANSWER TYPE Person Lower level of ambiguity, QUESTION FOCUS vagueness, synonimy
  • 47.
    Question Analysis Transformnatural language queries into triple patterns “Who is the daughter of Bill Clinton married to?” Bill Clinton daughter married to PODS (INSTANCE) (PREDICATE) (PREDICATE) Query Features 47
  • 48.
    Query Plan Mapquery features into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE) Query Features Query Plan  (1) INSTANCE SEARCH (Bill Clinton)  (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)  (3) e1 <- NAVIGATE (Bill Clintion, p1)  (4) p2 <- SEARCH PREDICATE (e1, married to)  (5) e2 <- NAVIGATE (e1, p2) 48
  • 49.
    Instance Search BillClinton daughter married to :Bill_Clinton Query: Linked Data: Instance Search 49
  • 50.
    Predicate Search BillClinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :almaMater :Yale_Law_School ... (PIVOT ENTITY) (ASSOCIATED TRIPLES) 50
  • 51.
    Predicate Search BillClinton daughter married to :Bill_Clinton Query: Linked Data: Which properties are semantically related to ‘daughter’? :Chelsea_Clinton :child :Baptists :religion :almaMater :Yale_Law_School ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 sem_rel(daughter,alma mater)=0.001 51
  • 52.
    Predicate Search BillClinton daughter married to :Bill_Clinton Query: Linked Data: Which properties are semantically related to ‘daughter’? (In the context of Bill Clinton) :Chelsea_Clinton :child :Baptists :religion :almaMater :Yale_Law_School ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 sem_rel(daughter,alma mater)=0.001 52
  • 53.
    Navigate Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child 53
  • 54.
    Navigate Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY) 54
  • 55.
    Predicate Search BillClinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY) :Mark_Mezvinsky :spouse 55
  • 56.
  • 57.
    Core Principles Minimize the impact of Ambiguity, Vagueness, Synonymy with semantic pivoting.  Semantic pivoting: Address the simplest matchings first (heuristics).  Semantic Relatedness as a primitive semantic approximation operation.  Distributional semantics as commonsense/semantic knowledge. Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional- Compositional Semantics Approach, IUI 2014
  • 58.
    Living in a Schema-less World 58
  • 59.
    How do webuild systems today? Structure the domain 59
  • 60.
    How do webuild systems today? Generalize and encode some rules
  • 61.
    How do webuild systems today? Allow some constrained interaction Query is here 61
  • 62.
  • 63.
    Data variety + Data Full knowledge Full data coverage Full automation 63
  • 64.
    Linked Data: Datasetsare easier to integrate and to consume (data model level). However, the semantic barrier for consumption is still there
  • 65.
    Data variety + Data Full knowledge Full data coverage Full automation 65
  • 66.
    Distributional DBMS NaturalLanguage Queries over Heterogeneous Linked Data Graphs: A Distributional- Compositional Semantics Approach, IUI 2014
  • 67.
    Data variety + Data Full knowledge Full data coverage Full automation 67
  • 68.
    Simplification of InformationExtraction A Semantic Best-Effort Approach for Extracting Structured Discourse Graphs, WoLE, 2012
  • 69.
    Simplification of InformationExtraction General Electric Company, or GE , is an American multinational conglomerate corporation incorporated in Schenectady , New York 69
  • 70.
    Data variety + Data Full knowledge Full data coverage Full automation 70
  • 71.
    Schema-agnostic programs TowardsAn Approximative Ontology-Agnostic Approach for Logic Programs, FOIKS 2014
  • 72.
    Data variety + Data Full knowledge Full data coverage Full automation 72
  • 73.
    Reasoning with DistributionalSemantics A Distributional Semantics Approach for Selective Reasoning on Commonsense Graph Knowledge Bases, NLDB 2014
  • 74.
    Data variety + Data Full knowledge Full data coverage Full automation 74
  • 75.
    Take-away Message Existing semantic technologies can address today major data management problems  Muiti-disciplinarity is one key (and NLI people are very good at it!): - NLP + IR + Semantic Web + Databases  Schema-agnosticism is a central property/functionality/goal!  Distributional Semantics + semantics of structured data = schema-agnosticism  Schema-agnosticism brings major impact for information systems.  We can tame the long tail of data variety!  The wave is just starting. Be a part of it! 75
  • 76.
    Want to playwith Distributional Semantics? http://easy-esa.org 76
  • 77.