SlideShare a Scribd company logo
1 of 161
Natural Language Data
Management and Interfaces
Recent Development and Open Challenges
Davood Rafiei
University of Alberta
Yunyao Li
IBM Research - Almaden
Chicago
2017
“If we are to satisfy the needs of
casual users of data bases, we
must break through the barriers
that presently prevent these users
from freely employing their native
languages"
Ted Codd, 1974
Employing Native Languages
•As data for describing things and relationships
• Otherwise a huge volume of data will end up outside
databases
•As an interface to databases
• Otherwise we limit database use to professionals
Outline
•Natural Language Data Management
•Natural Language Interfaces for Databases
•Open Challenges and Opportunities
Natural Language Data
Management
Outline of Part I
•The ubiquity of natural language data
•A few areas of application
•Challenges
•Areas of progress
•Querying natural language text
•Transforming natural language text
•Integration
The Ubiquity of
Natural Language Data
Data Domains
•Corporate data
•Scientific literature
•News articles
•Wikipedia
Corporate Data
Merril Lynch rule
“unstructured data
comprises the vast
majority of data found
in an organization.
Some estimates run as
high as 80%.”
Unstructured data
Scientific Literature
Impact of less invasive treatments including sclerotherapy with a new agent and
hemorrhoidopexy for prolapsing internal hemorrhoids.
Tokunaga Y, Sasaki H. (Int Surg. 2013)
Abstract
Abstract Conventional hemorrhoidectomy is applied for the treatment of prolapsing internal
hemorrhoids. Recently, less-invasive treatments such as sclerotherapy using aluminum
potassium sulphate/tannic acid (ALTA) and a procedure for prolapse and hemorrhoids (PPH)
have been introduced. We compared the results of sclerotherapy with ALTA and an improved
type of PPH03 with those of hemorrhoidectomy. Between January 2006 and March 2009, we
performed hemorrhoidectomy in 464 patients, ALTA in 940 patients, and PPH in 148 patients
with second- and third-degree internal hemorrhoids according to the Goligher's classification.
The volume of ALTA injected into a hemorrhoid was 7.3 ± 2.2 (mean ± SD) mL. The duration
of the operation was significantly shorter in ALTA (13 ± 2 minutes) than in hemorrhoidectomy
(43 ± 5 minutes) or PPH (32 ± 12 minutes). Postoperative pain, requiring intravenous pain
medications, occurred in 65 cases (14%) in hemorrhoidectomy, in 16 cases (1.7%) in ALTA,
and in 1 case (0.7%) in PPH. The disappearance rates of prolapse were 100% in
hemorrhoidectomy, 96% in ALTA, and 98.6% in PPH. ALTA can be performed on an
outpatient basis without any severe pain or complication, and PPH is a useful alternative
treatment with less pain. Less-invasive treatments are beneficial when performed with care to
avoid complications.
Treatment
No of patients tries on
Duration
News Articles
April 25, 2017 12:48 pm
Loonie hits 14-month low as softwood lumber duties expected to impact jobs
By Ross Marowits The Canadian Press
MONTREAL – The loonie hit a 14-month low on Tuesday at 73.60 cents, the lowest
level since February 2016.
The U.S. Commerce Department levied countervailing duties ranging between 3.02
and 24.12 per cent on five large Canadian producers and 19.88 per cent for all other
firms effective May 1. The duties will be retroactive 90 days for J.D. Irving and
producers other than Canfor, West Fraser, Resolute Forest Products and Tolko.
Anti-dumping duties to be announced June 23 could raise the total to as much as 30
to 35 per cent.
25,000 jobs will eventually be hit, including 10,000 direct jobs and 15,000 indirect
ones tied to the sector
Dias anticipates that.
Event
Triggering event
Following events expected
Wikipedia
•42 million pages
•Only 2.4 million infobox triplets
•Lots of data not in infobox
Obama was hired in Chicago as director of the Developing Communities
Project, a church-based community organization originally comprising
eight Catholic parishes in Roseland, West Pullman, and Riverdale on
Chicago's South Side.
…
In 1991, Obama accepted a two-year position as Visiting Law and
Government Fellow at the University of Chicago Law School to work on
his first book.
…
From April to October 1992, Obama directed Illinois's Project Vote, a
voter registration campaign…
Community QA
•Services such as Yahoo answers, Stack
Overflow, AnswerBag, …
•Data: question and answer pairs
•Want answers to new queries
Q: How to fix auto terminate mac terminal
Two StackOverflow pages returned by Google
- osx - How do I get a Mac “.command” file to automatically quit
after running a shell script?
- OSX - How to auto Close Terminal window after the “exit”
command executed.
Vision
Natural Language
Data Management
+
Structured Data
Queries Results
Challenges
Challenge – Lack of Schema
treatment patientCnt duration noOfPatients disappearanceRate
sclerotherapy with
ALTA
940 13+-2 16 96
PPH03 148 32+-12 1 98.6
hemorrhoidectomy 484 43+-5 65 100
•The scientific article shown earlier contains
structured data (as shown) but hard to query
due to the lack of schema
Challenge - Opacity of References
•Anaphora
• “Joe did not interrupt Sue because he was polite”
• “the lion bit the gazelle, because it had sharp teeth”
•Ambiguity of ids
• Does “john” in article A refer to the same “john” in
article B?
•Variations due to spatiotemporal differences
• “police chief” is ambiguous without a
spatiotemporal anchor
Challenge - Richness of Semantics
•Semantic relations
•crow ⊆ bird; bird ∩ nonbird= {};
bird ∪ nonbird=U
•Pragmatics
•The meanning depends on the context
•E.g. “Sherlock saw the man with binoculars”
•Textual entailment
•“every dog danced” ⟼ “every poodle moved”
Challenge - Correctness of Data
•Incorrect or sarcastic
•“Vladimir Putin is the president of the US’’
•Correct at some point in time (but not now)
•“Barack Obama is the president of the US”
•Correct now
•“Donald Trump is the president of the US”
•Always correct
•“Barack Obama is born in Hawaii”
•“Earth rotates around the sun”
Natural Language Data
•Text
•Speech
Focus: natural language text
System Architecture
Transform RDF
store
Text
store
Text System
Integrate
Enrichment
Entity
Resolution
Information
Extraction
SQL
SPARQL
Support
Natural Language
Text Queries
Rich
Queries
Knowledge
Base
Domain
schema
Structured
Data
DBMS
Progress
•Entity resolution
•Information extraction
•Question answering
•Reasoning
Progress
•Support natural language text queries
(rich queries)
•Transform
•Integrate
Support Natural Language
Text Queries
Approaches
•Boolean queries
•Grammar-based schema and searches
•Text pattern queries
•Tree pattern queries
Boolean Queries
•TREC legal track 2006-2012
•Retrieve documents as evidence in civil
litigation
•Default search in Quicklaw and Westlaw
•E.g.
( (memory w/2 loss) OR amnesia OR Alzheimer! OR
dementia) AND (lawsuit! OR litig! OR case OR
(tort w/2 claim!) OR complaint OR allegation!)
from TREC 09
Legal track
memory /2 loss
memory /s loss
Highlight that due to the variants in NL,
BQ can be extremely complex
Boolean Queries (Cont.)
•Not much use of the grammar
•Except ordering and term distance
•Research issues
•Optimization
• Selectivity estimation for boolean queries
[Chen et al., PODS 2000]
• String selectivity estimation [Jagadish et al.,
PODS 1999], [Chaudhuri et al., ICDE 2004]
•Query evaluation [Broder et al., CIKM 2003]
PAT Expressions
[Saliminen & Tompa, Acta Lingusitica Hungarica 94]
•A set-at-a-time algebra for text
•Text normalization
•Delimiters mapped to blank, lowercasing, etc.
•Searches make less use of grammar
•Lexical: e.g. “joe”, “bo”..“jo”
•Position: e.g. [20], shift.2 “2010”..“2017”
• The last two characters of the matches
•Frequency: e.g. signif.2 “computer”
• Significant two terms that start with “computer” such as
“computer systems”
Mind your Grammar [Gonnet and Tompa, VLDB 1987]
•Schema expressed
as a grammar
•Studied in the context
of Oxford English
Dictionary
Word Pos_tag Pr_brit Pr_us Plurals …
Man-trap n
Grammar-based Data
•The grammar (when known) allows data
to be represented and retrieved
•Compared to relational data
•Grammar ~ table schema
•Parsed strings (p-strings) ~ table instance
Grammar-based Data
(another context)
•Data wrapped in text and html formatting
•Many ecommerce sites with back-end rel.
data
•Grammar often simple
•Schema finding ~ grammar induction
•Input: (a) html pages with wrapped data, (b)
sample/tagged tuples
•Output: a grammar (or a wrapper)
Grammar Induction
•Challenge: Regular grammars cannot be
learned from positive samples only [Gold,
Inf. Cont. 1967]
• Many web pages use grammars that are
identifiable in the limit (e.g. [Crescenzi & Mecca,
J. ACM 2004])
•With natural language text
• Context free production rules exist for good
subsets
• Not deterministic (multiple derivations per input)
• The rules are usually complex, less uniform, and
maybe ambiguous
Text Pattern Queries
•Text modeled as “a sequence of tokens”
•Data wrapped in text patterns
•<name> was born in <year>
•Also referred to as surface text patterns
[Ravichandran and Hovy, ACL 2002]
•Queries ~ text patterns
Google Search: “is a car manufacturer”
DeWild [Li & Rafiei, SIGIR 2006, CIKM 2009]
•Query match short text (instead of a page)
•Result ranking
•To improve “precision at k”
•Query rewritings
DeWild Query: % is a car manufacturer
Rewriting Rules
•Hyponym patterns [Hearst, 1992]
• X such as Y
• X including Y
• Y and other X
•Morphological patterns
• X invents Y
• Y is invented by X
•Specific patterns
• X discovers Y
• X finds Y
• X stumbles upon Y
Rewriting Rules in DeWild
# nopos
(.+),? such as (.+)
such (.+) as (.+)
(.+),? especially (.+)
(.+),? including (.+)
->
$1 such as $2 && noun(,$1)
such $1 as $2 && noun(,$1)
$1, especially $2 && noun(,$1)
$1, including $2 && noun(,$1)
$2, and other $1 && noun(,$1)
$2, or other $1 && noun(,$1)
$2, a $1 && noun($1,)
$2 is a $1 && noun($1,)
#pos
N<([^<>]+)>N,? V<(w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<is (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<are (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<was (w+)>V by N<([^<>]+)>N
N<([^<>]+)>N V<were (w+)>V by N<([^<>]+)>N
->
$3 $2 $1 && verb($2,,,)
$3 $2 $1 && verb(,$2,,)
$3 $2 $1 && verb(,,$2,)
$3 will $2 $1 && verb($2,,,)
$3 is going to $2 $1 && verb($2,,,)
$1 is $2 by $3 && verb(,,,$2)
$1 was $2 by $3 && verb(,,,$2)
$1 are $2 by $3 && verb(,,,$2)
noun(country, countries) verb(go, goes, went, gone)
Queries in DeWild
•Text patterns with some wild cards
•E.g
•% is the prime minister of Canada
•% invented the light bulb
•% invented %
•% is a summer *blockbuster*
Indexing for Text Pattern Queries
•Method 1: Inverted index
34,480,00 -> …, <2,1,[10]>, …
is -> <1,5,[4,16,35,58,89]>, …. <2,1,[9]>, …
population -> … <2,1,[8]> <3,1,[10]>, …
Canada -> … <2,1,[7]>, …
Query: Canada population is %
docId tf offset list
Indexing for Text Pattern Queries (Cont.)
•Method 2: Neighbor index
[Cafarella & Etzioni, WWW 2005]
34,480,00 -> …, <2,1,[(10,is,-)]>, …
is -> …. <2,1,[(9,population,34,480,000)]>, …
population -> … <2,1,[(8,Canada,is)]>, …
Canada -> … <2,1,[(7,though,population)]>, …
Problems: (1) long posting lists e.g. for “is”, “and”, …
(2) join costs |#(query terms) - 1| * |post_list(termi)|
Indexing for Text Pattern Queries (Cont.)
•Method 3: Word Permuterm Index (WPI)
[Chubak & Rafiei, CIKM 2010]
•Based on Permuterm index [Garfield, JAIS 1976]
•Burrows-wheeler transformation of text [Burrows
& Wheeler, 1994]
•Structures to maintain the alphabet and to
access ranks
•E.g. three sentences (lexicographically sorted)
T = $ Rome is a city $ Rome is the capital of Italy $ countries such
as Italy $ ~
•BW-transform
• Find all word-level rotations of T
• Sort rotations
• The vector of the last elements is BW-transform
42
Word-level Burrows-wheeler
transformation
$ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~
$ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city
$ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy
$ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy
Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of
Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as
Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $
Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $
a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is
as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such
capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the
city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a
countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $
is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome
is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome
of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital
such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries
the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is
~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
BW-transformation
43
44
Traversing L backwards
Prev(i) = Count[L[i]] + RankL[i](L,i)
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
L
~
city
Italy
Italy
of
as
$
$
is
such
the
a
$
Rome
Rome
capital
countries
is
$
Prev(8) = Count($) + Rank$(L,8)
= 0 + 2 = 2
The second $ is preceded by city in T
Prev(10) = Count(such) + Ranksuch(L,10)
= 16 + 1 = 17
such is preceded by countries in T
T = $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~
Prev(10)Prev(8)
Number elements smaller
than L[i], in L
Occurrences of L[i] in the
range (L[1..i])
Tree Pattern Queries
•Text often modeled as a set of “ordered
node labeled tree”
•Order usually correspond to the order of the
words in a sentence
•Queries
•Navigational axes: XPath style queries
• E.g. find sentences that include `dog’ as a subject
•Boolean queries
• E.g. Find sentences that contain any of the words w1, w2 or
w3.
•Quantifiers and implications
•Subtree searches
Subtree Searches
What kind of animal is agouti? (TREC-2004 QA track)
Approaches
•Literature on general tree matching
•E.g. ATreeGrep [Shasha et al., PODS 2002]
•Often do not exploit properties of
Syntactically-Annotated Tree (SAT)
• E.g. distinct labels on nodes
•Querying SATs
•Work from the NLP community
• E.g. TGrep2, CorpusSearch, Lpath
•Scan-based, inefficient
•Indexing unique subtrees
Indexing Unique Subtrees
[Chubak & Rafiei, PVLDB 2012]
•Keys: unique subtrees of up to a certain size
•Posting lists: structural info. of keys
•Evaluation strategy: break queries into
subtrees, fetch lists and join
•Syntactically annotated trees
•Abundant frequent patterns  small number of
keys
•Small average branching factor  small number
of postings
Example Subtrees
A
B A
BC
D
C
A
B
C
D
size = 1
A
B C
D
size = 2
B
A
A
A
size = 3
A
B C
A
B A
A
C A
A
A
B
A
A
C
A
B
D
A
B C A
B C
D
A
B
C
D
A
B
A
C
A C
A D
A
B C B
A A
A C A
A A A
A
B
A C
DC
Subtree Coding
•Filter-based
• Store only tid for each unique subtree in the
posting list
• No other structural information
•Subtree interval coding
• Store pre, post and order values in a pre-order
traversal (for containment rel.) and level (for
parent-child rel.)
•Root split coding
• Optimize the storage for subtree interval coding
Query Decomposition
B
C D
FE
Query
A
B
A Query Cover = { , }
A
D
FE
,
C
B
C D
•Want an optimal cover to reduce the join cost
•Guarantee an optimal cover for filter-based and
subtree interval coding
• For subtrees of size 6 or less
•Bound the number of joins in a root split cover
System Architecture
Transform RDF
store
Text
store
Text System
Integrate
Enrichment
Entity
Resolution
Information
Extraction
SQL
SPARQL
Support
Natural Language
Text Queries
Rich
Queries
Knowledge
Base
Domain
schema
Structured
Data
DBMS
Transforming & Integrating
Natural Language Data
Transforming Natural Language Data
•Transformation to a meaning
representation (aka semantic parsing)
such as
•RDF triples
•Other form of logical
predicates
Integrating Natural Language Data
•Tight integration
•Text is maintained by a relational system
•Lose integration
•Text is maintained by a text system
Transforming Natural
Language Data to a Meaning
Representation
Challenges
(with logical inference in general)
•Detecting that
•Craw is a bird,
•Bird is an animal
•Craws can fly but pigs cannot
•Attending an organization relates to
education
•A person has a mother and a father but can
have many children
•Many more
Progress
•Brachman & Levesque, Knowledge
representation & reasoning, 2000.
•RTE entailment challenge
•Since 2005
•Knowledge bases and resources
such as Freebase, Wordnet, Yago,
dbpedia, …
•Shallow semantic parsers
Mapping to DCS Trees [Tian et al., ACL 2014]
•Dependency-based compositional
semantics (DCS) trees [Liang et al., ACL 2011]
•Similar to (and generated from) dependency
parse trees
love
Mary dog
subj obj
F1 = love ∩ (Mary[subj] X W[obj])
F2 = animal ∩ πobj (F1)
F3 = have ∩ (John[subj] X F2[obj])
Does John have an animal that Mary love?
DCS tree node ~ table
Subtree ~ rel. algebra exp.
Logical Inference on DCS
•Some of the axioms
•(R ⊂ S & S ⊂ T) ⇒ R ⊂ T
•R ⊂ S ⇒ πA(R) ⊂ πA(S)
•W != ∅
•Inference ~ deriving new relations using
the tables and the axioms
•Performance on inference problems
•Comparable to systems in FraCaS and
Pascal RTE
Addressing Knowledge Shortage
•Treat DCS tree fragments as paraphrase
candidates
•Establish paraphrases based on
distributional similarity (as in [Lewis &
Steedman, TACL 2013] and others)
blame cause
Debby Debbydeath
storm
tropical
storm
tropical
loss
life
obj iobj objsubj
mod mod
mod
Semantic Parsing using Freebase
[Berant et al., EMNLP 2013]
•Transform questions to freebase derivations
•Learn the mapping from a large collection of
question-answer pairs
Approach
•15 million triplets (text phrases) from
ClubWeb09 mapped to Freebase predicates
• Dates are normalized and text phrases are
lemmatized
• Unary predicates are extracted
• E.g. city(Chicago) from (Chicago, “is a city in”, Illinois)
• 6,299 such unary predicates
• Entity types are checked when there is ambiguity
• E.g. (BarackObama, 1961) is added to “born in” [person,date]
and not to “born in” [person,location]
• 55,081 typed binary predicates
Two Steps Mapping
•Alignment
•Map each phrase to a set of logical forms
•Bridging
•Establish a relation between multiple
predicates in a sentence
•E.g. Marriage.Spouse.TomCruise and 2006
will form Marriage.(Spouse.TomCruise ∩
startDate.2006)
The transformation helps to answer questions using Freebase
Storage and Querying of Triples
•RDF stores
•Native: Apache Jena TDB, Virtuoso,
Algebraix, 4store, GraphDB, …
•Relational-backed: Jena SDB, C-store, …
•Semantic reasoners
•Open source: Apache Jena, and many more
•A list at Manchester U.
• http://owl.cs.manchester.ac.uk/tools/list-of-reasoners/
Integrating
Natural Language Data
Challenges
•Structure in text
•Often not known in advance
•Sometimes subjective
•Optimization and plan generation
•Difficult with less stats, cost estimates and
join dependencies
•Interaction with other systems (e.g. IE,
NER)
•Adds another layer of abstraction
Integration Schemes
• Tight integration
• A Rel. Approach to Querying
Text
[Chu et al., VLDB 2007]
• Lose integration
• Join queries with external
text sources
[Chaudhuri et al., DIGMOD Record
1995]
• Optimizing SQL queries over
text databases
[Jain et al., ICDE 2008]
A Rel. Approach to Querying Text
[Chu et al., VLDB 2007]
•Each document is stored in a wide table
•Attributes are added as discovered
•Two tables
•Attribute catalog
•Records (one row per document)
•Attributes
•Two documents can have different attributes
•Multiple attributes in a doc can have the same
name
•Only non-null values are stored
Attribute Catalog
Records
Operators
•Extract
•Extract desired entities and relationships
•Integrate
•Suggest mappings between attributes
•Cluster
•Group documents into one or more clusters
Operator interaction
Integrate(address, sent-to) – extract(city,street,zipcode)
Lose Integration of Text
[Chaudhuri et al., SIGMOD Record 1995]
•Documents stored in a text system
•Relational view of documents
Relational
Database
System
Text
System
(mercury)
docid title author abstract …
Search, retrieve, join
Integration Techniques
•Tuple substitution
•Nested loop with the db tuple as the outer
relation
SELECT p.member, p.name, m.docid
FROM projects p, mercury m
WHERE p.sponsor=‘NSF’ AND p.name in m.title
AND p.member in m.author
Integration Techniques -- Cont.
•Semi-join
•Suppose the text system can take k terms
•For n members, send n/k queries of the form
(m1 OR m2 OR … OR mk) to the text system
•Probing
•Select a set of terms (how?) from project title
and check their mentions in the text system
•Keep a list of terms (or assignments) that
return empty
•Probing with tuple substitution
•Maintain a cache
SQL Queries over Text Databases
[Jain et al., ICDE 2008]
•Information Extraction (IE) modules over
text
•headquarter(company, location)
•ceoOf(company, ceo)
•Relational view of text
•A set of full outer joins over IE modules
•e.g. companies =headquarter ⋈ ceoOf ⋈ …
•SQL queries over relational views
•Want to improve upon “extract-then-query”
Problem
•Given a SQL query
•Find execution strategies that meet some
efficiency and quality constraints
•In terms of runtime, precision, recall, …
•On-the-fly IE from text
SELECT company, ceo, location
FROM companies
WHERE location=‘Chicago’
Retrieval Strategies
•scan
•Process all documents
•const
•Process documents that contain query
keywords
•promD
•Only process the promising documents for
each IE system (using IE specific keywords)
•promC
•AND the predicates of const and promD
chicago
headquarter OR (based AND shares)
chicago AND (Headquarter OR (based AND shares)
Selecting an Execution Plan
•Stats estimated for each strategy
•# of matching docs docs(E, promC, D)
•Retrieval time rTime(E, scan, D)
•Cost estimation
•Stratified sampling (with one stratum for PD
and another stratum for D-PD)
•For const use both
strata
•For promC & promD
use PD only
scan
const
promD
promC
D
PD
Natural Language Interface to
Databases (NLIDB)
Anatomy of a NLIDB
Query
Understanding
Query
Translation Data store
Feedback
Generation
Domain
knowledge
Optional component
NLQ Interpretation
interactions
queries
queries
Query Understanding
– Scope of Natural Language Support
Ad-hoc
NLQs
Controlled
NLQs
Grammar complexity
Vocabulary complexity
Ambiguity
Parser error
Query naturalness
Query Understanding – Stateless and Stateful
Stateful NLQsStateless NLQs
NLQ Engine
Databases
NLQ
NLQ Engine
Databases
NLQ
Query
history
Each query must be
• Fully specified
• Processed independently
Each query
• Can be partially specified
• Processed with regards to previous queries
Query Understanding - Parser Error
Handling
Parsers make mistakes.
• News: Accuracy of a dependency parser = ~90% [Andor et al., 2016]
• Questions: ~80% [Judge et al., 2006]
Different approaches:
Ignore Auto-correction Interactive correction
• Detect and correct certain
parser mistakes
• Query reformulation
• Parse tree correction
• Do nothing
Query Translation - Bridging the Semantic
Gaps
• Vocabulary gap
“Bill Clinton” vs. “William Jefferson Clinton”
“IBM” vs. “International Business Machine Incorporated”
• Leaky abstraction
• Mismatch between abstraction (e.g. data schema/domain ontology) and
user assumptions
“top executives” vs “person with title CEO, CFO, CIO, etc.”
• Ambiguity in user queries
• Underspecified queries
“Watson movie”  “Watson” as actor/actress
E.g. Emma Watson
“Watson” as a movie character
E.g. Dr. Watson in movie “Holmes and
Watson”
…
Query Translation – Query Construction
• Approaches
• Machine learning
• Construct formal queries from NLQ interpretations with
deterministic algorithms
• Query
• Formal query languages (e.g. XQuery / SQL)
• Intermediate language independent of underlying data stores
• The same intermediate query for different data stores
Systems
• PRECISE
• NaLIX
• NLPQC
• FREyA
• NaLIR
• ML2SQL
• NL2CM
• ATHANA
PRECISE [Popescu et al., 2003,2004]
• Controlled NLQ based on Semantic Tractability
Dependency
Parsing
Query
Generator RDBMS
Lexicon
NLQ Interpretation queriesMatcher
Semantic
Override
Feedback
Generation
interactions
Equivalence
Checker
PRECISE [Popescu et al., 2003,2004]
• Semantic Tractability
Database element: relations, attributes, or values
Token: a set of word stems that matches a database element
Syntactic marker: a term from a fixed set of database-independent terms that
make no semantic contribution to the interpretation of the NLQ
Semantically tractable sentence: Given a set of database element E, a
sentence S is considered semantic tractable, when its complete tokenization
satisfies the following conditions:
• Every token matches a unique data element in E
• Every attribute token attaches to a unique value token
• Every relation token attaches to either an attribute token or a value token
PRECISE [Popescu et al., 2003,2004]
• Explicitly correct parsing errors:
• Preposition attachment
• Preposition ellipsis
What are flights from Boston to Chicago on Monday?
pronoun verb noun prep noun noun nounprepprep
NP NP NP NP NP
PP PP
NP
PP
NP
VP
S
PRECISE [Popescu et al., 2003,2004]
• Explicitly correct parsing errors:
• Preposition attachment
• Preposition ellipsis
What are flights from Boston to Chicago on Monday?
pronoun verb noun prep noun noun nounprepprep
NP NP NP NP NP
PP PP
NP
PP
NP
VP
S
What are flights from Boston to Chicago Monday?
pronoun verb noun prep noun noun nounprep
NP NP NP NP NP
PP
NP
PP
NP
VP
S
PRECISE [Popescu et al., 2003,2004]
• Mapping parse tree nodes based on lexicon built from database
PRECISE [Popescu et al., 2003,2004]
• Addressing ambiguities through lexicon + semantic tractability
• Maximum-flow solution
PRECISE [Popescu et al., 2003,2004]
• Addressing ambiguities through lexicon + semantic tractability + user input
What are the systems analyst jobs in Austin?
Interpretation 1 Job title: systems analyst
Interpretation 2 Area: systems
Job title: analyst
NLQ
PRECISE [Popescu et al., 2003,2004]
• 1-to-many translation from interpretations to SQL based on all
possible join-paths
Job.Description  What
Job.Company  ‘HP’
Job.Platform  ‘Unix’
City.size  ‘small’
Job
JobID
Description
Company
Platform
City
CityID
Name
State
Size
What are the HP jobs on Unix in a small town?
NLQ
Interpretations DB Schema
SELECT DISTINCT Job.Description
FROM Job, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = City.CityID
PRECISE [Popescu et al., 2003,2004]
• 1-to-many translation from interpretations to SQL based on all
possible join-paths
Job.Description  What
Job.Company  ‘HP’
Job.Platform  ‘Unix’
City.size  ‘small’
Job
JobID
Description
Company
Platform
City
CityID
Name
State
Size
What are the HP jobs on Unix in a small town?
WorkLocation
JobID
CityID
NLQ
Interpretations DB Schema
PostLocation
JobID
CityID
SELECT DISTINCT Job.Description
FROM Job, WorkLocation, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = WorkLocation.JobID
AND WorkLocation.CityID = City.CityID
SELECT DISTINCT Job.Description
FROM Job, PostLocation, City
WHERE Job.Platform = ‘HP’
AND Job.Company = ‘Unix’
AND Job.JobID = WorkLocation.JobID
AND PostLocation.CityID = City.CityID
NLPQC [Stratica et al., 2005]
Question
Parsing
Query
Translation RDBMSNLQ Interpretation queries
Preprocessor schema
Rule
template
Link
Parser
Semantic
Analysis
• Controlled NLQ based on predefined rule templates
• No query history
NLPQC [Stratica et al., 2005]
• Build mapping rules for table names and attributes
• Automatically generated using WordNet
• Curated by system administrator
Table name: resource
…
Synonyms: 3 sense of resource
Sense 1: resource
Sense 2: resource
Sense 3: resource, resourcefulness, imagination
Hypernyms: 3 sense of resource
…
Hyponyms: 3 sense of resource
…
…
accept/reject/add
Databases
NLPQC [Stratica et al., 2005]
• Mapping parse tree node to data schema and value based on mapping
rules
Who is the author of book Algorithms
Table name: resource Table name: resource resource.default_attribute
NLQ
NLPQC [Stratica et al., 2005]
• Mapping parse tree node to data schema and value based on pre-defined
mapping rules
• Mapping parse trees to SQL statements based on pre-defined rule templates
Who is the author of book Algorithms
Table name: resource Table name: resource resource.default_attribute
Rule
template
SELECT author.name FROM author, resource, resource_author
WHERE resource.title = “Algorithm”
AND resource_author.resource_id=resource.resource_id
AND resource_author.author_id=author.author_id
NLQ
NLPQC [Stratica et al., 2005]
• No explicit ambiguity handling  leave it to mapping rules and rule
templates
• No parsing error handling  Assume no parsing error
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Controlled NLQ based on pre-defined controlled grammar
Dependency
Parser
Query
Translation
XML
DBs
Message Generator
Translation
Patterns
NLQ
Validated
Parse Tree
interactions
Schema-free
XQuery
warning
Classifier Validator
Controlled
Grammar
Classification
Tables
Query
History
Domain
Adapter
Domain
Knowledge
Knowledge
Extractor
errors
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Classify parse tree nodes into different types based on classification
tables
• Token: words/phrases that can be mapped into a XQery component
• Constructs in FLOWR expressions
• Marker: word/phrase that cannot be mapped into a XQuery component
• Connecting tokens, modify tokens, pronoun, stopwords
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
the [MM]
Classified parse tree
share [CM]
watershed [NT]
a [MM]
with [CM]
California [VT]
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Expand scope of NLQ support via domain adaptation
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
that [MM]
Classified parse tree
share [CM]
watershed [NT]
a [MM]
with [CM]
California [VT]
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
California [VT]
Updated classified parse tree with domain knowledge
where [MM]
river [NT] river [NT]
a [MM] of [CM]
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Validate classified parse tree + term expansion + insert implicit nodes
What are the state that share a watershed with
California
NLQ
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
California [VT]
Updated classified parse tree with domain knowledge
where [MM]
river [NT] river [NT]
a [MM] of [CM]
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
Implicit
node
Term expansion to
bridge
terminology gap
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (1) Variable binding
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (2) Pattern Mapping
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
where $v2 = $v3
where $v4 = “CA”
XQuery fragments
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (3) Nesting and grouping
What are the state that share a watershed with
California
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
Updated classified parse tree post validation
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
where $v2 = $v3
where $v4 = “CA”
XQuery fragments
No aggregation function/qualifier
 No nesting/grouping
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (3) Nesting and grouping
Find all the states whose number of rivers is the same as the number of
rivers in California?
NLQ
$v1
*
$v2
$v1
*
What are [CMT]
state[NT]
the [MM]
each [MM]
is the same as[CM]
state[NT]
a [MM] of [CM]
CA [VT]
where [MM]
river [NT] river [NT]
a [MM] of [CM]
state[NT]
$v3
$v4
the number of [FT] the number of [FT]$cv1
$cv2
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
XQuery fragments
Aggregation function
 Nesting and grouping based on $v2
and $v3
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Translation: (4) Construction full query
Find all the states whose number of rivers is the same as the number of
rivers in California?
NLQ
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
XQuery fragments
for $v1 in doc(“geo.xml”)//state,
$v4 in doc(“geo.xml”)//state
let $vars1 := {
for $v2 in doc(“geo.xml”)//river,
$v5 in doc(“geo.xml”)//state
where mqf($v2,$v5)
and $v5 = $v1
return $v2}
let $vars2 := {
for $v3 in doc(“geo.xml”)//river,
$v6 in doc(“geo.xml”)//state
where mqf($v3,$v6)
and $v6 = $v4
return $v3}
where count($vars1) = count($vars2)
and $v4 = “CA”
return $v1
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Support partially specified follow-up queries
• Detect topic switch to refresh query context
How about with
Texas?
NLQ
How about [SM]
Validated parse tree
with [CM]
TX [VT]
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “CA”
Query context
for $v1 in doc//state
for $v2 in doc//river
for $v3 in doc//river
for $v4 in doc//state
for $cv1 = count($v2)
for $cv2 = count($v3)
where $cv1 = $cv2
where $v4 = “TX”
Updated query context
Substitution
marker
Updated value
NaLIX [Li et al., 2007a, 2007b, 2007c]
• Handle ambiguity
• Ambiguity in terms  User feedback
e.g. “California” can be the name of a state, as well as a city
• Ambiguity in join-path  leverage Schema-free XQuery to find out the optimal join-
path
e.g. There could be multiple ways for a river to be related to a state
• Error handling
• Do not handle parser error explicitly
• Interactive UI to encourage NLQ input understandable by the system
FREyA [Damljanovic et al., 2013,2014]
• Support ad-hoc NLQs, including ill-formed queries
• Direct ontology look up + parse tree mapping  Certain level of
robustness
Syntactic
parsing
Query
Generation
Ontology
Feedback
Generation
POCs
dialogs
SPARQL
queries
Ambiguous
OCs/POCs
Ontology-based
Lookup
Syntactic
Mapping
Mapping
rules
NLQ
OCs
Consolidation
Triple
Generation
FREyA [Damljanovic et al., 2013,2014]
• Parse tree mapping based on pre-defined heuristic rules
 Finds POCs (Potential Ontology Concept)
• Direct ontology look up
 Finds OCs (Ontology Concept)
What is the highest point of the state bordering
Mississippi?
NLQ
the state Mississippi
POCs
the highest point
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
FREyA [Damljanovic et al., 2013,2014]
• Consolidate POCs and OCs
• If span(POC) ⊆ span(OC)  Merge POC and OC
What is the highest point of the state bordering
Mississippi?
NLQ
state Mississippi
POCs
the highest point
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
FREyA
[Damljanovic et al., 2013,2014]
• Consolidate POCs and OCs
• If span(POC) ⊆ span(OC)  Merge POC and OC
• Otherwise, provide suggestions and ask for user feedback
Return the population of California
NLQ
California
POCs
population
OCs
INSTANCE
geo:california
Suggestions ranked based on string similarity (Monge Elkan + Soundex)
1.T1. state population 2. state population density 3. has low point, …
FREyA
[Damljanovic et al., 2013,2014]
• Triple Generation: (1) Insert joker class
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY PROPERTYCLASS INSTANCE
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY1 PROPERTY2CLASS1 INSTANCEJOKER
?
FREyA
[Damljanovic et al., 2013,2014]
• Triple Generation: (2) Generate triples
OCs
geo:isHighestPointOf geo:State geo:border geo:mississippi
PROPERTY1 PROPERTY2CLASS1 INSTANCEJOKER
?
? - geo:isHighestPointOf - geo:State;
geo:State - geo:borders - geo:mississippi (geo:State);
Triples
FREyA
[Damljanovic et al., 2013,2014]
• Generate SPARQL query
? - geo:isHighestPointOf - geo:State;
geo:State - geo:borders - geo:mississippi (geo:State);
Triples
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix geo: <http://www.mooney.net/geo#>
select ?firstJoker ?p0 ?c1 ?p2 ?i3
where { { ?firstJoker ?p0 ?c1 .
filter (?p0=geo:isHighestPointOf) . }
?c1 rdf:type geo:State .
?c1 ?p2 ?i3 .
filter (?p2=geo:borders) .
?i3 rdf:type geo:State .
filter (?i3=geo:mississippi) . }
FREyA
[Damljanovic et al., 2013,2014]
• Determine return type
• Result of a SPARQL query is a graph
• Identify answer type to decide the result display
Show lakes in Minnesota.
NLQ
FREyA
[Damljanovic et al., 2013,2014]
• Handle ambiguities via user interactions
• Provide suggestions
• Leverage re-enforcement learning to improve ranking of suggestions
• No parser error handling
NaLIR [Li and Jagadish, 2014]
• Controlled NLQ based on predefined grammar
• No query history
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
NaLIR [Li and Jagadish, 2014]
• Mapping parse tree node to data schema and value based on WUP similarity [Wu
and Palmer, 1994]
• Explicitly request user input on ambiguous mappings and interpretations
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
NaLIR [Li and Jagadish, 2014]
• Automatically adjust parse tree structure into a valid parse tree
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
ROOT
return
author
paper
more
Bob
VLDB
after
2000
ROOT
return
author Bob
VLDB after
2000
more
paper
NaLIR [Li and Jagadish, 2014]
• Automatically adjust parse tree structure into a valid parse tree
• Further rewrite parse tree into one semantically reasonable
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
ROOT
return
author
paper
more
Bob
VLDB
after
2000
ROOT
return
author Bob
VLDB after
2000
more
paper
ROOT
return
author
VLDB after
2000
more
paper
number of number of
author
paper
VLDB after
2000
Bob
NaLIR [Li and Jagadish, 2014]
• 1-1 translation from query tree to SQL
Query
Tree
Translator
RDBMS
Interactive
Communicator
Data index &
schema
graph
NLQ query tree
interactions
queries
Dependency
parser
Parse Tree
Node
Mapper
Parse Tree
Structure
Adjuster
candidate
mapping
choice
candidate
query trees
choice
Learning NLQ  SQL [Palakurthi et al., 2015]
Stanford
Parser
Query
Translation RDBMS
Entity
Relationship
Schema
NLQ queries
Attribute
Classifier
Training
Data
Condition
Random
Fields
trained model
Training phase
Runtime
Classified
attributes
• Ad-hoc NLQ queries with explicit attribute mentions
• Implicit restriction imposed by the capability of the system itself
Learning NLQ  SQL [Palakurthi et al., 2015]
• Explicit attributes: attributes mentioned explicitly in the NLQ
List all the grades of all the students in Mathematics
NLQ
Explicit attributes:
grade and student
Implicit attribute:
course_name
by the classifier
Learning NLQ  SQL
[Palakurthi et al., 2015]
• Learn to map explicit attributes in the NLQ to SQL clauses
Type of Feature Example Feature
Token-based isSymbol
Grammatical POS tags and grammatical relations
Contextual Tokens preceding or following the current token
Other • isAttribute
• Presence of other attributes
• Trigger words (e.g. “each”)
FeaturesTraining data
Learning NLQ  SQL
[Palakurthi et al., 2015]
• Learn to map explicit attributes in the NLQ to SQL clauses
Who are the professors teaching more than 2 courses?
NLQ
GROUP BY HAVINGFROM
Learning NLQ  SQL
[Palakurthi et al., 2015]
• Construct full SQL queries
• Attribute Clause Mapping
• Identify joins based on ER diagram
• Add missing implicit attributes via Concept Identification [Srirampur et al., 2014]
Who are the professors teaching more than 2 courses?
NLQ
GROUP BY HAVINGFROM
SELECT professor_name
FROM COURSES,TEACH,PROFESSOR
WHERE course_id=course_teach_id
AND prof_teach_id =prof_id
GROUP BY professor_name
HAVING COUNT(course_name) > 2
SQL
Identified based
ER schema
Learning NLQ  SQL
[Palakurthi et al., 2015]
• No parsing error handling
• No explicit ambiguity handling
What length is the Mississippi?
NLQ
Implicit attribute: State
Wrongly
identified
NL2CM [Amsterdamer et al., 2015]
Query
Verification
Query
Generator OASIS
Feedback
Generation
Vocabularies
NLQ
interactions
Formal
query
IX
Detector
IX: Individual Expression
Ontology
Stanford
Parser
IX Patterns
General
Query
Generator
OASIS-QL
triples
SPARQL
triples
Crowd mining engine
• Controlled NLQ based on predefined types (e.g. no “why”
questions)
• Query verification with feedback
• No query history
NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Lexical individuality: Individual terms convey certain meaning
• Participant individuality: Participants or agents in the text that that
are relative to the person addressed by the request
• Synctatic individuality: Certain syntactic constructs in a sentence.
What are the most interesting places near Forest Hotel, Buffalo that we should vi
NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Lexical individuality: Individual terms convey certain meaning
• Participant individuality: Participants or agents in the text that that
are relative to the person addressed by the request
• Synctatic individuality: Certain syntactic constructs in a sentence.
What are the most interesting places near Forest Hotel, Buffalo that we should vi
$x interesting [] visit $x
Opinion Lexicon
NL2CM [Amsterdamer et al., 2015]
• Map parse tree with Individual Expression (IX) patterns and
vocabularies
• Processing the general parts of the query with FREyA system
• Interact with user to resolve ambiguities
What are the most interesting places near Forest Hotel, Buffalo that we should vi
$x interesting [] visit $x
Opinion Lexicon
$x near Forest Hotel,_Buffalo,_NY
User interaction
$x instanceOf Place
NL2CM [Amsterdamer et al., 2015]
• No parsing error handling
• Return error for partially interpretable queries
• SPARQL + OASIS-QL triples  a complete OASIS-QL query
$x interesting
[] visit $x
$x near Forest Hotel,_Buffalo,_NY
$x instanceOf Place
SELECT VARIABLES
WHERE
{$x instanceOf Place.
$x near Forest_Hotel,_Buffalo,_NY}
SATISFYING
{$x hasLabel “interesting”}
ORDER BY DESC(SUPPORT)
LIMIT 5
AND
{ [ ] visit $x}
WITH SUPPORT THRESHOLD = 0.1
NL2CM [Amsterdamer et al., 2015]
Query
Verification
Query
Generator OASIS
Feedback
Generation
Vocabularies
NLQ
interactions
Formal
query
IX
Detector
IX: Individual Expression
Ontology
Stanford
Parser
IX Patterns
General
Query
Generator
OASIS-QL
triples
SPARQL
triples
Crowd mining engine
• Handling ambiguity via user input
ATHANA [Saha et al., 2016]
• Permit ad-hoc queries
• No explicit constraints on NLQ
• Implicit limit on expressivity of NLQs by query expressivity limitation (e.g.
nested query with more than 1 level)
• No query history
NLQ Engine
Query
Translation Databases
Domain
Ontology
NLQ
OQL with NL
explanations
Top
ranked
SQL
query
SQL queries with
NL explanations
user-selected SQL query
Translation
Index
ATHANA [Saha et al., 2016]
• Annotate NLQ into evidences  No explicit parsing
• Handle ambiguity based on translation index and domain ontology
Key Entries
“Alibaba”
“Alibaba Inc”
“Alibaba Incorporated”
“Alibaba Holding”
…
Company.name: Alibaba Inc
Company.name: Alibaba Holding Inc.
Company.name: Alibaba Capital Partners
…
Translation Index
“Investiments”
“investiment”
PersonalInvestiment
InstitutionalInvestiment
… …
Databases
Data
Value
Domain Ontology
Metadata
Data
ATHANA [Saha et al., 2016]
ATHANA [Saha et al., 2016]
Show me restricted stock investments in Alibaba since 2012 by year
Holding.type
Transaction.type
InstitutionalInvestment.type
…
PersonalInvestment
InstitutionalInvestment
VCInvestment
…
Company.name:Alibaba Inc.
Company.name:Alibaba Holding Inc.
…
Transaction.reported_year
Transaction.purchase_year
InstitutionalInvestment.
reported_year
…
indexed value indexed valuemetadata metadatatime range
“since 2012”, “year”
Institutional
investment
Investment
Investee
type
Reported_
year
name
“investments” “restricted stock”
“since 2012”, “year”
“Alibaba”
Investee Company
investedIn “in”
unionOf
Is-a
Institutional
investment
Investment
Investee
type
Reported_
year
name
“investments” “restricted stock”
“Alibaba”
Investee Company
investedIn “in”
issuedBy
Is-a
unionOf
Security
Evidence
Interpretation trees
ATHANA [Saha et al., 2016]
• Ontology Query Language
• Intermediate language over domain ontologies
• Separate query semantics from underlying data stores
• Support common OLAP-style queries
ATHANA [Saha et al., 2016]
• 1-1 translation from interpretation tree to OQL
• 1-1 translation from OQL to SQL per relational schema
SELECT Sum(oInstituionalINvestment.amount),
oInstitutionalInvestment.reported_year
FROM InstitutionalInvestment OInstitutionalInvestment,
InvesteeCompany oInvesteeCompany
WHERE oInstitutionalInvestment.type = “restricted_stock”,
oInstitutionalInvestment.reported_year >= ‘2012’
oInstitutionalInvestment.reported_year >= Inf,
oInvesteeCompany.name = (‘Alibaba Holdings Ltd.’, ‘Alibaba Inc.’, ‘Alibaba Capital Partners’},
oInstitionalInvestmentisaInvestedInunionOf_SecurityissuedBy=oInvesteeCompany
GROUP BY oInstituionalInvestment.reported_year
Database 1 Database 2 Database 3 …
SQL
Statement1
SQL
Statement2
SQL
Statement3
SQL
Statement …
NLIDBs Summary
Systems Scope of NLQ Support Capability State Parsing Error Handling
Controlled Ad-hoc* Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction
PRECISE
NLPQC
NaLIX
FREyA
NaLIR
NL2CM
ML2SQL
ATHANA N/A N/A
* Implicit limitation by system capability
NLIDBs Summary – Cont.
Systems Ambiguity Handling Query Construction Target Language
Automatic Interactive Rule-based Machine-learning
PRECISE SQL
NLPQC SQL
NaLIX (Schema-free) XQuery
FREyA SPARQL
NaLIR SQL
NL2CM OASIS-QL
ML2SQL * SQL
ATHANA OQL
* Partially
Relationship to Semantic Parsing
Query
Understanding
NLQ
Domain
knowledge
Query
Translation
interpretations
queries
Data store
Semantic Parser
NL Sentence
Semantic
parsing results
Training
data
ML
Model
Semantic parsing can be
used to build NLIDB
query results
NLIDB Semantic Parsing
Relationship to Question Answering
Query
Understanding
NLQ
Domain
knowledge
Query
Translation
interpretations
database queries
Data store
Similar
techniques
query results
Query
Understanding
NLQ
Query
Translation
interpretations
Document search queries
top results
Document
Collection
Domain
knowledge
NLIDB Question Answering
Open Challenges and
Opportunities
Querying Natural Language Data -
Review
•Covered
•Boolean queries
•Grammar-based schema and searches
•Text pattern queries
•Tree pattern queries
• Developments beyond
•Keyword searches as input
•Documents as output
Querying Natural Language Data –
Challenges & Opportunities
• Grammar-based schemas
• Promising direction
• Challenges
• Queries w/o knowing the schema
• Many table schemes!
• Overlap and equivalence relationships
• Promising developments
• Paraphrasing relationships between text phrases, tree patterns,
DCS trees, etc.
• Development of resources (e.g. KBs) and shallow semantic
parsers to understand semantics
• Self-improving systems
Integrating & Transforming Natural
Language Data - Review
•Covered
•Transformations on text
•Lose and tight integration
•More work on
•Lose integration
•Optimizing query plans
Integrating & Transforming Natural Language
Data – Challenges & Opportunities
• Challenges
• Lack of schema, opacity of references, richness of
semantics and correctness of data
• Much to inspire from
• Work on transforming text
• Size and scope of resources for understanding text
• Progress in shallow semantic parsing
• Other areas such as translation and speech recognition
• Opportunities
• Lots of demand for relevant tools
• More structure in natural language text than text (as a
seq. of tokens)
• Strong ties to deductive databases
NLIDB: Ideal and Reality
Systems Scope of NLQ Support Capability State Parsing Error Handling
Controlled Ad-hoc Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction
PRECISE *
NLPQC
NaLIX *
FREyA *
NaLIR *
NL2CM
ML2SQL *
ATHANA * N/A N/A
Ideal
NLIDB
* Supported at limited extent
NLIDB: Ideal and Reality – Cont.
Systems Ambiguity Handling Query Construction Target Language
Automatic Interactive Rule-based Machine-learning
PRECISE * SQL
NLPQC SQL
NaLIX * (Schema-free) XQuery
FREyA SPARQL
NaLIR SQL
NL2CM OASIS-QL
ML2SQL * SQL
ATHANA * OQL
Ideal
NLIDB
Polystore language
* Supported at limited extent
NLIDB: Open Challenges
Query
Understanding
Query
Translation Data store
Feedback
Generation
Domain
knowledge
NLQ Interpretation
interactions
queries
queries
• Effectively communicate
limitations to users
• Engage user at the right moment
• Multi-modal interaction
• Support ad-hoc NLQs with complex
semantics
• Better handle parser errors
• Automatically bridge terminology gaps
• Automatically identify and resolve
ambiguity
• Multilingual/crosslingual support
• Polystore
• Structured data s+ (un-
/semi-)structured data
• Construct domain knowledge with
minimal development effort
• Construct complex queries
• Self-improving
• Personalization
• Conversational
Document
Collection
Transform
& integrate
Natural Language DM & Interfaces:
Opportunities
Database
Human
Computer
Interaction
Natural
Language
Processing
Machine
Learning
References
• [Agichtein and Gravano, 2003] Agichtein, E. and Gravano, L. (2003). Querying text databases for efficient
information extraction. In Proc. of the ICDE Conference, pages 113–124, Bangalore, India.
• [Agrawal et al., 2008] Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. (2008). Scalable ad-hoc entity
extraction from text collections. PVLDB, 1(1):945–957.
• [Amsterdamer et al., 2015] Amsterdamer, Y., Kukliansky, A., and Milo, T. (2015). A natural language interface for
querying general and individual knowledge. PVLDB, 8(12):1430–1441.
• [Andor et al., 2016] Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., and
Collins, M. (2016). Globally normalized transition-based neural networks. CoRR, abs/1603.06042.
• [Berant et al., 2013] Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on freebase from
question-answer pairs. In Proc. of the EMNLP Conference, volume 2, page 6.
• [Bertino et al., 2012] Bertino, E., Ooi, B. C., Sacks-Davis, R., Tan, K.-L., Zobel, J., Shidlovsky, B., and
Andronico, D. (2012). Indexing techniques for advanced database systems, volume 8. Springer Science &
Business Media.
• [Broder et al., 2003] Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient query
evaluation using a two-level retrieval process. In Proc. of the CIKM Conf., pages 426–434. ACM.
• [Cafarella and Etzioni, 2005] Cafarella, M. J. and Etzioni, O. (2005). A search engine for natural language
applications. In Proc. of the WWW conference, pages 442–452. ACM.
• [Cafarella et al., 2007] Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. (2007). Structured querying of web text
data: A technical challenge. In Proc. of the CIDR Conference, pages 225–234, Asilomar, CA.
• [Cai et al., 2005]Cai, G., Wang, H., MacEachren, A. M., Tokensregex: Defining cascaded regular expressions
over tokens. Technical Report CSTR-2014-02, Department of Computer Science, Stanford University.
References – Cont.
• [Chaudhuri et al., 1995] Chaudhuri, S., Dayal, U., and Yan, T. W. (1995). Join queries with external text sources:
Execution and optimization techniques. In ACM SIGMOD Record, pages 410–422, San Jose, California.
• [Chaudhuri et al., 2004] Chaudhuri, S., Ganti, V., and Gravano, L. (2004). Selectivity estimation for string
predicates: Overcoming the underestimation problem. In Proc. of the ICDE Conf., pages 227–238. IEEE.
• [Chen et al., 2000] Chen, Z., Koudas, N., Korn, F., and Muthukrishnan, S. (2000). Selectively estimation for
boolean queries. In Proc. of the PODS Conf., pages 216–225. ACM.
• [Chu et al., 2007] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. (2007a). A relational approach to
incrementally extracting and querying structure in unstructured data. In Proc. of the VLDB Conference.
• [Chubak and Rafiei, 2010] Chubak, P. and Rafiei, D. (2010). Index Structures for Efficiently Searching Natural
Language Text. In Proc. of the CIKM Conference.
• [Chubak and Rafiei, 2012] Chubak, P. and Rafiei, D. (2012). Efficient indexing and querying over syntactically
annotated trees. PVLDB, 5(11):1316–1327.
• [Codd, 1974] Codd, E. (1974). Seven steps to rendezvous with the casual user. In IFIP Working Conference
Data Base Management, pages 179–200.
• [Ferrucci, 2012] Ferrucci, D. A. (2012). Introduction to ”this is watson”. IBM Journal of Research and
Development, 56(3):1.
• [Gonnet and Tompa, 1987] Gonnet, G. H. and Tompa, F. W. (1987). Mind your grammar: a new approach to
modelling text. In Proc. of the VLDB Conference, pages 339–346, Brighton, England.
• [Gyssens et al., 1989] Gyssens, M., Paredaens, J., and Gucht, D. V. (1989). A grammar-based approach
towards unifying hierarchical data models (extended abstract). In Proc. of the SIGMOD Conference, pages
263–272, Portland, Oregon.
References – Cont.
• [Jagadish et al., 1999] Jagadish, H., Ng, R. T., and Srivastava, D. (1999). Substring selectivity estimation. In
Proc. of the PODS Conf., pages 249–260. ACM.
• [Jain et al., 2008] Jain, A., Doan, A., and Gravano, L. (2008). Optimizing SQL queries over text databases. In
Proc. of the ICDE Conference, pages 636–645, Cancun, Mexico.
• [Kaoudi and Manolescu, 2015] Kaoudi, Z. and Manolescu, I. (2015). Rdf in the clouds: a survey. The VLDB
Journal, 24(1):67–91.
• [Lewis and Steedman, 2013] Lewis, M. and Steedman, M. (2013). Combining distributional and logical
semantics. Transactions of the Association for Computational Linguistics, 1:179–192.
• [Li and Jagadish, 2014] Li, F. and Jagadish, H. V. (2014). Constructing an interactive natural language interface
for relational databases. PVLDB, 8(1):73–84.
• [Li et al., 2007] Li, Y., Yang, H., and Jagadish, H. V. (2007). Nalix: A generic natural language search
environment for XML data. ACM Trans. Database Systems, 32(4).
• [Liang et al., 2011] Liang, P., Jordan, M. I., and Klein, D. (2011). Learning dependency-based compositional
semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies-Volume 1, pages 590–599. Association for Computational Linguistics.
• [Lin and Pantel, 2001] Lin, D. and Pantel, P. (2001). Dirt - discovery of inference rules from text. In Proc. of the
KDD Conference, pages 323–328.
• [Rafiei and Li, 2009] Rafiei, D. and Li, H. (2009). Data extraction from the web using wild card queries. In Proc.
of the CIKM Conference, pages 1939–1942.
• [Ravichandran and Hovy, 2002] Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a
question answering system. In Proc. of the ACL Conference.
References – Cont.
• [Popescu et al., 2004] Popescu et al., A. (2004). Modern natural language interfaces to databases: Composing
statistical parsing with semantic tractability. In Proc. of the COLING Conference.
• [Saha et al., 2016] Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U. F., Mittal, A. R., and O¨ zcan, F.
(2016). Athena: An ontology-driven system for natural language querying over relational data stores. PVLDB,
9(12):1209–1220.
• [Salminen and Tompa, 1994] Salminen, A. and Tompa, F. (1994). PAT expressions: an algebra for text search.
• Acta Linguistica Hungarica, 41(1):277–306.
• [Stratica et al., 2005] Stratica, N., Kosseim, L., and Desai,
• B. C. (2005). Using semantic templates for a natural language interface to the cindi virtual library. Data and
Knowledge Engineering, 55(1):4–19.
• [Suchanek and Preda, 2014] Suchanek, F. M. and Preda,
• N. (2014). Semantic culturomics. Proc. of the VLDB Endowment, 7(12):1215–1218.
• [Tague et al., 1991] Tague, J., Salminen, A., and McClellan, C. (1991). A complete model for information
retrieval systems. In Proc. of the SIGIR Conference, pages 14–20, Chicago, Illinois.
• [Tian et al., 2014] Tian, R., Miyao, Y., and Matsuzaki, T. (2014). Logical inference on dependency-based
compositional semantics. In Proc. of the ACL Conference, pages 79–89.
• [Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In ACL
• [Valenzuela-Escarcega et al., 2016] Valenzuela-Escarcega, M. A., Hahn-Powell, G., and Surdeanu, M. (2016).
Odin’s runes: A rule language for information extraction. In Proc. of the Language Resources and Evaluation
Conference (LREC).
• [Xu, 2014] Xu, W. (2014). Data-driven approaches for paraphrasing across language variations. PhD thesis,
New York University.
Relevant Tutorials
• Semantic parsing
• Percy Liang: “natural language understanding: foundations and state-
of-the-art”, ICML 2015.
• Information extraction
• Laura Chiticariu, Yunyao Li, Sriram Raghavan, Frederick Reiss:
“Enterprise information extraction: recent developments and open
challenges.” SIGMOD 2010
• Entity resolution
• Lise Getoor and shwin Machanavajjhala: “Entity Resolution for Big
Data” KDD 2013

More Related Content

What's hot

Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Search the Internet like an Expert
Search the Internet like an ExpertSearch the Internet like an Expert
Search the Internet like an ExpertDaniyar Salikhov
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Semantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSemantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSerdar Sönmez
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upDavide Palmisano
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Ph.D_slide
Ph.D_slidePh.D_slide
Ph.D_slidePei Li
 

What's hot (13)

Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Search the Internet like an Expert
Search the Internet like an ExpertSearch the Internet like an Expert
Search the Internet like an Expert
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data Environment
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Semantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parserSemantic web xml-rdf-dom parser
Semantic web xml-rdf-dom parser
 
From the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking upFrom the Semantic Web to the Web of Data: ten years of linking up
From the Semantic Web to the Web of Data: ten years of linking up
 
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Ph.D_slide
Ph.D_slidePh.D_slide
Ph.D_slide
 

Similar to Natural Language Data Management and Interfaces: Recent Development and Open Challenges

Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate StudentsRebekah Cummings
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalMarianne Sweeny
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
Introduction to apache spark and machine learning
Introduction to apache spark and machine learningIntroduction to apache spark and machine learning
Introduction to apache spark and machine learningAwoyemi Ezekiel
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrasesCassandra Jacobs
 
Bioinformatics&Databases.ppt
Bioinformatics&Databases.pptBioinformatics&Databases.ppt
Bioinformatics&Databases.pptBlackHunt1
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 

Similar to Natural Language Data Management and Interfaces: Recent Development and Open Challenges (20)

Text mining
Text miningText mining
Text mining
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate Students
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
Koppel, Riding, Pace, and Ockerbloom, "Library Systems & Interoperability: Br...
 
Introduction to apache spark and machine learning
Introduction to apache spark and machine learningIntroduction to apache spark and machine learning
Introduction to apache spark and machine learning
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
lec01.ppt
lec01.pptlec01.ppt
lec01.ppt
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrases
 
Bioinformatics&Databases.ppt
Bioinformatics&Databases.pptBioinformatics&Databases.ppt
Bioinformatics&Databases.ppt
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 

More from Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLPYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Yunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Yunyao Li
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative AnalyticsYunyao Li
 

More from Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Natural Language Data Management and Interfaces: Recent Development and Open Challenges

  • 1. Natural Language Data Management and Interfaces Recent Development and Open Challenges Davood Rafiei University of Alberta Yunyao Li IBM Research - Almaden Chicago 2017
  • 2. “If we are to satisfy the needs of casual users of data bases, we must break through the barriers that presently prevent these users from freely employing their native languages" Ted Codd, 1974
  • 3. Employing Native Languages •As data for describing things and relationships • Otherwise a huge volume of data will end up outside databases •As an interface to databases • Otherwise we limit database use to professionals
  • 4. Outline •Natural Language Data Management •Natural Language Interfaces for Databases •Open Challenges and Opportunities
  • 6. Outline of Part I •The ubiquity of natural language data •A few areas of application •Challenges •Areas of progress •Querying natural language text •Transforming natural language text •Integration
  • 7. The Ubiquity of Natural Language Data
  • 8. Data Domains •Corporate data •Scientific literature •News articles •Wikipedia
  • 9. Corporate Data Merril Lynch rule “unstructured data comprises the vast majority of data found in an organization. Some estimates run as high as 80%.” Unstructured data
  • 10. Scientific Literature Impact of less invasive treatments including sclerotherapy with a new agent and hemorrhoidopexy for prolapsing internal hemorrhoids. Tokunaga Y, Sasaki H. (Int Surg. 2013) Abstract Abstract Conventional hemorrhoidectomy is applied for the treatment of prolapsing internal hemorrhoids. Recently, less-invasive treatments such as sclerotherapy using aluminum potassium sulphate/tannic acid (ALTA) and a procedure for prolapse and hemorrhoids (PPH) have been introduced. We compared the results of sclerotherapy with ALTA and an improved type of PPH03 with those of hemorrhoidectomy. Between January 2006 and March 2009, we performed hemorrhoidectomy in 464 patients, ALTA in 940 patients, and PPH in 148 patients with second- and third-degree internal hemorrhoids according to the Goligher's classification. The volume of ALTA injected into a hemorrhoid was 7.3 ± 2.2 (mean ± SD) mL. The duration of the operation was significantly shorter in ALTA (13 ± 2 minutes) than in hemorrhoidectomy (43 ± 5 minutes) or PPH (32 ± 12 minutes). Postoperative pain, requiring intravenous pain medications, occurred in 65 cases (14%) in hemorrhoidectomy, in 16 cases (1.7%) in ALTA, and in 1 case (0.7%) in PPH. The disappearance rates of prolapse were 100% in hemorrhoidectomy, 96% in ALTA, and 98.6% in PPH. ALTA can be performed on an outpatient basis without any severe pain or complication, and PPH is a useful alternative treatment with less pain. Less-invasive treatments are beneficial when performed with care to avoid complications. Treatment No of patients tries on Duration
  • 11. News Articles April 25, 2017 12:48 pm Loonie hits 14-month low as softwood lumber duties expected to impact jobs By Ross Marowits The Canadian Press MONTREAL – The loonie hit a 14-month low on Tuesday at 73.60 cents, the lowest level since February 2016. The U.S. Commerce Department levied countervailing duties ranging between 3.02 and 24.12 per cent on five large Canadian producers and 19.88 per cent for all other firms effective May 1. The duties will be retroactive 90 days for J.D. Irving and producers other than Canfor, West Fraser, Resolute Forest Products and Tolko. Anti-dumping duties to be announced June 23 could raise the total to as much as 30 to 35 per cent. 25,000 jobs will eventually be hit, including 10,000 direct jobs and 15,000 indirect ones tied to the sector Dias anticipates that. Event Triggering event Following events expected
  • 12. Wikipedia •42 million pages •Only 2.4 million infobox triplets •Lots of data not in infobox Obama was hired in Chicago as director of the Developing Communities Project, a church-based community organization originally comprising eight Catholic parishes in Roseland, West Pullman, and Riverdale on Chicago's South Side. … In 1991, Obama accepted a two-year position as Visiting Law and Government Fellow at the University of Chicago Law School to work on his first book. … From April to October 1992, Obama directed Illinois's Project Vote, a voter registration campaign…
  • 13. Community QA •Services such as Yahoo answers, Stack Overflow, AnswerBag, … •Data: question and answer pairs •Want answers to new queries Q: How to fix auto terminate mac terminal Two StackOverflow pages returned by Google - osx - How do I get a Mac “.command” file to automatically quit after running a shell script? - OSX - How to auto Close Terminal window after the “exit” command executed.
  • 16. Challenge – Lack of Schema treatment patientCnt duration noOfPatients disappearanceRate sclerotherapy with ALTA 940 13+-2 16 96 PPH03 148 32+-12 1 98.6 hemorrhoidectomy 484 43+-5 65 100 •The scientific article shown earlier contains structured data (as shown) but hard to query due to the lack of schema
  • 17. Challenge - Opacity of References •Anaphora • “Joe did not interrupt Sue because he was polite” • “the lion bit the gazelle, because it had sharp teeth” •Ambiguity of ids • Does “john” in article A refer to the same “john” in article B? •Variations due to spatiotemporal differences • “police chief” is ambiguous without a spatiotemporal anchor
  • 18. Challenge - Richness of Semantics •Semantic relations •crow ⊆ bird; bird ∩ nonbird= {}; bird ∪ nonbird=U •Pragmatics •The meanning depends on the context •E.g. “Sherlock saw the man with binoculars” •Textual entailment •“every dog danced” ⟼ “every poodle moved”
  • 19. Challenge - Correctness of Data •Incorrect or sarcastic •“Vladimir Putin is the president of the US’’ •Correct at some point in time (but not now) •“Barack Obama is the president of the US” •Correct now •“Donald Trump is the president of the US” •Always correct •“Barack Obama is born in Hawaii” •“Earth rotates around the sun”
  • 21. System Architecture Transform RDF store Text store Text System Integrate Enrichment Entity Resolution Information Extraction SQL SPARQL Support Natural Language Text Queries Rich Queries Knowledge Base Domain schema Structured Data DBMS
  • 23. Progress •Support natural language text queries (rich queries) •Transform •Integrate
  • 25. Approaches •Boolean queries •Grammar-based schema and searches •Text pattern queries •Tree pattern queries
  • 26. Boolean Queries •TREC legal track 2006-2012 •Retrieve documents as evidence in civil litigation •Default search in Quicklaw and Westlaw •E.g. ( (memory w/2 loss) OR amnesia OR Alzheimer! OR dementia) AND (lawsuit! OR litig! OR case OR (tort w/2 claim!) OR complaint OR allegation!) from TREC 09 Legal track memory /2 loss memory /s loss Highlight that due to the variants in NL, BQ can be extremely complex
  • 27. Boolean Queries (Cont.) •Not much use of the grammar •Except ordering and term distance •Research issues •Optimization • Selectivity estimation for boolean queries [Chen et al., PODS 2000] • String selectivity estimation [Jagadish et al., PODS 1999], [Chaudhuri et al., ICDE 2004] •Query evaluation [Broder et al., CIKM 2003]
  • 28. PAT Expressions [Saliminen & Tompa, Acta Lingusitica Hungarica 94] •A set-at-a-time algebra for text •Text normalization •Delimiters mapped to blank, lowercasing, etc. •Searches make less use of grammar •Lexical: e.g. “joe”, “bo”..“jo” •Position: e.g. [20], shift.2 “2010”..“2017” • The last two characters of the matches •Frequency: e.g. signif.2 “computer” • Significant two terms that start with “computer” such as “computer systems”
  • 29. Mind your Grammar [Gonnet and Tompa, VLDB 1987] •Schema expressed as a grammar •Studied in the context of Oxford English Dictionary Word Pos_tag Pr_brit Pr_us Plurals … Man-trap n
  • 30. Grammar-based Data •The grammar (when known) allows data to be represented and retrieved •Compared to relational data •Grammar ~ table schema •Parsed strings (p-strings) ~ table instance
  • 31. Grammar-based Data (another context) •Data wrapped in text and html formatting •Many ecommerce sites with back-end rel. data •Grammar often simple •Schema finding ~ grammar induction •Input: (a) html pages with wrapped data, (b) sample/tagged tuples •Output: a grammar (or a wrapper)
  • 32. Grammar Induction •Challenge: Regular grammars cannot be learned from positive samples only [Gold, Inf. Cont. 1967] • Many web pages use grammars that are identifiable in the limit (e.g. [Crescenzi & Mecca, J. ACM 2004]) •With natural language text • Context free production rules exist for good subsets • Not deterministic (multiple derivations per input) • The rules are usually complex, less uniform, and maybe ambiguous
  • 33. Text Pattern Queries •Text modeled as “a sequence of tokens” •Data wrapped in text patterns •<name> was born in <year> •Also referred to as surface text patterns [Ravichandran and Hovy, ACL 2002] •Queries ~ text patterns
  • 34. Google Search: “is a car manufacturer”
  • 35. DeWild [Li & Rafiei, SIGIR 2006, CIKM 2009] •Query match short text (instead of a page) •Result ranking •To improve “precision at k” •Query rewritings DeWild Query: % is a car manufacturer
  • 36. Rewriting Rules •Hyponym patterns [Hearst, 1992] • X such as Y • X including Y • Y and other X •Morphological patterns • X invents Y • Y is invented by X •Specific patterns • X discovers Y • X finds Y • X stumbles upon Y
  • 37. Rewriting Rules in DeWild # nopos (.+),? such as (.+) such (.+) as (.+) (.+),? especially (.+) (.+),? including (.+) -> $1 such as $2 && noun(,$1) such $1 as $2 && noun(,$1) $1, especially $2 && noun(,$1) $1, including $2 && noun(,$1) $2, and other $1 && noun(,$1) $2, or other $1 && noun(,$1) $2, a $1 && noun($1,) $2 is a $1 && noun($1,) #pos N<([^<>]+)>N,? V<(w+)>V by N<([^<>]+)>N N<([^<>]+)>N V<is (w+)>V by N<([^<>]+)>N N<([^<>]+)>N V<are (w+)>V by N<([^<>]+)>N N<([^<>]+)>N V<was (w+)>V by N<([^<>]+)>N N<([^<>]+)>N V<were (w+)>V by N<([^<>]+)>N -> $3 $2 $1 && verb($2,,,) $3 $2 $1 && verb(,$2,,) $3 $2 $1 && verb(,,$2,) $3 will $2 $1 && verb($2,,,) $3 is going to $2 $1 && verb($2,,,) $1 is $2 by $3 && verb(,,,$2) $1 was $2 by $3 && verb(,,,$2) $1 are $2 by $3 && verb(,,,$2) noun(country, countries) verb(go, goes, went, gone)
  • 38. Queries in DeWild •Text patterns with some wild cards •E.g •% is the prime minister of Canada •% invented the light bulb •% invented % •% is a summer *blockbuster*
  • 39. Indexing for Text Pattern Queries •Method 1: Inverted index 34,480,00 -> …, <2,1,[10]>, … is -> <1,5,[4,16,35,58,89]>, …. <2,1,[9]>, … population -> … <2,1,[8]> <3,1,[10]>, … Canada -> … <2,1,[7]>, … Query: Canada population is % docId tf offset list
  • 40. Indexing for Text Pattern Queries (Cont.) •Method 2: Neighbor index [Cafarella & Etzioni, WWW 2005] 34,480,00 -> …, <2,1,[(10,is,-)]>, … is -> …. <2,1,[(9,population,34,480,000)]>, … population -> … <2,1,[(8,Canada,is)]>, … Canada -> … <2,1,[(7,though,population)]>, … Problems: (1) long posting lists e.g. for “is”, “and”, … (2) join costs |#(query terms) - 1| * |post_list(termi)|
  • 41. Indexing for Text Pattern Queries (Cont.) •Method 3: Word Permuterm Index (WPI) [Chubak & Rafiei, CIKM 2010] •Based on Permuterm index [Garfield, JAIS 1976] •Burrows-wheeler transformation of text [Burrows & Wheeler, 1994] •Structures to maintain the alphabet and to access ranks
  • 42. •E.g. three sentences (lexicographically sorted) T = $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ •BW-transform • Find all word-level rotations of T • Sort rotations • The vector of the last elements is BW-transform 42 Word-level Burrows-wheeler transformation
  • 43. $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries such capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a countries such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is the capital such as Italy $ ~ $ Rome is a city $ Rome is the capital of Italy $ countries the capital of Italy $ countries such as Italy $ ~ $ Rome is a city $ Rome is ~ $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 BW-transformation 43
  • 44. 44 Traversing L backwards Prev(i) = Count[L[i]] + RankL[i](L,i) i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 L ~ city Italy Italy of as $ $ is such the a $ Rome Rome capital countries is $ Prev(8) = Count($) + Rank$(L,8) = 0 + 2 = 2 The second $ is preceded by city in T Prev(10) = Count(such) + Ranksuch(L,10) = 16 + 1 = 17 such is preceded by countries in T T = $ Rome is a city $ Rome is the capital of Italy $ countries such as Italy $ ~ Prev(10)Prev(8) Number elements smaller than L[i], in L Occurrences of L[i] in the range (L[1..i])
  • 45. Tree Pattern Queries •Text often modeled as a set of “ordered node labeled tree” •Order usually correspond to the order of the words in a sentence •Queries •Navigational axes: XPath style queries • E.g. find sentences that include `dog’ as a subject •Boolean queries • E.g. Find sentences that contain any of the words w1, w2 or w3. •Quantifiers and implications •Subtree searches
  • 46. Subtree Searches What kind of animal is agouti? (TREC-2004 QA track)
  • 47. Approaches •Literature on general tree matching •E.g. ATreeGrep [Shasha et al., PODS 2002] •Often do not exploit properties of Syntactically-Annotated Tree (SAT) • E.g. distinct labels on nodes •Querying SATs •Work from the NLP community • E.g. TGrep2, CorpusSearch, Lpath •Scan-based, inefficient •Indexing unique subtrees
  • 48. Indexing Unique Subtrees [Chubak & Rafiei, PVLDB 2012] •Keys: unique subtrees of up to a certain size •Posting lists: structural info. of keys •Evaluation strategy: break queries into subtrees, fetch lists and join •Syntactically annotated trees •Abundant frequent patterns  small number of keys •Small average branching factor  small number of postings
  • 49. Example Subtrees A B A BC D C A B C D size = 1 A B C D size = 2 B A A A size = 3 A B C A B A A C A A A B A A C A B D A B C A B C D A B C D A B A C A C A D A B C B A A A C A A A A A B A C DC
  • 50. Subtree Coding •Filter-based • Store only tid for each unique subtree in the posting list • No other structural information •Subtree interval coding • Store pre, post and order values in a pre-order traversal (for containment rel.) and level (for parent-child rel.) •Root split coding • Optimize the storage for subtree interval coding
  • 51. Query Decomposition B C D FE Query A B A Query Cover = { , } A D FE , C B C D •Want an optimal cover to reduce the join cost •Guarantee an optimal cover for filter-based and subtree interval coding • For subtrees of size 6 or less •Bound the number of joins in a root split cover
  • 52. System Architecture Transform RDF store Text store Text System Integrate Enrichment Entity Resolution Information Extraction SQL SPARQL Support Natural Language Text Queries Rich Queries Knowledge Base Domain schema Structured Data DBMS
  • 54. Transforming Natural Language Data •Transformation to a meaning representation (aka semantic parsing) such as •RDF triples •Other form of logical predicates
  • 55. Integrating Natural Language Data •Tight integration •Text is maintained by a relational system •Lose integration •Text is maintained by a text system
  • 56. Transforming Natural Language Data to a Meaning Representation
  • 57. Challenges (with logical inference in general) •Detecting that •Craw is a bird, •Bird is an animal •Craws can fly but pigs cannot •Attending an organization relates to education •A person has a mother and a father but can have many children •Many more
  • 58. Progress •Brachman & Levesque, Knowledge representation & reasoning, 2000. •RTE entailment challenge •Since 2005 •Knowledge bases and resources such as Freebase, Wordnet, Yago, dbpedia, … •Shallow semantic parsers
  • 59. Mapping to DCS Trees [Tian et al., ACL 2014] •Dependency-based compositional semantics (DCS) trees [Liang et al., ACL 2011] •Similar to (and generated from) dependency parse trees love Mary dog subj obj F1 = love ∩ (Mary[subj] X W[obj]) F2 = animal ∩ πobj (F1) F3 = have ∩ (John[subj] X F2[obj]) Does John have an animal that Mary love? DCS tree node ~ table Subtree ~ rel. algebra exp.
  • 60. Logical Inference on DCS •Some of the axioms •(R ⊂ S & S ⊂ T) ⇒ R ⊂ T •R ⊂ S ⇒ πA(R) ⊂ πA(S) •W != ∅ •Inference ~ deriving new relations using the tables and the axioms •Performance on inference problems •Comparable to systems in FraCaS and Pascal RTE
  • 61. Addressing Knowledge Shortage •Treat DCS tree fragments as paraphrase candidates •Establish paraphrases based on distributional similarity (as in [Lewis & Steedman, TACL 2013] and others) blame cause Debby Debbydeath storm tropical storm tropical loss life obj iobj objsubj mod mod mod
  • 62. Semantic Parsing using Freebase [Berant et al., EMNLP 2013] •Transform questions to freebase derivations •Learn the mapping from a large collection of question-answer pairs
  • 63. Approach •15 million triplets (text phrases) from ClubWeb09 mapped to Freebase predicates • Dates are normalized and text phrases are lemmatized • Unary predicates are extracted • E.g. city(Chicago) from (Chicago, “is a city in”, Illinois) • 6,299 such unary predicates • Entity types are checked when there is ambiguity • E.g. (BarackObama, 1961) is added to “born in” [person,date] and not to “born in” [person,location] • 55,081 typed binary predicates
  • 64. Two Steps Mapping •Alignment •Map each phrase to a set of logical forms •Bridging •Establish a relation between multiple predicates in a sentence •E.g. Marriage.Spouse.TomCruise and 2006 will form Marriage.(Spouse.TomCruise ∩ startDate.2006) The transformation helps to answer questions using Freebase
  • 65. Storage and Querying of Triples •RDF stores •Native: Apache Jena TDB, Virtuoso, Algebraix, 4store, GraphDB, … •Relational-backed: Jena SDB, C-store, … •Semantic reasoners •Open source: Apache Jena, and many more •A list at Manchester U. • http://owl.cs.manchester.ac.uk/tools/list-of-reasoners/
  • 67. Challenges •Structure in text •Often not known in advance •Sometimes subjective •Optimization and plan generation •Difficult with less stats, cost estimates and join dependencies •Interaction with other systems (e.g. IE, NER) •Adds another layer of abstraction
  • 68. Integration Schemes • Tight integration • A Rel. Approach to Querying Text [Chu et al., VLDB 2007] • Lose integration • Join queries with external text sources [Chaudhuri et al., DIGMOD Record 1995] • Optimizing SQL queries over text databases [Jain et al., ICDE 2008]
  • 69. A Rel. Approach to Querying Text [Chu et al., VLDB 2007] •Each document is stored in a wide table •Attributes are added as discovered •Two tables •Attribute catalog •Records (one row per document) •Attributes •Two documents can have different attributes •Multiple attributes in a doc can have the same name •Only non-null values are stored
  • 71. Operators •Extract •Extract desired entities and relationships •Integrate •Suggest mappings between attributes •Cluster •Group documents into one or more clusters Operator interaction Integrate(address, sent-to) – extract(city,street,zipcode)
  • 72. Lose Integration of Text [Chaudhuri et al., SIGMOD Record 1995] •Documents stored in a text system •Relational view of documents Relational Database System Text System (mercury) docid title author abstract … Search, retrieve, join
  • 73. Integration Techniques •Tuple substitution •Nested loop with the db tuple as the outer relation SELECT p.member, p.name, m.docid FROM projects p, mercury m WHERE p.sponsor=‘NSF’ AND p.name in m.title AND p.member in m.author
  • 74. Integration Techniques -- Cont. •Semi-join •Suppose the text system can take k terms •For n members, send n/k queries of the form (m1 OR m2 OR … OR mk) to the text system •Probing •Select a set of terms (how?) from project title and check their mentions in the text system •Keep a list of terms (or assignments) that return empty •Probing with tuple substitution •Maintain a cache
  • 75. SQL Queries over Text Databases [Jain et al., ICDE 2008] •Information Extraction (IE) modules over text •headquarter(company, location) •ceoOf(company, ceo) •Relational view of text •A set of full outer joins over IE modules •e.g. companies =headquarter ⋈ ceoOf ⋈ … •SQL queries over relational views •Want to improve upon “extract-then-query”
  • 76. Problem •Given a SQL query •Find execution strategies that meet some efficiency and quality constraints •In terms of runtime, precision, recall, … •On-the-fly IE from text SELECT company, ceo, location FROM companies WHERE location=‘Chicago’
  • 77. Retrieval Strategies •scan •Process all documents •const •Process documents that contain query keywords •promD •Only process the promising documents for each IE system (using IE specific keywords) •promC •AND the predicates of const and promD chicago headquarter OR (based AND shares) chicago AND (Headquarter OR (based AND shares)
  • 78. Selecting an Execution Plan •Stats estimated for each strategy •# of matching docs docs(E, promC, D) •Retrieval time rTime(E, scan, D) •Cost estimation •Stratified sampling (with one stratum for PD and another stratum for D-PD) •For const use both strata •For promC & promD use PD only scan const promD promC D PD
  • 79. Natural Language Interface to Databases (NLIDB)
  • 80. Anatomy of a NLIDB Query Understanding Query Translation Data store Feedback Generation Domain knowledge Optional component NLQ Interpretation interactions queries queries
  • 81. Query Understanding – Scope of Natural Language Support Ad-hoc NLQs Controlled NLQs Grammar complexity Vocabulary complexity Ambiguity Parser error Query naturalness
  • 82. Query Understanding – Stateless and Stateful Stateful NLQsStateless NLQs NLQ Engine Databases NLQ NLQ Engine Databases NLQ Query history Each query must be • Fully specified • Processed independently Each query • Can be partially specified • Processed with regards to previous queries
  • 83. Query Understanding - Parser Error Handling Parsers make mistakes. • News: Accuracy of a dependency parser = ~90% [Andor et al., 2016] • Questions: ~80% [Judge et al., 2006] Different approaches: Ignore Auto-correction Interactive correction • Detect and correct certain parser mistakes • Query reformulation • Parse tree correction • Do nothing
  • 84. Query Translation - Bridging the Semantic Gaps • Vocabulary gap “Bill Clinton” vs. “William Jefferson Clinton” “IBM” vs. “International Business Machine Incorporated” • Leaky abstraction • Mismatch between abstraction (e.g. data schema/domain ontology) and user assumptions “top executives” vs “person with title CEO, CFO, CIO, etc.” • Ambiguity in user queries • Underspecified queries “Watson movie”  “Watson” as actor/actress E.g. Emma Watson “Watson” as a movie character E.g. Dr. Watson in movie “Holmes and Watson” …
  • 85. Query Translation – Query Construction • Approaches • Machine learning • Construct formal queries from NLQ interpretations with deterministic algorithms • Query • Formal query languages (e.g. XQuery / SQL) • Intermediate language independent of underlying data stores • The same intermediate query for different data stores
  • 86. Systems • PRECISE • NaLIX • NLPQC • FREyA • NaLIR • ML2SQL • NL2CM • ATHANA
  • 87. PRECISE [Popescu et al., 2003,2004] • Controlled NLQ based on Semantic Tractability Dependency Parsing Query Generator RDBMS Lexicon NLQ Interpretation queriesMatcher Semantic Override Feedback Generation interactions Equivalence Checker
  • 88. PRECISE [Popescu et al., 2003,2004] • Semantic Tractability Database element: relations, attributes, or values Token: a set of word stems that matches a database element Syntactic marker: a term from a fixed set of database-independent terms that make no semantic contribution to the interpretation of the NLQ Semantically tractable sentence: Given a set of database element E, a sentence S is considered semantic tractable, when its complete tokenization satisfies the following conditions: • Every token matches a unique data element in E • Every attribute token attaches to a unique value token • Every relation token attaches to either an attribute token or a value token
  • 89. PRECISE [Popescu et al., 2003,2004] • Explicitly correct parsing errors: • Preposition attachment • Preposition ellipsis What are flights from Boston to Chicago on Monday? pronoun verb noun prep noun noun nounprepprep NP NP NP NP NP PP PP NP PP NP VP S
  • 90. PRECISE [Popescu et al., 2003,2004] • Explicitly correct parsing errors: • Preposition attachment • Preposition ellipsis What are flights from Boston to Chicago on Monday? pronoun verb noun prep noun noun nounprepprep NP NP NP NP NP PP PP NP PP NP VP S What are flights from Boston to Chicago Monday? pronoun verb noun prep noun noun nounprep NP NP NP NP NP PP NP PP NP VP S
  • 91. PRECISE [Popescu et al., 2003,2004] • Mapping parse tree nodes based on lexicon built from database
  • 92. PRECISE [Popescu et al., 2003,2004] • Addressing ambiguities through lexicon + semantic tractability • Maximum-flow solution
  • 93. PRECISE [Popescu et al., 2003,2004] • Addressing ambiguities through lexicon + semantic tractability + user input What are the systems analyst jobs in Austin? Interpretation 1 Job title: systems analyst Interpretation 2 Area: systems Job title: analyst NLQ
  • 94. PRECISE [Popescu et al., 2003,2004] • 1-to-many translation from interpretations to SQL based on all possible join-paths Job.Description  What Job.Company  ‘HP’ Job.Platform  ‘Unix’ City.size  ‘small’ Job JobID Description Company Platform City CityID Name State Size What are the HP jobs on Unix in a small town? NLQ Interpretations DB Schema SELECT DISTINCT Job.Description FROM Job, City WHERE Job.Platform = ‘HP’ AND Job.Company = ‘Unix’ AND Job.JobID = City.CityID
  • 95. PRECISE [Popescu et al., 2003,2004] • 1-to-many translation from interpretations to SQL based on all possible join-paths Job.Description  What Job.Company  ‘HP’ Job.Platform  ‘Unix’ City.size  ‘small’ Job JobID Description Company Platform City CityID Name State Size What are the HP jobs on Unix in a small town? WorkLocation JobID CityID NLQ Interpretations DB Schema PostLocation JobID CityID SELECT DISTINCT Job.Description FROM Job, WorkLocation, City WHERE Job.Platform = ‘HP’ AND Job.Company = ‘Unix’ AND Job.JobID = WorkLocation.JobID AND WorkLocation.CityID = City.CityID SELECT DISTINCT Job.Description FROM Job, PostLocation, City WHERE Job.Platform = ‘HP’ AND Job.Company = ‘Unix’ AND Job.JobID = WorkLocation.JobID AND PostLocation.CityID = City.CityID
  • 96. NLPQC [Stratica et al., 2005] Question Parsing Query Translation RDBMSNLQ Interpretation queries Preprocessor schema Rule template Link Parser Semantic Analysis • Controlled NLQ based on predefined rule templates • No query history
  • 97. NLPQC [Stratica et al., 2005] • Build mapping rules for table names and attributes • Automatically generated using WordNet • Curated by system administrator Table name: resource … Synonyms: 3 sense of resource Sense 1: resource Sense 2: resource Sense 3: resource, resourcefulness, imagination Hypernyms: 3 sense of resource … Hyponyms: 3 sense of resource … … accept/reject/add Databases
  • 98. NLPQC [Stratica et al., 2005] • Mapping parse tree node to data schema and value based on mapping rules Who is the author of book Algorithms Table name: resource Table name: resource resource.default_attribute NLQ
  • 99. NLPQC [Stratica et al., 2005] • Mapping parse tree node to data schema and value based on pre-defined mapping rules • Mapping parse trees to SQL statements based on pre-defined rule templates Who is the author of book Algorithms Table name: resource Table name: resource resource.default_attribute Rule template SELECT author.name FROM author, resource, resource_author WHERE resource.title = “Algorithm” AND resource_author.resource_id=resource.resource_id AND resource_author.author_id=author.author_id NLQ
  • 100. NLPQC [Stratica et al., 2005] • No explicit ambiguity handling  leave it to mapping rules and rule templates • No parsing error handling  Assume no parsing error
  • 101. NaLIX [Li et al., 2007a, 2007b, 2007c] • Controlled NLQ based on pre-defined controlled grammar Dependency Parser Query Translation XML DBs Message Generator Translation Patterns NLQ Validated Parse Tree interactions Schema-free XQuery warning Classifier Validator Controlled Grammar Classification Tables Query History Domain Adapter Domain Knowledge Knowledge Extractor errors
  • 102. NaLIX [Li et al., 2007a, 2007b, 2007c] • Classify parse tree nodes into different types based on classification tables • Token: words/phrases that can be mapped into a XQery component • Constructs in FLOWR expressions • Marker: word/phrase that cannot be mapped into a XQuery component • Connecting tokens, modify tokens, pronoun, stopwords What are the state that share a watershed with California NLQ What are [CMT] state[NT] the [MM] the [MM] Classified parse tree share [CM] watershed [NT] a [MM] with [CM] California [VT]
  • 103. NaLIX [Li et al., 2007a, 2007b, 2007c] • Expand scope of NLQ support via domain adaptation What are the state that share a watershed with California NLQ What are [CMT] state[NT] the [MM] that [MM] Classified parse tree share [CM] watershed [NT] a [MM] with [CM] California [VT] What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] California [VT] Updated classified parse tree with domain knowledge where [MM] river [NT] river [NT] a [MM] of [CM]
  • 104. NaLIX [Li et al., 2007a, 2007b, 2007c] • Validate classified parse tree + term expansion + insert implicit nodes What are the state that share a watershed with California NLQ What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] California [VT] Updated classified parse tree with domain knowledge where [MM] river [NT] river [NT] a [MM] of [CM] What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] CA [VT] Updated classified parse tree post validation where [MM] river [NT] river [NT] a [MM] of [CM] state[NT] Implicit node Term expansion to bridge terminology gap
  • 105. NaLIX [Li et al., 2007a, 2007b, 2007c] • Translation: (1) Variable binding What are the state that share a watershed with California NLQ $v1 * $v2 $v1 * What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] CA [VT] Updated classified parse tree post validation where [MM] river [NT] river [NT] a [MM] of [CM] state[NT] $v3 $v4
  • 106. NaLIX [Li et al., 2007a, 2007b, 2007c] • Translation: (2) Pattern Mapping What are the state that share a watershed with California NLQ $v1 * $v2 $v1 * What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] CA [VT] Updated classified parse tree post validation where [MM] river [NT] river [NT] a [MM] of [CM] state[NT] $v3 $v4 for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state where $v2 = $v3 where $v4 = “CA” XQuery fragments
  • 107. NaLIX [Li et al., 2007a, 2007b, 2007c] • Translation: (3) Nesting and grouping What are the state that share a watershed with California NLQ $v1 * $v2 $v1 * What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] CA [VT] Updated classified parse tree post validation where [MM] river [NT] river [NT] a [MM] of [CM] state[NT] $v3 $v4 for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state where $v2 = $v3 where $v4 = “CA” XQuery fragments No aggregation function/qualifier  No nesting/grouping
  • 108. NaLIX [Li et al., 2007a, 2007b, 2007c] • Translation: (3) Nesting and grouping Find all the states whose number of rivers is the same as the number of rivers in California? NLQ $v1 * $v2 $v1 * What are [CMT] state[NT] the [MM] each [MM] is the same as[CM] state[NT] a [MM] of [CM] CA [VT] where [MM] river [NT] river [NT] a [MM] of [CM] state[NT] $v3 $v4 the number of [FT] the number of [FT]$cv1 $cv2 for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state for $cv1 = count($v2) for $cv2 = count($v3) where $cv1 = $cv2 where $v4 = “CA” XQuery fragments Aggregation function  Nesting and grouping based on $v2 and $v3
  • 109. NaLIX [Li et al., 2007a, 2007b, 2007c] • Translation: (4) Construction full query Find all the states whose number of rivers is the same as the number of rivers in California? NLQ for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state for $cv1 = count($v2) for $cv2 = count($v3) where $cv1 = $cv2 where $v4 = “CA” XQuery fragments for $v1 in doc(“geo.xml”)//state, $v4 in doc(“geo.xml”)//state let $vars1 := { for $v2 in doc(“geo.xml”)//river, $v5 in doc(“geo.xml”)//state where mqf($v2,$v5) and $v5 = $v1 return $v2} let $vars2 := { for $v3 in doc(“geo.xml”)//river, $v6 in doc(“geo.xml”)//state where mqf($v3,$v6) and $v6 = $v4 return $v3} where count($vars1) = count($vars2) and $v4 = “CA” return $v1
  • 110. NaLIX [Li et al., 2007a, 2007b, 2007c] • Support partially specified follow-up queries • Detect topic switch to refresh query context How about with Texas? NLQ How about [SM] Validated parse tree with [CM] TX [VT] for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state for $cv1 = count($v2) for $cv2 = count($v3) where $cv1 = $cv2 where $v4 = “CA” Query context for $v1 in doc//state for $v2 in doc//river for $v3 in doc//river for $v4 in doc//state for $cv1 = count($v2) for $cv2 = count($v3) where $cv1 = $cv2 where $v4 = “TX” Updated query context Substitution marker Updated value
  • 111. NaLIX [Li et al., 2007a, 2007b, 2007c] • Handle ambiguity • Ambiguity in terms  User feedback e.g. “California” can be the name of a state, as well as a city • Ambiguity in join-path  leverage Schema-free XQuery to find out the optimal join- path e.g. There could be multiple ways for a river to be related to a state • Error handling • Do not handle parser error explicitly • Interactive UI to encourage NLQ input understandable by the system
  • 112. FREyA [Damljanovic et al., 2013,2014] • Support ad-hoc NLQs, including ill-formed queries • Direct ontology look up + parse tree mapping  Certain level of robustness Syntactic parsing Query Generation Ontology Feedback Generation POCs dialogs SPARQL queries Ambiguous OCs/POCs Ontology-based Lookup Syntactic Mapping Mapping rules NLQ OCs Consolidation Triple Generation
  • 113. FREyA [Damljanovic et al., 2013,2014] • Parse tree mapping based on pre-defined heuristic rules  Finds POCs (Potential Ontology Concept) • Direct ontology look up  Finds OCs (Ontology Concept) What is the highest point of the state bordering Mississippi? NLQ the state Mississippi POCs the highest point OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY PROPERTYCLASS INSTANCE
  • 114. FREyA [Damljanovic et al., 2013,2014] • Consolidate POCs and OCs • If span(POC) ⊆ span(OC)  Merge POC and OC What is the highest point of the state bordering Mississippi? NLQ state Mississippi POCs the highest point OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY PROPERTYCLASS INSTANCE OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY PROPERTYCLASS INSTANCE
  • 115. FREyA [Damljanovic et al., 2013,2014] • Consolidate POCs and OCs • If span(POC) ⊆ span(OC)  Merge POC and OC • Otherwise, provide suggestions and ask for user feedback Return the population of California NLQ California POCs population OCs INSTANCE geo:california Suggestions ranked based on string similarity (Monge Elkan + Soundex) 1.T1. state population 2. state population density 3. has low point, …
  • 116. FREyA [Damljanovic et al., 2013,2014] • Triple Generation: (1) Insert joker class OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY PROPERTYCLASS INSTANCE OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY1 PROPERTY2CLASS1 INSTANCEJOKER ?
  • 117. FREyA [Damljanovic et al., 2013,2014] • Triple Generation: (2) Generate triples OCs geo:isHighestPointOf geo:State geo:border geo:mississippi PROPERTY1 PROPERTY2CLASS1 INSTANCEJOKER ? ? - geo:isHighestPointOf - geo:State; geo:State - geo:borders - geo:mississippi (geo:State); Triples
  • 118. FREyA [Damljanovic et al., 2013,2014] • Generate SPARQL query ? - geo:isHighestPointOf - geo:State; geo:State - geo:borders - geo:mississippi (geo:State); Triples prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix geo: <http://www.mooney.net/geo#> select ?firstJoker ?p0 ?c1 ?p2 ?i3 where { { ?firstJoker ?p0 ?c1 . filter (?p0=geo:isHighestPointOf) . } ?c1 rdf:type geo:State . ?c1 ?p2 ?i3 . filter (?p2=geo:borders) . ?i3 rdf:type geo:State . filter (?i3=geo:mississippi) . }
  • 119. FREyA [Damljanovic et al., 2013,2014] • Determine return type • Result of a SPARQL query is a graph • Identify answer type to decide the result display Show lakes in Minnesota. NLQ
  • 120. FREyA [Damljanovic et al., 2013,2014] • Handle ambiguities via user interactions • Provide suggestions • Leverage re-enforcement learning to improve ranking of suggestions • No parser error handling
  • 121. NaLIR [Li and Jagadish, 2014] • Controlled NLQ based on predefined grammar • No query history Query Tree Translator RDBMS Interactive Communicator Data index & schema graph NLQ query tree interactions queries Dependency parser Parse Tree Node Mapper Parse Tree Structure Adjuster candidate mapping choice candidate query trees choice
  • 122. NaLIR [Li and Jagadish, 2014] • Mapping parse tree node to data schema and value based on WUP similarity [Wu and Palmer, 1994] • Explicitly request user input on ambiguous mappings and interpretations Query Tree Translator RDBMS Interactive Communicator Data index & schema graph NLQ query tree interactions queries Dependency parser Parse Tree Node Mapper Parse Tree Structure Adjuster candidate mapping choice candidate query trees choice
  • 123. NaLIR [Li and Jagadish, 2014] • Automatically adjust parse tree structure into a valid parse tree Query Tree Translator RDBMS Interactive Communicator Data index & schema graph NLQ query tree interactions queries Dependency parser Parse Tree Node Mapper Parse Tree Structure Adjuster candidate mapping choice candidate query trees choice ROOT return author paper more Bob VLDB after 2000 ROOT return author Bob VLDB after 2000 more paper
  • 124. NaLIR [Li and Jagadish, 2014] • Automatically adjust parse tree structure into a valid parse tree • Further rewrite parse tree into one semantically reasonable Query Tree Translator RDBMS Interactive Communicator Data index & schema graph NLQ query tree interactions queries Dependency parser Parse Tree Node Mapper Parse Tree Structure Adjuster candidate mapping choice candidate query trees choice ROOT return author paper more Bob VLDB after 2000 ROOT return author Bob VLDB after 2000 more paper ROOT return author VLDB after 2000 more paper number of number of author paper VLDB after 2000 Bob
  • 125. NaLIR [Li and Jagadish, 2014] • 1-1 translation from query tree to SQL Query Tree Translator RDBMS Interactive Communicator Data index & schema graph NLQ query tree interactions queries Dependency parser Parse Tree Node Mapper Parse Tree Structure Adjuster candidate mapping choice candidate query trees choice
  • 126. Learning NLQ  SQL [Palakurthi et al., 2015] Stanford Parser Query Translation RDBMS Entity Relationship Schema NLQ queries Attribute Classifier Training Data Condition Random Fields trained model Training phase Runtime Classified attributes • Ad-hoc NLQ queries with explicit attribute mentions • Implicit restriction imposed by the capability of the system itself
  • 127. Learning NLQ  SQL [Palakurthi et al., 2015] • Explicit attributes: attributes mentioned explicitly in the NLQ List all the grades of all the students in Mathematics NLQ Explicit attributes: grade and student Implicit attribute: course_name by the classifier
  • 128. Learning NLQ  SQL [Palakurthi et al., 2015] • Learn to map explicit attributes in the NLQ to SQL clauses Type of Feature Example Feature Token-based isSymbol Grammatical POS tags and grammatical relations Contextual Tokens preceding or following the current token Other • isAttribute • Presence of other attributes • Trigger words (e.g. “each”) FeaturesTraining data
  • 129. Learning NLQ  SQL [Palakurthi et al., 2015] • Learn to map explicit attributes in the NLQ to SQL clauses Who are the professors teaching more than 2 courses? NLQ GROUP BY HAVINGFROM
  • 130. Learning NLQ  SQL [Palakurthi et al., 2015] • Construct full SQL queries • Attribute Clause Mapping • Identify joins based on ER diagram • Add missing implicit attributes via Concept Identification [Srirampur et al., 2014] Who are the professors teaching more than 2 courses? NLQ GROUP BY HAVINGFROM SELECT professor_name FROM COURSES,TEACH,PROFESSOR WHERE course_id=course_teach_id AND prof_teach_id =prof_id GROUP BY professor_name HAVING COUNT(course_name) > 2 SQL Identified based ER schema
  • 131. Learning NLQ  SQL [Palakurthi et al., 2015] • No parsing error handling • No explicit ambiguity handling What length is the Mississippi? NLQ Implicit attribute: State Wrongly identified
  • 132. NL2CM [Amsterdamer et al., 2015] Query Verification Query Generator OASIS Feedback Generation Vocabularies NLQ interactions Formal query IX Detector IX: Individual Expression Ontology Stanford Parser IX Patterns General Query Generator OASIS-QL triples SPARQL triples Crowd mining engine • Controlled NLQ based on predefined types (e.g. no “why” questions) • Query verification with feedback • No query history
  • 133. NL2CM [Amsterdamer et al., 2015] • Map parse tree with Individual Expression (IX) patterns and vocabularies • Lexical individuality: Individual terms convey certain meaning • Participant individuality: Participants or agents in the text that that are relative to the person addressed by the request • Synctatic individuality: Certain syntactic constructs in a sentence. What are the most interesting places near Forest Hotel, Buffalo that we should vi
  • 134. NL2CM [Amsterdamer et al., 2015] • Map parse tree with Individual Expression (IX) patterns and vocabularies • Lexical individuality: Individual terms convey certain meaning • Participant individuality: Participants or agents in the text that that are relative to the person addressed by the request • Synctatic individuality: Certain syntactic constructs in a sentence. What are the most interesting places near Forest Hotel, Buffalo that we should vi $x interesting [] visit $x Opinion Lexicon
  • 135. NL2CM [Amsterdamer et al., 2015] • Map parse tree with Individual Expression (IX) patterns and vocabularies • Processing the general parts of the query with FREyA system • Interact with user to resolve ambiguities What are the most interesting places near Forest Hotel, Buffalo that we should vi $x interesting [] visit $x Opinion Lexicon $x near Forest Hotel,_Buffalo,_NY User interaction $x instanceOf Place
  • 136. NL2CM [Amsterdamer et al., 2015] • No parsing error handling • Return error for partially interpretable queries • SPARQL + OASIS-QL triples  a complete OASIS-QL query $x interesting [] visit $x $x near Forest Hotel,_Buffalo,_NY $x instanceOf Place SELECT VARIABLES WHERE {$x instanceOf Place. $x near Forest_Hotel,_Buffalo,_NY} SATISFYING {$x hasLabel “interesting”} ORDER BY DESC(SUPPORT) LIMIT 5 AND { [ ] visit $x} WITH SUPPORT THRESHOLD = 0.1
  • 137. NL2CM [Amsterdamer et al., 2015] Query Verification Query Generator OASIS Feedback Generation Vocabularies NLQ interactions Formal query IX Detector IX: Individual Expression Ontology Stanford Parser IX Patterns General Query Generator OASIS-QL triples SPARQL triples Crowd mining engine • Handling ambiguity via user input
  • 138. ATHANA [Saha et al., 2016] • Permit ad-hoc queries • No explicit constraints on NLQ • Implicit limit on expressivity of NLQs by query expressivity limitation (e.g. nested query with more than 1 level) • No query history NLQ Engine Query Translation Databases Domain Ontology NLQ OQL with NL explanations Top ranked SQL query SQL queries with NL explanations user-selected SQL query Translation Index
  • 139. ATHANA [Saha et al., 2016] • Annotate NLQ into evidences  No explicit parsing • Handle ambiguity based on translation index and domain ontology Key Entries “Alibaba” “Alibaba Inc” “Alibaba Incorporated” “Alibaba Holding” … Company.name: Alibaba Inc Company.name: Alibaba Holding Inc. Company.name: Alibaba Capital Partners … Translation Index “Investiments” “investiment” PersonalInvestiment InstitutionalInvestiment … … Databases Data Value Domain Ontology Metadata Data
  • 140. ATHANA [Saha et al., 2016]
  • 141. ATHANA [Saha et al., 2016] Show me restricted stock investments in Alibaba since 2012 by year Holding.type Transaction.type InstitutionalInvestment.type … PersonalInvestment InstitutionalInvestment VCInvestment … Company.name:Alibaba Inc. Company.name:Alibaba Holding Inc. … Transaction.reported_year Transaction.purchase_year InstitutionalInvestment. reported_year … indexed value indexed valuemetadata metadatatime range “since 2012”, “year” Institutional investment Investment Investee type Reported_ year name “investments” “restricted stock” “since 2012”, “year” “Alibaba” Investee Company investedIn “in” unionOf Is-a Institutional investment Investment Investee type Reported_ year name “investments” “restricted stock” “Alibaba” Investee Company investedIn “in” issuedBy Is-a unionOf Security Evidence Interpretation trees
  • 142. ATHANA [Saha et al., 2016] • Ontology Query Language • Intermediate language over domain ontologies • Separate query semantics from underlying data stores • Support common OLAP-style queries
  • 143. ATHANA [Saha et al., 2016] • 1-1 translation from interpretation tree to OQL • 1-1 translation from OQL to SQL per relational schema SELECT Sum(oInstituionalINvestment.amount), oInstitutionalInvestment.reported_year FROM InstitutionalInvestment OInstitutionalInvestment, InvesteeCompany oInvesteeCompany WHERE oInstitutionalInvestment.type = “restricted_stock”, oInstitutionalInvestment.reported_year >= ‘2012’ oInstitutionalInvestment.reported_year >= Inf, oInvesteeCompany.name = (‘Alibaba Holdings Ltd.’, ‘Alibaba Inc.’, ‘Alibaba Capital Partners’}, oInstitionalInvestmentisaInvestedInunionOf_SecurityissuedBy=oInvesteeCompany GROUP BY oInstituionalInvestment.reported_year Database 1 Database 2 Database 3 … SQL Statement1 SQL Statement2 SQL Statement3 SQL Statement …
  • 144. NLIDBs Summary Systems Scope of NLQ Support Capability State Parsing Error Handling Controlled Ad-hoc* Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction PRECISE NLPQC NaLIX FREyA NaLIR NL2CM ML2SQL ATHANA N/A N/A * Implicit limitation by system capability
  • 145. NLIDBs Summary – Cont. Systems Ambiguity Handling Query Construction Target Language Automatic Interactive Rule-based Machine-learning PRECISE SQL NLPQC SQL NaLIX (Schema-free) XQuery FREyA SPARQL NaLIR SQL NL2CM OASIS-QL ML2SQL * SQL ATHANA OQL * Partially
  • 146. Relationship to Semantic Parsing Query Understanding NLQ Domain knowledge Query Translation interpretations queries Data store Semantic Parser NL Sentence Semantic parsing results Training data ML Model Semantic parsing can be used to build NLIDB query results NLIDB Semantic Parsing
  • 147. Relationship to Question Answering Query Understanding NLQ Domain knowledge Query Translation interpretations database queries Data store Similar techniques query results Query Understanding NLQ Query Translation interpretations Document search queries top results Document Collection Domain knowledge NLIDB Question Answering
  • 149. Querying Natural Language Data - Review •Covered •Boolean queries •Grammar-based schema and searches •Text pattern queries •Tree pattern queries • Developments beyond •Keyword searches as input •Documents as output
  • 150. Querying Natural Language Data – Challenges & Opportunities • Grammar-based schemas • Promising direction • Challenges • Queries w/o knowing the schema • Many table schemes! • Overlap and equivalence relationships • Promising developments • Paraphrasing relationships between text phrases, tree patterns, DCS trees, etc. • Development of resources (e.g. KBs) and shallow semantic parsers to understand semantics • Self-improving systems
  • 151. Integrating & Transforming Natural Language Data - Review •Covered •Transformations on text •Lose and tight integration •More work on •Lose integration •Optimizing query plans
  • 152. Integrating & Transforming Natural Language Data – Challenges & Opportunities • Challenges • Lack of schema, opacity of references, richness of semantics and correctness of data • Much to inspire from • Work on transforming text • Size and scope of resources for understanding text • Progress in shallow semantic parsing • Other areas such as translation and speech recognition • Opportunities • Lots of demand for relevant tools • More structure in natural language text than text (as a seq. of tokens) • Strong ties to deductive databases
  • 153. NLIDB: Ideal and Reality Systems Scope of NLQ Support Capability State Parsing Error Handling Controlled Ad-hoc Fixed Self-improving Stateless Stateful Auto-correction Interactive-correction PRECISE * NLPQC NaLIX * FREyA * NaLIR * NL2CM ML2SQL * ATHANA * N/A N/A Ideal NLIDB * Supported at limited extent
  • 154. NLIDB: Ideal and Reality – Cont. Systems Ambiguity Handling Query Construction Target Language Automatic Interactive Rule-based Machine-learning PRECISE * SQL NLPQC SQL NaLIX * (Schema-free) XQuery FREyA SPARQL NaLIR SQL NL2CM OASIS-QL ML2SQL * SQL ATHANA * OQL Ideal NLIDB Polystore language * Supported at limited extent
  • 155. NLIDB: Open Challenges Query Understanding Query Translation Data store Feedback Generation Domain knowledge NLQ Interpretation interactions queries queries • Effectively communicate limitations to users • Engage user at the right moment • Multi-modal interaction • Support ad-hoc NLQs with complex semantics • Better handle parser errors • Automatically bridge terminology gaps • Automatically identify and resolve ambiguity • Multilingual/crosslingual support • Polystore • Structured data s+ (un- /semi-)structured data • Construct domain knowledge with minimal development effort • Construct complex queries • Self-improving • Personalization • Conversational Document Collection Transform & integrate
  • 156. Natural Language DM & Interfaces: Opportunities Database Human Computer Interaction Natural Language Processing Machine Learning
  • 157. References • [Agichtein and Gravano, 2003] Agichtein, E. and Gravano, L. (2003). Querying text databases for efficient information extraction. In Proc. of the ICDE Conference, pages 113–124, Bangalore, India. • [Agrawal et al., 2008] Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. (2008). Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1):945–957. • [Amsterdamer et al., 2015] Amsterdamer, Y., Kukliansky, A., and Milo, T. (2015). A natural language interface for querying general and individual knowledge. PVLDB, 8(12):1430–1441. • [Andor et al., 2016] Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., and Collins, M. (2016). Globally normalized transition-based neural networks. CoRR, abs/1603.06042. • [Berant et al., 2013] Berant, J., Chou, A., Frostig, R., and Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In Proc. of the EMNLP Conference, volume 2, page 6. • [Bertino et al., 2012] Bertino, E., Ooi, B. C., Sacks-Davis, R., Tan, K.-L., Zobel, J., Shidlovsky, B., and Andronico, D. (2012). Indexing techniques for advanced database systems, volume 8. Springer Science & Business Media. • [Broder et al., 2003] Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., and Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proc. of the CIKM Conf., pages 426–434. ACM. • [Cafarella and Etzioni, 2005] Cafarella, M. J. and Etzioni, O. (2005). A search engine for natural language applications. In Proc. of the WWW conference, pages 442–452. ACM. • [Cafarella et al., 2007] Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. (2007). Structured querying of web text data: A technical challenge. In Proc. of the CIDR Conference, pages 225–234, Asilomar, CA. • [Cai et al., 2005]Cai, G., Wang, H., MacEachren, A. M., Tokensregex: Defining cascaded regular expressions over tokens. Technical Report CSTR-2014-02, Department of Computer Science, Stanford University.
  • 158. References – Cont. • [Chaudhuri et al., 1995] Chaudhuri, S., Dayal, U., and Yan, T. W. (1995). Join queries with external text sources: Execution and optimization techniques. In ACM SIGMOD Record, pages 410–422, San Jose, California. • [Chaudhuri et al., 2004] Chaudhuri, S., Ganti, V., and Gravano, L. (2004). Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proc. of the ICDE Conf., pages 227–238. IEEE. • [Chen et al., 2000] Chen, Z., Koudas, N., Korn, F., and Muthukrishnan, S. (2000). Selectively estimation for boolean queries. In Proc. of the PODS Conf., pages 216–225. ACM. • [Chu et al., 2007] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. (2007a). A relational approach to incrementally extracting and querying structure in unstructured data. In Proc. of the VLDB Conference. • [Chubak and Rafiei, 2010] Chubak, P. and Rafiei, D. (2010). Index Structures for Efficiently Searching Natural Language Text. In Proc. of the CIKM Conference. • [Chubak and Rafiei, 2012] Chubak, P. and Rafiei, D. (2012). Efficient indexing and querying over syntactically annotated trees. PVLDB, 5(11):1316–1327. • [Codd, 1974] Codd, E. (1974). Seven steps to rendezvous with the casual user. In IFIP Working Conference Data Base Management, pages 179–200. • [Ferrucci, 2012] Ferrucci, D. A. (2012). Introduction to ”this is watson”. IBM Journal of Research and Development, 56(3):1. • [Gonnet and Tompa, 1987] Gonnet, G. H. and Tompa, F. W. (1987). Mind your grammar: a new approach to modelling text. In Proc. of the VLDB Conference, pages 339–346, Brighton, England. • [Gyssens et al., 1989] Gyssens, M., Paredaens, J., and Gucht, D. V. (1989). A grammar-based approach towards unifying hierarchical data models (extended abstract). In Proc. of the SIGMOD Conference, pages 263–272, Portland, Oregon.
  • 159. References – Cont. • [Jagadish et al., 1999] Jagadish, H., Ng, R. T., and Srivastava, D. (1999). Substring selectivity estimation. In Proc. of the PODS Conf., pages 249–260. ACM. • [Jain et al., 2008] Jain, A., Doan, A., and Gravano, L. (2008). Optimizing SQL queries over text databases. In Proc. of the ICDE Conference, pages 636–645, Cancun, Mexico. • [Kaoudi and Manolescu, 2015] Kaoudi, Z. and Manolescu, I. (2015). Rdf in the clouds: a survey. The VLDB Journal, 24(1):67–91. • [Lewis and Steedman, 2013] Lewis, M. and Steedman, M. (2013). Combining distributional and logical semantics. Transactions of the Association for Computational Linguistics, 1:179–192. • [Li and Jagadish, 2014] Li, F. and Jagadish, H. V. (2014). Constructing an interactive natural language interface for relational databases. PVLDB, 8(1):73–84. • [Li et al., 2007] Li, Y., Yang, H., and Jagadish, H. V. (2007). Nalix: A generic natural language search environment for XML data. ACM Trans. Database Systems, 32(4). • [Liang et al., 2011] Liang, P., Jordan, M. I., and Klein, D. (2011). Learning dependency-based compositional semantics. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 590–599. Association for Computational Linguistics. • [Lin and Pantel, 2001] Lin, D. and Pantel, P. (2001). Dirt - discovery of inference rules from text. In Proc. of the KDD Conference, pages 323–328. • [Rafiei and Li, 2009] Rafiei, D. and Li, H. (2009). Data extraction from the web using wild card queries. In Proc. of the CIKM Conference, pages 1939–1942. • [Ravichandran and Hovy, 2002] Ravichandran, D. and Hovy, E. (2002). Learning surface text patterns for a question answering system. In Proc. of the ACL Conference.
  • 160. References – Cont. • [Popescu et al., 2004] Popescu et al., A. (2004). Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In Proc. of the COLING Conference. • [Saha et al., 2016] Saha, D., Floratou, A., Sankaranarayanan, K., Minhas, U. F., Mittal, A. R., and O¨ zcan, F. (2016). Athena: An ontology-driven system for natural language querying over relational data stores. PVLDB, 9(12):1209–1220. • [Salminen and Tompa, 1994] Salminen, A. and Tompa, F. (1994). PAT expressions: an algebra for text search. • Acta Linguistica Hungarica, 41(1):277–306. • [Stratica et al., 2005] Stratica, N., Kosseim, L., and Desai, • B. C. (2005). Using semantic templates for a natural language interface to the cindi virtual library. Data and Knowledge Engineering, 55(1):4–19. • [Suchanek and Preda, 2014] Suchanek, F. M. and Preda, • N. (2014). Semantic culturomics. Proc. of the VLDB Endowment, 7(12):1215–1218. • [Tague et al., 1991] Tague, J., Salminen, A., and McClellan, C. (1991). A complete model for information retrieval systems. In Proc. of the SIGIR Conference, pages 14–20, Chicago, Illinois. • [Tian et al., 2014] Tian, R., Miyao, Y., and Matsuzaki, T. (2014). Logical inference on dependency-based compositional semantics. In Proc. of the ACL Conference, pages 79–89. • [Wu and Palmer, 1994] Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In ACL • [Valenzuela-Escarcega et al., 2016] Valenzuela-Escarcega, M. A., Hahn-Powell, G., and Surdeanu, M. (2016). Odin’s runes: A rule language for information extraction. In Proc. of the Language Resources and Evaluation Conference (LREC). • [Xu, 2014] Xu, W. (2014). Data-driven approaches for paraphrasing across language variations. PhD thesis, New York University.
  • 161. Relevant Tutorials • Semantic parsing • Percy Liang: “natural language understanding: foundations and state- of-the-art”, ICML 2015. • Information extraction • Laura Chiticariu, Yunyao Li, Sriram Raghavan, Frederick Reiss: “Enterprise information extraction: recent developments and open challenges.” SIGMOD 2010 • Entity resolution • Lise Getoor and shwin Machanavajjhala: “Entity Resolution for Big Data” KDD 2013

Editor's Notes

  1. RDF store: jena, MarkLogic, …
  2. Chen: selectivity of boolean queries Jagadish, chaudhuri: selectivity of strings
  3. Signif.2 “computer”
  4. The paper appeared in VLDB 1987
  5. roadrunner assumes prefix markup encoding of text
  6. Hyponym [specific] – hypernym [general]
  7. Garfield, permuterm index, 1976 Compressed permuterm index, P. Ferragina, R. Venturini, 2007 A hash table supports access to the alphabet; a wavelet tree supports rank
  8. TGrep2, CorpusSearch: load the corpus to main memory and scan
  9. Mss<6 is based on a bin-packing approximation that is optimal for bin sizes less than 6.
  10. RDF store: jena, MarkLogic, …
  11. Tropical storm Debby is blamed for death Tropical storm Debby has caused loss of lifed
  12. On WebQuestions, a run on training examples sends 600K Sparql queries to freebase.
  13. Result of operators can be stored back in the wide table
  14. Const: applicable for queries with selection conditions
  15. GeoDialogue
  16. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  17. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  18. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  19. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  20. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  21. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  22. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  23. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  24. Potential Ontology Con- cepts (POCs) are derived from the syntactic parse tree, and refer to question terms which could be linked to an ontology concept. Syntactic parse tree is gen- erated by Stanford Parser [12]. We use several heuristic rules in order to identify POCs. For example, each NP (noun phrase) or NN (noun) is identi ed as a POC. Also, if a noun phrase contains adjectives, these are considered POCs as well. Next, the algorithm iterates through the list of POCs, attempting to map them to OC
  25. Resources to interpret/understand semantics Shallow semantic parsers such as Framenet, universal dependencies and abstract meaning rep