SlideShare a Scribd company logo
Foundational Research
Propelled by Text Analytics
Benny Kimelfeld
LogicBlox
Preamble
• Myself:
– Ph.D. @ HebrewU (DB uncertainty + search)
– IBM Almaden (DB theory, IR, Text Analytics)
– LogicBlox (ML in DB, Prob. Programming)
– Technion IL (Associate Prof., next year)
• This talk:
 Infrastructure for text analytics
+ DB theory, formal languages, NLP, data mining,
computational complexity, …
2
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
3
Text Analytics Matters
Some important applications are based on the
analysis of text-centric data; for example:
Semantic Search
Semantic understanding & indexing of
content to better match user's intent
Life-Science Mining
Extract knowledge bases from
scientific publications
e-Commerce
Comparison Shopping extracts &
compares inventory from online sources
CRM / BI
Monitor customer’s social-media activity
for sentiment & business leads
Log Analysis
Summarize, visualize and analyze logs
produced by machines
4
Database Management Systems
• Old news: Data management is involved!
– Data semantics, query/analysis semantics, storage,
query evaluation, indices, consistency, transactions,
backup, privacy, recovery, …
– From-scratch engineering is highly challenging
• Motivation to the concept of a general-purpose
Database Management System
– Most notably: relational model (pioneered by Edgar F.
Codd in 1969) and SQL
5
“Big Data” Phenomena
Proprietary data in orgs.
(enterprises, governments, …)
Proliferation of publically open
data sources (Web, social, …)
Past: Present:
Massive-data analyses incurred
high machinery/personnel cost
Business models (cloud, crowd,
opensource) facilitate analyses
Data structured/controlled by
admins, e-forms, software, …
Uncontrolled data from humans’
free text, heterogeneous kbs, …
Analyses by specialized teams
of heavily trained experts
Analyses by a wide community
featuring a wide range of skills
6
“By 2018, the United States alone could face a shortage of 140,000
to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of
big data to make effective decisions.”
“Big data: The next frontier for innovation, competition, and productivity”
McKinsey Report, May 2011
We need dev. & management systems to
facilitate value extraction from Big Data
by a wide range of users / skills
7
Core Task: Information Extraction (IE)
“Information Extraction (IE) is the name given to any process
which selectively structures and combines data which is found,
explicitly stated or implied, in one or more texts. The final
output of the extraction process varies; in every case, however, it
can be transformed so as to populate some type of database.”
J. Cowie and Y. Wilks., Handbook of
Natural Language Processing, 2000
“Information extraction is the identification, and consequent or concurrent
classification and structuring into semantic classes, of specific
information found in unstructured data sources, such as natural language
text, making the information more suitable for information processing tasks.”
M. F. Moens, Information Extraction: Algorithms
and Prospects in a Retrieval Context, 2006
→data-in-text
(unstructured)
data-in-db
(structured)
In short:
8
Popular Classes of IE Tasks
• Named Entity Recognition
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
person person organization
organization
9
Popular Classes of IE Tasks
AdvisedB
y
WorksIn
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
• Named Entity Recognition
• Relation Extraction
10
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Graduation
Where?
Who?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
11
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
Education
Start End
Graduation
When?
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
12
Popular Classes of IE Tasks
From September 1936 to July 1938,
Turing spent most of his time studying
under Church at Princeton University.
In June 1938, he obtained his PhD
from Princeton.
SameEntity
SameEntity
• Named Entity Recognition
• Relation Extraction
• Event Extraction
• Temporal IE
• Coreference Resolution
13
ariu
lmaden
A
m. com
Yunyao Li
IBM Research - Almaden
San Jose, CA
yunyaol i @us. i bm. com
Frederick R. Reiss
IBM Research - Almaden
San Jose, CA
f r r ei ss@us. i bm. com
stract
a” analytics over unstruc-
enewed interest in infor-
E). We surveyed the land-
ies and identified amajor
industry and academia:
ominatesthecommercial
garded as dead-end tech-
mia. We believe the dis-
he way in which the two
ethebenefits and costsof
mia’s perception that rule-
research challenges. We
mportance of rule-based
Commercial*Vendors*(2013)*
NLP*Papers*
(200392012)*
100%$
50%$
0%$
3.5%*
21%$
75%$
Rule,$
Based$
Hybrid$
Machine$
Learning$
Based$
45%*
22%$
33%$
Implementa@ons*of*En@ty*Extrac@on*
Large*Vendors*
67%*
17%$
17%$
All*Vendors*
IE Paradigms: Rules & Statistics
• Rules
• ML classification
• Probabilistic graphical models
• Soft logic
[Chiticariu, Li, Reiss, EMNLP’13]
• EMNLP, ACL, NAACL, 2003-
2012
• 54 industrial vendors (Who’s
Who in Text Analytics, 2012)
“[…] rules are effective,
interpretable, and are easy
to customize by non-experts
to cope with errors.”
Gupta & Manning, CONLL’14
14
+
NLP
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
15
Xlog: Datalog for IE
• Extension of (non-recursive) Datalog
• Use case: DBLife (db research kb: dblife.cs.wisc.edu)
• Data types: string, document, span
– Focus on single-document programs
• “Procedural predicates” (p-predicates) are user-defined
functions that produce relations over spans
– Example: sentence(doc, span)
• Query-plan optimization
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Same string, different spans
Span [42,47)
16
Xlog Example
“Declarative Information Extraction using Datalog with Embedded Extraction Predicates”
[Shen, Doan, Naughton, Ramakrishnan, VLDB 2007]
Regex.
(string)
Unary
regex
formula
Binary
regex
formula
17
• Datalog syntax
– Types: string, span
• Built in collection of p-predicates
– Various types of built-in regex formulas
– Linguistic: deep parsing, coreference
resolution, named-entity extractor
Instaread: Datalog + NLP
Binary regex
formulas
Unary regex
formulas
[Hoffmann, 2012]
18
IBM SystemT: SQL for IE
• Engine for AQL: SQL-like declarative IE lang.
– AQL = Annotation Query Language
• SystemT = AQL + Runtime + Dev. Tooling
– [Chiticariu et al., ACL 2010]: position SystemT as a
high-quality and high-efficiency IE solution
– System and IDE demos in ACL 2011, SIGMOD 2011
• Commercial product, high academic presence
– Integration on public financial records [Hernández et al., EDBT’13,
Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10,
ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et
al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et
al., Interact’13], social media [Sindhwani et al., IBM Journal 2011]
19
SystemT’s AQL Example
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
regex + join w/ previous views
projection
union
Cleaning
Unary regex formulas
20
Formal Framework
• Repeated concept: Extend a relational query
language with text transducers (p-predicates,
usually regex formulas)
• Research challenge: theoretical underpinnings
of this combined document/relation model
• Expressive power
– Query-plan optimization: Can we rewrite an operator via “easier”
building blocks?
– System extensions: Can we express a new operation using
existing ones, or prove impossibility?
• Next: a formal framework
– With Fagin, Reiss, Vansummeren, PODS’13, JACM
21
22
Terminology
Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul
Otellini and the Intel board had no idea what they were in for when
the company announced it was acquiring McAfee on August 19,
2010.
Company CEO CompanyCEO
[1,14)
(Kaspersky Lab)
[19,36)
(Eugene Kaspersky)
[1,36)
[42,47)
(Intel)
[52,65)
(Paul Otellini)
[42,65)
Relation over spans from the document
Document
Span [52,65)
Document Spanners
Document d Relation over the spans of d
Kaspersky Lab CEO Eugene
Kaspersky said Intel CEO Paul Otellini
and the Intel board had no idea what
they were in for when the company
announced it was acquiring McAfee
on August 19, 2010.
x y z
[1,14) [30,36) [1,36)
[42,47) [52,65) [42,65)
[102,110) [115,125) [102,125)
Document Spanner: a function that maps every
doc. (string) into a relation over the doc.’s spans
More formally:
• Finite alphabet of symbols
• A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d
• The relation has a fixed signature (set of attributes)
− The attributes come from an infinite domain of variables x, y, z, …
23
Spanners as Regex Formulas
• Regular expression with embedded variables
• Examples:
• Restriction: each “evaluation” (parse tree) assigns
one span to each variable (see [Fagin et al., PODS’13])
Ordinary regex Span variable
 .* x{dddd} .*
 .* in w{Alabama | Alaska | Arizona | …} .*
 (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | …
Representation system for spanners
24
Spanners as Datalog w/ Regex
• Non-recursive Datalog (NR-Datalog)
• Operate over a document (not a relational db)
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Carter_from_Plains,_Georgia,_Washington
_from_Westmoreland,_Virginia
x z
[1,7)
Carter
[13,28)
Plains,_Georgia
[30,40)
Washington
[46,69)
Westmoreland,_Virginia
EDBs = Spanners!
Another representation
system for spanners
Quer
y goal
25
Spanners as Automata
0,1 0 1
Ordinary
NFA
1 0 0 1 1 1 0 1
Var-Stack
Automaton
1 0 0 1 1 1 0 1x{
y{
}
}
y{x{ } }
Var-Set
Automaton
1 0 0 1 1 1 0 1x{
}y
y{x{ }x }y
}x
0,1 0 1
0,1 0 1
• In an accepting run, each variable opens and later closes exactly once
⇒ Each accepting run defines an assignment to the variables
• Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples
Close most recent
Close x
y
x
x
y
Another representation system for spanners
y{
26
Study of Expressive Power
Spanners definable by
regex formulas=
Spanners definable by
var-stack automata
Spanners definable by
var-set automata =
Spanners definable by
NR Datalog w/ regex formulas
27
x{
y{
}
}
0,1 0 1
x{
}y
}x
0,1 0 1
y{
.*x{.*}_from_z{.*}.*}
Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ]
State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*]
Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*]
CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*]
Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y)
RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
Consequences
• Connections between Datalog+regex
spanners and other language formalisms
– Classic string relations [Berstel 79]
– Graph queries (CRPQs) [Cruz et al. 87]
• Extension with string equality & difference
– Expressiveness / closure properties
• Principles for cleaning inconsistencies
– Follow up work [PODS’14]
– Next…
28
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
29
Next, highlight 3 lines of foundational research that
were motivated by our work on text analytics:
1. Database inconsistency w/ repair priorities
2. Frequent subgraph mining
3. Update propagation
30
• Extractors may produce inconsistent results
– Data artifacts
– Developer limitations
• Rather than repairing the existing extractors,
common practice is to clean (intermediate) results
– SystemT “consolidators” [Chiticariu et al.10]
– GATE/JAPE “controls” [Cunningham 02]
– Implicit in other rule systems, e.g., WHISK [Soderland 99]
– POSIX regex disambiguation [Fowler 03]
Cleaning IE Inconsistencies
33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303
Person2
Person1
Address1
31
SystemT Consolidators
[Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010]
Other policies
built in
32
Five GATE/JAPE Controls
All
Once
First
AppeltBrin
.* x{dd+} .*Sequence 12345 and sequence 12.
Document Spanner
Screenshots from GATE UI
33
Cleaning via Prioritized Repairs
• Problem: existing policies are ad-hoc; how to
expose a language for user declaration?
• [Fagin, K, Reiss, Vansummeren 2014]: spanner
formalism for declarative cleaning
• Key: prioritized repairs [Staworko, et al. 12]
• Idea: Extend extraction programs with
– Denial constraints: which facts are in conflict?
– Priority declarations: preference between facts
• Captures SystemT, GATE, WHISK, POSIX, …
• We are now trying to improve our understanding
of prioritized repairs…
34
Prioritized Repairs: Definition
Database
Denial
Constraints
Collection of facts Which sets of facts
cannot co-exist?
Priority
Relation
Binary “is preferred to”
relation
• [Arenas, Bertossi, Chomicki 99]: Inconsistent DB
represents a set of (equally likely) “repairs”
 Then we can ask for the “possible” or “consistent” query answers
• [Staworko, Chomicki, Marcinkowski 12] add priorities:
• Let A and B be two consistent subsets of the database
• Say that A improves B if we can obtain A from B by a
“profitable” exchange of facts (precision later…)
• A repair is a consistent subset that cannot be improved
Inconsistent Database Instance
35
Example
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
Violated constraints (functional
dependencies):
• professor  university, city
(“key constraint”)
• university  city
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
professor university city
Monica ubiobio Concepción
Monica carleton Ottawa
Jorge uchile Santiago
Jorge ubiobio Santiago
Pablo uchile Santiago
“Ordinary” repairs [Arenas et al. 99]
Tuple priority  some repairs can be discarded [Staworko et al.] 36
A improves B if we get A from B by removing tuples & adding
tuple; each removed preferred to by some added
Complexity of Testing Improvability
Theorem:
 In the case of a single functional dependency
or two keys per relation, improvability can be
tested in polynomial time
 In any other combination of FDs, the
problem is NP-complete!
university faculty dean
UChile Economics Agosin
Technion CS Yavneh
Stanford Law Magill
two keys
37
Can a consistent subset be improved?
Recent work (unpublished)
w/ Fagin & Kolaitis
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
1. Apply
dependency
parsing
38
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
IE with Recurring Patterns
I want to buy my advisor a gift.
I really want to buy a gift to my advisor.
I want to buy a gift to the secretary and to my advisor.
I
want
buy
gift advisor
1. Apply
dependency
parsing
2. Find freq.
recurring
patterns
39
[Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
= 3
g1 g2 g3 g4
Freq.
Freq. Max.
Freq.
Max.
Maximal Frequent Subgraphs
Complexity Study
• Naturally, there has been a lot of work on this problem
– SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], …
• But little was known about the computational complexity
• Studied: impact of assumptions on comp. complexity
– Graph properties (e.g., trees, treewidth, etc.)
– Label repeatability
– Bounded #results desired
– Bounded threshold
• This work led to novel complexity results and a new
methodologies for mining maximal subgraphs
– [K & Kolaitis, ACM PODS’13, ACM TODS]
• Next, some complexity nuggets 
41
Complexity Nuggets
• Good news: If labels do not repeat in each input
graph, then there are PTime solutions when
– The threshold is bounded; or
– Graphs are trees & few results are desired
• In general graphs w/o label repetition, you can
find 2 results in PTime
– Bad news: But finding 3rd is NP-hard!
– Bad news: And if labels repeat and graphs are
trees, then finding 2nd is already NP-hard!
• Even for a bounded threshold
42
Improving Dictionaries w/ Feedback
text fragments
(sentences, tables, rows, …)
join
IBM , San Jose
company
occurrences
address
occurrencescompanies, countries, …
Apple , CupertinoIBM , Armonk
IE IE
IE
auto. suggest a “good”
fix to the IE program
Web data
“good” = small effect
on other results
Yahoo! , Cupertino Goo
43
View Updates
• View-update problem: Translate an update on a view to
an update on the base relations
• Deletion propagation as a special case
– Update is delete(a set of view tuples)
• Motivation:
– Classic: database/view maintenance
• DB access only through views, hidden join keys, etc.
– Debugging
• [K&al.12]: deletion propagation for debugging text extractors
– Database causality [Meliou&al.10]
• Intuition: good propagation provides a good explanation of why we
have the tuples to begin with
• [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis,
Repairs and View-Updates in Databases”
44
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
[Cui&Widom01; Buneman&al.02]
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
45
Example: File Access
= ⋈
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
46
Example: File Access
GroupFile
group file
ai a.txt
ai b.txt
db a. txt
db b.txt
os a.txt
UserGroup
user group
Emma ai
Emma db
Olivia os
Olivia db
Jacob ai
Access
user file
Emma a.txt
Emma b.txt
Olivia a.txt
Olivia b.txt
Jacob a.txt
Jacob b.txt
= ⋈
Access(u,f) :– UserGroup(u,g), GroupFile(g,f)
[Cui&Widom01; Buneman&al.02]
Delete source rows, s.t. Emma won’t access a.txt.
But, maintain maximum access permissions!
Decision variant is NP-complete [Buneman et al. 02]
47
Trichotomy in Complexity
We have established a precise (easily testable) criterion
that partition all cases into 3 categories:
1. The problem is solvable in PTime, and even via a
straightforward algorithm [Buneman et al. 2001]
2. The problem is NP-hard, but constant-ratio
approximable in PTime (ILP relaxation)
3. The problem is inapproximable for every ratio
Fix a schema (w/ fds) and a CQ w/o self joins
What is the complexity of finding a solution with a minimal side effect?
[K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14]
48
• Text Analytics in the Big Data Era
• Information Extraction Systems & Formalism
• Foundational Research Challenges
• Conclusions and Outlook
Outline
49
Summary
• Text analytics & IE
• Rule systems for IE
• A formal framework for rules, relating IE to
traditional DB concepts such as Datalog
• Research directions motivated by IE
– Prioritized repairs
– Graph mining
– Update propagation
50
Outlook: DB w/ Deep Text Support
• We need a uniform & elegant data/query model to
combine structured data & text; usefulness for querying
both text and relations
• We need a principled, simple & transparent probability
model + effective quality + practical execution cost
• We need to balance between automation and control:
from full specification by experts to feature generation for
nonexperienced
– Maximally realize the potential of every developer!
– LogicBlox is working on incorporating ML in Datalog!
51
BACKUP SLIDES
52
Room for Both
Statistical
Solution
Rule
System
Feature Engineering
Model Space, Runtime
Cleaning + Post Proc.
Cleaning + Post Proc.
Building blocks
(e.g., dictionaries, NER)
“What doesn’t work: Anything requiring high
precision and full automation”
Feldman & Ungar, KDD’08 tutorial on text mining
53
String DB, Spanners, Interval Algebra
Kaspersky Kaspersky
Intel Otellini
IBM Rometty
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
[10,20) [16,26)
[32,37) [50,58)
[105,108) [121,128)
String Databases Interval AlgebraSpanners
Atomic value: string Atomic value: span
(pointing to doc)
Atomic value: interval
(no text)
Join by string conditions
(e.g., x is a substring of y)
Join by interval conditions
(e.g., x is a sub-interval of y)
Join by interval+string
conditions (e.g., x a
token in y)
Apps: text predicates in DBs
[Grahne & al. 99] [Benedikt &
al. 03], string manipulation
[Bonner & Mecca 98]
[Ginsburg and Wang 98]
App: IE Apps: temporal reasoning
[Allen 83] [Vilain & Kautz
86] [Nebel & Bürckert 95]
[Krokhin et al. 03]
54
55
Imp. 1: Connection to Known Concepts
• Connection to Recognizable Relations [Berstel 79]
– These are unions of cross products of regular languages
– THM: The class of regular spanners is closed under
a string-selection predicate iff the predicate is a
recognizable relation
• Connection to CRPQs [Cruz et al. 87]
– Conjunctive Regular Path Queries have been studied as a
query language for labeled graphs
– THM: Regular spanners have the same expressive
power as unions of CRPQs on paths “with marked
endpoints”
• Up to some simple and necessary adaptation between the models
S I G M O D
Path with marked endpoints
Imp. 2: Adding String Equality
NR Datalog w/ regex formulas
Regular Spanners
Regularstr= Spanners
+ String-equality predicate
(+substring-of, prefix-of, …)
…application from Jane Doe,
social 012-345-6789, on Mar
20th… identified as John Doe,
012-345-6789, ask us to…
x1 x2
[117,125)
(Jane Doe)
[875,883)
(John Doe)
⋮ ⋮
NameSSN(x,y) := …
SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2)
Same string,
different spans
56
Difference with String Equality
• Are regularstr= spanners closed under difference?
– Why should they? Only positive operators are used…
– However, regex formulas (our EDBs) can introduce
“negative” operations (NFAs closed under complement)
• THM: The class of regular spanners is closed under
difference
• PROP: The class of regularstr= spanners is closed
under string-inequality selection
• THM: The class of regularstr= spanners is closed
under string-containment selection, but then, not
under non-string-containment selection!
• COR: The class of regularstr= is not closed under
difference
57
Formal Optimization Problem
Fixed: • Schema S w/ fun. dependencies
• Conjunctive query Q
Input: • Database instance I over S
• Set A⊆ Q(I) of answers to delete
Output: J ⊆ I s.t. Q(J) ∩ A = ∅
Goal: Minimize |(Q(I) – A) – Q(J)|
Side Effect
58

More Related Content

What's hot

SemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic WebSemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic Web
Adrian Paschke
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
EdiFaizal2
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
Amit Sheth
 
Ir 01
Ir   01Ir   01
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 
Fqas09
Fqas09Fqas09
Fqas09
Giorgio Orsi
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
Kathirvel Ayyaswamy
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
Rinke Hoekstra
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
Trey Grainger
 
20 most popular data scientists
20 most popular data scientists20 most popular data scientists
20 most popular data scientists
PromptCloud
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
varshakumar21
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
hplap
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
Raghav agrawal
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
VOGIN-academie
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
Mounia Lalmas-Roelleke
 
CV
CVCV
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
Pier Luca Lanzi
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
butest
 

What's hot (20)

SemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic WebSemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic Web
 
Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal Konsep Dasar Information Retrieval - Edi faizal
Konsep Dasar Information Retrieval - Edi faizal
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Ir 01
Ir   01Ir   01
Ir 01
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Fqas09
Fqas09Fqas09
Fqas09
 
BIG DATA RESEARCH
BIG DATA RESEARCHBIG DATA RESEARCH
BIG DATA RESEARCH
 
Managing Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS caseManaging Metadata for Science and Technology Studies: the RISIS case
Managing Metadata for Science and Technology Studies: the RISIS case
 
The Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge GraphThe Apache Solr Semantic Knowledge Graph
The Apache Solr Semantic Knowledge Graph
 
20 most popular data scientists
20 most popular data scientists20 most popular data scientists
20 most popular data scientists
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
CV
CVCV
CV
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 

Similar to Text Analytics - JCC2014 Kimelfeld

Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
Brand Niemann
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
Digital Reasoning
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
Yi Zeng
 
Data and Information Integration: Information Extraction
Data and Information Integration: Information ExtractionData and Information Integration: Information Extraction
Data and Information Integration: Information Extraction
IJMER
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
Andre Freitas
 
Database Essay
Database EssayDatabase Essay
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
ArmyTrilidiaDevegaSK
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
CSCJournals
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
CSCJournals
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
Marianne Sweeny
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
Amanda Gray
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 

Similar to Text Analytics - JCC2014 Kimelfeld (20)

Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
 
Data and Information Integration: Information Extraction
Data and Information Integration: Information ExtractionData and Information Integration: Information Extraction
Data and Information Integration: Information Extraction
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 

More from Pedro Contreras Flores

El dilema de las redes sociales
El dilema de las redes sociales El dilema de las redes sociales
El dilema de las redes sociales
Pedro Contreras Flores
 
Tipos de sistemas de información
Tipos de sistemas de informaciónTipos de sistemas de información
Tipos de sistemas de información
Pedro Contreras Flores
 
Servicio de información para bibliotecas
Servicio de información para bibliotecasServicio de información para bibliotecas
Servicio de información para bibliotecas
Pedro Contreras Flores
 
Gestión del conocimiento
Gestión del conocimientoGestión del conocimiento
Gestión del conocimiento
Pedro Contreras Flores
 
Business intelligence (bi) y big data0
Business intelligence (bi) y big data0Business intelligence (bi) y big data0
Business intelligence (bi) y big data0
Pedro Contreras Flores
 
Bibliotecas moviles y calidad
Bibliotecas moviles y calidadBibliotecas moviles y calidad
Bibliotecas moviles y calidad
Pedro Contreras Flores
 
Sistemas y servicios de informacion intro
Sistemas y servicios de informacion introSistemas y servicios de informacion intro
Sistemas y servicios de informacion intro
Pedro Contreras Flores
 
Plataforma de Digitalización
Plataforma de DigitalizaciónPlataforma de Digitalización
Plataforma de Digitalización
Pedro Contreras Flores
 
Red de transporte urbano
Red de transporte urbanoRed de transporte urbano
Red de transporte urbano
Pedro Contreras Flores
 
Packing
PackingPacking
Hormigas arfificiales - Mauro San Martín
Hormigas arfificiales - Mauro San MartínHormigas arfificiales - Mauro San Martín
Hormigas arfificiales - Mauro San Martín
Pedro Contreras Flores
 
Tecnologías de la información
Tecnologías de la informaciónTecnologías de la información
Tecnologías de la información
Pedro Contreras Flores
 
Modelamiento y simulación
Modelamiento y simulaciónModelamiento y simulación
Modelamiento y simulación
Pedro Contreras Flores
 
Java 3D
Java 3DJava 3D
Complementos de programación
Complementos de programaciónComplementos de programación
Complementos de programación
Pedro Contreras Flores
 
4 memoria dinamica
4 memoria dinamica4 memoria dinamica
4 memoria dinamica
Pedro Contreras Flores
 
3 recursividad
3 recursividad3 recursividad
3 recursividad
Pedro Contreras Flores
 
2 punteros y lenguaje c
2 punteros y lenguaje c2 punteros y lenguaje c
2 punteros y lenguaje c
Pedro Contreras Flores
 
Programación grafica en lenguaje c
Programación grafica en lenguaje cProgramación grafica en lenguaje c
Programación grafica en lenguaje c
Pedro Contreras Flores
 
2 archivos
2 archivos2 archivos

More from Pedro Contreras Flores (20)

El dilema de las redes sociales
El dilema de las redes sociales El dilema de las redes sociales
El dilema de las redes sociales
 
Tipos de sistemas de información
Tipos de sistemas de informaciónTipos de sistemas de información
Tipos de sistemas de información
 
Servicio de información para bibliotecas
Servicio de información para bibliotecasServicio de información para bibliotecas
Servicio de información para bibliotecas
 
Gestión del conocimiento
Gestión del conocimientoGestión del conocimiento
Gestión del conocimiento
 
Business intelligence (bi) y big data0
Business intelligence (bi) y big data0Business intelligence (bi) y big data0
Business intelligence (bi) y big data0
 
Bibliotecas moviles y calidad
Bibliotecas moviles y calidadBibliotecas moviles y calidad
Bibliotecas moviles y calidad
 
Sistemas y servicios de informacion intro
Sistemas y servicios de informacion introSistemas y servicios de informacion intro
Sistemas y servicios de informacion intro
 
Plataforma de Digitalización
Plataforma de DigitalizaciónPlataforma de Digitalización
Plataforma de Digitalización
 
Red de transporte urbano
Red de transporte urbanoRed de transporte urbano
Red de transporte urbano
 
Packing
PackingPacking
Packing
 
Hormigas arfificiales - Mauro San Martín
Hormigas arfificiales - Mauro San MartínHormigas arfificiales - Mauro San Martín
Hormigas arfificiales - Mauro San Martín
 
Tecnologías de la información
Tecnologías de la informaciónTecnologías de la información
Tecnologías de la información
 
Modelamiento y simulación
Modelamiento y simulaciónModelamiento y simulación
Modelamiento y simulación
 
Java 3D
Java 3DJava 3D
Java 3D
 
Complementos de programación
Complementos de programaciónComplementos de programación
Complementos de programación
 
4 memoria dinamica
4 memoria dinamica4 memoria dinamica
4 memoria dinamica
 
3 recursividad
3 recursividad3 recursividad
3 recursividad
 
2 punteros y lenguaje c
2 punteros y lenguaje c2 punteros y lenguaje c
2 punteros y lenguaje c
 
Programación grafica en lenguaje c
Programación grafica en lenguaje cProgramación grafica en lenguaje c
Programación grafica en lenguaje c
 
2 archivos
2 archivos2 archivos
2 archivos
 

Recently uploaded

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 

Recently uploaded (20)

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 

Text Analytics - JCC2014 Kimelfeld

  • 1. Foundational Research Propelled by Text Analytics Benny Kimelfeld LogicBlox
  • 2. Preamble • Myself: – Ph.D. @ HebrewU (DB uncertainty + search) – IBM Almaden (DB theory, IR, Text Analytics) – LogicBlox (ML in DB, Prob. Programming) – Technion IL (Associate Prof., next year) • This talk:  Infrastructure for text analytics + DB theory, formal languages, NLP, data mining, computational complexity, … 2
  • 3. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 3
  • 4. Text Analytics Matters Some important applications are based on the analysis of text-centric data; for example: Semantic Search Semantic understanding & indexing of content to better match user's intent Life-Science Mining Extract knowledge bases from scientific publications e-Commerce Comparison Shopping extracts & compares inventory from online sources CRM / BI Monitor customer’s social-media activity for sentiment & business leads Log Analysis Summarize, visualize and analyze logs produced by machines 4
  • 5. Database Management Systems • Old news: Data management is involved! – Data semantics, query/analysis semantics, storage, query evaluation, indices, consistency, transactions, backup, privacy, recovery, … – From-scratch engineering is highly challenging • Motivation to the concept of a general-purpose Database Management System – Most notably: relational model (pioneered by Edgar F. Codd in 1969) and SQL 5
  • 6. “Big Data” Phenomena Proprietary data in orgs. (enterprises, governments, …) Proliferation of publically open data sources (Web, social, …) Past: Present: Massive-data analyses incurred high machinery/personnel cost Business models (cloud, crowd, opensource) facilitate analyses Data structured/controlled by admins, e-forms, software, … Uncontrolled data from humans’ free text, heterogeneous kbs, … Analyses by specialized teams of heavily trained experts Analyses by a wide community featuring a wide range of skills 6
  • 7. “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” “Big data: The next frontier for innovation, competition, and productivity” McKinsey Report, May 2011 We need dev. & management systems to facilitate value extraction from Big Data by a wide range of users / skills 7
  • 8. Core Task: Information Extraction (IE) “Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts. The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some type of database.” J. Cowie and Y. Wilks., Handbook of Natural Language Processing, 2000 “Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, making the information more suitable for information processing tasks.” M. F. Moens, Information Extraction: Algorithms and Prospects in a Retrieval Context, 2006 →data-in-text (unstructured) data-in-db (structured) In short: 8
  • 9. Popular Classes of IE Tasks • Named Entity Recognition From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. person person organization organization 9
  • 10. Popular Classes of IE Tasks AdvisedB y WorksIn From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. • Named Entity Recognition • Relation Extraction 10
  • 11. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Graduation Where? Who? • Named Entity Recognition • Relation Extraction • Event Extraction 11
  • 12. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. Education Start End Graduation When? • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE 12
  • 13. Popular Classes of IE Tasks From September 1936 to July 1938, Turing spent most of his time studying under Church at Princeton University. In June 1938, he obtained his PhD from Princeton. SameEntity SameEntity • Named Entity Recognition • Relation Extraction • Event Extraction • Temporal IE • Coreference Resolution 13
  • 14. ariu lmaden A m. com Yunyao Li IBM Research - Almaden San Jose, CA yunyaol i @us. i bm. com Frederick R. Reiss IBM Research - Almaden San Jose, CA f r r ei ss@us. i bm. com stract a” analytics over unstruc- enewed interest in infor- E). We surveyed the land- ies and identified amajor industry and academia: ominatesthecommercial garded as dead-end tech- mia. We believe the dis- he way in which the two ethebenefits and costsof mia’s perception that rule- research challenges. We mportance of rule-based Commercial*Vendors*(2013)* NLP*Papers* (200392012)* 100%$ 50%$ 0%$ 3.5%* 21%$ 75%$ Rule,$ Based$ Hybrid$ Machine$ Learning$ Based$ 45%* 22%$ 33%$ Implementa@ons*of*En@ty*Extrac@on* Large*Vendors* 67%* 17%$ 17%$ All*Vendors* IE Paradigms: Rules & Statistics • Rules • ML classification • Probabilistic graphical models • Soft logic [Chiticariu, Li, Reiss, EMNLP’13] • EMNLP, ACL, NAACL, 2003- 2012 • 54 industrial vendors (Who’s Who in Text Analytics, 2012) “[…] rules are effective, interpretable, and are easy to customize by non-experts to cope with errors.” Gupta & Manning, CONLL’14 14 + NLP
  • 15. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 15
  • 16. Xlog: Datalog for IE • Extension of (non-recursive) Datalog • Use case: DBLife (db research kb: dblife.cs.wisc.edu) • Data types: string, document, span – Focus on single-document programs • “Procedural predicates” (p-predicates) are user-defined functions that produce relations over spans – Example: sentence(doc, span) • Query-plan optimization [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Same string, different spans Span [42,47) 16
  • 17. Xlog Example “Declarative Information Extraction using Datalog with Embedded Extraction Predicates” [Shen, Doan, Naughton, Ramakrishnan, VLDB 2007] Regex. (string) Unary regex formula Binary regex formula 17
  • 18. • Datalog syntax – Types: string, span • Built in collection of p-predicates – Various types of built-in regex formulas – Linguistic: deep parsing, coreference resolution, named-entity extractor Instaread: Datalog + NLP Binary regex formulas Unary regex formulas [Hoffmann, 2012] 18
  • 19. IBM SystemT: SQL for IE • Engine for AQL: SQL-like declarative IE lang. – AQL = Annotation Query Language • SystemT = AQL + Runtime + Dev. Tooling – [Chiticariu et al., ACL 2010]: position SystemT as a high-quality and high-efficiency IE solution – System and IDE demos in ACL 2011, SIGMOD 2011 • Commercial product, high academic presence – Integration on public financial records [Hernández et al., EDBT’13, Balakrishnan et al. SIGMOD’10], NER [Chiticariu et al. EMNLP’10, ACL’10, Nagesh et al. EMNLP’12, Roy et al. SIGMOD’13], IR [Zhu et al. WWW’10, K et al. SIGIR’12, CIKM’12], sentiment analysis [Hu et al., Interact’13], social media [Sindhwani et al., IBM Journal 2011] 19
  • 20. SystemT’s AQL Example [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] regex + join w/ previous views projection union Cleaning Unary regex formulas 20
  • 21. Formal Framework • Repeated concept: Extend a relational query language with text transducers (p-predicates, usually regex formulas) • Research challenge: theoretical underpinnings of this combined document/relation model • Expressive power – Query-plan optimization: Can we rewrite an operator via “easier” building blocks? – System extensions: Can we express a new operation using existing ones, or prove impossibility? • Next: a formal framework – With Fagin, Reiss, Vansummeren, PODS’13, JACM 21
  • 22. 22 Terminology Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. Company CEO CompanyCEO [1,14) (Kaspersky Lab) [19,36) (Eugene Kaspersky) [1,36) [42,47) (Intel) [52,65) (Paul Otellini) [42,65) Relation over spans from the document Document Span [52,65)
  • 23. Document Spanners Document d Relation over the spans of d Kaspersky Lab CEO Eugene Kaspersky said Intel CEO Paul Otellini and the Intel board had no idea what they were in for when the company announced it was acquiring McAfee on August 19, 2010. x y z [1,14) [30,36) [1,36) [42,47) [52,65) [42,65) [102,110) [115,125) [102,125) Document Spanner: a function that maps every doc. (string) into a relation over the doc.’s spans More formally: • Finite alphabet of symbols • A spanner maps each doc. d ∈ * into a relation over the spans [i,j) of d • The relation has a fixed signature (set of attributes) − The attributes come from an infinite domain of variables x, y, z, … 23
  • 24. Spanners as Regex Formulas • Regular expression with embedded variables • Examples: • Restriction: each “evaluation” (parse tree) assigns one span to each variable (see [Fagin et al., PODS’13]) Ordinary regex Span variable  .* x{dddd} .*  .* in w{Alabama | Alaska | Arizona | …} .*  (.* z{[A-Z][a-z]*, y{[A-Z][a-z]*}} .*) | … Representation system for spanners 24
  • 25. Spanners as Datalog w/ Regex • Non-recursive Datalog (NR-Datalog) • Operate over a document (not a relational db) Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z) Carter_from_Plains,_Georgia,_Washington _from_Westmoreland,_Virginia x z [1,7) Carter [13,28) Plains,_Georgia [30,40) Washington [46,69) Westmoreland,_Virginia EDBs = Spanners! Another representation system for spanners Quer y goal 25
  • 26. Spanners as Automata 0,1 0 1 Ordinary NFA 1 0 0 1 1 1 0 1 Var-Stack Automaton 1 0 0 1 1 1 0 1x{ y{ } } y{x{ } } Var-Set Automaton 1 0 0 1 1 1 0 1x{ }y y{x{ }x }y }x 0,1 0 1 0,1 0 1 • In an accepting run, each variable opens and later closes exactly once ⇒ Each accepting run defines an assignment to the variables • Nondeterministic ⇒ multiple accepting runs ⇒ multiple tuples Close most recent Close x y x x y Another representation system for spanners y{ 26
  • 27. Study of Expressive Power Spanners definable by regex formulas= Spanners definable by var-stack automata Spanners definable by var-set automata = Spanners definable by NR Datalog w/ regex formulas 27 x{ y{ } } 0,1 0 1 x{ }y }x 0,1 0 1 y{ .*x{.*}_from_z{.*}.*} Token(x) := [ (ε | .*_) x{[a-zA-Z]+} ( ((,V_) .*) | ε) ] State(x) := Token(x) , [.* x{Georgia|Virginia|Washington}.*] Cap1st(x) := Token(x) , [.* x{[A-Z].*}.*] CommaSp(x,y,z) := [.* z{x{.*} ,_ y{.*}}.*] Loc(z) := CommaSp(x,y,z) , Cap1st(x) , State(y) RETURN(x,z) := Cap1st(x) , [.*x{.*}_from_z{.*}.*}] , Loc(z)
  • 28. Consequences • Connections between Datalog+regex spanners and other language formalisms – Classic string relations [Berstel 79] – Graph queries (CRPQs) [Cruz et al. 87] • Extension with string equality & difference – Expressiveness / closure properties • Principles for cleaning inconsistencies – Follow up work [PODS’14] – Next… 28
  • 29. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 29
  • 30. Next, highlight 3 lines of foundational research that were motivated by our work on text analytics: 1. Database inconsistency w/ repair priorities 2. Frequent subgraph mining 3. Update propagation 30
  • 31. • Extractors may produce inconsistent results – Data artifacts – Developer limitations • Rather than repairing the existing extractors, common practice is to clean (intermediate) results – SystemT “consolidators” [Chiticariu et al.10] – GATE/JAPE “controls” [Cunningham 02] – Implicit in other rule systems, e.g., WHISK [Soderland 99] – POSIX regex disambiguation [Fowler 03] Cleaning IE Inconsistencies 33 Martin Luther King Jr. Dr., SE, Atlanta, GA 30303 Person2 Person1 Address1 31
  • 32. SystemT Consolidators [Chiticariu, Krishnamurthy, Li, Raghavan, Reiss, Vaithyanathan, ACL 2010] Other policies built in 32
  • 33. Five GATE/JAPE Controls All Once First AppeltBrin .* x{dd+} .*Sequence 12345 and sequence 12. Document Spanner Screenshots from GATE UI 33
  • 34. Cleaning via Prioritized Repairs • Problem: existing policies are ad-hoc; how to expose a language for user declaration? • [Fagin, K, Reiss, Vansummeren 2014]: spanner formalism for declarative cleaning • Key: prioritized repairs [Staworko, et al. 12] • Idea: Extend extraction programs with – Denial constraints: which facts are in conflict? – Priority declarations: preference between facts • Captures SystemT, GATE, WHISK, POSIX, … • We are now trying to improve our understanding of prioritized repairs… 34
  • 35. Prioritized Repairs: Definition Database Denial Constraints Collection of facts Which sets of facts cannot co-exist? Priority Relation Binary “is preferred to” relation • [Arenas, Bertossi, Chomicki 99]: Inconsistent DB represents a set of (equally likely) “repairs”  Then we can ask for the “possible” or “consistent” query answers • [Staworko, Chomicki, Marcinkowski 12] add priorities: • Let A and B be two consistent subsets of the database • Say that A improves B if we can obtain A from B by a “profitable” exchange of facts (precision later…) • A repair is a consistent subset that cannot be improved Inconsistent Database Instance 35
  • 36. Example professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago Violated constraints (functional dependencies): • professor  university, city (“key constraint”) • university  city professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago professor university city Monica ubiobio Concepción Monica carleton Ottawa Jorge uchile Santiago Jorge ubiobio Santiago Pablo uchile Santiago “Ordinary” repairs [Arenas et al. 99] Tuple priority  some repairs can be discarded [Staworko et al.] 36 A improves B if we get A from B by removing tuples & adding tuple; each removed preferred to by some added
  • 37. Complexity of Testing Improvability Theorem:  In the case of a single functional dependency or two keys per relation, improvability can be tested in polynomial time  In any other combination of FDs, the problem is NP-complete! university faculty dean UChile Economics Agosin Technion CS Yavneh Stanford Law Magill two keys 37 Can a consistent subset be improved? Recent work (unpublished) w/ Fagin & Kolaitis
  • 38. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. 1. Apply dependency parsing 38 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 39. IE with Recurring Patterns I want to buy my advisor a gift. I really want to buy a gift to my advisor. I want to buy a gift to the secretary and to my advisor. I want buy gift advisor 1. Apply dependency parsing 2. Find freq. recurring patterns 39 [Zhang, Baldwin, Ho, K, Li, ACL13]: Restoring grammar in social media, sms, etc.
  • 40. = 3 g1 g2 g3 g4 Freq. Freq. Max. Freq. Max. Maximal Frequent Subgraphs
  • 41. Complexity Study • Naturally, there has been a lot of work on this problem – SPIN [Huan et al. 04], MARGIN [Thomas et al. 10], … • But little was known about the computational complexity • Studied: impact of assumptions on comp. complexity – Graph properties (e.g., trees, treewidth, etc.) – Label repeatability – Bounded #results desired – Bounded threshold • This work led to novel complexity results and a new methodologies for mining maximal subgraphs – [K & Kolaitis, ACM PODS’13, ACM TODS] • Next, some complexity nuggets  41
  • 42. Complexity Nuggets • Good news: If labels do not repeat in each input graph, then there are PTime solutions when – The threshold is bounded; or – Graphs are trees & few results are desired • In general graphs w/o label repetition, you can find 2 results in PTime – Bad news: But finding 3rd is NP-hard! – Bad news: And if labels repeat and graphs are trees, then finding 2nd is already NP-hard! • Even for a bounded threshold 42
  • 43. Improving Dictionaries w/ Feedback text fragments (sentences, tables, rows, …) join IBM , San Jose company occurrences address occurrencescompanies, countries, … Apple , CupertinoIBM , Armonk IE IE IE auto. suggest a “good” fix to the IE program Web data “good” = small effect on other results Yahoo! , Cupertino Goo 43
  • 44. View Updates • View-update problem: Translate an update on a view to an update on the base relations • Deletion propagation as a special case – Update is delete(a set of view tuples) • Motivation: – Classic: database/view maintenance • DB access only through views, hidden join keys, etc. – Debugging • [K&al.12]: deletion propagation for debugging text extractors – Database causality [Meliou&al.10] • Intuition: good propagation provides a good explanation of why we have the tuples to begin with • [Bertossi, Salimi 14]: “Unifying Causality, Diagnosis, Repairs and View-Updates in Databases” 44
  • 45. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access(u,f) :– UserGroup(u,g), GroupFile(g,f) Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! [Cui&Widom01; Buneman&al.02] Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ 45
  • 46. Example: File Access = ⋈ GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! 46
  • 47. Example: File Access GroupFile group file ai a.txt ai b.txt db a. txt db b.txt os a.txt UserGroup user group Emma ai Emma db Olivia os Olivia db Jacob ai Access user file Emma a.txt Emma b.txt Olivia a.txt Olivia b.txt Jacob a.txt Jacob b.txt = ⋈ Access(u,f) :– UserGroup(u,g), GroupFile(g,f) [Cui&Widom01; Buneman&al.02] Delete source rows, s.t. Emma won’t access a.txt. But, maintain maximum access permissions! Decision variant is NP-complete [Buneman et al. 02] 47
  • 48. Trichotomy in Complexity We have established a precise (easily testable) criterion that partition all cases into 3 categories: 1. The problem is solvable in PTime, and even via a straightforward algorithm [Buneman et al. 2001] 2. The problem is NP-hard, but constant-ratio approximable in PTime (ILP relaxation) 3. The problem is inapproximable for every ratio Fix a schema (w/ fds) and a CQ w/o self joins What is the complexity of finding a solution with a minimal side effect? [K, Vondrak, Williams, Woodruff, PODS11, PODS12, TODS12, VLDB14] 48
  • 49. • Text Analytics in the Big Data Era • Information Extraction Systems & Formalism • Foundational Research Challenges • Conclusions and Outlook Outline 49
  • 50. Summary • Text analytics & IE • Rule systems for IE • A formal framework for rules, relating IE to traditional DB concepts such as Datalog • Research directions motivated by IE – Prioritized repairs – Graph mining – Update propagation 50
  • 51. Outlook: DB w/ Deep Text Support • We need a uniform & elegant data/query model to combine structured data & text; usefulness for querying both text and relations • We need a principled, simple & transparent probability model + effective quality + practical execution cost • We need to balance between automation and control: from full specification by experts to feature generation for nonexperienced – Maximally realize the potential of every developer! – LogicBlox is working on incorporating ML in Datalog! 51
  • 53. Room for Both Statistical Solution Rule System Feature Engineering Model Space, Runtime Cleaning + Post Proc. Cleaning + Post Proc. Building blocks (e.g., dictionaries, NER) “What doesn’t work: Anything requiring high precision and full automation” Feldman & Ungar, KDD’08 tutorial on text mining 53
  • 54. String DB, Spanners, Interval Algebra Kaspersky Kaspersky Intel Otellini IBM Rometty [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) [10,20) [16,26) [32,37) [50,58) [105,108) [121,128) String Databases Interval AlgebraSpanners Atomic value: string Atomic value: span (pointing to doc) Atomic value: interval (no text) Join by string conditions (e.g., x is a substring of y) Join by interval conditions (e.g., x is a sub-interval of y) Join by interval+string conditions (e.g., x a token in y) Apps: text predicates in DBs [Grahne & al. 99] [Benedikt & al. 03], string manipulation [Bonner & Mecca 98] [Ginsburg and Wang 98] App: IE Apps: temporal reasoning [Allen 83] [Vilain & Kautz 86] [Nebel & Bürckert 95] [Krokhin et al. 03] 54
  • 55. 55 Imp. 1: Connection to Known Concepts • Connection to Recognizable Relations [Berstel 79] – These are unions of cross products of regular languages – THM: The class of regular spanners is closed under a string-selection predicate iff the predicate is a recognizable relation • Connection to CRPQs [Cruz et al. 87] – Conjunctive Regular Path Queries have been studied as a query language for labeled graphs – THM: Regular spanners have the same expressive power as unions of CRPQs on paths “with marked endpoints” • Up to some simple and necessary adaptation between the models S I G M O D Path with marked endpoints
  • 56. Imp. 2: Adding String Equality NR Datalog w/ regex formulas Regular Spanners Regularstr= Spanners + String-equality predicate (+substring-of, prefix-of, …) …application from Jane Doe, social 012-345-6789, on Mar 20th… identified as John Doe, 012-345-6789, ask us to… x1 x2 [117,125) (Jane Doe) [875,883) (John Doe) ⋮ ⋮ NameSSN(x,y) := … SameSSN(x1,x2) := NameSSN(x1,y1) , NameSSN(x2,y2) , str(y1)=str(y2) Same string, different spans 56
  • 57. Difference with String Equality • Are regularstr= spanners closed under difference? – Why should they? Only positive operators are used… – However, regex formulas (our EDBs) can introduce “negative” operations (NFAs closed under complement) • THM: The class of regular spanners is closed under difference • PROP: The class of regularstr= spanners is closed under string-inequality selection • THM: The class of regularstr= spanners is closed under string-containment selection, but then, not under non-string-containment selection! • COR: The class of regularstr= is not closed under difference 57
  • 58. Formal Optimization Problem Fixed: • Schema S w/ fun. dependencies • Conjunctive query Q Input: • Database instance I over S • Set A⊆ Q(I) of answers to delete Output: J ⊆ I s.t. Q(J) ∩ A = ∅ Goal: Minimize |(Q(I) – A) – Q(J)| Side Effect 58