2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

Uncertainty-Annotated Databases
A lightweight approach for approximating certain
answers
1
Aaron Huber
Oliver Kennedy
Su Feng
Boris Glavic

Uncertainty example
6
Street, state
(State, IL) or (State, IN) or (State, CA)

Uncertainty Example
7
Street, state
1 (Lasalle, NY) or (Lasalle, IL) or (Lasalle, CA)
2 (Lasalle, NY) or (Lasalle, AZ)
3 (Lake, IL)
4 (Tucson, AZ)
5 (State, IL) or (State, IN) or (State, CA)
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
Possible World

Uncertainty Example
8
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Possible World
1
Possible World
2
Possible World
3
P=0.5 P=0.3 P= …

Incomplete data model
9
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Q
Possible World
1
Possible World
2
Possible World
3
P=0.5 P=0.3 P= …

Incomplete data model
10
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Q
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
Q Q
Q: SELECT distinct state FROM address
state
NY
IL
AZ
state
NY
IL
AZ
IN
state
NY
IL
AZ
IN
Possible World 1 Possible World 2 Possible World 3
P=0.5 P=0.2P=0.3
P=0.5
P=0.3+0.2=0.5

Incomplete/Probabilistic
databases
• MayBMS1 – Lineage based
• MCDB2 – Sampling based
11
1L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
2R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.

Incomplete database
12
×3
×300

Features
13
features
Probabilistic
databases
Efficiency
Expressiveness

Certain answers
14
id Street state
3 Lake IL
4 Tucson AZ
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
Possible World 1 Possible World 2 Possible World 3
Certain Answer

Certain answers
• Under-approximated certain answers1
• Lower bounds2
15
1L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
2P. Guagliardo. 2017. Correctness of SQL Queries on Databases with Nulls

Approximated
Certain
Answers
Features
16
features
Probabilistic
databases
Efficiency
Expressiveness

Ground truth
17
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Possible World 2 Possible World 3Ground Truth
id Street state
3 Lake IL
4 Tucson AZ
Certain Answer

Trust-
worthiness
Features
19
features
Probabilistic
Databases
Approximated
Certain
Answers
Efficiency
Expressiveness
Utility

“Best Guess”
20
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
No confident on answers
Best Guess

UADBBest guess
Features
21
Features
Probabilistic
databases
Approximated
Certain
answers
Efficiency
Expressivenes
s
Utility
Trust-
worthiness

Solution
22
“Best Guess”
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Approx. Certain
id Street state
3 Lake IL
4 Tucson AZ

Uncertainty-annotated
databases
23
UADB
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Under-approximated
Certain Answers
Best Guess
Certain Answers
: Uncertain
: Certain

Interoperability with
IDBs/PDBs
24
1T. Imielinski. 1984. Incomplete Information in Relational Databases.
2P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.

UADB example
26
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
C: certain label
F : Uncertain
T : Certain

UADB Query Processing
27
Π 𝑠𝑡𝑎𝑡𝑒
state C
NY
IL
AZ
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain

UADB Query Processing cont.
28
state C
NY
IL
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain

29
State C
NY
IL T=T⋁F
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain

30
state C
NY F=F⋁F
IL T
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain

31
User
LName Street C
Smith Lasalle T
Smith Lake T
Jones State F
C
Lasalle NY Smith F=F⋀T
⋈
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain

• Incomplete sets semantics
• Incomplete bags semantics
• Incomplete Provenance
• Etc.
32
K-relations1
1T. J. Green. 2007. Provenance Semirings.

Technical contributions
• Generalized Incomplete Databases and
Certain Answers to K-relations
– Each possible world is a K-relation
– Certain answers are the greatest lower bound (GLB) on
annotations across all worlds (using natural order)
• Derive UADBS from probabilistic/incomplete
data models
– Proven to under-approximate certain answers
– Extract one possible world (with highest prob. if feasible)
33

Technical contributions cont.
• Bounds on certain answers are preserved by
standard K-relational query semantics
– Generalizes result for sets due to Reiter1
– In contrast to certain answers, UADB are closed under
queries!
• Implementation for bags
– rewriting frontend for deterministic relational database
34
2R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.

Experiments
• Do UADBs scale well?
• Do UADBs have good Utility?
• How good is the approximation?
35

Setups
• Data sets
– PD-Bench1 (TPC-H +error)
– Real world datasets with natural errors
– Cleaned datasets with known ground truth
• Comparing with
– Approximated certain answers by Libkin2
– Sampling based PDB – MCDB3
– Lineage based PDB – MayBMS4
36
1L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data.
2L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
3R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
4L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.

Experimental results
37
• PD-Bench Q1
• 1GB
• No probability
5%

38
• PD-Bench Q1
• 2% error
• No probability

39
“Best guess”

40
“Random guess”

41
• Real world data • On projection

Conclusions & Future work
42
• Best Guess
Certain answers
• Approx. Certain
• Lightweight, implementation friendly
o Larger class of queries?
o More precise while still efficient?

Questions?
• Vizier - https://vizierdb.info - demo session C (W/TH 4:20pm)
• GitHub - https://github.com/IITDBGroup/gprom
• References
43
Y. Yang. 2015. Lenses: An On-demand Approach to ETL.
T. Imielinski. 1984. Incomplete Information in Relational Databases.
P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.
L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data.
L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.
T. J. Green. 2007. Provenance Semirings.
Aaron Huber
Oliver Kennedy
Su Feng
Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

Recommended

Recommended

More Related Content

More from Boris Glavic

More from Boris Glavic (10)

Recently uploaded

Recently uploaded (20)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

Editor's Notes