SlideShare a Scribd company logo
1 of 43
Uncertainty-Annotated Databases
A lightweight approach for approximating certain
answers
1
Aaron Huber
Oliver Kennedy
Su Feng
Boris Glavic
Uncertainty is everywhere
2
Uncertainty is everywhere
3
Uncertainty is everywhere
4
Uncertainty is everywhere
5
Uncertainty example
6
Street, state
(State, IL) or (State, IN) or (State, CA)
Uncertainty Example
7
Street, state
1 (Lasalle, NY) or (Lasalle, IL) or (Lasalle, CA)
2 (Lasalle, NY) or (Lasalle, AZ)
3 (Lake, IL)
4 (Tucson, AZ)
5 (State, IL) or (State, IN) or (State, CA)
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
Possible World
Uncertainty Example
8
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Possible World
1
Possible World
2
Possible World
3
P=0.5 P=0.3 P= …
Incomplete data model
9
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Q
Possible World
1
Possible World
2
Possible World
3
P=0.5 P=0.3 P= …
Incomplete data model
10
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Q
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
Q Q
Q: SELECT distinct state FROM address
state
NY
IL
AZ
state
NY
IL
AZ
IN
state
NY
IL
AZ
IN
Possible World 1 Possible World 2 Possible World 3
P=0.5 P=0.2P=0.3
P=0.5
P=0.3+0.2=0.5
Incomplete/Probabilistic
databases
• MayBMS1 – Lineage based
• MCDB2 – Sampling based
11
1L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
2R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
Incomplete database
12
×3
×300
Features
13
features
Probabilistic
databases
Efficiency
Expressiveness
Certain answers
14
id Street state
3 Lake IL
4 Tucson AZ
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
Possible World 1 Possible World 2 Possible World 3
Certain Answer
Certain answers
• Under-approximated certain answers1
• Lower bounds2
15
1L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
2P. Guagliardo. 2017. Correctness of SQL Queries on Databases with Nulls
Approximated
Certain
Answers
Features
16
features
Probabilistic
databases
Efficiency
Expressiveness
Ground truth
17
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
id Street state
1 Lasalle IL
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IN
id Street state
1 Lasalle NY
2 Lasalle AZ
3 Lake IL
4 Tucson AZ
5 State IN
…
Possible World 2 Possible World 3Ground Truth
id Street state
3 Lake IL
4 Tucson AZ
Certain Answer
Utility of certain answers
18
Trust-
worthiness
Features
19
features
Probabilistic
Databases
Approximated
Certain
Answers
Efficiency
Expressiveness
Utility
“Best Guess”
20
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
No confident on answers
Best Guess
UADBBest guess
Features
21
Features
Probabilistic
databases
Approximated
Certain
answers
Efficiency
Expressivenes
s
Utility
Trust-
worthiness
Solution
22
“Best Guess”
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Approx. Certain
id Street state
3 Lake IL
4 Tucson AZ
Uncertainty-annotated
databases
23
UADB
id Street state
1 Lasalle NY
2 Lasalle NY
3 Lake IL
4 Tucson AZ
5 State IL
Under-approximated
Certain Answers
Best Guess
Certain Answers
: Uncertain
: Certain
Interoperability with
IDBs/PDBs
24
1T. Imielinski. 1984. Incomplete Information in Relational Databases.
2P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.
Implementing UADB Queries
25
UADB example
26
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
C: certain label
F : Uncertain
T : Certain
UADB Query Processing
27
Π 𝑠𝑡𝑎𝑡𝑒
state C
NY
IL
AZ
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain
UADB Query Processing cont.
28
Π 𝑠𝑡𝑎𝑡𝑒
state C
NY
IL
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain
UADB Query Processing cont.
29
Π 𝑠𝑡𝑎𝑡𝑒
State C
NY
IL T=T⋁F
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain
UADB Query Processing cont.
30
Π 𝑠𝑡𝑎𝑡𝑒
state C
NY F=F⋁F
IL T
AZ T
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain
UADB Query Processing cont.
31
User
LName Street C
Smith Lasalle T
Smith Lake T
Jones State F
C
Lasalle NY Smith F=F⋀T
⋈
Streets
Street state C
Lasalle NY F
Lasalle NY F
Lake IL T
Tucson AZ T
State IL F
F : Uncertain
T : Certain
• Incomplete sets semantics
• Incomplete bags semantics
• Incomplete Provenance
• Etc.
32
K-relations1
1T. J. Green. 2007. Provenance Semirings.
Technical contributions
• Generalized Incomplete Databases and
Certain Answers to K-relations
– Each possible world is a K-relation
– Certain answers are the greatest lower bound (GLB) on
annotations across all worlds (using natural order)
• Derive UADBS from probabilistic/incomplete
data models
– Proven to under-approximate certain answers
– Extract one possible world (with highest prob. if feasible)
33
Technical contributions cont.
• Bounds on certain answers are preserved by
standard K-relational query semantics
– Generalizes result for sets due to Reiter1
– In contrast to certain answers, UADB are closed under
queries!
• Implementation for bags
– rewriting frontend for deterministic relational database
34
2R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.
Experiments
• Do UADBs scale well?
• Do UADBs have good Utility?
• How good is the approximation?
35
Setups
• Data sets
– PD-Bench1 (TPC-H +error)
– Real world datasets with natural errors
– Cleaned datasets with known ground truth
• Comparing with
– Approximated certain answers by Libkin2
– Sampling based PDB – MCDB3
– Lineage based PDB – MayBMS4
36
1L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data.
2L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
3R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
4L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
Experimental results
37
• PD-Bench Q1
• 1GB
• No probability
5%
Experimental results
38
• PD-Bench Q1
• 2% error
• No probability
Experimental results
39
“Best guess”
Experimental results
40
“Random guess”
Experimental results
41
• Real world data • On projection
Conclusions & Future work
42
• Best Guess
Certain answers
• Approx. Certain
• Lightweight, implementation friendly
o Larger class of queries?
o More precise while still efficient?
Questions?
• Vizier - https://vizierdb.info - demo session C (W/TH 4:20pm)
• GitHub - https://github.com/IITDBGroup/gprom
• References
43
Y. Yang. 2015. Lenses: An On-demand Approach to ETL.
T. Imielinski. 1984. Incomplete Information in Relational Databases.
P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.
L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data.
L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers.
R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.
T. J. Green. 2007. Provenance Semirings.
Aaron Huber
Oliver Kennedy
Su Feng
Boris Glavic

More Related Content

More from Boris Glavic

TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"Boris Glavic
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningBoris Glavic
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 

More from Boris Glavic (10)

TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

Recently uploaded

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Recently uploaded (20)

Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

Editor's Notes

  1. Uncertainty is everywhere.
  2. Different sensors on same target may give you different readings.
  3. Data cleaning may introduced uncertainty.
  4. Information extraction like Searching for a street name may return multiple options.
  5. For example. If we are unsure about which address is correct. We can record all possible options. In this case we do not know if state street is in IL, IN or CA. BTW this specific representation is called X-tuples.
  6. And if we have a list of those options. Every row represents a tuple and every row with multiple options represents a decision. Each combination of the decisions for each row represents a possible instance and we call it possible world. For example choosing marked options will result the possible world on the right hand side.
  7. In principal we can enumerate all possible worlds. For example these three possible worlds and so on. Then we get a naïve representation of incomplete database model. We can also assign probabilities to each possible worlds in which case we call it probabilistic database.
  8. Of course we want to run queries over it.
  9. Naively, We can do this by running the query over each possible worlds to get all possible outcomes. For example if we only have three possible worlds and we distinct project on state. Then we get these corresponding results for each possible world. If probabilities are assigned. We add probabilities when two or more worlds produce identical result. However this is completely non-practical since the number of possible world can be exponential to the number of decisions we have to make about the data. People come up several ways to deal with it more effectively.
  10. Work on Full PDBs is hard. There are extensive works doing good job on compactly represent PDBs and query over them with relatively better performance. Most of them are either lineage based or sampling based. #p to compute actual probabilities. Polynomial probability approximation but those systems still need to track all or a large set of possible answers. Which still have performance bottleneck.
  11. Here is a performance comparison between conventional query processing and the two types probabilistic databases just introduced. We can see that in the best case PQP is 3 times slower that conventional query processing and as uncertainty increases PQP can be up to 300 time slower in our test case.
  12. So, PDBs are very expressive by tracking all possible answers and their probabilities. but doing so is a lot of work which make it very slow.
  13. What If we do not have probabilities. And since possible answers can be very large, what if we only keep certain answers. Certain answers means the answers in common across all possible worlds. For example if we consider Id as one of the attribute, then those two tuples exist in all possible worlds so they are the certain answers. However, exact certain answers are still expensive to compute.
  14. Thanks to the previous works we can compute approximated certain answers efficiently. And also the answers is a lower bound. subset of actual certain answers.
  15. So, Approximated certain answers is efficient, but It does not track all possible answers as well as their probabilities. Also, since certain answers only give you a subset of the query result, how bad is this in practice?
  16. In order to test that (how certain answers approach behaves). Recall that in reality, there is always a ground truth that is fundamentally correct. Of course the ground truth is unknown, but lets assume we know it. Certain answers are guaranteed to be a subset of the ground truth. However, It could be only a very conservative approximation of it.
  17. So if we know the ground truth, we can measure the utility in terms of precision and recall of the certain answers from the ground truth versus the amount of uncertainty which shown on the graph. One sentence to describe the graph, certain answers will have high precision, low recall.
  18. So, Certain answers lacking the utility by dropping all possible tuples. These are the principal approaches and they both gives you answers you can trust. Unfortunately, nobody uses them in practice.
  19. In practice especially in many data cleaning tasks, people just make heuristic choices and ignore uncertainty afterwards. This is equivalent to picking one possible world. And we call it the best guess world. The down side of this approach is that we loss all information about uncertainty.
  20. Although you can not trust any answers from the best guess world, you get the efficiency by only working on one world, get the utility by getting closer to the ground truth, There is no way you can get both efficiency and expressiveness. Since people want efficiency in practice, PDBs are out. For the other two approaches, one gives up utility, one gives up trustworthiness. Our approach – UADB achieves efficiency, trust-worthiness and utility. And for this, we have to give up expressiveness which means we are not tracking all possible answers.
  21. The obvious solution for that is to combine approximated certain answers with best guess. We noticed that approximated certain answers are always a subset of the best guess world.
  22. So we can use heuristics to find a best guess world and mark weather the tuple is an certain (blue tuples) or uncertain (red tuples) . Remember that the approximation of certain answers would be a subset of the actual certain answers. Every tuple we marked as certain is guaranteed to be certain. Every tuple we marked as uncertain may or may not be uncertain. But we are always bounding the actual certain answers between the under-approximation and the best guess. This works because certain answers is a subset of every possible world.
  23. Uncertainty comes from different sources and lots of existing work has come out ways to modeling uncertainty. We do not want to re-invent the wheel. So we want to be able to translate existing models of uncertainty into UADB. In our paper we present translations from three common representations: tuple independent database, c-tables and x-tuples. Adding new models is easy.
  24. After we have a UADB instance, query over UADB is light weight and easy to implement. Our system is query rewriting based so the rewritten query that can run on any conventional database systems like postgres, oracle, etc.
  25. For using a relational database, we need a way to encode UADB as relations. We implement color labeling as an attribute at the end of the table. Where false means uncertain and true means certain.
  26. We have a representation, how do we rewrite queries to operate over the representation? Let look at a query of projection over state. Here is the result on the best guess world. What is the correct annotation that preserves bounding?
  27. For the last tuple AZ, it is clear that there is certainly one AZ tuple in the input, so the output must contain AZ tuple, so we label it as certain.
  28. For the second tuple IL, there are two IL tuples in the input correspond to it where one of them is certain and one of them is not. Since one of the IL tuple in the input is certain, no mater the other uncertain IL tuple exist or not we will always have the IL tuple in the output. So we label it as certain.
  29. For the first tuple NY. Both of the two input NY tuples are uncertain. So the output NY tuple can be missing when both of its uncertain inputs are missing. So we label it as uncertain.
  30. For another example of cross join, if we only show casing the join of this two specific tuples. We can see that if we join a uncertain tuple with a certain tuple, the output tuple is uncertain since as long as one of the joining tuples is missing, the output tuple will not exist. Now, Many audiences may feel these examples looked very familiar. And it is! because we are using k-relations.
  31. Which means our approach can works not only on uncertain sets but also uncertain bags, uncertain provenance, or many other semirings with querying semantics adaptable to k-relations.
  32. For a summarized technical contributions, our approach..
  33. We use both synthetic datasets which is TPC-H with randomly generated errors and also real world datasets.
  34. When run our approach on synthetic dataset measuring runtime in log scale versus uncertainty percentage, we can see our approach have roughly 5% overhead comparing with conventional query processing. The low overhead maintained while the amount of uncertainty increase.
  35. The low overhead also scales with data size.
  36. In terms of utility, if we use a good case given by a good ML model i.e. spark ML, We can get precision over 90% and recall over 80%.
  37. And if we do not have information or too expensive to get the best guess world. Here is the result if we randomly picking one world.
  38. This is One out of the ten examples of the result showing misclassification of certain on real world data. It really depend on the data itself but in this case we get a worst case of 0.8% missclassified tuples.
  39. By combining best guess with under approximation of certain answers we developed an uncertainty model that bounds actual certain answers, lightweight and easy to implement.
  40. Our model also plays a role in Vizier system, so feel free to try Vizier demo too!