Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

40 views

Published on

Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.

Published in: Science
  • Be the first to comment

  • Be the first to like this

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers

  1. 1. Uncertainty-Annotated Databases A lightweight approach for approximating certain answers 1 Aaron Huber Oliver Kennedy Su Feng Boris Glavic
  2. 2. Uncertainty is everywhere 2
  3. 3. Uncertainty is everywhere 3
  4. 4. Uncertainty is everywhere 4
  5. 5. Uncertainty is everywhere 5
  6. 6. Uncertainty example 6 Street, state (State, IL) or (State, IN) or (State, CA)
  7. 7. Uncertainty Example 7 Street, state 1 (Lasalle, NY) or (Lasalle, IL) or (Lasalle, CA) 2 (Lasalle, NY) or (Lasalle, AZ) 3 (Lake, IL) 4 (Tucson, AZ) 5 (State, IL) or (State, IN) or (State, CA) id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN Possible World
  8. 8. Uncertainty Example 8 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.3 P= …
  9. 9. Incomplete data model 9 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Q Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.3 P= …
  10. 10. Incomplete data model 10 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Q id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN Q Q Q: SELECT distinct state FROM address state NY IL AZ state NY IL AZ IN state NY IL AZ IN Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.2P=0.3 P=0.5 P=0.3+0.2=0.5
  11. 11. Incomplete/Probabilistic databases • MayBMS1 – Lineage based • MCDB2 – Sampling based 11 1L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions. 2R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
  12. 12. Incomplete database 12 ×3 ×300
  13. 13. Features 13 features Probabilistic databases Efficiency Expressiveness
  14. 14. Certain answers 14 id Street state 3 Lake IL 4 Tucson AZ id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN Possible World 1 Possible World 2 Possible World 3 Certain Answer
  15. 15. Certain answers • Under-approximated certain answers1 • Lower bounds2 15 1L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. 2P. Guagliardo. 2017. Correctness of SQL Queries on Databases with Nulls
  16. 16. Approximated Certain Answers Features 16 features Probabilistic databases Efficiency Expressiveness
  17. 17. Ground truth 17 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Possible World 2 Possible World 3Ground Truth id Street state 3 Lake IL 4 Tucson AZ Certain Answer
  18. 18. Utility of certain answers 18
  19. 19. Trust- worthiness Features 19 features Probabilistic Databases Approximated Certain Answers Efficiency Expressiveness Utility
  20. 20. “Best Guess” 20 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL No confident on answers Best Guess
  21. 21. UADBBest guess Features 21 Features Probabilistic databases Approximated Certain answers Efficiency Expressivenes s Utility Trust- worthiness
  22. 22. Solution 22 “Best Guess” id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Approx. Certain id Street state 3 Lake IL 4 Tucson AZ
  23. 23. Uncertainty-annotated databases 23 UADB id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Under-approximated Certain Answers Best Guess Certain Answers : Uncertain : Certain
  24. 24. Interoperability with IDBs/PDBs 24 1T. Imielinski. 1984. Incomplete Information in Relational Databases. 2P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.
  25. 25. Implementing UADB Queries 25
  26. 26. UADB example 26 Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F C: certain label F : Uncertain T : Certain
  27. 27. UADB Query Processing 27 Π 𝑠𝑡𝑎𝑡𝑒 state C NY IL AZ Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
  28. 28. UADB Query Processing cont. 28 Π 𝑠𝑡𝑎𝑡𝑒 state C NY IL AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
  29. 29. UADB Query Processing cont. 29 Π 𝑠𝑡𝑎𝑡𝑒 State C NY IL T=T⋁F AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
  30. 30. UADB Query Processing cont. 30 Π 𝑠𝑡𝑎𝑡𝑒 state C NY F=F⋁F IL T AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
  31. 31. UADB Query Processing cont. 31 User LName Street C Smith Lasalle T Smith Lake T Jones State F C Lasalle NY Smith F=F⋀T ⋈ Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
  32. 32. • Incomplete sets semantics • Incomplete bags semantics • Incomplete Provenance • Etc. 32 K-relations1 1T. J. Green. 2007. Provenance Semirings.
  33. 33. Technical contributions • Generalized Incomplete Databases and Certain Answers to K-relations – Each possible world is a K-relation – Certain answers are the greatest lower bound (GLB) on annotations across all worlds (using natural order) • Derive UADBS from probabilistic/incomplete data models – Proven to under-approximate certain answers – Extract one possible world (with highest prob. if feasible) 33
  34. 34. Technical contributions cont. • Bounds on certain answers are preserved by standard K-relational query semantics – Generalizes result for sets due to Reiter1 – In contrast to certain answers, UADB are closed under queries! • Implementation for bags – rewriting frontend for deterministic relational database 34 2R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.
  35. 35. Experiments • Do UADBs scale well? • Do UADBs have good Utility? • How good is the approximation? 35
  36. 36. Setups • Data sets – PD-Bench1 (TPC-H +error) – Real world datasets with natural errors – Cleaned datasets with known ground truth • Comparing with – Approximated certain answers by Libkin2 – Sampling based PDB – MCDB3 – Lineage based PDB – MayBMS4 36 1L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data. 2L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. 3R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data. 4L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
  37. 37. Experimental results 37 • PD-Bench Q1 • 1GB • No probability 5%
  38. 38. Experimental results 38 • PD-Bench Q1 • 2% error • No probability
  39. 39. Experimental results 39 “Best guess”
  40. 40. Experimental results 40 “Random guess”
  41. 41. Experimental results 41 • Real world data • On projection
  42. 42. Conclusions & Future work 42 • Best Guess Certain answers • Approx. Certain • Lightweight, implementation friendly o Larger class of queries? o More precise while still efficient?
  43. 43. Questions? • Vizier - https://vizierdb.info - demo session C (W/TH 4:20pm) • GitHub - https://github.com/IITDBGroup/gprom • References 43 Y. Yang. 2015. Lenses: An On-demand Approach to ETL. T. Imielinski. 1984. Incomplete Information in Relational Databases. P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage. L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data. L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data. L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions. R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. T. J. Green. 2007. Provenance Semirings. Aaron Huber Oliver Kennedy Su Feng Boris Glavic

×