Report

Boris GlavicFollow

Aug. 2, 2019•0 likes•215 views

Aug. 2, 2019•0 likes•215 views

Download to read offline

Report

Science

Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.

Boris GlavicFollow

2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic

2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic

2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic

2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic

2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic

2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic

- 1. Uncertainty-Annotated Databases A lightweight approach for approximating certain answers 1 Aaron Huber Oliver Kennedy Su Feng Boris Glavic
- 2. Uncertainty is everywhere 2
- 3. Uncertainty is everywhere 3
- 4. Uncertainty is everywhere 4
- 5. Uncertainty is everywhere 5
- 6. Uncertainty example 6 Street, state (State, IL) or (State, IN) or (State, CA)
- 7. Uncertainty Example 7 Street, state 1 (Lasalle, NY) or (Lasalle, IL) or (Lasalle, CA) 2 (Lasalle, NY) or (Lasalle, AZ) 3 (Lake, IL) 4 (Tucson, AZ) 5 (State, IL) or (State, IN) or (State, CA) id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN Possible World
- 8. Uncertainty Example 8 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.3 P= …
- 9. Incomplete data model 9 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Q Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.3 P= …
- 10. Incomplete data model 10 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Q id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN Q Q Q: SELECT distinct state FROM address state NY IL AZ state NY IL AZ IN state NY IL AZ IN Possible World 1 Possible World 2 Possible World 3 P=0.5 P=0.2P=0.3 P=0.5 P=0.3+0.2=0.5
- 11. Incomplete/Probabilistic databases • MayBMS1 – Lineage based • MCDB2 – Sampling based 11 1L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions. 2R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data.
- 12. Incomplete database 12 ×3 ×300
- 13. Features 13 features Probabilistic databases Efficiency Expressiveness
- 14. Certain answers 14 id Street state 3 Lake IL 4 Tucson AZ id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN Possible World 1 Possible World 2 Possible World 3 Certain Answer
- 15. Certain answers • Under-approximated certain answers1 • Lower bounds2 15 1L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. 2P. Guagliardo. 2017. Correctness of SQL Queries on Databases with Nulls
- 16. Approximated Certain Answers Features 16 features Probabilistic databases Efficiency Expressiveness
- 17. Ground truth 17 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL id Street state 1 Lasalle IL 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IN id Street state 1 Lasalle NY 2 Lasalle AZ 3 Lake IL 4 Tucson AZ 5 State IN … Possible World 2 Possible World 3Ground Truth id Street state 3 Lake IL 4 Tucson AZ Certain Answer
- 18. Utility of certain answers 18
- 19. Trust- worthiness Features 19 features Probabilistic Databases Approximated Certain Answers Efficiency Expressiveness Utility
- 20. “Best Guess” 20 id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL No confident on answers Best Guess
- 21. UADBBest guess Features 21 Features Probabilistic databases Approximated Certain answers Efficiency Expressivenes s Utility Trust- worthiness
- 22. Solution 22 “Best Guess” id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Approx. Certain id Street state 3 Lake IL 4 Tucson AZ
- 23. Uncertainty-annotated databases 23 UADB id Street state 1 Lasalle NY 2 Lasalle NY 3 Lake IL 4 Tucson AZ 5 State IL Under-approximated Certain Answers Best Guess Certain Answers : Uncertain : Certain
- 24. Interoperability with IDBs/PDBs 24 1T. Imielinski. 1984. Incomplete Information in Relational Databases. 2P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage.
- 25. Implementing UADB Queries 25
- 26. UADB example 26 Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F C: certain label F : Uncertain T : Certain
- 27. UADB Query Processing 27 Π 𝑠𝑡𝑎𝑡𝑒 state C NY IL AZ Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
- 28. UADB Query Processing cont. 28 Π 𝑠𝑡𝑎𝑡𝑒 state C NY IL AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
- 29. UADB Query Processing cont. 29 Π 𝑠𝑡𝑎𝑡𝑒 State C NY IL T=T⋁F AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
- 30. UADB Query Processing cont. 30 Π 𝑠𝑡𝑎𝑡𝑒 state C NY F=F⋁F IL T AZ T Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
- 31. UADB Query Processing cont. 31 User LName Street C Smith Lasalle T Smith Lake T Jones State F C Lasalle NY Smith F=F⋀T ⋈ Streets Street state C Lasalle NY F Lasalle NY F Lake IL T Tucson AZ T State IL F F : Uncertain T : Certain
- 32. • Incomplete sets semantics • Incomplete bags semantics • Incomplete Provenance • Etc. 32 K-relations1 1T. J. Green. 2007. Provenance Semirings.
- 33. Technical contributions • Generalized Incomplete Databases and Certain Answers to K-relations – Each possible world is a K-relation – Certain answers are the greatest lower bound (GLB) on annotations across all worlds (using natural order) • Derive UADBS from probabilistic/incomplete data models – Proven to under-approximate certain answers – Extract one possible world (with highest prob. if feasible) 33
- 34. Technical contributions cont. • Bounds on certain answers are preserved by standard K-relational query semantics – Generalizes result for sets due to Reiter1 – In contrast to certain answers, UADB are closed under queries! • Implementation for bags – rewriting frontend for deterministic relational database 34 2R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values.
- 35. Experiments • Do UADBs scale well? • Do UADBs have good Utility? • How good is the approximation? 35
- 36. Setups • Data sets – PD-Bench1 (TPC-H +error) – Real world datasets with natural errors – Cleaned datasets with known ground truth • Comparing with – Approximated certain answers by Libkin2 – Sampling based PDB – MCDB3 – Lineage based PDB – MayBMS4 36 1L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data. 2L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. 3R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data. 4L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions.
- 37. Experimental results 37 • PD-Bench Q1 • 1GB • No probability 5%
- 38. Experimental results 38 • PD-Bench Q1 • 2% error • No probability
- 39. Experimental results 39 “Best guess”
- 40. Experimental results 40 “Random guess”
- 41. Experimental results 41 • Real world data • On projection
- 42. Conclusions & Future work 42 • Best Guess Certain answers • Approx. Certain • Lightweight, implementation friendly o Larger class of queries? o More precise while still efficient?
- 43. Questions? • Vizier - https://vizierdb.info - demo session C (W/TH 4:20pm) • GitHub - https://github.com/IITDBGroup/gprom • References 43 Y. Yang. 2015. Lenses: An On-demand Approach to ETL. T. Imielinski. 1984. Incomplete Information in Relational Databases. P. Agrawal. 2006. Trio: A System for Data, Uncertainty, and Lineage. L. Antova. 2008. Fast and Simple Relational Processing of Uncertain Data. L. Libkin. 2016. SQL’s Three-Valued Logic and Certain Answers. R. Jampani. 2008. MCDB: a monte carlo approach to managing uncertain data. L. Antova. 2007. MayBMS: Managing Incomplete Information with Probabilistic World Set Decompositions. R. Reiter. 1986. A sound and sometimes complete query evaluation algorithm for relational databases with null values. T. J. Green. 2007. Provenance Semirings. Aaron Huber Oliver Kennedy Su Feng Boris Glavic

- Uncertainty is everywhere.
- Different sensors on same target may give you different readings.
- Data cleaning may introduced uncertainty.
- Information extraction like Searching for a street name may return multiple options.
- For example. If we are unsure about which address is correct. We can record all possible options. In this case we do not know if state street is in IL, IN or CA. BTW this specific representation is called X-tuples.
- And if we have a list of those options. Every row represents a tuple and every row with multiple options represents a decision. Each combination of the decisions for each row represents a possible instance and we call it possible world. For example choosing marked options will result the possible world on the right hand side.
- In principal we can enumerate all possible worlds. For example these three possible worlds and so on. Then we get a naïve representation of incomplete database model. We can also assign probabilities to each possible worlds in which case we call it probabilistic database.
- Of course we want to run queries over it.
- Naively, We can do this by running the query over each possible worlds to get all possible outcomes. For example if we only have three possible worlds and we distinct project on state. Then we get these corresponding results for each possible world. If probabilities are assigned. We add probabilities when two or more worlds produce identical result. However this is completely non-practical since the number of possible world can be exponential to the number of decisions we have to make about the data. People come up several ways to deal with it more effectively.
- Work on Full PDBs is hard. There are extensive works doing good job on compactly represent PDBs and query over them with relatively better performance. Most of them are either lineage based or sampling based. #p to compute actual probabilities. Polynomial probability approximation but those systems still need to track all or a large set of possible answers. Which still have performance bottleneck.
- Here is a performance comparison between conventional query processing and the two types probabilistic databases just introduced. We can see that in the best case PQP is 3 times slower that conventional query processing and as uncertainty increases PQP can be up to 300 time slower in our test case.
- So, PDBs are very expressive by tracking all possible answers and their probabilities. but doing so is a lot of work which make it very slow.
- What If we do not have probabilities. And since possible answers can be very large, what if we only keep certain answers. Certain answers means the answers in common across all possible worlds. For example if we consider Id as one of the attribute, then those two tuples exist in all possible worlds so they are the certain answers. However, exact certain answers are still expensive to compute.
- Thanks to the previous works we can compute approximated certain answers efficiently. And also the answers is a lower bound. subset of actual certain answers.
- So, Approximated certain answers is efficient, but It does not track all possible answers as well as their probabilities. Also, since certain answers only give you a subset of the query result, how bad is this in practice?
- In order to test that (how certain answers approach behaves). Recall that in reality, there is always a ground truth that is fundamentally correct. Of course the ground truth is unknown, but lets assume we know it. Certain answers are guaranteed to be a subset of the ground truth. However, It could be only a very conservative approximation of it.
- So if we know the ground truth, we can measure the utility in terms of precision and recall of the certain answers from the ground truth versus the amount of uncertainty which shown on the graph. One sentence to describe the graph, certain answers will have high precision, low recall.
- So, Certain answers lacking the utility by dropping all possible tuples. These are the principal approaches and they both gives you answers you can trust. Unfortunately, nobody uses them in practice.
- In practice especially in many data cleaning tasks, people just make heuristic choices and ignore uncertainty afterwards. This is equivalent to picking one possible world. And we call it the best guess world.The down side of this approach is that we loss all information about uncertainty.
- Although you can not trust any answers from the best guess world, you get the efficiency by only working on one world, get the utility by getting closer to the ground truth, There is no way you can get both efficiency and expressiveness. Since people want efficiency in practice, PDBs are out. For the other two approaches, one gives up utility, one gives up trustworthiness. Our approach – UADB achieves efficiency, trust-worthiness and utility. And for this, we have to give up expressiveness which means we are not tracking all possible answers.
- The obvious solution for that is to combine approximated certain answers with best guess. We noticed that approximated certain answers are always a subset of the best guess world.
- So we can use heuristics to find a best guess world and mark weather the tuple is an certain (blue tuples) or uncertain (red tuples) . Remember that the approximation of certain answers would be a subset of the actual certain answers. Every tuple we marked as certain is guaranteed to be certain. Every tuple we marked as uncertain may or may not be uncertain. But we are always bounding the actual certain answers between the under-approximation and the best guess. This works because certain answers is a subset of every possible world.
- Uncertainty comes from different sources and lots of existing work has come out ways to modeling uncertainty. We do not want to re-invent the wheel. So we want to be able to translate existing models of uncertainty into UADB. In our paper we present translations from three common representations: tuple independent database, c-tables and x-tuples. Adding new models is easy.
- After we have a UADB instance, query over UADB is light weight and easy to implement. Our system is query rewriting based so the rewritten query that can run on any conventional database systems like postgres, oracle, etc.
- For using a relational database, we need a way to encode UADB as relations. We implement color labeling as an attribute at the end of the table. Where false means uncertain and true means certain.
- We have a representation, how do we rewrite queries to operate over the representation? Let look at a query of projection over state. Here is the result on the best guess world. What is the correct annotation that preserves bounding?
- For the last tuple AZ, it is clear that there is certainly one AZ tuple in the input, so the output must contain AZ tuple, so we label it as certain.
- For the second tuple IL, there are two IL tuples in the input correspond to it where one of them is certain and one of them is not. Since one of the IL tuple in the input is certain, no mater the other uncertain IL tuple exist or not we will always have the IL tuple in the output. So we label it as certain.
- For the first tuple NY. Both of the two input NY tuples are uncertain. So the output NY tuple can be missing when both of its uncertain inputs are missing. So we label it as uncertain.
- For another example of cross join, if we only show casing the join of this two specific tuples. We can see that if we join a uncertain tuple with a certain tuple, the output tuple is uncertain since as long as one of the joining tuples is missing, the output tuple will not exist. Now, Many audiences may feel these examples looked very familiar. And it is! because we are using k-relations.
- Which means our approach can works not only on uncertain sets but also uncertain bags, uncertain provenance, or many other semirings with querying semantics adaptable to k-relations.
- For a summarized technical contributions, our approach..
- We use both synthetic datasets which is TPC-H with randomly generated errors and also real world datasets.
- When run our approach on synthetic dataset measuring runtime in log scale versus uncertainty percentage, we can see our approach have roughly 5% overhead comparing with conventional query processing. The low overhead maintained while the amount of uncertainty increase.
- The low overhead also scales with data size.
- In terms of utility, if we use a good case given by a good ML model i.e. spark ML, We can get precision over 90% and recall over 80%.
- And if we do not have information or too expensive to get the best guess world. Here is the result if we randomly picking one world.
- This is One out of the ten examples of the result showing misclassification of certain on real world data. It really depend on the data itself but in this case we get a worst case of 0.8% missclassified tuples.
- By combining best guess with under approximation of certain answers we developed an uncertainty model that bounds actual certain answers, lightweight and easy to implement.
- Our model also plays a role in Vizier system, so feel free to try Vizier demo too!