Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

36 views

Published on

Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.

Published in: Science
  • Be the first to comment

  • Be the first to like this

2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

  1. 1. Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances SIGMOD 2019 Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy Illinois Institute of Technology Duke University SIGMOD Research Session 5 - July 3rd - 11:30am Slide 1 of 16 Q. Zeng - CAPE:
  2. 2. Explain surprising query results Slide 2 of 16 Q. Zeng - CAPE: Introduction
  3. 3. Explain surprising query results Slide 2 of 16 Q. Zeng - CAPE: Introduction
  4. 4. Explain surprising query results Slide 2 of 16 Q. Zeng - CAPE: Introduction
  5. 5. Related Work Provenance Semiring model[Green et al., 2007] Causality based [Meliou et al., 2010] Provenance systems[Arab et al., 2014] Slide 3 of 16 Q. Zeng - CAPE: Introduction
  6. 6. Related Work Provenance Semiring model[Green et al., 2007] Causality based [Meliou et al., 2010] Provenance systems[Arab et al., 2014] "Why high/low" question[Wu and Madden, 2013][Roy and Suciu, 2014] Intervention — A subset of provenance whose removal would cause the result to move to the opposite direction Slide 3 of 16 Q. Zeng - CAPE: Introduction
  7. 7. Related Work Provenance Semiring model[Green et al., 2007] Causality based [Meliou et al., 2010] Provenance systems[Arab et al., 2014] "Why high/low" question[Wu and Madden, 2013][Roy and Suciu, 2014] Intervention — A subset of provenance whose removal would cause the result to move to the opposite direction All based on provenance Slide 3 of 16 Q. Zeng - CAPE: Introduction
  8. 8. Only provenance is useful? Slide 4 of 16 Q. Zeng - CAPE: Introduction
  9. 9. Only provenance is useful? No. Non-provenance can be useful. Slide 4 of 16 Q. Zeng - CAPE: Introduction
  10. 10. Only provenance is useful? Slide 4 of 16 Q. Zeng - CAPE: Introduction
  11. 11. Only provenance is useful? Boris: Why did you work only 2 hours yesterday? Slide 4 of 16 Q. Zeng - CAPE: Introduction
  12. 12. Only provenance is useful? Boris: Why did you work only 2 hours yesterday? Qitian (provenance based explanation): Yeah, I worked from 9-11 AM. Slide 4 of 16 Q. Zeng - CAPE: Introduction
  13. 13. Only provenance is useful? Boris: Why did you work only 2 hours yesterday? Qitian (provenance based explanation): Yeah, I worked from 9-11 AM. Boris: Okay, I’m cutting low your stipend. Slide 4 of 16 Q. Zeng - CAPE: Introduction
  14. 14. Only provenance is useful? Boris: Why did you work only 2 hours yesterday? Qitian: I was on a plane to SIGMOD for 8 hours. Boris: Fair enough. Slide 4 of 16 Q. Zeng - CAPE: Introduction
  15. 15. Example - Table Pub author pubid year venue AX P1 2005 SIGKDD AY P2 2004 SIGKDD AZ P2 2004 SIGKDD AZ P3 2004 SIGMOD Q = SELECT author , year , venue , count (∗) AS pubcnt FROM Pub GROUP BY author , year , venue Slide 5 of 16 Q. Zeng - CAPE: Introduction
  16. 16. Example - Table Pub author pubid year venue AX P1 2005 SIGKDD AY P2 2004 SIGKDD AZ P2 2004 SIGKDD AZ P3 2004 SIGMOD Q = SELECT author , year , venue , count (∗) AS pubcnt FROM Pub GROUP BY author , year , venue author venue year pubcnt AX SIGKDD 2006 4 AX SIGKDD 2007 1 AX SIGKDD 2008 4 Slide 5 of 16 Q. Zeng - CAPE: Introduction
  17. 17. Example - Query Result φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Why high/low question Aggregate query Slide 6 of 16 Q. Zeng - CAPE: Introduction
  18. 18. Example - Query Result φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Why high/low question Aggregate query Provenance-based approach —By "intervention" Slide 6 of 16 Q. Zeng - CAPE: Introduction
  19. 19. Example - Query Result φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Why high/low question Aggregate query Provenance-based approach —By "intervention" A subset of provenance whose removal makes AX ’s SIGKDD 2007 paper go up Slide 6 of 16 Q. Zeng - CAPE: Introduction
  20. 20. Example - Query Result φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Why high/low question Aggregate query Provenance-based approach —By "intervention" A subset of provenance whose removal makes AX ’s SIGKDD 2007 paper go up Slide 6 of 16 Q. Zeng - CAPE: Introduction
  21. 21. Example - Query Result φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Why high/low question Aggregate query Provenance-based approach —By "intervention" A subset of provenance whose removal makes AX ’s SIGKDD 2007 paper go up Our approach —By counterbalance AX ’s high publication number in other conference or other year Slide 6 of 16 Q. Zeng - CAPE: Introduction
  22. 22. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Slide 7 of 16 Q. Zeng - CAPE: Introduction
  23. 23. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Mine ARPs Slide 7 of 16 Q. Zeng - CAPE: Introduction
  24. 24. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Mine ARPs → Look for counterbalance Slide 7 of 16 Q. Zeng - CAPE: Introduction
  25. 25. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Mine ARPs → Look for counterbalance → Present top k Slide 7 of 16 Q. Zeng - CAPE: Introduction
  26. 26. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Mine ARPs → Look for counterbalance → Present top k offline Interactive with user question Slide 7 of 16 Q. Zeng - CAPE: Introduction
  27. 27. Our Approach Assumptions of φ: A pattern exists which describes the data (Aggregate Regression Pattern, or ARP) (AX ,SIGKDD,2007,1) is a low outlier of the pattern Mine ARPs → Look for counterbalance → Present top k offline Interactive with user question CAPE Slide 7 of 16 Q. Zeng - CAPE: Introduction
  28. 28. Aggregate Regression Pattern P="For each author , the total publication (count(*)) is linear over the years " Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  29. 29. Aggregate Regression Pattern A set of partition attributes P="For each author , the total publication (count(*)) is linear over the years " Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  30. 30. Aggregate Regression Pattern A set of partition attributes P="For each author , the total publication (count(*)) is linear over the years " A set of predictor attributes Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  31. 31. Aggregate Regression Pattern A set of partition attributes P="For each author , the total publication (count(*)) is linear over the years " A set of predictor attributes An aggregate function Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  32. 32. Aggregate Regression Pattern A set of partition attributes P="For each author , the total publication (count(*)) is linear over the years " A set of predictor attributes An aggregate function A regression model type Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  33. 33. Aggregate Regression Pattern P="For each author , the total publication (count(*)) is linear over the years " A pattern can hold locally on a fixed value of partition attributes Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  34. 34. Aggregate Regression Pattern P="For each author , the total publication (count(*)) is linear over the years " A pattern can hold locally on a fixed value of partition attributes Say, P holds on AX Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  35. 35. Aggregate Regression Pattern P="For each author , the total publication (count(*)) is linear over the years " A pattern can hold locally on a fixed value of partition attributes A pattern can also hold globally if it holds for sufficiently many values of partition attributes (A good number of authors) Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  36. 36. Aggregate Regression Pattern P="For each author , the total publication (count(*)) is linear over the years " A pattern can hold locally on a fixed value of partition attributes A pattern can also hold globally if it holds for sufficiently many values of partition attributes (A good number of authors) Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  37. 37. Mining ARP Brute Force: at least 3|R| candidate patterns Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  38. 38. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  39. 39. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: maximum 4 attributes in a pattern. This alone would reduce the number of candidate patterns to polynomial. Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  40. 40. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Reusing sort order Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  41. 41. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Reusing sort order Partition Attributes Predictor Attributes A,B,C D A,B C,D A B,C,D Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  42. 42. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Reusing sort order Detecting and Applying Functional Dependency Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  43. 43. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Reusing sort order Detecting and Applying Functional Dependency "For each A, agg(α) is linear over C" A → B ⇒ "For each A and B, agg(α) is linear over C" Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  44. 44. Mining ARP Brute Force: at least 3|R| candidate patterns Optimization: Restricting size: Reusing sort order Detecting and Applying Functional Dependency Performance Evaluation on Chicago crime data, PostgreSQL 4 5 6 7 8 9 10 11 #attributes 0 1000 2000 3000 4000 5000 6000 7000 time(sec) naive cube ARP-mine (a) 10k rows 10k 50k 100k #rows 0 500 1000 1500 2000 2500 3000 3500 time(sec) FD opt. w/o FD opt. (b) 8 attributes Slide 9 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  45. 45. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  46. 46. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  47. 47. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  48. 48. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ E.g. P1="For each author and venue, the total publication is constant over the years" needs to hold on (AX , SIGKDD) Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  49. 49. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ E.g. P1="For each author and venue, the total publication is constant over the years" needs to hold on (AX , SIGKDD) AX ’s number of SIGKDD publications each year: Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  50. 50. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ E.g. P1="For each author and venue, the total publication is constant over the years" needs to hold on (AX , SIGKDD) Generalizes φ Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  51. 51. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ E.g. P1="For each author and venue, the total publication is constant over the years" needs to hold on (AX , SIGKDD) Generalizes φ E.g. P="For each author, the total publication is linear over the years" Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  52. 52. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) Holds locally on φ E.g. P1="For each author and venue, the total publication is constant over the years" needs to hold on (AX , SIGKDD) Generalizes φ E.g. P="For each author, the total publication is linear over the years" AX ’s number of publications each year: Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  53. 53. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  54. 54. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P="For author AX , the total publication is linear over the years" Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  55. 55. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P="For author AX , the total publication is linear over the years" author AX and ICDE constant Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  56. 56. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P="For author AX , the total publication is linear over the years" author AX and ICDE constant P1="For author AX and ICDE, the total publication is constant over the years" Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  57. 57. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P="For author AX , the total publication is linear over the years" author AX and ICDE constant P1="For author AX and ICDE, the total publication is constant over the years" In this simple example it happens that we refined back to the same attributes as user question but it doesn’t necessarily have to be Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  58. 58. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P1="For author AX and ICDE, the total publication is constant over the years" 3 t = (AX , ICDE, 2007, 6) ∈ QP1 Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  59. 59. Steps of Counterbalancing φ = “Why is the number of AX ’s SIGKDD 2007 paper low”? 1 Relevant pattern (Not all patterns are useful) 2 Refinement (There might not be direct counterbalance on relevant pattern) P1="For author AX and ICDE, the total publication is constant over the years" 3 t = (AX , ICDE, 2007, 6) ∈ QP1 t [pubcnt] = 6 is a high outlier Slide 10 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  60. 60. Explanation Explanations returned by CAPE for φ contains AX ’s number of publication in other venue or other year E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4) don’t need to have the same schema as φ E.g. (AX , 2010, 63) Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  61. 61. Explanation Explanations returned by CAPE for φ contains AX ’s number of publication in other venue or other year E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4) don’t need to have the same schema as φ E.g. (AX , 2010, 63) Not all counterbalances are good. We need to score them and return top ones. Slide 11 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  62. 62. Scoring Explanations 1 The distance between user question tuple and explanation tuple. Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  63. 63. Scoring Explanations 1 The distance between user question tuple and explanation tuple. ⇒ Tuples that are more similar are more likely to cause unusual result. For φ=(AX , SIGKDD, 2007, 1), 2007 is better than 2006 for an answer, ICDE is better than a conference in other area like SIGCOMM Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  64. 64. Scoring Explanations 1 The distance between user question tuple and explanation tuple. 2 The deviation of explanation tuple from its expected value. Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  65. 65. Scoring Explanations 1 The distance between user question tuple and explanation tuple. 2 The deviation of explanation tuple from its expected value. ⇒ Higher deviation means more unusual, which is more likely to cause other unusual events. Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  66. 66. Scoring Explanations 1 The distance between user question tuple and explanation tuple. 2 The deviation of explanation tuple from its expected value. ⇒ Higher deviation means more unusual, which is more likely to cause other unusual events. AX ’s SIGKDD publication: AX ’s ICDE publication: Slide 12 of 16 Q. Zeng - CAPE: Counterbalance with ARP
  67. 67. Qualitative Evaluation More example: Chicago crime data: Crime(id, type, community, year) Q=γtype,community,year,count(*)(Crime) φ="Why is battery crime in 2011 at community area 26 low (16)?" Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
  68. 68. Qualitative Evaluation More example: Chicago crime data: Crime(id, type, community, year) Q=γtype,community,year,count(*)(Crime) φ="Why is battery crime in 2011 at community area 26 low (16)?" Explanation rank type community year count(*) score 1 26 2012 117 63.9 Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
  69. 69. Qualitative Evaluation More example: Chicago crime data: Crime(id, type, community, year) Q=γtype,community,year,count(*)(Crime) φ="Why is battery crime in 2011 at community area 26 low (16)?" Explanation rank type community year count(*) score 1 26 2012 117 63.9 2 Battery 25 2011 79 60.5 Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
  70. 70. Qualitative Evaluation More example: Chicago crime data: Crime(id, type, community, year) Q=γtype,community,year,count(*)(Crime) φ="Why is battery crime in 2011 at community area 26 low (16)?" Explanation rank type community year count(*) score 1 26 2012 117 63.9 2 Battery 25 2011 79 60.5 3 Battery 2010 1095 49.0 Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
  71. 71. Qualitative Evaluation More example: Chicago crime data: Crime(id, type, community, year) Q=γtype,community,year,count(*)(Crime) φ="Why is battery crime in 2011 at community area 26 low (16)?" Explanation rank type community year count(*) score 1 26 2012 117 63.9 2 Battery 25 2011 79 60.5 3 Battery 2010 1095 49.0 4 Assault 26 2011 10 40.1 Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation
  72. 72. Conclusion & Future Work Conclusions Provenance may be insufficient Reasonable explanations can be given by counterbalance Mine patterns offline Look for counterbalance and rank online Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
  73. 73. Conclusion & Future Work Conclusions Provenance may be insufficient Reasonable explanations can be given by counterbalance Mine patterns offline Look for counterbalance and rank online Future Work Extend to larger class of queries e.g., joins Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work
  74. 74. Questions? ? GitHub https://github.com/IITDBGroup/cape Demo VLDB 2019 Slide 15 of 16 Q. Zeng - CAPE: Conclusion & Future Work
  75. 75. References I [Arab et al., 2014] Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., and Glavic, B. (2014). A generic provenance middleware for database queries, updates, and transactions. In Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance. [Green et al., 2007] Green, T. J., Karvounarakis, G., and Tannen, V. (2007). Provenance semirings. In PODS, pages 31–40. [Meliou et al., 2010] Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D. (2010). The complexity of causality and responsibility for query answers and non-answers. PVLDB, 4(1):34–45. [Roy and Suciu, 2014] Roy, S. and Suciu, D. (2014). A formal approach to finding explanations for database queries. In SIGMOD, pages 1579–1590. [Wu and Madden, 2013] Wu, E. and Madden, S. (2013). Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553–564. Slide 16 of 16 Q. Zeng - CAPE: Bibliography

×