2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Going Beyond Provenance: Explaining Query Answers
with Pattern-based Counterbalances
SIGMOD 2019
Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy
Illinois Institute of Technology Duke University
SIGMOD Research Session 5 - July 3rd - 11:30am
Slide 1 of 16 Q. Zeng - CAPE:

Explain surprising query results
Slide 2 of 16 Q. Zeng - CAPE: Introduction

Related Work
Provenance
Semiring model[Green et al., 2007]
Causality based [Meliou et al., 2010]
Provenance systems[Arab et al., 2014]

Related Work
Provenance
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction

Related Work
Provenance
"Why high/low"
question[Wu and Madden, 2013][Roy and Suciu, 2014]
Intervention — A subset of provenance whose removal would cause
the result to move to the opposite direction
All based on provenance

Only provenance is useful?

No. Non-provenance can be useful.

Boris: Why did you work only 2 hours yesterday?

Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.

Qitian (provenance based explanation): Yeah, I worked from 9-11 AM.
Boris: Okay, I’m cutting low your stipend.

Qitian: I was on a plane to SIGMOD for 8 hours.
Boris: Fair enough.

Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue

Example - Table
Pub
author pubid year venue
AX P1 2005 SIGKDD
AY P2 2004 SIGKDD
AZ P2 2004 SIGKDD
AZ P3 2004 SIGMOD
Q =
SELECT author , year , venue , count (∗) AS pubcnt
FROM Pub
GROUP BY author , year , venue
author venue year pubcnt
AX SIGKDD 2006 4
AX SIGKDD 2007 1
AX SIGKDD 2008 4

Example - Query Result
φ = “Why is the number of AX ’s SIGKDD 2007 paper low”?
Why high/low question
Aggregate query

Aggregate query
Provenance-based approach
—By "intervention"

Aggregate query
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up

Aggregate query
A subset of provenance whose
removal makes
AX ’s SIGKDD 2007 paper go up
Our approach
—By counterbalance
AX ’s high publication number in
other conference or other year

Our Approach
Assumptions of φ:
A pattern exists which describes the data (Aggregate Regression
Pattern, or ARP)
(AX ,SIGKDD,2007,1) is a low outlier of the pattern

Our Approach
Assumptions of φ:
Pattern, or ARP)
Mine ARPs

Our Approach
Assumptions of φ:
Pattern, or ARP)
Mine ARPs → Look for counterbalance

Our Approach
Assumptions of φ:
Pattern, or ARP)
Mine ARPs → Look for counterbalance → Present top k

Our Approach
Assumptions of φ:
Pattern, or ARP)
oﬄine Interactive with user question

Our Approach
Assumptions of φ:
Pattern, or ARP)
oﬄine Interactive with user question
CAPE

Aggregate Regression Pattern
P="For each author , the total publication (count(*)) is linear over
the years "
Slide 8 of 16 Q. Zeng - CAPE: Counterbalance with ARP

A set of partition attributes
the years "

the years "
A set of predictor attributes

the years "
An aggregate function

the years "
An aggregate function
A regression model type

the years "
A pattern can hold locally on a ﬁxed value of partition attributes

the years "
A pattern can hold locally on a ﬁxed value of partition attributes Say,
P holds on AX

the years "
A pattern can hold locally on a ﬁxed value of partition attributes
A pattern can also hold globally if it holds for suﬃciently many values
of partition attributes (A good number of authors)

Mining ARP
Brute Force: at least 3|R| candidate patterns

Mining ARP
Optimization:
Restricting size:

Mining ARP
Optimization:
Restricting size:
maximum 4 attributes in a pattern. This alone would reduce the
number of candidate patterns to polynomial.

Mining ARP
Optimization:
Restricting size:
Reusing sort order

Mining ARP
Optimization:
Restricting size:
Reusing sort order
Partition Attributes Predictor Attributes
A,B,C D
A,B C,D
A B,C,D

Mining ARP
Optimization:
Restricting size:
Reusing sort order
Detecting and Applying Functional Dependency

Mining ARP
Optimization:
Restricting size:
Reusing sort order
"For each A, agg(α) is linear over C"
A → B
⇒ "For each A and B, agg(α) is linear over C"

Mining ARP
Optimization:
Restricting size:
Reusing sort order
Performance Evaluation on Chicago crime data, PostgreSQL
4 5 6 7 8 9 10 11
#attributes
0
1000
2000
3000
4000
5000
6000
7000
time(sec)
naive
cube
ARP-mine
(a) 10k rows
10k 50k 100k
#rows
0
500
1000
1500
2000
2500
3000
3500
time(sec)
FD opt.
w/o FD opt.
(b) 8 attributes

Steps of Counterbalancing

1 Relevant pattern (Not all patterns are useful)

Holds locally on φ

Holds locally on φ
E.g. P1="For each author and venue, the total publication is constant
over the years" needs to hold on (AX , SIGKDD)

Holds locally on φ
AX ’s number of SIGKDD publications each year:

Holds locally on φ
Generalizes φ

Holds locally on φ
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"

Holds locally on φ
Generalizes φ
E.g. P="For each author, the total publication is linear over the years"
AX ’s number of publications each year:

2 Reﬁnement (There might not be direct counterbalance on relevant
pattern)

pattern)
P="For author AX , the total publication is linear over the years"

pattern)
author AX and ICDE
constant

pattern)
author AX and ICDE
constant
P1="For author AX and ICDE, the total publication is constant over
the years"

pattern)
author AX and ICDE
constant
the years"
In this simple example it happens that we reﬁned back to the same
attributes as user question but it doesn’t necessarily have to be

pattern)
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1

pattern)
the years"
3 t = (AX , ICDE, 2007, 6) ∈ QP1
t [pubcnt] = 6 is a high
outlier

Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)

Explanation
Explanations returned by CAPE for φ
contains AX ’s number of publication in other venue or other year
E.g. (AX , ICDE, 2006, 6), (AX , VLDB, 2007, 4)
don’t need to have the same schema as φ
E.g. (AX , 2010, 63)
Not all counterbalances are good. We need to score them and return top
ones.

Scoring Explanations
1 The distance between user question tuple and explanation tuple.

⇒ Tuples that are more similar are more likely to cause unusual result.
For φ=(AX , SIGKDD, 2007, 1), 2007 is better than 2006 for an
answer, ICDE is better than a conference in other area like SIGCOMM

2 The deviation of explanation tuple from its expected value.

⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.

⇒ Higher deviation means more unusual, which is more likely to cause
other unusual events.
AX ’s SIGKDD publication: AX ’s ICDE publication:

Qualitative Evaluation
More example:
Chicago crime data: Crime(id, type, community, year)
Q=γtype,community,year,count(*)(Crime)
φ="Why is battery crime in 2011 at community area 26 low (16)?"
Slide 13 of 16 Q. Zeng - CAPE: Qualitative Evaluation

More example:
Explanation
rank type community year count(*) score
1 26 2012 117 63.9

More example:
Explanation
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5

More example:
Explanation
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0

More example:
Explanation
1 26 2012 117 63.9
2 Battery 25 2011 79 60.5
3 Battery 2010 1095 49.0
4 Assault 26 2011 10 40.1

Conclusion & Future Work
Conclusions
Provenance may be insuﬃcient
Reasonable explanations can be given by counterbalance
Mine patterns oﬄine
Look for counterbalance and rank online
Slide 14 of 16 Q. Zeng - CAPE: Conclusion & Future Work

Conclusion & Future Work
Conclusions
Provenance may be insuﬃcient
Reasonable explanations can be given by counterbalance
Mine patterns oﬄine
Look for counterbalance and rank online
Future Work
Extend to larger class of queries
e.g., joins

Questions?
?
GitHub
https://github.com/IITDBGroup/cape
Demo
VLDB 2019

References I
[Arab et al., 2014] Arab, B., Gawlick, D., Radhakrishnan, V., Guo, H., and Glavic, B. (2014).
A generic provenance middleware for database queries, updates, and transactions.
In Proceedings of the 6th USENIX Workshop on the Theory and Practice of Provenance.
[Green et al., 2007] Green, T. J., Karvounarakis, G., and Tannen, V. (2007).
Provenance semirings.
In PODS, pages 31–40.
[Meliou et al., 2010] Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D. (2010).
The complexity of causality and responsibility for query answers and non-answers.
PVLDB, 4(1):34–45.
[Roy and Suciu, 2014] Roy, S. and Suciu, D. (2014).
A formal approach to ﬁnding explanations for database queries.
In SIGMOD, pages 1579–1590.
[Wu and Madden, 2013] Wu, E. and Madden, S. (2013).
Scorpion: Explaining away outliers in aggregate queries.
PVLDB, 6(8):553–564.
Slide 16 of 16 Q. Zeng - CAPE: Bibliography

2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Recommended

Recommended

More Related Content

Similar to 2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances

Similar to 2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances (6)

More from Boris Glavic

More from Boris Glavic (18)

Recently uploaded

Recently uploaded (20)

2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances