Answering Why-Not Questions on Top-K Queries

Answering Why-not Questions
on Top-K Queries
Andy He and Eric Lo
The Hong Kong Polytechnic University

Background
 The database community has focused on
the performance issues for decades
 Recently more people turn their focus on
to the usability issues
 Supporting keyword search
 Query auto-completion
 Explaining your query result (a.k.a. Why and
Why-Not Questions)
2/33

Why-Not Questions
 You post a query Q
 Database returns you a result R
 R gives you “surprise”
 E.g., a tuple m that you are expecting in the
result is missing, you ask “WHY??!”
 You pose a why-not question (Q,R,m)
 Database returns you an explanation E
3/33

The (short) history of Why-Not
 Chapman and Jagadish
 “Why Not?” [SIGMOD 09]
 Select-Project-Join (SPJ) Questions
 Explanation E = “tell you which operator excludes
the expected tuple”
 Hung, Che, A.H. Doan, and J. Naughton
 “On the Provenance of Non-Answers to Queries
Over Extracted Data” [PVLDB 09]
 SPJ Queries
 Explanation E =“tell you how to modify the data”
4/33

The (short) history of Why-Not
 Herschel and Herandez
 “Explaining Missing Answers to SPJUA Queries”
[PVLDB 10]
 SPJUA Queries
 Explanation E =“tell you how to modify the data”
 Tran and C.Y. Chan
 “How to Conquer why-not Questions” [SIGMOD
10]
 SPJA Queries
 Explanation E =“tell you how to modify your
query” 5/33

About this work
 Why-Not question on Top-k queries.
 Hotel <Price, Distance to CityCenter>
 Top-3 Hotel
 Weighting worigin =<0.5, 0.5>
 Result
 Rank 1: Sheraton
 Rank 2: Westin
 Rank 3: InterContinental
 “WHY my favorite Renaissance NOT in the Top-3 result?”
 If my value of k is too small?
 Or I should revise my weighting?
 Or need to modify both k and weighting?
 Explanation E = “tell you how to refine your Top-K query in
order to get your favorites back to the result”
6/33

One possible answer
-only modify k
 Original query
Q(koriginal=3,woriginal=<0.5,0.5>)
 The ranking of Renaissance under the
original weighting woriginal=<0.5,0.5>
 Rank 1: Sheraton
 Rank 2: Westin
 Rank 3: InterContinental
 Rank 4: Hilton
 Rank 5: Renaissance
 Refined query #1: Q1(k=3,w=<0.5,0.5>)
5
7/33
X

Another possible answer
-only modify weighting
 Original query Q(k=3,woriginal=<0.5,0.5>)
 If we set weighting w=<0.1,0.9>
 Rank 1: Hotel E
 Rank 2: Hotel F
8/33

Yet another possible answer
-modify both
 Original query Q(k=3,w=<0.5,0.5>)
 Rank 1: Hotel A
 Rank 2: Hotel B
 Rank 3: Hotel C
 …
 …
9/33

Our objective
 Find the refined query that minimizes a
penalty function with the missing tuple m
in the Top-K results
Prefer Modify K PMK
Prefer Modify Weighting PMW
Never Mind (Default) NM
10/33

Basic idea
 For each weighting wi ∈ W
 Run PROGRESS(wi, UNTIL-SEE-m)
 Obtain the ranking ri of m under the weighting
wi
 Form a refined query Qi(k=ri,w=wi)
 Return the refined query with the least
penalty
W is
infinite!!!
11/33

Our approach: sampling
 For each weighting wi ∈ W
 Run PROGRESS(wi, UNTIL-SEE-m)
 Obtain the ranking ri of m under the weighting
wi
 Form a refined query Qi(k=ri,w=wi)
 Return the refined query with the least
penalty
W is a set of
weightings draw from
a restricted weighting
space
Key Theorem: The optimal refined query
Qbest is either Q1 or else Qbest has a weighting
wbest in a restricted weighting space.
12/33
W

How large the sample size should
be?
 We say a refined query is the best-T% refined
query if its penalty is smaller than (1-T)% refined
queries
 And we hope to get such a query with a
probability larger than a threshold Pr
13/33

The PROGRESS operation can be
expensive
 Original query Q(k=3,woriginal=<0.5,0.5>)
 Rank 1: Hotel A
 Rank 2: Hotel B
 Rank 3: Hotel C
 …
 …
 Refined query: Q2(k=10000,w=<0.5,0.5>)
Very
Slow！！！
14/33

Two optimization techniques
 Stop each PROGRESS operation early
 Skip some PROGRESS operations
15/33

Stop earlier
 The original query Q(k=3,worigin=<0.5,0.5>)
 Rank 1: Hotel A
 Rank 2: Hotel B
 Rank 3: Hotel C
 …
 Rank 5: Hotel D
 …
16/33

Skip PROGRESS operation(a)
 Similar weightings may lead to similar rankings
 Based on “Reverse Top-K” paper, ICDE’10
 Therefore
 The query result of PROGRESS(wx, UNTIL-SEE-m)
 could be used to deduce
 The query result of PROGRESS(wy, UNTIL-SEE-m)
 [Provided that wx and wy are similar]
17/33

Skip PROGRESS operation(a)
 E.g., Original query Q(k=3,worigin=<0.5,0.5>)
Score under w=<0.5,0.5>
Hotel Score
Sheraton 10
Westin 9
InterContinental 8
Hilton 7
Renaissance 6
Score under w=<0.6,0.4>
Hotel Score
Sheraton 9
Westin 10
InterContinental 7
Hilton 8
Renaissance 5
How the score
looks like if we
set w=<0.6,0.4>
18/33

Skip PROGRESS operation(b)
 We can skip a weighting w if we find its
change ∆w between the original weighting
worigin is too large.
 E.g., We have a refined query with penalty
equals to 0.5, for a weighting w, if its changing
∆w is 1. We can totally skip it.
19/33

Experiments
 Case Study on NBA data
 Experiments on Synthetic Data
20/33

Case study on NBA data
 Compare with a pure random sampling
version
 Which do not draw sample from the restricted
weighting space but from the complete
weighting space
21/33

Find the top-3 centers in NBA history
 5 Attributes (Weighting = 1/5)
 POINTS
 REBOUND
 BLOCKING
 FIELD GOAL
 FREE THROW
 Initial Result
 Rank 1: Chamberlain
 Rank 2: Abdul-Jabber
 Rank 3: O’Neal
22/33

Find the top-3 centers in NBA history
Sampling on the
restricted sampling
space
Sampling on the
whole weighting
space
Refined query Top-3 Top-7
∆k 0 4
Time (ms) 156 154
Penalty 0.069 0.28
Why Not ?!
We choose “Prefer Modify Weighting”
23/33

Synthetic Data
 Uniform, Anti-correlated, Correlated
 Scalability
24/33

Varying query dimensions
25/33

Varying the ranking of the missing
object
27/33

Varying the number of missing
objects
28/33

Varying T%
29/33
Time Time
Quality Quality

Optimization effectiveness
31/33

Conclusions
 We are the first one to answer why-not question on top-k
query
 We prove that finding the optimal answer is
computationally expensive
 A sampling based method is proposed
 The optimal answer is proved to be in a restricted
sample space
 Two optimization techniques are proposed
 Stop each PROGRESS operation early
 Skip some PROGRESS operations
32/33

Deal with multiple missing objects M
 We have to modify the algorithm a litte bit:
 Do a simple filtering on the set of missing
objects
 If mi dominates mj in the data space
 Remove mi from M Because every time mj shows
up in a top-k result, mi must be there
 Condition UNTIL-SEE-m becomes UNTIL-
SEE-ALL-OBJECTS-IN-M
34/33

Penalty Model
 Original Query Q(3, worigin)
 Refined Query Q1(5, worigin)
 Penalty of changing k
 ∆ k = 5 - 3 = 2
 Penalty of changing w
 ∆ w = ||worigin -worigin||2=0
 Basic penalty model
 Penalty(5,w0) = λk ∆ k + λw ∆ w
 (λk + λw = 1)
35/33

Normalized penalty function
36/33

Answering Why-Not Questions on Top-K Queries

Recommended

Recommended

More Related Content

Similar to Answering Why-Not Questions on Top-K Queries

Similar to Answering Why-Not Questions on Top-K Queries (20)

Recently uploaded

Recently uploaded (20)

Answering Why-Not Questions on Top-K Queries