In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Audio Music Similarity and Retrieval: Evaluation Power and Stability
1. Audio Music Similarity and Retrieval:
Evaluation Power and Stability
Julián Urbano @julian_urbano
Diego Martín, Mónica Marrero and Jorge Morato
University Carlos III of Madrid
ISMIR 2011
Picture by Michael Shane Miami, USA · October 26th
4. grand results
(MIREX 2009)
I won!
oh, come on! it‘s so close!
but the difference is not significant…
yeah, it’s not
significant!
5. grand results
(MIREX 2009)
I won!
oh, come on! it‘s so close!
but the difference is not significant…
did you hear?
yeah, it’s not
significant! shut up… we are!
6. grand results
(MIREX 2009)
I won!
oh, come on! it‘s so close!
but the difference is not significant…
did you hear?
yeah, it’s not
significant! shut up… we are! damn it!
don‘t worry
about it
8. proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
a statistical test returns p<0.01, so we conclude A >> B
B A
9. proper interpretation of p-values
H0: mean score of system A = mean score of B
H1: mean scores are different
a statistical test returns p<0.01, so we conclude A >> B
B A
it means that if we assume H0
and repeat the experiment,
there is a <0.01 probability
of having these result again*
*or one even more extreme
10. conclusions about general behavior
MIREX 2009 MIREX 2010
this evaluation
A>B is not powerful A?B
system A is better than B, but it’s we can expect anything
not statistically significant with a different collection
…and stable
this one
is powerful… A >> B
A >> B we expect the same:
A is significantly better than B
A is better than B, and it’s
statistically significant but these could also happen:
A > B or A < B or A << B
lack of power in MIREX 2010
minor stability conflict
major stability conflict
13. Text REtrieval Conference
no significance testing depends on the
[Buckley and Voorhees, 2000] measure used
1% to 14% of comparisons show stability conflicts
~25% differences to ensure <5% conflicts with 50 queries
sensitivity [Sanderson and Zobel, 2005] others were
improved reliability with pairwise t-tests not as good
virtually no conflicts if >10% differences with significance
effort [Voorhees, 2009]
with many queries, even significance is unreliable
[Sakai, 2007]
major review: other collections and more recent measures
some measures are much better than others
does not mean they should not be used!
14. Music Similarity and Retrieval
[Typke et al., 2005][Urbano et al., 2010]
alternative forms of ground truth for SMS
reliable and comprehensive but too expensive
no prefixed
[Typke et al., 2006] relevance scale
specific measure for the task
[Jones et al., 2007]
agreement between judgments by different people
propose to use more queries
despite high agreement,
evaluation does change… [Urbano et al., 2010][Lee, 2010]
cheaper judgments via crowdsourcing seems reliable
[Urbano, 2011] more about this
many other things in 30 mins
16. it‘s actually about the
effort-reliability tradeoff
task relevance judgments # of systems
# of queries measures system similarity
statistical methods
17. measures
&
judgments
Picture by Wessex Archaeology
18. how much information does the user gain?
measure used in MIREX
(with different name) results as a set
AG@5: Average Gain in the top 5 documents
more realistic
user model
results as a list
NDCG@5: Normalized Discounted Cumulated Gain
ANDCG@5: Average NDCG across ranks
first,
best documents first
ADR@5: Average Dynamic Recall and the lower the rank
the lower the gain*
*details in the paper
19. how much information does a result provide?
BROAD relevance judgments
not similar = 0
somewhat similar = 1
very similar = 2
FINE relevance judgments
real-valued, from 0 to 10 or 100
22. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures
23. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures
all 100 queries set
24. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures
all 100 queries set
5 query
subset
random sample
25. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures Broad judgments
% significant
all 100 queries set
5 query
subset
# queries
evaluation
Fine judgments
% significant
random sample
# queries
26. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures Broad judgments
52,500
% significant
all 100 queries set
5 query system
subset comparisons
# queries
evaluation
Fine judgments
% significant
random sample
repeat 500 times for 5 query subsets
to minimize random effects
# queries
27. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures Broad judgments
52,500
% significant
all 100 queries set
10 query system
subset comparisons
# queries
evaluation
Fine judgments
% significant
repeat another 500 times for 10 query subsets # queries
28. % of pairwise comparisons that are significant
what's the effect of:
number of queries
balanced across relevance judgments
10 genres effectiveness measures Broad judgments
% significant
all 100 queries set
barroque 10 query
blues subset
classical
country
# queries
edance
jazz
evaluation
Fine judgments
metal
rap-hiphop
% significant
rock&roll
romantic
stratified random sampling
with equal priors # queries
29. % of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures Broad judgments
% significant
all 100 query subset
# queries
evaluation
Fine judgments
% significant
# queries
36. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries
relevance judgments
effectiveness measures
37. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic
38. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries
relevance judgments
effectiveness measures
5 query
subset
all 100 queries set
barroque
blues
classical
country
edance 5 query independent
jazz
metal
subset samples
rap-hiphop
rock&roll
romantic
39. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries
relevance judgments
effectiveness measures
Broad judgments
5 query
subset
% conflicting
all 100 queries set
barroque
blues
evaluation
classical
country
edance 5 query independent #queries
jazz
metal
subset samples Fine judgments
rap-hiphop
% conflicting
rock&roll
romantic evaluation
#queries
40. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries 52,500
relevance judgments cross-
cross-collection
effectiveness measures system comparisons
Broad judgments
5 query
subset
% conflicting
all 100 queries set
barroque
blues
evaluation
classical
country
edance 5 query independent #queries
jazz
metal
subset samples Fine judgments
rap-hiphop
% conflicting
rock&roll
romantic evaluation
repeat 500 times
to minimize random effects #queries
41. % of pairwise comparisons that are conflicting
what's the effect of:
number of queries
relevance judgments
effectiveness measures
Broad judgments
50 query subset
% conflicting
evaluation
with 100
total queries #queries
we can’t go 50 query subset Fine judgments
beyond 50
% conflicting
evaluation
#queries
51. compare two systems
is the difference significant?
t-test, Wilcoxon test, sign test, etc.
they make
different assumptions stability conflict
significance level α
probability of Type I error
(finding a significant difference when there is none)
usually, α=0.05 or α=0.01
5% or 1% of my significant results are just wrong
52. MIREX 2009 compare several systems
15 systems = 105 comparisons
experiment-wide significance level = 1-(1-α)105 = 0.995
we can expect at least one significant comparison to be wrong
instead, compare all systems at once
ANOVA, Friedman test, Kruskal-Wallis, etc.
used in MIREX
(with different assumptions)
correct p-values to keep experiment-wide significance level <0.05
Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
54. what a MIREX participant wants
compare my system with the other 14
comparisons between those 14 are uninteresting
subexperiment: only 14 pairwise comparisons, not 105
get back the power missed by considering the other 91
should throw out more conflicts too
number of comparisons grows linearly with number of systems
subexperiment-wide significant level = 1-(1-α)14 = 0.512
compare all systems with 1-tailed Wilcoxon tests at α=0.01
experiment-wide significant level = 1-(1-0.01)105 = 0.652
subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
59. stability results (lower is better)
earlier convergence
because of increased power
Broad judgments Fine judgments
AG
8 10 12 14 16 18 20 22
8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR
% Conflicting comparisons
% Conflicting comparisons
6
6
4
4
2
2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Query subset size Query subset size
AG converges again to 3-4%
(A)NDCG converge to 5-6%
60. type of conflicts (50 queries)
A>B A<B A<<B
measure conflicts
(power) (minor) (major)
AG 3.68% 96.32% 3.68% 0%
NDCG 5.05% 96.82% 3.18% 0%
Broad
ANDCG 6.08% 96.84% 3.13% 0.03%
ADR 5.93% 95.12% 4.88% 0%
AG 3.32% 98.34% 1.66% 0% within known
Type III
NDCG 6.58% 96.61% 3.39% 0% error rates
Fine
ANDCG 6.44% 94.94% 5.06% 0%
ADR 12.48% 90.58% 9.37% 0.05%
again, due to
lack of power in one collection no major conflicts
64. Do not attempt to accomplish greater results
by a greater effort of your little understanding,
but by a greater understanding of your little effort.
̶ Walter Russell
65. using more and more queries is pointless
too much effort for the small gain in power and stability
using different similarity scales has little effect
using only one is probably just fine
some effectiveness measures are better than others
they should still be used: they measure different things
but bear in mind their power and stability
some statistical methods are better than others
virtually same realiability with half the effort
if significance shows up it most probably is true
at worst, conflicts are due to lack of power
67. forget about power and worry about effect-size
eventually, significance becomes meaningless
reduce the judging effort
more queries in Symbolic Melodic Similarity
reliable low-cost in-house evaluations and Crowdsourcing
deeper evaluation cutoffs
not just the top 5 documents: pay attention to ranking
probably more reliable, and certainly more reusable
effect of the number of systems
specially if developed by the same research group
other statistical methods
Multiple Comparisons with a Control (baseline)
other collections, tasks and measures
68. guide experimenters in
the interpretation
of the results and the
tradeoff between
effort and reliability