Audio Music Similarity and Retrieval: Evaluation Power and Stability

Audio Music Similarity and Retrieval:
Evaluation Power and Stability
Julián Urbano @julian_urbano
Diego Martín, Mónica Marrero and Jorge Morato
University Carlos III of Madrid

ISMIR 2011
Picture by Michael Shane Miami, USA · October 26th

AMS

retrieve audio clips
musically similar
to a query clip

grand results
(MIREX 2009)
I won!
oh, come on! it‘s so close!
but the difference is not significant…

yeah, it’s not
significant!

grand results
(MIREX 2009)
I won!
did you hear?
yeah, it’s not
significant! shut up… we are!

grand results
(MIREX 2009)
I won!
did you hear?
yeah, it’s not
significant! shut up… we are! damn it!

don‘t worry
about it

what does it mean?

Picture by Sara A. Beyer

proper interpretation of p-values

H0: mean score of system A = mean score of B
H1: mean scores are different

a statistical test returns p<0.01, so we conclude A >> B

B A

proper interpretation of p-values

H0: mean score of system A = mean score of B
H1: mean scores are different

a statistical test returns p<0.01, so we conclude A >> B

B A
it means that if we assume H0
and repeat the experiment,
there is a <0.01 probability
of having these result again*

*or one even more extreme

conclusions about general behavior

MIREX 2009 MIREX 2010
this evaluation
A>B is not powerful A?B
system A is better than B, but it’s we can expect anything
not statistically significant with a different collection

…and stable
this one
is powerful… A >> B
A >> B we expect the same:
A is significantly better than B
A is better than B, and it’s
statistically significant but these could also happen:

A > B or A < B or A << B
lack of power in MIREX 2010
minor stability conflict
major stability conflict

Isaac Newton
on the shoulders of giants

Text REtrieval Conference
no significance testing depends on the
[Buckley and Voorhees, 2000] measure used
1% to 14% of comparisons show stability conflicts
~25% differences to ensure <5% conflicts with 50 queries

sensitivity [Sanderson and Zobel, 2005] others were
improved reliability with pairwise t-tests not as good
virtually no conflicts if >10% differences with significance
effort [Voorhees, 2009]
with many queries, even significance is unreliable

[Sakai, 2007]
major review: other collections and more recent measures
some measures are much better than others
does not mean they should not be used!

Music Similarity and Retrieval
[Typke et al., 2005][Urbano et al., 2010]
alternative forms of ground truth for SMS
reliable and comprehensive but too expensive
no prefixed
[Typke et al., 2006] relevance scale
specific measure for the task
[Jones et al., 2007]
agreement between judgments by different people
propose to use more queries
despite high agreement,
evaluation does change… [Urbano et al., 2010][Lee, 2010]
cheaper judgments via crowdsourcing seems reliable
[Urbano, 2011] more about this
many other things in 30 mins

it‘s actually about the
effort-reliability tradeoff

it‘s actually about the
task relevance judgments # of systems
# of queries measures system similarity
statistical methods

measures
&
judgments
Picture by Wessex Archaeology

how much information does the user gain?
measure used in MIREX
(with different name) results as a set
AG@5: Average Gain in the top 5 documents
more realistic
user model
results as a list
NDCG@5: Normalized Discounted Cumulated Gain

ANDCG@5: Average NDCG across ranks
first,
best documents first
ADR@5: Average Dynamic Recall and the lower the rank
the lower the gain*
*details in the paper

how much information does a result provide?

BROAD relevance judgments
not similar = 0
somewhat similar = 1
very similar = 2

FINE relevance judgments
real-valued, from 0 to 10 or 100

look at MIREX 2009

largest evaluation until 2011

power

Picture by Roger Green

% of pairwise comparisons that are significant
what's the effect of:
number of queries
relevance judgments
effectiveness measures

number of queries
relevance judgments
all 100 queries set

number of queries
relevance judgments
all 100 queries set
5 query
subset

random sample

number of queries
relevance judgments
effectiveness measures Broad judgments

% significant
all 100 queries set
5 query
subset
# queries
evaluation
Fine judgments

% significant
random sample
# queries

number of queries
relevance judgments

52,500

% significant
all 100 queries set
5 query system
subset comparisons
# queries
evaluation
Fine judgments

% significant
random sample
repeat 500 times for 5 query subsets
to minimize random effects
# queries

number of queries
relevance judgments

52,500

% significant
all 100 queries set
10 query system
subset comparisons
# queries
evaluation
Fine judgments

% significant
repeat another 500 times for 10 query subsets # queries

number of queries
balanced across relevance judgments
10 genres effectiveness measures Broad judgments

% significant
all 100 queries set
barroque 10 query
blues subset
classical
country
# queries
edance
jazz
evaluation
Fine judgments
metal
rap-hiphop

% significant
rock&roll
romantic

stratified random sampling
with equal priors # queries

number of queries
relevance judgments

% significant
all 100 query subset

# queries
evaluation
Fine judgments

% significant
# queries

we simulate possible
evaluation scenarios

power results (larger is better)

Broad judgments Fine judgments

power in

46 48 50 52 54 56 58 60 62 64
46 48 50 52 54 56 58 60 62 64

MIREX 2009
% Significant comparisons

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100
Query set size Query set size


similar logarithmic trend except for ADRFine (expected)

power in

46 48 50 52 54 56 58 60 62 64
46 48 50 52 54 56 58 60 62 64

MIREX 2009

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100


similar logarithmic trend except for ADRFine (expected)

power in

46 48 50 52 54 56 58 60 62 64
46 48 50 52 54 56 58 60 62 64

MIREX 2009

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100

only 2 significant pairs same power
missed with 70% effort with 70% effort!
(probably unstable)

merely using more queries
does not pay off
when looking for power

stability

Picture by Dave Hunt

% of pairwise comparisons that are conflicting
number of queries
relevance judgments

number of queries
relevance judgments
5 query
subset
all 100 queries set
barroque
blues
classical
country
edance
jazz
metal
rap-hiphop
rock&roll
romantic

number of queries
relevance judgments
5 query
subset
all 100 queries set
barroque
blues
classical
country
edance 5 query independent
jazz
metal
subset samples
rap-hiphop
rock&roll
romantic

number of queries
relevance judgments
Broad judgments
5 query
subset

% conflicting
all 100 queries set
barroque
blues
evaluation
classical
country
edance 5 query independent #queries

jazz
metal
subset samples Fine judgments

rap-hiphop

% conflicting
rock&roll
romantic evaluation

#queries

number of queries 52,500
relevance judgments cross-
cross-collection
effectiveness measures system comparisons
Broad judgments
5 query
subset

% conflicting
all 100 queries set
barroque
blues
evaluation
classical
country
edance 5 query independent #queries

jazz
metal
subset samples Fine judgments

rap-hiphop

% conflicting
rock&roll
romantic evaluation

repeat 500 times
to minimize random effects #queries

number of queries
relevance judgments
Broad judgments

50 query subset

% conflicting
evaluation
with 100
total queries #queries

we can’t go 50 query subset Fine judgments
beyond 50

% conflicting
evaluation

#queries

we simulate comparisons
across possible collections

stability results (lower is better)

AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR
% Conflicting comparisons

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Query subset size Query subset size

stability in
MIREX 2009


lack of power in one collection
but not in the other
AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

stability in
MIREX 2009


ADR takes longer but not in the other
to converge Broad judgments Fine judgments
AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

stability in
MIREX 2009


ADR takes longer but not in the other
to converge Broad judgments Fine judgments
AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

stability in converge to <5% for >40 queries
MIREX 2009 (consistent with α=0.05)

merely using more queries
does not pay off
when looking for stability

type of conflicts (50 queries)
no major conflict
whatsoever
A>B A<B A<<B
measure conflicts
(power) (minor) (major)
AG 3.36% 100% 0% 0%
NDCG 3.77% 99.90% 0.10% 0%
Broad

ANDCG 4.73% 99.96% 0.04% 0%
ADR 9.03% 99.94% 0.06% 0%
AG 2.64% 99.86% 0.14% 0%
NDCG 2.94% 99.74% 0.26% 0%
Fine

ANDCG 4.03% 99.91% 0.09% 0%
ADR 19.08% 99.50% 0.50% 0%

virtually all conflicts due to

if significance shows up
it most probably is correct
are we being too conservative?

statistics
Milton Friedman Frank Wilcoxon John Tukey

compare two systems

is the difference significant?
t-test, Wilcoxon test, sign test, etc.
they make
different assumptions stability conflict
significance level α
probability of Type I error
(finding a significant difference when there is none)

usually, α=0.05 or α=0.01
5% or 1% of my significant results are just wrong

MIREX 2009 compare several systems
15 systems = 105 comparisons

experiment-wide significance level = 1-(1-α)105 = 0.995
we can expect at least one significant comparison to be wrong

instead, compare all systems at once
ANOVA, Friedman test, Kruskal-Wallis, etc.
used in MIREX
(with different assumptions)
correct p-values to keep experiment-wide significance level <0.05
Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.

more stability
at the cost of
less power
is it worth it?

what a MIREX participant wants

compare my system with the other 14
comparisons between those 14 are uninteresting

subexperiment: only 14 pairwise comparisons, not 105
get back the power missed by considering the other 91
should throw out more conflicts too
number of comparisons grows linearly with number of systems
subexperiment-wide significant level = 1-(1-α)14 = 0.512

compare all systems with 1-tailed Wilcoxon tests at α=0.01
experiment-wide significant level = 1-(1-0.01)105 = 0.652
subexperiment-wide significant level = 1-(1-0.01)14 = 0.131


46 48 50 52 54 56 58 60 62 64

46 48 50 52 54 56 58 60 62 64

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100

Friedman+Tukey
(as in MIREX)

all 1-tailed Wilcoxon comparisons
is up to %20 more powerful than Friedman+Tukey
46 48 50 52 54 56 58 60 62 64

46 48 50 52 54 56 58 60 62 64

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100

Friedman+Tukey
(as in MIREX)

all 1-tailed Wilcoxon comparisons
is up to %20 more powerful than Friedman+Tukey
46 48 50 52 54 56 58 60 62 64

46 48 50 52 54 56 58 60 62 64

AG
NDCG
ANDCG
ADR

40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100

same power Friedman+Tukey
50%
with 50% effort! (as in MIREX)


earlier convergence
because of increased power
AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50


earlier convergence
because of increased power
AG
8 10 12 14 16 18 20 22

8 10 12 14 16 18 20 22
NDCG
ANDCG
ADR

6

6
4

4
2

2
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

AG converges again to 3-4%
(A)NDCG converge to 5-6%

type of conflicts (50 queries)

A>B A<B A<<B
measure conflicts
(power) (minor) (major)
AG 3.68% 96.32% 3.68% 0%
NDCG 5.05% 96.82% 3.18% 0%
Broad

ANDCG 6.08% 96.84% 3.13% 0.03%
ADR 5.93% 95.12% 4.88% 0%
AG 3.32% 98.34% 1.66% 0% within known
Type III
NDCG 6.58% 96.61% 3.39% 0% error rates
Fine

ANDCG 6.44% 94.94% 5.06% 0%
ADR 12.48% 90.58% 9.37% 0.05%

again, due to
lack of power in one collection no major conflicts


Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries
measure power - conflicts = stable power - conflicts = stable
AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42%
NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%
Broad

ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29%
ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37%
AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99%
NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%
Fine

ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94%
ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55%

virtually same reliability with half the effort!

Friedman-Tukey requires
too much effort

Do not attempt to accomplish greater results
by a greater effort of your little understanding,
but by a greater understanding of your little effort.
̶ Walter Russell

using more and more queries is pointless
too much effort for the small gain in power and stability

using different similarity scales has little effect
using only one is probably just fine

some effectiveness measures are better than others
they should still be used: they measure different things
but bear in mind their power and stability

some statistical methods are better than others
virtually same realiability with half the effort

if significance shows up it most probably is true
at worst, conflicts are due to lack of power

forget about power and worry about effect-size
eventually, significance becomes meaningless
reduce the judging effort
more queries in Symbolic Melodic Similarity
reliable low-cost in-house evaluations and Crowdsourcing

deeper evaluation cutoffs
not just the top 5 documents: pay attention to ranking
probably more reliable, and certainly more reusable

effect of the number of systems
specially if developed by the same research group

other statistical methods
Multiple Comparisons with a Control (baseline)
other collections, tasks and measures

guide experimenters in
the interpretation
of the results and the
tradeoff between
effort and reliability

Audio Music Similarity and Retrieval: Evaluation Power and Stability

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

More from Julián Urbano

More from Julián Urbano (10)

Recently uploaded

Recently uploaded (20)

Audio Music Similarity and Retrieval: Evaluation Power and Stability