Audio Music Similarity and Retrieval:                   Evaluation Power and Stability                                   J...
AMSretrieve audio clips musically similar  to a query clip
grand results  (MIREX 2009)
grand results                                  (MIREX 2009)I won!                      oh, come on! it‘s so close!        ...
grand results                                  (MIREX 2009)I won!                      oh, come on! it‘s so close!        ...
grand results                                  (MIREX 2009)I won!                      oh, come on! it‘s so close!        ...
what does it mean?Picture by Sara A. Beyer
proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are differenta statistical te...
proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are different a statistical t...
conclusions about general behavior           MIREX 2009                                         MIREX 2010                ...
it‘s all about reliability
Isaac Newton               on the shoulders of giants
Text REtrieval Conferenceno significance testing                                            depends on the                ...
Music Similarity and Retrieval                      [Typke et al., 2005][Urbano et al., 2010]                alternative f...
it‘s actually about theeffort-reliability tradeoff
it‘s actually about the effort-reliability tradeoff   task        relevance judgments      # of systems# of queries       ...
measures                                    &                                judgmentsPicture by Wessex Archaeology
how much information does the user gain? measure used in MIREX  (with different name)      results as a set               ...
how much information does a result provide?        BROAD relevance judgments                not similar = 0             so...
look at MIREX 2009  largest evaluation until 2011
powerPicture by Roger Green
% of pairwise comparisons that are significant            whats the effect of:                 number of queries          ...
% of pairwise comparisons that are significant                   whats the effect of:                           number of ...
% of pairwise comparisons that are significant                   whats the effect of:                           number of ...
% of pairwise comparisons that are significant                   whats the effect of:                           number of ...
% of pairwise comparisons that are significant                   whats the effect of:                           number of ...
% of pairwise comparisons that are significant                   whats the effect of:                           number of ...
% of pairwise comparisons that are significant                                whats the effect of:                        ...
% of pairwise comparisons that are significant                whats the effect of:                     number of queries  ...
we simulate possibleevaluation scenarios
power results (larger is better)                                                                       Broad judgments    ...
power results (larger is better)                            similar logarithmic trend except for ADRFine (expected)       ...
power results (larger is better)                            similar logarithmic trend except for ADRFine (expected)       ...
merely using more queriesdoes not pay offwhen looking for power
stabilityPicture by Dave Hunt
% of pairwise comparisons that are conflicting            whats the effect of:                 number of queries          ...
% of pairwise comparisons that are conflicting                                whats the effect of:                        ...
% of pairwise comparisons that are conflicting                                whats the effect of:                        ...
% of pairwise comparisons that are conflicting                                whats the effect of:                        ...
% of pairwise comparisons that are conflicting                                whats the effect of:                        ...
% of pairwise comparisons that are conflicting                       whats the effect of:                            numbe...
we simulate comparisonsacross possible collections
stability results (lower is better)                                                                   Broad judgments     ...
stability results (lower is better)                                                                                       ...
stability results (lower is better)                                                                                       ...
stability results (lower is better)                                                                                       ...
merely using more queriesdoes not pay offwhen looking for stability
type of conflicts (50 queries)                                                           no major conflict                ...
if significance shows upit most probably is correct     are we being too conservative?
statisticsMilton Friedman      Frank Wilcoxon   John Tukey
compare two systems                     is the difference significant?                 t-test, Wilcoxon test, sign test, e...
MIREX 2009 compare several systems           15 systems = 105 comparisons       experiment-wide significance level = 1-(1-...
more stabilityat the cost of less power   is it worth it?
what a MIREX participant wants          compare my system with the other 14      comparisons between those 14 are unintere...
power results (larger is better)                                                                       Broad judgments    ...
power results (larger is better)                                                                               all 1-taile...
power results (larger is better)                                                                               all 1-taile...
stability results (lower is better)    earlier convergencebecause of increased power                                      ...
stability results (lower is better)    earlier convergencebecause of increased power                                      ...
type of conflicts (50 queries)                                  A>B       A<B      A<<B   measure          conflicts      ...
effort-reliability tradeoff                Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries   measure    ...
Friedman-Tukey requirestoo much effort
my point?
Do not attempt to accomplish greater resultsby a greater effort of your little understanding,but by a greater understandin...
using more and more queries is pointless     too much effort for the small gain in power and stability  using different si...
Picture by Ronny Welter
forget about power and worry about effect-size       eventually, significance becomes meaningless             reduce the j...
guide experimenters in the interpretation of the results and the  tradeoff betweeneffort and reliability
Upcoming SlideShare
Loading in...5
×

Audio Music Similarity and Retrieval: Evaluation Power and Stability

329

Published on

In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
329
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Audio Music Similarity and Retrieval: Evaluation Power and Stability"

  1. 1. Audio Music Similarity and Retrieval: Evaluation Power and Stability Julián Urbano @julian_urbano Diego Martín, Mónica Marrero and Jorge Morato University Carlos III of Madrid ISMIR 2011Picture by Michael Shane Miami, USA · October 26th
  2. 2. AMSretrieve audio clips musically similar to a query clip
  3. 3. grand results (MIREX 2009)
  4. 4. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… yeah, it’s not significant!
  5. 5. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are!
  6. 6. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are! damn it! don‘t worry about it
  7. 7. what does it mean?Picture by Sara A. Beyer
  8. 8. proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are differenta statistical test returns p<0.01, so we conclude A >> B B A
  9. 9. proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are different a statistical test returns p<0.01, so we conclude A >> B B Ait means that if we assume H0 and repeat the experiment, there is a <0.01 probability of having these result again* *or one even more extreme
  10. 10. conclusions about general behavior MIREX 2009 MIREX 2010 this evaluation A>B is not powerful A?B system A is better than B, but it’s we can expect anything not statistically significant with a different collection …and stable this oneis powerful… A >> B A >> B we expect the same: A is significantly better than B A is better than B, and it’s statistically significant but these could also happen: A > B or A < B or A << B lack of power in MIREX 2010 minor stability conflict major stability conflict
  11. 11. it‘s all about reliability
  12. 12. Isaac Newton on the shoulders of giants
  13. 13. Text REtrieval Conferenceno significance testing depends on the [Buckley and Voorhees, 2000] measure used 1% to 14% of comparisons show stability conflicts ~25% differences to ensure <5% conflicts with 50 queries sensitivity [Sanderson and Zobel, 2005] others were improved reliability with pairwise t-tests not as good virtually no conflicts if >10% differences with significanceeffort [Voorhees, 2009] with many queries, even significance is unreliable [Sakai, 2007] major review: other collections and more recent measures some measures are much better than others does not mean they should not be used!
  14. 14. Music Similarity and Retrieval [Typke et al., 2005][Urbano et al., 2010] alternative forms of ground truth for SMS reliable and comprehensive but too expensive no prefixed [Typke et al., 2006] relevance scale specific measure for the task [Jones et al., 2007] agreement between judgments by different people propose to use more queriesdespite high agreement,evaluation does change… [Urbano et al., 2010][Lee, 2010] cheaper judgments via crowdsourcing seems reliable [Urbano, 2011] more about this many other things in 30 mins
  15. 15. it‘s actually about theeffort-reliability tradeoff
  16. 16. it‘s actually about the effort-reliability tradeoff task relevance judgments # of systems# of queries measures system similarity statistical methods
  17. 17. measures & judgmentsPicture by Wessex Archaeology
  18. 18. how much information does the user gain? measure used in MIREX (with different name) results as a set AG@5: Average Gain in the top 5 documents more realistic user model results as a list NDCG@5: Normalized Discounted Cumulated Gain ANDCG@5: Average NDCG across ranks first, best documents first ADR@5: Average Dynamic Recall and the lower the rank the lower the gain**details in the paper
  19. 19. how much information does a result provide? BROAD relevance judgments not similar = 0 somewhat similar = 1 very similar = 2 FINE relevance judgments real-valued, from 0 to 10 or 100
  20. 20. look at MIREX 2009 largest evaluation until 2011
  21. 21. powerPicture by Roger Green
  22. 22. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures
  23. 23. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures all 100 queries set
  24. 24. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures all 100 queries set 5 query subset random sample
  25. 25. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 queries set 5 query subset # queries evaluation Fine judgments % significant random sample # queries
  26. 26. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 5 query system subset comparisons # queries evaluation Fine judgments % significant random sample repeat 500 times for 5 query subsets to minimize random effects # queries
  27. 27. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 10 query system subset comparisons # queries evaluation Fine judgments % significant repeat another 500 times for 10 query subsets # queries
  28. 28. % of pairwise comparisons that are significant whats the effect of: number of queries balanced across relevance judgments 10 genres effectiveness measures Broad judgments % significant all 100 queries set barroque 10 query blues subset classical country # queries edance jazz evaluation Fine judgments metalrap-hiphop % significant rock&roll romantic stratified random sampling with equal priors # queries
  29. 29. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 query subset # queries evaluation Fine judgments % significant # queries
  30. 30. we simulate possibleevaluation scenarios
  31. 31. power results (larger is better) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  32. 32. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  33. 33. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size only 2 significant pairs same powermissed with 70% effort with 70% effort! (probably unstable)
  34. 34. merely using more queriesdoes not pay offwhen looking for power
  35. 35. stabilityPicture by Dave Hunt
  36. 36. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures
  37. 37. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance jazz metalrap-hiphop rock&roll romantic
  38. 38. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance 5 query independent jazz metal subset samplesrap-hiphop rock&roll romantic
  39. 39. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgmentsrap-hiphop % conflicting rock&roll romantic evaluation #queries
  40. 40. % of pairwise comparisons that are conflicting whats the effect of: number of queries 52,500 relevance judgments cross- cross-collection effectiveness measures system comparisons Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgmentsrap-hiphop % conflicting rock&roll romantic evaluation repeat 500 times to minimize random effects #queries
  41. 41. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 50 query subset % conflicting evaluation with 100total queries #queries we can’t go 50 query subset Fine judgments beyond 50 % conflicting evaluation #queries
  42. 42. we simulate comparisonsacross possible collections
  43. 43. stability results (lower is better) Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR% Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  44. 44. stability results (lower is better) lack of power in one collection but not in the other Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR% Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  45. 45. stability results (lower is better) lack of power in one collectionADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  46. 46. stability results (lower is better) lack of power in one collectionADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in converge to <5% for >40 queries MIREX 2009 (consistent with α=0.05)
  47. 47. merely using more queriesdoes not pay offwhen looking for stability
  48. 48. type of conflicts (50 queries) no major conflict whatsoever A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.36% 100% 0% 0% NDCG 3.77% 99.90% 0.10% 0% Broad ANDCG 4.73% 99.96% 0.04% 0% ADR 9.03% 99.94% 0.06% 0% AG 2.64% 99.86% 0.14% 0% NDCG 2.94% 99.74% 0.26% 0% Fine ANDCG 4.03% 99.91% 0.09% 0% ADR 19.08% 99.50% 0.50% 0% virtually all conflicts due tolack of power in one collection
  49. 49. if significance shows upit most probably is correct are we being too conservative?
  50. 50. statisticsMilton Friedman Frank Wilcoxon John Tukey
  51. 51. compare two systems is the difference significant? t-test, Wilcoxon test, sign test, etc. they makedifferent assumptions stability conflict significance level α probability of Type I error (finding a significant difference when there is none) usually, α=0.05 or α=0.01 5% or 1% of my significant results are just wrong
  52. 52. MIREX 2009 compare several systems 15 systems = 105 comparisons experiment-wide significance level = 1-(1-α)105 = 0.995 we can expect at least one significant comparison to be wrong instead, compare all systems at once ANOVA, Friedman test, Kruskal-Wallis, etc. used in MIREX(with different assumptions) correct p-values to keep experiment-wide significance level <0.05 Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
  53. 53. more stabilityat the cost of less power is it worth it?
  54. 54. what a MIREX participant wants compare my system with the other 14 comparisons between those 14 are uninteresting subexperiment: only 14 pairwise comparisons, not 105 get back the power missed by considering the other 91 should throw out more conflicts toonumber of comparisons grows linearly with number of systems subexperiment-wide significant level = 1-(1-α)14 = 0.512 compare all systems with 1-tailed Wilcoxon tests at α=0.01 experiment-wide significant level = 1-(1-0.01)105 = 0.652 subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
  55. 55. power results (larger is better) Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  56. 56. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  57. 57. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size same power Friedman+Tukey 50% with 50% effort! (as in MIREX)
  58. 58. stability results (lower is better) earlier convergencebecause of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size
  59. 59. stability results (lower is better) earlier convergencebecause of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size AG converges again to 3-4% (A)NDCG converge to 5-6%
  60. 60. type of conflicts (50 queries) A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.68% 96.32% 3.68% 0% NDCG 5.05% 96.82% 3.18% 0%Broad ANDCG 6.08% 96.84% 3.13% 0.03% ADR 5.93% 95.12% 4.88% 0% AG 3.32% 98.34% 1.66% 0% within known Type III NDCG 6.58% 96.61% 3.39% 0% error ratesFine ANDCG 6.44% 94.94% 5.06% 0% ADR 12.48% 90.58% 9.37% 0.05% again, due to lack of power in one collection no major conflicts
  61. 61. effort-reliability tradeoff Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries measure power - conflicts = stable power - conflicts = stable AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42% NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%Broad ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29% ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37% AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99% NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%Fine ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94% ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55% virtually same reliability with half the effort!
  62. 62. Friedman-Tukey requirestoo much effort
  63. 63. my point?
  64. 64. Do not attempt to accomplish greater resultsby a greater effort of your little understanding,but by a greater understanding of your little effort. ̶ Walter Russell
  65. 65. using more and more queries is pointless too much effort for the small gain in power and stability using different similarity scales has little effect using only one is probably just finesome effectiveness measures are better than others they should still be used: they measure different things but bear in mind their power and stability some statistical methods are better than others virtually same realiability with half the effort if significance shows up it most probably is true at worst, conflicts are due to lack of power
  66. 66. Picture by Ronny Welter
  67. 67. forget about power and worry about effect-size eventually, significance becomes meaningless reduce the judging effort more queries in Symbolic Melodic Similarity reliable low-cost in-house evaluations and Crowdsourcing deeper evaluation cutoffs not just the top 5 documents: pay attention to ranking probably more reliable, and certainly more reusable effect of the number of systems specially if developed by the same research group other statistical methods Multiple Comparisons with a Control (baseline) other collections, tasks and measures
  68. 68. guide experimenters in the interpretation of the results and the tradeoff betweeneffort and reliability
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×