Your SlideShare is downloading. ×
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Audio Music Similarity and Retrieval: Evaluation Power and Stability

298
views

Published on

In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a …

In this paper we analyze the reliability of the results in the evaluation of Audio Music Similarity and Retrieval systems. We focus on the power and stability of the evaluation, that is, how often a significant difference is found between systems and how often these significant differences are incorrect. We study the effect of using different effectiveness measures with different sets of relevance judgments, for varying number of queries and alternative statistical procedures. Different measures are shown to behave similarly overall, though some are much more sensitive and stable than others. The use of different statistical procedures does improve the reliability of the results, and it allows using as little as half the number of queries currently used in MIREX evaluations while still offering very similar reliability levels. We also conclude that experimenters can be very confident that if a significant difference is found between two systems, the difference is indeed real.


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
298
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Audio Music Similarity and Retrieval: Evaluation Power and Stability Julián Urbano @julian_urbano Diego Martín, Mónica Marrero and Jorge Morato University Carlos III of Madrid ISMIR 2011Picture by Michael Shane Miami, USA · October 26th
  • 2. AMSretrieve audio clips musically similar to a query clip
  • 3. grand results (MIREX 2009)
  • 4. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… yeah, it’s not significant!
  • 5. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are!
  • 6. grand results (MIREX 2009)I won! oh, come on! it‘s so close! but the difference is not significant… did you hear? yeah, it’s not significant! shut up… we are! damn it! don‘t worry about it
  • 7. what does it mean?Picture by Sara A. Beyer
  • 8. proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are differenta statistical test returns p<0.01, so we conclude A >> B B A
  • 9. proper interpretation of p-valuesH0: mean score of system A = mean score of BH1: mean scores are different a statistical test returns p<0.01, so we conclude A >> B B Ait means that if we assume H0 and repeat the experiment, there is a <0.01 probability of having these result again* *or one even more extreme
  • 10. conclusions about general behavior MIREX 2009 MIREX 2010 this evaluation A>B is not powerful A?B system A is better than B, but it’s we can expect anything not statistically significant with a different collection …and stable this oneis powerful… A >> B A >> B we expect the same: A is significantly better than B A is better than B, and it’s statistically significant but these could also happen: A > B or A < B or A << B lack of power in MIREX 2010 minor stability conflict major stability conflict
  • 11. it‘s all about reliability
  • 12. Isaac Newton on the shoulders of giants
  • 13. Text REtrieval Conferenceno significance testing depends on the [Buckley and Voorhees, 2000] measure used 1% to 14% of comparisons show stability conflicts ~25% differences to ensure <5% conflicts with 50 queries sensitivity [Sanderson and Zobel, 2005] others were improved reliability with pairwise t-tests not as good virtually no conflicts if >10% differences with significanceeffort [Voorhees, 2009] with many queries, even significance is unreliable [Sakai, 2007] major review: other collections and more recent measures some measures are much better than others does not mean they should not be used!
  • 14. Music Similarity and Retrieval [Typke et al., 2005][Urbano et al., 2010] alternative forms of ground truth for SMS reliable and comprehensive but too expensive no prefixed [Typke et al., 2006] relevance scale specific measure for the task [Jones et al., 2007] agreement between judgments by different people propose to use more queriesdespite high agreement,evaluation does change… [Urbano et al., 2010][Lee, 2010] cheaper judgments via crowdsourcing seems reliable [Urbano, 2011] more about this many other things in 30 mins
  • 15. it‘s actually about theeffort-reliability tradeoff
  • 16. it‘s actually about the effort-reliability tradeoff task relevance judgments # of systems# of queries measures system similarity statistical methods
  • 17. measures & judgmentsPicture by Wessex Archaeology
  • 18. how much information does the user gain? measure used in MIREX (with different name) results as a set AG@5: Average Gain in the top 5 documents more realistic user model results as a list NDCG@5: Normalized Discounted Cumulated Gain ANDCG@5: Average NDCG across ranks first, best documents first ADR@5: Average Dynamic Recall and the lower the rank the lower the gain**details in the paper
  • 19. how much information does a result provide? BROAD relevance judgments not similar = 0 somewhat similar = 1 very similar = 2 FINE relevance judgments real-valued, from 0 to 10 or 100
  • 20. look at MIREX 2009 largest evaluation until 2011
  • 21. powerPicture by Roger Green
  • 22. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures
  • 23. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures all 100 queries set
  • 24. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures all 100 queries set 5 query subset random sample
  • 25. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 queries set 5 query subset # queries evaluation Fine judgments % significant random sample # queries
  • 26. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 5 query system subset comparisons # queries evaluation Fine judgments % significant random sample repeat 500 times for 5 query subsets to minimize random effects # queries
  • 27. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 52,500 % significant all 100 queries set 10 query system subset comparisons # queries evaluation Fine judgments % significant repeat another 500 times for 10 query subsets # queries
  • 28. % of pairwise comparisons that are significant whats the effect of: number of queries balanced across relevance judgments 10 genres effectiveness measures Broad judgments % significant all 100 queries set barroque 10 query blues subset classical country # queries edance jazz evaluation Fine judgments metalrap-hiphop % significant rock&roll romantic stratified random sampling with equal priors # queries
  • 29. % of pairwise comparisons that are significant whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments % significant all 100 query subset # queries evaluation Fine judgments % significant # queries
  • 30. we simulate possibleevaluation scenarios
  • 31. power results (larger is better) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  • 32. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size
  • 33. power results (larger is better) similar logarithmic trend except for ADRFine (expected) Broad judgments Fine judgments power in 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64 MIREX 2009% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size only 2 significant pairs same powermissed with 70% effort with 70% effort! (probably unstable)
  • 34. merely using more queriesdoes not pay offwhen looking for power
  • 35. stabilityPicture by Dave Hunt
  • 36. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures
  • 37. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance jazz metalrap-hiphop rock&roll romantic
  • 38. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures 5 query subset all 100 queries set barroque blues classical country edance 5 query independent jazz metal subset samplesrap-hiphop rock&roll romantic
  • 39. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgmentsrap-hiphop % conflicting rock&roll romantic evaluation #queries
  • 40. % of pairwise comparisons that are conflicting whats the effect of: number of queries 52,500 relevance judgments cross- cross-collection effectiveness measures system comparisons Broad judgments 5 query subset % conflicting all 100 queries set barroque blues evaluation classical country edance 5 query independent #queries jazz metal subset samples Fine judgmentsrap-hiphop % conflicting rock&roll romantic evaluation repeat 500 times to minimize random effects #queries
  • 41. % of pairwise comparisons that are conflicting whats the effect of: number of queries relevance judgments effectiveness measures Broad judgments 50 query subset % conflicting evaluation with 100total queries #queries we can’t go 50 query subset Fine judgments beyond 50 % conflicting evaluation #queries
  • 42. we simulate comparisonsacross possible collections
  • 43. stability results (lower is better) Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR% Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 44. stability results (lower is better) lack of power in one collection but not in the other Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR% Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 45. stability results (lower is better) lack of power in one collectionADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in MIREX 2009
  • 46. stability results (lower is better) lack of power in one collectionADR takes longer but not in the other to converge Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size stability in converge to <5% for >40 queries MIREX 2009 (consistent with α=0.05)
  • 47. merely using more queriesdoes not pay offwhen looking for stability
  • 48. type of conflicts (50 queries) no major conflict whatsoever A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.36% 100% 0% 0% NDCG 3.77% 99.90% 0.10% 0% Broad ANDCG 4.73% 99.96% 0.04% 0% ADR 9.03% 99.94% 0.06% 0% AG 2.64% 99.86% 0.14% 0% NDCG 2.94% 99.74% 0.26% 0% Fine ANDCG 4.03% 99.91% 0.09% 0% ADR 19.08% 99.50% 0.50% 0% virtually all conflicts due tolack of power in one collection
  • 49. if significance shows upit most probably is correct are we being too conservative?
  • 50. statisticsMilton Friedman Frank Wilcoxon John Tukey
  • 51. compare two systems is the difference significant? t-test, Wilcoxon test, sign test, etc. they makedifferent assumptions stability conflict significance level α probability of Type I error (finding a significant difference when there is none) usually, α=0.05 or α=0.01 5% or 1% of my significant results are just wrong
  • 52. MIREX 2009 compare several systems 15 systems = 105 comparisons experiment-wide significance level = 1-(1-α)105 = 0.995 we can expect at least one significant comparison to be wrong instead, compare all systems at once ANOVA, Friedman test, Kruskal-Wallis, etc. used in MIREX(with different assumptions) correct p-values to keep experiment-wide significance level <0.05 Tukey’s HSD, Bonferroni, Scheffe, Duncan, Newman-Keuls, etc.
  • 53. more stabilityat the cost of less power is it worth it?
  • 54. what a MIREX participant wants compare my system with the other 14 comparisons between those 14 are uninteresting subexperiment: only 14 pairwise comparisons, not 105 get back the power missed by considering the other 91 should throw out more conflicts toonumber of comparisons grows linearly with number of systems subexperiment-wide significant level = 1-(1-α)14 = 0.512 compare all systems with 1-tailed Wilcoxon tests at α=0.01 experiment-wide significant level = 1-(1-0.01)105 = 0.652 subexperiment-wide significant level = 1-(1-0.01)14 = 0.131
  • 55. power results (larger is better) Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  • 56. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size Friedman+Tukey (as in MIREX)
  • 57. power results (larger is better) all 1-tailed Wilcoxon comparisons is up to %20 more powerful than Friedman+Tukey Broad judgments Fine judgments 46 48 50 52 54 56 58 60 62 64 46 48 50 52 54 56 58 60 62 64% Significant comparisons % Significant comparisons AG NDCG ANDCG ADR 40 45 50 55 60 65 70 75 80 85 90 95 100 40 45 50 55 60 65 70 75 80 85 90 95 100 Query set size Query set size same power Friedman+Tukey 50% with 50% effort! (as in MIREX)
  • 58. stability results (lower is better) earlier convergencebecause of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size
  • 59. stability results (lower is better) earlier convergencebecause of increased power Broad judgments Fine judgments AG 8 10 12 14 16 18 20 22 8 10 12 14 16 18 20 22 NDCG ANDCG ADR % Conflicting comparisons % Conflicting comparisons 6 6 4 4 2 2 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Query subset size Query subset size AG converges again to 3-4% (A)NDCG converge to 5-6%
  • 60. type of conflicts (50 queries) A>B A<B A<<B measure conflicts (power) (minor) (major) AG 3.68% 96.32% 3.68% 0% NDCG 5.05% 96.82% 3.18% 0%Broad ANDCG 6.08% 96.84% 3.13% 0.03% ADR 5.93% 95.12% 4.88% 0% AG 3.32% 98.34% 1.66% 0% within known Type III NDCG 6.58% 96.61% 3.39% 0% error ratesFine ANDCG 6.44% 94.94% 5.06% 0% ADR 12.48% 90.58% 9.37% 0.05% again, due to lack of power in one collection no major conflicts
  • 61. effort-reliability tradeoff Friedman+Tukey with 100 queries 1-tailed Wilcoxon with 50 queries measure power - conflicts = stable power - conflicts = stable AG 57.14% - 3.64% = 53.50% 55.10% - 3.68% = 51.42% NDCG 57.14% - 4.08% = 53.06% 57.01% - 5.05% = 51.96%Broad ANDCG 57.14% - 4.19% = 52.95% 57.37% - 6.08% = 51.29% ADR 56.19% - 7.13% = 49.06% 57.30% - 5.93% = 51.37% AG 54.29% - 3.20% = 51.09% 54.31% - 3.32% = 50.99% NDCG 56.19% - 3.04% = 53.15% 57.56% - 6.58% = 50.98%Fine ANDCG 56.19% - 2.96% = 53.23% 57.38% - 6.44% = 50.94% ADR 56.19% - 19.97% = 36.22% 55.03% - 12.48% = 42.55% virtually same reliability with half the effort!
  • 62. Friedman-Tukey requirestoo much effort
  • 63. my point?
  • 64. Do not attempt to accomplish greater resultsby a greater effort of your little understanding,but by a greater understanding of your little effort. ̶ Walter Russell
  • 65. using more and more queries is pointless too much effort for the small gain in power and stability using different similarity scales has little effect using only one is probably just finesome effectiveness measures are better than others they should still be used: they measure different things but bear in mind their power and stability some statistical methods are better than others virtually same realiability with half the effort if significance shows up it most probably is true at worst, conflicts are due to lack of power
  • 66. Picture by Ronny Welter
  • 67. forget about power and worry about effect-size eventually, significance becomes meaningless reduce the judging effort more queries in Symbolic Melodic Similarity reliable low-cost in-house evaluations and Crowdsourcing deeper evaluation cutoffs not just the top 5 documents: pay attention to ranking probably more reliable, and certainly more reusable effect of the number of systems specially if developed by the same research group other statistical methods Multiple Comparisons with a Control (baseline) other collections, tasks and measures
  • 68. guide experimenters in the interpretation of the results and the tradeoff betweeneffort and reliability