Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

775 views

Published on

Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.

Published in: Technology, Business
  • Be the first to comment

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

  1. 1. Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval @julian_urbano @m_schedl University Carlos III of Madrid Johannes Kepler University AdMIRe 2012Picture by ERdi43 (Wikipedia) Lyon, France · April 17th
  2. 2. Problem evaluation of IR systems is costly Annotations time consuming expensive boring (Bad) Consequence small and biased test collections unlikely to change from year to year Solutionapply low-cost evaluation methodologies
  3. 3. nearly 2 decades of Meta-Evaluation in Text IR some good practices inherited from here NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR MIREX (2000-today) (2005-today) a lot of things have happened here!
  4. 4. Minimal Test Collections (MTC) [Carterette at al.] estimate the ranking of systems with very few judgments (high incompleteness) Application in Audio Music Similarity (AMS)dozens of volunteers required by MIREX every year to make thousands of judgments Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
  5. 5. evaluation withincomplete judgments
  6. 6. Basic Idea treat similarity scores as random variables can be estimated with uncertainty gain of an arbitrary document: Gi ⤳ multinomial 𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙 𝑙∈ℒ ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100} whenever document i is judged: 𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0*all variance formulas in the paper
  7. 7. AG@k is also treated as a random variable 1 𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 iterate all documents ranking at which (in practice, only it was retrieved the top k retrieved) Ultimate Goalcompute a good estimate with the least effort
  8. 8. Comparing Two Systemswhat really matters is the sign of the difference 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 Evaluating Several Queries 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞 𝒬 𝑞∈𝒬 iterate all queries The Rationale if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then judge another document else stop judging
  9. 9. Distribution of AG@kwhat are the possible assignments of similarity? 𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘 𝛾 𝑘 ∈𝛤 𝑘 ultimately iterate all possible depends on the permutations of k distribution of Gisimilarity assignments Plain English the ratio of similarity assignments s.t. AG@k=zFor Complex Measures or Large Similarity Scales run Monte Carlo simulation
  10. 10. Actually, AG@k is a Special Case let G be the similarity of the top k for all queries query AG@k for a single query1. take a sample of k documents. Mean = X12. take a sample of k documents. Mean = X2 ...Q. take a sample of k documents. Mean = XQ Mean of sample means = X mean AG@k over all queries Central Limit Theorem as Q→∞, X approximates a normal distribution regardless of the distribution of G
  11. 11. AG@k is Normally Distributeduse the normal cumulative density function Φ −𝐸 ∆𝐴𝐺@𝑘 𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ 𝑉𝑎𝑟 ∆𝐴𝐺@𝑘 BROAD scale FINE scale 0.030 0.0 0.2 0.4 0.6 0.8 1.0 0.020 Density Density 0.010 0.000 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 AG@5 AG@5
  12. 12. Confidence as a Function of # Judgments 100 95 90 or waste Confidence in ranking of systems our time or keep judging 85 we can to be really confident 80 stop judging 75 70 65 60 55 50 0 10 20 30 40 50 60 70 80 90 100 Percent of judgments what documents should we judge? those that maximize the confidence
  13. 13. The Trickdocuments retrieved by both systems are useless there is no need to judge them whatever Gi is, it is added and then subtracted Comparing Several Systems compute a weight wi for each query-document judge the document with largest effect wi in the Original MTC wi = largest weight across system pairsreduces to # of system pairs affected by query-doc i
  14. 14. wi Dependent on Confidence if we are highly confident about a pair of systemswe do not need to judge another of their documents even if it has the largest weight 2 𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘 𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional to confidence iterate system pairs with low confidence better results than traditional weights
  15. 15. MTC for AMSwith AG@k
  16. 16. MTC for ΔAG@k average confidence on the ranking 1while 𝐴,𝐵 ∈𝒮 𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do 𝒮 select the 𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖 best document from all unjudged query-documents judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ ) 𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗ 𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0 update (increase confidence)end while
  17. 17. MTC inMIREX AMS 2011
  18. 18. Why MIREX 2011 largest edition so far 18 systems (153 pairwise comparisons) 100 queries and 6,322 judgments Distribution of Gilet us work with a uniform distribution for now
  19. 19. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
  20. 20. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
  21. 21. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
  22. 22. high confidencewith considerably less effort
  23. 23. Accuracy as Judgments are Made estimated bins always better than expected
  24. 24. Accuracy as Judgments are Made estimated signshighly correlatedwith confidence
  25. 25. Accuracy as Judgments are Maderankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
  26. 26. high confidence and high accuracywith considerably less effort
  27. 27. Statistical SignificanceMTC allows us to accurately estimate the ranking but for the current set of queries can we generalize to a general set of queries? Not Trivial we have the variance of the estimates but not the sample variance
  28. 28. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 known judgments 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 𝑖∈𝜋 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋*same for the lower bound
  29. 29. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 retrieved by A 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − best 𝑘 similarity 𝑖∈𝜋 unknown judgments score 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋*same for the lower bound
  30. 30. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 retrieved by B 𝑖∈𝜋 but not by A 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 worst 𝑘 similarity 𝑖∈𝜋 unknown judgments score*same for the lower bound
  31. 31. 3 Rules1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different Problem upper and lower bounds are very unrealistic
  32. 32. Incorporate a Heuristic4. If the estimated difference is larger than t naively conclude significance Choose t Based on Power Analysist = effect-size detectable by a t-test with • sample variance σ2=0.0615 from previous • sample size n=100 MIREX editions • Type I Error rate α=0.05 typical values • Type II Error rate β=0.15 t ≈ 0.067
  33. 33. Accuracy of the Significance Estimates pretty good around 95% confidence rule 4 (heuristic) ends upoverestimating significance
  34. 34. Accuracy of the Significance Estimates rules 1 to 3 begin to apply and correct overestimations rule 4 (heuristic) ends upoverestimating significance
  35. 35. Accuracy of the Significance Estimates closer to expected never under 90%
  36. 36. significancecan be estimated fairly well too
  37. 37. what we did
  38. 38. Introduce MTC to the MIR folks Work out the Math for MTC with AG@kSee How Well it would have Done in AMS 2011 quite well actually!
  39. 39. what now
  40. 40. Learn the true Distribution of Similarity Judgments it‘s clearly not uniformwould give more accurate estimates with less effort use previous AMS data or fit a model as we judge Significance Testing with Incomplete Judgments best-case scenarios are very unrealisticStudy Low-Cost Methodologies for other MIR Tasks
  41. 41. what for
  42. 42. MTC Greatly Reduces the Effort for AMS (and SMS) have MIREX volunteers incrementally create brand new test collections for other tasks Better Yet study low-cost methodologies for the other tasks Not Only for MIREX private collections for in-house evaluationsno possibility of gathering large pools of annotators lost-cost becomes paramount
  43. 43. the MIR community needs a paradigm shiftfrom a priori to a posteriori evaluation methods to reduce cost and gain reliability

×