Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- A Plan for Sustainable MIR Evaluation by Julián Urbano 183 views
- Improving the Generation of Ground ... by Julián Urbano 456 views
- Laurie Loomis Malcomson by John Malcomson 413 views
- Symbolic Melodic Similarity (throug... by Julián Urbano 832 views
- Using the Shape of Music to Compute... by Julián Urbano 545 views
- Bringing Undergraduate Students Clo... by Julián Urbano 782 views

775 views

Published on

No Downloads

Total views

775

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

5

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval @julian_urbano @m_schedl University Carlos III of Madrid Johannes Kepler University AdMIRe 2012Picture by ERdi43 (Wikipedia) Lyon, France · April 17th
- 2. Problem evaluation of IR systems is costly Annotations time consuming expensive boring (Bad) Consequence small and biased test collections unlikely to change from year to year Solutionapply low-cost evaluation methodologies
- 3. nearly 2 decades of Meta-Evaluation in Text IR some good practices inherited from here NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR MIREX (2000-today) (2005-today) a lot of things have happened here!
- 4. Minimal Test Collections (MTC) [Carterette at al.] estimate the ranking of systems with very few judgments (high incompleteness) Application in Audio Music Similarity (AMS)dozens of volunteers required by MIREX every year to make thousands of judgments Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
- 5. evaluation withincomplete judgments
- 6. Basic Idea treat similarity scores as random variables can be estimated with uncertainty gain of an arbitrary document: Gi ⤳ multinomial 𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙 𝑙∈ℒ ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100} whenever document i is judged: 𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0*all variance formulas in the paper
- 7. AG@k is also treated as a random variable 1 𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 iterate all documents ranking at which (in practice, only it was retrieved the top k retrieved) Ultimate Goalcompute a good estimate with the least effort
- 8. Comparing Two Systemswhat really matters is the sign of the difference 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 Evaluating Several Queries 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞 𝒬 𝑞∈𝒬 iterate all queries The Rationale if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then judge another document else stop judging
- 9. Distribution of AG@kwhat are the possible assignments of similarity? 𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘 𝛾 𝑘 ∈𝛤 𝑘 ultimately iterate all possible depends on the permutations of k distribution of Gisimilarity assignments Plain English the ratio of similarity assignments s.t. AG@k=zFor Complex Measures or Large Similarity Scales run Monte Carlo simulation
- 10. Actually, AG@k is a Special Case let G be the similarity of the top k for all queries query AG@k for a single query1. take a sample of k documents. Mean = X12. take a sample of k documents. Mean = X2 ...Q. take a sample of k documents. Mean = XQ Mean of sample means = X mean AG@k over all queries Central Limit Theorem as Q→∞, X approximates a normal distribution regardless of the distribution of G
- 11. AG@k is Normally Distributeduse the normal cumulative density function Φ −𝐸 ∆𝐴𝐺@𝑘 𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ 𝑉𝑎𝑟 ∆𝐴𝐺@𝑘 BROAD scale FINE scale 0.030 0.0 0.2 0.4 0.6 0.8 1.0 0.020 Density Density 0.010 0.000 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 AG@5 AG@5
- 12. Confidence as a Function of # Judgments 100 95 90 or waste Confidence in ranking of systems our time or keep judging 85 we can to be really confident 80 stop judging 75 70 65 60 55 50 0 10 20 30 40 50 60 70 80 90 100 Percent of judgments what documents should we judge? those that maximize the confidence
- 13. The Trickdocuments retrieved by both systems are useless there is no need to judge them whatever Gi is, it is added and then subtracted Comparing Several Systems compute a weight wi for each query-document judge the document with largest effect wi in the Original MTC wi = largest weight across system pairsreduces to # of system pairs affected by query-doc i
- 14. wi Dependent on Confidence if we are highly confident about a pair of systemswe do not need to judge another of their documents even if it has the largest weight 2 𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘 𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional to confidence iterate system pairs with low confidence better results than traditional weights
- 15. MTC for AMSwith AG@k
- 16. MTC for ΔAG@k average confidence on the ranking 1while 𝐴,𝐵 ∈𝒮 𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do 𝒮 select the 𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖 best document from all unjudged query-documents judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ ) 𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗ 𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0 update (increase confidence)end while
- 17. MTC inMIREX AMS 2011
- 18. Why MIREX 2011 largest edition so far 18 systems (153 pairwise comparisons) 100 queries and 6,322 judgments Distribution of Gilet us work with a uniform distribution for now
- 19. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
- 20. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
- 21. Confidence as Judgments are Madecorrect bins: estimated sign is correct or not significant anyway
- 22. high confidencewith considerably less effort
- 23. Accuracy as Judgments are Made estimated bins always better than expected
- 24. Accuracy as Judgments are Made estimated signshighly correlatedwith confidence
- 25. Accuracy as Judgments are Maderankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
- 26. high confidence and high accuracywith considerably less effort
- 27. Statistical SignificanceMTC allows us to accurately estimate the ranking but for the current set of queries can we generalize to a general set of queries? Not Trivial we have the variance of the estimates but not the sample variance
- 28. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 known judgments 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 𝑖∈𝜋 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋*same for the lower bound
- 29. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 retrieved by A 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − best 𝑘 similarity 𝑖∈𝜋 unknown judgments score 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋*same for the lower bound
- 30. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 retrieved by B 𝑖∈𝜋 but not by A 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 worst 𝑘 similarity 𝑖∈𝜋 unknown judgments score*same for the lower bound
- 31. 3 Rules1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different Problem upper and lower bounds are very unrealistic
- 32. Incorporate a Heuristic4. If the estimated difference is larger than t naively conclude significance Choose t Based on Power Analysist = effect-size detectable by a t-test with • sample variance σ2=0.0615 from previous • sample size n=100 MIREX editions • Type I Error rate α=0.05 typical values • Type II Error rate β=0.15 t ≈ 0.067
- 33. Accuracy of the Significance Estimates pretty good around 95% confidence rule 4 (heuristic) ends upoverestimating significance
- 34. Accuracy of the Significance Estimates rules 1 to 3 begin to apply and correct overestimations rule 4 (heuristic) ends upoverestimating significance
- 35. Accuracy of the Significance Estimates closer to expected never under 90%
- 36. significancecan be estimated fairly well too
- 37. what we did
- 38. Introduce MTC to the MIR folks Work out the Math for MTC with AG@kSee How Well it would have Done in AMS 2011 quite well actually!
- 39. what now
- 40. Learn the true Distribution of Similarity Judgments it‘s clearly not uniformwould give more accurate estimates with less effort use previous AMS data or fit a model as we judge Significance Testing with Incomplete Judgments best-case scenarios are very unrealisticStudy Low-Cost Methodologies for other MIR Tasks
- 41. what for
- 42. MTC Greatly Reduces the Effort for AMS (and SMS) have MIREX volunteers incrementally create brand new test collections for other tasks Better Yet study low-cost methodologies for the other tasks Not Only for MIREX private collections for in-house evaluationsno possibility of gathering large pools of annotators lost-cost becomes paramount
- 43. the MIR community needs a paradigm shiftfrom a priori to a posteriori evaluation methods to reduce cost and gain reliability

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment