• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval
 

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

on

  • 874 views

 

Statistics

Views

Total Views
874
Views on SlideShare
848
Embed Views
26

Actions

Likes
1
Downloads
8
Comments
0

3 Embeds 26

https://twitter.com 14
http://twitter.com 7
https://si0.twimg.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval Presentation Transcript

    • How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz ISMIR 2012Picture by Humberto Santos Porto, Portugal · October 9th
    • let’s review two papers
    • statistically significantpaper A: paper B:+0.14* +0.21…which one should get published?a.k.a. which research line should we follow?
    • statistically significantpaper A: paper B:+0.14* +0.21…which one should get published?a.k.a. which research line should we follow?
    • paper A: paper B:+0.14* +0.14*…which one should get published?a.k.a. which research line should we follow?
    • paper A: paper B:+0.14* +0.14*…which one should get published?a.k.a. which research line should we follow?
    • Goal of Comparing Systems…Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user) Impossible! requires running the systems for theuniverse of all queries -1 0 𝑑 1 Δeffectiveness
    • …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
    • …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
    • …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
    • …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 𝑑0 1 Δeffectiveness
    • …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠There is always random error …so we need a measure of confidence
    • The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0
    • The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 Result of the test… p-value = P( 𝒅 | H0 ) …interpretation of the testp-value is very small: reject H0 otherwise: accept H0
    • The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0We accept/reject H0… (based on the p-value and α) …not the test!
    • Usual (wrong) conclusions A is substantially than B A is much better than B The difference is important The difference is significant
    • What does it mean?That there is a difference (unlikely due to chance/random error)
    • What does it mean?That there is a difference (unlikely due to chance/random error)We don’t need fancy statistics… …we already know they are different!
    • H0: 𝒅 = 0 is false by definition because systems A and Bare different to begin with
    • What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values
    • What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values𝒅 = +0.6 is a huge improvement𝒅 = +0.0001 is irrelevant… …and yet, it can easily be statistically significant
    • Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?
    • Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce variance
    • Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the system
    • Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the systemc) Evaluate with more queries!
    • Statistical Significance iseventually meaningless… …all you have to do is use enough queries
    • Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction Statistical Significance: p-value Confidence An improvement may be statistically significant, but that doesn’t mean it’s important!
    • the real importanceof an improvement
    • Purpose of Evaluation How good Is system Ais my system? better than system B? 0 1 -1 0 1 effectiveness ΔeffectivenessWe measure system effectiveness
    • AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
    • AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
    • AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
    • AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
    • AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
    • Assumption System Effectiveness corresponds to User Satisfaction this is ourultimate goal! Does it? How well?
    • How we measureSystem Effectiveness Similarity scale we normalize to [0, 1] Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100 Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rankWhat correlates betterwith user satisfaction?
    • Experiment
    • Experiment
    • Experiment known effectiveness
    • Experiment user preference
    • Experiment non-preference
    • What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluationNegative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfying Bad: both systems are unsatisfying
    • Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity Random and Artificial examples Query: selected randomlySystem outputs: random lists of 5 documents 2200 examples for 73 unique queries2869 unique lists with 3031 unique clips balanced and complete design
    • Subjects Crowdsourcing Cheap, fast and… diverse pool of subjects 2200 Quality examples control Trap examples (known answers)$0.03 per example Worker pool
    • Results 6895 total answers 881 workers from 62 countries 3393 accepted answers (41%) 100 workers (87% rejected!)95% average quality when accepted
    • How good is my system? 884 nonpreferences (40%) What do we expect?
    • How good is my system? 884 nonpreferences (40%) Linear mapping
    • How good is my system? 884 nonpreferences (40%) What do we have?
    • How good is my system? 884 nonpreferences (40%)
    • How good is my system? 884 nonpreferences (40%)
    • How good is my system? 884 nonpreferences (40%) room for ~20% improvement with personalization
    • Is system A better than B? 1316 preferences (60%) What do we expect?
    • Is system A better than B? 1316 preferences (60%) Users always notice the difference… …regardless of how large it is
    • Is system A better than B? 1316 preferences (60%) What do we have?
    • Is system A better than B? 1316 preferences (60%)
    • Is system A better than B? 1316 preferences (60%)
    • Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for >50% of users to agree
    • Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100%
    • Is system A better than B? 1316 preferences (60%) Do users prefer the (supposedly) worse system?
    • Is system A better than B? 1316 preferences (60%)
    • Statistical Significance has nothing to do with this
    • Picture by Ronny Welter
    • Reporting ResultsConfidence intervals / Variance 0.584
    • Reporting ResultsConfidence intervals / Variance 0.584 ± .023 Indicator of evaluation error Better understanding of expected user satisfaction
    • Reporting Results Actual p-values+0.037 ± .031 *
    • Reporting Results Actual p-values+0.037 ± .031 (p=0.02) Statistical Significance is relative α=0.05 and α=0.01 are completely arbitraryDepends on context, cost of Type I errors and implementation, etc.
    • let’s review two papers (again)
    • paper A:+0.14*paper B:+0.21 …which one should get published? a.k.a. which research line should we follow?
    • paper A (500 queries):+0.14 ± 0.03 (p=0.048)paper B (50 queries):+0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
    • paper A (500 queries):+0.14 ± 0.03 (p=0.048)paper B (50 queries):+0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
    • paper A:+0.14 *paper B:+0.14 * …which one should get published? a.k.a. which research line should we follow?
    • paper A (cost=$500,000):+0.14 ± 0.01 (p=0.004)paper B (cost=$50):+0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
    • paper A (cost=$500,000):+0.14 ± 0.01 (p=0.004)paper B (cost=$50):+0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
    • effect-sizes areindicators of user satisfaction need to personalize results small differences are not noticed p-values are indicators of confidence beware of collection sizeneed to provide full reports
    • The difference between “Significant” and “Not Significant” is not itself statistically significant ― A. Gelman & H. Stern