Successfully reported this slideshow.
Your SlideShare is downloading. ×

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Ad

How Significant is Statistically Significant?
                         The Case of Audio Music Similarity and Retrieval

 ...

Ad

let’s review two papers

Ad

statistically
                  significant
paper A:                     paper B:

+0.14*                       +0.21
…whi...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
RESEARCH METHODS LESSON 3
RESEARCH METHODS LESSON 3
Loading in …3
×

Check these out next

1 of 75 Ad
1 of 75 Ad

More Related Content

More from Julián Urbano (20)

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

  1. 1. How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz ISMIR 2012 Picture by Humberto Santos Porto, Portugal · October 9th
  2. 2. let’s review two papers
  3. 3. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
  4. 4. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
  5. 5. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
  6. 6. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
  7. 7. Goal of Comparing Systems… Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user) Impossible! requires running the systems for the universe of all queries -1 0 𝑑 1 Δeffectiveness
  8. 8. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  9. 9. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  10. 10. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  11. 11. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 𝑑0 1 Δeffectiveness
  12. 12. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 There is always random error …so we need a measure of confidence
  13. 13. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0
  14. 14. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 Result of the test… p-value = P( 𝒅 | H0 ) …interpretation of the test p-value is very small: reject H0 otherwise: accept H0
  15. 15. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 We accept/reject H0… (based on the p-value and α) …not the test!
  16. 16. Usual (wrong) conclusions A is substantially than B A is much better than B The difference is important The difference is significant
  17. 17. What does it mean? That there is a difference (unlikely due to chance/random error)
  18. 18. What does it mean? That there is a difference (unlikely due to chance/random error) We don’t need fancy statistics… …we already know they are different!
  19. 19. H0: 𝒅 = 0 is false by definition because systems A and B are different to begin with
  20. 20. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values
  21. 21. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values 𝒅 = +0.6 is a huge improvement 𝒅 = +0.0001 is irrelevant… …and yet, it can easily be statistically significant
  22. 22. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance?
  23. 23. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance
  24. 24. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system
  25. 25. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!
  26. 26. Statistical Significance is eventually meaningless… …all you have to do is use enough queries
  27. 27. Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction Statistical Significance: p-value Confidence An improvement may be statistically significant, but that doesn’t mean it’s important!
  28. 28. the real importance of an improvement
  29. 29. Purpose of Evaluation How good Is system A is my system? better than system B? 0 1 -1 0 1 effectiveness Δeffectiveness We measure system effectiveness
  30. 30. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  31. 31. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  32. 32. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  33. 33. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  34. 34. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  35. 35. Assumption System Effectiveness corresponds to User Satisfaction this is our ultimate goal! Does it? How well?
  36. 36. How we measure System Effectiveness Similarity scale we normalize to [0, 1] Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100 Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank What correlates better with user satisfaction?
  37. 37. Experiment
  38. 38. Experiment
  39. 39. Experiment known effectiveness
  40. 40. Experiment user preference
  41. 41. Experiment non-preference
  42. 42. What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluation Negative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfying Bad: both systems are unsatisfying
  43. 43. Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity Random and Artificial examples Query: selected randomly System outputs: random lists of 5 documents 2200 examples for 73 unique queries 2869 unique lists with 3031 unique clips balanced and complete design
  44. 44. Subjects Crowdsourcing Cheap, fast and… diverse pool of subjects 2200 Quality examples control Trap examples (known answers) $0.03 per example Worker pool
  45. 45. Results 6895 total answers 881 workers from 62 countries 3393 accepted answers (41%) 100 workers (87% rejected!) 95% average quality when accepted
  46. 46. How good is my system? 884 nonpreferences (40%) What do we expect?
  47. 47. How good is my system? 884 nonpreferences (40%) Linear mapping
  48. 48. How good is my system? 884 nonpreferences (40%) What do we have?
  49. 49. How good is my system? 884 nonpreferences (40%)
  50. 50. How good is my system? 884 nonpreferences (40%)
  51. 51. How good is my system? 884 nonpreferences (40%) room for ~20% improvement with personalization
  52. 52. Is system A better than B? 1316 preferences (60%) What do we expect?
  53. 53. Is system A better than B? 1316 preferences (60%) Users always notice the difference… …regardless of how large it is
  54. 54. Is system A better than B? 1316 preferences (60%) What do we have?
  55. 55. Is system A better than B? 1316 preferences (60%)
  56. 56. Is system A better than B? 1316 preferences (60%)
  57. 57. Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for >50% of users to agree
  58. 58. Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100%
  59. 59. Is system A better than B? 1316 preferences (60%) Do users prefer the (supposedly) worse system?
  60. 60. Is system A better than B? 1316 preferences (60%)
  61. 61. Statistical Significance has nothing to do with this
  62. 62. Picture by Ronny Welter
  63. 63. Reporting Results Confidence intervals / Variance 0.584
  64. 64. Reporting Results Confidence intervals / Variance 0.584 ± .023 Indicator of evaluation error Better understanding of expected user satisfaction
  65. 65. Reporting Results Actual p-values +0.037 ± .031 *
  66. 66. Reporting Results Actual p-values +0.037 ± .031 (p=0.02) Statistical Significance is relative α=0.05 and α=0.01 are completely arbitrary Depends on context, cost of Type I errors and implementation, etc.
  67. 67. let’s review two papers (again)
  68. 68. paper A: +0.14* paper B: +0.21 …which one should get published? a.k.a. which research line should we follow?
  69. 69. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
  70. 70. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
  71. 71. paper A: +0.14 * paper B: +0.14 * …which one should get published? a.k.a. which research line should we follow?
  72. 72. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
  73. 73. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
  74. 74. effect-sizes are indicators of user satisfaction need to personalize results small differences are not noticed p-values are indicators of confidence beware of collection size need to provide full reports
  75. 75. The difference between “Significant” and “Not Significant” is not itself statistically significant ― A. Gelman & H. Stern

×