Your SlideShare is downloading.
×

- 1. How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz ISMIR 2012 Picture by Humberto Santos Porto, Portugal · October 9th
- 2. let’s review two papers
- 3. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
- 4. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
- 5. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
- 6. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
- 7. Goal of Comparing Systems… Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user) Impossible! requires running the systems for the universe of all queries -1 0 𝑑 1 Δeffectiveness
- 8. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
- 9. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
- 10. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
- 11. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 𝑑0 1 Δeffectiveness
- 12. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 There is always random error …so we need a measure of confidence
- 13. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0
- 14. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 Result of the test… p-value = P( 𝒅 | H0 ) …interpretation of the test p-value is very small: reject H0 otherwise: accept H0
- 15. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 We accept/reject H0… (based on the p-value and α) …not the test!
- 16. Usual (wrong) conclusions A is substantially than B A is much better than B The difference is important The difference is significant
- 17. What does it mean? That there is a difference (unlikely due to chance/random error)
- 18. What does it mean? That there is a difference (unlikely due to chance/random error) We don’t need fancy statistics… …we already know they are different!
- 19. H0: 𝒅 = 0 is false by definition because systems A and B are different to begin with
- 20. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values
- 21. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values 𝒅 = +0.6 is a huge improvement 𝒅 = +0.0001 is irrelevant… …and yet, it can easily be statistically significant
- 22. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance?
- 23. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance
- 24. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system
- 25. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!
- 26. Statistical Significance is eventually meaningless… …all you have to do is use enough queries
- 27. Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction Statistical Significance: p-value Confidence An improvement may be statistically significant, but that doesn’t mean it’s important!
- 28. the real importance of an improvement
- 29. Purpose of Evaluation How good Is system A is my system? better than system B? 0 1 -1 0 1 effectiveness Δeffectiveness We measure system effectiveness
- 30. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 31. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 32. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 33. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 34. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 35. Assumption System Effectiveness corresponds to User Satisfaction this is our ultimate goal! Does it? How well?
- 36. How we measure System Effectiveness Similarity scale we normalize to [0, 1] Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100 Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank What correlates better with user satisfaction?
- 37. Experiment
- 38. Experiment
- 39. Experiment known effectiveness
- 40. Experiment user preference
- 41. Experiment non-preference
- 42. What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluation Negative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfying Bad: both systems are unsatisfying
- 43. Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity Random and Artificial examples Query: selected randomly System outputs: random lists of 5 documents 2200 examples for 73 unique queries 2869 unique lists with 3031 unique clips balanced and complete design
- 44. Subjects Crowdsourcing Cheap, fast and… diverse pool of subjects 2200 Quality examples control Trap examples (known answers) $0.03 per example Worker pool
- 45. Results 6895 total answers 881 workers from 62 countries 3393 accepted answers (41%) 100 workers (87% rejected!) 95% average quality when accepted
- 46. How good is my system? 884 nonpreferences (40%) What do we expect?
- 47. How good is my system? 884 nonpreferences (40%) Linear mapping
- 48. How good is my system? 884 nonpreferences (40%) What do we have?
- 49. How good is my system? 884 nonpreferences (40%)
- 50. How good is my system? 884 nonpreferences (40%)
- 51. How good is my system? 884 nonpreferences (40%) room for ~20% improvement with personalization
- 52. Is system A better than B? 1316 preferences (60%) What do we expect?
- 53. Is system A better than B? 1316 preferences (60%) Users always notice the difference… …regardless of how large it is
- 54. Is system A better than B? 1316 preferences (60%) What do we have?
- 55. Is system A better than B? 1316 preferences (60%)
- 56. Is system A better than B? 1316 preferences (60%)
- 57. Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for >50% of users to agree
- 58. Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100%
- 59. Is system A better than B? 1316 preferences (60%) Do users prefer the (supposedly) worse system?
- 60. Is system A better than B? 1316 preferences (60%)
- 61. Statistical Significance has nothing to do with this
- 62. Picture by Ronny Welter
- 63. Reporting Results Confidence intervals / Variance 0.584
- 64. Reporting Results Confidence intervals / Variance 0.584 ± .023 Indicator of evaluation error Better understanding of expected user satisfaction
- 65. Reporting Results Actual p-values +0.037 ± .031 *
- 66. Reporting Results Actual p-values +0.037 ± .031 (p=0.02) Statistical Significance is relative α=0.05 and α=0.01 are completely arbitrary Depends on context, cost of Type I errors and implementation, etc.
- 67. let’s review two papers (again)
- 68. paper A: +0.14* paper B: +0.21 …which one should get published? a.k.a. which research line should we follow?
- 69. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
- 70. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
- 71. paper A: +0.14 * paper B: +0.14 * …which one should get published? a.k.a. which research line should we follow?
- 72. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
- 73. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
- 74. effect-sizes are indicators of user satisfaction need to personalize results small differences are not noticed p-values are indicators of confidence beware of collection size need to provide full reports
- 75. The difference between “Signiﬁcant” and “Not Signiﬁcant” is not itself statistically signiﬁcant ― A. Gelman & H. Stern