Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Abundance Leadership: A journey to ... by BrendaODDA 927 views
- CrowdSearcher. Reactive and multipl... by Search Computing 823 views
- Validity and Reliability of Cranfie... by Julián Urbano 694 views
- A Survey of Entity Ranking over RDF... by Intelligent Searc... 1068 views
- Evaluation in (Music) Information R... by Julián Urbano 1502 views
- TRank ISWC2013 by eXascale Infolab 976 views

1,144 views

Published on

No Downloads

Total views

1,144

On SlideShare

0

From Embeds

0

Number of Embeds

167

Shares

0

Downloads

9

Comments

0

Likes

2

No embeds

No notes for slide

- 1. How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz ISMIR 2012Picture by Humberto Santos Porto, Portugal · October 9th
- 2. let’s review two papers
- 3. statistically significantpaper A: paper B:+0.14* +0.21…which one should get published?a.k.a. which research line should we follow?
- 4. statistically significantpaper A: paper B:+0.14* +0.21…which one should get published?a.k.a. which research line should we follow?
- 5. paper A: paper B:+0.14* +0.14*…which one should get published?a.k.a. which research line should we follow?
- 6. paper A: paper B:+0.14* +0.14*…which one should get published?a.k.a. which research line should we follow?
- 7. Goal of Comparing Systems…Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user) Impossible! requires running the systems for theuniverse of all queries -1 0 𝑑 1 Δeffectiveness
- 8. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
- 9. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
- 10. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 0 𝑑 1 Δeffectiveness
- 11. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠-1 𝑑0 1 Δeffectiveness
- 12. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠There is always random error …so we need a measure of confidence
- 13. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0
- 14. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 Result of the test… p-value = P( 𝒅 | H0 ) …interpretation of the testp-value is very small: reject H0 otherwise: accept H0
- 15. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0We accept/reject H0… (based on the p-value and α) …not the test!
- 16. Usual (wrong) conclusions A is substantially than B A is much better than B The difference is important The difference is significant
- 17. What does it mean?That there is a difference (unlikely due to chance/random error)
- 18. What does it mean?That there is a difference (unlikely due to chance/random error)We don’t need fancy statistics… …we already know they are different!
- 19. H0: 𝒅 = 0 is false by definition because systems A and Bare different to begin with
- 20. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values
- 21. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values𝒅 = +0.6 is a huge improvement𝒅 = +0.0001 is irrelevant… …and yet, it can easily be statistically significant
- 22. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?
- 23. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce variance
- 24. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the system
- 25. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡,𝒕= 𝒔𝒅 the smaller the p-valueHow to achieve statistical significance?a) Reduce varianceb) Further improve the systemc) Evaluate with more queries!
- 26. Statistical Significance iseventually meaningless… …all you have to do is use enough queries
- 27. Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction Statistical Significance: p-value Confidence An improvement may be statistically significant, but that doesn’t mean it’s important!
- 28. the real importanceof an improvement
- 29. Purpose of Evaluation How good Is system Ais my system? better than system B? 0 1 -1 0 1 effectiveness ΔeffectivenessWe measure system effectiveness
- 30. AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 31. AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 32. AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 33. AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 34. AssumptionSystem Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
- 35. Assumption System Effectiveness corresponds to User Satisfaction this is ourultimate goal! Does it? How well?
- 36. How we measureSystem Effectiveness Similarity scale we normalize to [0, 1] Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100 Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rankWhat correlates betterwith user satisfaction?
- 37. Experiment
- 38. Experiment
- 39. Experiment known effectiveness
- 40. Experiment user preference
- 41. Experiment non-preference
- 42. What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluationNegative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfying Bad: both systems are unsatisfying
- 43. Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity Random and Artificial examples Query: selected randomlySystem outputs: random lists of 5 documents 2200 examples for 73 unique queries2869 unique lists with 3031 unique clips balanced and complete design
- 44. Subjects Crowdsourcing Cheap, fast and… diverse pool of subjects 2200 Quality examples control Trap examples (known answers)$0.03 per example Worker pool
- 45. Results 6895 total answers 881 workers from 62 countries 3393 accepted answers (41%) 100 workers (87% rejected!)95% average quality when accepted
- 46. How good is my system? 884 nonpreferences (40%) What do we expect?
- 47. How good is my system? 884 nonpreferences (40%) Linear mapping
- 48. How good is my system? 884 nonpreferences (40%) What do we have?
- 49. How good is my system? 884 nonpreferences (40%)
- 50. How good is my system? 884 nonpreferences (40%)
- 51. How good is my system? 884 nonpreferences (40%) room for ~20% improvement with personalization
- 52. Is system A better than B? 1316 preferences (60%) What do we expect?
- 53. Is system A better than B? 1316 preferences (60%) Users always notice the difference… …regardless of how large it is
- 54. Is system A better than B? 1316 preferences (60%) What do we have?
- 55. Is system A better than B? 1316 preferences (60%)
- 56. Is system A better than B? 1316 preferences (60%)
- 57. Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for >50% of users to agree
- 58. Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100%
- 59. Is system A better than B? 1316 preferences (60%) Do users prefer the (supposedly) worse system?
- 60. Is system A better than B? 1316 preferences (60%)
- 61. Statistical Significance has nothing to do with this
- 62. Picture by Ronny Welter
- 63. Reporting ResultsConfidence intervals / Variance 0.584
- 64. Reporting ResultsConfidence intervals / Variance 0.584 ± .023 Indicator of evaluation error Better understanding of expected user satisfaction
- 65. Reporting Results Actual p-values+0.037 ± .031 *
- 66. Reporting Results Actual p-values+0.037 ± .031 (p=0.02) Statistical Significance is relative α=0.05 and α=0.01 are completely arbitraryDepends on context, cost of Type I errors and implementation, etc.
- 67. let’s review two papers (again)
- 68. paper A:+0.14*paper B:+0.21 …which one should get published? a.k.a. which research line should we follow?
- 69. paper A (500 queries):+0.14 ± 0.03 (p=0.048)paper B (50 queries):+0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
- 70. paper A (500 queries):+0.14 ± 0.03 (p=0.048)paper B (50 queries):+0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
- 71. paper A:+0.14 *paper B:+0.14 * …which one should get published? a.k.a. which research line should we follow?
- 72. paper A (cost=$500,000):+0.14 ± 0.01 (p=0.004)paper B (cost=$50):+0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
- 73. paper A (cost=$500,000):+0.14 ± 0.01 (p=0.004)paper B (cost=$50):+0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
- 74. effect-sizes areindicators of user satisfaction need to personalize results small differences are not noticed p-values are indicators of confidence beware of collection sizeneed to provide full reports
- 75. The difference between “Signiﬁcant” and “Not Signiﬁcant” is not itself statistically signiﬁcant ― A. Gelman & H. Stern

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment