Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Statistical Analysis of Results in Music Information Retrieval: Why and How

693 views

Published on

Slides from the ISMIR 2018 tutorial

Published in: Science
  • Be the first to comment

Statistical Analysis of Results in Music Information Retrieval: Why and How

  1. 1. Statistical Analysis of Results in Music Information Retrieval: Why and How Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  2. 2. Who 2
  3. 3. Julián Urbano ● Assistant Professor @ TU Delft, The Netherlands ● BSc-PhD Computer Science ● 10 years of research in (Music) Information Retrieval o And related, like information extraction or crowdsourcing for IR ● Active @ ISMIR since 2010 ● Research topics o Evaluation methodologies o Statistical methods for evaluation o Simulation for evaluation o Low-cost evaluation 3 Supported by the European Commission H2020 project TROMPA (770376-2)
  4. 4. Arthur Flexer ● Austrian Research Institute for Artificial Intelligence - Intelligent Music Processing and Machine Learning Group ● PhD in Psychology, minor degree in Computer Science ● 10 years of research in neuroscience, 13 years in MIR ● Active @ ISMIR since 2005 ● Published on: o role of experiments in MIR o problems of ground truth o problems of inter-rater agreement 4 Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
  5. 5. Arthur Flexer ● Austrian Research Institute for Artificial Intelligence - Intelligent Music Processing and Machine Learning Group ● PhD in Psychology, minor degree in Computer Science ● 10 years of research in neuroscience, 13 years in MIR ● Active @ ISMIR since 2005 ● Published on: o role of experiments in MIR o problems of ground truth o problems of inter-rater agreement 5 Semi-retired veteran DJ Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
  6. 6. Disclaimer ● Design of experiments (DOE) used and needed in all kinds of sciences ● DOE is a science in its own right ● No fixed “how-tos” or “cookbooks” ● Different schools and opinions ● Established ways to proceed in different fields ● We will present current procedures in (M)IR ● But also discuss, criticize, point to problems ● Present alternatives and solutions (?) 6
  7. 7. Program ● Part I: Why (we evaluate the way we do it)? o Tasks and use cases o Cranfield o Validity and reliability ● Part II: How (should we not analyze results)? o Populations and samples o Estimating means o Fisher, Neyman-Pearson, NHST o Tests and multiple comparisons ● Part III: What else (should we care about)? o Inter-rater agreement o Adversarial examples ● Part IV: So (what does it all mean)? ● Discussion? 7
  8. 8. 8 #ismir2018
  9. 9. Part I: Why? Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  10. 10. Typical Information Retrieval task 2
  11. 11. Typical Information Retrieval task 2 IR System
  12. 12. Typical Information Retrieval task 2 Documents IR System
  13. 13. Typical Information Retrieval task 2 Documents IR System
  14. 14. Typical Information Retrieval task 2 Documents Information Need or Topic IR System
  15. 15. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query
  16. 16. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query
  17. 17. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results
  18. 18. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results
  19. 19. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results query
  20. 20. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query ResultsResults query
  21. 21. Two recurrent questions  How good is my system? ○ What does good mean? ○ What is good enough?  Is system A better than system B? ○ What does better mean? ○ How much better?  What do we talk about? ○ Efficiency? ○ Effectiveness? ○ Ease? 3
  22. 22. Hypothesis: A is better than B How would you design this experiment?
  23. 23. Measure user experience  We are interested in user-measures ○ Time to complete task ○ Idle time ○ Success/Failure rate ○ Frustration ○ Ease of learning ○ Ease of use …  Their distributions describe user experience ○ For an arbitrary user and topic (and document collection?) ○ What can we expect? 5 0 time to complete task none frustration muchsome
  24. 24. Sources of variability user-measure = f(documents, topic, user, system)  Our goal is the distribution of the user-measure for our system, which is impossible to calculate ○ (Possibly?) infinite populations  As usual, the best we can do is estimate it ○ Becomes subject to random error 6
  25. 25. Desired: Live Observation  Estimate distributions with a live experiment  Sample documents, topics and users  Have them use the system, for real  Measure user experience, implicitly or explicitly  Many problems ○ High cost, representativeness ○ Ethics, privacy, hidden effects, inconsistency ○ Hard to replicate experiment and repeat results ○ Just plain impossible to reproduce results *replicate = same method, different sample reproduce = same method, same sample 7
  26. 26. Alternative: Fixed samples  Get (hopefully) good samples, fix them and reuse ○ Documents ○ Topics ○ Users  Promotes reproducibility and reduces variability  But we can’t just fix the users! 8
  27. 27. Simulate users  Cranfield paradigm: remove users, but include a user- abstraction, fixed across experiments ○ Static user component: judgments or annotations in ground truth ○ Dynamic user component: effectiveness or performance measures  Removes all sources of variability, except systems user-measure = f(documents, topic, user, system) 9
  28. 28. Simulate users  Cranfield paradigm: remove users, but include a user- abstraction, fixed across experiments ○ Static user component: judgments or annotations in ground truth ○ Dynamic user component: effectiveness or performance measures  Removes all sources of variability, except systems user-measure = f(documents, topic, user, system) 9 user-measure = f(system)
  29. 29. Datasets (aka Test Collections)  Controlled sample of documents, topics and judgments, shared across researchers…  …combined with performance measures  (Most?) important resource for IR research ○ Experiments are inexpensive (datasets are not!) ○ Research becomes systematic ○ Evaluation is deterministic ○ Reproducibility is not only possible but easy 10
  30. 30. Repeat over topics Cranfield-like evaluation 11 Systemcollection topic doc doc doc doc doc rel rel rel rel rel Annotator Measure score Annotation protocol
  31. 31. User Models & Annotation Protocols  In practice, there are hundreds of options  Utility of a document w.r.t. scale of annotation ○ Binary or graded relevance? ○ Linear utility w.r.t. relevance? Exponential? ○ Independent of other documents?  Top heaviness to penalize late arrival ○ No discount? ○ Linear discount? Logarithmic? ○ Independent of other documents?  Interaction, browsing?  Cutoff ○ Fixed: only top k documents? ○ Dynamic: wherever some condition is met? ○ All documents?  etc 12
  32. 32. Tasks vs Use Cases  Everything depends on the use case of interest  The same task may have several use cases (or subtasks) ○ Informational ○ Navigational ○ Transactional ○ etc  Different use cases may imply, suggest or require different decisions wrt system input/output, goal, annotations, measures... 13
  33. 33. Task: instrument recognition What is the use case?
  34. 34. Instrument recognition 1) Given a piece of music as input, identify the instruments that are played in it: ○ For each window of T milliseconds, return a list of instruments being played (extraction). ○ Return the instruments being played anywhere in the piece (classification). 2) Given an instrument as input, retrieve a list of music pieces in which the instrument is played: ○ Return the list of music pieces (retrieval). ○ Return the list, but for each piece also provide a clip (start-end) where the instrument is played (retrieval+extraction).  Each case implies different systems, annotations and measures, and even different end-users (non-human?) https://github.com/cosmir/open-mic/issues/19 15
  35. 35. But wait a minute...  Are we estimating distributions about users or distributions about systems? user-measure = f(system) system-measure = f(system, protocol, measure) 16
  36. 36. But wait a minute...  Are we estimating distributions about users or distributions about systems? user-measure = f(system) system-measure = f(system, protocol, measure) 16 system-measure = f(system, protocol, measure, annotator, context, ...)  Whether the system output satisfies the user or not, has nothing to do with how we measure its performance  What is the best way to predict user satisfaction?
  37. 37. Real world vs. The lab 17 The Web Abstraction Prediction Real World Cranfield IR System Topic Relevance Judgments IR System Documents AP DCG RR Static Component Dynamic Component Test Collection Effectiveness Measures Information need
  38. 38. Output Cranfield in Music IR 18 System Input Measure Annotations Users
  39. 39. Classes of Tasks in Music IR ● Retrieval ○ Music similarity 19 System collection track track track track track track
  40. 40. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming 20 System collection hum track track track track track
  41. 41. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation 21 System collection user track track track track track
  42. 42. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification 22 System track genre
  43. 43. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification ○ Mood recognition 23 System track mood1 mood2
  44. 44. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification ○ Mood recognition ○ Autotagging 24 System track tag tag tag
  45. 45. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification ○ Mood recognition ○ Autotagging ● Extraction ○ Structural segmentation 25 System track seg segseg seg
  46. 46. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification ○ Mood recognition ○ Autotagging ● Extraction ○ Structural segmentation ○ Melody extraction 26 System track
  47. 47. Classes of Tasks in Music IR ● Retrieval ○ Music similarity ○ Query by humming ○ Recommendation ● Annotation ○ Genre classification ○ Mood recognition ○ Autotagging ● Extraction ○ Structural segmentation ○ Melody extraction ○ Chord estimation 27 System track chocho cho
  48. 48. Evaluation as Simulation  Cranfield-style evaluation with datasets is a simulation of the user-system interaction, deterministic, maybe even simplistic, but a simulation nonetheless  Provides us with data to estimate how good our systems are, or which one is better  Typically, many decisions are made for the practitioner  Comes with many assumptions and limitations 28
  49. 49. Validity and Reliability  Validity: are we measuring what we want to? ○ Internal: are observed effects due to hidden factors? ○ External: are input items, annotators, etc generalizable? ○ Construct: do system-measures match user-measures? ○ Conclusion: how good is good and how better is better? Systematic error  Reliability: how repeatable are the results? ○ Will I obtain the same results with a different collection? ○ How large do collections need to be? ○ What statistical methods should be used? Random error 29
  50. 50. 30 Not Valid Reliable Valid Not Reliable Not Valid Not Reliable Valid Reliable
  51. 51. So long as... • So long as the dataset is large enough to minimize random error and draw reliable conclusions • So long as the tools we use to make those conclusions can be trusted • So long as the task and use case are clear • So long as the annotation protocol and performance measure (ie. user model) are realistic and actually measure something meaningful for the use case • So long as the samples of inputs and annotators present in the dataset are representative for the task 31
  52. 52. What else How So long as... • So long as the dataset is large enough to minimize random error and draw reliable conclusions • So long as the tools we use to make those conclusions can be trusted • So long as the task and use case are clear • So long as the annotation protocol and performance measure (ie. user model) are realistic and actually measure something meaningful for the use case • So long as the samples of inputs and annotators present in the dataset are representative for the task 31
  53. 53. “If you can’t measure it, you can’t improve it.” —Lord Kelvin 32
  54. 54. “If you can’t measure it, you can’t improve it.” —Lord Kelvin 32 “But measurements have to be trustworthy.” —yours truly
  55. 55. Part II: How? Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  56. 56. Populations and Samples
  57. 57. Populations of interest ● The task and use case define the populations of interest ○ Music tracks ○ Users ○ Annotators ○ Vocabularies ● Impossible to study all entities in these populations ○ Too many ○ Don’t exist anymore (or yet) ○ Too far away ○ Too expensive ○ Illegal 3
  58. 58. Populations and samples ● Our goal is to study the performance of the system on these populations ○ Lets us know what to expect from the system in the real world ● We’re typically interested in the mean: the expectation 𝜇 ○ Based on this we would decide what research line to pursue, what paper to publish, what project to fund, etc. ○ But variability is also important, though often neglected ● A dataset represents just a sample from that population ○ By studying the sample we could generalize back to the population ○ But will bear some degree of random error due to sampling 4
  59. 59. Populations and samples 5 Target population
  60. 60. Populations and samples 5 Target population External validity Accessible population or sampling frame
  61. 61. Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  62. 62. Inference Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  63. 63. Generalization Inference Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  64. 64. Populations and samples ● This is an estimation problem ● The objective is the distribution 𝐹 of performance over the population, specifically the mean 𝜇 ● Given a sample of observations 𝑋1, … , 𝑋 𝑛, estimate 𝜇 ● Most straightforward estimator is the sample mean: 𝜇 = 𝑋 ● Problem: 𝝁 = 𝝁 + 𝒆, being 𝒆 random error ● For any given sample or dataset, we only know 𝑿 ● How confident we are in the results and our conclusions, depends on the size of 𝒆 with respect to 𝝁 6
  65. 65. Populations and samples 7 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  66. 66. Populations and samples 7 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^
  67. 67. Populations and samples 7 sample (n=10) performance frequency 0 1^ 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^
  68. 68. Populations and samples 7 sample (n=10) performance frequency 0 1^ 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^ sample (n=10) performance frequency 0 1^
  69. 69. Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  70. 70. sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  71. 71. sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  72. 72. sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  73. 73. Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  74. 74. sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  75. 75. sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  76. 76. sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  77. 77. 0.35 0.40 0.45 0.50 0.55 010203040 sampling distribution mean performance X density n=10 n=30 n=50 n=100 0.0 0.2 0.4 0.6 0.8 1.0 01234 population performance density Sampling distribution and standard error ● Let us assume some distribution with some 𝜇 ● Experiment: draw random sample of size 𝑛 and compute 𝑋 ● The sampling distribution is the distribution of 𝑋 over replications of the experiment ● Standard error is the std. dev. of the sampling distribution 10
  78. 78. Estimating the mean ● The true distribution 𝐹 has mean 𝜇 and variance 𝜎2 ○ 𝐸 𝑋 = 𝜇 ○ 𝑉𝑎𝑟 𝑋 = 𝜎2 ● For the sample mean 𝑋 = 1 𝑛 ∑𝑋𝑖 we have ○ 𝐸 𝑋 = 1 𝑛 ∑𝐸 𝑋𝑖 = 𝜇 ○ 𝑉𝑎𝑟 𝑋 = 1 𝑛2 ∑𝑉𝑎𝑟 𝑋𝑖 = 𝜎2 𝑛 std. error = 𝜎 𝑋 = 𝜎 𝑛 ● Law of large numbers: 𝑋 → 𝜇 as 𝑛 → ∞ ● The larger the dataset, the better our estimates ● Regardless of the true distribution 𝑭 over the population 11
  79. 79. Gaussians everywhere ● For the special case where 𝐹 = 𝑁 𝜇, 𝜎2 , the sample mean is also Gaussian, specifically 𝑋~𝑁 𝜇, 𝜎2 𝑛 ● But a Gaussian distribution is sometimes a very unrealistic model for our data ● Central Limit Theorem (CLT): 𝑋 𝑑 → 𝑁 𝜇, 𝜎2 𝑛 as 𝑛 → ∞ ● Regardless of the shape of the original 𝑭 12 0.0 0.2 0.4 0.6 0.8 1.0 05101520 population performance density n=10 mean performance X density 0.00 0.04 0.08 0.12 051525 n=100 mean performance X density 0.00 0.04 0.08 0.12 02060
  80. 80. Approximations ● Until the late 1890s, the CLT was invoked everywhere for the simplicity of working with Gaussians ● Tables of the Gaussian distribution were used to test ● Still, there were two main problems ○ 𝜎2 is unknown ○ The rate of converge, ie. small samples ● But something happened at the Guinness factory in 1908 13
  81. 81. 14 William Gosset
  82. 82. Student-t distribution ● Gaussian approximations for sampling distributions were reasonably good for large samples, but not for small ● Gosset thought about deriving the theoretical distributions under assumptions of the underlying model ● Specifically, when 𝑋~𝑁 𝜇, 𝜎2 : ○ If 𝜎 is known, we know that 𝑧 = 𝑋−𝜇 𝜎 𝑛 ~𝑁 0,1 ○ If 𝜎 is unknown Gosset introduced the Student-t distribution: 𝑡 = 𝑋−𝜇 𝑠/ 𝑛 ~𝑇 𝑛 − 1 , where 𝑠 is the sample standard deviation ● In a sense, it accounts for the uncertainty in 𝜎 = 𝑠 15
  83. 83. Small-sample problems ● In non-English literature there are earlier mentions, but it was popularized by Gosset and, mostly, Fisher ● He initiated the study of the so-called small-sample problems, specifically with the Student-𝑡 distribution 16 -4 -2 0 2 4 0.00.10.20.30.4 Student-t distribution t statistic density n=2 n=3 n=6 n=30
  84. 84. Ronald Fisher
  85. 85. Fisher and small samples ● Gosset did not provide a proof of the 𝑡 distribution, but Fisher did in 1912-1915 ● Fisher stopped working on small sample problems until Gosset convinced him in 1922 ● He then worked out exact distributions for correlation coefficients, regression coefficients, 𝜒2 tests, etc. in the early 1920s ● These, and much of his work on estimation and design of experiments, were collected in his famous 1925 book ● This is book is sometimes considered the birth of modern statistical methods 18 Ronald Fisher
  86. 86. 19
  87. 87. Fisher’s significance testing ● In those other papers Fisher developed his theory of significance testing ● Suppose we have observed data 𝑋~𝑓 𝑥 𝜃 and we are interested in testing the null hypothesis 𝐻0: 𝜃 = 𝜃0 ● We choose a relevant test statistic 𝑇 s.t. large values of 𝑇 reflect evidence against 𝐻0 ● Compute the 𝒑-value 𝑝 = 𝑃 𝑇 𝑋∗ ≥ 𝑇 𝑋 𝐻0 , that is, the probability that, under 𝐻0, we observe a sample 𝑋∗ with a test statistic at least as extreme as we observed initially ● Assess the statistical significance of the results, that is, reject 𝐻0 if 𝑝 is small 20
  88. 88. Testing the mean ● We observed 𝑋 = {−0.13, 0.68, −0.34, 2.10, 0.83, −0.32, 0.99, 1.24, 1.08, 0.19} and assume a Gaussian model ● We set 𝐻0: 𝜇 = 0 and choose a 𝑡 statistic ● For our data, 𝑝 = 0.0155 𝑡 = 2.55 ● If we consider 𝑝 small enough, we reject 𝐻0 21 -4 -2 0 2 4 0.00.10.20.30.4 test statistic t density p
  89. 89. Small p-values ● “we do not want to know the exact value of p […], but, in the first place, whether or not the observed value is open to suspicion” ● Fisher provided in his book tables not of the new small- sample distributions, but of selected quantiles ● Allow for calculation of ranges of 𝑝-values given test statistics, as different degrees of evidence against 𝐻0 ● The 𝑝-value is gradable, a continuous measure of evidence 22
  90. 90. 23
  91. 91. 𝑝 and 𝛼 ● Fisher employed the term significance level 𝛼 for these theoretical 𝑝-values used as reference points to identify statistically significant results: reject 𝐻0 if 𝑝 ≤ 𝛼 ● This is context-dependent, is not prefixed beforehand and can change from time to time ● He arbitrarily “suggested” 𝛼 = .05 for illustration purposes ● Observing 𝑝 > 𝛼 does not prove 𝐻0; it just fails to reject it 24
  92. 92. Jerzy Neyman & Egon Pearson
  93. 93. Pearson ● Pearson saw Fisher’s tables as a way to compute critical values that “lent themselves to the idea of choice, in advance of experiment, of the risk of the „first kind of error’ which the experimenter was prepared to take” ● In a letter to Pearson, Gosset replied “if the chance is very small, say .00001, […] what it does is to show that if there is any alternative hypothesis which will explain the occurrence of the sample with a more reasonable probability, say .05 […], you will be very much more inclined to consider that the original hypothesis is not true” 26 Egon Pearson
  94. 94. Neyman ● Pearson saw the light: “the only valid reason for rejecting a statistical hypothesis is that some alternative explains the observed events with a greater degree of probability” ● In 1926 Pearson writes to Neyman to propose his ideas of hypothesis testing, which they developed and published in 1928 27 Jerzy Neyman
  95. 95. 28
  96. 96. Errors ● 𝛼 = 𝑃 𝑡𝑦𝑝𝑒 1 𝑒𝑟𝑟𝑜𝑟 ● 𝛽 = 𝑃 𝑡𝑦𝑝𝑒 2 𝑒𝑟𝑟𝑜𝑟 ● Power = 1 − 𝛽 29 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 Truth 𝐻0 true 𝐻1 true Test accept 𝐻0 true negative Type 2 error reject 𝐻0 Type 1 error true positive 𝛼𝛽
  97. 97. Errors 30 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.5 ● 𝜎 = 1 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H1 𝑡 = 𝑋 − 𝜇 𝜎 𝒏
  98. 98. Errors 31 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.25 ● 𝜎 = 1 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H2 𝑡 = 𝑋 − 𝝁 𝜎 𝑛
  99. 99. Errors 32 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.25 ● 𝜎 = 3 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H2 𝑡 = 𝑋 − 𝜇 𝝈 𝑛
  100. 100. Neyman-Pearson hypothesis testing ● Define the null and alternative hypotheses, eg. 𝐻0: 𝜇 = 0 and 𝐻1: 𝜇 = 0.5 ● Set the acceptable error rates 𝛼 (type 1) and 𝛽 (type 2) ● Select the most powerful test 𝑇 for the hypotheses and 𝛼, which sets the critical value 𝑐 ● Given 𝐻1 and 𝛽, select the sample size 𝑛 required to detect an effect 𝒅 or larger ● Collect data and reject 𝐻0 if 𝑇 𝑋 ≥ 𝑐 ● The testing conditions are set beforehand: 𝐻0, 𝐻1, 𝛼, 𝛽 ● The experiment is designed for a target effect 𝑑: 𝑛 33
  101. 101. Error rates and tests ● Under repeated experiments, the long-run error rate is 𝛼 ● Neyman-Pearson did not suggest values for it: “the balance [between the two kinds of error] must be left to the investigator […] we attempt to adjust the balance between the risks 1 and 2 to meet the type of problem before us” ● For 𝛽 they “suggested” 𝛼 ≤ 𝛽 ≤ 0.20 ● To Fisher, the choice of test statistic in his methodology was rather obvious to the investigator and wasn’t important to him ● Neyman-Pearson answered this by defining the “best” test: that which minimizes error 2 subject to a bound in error 1 34
  102. 102. Likelihood ratio test ● Pearson apparently suggested the likelihood ratio test for their new hypothesis testing methodology ℒ = 𝑝 𝑋 𝐻0) 𝑝 𝑋 𝐻1 ● Later found that as 𝑛 → ∞, −2 log ℒ ~𝜒2 ● Neyman was reluctant, as he thought some Bayesian consideration had to be taken about prior distributions over the hypotheses (“inverse probability” at the time) ● For simple point hypotheses like 𝐻0: 𝜃 = 𝜃0 and 𝐻1: 𝜃 = 𝜃1, the Likelihood ratio test turned out to be the most powerful ● In the case of comparing means of Gaussians, this reduces to Student’s 𝑡-test! 35
  103. 103. Composite hypotheses ● Neyman-Pearson theory extends to composite hypotheses of the form 𝐻: 𝜃 ∈ Θ, such as 𝐻1: 𝜇 > 0.5 ● The math got more complex, and Neyman was still somewhat reluctant: “it may be argued that it is impossible to estimate the probability of such a hypothesis without a knowledge of the relative a priori probabilities of the constituent simple hypotheses” ● Although “wishing to test the probability of a hypothesis A we have to assume that all hypotheses are a priori equally probable and calculate the probability a posteriory of A” 36
  104. 104. Null Hypothesis Significance Testing
  105. 105. Recap ● Fisher: significance testing ○ Inductive inference: rational belief when reasoning from sample to population ○ Rigorous experimental design to extract results from few samples ○ Replicate and develop your hypotheses, consider all significant and non- significant results together ○ Power can not be computed beforehand ● Neyman-Pearson: hypothesis testing ○ Inductive behavior: frequency of errors in judgments ○ Long-run results from many samples ○ p-values don’t have frequentist interpretations ● In the 1940s the two worlds began to appear as just one in statistics textbooks, and rapidly adopted by researchers 38
  106. 106. 39
  107. 107. 39
  108. 108. Null Hypothesis Significance Testing ● Collect data ● Set hypotheses, typically 𝐻0: 𝜇 = 0 and 𝐻1: 𝜇 ≠ 0 ○ Either there is an effect or there isn’t ● Set 𝛼, typically to 0.05 or 0.01 ● Select test statistic based on hypotheses and compute 𝑝 ● If 𝑝 ≤ 𝛼, reject the null; fail to reject if 𝑝 > 𝛼 40
  109. 109. Common bad practices ● Run tests blindly without looking at your data ● Decide on 𝛼 after computing 𝑝 ● Report “(not) significant at the 0.05 level” instead of providing the actual 𝑝-value ● Report degrees of significance like “very” or “barely” ● Do not report test statistic alongside 𝑝, eg. 𝑡 58 = 1.54 ● Accept 𝐻0 if 𝑝 > 𝛼 or accept 𝐻1 if 𝑝 ≤ 𝛼 ● Interpret 𝑝 as the probability of the null ● Simply reject the null, without looking at the effect size ● Ignore the type 2 error rate 𝛽, ie. power analysis a posteriori ● Interpret statistically significant result as important ● Train the same models until significance is found ● Publish only statistically significant results ¯_(ツ)_/¯ 41
  110. 110. NHST for (M)IR 2 systems
  111. 111. Paired tests ● We typically want to compare our system B with some baseline system A ● We have the scores over 𝑛 inputs from some dataset ● The hypotheses are 𝐻0: 𝜇 𝐴 = 𝜇 𝐵 and 𝐻1: 𝜇 𝐴 ≠ 𝜇 𝐵 A performance Frequency 0.0 0.2 0.4 0.6 0.8 1.0 01234 mean=0.405 sd=0.213 B performance Frequency 0.0 0.2 0.4 0.6 0.8 1.0 01234 mean=0.425 sd=0.225
  112. 112. Paired test ● If we ignore the structure of the experiment, we have a bad model and a test with low power Simple test: 𝑡 = 0.28, 𝑝 = 0.78 ● In our experiments, every observation from A corresponds to an observation from B, ie. they are paired observations ● The test can account for this to better model our data ● Instead of looking at A vs B, we look at A-B vs 0 Paired test: 𝑡 = 2.16, 𝑝 = 0.044 44 0.0 0.2 0.4 0.6 0.8 1.0 0.00.40.8 A B
  113. 113. Paired 𝑡-test ● Assumption ○ Data come from Gaussian distributions ● Equivalent to a t-test of 𝜇 𝐷 = 0, where 𝐷𝑖 = 𝐵𝑖 − 𝐴𝑖 𝑡 = 𝑛 𝑋 𝐵−𝐴 𝑠 𝐵−𝐴 = 𝑛 𝐷 𝑠 𝐷 = 0.35, 𝑝 = 0.73 45 A B .76 .75 .33 .37 .59 .59 .28 .15 .36 .49 .43 .50 .21 .33 .43 .27 .72 .81 .40 .36 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 A B
  114. 114. Wilcoxon signed-rank test ● Assumptions: ○ Data measured at least at interval level ○ Distribution is symmetric ● Disregard for magnitudes ● Convert all untied 𝐷𝑖 to ranks 𝑅𝑖 ● Compute 𝑊+ and 𝑊− equal to the sum of 𝑅𝑖 that are positive or negative ● The test statistic is 𝑊 = min 𝑊+, 𝑊− = 21 ● 𝑊 follows a Wilcoxon distribution, from which one can calculate 𝑝 = 0.91 46 A B D rank .76 .75 -.01 1 .33 .37 .04 2 .59 .59 0 - .28 .15 -.13 8 .36 .49 .13 7 .43 .50 .07 4 .21 .33 .12 6 .43 .27 -.16 9 .72 .81 .09 5 .40 .36 -.04 3
  115. 115. Sign test ● Complete disregard for magnitudes ● Simulate coin flips: was system B better (or worse) than A for some input? ● Follows Binomial distribution ● The test statistic is the number of successes (B>A), which is 5 ● 𝑝-value is the probability of 5 or more successes in 9 coin flips = 0.5 47 A B sign .76 .75 -1 .33 .37 +1 .59 .59 0 .28 .15 -1 .36 .49 +1 .43 .50 +1 .21 .33 +1 .43 .27 -1 .72 .81 +1 .40 .36 -1
  116. 116. Bootstrap test ● Compute deltas, 𝐷𝑖 = 𝐵𝑖 − 𝐴𝑖 ● The empirical distribution 𝑒𝑐𝑑𝑓𝐷 estimates the true distribution 𝐹 𝐷 ● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times) ○ Draw a bootstrap sample 𝐵𝑖 by sampling 𝑛 scores with replacement from 𝑒𝑐𝑑𝑓𝐷 ○ Compute the mean of the bootstrap sample, 𝐵𝑖 ○ Let 𝐵 = 1/𝑇∑𝐵𝑖 ○ 𝐵𝑖 estimates the sampling distribution of the mean ● The 𝑝-value is ∑𝕀 𝐵𝑖 − 𝐵 ≥ 𝐷 𝑇 ≈ 0.71 48
  117. 117. Permutation test ● Under the null hypothesis, an arbitrary score could have been generated for system A or from system B ● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times) ○ Create a sample 𝑃𝑖 by randomly swapping the sign of each observation ○ Compute the mean 𝑃𝑖 ● The 𝑝-value is ∑𝕀 𝑃𝑖 ≥ 𝐷 𝑇 ≈ 0.73 49
  118. 118. The computer does all this for you 50
  119. 119. In practice 51 [Carterette, 2015a]
  120. 120. NHST for (M)IR Multiple systems
  121. 121. ANOVA ● Assume a model 𝑦𝑠𝑖 = 𝜇 + 𝜈𝑠 + 𝜈𝑖 + 𝑒𝑠𝑖, where 𝜈𝑠 = 𝑦𝑠∙ − 𝜇 ○ Implicitly “pairs” the observations by item ● The variance of the observed scores can be decomposed 𝜎2 𝑦 = 𝜎2 𝑠 + 𝜎2 𝑖 + 𝜎2 𝑒 ● where 𝜎2 𝑠 is the variance across system means ○ Low: system means are close to each other ○ High: system means are far from each other ● The null hypothesis is 𝐻0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯ ○ Even if we reject, we still don’t know which system is different! ● The test statistic is of the form 𝐹 = 𝜎2 𝑠 𝜎2 𝑒 ● We’d like to have 𝜎2 𝑠 ≫ 𝜎2 𝑖 , 𝜎2 𝑒
  122. 122. Friedman test ● Same principle as ANOVA, but non-parametric ● Similarly to Wilcoxon, rank observations (per item) and estimate effects ● Ignores actual magnitudes; simply uses ranks 54
  123. 123. Multiple testing ● When testing multiple hypotheses, the probability of at least one type 1 error increases ● Multiple testing procedures correct 𝑝-values for a family- wise error rate 55 [Carterette, 2015a]
  124. 124. Tukey’s HSD ● Follow ANOVA to test 𝐻0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯ ● The maximum observed difference between systems is likely the one causing the rejection ● Tukey’s HSD compares all pairs of systems, each with an individual 𝑝-value ● These 𝑝-values are then corrected based on the expected distribution of maximum differences under 𝐻0 ● In practice, it inflates 𝑝-values ● Ensures a family-wise type 1 error rate 56
  125. 125. Tukey’s HSD 57 [Carterette, 2015a]
  126. 126. Others ● There are many other procedures to control for multiple comparisons ● Bonferroni: very conservative (low power) ● Dunnett’s: compare all against a control (eg. baseline) ● Other procedures control for the false discovery rate, ie. the probability of a type 1 error given 𝑝 < 𝛼 ● One way or another, they all inflate 𝑝-values 58
  127. 127. Part III: )hat Else? (alidity! Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  128. 128. )hat else can go wrong ? (alidity! ● (alidity reprise ● Example I: Inter-rater agreement ● Example II: Adversarial examples 2
  129. 129. (alidity Reprise 3
  130. 130. (alidity ● (alidity ○ Valid experiment is an experiment actually measuring what the experimenter intended to measure ○ Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ○ Internal validity: is this relationship causal or could confounding factors explain the relation ○ External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment ○ Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment 4
  131. 131. Inter-rater agreement in music similarity 5
  132. 132. Automatic recommendation / Playlisting 6 Millions of songs Result list Query song + =
  133. 133. Automatic recommendation / Playlisting 7 Millions of songs Result list Query song + = Similarity
  134. 134. Computation of similarity between songs ● Collaborative filtering Spotify, Dee”er? ● Social meta-data Last.Fm? ● Expert knowledge Pandora? ● Meta-data from the web ● ... ● Audio-based 8
  135. 135. Computation of similarity between songs 9 Songs as audio Switching to frequencies Computation of features Machine Learning → similarity (metric) S(a1, a2) = ? Pictures from E. Pampalk’s Phd thesis 2006
  136. 136. Computation of similarity between songs 10 Query song Similar? Similar? …………….
  137. 137. Computation of similarity between songs 11 Query song Similar!! Similar!! ……………. max(S)=97.9
  138. 138. Are we there yet? 14
  139. 139. How can we evaluate our models of music similarity? 15 45 87 100 23 100 87 100 23 45
  140. 140. How can we evaluate our models of music similarity? 16 45 87 100 23 100 87 100 23 45 Do these numbers correspond to a human assessment of music similarity?
  141. 141. MIREX - Music Information Retrieval eXchange 17
  142. 142. MIREX - Music Information Retrieval eXchange ● Standardi”ed testbeds allowing for fair comparison of MIR systems ● range of different tasks ● based on human evaluation ○ Cranfield: remove users, look at annotations only 18
  143. 143. MIREX - Music Information Retrieval eXchange ● Standardi”ed testbeds allowing for fair comparison of MIR systems ● range of different tasks ● based on human evaluation ○ Cranfield: remove users, look at annotations only ● )hat is the level of agreement between human raters/annotators ? ● )hat does this mean for the evaluation of MIR systems? ● Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016. 19
  144. 144. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 20
  145. 145. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 5000 song database ● participating MIR systems compute 5000x5000 distance matrix ● 60 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● „Rate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 10 very similar or identical . 21
  146. 146. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 100 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● „Rate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 22
  147. 147. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 50 randomly selected queries ● return 10 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● „Rate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 23
  148. 148. Experimental design 24 Independent variable treatment manipulated by researcher Dependent variable effect measured by researcher ● measure the effect of different treatments on a dependent variable
  149. 149. Experimental design 25 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable
  150. 150. Experimental design 26 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable MIREX AMS 2014
  151. 151. Experimental design 27 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable MIREX AMS 2014
  152. 152. )hat about validity? ● Valid experiment is an experiment actually measuring what the experimenter intended to measure ● )hat is the intention of the experimenter in the AMS task? ● What do we want to measure here? 28
  153. 153. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 100 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● „Rate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 29
  154. 154. Rate the similarity! 30
  155. 155. Rate the similarity! 31 Query song Candidate song 0 … 100
  156. 156. Rate the similarity! 32
  157. 157. Rate the similarity! 33
  158. 158. Rate the similarity! 34
  159. 159. Rate the similarity! 35
  160. 160. Rate the similarity! 36 ● Factors that influence human music perception ○ Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information Retrieval Research, J. of Intelligent Information Systems, December 2013, (olume 41, Issue 3, pp 523-539, 2013.
  161. 161. Rate the similarity! 37
  162. 162. Inter-rater agreement in AMS 38
  163. 163. Inter-rater agreement in AMS ● AMS 2006 is the only year with multiple graders ● each query/candidate pair evaluated by three different human graders ● each grader gives a FINE score between 0 … 10 not … very similar 39
  164. 164. Inter-rater agreement in AMS ● AMS 2006 is the only year with multiple graders ● each query/candidate pair evaluated by three different human graders ● each grader gives a FINE score between 0 … 10 not … very similar ● correlation between pairs of graders 40
  165. 165. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 41
  166. 166. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 42
  167. 167. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 43
  168. 168. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 44
  169. 169. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 45
  170. 170. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 46 Average = 6.54
  171. 171. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval ● what sounds very similar to one grader, will on average receive a score of only 6.54 from other graders ● this constitutes an upper bound for average FINE scores in AMS ● there will always be users that disagree moving target 47 Average = 6.54
  172. 172. Comparison to the upper bound ● compare top performing systems 2007, 2009 - 2014 to upper bound 48
  173. 173. Comparison to the upper bound ● compare top performing systems 2007, 2009 - 2014 to upper bound 49
  174. 174. Comparison to the upper bound ● upper bound has already been reached in 2009 50 PS2
  175. 175. Comparison to the upper bound ● upper bound has already been reached in 2009 51 PS2 PS2 PS2 PS2 PS2 PS2
  176. 176. Comparison to the upper bound 52 ● upper bound has already been reached in 2009 ● can upper bound be surpassed in the future? ● or is this an inherent problem due to low inter-rater agreement in human evaluation of music similarity? ● this prevents progress in MIR research on music similarity ● AMS task dead since 2015
  177. 177. )hat went wrong here? 53
  178. 178. )hat about validity? ● Valid experiment is an experiment actually measuring what the experimenter intended to measure ● )hat is the intention of the experimenter in the AMS task? ● What do we want to measure here? 54
  179. 179. Construct (alidity ● Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment? 55
  180. 180. Construct (alidity ● Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment? ● Unclear intention: to measure an abstract concept of music similarity? ● Possible solutions: ○ more fine-grained notion of similarity ○ ask a more specific question? ○ does something like abstract music similarity even exist? ○ evaluation of complete MIR systems centered around specific task use case could lead to much clearer hypothesis ○ Remember MIREX Grand challenge user experience 2014? ○ “You are creating a short video about a memorable occasion that happened to you recently, and you need to find some (copyright-free) songs to use as background music.” 56
  181. 181. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design 57
  182. 182. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 58 Independent variable Type of algorithm Dependent variable FINE similarity rating
  183. 183. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 59 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age musical training/experience/preference type of music, ...
  184. 184. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 60 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age: female only, age 20-30y musical training/experience/preference: music professionals type of music: piano concertos only
  185. 185. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 61 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age: female only, age 20-30y musical training/experience/preference: music professionals type of music: piano concertos only Very specialized, limited generality
  186. 186. Internal (alidity ● Control variable, monitor it: 62
  187. 187. Internal (alidity ● Control variable, monitor it: 63 Exponential complexity
  188. 188. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment? 64
  189. 189. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment? ● Unclear target population: identical with sample of 7000 US pop songs? All US pop music in general? ● Beware: cross-collection studies show dramatic losses in performance ○ Bogdanov, D., Porter, A., Herrera Boyer, P., & Serra, X. (2016). Cross-collection evaluation for music classification tasks. ISMIR 2016. ● Possible solutions: ○ Clear target population ○ More constricted target populations ○ Much larger data samples ○ Use case? 65
  190. 190. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference 66
  191. 191. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference 67 J. Urbano, J. S. Downie, B. McFee and M. Schedl: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval, ISMIR 2012.
  192. 192. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference ● Possible solutions: ○ Are there system measures that better correspond to user measures? ○ Use case! 68
  193. 193. Lack of inter-rater agreement in other areas 69
  194. 194. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound 70
  195. 195. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound ● MIREX ‘Music Structural Segmentation’ task ○ Human annotations of structural segmentations structural boundaries and labels denoting repeated segments , chorus, verse, … ○ Algorithms have to produce such annotations ○ F1-score between different annotators as upper bound ○ Upper bound reached at least for certain music classical and world music 71
  196. 196. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound ● MIREX ‘Music Structural Segmentation’ task ○ Human annotations of structural segmentations structural boundaries and labels denoting repeated segments , chorus, verse, … ○ Algorithms have to produce such annotations ○ F1-score between different annotators as upper bound ○ Upper bound reached at least for certain music classical and world music ○ Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, J. of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016. ○ Smith, J.B.L., Chew, E.: A meta-analysis of the MIREX structure segmentation task, ISMIR, 2013. ○ Serrà, J., Müller, M., Grosche, P., & Arcos, J.L.: Unsupervised music structure annotation by time series structure features and segment similarity., IEEE Transactions on Multimedia, Special Issue on Music Data Mining, 16(5), 1229–1240, 2014. 72
  197. 197. Inter-rater agreement and upper bounds ● Extraction of metrical structure ○ Quinton, E., Harte, C., Sandler, M.: Extraction of metrical structure from music recordings, DAFX 2015. ● Melody estimation ○ Balke, S., Driedger, J., Abeßer, J., Dittmar, C., Müller, M.: Towards Evaluating Multiple Predominant Melody Annotations in Jazz Recordings, ISMIR 2016. ○ Bosch J.J.,Gomez E..: Melody extraction in symphonic classical music: a comparative study of mutual agreement between humans and algorithms, Proc. of the Conference on Interdisciplinary Musicology, 2014. ● Timbre and rhythm similarity ○ Panteli, M., Rocha, B., Bogaards, N., Honingh, A.: A model for rhythm and timbre similarity in electronic dance music. Musicae Scientiae, 21(3), 338-361, 2017. ● Many more? 73
  198. 198. Adversarial Examples 74
  199. 199. Adversarial Examples - Image Recognition ● An adversary slightly and imperceptibly changes an input image to fool a machine learning system ○ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014. 75 original + noise = adversarial example all classified as “Camel”
  200. 200. Adversarial Examples - MIR ● Imperceptibly filtered audio fools genre recognition system ○ Sturm B.L.: A simple method to determine if a music information retrieval system is a horse , IEEE Trans. on Multimedia, 16 6 , pp. 1636-1644, 2014. 76 deflate
  201. 201. Adversarial Examples - MIR ● Imperceptibly filtered audio fools genre recognition system http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/in dex.html 77 deflate
  202. 202. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment ● Unclear target population: identical with sample of few hundred ISMIR or GTZAN songs? ● Or are we aiming at genre classification in general? ● If target is genre classification in general, there is a problem! 78
  203. 203. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● )hy can these MIR systems be fooled so easily? ○ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier ○ )hat is the confounding variable? 79
  204. 204. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● )hy can these MIR systems be fooled so easily? ● E.g.: in case of rhythm classification, systems were picking up tempo not rhythm! Tempo acted as confounding factor! ○ Sturm B.L.: The Horse Inside: Seeking Causes Behind the Behaviors of Music Content Analysis Systems, Computers in Entertainment, 14 2 , 2016. 80
  205. 205. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ○ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier ○ )hat is the confounding variable? ● high dimensionality of the data input space? ○ Small perturbations to input data might accumulate over many dimensions with minor changes ‘snowballing’ into larger changes in transfer functions of deep neural networks ○ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014 81
  206. 206. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ○ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier ○ )hat is the confounding variable? ● linearity of models? ○ linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect … rectified linear units ReLU ? ○ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014 82
  207. 207. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ○ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier Open question: what is the confounding variable???? 83
  208. 208. Summary (alidity 84
  209. 209. (alidity ● (alidity ○ Valid experiment is an experiment actually measuring what the experimenter intended to measure ○ Conclusion validity ○ Internal validity ○ External validity ○ Construct validity ● Care about validity of your experiments! ● Validity is the right framework to talk about these problems 85
  210. 210. Part IV: So? Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  211. 211. What’s in a 𝑝-value? ● Confounds effect size and sample size, eg. 𝑡 = 𝑛 𝑋−𝜇 𝜎 ● Unfortunately, we virtually never check power. Don't ever accept 𝐻0 ● "Easy" way to achieve significance is obtaining more data, but the true effect remains the same ● Even if one rejects 𝐻0, it could still be true 2
  212. 212. 𝑃 𝐻0 𝑝 ≤ 𝛼 = 𝑃 𝑝≤𝛼 𝐻0 𝑃 𝐻0 𝑃 𝑝≤𝛼 = 𝑃 𝑝 ≤ 𝛼 𝐻0 𝑃 𝐻0 𝑃 𝑝 ≤ 𝛼 𝐻0 𝑃 𝐻0 + 𝑃 𝑝 ≤ 𝛼 𝐻1 𝑃 𝐻1 = 𝛼𝑃 𝐻0 𝛼𝑃 𝐻0 + 1 − 𝛽 𝑃 𝐻1 ● 𝑃 𝐻0 = 𝑃 𝐻1 = 0.5 ○ 𝛼 = 0.05, 𝛽 = 0.05 → 𝑃 𝐻0 p ≤ 𝛼 = 0.05 ○ 𝛼 = 0.05, 𝛽 = 0.5 → 𝑃 𝐻0 p ≤ 𝛼 = 0.09 ● 𝐻0 = 0.8, 𝑃 𝐻1 = 0.2 ○ 𝛼 = 0.05, 𝛽 = 0.05 → 𝑃 𝐻0 p ≤ 𝛼 = 0.17 ○ 𝛼 = 0.05, 𝛽 = 0.5 → 𝑃 𝐻0 p ≤ 𝛼 = 0.29 3
  213. 213. 𝐻0 is always false ● In this kind of dataset-based experiments, 𝐻0: 𝜇 = 0 is always false ● Two systems may be veeeeeery similar, but not the same ● Binary accept/reject decisions don't even make sense ● Why bother with multiple comparisons then? ● Care about type S(ign) and M(agnitude) errors ● To what extent do non-parametric methods make sense (Wilcoxon, Sign, Friedman), specially combined with parametrics like Tukey’s? 4
  214. 214. Binary thinking no more ● Nothing wrong with the 𝑝-value, but with its use ● 𝑝 as a detective vs 𝑝 as a judge ● Any 𝛼 is completely arbitrary ● What is the cost of a type 2 error? ● How does the lack of validity affect NHST? (measures, sampling frames, ignoring cross-assessor variability, etc) ● What about researcher degrees of freedom? ● Why not focus on effect sizes? Intervals, correlations, etc. ● Bayesian methods? What priors? 5
  215. 215. Assumptions ● In dataset-based (M)IR experiments, test assumptions are false by definition ● 𝑝-values are, to some degree, approximated ● So again, why use any threshold? ● So which test should you choose? ● Run them all, and compare ● If they tend to disagree, take a closer look at the data ● Look beyond the experiment at hand, gather more data ● Always perform error analysis to make sense of it 6
  216. 216. Replication ● Fisher, and specially Neyman-Pearson, advocated for replication ● A 𝑝-value is only concerned with the current data ● The hypothesis testing framework only makes sense with repeated testing ● In (M)IR we hardly do it; we're stuck to the same datasets 7
  217. 217. Significant ≠ Relevant ≠ Interesting 8 All research
  218. 218. Significant ≠ Relevant ≠ Interesting 8 All research Interesting
  219. 219. Significant ≠ Relevant ≠ Interesting 8 All research Interesting Relevant
  220. 220. Significant ≠ Relevant ≠ Interesting 8 All research Interesting Relevant Statistically Significant
  221. 221. There is always random error in our experiments, so we always need some kind of statistical analysis But there is no point in being too picky or intense about how we do it Nobody knows how to do it properly, and different fields adopt different methods What is far more productive, is to adopt an exploratory attitude rather than mechanically testing 9
  222. 222. References Julián Urbano Arthur Flexer An ISMIR 2018 Tutorial · Paris
  223. 223. ● Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User Satisfaction. ACM SIGIR ● Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems, Prevalence, and an Alternative. Journal of Wildfire Management ● Armstrong, T.G., Moffat, A., Webber, W. & Zobel, J. (2009). Improvements that don't add up: ad-hoc retrieval results since 1998. CIKM ● Balke, S., Driedger, J., Abeßer, J., Dittmar, C. & Müller, M. (2016). Towards Evaluating Multiple Predominant Melody Annotations in Jazz Recordings. ISMIR ● Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science ● Bosch J.J. & Gómez E. (2014). Melody extraction in symphonic classical music: acomparative study of mutual agreement between humans and algorithms. Conference on Interdisciplinary Musicology ● Boytsov, L., Belova, A. & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. SIGIR ● Carterette, B. (2012). Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments. ACM Transactions on Information Systems ● Carterette, B. (2015a). Statistical Significance Testing in Information Retrieval: Theory and Practice. ICTIR ● Carterette, B. (2015b). Bayesian Inference for Information Retrieval Evaluation. ACM ICTIR ● Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum ● Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. ACM SIGIR ● Downie, J. S. (2004). The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future. Computer Music Journal ● Fisher, R. A. (1925). Statistical Methods for Research Workers. Cosmo Publications ● Flexer, A. (2006). Statistical Evaluation of Music Information Retrieval Experiments. Journal of New Music Research 2
  224. 224. ● Flexer, A., Grill, T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal of New Music Research ● Gelman, A. (2013b). The problem with p-values is how they’re used. ● Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple Comparisons. Journal of Research on Educational Effectiveness ● Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no shing expedition' or p-hacking' and the research hypothesis was posited ahead of time. ● Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist ● Gelman, A., & Stern, H. (2006). The Difference Between Significant and Not Significant is not Itself Statistically Significant. The American Statistician ● Goodfellow I.J., Shlens J. & Szegedy C. (2014). Explaining and harnessing adversarial examples. ICLR ● Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., & Langlois, T. (2014). On Evaluation Validity in Music Autotagging. ACM Computing Research Repository. ● Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User Evaluations Give the Same Results? ACM SIGIR ● Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. ISMIR ● Hull, D. (1993). Using Statistical Testing in the Evaluation of Retrieval Experiments. ACM SIGIR ● Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine ● Lehmann, E.L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? Journal of the American Statistical Association ● Lehmann, E.L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer 3
  225. 225. ● Lee, J. H., & Cunningham, S. J. (2013). Toward an understanding of the history and impact of user studies in music information retrieval. Journal of Intelligent Information Systems ● Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three Current Issues In Music Autotagging. ISMIR ● Neyman, J. & Pearson, E.S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I. Biometrika ● Panteli, M., Rocha, B., Bogaards, N. & Honingh, A. (2017). A model for rhythm and timbre similarity in electronic dance music. Musicae Scientiae ● Quinton, E., Harte, C. & Sandler, M. (2015). Extraction of metrical structure from music recordings. DAFX ● Sakai, T. (2014). Statistical Reform in Information Retrieval? ACM SIGIR Forum ● Savoy, J. (1997). Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management ● Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal of Intelligent Information Systems ● Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin ● Serrà, J., Müller, M., Grosche, P., & Arcos, J.L. (2014). Unsupervised music structure annotation by time series structure features and segment similarity. IEEE Trans. on Multimedia ● Smith, J.B.L. & Chew, E. (2013). A meta-analysis of the MIREX structure segmentation task. ISMIR ● Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. ACM CIKM ● Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes. CM SIGIR 4
  226. 226. ● Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that They Are Unrealistic. European Workshop on Human-Computer Interaction and Information Retrieval ● Student. (1908). The Probable Error of a Mean. Biometrika ● Sturm, B. L. (2013). Classification Accuracy is Not Enough: On the Evaluation ofMusic Genre Recognition Systems. Journal of Intelligent Information Systems ● Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval. Journal of New Music Research ● Sturm, B.L. (2014). A simple method to determine if a music information retrieval system is a horse, IEEE Trans. on Multimedia ● Sturm B.L. (2016). "The Horse" Inside: Seeking Causes Behind the Behaviors of Music Content Analysis Systems, Computers in Entertainment ● Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management ● Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. ACM SIGIR ● Urbano, J. (2015). Test Collection Reliability: A Study of Bias and Robustness to Statistical Assumptions via Stochastic Simulation. Information Retrieval Journal ● Urbano, J., Downie, J. S., McFee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval. ISMIR ● Urbano, J., Marrero, M., & Martín, D. (2013a). A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. ACM SIGIR ● Urbano, J., Marrero, M. & Martín, D. (2013b). On the Measurement of Test Collection Reliability. SIGIR ● Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems 5
  227. 227. ● Urbano, J. & Marrero, M. (2016). Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance. SIGIR ● Urbano, J. & Nagler, T. (2018). Stochastic Simulation of Test Collections: Evaluation Scores. SIGIR ● Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. ACM SIGIR ● Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. ACM CIKM ● Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press ● Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM SIGIR 6

×