Statistical Analysis of Results in Music Information Retrieval

Statistical Analysis of
Results in Music Information
Retrieval: Why and How
Julián Urbano Arthur Flexer
An ISMIR 2018 Tutorial · Paris

Julián Urbano
● Assistant Professor @ TU Delft, The Netherlands
● BSc-PhD Computer Science
● 10 years of research in (Music) Information Retrieval
o And related, like information extraction or crowdsourcing for IR
● Active @ ISMIR since 2010
● Research topics
o Evaluation methodologies
o Statistical methods for evaluation
o Simulation for evaluation
o Low-cost evaluation
3
Supported by the European Commission H2020 project TROMPA (770376-2)

Arthur Flexer
● Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
● PhD in Psychology, minor degree in Computer Science
● 10 years of research in neuroscience, 13 years in MIR
● Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
4
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)

Arthur Flexer
● Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
● PhD in Psychology, minor degree in Computer Science
● 10 years of research in neuroscience, 13 years in MIR
● Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
5
Semi-retired veteran DJ
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)

Disclaimer
● Design of experiments (DOE) used and needed in all kinds
of sciences
● DOE is a science in its own right
● No fixed “how-tos” or “cookbooks”
● Different schools and opinions
● Established ways to proceed in different fields
● We will present current procedures in (M)IR
● But also discuss, criticize, point to problems
● Present alternatives and solutions (?)
6

Program
● Part I: Why (we evaluate the way we do it)?
o Tasks and use cases
o Cranfield
o Validity and reliability
● Part II: How (should we not analyze results)?
o Populations and samples
o Estimating means
o Fisher, Neyman-Pearson, NHST
o Tests and multiple comparisons
● Part III: What else (should we care about)?
o Inter-rater agreement
o Adversarial examples
● Part IV: So (what does it all mean)?
● Discussion?
7

Part I: Why?

Typical Information Retrieval task
2

2
IR System

2
Documents
IR System

2
Documents
Information
Need or Topic
IR System

2
Documents
Information
Need or Topic
IR System
query

2
Documents
Information
Need or Topic
IR System
query
Results

2
Documents
Information
Need or Topic
IR System
query
Results
query

2
Documents
Information
Need or Topic
IR System
query
ResultsResults
query

Two recurrent questions
 How good is my system?
○ What does good mean?
○ What is good enough?
 Is system A better than system B?
○ What does better mean?
○ How much better?
 What do we talk about?
○ Efficiency?
○ Effectiveness?
○ Ease?
3

Hypothesis: A is better than B
How would you design
this experiment?

Measure user experience
 We are interested in user-measures
○ Time to complete task
○ Idle time
○ Success/Failure rate
○ Frustration
○ Ease of learning
○ Ease of use …
 Their distributions describe user experience
○ For an arbitrary user and topic (and document collection?)
○ What can we expect?
5
0
time to complete task
none
frustration
muchsome

Sources of variability
user-measure = f(documents, topic, user, system)
 Our goal is the distribution of the user-measure for our
system, which is impossible to calculate
○ (Possibly?) infinite populations
 As usual, the best we can do is estimate it
○ Becomes subject to random error
6

Desired: Live Observation
 Estimate distributions with a live experiment
 Sample documents, topics and users
 Have them use the system, for real
 Measure user experience, implicitly or explicitly
 Many problems
○ High cost, representativeness
○ Ethics, privacy, hidden effects, inconsistency
○ Hard to replicate experiment and repeat results
○ Just plain impossible to reproduce results
*replicate = same method, different sample
reproduce = same method, same sample
7

Alternative: Fixed samples
 Get (hopefully) good samples, fix them and reuse
○ Documents
○ Topics
○ Users
 Promotes reproducibility and reduces variability
 But we can’t just fix the users!
8

Simulate users
 Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
○ Static user component: judgments or annotations in ground truth
○ Dynamic user component: effectiveness or performance measures
 Removes all sources of variability, except systems
9

Simulate users
 Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
○ Static user component: judgments or annotations in ground truth
○ Dynamic user component: effectiveness or performance measures
 Removes all sources of variability, except systems
9
user-measure = f(system)

Datasets (aka Test Collections)
 Controlled sample of documents, topics and judgments,
shared across researchers…
 …combined with performance measures
 (Most?) important resource for IR research
○ Experiments are inexpensive (datasets are not!)
○ Research becomes systematic
○ Evaluation is deterministic
○ Reproducibility is not only possible but easy
10

Repeat
over topics
Cranfield-like evaluation
11
Systemcollection
topic
doc
doc
doc
doc
doc
rel
rel
rel
rel
rel
Annotator
Measure
score
Annotation
protocol

User Models & Annotation Protocols
 In practice, there are hundreds of options
 Utility of a document w.r.t. scale of annotation
○ Binary or graded relevance?
○ Linear utility w.r.t. relevance? Exponential?
○ Independent of other documents?
 Top heaviness to penalize late arrival
○ No discount?
○ Linear discount? Logarithmic?
○ Independent of other documents?
 Interaction, browsing?
 Cutoff
○ Fixed: only top k documents?
○ Dynamic: wherever some condition is met?
○ All documents?
 etc
12

Tasks vs Use Cases
 Everything depends on the use case of interest
 The same task may have several use cases (or subtasks)
○ Informational
○ Navigational
○ Transactional
○ etc
 Different use cases may imply, suggest or require
different decisions wrt system input/output, goal,
annotations, measures...
13

Task: instrument recognition
What is the use case?

Instrument recognition
1) Given a piece of music as input, identify the instruments
that are played in it:
○ For each window of T milliseconds, return a list of instruments being
played (extraction).
○ Return the instruments being played anywhere in the piece (classification).
2) Given an instrument as input, retrieve a list of music
pieces in which the instrument is played:
○ Return the list of music pieces (retrieval).
○ Return the list, but for each piece also provide a clip (start-end) where the
instrument is played (retrieval+extraction).
 Each case implies different systems, annotations and
measures, and even different end-users (non-human?)
https://github.com/cosmir/open-mic/issues/19
15

But wait a minute...
 Are we estimating distributions about users or distributions
about systems?
system-measure = f(system, protocol, measure)
16

But wait a minute...
 Are we estimating distributions about users or distributions
about systems?
system-measure = f(system, protocol, measure)
16
system-measure = f(system, protocol, measure, annotator, context, ...)
 Whether the system output satisfies the user or not, has
nothing to do with how we measure its performance
 What is the best way to predict user satisfaction?

Real world vs. The lab
17
The Web
Abstraction
Prediction
Real World Cranfield
IR System
Topic
Relevance
Judgments
IR System
Documents
AP
DCG
RR
Static
Component
Dynamic
Component
Test
Collection
Effectiveness
Measures
Information
need

Output
Cranfield in Music IR
18
System
Input
Measure
Annotations
Users

Classes of Tasks in Music IR
● Retrieval
○ Music similarity
19
System
collection track
track
track
track
track
track

● Retrieval
○ Query by humming
20
System
collection hum
track
track
track
track
track

● Retrieval
○ Recommendation
21
System
collection user
track
track
track
track
track

● Retrieval
○ Recommendation
● Annotation
○ Genre classification
22
System
track
genre

● Retrieval
○ Recommendation
● Annotation
○ Mood recognition
23
System
track
mood1
mood2

● Retrieval
○ Recommendation
● Annotation
○ Autotagging
24
System
track
tag
tag
tag

● Retrieval
○ Recommendation
● Annotation
○ Autotagging
● Extraction
○ Structural segmentation
25
System
track
seg segseg seg

● Retrieval
○ Recommendation
● Annotation
○ Autotagging
● Extraction
○ Melody extraction
26
System
track

● Retrieval
○ Recommendation
● Annotation
○ Autotagging
● Extraction
○ Melody extraction
○ Chord estimation
27
System
track
chocho cho

Evaluation as Simulation
 Cranfield-style evaluation with datasets is a simulation of
the user-system interaction, deterministic, maybe even
simplistic, but a simulation nonetheless
 Provides us with data to estimate how good our systems
are, or which one is better
 Typically, many decisions are made for the practitioner
 Comes with many assumptions and limitations
28

Validity and Reliability
 Validity: are we measuring what we want to?
○ Internal: are observed effects due to hidden factors?
○ External: are input items, annotators, etc generalizable?
○ Construct: do system-measures match user-measures?
○ Conclusion: how good is good and how better is better?
Systematic error
 Reliability: how repeatable are the results?
○ Will I obtain the same results with a different collection?
○ How large do collections need to be?
○ What statistical methods should be used?
Random error
29

30
Not Valid
Reliable
Valid
Not Reliable
Not Valid
Not Reliable
Valid
Reliable

So long as...
• So long as the dataset is large enough to minimize random
error and draw reliable conclusions
• So long as the tools we use to make those conclusions can
be trusted
• So long as the task and use case are clear
• So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
• So long as the samples of inputs and annotators present in
the dataset are representative for the task
31

What else
How
So long as...
• So long as the dataset is large enough to minimize random
error and draw reliable conclusions
• So long as the tools we use to make those conclusions can
be trusted
• So long as the task and use case are clear
• So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
• So long as the samples of inputs and annotators present in
the dataset are representative for the task
31

“If you can’t measure it, you can’t improve it.”
—Lord Kelvin
32

“If you can’t measure it, you can’t improve it.”
—Lord Kelvin
32
“But measurements have to be trustworthy.”
—yours truly

Part II: How?

Populations of interest
● The task and use case define the populations of interest
○ Music tracks
○ Users
○ Annotators
○ Vocabularies
● Impossible to study all entities in these populations
○ Too many
○ Don’t exist anymore (or yet)
○ Too far away
○ Too expensive
○ Illegal
3

Populations and samples
● Our goal is to study the performance of the system on
these populations
○ Lets us know what to expect from the system in the real world
● We’re typically interested in the mean: the expectation 𝜇
○ Based on this we would decide what research line to pursue, what paper
to publish, what project to fund, etc.
○ But variability is also important, though often neglected
● A dataset represents just a sample from that population
○ By studying the sample we could generalize back to the population
○ But will bear some degree of random error due to sampling
4

5
Target population

5
Target population
External validity
Accessible population
or sampling frame

5
Target population
External validity
or sampling frame
Content Validity
and Reliability
Sample

Inference
5
Target population
External validity
or sampling frame
Content Validity
and Reliability
Sample

Generalization
Inference
5
Target population
External validity
or sampling frame
Content Validity
and Reliability
Sample

● This is an estimation problem
● The objective is the distribution 𝐹 of performance over the
population, specifically the mean 𝜇
● Given a sample of observations 𝑋1, … , 𝑋 𝑛, estimate 𝜇
● Most straightforward estimator is the sample mean: 𝜇 = 𝑋
● Problem: 𝝁 = 𝝁 + 𝒆, being 𝒆 random error
● For any given sample or dataset, we only know 𝑿
● How confident we are in the results and our conclusions,
depends on the size of 𝒆 with respect to 𝝁
6

7
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

7
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^

7
sample (n=10)
performance
frequency
0 1^
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^

7
sample (n=10)
performance
frequency
0 1^
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^
sample (n=10)
performance
frequency 0 1^

8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=30)
performance
frequency
0 1^
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency
0 1^
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency 0 1^
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=100)
performance
frequency
0 1^
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency
0 1^
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency 0 1^
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density

0.35 0.40 0.45 0.50 0.55
010203040
sampling distribution
mean performance X
density
n=10
n=30
n=50
n=100
0.0 0.2 0.4 0.6 0.8 1.0
01234
population
performance
density
Sampling distribution and standard error
● Let us assume some distribution with some 𝜇
● Experiment: draw random sample of size 𝑛 and compute 𝑋
● The sampling distribution is the distribution of 𝑋 over
replications of the experiment
● Standard error is the std. dev. of the sampling distribution
10

Estimating the mean
● The true distribution 𝐹 has mean 𝜇 and variance 𝜎2
○ 𝐸 𝑋 = 𝜇
○ 𝑉𝑎𝑟 𝑋 = 𝜎2
● For the sample mean 𝑋 =
1
𝑛
∑𝑋𝑖 we have
○ 𝐸 𝑋 =
1
𝑛
∑𝐸 𝑋𝑖 = 𝜇
○ 𝑉𝑎𝑟 𝑋 =
1
𝑛2 ∑𝑉𝑎𝑟 𝑋𝑖 =
𝜎2
𝑛
std. error = 𝜎 𝑋 =
𝜎
𝑛
● Law of large numbers: 𝑋 → 𝜇 as 𝑛 → ∞
● The larger the dataset, the better our estimates
● Regardless of the true distribution 𝑭 over the population
11

Gaussians everywhere
● For the special case where 𝐹 = 𝑁 𝜇, 𝜎2 , the sample mean
is also Gaussian, specifically 𝑋~𝑁 𝜇,
𝜎2
𝑛
● But a Gaussian distribution is sometimes a very unrealistic
model for our data
● Central Limit Theorem (CLT): 𝑋
𝑑
→ 𝑁 𝜇,
𝜎2
𝑛
as 𝑛 → ∞
● Regardless of the shape of the original 𝑭
12
0.0 0.2 0.4 0.6 0.8 1.0
05101520
population
performance
density
n=10
mean performance X
density
0.00 0.04 0.08 0.12
051525
n=100
mean performance X
density 0.00 0.04 0.08 0.12
02060

Approximations
● Until the late 1890s, the CLT was invoked everywhere for
the simplicity of working with Gaussians
● Tables of the Gaussian distribution were used to test
● Still, there were two main problems
○ 𝜎2
is unknown
○ The rate of converge, ie. small samples
● But something happened at the Guinness factory in 1908
13

Student-t distribution
● Gaussian approximations for sampling distributions were
reasonably good for large samples, but not for small
● Gosset thought about deriving the theoretical distributions
under assumptions of the underlying model
● Specifically, when 𝑋~𝑁 𝜇, 𝜎2 :
○ If 𝜎 is known, we know that 𝑧 =
𝑋−𝜇
𝜎 𝑛
~𝑁 0,1
○ If 𝜎 is unknown Gosset introduced the Student-t distribution:
𝑡 =
𝑋−𝜇
𝑠/ 𝑛
~𝑇 𝑛 − 1 , where 𝑠 is the sample standard deviation
● In a sense, it accounts for the uncertainty in 𝜎 = 𝑠
15

Small-sample problems
● In non-English literature there are earlier mentions, but it
was popularized by Gosset and, mostly, Fisher
● He initiated the study of the so-called small-sample
problems, specifically with the Student-𝑡 distribution
16
-4 -2 0 2 4
0.00.10.20.30.4
Student-t distribution
t statistic
density
n=2
n=3
n=6
n=30

Fisher and small samples
● Gosset did not provide a proof of the 𝑡
distribution, but Fisher did in 1912-1915
● Fisher stopped working on small sample
problems until Gosset convinced him in 1922
● He then worked out exact distributions for correlation
coefficients, regression coefficients, 𝜒2 tests, etc. in the
early 1920s
● These, and much of his work on estimation and design of
experiments, were collected in his famous 1925 book
● This is book is sometimes considered the birth of modern
statistical methods
18
Ronald Fisher

Fisher’s significance testing
● In those other papers Fisher developed his theory of
significance testing
● Suppose we have observed data 𝑋~𝑓 𝑥 𝜃 and we are
interested in testing the null hypothesis 𝐻0: 𝜃 = 𝜃0
● We choose a relevant test statistic 𝑇 s.t. large values of 𝑇
reflect evidence against 𝐻0
● Compute the 𝒑-value 𝑝 = 𝑃 𝑇 𝑋∗ ≥ 𝑇 𝑋 𝐻0 , that is, the
probability that, under 𝐻0, we observe a sample 𝑋∗ with a
test statistic at least as extreme as we observed initially
● Assess the statistical significance of the results, that is,
reject 𝐻0 if 𝑝 is small
20

Testing the mean
● We observed 𝑋 = {−0.13, 0.68, −0.34, 2.10, 0.83, −0.32,
0.99, 1.24, 1.08, 0.19} and assume a Gaussian model
● We set 𝐻0: 𝜇 = 0 and choose a 𝑡 statistic
● For our data, 𝑝 = 0.0155 𝑡 = 2.55
● If we consider 𝑝 small enough, we reject 𝐻0
21
-4 -2 0 2 4
0.00.10.20.30.4
test statistic
t
density
p

Small p-values
● “we do not want to know the exact value of p […], but, in
the first place, whether or not the observed value is open
to suspicion”
● Fisher provided in his book tables not of the new small-
sample distributions, but of selected quantiles
● Allow for calculation of ranges of 𝑝-values given test
statistics, as different degrees of evidence against 𝐻0
● The 𝑝-value is gradable, a continuous measure of evidence
22

𝑝 and 𝛼
● Fisher employed the term significance level 𝛼 for these
theoretical 𝑝-values used as reference points to identify
statistically significant results: reject 𝐻0 if 𝑝 ≤ 𝛼
● This is context-dependent, is not prefixed beforehand and
can change from time to time
● He arbitrarily “suggested” 𝛼 = .05 for illustration purposes
● Observing 𝑝 > 𝛼 does not prove 𝐻0; it just fails to reject it
24

Pearson
● Pearson saw Fisher’s tables as a way to
compute critical values that “lent
themselves to the idea of choice, in advance
of experiment, of the risk of the „first kind
of error’ which the experimenter was prepared to take”
● In a letter to Pearson, Gosset replied “if the chance is very
small, say .00001, […] what it does is to show that if there is
any alternative hypothesis which will explain the
occurrence of the sample with a more reasonable
probability, say .05 […], you will be very much more
inclined to consider that the original hypothesis is not
true”
26
Egon Pearson

Neyman
● Pearson saw the light: “the only valid
reason for rejecting a statistical hypothesis
is that some alternative explains the
observed events with a greater degree
of probability”
● In 1926 Pearson writes to Neyman to propose his ideas of
hypothesis testing, which they developed and published
in 1928
27
Jerzy Neyman

Errors
● 𝛼 = 𝑃 𝑡𝑦𝑝𝑒 1 𝑒𝑟𝑟𝑜𝑟
● 𝛽 = 𝑃 𝑡𝑦𝑝𝑒 2 𝑒𝑟𝑟𝑜𝑟
● Power = 1 − 𝛽
29
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
Truth
𝐻0 true 𝐻1 true
Test
accept 𝐻0 true
negative
Type 2
error
reject 𝐻0 Type 1
error
true
positive
𝛼𝛽

Errors
30
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.5
● 𝜎 = 1
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H1
𝑡 =
𝑋 − 𝜇
𝜎 𝒏

Errors
31
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.25
● 𝜎 = 1
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H2
𝑡 =
𝑋 − 𝝁
𝜎 𝑛

Errors
32
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: 𝜇 = 0, 𝐻1: 𝜇 = 0.25
● 𝜎 = 3
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H2
𝑡 =
𝑋 − 𝜇
𝝈 𝑛

Neyman-Pearson hypothesis testing
● Define the null and alternative hypotheses, eg.
𝐻0: 𝜇 = 0 and 𝐻1: 𝜇 = 0.5
● Set the acceptable error rates 𝛼 (type 1) and 𝛽 (type 2)
● Select the most powerful test 𝑇 for the hypotheses and 𝛼,
which sets the critical value 𝑐
● Given 𝐻1 and 𝛽, select the sample size 𝑛 required to detect
an effect 𝒅 or larger
● Collect data and reject 𝐻0 if 𝑇 𝑋 ≥ 𝑐
● The testing conditions are set beforehand: 𝐻0, 𝐻1, 𝛼, 𝛽
● The experiment is designed for a target effect 𝑑: 𝑛
33

Error rates and tests
● Under repeated experiments, the long-run error rate is 𝛼
● Neyman-Pearson did not suggest values for it:
“the balance [between the two kinds of error] must be left
to the investigator […] we attempt to adjust the balance
between the risks 1 and 2 to meet the type of problem
before us”
● For 𝛽 they “suggested” 𝛼 ≤ 𝛽 ≤ 0.20
● To Fisher, the choice of test statistic in his methodology
was rather obvious to the investigator and wasn’t important
to him
● Neyman-Pearson answered this by defining the “best” test:
that which minimizes error 2 subject to a bound in error 1
34

Likelihood ratio test
● Pearson apparently suggested the likelihood ratio test for
their new hypothesis testing methodology
ℒ =
𝑝 𝑋 𝐻0)
𝑝 𝑋 𝐻1
● Later found that as 𝑛 → ∞, −2 log ℒ ~𝜒2
● Neyman was reluctant, as he thought some Bayesian
consideration had to be taken about prior distributions
over the hypotheses (“inverse probability” at the time)
● For simple point hypotheses like 𝐻0: 𝜃 = 𝜃0 and 𝐻1: 𝜃 = 𝜃1,
the Likelihood ratio test turned out to be the most powerful
● In the case of comparing means of Gaussians, this reduces
to Student’s 𝑡-test!
35

Composite hypotheses
● Neyman-Pearson theory extends to composite hypotheses
of the form 𝐻: 𝜃 ∈ Θ, such as 𝐻1: 𝜇 > 0.5
● The math got more complex, and Neyman was still
somewhat reluctant: “it may be argued that it is
impossible to estimate the probability of such a hypothesis
without a knowledge of the relative a priori probabilities of
the constituent simple hypotheses”
● Although “wishing to test the probability of a hypothesis A
we have to assume that all hypotheses are a priori equally
probable and calculate the probability a posteriory of A”
36

Null Hypothesis
Significance Testing

Recap
● Fisher: significance testing
○ Inductive inference: rational belief when reasoning from sample to
population
○ Rigorous experimental design to extract results from few samples
○ Replicate and develop your hypotheses, consider all significant and non-
significant results together
○ Power can not be computed beforehand
● Neyman-Pearson: hypothesis testing
○ Inductive behavior: frequency of errors in judgments
○ Long-run results from many samples
○ p-values don’t have frequentist interpretations
● In the 1940s the two worlds began to appear as just one in
statistics textbooks, and rapidly adopted by researchers
38

Null Hypothesis Significance Testing
● Collect data
● Set hypotheses, typically 𝐻0: 𝜇 = 0 and 𝐻1: 𝜇 ≠ 0
○ Either there is an effect or there isn’t
● Set 𝛼, typically to 0.05 or 0.01
● Select test statistic based on hypotheses and compute 𝑝
● If 𝑝 ≤ 𝛼, reject the null; fail to reject if 𝑝 > 𝛼
40

Common bad practices
● Run tests blindly without looking at your data
● Decide on 𝛼 after computing 𝑝
● Report “(not) significant at the 0.05 level” instead of
providing the actual 𝑝-value
● Report degrees of significance like “very” or “barely”
● Do not report test statistic alongside 𝑝, eg. 𝑡 58 = 1.54
● Accept 𝐻0 if 𝑝 > 𝛼 or accept 𝐻1 if 𝑝 ≤ 𝛼
● Interpret 𝑝 as the probability of the null
● Simply reject the null, without looking at the effect size
● Ignore the type 2 error rate 𝛽, ie. power analysis a posteriori
● Interpret statistically significant result as important
● Train the same models until significance is found
● Publish only statistically significant results ¯_(ツ)_/¯
41

Paired tests
● We typically want to compare our system B with some
baseline system A
● We have the scores over 𝑛 inputs from some dataset
● The hypotheses are 𝐻0: 𝜇 𝐴 = 𝜇 𝐵 and 𝐻1: 𝜇 𝐴 ≠ 𝜇 𝐵
A
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.405
sd=0.213
B
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.425
sd=0.225

Paired test
● If we ignore the structure of the experiment, we have a bad
model and a test with low power
Simple test: 𝑡 = 0.28, 𝑝 = 0.78
● In our experiments, every observation from A corresponds
to an observation from B, ie. they are paired observations
● The test can account for this to better model our data
● Instead of looking at A vs B, we look at A-B vs 0
Paired test: 𝑡 = 2.16, 𝑝 = 0.044
44
0.0 0.2 0.4 0.6 0.8 1.0
0.00.40.8
A
B

Paired 𝑡-test
● Assumption
○ Data come from Gaussian distributions
● Equivalent to a t-test of 𝜇 𝐷 = 0, where 𝐷𝑖 = 𝐵𝑖 − 𝐴𝑖
𝑡 = 𝑛
𝑋 𝐵−𝐴
𝑠 𝐵−𝐴
= 𝑛
𝐷
𝑠 𝐷
= 0.35, 𝑝 = 0.73
45
A B
.76 .75
.33 .37
.59 .59
.28 .15
.36 .49
.43 .50
.21 .33
.43 .27
.72 .81
.40 .36
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
A
B

Wilcoxon signed-rank test
● Assumptions:
○ Data measured at least at interval level
○ Distribution is symmetric
● Disregard for magnitudes
● Convert all untied 𝐷𝑖 to ranks 𝑅𝑖
● Compute 𝑊+ and 𝑊− equal to the sum
of 𝑅𝑖 that are positive or negative
● The test statistic is
𝑊 = min 𝑊+, 𝑊− = 21
● 𝑊 follows a Wilcoxon distribution, from
which one can calculate 𝑝 = 0.91
46
A B D rank
.76 .75 -.01 1
.33 .37 .04 2
.59 .59 0 -
.28 .15 -.13 8
.36 .49 .13 7
.43 .50 .07 4
.21 .33 .12 6
.43 .27 -.16 9
.72 .81 .09 5
.40 .36 -.04 3

Sign test
● Complete disregard for magnitudes
● Simulate coin flips: was system B better (or
worse) than A for some input?
● Follows Binomial distribution
● The test statistic is the number of successes
(B>A), which is 5
● 𝑝-value is the probability of 5 or more
successes in 9 coin flips = 0.5
47
A B sign
.76 .75 -1
.33 .37 +1
.59 .59 0
.28 .15 -1
.36 .49 +1
.43 .50 +1
.21 .33 +1
.43 .27 -1
.72 .81 +1
.40 .36 -1

Bootstrap test
● Compute deltas, 𝐷𝑖 = 𝐵𝑖 − 𝐴𝑖
● The empirical distribution 𝑒𝑐𝑑𝑓𝐷 estimates the true
distribution 𝐹 𝐷
● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times)
○ Draw a bootstrap sample 𝐵𝑖 by sampling 𝑛 scores with replacement from
𝑒𝑐𝑑𝑓𝐷
○ Compute the mean of the bootstrap sample, 𝐵𝑖
○ Let 𝐵 = 1/𝑇∑𝐵𝑖
○ 𝐵𝑖 estimates the sampling distribution of the mean
● The 𝑝-value is
∑𝕀 𝐵𝑖 − 𝐵 ≥ 𝐷
𝑇
≈ 0.71
48

Permutation test
● Under the null hypothesis, an arbitrary score could have
been generated for system A or from system B
● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times)
○ Create a sample 𝑃𝑖 by randomly swapping the sign of each observation
○ Compute the mean 𝑃𝑖
● The 𝑝-value is
∑𝕀 𝑃𝑖 ≥ 𝐷
𝑇
≈ 0.73
49

The computer does all this for you
50

In practice
51
[Carterette, 2015a]

NHST for (M)IR
Multiple systems

ANOVA
● Assume a model 𝑦𝑠𝑖 = 𝜇 + 𝜈𝑠 + 𝜈𝑖 + 𝑒𝑠𝑖, where 𝜈𝑠 = 𝑦𝑠∙ − 𝜇
○ Implicitly “pairs” the observations by item
● The variance of the observed scores can be decomposed
𝜎2
𝑦 = 𝜎2
𝑠 + 𝜎2
𝑖 + 𝜎2
𝑒
● where 𝜎2 𝑠 is the variance across system means
○ Low: system means are close to each other
○ High: system means are far from each other
● The null hypothesis is 𝐻0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯
○ Even if we reject, we still don’t know which system is different!
● The test statistic is of the form 𝐹 =
𝜎2 𝑠
𝜎2 𝑒
● We’d like to have 𝜎2 𝑠 ≫ 𝜎2 𝑖 , 𝜎2 𝑒

Friedman test
● Same principle as ANOVA, but non-parametric
● Similarly to Wilcoxon, rank observations (per item) and
estimate effects
● Ignores actual magnitudes; simply uses ranks
54

Multiple testing
● When testing multiple hypotheses, the probability of at
least one type 1 error increases
● Multiple testing procedures correct 𝑝-values for a family-
wise error rate
55
[Carterette, 2015a]

Tukey’s HSD
● Follow ANOVA to test 𝐻0: 𝜇1 = 𝜇2 = 𝜇3 = ⋯
● The maximum observed difference between systems is
likely the one causing the rejection
● Tukey’s HSD compares all pairs of systems, each with an
individual 𝑝-value
● These 𝑝-values are then corrected based on the expected
distribution of maximum differences under 𝐻0
● In practice, it inflates 𝑝-values
● Ensures a family-wise type 1 error rate
56

Tukey’s HSD
57
[Carterette, 2015a]

Others
● There are many other procedures to control for multiple
comparisons
● Bonferroni: very conservative (low power)
● Dunnett’s: compare all against a control (eg. baseline)
● Other procedures control for the false discovery rate, ie.
the probability of a type 1 error given 𝑝 < 𝛼
● One way or another, they all inflate 𝑝-values
58

Part III: )hat Else? (alidity!

)hat else can go wrong ? (alidity!
● (alidity reprise
● Example I: Inter-rater agreement
● Example II: Adversarial examples
2

(alidity
● (alidity
○ Valid experiment is an experiment actually measuring what the
experimenter intended to measure
○ Conclusion validity: does a difference between system measures
correspond to a difference in user measures and is it noticeable to users
○ Internal validity: is this relationship causal or could confounding factors
explain the relation
○ External validity: do cause-effect relationships also hold for target
populations beyond the sample used in the experiment
○ Construct validity: are intentions and hypotheses of the experimenter
represented in the actual experiment
4

Inter-rater agreement in music
similarity
5

Automatic recommendation / Playlisting
6
Millions of songs
Result list
Query song
+ =

Automatic recommendation / Playlisting
7
Millions of songs
Result list
Query song
+ =
Similarity

Computation of similarity between songs
● Collaborative filtering Spotify, Dee”er?
● Social meta-data Last.Fm?
● Expert knowledge Pandora?
● Meta-data from the web
● ...
● Audio-based
8

9
Songs as audio
Switching to
frequencies
Computation of
features
Machine Learning
→ similarity (metric)
S(a1, a2) = ?
Pictures from E. Pampalk’s Phd thesis 2006

10
Query song
Similar? Similar?
…………….

11
Query song
Similar!! Similar!!
…………….
max(S)=97.9

How can we evaluate our models of music
similarity?
15
45 87 100
23 100 87
100 23 45

How can we evaluate our models of music
similarity?
16
45 87 100
23 100 87
100 23 45
Do these
numbers
correspond to
a human
assessment of
music
similarity?

MIREX -
Music Information Retrieval eXchange
17

MIREX -
● Standardi”ed testbeds allowing for fair comparison of MIR
systems
● range of different tasks
● based on human evaluation
○ Cranfield: remove users, look at annotations only
18

MIREX -
● Standardi”ed testbeds allowing for fair comparison of MIR
systems
● range of different tasks
● based on human evaluation
○ Cranfield: remove users, look at annotations only
● )hat is the level of agreement between human
raters/annotators ?
● )hat does this mean for the evaluation of MIR systems?
● Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling
Music Similarity, Journal of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
19

Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
20

● 5000 song database
● participating MIR systems compute 5000x5000 distance
matrix
● 60 randomly selected queries
● return 5 closest candidate songs for each of the MIR
systems
● for each query/candidate pair, ask the human grader:
● „Rate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 10
very similar or identical .
21

matrix
systems
22

matrix
systems
23

Experimental design
24
Independent variable
treatment
manipulated by researcher
Dependent variable
effect
measured by researcher
● measure the effect of different treatments on a dependent
variable

Experimental design
25
treatment
Type of algorithm
Dependent variable
effect
FINE similarity rating
variable

Experimental design
26
treatment
Type of algorithm
Dependent variable
effect
variable
MIREX AMS 2014

Experimental design
27
treatment
Type of algorithm
Dependent variable
effect
variable
MIREX AMS 2014

)hat about validity?
● Valid experiment is an experiment actually measuring
what the experimenter intended to measure
● )hat is the intention of the experimenter in the AMS task?
● What do we want to measure here?
28

matrix
systems
29

Rate the similarity!
31
Query song Candidate song
0 … 100

Rate the similarity!
36
● Factors that influence human music perception
○ Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information
Retrieval Research, J. of Intelligent Information Systems, December 2013,
(olume 41, Issue 3, pp 523-539, 2013.

Inter-rater agreement in AMS
38

● AMS 2006 is the only year with multiple graders
● each query/candidate pair evaluated by three different
human graders
● each grader gives a FINE score between 0 … 10 not … very
similar
39

● AMS 2006 is the only year with multiple graders
● each query/candidate pair evaluated by three different
human graders
● each grader gives a FINE score between 0 … 10 not … very
similar
● correlation between pairs of graders
40

● inter-rater agreement for different intervals of FINE scores
41

42

43

● look at very similar ratings in the [9,10] interval
44

45

46
Average =
6.54

● what sounds very similar to one grader, will on average
receive a score of only 6.54 from other graders
● this constitutes an upper bound for average FINE scores in
AMS
● there will always be users that disagree moving target
47
Average =
6.54

Comparison to the upper bound
● compare top performing systems 2007, 2009 - 2014 to
upper bound
48

● compare top performing systems 2007, 2009 - 2014 to
upper bound
49

● upper bound has already been reached in 2009
50
PS2

51
PS2
PS2
PS2
PS2 PS2
PS2

52
● can upper bound be surpassed in the future?
● or is this an inherent problem due to low inter-rater
agreement in human evaluation of music similarity?
● this prevents progress in MIR research on music similarity
● AMS task dead since 2015

)hat about validity?
● Valid experiment is an experiment actually measuring
what the experimenter intended to measure
● )hat is the intention of the experimenter in the AMS task?
● What do we want to measure here?
54

Construct (alidity
● Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
55

Construct (alidity
● Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
● Unclear intention: to measure an abstract concept of music
similarity?
● Possible solutions:
○ more fine-grained notion of similarity
○ ask a more specific question?
○ does something like abstract music similarity even exist?
○ evaluation of complete MIR systems centered around specific task use
case could lead to much clearer hypothesis
○ Remember MIREX Grand challenge user experience 2014?
○ “You are creating a short video about a memorable occasion that happened to you
recently, and you need to find some (copyright-free) songs to use as background
music.”
56

Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
57

Internal (alidity
58
Type of algorithm
Dependent variable

Internal (alidity
59
Type of algorithm
Dependent variable
Control variable
gender, age
musical training/experience/preference
type of music, ...

Internal (alidity
60
Type of algorithm
Dependent variable
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only

Internal (alidity
61
Type of algorithm
Dependent variable
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only
Very specialized, limited generality

Internal (alidity
● Control variable, monitor it:
62

Internal (alidity
● Control variable, monitor it:
63
Exponential complexity

External (alidity
● External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment?
64

External (alidity
experiment?
● Unclear target population: identical with sample of 7000
US pop songs? All US pop music in general?
● Beware: cross-collection studies show dramatic losses in
performance
○ Bogdanov, D., Porter, A., Herrera Boyer, P., & Serra, X. (2016). Cross-collection evaluation for
music classification tasks. ISMIR 2016.
○ Clear target population
○ More constricted target populations
○ Much larger data samples
○ Use case?
65

Conclusion (alidity
● Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
● A large difference in effect measures is needed that users
see the difference
66

Conclusion (alidity
see the difference
67
J. Urbano, J. S. Downie, B. McFee and M.
Schedl: How Significant is Statistically
Significant? The case of Audio Music
Similarity and Retrieval, ISMIR 2012.

Conclusion (alidity
see the difference
○ Are there system measures that better correspond to user measures?
○ Use case!
68

Lack of inter-rater agreement in
other areas
69

Lack of inter-rater agreement
● It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
70

● MIREX ‘Music Structural Segmentation’ task
○ Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, …
○ Algorithms have to produce such annotations
○ F1-score between different annotators as upper bound
○ Upper bound reached at least for certain music classical and world music
71

● MIREX ‘Music Structural Segmentation’ task
○ Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, …
○ Algorithms have to produce such annotations
○ F1-score between different annotators as upper bound
○ Upper bound reached at least for certain music classical and world music
○ Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity,
J. of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
○ Smith, J.B.L., Chew, E.: A meta-analysis of the MIREX structure segmentation task, ISMIR,
2013.
○ Serrà, J., Müller, M., Grosche, P., & Arcos, J.L.: Unsupervised music structure annotation by
time series structure features and segment similarity., IEEE Transactions on Multimedia,
Special Issue on Music Data Mining, 16(5), 1229–1240, 2014.
72

Inter-rater agreement and upper bounds
● Extraction of metrical structure
○ Quinton, E., Harte, C., Sandler, M.: Extraction of metrical structure from
music recordings, DAFX 2015.
● Melody estimation
○ Balke, S., Driedger, J., Abeßer, J., Dittmar, C., Müller, M.: Towards
Evaluating Multiple Predominant Melody Annotations in Jazz Recordings,
ISMIR 2016.
○ Bosch J.J.,Gomez E..: Melody extraction in symphonic classical music: a
comparative study of mutual agreement between humans and
algorithms, Proc. of the Conference on Interdisciplinary Musicology,
2014.
● Timbre and rhythm similarity
○ Panteli, M., Rocha, B., Bogaards, N., Honingh, A.: A model for rhythm
and timbre similarity in electronic dance music. Musicae Scientiae, 21(3),
338-361, 2017.
● Many more?
73

Adversarial Examples - Image Recognition
● An adversary slightly and imperceptibly changes an input
image to fool a machine learning system
○ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014.
75
original + noise =
adversarial example
all classified as “Camel”

Adversarial Examples - MIR
● Imperceptibly filtered audio fools genre recognition
system
○ Sturm B.L.: A simple method to determine if a music information retrieval
system is a horse , IEEE Trans. on Multimedia, 16 6 , pp. 1636-1644, 2014.
76
deflate

Adversarial Examples - MIR
● Imperceptibly filtered audio fools genre recognition
system
http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/in
dex.html
77
deflate

External (alidity
experiment
● Unclear target population: identical with sample of few
hundred ISMIR or GTZAN songs?
● Or are we aiming at genre classification in general?
● If target is genre classification in general, there is a
problem!
78

Internal (alidity
● )hy can these MIR systems be fooled so easily?
○ no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
○ )hat is the confounding variable?
79

Internal (alidity
● )hy can these MIR systems be fooled so easily?
● E.g.: in case of rhythm classification, systems were picking
up tempo not rhythm! Tempo acted as confounding factor!
○ Sturm B.L.: The Horse Inside: Seeking Causes Behind the Behaviors of
Music Content Analysis Systems, Computers in Entertainment, 14 2 , 2016.
80

Internal (alidity
● high dimensionality of the data input space?
○ Small perturbations to input data might accumulate over many dimensions
with minor changes ‘snowballing’ into larger changes in transfer functions
of deep neural networks
adversarial examples, ICLR, 2014
81

Internal (alidity
● linearity of models?
○ linear responses are overly confident at points that do not occur in the
data distribution, and these confident predictions are often highly
incorrect … rectified linear units ReLU ?
adversarial examples, ICLR, 2014
82

Internal (alidity
Open question: what is the confounding
variable????
83

(alidity
● (alidity
○ Valid experiment is an experiment actually measuring what the
experimenter intended to measure
○ Conclusion validity
○ Internal validity
○ External validity
○ Construct validity
● Care about validity of your experiments!
● Validity is the right framework to talk about these
problems
85

Part IV: So?

What’s in a 𝑝-value?
● Confounds effect size and sample size, eg. 𝑡 = 𝑛
𝑋−𝜇
𝜎
● Unfortunately, we virtually never check power. Don't ever
accept 𝐻0
● "Easy" way to achieve significance is obtaining more data,
but the true effect remains the same
● Even if one rejects 𝐻0, it could still be true
2

𝑃 𝐻0 𝑝 ≤ 𝛼 =
𝑃 𝑝≤𝛼 𝐻0 𝑃 𝐻0
𝑃 𝑝≤𝛼
=
𝑃 𝑝 ≤ 𝛼 𝐻0 𝑃 𝐻0
𝑃 𝑝 ≤ 𝛼 𝐻0 𝑃 𝐻0 + 𝑃 𝑝 ≤ 𝛼 𝐻1 𝑃 𝐻1
=
𝛼𝑃 𝐻0
𝛼𝑃 𝐻0 + 1 − 𝛽 𝑃 𝐻1
● 𝑃 𝐻0 = 𝑃 𝐻1 = 0.5
○ 𝛼 = 0.05, 𝛽 = 0.05 → 𝑃 𝐻0 p ≤ 𝛼 = 0.05
○ 𝛼 = 0.05, 𝛽 = 0.5 → 𝑃 𝐻0 p ≤ 𝛼 = 0.09
● 𝐻0 = 0.8, 𝑃 𝐻1 = 0.2
○ 𝛼 = 0.05, 𝛽 = 0.05 → 𝑃 𝐻0 p ≤ 𝛼 = 0.17
○ 𝛼 = 0.05, 𝛽 = 0.5 → 𝑃 𝐻0 p ≤ 𝛼 = 0.29
3

𝐻0 is always false
● In this kind of dataset-based experiments, 𝐻0: 𝜇 = 0 is
always false
● Two systems may be veeeeeery similar, but not the same
● Binary accept/reject decisions don't even make sense
● Why bother with multiple comparisons then?
● Care about type S(ign) and M(agnitude) errors
● To what extent do non-parametric methods make sense
(Wilcoxon, Sign, Friedman), specially combined with
parametrics like Tukey’s?
4

Binary thinking no more
● Nothing wrong with the 𝑝-value, but with its use
● 𝑝 as a detective vs 𝑝 as a judge
● Any 𝛼 is completely arbitrary
● What is the cost of a type 2 error?
● How does the lack of validity affect NHST? (measures,
sampling frames, ignoring cross-assessor variability, etc)
● What about researcher degrees of freedom?
● Why not focus on effect sizes? Intervals, correlations, etc.
● Bayesian methods? What priors?
5

Assumptions
● In dataset-based (M)IR experiments, test assumptions are
false by definition
● 𝑝-values are, to some degree, approximated
● So again, why use any threshold?
● So which test should you choose?
● Run them all, and compare
● If they tend to disagree, take a closer look at the data
● Look beyond the experiment at hand, gather more data
● Always perform error analysis to make sense of it
6

Replication
● Fisher, and specially Neyman-Pearson, advocated for
replication
● A 𝑝-value is only concerned with the current data
● The hypothesis testing framework only makes sense with
repeated testing
● In (M)IR we hardly do it; we're stuck to the same datasets
7

Significant ≠ Relevant ≠ Interesting
8
All research

8
All research
Interesting

8
All research
Interesting
Relevant

8
All research
Interesting
Relevant
Statistically
Significant

There is always random error in our experiments,
so we always need some kind of statistical analysis
But there is no point in being too picky
or intense about how we do it
Nobody knows how to do it properly,
and different fields adopt different methods
What is far more productive,
is to adopt an exploratory attitude
rather than mechanically testing
9

References

● Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness
Measures and User Satisfaction. ACM SIGIR
● Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems,
Prevalence, and an Alternative. Journal of Wildfire Management
● Armstrong, T.G., Moffat, A., Webber, W. & Zobel, J. (2009). Improvements that don't add up: ad-hoc
retrieval results since 1998. CIKM
● Balke, S., Driedger, J., Abeßer, J., Dittmar, C. & Müller, M. (2016). Towards Evaluating Multiple
Predominant Melody Annotations in Jazz Recordings. ISMIR
● Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
● Bosch J.J. & Gómez E. (2014). Melody extraction in symphonic classical music: acomparative study of
mutual agreement between humans and algorithms. Conference on Interdisciplinary Musicology
● Boytsov, L., Belova, A. & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR
experiments. SIGIR
● Carterette, B. (2012). Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval
Experiments. ACM Transactions on Information Systems
● Carterette, B. (2015a). Statistical Significance Testing in Information Retrieval: Theory and Practice.
ICTIR
● Carterette, B. (2015b). Bayesian Inference for Information Retrieval Evaluation. ACM ICTIR
● Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum
● Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. ACM
SIGIR
● Downie, J. S. (2004). The Scientific Evaluation of Music Information Retrieval Systems: Foundations
and Future. Computer Music Journal
● Fisher, R. A. (1925). Statistical Methods for Research Workers. Cosmo Publications
● Flexer, A. (2006). Statistical Evaluation of Music Information Retrieval Experiments. Journal of New
Music Research
2

● Flexer, A., Grill, T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal
of New Music Research
● Gelman, A. (2013b). The problem with p-values is how they’re used.
● Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple
Comparisons. Journal of Research on Educational Effectiveness
● Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a
problem, even when there is no shing expedition' or p-hacking' and the research hypothesis was
posited ahead of time.
● Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist
● Gelman, A., & Stern, H. (2006). The Difference Between Significant and Not Significant is not Itself
Statistically Significant. The American Statistician
● Goodfellow I.J., Shlens J. & Szegedy C. (2014). Explaining and harnessing adversarial examples. ICLR
● Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., & Langlois, T. (2014). On Evaluation Validity in
Music Autotagging. ACM Computing Research Repository.
● Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and
User Evaluations Give the Same Results? ACM SIGIR
● Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar
Songs. ISMIR
● Hull, D. (1993). Using Statistical Testing in the Evaluation of Retrieval Experiments. ACM SIGIR
● Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine
● Lehmann, E.L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or
Two? Journal of the American Statistical Association
● Lehmann, E.L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer
3

● Lee, J. H., & Cunningham, S. J. (2013). Toward an understanding of the history and impact of user
studies in music information retrieval. Journal of Intelligent Information Systems
● Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three Current Issues In Music
Autotagging. ISMIR
● Neyman, J. & Pearson, E.S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes
of Statistical Inference: Part I. Biometrika
● Panteli, M., Rocha, B., Bogaards, N. & Honingh, A. (2017). A model for rhythm and timbre similarity in
electronic dance music. Musicae Scientiae
● Quinton, E., Harte, C. & Sandler, M. (2015). Extraction of metrical structure from music recordings.
DAFX
● Sakai, T. (2014). Statistical Reform in Information Retrieval? ACM SIGIR Forum
● Savoy, J. (1997). Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and
Management
● Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval
Research. Journal of Intelligent Information Systems
● Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Houghton-Mifflin
● Serrà, J., Müller, M., Grosche, P., & Arcos, J.L. (2014). Unsupervised music structure annotation by time
series structure features and segment similarity. IEEE Trans. on Multimedia
● Smith, J.B.L. & Chew, E. (2013). A meta-analysis of the MIREX structure segmentation task. ISMIR
● Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for
Information Retrieval Evaluation. ACM CIKM
● Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement Among Statistical Significance Tests for
Information Retrieval Evaluation at Varying Sample Sizes. CM SIGIR
4

● Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our
Metrics, that They Are Unrealistic. European Workshop on Human-Computer Interaction and
Information Retrieval
● Student. (1908). The Probable Error of a Mean. Biometrika
● Sturm, B. L. (2013). Classification Accuracy is Not Enough: On the Evaluation ofMusic Genre
Recognition Systems. Journal of Intelligent Information Systems
● Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music
Information Retrieval. Journal of New Music Research
● Sturm, B.L. (2014). A simple method to determine if a music information retrieval system is a horse,
IEEE Trans. on Multimedia
● Sturm B.L. (2016). "The Horse" Inside: Seeking Causes Behind the Behaviors of Music Content
Analysis Systems, Computers in Entertainment
● Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited.
Information Processing and Management
● Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. ACM
SIGIR
● Urbano, J. (2015). Test Collection Reliability: A Study of Bias and Robustness to Statistical
Assumptions via Stochastic Simulation. Information Retrieval Journal
● Urbano, J., Downie, J. S., McFee, B., & Schedl, M. (2012). How Significant is Statistically Significant?
The case of Audio Music Similarity and Retrieval. ISMIR
● Urbano, J., Marrero, M., & Martín, D. (2013a). A Comparison of the Optimality of Statistical Significance
Tests for Information Retrieval Evaluation. ACM SIGIR
● Urbano, J., Marrero, M. & Martín, D. (2013b). On the Measurement of Test Collection Reliability. SIGIR
● Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of
Intelligent Information Systems
5

● Urbano, J. & Marrero, M. (2016). Toward Estimating the Rank Correlation between the Test Collection
Results and the True System Performance. SIGIR
● Urbano, J. & Nagler, T. (2018). Stochastic Simulation of Test Collections: Evaluation Scores. SIGIR
● Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error.
ACM SIGIR
● Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. ACM CIKM
● Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error
Costs Us Jobs, Justice, and Lives. University of Michigan Press
● Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM
SIGIR
6

Statistical Analysis of Results in Music Information Retrieval

Recommended

Recommended

More Related Content

Similar to Statistical Analysis of Results in Music Information Retrieval

Similar to Statistical Analysis of Results in Music Information Retrieval (20)

More from Julián Urbano

More from Julián Urbano (20)

Recently uploaded

Recently uploaded (20)

Statistical Analysis of Results in Music Information Retrieval