3. JuliΓ‘n Urbano
β Assistant Professor @ TU Delft, The Netherlands
β BSc-PhD Computer Science
β 10 years of research in (Music) Information Retrieval
o And related, like information extraction or crowdsourcing for IR
β Active @ ISMIR since 2010
β Research topics
o Evaluation methodologies
o Statistical methods for evaluation
o Simulation for evaluation
o Low-cost evaluation
3
Supported by the European Commission H2020 project TROMPA (770376-2)
4. Arthur Flexer
β Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
β PhD in Psychology, minor degree in Computer Science
β 10 years of research in neuroscience, 13 years in MIR
β Active @ ISMIR since 2005
β Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
4
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
5. Arthur Flexer
β Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
β PhD in Psychology, minor degree in Computer Science
β 10 years of research in neuroscience, 13 years in MIR
β Active @ ISMIR since 2005
β Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
5
Semi-retired veteran DJ
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
6. Disclaimer
β Design of experiments (DOE) used and needed in all kinds
of sciences
β DOE is a science in its own right
β No fixed βhow-tosβ or βcookbooksβ
β Different schools and opinions
β Established ways to proceed in different fields
β We will present current procedures in (M)IR
β But also discuss, criticize, point to problems
β Present alternatives and solutions (?)
6
7. Program
β Part I: Why (we evaluate the way we do it)?
o Tasks and use cases
o Cranfield
o Validity and reliability
β Part II: How (should we not analyze results)?
o Populations and samples
o Estimating means
o Fisher, Neyman-Pearson, NHST
o Tests and multiple comparisons
β Part III: What else (should we care about)?
o Inter-rater agreement
o Adversarial examples
β Part IV: So (what does it all mean)?
β Discussion?
7
21. Two recurrent questions
ο· How good is my system?
β What does good mean?
β What is good enough?
ο· Is system A better than system B?
β What does better mean?
β How much better?
ο· What do we talk about?
β Efficiency?
β Effectiveness?
β Ease?
3
22. Hypothesis: A is better than B
How would you design
this experiment?
23. Measure user experience
ο· We are interested in user-measures
β Time to complete task
β Idle time
β Success/Failure rate
β Frustration
β Ease of learning
β Ease of use β¦
ο· Their distributions describe user experience
β For an arbitrary user and topic (and document collection?)
β What can we expect?
5
0
time to complete task
none
frustration
muchsome
24. Sources of variability
user-measure = f(documents, topic, user, system)
ο· Our goal is the distribution of the user-measure for our
system, which is impossible to calculate
β (Possibly?) infinite populations
ο· As usual, the best we can do is estimate it
β Becomes subject to random error
6
25. Desired: Live Observation
ο· Estimate distributions with a live experiment
ο· Sample documents, topics and users
ο· Have them use the system, for real
ο· Measure user experience, implicitly or explicitly
ο· Many problems
β High cost, representativeness
β Ethics, privacy, hidden effects, inconsistency
β Hard to replicate experiment and repeat results
β Just plain impossible to reproduce results
*replicate = same method, different sample
reproduce = same method, same sample
7
26. Alternative: Fixed samples
ο· Get (hopefully) good samples, fix them and reuse
β Documents
β Topics
β Users
ο· Promotes reproducibility and reduces variability
ο· But we canβt just fix the users!
8
27. Simulate users
ο· Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
β Static user component: judgments or annotations in ground truth
β Dynamic user component: effectiveness or performance measures
ο· Removes all sources of variability, except systems
user-measure = f(documents, topic, user, system)
9
28. Simulate users
ο· Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
β Static user component: judgments or annotations in ground truth
β Dynamic user component: effectiveness or performance measures
ο· Removes all sources of variability, except systems
user-measure = f(documents, topic, user, system)
9
user-measure = f(system)
29. Datasets (aka Test Collections)
ο· Controlled sample of documents, topics and judgments,
shared across researchersβ¦
ο· β¦combined with performance measures
ο· (Most?) important resource for IR research
β Experiments are inexpensive (datasets are not!)
β Research becomes systematic
β Evaluation is deterministic
β Reproducibility is not only possible but easy
10
31. User Models & Annotation Protocols
ο· In practice, there are hundreds of options
ο· Utility of a document w.r.t. scale of annotation
β Binary or graded relevance?
β Linear utility w.r.t. relevance? Exponential?
β Independent of other documents?
ο· Top heaviness to penalize late arrival
β No discount?
β Linear discount? Logarithmic?
β Independent of other documents?
ο· Interaction, browsing?
ο· Cutoff
β Fixed: only top k documents?
β Dynamic: wherever some condition is met?
β All documents?
ο· etc
12
32. Tasks vs Use Cases
ο· Everything depends on the use case of interest
ο· The same task may have several use cases (or subtasks)
β Informational
β Navigational
β Transactional
β etc
ο· Different use cases may imply, suggest or require
different decisions wrt system input/output, goal,
annotations, measures...
13
34. Instrument recognition
1) Given a piece of music as input, identify the instruments
that are played in it:
β For each window of T milliseconds, return a list of instruments being
played (extraction).
β Return the instruments being played anywhere in the piece (classification).
2) Given an instrument as input, retrieve a list of music
pieces in which the instrument is played:
β Return the list of music pieces (retrieval).
β Return the list, but for each piece also provide a clip (start-end) where the
instrument is played (retrieval+extraction).
ο· Each case implies different systems, annotations and
measures, and even different end-users (non-human?)
https://github.com/cosmir/open-mic/issues/19
15
35. But wait a minute...
ο· Are we estimating distributions about users or distributions
about systems?
user-measure = f(system)
system-measure = f(system, protocol, measure)
16
36. But wait a minute...
ο· Are we estimating distributions about users or distributions
about systems?
user-measure = f(system)
system-measure = f(system, protocol, measure)
16
system-measure = f(system, protocol, measure, annotator, context, ...)
ο· Whether the system output satisfies the user or not, has
nothing to do with how we measure its performance
ο· What is the best way to predict user satisfaction?
37. Real world vs. The lab
17
The Web
Abstraction
Prediction
Real World Cranfield
IR System
Topic
Relevance
Judgments
IR System
Documents
AP
DCG
RR
Static
Component
Dynamic
Component
Test
Collection
Effectiveness
Measures
Information
need
39. Classes of Tasks in Music IR
β Retrieval
β Music similarity
19
System
collection track
track
track
track
track
track
40. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
20
System
collection hum
track
track
track
track
track
41. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
21
System
collection user
track
track
track
track
track
42. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
22
System
track
genre
43. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
β Mood recognition
23
System
track
mood1
mood2
44. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
β Mood recognition
β Autotagging
24
System
track
tag
tag
tag
45. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
β Mood recognition
β Autotagging
β Extraction
β Structural segmentation
25
System
track
seg segseg seg
46. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
β Mood recognition
β Autotagging
β Extraction
β Structural segmentation
β Melody extraction
26
System
track
47. Classes of Tasks in Music IR
β Retrieval
β Music similarity
β Query by humming
β Recommendation
β Annotation
β Genre classification
β Mood recognition
β Autotagging
β Extraction
β Structural segmentation
β Melody extraction
β Chord estimation
27
System
track
chocho cho
48. Evaluation as Simulation
ο· Cranfield-style evaluation with datasets is a simulation of
the user-system interaction, deterministic, maybe even
simplistic, but a simulation nonetheless
ο· Provides us with data to estimate how good our systems
are, or which one is better
ο· Typically, many decisions are made for the practitioner
ο· Comes with many assumptions and limitations
28
49. Validity and Reliability
ο· Validity: are we measuring what we want to?
β Internal: are observed effects due to hidden factors?
β External: are input items, annotators, etc generalizable?
β Construct: do system-measures match user-measures?
β Conclusion: how good is good and how better is better?
Systematic error
ο· Reliability: how repeatable are the results?
β Will I obtain the same results with a different collection?
β How large do collections need to be?
β What statistical methods should be used?
Random error
29
51. So long as...
β’ So long as the dataset is large enough to minimize random
error and draw reliable conclusions
β’ So long as the tools we use to make those conclusions can
be trusted
β’ So long as the task and use case are clear
β’ So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
β’ So long as the samples of inputs and annotators present in
the dataset are representative for the task
31
52. What else
How
So long as...
β’ So long as the dataset is large enough to minimize random
error and draw reliable conclusions
β’ So long as the tools we use to make those conclusions can
be trusted
β’ So long as the task and use case are clear
β’ So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
β’ So long as the samples of inputs and annotators present in
the dataset are representative for the task
31
53. βIf you canβt measure it, you canβt improve it.β
βLord Kelvin
32
54. βIf you canβt measure it, you canβt improve it.β
βLord Kelvin
32
βBut measurements have to be trustworthy.β
βyours truly
57. Populations of interest
β The task and use case define the populations of interest
β Music tracks
β Users
β Annotators
β Vocabularies
β Impossible to study all entities in these populations
β Too many
β Donβt exist anymore (or yet)
β Too far away
β Too expensive
β Illegal
3
58. Populations and samples
β Our goal is to study the performance of the system on
these populations
β Lets us know what to expect from the system in the real world
β Weβre typically interested in the mean: the expectation π
β Based on this we would decide what research line to pursue, what paper
to publish, what project to fund, etc.
β But variability is also important, though often neglected
β A dataset represents just a sample from that population
β By studying the sample we could generalize back to the population
β But will bear some degree of random error due to sampling
4
64. Populations and samples
β This is an estimation problem
β The objective is the distribution πΉ of performance over the
population, specifically the mean π
β Given a sample of observations π1, β¦ , π π, estimate π
β Most straightforward estimator is the sample mean: π = π
β Problem: π = π + π, being π random error
β For any given sample or dataset, we only know πΏ
β How confident we are in the results and our conclusions,
depends on the size of π with respect to π
6
76. sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency 0 1^
Populations and samples
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
77. 0.35 0.40 0.45 0.50 0.55
010203040
sampling distribution
mean performance X
density
n=10
n=30
n=50
n=100
0.0 0.2 0.4 0.6 0.8 1.0
01234
population
performance
density
Sampling distribution and standard error
β Let us assume some distribution with some π
β Experiment: draw random sample of size π and compute π
β The sampling distribution is the distribution of π over
replications of the experiment
β Standard error is the std. dev. of the sampling distribution
10
78. Estimating the mean
β The true distribution πΉ has mean π and variance π2
β πΈ π = π
β πππ π = π2
β For the sample mean π =
1
π
βππ we have
β πΈ π =
1
π
βπΈ ππ = π
β πππ π =
1
π2 βπππ ππ =
π2
π
std. error = π π =
π
π
β Law of large numbers: π β π as π β β
β The larger the dataset, the better our estimates
β Regardless of the true distribution π over the population
11
79. Gaussians everywhere
β For the special case where πΉ = π π, π2 , the sample mean
is also Gaussian, specifically π~π π,
π2
π
β But a Gaussian distribution is sometimes a very unrealistic
model for our data
β Central Limit Theorem (CLT): π
π
β π π,
π2
π
as π β β
β Regardless of the shape of the original π
12
0.0 0.2 0.4 0.6 0.8 1.0
05101520
population
performance
density
n=10
mean performance X
density
0.00 0.04 0.08 0.12
051525
n=100
mean performance X
density 0.00 0.04 0.08 0.12
02060
80. Approximations
β Until the late 1890s, the CLT was invoked everywhere for
the simplicity of working with Gaussians
β Tables of the Gaussian distribution were used to test
β Still, there were two main problems
β π2
is unknown
β The rate of converge, ie. small samples
β But something happened at the Guinness factory in 1908
13
82. Student-t distribution
β Gaussian approximations for sampling distributions were
reasonably good for large samples, but not for small
β Gosset thought about deriving the theoretical distributions
under assumptions of the underlying model
β Specifically, when π~π π, π2 :
β If π is known, we know that π§ =
πβπ
π π
~π 0,1
β If π is unknown Gosset introduced the Student-t distribution:
π‘ =
πβπ
π / π
~π π β 1 , where π is the sample standard deviation
β In a sense, it accounts for the uncertainty in π = π
15
83. Small-sample problems
β In non-English literature there are earlier mentions, but it
was popularized by Gosset and, mostly, Fisher
β He initiated the study of the so-called small-sample
problems, specifically with the Student-π‘ distribution
16
-4 -2 0 2 4
0.00.10.20.30.4
Student-t distribution
t statistic
density
n=2
n=3
n=6
n=30
85. Fisher and small samples
β Gosset did not provide a proof of the π‘
distribution, but Fisher did in 1912-1915
β Fisher stopped working on small sample
problems until Gosset convinced him in 1922
β He then worked out exact distributions for correlation
coefficients, regression coefficients, π2 tests, etc. in the
early 1920s
β These, and much of his work on estimation and design of
experiments, were collected in his famous 1925 book
β This is book is sometimes considered the birth of modern
statistical methods
18
Ronald Fisher
87. Fisherβs significance testing
β In those other papers Fisher developed his theory of
significance testing
β Suppose we have observed data π~π π₯ π and we are
interested in testing the null hypothesis π»0: π = π0
β We choose a relevant test statistic π s.t. large values of π
reflect evidence against π»0
β Compute the π-value π = π π πβ β₯ π π π»0 , that is, the
probability that, under π»0, we observe a sample πβ with a
test statistic at least as extreme as we observed initially
β Assess the statistical significance of the results, that is,
reject π»0 if π is small
20
88. Testing the mean
β We observed π = {β0.13, 0.68, β0.34, 2.10, 0.83, β0.32,
0.99, 1.24, 1.08, 0.19} and assume a Gaussian model
β We set π»0: π = 0 and choose a π‘ statistic
β For our data, π = 0.0155 π‘ = 2.55
β If we consider π small enough, we reject π»0
21
-4 -2 0 2 4
0.00.10.20.30.4
test statistic
t
density
p
89. Small p-values
β βwe do not want to know the exact value of p [β¦], but, in
the first place, whether or not the observed value is open
to suspicionβ
β Fisher provided in his book tables not of the new small-
sample distributions, but of selected quantiles
β Allow for calculation of ranges of π-values given test
statistics, as different degrees of evidence against π»0
β The π-value is gradable, a continuous measure of evidence
22
91. π and πΌ
β Fisher employed the term significance level πΌ for these
theoretical π-values used as reference points to identify
statistically significant results: reject π»0 if π β€ πΌ
β This is context-dependent, is not prefixed beforehand and
can change from time to time
β He arbitrarily βsuggestedβ πΌ = .05 for illustration purposes
β Observing π > πΌ does not prove π»0; it just fails to reject it
24
93. Pearson
β Pearson saw Fisherβs tables as a way to
compute critical values that βlent
themselves to the idea of choice, in advance
of experiment, of the risk of the βfirst kind
of errorβ which the experimenter was prepared to takeβ
β In a letter to Pearson, Gosset replied βif the chance is very
small, say .00001, [β¦] what it does is to show that if there is
any alternative hypothesis which will explain the
occurrence of the sample with a more reasonable
probability, say .05 [β¦], you will be very much more
inclined to consider that the original hypothesis is not
trueβ
26
Egon Pearson
94. Neyman
β Pearson saw the light: βthe only valid
reason for rejecting a statistical hypothesis
is that some alternative explains the
observed events with a greater degree
of probabilityβ
β In 1926 Pearson writes to Neyman to propose his ideas of
hypothesis testing, which they developed and published
in 1928
27
Jerzy Neyman
100. Neyman-Pearson hypothesis testing
β Define the null and alternative hypotheses, eg.
π»0: π = 0 and π»1: π = 0.5
β Set the acceptable error rates πΌ (type 1) and π½ (type 2)
β Select the most powerful test π for the hypotheses and πΌ,
which sets the critical value π
β Given π»1 and π½, select the sample size π required to detect
an effect π or larger
β Collect data and reject π»0 if π π β₯ π
β The testing conditions are set beforehand: π»0, π»1, πΌ, π½
β The experiment is designed for a target effect π: π
33
101. Error rates and tests
β Under repeated experiments, the long-run error rate is πΌ
β Neyman-Pearson did not suggest values for it:
βthe balance [between the two kinds of error] must be left
to the investigator [β¦] we attempt to adjust the balance
between the risks 1 and 2 to meet the type of problem
before usβ
β For π½ they βsuggestedβ πΌ β€ π½ β€ 0.20
β To Fisher, the choice of test statistic in his methodology
was rather obvious to the investigator and wasnβt important
to him
β Neyman-Pearson answered this by defining the βbestβ test:
that which minimizes error 2 subject to a bound in error 1
34
102. Likelihood ratio test
β Pearson apparently suggested the likelihood ratio test for
their new hypothesis testing methodology
β =
π π π»0)
π π π»1
β Later found that as π β β, β2 log β ~π2
β Neyman was reluctant, as he thought some Bayesian
consideration had to be taken about prior distributions
over the hypotheses (βinverse probabilityβ at the time)
β For simple point hypotheses like π»0: π = π0 and π»1: π = π1,
the Likelihood ratio test turned out to be the most powerful
β In the case of comparing means of Gaussians, this reduces
to Studentβs π‘-test!
35
103. Composite hypotheses
β Neyman-Pearson theory extends to composite hypotheses
of the form π»: π β Ξ, such as π»1: π > 0.5
β The math got more complex, and Neyman was still
somewhat reluctant: βit may be argued that it is
impossible to estimate the probability of such a hypothesis
without a knowledge of the relative a priori probabilities of
the constituent simple hypothesesβ
β Although βwishing to test the probability of a hypothesis A
we have to assume that all hypotheses are a priori equally
probable and calculate the probability a posteriory of Aβ
36
105. Recap
β Fisher: significance testing
β Inductive inference: rational belief when reasoning from sample to
population
β Rigorous experimental design to extract results from few samples
β Replicate and develop your hypotheses, consider all significant and non-
significant results together
β Power can not be computed beforehand
β Neyman-Pearson: hypothesis testing
β Inductive behavior: frequency of errors in judgments
β Long-run results from many samples
β p-values donβt have frequentist interpretations
β In the 1940s the two worlds began to appear as just one in
statistics textbooks, and rapidly adopted by researchers
38
108. Null Hypothesis Significance Testing
β Collect data
β Set hypotheses, typically π»0: π = 0 and π»1: π β 0
β Either there is an effect or there isnβt
β Set πΌ, typically to 0.05 or 0.01
β Select test statistic based on hypotheses and compute π
β If π β€ πΌ, reject the null; fail to reject if π > πΌ
40
109. Common bad practices
β Run tests blindly without looking at your data
β Decide on πΌ after computing π
β Report β(not) significant at the 0.05 levelβ instead of
providing the actual π-value
β Report degrees of significance like βveryβ or βbarelyβ
β Do not report test statistic alongside π, eg. π‘ 58 = 1.54
β Accept π»0 if π > πΌ or accept π»1 if π β€ πΌ
β Interpret π as the probability of the null
β Simply reject the null, without looking at the effect size
β Ignore the type 2 error rate π½, ie. power analysis a posteriori
β Interpret statistically significant result as important
β Train the same models until significance is found
β Publish only statistically significant results Β―_(γ)_/Β―
41
111. Paired tests
β We typically want to compare our system B with some
baseline system A
β We have the scores over π inputs from some dataset
β The hypotheses are π»0: π π΄ = π π΅ and π»1: π π΄ β π π΅
A
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.405
sd=0.213
B
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.425
sd=0.225
112. Paired test
β If we ignore the structure of the experiment, we have a bad
model and a test with low power
Simple test: π‘ = 0.28, π = 0.78
β In our experiments, every observation from A corresponds
to an observation from B, ie. they are paired observations
β The test can account for this to better model our data
β Instead of looking at A vs B, we look at A-B vs 0
Paired test: π‘ = 2.16, π = 0.044
44
0.0 0.2 0.4 0.6 0.8 1.0
0.00.40.8
A
B
113. Paired π‘-test
β Assumption
β Data come from Gaussian distributions
β Equivalent to a t-test of π π· = 0, where π·π = π΅π β π΄π
π‘ = π
π π΅βπ΄
π π΅βπ΄
= π
π·
π π·
= 0.35, π = 0.73
45
A B
.76 .75
.33 .37
.59 .59
.28 .15
.36 .49
.43 .50
.21 .33
.43 .27
.72 .81
.40 .36
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
A
B
114. Wilcoxon signed-rank test
β Assumptions:
β Data measured at least at interval level
β Distribution is symmetric
β Disregard for magnitudes
β Convert all untied π·π to ranks π π
β Compute π+ and πβ equal to the sum
of π π that are positive or negative
β The test statistic is
π = min π+, πβ = 21
β π follows a Wilcoxon distribution, from
which one can calculate π = 0.91
46
A B D rank
.76 .75 -.01 1
.33 .37 .04 2
.59 .59 0 -
.28 .15 -.13 8
.36 .49 .13 7
.43 .50 .07 4
.21 .33 .12 6
.43 .27 -.16 9
.72 .81 .09 5
.40 .36 -.04 3
115. Sign test
β Complete disregard for magnitudes
β Simulate coin flips: was system B better (or
worse) than A for some input?
β Follows Binomial distribution
β The test statistic is the number of successes
(B>A), which is 5
β π-value is the probability of 5 or more
successes in 9 coin flips = 0.5
47
A B sign
.76 .75 -1
.33 .37 +1
.59 .59 0
.28 .15 -1
.36 .49 +1
.43 .50 +1
.21 .33 +1
.43 .27 -1
.72 .81 +1
.40 .36 -1
116. Bootstrap test
β Compute deltas, π·π = π΅π β π΄π
β The empirical distribution πππππ· estimates the true
distribution πΉ π·
β Repeat for π = 1, β¦ , π, with π large (thousands of times)
β Draw a bootstrap sample π΅π by sampling π scores with replacement from
πππππ·
β Compute the mean of the bootstrap sample, π΅π
β Let π΅ = 1/πβπ΅π
β π΅π estimates the sampling distribution of the mean
β The π-value is
βπ π΅π β π΅ β₯ π·
π
β 0.71
48
117. Permutation test
β Under the null hypothesis, an arbitrary score could have
been generated for system A or from system B
β Repeat for π = 1, β¦ , π, with π large (thousands of times)
β Create a sample ππ by randomly swapping the sign of each observation
β Compute the mean ππ
β The π-value is
βπ ππ β₯ π·
π
β 0.73
49
121. ANOVA
β Assume a model π¦π π = π + ππ + ππ + ππ π, where ππ = π¦π β β π
β Implicitly βpairsβ the observations by item
β The variance of the observed scores can be decomposed
π2
π¦ = π2
π + π2
π + π2
π
β where π2 π is the variance across system means
β Low: system means are close to each other
β High: system means are far from each other
β The null hypothesis is π»0: π1 = π2 = π3 = β―
β Even if we reject, we still donβt know which system is different!
β The test statistic is of the form πΉ =
π2 π
π2 π
β Weβd like to have π2 π β« π2 π , π2 π
122. Friedman test
β Same principle as ANOVA, but non-parametric
β Similarly to Wilcoxon, rank observations (per item) and
estimate effects
β Ignores actual magnitudes; simply uses ranks
54
123. Multiple testing
β When testing multiple hypotheses, the probability of at
least one type 1 error increases
β Multiple testing procedures correct π-values for a family-
wise error rate
55
[Carterette, 2015a]
124. Tukeyβs HSD
β Follow ANOVA to test π»0: π1 = π2 = π3 = β―
β The maximum observed difference between systems is
likely the one causing the rejection
β Tukeyβs HSD compares all pairs of systems, each with an
individual π-value
β These π-values are then corrected based on the expected
distribution of maximum differences under π»0
β In practice, it inflates π-values
β Ensures a family-wise type 1 error rate
56
126. Others
β There are many other procedures to control for multiple
comparisons
β Bonferroni: very conservative (low power)
β Dunnettβs: compare all against a control (eg. baseline)
β Other procedures control for the false discovery rate, ie.
the probability of a type 1 error given π < πΌ
β One way or another, they all inflate π-values
58
127. Part III: )hat Else? (alidity!
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
128. )hat else can go wrong ? (alidity!
β (alidity reprise
β Example I: Inter-rater agreement
β Example II: Adversarial examples
2
130. (alidity
β (alidity
β Valid experiment is an experiment actually measuring what the
experimenter intended to measure
β Conclusion validity: does a difference between system measures
correspond to a difference in user measures and is it noticeable to users
β Internal validity: is this relationship causal or could confounding factors
explain the relation
β External validity: do cause-effect relationships also hold for target
populations beyond the sample used in the experiment
β Construct validity: are intentions and hypotheses of the experimenter
represented in the actual experiment
4
134. Computation of similarity between songs
β Collaborative filtering Spotify, Deeβer?
β Social meta-data Last.Fm?
β Expert knowledge Pandora?
β Meta-data from the web
β ...
β Audio-based
8
135. Computation of similarity between songs
9
Songs as audio
Switching to
frequencies
Computation of
features
Machine Learning
β similarity (metric)
S(a1, a2) = ?
Pictures from E. Pampalkβs Phd thesis 2006
139. How can we evaluate our models of music
similarity?
15
45 87 100
23 100 87
100 23 45
140. How can we evaluate our models of music
similarity?
16
45 87 100
23 100 87
100 23 45
Do these
numbers
correspond to
a human
assessment of
music
similarity?
142. MIREX -
Music Information Retrieval eXchange
β Standardiβed testbeds allowing for fair comparison of MIR
systems
β range of different tasks
β based on human evaluation
β Cranfield: remove users, look at annotations only
18
143. MIREX -
Music Information Retrieval eXchange
β Standardiβed testbeds allowing for fair comparison of MIR
systems
β range of different tasks
β based on human evaluation
β Cranfield: remove users, look at annotations only
β )hat is the level of agreement between human
raters/annotators ?
β )hat does this mean for the evaluation of MIR systems?
β Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling
Music Similarity, Journal of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
19
145. Audio music similarity
β Audio Music Similarity and Retrieval AMS task 2006-2014
β 5000 song database
β participating MIR systems compute 5000x5000 distance
matrix
β 60 randomly selected queries
β return 5 closest candidate songs for each of the MIR
systems
β for each query/candidate pair, ask the human grader:
β βRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 10
very similar or identical .
21
146. Audio music similarity
β Audio Music Similarity and Retrieval AMS task 2006-2014
β 7000 song database
β participating MIR systems compute 7000x7000 distance
matrix
β 100 randomly selected queries
β return 5 closest candidate songs for each of the MIR
systems
β for each query/candidate pair, ask the human grader:
β βRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
22
147. Audio music similarity
β Audio Music Similarity and Retrieval AMS task 2006-2014
β 7000 song database
β participating MIR systems compute 7000x7000 distance
matrix
β 50 randomly selected queries
β return 10 closest candidate songs for each of the MIR
systems
β for each query/candidate pair, ask the human grader:
β βRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
23
152. )hat about validity?
β Valid experiment is an experiment actually measuring
what the experimenter intended to measure
β )hat is the intention of the experimenter in the AMS task?
β What do we want to measure here?
28
153. Audio music similarity
β Audio Music Similarity and Retrieval AMS task 2006-2014
β 7000 song database
β participating MIR systems compute 7000x7000 distance
matrix
β 100 randomly selected queries
β return 5 closest candidate songs for each of the MIR
systems
β for each query/candidate pair, ask the human grader:
β βRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
29
160. Rate the similarity!
36
β Factors that influence human music perception
β Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information
Retrieval Research, J. of Intelligent Information Systems, December 2013,
(olume 41, Issue 3, pp 523-539, 2013.
163. Inter-rater agreement in AMS
β AMS 2006 is the only year with multiple graders
β each query/candidate pair evaluated by three different
human graders
β each grader gives a FINE score between 0 β¦ 10 not β¦ very
similar
39
164. Inter-rater agreement in AMS
β AMS 2006 is the only year with multiple graders
β each query/candidate pair evaluated by three different
human graders
β each grader gives a FINE score between 0 β¦ 10 not β¦ very
similar
β correlation between pairs of graders
40
165. Inter-rater agreement in AMS
β inter-rater agreement for different intervals of FINE scores
41
166. Inter-rater agreement in AMS
β inter-rater agreement for different intervals of FINE scores
42
167. Inter-rater agreement in AMS
β inter-rater agreement for different intervals of FINE scores
43
170. Inter-rater agreement in AMS
β look at very similar ratings in the [9,10] interval
46
Average =
6.54
171. Inter-rater agreement in AMS
β look at very similar ratings in the [9,10] interval
β what sounds very similar to one grader, will on average
receive a score of only 6.54 from other graders
β this constitutes an upper bound for average FINE scores in
AMS
β there will always be users that disagree moving target
47
Average =
6.54
172. Comparison to the upper bound
β compare top performing systems 2007, 2009 - 2014 to
upper bound
48
173. Comparison to the upper bound
β compare top performing systems 2007, 2009 - 2014 to
upper bound
49
174. Comparison to the upper bound
β upper bound has already been reached in 2009
50
PS2
175. Comparison to the upper bound
β upper bound has already been reached in 2009
51
PS2
PS2
PS2
PS2 PS2
PS2
176. Comparison to the upper bound
52
β upper bound has already been reached in 2009
β can upper bound be surpassed in the future?
β or is this an inherent problem due to low inter-rater
agreement in human evaluation of music similarity?
β this prevents progress in MIR research on music similarity
β AMS task dead since 2015
178. )hat about validity?
β Valid experiment is an experiment actually measuring
what the experimenter intended to measure
β )hat is the intention of the experimenter in the AMS task?
β What do we want to measure here?
54
179. Construct (alidity
β Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
55
180. Construct (alidity
β Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
β Unclear intention: to measure an abstract concept of music
similarity?
β Possible solutions:
β more fine-grained notion of similarity
β ask a more specific question?
β does something like abstract music similarity even exist?
β evaluation of complete MIR systems centered around specific task use
case could lead to much clearer hypothesis
β Remember MIREX Grand challenge user experience 2014?
β βYou are creating a short video about a memorable occasion that happened to you
recently, and you need to find some (copyright-free) songs to use as background
music.β
56
181. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β Many factors that influence human music perception, need
to be controlled in experimental design
57
182. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β Many factors that influence human music perception, need
to be controlled in experimental design
β Possible solutions:
58
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
183. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β Many factors that influence human music perception, need
to be controlled in experimental design
β Possible solutions:
59
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age
musical training/experience/preference
type of music, ...
184. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β Many factors that influence human music perception, need
to be controlled in experimental design
β Possible solutions:
60
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only
185. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β Many factors that influence human music perception, need
to be controlled in experimental design
β Possible solutions:
61
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only
Very specialized, limited generality
188. External (alidity
β External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment?
64
189. External (alidity
β External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment?
β Unclear target population: identical with sample of 7000
US pop songs? All US pop music in general?
β Beware: cross-collection studies show dramatic losses in
performance
β Bogdanov, D., Porter, A., Herrera Boyer, P., & Serra, X. (2016). Cross-collection evaluation for
music classification tasks. ISMIR 2016.
β Possible solutions:
β Clear target population
β More constricted target populations
β Much larger data samples
β Use case?
65
190. Conclusion (alidity
β Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
β A large difference in effect measures is needed that users
see the difference
66
191. Conclusion (alidity
β Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
β A large difference in effect measures is needed that users
see the difference
67
J. Urbano, J. S. Downie, B. McFee and M.
Schedl: How Significant is Statistically
Significant? The case of Audio Music
Similarity and Retrieval, ISMIR 2012.
192. Conclusion (alidity
β Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
β A large difference in effect measures is needed that users
see the difference
β Possible solutions:
β Are there system measures that better correspond to user measures?
β Use case!
68
194. Lack of inter-rater agreement
β It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
70
195. Lack of inter-rater agreement
β It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
β MIREX βMusic Structural Segmentationβ task
β Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, β¦
β Algorithms have to produce such annotations
β F1-score between different annotators as upper bound
β Upper bound reached at least for certain music classical and world music
71
196. Lack of inter-rater agreement
β It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
β MIREX βMusic Structural Segmentationβ task
β Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, β¦
β Algorithms have to produce such annotations
β F1-score between different annotators as upper bound
β Upper bound reached at least for certain music classical and world music
β Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity,
J. of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
β Smith, J.B.L., Chew, E.: A meta-analysis of the MIREX structure segmentation task, ISMIR,
2013.
β SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L.: Unsupervised music structure annotation by
time series structure features and segment similarity., IEEE Transactions on Multimedia,
Special Issue on Music Data Mining, 16(5), 1229β1240, 2014.
72
197. Inter-rater agreement and upper bounds
β Extraction of metrical structure
β Quinton, E., Harte, C., Sandler, M.: Extraction of metrical structure from
music recordings, DAFX 2015.
β Melody estimation
β Balke, S., Driedger, J., AbeΓer, J., Dittmar, C., MΓΌller, M.: Towards
Evaluating Multiple Predominant Melody Annotations in Jazz Recordings,
ISMIR 2016.
β Bosch J.J.,Gomez E..: Melody extraction in symphonic classical music: a
comparative study of mutual agreement between humans and
algorithms, Proc. of the Conference on Interdisciplinary Musicology,
2014.
β Timbre and rhythm similarity
β Panteli, M., Rocha, B., Bogaards, N., Honingh, A.: A model for rhythm
and timbre similarity in electronic dance music. Musicae Scientiae, 21(3),
338-361, 2017.
β Many more?
73
199. Adversarial Examples - Image Recognition
β An adversary slightly and imperceptibly changes an input
image to fool a machine learning system
β Goodfellow I.J., Shlens J., Sβegedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014.
75
original + noise =
adversarial example
all classified as βCamelβ
200. Adversarial Examples - MIR
β Imperceptibly filtered audio fools genre recognition
system
β Sturm B.L.: A simple method to determine if a music information retrieval
system is a horse , IEEE Trans. on Multimedia, 16 6 , pp. 1636-1644, 2014.
76
deflate
201. Adversarial Examples - MIR
β Imperceptibly filtered audio fools genre recognition
system
http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/in
dex.html
77
deflate
202. External (alidity
β External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment
β Unclear target population: identical with sample of few
hundred ISMIR or GTZAN songs?
β Or are we aiming at genre classification in general?
β If target is genre classification in general, there is a
problem!
78
203. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β )hy can these MIR systems be fooled so easily?
β no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β )hat is the confounding variable?
79
204. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β )hy can these MIR systems be fooled so easily?
β E.g.: in case of rhythm classification, systems were picking
up tempo not rhythm! Tempo acted as confounding factor!
β Sturm B.L.: The Horse Inside: Seeking Causes Behind the Behaviors of
Music Content Analysis Systems, Computers in Entertainment, 14 2 , 2016.
80
205. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β )hat is the confounding variable?
β high dimensionality of the data input space?
β Small perturbations to input data might accumulate over many dimensions
with minor changes βsnowballingβ into larger changes in transfer functions
of deep neural networks
β Goodfellow I.J., Shlens J., Sβegedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014
81
206. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β )hat is the confounding variable?
β linearity of models?
β linear responses are overly confident at points that do not occur in the
data distribution, and these confident predictions are often highly
incorrect β¦ rectified linear units ReLU ?
β Goodfellow I.J., Shlens J., Sβegedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014
82
207. Internal (alidity
β Internal validity: is the relationship causal or could
confounding factors explain the relation?
β no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
Open question: what is the confounding
variable????
83
209. (alidity
β (alidity
β Valid experiment is an experiment actually measuring what the
experimenter intended to measure
β Conclusion validity
β Internal validity
β External validity
β Construct validity
β Care about validity of your experiments!
β Validity is the right framework to talk about these
problems
85
211. Whatβs in a π-value?
β Confounds effect size and sample size, eg. π‘ = π
πβπ
π
β Unfortunately, we virtually never check power. Don't ever
accept π»0
β "Easy" way to achieve significance is obtaining more data,
but the true effect remains the same
β Even if one rejects π»0, it could still be true
2
213. π»0 is always false
β In this kind of dataset-based experiments, π»0: π = 0 is
always false
β Two systems may be veeeeeery similar, but not the same
β Binary accept/reject decisions don't even make sense
β Why bother with multiple comparisons then?
β Care about type S(ign) and M(agnitude) errors
β To what extent do non-parametric methods make sense
(Wilcoxon, Sign, Friedman), specially combined with
parametrics like Tukeyβs?
4
214. Binary thinking no more
β Nothing wrong with the π-value, but with its use
β π as a detective vs π as a judge
β Any πΌ is completely arbitrary
β What is the cost of a type 2 error?
β How does the lack of validity affect NHST? (measures,
sampling frames, ignoring cross-assessor variability, etc)
β What about researcher degrees of freedom?
β Why not focus on effect sizes? Intervals, correlations, etc.
β Bayesian methods? What priors?
5
215. Assumptions
β In dataset-based (M)IR experiments, test assumptions are
false by definition
β π-values are, to some degree, approximated
β So again, why use any threshold?
β So which test should you choose?
β Run them all, and compare
β If they tend to disagree, take a closer look at the data
β Look beyond the experiment at hand, gather more data
β Always perform error analysis to make sense of it
6
216. Replication
β Fisher, and specially Neyman-Pearson, advocated for
replication
β A π-value is only concerned with the current data
β The hypothesis testing framework only makes sense with
repeated testing
β In (M)IR we hardly do it; we're stuck to the same datasets
7
220. Significant β Relevant β Interesting
8
All research
Interesting
Relevant
Statistically
Significant
221. There is always random error in our experiments,
so we always need some kind of statistical analysis
But there is no point in being too picky
or intense about how we do it
Nobody knows how to do it properly,
and different fields adopt different methods
What is far more productive,
is to adopt an exploratory attitude
rather than mechanically testing
9
223. β Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness
Measures and User Satisfaction. ACM SIGIR
β Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems,
Prevalence, and an Alternative. Journal of Wildfire Management
β Armstrong, T.G., Moffat, A., Webber, W. & Zobel, J. (2009). Improvements that don't add up: ad-hoc
retrieval results since 1998. CIKM
β Balke, S., Driedger, J., AbeΓer, J., Dittmar, C. & MΓΌller, M. (2016). Towards Evaluating Multiple
Predominant Melody Annotations in Jazz Recordings. ISMIR
β Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
β Bosch J.J. & GΓ³mez E. (2014). Melody extraction in symphonic classical music: acomparative study of
mutual agreement between humans and algorithms. Conference on Interdisciplinary Musicology
β Boytsov, L., Belova, A. & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR
experiments. SIGIR
β Carterette, B. (2012). Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval
Experiments. ACM Transactions on Information Systems
β Carterette, B. (2015a). Statistical Significance Testing in Information Retrieval: Theory and Practice.
ICTIR
β Carterette, B. (2015b). Bayesian Inference for Information Retrieval Evaluation. ACM ICTIR
β Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum
β Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. ACM
SIGIR
β Downie, J. S. (2004). The Scientific Evaluation of Music Information Retrieval Systems: Foundations
and Future. Computer Music Journal
β Fisher, R. A. (1925). Statistical Methods for Research Workers. Cosmo Publications
β Flexer, A. (2006). Statistical Evaluation of Music Information Retrieval Experiments. Journal of New
Music Research
2
224. β Flexer, A., Grill, T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal
of New Music Research
β Gelman, A. (2013b). The problem with p-values is how theyβre used.
β Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Donβt Have to Worry About Multiple
Comparisons. Journal of Research on Educational Effectiveness
β Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a
problem, even when there is no shing expedition' or p-hacking' and the research hypothesis was
posited ahead of time.
β Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist
β Gelman, A., & Stern, H. (2006). The Difference Between Significant and Not Significant is not Itself
Statistically Significant. The American Statistician
β Goodfellow I.J., Shlens J. & Szegedy C. (2014). Explaining and harnessing adversarial examples. ICLR
β Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., & Langlois, T. (2014). On Evaluation Validity in
Music Autotagging. ACM Computing Research Repository.
β Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and
User Evaluations Give the Same Results? ACM SIGIR
β Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar
Songs. ISMIR
β Hull, D. (1993). Using Statistical Testing in the Evaluation of Retrieval Experiments. ACM SIGIR
β Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine
β Lehmann, E.L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or
Two? Journal of the American Statistical Association
β Lehmann, E.L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer
3
225. β Lee, J. H., & Cunningham, S. J. (2013). Toward an understanding of the history and impact of user
studies in music information retrieval. Journal of Intelligent Information Systems
β Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three Current Issues In Music
Autotagging. ISMIR
β Neyman, J. & Pearson, E.S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes
of Statistical Inference: Part I. Biometrika
β Panteli, M., Rocha, B., Bogaards, N. & Honingh, A. (2017). A model for rhythm and timbre similarity in
electronic dance music. Musicae Scientiae
β Quinton, E., Harte, C. & Sandler, M. (2015). Extraction of metrical structure from music recordings.
DAFX
β Sakai, T. (2014). Statistical Reform in Information Retrieval? ACM SIGIR Forum
β Savoy, J. (1997). Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and
Management
β Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval
Research. Journal of Intelligent Information Systems
β Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Houghton-Mifflin
β SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L. (2014). Unsupervised music structure annotation by time
series structure features and segment similarity. IEEE Trans. on Multimedia
β Smith, J.B.L. & Chew, E. (2013). A meta-analysis of the MIREX structure segmentation task. ISMIR
β Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for
Information Retrieval Evaluation. ACM CIKM
β Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement Among Statistical Significance Tests for
Information Retrieval Evaluation at Varying Sample Sizes. CM SIGIR
4
226. β Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our
Metrics, that They Are Unrealistic. European Workshop on Human-Computer Interaction and
Information Retrieval
β Student. (1908). The Probable Error of a Mean. Biometrika
β Sturm, B. L. (2013). Classification Accuracy is Not Enough: On the Evaluation ofMusic Genre
Recognition Systems. Journal of Intelligent Information Systems
β Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music
Information Retrieval. Journal of New Music Research
β Sturm, B.L. (2014). A simple method to determine if a music information retrieval system is a horse,
IEEE Trans. on Multimedia
β Sturm B.L. (2016). "The Horse" Inside: Seeking Causes Behind the Behaviors of Music Content
Analysis Systems, Computers in Entertainment
β Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited.
Information Processing and Management
β Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. ACM
SIGIR
β Urbano, J. (2015). Test Collection Reliability: A Study of Bias and Robustness to Statistical
Assumptions via Stochastic Simulation. Information Retrieval Journal
β Urbano, J., Downie, J. S., McFee, B., & Schedl, M. (2012). How Significant is Statistically Significant?
The case of Audio Music Similarity and Retrieval. ISMIR
β Urbano, J., Marrero, M., & MartΓn, D. (2013a). A Comparison of the Optimality of Statistical Significance
Tests for Information Retrieval Evaluation. ACM SIGIR
β Urbano, J., Marrero, M. & MartΓn, D. (2013b). On the Measurement of Test Collection Reliability. SIGIR
β Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of
Intelligent Information Systems
5
227. β Urbano, J. & Marrero, M. (2016). Toward Estimating the Rank Correlation between the Test Collection
Results and the True System Performance. SIGIR
β Urbano, J. & Nagler, T. (2018). Stochastic Simulation of Test Collections: Evaluation Scores. SIGIR
β Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error.
ACM SIGIR
β Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. ACM CIKM
β Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error
Costs Us Jobs, Justice, and Lives. University of Michigan Press
β Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM
SIGIR
6