SlideShare a Scribd company logo
1 of 227
Download to read offline
Statistical Analysis of
Results in Music Information
Retrieval: Why and How
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
Who
2
JuliΓ‘n Urbano
● Assistant Professor @ TU Delft, The Netherlands
● BSc-PhD Computer Science
● 10 years of research in (Music) Information Retrieval
o And related, like information extraction or crowdsourcing for IR
● Active @ ISMIR since 2010
● Research topics
o Evaluation methodologies
o Statistical methods for evaluation
o Simulation for evaluation
o Low-cost evaluation
3
Supported by the European Commission H2020 project TROMPA (770376-2)
Arthur Flexer
● Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
● PhD in Psychology, minor degree in Computer Science
● 10 years of research in neuroscience, 13 years in MIR
● Active @ ISMIR since 2005
● Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
4
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
Arthur Flexer
● Austrian Research Institute for Artificial Intelligence -
Intelligent Music Processing and Machine Learning Group
● PhD in Psychology, minor degree in Computer Science
● 10 years of research in neuroscience, 13 years in MIR
● Active @ ISMIR since 2005
● Published on:
o role of experiments in MIR
o problems of ground truth
o problems of inter-rater agreement
5
Semi-retired veteran DJ
Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
Disclaimer
● Design of experiments (DOE) used and needed in all kinds
of sciences
● DOE is a science in its own right
● No fixed β€œhow-tos” or β€œcookbooks”
● Different schools and opinions
● Established ways to proceed in different fields
● We will present current procedures in (M)IR
● But also discuss, criticize, point to problems
● Present alternatives and solutions (?)
6
Program
● Part I: Why (we evaluate the way we do it)?
o Tasks and use cases
o Cranfield
o Validity and reliability
● Part II: How (should we not analyze results)?
o Populations and samples
o Estimating means
o Fisher, Neyman-Pearson, NHST
o Tests and multiple comparisons
● Part III: What else (should we care about)?
o Inter-rater agreement
o Adversarial examples
● Part IV: So (what does it all mean)?
● Discussion?
7
8
#ismir2018
Part I: Why?
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
Typical Information Retrieval task
2
Typical Information Retrieval task
2
IR System
Typical Information Retrieval task
2
Documents
IR System
Typical Information Retrieval task
2
Documents
IR System
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
Results
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
Results
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
Results
query
Typical Information Retrieval task
2
Documents
Information
Need or Topic
IR System
query
ResultsResults
query
Two recurrent questions
ο‚· How good is my system?
β—‹ What does good mean?
β—‹ What is good enough?
ο‚· Is system A better than system B?
β—‹ What does better mean?
β—‹ How much better?
ο‚· What do we talk about?
β—‹ Efficiency?
β—‹ Effectiveness?
β—‹ Ease?
3
Hypothesis: A is better than B
How would you design
this experiment?
Measure user experience
ο‚· We are interested in user-measures
β—‹ Time to complete task
β—‹ Idle time
β—‹ Success/Failure rate
β—‹ Frustration
β—‹ Ease of learning
β—‹ Ease of use …
ο‚· Their distributions describe user experience
β—‹ For an arbitrary user and topic (and document collection?)
β—‹ What can we expect?
5
0
time to complete task
none
frustration
muchsome
Sources of variability
user-measure = f(documents, topic, user, system)
ο‚· Our goal is the distribution of the user-measure for our
system, which is impossible to calculate
β—‹ (Possibly?) infinite populations
ο‚· As usual, the best we can do is estimate it
β—‹ Becomes subject to random error
6
Desired: Live Observation
ο‚· Estimate distributions with a live experiment
ο‚· Sample documents, topics and users
ο‚· Have them use the system, for real
ο‚· Measure user experience, implicitly or explicitly
ο‚· Many problems
β—‹ High cost, representativeness
β—‹ Ethics, privacy, hidden effects, inconsistency
β—‹ Hard to replicate experiment and repeat results
β—‹ Just plain impossible to reproduce results
*replicate = same method, different sample
reproduce = same method, same sample
7
Alternative: Fixed samples
ο‚· Get (hopefully) good samples, fix them and reuse
β—‹ Documents
β—‹ Topics
β—‹ Users
ο‚· Promotes reproducibility and reduces variability
ο‚· But we can’t just fix the users!
8
Simulate users
ο‚· Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
β—‹ Static user component: judgments or annotations in ground truth
β—‹ Dynamic user component: effectiveness or performance measures
ο‚· Removes all sources of variability, except systems
user-measure = f(documents, topic, user, system)
9
Simulate users
ο‚· Cranfield paradigm: remove users, but include a user-
abstraction, fixed across experiments
β—‹ Static user component: judgments or annotations in ground truth
β—‹ Dynamic user component: effectiveness or performance measures
ο‚· Removes all sources of variability, except systems
user-measure = f(documents, topic, user, system)
9
user-measure = f(system)
Datasets (aka Test Collections)
ο‚· Controlled sample of documents, topics and judgments,
shared across researchers…
ο‚· …combined with performance measures
ο‚· (Most?) important resource for IR research
β—‹ Experiments are inexpensive (datasets are not!)
β—‹ Research becomes systematic
β—‹ Evaluation is deterministic
β—‹ Reproducibility is not only possible but easy
10
Repeat
over topics
Cranfield-like evaluation
11
Systemcollection
topic
doc
doc
doc
doc
doc
rel
rel
rel
rel
rel
Annotator
Measure
score
Annotation
protocol
User Models & Annotation Protocols
ο‚· In practice, there are hundreds of options
ο‚· Utility of a document w.r.t. scale of annotation
β—‹ Binary or graded relevance?
β—‹ Linear utility w.r.t. relevance? Exponential?
β—‹ Independent of other documents?
ο‚· Top heaviness to penalize late arrival
β—‹ No discount?
β—‹ Linear discount? Logarithmic?
β—‹ Independent of other documents?
ο‚· Interaction, browsing?
ο‚· Cutoff
β—‹ Fixed: only top k documents?
β—‹ Dynamic: wherever some condition is met?
β—‹ All documents?
ο‚· etc
12
Tasks vs Use Cases
ο‚· Everything depends on the use case of interest
ο‚· The same task may have several use cases (or subtasks)
β—‹ Informational
β—‹ Navigational
β—‹ Transactional
β—‹ etc
ο‚· Different use cases may imply, suggest or require
different decisions wrt system input/output, goal,
annotations, measures...
13
Task: instrument recognition
What is the use case?
Instrument recognition
1) Given a piece of music as input, identify the instruments
that are played in it:
β—‹ For each window of T milliseconds, return a list of instruments being
played (extraction).
β—‹ Return the instruments being played anywhere in the piece (classification).
2) Given an instrument as input, retrieve a list of music
pieces in which the instrument is played:
β—‹ Return the list of music pieces (retrieval).
β—‹ Return the list, but for each piece also provide a clip (start-end) where the
instrument is played (retrieval+extraction).
ο‚· Each case implies different systems, annotations and
measures, and even different end-users (non-human?)
https://github.com/cosmir/open-mic/issues/19
15
But wait a minute...
ο‚· Are we estimating distributions about users or distributions
about systems?
user-measure = f(system)
system-measure = f(system, protocol, measure)
16
But wait a minute...
ο‚· Are we estimating distributions about users or distributions
about systems?
user-measure = f(system)
system-measure = f(system, protocol, measure)
16
system-measure = f(system, protocol, measure, annotator, context, ...)
ο‚· Whether the system output satisfies the user or not, has
nothing to do with how we measure its performance
ο‚· What is the best way to predict user satisfaction?
Real world vs. The lab
17
The Web
Abstraction
Prediction
Real World Cranfield
IR System
Topic
Relevance
Judgments
IR System
Documents
AP
DCG
RR
Static
Component
Dynamic
Component
Test
Collection
Effectiveness
Measures
Information
need
Output
Cranfield in Music IR
18
System
Input
Measure
Annotations
Users
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
19
System
collection track
track
track
track
track
track
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
20
System
collection hum
track
track
track
track
track
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
21
System
collection user
track
track
track
track
track
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
22
System
track
genre
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
β—‹ Mood recognition
23
System
track
mood1
mood2
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
β—‹ Mood recognition
β—‹ Autotagging
24
System
track
tag
tag
tag
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
β—‹ Mood recognition
β—‹ Autotagging
● Extraction
β—‹ Structural segmentation
25
System
track
seg segseg seg
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
β—‹ Mood recognition
β—‹ Autotagging
● Extraction
β—‹ Structural segmentation
β—‹ Melody extraction
26
System
track
Classes of Tasks in Music IR
● Retrieval
β—‹ Music similarity
β—‹ Query by humming
β—‹ Recommendation
● Annotation
β—‹ Genre classification
β—‹ Mood recognition
β—‹ Autotagging
● Extraction
β—‹ Structural segmentation
β—‹ Melody extraction
β—‹ Chord estimation
27
System
track
chocho cho
Evaluation as Simulation
ο‚· Cranfield-style evaluation with datasets is a simulation of
the user-system interaction, deterministic, maybe even
simplistic, but a simulation nonetheless
ο‚· Provides us with data to estimate how good our systems
are, or which one is better
ο‚· Typically, many decisions are made for the practitioner
ο‚· Comes with many assumptions and limitations
28
Validity and Reliability
ο‚· Validity: are we measuring what we want to?
β—‹ Internal: are observed effects due to hidden factors?
β—‹ External: are input items, annotators, etc generalizable?
β—‹ Construct: do system-measures match user-measures?
β—‹ Conclusion: how good is good and how better is better?
Systematic error
ο‚· Reliability: how repeatable are the results?
β—‹ Will I obtain the same results with a different collection?
β—‹ How large do collections need to be?
β—‹ What statistical methods should be used?
Random error
29
30
Not Valid
Reliable
Valid
Not Reliable
Not Valid
Not Reliable
Valid
Reliable
So long as...
β€’ So long as the dataset is large enough to minimize random
error and draw reliable conclusions
β€’ So long as the tools we use to make those conclusions can
be trusted
β€’ So long as the task and use case are clear
β€’ So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
β€’ So long as the samples of inputs and annotators present in
the dataset are representative for the task
31
What else
How
So long as...
β€’ So long as the dataset is large enough to minimize random
error and draw reliable conclusions
β€’ So long as the tools we use to make those conclusions can
be trusted
β€’ So long as the task and use case are clear
β€’ So long as the annotation protocol and performance
measure (ie. user model) are realistic and actually measure
something meaningful for the use case
β€’ So long as the samples of inputs and annotators present in
the dataset are representative for the task
31
β€œIf you can’t measure it, you can’t improve it.”
β€”Lord Kelvin
32
β€œIf you can’t measure it, you can’t improve it.”
β€”Lord Kelvin
32
β€œBut measurements have to be trustworthy.”
β€”yours truly
Part II: How?
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
Populations and Samples
Populations of interest
● The task and use case define the populations of interest
β—‹ Music tracks
β—‹ Users
β—‹ Annotators
β—‹ Vocabularies
● Impossible to study all entities in these populations
β—‹ Too many
β—‹ Don’t exist anymore (or yet)
β—‹ Too far away
β—‹ Too expensive
β—‹ Illegal
3
Populations and samples
● Our goal is to study the performance of the system on
these populations
β—‹ Lets us know what to expect from the system in the real world
● We’re typically interested in the mean: the expectation πœ‡
β—‹ Based on this we would decide what research line to pursue, what paper
to publish, what project to fund, etc.
β—‹ But variability is also important, though often neglected
● A dataset represents just a sample from that population
β—‹ By studying the sample we could generalize back to the population
β—‹ But will bear some degree of random error due to sampling
4
Populations and samples
5
Target population
Populations and samples
5
Target population
External validity
Accessible population
or sampling frame
Populations and samples
5
Target population
External validity
Accessible population
or sampling frame
Content Validity
and Reliability
Sample
Inference
Populations and samples
5
Target population
External validity
Accessible population
or sampling frame
Content Validity
and Reliability
Sample
Generalization
Inference
Populations and samples
5
Target population
External validity
Accessible population
or sampling frame
Content Validity
and Reliability
Sample
Populations and samples
● This is an estimation problem
● The objective is the distribution 𝐹 of performance over the
population, specifically the mean πœ‡
● Given a sample of observations 𝑋1, … , 𝑋 𝑛, estimate πœ‡
● Most straightforward estimator is the sample mean: πœ‡ = 𝑋
● Problem: 𝝁 = 𝝁 + 𝒆, being 𝒆 random error
● For any given sample or dataset, we only know 𝑿
● How confident we are in the results and our conclusions,
depends on the size of 𝒆 with respect to 𝝁
6
Populations and samples
7
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
Populations and samples
7
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^
Populations and samples
7
sample (n=10)
performance
frequency
0 1^
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^
Populations and samples
7
sample (n=10)
performance
frequency
0 1^
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=10)
performance
frequency
0 1^
sample (n=10)
performance
frequency 0 1^
Populations and samples
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=30)
performance
frequency
0 1^
Populations and samples
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency
0 1^
Populations and samples
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency
0 1^
sample (n=30)
performance
frequency 0 1^
Populations and samples
8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
Populations and samples
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=100)
performance
frequency
0 1^
Populations and samples
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency
0 1^
Populations and samples
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency
0 1^
sample (n=100)
performance
frequency 0 1^
Populations and samples
9
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
population
performance
density
0.35 0.40 0.45 0.50 0.55
010203040
sampling distribution
mean performance X
density
n=10
n=30
n=50
n=100
0.0 0.2 0.4 0.6 0.8 1.0
01234
population
performance
density
Sampling distribution and standard error
● Let us assume some distribution with some πœ‡
● Experiment: draw random sample of size 𝑛 and compute 𝑋
● The sampling distribution is the distribution of 𝑋 over
replications of the experiment
● Standard error is the std. dev. of the sampling distribution
10
Estimating the mean
● The true distribution 𝐹 has mean πœ‡ and variance 𝜎2
β—‹ 𝐸 𝑋 = πœ‡
β—‹ π‘‰π‘Žπ‘Ÿ 𝑋 = 𝜎2
● For the sample mean 𝑋 =
1
𝑛
βˆ‘π‘‹π‘– we have
β—‹ 𝐸 𝑋 =
1
𝑛
βˆ‘πΈ 𝑋𝑖 = πœ‡
β—‹ π‘‰π‘Žπ‘Ÿ 𝑋 =
1
𝑛2 βˆ‘π‘‰π‘Žπ‘Ÿ 𝑋𝑖 =
𝜎2
𝑛
std. error = 𝜎 𝑋 =
𝜎
𝑛
● Law of large numbers: 𝑋 β†’ πœ‡ as 𝑛 β†’ ∞
● The larger the dataset, the better our estimates
● Regardless of the true distribution 𝑭 over the population
11
Gaussians everywhere
● For the special case where 𝐹 = 𝑁 πœ‡, 𝜎2 , the sample mean
is also Gaussian, specifically 𝑋~𝑁 πœ‡,
𝜎2
𝑛
● But a Gaussian distribution is sometimes a very unrealistic
model for our data
● Central Limit Theorem (CLT): 𝑋
𝑑
β†’ 𝑁 πœ‡,
𝜎2
𝑛
as 𝑛 β†’ ∞
● Regardless of the shape of the original 𝑭
12
0.0 0.2 0.4 0.6 0.8 1.0
05101520
population
performance
density
n=10
mean performance X
density
0.00 0.04 0.08 0.12
051525
n=100
mean performance X
density 0.00 0.04 0.08 0.12
02060
Approximations
● Until the late 1890s, the CLT was invoked everywhere for
the simplicity of working with Gaussians
● Tables of the Gaussian distribution were used to test
● Still, there were two main problems
β—‹ 𝜎2
is unknown
β—‹ The rate of converge, ie. small samples
● But something happened at the Guinness factory in 1908
13
14
William Gosset
Student-t distribution
● Gaussian approximations for sampling distributions were
reasonably good for large samples, but not for small
● Gosset thought about deriving the theoretical distributions
under assumptions of the underlying model
● Specifically, when 𝑋~𝑁 πœ‡, 𝜎2 :
β—‹ If 𝜎 is known, we know that 𝑧 =
π‘‹βˆ’πœ‡
𝜎 𝑛
~𝑁 0,1
β—‹ If 𝜎 is unknown Gosset introduced the Student-t distribution:
𝑑 =
π‘‹βˆ’πœ‡
𝑠/ 𝑛
~𝑇 𝑛 βˆ’ 1 , where 𝑠 is the sample standard deviation
● In a sense, it accounts for the uncertainty in 𝜎 = 𝑠
15
Small-sample problems
● In non-English literature there are earlier mentions, but it
was popularized by Gosset and, mostly, Fisher
● He initiated the study of the so-called small-sample
problems, specifically with the Student-𝑑 distribution
16
-4 -2 0 2 4
0.00.10.20.30.4
Student-t distribution
t statistic
density
n=2
n=3
n=6
n=30
Ronald Fisher
Fisher and small samples
● Gosset did not provide a proof of the 𝑑
distribution, but Fisher did in 1912-1915
● Fisher stopped working on small sample
problems until Gosset convinced him in 1922
● He then worked out exact distributions for correlation
coefficients, regression coefficients, πœ’2 tests, etc. in the
early 1920s
● These, and much of his work on estimation and design of
experiments, were collected in his famous 1925 book
● This is book is sometimes considered the birth of modern
statistical methods
18
Ronald Fisher
19
Fisher’s significance testing
● In those other papers Fisher developed his theory of
significance testing
● Suppose we have observed data 𝑋~𝑓 π‘₯ πœƒ and we are
interested in testing the null hypothesis 𝐻0: πœƒ = πœƒ0
● We choose a relevant test statistic 𝑇 s.t. large values of 𝑇
reflect evidence against 𝐻0
● Compute the 𝒑-value 𝑝 = 𝑃 𝑇 π‘‹βˆ— β‰₯ 𝑇 𝑋 𝐻0 , that is, the
probability that, under 𝐻0, we observe a sample π‘‹βˆ— with a
test statistic at least as extreme as we observed initially
● Assess the statistical significance of the results, that is,
reject 𝐻0 if 𝑝 is small
20
Testing the mean
● We observed 𝑋 = {βˆ’0.13, 0.68, βˆ’0.34, 2.10, 0.83, βˆ’0.32,
0.99, 1.24, 1.08, 0.19} and assume a Gaussian model
● We set 𝐻0: πœ‡ = 0 and choose a 𝑑 statistic
● For our data, 𝑝 = 0.0155 𝑑 = 2.55
● If we consider 𝑝 small enough, we reject 𝐻0
21
-4 -2 0 2 4
0.00.10.20.30.4
test statistic
t
density
p
Small p-values
● β€œwe do not want to know the exact value of p […], but, in
the first place, whether or not the observed value is open
to suspicion”
● Fisher provided in his book tables not of the new small-
sample distributions, but of selected quantiles
● Allow for calculation of ranges of 𝑝-values given test
statistics, as different degrees of evidence against 𝐻0
● The 𝑝-value is gradable, a continuous measure of evidence
22
23
𝑝 and 𝛼
● Fisher employed the term significance level 𝛼 for these
theoretical 𝑝-values used as reference points to identify
statistically significant results: reject 𝐻0 if 𝑝 ≀ 𝛼
● This is context-dependent, is not prefixed beforehand and
can change from time to time
● He arbitrarily β€œsuggested” 𝛼 = .05 for illustration purposes
● Observing 𝑝 > 𝛼 does not prove 𝐻0; it just fails to reject it
24
Jerzy Neyman & Egon Pearson
Pearson
● Pearson saw Fisher’s tables as a way to
compute critical values that β€œlent
themselves to the idea of choice, in advance
of experiment, of the risk of the β€žfirst kind
of error’ which the experimenter was prepared to take”
● In a letter to Pearson, Gosset replied β€œif the chance is very
small, say .00001, […] what it does is to show that if there is
any alternative hypothesis which will explain the
occurrence of the sample with a more reasonable
probability, say .05 […], you will be very much more
inclined to consider that the original hypothesis is not
true”
26
Egon Pearson
Neyman
● Pearson saw the light: β€œthe only valid
reason for rejecting a statistical hypothesis
is that some alternative explains the
observed events with a greater degree
of probability”
● In 1926 Pearson writes to Neyman to propose his ideas of
hypothesis testing, which they developed and published
in 1928
27
Jerzy Neyman
28
Errors
● 𝛼 = 𝑃 𝑑𝑦𝑝𝑒 1 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
● 𝛽 = 𝑃 𝑑𝑦𝑝𝑒 2 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ
● Power = 1 βˆ’ 𝛽
29
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
Truth
𝐻0 true 𝐻1 true
Test
accept 𝐻0 true
negative
Type 2
error
reject 𝐻0 Type 1
error
true
positive
𝛼𝛽
Errors
30
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.5
● 𝜎 = 1
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H1
𝑑 =
𝑋 βˆ’ πœ‡
𝜎 𝒏
Errors
31
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.25
● 𝜎 = 1
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H2
𝑑 =
𝑋 βˆ’ 𝝁
𝜎 𝑛
Errors
32
-2 -1 0 1 2 3
0.00.10.20.30.4
n=10
test statistic
density
H0 H1
𝛼𝛽
● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.25
● 𝜎 = 3
-2 -1 0 1 2 3
0.00.10.20.30.4
n=30
test statistic
density
H0 H2
𝑑 =
𝑋 βˆ’ πœ‡
𝝈 𝑛
Neyman-Pearson hypothesis testing
● Define the null and alternative hypotheses, eg.
𝐻0: πœ‡ = 0 and 𝐻1: πœ‡ = 0.5
● Set the acceptable error rates 𝛼 (type 1) and 𝛽 (type 2)
● Select the most powerful test 𝑇 for the hypotheses and 𝛼,
which sets the critical value 𝑐
● Given 𝐻1 and 𝛽, select the sample size 𝑛 required to detect
an effect 𝒅 or larger
● Collect data and reject 𝐻0 if 𝑇 𝑋 β‰₯ 𝑐
● The testing conditions are set beforehand: 𝐻0, 𝐻1, 𝛼, 𝛽
● The experiment is designed for a target effect 𝑑: 𝑛
33
Error rates and tests
● Under repeated experiments, the long-run error rate is 𝛼
● Neyman-Pearson did not suggest values for it:
β€œthe balance [between the two kinds of error] must be left
to the investigator […] we attempt to adjust the balance
between the risks 1 and 2 to meet the type of problem
before us”
● For 𝛽 they β€œsuggested” 𝛼 ≀ 𝛽 ≀ 0.20
● To Fisher, the choice of test statistic in his methodology
was rather obvious to the investigator and wasn’t important
to him
● Neyman-Pearson answered this by defining the β€œbest” test:
that which minimizes error 2 subject to a bound in error 1
34
Likelihood ratio test
● Pearson apparently suggested the likelihood ratio test for
their new hypothesis testing methodology
β„’ =
𝑝 𝑋 𝐻0)
𝑝 𝑋 𝐻1
● Later found that as 𝑛 β†’ ∞, βˆ’2 log β„’ ~πœ’2
● Neyman was reluctant, as he thought some Bayesian
consideration had to be taken about prior distributions
over the hypotheses (β€œinverse probability” at the time)
● For simple point hypotheses like 𝐻0: πœƒ = πœƒ0 and 𝐻1: πœƒ = πœƒ1,
the Likelihood ratio test turned out to be the most powerful
● In the case of comparing means of Gaussians, this reduces
to Student’s 𝑑-test!
35
Composite hypotheses
● Neyman-Pearson theory extends to composite hypotheses
of the form 𝐻: πœƒ ∈ Θ, such as 𝐻1: πœ‡ > 0.5
● The math got more complex, and Neyman was still
somewhat reluctant: β€œit may be argued that it is
impossible to estimate the probability of such a hypothesis
without a knowledge of the relative a priori probabilities of
the constituent simple hypotheses”
● Although β€œwishing to test the probability of a hypothesis A
we have to assume that all hypotheses are a priori equally
probable and calculate the probability a posteriory of A”
36
Null Hypothesis
Significance Testing
Recap
● Fisher: significance testing
β—‹ Inductive inference: rational belief when reasoning from sample to
population
β—‹ Rigorous experimental design to extract results from few samples
β—‹ Replicate and develop your hypotheses, consider all significant and non-
significant results together
β—‹ Power can not be computed beforehand
● Neyman-Pearson: hypothesis testing
β—‹ Inductive behavior: frequency of errors in judgments
β—‹ Long-run results from many samples
β—‹ p-values don’t have frequentist interpretations
● In the 1940s the two worlds began to appear as just one in
statistics textbooks, and rapidly adopted by researchers
38
39
39
Null Hypothesis Significance Testing
● Collect data
● Set hypotheses, typically 𝐻0: πœ‡ = 0 and 𝐻1: πœ‡ β‰  0
β—‹ Either there is an effect or there isn’t
● Set 𝛼, typically to 0.05 or 0.01
● Select test statistic based on hypotheses and compute 𝑝
● If 𝑝 ≀ 𝛼, reject the null; fail to reject if 𝑝 > 𝛼
40
Common bad practices
● Run tests blindly without looking at your data
● Decide on 𝛼 after computing 𝑝
● Report β€œ(not) significant at the 0.05 level” instead of
providing the actual 𝑝-value
● Report degrees of significance like β€œvery” or β€œbarely”
● Do not report test statistic alongside 𝑝, eg. 𝑑 58 = 1.54
● Accept 𝐻0 if 𝑝 > 𝛼 or accept 𝐻1 if 𝑝 ≀ 𝛼
● Interpret 𝑝 as the probability of the null
● Simply reject the null, without looking at the effect size
● Ignore the type 2 error rate 𝛽, ie. power analysis a posteriori
● Interpret statistically significant result as important
● Train the same models until significance is found
● Publish only statistically significant results Β―_(ツ)_/Β―
41
NHST for (M)IR
2 systems
Paired tests
● We typically want to compare our system B with some
baseline system A
● We have the scores over 𝑛 inputs from some dataset
● The hypotheses are 𝐻0: πœ‡ 𝐴 = πœ‡ 𝐡 and 𝐻1: πœ‡ 𝐴 β‰  πœ‡ 𝐡
A
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.405
sd=0.213
B
performance
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
01234
mean=0.425
sd=0.225
Paired test
● If we ignore the structure of the experiment, we have a bad
model and a test with low power
Simple test: 𝑑 = 0.28, 𝑝 = 0.78
● In our experiments, every observation from A corresponds
to an observation from B, ie. they are paired observations
● The test can account for this to better model our data
● Instead of looking at A vs B, we look at A-B vs 0
Paired test: 𝑑 = 2.16, 𝑝 = 0.044
44
0.0 0.2 0.4 0.6 0.8 1.0
0.00.40.8
A
B
Paired 𝑑-test
● Assumption
β—‹ Data come from Gaussian distributions
● Equivalent to a t-test of πœ‡ 𝐷 = 0, where 𝐷𝑖 = 𝐡𝑖 βˆ’ 𝐴𝑖
𝑑 = 𝑛
𝑋 π΅βˆ’π΄
𝑠 π΅βˆ’π΄
= 𝑛
𝐷
𝑠 𝐷
= 0.35, 𝑝 = 0.73
45
A B
.76 .75
.33 .37
.59 .59
.28 .15
.36 .49
.43 .50
.21 .33
.43 .27
.72 .81
.40 .36
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
A
B
Wilcoxon signed-rank test
● Assumptions:
β—‹ Data measured at least at interval level
β—‹ Distribution is symmetric
● Disregard for magnitudes
● Convert all untied 𝐷𝑖 to ranks 𝑅𝑖
● Compute π‘Š+ and π‘Šβˆ’ equal to the sum
of 𝑅𝑖 that are positive or negative
● The test statistic is
π‘Š = min π‘Š+, π‘Šβˆ’ = 21
● π‘Š follows a Wilcoxon distribution, from
which one can calculate 𝑝 = 0.91
46
A B D rank
.76 .75 -.01 1
.33 .37 .04 2
.59 .59 0 -
.28 .15 -.13 8
.36 .49 .13 7
.43 .50 .07 4
.21 .33 .12 6
.43 .27 -.16 9
.72 .81 .09 5
.40 .36 -.04 3
Sign test
● Complete disregard for magnitudes
● Simulate coin flips: was system B better (or
worse) than A for some input?
● Follows Binomial distribution
● The test statistic is the number of successes
(B>A), which is 5
● 𝑝-value is the probability of 5 or more
successes in 9 coin flips = 0.5
47
A B sign
.76 .75 -1
.33 .37 +1
.59 .59 0
.28 .15 -1
.36 .49 +1
.43 .50 +1
.21 .33 +1
.43 .27 -1
.72 .81 +1
.40 .36 -1
Bootstrap test
● Compute deltas, 𝐷𝑖 = 𝐡𝑖 βˆ’ 𝐴𝑖
● The empirical distribution 𝑒𝑐𝑑𝑓𝐷 estimates the true
distribution 𝐹 𝐷
● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times)
β—‹ Draw a bootstrap sample 𝐡𝑖 by sampling 𝑛 scores with replacement from
𝑒𝑐𝑑𝑓𝐷
β—‹ Compute the mean of the bootstrap sample, 𝐡𝑖
β—‹ Let 𝐡 = 1/π‘‡βˆ‘π΅π‘–
β—‹ 𝐡𝑖 estimates the sampling distribution of the mean
● The 𝑝-value is
βˆ‘π•€ 𝐡𝑖 βˆ’ 𝐡 β‰₯ 𝐷
𝑇
β‰ˆ 0.71
48
Permutation test
● Under the null hypothesis, an arbitrary score could have
been generated for system A or from system B
● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times)
β—‹ Create a sample 𝑃𝑖 by randomly swapping the sign of each observation
β—‹ Compute the mean 𝑃𝑖
● The 𝑝-value is
βˆ‘π•€ 𝑃𝑖 β‰₯ 𝐷
𝑇
β‰ˆ 0.73
49
The computer does all this for you
50
In practice
51
[Carterette, 2015a]
NHST for (M)IR
Multiple systems
ANOVA
● Assume a model 𝑦𝑠𝑖 = πœ‡ + πœˆπ‘  + πœˆπ‘– + 𝑒𝑠𝑖, where πœˆπ‘  = π‘¦π‘ βˆ™ βˆ’ πœ‡
β—‹ Implicitly β€œpairs” the observations by item
● The variance of the observed scores can be decomposed
𝜎2
𝑦 = 𝜎2
𝑠 + 𝜎2
𝑖 + 𝜎2
𝑒
● where 𝜎2 𝑠 is the variance across system means
β—‹ Low: system means are close to each other
β—‹ High: system means are far from each other
● The null hypothesis is 𝐻0: πœ‡1 = πœ‡2 = πœ‡3 = β‹―
β—‹ Even if we reject, we still don’t know which system is different!
● The test statistic is of the form 𝐹 =
𝜎2 𝑠
𝜎2 𝑒
● We’d like to have 𝜎2 𝑠 ≫ 𝜎2 𝑖 , 𝜎2 𝑒
Friedman test
● Same principle as ANOVA, but non-parametric
● Similarly to Wilcoxon, rank observations (per item) and
estimate effects
● Ignores actual magnitudes; simply uses ranks
54
Multiple testing
● When testing multiple hypotheses, the probability of at
least one type 1 error increases
● Multiple testing procedures correct 𝑝-values for a family-
wise error rate
55
[Carterette, 2015a]
Tukey’s HSD
● Follow ANOVA to test 𝐻0: πœ‡1 = πœ‡2 = πœ‡3 = β‹―
● The maximum observed difference between systems is
likely the one causing the rejection
● Tukey’s HSD compares all pairs of systems, each with an
individual 𝑝-value
● These 𝑝-values are then corrected based on the expected
distribution of maximum differences under 𝐻0
● In practice, it inflates 𝑝-values
● Ensures a family-wise type 1 error rate
56
Tukey’s HSD
57
[Carterette, 2015a]
Others
● There are many other procedures to control for multiple
comparisons
● Bonferroni: very conservative (low power)
● Dunnett’s: compare all against a control (eg. baseline)
● Other procedures control for the false discovery rate, ie.
the probability of a type 1 error given 𝑝 < 𝛼
● One way or another, they all inflate 𝑝-values
58
Part III: )hat Else? (alidity!
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
)hat else can go wrong ? (alidity!
● (alidity reprise
● Example I: Inter-rater agreement
● Example II: Adversarial examples
2
(alidity Reprise
3
(alidity
● (alidity
β—‹ Valid experiment is an experiment actually measuring what the
experimenter intended to measure
β—‹ Conclusion validity: does a difference between system measures
correspond to a difference in user measures and is it noticeable to users
β—‹ Internal validity: is this relationship causal or could confounding factors
explain the relation
β—‹ External validity: do cause-effect relationships also hold for target
populations beyond the sample used in the experiment
β—‹ Construct validity: are intentions and hypotheses of the experimenter
represented in the actual experiment
4
Inter-rater agreement in music
similarity
5
Automatic recommendation / Playlisting
6
Millions of songs
Result list
Query song
+ =
Automatic recommendation / Playlisting
7
Millions of songs
Result list
Query song
+ =
Similarity
Computation of similarity between songs
● Collaborative filtering Spotify, Dee”er?
● Social meta-data Last.Fm?
● Expert knowledge Pandora?
● Meta-data from the web
● ...
● Audio-based
8
Computation of similarity between songs
9
Songs as audio
Switching to
frequencies
Computation of
features
Machine Learning
β†’ similarity (metric)
S(a1, a2) = ?
Pictures from E. Pampalk’s Phd thesis 2006
Computation of similarity between songs
10
Query song
Similar? Similar?
…………….
Computation of similarity between songs
11
Query song
Similar!! Similar!!
…………….
max(S)=97.9
Are we there yet?
14
How can we evaluate our models of music
similarity?
15
45 87 100
23 100 87
100 23 45
How can we evaluate our models of music
similarity?
16
45 87 100
23 100 87
100 23 45
Do these
numbers
correspond to
a human
assessment of
music
similarity?
MIREX -
Music Information Retrieval eXchange
17
MIREX -
Music Information Retrieval eXchange
● Standardi”ed testbeds allowing for fair comparison of MIR
systems
● range of different tasks
● based on human evaluation
β—‹ Cranfield: remove users, look at annotations only
18
MIREX -
Music Information Retrieval eXchange
● Standardi”ed testbeds allowing for fair comparison of MIR
systems
● range of different tasks
● based on human evaluation
β—‹ Cranfield: remove users, look at annotations only
● )hat is the level of agreement between human
raters/annotators ?
● )hat does this mean for the evaluation of MIR systems?
● Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling
Music Similarity, Journal of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
19
Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
20
Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
● 5000 song database
● participating MIR systems compute 5000x5000 distance
matrix
● 60 randomly selected queries
● return 5 closest candidate songs for each of the MIR
systems
● for each query/candidate pair, ask the human grader:
● β€žRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 10
very similar or identical .
21
Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
● 7000 song database
● participating MIR systems compute 7000x7000 distance
matrix
● 100 randomly selected queries
● return 5 closest candidate songs for each of the MIR
systems
● for each query/candidate pair, ask the human grader:
● β€žRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
22
Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
● 7000 song database
● participating MIR systems compute 7000x7000 distance
matrix
● 50 randomly selected queries
● return 10 closest candidate songs for each of the MIR
systems
● for each query/candidate pair, ask the human grader:
● β€žRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
23
Experimental design
24
Independent variable
treatment
manipulated by researcher
Dependent variable
effect
measured by researcher
● measure the effect of different treatments on a dependent
variable
Experimental design
25
Independent variable
treatment
manipulated by researcher
Type of algorithm
Dependent variable
effect
measured by researcher
FINE similarity rating
● measure the effect of different treatments on a dependent
variable
Experimental design
26
Independent variable
treatment
manipulated by researcher
Type of algorithm
Dependent variable
effect
measured by researcher
FINE similarity rating
● measure the effect of different treatments on a dependent
variable
MIREX AMS 2014
Experimental design
27
Independent variable
treatment
manipulated by researcher
Type of algorithm
Dependent variable
effect
measured by researcher
FINE similarity rating
● measure the effect of different treatments on a dependent
variable
MIREX AMS 2014
)hat about validity?
● Valid experiment is an experiment actually measuring
what the experimenter intended to measure
● )hat is the intention of the experimenter in the AMS task?
● What do we want to measure here?
28
Audio music similarity
● Audio Music Similarity and Retrieval AMS task 2006-2014
● 7000 song database
● participating MIR systems compute 7000x7000 distance
matrix
● 100 randomly selected queries
● return 5 closest candidate songs for each of the MIR
systems
● for each query/candidate pair, ask the human grader:
● β€žRate the similarity of the following Query-Candidate pairs.
Assign a categorical similarity Not similar, Somewhat
Similar, or (ery Similar and a numeric similarity score. The
numeric similarity score ranges from 0 not similar to 100
very similar or identical .
29
Rate the similarity!
30
Rate the similarity!
31
Query song Candidate song
0 … 100
Rate the similarity!
32
Rate the similarity!
33
Rate the similarity!
34
Rate the similarity!
35
Rate the similarity!
36
● Factors that influence human music perception
β—‹ Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information
Retrieval Research, J. of Intelligent Information Systems, December 2013,
(olume 41, Issue 3, pp 523-539, 2013.
Rate the similarity!
37
Inter-rater agreement in AMS
38
Inter-rater agreement in AMS
● AMS 2006 is the only year with multiple graders
● each query/candidate pair evaluated by three different
human graders
● each grader gives a FINE score between 0 … 10 not … very
similar
39
Inter-rater agreement in AMS
● AMS 2006 is the only year with multiple graders
● each query/candidate pair evaluated by three different
human graders
● each grader gives a FINE score between 0 … 10 not … very
similar
● correlation between pairs of graders
40
Inter-rater agreement in AMS
● inter-rater agreement for different intervals of FINE scores
41
Inter-rater agreement in AMS
● inter-rater agreement for different intervals of FINE scores
42
Inter-rater agreement in AMS
● inter-rater agreement for different intervals of FINE scores
43
Inter-rater agreement in AMS
● look at very similar ratings in the [9,10] interval
44
Inter-rater agreement in AMS
● look at very similar ratings in the [9,10] interval
45
Inter-rater agreement in AMS
● look at very similar ratings in the [9,10] interval
46
Average =
6.54
Inter-rater agreement in AMS
● look at very similar ratings in the [9,10] interval
● what sounds very similar to one grader, will on average
receive a score of only 6.54 from other graders
● this constitutes an upper bound for average FINE scores in
AMS
● there will always be users that disagree moving target
47
Average =
6.54
Comparison to the upper bound
● compare top performing systems 2007, 2009 - 2014 to
upper bound
48
Comparison to the upper bound
● compare top performing systems 2007, 2009 - 2014 to
upper bound
49
Comparison to the upper bound
● upper bound has already been reached in 2009
50
PS2
Comparison to the upper bound
● upper bound has already been reached in 2009
51
PS2
PS2
PS2
PS2 PS2
PS2
Comparison to the upper bound
52
● upper bound has already been reached in 2009
● can upper bound be surpassed in the future?
● or is this an inherent problem due to low inter-rater
agreement in human evaluation of music similarity?
● this prevents progress in MIR research on music similarity
● AMS task dead since 2015
)hat went wrong here?
53
)hat about validity?
● Valid experiment is an experiment actually measuring
what the experimenter intended to measure
● )hat is the intention of the experimenter in the AMS task?
● What do we want to measure here?
54
Construct (alidity
● Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
55
Construct (alidity
● Construct validity: are intentions and hypotheses of the
experimenter represented in the actual experiment?
● Unclear intention: to measure an abstract concept of music
similarity?
● Possible solutions:
β—‹ more fine-grained notion of similarity
β—‹ ask a more specific question?
β—‹ does something like abstract music similarity even exist?
β—‹ evaluation of complete MIR systems centered around specific task use
case could lead to much clearer hypothesis
β—‹ Remember MIREX Grand challenge user experience 2014?
β—‹ β€œYou are creating a short video about a memorable occasion that happened to you
recently, and you need to find some (copyright-free) songs to use as background
music.”
56
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
57
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
● Possible solutions:
58
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
● Possible solutions:
59
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age
musical training/experience/preference
type of music, ...
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
● Possible solutions:
60
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● Many factors that influence human music perception, need
to be controlled in experimental design
● Possible solutions:
61
Independent variable
Type of algorithm
Dependent variable
FINE similarity rating
Control variable
gender, age: female only, age 20-30y
musical training/experience/preference: music professionals
type of music: piano concertos only
Very specialized, limited generality
Internal (alidity
● Control variable, monitor it:
62
Internal (alidity
● Control variable, monitor it:
63
Exponential complexity
External (alidity
● External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment?
64
External (alidity
● External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment?
● Unclear target population: identical with sample of 7000
US pop songs? All US pop music in general?
● Beware: cross-collection studies show dramatic losses in
performance
β—‹ Bogdanov, D., Porter, A., Herrera Boyer, P., & Serra, X. (2016). Cross-collection evaluation for
music classification tasks. ISMIR 2016.
● Possible solutions:
β—‹ Clear target population
β—‹ More constricted target populations
β—‹ Much larger data samples
β—‹ Use case?
65
Conclusion (alidity
● Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
● A large difference in effect measures is needed that users
see the difference
66
Conclusion (alidity
● Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
● A large difference in effect measures is needed that users
see the difference
67
J. Urbano, J. S. Downie, B. McFee and M.
Schedl: How Significant is Statistically
Significant? The case of Audio Music
Similarity and Retrieval, ISMIR 2012.
Conclusion (alidity
● Conclusion validity: does a difference between system
measures correspond to a difference in user measures
and is it noticeable to users ?
● A large difference in effect measures is needed that users
see the difference
● Possible solutions:
β—‹ Are there system measures that better correspond to user measures?
β—‹ Use case!
68
Lack of inter-rater agreement in
other areas
69
Lack of inter-rater agreement
● It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
70
Lack of inter-rater agreement
● It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
● MIREX β€˜Music Structural Segmentation’ task
β—‹ Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, …
β—‹ Algorithms have to produce such annotations
β—‹ F1-score between different annotators as upper bound
β—‹ Upper bound reached at least for certain music classical and world music
71
Lack of inter-rater agreement
● It does not make sense to go beyond inter-rater
agreement, this constitutes an upper bound
● MIREX β€˜Music Structural Segmentation’ task
β—‹ Human annotations of structural segmentations structural boundaries and
labels denoting repeated segments , chorus, verse, …
β—‹ Algorithms have to produce such annotations
β—‹ F1-score between different annotators as upper bound
β—‹ Upper bound reached at least for certain music classical and world music
β—‹ Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity,
J. of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016.
β—‹ Smith, J.B.L., Chew, E.: A meta-analysis of the MIREX structure segmentation task, ISMIR,
2013.
β—‹ SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L.: Unsupervised music structure annotation by
time series structure features and segment similarity., IEEE Transactions on Multimedia,
Special Issue on Music Data Mining, 16(5), 1229–1240, 2014.
72
Inter-rater agreement and upper bounds
● Extraction of metrical structure
β—‹ Quinton, E., Harte, C., Sandler, M.: Extraction of metrical structure from
music recordings, DAFX 2015.
● Melody estimation
β—‹ Balke, S., Driedger, J., Abeßer, J., Dittmar, C., MΓΌller, M.: Towards
Evaluating Multiple Predominant Melody Annotations in Jazz Recordings,
ISMIR 2016.
β—‹ Bosch J.J.,Gomez E..: Melody extraction in symphonic classical music: a
comparative study of mutual agreement between humans and
algorithms, Proc. of the Conference on Interdisciplinary Musicology,
2014.
● Timbre and rhythm similarity
β—‹ Panteli, M., Rocha, B., Bogaards, N., Honingh, A.: A model for rhythm
and timbre similarity in electronic dance music. Musicae Scientiae, 21(3),
338-361, 2017.
● Many more?
73
Adversarial Examples
74
Adversarial Examples - Image Recognition
● An adversary slightly and imperceptibly changes an input
image to fool a machine learning system
β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014.
75
original + noise =
adversarial example
all classified as β€œCamel”
Adversarial Examples - MIR
● Imperceptibly filtered audio fools genre recognition
system
β—‹ Sturm B.L.: A simple method to determine if a music information retrieval
system is a horse , IEEE Trans. on Multimedia, 16 6 , pp. 1636-1644, 2014.
76
deflate
Adversarial Examples - MIR
● Imperceptibly filtered audio fools genre recognition
system
http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/in
dex.html
77
deflate
External (alidity
● External validity: do cause-effect relationships also hold
for target populations beyond the sample used in the
experiment
● Unclear target population: identical with sample of few
hundred ISMIR or GTZAN songs?
● Or are we aiming at genre classification in general?
● If target is genre classification in general, there is a
problem!
78
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● )hy can these MIR systems be fooled so easily?
β—‹ no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β—‹ )hat is the confounding variable?
79
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
● )hy can these MIR systems be fooled so easily?
● E.g.: in case of rhythm classification, systems were picking
up tempo not rhythm! Tempo acted as confounding factor!
β—‹ Sturm B.L.: The Horse Inside: Seeking Causes Behind the Behaviors of
Music Content Analysis Systems, Computers in Entertainment, 14 2 , 2016.
80
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
β—‹ no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β—‹ )hat is the confounding variable?
● high dimensionality of the data input space?
β—‹ Small perturbations to input data might accumulate over many dimensions
with minor changes β€˜snowballing’ into larger changes in transfer functions
of deep neural networks
β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014
81
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
β—‹ no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
β—‹ )hat is the confounding variable?
● linearity of models?
β—‹ linear responses are overly confident at points that do not occur in the
data distribution, and these confident predictions are often highly
incorrect … rectified linear units ReLU ?
β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing
adversarial examples, ICLR, 2014
82
Internal (alidity
● Internal validity: is the relationship causal or could
confounding factors explain the relation?
β—‹ no causal relation between the class e.g. genre represented in the data
and the label returned by the classifier
Open question: what is the confounding
variable????
83
Summary (alidity
84
(alidity
● (alidity
β—‹ Valid experiment is an experiment actually measuring what the
experimenter intended to measure
β—‹ Conclusion validity
β—‹ Internal validity
β—‹ External validity
β—‹ Construct validity
● Care about validity of your experiments!
● Validity is the right framework to talk about these
problems
85
Part IV: So?
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
What’s in a 𝑝-value?
● Confounds effect size and sample size, eg. 𝑑 = 𝑛
π‘‹βˆ’πœ‡
𝜎
● Unfortunately, we virtually never check power. Don't ever
accept 𝐻0
● "Easy" way to achieve significance is obtaining more data,
but the true effect remains the same
● Even if one rejects 𝐻0, it could still be true
2
𝑃 𝐻0 𝑝 ≀ 𝛼 =
𝑃 𝑝≀𝛼 𝐻0 𝑃 𝐻0
𝑃 𝑝≀𝛼
=
𝑃 𝑝 ≀ 𝛼 𝐻0 𝑃 𝐻0
𝑃 𝑝 ≀ 𝛼 𝐻0 𝑃 𝐻0 + 𝑃 𝑝 ≀ 𝛼 𝐻1 𝑃 𝐻1
=
𝛼𝑃 𝐻0
𝛼𝑃 𝐻0 + 1 βˆ’ 𝛽 𝑃 𝐻1
● 𝑃 𝐻0 = 𝑃 𝐻1 = 0.5
β—‹ 𝛼 = 0.05, 𝛽 = 0.05 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.05
β—‹ 𝛼 = 0.05, 𝛽 = 0.5 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.09
● 𝐻0 = 0.8, 𝑃 𝐻1 = 0.2
β—‹ 𝛼 = 0.05, 𝛽 = 0.05 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.17
β—‹ 𝛼 = 0.05, 𝛽 = 0.5 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.29
3
𝐻0 is always false
● In this kind of dataset-based experiments, 𝐻0: πœ‡ = 0 is
always false
● Two systems may be veeeeeery similar, but not the same
● Binary accept/reject decisions don't even make sense
● Why bother with multiple comparisons then?
● Care about type S(ign) and M(agnitude) errors
● To what extent do non-parametric methods make sense
(Wilcoxon, Sign, Friedman), specially combined with
parametrics like Tukey’s?
4
Binary thinking no more
● Nothing wrong with the 𝑝-value, but with its use
● 𝑝 as a detective vs 𝑝 as a judge
● Any 𝛼 is completely arbitrary
● What is the cost of a type 2 error?
● How does the lack of validity affect NHST? (measures,
sampling frames, ignoring cross-assessor variability, etc)
● What about researcher degrees of freedom?
● Why not focus on effect sizes? Intervals, correlations, etc.
● Bayesian methods? What priors?
5
Assumptions
● In dataset-based (M)IR experiments, test assumptions are
false by definition
● 𝑝-values are, to some degree, approximated
● So again, why use any threshold?
● So which test should you choose?
● Run them all, and compare
● If they tend to disagree, take a closer look at the data
● Look beyond the experiment at hand, gather more data
● Always perform error analysis to make sense of it
6
Replication
● Fisher, and specially Neyman-Pearson, advocated for
replication
● A 𝑝-value is only concerned with the current data
● The hypothesis testing framework only makes sense with
repeated testing
● In (M)IR we hardly do it; we're stuck to the same datasets
7
Significant β‰  Relevant β‰  Interesting
8
All research
Significant β‰  Relevant β‰  Interesting
8
All research
Interesting
Significant β‰  Relevant β‰  Interesting
8
All research
Interesting
Relevant
Significant β‰  Relevant β‰  Interesting
8
All research
Interesting
Relevant
Statistically
Significant
There is always random error in our experiments,
so we always need some kind of statistical analysis
But there is no point in being too picky
or intense about how we do it
Nobody knows how to do it properly,
and different fields adopt different methods
What is far more productive,
is to adopt an exploratory attitude
rather than mechanically testing
9
References
JuliΓ‘n Urbano Arthur Flexer
An ISMIR 2018 Tutorial Β· Paris
● Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness
Measures and User Satisfaction. ACM SIGIR
● Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems,
Prevalence, and an Alternative. Journal of Wildfire Management
● Armstrong, T.G., Moffat, A., Webber, W. & Zobel, J. (2009). Improvements that don't add up: ad-hoc
retrieval results since 1998. CIKM
● Balke, S., Driedger, J., Abeßer, J., Dittmar, C. & MΓΌller, M. (2016). Towards Evaluating Multiple
Predominant Melody Annotations in Jazz Recordings. ISMIR
● Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science
● Bosch J.J. & GΓ³mez E. (2014). Melody extraction in symphonic classical music: acomparative study of
mutual agreement between humans and algorithms. Conference on Interdisciplinary Musicology
● Boytsov, L., Belova, A. & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR
experiments. SIGIR
● Carterette, B. (2012). Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval
Experiments. ACM Transactions on Information Systems
● Carterette, B. (2015a). Statistical Significance Testing in Information Retrieval: Theory and Practice.
ICTIR
● Carterette, B. (2015b). Bayesian Inference for Information Retrieval Evaluation. ACM ICTIR
● Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum
● Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. ACM
SIGIR
● Downie, J. S. (2004). The Scientific Evaluation of Music Information Retrieval Systems: Foundations
and Future. Computer Music Journal
● Fisher, R. A. (1925). Statistical Methods for Research Workers. Cosmo Publications
● Flexer, A. (2006). Statistical Evaluation of Music Information Retrieval Experiments. Journal of New
Music Research
2
● Flexer, A., Grill, T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal
of New Music Research
● Gelman, A. (2013b). The problem with p-values is how they’re used.
● Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple
Comparisons. Journal of Research on Educational Effectiveness
● Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a
problem, even when there is no shing expedition' or p-hacking' and the research hypothesis was
posited ahead of time.
● Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist
● Gelman, A., & Stern, H. (2006). The Difference Between Significant and Not Significant is not Itself
Statistically Significant. The American Statistician
● Goodfellow I.J., Shlens J. & Szegedy C. (2014). Explaining and harnessing adversarial examples. ICLR
● Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., & Langlois, T. (2014). On Evaluation Validity in
Music Autotagging. ACM Computing Research Repository.
● Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and
User Evaluations Give the Same Results? ACM SIGIR
● Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar
Songs. ISMIR
● Hull, D. (1993). Using Statistical Testing in the Evaluation of Retrieval Experiments. ACM SIGIR
● Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine
● Lehmann, E.L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or
Two? Journal of the American Statistical Association
● Lehmann, E.L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer
3
● Lee, J. H., & Cunningham, S. J. (2013). Toward an understanding of the history and impact of user
studies in music information retrieval. Journal of Intelligent Information Systems
● Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three Current Issues In Music
Autotagging. ISMIR
● Neyman, J. & Pearson, E.S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes
of Statistical Inference: Part I. Biometrika
● Panteli, M., Rocha, B., Bogaards, N. & Honingh, A. (2017). A model for rhythm and timbre similarity in
electronic dance music. Musicae Scientiae
● Quinton, E., Harte, C. & Sandler, M. (2015). Extraction of metrical structure from music recordings.
DAFX
● Sakai, T. (2014). Statistical Reform in Information Retrieval? ACM SIGIR Forum
● Savoy, J. (1997). Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and
Management
● Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval
Research. Journal of Intelligent Information Systems
● Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs
for Generalized Causal Inference. Houghton-Mifflin
● SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L. (2014). Unsupervised music structure annotation by time
series structure features and segment similarity. IEEE Trans. on Multimedia
● Smith, J.B.L. & Chew, E. (2013). A meta-analysis of the MIREX structure segmentation task. ISMIR
● Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for
Information Retrieval Evaluation. ACM CIKM
● Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement Among Statistical Significance Tests for
Information Retrieval Evaluation at Varying Sample Sizes. CM SIGIR
4
● Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our
Metrics, that They Are Unrealistic. European Workshop on Human-Computer Interaction and
Information Retrieval
● Student. (1908). The Probable Error of a Mean. Biometrika
● Sturm, B. L. (2013). Classification Accuracy is Not Enough: On the Evaluation ofMusic Genre
Recognition Systems. Journal of Intelligent Information Systems
● Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music
Information Retrieval. Journal of New Music Research
● Sturm, B.L. (2014). A simple method to determine if a music information retrieval system is a horse,
IEEE Trans. on Multimedia
● Sturm B.L. (2016). "The Horse" Inside: Seeking Causes Behind the Behaviors of Music Content
Analysis Systems, Computers in Entertainment
● Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited.
Information Processing and Management
● Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. ACM
SIGIR
● Urbano, J. (2015). Test Collection Reliability: A Study of Bias and Robustness to Statistical
Assumptions via Stochastic Simulation. Information Retrieval Journal
● Urbano, J., Downie, J. S., McFee, B., & Schedl, M. (2012). How Significant is Statistically Significant?
The case of Audio Music Similarity and Retrieval. ISMIR
● Urbano, J., Marrero, M., & MartΓ­n, D. (2013a). A Comparison of the Optimality of Statistical Significance
Tests for Information Retrieval Evaluation. ACM SIGIR
● Urbano, J., Marrero, M. & MartΓ­n, D. (2013b). On the Measurement of Test Collection Reliability. SIGIR
● Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of
Intelligent Information Systems
5
● Urbano, J. & Marrero, M. (2016). Toward Estimating the Rank Correlation between the Test Collection
Results and the True System Performance. SIGIR
● Urbano, J. & Nagler, T. (2018). Stochastic Simulation of Test Collections: Evaluation Scores. SIGIR
● Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error.
ACM SIGIR
● Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. ACM CIKM
● Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error
Costs Us Jobs, Justice, and Lives. University of Michigan Press
● Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM
SIGIR
6

More Related Content

Similar to Statistical Analysis of Results in Music Information Retrieval

An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
Β 
Tenc Winterschool09 Davinia Slideshare
Tenc Winterschool09 Davinia SlideshareTenc Winterschool09 Davinia Slideshare
Tenc Winterschool09 Davinia Slideshareguest94c824
Β 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesHendrik Drachsler
Β 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyArnab Bhadury
Β 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...multimediaeval
Β 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJuliΓ‘n Urbano
Β 
A hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionMarijn Koolen
Β 
JCDL 2013 DOCTORAL CONSORTIUM
JCDL 2013 DOCTORAL CONSORTIUMJCDL 2013 DOCTORAL CONSORTIUM
JCDL 2013 DOCTORAL CONSORTIUMJose Antonio Olvera
Β 
Lessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism WorkshopLessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism WorkshopMarijn Koolen
Β 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...JuliΓ‘n Urbano
Β 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG
Β 
Overview of Data Science and AI
Overview of Data Science and AIOverview of Data Science and AI
Overview of Data Science and AIjohnstamford
Β 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender SystemsKatrien Verbert
Β 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...KnowledgeGraph
Β 
Learning Analytics – Research challenges arising from a current review of LA use
Learning Analytics – Research challenges arising from a current review of LA useLearning Analytics – Research challenges arising from a current review of LA use
Learning Analytics – Research challenges arising from a current review of LA useRiina Vuorikari
Β 
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Katrien Verbert
Β 
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Tore Hoel
Β 

Similar to Statistical Analysis of Results in Music Information Retrieval (20)

An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
Β 
Tenc Winterschool09 Davinia Slideshare
Tenc Winterschool09 Davinia SlideshareTenc Winterschool09 Davinia Slideshare
Tenc Winterschool09 Davinia Slideshare
Β 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for Universities
Β 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
Β 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
Β 
Lecture rm 2
Lecture rm 2Lecture rm 2
Lecture rm 2
Β 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
Β 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
Β 
A hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflectionA hands-on approach to digital tool criticism: Tools for (self-)reflection
A hands-on approach to digital tool criticism: Tools for (self-)reflection
Β 
JCDL 2013 DOCTORAL CONSORTIUM
JCDL 2013 DOCTORAL CONSORTIUMJCDL 2013 DOCTORAL CONSORTIUM
JCDL 2013 DOCTORAL CONSORTIUM
Β 
Lessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism WorkshopLessons Learned from a Digital Tool Criticism Workshop
Lessons Learned from a Digital Tool Criticism Workshop
Β 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Β 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Β 
master_thesis.pdf
master_thesis.pdfmaster_thesis.pdf
master_thesis.pdf
Β 
Overview of Data Science and AI
Overview of Data Science and AIOverview of Data Science and AI
Overview of Data Science and AI
Β 
Interactive Recommender Systems
Interactive Recommender SystemsInteractive Recommender Systems
Interactive Recommender Systems
Β 
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...Report on the First Knowledge Graph Reasoning Challenge  2018 -Toward the eXp...
Report on the First Knowledge Graph Reasoning Challenge 2018 -Toward the eXp...
Β 
Learning Analytics – Research challenges arising from a current review of LA use
Learning Analytics – Research challenges arising from a current review of LA useLearning Analytics – Research challenges arising from a current review of LA use
Learning Analytics – Research challenges arising from a current review of LA use
Β 
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?
Β 
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Introduction to Learner Analytics Session at Oslo Open Forum Conferences prio...
Β 

More from JuliΓ‘n Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...JuliΓ‘n Urbano
Β 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJuliΓ‘n Urbano
Β 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJuliΓ‘n Urbano
Β 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...JuliΓ‘n Urbano
Β 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...JuliΓ‘n Urbano
Β 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...JuliΓ‘n Urbano
Β 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJuliΓ‘n Urbano
Β 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...JuliΓ‘n Urbano
Β 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)JuliΓ‘n Urbano
Β 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJuliΓ‘n Urbano
Β 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJuliΓ‘n Urbano
Β 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJuliΓ‘n Urbano
Β 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...JuliΓ‘n Urbano
Β 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...JuliΓ‘n Urbano
Β 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...JuliΓ‘n Urbano
Β 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...JuliΓ‘n Urbano
Β 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJuliΓ‘n Urbano
Β 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...JuliΓ‘n Urbano
Β 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsJuliΓ‘n Urbano
Β 

More from JuliΓ‘n Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Β 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
Β 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
Β 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
Β 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
Β 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
Β 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
Β 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
Β 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
Β 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
Β 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
Β 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Β 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Β 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Β 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Β 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
Β 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Β 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Β 
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Β 
Improving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered ListsImproving the Generation of Ground Truths based on Partially Ordered Lists
Improving the Generation of Ground Truths based on Partially Ordered Lists
Β 

Recently uploaded

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
Β 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
Β 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
Β 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
Β 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
Β 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
Β 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
Β 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
Β 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3
Β 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
Β 
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”soniya singh
Β 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
Β 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
Β 
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.aasikanpl
Β 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
Β 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSΓ©rgio Sacani
Β 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
Β 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
Β 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
Β 

Recently uploaded (20)

A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
Β 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
Β 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
Β 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
Β 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
Β 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
Β 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Β 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
Β 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Β 
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfNAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf
Β 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Β 
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”
Call Girls in Munirka Delhi πŸ’―Call Us πŸ”8264348440πŸ”
Β 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Β 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Β 
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Call Girls in Mayapuri Delhi πŸ’―Call Us πŸ”9953322196πŸ” πŸ’―Escort.
Β 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
Β 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Β 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
Β 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
Β 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Β 

Statistical Analysis of Results in Music Information Retrieval

  • 1. Statistical Analysis of Results in Music Information Retrieval: Why and How JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 3. JuliΓ‘n Urbano ● Assistant Professor @ TU Delft, The Netherlands ● BSc-PhD Computer Science ● 10 years of research in (Music) Information Retrieval o And related, like information extraction or crowdsourcing for IR ● Active @ ISMIR since 2010 ● Research topics o Evaluation methodologies o Statistical methods for evaluation o Simulation for evaluation o Low-cost evaluation 3 Supported by the European Commission H2020 project TROMPA (770376-2)
  • 4. Arthur Flexer ● Austrian Research Institute for Artificial Intelligence - Intelligent Music Processing and Machine Learning Group ● PhD in Psychology, minor degree in Computer Science ● 10 years of research in neuroscience, 13 years in MIR ● Active @ ISMIR since 2005 ● Published on: o role of experiments in MIR o problems of ground truth o problems of inter-rater agreement 4 Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
  • 5. Arthur Flexer ● Austrian Research Institute for Artificial Intelligence - Intelligent Music Processing and Machine Learning Group ● PhD in Psychology, minor degree in Computer Science ● 10 years of research in neuroscience, 13 years in MIR ● Active @ ISMIR since 2005 ● Published on: o role of experiments in MIR o problems of ground truth o problems of inter-rater agreement 5 Semi-retired veteran DJ Supported by the Vienna Science and Technology Fund (WWTF, project MA14-018)
  • 6. Disclaimer ● Design of experiments (DOE) used and needed in all kinds of sciences ● DOE is a science in its own right ● No fixed β€œhow-tos” or β€œcookbooks” ● Different schools and opinions ● Established ways to proceed in different fields ● We will present current procedures in (M)IR ● But also discuss, criticize, point to problems ● Present alternatives and solutions (?) 6
  • 7. Program ● Part I: Why (we evaluate the way we do it)? o Tasks and use cases o Cranfield o Validity and reliability ● Part II: How (should we not analyze results)? o Populations and samples o Estimating means o Fisher, Neyman-Pearson, NHST o Tests and multiple comparisons ● Part III: What else (should we care about)? o Inter-rater agreement o Adversarial examples ● Part IV: So (what does it all mean)? ● Discussion? 7
  • 9. Part I: Why? JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 11. Typical Information Retrieval task 2 IR System
  • 12. Typical Information Retrieval task 2 Documents IR System
  • 13. Typical Information Retrieval task 2 Documents IR System
  • 14. Typical Information Retrieval task 2 Documents Information Need or Topic IR System
  • 15. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query
  • 16. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query
  • 17. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results
  • 18. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results
  • 19. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query Results query
  • 20. Typical Information Retrieval task 2 Documents Information Need or Topic IR System query ResultsResults query
  • 21. Two recurrent questions ο‚· How good is my system? β—‹ What does good mean? β—‹ What is good enough? ο‚· Is system A better than system B? β—‹ What does better mean? β—‹ How much better? ο‚· What do we talk about? β—‹ Efficiency? β—‹ Effectiveness? β—‹ Ease? 3
  • 22. Hypothesis: A is better than B How would you design this experiment?
  • 23. Measure user experience ο‚· We are interested in user-measures β—‹ Time to complete task β—‹ Idle time β—‹ Success/Failure rate β—‹ Frustration β—‹ Ease of learning β—‹ Ease of use … ο‚· Their distributions describe user experience β—‹ For an arbitrary user and topic (and document collection?) β—‹ What can we expect? 5 0 time to complete task none frustration muchsome
  • 24. Sources of variability user-measure = f(documents, topic, user, system) ο‚· Our goal is the distribution of the user-measure for our system, which is impossible to calculate β—‹ (Possibly?) infinite populations ο‚· As usual, the best we can do is estimate it β—‹ Becomes subject to random error 6
  • 25. Desired: Live Observation ο‚· Estimate distributions with a live experiment ο‚· Sample documents, topics and users ο‚· Have them use the system, for real ο‚· Measure user experience, implicitly or explicitly ο‚· Many problems β—‹ High cost, representativeness β—‹ Ethics, privacy, hidden effects, inconsistency β—‹ Hard to replicate experiment and repeat results β—‹ Just plain impossible to reproduce results *replicate = same method, different sample reproduce = same method, same sample 7
  • 26. Alternative: Fixed samples ο‚· Get (hopefully) good samples, fix them and reuse β—‹ Documents β—‹ Topics β—‹ Users ο‚· Promotes reproducibility and reduces variability ο‚· But we can’t just fix the users! 8
  • 27. Simulate users ο‚· Cranfield paradigm: remove users, but include a user- abstraction, fixed across experiments β—‹ Static user component: judgments or annotations in ground truth β—‹ Dynamic user component: effectiveness or performance measures ο‚· Removes all sources of variability, except systems user-measure = f(documents, topic, user, system) 9
  • 28. Simulate users ο‚· Cranfield paradigm: remove users, but include a user- abstraction, fixed across experiments β—‹ Static user component: judgments or annotations in ground truth β—‹ Dynamic user component: effectiveness or performance measures ο‚· Removes all sources of variability, except systems user-measure = f(documents, topic, user, system) 9 user-measure = f(system)
  • 29. Datasets (aka Test Collections) ο‚· Controlled sample of documents, topics and judgments, shared across researchers… ο‚· …combined with performance measures ο‚· (Most?) important resource for IR research β—‹ Experiments are inexpensive (datasets are not!) β—‹ Research becomes systematic β—‹ Evaluation is deterministic β—‹ Reproducibility is not only possible but easy 10
  • 31. User Models & Annotation Protocols ο‚· In practice, there are hundreds of options ο‚· Utility of a document w.r.t. scale of annotation β—‹ Binary or graded relevance? β—‹ Linear utility w.r.t. relevance? Exponential? β—‹ Independent of other documents? ο‚· Top heaviness to penalize late arrival β—‹ No discount? β—‹ Linear discount? Logarithmic? β—‹ Independent of other documents? ο‚· Interaction, browsing? ο‚· Cutoff β—‹ Fixed: only top k documents? β—‹ Dynamic: wherever some condition is met? β—‹ All documents? ο‚· etc 12
  • 32. Tasks vs Use Cases ο‚· Everything depends on the use case of interest ο‚· The same task may have several use cases (or subtasks) β—‹ Informational β—‹ Navigational β—‹ Transactional β—‹ etc ο‚· Different use cases may imply, suggest or require different decisions wrt system input/output, goal, annotations, measures... 13
  • 34. Instrument recognition 1) Given a piece of music as input, identify the instruments that are played in it: β—‹ For each window of T milliseconds, return a list of instruments being played (extraction). β—‹ Return the instruments being played anywhere in the piece (classification). 2) Given an instrument as input, retrieve a list of music pieces in which the instrument is played: β—‹ Return the list of music pieces (retrieval). β—‹ Return the list, but for each piece also provide a clip (start-end) where the instrument is played (retrieval+extraction). ο‚· Each case implies different systems, annotations and measures, and even different end-users (non-human?) https://github.com/cosmir/open-mic/issues/19 15
  • 35. But wait a minute... ο‚· Are we estimating distributions about users or distributions about systems? user-measure = f(system) system-measure = f(system, protocol, measure) 16
  • 36. But wait a minute... ο‚· Are we estimating distributions about users or distributions about systems? user-measure = f(system) system-measure = f(system, protocol, measure) 16 system-measure = f(system, protocol, measure, annotator, context, ...) ο‚· Whether the system output satisfies the user or not, has nothing to do with how we measure its performance ο‚· What is the best way to predict user satisfaction?
  • 37. Real world vs. The lab 17 The Web Abstraction Prediction Real World Cranfield IR System Topic Relevance Judgments IR System Documents AP DCG RR Static Component Dynamic Component Test Collection Effectiveness Measures Information need
  • 38. Output Cranfield in Music IR 18 System Input Measure Annotations Users
  • 39. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity 19 System collection track track track track track track
  • 40. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming 20 System collection hum track track track track track
  • 41. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation 21 System collection user track track track track track
  • 42. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification 22 System track genre
  • 43. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification β—‹ Mood recognition 23 System track mood1 mood2
  • 44. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification β—‹ Mood recognition β—‹ Autotagging 24 System track tag tag tag
  • 45. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification β—‹ Mood recognition β—‹ Autotagging ● Extraction β—‹ Structural segmentation 25 System track seg segseg seg
  • 46. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification β—‹ Mood recognition β—‹ Autotagging ● Extraction β—‹ Structural segmentation β—‹ Melody extraction 26 System track
  • 47. Classes of Tasks in Music IR ● Retrieval β—‹ Music similarity β—‹ Query by humming β—‹ Recommendation ● Annotation β—‹ Genre classification β—‹ Mood recognition β—‹ Autotagging ● Extraction β—‹ Structural segmentation β—‹ Melody extraction β—‹ Chord estimation 27 System track chocho cho
  • 48. Evaluation as Simulation ο‚· Cranfield-style evaluation with datasets is a simulation of the user-system interaction, deterministic, maybe even simplistic, but a simulation nonetheless ο‚· Provides us with data to estimate how good our systems are, or which one is better ο‚· Typically, many decisions are made for the practitioner ο‚· Comes with many assumptions and limitations 28
  • 49. Validity and Reliability ο‚· Validity: are we measuring what we want to? β—‹ Internal: are observed effects due to hidden factors? β—‹ External: are input items, annotators, etc generalizable? β—‹ Construct: do system-measures match user-measures? β—‹ Conclusion: how good is good and how better is better? Systematic error ο‚· Reliability: how repeatable are the results? β—‹ Will I obtain the same results with a different collection? β—‹ How large do collections need to be? β—‹ What statistical methods should be used? Random error 29
  • 50. 30 Not Valid Reliable Valid Not Reliable Not Valid Not Reliable Valid Reliable
  • 51. So long as... β€’ So long as the dataset is large enough to minimize random error and draw reliable conclusions β€’ So long as the tools we use to make those conclusions can be trusted β€’ So long as the task and use case are clear β€’ So long as the annotation protocol and performance measure (ie. user model) are realistic and actually measure something meaningful for the use case β€’ So long as the samples of inputs and annotators present in the dataset are representative for the task 31
  • 52. What else How So long as... β€’ So long as the dataset is large enough to minimize random error and draw reliable conclusions β€’ So long as the tools we use to make those conclusions can be trusted β€’ So long as the task and use case are clear β€’ So long as the annotation protocol and performance measure (ie. user model) are realistic and actually measure something meaningful for the use case β€’ So long as the samples of inputs and annotators present in the dataset are representative for the task 31
  • 53. β€œIf you can’t measure it, you can’t improve it.” β€”Lord Kelvin 32
  • 54. β€œIf you can’t measure it, you can’t improve it.” β€”Lord Kelvin 32 β€œBut measurements have to be trustworthy.” β€”yours truly
  • 55. Part II: How? JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 57. Populations of interest ● The task and use case define the populations of interest β—‹ Music tracks β—‹ Users β—‹ Annotators β—‹ Vocabularies ● Impossible to study all entities in these populations β—‹ Too many β—‹ Don’t exist anymore (or yet) β—‹ Too far away β—‹ Too expensive β—‹ Illegal 3
  • 58. Populations and samples ● Our goal is to study the performance of the system on these populations β—‹ Lets us know what to expect from the system in the real world ● We’re typically interested in the mean: the expectation πœ‡ β—‹ Based on this we would decide what research line to pursue, what paper to publish, what project to fund, etc. β—‹ But variability is also important, though often neglected ● A dataset represents just a sample from that population β—‹ By studying the sample we could generalize back to the population β—‹ But will bear some degree of random error due to sampling 4
  • 60. Populations and samples 5 Target population External validity Accessible population or sampling frame
  • 61. Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  • 62. Inference Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  • 63. Generalization Inference Populations and samples 5 Target population External validity Accessible population or sampling frame Content Validity and Reliability Sample
  • 64. Populations and samples ● This is an estimation problem ● The objective is the distribution 𝐹 of performance over the population, specifically the mean πœ‡ ● Given a sample of observations 𝑋1, … , 𝑋 𝑛, estimate πœ‡ ● Most straightforward estimator is the sample mean: πœ‡ = 𝑋 ● Problem: 𝝁 = 𝝁 + 𝒆, being 𝒆 random error ● For any given sample or dataset, we only know 𝑿 ● How confident we are in the results and our conclusions, depends on the size of 𝒆 with respect to 𝝁 6
  • 65. Populations and samples 7 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 66. Populations and samples 7 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^
  • 67. Populations and samples 7 sample (n=10) performance frequency 0 1^ 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^
  • 68. Populations and samples 7 sample (n=10) performance frequency 0 1^ 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density sample (n=10) performance frequency 0 1^ sample (n=10) performance frequency 0 1^
  • 69. Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 70. sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 71. sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 72. sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ sample (n=30) performance frequency 0 1^ Populations and samples 8 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 73. Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 74. sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 75. sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 76. sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ sample (n=100) performance frequency 0 1^ Populations and samples 9 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 population performance density
  • 77. 0.35 0.40 0.45 0.50 0.55 010203040 sampling distribution mean performance X density n=10 n=30 n=50 n=100 0.0 0.2 0.4 0.6 0.8 1.0 01234 population performance density Sampling distribution and standard error ● Let us assume some distribution with some πœ‡ ● Experiment: draw random sample of size 𝑛 and compute 𝑋 ● The sampling distribution is the distribution of 𝑋 over replications of the experiment ● Standard error is the std. dev. of the sampling distribution 10
  • 78. Estimating the mean ● The true distribution 𝐹 has mean πœ‡ and variance 𝜎2 β—‹ 𝐸 𝑋 = πœ‡ β—‹ π‘‰π‘Žπ‘Ÿ 𝑋 = 𝜎2 ● For the sample mean 𝑋 = 1 𝑛 βˆ‘π‘‹π‘– we have β—‹ 𝐸 𝑋 = 1 𝑛 βˆ‘πΈ 𝑋𝑖 = πœ‡ β—‹ π‘‰π‘Žπ‘Ÿ 𝑋 = 1 𝑛2 βˆ‘π‘‰π‘Žπ‘Ÿ 𝑋𝑖 = 𝜎2 𝑛 std. error = 𝜎 𝑋 = 𝜎 𝑛 ● Law of large numbers: 𝑋 β†’ πœ‡ as 𝑛 β†’ ∞ ● The larger the dataset, the better our estimates ● Regardless of the true distribution 𝑭 over the population 11
  • 79. Gaussians everywhere ● For the special case where 𝐹 = 𝑁 πœ‡, 𝜎2 , the sample mean is also Gaussian, specifically 𝑋~𝑁 πœ‡, 𝜎2 𝑛 ● But a Gaussian distribution is sometimes a very unrealistic model for our data ● Central Limit Theorem (CLT): 𝑋 𝑑 β†’ 𝑁 πœ‡, 𝜎2 𝑛 as 𝑛 β†’ ∞ ● Regardless of the shape of the original 𝑭 12 0.0 0.2 0.4 0.6 0.8 1.0 05101520 population performance density n=10 mean performance X density 0.00 0.04 0.08 0.12 051525 n=100 mean performance X density 0.00 0.04 0.08 0.12 02060
  • 80. Approximations ● Until the late 1890s, the CLT was invoked everywhere for the simplicity of working with Gaussians ● Tables of the Gaussian distribution were used to test ● Still, there were two main problems β—‹ 𝜎2 is unknown β—‹ The rate of converge, ie. small samples ● But something happened at the Guinness factory in 1908 13
  • 82. Student-t distribution ● Gaussian approximations for sampling distributions were reasonably good for large samples, but not for small ● Gosset thought about deriving the theoretical distributions under assumptions of the underlying model ● Specifically, when 𝑋~𝑁 πœ‡, 𝜎2 : β—‹ If 𝜎 is known, we know that 𝑧 = π‘‹βˆ’πœ‡ 𝜎 𝑛 ~𝑁 0,1 β—‹ If 𝜎 is unknown Gosset introduced the Student-t distribution: 𝑑 = π‘‹βˆ’πœ‡ 𝑠/ 𝑛 ~𝑇 𝑛 βˆ’ 1 , where 𝑠 is the sample standard deviation ● In a sense, it accounts for the uncertainty in 𝜎 = 𝑠 15
  • 83. Small-sample problems ● In non-English literature there are earlier mentions, but it was popularized by Gosset and, mostly, Fisher ● He initiated the study of the so-called small-sample problems, specifically with the Student-𝑑 distribution 16 -4 -2 0 2 4 0.00.10.20.30.4 Student-t distribution t statistic density n=2 n=3 n=6 n=30
  • 85. Fisher and small samples ● Gosset did not provide a proof of the 𝑑 distribution, but Fisher did in 1912-1915 ● Fisher stopped working on small sample problems until Gosset convinced him in 1922 ● He then worked out exact distributions for correlation coefficients, regression coefficients, πœ’2 tests, etc. in the early 1920s ● These, and much of his work on estimation and design of experiments, were collected in his famous 1925 book ● This is book is sometimes considered the birth of modern statistical methods 18 Ronald Fisher
  • 86. 19
  • 87. Fisher’s significance testing ● In those other papers Fisher developed his theory of significance testing ● Suppose we have observed data 𝑋~𝑓 π‘₯ πœƒ and we are interested in testing the null hypothesis 𝐻0: πœƒ = πœƒ0 ● We choose a relevant test statistic 𝑇 s.t. large values of 𝑇 reflect evidence against 𝐻0 ● Compute the 𝒑-value 𝑝 = 𝑃 𝑇 π‘‹βˆ— β‰₯ 𝑇 𝑋 𝐻0 , that is, the probability that, under 𝐻0, we observe a sample π‘‹βˆ— with a test statistic at least as extreme as we observed initially ● Assess the statistical significance of the results, that is, reject 𝐻0 if 𝑝 is small 20
  • 88. Testing the mean ● We observed 𝑋 = {βˆ’0.13, 0.68, βˆ’0.34, 2.10, 0.83, βˆ’0.32, 0.99, 1.24, 1.08, 0.19} and assume a Gaussian model ● We set 𝐻0: πœ‡ = 0 and choose a 𝑑 statistic ● For our data, 𝑝 = 0.0155 𝑑 = 2.55 ● If we consider 𝑝 small enough, we reject 𝐻0 21 -4 -2 0 2 4 0.00.10.20.30.4 test statistic t density p
  • 89. Small p-values ● β€œwe do not want to know the exact value of p […], but, in the first place, whether or not the observed value is open to suspicion” ● Fisher provided in his book tables not of the new small- sample distributions, but of selected quantiles ● Allow for calculation of ranges of 𝑝-values given test statistics, as different degrees of evidence against 𝐻0 ● The 𝑝-value is gradable, a continuous measure of evidence 22
  • 90. 23
  • 91. 𝑝 and 𝛼 ● Fisher employed the term significance level 𝛼 for these theoretical 𝑝-values used as reference points to identify statistically significant results: reject 𝐻0 if 𝑝 ≀ 𝛼 ● This is context-dependent, is not prefixed beforehand and can change from time to time ● He arbitrarily β€œsuggested” 𝛼 = .05 for illustration purposes ● Observing 𝑝 > 𝛼 does not prove 𝐻0; it just fails to reject it 24
  • 92. Jerzy Neyman & Egon Pearson
  • 93. Pearson ● Pearson saw Fisher’s tables as a way to compute critical values that β€œlent themselves to the idea of choice, in advance of experiment, of the risk of the β€žfirst kind of error’ which the experimenter was prepared to take” ● In a letter to Pearson, Gosset replied β€œif the chance is very small, say .00001, […] what it does is to show that if there is any alternative hypothesis which will explain the occurrence of the sample with a more reasonable probability, say .05 […], you will be very much more inclined to consider that the original hypothesis is not true” 26 Egon Pearson
  • 94. Neyman ● Pearson saw the light: β€œthe only valid reason for rejecting a statistical hypothesis is that some alternative explains the observed events with a greater degree of probability” ● In 1926 Pearson writes to Neyman to propose his ideas of hypothesis testing, which they developed and published in 1928 27 Jerzy Neyman
  • 95. 28
  • 96. Errors ● 𝛼 = 𝑃 𝑑𝑦𝑝𝑒 1 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ ● 𝛽 = 𝑃 𝑑𝑦𝑝𝑒 2 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ ● Power = 1 βˆ’ 𝛽 29 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 Truth 𝐻0 true 𝐻1 true Test accept 𝐻0 true negative Type 2 error reject 𝐻0 Type 1 error true positive 𝛼𝛽
  • 97. Errors 30 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.5 ● 𝜎 = 1 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H1 𝑑 = 𝑋 βˆ’ πœ‡ 𝜎 𝒏
  • 98. Errors 31 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.25 ● 𝜎 = 1 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H2 𝑑 = 𝑋 βˆ’ 𝝁 𝜎 𝑛
  • 99. Errors 32 -2 -1 0 1 2 3 0.00.10.20.30.4 n=10 test statistic density H0 H1 𝛼𝛽 ● 𝐻0: πœ‡ = 0, 𝐻1: πœ‡ = 0.25 ● 𝜎 = 3 -2 -1 0 1 2 3 0.00.10.20.30.4 n=30 test statistic density H0 H2 𝑑 = 𝑋 βˆ’ πœ‡ 𝝈 𝑛
  • 100. Neyman-Pearson hypothesis testing ● Define the null and alternative hypotheses, eg. 𝐻0: πœ‡ = 0 and 𝐻1: πœ‡ = 0.5 ● Set the acceptable error rates 𝛼 (type 1) and 𝛽 (type 2) ● Select the most powerful test 𝑇 for the hypotheses and 𝛼, which sets the critical value 𝑐 ● Given 𝐻1 and 𝛽, select the sample size 𝑛 required to detect an effect 𝒅 or larger ● Collect data and reject 𝐻0 if 𝑇 𝑋 β‰₯ 𝑐 ● The testing conditions are set beforehand: 𝐻0, 𝐻1, 𝛼, 𝛽 ● The experiment is designed for a target effect 𝑑: 𝑛 33
  • 101. Error rates and tests ● Under repeated experiments, the long-run error rate is 𝛼 ● Neyman-Pearson did not suggest values for it: β€œthe balance [between the two kinds of error] must be left to the investigator […] we attempt to adjust the balance between the risks 1 and 2 to meet the type of problem before us” ● For 𝛽 they β€œsuggested” 𝛼 ≀ 𝛽 ≀ 0.20 ● To Fisher, the choice of test statistic in his methodology was rather obvious to the investigator and wasn’t important to him ● Neyman-Pearson answered this by defining the β€œbest” test: that which minimizes error 2 subject to a bound in error 1 34
  • 102. Likelihood ratio test ● Pearson apparently suggested the likelihood ratio test for their new hypothesis testing methodology β„’ = 𝑝 𝑋 𝐻0) 𝑝 𝑋 𝐻1 ● Later found that as 𝑛 β†’ ∞, βˆ’2 log β„’ ~πœ’2 ● Neyman was reluctant, as he thought some Bayesian consideration had to be taken about prior distributions over the hypotheses (β€œinverse probability” at the time) ● For simple point hypotheses like 𝐻0: πœƒ = πœƒ0 and 𝐻1: πœƒ = πœƒ1, the Likelihood ratio test turned out to be the most powerful ● In the case of comparing means of Gaussians, this reduces to Student’s 𝑑-test! 35
  • 103. Composite hypotheses ● Neyman-Pearson theory extends to composite hypotheses of the form 𝐻: πœƒ ∈ Θ, such as 𝐻1: πœ‡ > 0.5 ● The math got more complex, and Neyman was still somewhat reluctant: β€œit may be argued that it is impossible to estimate the probability of such a hypothesis without a knowledge of the relative a priori probabilities of the constituent simple hypotheses” ● Although β€œwishing to test the probability of a hypothesis A we have to assume that all hypotheses are a priori equally probable and calculate the probability a posteriory of A” 36
  • 105. Recap ● Fisher: significance testing β—‹ Inductive inference: rational belief when reasoning from sample to population β—‹ Rigorous experimental design to extract results from few samples β—‹ Replicate and develop your hypotheses, consider all significant and non- significant results together β—‹ Power can not be computed beforehand ● Neyman-Pearson: hypothesis testing β—‹ Inductive behavior: frequency of errors in judgments β—‹ Long-run results from many samples β—‹ p-values don’t have frequentist interpretations ● In the 1940s the two worlds began to appear as just one in statistics textbooks, and rapidly adopted by researchers 38
  • 106. 39
  • 107. 39
  • 108. Null Hypothesis Significance Testing ● Collect data ● Set hypotheses, typically 𝐻0: πœ‡ = 0 and 𝐻1: πœ‡ β‰  0 β—‹ Either there is an effect or there isn’t ● Set 𝛼, typically to 0.05 or 0.01 ● Select test statistic based on hypotheses and compute 𝑝 ● If 𝑝 ≀ 𝛼, reject the null; fail to reject if 𝑝 > 𝛼 40
  • 109. Common bad practices ● Run tests blindly without looking at your data ● Decide on 𝛼 after computing 𝑝 ● Report β€œ(not) significant at the 0.05 level” instead of providing the actual 𝑝-value ● Report degrees of significance like β€œvery” or β€œbarely” ● Do not report test statistic alongside 𝑝, eg. 𝑑 58 = 1.54 ● Accept 𝐻0 if 𝑝 > 𝛼 or accept 𝐻1 if 𝑝 ≀ 𝛼 ● Interpret 𝑝 as the probability of the null ● Simply reject the null, without looking at the effect size ● Ignore the type 2 error rate 𝛽, ie. power analysis a posteriori ● Interpret statistically significant result as important ● Train the same models until significance is found ● Publish only statistically significant results Β―_(ツ)_/Β― 41
  • 110. NHST for (M)IR 2 systems
  • 111. Paired tests ● We typically want to compare our system B with some baseline system A ● We have the scores over 𝑛 inputs from some dataset ● The hypotheses are 𝐻0: πœ‡ 𝐴 = πœ‡ 𝐡 and 𝐻1: πœ‡ 𝐴 β‰  πœ‡ 𝐡 A performance Frequency 0.0 0.2 0.4 0.6 0.8 1.0 01234 mean=0.405 sd=0.213 B performance Frequency 0.0 0.2 0.4 0.6 0.8 1.0 01234 mean=0.425 sd=0.225
  • 112. Paired test ● If we ignore the structure of the experiment, we have a bad model and a test with low power Simple test: 𝑑 = 0.28, 𝑝 = 0.78 ● In our experiments, every observation from A corresponds to an observation from B, ie. they are paired observations ● The test can account for this to better model our data ● Instead of looking at A vs B, we look at A-B vs 0 Paired test: 𝑑 = 2.16, 𝑝 = 0.044 44 0.0 0.2 0.4 0.6 0.8 1.0 0.00.40.8 A B
  • 113. Paired 𝑑-test ● Assumption β—‹ Data come from Gaussian distributions ● Equivalent to a t-test of πœ‡ 𝐷 = 0, where 𝐷𝑖 = 𝐡𝑖 βˆ’ 𝐴𝑖 𝑑 = 𝑛 𝑋 π΅βˆ’π΄ 𝑠 π΅βˆ’π΄ = 𝑛 𝐷 𝑠 𝐷 = 0.35, 𝑝 = 0.73 45 A B .76 .75 .33 .37 .59 .59 .28 .15 .36 .49 .43 .50 .21 .33 .43 .27 .72 .81 .40 .36 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 A B
  • 114. Wilcoxon signed-rank test ● Assumptions: β—‹ Data measured at least at interval level β—‹ Distribution is symmetric ● Disregard for magnitudes ● Convert all untied 𝐷𝑖 to ranks 𝑅𝑖 ● Compute π‘Š+ and π‘Šβˆ’ equal to the sum of 𝑅𝑖 that are positive or negative ● The test statistic is π‘Š = min π‘Š+, π‘Šβˆ’ = 21 ● π‘Š follows a Wilcoxon distribution, from which one can calculate 𝑝 = 0.91 46 A B D rank .76 .75 -.01 1 .33 .37 .04 2 .59 .59 0 - .28 .15 -.13 8 .36 .49 .13 7 .43 .50 .07 4 .21 .33 .12 6 .43 .27 -.16 9 .72 .81 .09 5 .40 .36 -.04 3
  • 115. Sign test ● Complete disregard for magnitudes ● Simulate coin flips: was system B better (or worse) than A for some input? ● Follows Binomial distribution ● The test statistic is the number of successes (B>A), which is 5 ● 𝑝-value is the probability of 5 or more successes in 9 coin flips = 0.5 47 A B sign .76 .75 -1 .33 .37 +1 .59 .59 0 .28 .15 -1 .36 .49 +1 .43 .50 +1 .21 .33 +1 .43 .27 -1 .72 .81 +1 .40 .36 -1
  • 116. Bootstrap test ● Compute deltas, 𝐷𝑖 = 𝐡𝑖 βˆ’ 𝐴𝑖 ● The empirical distribution 𝑒𝑐𝑑𝑓𝐷 estimates the true distribution 𝐹 𝐷 ● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times) β—‹ Draw a bootstrap sample 𝐡𝑖 by sampling 𝑛 scores with replacement from 𝑒𝑐𝑑𝑓𝐷 β—‹ Compute the mean of the bootstrap sample, 𝐡𝑖 β—‹ Let 𝐡 = 1/π‘‡βˆ‘π΅π‘– β—‹ 𝐡𝑖 estimates the sampling distribution of the mean ● The 𝑝-value is βˆ‘π•€ 𝐡𝑖 βˆ’ 𝐡 β‰₯ 𝐷 𝑇 β‰ˆ 0.71 48
  • 117. Permutation test ● Under the null hypothesis, an arbitrary score could have been generated for system A or from system B ● Repeat for 𝑖 = 1, … , 𝑇, with 𝑇 large (thousands of times) β—‹ Create a sample 𝑃𝑖 by randomly swapping the sign of each observation β—‹ Compute the mean 𝑃𝑖 ● The 𝑝-value is βˆ‘π•€ 𝑃𝑖 β‰₯ 𝐷 𝑇 β‰ˆ 0.73 49
  • 118. The computer does all this for you 50
  • 121. ANOVA ● Assume a model 𝑦𝑠𝑖 = πœ‡ + πœˆπ‘  + πœˆπ‘– + 𝑒𝑠𝑖, where πœˆπ‘  = π‘¦π‘ βˆ™ βˆ’ πœ‡ β—‹ Implicitly β€œpairs” the observations by item ● The variance of the observed scores can be decomposed 𝜎2 𝑦 = 𝜎2 𝑠 + 𝜎2 𝑖 + 𝜎2 𝑒 ● where 𝜎2 𝑠 is the variance across system means β—‹ Low: system means are close to each other β—‹ High: system means are far from each other ● The null hypothesis is 𝐻0: πœ‡1 = πœ‡2 = πœ‡3 = β‹― β—‹ Even if we reject, we still don’t know which system is different! ● The test statistic is of the form 𝐹 = 𝜎2 𝑠 𝜎2 𝑒 ● We’d like to have 𝜎2 𝑠 ≫ 𝜎2 𝑖 , 𝜎2 𝑒
  • 122. Friedman test ● Same principle as ANOVA, but non-parametric ● Similarly to Wilcoxon, rank observations (per item) and estimate effects ● Ignores actual magnitudes; simply uses ranks 54
  • 123. Multiple testing ● When testing multiple hypotheses, the probability of at least one type 1 error increases ● Multiple testing procedures correct 𝑝-values for a family- wise error rate 55 [Carterette, 2015a]
  • 124. Tukey’s HSD ● Follow ANOVA to test 𝐻0: πœ‡1 = πœ‡2 = πœ‡3 = β‹― ● The maximum observed difference between systems is likely the one causing the rejection ● Tukey’s HSD compares all pairs of systems, each with an individual 𝑝-value ● These 𝑝-values are then corrected based on the expected distribution of maximum differences under 𝐻0 ● In practice, it inflates 𝑝-values ● Ensures a family-wise type 1 error rate 56
  • 126. Others ● There are many other procedures to control for multiple comparisons ● Bonferroni: very conservative (low power) ● Dunnett’s: compare all against a control (eg. baseline) ● Other procedures control for the false discovery rate, ie. the probability of a type 1 error given 𝑝 < 𝛼 ● One way or another, they all inflate 𝑝-values 58
  • 127. Part III: )hat Else? (alidity! JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 128. )hat else can go wrong ? (alidity! ● (alidity reprise ● Example I: Inter-rater agreement ● Example II: Adversarial examples 2
  • 130. (alidity ● (alidity β—‹ Valid experiment is an experiment actually measuring what the experimenter intended to measure β—‹ Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users β—‹ Internal validity: is this relationship causal or could confounding factors explain the relation β—‹ External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment β—‹ Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment 4
  • 131. Inter-rater agreement in music similarity 5
  • 132. Automatic recommendation / Playlisting 6 Millions of songs Result list Query song + =
  • 133. Automatic recommendation / Playlisting 7 Millions of songs Result list Query song + = Similarity
  • 134. Computation of similarity between songs ● Collaborative filtering Spotify, Dee”er? ● Social meta-data Last.Fm? ● Expert knowledge Pandora? ● Meta-data from the web ● ... ● Audio-based 8
  • 135. Computation of similarity between songs 9 Songs as audio Switching to frequencies Computation of features Machine Learning β†’ similarity (metric) S(a1, a2) = ? Pictures from E. Pampalk’s Phd thesis 2006
  • 136. Computation of similarity between songs 10 Query song Similar? Similar? …………….
  • 137. Computation of similarity between songs 11 Query song Similar!! Similar!! ……………. max(S)=97.9
  • 138. Are we there yet? 14
  • 139. How can we evaluate our models of music similarity? 15 45 87 100 23 100 87 100 23 45
  • 140. How can we evaluate our models of music similarity? 16 45 87 100 23 100 87 100 23 45 Do these numbers correspond to a human assessment of music similarity?
  • 141. MIREX - Music Information Retrieval eXchange 17
  • 142. MIREX - Music Information Retrieval eXchange ● Standardi”ed testbeds allowing for fair comparison of MIR systems ● range of different tasks ● based on human evaluation β—‹ Cranfield: remove users, look at annotations only 18
  • 143. MIREX - Music Information Retrieval eXchange ● Standardi”ed testbeds allowing for fair comparison of MIR systems ● range of different tasks ● based on human evaluation β—‹ Cranfield: remove users, look at annotations only ● )hat is the level of agreement between human raters/annotators ? ● )hat does this mean for the evaluation of MIR systems? ● Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016. 19
  • 144. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 20
  • 145. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 5000 song database ● participating MIR systems compute 5000x5000 distance matrix ● 60 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● β€žRate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 10 very similar or identical . 21
  • 146. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 100 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● β€žRate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 22
  • 147. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 50 randomly selected queries ● return 10 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● β€žRate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 23
  • 148. Experimental design 24 Independent variable treatment manipulated by researcher Dependent variable effect measured by researcher ● measure the effect of different treatments on a dependent variable
  • 149. Experimental design 25 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable
  • 150. Experimental design 26 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable MIREX AMS 2014
  • 151. Experimental design 27 Independent variable treatment manipulated by researcher Type of algorithm Dependent variable effect measured by researcher FINE similarity rating ● measure the effect of different treatments on a dependent variable MIREX AMS 2014
  • 152. )hat about validity? ● Valid experiment is an experiment actually measuring what the experimenter intended to measure ● )hat is the intention of the experimenter in the AMS task? ● What do we want to measure here? 28
  • 153. Audio music similarity ● Audio Music Similarity and Retrieval AMS task 2006-2014 ● 7000 song database ● participating MIR systems compute 7000x7000 distance matrix ● 100 randomly selected queries ● return 5 closest candidate songs for each of the MIR systems ● for each query/candidate pair, ask the human grader: ● β€žRate the similarity of the following Query-Candidate pairs. Assign a categorical similarity Not similar, Somewhat Similar, or (ery Similar and a numeric similarity score. The numeric similarity score ranges from 0 not similar to 100 very similar or identical . 29
  • 155. Rate the similarity! 31 Query song Candidate song 0 … 100
  • 160. Rate the similarity! 36 ● Factors that influence human music perception β—‹ Schedl M., Flexer A., Urbano J.: The Neglected User in Music Information Retrieval Research, J. of Intelligent Information Systems, December 2013, (olume 41, Issue 3, pp 523-539, 2013.
  • 163. Inter-rater agreement in AMS ● AMS 2006 is the only year with multiple graders ● each query/candidate pair evaluated by three different human graders ● each grader gives a FINE score between 0 … 10 not … very similar 39
  • 164. Inter-rater agreement in AMS ● AMS 2006 is the only year with multiple graders ● each query/candidate pair evaluated by three different human graders ● each grader gives a FINE score between 0 … 10 not … very similar ● correlation between pairs of graders 40
  • 165. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 41
  • 166. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 42
  • 167. Inter-rater agreement in AMS ● inter-rater agreement for different intervals of FINE scores 43
  • 168. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 44
  • 169. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 45
  • 170. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval 46 Average = 6.54
  • 171. Inter-rater agreement in AMS ● look at very similar ratings in the [9,10] interval ● what sounds very similar to one grader, will on average receive a score of only 6.54 from other graders ● this constitutes an upper bound for average FINE scores in AMS ● there will always be users that disagree moving target 47 Average = 6.54
  • 172. Comparison to the upper bound ● compare top performing systems 2007, 2009 - 2014 to upper bound 48
  • 173. Comparison to the upper bound ● compare top performing systems 2007, 2009 - 2014 to upper bound 49
  • 174. Comparison to the upper bound ● upper bound has already been reached in 2009 50 PS2
  • 175. Comparison to the upper bound ● upper bound has already been reached in 2009 51 PS2 PS2 PS2 PS2 PS2 PS2
  • 176. Comparison to the upper bound 52 ● upper bound has already been reached in 2009 ● can upper bound be surpassed in the future? ● or is this an inherent problem due to low inter-rater agreement in human evaluation of music similarity? ● this prevents progress in MIR research on music similarity ● AMS task dead since 2015
  • 177. )hat went wrong here? 53
  • 178. )hat about validity? ● Valid experiment is an experiment actually measuring what the experimenter intended to measure ● )hat is the intention of the experimenter in the AMS task? ● What do we want to measure here? 54
  • 179. Construct (alidity ● Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment? 55
  • 180. Construct (alidity ● Construct validity: are intentions and hypotheses of the experimenter represented in the actual experiment? ● Unclear intention: to measure an abstract concept of music similarity? ● Possible solutions: β—‹ more fine-grained notion of similarity β—‹ ask a more specific question? β—‹ does something like abstract music similarity even exist? β—‹ evaluation of complete MIR systems centered around specific task use case could lead to much clearer hypothesis β—‹ Remember MIREX Grand challenge user experience 2014? β—‹ β€œYou are creating a short video about a memorable occasion that happened to you recently, and you need to find some (copyright-free) songs to use as background music.” 56
  • 181. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design 57
  • 182. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 58 Independent variable Type of algorithm Dependent variable FINE similarity rating
  • 183. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 59 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age musical training/experience/preference type of music, ...
  • 184. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 60 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age: female only, age 20-30y musical training/experience/preference: music professionals type of music: piano concertos only
  • 185. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● Many factors that influence human music perception, need to be controlled in experimental design ● Possible solutions: 61 Independent variable Type of algorithm Dependent variable FINE similarity rating Control variable gender, age: female only, age 20-30y musical training/experience/preference: music professionals type of music: piano concertos only Very specialized, limited generality
  • 186. Internal (alidity ● Control variable, monitor it: 62
  • 187. Internal (alidity ● Control variable, monitor it: 63 Exponential complexity
  • 188. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment? 64
  • 189. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment? ● Unclear target population: identical with sample of 7000 US pop songs? All US pop music in general? ● Beware: cross-collection studies show dramatic losses in performance β—‹ Bogdanov, D., Porter, A., Herrera Boyer, P., & Serra, X. (2016). Cross-collection evaluation for music classification tasks. ISMIR 2016. ● Possible solutions: β—‹ Clear target population β—‹ More constricted target populations β—‹ Much larger data samples β—‹ Use case? 65
  • 190. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference 66
  • 191. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference 67 J. Urbano, J. S. Downie, B. McFee and M. Schedl: How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval, ISMIR 2012.
  • 192. Conclusion (alidity ● Conclusion validity: does a difference between system measures correspond to a difference in user measures and is it noticeable to users ? ● A large difference in effect measures is needed that users see the difference ● Possible solutions: β—‹ Are there system measures that better correspond to user measures? β—‹ Use case! 68
  • 193. Lack of inter-rater agreement in other areas 69
  • 194. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound 70
  • 195. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound ● MIREX β€˜Music Structural Segmentation’ task β—‹ Human annotations of structural segmentations structural boundaries and labels denoting repeated segments , chorus, verse, … β—‹ Algorithms have to produce such annotations β—‹ F1-score between different annotators as upper bound β—‹ Upper bound reached at least for certain music classical and world music 71
  • 196. Lack of inter-rater agreement ● It does not make sense to go beyond inter-rater agreement, this constitutes an upper bound ● MIREX β€˜Music Structural Segmentation’ task β—‹ Human annotations of structural segmentations structural boundaries and labels denoting repeated segments , chorus, verse, … β—‹ Algorithms have to produce such annotations β—‹ F1-score between different annotators as upper bound β—‹ Upper bound reached at least for certain music classical and world music β—‹ Flexer A., Grill T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, J. of New Music Research, Vol. 45, No. 3, pp. 239-251, 2016. β—‹ Smith, J.B.L., Chew, E.: A meta-analysis of the MIREX structure segmentation task, ISMIR, 2013. β—‹ SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L.: Unsupervised music structure annotation by time series structure features and segment similarity., IEEE Transactions on Multimedia, Special Issue on Music Data Mining, 16(5), 1229–1240, 2014. 72
  • 197. Inter-rater agreement and upper bounds ● Extraction of metrical structure β—‹ Quinton, E., Harte, C., Sandler, M.: Extraction of metrical structure from music recordings, DAFX 2015. ● Melody estimation β—‹ Balke, S., Driedger, J., Abeßer, J., Dittmar, C., MΓΌller, M.: Towards Evaluating Multiple Predominant Melody Annotations in Jazz Recordings, ISMIR 2016. β—‹ Bosch J.J.,Gomez E..: Melody extraction in symphonic classical music: a comparative study of mutual agreement between humans and algorithms, Proc. of the Conference on Interdisciplinary Musicology, 2014. ● Timbre and rhythm similarity β—‹ Panteli, M., Rocha, B., Bogaards, N., Honingh, A.: A model for rhythm and timbre similarity in electronic dance music. Musicae Scientiae, 21(3), 338-361, 2017. ● Many more? 73
  • 199. Adversarial Examples - Image Recognition ● An adversary slightly and imperceptibly changes an input image to fool a machine learning system β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014. 75 original + noise = adversarial example all classified as β€œCamel”
  • 200. Adversarial Examples - MIR ● Imperceptibly filtered audio fools genre recognition system β—‹ Sturm B.L.: A simple method to determine if a music information retrieval system is a horse , IEEE Trans. on Multimedia, 16 6 , pp. 1636-1644, 2014. 76 deflate
  • 201. Adversarial Examples - MIR ● Imperceptibly filtered audio fools genre recognition system http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/in dex.html 77 deflate
  • 202. External (alidity ● External validity: do cause-effect relationships also hold for target populations beyond the sample used in the experiment ● Unclear target population: identical with sample of few hundred ISMIR or GTZAN songs? ● Or are we aiming at genre classification in general? ● If target is genre classification in general, there is a problem! 78
  • 203. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● )hy can these MIR systems be fooled so easily? β—‹ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier β—‹ )hat is the confounding variable? 79
  • 204. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? ● )hy can these MIR systems be fooled so easily? ● E.g.: in case of rhythm classification, systems were picking up tempo not rhythm! Tempo acted as confounding factor! β—‹ Sturm B.L.: The Horse Inside: Seeking Causes Behind the Behaviors of Music Content Analysis Systems, Computers in Entertainment, 14 2 , 2016. 80
  • 205. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? β—‹ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier β—‹ )hat is the confounding variable? ● high dimensionality of the data input space? β—‹ Small perturbations to input data might accumulate over many dimensions with minor changes β€˜snowballing’ into larger changes in transfer functions of deep neural networks β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014 81
  • 206. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? β—‹ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier β—‹ )hat is the confounding variable? ● linearity of models? β—‹ linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect … rectified linear units ReLU ? β—‹ Goodfellow I.J., Shlens J., S”egedy C.: Explaining and harnessing adversarial examples, ICLR, 2014 82
  • 207. Internal (alidity ● Internal validity: is the relationship causal or could confounding factors explain the relation? β—‹ no causal relation between the class e.g. genre represented in the data and the label returned by the classifier Open question: what is the confounding variable???? 83
  • 209. (alidity ● (alidity β—‹ Valid experiment is an experiment actually measuring what the experimenter intended to measure β—‹ Conclusion validity β—‹ Internal validity β—‹ External validity β—‹ Construct validity ● Care about validity of your experiments! ● Validity is the right framework to talk about these problems 85
  • 210. Part IV: So? JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 211. What’s in a 𝑝-value? ● Confounds effect size and sample size, eg. 𝑑 = 𝑛 π‘‹βˆ’πœ‡ 𝜎 ● Unfortunately, we virtually never check power. Don't ever accept 𝐻0 ● "Easy" way to achieve significance is obtaining more data, but the true effect remains the same ● Even if one rejects 𝐻0, it could still be true 2
  • 212. 𝑃 𝐻0 𝑝 ≀ 𝛼 = 𝑃 𝑝≀𝛼 𝐻0 𝑃 𝐻0 𝑃 𝑝≀𝛼 = 𝑃 𝑝 ≀ 𝛼 𝐻0 𝑃 𝐻0 𝑃 𝑝 ≀ 𝛼 𝐻0 𝑃 𝐻0 + 𝑃 𝑝 ≀ 𝛼 𝐻1 𝑃 𝐻1 = 𝛼𝑃 𝐻0 𝛼𝑃 𝐻0 + 1 βˆ’ 𝛽 𝑃 𝐻1 ● 𝑃 𝐻0 = 𝑃 𝐻1 = 0.5 β—‹ 𝛼 = 0.05, 𝛽 = 0.05 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.05 β—‹ 𝛼 = 0.05, 𝛽 = 0.5 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.09 ● 𝐻0 = 0.8, 𝑃 𝐻1 = 0.2 β—‹ 𝛼 = 0.05, 𝛽 = 0.05 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.17 β—‹ 𝛼 = 0.05, 𝛽 = 0.5 β†’ 𝑃 𝐻0 p ≀ 𝛼 = 0.29 3
  • 213. 𝐻0 is always false ● In this kind of dataset-based experiments, 𝐻0: πœ‡ = 0 is always false ● Two systems may be veeeeeery similar, but not the same ● Binary accept/reject decisions don't even make sense ● Why bother with multiple comparisons then? ● Care about type S(ign) and M(agnitude) errors ● To what extent do non-parametric methods make sense (Wilcoxon, Sign, Friedman), specially combined with parametrics like Tukey’s? 4
  • 214. Binary thinking no more ● Nothing wrong with the 𝑝-value, but with its use ● 𝑝 as a detective vs 𝑝 as a judge ● Any 𝛼 is completely arbitrary ● What is the cost of a type 2 error? ● How does the lack of validity affect NHST? (measures, sampling frames, ignoring cross-assessor variability, etc) ● What about researcher degrees of freedom? ● Why not focus on effect sizes? Intervals, correlations, etc. ● Bayesian methods? What priors? 5
  • 215. Assumptions ● In dataset-based (M)IR experiments, test assumptions are false by definition ● 𝑝-values are, to some degree, approximated ● So again, why use any threshold? ● So which test should you choose? ● Run them all, and compare ● If they tend to disagree, take a closer look at the data ● Look beyond the experiment at hand, gather more data ● Always perform error analysis to make sense of it 6
  • 216. Replication ● Fisher, and specially Neyman-Pearson, advocated for replication ● A 𝑝-value is only concerned with the current data ● The hypothesis testing framework only makes sense with repeated testing ● In (M)IR we hardly do it; we're stuck to the same datasets 7
  • 217. Significant β‰  Relevant β‰  Interesting 8 All research
  • 218. Significant β‰  Relevant β‰  Interesting 8 All research Interesting
  • 219. Significant β‰  Relevant β‰  Interesting 8 All research Interesting Relevant
  • 220. Significant β‰  Relevant β‰  Interesting 8 All research Interesting Relevant Statistically Significant
  • 221. There is always random error in our experiments, so we always need some kind of statistical analysis But there is no point in being too picky or intense about how we do it Nobody knows how to do it properly, and different fields adopt different methods What is far more productive, is to adopt an exploratory attitude rather than mechanically testing 9
  • 222. References JuliΓ‘n Urbano Arthur Flexer An ISMIR 2018 Tutorial Β· Paris
  • 223. ● Al-Maskari, A., Sanderson, M., & Clough, P. (2007). The Relationship between IR Effectiveness Measures and User Satisfaction. ACM SIGIR ● Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null Hypothesis Testing: Problems, Prevalence, and an Alternative. Journal of Wildfire Management ● Armstrong, T.G., Moffat, A., Webber, W. & Zobel, J. (2009). Improvements that don't add up: ad-hoc retrieval results since 1998. CIKM ● Balke, S., Driedger, J., Abeßer, J., Dittmar, C. & MΓΌller, M. (2016). Towards Evaluating Multiple Predominant Melody Annotations in Jazz Recordings. ISMIR ● Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman Have Agreed on Testing? Statistical Science ● Bosch J.J. & GΓ³mez E. (2014). Melody extraction in symphonic classical music: acomparative study of mutual agreement between humans and algorithms. Conference on Interdisciplinary Musicology ● Boytsov, L., Belova, A. & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. SIGIR ● Carterette, B. (2012). Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments. ACM Transactions on Information Systems ● Carterette, B. (2015a). Statistical Significance Testing in Information Retrieval: Theory and Practice. ICTIR ● Carterette, B. (2015b). Bayesian Inference for Information Retrieval Evaluation. ACM ICTIR ● Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum ● Cormack, G. V., & Lynam, T. R. (2006). Statistical Precision of Information Retrieval Evaluation. ACM SIGIR ● Downie, J. S. (2004). The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future. Computer Music Journal ● Fisher, R. A. (1925). Statistical Methods for Research Workers. Cosmo Publications ● Flexer, A. (2006). Statistical Evaluation of Music Information Retrieval Experiments. Journal of New Music Research 2
  • 224. ● Flexer, A., Grill, T.: The Problem of Limited Inter-rater Agreement in Modelling Music Similarity, Journal of New Music Research ● Gelman, A. (2013b). The problem with p-values is how they’re used. ● Gelman, A., Hill, J., & Yajima, M. (2012). Why We (Usually) Don’t Have to Worry About Multiple Comparisons. Journal of Research on Educational Effectiveness ● Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no shing expedition' or p-hacking' and the research hypothesis was posited ahead of time. ● Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist ● Gelman, A., & Stern, H. (2006). The Difference Between Significant and Not Significant is not Itself Statistically Significant. The American Statistician ● Goodfellow I.J., Shlens J. & Szegedy C. (2014). Explaining and harnessing adversarial examples. ICLR ● Gouyon, F., Sturm, B. L., Oliveira, J. L., Hespanhol, N., & Langlois, T. (2014). On Evaluation Validity in Music Autotagging. ACM Computing Research Repository. ● Hersh, W., Turpin, A., Price, S., Chan, B., Kraemer, D., Sacherek, L., & Olson, D. (2000). Do Batch and User Evaluations Give the Same Results? ACM SIGIR ● Hu, X., & Kando, N. (2012). User-Centered Measures vs. System Effectiveness in Finding Similar Songs. ISMIR ● Hull, D. (1993). Using Statistical Testing in the Evaluation of Retrieval Experiments. ACM SIGIR ● Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine ● Lehmann, E.L. (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? Journal of the American Statistical Association ● Lehmann, E.L. (2011). Fisher, Neyman, and the Creation of Classical Statistics. Springer 3
  • 225. ● Lee, J. H., & Cunningham, S. J. (2013). Toward an understanding of the history and impact of user studies in music information retrieval. Journal of Intelligent Information Systems ● Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three Current Issues In Music Autotagging. ISMIR ● Neyman, J. & Pearson, E.S. (1928). On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I. Biometrika ● Panteli, M., Rocha, B., Bogaards, N. & Honingh, A. (2017). A model for rhythm and timbre similarity in electronic dance music. Musicae Scientiae ● Quinton, E., Harte, C. & Sandler, M. (2015). Extraction of metrical structure from music recordings. DAFX ● Sakai, T. (2014). Statistical Reform in Information Retrieval? ACM SIGIR Forum ● Savoy, J. (1997). Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing and Management ● Schedl, M., Flexer, A., & Urbano, J. (2013). The Neglected User in Music Information Retrieval Research. Journal of Intelligent Information Systems ● Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin ● SerrΓ , J., MΓΌller, M., Grosche, P., & Arcos, J.L. (2014). Unsupervised music structure annotation by time series structure features and segment similarity. IEEE Trans. on Multimedia ● Smith, J.B.L. & Chew, E. (2013). A meta-analysis of the MIREX structure segmentation task. ISMIR ● Smucker, M. D., Allan, J., & Carterette, B. (2007). A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. ACM CIKM ● Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement Among Statistical Significance Tests for Information Retrieval Evaluation at Varying Sample Sizes. CM SIGIR 4
  • 226. ● Smucker, M. D., & Clarke, C. L. A. (2012). The Fault, Dear Researchers, is Not in Cranfield, But in Our Metrics, that They Are Unrealistic. European Workshop on Human-Computer Interaction and Information Retrieval ● Student. (1908). The Probable Error of a Mean. Biometrika ● Sturm, B. L. (2013). Classification Accuracy is Not Enough: On the Evaluation ofMusic Genre Recognition Systems. Journal of Intelligent Information Systems ● Sturm, B. L. (2014). The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval. Journal of New Music Research ● Sturm, B.L. (2014). A simple method to determine if a music information retrieval system is a horse, IEEE Trans. on Multimedia ● Sturm B.L. (2016). "The Horse" Inside: Seeking Causes Behind the Behaviors of Music Content Analysis Systems, Computers in Entertainment ● Tague-Sutcliffe, J. (1992). The Pragmatics of Information Retrieval Experimentation, Revisited. Information Processing and Management ● Turpin, A., & Hersh, W. (2001). Why Batch and User Evaluations Do Not Give the Same Results. ACM SIGIR ● Urbano, J. (2015). Test Collection Reliability: A Study of Bias and Robustness to Statistical Assumptions via Stochastic Simulation. Information Retrieval Journal ● Urbano, J., Downie, J. S., McFee, B., & Schedl, M. (2012). How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval. ISMIR ● Urbano, J., Marrero, M., & MartΓ­n, D. (2013a). A Comparison of the Optimality of Statistical Significance Tests for Information Retrieval Evaluation. ACM SIGIR ● Urbano, J., Marrero, M. & MartΓ­n, D. (2013b). On the Measurement of Test Collection Reliability. SIGIR ● Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation in Music Information Retrieval. Journal of Intelligent Information Systems 5
  • 227. ● Urbano, J. & Marrero, M. (2016). Toward Estimating the Rank Correlation between the Test Collection Results and the True System Performance. SIGIR ● Urbano, J. & Nagler, T. (2018). Stochastic Simulation of Test Collections: Evaluation Scores. SIGIR ● Voorhees, E. M., & Buckley, C. (2002). The Effect of Topic Set Size on Retrieval Experiment Error. ACM SIGIR ● Webber, W., Moffat, A., & Zobel, J. (2008). Statistical Power in Retrieval Experimentation. ACM CIKM ● Ziliak, S. T., & McCloskey, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. University of Michigan Press ● Zobel, J. (1998). How Reliable are the Results of Large-Scale Information Retrieval Experiments? ACM SIGIR 6