Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Information Retrieval Meta-Evaluation:
Challenges and Opportunities
in the Music Domain
Julián Urbano @julian_urbano
University Carlos III of Madrid

ISMIR 2011
Picture by Daniel Ray Miami, USA · October 26th

current evaluation practices
hinder the proper
development of Music IR

we lack
meta-evaluation studies

we can’t complete the IR
research & development cycle

how we got here?

Picture by NASA History Office

users large-scale multi-language &
the basis collections multi-modal

NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development

NTCIR CLEF
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

ISMIR
(2000-today)

3 workshops (2002-2003):
The MIR/MDL Evaluation Project

ISMIR 2001 resolution on the need to create
standardized MIR test collections tasks and
collections, tasks,
evaluation metrics for MIR research and development

NTCIR CLEF
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

follow the steps of the Text IR folks ISMIR
(2000-today)
MIREX
(2005-today)

but carefully: not everything applies to music
>1200
3 workshops (2002-2003): runs!
The MIR/MDL Evaluation Project

are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR

NTCIR CLEF
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

ISMIR MIREX
(2000-today)
(2005-today)

positive
impact
on MIR

are we done already?
nearly 2 decades of
Evaluation is not easy Meta-
Meta-Evaluation in Text IR

NTCIR CLEF
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011

some good practices inherited from here ISMIR MIREX
(2000-today)
(2005-today)

positive
impact
“not everything applies” a lot of things
have happened here!
on MIR
but much of it does!

we still have
a very long
way to go

Picture by Official U.S. Navy Imagery
evaluation

Cranfield Paradigm

Task
User Model

Experimental Validity

how well an experiment meets the well-grounded
requirements of the scientific method
do the results fairly and actually assess
what was intended?

Meta-Evaluation
analyze the validity of IR Evaluation experiments

Ground truth
User model
Documents

Measures
Systems
Queries
Construct Task
x x x
Content x x x x x
Convergent x x x
Criterion x x x
Internal x x x x x
External x x x x
Conclusion x x x x x

construct validity
#fail
measure quality of a Web search engine
by the number of visits
what?
do the variables of the experiment correspond
to the theoretical meaning of the concept
they purport to measure?
how?
thorough selection and justification
of the variables used

construct validity in IR
effectiveness measures and their user model
[Carterette, SIGIR2011]

set-based measures do not resemble real users
[Sanderson et al., SIGIR2010]
rank-based measures are better
[Jarvelin et al., TOIS2002]

graded relevance is better
[Voorhees, SIGIR2001][Kekäläinen, IP&M2005]
other forms of ground truth are better
[Bennet et al., SIGIRForum2008]

content validity
#fail
measure reading comprehension
only with sci-fi books

what?
do the experimental units reflect and represent
the elements of the domain under study?

how?
careful selection of the experimental units

content validity in IR
tasks closely resembling real-world settings
systems completely fulfilling real-user needs

heavy user component, difficult to control
evaluate de system component instead
[Cleverdon, SIGIR2001][Voorhees, CLEF2002]

actual value of systems is really unknown
[Marchioni, CACM2006]
sometimes they just do not work with real users
[Turpin et al., SIGIR2001]

content validity in IR
documents resembling real-world settings’
large and representative samples
specially for Machine Learning
careful selection of queries, diverse but reasonable
[Voorhees, CLEF2002][Carterette et al., ECIR2009]

random selection is not good

some queries are better to differentiate bad systems
[Guiver et al., TOIS2009][Robertson, ECIR2011]

convergent validity
#fail
measures of math skills not correlated
with abstract thinking
what?
do the results agree with others, theoretical or
experimental, they should be related with?
how?
careful examination and confirmation
of the relationship between the results
and others supposedly related

convergent validity in IR
ground truth data is subjective
differences across groups and over time
different results depending on who evaluates
absolute numbers change
relative differences stand still for the most part
[Voorhees, IP&M2000]

for large-scale evaluations or varying experience
of assessors, differences do exist
[Carterette et al., 2010]

convergent validity in IR
measures are precision- or recall-oriented
they should therefore be correlated with each other
but they actually are not reliability?
[Kekäläinen, IP&M2005][Sakai, IP&M2007]
better correlated with others than with themselves!
[Webber et al., SIGIR2008]

correlation with user satisfaction in the task
ranks, unconventional judgments, discounted gain…
[Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]

criterion validity
#fail
ask if the new drink is good
instead of better than the old one

what?
are the results correlated with those of
other experiments already known to be valid?

how?
careful examination and confirmation of the
correlation between our results and previous ones

criterion validity in IR
practical large-scale methodologies: pooling
[Buckley et al., SIGIR2004] less effort, but
same results?
judgments by non-experts
[Bailey et al., SIGIR2008]
crowdsourcing for low-cost
[Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010]
estimate measures with fewer judgments
[Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]
select what documents to judge, by informativeness
[Carterette et al., SIGIR2006][Carterette et al., SIGIR2007]
use no relevance judgments at all
[Soboroff et al., SIGIR2001]

internal validity
#fail
measure usefulness of Windows vs Linux vs iOS
only with Apple employees
what?
can the conclusions be rigorously drawn
from the experiment alone
and not other overlooked factors?
how?
careful identification and control of possible
confounding variables and selection of desgin

internal validity in IR
inconsistency: performance depends on assessors
[Voorhees, IP&M2000][Carterette et al., SIGIR2010]
incompleteness: performance depends on pools
system reinforcement
[Zobel, SIGIR2008]
affects reliability of measures and overall results
[Sakai, JIR2008][Buckley et al., SIGIR2007] specially for
Machine Learning
train-test: same characteristics in queries and docs
improvements on the same collections: overfitting
[Voorhees, CLEF2002]
measures must be fair to all systems

external validity
#fail
study cancer treatment mostly with teenage males

what?
can the results be generalized
to other populations and experimental settings?

how?
careful design and justification
of sampling and selection methods

external validity in IR
weakest point of IR Evaluation

large-scale is always incomplete
[Zobel, SIGIR2008][Buckley et al., SIGIR2004]

test collections are themselves an evaluation result
but they become hardly reusable
[Carterette et al., WSDM2010][Carterette et al., SIGIR2010]

external validity in IR
systems perform differently with different collections
cross-collection comparisons are unjustified
highly depends on test collection characteristics
[Bodoff et al., SIGIR2007][Voorhees, CLEF2002]
interpretation of results must be in terms of
pairwise comparisons, not absolute numbers
do not claim anything about state of the art
based on a handful of experiments
baselines can be used to compare across collections
meaningful, [Armstrong et al., CIKM2009]
not random!

conclusion validity
#fail
more access to the Internet in China than in the US
because of the larger total number of users

what?
are the conclusions justified based on the results?

how?
careful selection of the measuring instruments and
statistical methods used to draw grand conclusions

conclusion validity in IR

measures should be sensitive and stable
[Buckley et al., SIGIR2000]
and also powerful
[Voorhees et al., SIGIR2002][Sakai, IP&M2007]
with little effort

always bearing in mind
the user model and the task

conclusion validity in IR

statistical methods to compare score distributions
[Smucker et al., CIKM2007][Webber et al., CIKM2008]

correct interpretation of the statistics
hypothesis testing is troublesome
statistical significance ≠ practical significance
increasing #queries (sample size) increases power
to detect ever smaller differences (effect-size)
eventually, everything is statistically significant

challenges

Picture by Brian Snelson

IR Research & Development Cycle

MIR evaluation practices
do not allow us
to complete this cycle


loose definition of task
intent and user model

realistic data


collections are too small
and/or biased
lack of realistic,
controlled public standard formats
collections and evaluation
software to
minimize bugs

can’t replicate
private,
private undescribed results, often
and unanalyzed leading to
collections emerge wrong conclusions

undocumented measures,
lack of baselines as lower bound no accepted evaluation software
(random is not a baseline!)

proper statistics correct
interpretation
of statistics


raw musical material
unknown
undocumented
queries and/or
documents

go back to private
collections:
overfitting!
overfitting!


collections can’t be
reused
blind
improvements
go back to
private collections:
overfitting!
overfitting!

collections

large, heterogeneous and controlled

not a hard endeavour, except for the damn copyright

Million Song Dataset!
still problematic (new features?, actual music)

standardize collections across tasks
better understanding and use of improvements

raw music data

essential for Learning and Improvement phases

use copyright-free data
Jamendo!
study possible biases

reconsider artificial material

evaluation model
let teams run their own algorithms
(needs public collections)

relief for IMIRSEL and promote wider participation
successfuly used for 20 years in Text IR venues
adopted by MusiCLEF

only viable alternative in the long run
MIREX-DIY platforms still don’t allow full completion
of the IR Research & Development Cycle

organization
IMIRSEL plans, schedules and runs everything

add a 2nd tier of organizers, task-specific
logistics, planning, evaluation, troubleshooting…

format of large forums like TREC and CLEF
smooth the process and develop tasks that really
push the limits of the state of the art

overview papers
every year, by task organizers

detail the evaluation process, data, results
discussion to boost Interpretation and Learning

perfect wrap-up for team papers
rarely discuss results, and many are not even drafted

specific methodologies
MIR has unique methodologies and measures

meta-evaluate: analyze and improve

human effects on the evaluation

user satisfaction

standard evaluation software
bugs are inevitable

open evaluation software to everybody
gain reliability
speed up the development process
serve as documentation for newcomers

promote standardization of formats

baselines
help measuring the overall progress of the filed

standard formats + standard software +
public controlled collections + raw music +
task-specific organization

measure the state of the art

commitment

we need to acknowledge the current problems

MIREX should not only be a place to
evaluate and improve systems
but also a place to
meta-evaluate and improve how we evaluate
and a place to
design tasks that challenge researchers

analyze our evaluation methodologies

we all need to start
questioning
evaluation practices

it’s worth it

Picture by Brian Snelson

questioning
it‘s not that eveything we do is wrong…

questioning
it‘s not that eveything we do is wrong…
it’s that we don’t know it!

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

Recommended

Recommended

More Related Content

More from Julián Urbano

More from Julián Urbano (17)

Recently uploaded

Recently uploaded (20)

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain