• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain
 

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

on

  • 755 views

The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, ...

The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.

Statistics

Views

Total Views
755
Views on SlideShare
750
Embed Views
5

Actions

Likes
0
Downloads
6
Comments
0

2 Embeds 5

http://a0.twimg.com 4
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Presentation Transcript

    • Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Julián Urbano @julian_urbano University Carlos III of Madrid ISMIR 2011Picture by Daniel Ray Miami, USA · October 26th
    • Picture by Bill Mill
    • current evaluation practices hinder the proper development of Music IR
    • we lack meta-evaluation studies we can’t complete the IRresearch & development cycle
    • how we got here?Picture by NASA History Office
    • users large-scale multi-language &the basis collections multi-modal NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011
    • ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR (2000-today) 3 workshops (2002-2003): The MIR/MDL Evaluation Project
    • ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 follow the steps of the Text IR folks ISMIR (2000-today) MIREX (2005-today) but carefully: not everything applies to music >1200 3 workshops (2002-2003): runs! The MIR/MDL Evaluation Project
    • are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR MIREX (2000-today) (2005-today) positive impact on MIR
    • are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 some good practices inherited from here ISMIR MIREX (2000-today) (2005-today) positive impact “not everything applies” a lot of things have happened here! on MIR but much of it does!
    • we still havea very long way to go
    • Picture by Official U.S. Navy Imagery evaluation
    • Cranfield Paradigm Task User Model
    • Experimental Validityhow well an experiment meets the well-grounded requirements of the scientific method do the results fairly and actually assess what was intended? Meta-Evaluationanalyze the validity of IR Evaluation experiments
    • Ground truth User model Documents Measures Systems Queries Construct Task x x x Content x x x x xConvergent x x x Criterion x x x Internal x x x x x External x x x xConclusion x x x x x
    • experimental failures
    • construct validity #fail measure quality of a Web search engine by the number of visits what?do the variables of the experiment correspond to the theoretical meaning of the concept they purport to measure? how? thorough selection and justification of the variables used
    • construct validity in IR effectiveness measures and their user model [Carterette, SIGIR2011]set-based measures do not resemble real users [Sanderson et al., SIGIR2010] rank-based measures are better [Jarvelin et al., TOIS2002] graded relevance is better [Voorhees, SIGIR2001][Kekäläinen, IP&M2005] other forms of ground truth are better [Bennet et al., SIGIRForum2008]
    • content validity #fail measure reading comprehension only with sci-fi books what?do the experimental units reflect and represent the elements of the domain under study? how? careful selection of the experimental units
    • content validity in IR tasks closely resembling real-world settings systems completely fulfilling real-user needs heavy user component, difficult to control evaluate de system component instead [Cleverdon, SIGIR2001][Voorhees, CLEF2002] actual value of systems is really unknown [Marchioni, CACM2006]sometimes they just do not work with real users [Turpin et al., SIGIR2001]
    • content validity in IR documents resembling real-world settings’ large and representative samples specially for Machine Learningcareful selection of queries, diverse but reasonable [Voorhees, CLEF2002][Carterette et al., ECIR2009] random selection is not goodsome queries are better to differentiate bad systems [Guiver et al., TOIS2009][Robertson, ECIR2011]
    • convergent validity #fail measures of math skills not correlated with abstract thinking what?do the results agree with others, theoretical or experimental, they should be related with? how? careful examination and confirmation of the relationship between the results and others supposedly related
    • convergent validity in IR ground truth data is subjective differences across groups and over time different results depending on who evaluates absolute numbers changerelative differences stand still for the most part [Voorhees, IP&M2000]for large-scale evaluations or varying experience of assessors, differences do exist [Carterette et al., 2010]
    • convergent validity in IR measures are precision- or recall-orientedthey should therefore be correlated with each other but they actually are not reliability? [Kekäläinen, IP&M2005][Sakai, IP&M2007]better correlated with others than with themselves! [Webber et al., SIGIR2008] correlation with user satisfaction in the task [Sanderson et al., SIGIR2010]ranks, unconventional judgments, discounted gain… [Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
    • criterion validity #fail ask if the new drink is good instead of better than the old one what? are the results correlated with those of other experiments already known to be valid? how? careful examination and confirmation of thecorrelation between our results and previous ones
    • criterion validity in IR practical large-scale methodologies: pooling [Buckley et al., SIGIR2004] less effort, but same results? judgments by non-experts [Bailey et al., SIGIR2008] crowdsourcing for low-cost [Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010] estimate measures with fewer judgments [Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]select what documents to judge, by informativeness [Carterette et al., SIGIR2006][Carterette et al., SIGIR2007] use no relevance judgments at all [Soboroff et al., SIGIR2001]
    • internal validity #failmeasure usefulness of Windows vs Linux vs iOS only with Apple employees what? can the conclusions be rigorously drawn from the experiment alone and not other overlooked factors? how? careful identification and control of possibleconfounding variables and selection of desgin
    • internal validity in IRinconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010] incompleteness: performance depends on pools system reinforcement [Zobel, SIGIR2008] affects reliability of measures and overall results [Sakai, JIR2008][Buckley et al., SIGIR2007] specially for Machine Learningtrain-test: same characteristics in queries and docsimprovements on the same collections: overfitting [Voorhees, CLEF2002] measures must be fair to all systems
    • external validity #failstudy cancer treatment mostly with teenage males what? can the results be generalizedto other populations and experimental settings? how? careful design and justification of sampling and selection methods
    • external validity in IR weakest point of IR Evaluation [Voorhees, CLEF2002] large-scale is always incomplete [Zobel, SIGIR2008][Buckley et al., SIGIR2004]test collections are themselves an evaluation result but they become hardly reusable [Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
    • external validity in IRsystems perform differently with different collections cross-collection comparisons are unjustified highly depends on test collection characteristics [Bodoff et al., SIGIR2007][Voorhees, CLEF2002] interpretation of results must be in terms of pairwise comparisons, not absolute numbers [Voorhees, CLEF2002] do not claim anything about state of the art based on a handful of experimentsbaselines can be used to compare across collections meaningful, [Armstrong et al., CIKM2009] not random!
    • conclusion validity #failmore access to the Internet in China than in the US because of the larger total number of users what?are the conclusions justified based on the results? how?careful selection of the measuring instruments andstatistical methods used to draw grand conclusions
    • conclusion validity in IRmeasures should be sensitive and stable [Buckley et al., SIGIR2000] and also powerful [Voorhees et al., SIGIR2002][Sakai, IP&M2007] with little effort [Sanderson et al., SIGIR2005] always bearing in mind the user model and the task
    • conclusion validity in IRstatistical methods to compare score distributions [Smucker et al., CIKM2007][Webber et al., CIKM2008] correct interpretation of the statistics hypothesis testing is troublesomestatistical significance ≠ practical significanceincreasing #queries (sample size) increases power to detect ever smaller differences (effect-size) eventually, everything is statistically significant
    • challengesPicture by Brian Snelson
    • IR Research & Development Cycle
    • IR Research & Development Cycle
    • IR Research & Development Cycle
    • IR Research & Development Cycle
    • IR Research & Development Cycle
    • MIR evaluation practices do not allow us to complete this cycle
    • IR Research & Development Cycle
    • IR Research & Development Cycleloose definition of taskintent and user model realistic data
    • IR Research & Development Cycle
    • IR Research & Development Cycle collections are too small and/or biased lack of realistic,controlled public standard formats collections and evaluation software to minimize bugs can’t replicate private, private undescribed results, often and unanalyzed leading to collections emerge wrong conclusions
    • IR Research & Development Cycle
    • IR Research & Development Cycle undocumented measures,lack of baselines as lower bound no accepted evaluation software (random is not a baseline!) proper statistics correct interpretation of statistics
    • IR Research & Development Cycle
    • IR Research & Development Cycle raw musical material unknown undocumented queries and/or documents go back to private collections: overfitting! overfitting!
    • IR Research & Development Cycle
    • IR Research & Development Cycle collections can’t be reused blind improvements go back to private collections: overfitting! overfitting!
    • Picture by Donna Grayson
    • collections large, heterogeneous and controllednot a hard endeavour, except for the damn copyright Million Song Dataset! still problematic (new features?, actual music) standardize collections across tasks better understanding and use of improvements
    • raw music dataessential for Learning and Improvement phases use copyright-free data Jamendo! study possible biases reconsider artificial material
    • evaluation model let teams run their own algorithms (needs public collections)relief for IMIRSEL and promote wider participation successfuly used for 20 years in Text IR venues adopted by MusiCLEF only viable alternative in the long runMIREX-DIY platforms still don’t allow full completion of the IR Research & Development Cycle
    • organization IMIRSEL plans, schedules and runs everything add a 2nd tier of organizers, task-specificlogistics, planning, evaluation, troubleshooting… format of large forums like TREC and CLEFsmooth the process and develop tasks that really push the limits of the state of the art
    • overview papers every year, by task organizers detail the evaluation process, data, results discussion to boost Interpretation and Learning perfect wrap-up for team papersrarely discuss results, and many are not even drafted
    • specific methodologiesMIR has unique methodologies and measures meta-evaluate: analyze and improve human effects on the evaluation user satisfaction
    • standard evaluation software bugs are inevitableopen evaluation software to everybody gain reliability speed up the development processserve as documentation for newcomers promote standardization of formats
    • baselineshelp measuring the overall progress of the filed standard formats + standard software + public controlled collections + raw music + task-specific organization measure the state of the art
    • commitmentwe need to acknowledge the current problems MIREX should not only be a place to evaluate and improve systems but also a place to meta-evaluate and improve how we evaluate and a place to design tasks that challenge researchers analyze our evaluation methodologies
    • we all need to startquestioningevaluation practices
    • it’s worth itPicture by Brian Snelson
    • we all need to start questioning evaluation practicesit‘s not that eveything we do is wrong…
    • we all need to start questioning evaluation practicesit‘s not that eveything we do is wrong…it’s that we don’t know it!