Information Retrieval Meta-Evaluation:                     Challenges and Opportunities                         in the Mus...
Picture by Bill Mill
current evaluation practices     hinder the proper development of Music IR
we lack  meta-evaluation studies   we can’t complete the IRresearch & development cycle
how we got here?Picture by NASA History Office
users                     large-scale                  multi-language &the basis           collections                    ...
ISMIR 2001 resolution on the need to create              standardized MIR test collections tasks and                      ...
ISMIR 2001 resolution on the need to create                standardized MIR test collections tasks and                    ...
are we done already?                                                        nearly 2 decades of       Evaluation is not ea...
are we done already?                                                                     nearly 2 decades of       Evaluat...
we still havea very long way to go
Picture by Official U.S. Navy Imagery                                        evaluation
Cranfield Paradigm          Task       User Model
Experimental Validityhow well an experiment meets the well-grounded    requirements of the scientific method    do the res...
Ground truth                     User model                                  Documents                                    ...
experimental failures
construct validity                    #fail   measure quality of a Web search engine          by the number of visits     ...
construct validity in IR effectiveness measures and their user model               [Carterette, SIGIR2011]set-based measur...
content validity                    #fail       measure reading comprehension           only with sci-fi books            ...
content validity in IR   tasks closely resembling real-world settings  systems completely fulfilling real-user needs   hea...
content validity in IR    documents resembling real-world settings’        large and representative samples               ...
convergent validity                   #fail    measures of math skills not correlated           with abstract thinking    ...
convergent validity in IR         ground truth data is subjective    differences across groups and over time different res...
convergent validity in IR     measures are precision- or recall-orientedthey should therefore be correlated with each othe...
criterion validity                        #fail            ask if the new drink is good        instead of better than the ...
criterion validity in IR    practical large-scale methodologies: pooling                  [Buckley et al., SIGIR2004] less...
internal validity                    #failmeasure usefulness of Windows vs Linux vs iOS         only with Apple employees ...
internal validity in IRinconsistency: performance depends on assessors        [Voorhees, IP&M2000][Carterette et al., SIGI...
external validity                      #failstudy cancer treatment mostly with teenage males                     what?    ...
external validity in IR            weakest point of IR Evaluation                     [Voorhees, CLEF2002]          large-...
external validity in IRsystems perform differently with different collections     cross-collection comparisons are unjusti...
conclusion validity                       #failmore access to the Internet in China than in the US   because of the larger...
conclusion validity in IRmeasures should be sensitive and stable           [Buckley et al., SIGIR2000]            and also...
conclusion validity in IRstatistical methods to compare score distributions     [Smucker et al., CIKM2007][Webber et al., ...
challengesPicture by Brian Snelson
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
IR Research & Development Cycle
MIR evaluation practices  do not allow us to complete this cycle
IR Research & Development Cycle
IR Research & Development Cycleloose definition of taskintent and user model realistic data
IR Research & Development Cycle
IR Research & Development Cycle                        collections are too small                               and/or bias...
IR Research & Development Cycle
IR Research & Development Cycle                                    undocumented measures,lack of baselines as lower bound ...
IR Research & Development Cycle
IR Research & Development Cycle                             raw musical material                                  unknown ...
IR Research & Development Cycle
IR Research & Development Cycle                      collections can’t be                             reused           bli...
Picture by Donna Grayson
collections       large, heterogeneous and controllednot a hard endeavour, except for the damn copyright                Mi...
raw music dataessential for Learning and Improvement phases           use copyright-free data                 Jamendo!    ...
evaluation model        let teams run their own algorithms               (needs public collections)relief for IMIRSEL and ...
organization IMIRSEL plans, schedules and runs everything    add a 2nd tier of organizers, task-specificlogistics, plannin...
overview papers           every year, by task organizers     detail the evaluation process, data, results  discussion to b...
specific methodologiesMIR has unique methodologies and measures    meta-evaluate: analyze and improve      human effects o...
standard evaluation software         bugs are inevitableopen evaluation software to everybody            gain reliability ...
baselineshelp measuring the overall progress of the filed   standard formats + standard software +  public controlled coll...
commitmentwe need to acknowledge the current problems    MIREX should not only be a place to       evaluate and improve sy...
we all need to startquestioningevaluation practices
it’s worth itPicture by Brian Snelson
we all need to start     questioning    evaluation practicesit‘s not that eveything we do is wrong…
we all need to start     questioning    evaluation practicesit‘s not that eveything we do is wrong…it’s that we don’t know...
Upcoming SlideShare
Loading in …5
×

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

806
-1

Published on

The Music Information Retrieval field has acknowledged the need for rigorous scientific evaluations for some time now. Several efforts were set out to develop and provide the necessary infrastructure, technology and methodologies to carry out these evaluations, out of which the annual Music Information Retrieval Evaluation eXchange emerged. The community as a whole has enormously gained from this evaluation forum, but very little attention has been paid to reliability and correctness issues. From the standpoint of the analysis of experimental validity, this paper presents a survey of past meta-evaluation work in the context of Text Information Retrieval, arguing that the music community still needs to address various issues concerning the evaluation of music systems and the IR cycle, pointing out directions for further research and proposals in this line.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
806
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain

  1. 1. Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Music Domain Julián Urbano @julian_urbano University Carlos III of Madrid ISMIR 2011Picture by Daniel Ray Miami, USA · October 26th
  2. 2. Picture by Bill Mill
  3. 3. current evaluation practices hinder the proper development of Music IR
  4. 4. we lack meta-evaluation studies we can’t complete the IRresearch & development cycle
  5. 5. how we got here?Picture by NASA History Office
  6. 6. users large-scale multi-language &the basis collections multi-modal NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011
  7. 7. ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR (2000-today) 3 workshops (2002-2003): The MIR/MDL Evaluation Project
  8. 8. ISMIR 2001 resolution on the need to create standardized MIR test collections tasks and collections, tasks, evaluation metrics for MIR research and development NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 follow the steps of the Text IR folks ISMIR (2000-today) MIREX (2005-today) but carefully: not everything applies to music >1200 3 workshops (2002-2003): runs! The MIR/MDL Evaluation Project
  9. 9. are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 ISMIR MIREX (2000-today) (2005-today) positive impact on MIR
  10. 10. are we done already? nearly 2 decades of Evaluation is not easy Meta- Meta-Evaluation in Text IR NTCIR CLEFCranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995)1960 2011 some good practices inherited from here ISMIR MIREX (2000-today) (2005-today) positive impact “not everything applies” a lot of things have happened here! on MIR but much of it does!
  11. 11. we still havea very long way to go
  12. 12. Picture by Official U.S. Navy Imagery evaluation
  13. 13. Cranfield Paradigm Task User Model
  14. 14. Experimental Validityhow well an experiment meets the well-grounded requirements of the scientific method do the results fairly and actually assess what was intended? Meta-Evaluationanalyze the validity of IR Evaluation experiments
  15. 15. Ground truth User model Documents Measures Systems Queries Construct Task x x x Content x x x x xConvergent x x x Criterion x x x Internal x x x x x External x x x xConclusion x x x x x
  16. 16. experimental failures
  17. 17. construct validity #fail measure quality of a Web search engine by the number of visits what?do the variables of the experiment correspond to the theoretical meaning of the concept they purport to measure? how? thorough selection and justification of the variables used
  18. 18. construct validity in IR effectiveness measures and their user model [Carterette, SIGIR2011]set-based measures do not resemble real users [Sanderson et al., SIGIR2010] rank-based measures are better [Jarvelin et al., TOIS2002] graded relevance is better [Voorhees, SIGIR2001][Kekäläinen, IP&M2005] other forms of ground truth are better [Bennet et al., SIGIRForum2008]
  19. 19. content validity #fail measure reading comprehension only with sci-fi books what?do the experimental units reflect and represent the elements of the domain under study? how? careful selection of the experimental units
  20. 20. content validity in IR tasks closely resembling real-world settings systems completely fulfilling real-user needs heavy user component, difficult to control evaluate de system component instead [Cleverdon, SIGIR2001][Voorhees, CLEF2002] actual value of systems is really unknown [Marchioni, CACM2006]sometimes they just do not work with real users [Turpin et al., SIGIR2001]
  21. 21. content validity in IR documents resembling real-world settings’ large and representative samples specially for Machine Learningcareful selection of queries, diverse but reasonable [Voorhees, CLEF2002][Carterette et al., ECIR2009] random selection is not goodsome queries are better to differentiate bad systems [Guiver et al., TOIS2009][Robertson, ECIR2011]
  22. 22. convergent validity #fail measures of math skills not correlated with abstract thinking what?do the results agree with others, theoretical or experimental, they should be related with? how? careful examination and confirmation of the relationship between the results and others supposedly related
  23. 23. convergent validity in IR ground truth data is subjective differences across groups and over time different results depending on who evaluates absolute numbers changerelative differences stand still for the most part [Voorhees, IP&M2000]for large-scale evaluations or varying experience of assessors, differences do exist [Carterette et al., 2010]
  24. 24. convergent validity in IR measures are precision- or recall-orientedthey should therefore be correlated with each other but they actually are not reliability? [Kekäläinen, IP&M2005][Sakai, IP&M2007]better correlated with others than with themselves! [Webber et al., SIGIR2008] correlation with user satisfaction in the task [Sanderson et al., SIGIR2010]ranks, unconventional judgments, discounted gain… [Bennet et al, SIGIRForum2008][Järvelin et al, TOIS2002]
  25. 25. criterion validity #fail ask if the new drink is good instead of better than the old one what? are the results correlated with those of other experiments already known to be valid? how? careful examination and confirmation of thecorrelation between our results and previous ones
  26. 26. criterion validity in IR practical large-scale methodologies: pooling [Buckley et al., SIGIR2004] less effort, but same results? judgments by non-experts [Bailey et al., SIGIR2008] crowdsourcing for low-cost [Alonso et al., SIGIR2009][Carvalho et al., SIGIRForum2010] estimate measures with fewer judgments [Yilmaz et al., CIKM2006][Yilmaz et al., SIGIR2008]select what documents to judge, by informativeness [Carterette et al., SIGIR2006][Carterette et al., SIGIR2007] use no relevance judgments at all [Soboroff et al., SIGIR2001]
  27. 27. internal validity #failmeasure usefulness of Windows vs Linux vs iOS only with Apple employees what? can the conclusions be rigorously drawn from the experiment alone and not other overlooked factors? how? careful identification and control of possibleconfounding variables and selection of desgin
  28. 28. internal validity in IRinconsistency: performance depends on assessors [Voorhees, IP&M2000][Carterette et al., SIGIR2010] incompleteness: performance depends on pools system reinforcement [Zobel, SIGIR2008] affects reliability of measures and overall results [Sakai, JIR2008][Buckley et al., SIGIR2007] specially for Machine Learningtrain-test: same characteristics in queries and docsimprovements on the same collections: overfitting [Voorhees, CLEF2002] measures must be fair to all systems
  29. 29. external validity #failstudy cancer treatment mostly with teenage males what? can the results be generalizedto other populations and experimental settings? how? careful design and justification of sampling and selection methods
  30. 30. external validity in IR weakest point of IR Evaluation [Voorhees, CLEF2002] large-scale is always incomplete [Zobel, SIGIR2008][Buckley et al., SIGIR2004]test collections are themselves an evaluation result but they become hardly reusable [Carterette et al., WSDM2010][Carterette et al., SIGIR2010]
  31. 31. external validity in IRsystems perform differently with different collections cross-collection comparisons are unjustified highly depends on test collection characteristics [Bodoff et al., SIGIR2007][Voorhees, CLEF2002] interpretation of results must be in terms of pairwise comparisons, not absolute numbers [Voorhees, CLEF2002] do not claim anything about state of the art based on a handful of experimentsbaselines can be used to compare across collections meaningful, [Armstrong et al., CIKM2009] not random!
  32. 32. conclusion validity #failmore access to the Internet in China than in the US because of the larger total number of users what?are the conclusions justified based on the results? how?careful selection of the measuring instruments andstatistical methods used to draw grand conclusions
  33. 33. conclusion validity in IRmeasures should be sensitive and stable [Buckley et al., SIGIR2000] and also powerful [Voorhees et al., SIGIR2002][Sakai, IP&M2007] with little effort [Sanderson et al., SIGIR2005] always bearing in mind the user model and the task
  34. 34. conclusion validity in IRstatistical methods to compare score distributions [Smucker et al., CIKM2007][Webber et al., CIKM2008] correct interpretation of the statistics hypothesis testing is troublesomestatistical significance ≠ practical significanceincreasing #queries (sample size) increases power to detect ever smaller differences (effect-size) eventually, everything is statistically significant
  35. 35. challengesPicture by Brian Snelson
  36. 36. IR Research & Development Cycle
  37. 37. IR Research & Development Cycle
  38. 38. IR Research & Development Cycle
  39. 39. IR Research & Development Cycle
  40. 40. IR Research & Development Cycle
  41. 41. MIR evaluation practices do not allow us to complete this cycle
  42. 42. IR Research & Development Cycle
  43. 43. IR Research & Development Cycleloose definition of taskintent and user model realistic data
  44. 44. IR Research & Development Cycle
  45. 45. IR Research & Development Cycle collections are too small and/or biased lack of realistic,controlled public standard formats collections and evaluation software to minimize bugs can’t replicate private, private undescribed results, often and unanalyzed leading to collections emerge wrong conclusions
  46. 46. IR Research & Development Cycle
  47. 47. IR Research & Development Cycle undocumented measures,lack of baselines as lower bound no accepted evaluation software (random is not a baseline!) proper statistics correct interpretation of statistics
  48. 48. IR Research & Development Cycle
  49. 49. IR Research & Development Cycle raw musical material unknown undocumented queries and/or documents go back to private collections: overfitting! overfitting!
  50. 50. IR Research & Development Cycle
  51. 51. IR Research & Development Cycle collections can’t be reused blind improvements go back to private collections: overfitting! overfitting!
  52. 52. Picture by Donna Grayson
  53. 53. collections large, heterogeneous and controllednot a hard endeavour, except for the damn copyright Million Song Dataset! still problematic (new features?, actual music) standardize collections across tasks better understanding and use of improvements
  54. 54. raw music dataessential for Learning and Improvement phases use copyright-free data Jamendo! study possible biases reconsider artificial material
  55. 55. evaluation model let teams run their own algorithms (needs public collections)relief for IMIRSEL and promote wider participation successfuly used for 20 years in Text IR venues adopted by MusiCLEF only viable alternative in the long runMIREX-DIY platforms still don’t allow full completion of the IR Research & Development Cycle
  56. 56. organization IMIRSEL plans, schedules and runs everything add a 2nd tier of organizers, task-specificlogistics, planning, evaluation, troubleshooting… format of large forums like TREC and CLEFsmooth the process and develop tasks that really push the limits of the state of the art
  57. 57. overview papers every year, by task organizers detail the evaluation process, data, results discussion to boost Interpretation and Learning perfect wrap-up for team papersrarely discuss results, and many are not even drafted
  58. 58. specific methodologiesMIR has unique methodologies and measures meta-evaluate: analyze and improve human effects on the evaluation user satisfaction
  59. 59. standard evaluation software bugs are inevitableopen evaluation software to everybody gain reliability speed up the development processserve as documentation for newcomers promote standardization of formats
  60. 60. baselineshelp measuring the overall progress of the filed standard formats + standard software + public controlled collections + raw music + task-specific organization measure the state of the art
  61. 61. commitmentwe need to acknowledge the current problems MIREX should not only be a place to evaluate and improve systems but also a place to meta-evaluate and improve how we evaluate and a place to design tasks that challenge researchers analyze our evaluation methodologies
  62. 62. we all need to startquestioningevaluation practices
  63. 63. it’s worth itPicture by Brian Snelson
  64. 64. we all need to start questioning evaluation practicesit‘s not that eveything we do is wrong…
  65. 65. we all need to start questioning evaluation practicesit‘s not that eveything we do is wrong…it’s that we don’t know it!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×