Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks


Published on

Music similarity tasks, where musical pieces similar to a query should be retrieved, are quite troublesome to evaluate. Ground truths based on partially ordered lists were developed to cope with problems regarding relevance judgment, but they require such man-power to generate that the official MIREX evaluations had to turn over more affordable alternatives. However, in house evaluations keep using these partially ordered lists because they are still more suitable for similarity tasks. In this paper we propose a cheaper alternative to generate these lists by using crowdsourcing to gather music preference judgments. We show that our method produces lists very similar to the original ones, while dealing with some defects of the original methodology. With this study, we show that crowdsourcing is a perfectly viable alternative to evaluate music systems without the need for experts.

  • Be the first to comment

  • Be the first to like this

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

  1. 1. CrowdsourcingPreference Judgments forEvaluation of Music Similarity TasksJulián Urbano, Jorge Morato,Mónica Marrero and Diego Martínhttp://julian-urbano.infoTwitter: @julian_urbano SIGIR CSE 2010 Geneva, Switzerland · July 23rd
  2. 2. 2Outline• Introduction• Motivation• Alternative Methodology• Crowdsourcing Preferences• Results• Conclusions and Future Work
  3. 3. 3Evaluation Experiments• Essential for Information Retrieval [Voorhees, 2002]• Traditionally followed the Cranfield paradigm ▫ Relevance judgments are the most important part of test collections (and the most expensive)• In the music domain evaluation has not been taken too seriously until very recently ▫ MIREX appeared in 2005 [Downie et al., 2010] ▫ Additional problems with the construction and maintenance of test collections [Downie, 2004]
  4. 4. 4Music Similarity Tasks• Given a music piece (i.e. the query) return a ranked list of other pieces similar to it ▫ Actual music contents, forget the metadata!• It comes in two flavors ▫ Symbolic Melodic Similarity (SMS) ▫ Audio Music Similarity (AMS)• It is inherently more complex to evaluate ▫ Relevance judgments are very problematic
  5. 5. 5Relevance (Similarity) Judgments• Relevance is usually considered on a fixed scale ▫ Relevant, not relevant, very relevant…• For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007] ▫ Single melodic changes are not perceived to change the overall melody  Move a note up or down in pitch, shorten it, etc. ▫ But the similarity is weaker as more changes apply• Where is the line between relevance levels?
  6. 6. 6Partially Ordered Lists• The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005] ▫ Does not need any prefixed relevance scale• Ordered groups of documents equally relevant ▫ Have to keep the order of the groups ▫ Allow permutations within the same group• Assessors only need to be sure that any pair of documents is ordered properly
  7. 7. 7Partially Ordered Lists (II)
  8. 8. 8Partially Ordered Lists (and III)• Used in the first edition of MIREX in 2005 [Downie et al., 2005]• Widely accepted by the MIR community to report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]• MIREX was forced to move to traditional level-based relevance since 2006 ▫ Partially ordered lists are expensive ▫ And have some inconsistencies
  9. 9. 9Expensiveness• The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005] ▫ Only 11 of them had time to work on all 11 queries ▫ This exceeds MIREX’s resources for a single task• MIREX had to move to level-based relevance ▫ BROAD: Not Similar, Somewhat Similar, Very Similar ▫ FINE: numerical, from 0 to 10 with one decimal digit• Problems with assessor consistency came up
  10. 10. 10Issues with Assessor Consistency• The line between levels is certainly unclear [Jones et al., 2007][Downie et al., 2010]
  11. 11. 11Original Methodology• Go back to partially ordered lists ▫ Filter the collection ▫ Have the experts rank the candidates ▫ Arrange the candidates by rank ▫ Aggregate candidates whose ranks are not significantly different (Mann-Whitney U)• There are known odd results and inconsistencies [Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b] ▫ Disregard changes that do not alter the actual perception, such as clef or key and time signature ▫ Something like changing the language of a text and use synonyms [Urbano et al., 2010a]
  12. 12. 12Inconsistencies due to Ranking
  13. 13. 13Alternative Methodology• Minimize inconsistencies [Urbano et al., 2010b]• Cheapen the whole process• Reasonable Person hypothesis [Downie, 2004] ▫ With crowdsourcing (finally)• Use Amazon Mechanical Turk ▫ Get rid of experts [Alonso et al., 2008][Alonso et al., 2009] ▫ Work with “reasonable turkers” ▫ Explore other domains to apply crowdsourcing
  14. 14. 14Equally Relevant Documents• Experts were forced to give totally ordered lists• One would expect ranks to randomly average out ▫ Half the experts prefer one document ▫ Half the experts prefer the other one• That is hardly the case ▫ Do not expect similar ranks if the experts can not give similar ranks in the first place
  15. 15. 15Give Audio instead of Images• Experts may guide by the images, not the music ▫ Some irrelevant changes in the image can deceive• No music expertise should be needed ▫ Reasonable person turker hypothesis
  16. 16. 16Preference Judgments• In their heads, experts actually do preference judgments ▫ Similar to a binary search ▫ Accelerates assessor fatigue as the list grows• Already noted for level-based relevance ▫ Go back and re-judge [Downie et al., 2010][Jones et al., 2007] ▫ Overlapping between BROAD and FINE scores• Change the relevance assessment question ▫ Which is more similar to Q: A or B? [Carterette et al., 2008]
  17. 17. 17Preference Judgments (II)• Better than traditional level-based relevance ▫ Inter-assessor agreement ▫ Time to answer• In our case, three-point preferences ▫ A < B (A is more similar) ▫ A = B (they are equally similar/dissimilar) ▫ A > B (B is more similar)
  18. 18. 18Preference Judgments (and III)• Use a modified QuickSort algorithm to sort documents in a partially ordered list ▫ Do not need all O(n2) judgments, but O(n·log n) X is the current pivot on the segment X has been pivot already
  19. 19. 19How Many Assessors?• Ranks are given to each document in a pair ▫ +1 if it is preferred over the other one ▫ -1 if the other one is preferred ▫ 0 if they were judged equally similar/dissimilar• Test for signed differences in the samples• In the original lists 35 experts were used ▫ Ranks of a document between 1 and more than 20• Our rank sample is less (and equally) variable ▫ rank(A) = -rank(B) ⇒ var(A) = var (B) ▫ Effect size is larger so statistical power increases ▫ Fewer assessors are needed overall
  20. 20. 20Crowdsourcing Preferences• Crowdsourcing seems very appropriate ▫ Reasonable person hypothesis ▫ Audio instead of images ▫ Preference judgments ▫ QuickSort for partially ordered lists• The task can be split into very small assignments• It should be much more cheap and consistent ▫ Do not need experts ▫ Do not deceive and increase consistency ▫ Easier and faster to judge ▫ Need fewer judgments and judges
  21. 21. 21New Domain of Application• Crowdsourcing has been used mainly to evaluate text documents in English• How about other languages? ▫ Spanish [Alonso et al., 2010]• How about multimedia? ▫ Image tagging? [Nowak et al., 2010] ▫ Music similarity?
  22. 22. 22Data• MIREX 2005 Evaluation collection ▫ ~550 musical incipits in MIDI format ▫ 11 queries also in MIDI format ▫ 4 to 23 candidates per query• Convert to MP3 as it is easier to play in browsers• Trim the leading and tailing silence ▫ 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4) ▫ 4 to 24 secs. (mean 13) to listen to all 3 incipits• Uploaded all MP3 files and a Flash player to a private server to stream data on the fly
  23. 23. 23HIT Design 2 yummy cents of dollar
  24. 24. 24Threats to Validity• Basically had to randomize everything ▫ Initial order of candidates in the first segment ▫ Alternate between queries ▫ Alternate between pivots of the same query ▫ Alternate pivots as variations A and B• Let the workers know about this randomization• In first trials some documents were judged more similar to the query than the query itself! ▫ Require at least 95% acceptance rate ▫ Ask for 10 different workers per HIT [Alonso et al., 2009] ▫ Beware of bots (always judged equal in 8 secs.)
  25. 25. 25Summary of Submissions• The 11 lists account for 119 candidates to judge• Sent 8 batches (QuickSort iterations) to MTurk• Had to judge 281 pairs (38%) = 2810 judgments• 79 unique workers for about 1 day and a half• A total cost (excluding trials) of $70.25
  26. 26. 26Feedback and Music Background• 23 of the 79 workers gave us feedback ▫ 4 very positive comments: very relaxing music ▫ 1 greedy worker: give me more money ▫ 2 technical problems loading the audio in 2 HITs  Not reported by any of the other 9 workers ▫ 5 reported no music background ▫ 6 had formal music education ▫ 9 professional practitioners for several years ▫ 9 play an instrument, mainly piano ▫ 6 performers in choir
  27. 27. 27Agreement between Workers• Forget about Fleiss’ Kappa ▫ Does not account for the size of the disagreement ▫ A<B and A=B is not as bad as A<B and B<A• Look at all 45 pairs of judgments per pair ▫ +2 if total agreement (e.g. A<B and A<B) ▫ +1 if partial agreement (e.g. A<B and A=B) ▫ 0 if no agreement (i.e. A<B and B<A) ▫ Divide by 90 (all pairs with total agreement)• Average agreement score per pair was 0.664 ▫ From 0.506 (iteration 8) to 0.822 (iteration 2)
  28. 28. 28Agreement Workers-Experts• Those 10 judgments were actually aggregated Percentages per row total ▫ 155 (55%) total agreement ▫ 102 (36%) partial agreement ▫ 23 (8%) no agreement• Total agreement score = 0.735• Supports the reasonable person hypothesis
  29. 29. 29Agreement Single Worker-Experts
  30. 30. 30Agreement (Summary)• Very similar judgments overall ▫ The reasonable person hypothesis stands still ▫ Crowdsourcing seems a doable alternative ▫ No music expertise seems necessary• We could use just one assessor per pair ▫ If we could keep him/her throughout the query
  31. 31. 31Ground Truth Similarity• Do high agreement scores translate into highly similar ground truth lists?• Consider the original lists (All-2) as ground truth• And the crowdsourced lists as a system’s result ▫ Compute the Average Dynamic Recall [Typke et al., 2006] ▫ And then the other way around• Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]
  32. 32. 32Ground Truth Similarity (II)• The result depends on the initial ordering ▫ Ground truth = (A, B, C), (D, E) ▫ Results1 = (A, B), (D, E, C)  ADR score = 0.933 ▫ Results2 = (A, B), (C, D, E)  ADR score = 1• Results1 is identical to Results2• Generate 1000 (identical) versions by randomly permuting the documents within a group
  33. 33. 33Ground Truth Similarity (and III) Min. and Max. between square brackets• Very similar to the original All-2 lists• Like the Any-1 version, also more restrictive• More consistent (workers were not deceived)
  34. 34. 34MIREX 2005 Revisited• Would the evaluation have been affected? ▫ Re-evaluated the 7 systems that participated ▫ Included our Splines system [Urbano et al., 2010a]• All systems perform significantly worse ▫ ADR score drops between 9-15%• But their ranking is just the same ▫ Kendall’s τ = 1
  35. 35. 35Conclusions• Partially ordered lists should come back• We proposed an alternative methodology ▫ Asked for three-point preference judgments ▫ Used Amazon Mechanical Turk  Crowdsourcing can be used for music-related tasks  Provided empirical evidence supporting the reasonable person hypothesis• What for? ▫ More affordable and large-scale evaluations
  36. 36. 36Conclusions (and II)• We need fewer assessors ▫ More queries with the same man-power• Preferences are easier and faster to judge• Fewer judgments are required ▫ Sorting algorithm• Avoid inconsistencies (A=B option)• Using audio instead of images gets rid of experts• From 70 expert hours to 35 hours for $70
  37. 37. 37Future Work• Choice of pivots in the sorting algorithm ▫ e.g. the query itself would not provide information• Study the collections for Audio Tasks ▫ They have more data  Inaccessible ▫ But no partially ordered list (yet)• Use our methodology with one real expert judging preferences for the same query• Try crowdsourcing too with one single worker
  38. 38. 38Future Work (and II)• Experimental study on the characteristics of music similarity perception by humans ▫ Is it transitive?  We assumed it is ▫ Is it symmetrical?• If these properties do not hold we have problems• Id they do, we can start thinking on Minimal and Incremental Test Collections [Carterette et al., 2005]
  39. 39. 39And That’s It! Picture by 姒儿喵喵