SlideShare a Scribd company logo
1 of 39
Download to read offline
Crowdsourcing
Preference Judgments for
Evaluation of Music Similarity Tasks

Julián Urbano, Jorge Morato,
Mónica Marrero and Diego Martín
http://julian-urbano.info
Twitter: @julian_urbano


                                            SIGIR CSE 2010
                              Geneva, Switzerland · July 23rd
2



Outline
•   Introduction
•   Motivation
•   Alternative Methodology
•   Crowdsourcing Preferences
•   Results
•   Conclusions and Future Work
3



Evaluation Experiments
• Essential for Information Retrieval [Voorhees, 2002]

• Traditionally followed the Cranfield paradigm
  ▫ Relevance judgments are the most important
    part of test collections (and the most expensive)

• In the music domain evaluation has not been
  taken too seriously until very recently
  ▫ MIREX appeared in 2005 [Downie et al., 2010]
  ▫ Additional problems with the construction and
    maintenance of test collections [Downie, 2004]
4



Music Similarity Tasks
• Given a music piece (i.e. the query) return a
  ranked list of other pieces similar to it
 ▫ Actual music contents, forget the metadata!

• It comes in two flavors
 ▫ Symbolic Melodic Similarity (SMS)
 ▫ Audio Music Similarity (AMS)

• It is inherently more complex to evaluate
 ▫ Relevance judgments are very problematic
5



Relevance (Similarity) Judgments
• Relevance is usually considered on a fixed scale
  ▫ Relevant, not relevant, very relevant…

• For music similarity tasks relevance is rather
  continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007]
  ▫ Single melodic changes are not perceived to
    change the overall melody
      Move a note up or down in pitch, shorten it, etc.
  ▫ But the similarity is weaker as more changes apply

• Where is the line between relevance levels?
6



Partially Ordered Lists
• The relevance of a document is implied by its
  position in a partially ordered list [Typke et al., 2005]
  ▫ Does not need any prefixed relevance scale

• Ordered groups of documents equally relevant
  ▫ Have to keep the order of the groups
  ▫ Allow permutations within the same group

• Assessors only need to be sure that any pair of
  documents is ordered properly
7



Partially Ordered Lists (II)
8



Partially Ordered Lists (and III)
• Used in the first edition of MIREX in 2005
 [Downie et al., 2005]



• Widely accepted by the MIR community
  to report new developments
 [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006]



• MIREX was forced to move to traditional
  level-based relevance since 2006
 ▫ Partially ordered lists are expensive
 ▫ And have some inconsistencies
9



Expensiveness
• The ground truth for just 11 queries took 35
  music experts for 2 hours [Typke et al., 2005]
 ▫ Only 11 of them had time to work on all 11 queries
 ▫ This exceeds MIREX’s resources for a single task

• MIREX had to move to level-based relevance
 ▫ BROAD: Not Similar, Somewhat Similar, Very Similar
 ▫ FINE: numerical, from 0 to 10 with one decimal digit

• Problems with assessor consistency came up
10



Issues with Assessor Consistency
• The line between levels is certainly unclear
 [Jones et al., 2007][Downie et al., 2010]
11



Original Methodology
• Go back to partially ordered lists
 ▫   Filter the collection
 ▫   Have the experts rank the candidates
 ▫   Arrange the candidates by rank
 ▫   Aggregate candidates whose ranks are not
     significantly different (Mann-Whitney U)
• There are known odd results and inconsistencies
 [Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b]
 ▫ Disregard changes that do not alter the actual
   perception, such as clef or key and time signature
 ▫ Something like changing the language of a text
   and use synonyms [Urbano et al., 2010a]
12



Inconsistencies due to Ranking
13



Alternative Methodology
• Minimize inconsistencies [Urbano et al., 2010b]
• Cheapen the whole process

• Reasonable Person hypothesis [Downie, 2004]
  ▫ With crowdsourcing (finally)

• Use Amazon Mechanical Turk
  ▫ Get rid of experts [Alonso et al., 2008][Alonso et al., 2009]
  ▫ Work with “reasonable turkers”
  ▫ Explore other domains to apply crowdsourcing
14



Equally Relevant Documents
• Experts were forced to give totally ordered lists

• One would expect ranks to randomly average out
  ▫ Half the experts prefer one document
  ▫ Half the experts prefer the other one

• That is hardly the case
  ▫ Do not expect similar ranks if the experts
    can not give similar ranks in the first place
15



Give Audio instead of Images
• Experts may guide by the images, not the music
 ▫ Some irrelevant changes in the image can deceive




• No music expertise should be needed
 ▫ Reasonable person turker hypothesis
16



Preference Judgments
• In their heads, experts actually do
  preference judgments
 ▫ Similar to a binary search
 ▫ Accelerates assessor fatigue as the list grows

• Already noted for level-based relevance
 ▫ Go back and re-judge [Downie et al., 2010][Jones et al., 2007]
 ▫ Overlapping between BROAD and FINE scores

• Change the relevance assessment question
 ▫ Which is more similar to Q: A or B? [Carterette et al., 2008]
17



Preference Judgments (II)
• Better than traditional level-based relevance
 ▫ Inter-assessor agreement
 ▫ Time to answer

• In our case, three-point preferences
 ▫ A < B (A is more similar)
 ▫ A = B (they are equally similar/dissimilar)
 ▫ A > B (B is more similar)
18



Preference Judgments (and III)
• Use a modified QuickSort algorithm to sort
  documents in a partially ordered list
 ▫ Do not need all O(n2) judgments, but O(n·log n)




              X is the current pivot on the segment
                     X has been pivot already
19



How Many Assessors?
• Ranks are given to each document in a pair
 ▫ +1 if it is preferred over the other one
 ▫ -1 if the other one is preferred
 ▫ 0 if they were judged equally similar/dissimilar
• Test for signed differences in the samples
• In the original lists 35 experts were used
 ▫ Ranks of a document between 1 and more than 20
• Our rank sample is less (and equally) variable
 ▫ rank(A) = -rank(B) ⇒ var(A) = var (B)
 ▫ Effect size is larger so statistical power increases
 ▫ Fewer assessors are needed overall
20



Crowdsourcing Preferences
• Crowdsourcing seems very appropriate
 ▫   Reasonable person hypothesis
 ▫   Audio instead of images
 ▫   Preference judgments
 ▫   QuickSort for partially ordered lists
• The task can be split into very small assignments
• It should be much more cheap and consistent
 ▫   Do not need experts
 ▫   Do not deceive and increase consistency
 ▫   Easier and faster to judge
 ▫   Need fewer judgments and judges
21



New Domain of Application
• Crowdsourcing has been used mainly to evaluate
  text documents in English

• How about other languages?
 ▫ Spanish [Alonso et al., 2010]

• How about multimedia?
 ▫ Image tagging? [Nowak et al., 2010]
 ▫ Music similarity?
22



Data
• MIREX 2005 Evaluation collection
 ▫ ~550 musical incipits in MIDI format
 ▫ 11 queries also in MIDI format
 ▫ 4 to 23 candidates per query

• Convert to MP3 as it is easier to play in browsers
• Trim the leading and tailing silence
 ▫ 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4)
 ▫ 4 to 24 secs. (mean 13) to listen to all 3 incipits
• Uploaded all MP3 files and a Flash player to a
  private server to stream data on the fly
23



HIT Design




         2 yummy cents of dollar
24



Threats to Validity
• Basically had to randomize everything
 ▫   Initial order of candidates in the first segment
 ▫   Alternate between queries
 ▫   Alternate between pivots of the same query
 ▫   Alternate pivots as variations A and B
• Let the workers know about this randomization
• In first trials some documents were judged more
  similar to the query than the query itself!
 ▫ Require at least 95% acceptance rate
 ▫ Ask for 10 different workers per HIT [Alonso et al., 2009]
 ▫ Beware of bots (always judged equal in 8 secs.)
25



Summary of Submissions
•   The 11 lists account for 119 candidates to judge
•   Sent 8 batches (QuickSort iterations) to MTurk
•   Had to judge 281 pairs (38%) = 2810 judgments
•   79 unique workers for about 1 day and a half
•   A total cost (excluding trials) of $70.25
26



Feedback and Music Background
• 23 of the 79 workers gave us feedback
 ▫ 4 very positive comments: very relaxing music
 ▫ 1 greedy worker: give me more money
 ▫ 2 technical problems loading the audio in 2 HITs
      Not reported by any of the other 9 workers

 ▫   5 reported no music background
 ▫   6 had formal music education
 ▫   9 professional practitioners for several years
 ▫   9 play an instrument, mainly piano
 ▫   6 performers in choir
27



Agreement between Workers
• Forget about Fleiss’ Kappa
 ▫ Does not account for the size of the disagreement
 ▫ A<B and A=B is not as bad as A<B and B<A
• Look at all 45 pairs of judgments per pair
 ▫   +2 if total agreement (e.g. A<B and A<B)
 ▫   +1 if partial agreement (e.g. A<B and A=B)
 ▫   0 if no agreement (i.e. A<B and B<A)
 ▫   Divide by 90 (all pairs with total agreement)

• Average agreement score per pair was 0.664
 ▫ From 0.506 (iteration 8) to 0.822 (iteration 2)
28



Agreement Workers-Experts
• Those 10 judgments were actually aggregated




                  Percentages per row total
 ▫ 155 (55%) total agreement
 ▫ 102 (36%) partial agreement
 ▫ 23 (8%) no agreement
• Total agreement score = 0.735
• Supports the reasonable person hypothesis
29



Agreement Single Worker-Experts
30



Agreement (Summary)




• Very similar judgments overall
 ▫ The reasonable person hypothesis stands still
 ▫ Crowdsourcing seems a doable alternative
 ▫ No music expertise seems necessary
• We could use just one assessor per pair
 ▫ If we could keep him/her throughout the query
31



Ground Truth Similarity
• Do high agreement scores translate into
  highly similar ground truth lists?

• Consider the original lists (All-2) as ground truth
• And the crowdsourced lists as a system’s result
  ▫ Compute the Average Dynamic Recall [Typke et al., 2006]
  ▫ And then the other way around

• Also compare with the (more consistent) original
  lists aggregated in Any-1 form [Urbano et al., 2010b]
32



Ground Truth Similarity (II)
• The result depends on the initial ordering
  ▫ Ground truth = (A, B, C), (D, E)
  ▫ Results1 = (A, B), (D, E, C)
    ADR score = 0.933
 ▫ Results2 = (A, B), (C, D, E)
    ADR score = 1

• Results1 is identical to Results2

• Generate 1000 (identical) versions by randomly
  permuting the documents within a group
33



Ground Truth Similarity (and III)




             Min. and Max. between square brackets



• Very similar to the original All-2 lists
• Like the Any-1 version, also more restrictive
• More consistent (workers were not deceived)
34



MIREX 2005 Revisited
• Would the evaluation have been affected?
 ▫ Re-evaluated the 7 systems that participated
 ▫ Included our Splines system [Urbano et al., 2010a]




• All systems perform significantly worse
 ▫ ADR score drops between 9-15%
• But their ranking is just the same
 ▫ Kendall’s τ = 1
35



Conclusions
• Partially ordered lists should come back

• We proposed an alternative methodology
 ▫ Asked for three-point preference judgments
 ▫ Used Amazon Mechanical Turk
    Crowdsourcing can be used for music-related tasks
    Provided empirical evidence supporting the
     reasonable person hypothesis


• What for?
 ▫ More affordable and large-scale evaluations
36



Conclusions (and II)
• We need fewer assessors
 ▫ More queries with the same man-power
• Preferences are easier and faster to judge
• Fewer judgments are required
 ▫ Sorting algorithm

• Avoid inconsistencies (A=B option)
• Using audio instead of images gets rid of experts

• From 70 expert hours to 35 hours for $70
37



Future Work
• Choice of pivots in the sorting algorithm
 ▫ e.g. the query itself would not provide information

• Study the collections for Audio Tasks
 ▫ They have more data
    Inaccessible
 ▫ But no partially ordered list (yet)

• Use our methodology with one real expert
  judging preferences for the same query
• Try crowdsourcing too with one single worker
38



Future Work (and II)
• Experimental study on the characteristics of
  music similarity perception by humans
 ▫ Is it transitive?
     We assumed it is
 ▫ Is it symmetrical?

• If these properties do not hold we have problems

• Id they do, we can start thinking on Minimal
  and Incremental Test Collections
 [Carterette et al., 2005]
39



And That’s It!




                 Picture by 姒儿喵喵

More Related Content

Viewers also liked

On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Richard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
Richard Diamond
 

Viewers also liked (10)

On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Similar to Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
k21jag
 
CRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 PresentationCRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 Presentation
axambo
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 

Similar to Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks (9)

Philosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen VoorheesPhilosophy of IR Evaluation Ellen Voorhees
Philosophy of IR Evaluation Ellen Voorhees
 
Presenting presenting!
Presenting presenting!Presenting presenting!
Presenting presenting!
 
CRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 PresentationCRC PhD Student Conference 2010 Presentation
CRC PhD Student Conference 2010 Presentation
 
Is acquiring knowledge of verb subcategorization in English easier? A partial...
Is acquiring knowledge of verb subcategorization in English easier? A partial...Is acquiring knowledge of verb subcategorization in English easier? A partial...
Is acquiring knowledge of verb subcategorization in English easier? A partial...
 
Assessment presentation
Assessment presentationAssessment presentation
Assessment presentation
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Rss Oct 2011 Mixed Modes Pres2
Rss Oct 2011 Mixed Modes Pres2Rss Oct 2011 Mixed Modes Pres2
Rss Oct 2011 Mixed Modes Pres2
 
The effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appealThe effects of preferred text formatting on performance and perceptual appeal
The effects of preferred text formatting on performance and perceptual appeal
 

More from Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 

More from Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks

  • 1. Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks Julián Urbano, Jorge Morato, Mónica Marrero and Diego Martín http://julian-urbano.info Twitter: @julian_urbano SIGIR CSE 2010 Geneva, Switzerland · July 23rd
  • 2. 2 Outline • Introduction • Motivation • Alternative Methodology • Crowdsourcing Preferences • Results • Conclusions and Future Work
  • 3. 3 Evaluation Experiments • Essential for Information Retrieval [Voorhees, 2002] • Traditionally followed the Cranfield paradigm ▫ Relevance judgments are the most important part of test collections (and the most expensive) • In the music domain evaluation has not been taken too seriously until very recently ▫ MIREX appeared in 2005 [Downie et al., 2010] ▫ Additional problems with the construction and maintenance of test collections [Downie, 2004]
  • 4. 4 Music Similarity Tasks • Given a music piece (i.e. the query) return a ranked list of other pieces similar to it ▫ Actual music contents, forget the metadata! • It comes in two flavors ▫ Symbolic Melodic Similarity (SMS) ▫ Audio Music Similarity (AMS) • It is inherently more complex to evaluate ▫ Relevance judgments are very problematic
  • 5. 5 Relevance (Similarity) Judgments • Relevance is usually considered on a fixed scale ▫ Relevant, not relevant, very relevant… • For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007] ▫ Single melodic changes are not perceived to change the overall melody  Move a note up or down in pitch, shorten it, etc. ▫ But the similarity is weaker as more changes apply • Where is the line between relevance levels?
  • 6. 6 Partially Ordered Lists • The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005] ▫ Does not need any prefixed relevance scale • Ordered groups of documents equally relevant ▫ Have to keep the order of the groups ▫ Allow permutations within the same group • Assessors only need to be sure that any pair of documents is ordered properly
  • 8. 8 Partially Ordered Lists (and III) • Used in the first edition of MIREX in 2005 [Downie et al., 2005] • Widely accepted by the MIR community to report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006] • MIREX was forced to move to traditional level-based relevance since 2006 ▫ Partially ordered lists are expensive ▫ And have some inconsistencies
  • 9. 9 Expensiveness • The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005] ▫ Only 11 of them had time to work on all 11 queries ▫ This exceeds MIREX’s resources for a single task • MIREX had to move to level-based relevance ▫ BROAD: Not Similar, Somewhat Similar, Very Similar ▫ FINE: numerical, from 0 to 10 with one decimal digit • Problems with assessor consistency came up
  • 10. 10 Issues with Assessor Consistency • The line between levels is certainly unclear [Jones et al., 2007][Downie et al., 2010]
  • 11. 11 Original Methodology • Go back to partially ordered lists ▫ Filter the collection ▫ Have the experts rank the candidates ▫ Arrange the candidates by rank ▫ Aggregate candidates whose ranks are not significantly different (Mann-Whitney U) • There are known odd results and inconsistencies [Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b] ▫ Disregard changes that do not alter the actual perception, such as clef or key and time signature ▫ Something like changing the language of a text and use synonyms [Urbano et al., 2010a]
  • 13. 13 Alternative Methodology • Minimize inconsistencies [Urbano et al., 2010b] • Cheapen the whole process • Reasonable Person hypothesis [Downie, 2004] ▫ With crowdsourcing (finally) • Use Amazon Mechanical Turk ▫ Get rid of experts [Alonso et al., 2008][Alonso et al., 2009] ▫ Work with “reasonable turkers” ▫ Explore other domains to apply crowdsourcing
  • 14. 14 Equally Relevant Documents • Experts were forced to give totally ordered lists • One would expect ranks to randomly average out ▫ Half the experts prefer one document ▫ Half the experts prefer the other one • That is hardly the case ▫ Do not expect similar ranks if the experts can not give similar ranks in the first place
  • 15. 15 Give Audio instead of Images • Experts may guide by the images, not the music ▫ Some irrelevant changes in the image can deceive • No music expertise should be needed ▫ Reasonable person turker hypothesis
  • 16. 16 Preference Judgments • In their heads, experts actually do preference judgments ▫ Similar to a binary search ▫ Accelerates assessor fatigue as the list grows • Already noted for level-based relevance ▫ Go back and re-judge [Downie et al., 2010][Jones et al., 2007] ▫ Overlapping between BROAD and FINE scores • Change the relevance assessment question ▫ Which is more similar to Q: A or B? [Carterette et al., 2008]
  • 17. 17 Preference Judgments (II) • Better than traditional level-based relevance ▫ Inter-assessor agreement ▫ Time to answer • In our case, three-point preferences ▫ A < B (A is more similar) ▫ A = B (they are equally similar/dissimilar) ▫ A > B (B is more similar)
  • 18. 18 Preference Judgments (and III) • Use a modified QuickSort algorithm to sort documents in a partially ordered list ▫ Do not need all O(n2) judgments, but O(n·log n) X is the current pivot on the segment X has been pivot already
  • 19. 19 How Many Assessors? • Ranks are given to each document in a pair ▫ +1 if it is preferred over the other one ▫ -1 if the other one is preferred ▫ 0 if they were judged equally similar/dissimilar • Test for signed differences in the samples • In the original lists 35 experts were used ▫ Ranks of a document between 1 and more than 20 • Our rank sample is less (and equally) variable ▫ rank(A) = -rank(B) ⇒ var(A) = var (B) ▫ Effect size is larger so statistical power increases ▫ Fewer assessors are needed overall
  • 20. 20 Crowdsourcing Preferences • Crowdsourcing seems very appropriate ▫ Reasonable person hypothesis ▫ Audio instead of images ▫ Preference judgments ▫ QuickSort for partially ordered lists • The task can be split into very small assignments • It should be much more cheap and consistent ▫ Do not need experts ▫ Do not deceive and increase consistency ▫ Easier and faster to judge ▫ Need fewer judgments and judges
  • 21. 21 New Domain of Application • Crowdsourcing has been used mainly to evaluate text documents in English • How about other languages? ▫ Spanish [Alonso et al., 2010] • How about multimedia? ▫ Image tagging? [Nowak et al., 2010] ▫ Music similarity?
  • 22. 22 Data • MIREX 2005 Evaluation collection ▫ ~550 musical incipits in MIDI format ▫ 11 queries also in MIDI format ▫ 4 to 23 candidates per query • Convert to MP3 as it is easier to play in browsers • Trim the leading and tailing silence ▫ 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4) ▫ 4 to 24 secs. (mean 13) to listen to all 3 incipits • Uploaded all MP3 files and a Flash player to a private server to stream data on the fly
  • 23. 23 HIT Design 2 yummy cents of dollar
  • 24. 24 Threats to Validity • Basically had to randomize everything ▫ Initial order of candidates in the first segment ▫ Alternate between queries ▫ Alternate between pivots of the same query ▫ Alternate pivots as variations A and B • Let the workers know about this randomization • In first trials some documents were judged more similar to the query than the query itself! ▫ Require at least 95% acceptance rate ▫ Ask for 10 different workers per HIT [Alonso et al., 2009] ▫ Beware of bots (always judged equal in 8 secs.)
  • 25. 25 Summary of Submissions • The 11 lists account for 119 candidates to judge • Sent 8 batches (QuickSort iterations) to MTurk • Had to judge 281 pairs (38%) = 2810 judgments • 79 unique workers for about 1 day and a half • A total cost (excluding trials) of $70.25
  • 26. 26 Feedback and Music Background • 23 of the 79 workers gave us feedback ▫ 4 very positive comments: very relaxing music ▫ 1 greedy worker: give me more money ▫ 2 technical problems loading the audio in 2 HITs  Not reported by any of the other 9 workers ▫ 5 reported no music background ▫ 6 had formal music education ▫ 9 professional practitioners for several years ▫ 9 play an instrument, mainly piano ▫ 6 performers in choir
  • 27. 27 Agreement between Workers • Forget about Fleiss’ Kappa ▫ Does not account for the size of the disagreement ▫ A<B and A=B is not as bad as A<B and B<A • Look at all 45 pairs of judgments per pair ▫ +2 if total agreement (e.g. A<B and A<B) ▫ +1 if partial agreement (e.g. A<B and A=B) ▫ 0 if no agreement (i.e. A<B and B<A) ▫ Divide by 90 (all pairs with total agreement) • Average agreement score per pair was 0.664 ▫ From 0.506 (iteration 8) to 0.822 (iteration 2)
  • 28. 28 Agreement Workers-Experts • Those 10 judgments were actually aggregated Percentages per row total ▫ 155 (55%) total agreement ▫ 102 (36%) partial agreement ▫ 23 (8%) no agreement • Total agreement score = 0.735 • Supports the reasonable person hypothesis
  • 30. 30 Agreement (Summary) • Very similar judgments overall ▫ The reasonable person hypothesis stands still ▫ Crowdsourcing seems a doable alternative ▫ No music expertise seems necessary • We could use just one assessor per pair ▫ If we could keep him/her throughout the query
  • 31. 31 Ground Truth Similarity • Do high agreement scores translate into highly similar ground truth lists? • Consider the original lists (All-2) as ground truth • And the crowdsourced lists as a system’s result ▫ Compute the Average Dynamic Recall [Typke et al., 2006] ▫ And then the other way around • Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]
  • 32. 32 Ground Truth Similarity (II) • The result depends on the initial ordering ▫ Ground truth = (A, B, C), (D, E) ▫ Results1 = (A, B), (D, E, C)  ADR score = 0.933 ▫ Results2 = (A, B), (C, D, E)  ADR score = 1 • Results1 is identical to Results2 • Generate 1000 (identical) versions by randomly permuting the documents within a group
  • 33. 33 Ground Truth Similarity (and III) Min. and Max. between square brackets • Very similar to the original All-2 lists • Like the Any-1 version, also more restrictive • More consistent (workers were not deceived)
  • 34. 34 MIREX 2005 Revisited • Would the evaluation have been affected? ▫ Re-evaluated the 7 systems that participated ▫ Included our Splines system [Urbano et al., 2010a] • All systems perform significantly worse ▫ ADR score drops between 9-15% • But their ranking is just the same ▫ Kendall’s τ = 1
  • 35. 35 Conclusions • Partially ordered lists should come back • We proposed an alternative methodology ▫ Asked for three-point preference judgments ▫ Used Amazon Mechanical Turk  Crowdsourcing can be used for music-related tasks  Provided empirical evidence supporting the reasonable person hypothesis • What for? ▫ More affordable and large-scale evaluations
  • 36. 36 Conclusions (and II) • We need fewer assessors ▫ More queries with the same man-power • Preferences are easier and faster to judge • Fewer judgments are required ▫ Sorting algorithm • Avoid inconsistencies (A=B option) • Using audio instead of images gets rid of experts • From 70 expert hours to 35 hours for $70
  • 37. 37 Future Work • Choice of pivots in the sorting algorithm ▫ e.g. the query itself would not provide information • Study the collections for Audio Tasks ▫ They have more data  Inaccessible ▫ But no partially ordered list (yet) • Use our methodology with one real expert judging preferences for the same query • Try crowdsourcing too with one single worker
  • 38. 38 Future Work (and II) • Experimental study on the characteristics of music similarity perception by humans ▫ Is it transitive?  We assumed it is ▫ Is it symmetrical? • If these properties do not hold we have problems • Id they do, we can start thinking on Minimal and Incremental Test Collections [Carterette et al., 2005]
  • 39. 39 And That’s It! Picture by 姒儿喵喵