Generating Ground Truth for Music Mood Classification Using Mechanical Turk

  • 300 views
Uploaded on

Presentation of "Generating Ground Truth for Music Mood Classification Using Mechanical Turk" by Jin Ha Lee and Xiao Hu at the 12th Annual ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL).

Presentation of "Generating Ground Truth for Music Mood Classification Using Mechanical Turk" by Jin Ha Lee and Xiao Hu at the 12th Annual ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL).

More in: Technology , Spiritual
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
300
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Generating Ground Truth for Music Mood Classification Using Mechanical Turk Jin Ha Lee & Xiao Hu JCDL 2012
  • 2. Mood: a relatively long lasting and stable emotional state (Meyer, 1956) Emotion? Affect?
  • 3. Music mood • Recently received a lot of attention in MIR (Music Information Retrieval) domain • “Audio Music Mood Classification” task in MIREX, starting in 2007 • Critical for developing MDL Music Information Retrieval Evaluation eXchange
  • 4. • Evaluation is based on ground truth Passionate Bittersweet Bittersweet Bittersweet
  • 5. More is better! However, generating ground truth based on human input is expensive and time consuming
  • 6. How is it done in MIREX? • A web-based survey system called E6K • Invitations posted to MIREX and music-ir mailing lists in order to recruit volunteers
  • 7. Can we use the CROWD instead of MUSIC EXPERTS? Is there a better way?
  • 8. 1. How do music mood classification results obtained from MechanicalTurk compare to those collected from music experts in MIREX? 2. How different or similar are the evaluation outcomes for MIREX AMC task when based on ground truth collected from MechanicalTurk vs. E6K?
  • 9. Workers (Turkers) Task RequesterAmazon Mechanical Turk (MTurk)
  • 10. Cluster1 passionate, rousing, confident, boisterous, rowdy Cluster2 cheerful, fun, rollicking, sweet, amiable/good natured Cluster3 bittersweet, poignant, wistful, literate, autumnal, brooding Cluster4 humorous, silly, campy, quirky, whimsical, witty, wry Cluster5 aggressive, intense, fiery, tense/anxious, volatile, visceral TASK: Listen to 30 second music clips → Select one of the five mood clusters ↓
  • 11. Qualification test Consistency check Review process
  • 12. 1250 songs x 2 judgments 2500 unique mood judgments 186 HITs collected - 86 HITs rejected 100 HITs accepted Basic Stats 1HIT =25 songs
  • 13. EVALUTRON 6000 Stats on Collecting Data AverageTime Spent on Each Music Clip 21.54 seconds 17.46 seconds TotalTime for Collecting All Judgments 38 days (+ additional in-house assessment) 19 days Cost for Collecting All Judgments $0 $60.50
  • 14. Comparison of E6K and MTurk data
  • 15. Cluster E6K MTurk Diff. in % (E6K-MTurk) Cluster1 405 (16.4%) 450 (18.0%) -1.6% Cluster2 472 (19.1%) 536 (21.4%) -2.3% Cluster3 542 (22.0%) 622 (24.9%) -2.9% Cluster4 412 (16.7%) 367 (14.7%) 2.0% Cluster5 400 (16.2%) 403 (16.1%) 0.1% Other 237 (9.6%) 122 (4.9%) 4.7% Total 2468 2500 - Number of Judgments and Distribution across Clusters
  • 16. Distribution of Agreement Cluster E6K MTurk Both Cluster1 121 89 29 Cluster2 130 131 44 Cluster3 163 216 91 Cluster4 121 85 42 Cluster5 126 121 64 Total 661 642 270
  • 17. Confusion among the Clusters Clusters Disagreed in E6K Disagreed IN MTurk Cluster 1 & Cluster 2 20 95 Cluster 2 & Cluster 4 31 86 Cluster 1 & Cluster 5 13 74 ⁞ ⁞ ⁞ Cluster 3 & Cluster 4 6 27 Cluster 2 & Cluster 5 1 22 Cluster 3 & Cluster 5 1 20 Total 253 595
  • 18. Cluster 1 Cluster 2 Cluster 5 Cluster 4 Cluster 3 Russell’s model
  • 19. System Performance E6K Average accuracy MTurk Average accuracy CL 0.65 GT 0.66 GT 0.64 CL 0.63 TL 0.64 TL 0.63 ME1 0.61 ME1 0.57 ME2 0.61 ME2 0.57 IM2 0.57 IM2 0.57 KL1 0.56 KL1 0.55 IM1 0.53 IM1 0.54 KL2 0.29 KL2 0.29
  • 20. TK-HSD Rank Comparison MTurkE6K
  • 21. Conclusion • Overall the human judgments from E6K and MTurk showed similar patterns: – Judgment distribution across five mood clusters – Agreement distribution across clusters – Confusion among clusters • System performance rankings from E6K and Mturk were also comparable
  • 22. Conclusion (Cont’d.) • However, combined ground truth from E6K and MTurk is only about 60% the size of the original E6K ground truth • Mood is a highly subjective feature for describing and organizing music • Other means for judging the moods should be explored (e.g., ranking)
  • 23. Future work • In-depth interview with users to investigate factors affecting people’s judgments on music mood • More controlled study with different user groups
  • 24. Questions?