Ismir2012 tutorial2


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ismir2012 tutorial2

  1. 1. 10/5/2012 ISMIR 2012 Tutorial 2 Speaker Music Affect Recognition: The State-of-the-art and Lessons Learned Xiao Hu, Ph.D Yi-Hsuan Eric Yang, Ph.DThe University of Hong Kong Academic Sinica, Taiwan 10/5/2012 1 10/5/2012 2Speaker The Audience Do you believe that music is powerful? Why do you think so? Have you searched for music by affect? Have you searched for other things (photos, video) by affect? Have you questioned the difference between emotion and mood? Is your research related to affect? 10/5/2012 3 10/5/2012 4 Music Affect: Music Affect: 10/5/2012 5 10/5/2012 6 1
  2. 2. 10/5/2012Music Affect: Music Affect:10/5/2012 7 10/5/2012 8 Agenda Agenda Grand challenges on music affect Grand challenges on music affect Music affect taxonomy and annotation Music affect taxonomy and annotation Automatic music affect analysis Automatic music affect analysis Categorical approach Categorical approach Multimodal approach Multimodal approach Dimensional approach Dimensional approach Temporal approach Temporal approach Beyond music Beyond music Conclusion Conclusion10/5/2012 9 10/5/2012 10 Emotion or Mood ? Emotion or Mood ? Mood: “relatively permanent and stable” Emotion: “temporary and evanescent” "most of the supposed [psychological] studies of emotion in music are actually concerned with mood and Leonard association." Meyer Meyer, Leonard B. (1956). Emotion and Meaning in Music. Chicago: Chicago University Press10/5/2012 11 10/5/2012 12 2
  3. 3. 10/5/2012 Expressed or Induced Which Moods? 1/2 Different websites / studies use different terms Designated/indicated/expressed by a music piece Induced/evoked/felt by a listener Both are studied in MIR Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups Mainly differ in the ways of collecting labels “indicate how you feel when listen to the music” “indicate the mood conveyed by the music” Tellegen-Watson- Clark model10/5/2012 13 10/5/2012 14 Which Moods ? 2/2 Sources of Music Emotion Intrinsic (structural characteristics of the music) Lack of a general theory of emotions e.g., modality -> happy vs. sad Ekman’s 6 basic emotions: What about melody? anger, joy, surprise disgust, sadness, fear Extrinsic emotion (semantic context related but outside the music) Lee et al., (2012) identified a range of factors in people’s assessment of music mood Verbalization of emotional states is Lyrics, tempo, instrumentation, genre, delivery, and even cultural often a “distortion” (Meyer, 1956) context “unspeakable feelings” Little has been known on the mapping of these factors to music mood “ a restful feeling throughout ... like one of going downstream while swimming” Lee, J. H., Hill, T., & Work, L. (2012) What does music mood mean for real10/5/2012 15 10/5/2012 16 users? Proceedings of the iConferenceLet’s ask the users… (Lee et al., 2012) Data, data, data! Extremely scarce resource Annotations are time consuming Consistency is low across annotators Existent public datasets on mood: MoodSwings Turk dataset 240 30-sec clips; Arousal – Valence scores MIREX mood classification task 600 30-sec clips; in 5 mood clusters MIREX tag classification task (mood sub-task) 3,469 30-sec clips; in 18 mood-related tag groups Yang’s emotion regression dataset 193 25-sec clips; in 11 levels Arousal Valence scale10/5/2012 17 10/5/2012 18 3
  4. 4. 10/5/2012 Suboptimal Performance Newer Challenges MIREX Mood Classification (2012) Cross-cultural applicability Accuracy: 46% - 68% Existent efforts focus on Western music OS1 @ ISMIR 2012 (tomorrow): Yang & Hu: Cross-cultural Music MIREX Tag Classification mood subtask(2011) Mood Classification: A Comparison on English and Chinese Songs Personalization Ultimate solution to the subjectivity problem Contextualization Even the same person’s emotional responses change in different time, location, occasions PS1 @ ISMIR 2012 (Tomorrow) Watson & Mandryk: Modeling Musical Mood From Audio Features and Listening Context on an In- Situ Data Set10/5/2012 19 10/5/2012 20 Summary of Challenges Agenda Grand challenges on music affect Terminology Music affect taxonomy and annotation Models and categories Automatic music affect analysis No consensus Categorical approach Sources and factors Multimodal approach No clear mapping between sources and affects Dimensional approach Data scarcity Temporal approach Suboptimal performances Newer issues Beyond music Cross-cultural, personalization, contextualization,... Conclusion10/5/2012 21 10/5/2012 22 Music affect taxonomy and Taxonomy annotation Domain oriented controlled vocabulary Background Contain labels (metadata) What are taxonomies? Commonly used on websites Taxonomy vs. Folksonomy Pick list; browsable directory, etc. Developing music mood taxonomies Taxonomy from Editorial Labels Taxonomies from Social Tags Annotations Experts Crowdsourcing (e.g., MTurks, games) Subjects Derived from online services10/5/2012 23 10/5/2012 24 4
  5. 5. 10/5/2012 Taxonomy vs. Folksonomy Models in Music Psychology 1/2 Taxonomy Categorical Controlled, structured vocabulary Hevner’s Often require expert knowledge adjective circle Top-down and bottom up approaches e.g., (1936) Folksonomy Hevner, K. 1936. Experimental studies Uncontrolled, unstructured vocabulary of the elements of Social tags freely applied by users expression in music. American Journal of Commonality exists in large number of tags Psychology, 48 e.g., 10/5/2012 25 10/5/2012 26 Models in Music Psychology 2/2 Borrow from Psychology to MIR Dimensional Russell’s circumplex Thayer’s stress-energy model gives 4 clusters Farnsworth’s 10 adjective groups model Russell, J. A. 1980. A Grounded in music circumplex model of perception research, but affect. Journal of Personality and Social lack social context of music Psychology, 39: 1161- listening (Juslin & Laukka, 1178. 2004) Tellegen-Watson-Clark model Juslin, P. N. and Laukka, P. (2004). Expression, perception, and induction of musical 10/5/2012 27 emotions: a review and a questionnaire study of everyday listening. JNMR. 10/5/2012 28Taxonomy Built from Editorial Labels• Editorial labels:-Given by professionaleditors of onlinerepositories-Have a certain level ofcontrol- Rooted in realistic socialcontexts “the most comprehensive music reference source on the planet” 288 mood labels created and assigned to music works 10/5/2012 29 10/5/2012 30 5
  6. 6. 10/5/2012 Mood Label Clustering A Taxonomy of 5 Mood Clusters Mood labels for albums Mood labels for songs Cluster_1: passionate, rousing, confident, boisterous, rowdy Cluster_2: rollicking, cheerful, fun, sweet, amiable/good natured Cluster_3: literate, poignant, wistful, bittersweet, autumnal, brooding Cluster_4: humorous, silly, campy, quirky, whimsical, witty, wry Cluster_5: aggressive, fiery, tense/anxious, intense, volatile, visceralC1 C2 C3 C4 C5 C4 C1 C3 C2 C5 Hu, X., & Downie, J. S. (2007). Exploring Mood Metadata: Relationships with10/5/2012 Genre, Artist and Usage Metadata. In Proceedings of ISMIR 31 10/5/2012 32 Taxonomy from Social Tags The Method Social tags 1,586 terms in WordNet-Affect (a lexicon of affective words) Pros: “The largest music tagging – 202 evaluation terms in General Inquirer site for Western music” Users’ perspectives (“good”, “great”, “poor”, etc.) Large quantity – 135 non-affect/ ambiguous terms by experts ( “cold”, “chill”, “beat”, etc.) = 1,249 terms Cons: Non-standardized 476 terms are tags Linguistic Resources Ambiguous Human Expertise group the tags by WordNet-Affect and experts => 36 categories Hu, X. (2010). Music and Mood: Where Theory and Reality Meet. In Proceedings of the 5th iConference, (Best Student Paper). 10/5/2012 33 10/5/2012 34 2-D Mood Taxonomy Comparison to Russell’s 2-D Model 2-Dimensional Representation 10/5/2012 35 10/5/2012 36 10/5/2012 6
  7. 7. 10/5/2012 Our Taxonomy Laurier et al. (2009) Taxonomy from Social Tags 1/2 Manually compiled 120 mood words from the literature VALENCE Crawled 6.8M social tags from 107 unique tags matched mood words 80 tags with more than 100 occurrences Most used Least used sad rollicking AROUSAL fun solemn melancholy rowdy happy tense Laurier et al. (2009) Music mood representations from social tags, ISMIR10/5/2012 37 10/5/2012 38Laurier et al. (2009) Taxonomy from Agreement between Laurier’s and Social Tags 2/2 the 5 cluster taxonomy• Used LSA to project tag-track matrix to a space of 100 dim. Based on Laurier’s 100-dimensional space• Clustering trials with varied number of clusters Intra-cluster similarity Inter-cluster dissimilarity cluster 1 cluster 2 cluster 3 cluster 4 angry sad tender happy C1 C2 C3 C4 C5 aggressive bittersweet soothing joyous C1 0 .74 .13 .20 .11 visceral sentimental sleepy bright rousing tragic tranquil cheerful C2 0 .86 .82 .88 intense depressing quiet humorous C3 0 .32 .27 confident sadness calm gay C4 0 .53 anger spooky serene amiable C5 0 +A –V -A –V -A +V +A +VLaurier et al. (2009) Music mood representations from social tags, ISMIR39 Laurier et al. (2009) Music mood representations from social tags, ISMIR10/5/2012 10/5/2012 40 Summary on Taxonomy Mood Annotations What are taxonomies? All annotation needs three things Taxonomy vs. Folksonomy taxonomy, music, people People Developing music mood taxonomies Experts from Editorial Labels Subjects from Social Tags Crowdsourcing (e.g., MTurks, games) Derive annotations from online services10/5/2012 41 10/5/2012 42 7
  8. 8. 10/5/2012 Expert Annotation Expert Annotation: MIREX AMC The MIREX Audio Mood Classification (AMC) task 2468 judgments collected (3750 •Each expert had 250 clips 5 cluster taxonomy • 8 of 21 experts finished all planned) 1,250 tracks selected from the APM libraries assignments Each clips had 2 or 3 judgments A Web-based annotation system called E6K Avg. Cohen’s Kappa: 0.5 Dataset built from agreements among experts Agreements C1 C2 C3 C4 C5 Total Accuracy 3 of 3 judges 21 24 56 21 31 153 0.59 2 of 3 judges 41 35 18 26 14 134 0.38 2 0f 2 judges 58 61 46 73 75 313 0.54 Total 120 120 120 120 120 600 Lessons: 1. Missed judgments -> low accuracyHu, X., Downie, J. S., Laurier, C., Bay, M., & Ehmann, A. (2008). The 2007 MIREX 2. Need more motivated annotators 10/5/2012 43 10/5/2012 44Audio Mood Classification Task: Lessons Learned. In ISMIR. Crowdsourcing: Amazon Mechanic Turk Annotation: Amazon Mechanic Turk• Lee & Hu (2012): compare expert and MTurk annotations Human Intelligence Task • The same 1,250 music clips as in MIREX AMC (HIT) • The same 5 clusters Each HIT had 27 clips • Annotators: “Turkers” who work on human intelligent 2 duplicates for consistency tasks for very low payment check Each clips had 2 judges• Advantages of MTurk Paid 0.55 USD for 1 HIT • Plenty of labor Qualification test before• Disadvantages of MTurk proceeding to task • Quality control 186 HITs collected 100 HITs acceptedLee, J. H. & Hu, X. (2012) Generating Ground Truth for Music Mood Classification Avg. Cohen’s kappa: 0.48Using Mechanical Turk, In Proceedings of Joint Conference on Digital Libraries 10/5/2012 45 10/5/2012 46 Comparison: Stats on Collecting Data Comparison: Agreement Rates EVALUTRON 6000 EVALUTRON 6000 Number of Judgments Collected 2 22 % of clips with % of clips with 2468 (incomplete) 2500 (complete) agreements agreements Total Time for Collecting All Judgments C1 40.2% C1 39.6% C2 60.2% C2 48.9% 38 days 19 days (+ additional in-house C3 70.5% C3 69.5% assessment) C4 39.6% C4 46.3% Cost for Collecting All Judgments C5 70.8% C5 60.0% Other 16.9% Other 21.3% $0 $60.50 Average Time Spent on Each Music Clip 10/5/2012 21.54 seconds 17.46 seconds 47 10/5/2012 48 8
  9. 9. 10/5/2012 Comparison: Confusions among Confusions Shown in Russell’s Model Clusters Disagreed IN Clusters Disagreed in E6K EVALUTRON 6000 MTurk Cluster Cluster ClusterCluster 1 & Cluster 2 20 95 5 1 2Cluster 2 & Cluster 4 31 86Cluster 1 & Cluster 5 13 74 ⁞ ⁞ ⁞ ClusterCluster 3 & Cluster 4 6 27 Cluster 4Cluster 2 & Cluster 5 1 22Cluster 3 & Cluster 5 1 20 3 Total 253 59510/5/2012 49 10/5/2012 50 Comparison: System Performances Crowdsourcing: Games (MIREX 2007) MoodSwings (Kim et al., 2008) EVALUTRON 6000 2-player Web-based game to collect annotations of music pieces in the arousal- valence space Time-varying annotations are collected at a rate of 1 sample per second Players “score” for agreement with their competitor Kim, Y. E., Schimdt, E., and Emelle, L. (2008). Moodswings: a collaborative game for music mood label collection, ISMIR10/5/2012 51 10/5/2012 52 MoodSwings: Challenges MoodSwings: MTurk version Needs a pair of players • Single person game Simulated AI player • No competition, no scores Randomly following the real player less challenging • Monetary reward Based on prediction model need training data (0.25 USD/11 pieces) • Consistency check: Attracting players (true for all games) -- 2 identical pieces whose Must be challenging and fun labels must be within experts’ Music: more recent and entertaining decision boundary Game interface: sleek, aesthetic -- must not label all clips the Research values same way Variety of music and mood Speck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparative study B. G. Morton, J. A. Speck, E. M. Schmidt, and Y. E. Kim (2010). Improving music of collaborative vs. traditional music mood annotation, ISMIR emotion labeling using human computation,” in HCOMP10/5/2012 53 10/5/2012 54 9
  10. 10. 10/5/2012 MoodSwings: 2 version Comparison Subject Annotation Do not require music expertise Easier to recruit than experts Arguably more authentic to MIR situations Label Can be trained for annotation task Corr. Higher data quality than MTurk V: 0.71 Still needs verification/evaluation A: 0.85 Often with payments Rates much higher than MTurkSpeck, J. A., Schmidt, E. M., Morton, B. G., and Kim, Y. E. (2011). A comparativestudy of collaborative vs. traditional music mood annotation, ISMIR 10/5/2012 55 10/5/2012 Image Copyright © 56 MIREX Mood Tag ClassificationDerive Annotations from online services Harness the power of Music 2.0 Based on editorial labels and noisy user tags e.g., the MSD e.g., MIREX Audio Tag Classification mood dataset Music 2.0 Logo by Rocketsurgeon 10/5/2012 57 10/5/2012 58 MIREX Mood Tag Classification Dataset: MIREX Mood Tag Classification Dataset: Positive Examples in Each Category An Example Based on the top 100 tags provided by API Select songs tagged heavily with terms in a category10/5/2012 59 10/5/2012 60 10
  11. 11. 10/5/2012 Cross-Cultural Issue in Annotation Annotation Derived from Music 2.0 A survey of 30 clips on Americans and Chinese PROS CONS Grounded on real-life • Need mood-related C1: passionate social tags C2: cheerful usage C3: bittersweet Larger dataset, • Need clever ways to C4: humorous supporting multi- filter out noise C5: aggressive label • May be culturally dependent Got to get you into No manual my life by The annotation required Beatles Hu, X. & Lee, J. H. (2012). A Cross-cultural Study of Music Mood Perception 10/5/2012 61 between American and Chinese Listeners, ISMIR (PS3 – Thursday!) 10/5/2012 62 Summary on Annotation Agenda Grand challenges on music affect Music affect taxonomy and annotation Automatic Music affect analysis Categorical approach Multimodal approach Dimensional approach Expert annotation for small datasets Temporal approach Crowdsourcing with careful designs Beyond music Music 2.0 for super size datasets Conclusion ?? 10/5/2012 63 10/5/2012 64 Categorical and Multimodal Automatic Approaches Approaches Categorical vs. Dimensional Pros Cons Classification problem and frameworkCategorical • Intuitive • Term are ambiguous Audio features and classification models • Natural language • Difficult to offer fine- Existing experiments grained differentiation Multimodal classificationDimensional • Continuous • Less intuitive Cross-cultural classification affective scales • Difficult to annotate • Good user interface 10/5/2012 65 10/5/2012 66 11
  12. 12. 10/5/2012 A Framework for Multimodal Mood Classification Automatic Classification (supervised learning) Textual Social tags Lyrics MP3s … Audio Classifier Dataset Construction “Here comes the sun” Happy Feature Feature “ I will be back” -> Happy linguistic stylistic … tempo timbral … Extraction Extraction Sad “Down with the Angry Feature Feature sickness” Angry Generation and Selection F-score language Prediction modeling PCA Song X Happy Training Sad Selection … Song Y Sad ……… Classification and Testing Training examples Multimodal Classification feature late Hybrid Combination concate fusion methods SVM KNN … nation … New examples Evaluation and Analysis performance learning feature comparison curves comparison 10/5/2012 67 10/5/2012 … 68 Audio Features Classification Models Type Description Tool Energy The mean and standard deviation of root Marsyas, Generic supervised learning algorithms mean square energy MIR Toolbox neural network, k-nearest neighbor (k-NN), maximum likelihood, MIR Toolbox decision tree, support vector machine (SVM), Gaussian mixture Rhythm Fluctuation pattern and tempo PsySound models (GMM), Neural Network, etc. Pitch class profile, the intensity of 12 MIR Toolbox Tools: generic machine learning packages Pitch semitones of the musical octave in PsySound Weka, RapidMiner, LibSVM, SVMLight Western twelve-tone scale Key clarity, musical mode (major/minor), MIR Toolbox SVM seems superior Tonal and harmonic change (e.g., chord change) The mean and standard deviation of the Marsyas, Timbre first 13 MFCCs, delta MFCCs, and delta MIR Toolbox delta MFCCs perceptual loudness, volume, sharpness Psycho- (dull/sharp), timbre width (flat/rough), PsySound acoustic spectral and tonal dissonance (dissonant/consonant) of music MIREX AMC 2007 Results 10/5/2012 69 10/5/2012 70 Audio signal’s “glass-ceiling” Multimodal Classification Aucouturier & Pachet (2004) Social Tags Metadata “Semantic Gap” between low-Level music feature and high-level human perception Bischoff et al. MIREX AMC performance (5 classes) MUSIC Schuller et al. 2009 2011 Year Top 3 accuracies 2007 61.50%, 60.50%, 59.67% Lyrics 2008 63.67%, 58.20%, 56.00% Audio 2009 65.67%, 65.50%, 63.67% 2010 63.83%, 63.50%, 63.17% Yang & Lee, 2004 2011 69.50%, 67.17%, 66.67% Laurie et al, 2009 2012 67.83%, 67.67%, 67.17% Hu & Downie, 2010Aucouturier, J-J., & Pachet, F. (2004), Improving timbre similarity: How high is the Improving classification performance by combiningsky? Journal of Negative. Results in Speech and Audio Sciences, 1 (1). 10/5/2012 71 10/5/2012 multiple independent sources 72 12
  13. 13. 10/5/2012 Lyric Features Lyric Feature Example Basic features: Content words, part-of-speech, function ANEW examples Top General Inquire (GI) features in category “Aggressive” words Vale Aro Domi GI Feature Description Example Lexicon features: nce usal nance Words in WordNet-Affect WlbPhys words connoting the physical aspects of well blood, dead, drunk, Happy 8.21 6.49 6.63 being, including its absence pain Psycholinguistic features: Sad 1.61 4.13 3.45 Perceiv words referring to the perceptual process of dazzle, fantasy, hear, Psychological categories in GI (General Thrill 8.05 8.02 6.54 Inquirer) recognizing or identifying something by look, make, tell, view Kiss 8.26 7.32 6.93 means of the senses Scores in ANEW (Affective Norm of English Words) Dead 1.94 5.73 2.84 Exert action words hit, kick, drag, upset Stylistic features: Dream 6.73 4.53 5.53 TIME words indicating time noon, night, midnight Punctuation marks; interjection words Angry 2.85 7.17 5.55 Statistics: e.g., how many words per Fear 2.76 6.96 3.22 COLL words referring to all human collectivities people, gang, party minute WlbLoss words related to a loss in a state of well burn, die, hurt, madHu, X. & Downie, J. S. (2010) Improving Mood Classification in Music Digital being, including being upset 10/5/2012 73 10/5/2012 74Libraries by Combining Lyrics and Audio, JCDL Distribution of feature “!” Lyric No significant differenceClassification between top combinations Results 10/5/2012 75 10/5/2012 76 Distribution of feature “hey” “number of words per minute” 10/5/2012 77 10/5/2012 78 13
  14. 14. 10/5/2012Combine with Audio-based Classifier Hybrid Methods – Late fusion Lyric Classifier A leading system in MIREX AMC 2007 and 2008: Marsyas Dominate due Prediction to clarity and Music Analysis, Retrieval and Synthesis for Audio Signals Final the avoidance led by Prof. Tzanetakis at University of Victoria Prediction of “curse of Prediction Uses audio spectral features dimensionality” Audio Classifier Finalist in the Sourceforge Community Choice Awards 2009 – Feature concatenation (early fusion) Classifier Prediction10/5/2012 79 10/5/2012 80 Effectiveness Audio Hybrid (late Hybrid Lyrics fusion) (early fusion)10/5/2012 81 10/5/2012 82 Audio vs. Lyrics Learning Curves Hu & Downie (2010) When Lyrics Outperform Audio for10/5/2012 83 10/5/2012 84 Music Mood Classification: A Feature Analysis, ISMIR 14
  15. 15. 10/5/2012 Top Lyric Features Top Lyric Features in “Calm” 10/5/2012 85 10/5/2012 86 Other Textual Features used in Music Mood ClassificationTop Affective Words Based on SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity Simple Syntactic Structures Negation, modifier vs. Lyric rhyme patterns (inspired by poems) Contextual features (Beyond lyrics) Social tags, blogs, playlists, etc. 10/5/2012 87 10/5/2012 88 Summary of Categorical and Cross-cultural Mood Classification Multimodal Approaches Tomorrow, Oral Session 1 Natural language labels are intuitive to end users Cross cultural Based on supervised learning techniques model Studies mostly focusing on Feature Engineering applicability: Multimodal approaches improve performances -23 mood categories based on Effectiveness and Efficiency Cross-cultural mood classification: just started - Train on songs in Challenges one culture and classify songs in the Ambiguity inherent in terms (Meyer’s “distortion”) other Hierarchy of mood categories Connections between features and mood categoriesYang & Hu (2012) Cross-cultural Music Mood Classification: A Comparison on 10/5/2012 89 10/5/2012 90English and Chinese Songs, ISMIR 15
  16. 16. 10/5/2012 Agenda Dimensional Approach Grand challenges on music affect Music affect taxonomy and annotation What is and why dimensional model Automatic Music affect analysis Computational model for dimensional music Categorical approach emotion recognition Multimodal approach Issues Dimensional approach Difficulty of emotion rating Temporal approach Subjectivity of emotion perception Beyond music Context of music listening Usability of UI Conclusion 10/5/2012 91 10/5/2012 92Categorical Approach Dimensional Approach Audio spectrum Audio spectrum Circumplex model Hevner’ model (1936) (Russell 1980) 10/5/2012 93 10/5/2012 94 The Valence-Arousal (VA) Emotion Model What is the Dimensional Model ○ Energy or neurophysiological Alternative conceptualization of Activation‒Arousal stimulation level emotions based on their placement along broad affective dimensions It is obtained by analyzing “similarity ratings” of emotion words or facial expression by factor analysis Evaluation‒Valence or multi-dimensional scaling ○ Pleasantness ○ Positive and For example, Russell (1980) asked negative affective 343 subjects to describe their emotional states using states 28 emotion words and use four different methods to analyze the correlation between the emotion ratings Many studies identifies similar dimensions [psp80] 10/5/2012 95 10/5/2012 96 16
  17. 17. 10/5/2012 More Dimensions Why the Dimensional Model 1/3 The world of emotions is not 2D Free of emotion words (Fontaine et al., 2007) 3rd dimension: potency‒control Emotion words are not always precise and consistent Feeling of power/weakness; We often cannot find proper words to express our feelings dominance/submission Different people have different understandings to the words Anger ↔ fear Emotion words are difficult to translate and might not exist with Pride ↔ shame the exact same meaning in different languages (Russell 1991) Interest ↔ disappointment Semantic overlap between emotion categories 4th dimension: predictability Cheerful, happy, joyous, party/celebratory Surprise Melancholy, gloomy, sad, sorrowful Stress↔ fear Difficult to determine how many and what categories to Contempt ↔ disgust be used in a mood classification system However, 2D model seems to work fine for music emotion 10/5/2012 97 10/5/2012 98No Consensus on Mood Taxonomy in MIR Why the Dimensional Model 2/3Work # Emotion description Emotion changesKatayose et al [icpr98] 4 Gloomy, urbane, pathetic, serious as time unfolds Reliable and economical modelFeng et al [sigir03] 4 Happy, angry, fear, sad Only two variables (valence, arousal),Li et al [ismir03], Happy, light, graceful, dreamy, longing, dark, sacred,Wieczorkowska 13 dramatic, agitated, frustrated, mysterious, passionate, instead of tens or hundreds of mood tagset al [imtci04] bluesy Easy to compare the performanceWang et al [icsp04] 6 Joyous, robust, restless, lyrical, sober, gloomy of different systemsTolos et al [ccnc05] 3 Happy, aggressive, melancholic+calmLu et al [taslp06] 4 Exuberant, anxious/frantic, depressed, content Suitable for continuous measurements arousalYang et al [mm06] 4 Happy, angry, sad, relaxed Emotions may change over time very angrySkowronek et al Arousing, angry, calming, carefree, cheerful, emo- angry 12[ismir07] tional, loving, peaceful, powerful, sad, restless, tender Emotion intensity neutral Happy, light, easy, touching, sad, sublime, More precise and intuitive than valenceWu et al [mmm08] 8 grand, exciting emotion wordsHu et al [ismir08] 5 Passionate, cheerful, bittersweet, witty, aggressiveTrohidis et al [ismir08] 10/5/2012 6 Surprised, happy, relaxed, quiet, sad, angry 99 10/5/2012 100 Why the Dimensional Model 3/3 Mapping Songs to the VA Space Ready canvas for user Assumption interaction View the VA space as a Emotion-based retrieval Song collection navigation continuous, Euclidean space View each point as an emotional state (valence, arousal) Goal Given a short music clip (e.g., 10 to 30 seconds) Automatically compute a pair of valence and arousal (VA) values that best quantify (summarize) the expressed emotion of the overall clip The research on time-dependent second-by-second emotion recognition (emotion tracking) will be introduced in the next Three dimensions are used: session 10/5/2012 valence, arousal, synthetic/acoustic 101 10/5/2012 102 17
  18. 18. 10/5/2012 How to Predict Emotion Values 1/3 How to Predict Emotion Values 2/3 Sol (B): by further exploiting the “geographic information” Sol (A): by dividing the emotion space into several (Yang et al., 2006) mood classes 1 For example, perform For example, into 16 classes binary classification 0.5 Pros for each quadrant 0 Apply arithmetic operations class 1 class 2 class 3 class 4 Standard classification problem y = f(x), to the probability estimates x is a feature vector, Valence = u1 + u4 – u2 – u3 y is a discrete label (1‒16) Arousal = u1 + u2 – u3 – u4 Cons (u denotes likelihood) Poor granularity of the Pros emotion space Easy to compute (not really VA values) Cons Moody by Crayonroom Lack theoretical foundation10/5/2012 103 10/5/2012 104 How to Predict Emotion Values 3/3 Linear Regression: Example Sol (C): by means of regression (Yang et al., 2007, 2008; MacDorman et al., 2007; Eerola et al., 2009) Linear regression Given features, predict a numerical value f(x) = wTx +b Possible (hypothesized) w for valence and arousal One for valence, one for arousal yv = fv (x), x is a feature vector, loudness tempo pitch level harmony mode ya = fa (x), yv and ya are both numerical values (loud/ (fast/ (high/ (consonant (major/ soft) slow) low) /dissonant) miner) Pros valence 0 0 0 1 1 Regression analysis is theoretical sound and well-developed arousal 1 1 1 0 0 Many off-the-shelf good regression algorithms positive valence = consonant harmony & major mode Cons high arousal = loud loudness & fast tempo & high pitch Require ground truth “emotion values” Need to ask human subject to “rate” the emotion values of songs Nonlinear regression functions can also be used10/5/2012 105 10/5/2012 106 Computational Framework Feature Extraction: Get x Extractor Language Features Emotion annotation: obtain y for training data MFCC, LPCC, spectral properties (centroid, Marsyas-0.2 C Feature extraction: obtain x moment, flatness, crest factor) Regression model training: obtain w Spectral features, rhythm features, pitch, key MIR toolbox Matlab clarity, harmonic change, mode Automatic prediction: obtain y for test data MFCC, spectral histogram, periodic MA toolbox Matlab histogram, fluctuation pattern y w Psychoacoustic model –based features Emotion Emotion PsySound Matlab (loudness, sharpness, roughness, virtual annotation value pitch, volume, timbre width, dissonance) Training Regressor data training Rhythm pattern Feature Matlab Rhythm pattern, beat histogram, tempo Feature extractor extraction Regressor x EchoNest API Python Timbre, pitch, loudness, key, mode, tempo Test Feature Feature Automatic Emotion data extraction Prediction value MPEG-7 audio Spectral properties, harmonic ratio, noise Java x y encoder level, fundamental frequency type10/5/2012 107 10/5/2012 108 18
  19. 19. 10/5/2012 Relevant Features Example Matlab Code for Extracting MFCC [Gomez and Danuser, 2007] Using the MA Toolbox Sound intensity Tempo Rhythm DC value major Pitch range Take mean & Mode Consonance STD along time we take 20 coefficients10/5/2012 109 10/5/2012 110 Emotion Annotation: Get y Example System Rate the VA values of each song Data set (Yang et al., 2008) Ordinal rating scale 195 pop songs (Chinese, Japanese, and English) Scroll bar Each song is rated by 10+ subjects Only need to annotate the y for training Ground truth is set by averaging data, the y for the test data can be automatically predicted by our regression Use Marsyas and PsySound to extract features y model Emotion Emotion w Model learning (get w) annotatio value Training n Regressor Linear regression data training Feature Feature Adaboost.RT (nonlinear) extraction x Regressor Support vector regression (SVR)(nonlinear) Test Feature Feature Automatic Emotion data extraction Prediction value Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H.-H. Chen (2008) A regression approach to x y10/5/2012 111 music emotion recognition, IEEE TASLP 16(2) 10/5/2012 112 Performance Evaluation Quantitative Result Evaluation metric Method R2 of valence R2 of arousal Linear regression 0.109 0.568 R 2 statistics Adaboost.RT [ijcnn04] 0.117 0.553 Squared correlation between estimate and ground SVR (support vector regression) [sc04] 0.222 0.570 truth SVR + RReliefF (feature selection) [ml03] 0.254 0.609 The higher the better R 2 = 1 perfectly fits Result R 2 = 0 random guess SVR (nonlinear) performs the best Feature selection by the algorithm RReliefF offers gain 10-fold cross validation Valence: 0.254 9/10 data for training and 1/10 for testing Arousal: 0.609 Valence is more difficult to model (it is more subjective) Repeat 20 times to get average result Valence: 0.25 – 0.35 Arousal: 0.60 – 0.8510/5/2012 113 10/5/2012 114 19