Turnbull Music Vocab

594 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
594
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Start Itunes with talk playlist Start web browser for music survey sight - cosmal.ucsd.edu/cal/friends - username: dougturnbull
  • Turnbull Music Vocab

    1. 1. Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull , Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007
    2. 2. Introduction <ul><li>Our Goal: Create a content-based music search engine for natural language queries. </li></ul><ul><ul><li>CAL Music Search Engine [SIGIR07] </li></ul></ul><ul><li>Problem: picking a vocabulary of musically meaningful words? </li></ul><ul><ul><li>Word is present  pattern in audio content </li></ul></ul><ul><li>Solution: find words that are correlated with a set of acoustic signals </li></ul>
    3. 3. Two-View Representation <ul><li>Consider a set of annotated songs. Each song is represented by: </li></ul><ul><ul><li>Annotation vector in a Semantic Space </li></ul></ul><ul><ul><li>Audio feature vector (s) in an Acoustic Space </li></ul></ul>Acoustic Space (2D) x y Semantic Space (2D) ‘ funky’ ‘ Ireland’ Mustang Sally The Commitments Mustang Sally The Commitments Riverdance Bill Whelan Riverdance Bill Whelan Hot Pants James Brown Hot Pants James Brown
    4. 4. Semantic Representation <ul><li>Vocabulary of words: </li></ul><ul><ul><li>CAL500 : 174 phrases from a human survey </li></ul></ul><ul><ul><ul><li>Instrumentation, genre, emotion, usages, vocal characteristics </li></ul></ul></ul><ul><ul><li>LastFM: ~15,000 tags from social music site </li></ul></ul><ul><ul><li>Web Mining: 100,000+ words mined from text documents </li></ul></ul><ul><li>Annotation Vector, denoted s </li></ul><ul><ul><li>Each element represents the ‘semantic association’ between a word and the song. </li></ul></ul><ul><ul><li>Dimension (D S ) = size of vocabulary </li></ul></ul><ul><ul><li>Example: Frank Sinatra’s ‘Fly Me to the Moon” </li></ul></ul><ul><ul><ul><li>Vocabulary = {funk, jazz, guitar, female vocals, sad, passionate } </li></ul></ul></ul><ul><ul><ul><li>Annotation ( s i ) = [0/4 , 3/4, 4/4 , 0/4 , 2/4, 1/4] </li></ul></ul></ul><ul><li>Data is represented by a N x D S Matrix S = </li></ul>- s 1 - . - s i - . - s N -
    5. 5. Acoustic Representation <ul><li>Each song is represented by an audio feature vector a that is automatically extracted from the audio-content. </li></ul><ul><li>Data is represented by NxD A matrix A = </li></ul>- a 1 - . - a i - . - a N - Acoustic Space (2D) Semantic Space (2D) ‘ funky’ ‘ Ireland’ Mustang Sally The Commitments x y Mustang Sally The Commitments
    6. 6. Canonical Correlation Analysis (CCA) <ul><li>CCA is a technique for exploring dependencies between two related spaces. </li></ul><ul><ul><li>Generalization of PCA to multiple spaces </li></ul></ul><ul><ul><li>Constrained optimization problem </li></ul></ul><ul><ul><ul><li>Find vectors weight v ectors w s and w a : </li></ul></ul></ul><ul><ul><ul><ul><li>1-D projection of data in the semantic space - Sw s </li></ul></ul></ul></ul><ul><ul><ul><ul><li>1-D projection of data in the acoustic space - Aw a </li></ul></ul></ul></ul><ul><ul><ul><li>Maximize correlation of the projections </li></ul></ul></ul><ul><ul><ul><ul><li>max ( Sw s ) T ( Aw a ) </li></ul></ul></ul></ul><ul><ul><ul><li>Constrain w s and w a to prevent infinite correlation </li></ul></ul></ul>max ( Sw s ) T ( Aw a ) w a , w s subject to: ( Sw s ) T ( Sw s ) = 1 ( Aw a ) T ( Aw a ) = 1
    7. 7. CCA Visualization Audio feature space Semantic space ‘ funky’ ‘ Ireland’ a a a b b b b c c c c d d d = = x y 1 1 0 -1 0 -1 -1 -1 S 1 -1 1 1 -1 -1 -1 1 A 1 0 0 -1 2 0 0 -2 1 0 w s 1 -1 w a ( Sw s ) T ( Aw a ) 2 0 0 -2 1 0 0 -1 = 4 Sparse Solution
    8. 8. What Sparsity means… <ul><li>In the previous example, </li></ul><ul><ul><li>w s , ’funky’  0 </li></ul></ul><ul><ul><ul><li> ‘ funky’ is correlated w/ audio signals </li></ul></ul></ul><ul><ul><ul><li> a musically meaningful word </li></ul></ul></ul><ul><ul><li>w s , ’Ireland’ = 0 </li></ul></ul><ul><ul><ul><li> ‘ Ireland’ is not correlated </li></ul></ul></ul><ul><ul><ul><li> No linear relationship with the acoustic representation </li></ul></ul></ul><ul><li>In practice, w s is dense even if most words are uncorrelated </li></ul><ul><ul><li>‘ dense’ means many non-zero values </li></ul></ul><ul><ul><li>due to random variability in the data </li></ul></ul><ul><li>Key Idea: reformulate CCA to produce a sparse solution. </li></ul>
    9. 9. Introducing Sparse CCA [ICML07] <ul><li>Plan: penalize the objective function for each non-zero semantic dimensions </li></ul><ul><ul><li>Pick a penalty function f( w s ) </li></ul></ul><ul><ul><ul><li>Penalizes each non-zero dimension </li></ul></ul></ul><ul><ul><ul><li>Take 1: Cardinality of w s : f(w s ) = | w s | 0 </li></ul></ul></ul><ul><ul><ul><ul><li>Combinatorial problem - np-hard </li></ul></ul></ul></ul><ul><ul><ul><li>Take 2: L1 relaxation: f( w s ) = | w s | 1 </li></ul></ul></ul><ul><ul><ul><ul><li>Non-convex, not very tight approximation </li></ul></ul></ul></ul><ul><ul><ul><li>Take 3: SDP relaxation </li></ul></ul></ul><ul><ul><ul><ul><li>Prohibitive expensive for large problem </li></ul></ul></ul></ul><ul><ul><ul><li>Solution: f( w s ) =  i log |w s,i | </li></ul></ul></ul><ul><ul><ul><ul><li>Non-convex problem, but </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Can be solved efficiently with DC program </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Tight approximation </li></ul></ul></ul></ul>
    10. 10. Introducing Sparse CCA [ICML07] <ul><li>Plan: penalize the objective function for each non-zero semantic dimensions </li></ul><ul><ul><li>Pick a penalty function f( w s ) </li></ul></ul><ul><ul><ul><li>Penalizes each non-zero dimension </li></ul></ul></ul><ul><ul><ul><li>f( w s ) =  i log |w s,i | </li></ul></ul></ul><ul><ul><li>Use tuning parameter  to control importance of sparsity </li></ul></ul><ul><ul><ul><li>Increasing   smaller set of ‘musically relevant’ words </li></ul></ul></ul>max ( Sw s ) T ( Aw a ) w a , w s subject to: ( Sw s ) T ( Sw s ) = 1 ( Aw a ) T ( Aw a ) = 1 -  f( w s )
    11. 11. Experimental Setup <ul><li>CAL500 Data Set [SIGIR07] </li></ul><ul><ul><li>500 songs by 500 Artists </li></ul></ul><ul><ul><li>Semantic Representation </li></ul></ul><ul><ul><ul><li>173 words </li></ul></ul></ul><ul><ul><ul><ul><li>genre, instrumentation, usages, emotions, vocals, etc… </li></ul></ul></ul></ul><ul><ul><ul><li>Annotation vector is average from 3+ listeners </li></ul></ul></ul><ul><ul><ul><li>Word Agreement Score </li></ul></ul></ul><ul><ul><ul><ul><li>measures how consistently listeners apply a word to songs </li></ul></ul></ul></ul><ul><ul><li>Acoustic Representation </li></ul></ul><ul><ul><ul><li>Bag of Dynamic MFCC Vectors [McKinney03] </li></ul></ul></ul><ul><ul><ul><ul><li>52-D vector spectral modulation intensities </li></ul></ul></ul></ul><ul><ul><ul><ul><li>160 vectors per minute of audio content </li></ul></ul></ul></ul><ul><ul><ul><li>Duplicate annotation vector for each Dynamic MFCC </li></ul></ul></ul>
    12. 12. Experiment 1: Qualitative Results <ul><li>Words with high acoustic correlation </li></ul><ul><li>hip-hop, arousing, sad, drum machine, heavy beat, at a party, rapping </li></ul><ul><li>Words with no acoustic correlation </li></ul><ul><li>classic rock, normal, constant energy, going to sleep, falsetto </li></ul>
    13. 13. Experiment 2: Vocabulary Pruning <ul><li>AMG2131 Text Corpus [ISMIR06] </li></ul><ul><ul><li>AMG Allmusic song reviews for most of CAL500 songs </li></ul></ul><ul><ul><li>315 word vocabulary </li></ul></ul><ul><ul><li>Annotation vector based on the presence or absence of a word in the review </li></ul></ul><ul><ul><li>More noisy word-song relationships then CAL500 </li></ul></ul><ul><li>Experimental Design: </li></ul><ul><ul><li>Merge vocabularies: 173+315 = 488 words </li></ul></ul><ul><ul><li>Prune noisy words as we increase amount of sparsity in CCA </li></ul></ul><ul><li>Hypothesis: </li></ul><ul><ul><li>AMG words will be pruned before CAL500 words </li></ul></ul>
    14. 14. Experiment 2: Vocabulary Pruning <ul><li>Experimental Design: </li></ul><ul><ul><li>Merge vocabularies: 488 words </li></ul></ul><ul><ul><li>Prune noisy words as we increase amount of sparsity in CCA </li></ul></ul><ul><li>Result: </li></ul><ul><li>As Sparse CCA is more aggressive, more AMG words are pruned. </li></ul>Vocabulary Size % Web2131 Words # AMG2131 Words # CAL500 Words 0.64 315 173 488 0.52 131 118 249 0.42 64 85 149 0.22 11 39 50
    15. 15. Experiment 3: Vocabulary Selection <ul><li>Experimental Design: </li></ul><ul><li>Rank words by </li></ul><ul><ul><li>how aggressive Sparse CCA is before word gets pruned. </li></ul></ul><ul><ul><li>how consistently humans use a word across CAL500 corpus. </li></ul></ul><ul><li>As we decrease vocabulary size, calculate Average AROC </li></ul>Result: Sparse CCA does predict words that have better AROC .68 .76 AROC 173 120 20 Vocab Size 70
    16. 16. Recap <ul><li>Constructing a ‘meaningful vocabulary’ is the first step in building a content-based, natural-language search engine for music. </li></ul><ul><li>Given a semantic representation and acoustic representation Sparse CCA can be used to find ‘musically meaningful’ words. </li></ul><ul><ul><li>i.e., semantic dimensions linearly correlated with audio features </li></ul></ul><ul><li>Automatically pruning words is important when using noisy sources of semantic information </li></ul><ul><ul><li>e.g., LastFM Tags or Web Documents </li></ul></ul>
    17. 17. Future Work <ul><li>Theory: moving beyond linear correlation with kernel methods </li></ul><ul><li>Application: Sparse CCA can be used to find ‘musically meaningful’ audio features by imposing sparsity in the acoustic space </li></ul><ul><li>Practice: handling large, noisy semantically annotated music corpora </li></ul>
    18. 18. Identifying Words that are Musically Meaningful David Torres, Douglas Turnbull , Luke Barrington, Gert Lanckriet Computer Audition Lab UC San Diego ISMIR September 25, 2007
    19. 19. Experiment 3: Vocabulary Selection <ul><li>Our content-based music search engine rank orders songs given a text-based query [SIGIR 07] </li></ul><ul><ul><li>Area under the ROC curve (AROC) measures quality of each ranking </li></ul></ul><ul><ul><ul><li>0.5 is random, 1.0 is perfect </li></ul></ul></ul><ul><ul><ul><li>0.68 is average AROC for all 1-word queries </li></ul></ul></ul><ul><li>Can Sparse CCA pick words that will have higher AROC? </li></ul><ul><ul><li>Idea: words with high correlation have more signal in the audio representation and will be easier to model. </li></ul></ul><ul><ul><li>How does it compare picking words that humans consistently use to label songs. </li></ul></ul>

    ×