Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features?

Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness?

  • Login to see the comments

  • Be the first to like this

What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features?

  1. 1. 5. How robust are Chroma features? 4. How robust are MFCCs features? Supported by the A4U postdoctoral grants programme and projects SIGMUS (TIN2012-36650), Compmusic (ERC 267583), PHENICX (ICT-2011.8.2) and GiantSteps (ICT-2013-10) What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features? Julián Urbano, Dmitry Bogdanov, Perfecto Herrera, Emilia Gómez and Xavier Serra Department of Information and Communication Technologies Contact: julian.urbano@upf.edu http://mtg.upf.edu Variance components (% of total) in distributions of robustness of MFCCs 22050 Hz 44100 Hz δ ε r ρ θ δ ε r ρ θ Lib 1 σ FSize2 1 % 3 % 2 % 0 % 2 % 0 % 0 % 0 % 0 % 0 % σ Codec2 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % σ Brate:Codec2 31 % 42 % 22 % 8 % 21 % 47 % 42 % 23 % 24 % 22 % σ FSize×Codec2 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % σ FSize×(Brate:Codec) 2 5 % 12 % 12 % 1 % 13 % 7 % 18 % 18 % 11 % 18 % σ Genre2 1 % 5 % 4 % 0 % 4 % 1 % 1 % 1 % 0 % 1 % σ Track2 20 % 6 % 6 % 13 % 6 % 10 % 4 % 3 % 5 % 3 % σ residual2 42 % 33 % 54 % 79 % 54 % 34 % 35 % 56 % 60 % 57 % Grand mean 0.0591 1.6958 0.9999 0.9977 0.9999 0.0682 1.8820 0.9998 0.9939 0.9998 Total variance 0.0032 3.4641 1.8e-7 3.2e-5 1.5e-7 0.0081 11.44 1.6e-6 0.0005 1.4e-6 Standard deviation 0.0567 1.8612 0.0004 0.0056 0.0004 0.0897 3.3835 0.0013 0.0214 0.0012 Lib 2 σ FSize2 1 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % σ Codec2 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % σ Brate:Codec2 5 % 6 % 2 % 1 % 3 % 23 % 24 % 14 % 13 % 15 % σ FSize×Codec2 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % 0 % σ FSize×(Brate:Codec) 2 1 % 0 % 0 % 0 % 0 % 7 % 8 % 10 % 6 % 11 % σ Genre2 4 % 15 % 3 % 1 % 4 % 0 % 5 % 1 % 0 % 0 % σ Track2 52 % 61 % 32 % 66 % 41 % 27 % 14 % 7 % 13 % 6 % σ residual2 36 % 18 % 63 % 32 % 51 % 41 % 48 % 68 % 67 % 68 % Grand mean 0.0622 0.0278 0.9999 0.9955 0.9999 0.0656 0.0342 0.9998 0.9947 0.9999 Total variance 0.0040 0.0015 8.9e-8 0.0002 3.5e-8 0.0055 0.0034 6.4e-7 0.0002 4.8e-7 Standard deviation 0.0631 0.0391 0.0003 0.0131 0.0002 0.0740 0.0587 0.0008 0.0150 0.0007 Much research in MIR is based on descriptors computed from audio signals. Some music corpora use different audio encodings, some do not contain audio but descriptors already computed in some particular way, and sometimes we have to gather audio files ourselves. We thus assume that descriptors are robust to these changes and algorithms are not affected. We investigated this issue for MFCCs and Chroma: how do encoding quality, analysis parameters and musical characteristics affect their robustness? Variance components (% of total) in distributions of robustness of Chroma 22050 Hz 44100 Hz δ ε r ρ θ δ ε r ρ θ Lib 1 σ FSize2 2 % 3 % 0 % 0 % 0 % 2 % 2 % 0 % 0 % 1 % σ Genre2 3 % 3 % 1 % 1 % 1 % 3 % 3 % 1 % 1 % 1 % σ Track2 21 % 19 % 18 % 19 % 17 % 22 % 21 % 19 % 20 % 19 % σ residual2 75 % 75 % 81 % 80 % 82 % 72 % 74 % 80 % 78 % 80 % Grand mean 0.0610 0.0545 0.9554 0.9366 0.992 0.0588 0.0521 0.9549 0.9375 0.9922 Total variance 0.0046 0.0085 0.0276 0.0293 0.0014 0.0048 0.0082 0.0286 0.0298 0.0013 Standard deviation 0.0682 0.0924 0.1663 0.1713 0.0373 0.0695 0.0904 0.1691 0.1725 0.0355 Lib 2 σ Codec2 64 % 35 % 0 % 0 % 0 % 32 % 22 % 0 % 0 % 0 % σ Brate:Codec2 1 % 0 % 0 % 0 % 0 % 62 % 40 % 0 % 0 % 0 % σ Genre2 0 % 16 % 3 % 4 % 8 % 1 % 10 % 3 % 1 % 4 % σ Track2 19 % 33 % 97 % 93 % 92 % 3 % 14 % 94 % 93 % 77 % σ residual2 16 % 17 % 0 % 3 % 0 % 2 % 15 % 2 % 6 % 19 % Grand mean 0.0346 0.0031 0.9915 0.9766 0.9998 0.0260 0.0022 0.9989 0.9928 1.0000 Total variance 0.0004 5e-6 0.0002 0.0007 6.1e-8 0.0005 4.8e-6 3.7e-6 0.0001 1.8e-9 Standard deviation 0.0195 0.0022 0.0135 0.0270 0.0002 0.0213 0.0022 0.0019 0.0122 0.0000 1. What factors did we study? Encoding Quality •Sampling Rate: 22050 and 44100 Hz •Codec: WAV, MP3 CBR and MP3 VBR •Bitrate: 64 to 320 Kbps Analysis Parameters •Analysis Tool: •Lib1 (Essentia 2.0.1) •MFCCs: 40 mel bands, bins equally spaced, 0-11000 Hz •Chroma: 40-5000 Hz, estimates tuning frequency •Lib2 (QM Vamp Plugins 1.7) •MFCCs: 40 mel bands, 66-6364 Hz •Chroma: 65-2093 Hz, constant Q transform, assumes tuning at 440 Hz, ignores harmonics and fixes frame size to 16384 •Frame Size: 1024, 2048, …, 32768 samples Musical Characteristics •Genre: blues, classical, rock, jazz, disco/funk/soul, country, electronic, rap/hip-hop, reggae, rock’n’roll 2. What did we do? Corpus •Compile ad-hoc corpus of 400 music tracks, 395 different artists, uniformly covering all 10 genres •Clipped to 30 seconds for efficiency File versions •Original: lossless, encoded in FLAC •Derived: lossy, encoded in MP3, for all Sampling Rate, Codec and Bitrate Method •Compute MFCCs and Chroma vectors from all files (remove first MFCC) •Summarize frame-wise feature vectors with the mean of each coefficient •Compute indicators of robustness between derived and original descriptors •Block by Tool, Sampling Rate and Frame Size (not mixed in practice) 3. How did we measure robustness? 5 robustness indicators (error) between lossless and lossy •Mean relative error δ across coefficients (%) •Euclidean distance ε •Pearson correlation r •Spearman correlation ρ •Cosine similarity θ •Get 144400 datapoints per indicator and analyze their distribution •High robustness is good if we have heterogeneous encodings •Low stability is bad with heterogeneous encodings, but we can analyze what factor is responsible for the variability and control it Fitted random-effects models to study each effect separately •Controllable factors (Frame Size, Codec and Bitrate): fitted the main effects and all the interactions among them •Uncontrollable factors (Genre and Track): fitted only main effects •ANOVA to estimate variance components (individual effect contributions) σ2 robustness stability 0 error •Shape of MFCC vectors is preserved but individual coefficients differ •Most variability due to Track+residual effect, which cannot be controlled anyway •Independent of Genre •Frame Size is irrelevant (except for 64 Kbps in Lib1: low Frame Sizes are robust) •Lib2 more stable than Lib1 •But there is a large Codec:Bitrate effect, so we can achieve high robustness if we establish a minimum Bitrate •No effect of Codec in Lib1 (omitted). Frame Size omitted in Lib2 (it is fixed to16384) •Lib2 more robust than Lib1. Shape of Chroma vector kept too, •Lib2 more stable than Lib1 •Almost all variability due to uncontrollable Track+residual •But much variability in δ and ε is due to Codec and Bitrate with Lib2; avoided by normalizing Chroma vector to unit max Establishing a minimum Bitrate for MFCCs •Lib1 converges to δ≈3% at 256 Kbps, and stability decreases with bitrate (low within-group variance) •Lib2 converges to δ≈5% at 160-192 Kbps, but stability does not change after 96 Kbps (same within-group variance) •Lib1 is twice as robust with homogeneous encoding •Lib2 is more stable with heterogeneous encodings 6. What are the practical implications? Consider Genre Classification as an example •SVM per Sampling Rate, Codec, Bitrate, Tool & feature Accuracy •No significant differences found •With same encoding in training & testing •With different encoding in training & testing •Always best if training and testing with same encoding •No correlation between bitrate and accuracy; few differences attributable to Type I errors •Should study other low-level tasks more likely to be affected by lossy compression

×