Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Comparison of features for musical instrument recognition


Published on

  • Be the first to comment

  • Be the first to like this

Comparison of features for musical instrument recognition

  1. 1. COMPARISON OF FEATURES FOR MUSICAL INSTRUMENT RECOGNITION Antti Eronen Signal Processing Laboratory, Tampere University of Technology P.O.Box 553, FIN-33101 Tampere, Finland ABSTRACT ments. The performance of the system and the confusions itSeveral features were compared with regard to recognition per- makes are compared to the results reported in a human perceptionformance in a musical instrument recognition system. Both mel- experiment, which used a subset of the same data as stimuli [2].frequency and linear prediction cepstral and delta cepstral coeffi-cients were calculated. Linear prediction analysis was carried out 2. FEATURE EXTRACTIONboth on a uniform and a warped frequency scale, and reflection 2.1. Cepstral featurescoefficients were also used as features. The performance of earlier For isolated musical tones, the onset has been found to bedescribed features relating to the temporal development, modula- important for recognition by human subjects [4]. Motivated bytion properties, brightness, and spectral synchronity of sounds this, the cepstral analyses are made separately for the onset andwas also analysed. The data base consisted of 5286 acoustic and steady state segments of a tone. Based on the root mean squaresynthetic solo tones from 29 different Western orchestral instru- (RMS) -energy level of the signal, each tone is segmented intoments, out of which 16 instruments were included in the test set. onset and steady state segments. The steady state begins when theThe best performance for solo tone recognition, 35% for individ- signal achieves its average RMS-energy level for the first time,ual instruments and 77% for families, was obtained with a feature and the onset segment is the 10 dB rise before this point.set consisting of two sets of mel-frequency cepstral coefficients For the onset portion of tones, both LP and filterbank analysesand a subset of the other analysed features. The confusions made were performed in approximately 20 ms length hamming win-by the system were analysed and compared to results reported in a dowed frames with 25% overlap. In the steady state segment,human perception experiment. frame length of 40 ms was used. If the onset was shorter than 80 1. INTRODUCTION ms, the beginning of steady state was moved forward so that at least 80 ms was analysed. Prior to the analyses, each acoustic sig- Automatic musical instrument recognition is a fascinating and nal was preemphasized with the high pass filter 1, – 0.97z to –1essential subproblem in music indexing, retrieval, and automatic flatten the spectrum.transcription. It is closely related to computational auditory scene The LP coefficients were obtained from an all-pole approxi-analysis. However, musical instrument recognition has not mation of the windowed waveform, and were computed using thereceived as much research interest as speaker recognition, for autocorrelation method. In the calculation of the WLP coeffi-instance. cients, the frequency warping transformation was obtained by The implemented musical instrument recognition systems replacing the unit delays of the predicting filter with first-orderstill have limited practical usability. Brown has reported a system all-pass elements. In the z-domain this can be interpreted by thethat is able to recognize four woodwind instruments from mono- mappingphonic recordings with a performance comparable to that of –1human’s [1]. Martin’s system recognized a wider set of instru- –1 –1z –λ z →z ˜ = ------------------- . - (1) –1ments, although it did not perform as well as human subjects in a 1 – λzsimilar task [2]. In the implementation this means replacing the autocorrela- This paper continues the work presented in [3] by using new tion network with a warped autocorrelation network [5]. Thecepstral features and introducing a significant extension to the parameter λ is selected in such a way that the resulting frequencyevaluation data. The research focuses on comparing different fea- mapping approximates the desired frequency scale. By selectingtures with regard of recognition accuracy in a solo tone recogni- λ=0.7564 for 44.1 kHz samples, a Bark scale approximation wastion task. First, we analyse different cepstral features that are obtained [6]. Finally, the obtained linear prediction coefficients anbased either on linear prediction (LP) or filterbank analysis. Both are transformed into cepstral coefficients cn with the recursion [7,conventional LP having uniform frequency resolution and more pp. 163]psychoacoustically motivated warped linear prediction (WLP) are n–1 ∑ kck an – k .used. WLP based features have not been used for musical instru- 1 c n = – a n – -- - (2)ment recognition before. Second, other features are analysed that n k=1are related to the temporal development, modulation properties, The number of cepstral coefficients was equal to the analysisbrightness, and spectral synchronity of sounds. order after the zeroth coefficient, which is a function of the chan- The evaluation database is extended to include several exam- nel gain, was discarded.ples of a particular instrument. Both acoustic and synthetic iso- For the mel-frequency cepstral coefficient (MFCC) calcula-lated notes of 16 Western orchestral instruments are used for tions, a discrete Fourier transform was first calculated for the win-testing, whereas the training data includes examples of 29 instru-21-24 October 2001, New Paltz, New York W2001-1
  2. 2. dowed waveform. The length of the transform was 1024 or 2048point for 20 ms and 40 ms frames, respectively. 40 triangular WLP cepstra Refl. coeffs. (WLP)bandpass filters having equal bandwith on the mel-frequency 70 LP cepstrascale were simulated, and the MFCCs were calculated from the Refl. coeffs. (LP) 60log filterbank amplitudes using a shifted discrete cosine transform Percent correct[7, p.189]. 50 Family recognition In all cases, the median values of cepstral coefficients were 40stored for the onset and steady state segments. Delta cepstralcoefficients were calculated by fitting a first order polynomial 30over the cepstral trajectories. For the delta-cepstral coefficients, 20 Instrument recognitionthe median of their absolute value was calculated. We also experi-mented with coefficient standard deviations in the case of the 10MFCCs. 0 0 5 10 15 20 25 30 35 40 452.2. Spectral and temporal features LP analysis order Calculation of the other features analysed in this study has Figure 1. Classification performance as a function of analysisbeen described in [3] and will be only shortly summarized here. order for different LP based features. Amplitude envelope contains information e.g. about the typeof excitation; i.e. whether a violin has been bowed or plucked. samples are recorded in studios with different acoustic character-Tight coupling between the excitation and the resonance structure istics and recording equipment, and the samples from Iowa Uni-is indicated by a short onset duration. To measure the slope of the versity are recorded in an anechoic chamber. The samples fromamplitude decay after the onset, a line was fitted over the ampli- the Roland synthesizer were played on the keyboard and recordedtude envelope on a dB scale. Also, the mean square error of the fit through analog lines into a Silicon Graphics Octane workstation.was used as a feature. Crest factor, i.e. maximum / RMS value The synthesizer has a dynamic keyboard, thus these samples havewas also used to characterize the shape of the amplitude envelope. varying dynamics. The samples from SOL include only the first Strength and frequency of amplitude modulation (AM) was 1.5 seconds of the played note.measured at two frequency ranges: from 4-8 Hz to measure trem- Cross validation aimed at as realistic conditions as possibleolo, i.e. AM in conjunction with vibrato, and 10-40 Hz for graini- with this data set. On each trial, the training data consisted of allness or roughness of tones. the samples except those of the particular performer and instru- Spectral centroid (SC) corresponds to perceived brightness ment being tested. In this way, the training data is maximally uti-and has been one of the interpretations for the dissimilarity rat- lized, but the system has never heard the samples from thatings in many multidimensional scaling studies [4]. SC was calcu- particular instrument in those circumstances before. There werelated from a short time power spectrum of the signal using 16 instruments that had at least three independent recordings, sologarithmic frequency resolution. The normalized value of SC is these instruments were used for testing. The instruments can bethe absolute value in Hz divided by the fundamental frequency. seen in Figure 4. A total of 5286 samples of 29 Western orchestralThe mean, maximum and standard deviation values of SC were instruments were included in the data set, out of which 3337 sam-used as features. ples were used for testing. The classifier made its choice among Onset asynchrony refers to the differences in the rate of the the 29 instruments. In these tests, a random guesser would scoreenergy development of different frequency components. A sinu- 3.5% in the individual instrument recognition task, and 16.7% insoid envelope representation was used to calculate the intensity family classification.envelopes for different harmonics, and the standard deviation of In each test, classifications were performed separately for theonset durations for different harmonics was used as a one feature. instrument family and individual instrument cases. A k-nearestAnother feature measuring this property is obtained by fitting the neighbours (kNN) classifier was used, where the values of k wereintensity envelopes of individual harmonics into the overall inten- 11 for instrument family and for 5 individual instrument classifi-sity evelope during the onset period, and the average mean square cation. The distance metric was Mahalanobis with equal covari-error of those fits was used as a feature. ance matrix for all classes, which was implemented by using the Fundamental frequency (f0) of tones is measured using the discrete form of the Karhunen-Loeve transform to uncorrelate thealgorithm from [8], and used as a feature. Also, its standard devi- features and normalize the variances, and then by using the eucli-ation was used as measure for vibrato. dean distance metric in the normalized space. 3. EXPERIMENTAL SETUP 4. RESULTS Samples from five different sources were included in the vali- Different orders of the linear prediction filter were used to seedation database. First, the samples used in [3] consisted of the the effect of that on the performance of several LP and WLPsamples from the McGill University Master Samples Collection based features. The results for instrument family and individual(MUMS) [9], as well as recordings of an acoustic guitar made at instrument recognition are shown in Figure 1. The feature vectorTampere University of Technology. The other sources of samples at all points consisted of two sets of coefficients: medians over thewere the University of Iowa website, IRCAM Studio Online onset period and medians over the steady state. The optimal anal-(SOL), and a Roland XP-30 synthesizer. The MUMS and SOL ysis order was between 9 and 14, above and below which per-21-24 October 2001, New Paltz, New York W2001-2
  3. 3. Individual instrument 100 Instrument family 80 Random guess (instrument) Percent correct Random guess (family) 60 23 std of MFCCs of steady state std of MFCCs of onset 21 DMFCCs of steady state 40 DMFCCs of onset 19 MFCCs of steady 20 Individual instrument MFCCs of onset Instrument family 17 std of f0 0 1 3 5 7 9 11 fundamental frequency (f0) Feature Note sequence length 15 onset duration error of fit between onset intensities Figure 3. Classification performance as a function of note 13 std of component onset durations strength of AM, range 10−40Hz sequence length. 11 frequency of AM, range 10−40Hz backward select algorithm. If the MFCCs were replaced with heuristic strength of AM, range 4−8Hz 9 strength of AM, range 4−8Hz order 13 WLPCCs, the accuracy was 35% (72%). frequency of AM, range 4−8Hz In practical situations, a recognition system is likely to have 7 std of normalized SC more than one note to use for classification. A simulation was std of SC 5 mean of SC made to test the system’s behaviour in this situation. Random mean of normalized SC sequences of notes were generated and each note was classified 3 crest factor individually. The final classification result was pooled across the mean square error of line fit slope of line fit (post onset decay) sequence by using the majority rule. The recognition accuracies 1 were averaged over 50 runs for each instrument and note 0 10 20 30 40 50 60 sequence length. Figure 3 shows the average accuracies for indi- Percent correct vidual instrument and family classification. With 11 random Figure 2. Classification performance as a function of features. notes, the average accuracy increased to 51% (96%). In instru- The features printed in italics were included in the best per- ment family classification, the recognition accuracy for the tenor forming configuration. saxophone was the worst (55% with 11 notes), whereas the accu- racy for the all other instruments was over 90%. In the case offormance degrades. The number of cepstral coefficients was one individual instruments, the accuracy for the tenor trombone, tuba,less than the analysis order. WLP cepstral and reflection coeffi- cello, violin, viola and guitar was poorer than with one note, thecients outperformed LP cepstral and reflection coefficients at all accuracy for the other instruments was higher.analysis orders calculated. The best accuracy with LP based fea- The recognition accuracy depends on the recording circum-tures was 33% for individual instruments (66% for instrument stances, as may be expected. The individual instrument recogni-families), and was obtained with WLP cepstral coefficients tion accuracies were 32%, 87%, 21% and 37% for the samples(WLPCC) of order 13. from MUMS, Iowa, Roland and SOL sources, respectively. The In Figure 2, the classification accuracy is presented as a func- Iowa samples included only the woodwinds and the French horn,tion of features. The cepstral parameters are mel-frequency ceps- which were on the average recognized with 49% accuracy. Thus,tral coefficients or their derivatives. The optimal number of the recognition accuracy is clearly better for the Iowa samplesMFCCs was 12, above and below which the performance slowly recorded in an anechoic chamber. The samples from the otherdegraded. However, optimization of the filter bank parameters three sources are comparable with the exception that the samplesshould be done for the MFCCs, but was left for future research. from SOL did not include tenor or soprano sax. With synthesizedBy using the MFCCs both from the onset and steady state, the samples the performance is clearly worse, which is probably dueaccuracies were 32% (69%). Because of computational cost con- to both the varying quality of the synthetic tones and the varyingsiderations the MFCC were selected as the cepstrum features for dynamics.the remaining experiments. Adding the mel-frequency delta cep-strum coefficients (DMFCC) slightly improved the performance, 5. DISCUSSIONusing the MFCCs and DMFCCs of the steady state resulted in34% (72%) accuracy. The confusion matrix for the feature set giving the best accu- The other features did not alone prove out very successful. racy is presented in Figure 4. There are large differences in theOnset duration was the most successful with 35% accuracy in recognition accuracies of different instruments. The soprano saxinstrument family classification. In individual instruments, spec- is recognized correctly in 72% of the cases, while the classifica-tral centroid gave the best accuracy, 10%. Both were clearly infe- tion accuracies for the violin and guitar are only 4%. French hornrior to the MFCCs and DMFCCs. It should be noted, however, is the most common target for misclassifications.that the MFCC features are vectors of coefficients, and the other It is interesting to compare the behaviour of the system tofeatures consist of a single number each. human subjects. Martin [2] has reported a listening experiment The best accuracy 35% (77%) was obtained by using a feature where fourteen subjects recognized 137 samples from the McGillvector consisting of the features printed in italics in Figure 2. The collection, a subset of the data used in our evaluations. The differ-feature set was found by using a subset of the data and a simple ences in the instrument sets are small, Martin’s samples did not21-24 October 2001, New Paltz, New York W2001-3
  4. 4. Figure 4. Confusion matrix for the best performing feature set. Entries are expressed as percentages and are rounded to the nearest integer. The boxes indicate instrument families.include any sax or guitar samples, but had the piccolo and the 7. ACKNOWLEDGEMENTEnglish horn, which were not present in our test data. In his test, The available samples in the web by the University of Iowathe subjects recognized the individual instrument correctly in ( and IRCAM (http://45.9% of cases (91.7% for instrument families). Our system made helped greatly in collecting our database. Themore outside family confusions than the subjects in Martin’s test. warping toolbox by Härmä and Karjalainen (http://www.acous-It was not able to generalize into more abstract instrument fami- was used for the calculation of WLPlies as well as humans, which was also the case in Martin’s com- based features. Our MFCC analysis was based on Slaney’s imple-puter simulations [2]. In individual instrument classification, the mentation ( is perhaps smaller. 010/). The within-family confusions made by the system are quitesimilar to the confusions made by humans. Examples include the 8. REFERENCESFrench horn as tenor trombone and vice versa, tuba as French [1] Brown, J. C. “Feature dependence in the automatic identifica-horn, or B-flat clarinet as E-flat clarinet. The confusions between tion of musical woodwind instruments.” J. Acoust. Soc. Am.,the viola and the violin, and the cello and the double bass were Vol. 109, No. 3, pp. 1064-1072, 2001.also common to both humans and our system. In the confusions [2] Martin, K. D. Sound-Source Recognition: A Theory and Com-occurring outside the instrument family, confusions of the B-flat putational Model. Ph.D. thesis, Massachusetts Institute ofclarinet as soprano or alto sax were common to both our system Technology, Cambridge, MA, 1999. Available at: http://and the subjects. 6. CONCLUSIONS [3] Eronen, A. & Klapuri, A. “Musical Instrument Recognition Using Cepstral Coefficients and Temporal Features". Proc. Warped linear prediction based features proved to be success- IEEE International Conference on Acoustics, Speech and Sig-ful in the automatic recognition of musical instrument solo tones, nal Processing, Istanbul, June 5-9, 2000.and resulted in better accuracy than what was obtained with corre- [4] Handel, S. Timbre perception and auditory object identifica-sponding conventional LP based features. The mel-frequency tion. In Moore (ed.) Hearing. New York, Academic Press.cepstral coefficients gave the best accuracy in instrument family [5] Härmä, A. “Frequency-Warped Signal Processing forclassification, and would be the selection also for the sake of com- Audio Applications”. J. Audio Eng. Soc., Vol. 48, No. 11, pp.putational complexity. The best overall accuracy was obtained by 1011-1031, 2000.augmenting the mel-cepstral coefficients with features describing [6] Smith, J. O. & Abel, J. S. “Bark and ERB Bilinear Trans-the type of excitation, brightness, modulations, synchronity and forms”. IEEE Transactions on Speech and Audio Processing,fundamental frequency of tones. Vol. 7, No. 6, pp. 697-708, 1999. Care should be taken while interpreting the presented results [7] Rabiner, L. R. & Juang, B. H. Fundamentals of speech recog-on the accuracy obtained with different features. First, the best set nition. Prentice-Hall 1993.of features for musical instrument recognition depends on the [8] Klapuri, A. “Pitch Estimation Using Multiple Independentcontext [2,4]. Second, the extraction algorithms for features other Time-Frequency Windows”. Proc. IEEE Workshop on Appli-than cepstral coefficients are still in their early stages of develop- cations of Signal Processing to Audio and Acoustics, Newment. However, since the accuracy improved when cepstral fea- Paltz, New York, Oct. 17-20, 1999.tures were added with other features, this approach should be [9] Opolko, F. & Wapnick, J. McGill University Master Samplesfurther developed. (compact disk). McGill University, 1987.21-24 October 2001, New Paltz, New York W2001-4