Computational Approaches to Melodic Analysis of Indian Art Music discusses various computational approaches to analyzing the melody of Indian art music, including tonic identification and predominant pitch estimation. For tonic identification, the document describes using multipitch analysis of the audio signal along with a drone background to identify the tonic note, with reported 90% accuracy. It also discusses signal processing techniques used such as STFT and spectral peak picking. For predominant pitch estimation, it reviews various pitch estimation algorithms including autocorrelation-based, frequency-domain, and multipitch approaches, highlighting the YIN algorithm which produces fewer errors than other methods.
This document presents a method for improving pitch extraction from noisy speech signals using a fuzzy weighted autocorrelation function (FWS-ACF). Simulation results show the FWS-ACF method provides better robustness against background noise than conventional autocorrelation and other pitch extraction methods. The FWS assigns membership values between 0-1 to emphasize true peaks in the autocorrelation function. Results show the FWS-ACF achieves lower gross pitch error rates than other methods like cepstrum, average magnitude difference function, and weighted autocorrelation when extracting pitch from noisy speech.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...April Smith
This paper presents a method for classifying phonemes that combines reconstructed phase space (RPS) representations with sub-band decomposition of speech signals. Experiments on the TIMIT database show that different phonological classes (vowels, fricatives, nasals, stops) are recognized with varying accuracy depending on the frequency sub-band. The results indicate filtering signals before embedding in RPS has potential to improve classification accuracy by exploiting differences in how well phonemes of different classes are represented in different frequency ranges. Combining classifications from multiple sub-bands may yield better performance than using the full-band signal alone.
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Lushanthan Sivaneasharajah
This document describes research applying Fisher Linear Discriminant Analysis (LDA) and K-Nearest Neighbors (K-NN) algorithms to classify speech and music audio clips. It finds that Fisher LDA using single features like mel-frequency cepstral coefficients achieves classification error rates below 5%, outperforming K-NN. While combining multiple features does not improve LDA results, combining the outputs of LDA and K-NN classifiers using majority voting further lowers the error rate to 4.5%, demonstrating the benefit of classifier ensembles for this task.
1) Equalizer matching involves finding the power spectrum of an example audio, then multiplying the input audio's magnitude spectrogram by a filter matching the example's power spectrum.
2) Noise matching involves denoising the input and example separately, then recombining their clean and noise components using the original signal-to-noise ratio.
3) Reverberation matching uses convolutive non-negative matrix factorization to decompose the input into a dry sound and reverb kernel, and convolve the estimated dry input with the example's reverb kernel.
The document discusses spatial hearing and head-related transfer functions (HRTFs) for virtual audio. It covers measuring HRTFs using a KEMAR manikin, constructing filters based on measured HRTFs to localize sound, issues with non-individualized HRTFs, synthetic HRTF approaches, and techniques for externalization like reverberation and decorrelation. Applications mentioned include immersive environments, hearing aids, and representational sounds.
An efficient peak valley detection based vad algorithm for robust detection o...csandit
Voice Activity Detection (VAD) problem considers detecting the presence of speech in a noisy
signal. The speech/non-speech classification task is not as trivial as it appears, and most of the
VAD algorithms fail when the level of background noise increases. In this research we are
presenting a new technique for Voice Activity Detection (VAD) in EEG collected brain stem
speech evoked potentials data [7, 8, 9]. This one is spectral subtraction method in which we
have developed our own mathematical formula for the peak valley detection (PVD) of the
frequency spectra to detect the voice activity [1]. The purpose of this research is to compare the
performance of this SNR based PVD (SNRPVD) method over Zero-Crossing rate detector [5]
and statistical analysis based algorithms [10]. We have put into application of these three
algorithms on these particular data sets of this experiment [7, 8, 9] and VAD is verified and
compared the results of these three. MATLAB routines were developed on these particular
methodologies. Finally we concluded that the method of SNRPVD surely performing better than
the ZCR and statistical algorithms.
An efficient peak valley detection based vad algorithm for robust detection o...csandit
Biometrics is science of measuring and statistically analyzing biological data. Biometric system
establishes identity of a person based on unique physical or behavioral characteristic possessed
by an individual. Behavioral biometrics measures characteristics which are acquired naturally
over time. Physical biometrics measures inherent physical characteristics on an individual.
Over the last few decades enormous attention is drawn towards ocular biometrics. Cues
provided by ocular region have led to exploration of newer traits. Feasibility of periocular
region as a useful biometric trait has been explored recently. With the promising results of
preliminary examination, research towards periocular region is currently gaining lot of
prominence. Researchers have analyzed various techniques of feature extraction and
classification in the periocular region. This paper investigates the effect of using Lower Central
Periocular Region (LCPR) for identification. The results obtained are comparable with those
acquired for full periocular texture features with an advantage of reduced periocular area.
This document presents a method for improving pitch extraction from noisy speech signals using a fuzzy weighted autocorrelation function (FWS-ACF). Simulation results show the FWS-ACF method provides better robustness against background noise than conventional autocorrelation and other pitch extraction methods. The FWS assigns membership values between 0-1 to emphasize true peaks in the autocorrelation function. Results show the FWS-ACF achieves lower gross pitch error rates than other methods like cepstrum, average magnitude difference function, and weighted autocorrelation when extracting pitch from noisy speech.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...April Smith
This paper presents a method for classifying phonemes that combines reconstructed phase space (RPS) representations with sub-band decomposition of speech signals. Experiments on the TIMIT database show that different phonological classes (vowels, fricatives, nasals, stops) are recognized with varying accuracy depending on the frequency sub-band. The results indicate filtering signals before embedding in RPS has potential to improve classification accuracy by exploiting differences in how well phonemes of different classes are represented in different frequency ranges. Combining classifications from multiple sub-bands may yield better performance than using the full-band signal alone.
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Lushanthan Sivaneasharajah
This document describes research applying Fisher Linear Discriminant Analysis (LDA) and K-Nearest Neighbors (K-NN) algorithms to classify speech and music audio clips. It finds that Fisher LDA using single features like mel-frequency cepstral coefficients achieves classification error rates below 5%, outperforming K-NN. While combining multiple features does not improve LDA results, combining the outputs of LDA and K-NN classifiers using majority voting further lowers the error rate to 4.5%, demonstrating the benefit of classifier ensembles for this task.
1) Equalizer matching involves finding the power spectrum of an example audio, then multiplying the input audio's magnitude spectrogram by a filter matching the example's power spectrum.
2) Noise matching involves denoising the input and example separately, then recombining their clean and noise components using the original signal-to-noise ratio.
3) Reverberation matching uses convolutive non-negative matrix factorization to decompose the input into a dry sound and reverb kernel, and convolve the estimated dry input with the example's reverb kernel.
The document discusses spatial hearing and head-related transfer functions (HRTFs) for virtual audio. It covers measuring HRTFs using a KEMAR manikin, constructing filters based on measured HRTFs to localize sound, issues with non-individualized HRTFs, synthetic HRTF approaches, and techniques for externalization like reverberation and decorrelation. Applications mentioned include immersive environments, hearing aids, and representational sounds.
An efficient peak valley detection based vad algorithm for robust detection o...csandit
Voice Activity Detection (VAD) problem considers detecting the presence of speech in a noisy
signal. The speech/non-speech classification task is not as trivial as it appears, and most of the
VAD algorithms fail when the level of background noise increases. In this research we are
presenting a new technique for Voice Activity Detection (VAD) in EEG collected brain stem
speech evoked potentials data [7, 8, 9]. This one is spectral subtraction method in which we
have developed our own mathematical formula for the peak valley detection (PVD) of the
frequency spectra to detect the voice activity [1]. The purpose of this research is to compare the
performance of this SNR based PVD (SNRPVD) method over Zero-Crossing rate detector [5]
and statistical analysis based algorithms [10]. We have put into application of these three
algorithms on these particular data sets of this experiment [7, 8, 9] and VAD is verified and
compared the results of these three. MATLAB routines were developed on these particular
methodologies. Finally we concluded that the method of SNRPVD surely performing better than
the ZCR and statistical algorithms.
An efficient peak valley detection based vad algorithm for robust detection o...csandit
Biometrics is science of measuring and statistically analyzing biological data. Biometric system
establishes identity of a person based on unique physical or behavioral characteristic possessed
by an individual. Behavioral biometrics measures characteristics which are acquired naturally
over time. Physical biometrics measures inherent physical characteristics on an individual.
Over the last few decades enormous attention is drawn towards ocular biometrics. Cues
provided by ocular region have led to exploration of newer traits. Feasibility of periocular
region as a useful biometric trait has been explored recently. With the promising results of
preliminary examination, research towards periocular region is currently gaining lot of
prominence. Researchers have analyzed various techniques of feature extraction and
classification in the periocular region. This paper investigates the effect of using Lower Central
Periocular Region (LCPR) for identification. The results obtained are comparable with those
acquired for full periocular texture features with an advantage of reduced periocular area.
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...cscpconf
This document presents a new peak valley detection (PVD) based voice activity detection (VAD) algorithm for detecting speech in EEG data collected from brain stem responses to speech stimuli. It compares the performance of this signal-to-noise ratio PVD (SNRPVD) method to a zero-crossing rate detector and statistical analysis based algorithms. The SNRPVD method detects vowel sounds by identifying spectral peaks, which remain prominent even in noise, and calculates similarity to a registered peak signature vector. Results on 10 subject datasets show SNRPVD outperforms other methods, correctly detecting speech at lower signal-to-noise ratios. Further research will compare SNRPVD to additional VAD algorithms to validate its superior performance.
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...CSCJournals
voice activity detector (VAD) is used to separate the speech data included parts from silence parts of the signal. In this paper a new VAD algorithm is represented on the basis of singular value decomposition. There are two sections to perform the feature vector extraction. In first section voiced frames are separated from unvoiced and silence frames. In second section unvoiced frames are silence frames. To perform the above sections, first, windowing the noisy signal then Hankel’s matrix is formed for each frame. The basis of statistical feature extraction of purposed system is slope of singular value curve related to each frame by using linear regression. It is shown that the slope of singular values curve per different SNRs in voiced frames is more than the other types and this property can be to achieve the goal the first part can be used. High similarity between feature vector of unvoiced and silence frame caused to approach for separation of the two categories above cannot be used. So in the second part, the frequency characteristics for identification of unvoiced frames from silent frames have been used. Simulation results show that high speed and accuracy are the advantages of the proposed system.
This document discusses a study investigating the combined use of Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) features in automatic speech recognition systems. It begins by outlining the challenges of automatic speech recognition and then describes the MFCC and LPC algorithms for extracting basic speech features. The study suggests combining MFCC and LPC-based recognition subsystems to improve reliability. Neural networks are used for training and recognition, and results show the combined approach improves recognition quality compared to individual methods.
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
This document presents a modified least mean square (LMS) algorithm to reduce noise in real-time speech signals. The proposed approach modifies the standard LMS algorithm by incorporating a Wiener filter. Experiments are conducted on speech samples from the NOIZEUS database with various types of noise at different signal-to-noise ratios. Objective measures like segmental SNR, log likelihood ratio, Itakura-Saito spectral distance, and cepstrum are used to evaluate the performance of the proposed algorithm compared to the standard LMS algorithm. The results show that the modified LMS algorithm with Wiener filter outperforms the standard LMS algorithm in enhancing the quality of noisy speech signals based on the objective measure values.
This document presents an investigation into using adaptive filter bank analysis (AFBA) to derive robust mel frequency cepstral features for noisy speech recognition. AFBA adaptively incorporates the signal-to-noise ratio (SNR) value into filter bank analysis by making the weighting factor for each log filter bank energy component dependent on the SNR frame-by-frame. Experimental results on a Mandarin speech database show AFBA provides higher recognition rates than other techniques in various noisy conditions.
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a target speaker. It works by analyzing the source speech into an excitation signal and filter components, then resynthesizing it with the pitch and vocal characteristics of the target speaker. The key steps are detecting the pitches of the source and target speakers, scaling the source pitch to match the target, then resynthesizing the source speech using the target's vocal filter characteristics and the pitch-scaled excitation signal. Voice morphing was developed in 1999 and has applications in text-to-speech, dubbing, voice disguising, and public announcement systems.
The document proposes a time-frequency domain approach for pitch estimation of noisy speech that uses an inverse circular average magnitude difference function to weight the autocorrelation function of pre-filtered noisy speech. It estimates the dominant pitch harmonic in the frequency domain using a cosine model of autocorrelation function before optimally fitting a variable period impulse train to the weighted autocorrelation function for pitch estimation. Simulation results using the Keele speech database show the proposed method achieves better pitch estimation accuracy than conventional autocorrelation-based methods, even at low signal-to-noise ratios down to -10 dB.
Graphical visualization of musical emotionsPranay Prasoon
The document discusses graphical visualization of musical emotions using artificial neural networks. 13 audio features are extracted from Hindustani classical music clips labeled as happy or sad. An ANN model with backpropagation algorithm is trained on 70% of data, validated on 15% and tested on 15%. The model correctly classified 15 of 17 happy clips and 21 of 22 sad clips. Testing was repeated 10 times with over 90% accuracy each time, showing the model effectively recognizes musical emotions. Future work involves expanding the model to recognize additional emotions and incorporating physiological features.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
Robot navigation in unknown environment with obstacle recognition using laser...IJECEIAES
Robot navigation in unknown and dynamic environments may result in aimless wandering, corner traps and repetitive path loops. To address these issues, this paper presents the solution by comparing the standard deviation of the distance ranges of the obstacles appeared in the robot navigation path. For the similar obstacles, The standard deviations of distance range vectors, obtained from the laser range finder sensor of the robot at similar pose, are very close to each other. Therefore, the measurements of odometer sensor are also combined with the standard deviation to recognize the location of the obstacles. A novel algorithm, with obstacle detection feature, is presented for robot navigation in unknown and dynamic environments. The algorithm checks the similarity of the distance range vectors of the obstacles in the path and uses this information in combination with the odometer measurements to identify the obstacles and their locations. The experimental work is carried out using Gazebo simulator.
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesMatthieu Hodgkinson
This document summarizes a new method for analyzing inharmonic instrumental tones called Median-Adjustive Trajectories (MAT). The method exploits an equation that relates the inharmonicity coefficient to the frequencies and numbers of any two partials from an inharmonic series. It estimates the frequencies of the first two prominent peaks to calculate an initial inharmonicity coefficient. This is then used along with the partial frequencies in iterative steps to estimate subsequent partial frequencies, refining the coefficient at each step. The estimates are based on medians of arrays calculated from the relevant equations to improve accuracy. The method allows efficient analysis of inharmonic spectra without exhaustive searches over parameter ranges.
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
This work presents an application of Fundamental Frequency (Pitch), Linear Predictive Cepstral Coefficient
(LPCC) and Mel Frequency Cepstral Coefficient (MFCC) in identification of sex of the speaker in speech
recognition research. The aim of this article is to compare the performance of these three methods for
identification of sex of the speakers. A successful speech recognition system can help in non critical operations
such as presenting the driving route to the driver, dialing a phone number, light switch turn on/off, the coffee
machine on/off etc. apart from speaker verification-caste wise, community wise and locality wise including
identification of sex. Here an attempt has been made to identify the sex of Bodo speakers through vowel
utterance by following Pitch value, LPCC and MFCC techniques. It is found here that the feature vector
organization of LPCC coefficients provides a more promising way of speech-speaker recognition in case of
Bodo Language than that of Pitch and MFCC.
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Venkata Sudhir Vedurla
This document presents an analysis of acoustic echo cancellation for speech processing using the LMS adaptive filtering algorithm. It begins with an abstract that outlines the challenges of conventional echo cancellation techniques and the need for a computationally efficient, rapidly converging algorithm. It then provides background on acoustic echo, the principles of echo cancellation, discrete time signals, speech signals, and an overview of the LMS adaptive filtering algorithm and its application to echo cancellation. The document analyzes the performance of the LMS algorithm for echo cancellation by examining how the step size parameter affects convergence and steady state error. It concludes that the LMS algorithm is well-suited for echo cancellation due to its computational simplicity, though the step size must be carefully selected for optimal performance
This document summarizes a research paper on speech enhancement using the signal subspace algorithm. It begins with an abstract describing how noise degrades speech quality and intelligibility in communication systems. It then provides background on speech enhancement objectives and commonly used methods like spectral subtraction and signal subspace. The paper describes the signal subspace algorithm and shows its ability to enhance speech signals by suppressing noise. Experimental results on sine waves with added Gaussian noise demonstrate improved peak signal-to-noise ratios when using the signal subspace method compared to the noisy signals. The conclusion is that the algorithm removes noise to a great extent from noisy speech.
This document summarizes a study on independent speaker recognition for native English vowels. The study used a standard approach for vowel classification based on formant frequencies, which depend on vocal tract shape and dimensions. Formants F1 and F2 were extracted from speech samples and used as features. Euclidean distance was used to measure similarity between test samples and reference formant values. The method achieved 80-95% recognition accuracy for vowels from male and female speakers. Vowels /a/ and /o/ had the highest recognition rates while /e/ and /i/ were more likely to be confused due to inter-speaker variation. The study demonstrated the viability of using formant frequencies for automatic vowel and speaker recognition.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
Modified synthesis strategy for vowels and semi vowels klatt synthesizeIAEME Publication
This document discusses modifications to the Klatt synthesizer for synthesizing vowels and semi-vowels. It proposes storing control parameters and updating strategies to improve naturalness. The Klatt synthesizer uses a source-filter model with voicing and frication sources, and cascade/parallel vocal tract filters. Parameters like pitch, formants and bandwidths are stored in a database that is segmented into frames. Synthesis involves generating the excitation signal then filtering with resonators updated per frame. The modified approach varies the frame size for more precise parameter tracking as in the KlattGrid synthesizer.
Broad Phoneme Classification Using Signal Based Features ijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme classification is achieved using features derived directly from the speech at the signal level itself. Broad phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first three formants. Features derived from short time frames of training speech are used to train a multilayer feedforward neural network based classifier with manually marked class label as output and classification accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction which is useful for applications such as automatic speech recognition and automatic language identification.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
Mining Melodic Patterns in Large Audio Collections of Indian Art MusicSankalp Gulati
More info: http://mtg.upf.edu/node/3108
Abstract: Discovery of repeating structures in music is fundamental to its analysis, understanding and interpretation. We present a data-driven approach for the discovery of short-time melodic patterns in large collections of Indian art music. The approach first discovers melodic patterns within an audio recording and subsequently searches for their repetitions in the entire music collection. We compute similarity between melodic patterns using dynamic time warping (DTW). Furthermore, we investigate four different variants of the DTW cost function for rank refinement of the obtained results. The music collection used in this study comprises 1,764 audio recordings with a total duration of 365 hours. Over 13 trillion DTW distance computations are done for the entire dataset. Due to the computational complexity of the task, different lower bounding and early abandoning techniques are applied during DTW distance computation. An evaluation based on expert feedback on a subset of the dataset shows that the discovered melodic patterns are musically relevant. Several musically interesting relationships are discovered, yielding further scope for establishing novel similarity measures based on melodic patterns. The discovered melodic patterns can further be used in challenging computational tasks such as automatic raga recognition, composition identification and music recommendation.
Landmark Detection in Hindustani Music MelodiesSankalp Gulati
More info: http://mtg.upf.edu/node/2998
Abstract: Musical melodies contain hierarchically organized events, where some events are more salient than others, acting as melodic landmarks. In Hindustani music melodies, an important landmark is the occurrence of a nyas. Occurrence of nyas is crucial to build and sustain the format of a rag and mark the boundaries of melodic motifs. Detection of nyas segments is relevant to tasks such as melody segmentation, motif discovery and rag recognition. However, detection of nyas segments is challenging as these segments do not follow explicit set of rules in terms of segment length, contour characteristics, and melodic context. In this paper we propose a method for the automatic detection of nyas segments in Hindustani music melodies. It consists of two main steps: a segmentation step that incorporates domain knowledge in order to facilitate the placement of nyas boundaries, and a segment classification step that is based on a series of musically motivated pitch contour features. The proposed method obtains significant accuracies for a heterogeneous data set of 20 audio music recordings containing 1257 nyas svar occurrences and total duration of 1.5 hours. Further, we show that the proposed segmentation strategy significantly improves over a classical piece-wise linear segmentation approach.
More Related Content
Similar to [Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music
AN EFFICIENT PEAK VALLEY DETECTION BASED VAD ALGORITHM FOR ROBUST DETECTION O...cscpconf
This document presents a new peak valley detection (PVD) based voice activity detection (VAD) algorithm for detecting speech in EEG data collected from brain stem responses to speech stimuli. It compares the performance of this signal-to-noise ratio PVD (SNRPVD) method to a zero-crossing rate detector and statistical analysis based algorithms. The SNRPVD method detects vowel sounds by identifying spectral peaks, which remain prominent even in noise, and calculates similarity to a registered peak signature vector. Results on 10 subject datasets show SNRPVD outperforms other methods, correctly detecting speech at lower signal-to-noise ratios. Further research will compare SNRPVD to additional VAD algorithms to validate its superior performance.
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...CSCJournals
voice activity detector (VAD) is used to separate the speech data included parts from silence parts of the signal. In this paper a new VAD algorithm is represented on the basis of singular value decomposition. There are two sections to perform the feature vector extraction. In first section voiced frames are separated from unvoiced and silence frames. In second section unvoiced frames are silence frames. To perform the above sections, first, windowing the noisy signal then Hankel’s matrix is formed for each frame. The basis of statistical feature extraction of purposed system is slope of singular value curve related to each frame by using linear regression. It is shown that the slope of singular values curve per different SNRs in voiced frames is more than the other types and this property can be to achieve the goal the first part can be used. High similarity between feature vector of unvoiced and silence frame caused to approach for separation of the two categories above cannot be used. So in the second part, the frequency characteristics for identification of unvoiced frames from silent frames have been used. Simulation results show that high speed and accuracy are the advantages of the proposed system.
This document discusses a study investigating the combined use of Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) features in automatic speech recognition systems. It begins by outlining the challenges of automatic speech recognition and then describes the MFCC and LPC algorithms for extracting basic speech features. The study suggests combining MFCC and LPC-based recognition subsystems to improve reliability. Neural networks are used for training and recognition, and results show the combined approach improves recognition quality compared to individual methods.
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...IRJET Journal
This document presents a modified least mean square (LMS) algorithm to reduce noise in real-time speech signals. The proposed approach modifies the standard LMS algorithm by incorporating a Wiener filter. Experiments are conducted on speech samples from the NOIZEUS database with various types of noise at different signal-to-noise ratios. Objective measures like segmental SNR, log likelihood ratio, Itakura-Saito spectral distance, and cepstrum are used to evaluate the performance of the proposed algorithm compared to the standard LMS algorithm. The results show that the modified LMS algorithm with Wiener filter outperforms the standard LMS algorithm in enhancing the quality of noisy speech signals based on the objective measure values.
This document presents an investigation into using adaptive filter bank analysis (AFBA) to derive robust mel frequency cepstral features for noisy speech recognition. AFBA adaptively incorporates the signal-to-noise ratio (SNR) value into filter bank analysis by making the weighting factor for each log filter bank energy component dependent on the SNR frame-by-frame. Experimental results on a Mandarin speech database show AFBA provides higher recognition rates than other techniques in various noisy conditions.
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a target speaker. It works by analyzing the source speech into an excitation signal and filter components, then resynthesizing it with the pitch and vocal characteristics of the target speaker. The key steps are detecting the pitches of the source and target speakers, scaling the source pitch to match the target, then resynthesizing the source speech using the target's vocal filter characteristics and the pitch-scaled excitation signal. Voice morphing was developed in 1999 and has applications in text-to-speech, dubbing, voice disguising, and public announcement systems.
The document proposes a time-frequency domain approach for pitch estimation of noisy speech that uses an inverse circular average magnitude difference function to weight the autocorrelation function of pre-filtered noisy speech. It estimates the dominant pitch harmonic in the frequency domain using a cosine model of autocorrelation function before optimally fitting a variable period impulse train to the weighted autocorrelation function for pitch estimation. Simulation results using the Keele speech database show the proposed method achieves better pitch estimation accuracy than conventional autocorrelation-based methods, even at low signal-to-noise ratios down to -10 dB.
Graphical visualization of musical emotionsPranay Prasoon
The document discusses graphical visualization of musical emotions using artificial neural networks. 13 audio features are extracted from Hindustani classical music clips labeled as happy or sad. An ANN model with backpropagation algorithm is trained on 70% of data, validated on 15% and tested on 15%. The model correctly classified 15 of 17 happy clips and 21 of 22 sad clips. Testing was repeated 10 times with over 90% accuracy each time, showing the model effectively recognizes musical emotions. Future work involves expanding the model to recognize additional emotions and incorporating physiological features.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
Robot navigation in unknown environment with obstacle recognition using laser...IJECEIAES
Robot navigation in unknown and dynamic environments may result in aimless wandering, corner traps and repetitive path loops. To address these issues, this paper presents the solution by comparing the standard deviation of the distance ranges of the obstacles appeared in the robot navigation path. For the similar obstacles, The standard deviations of distance range vectors, obtained from the laser range finder sensor of the robot at similar pose, are very close to each other. Therefore, the measurements of odometer sensor are also combined with the standard deviation to recognize the location of the obstacles. A novel algorithm, with obstacle detection feature, is presented for robot navigation in unknown and dynamic environments. The algorithm checks the similarity of the distance range vectors of the obstacles in the path and uses this information in combination with the odometer measurements to identify the obstacles and their locations. The experimental work is carried out using Gazebo simulator.
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesMatthieu Hodgkinson
This document summarizes a new method for analyzing inharmonic instrumental tones called Median-Adjustive Trajectories (MAT). The method exploits an equation that relates the inharmonicity coefficient to the frequencies and numbers of any two partials from an inharmonic series. It estimates the frequencies of the first two prominent peaks to calculate an initial inharmonicity coefficient. This is then used along with the partial frequencies in iterative steps to estimate subsequent partial frequencies, refining the coefficient at each step. The estimates are based on medians of arrays calculated from the relevant equations to improve accuracy. The method allows efficient analysis of inharmonic spectra without exhaustive searches over parameter ranges.
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
This work presents an application of Fundamental Frequency (Pitch), Linear Predictive Cepstral Coefficient
(LPCC) and Mel Frequency Cepstral Coefficient (MFCC) in identification of sex of the speaker in speech
recognition research. The aim of this article is to compare the performance of these three methods for
identification of sex of the speakers. A successful speech recognition system can help in non critical operations
such as presenting the driving route to the driver, dialing a phone number, light switch turn on/off, the coffee
machine on/off etc. apart from speaker verification-caste wise, community wise and locality wise including
identification of sex. Here an attempt has been made to identify the sex of Bodo speakers through vowel
utterance by following Pitch value, LPCC and MFCC techniques. It is found here that the feature vector
organization of LPCC coefficients provides a more promising way of speech-speaker recognition in case of
Bodo Language than that of Pitch and MFCC.
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Venkata Sudhir Vedurla
This document presents an analysis of acoustic echo cancellation for speech processing using the LMS adaptive filtering algorithm. It begins with an abstract that outlines the challenges of conventional echo cancellation techniques and the need for a computationally efficient, rapidly converging algorithm. It then provides background on acoustic echo, the principles of echo cancellation, discrete time signals, speech signals, and an overview of the LMS adaptive filtering algorithm and its application to echo cancellation. The document analyzes the performance of the LMS algorithm for echo cancellation by examining how the step size parameter affects convergence and steady state error. It concludes that the LMS algorithm is well-suited for echo cancellation due to its computational simplicity, though the step size must be carefully selected for optimal performance
This document summarizes a research paper on speech enhancement using the signal subspace algorithm. It begins with an abstract describing how noise degrades speech quality and intelligibility in communication systems. It then provides background on speech enhancement objectives and commonly used methods like spectral subtraction and signal subspace. The paper describes the signal subspace algorithm and shows its ability to enhance speech signals by suppressing noise. Experimental results on sine waves with added Gaussian noise demonstrate improved peak signal-to-noise ratios when using the signal subspace method compared to the noisy signals. The conclusion is that the algorithm removes noise to a great extent from noisy speech.
This document summarizes a study on independent speaker recognition for native English vowels. The study used a standard approach for vowel classification based on formant frequencies, which depend on vocal tract shape and dimensions. Formants F1 and F2 were extracted from speech samples and used as features. Euclidean distance was used to measure similarity between test samples and reference formant values. The method achieved 80-95% recognition accuracy for vowels from male and female speakers. Vowels /a/ and /o/ had the highest recognition rates while /e/ and /i/ were more likely to be confused due to inter-speaker variation. The study demonstrated the viability of using formant frequencies for automatic vowel and speaker recognition.
CORRELATION BASED FUNDAMENTAL FREQUENCY EXTRACTION METHOD IN NOISY SPEECH SIGNALijcseit
This paper proposed a correlation based method using the autocorrelation function and the YIN. The
autocorrelation function and also YIN is a popular measurement in estimating fundamental frequency in
time domain. The performance of these two methods, however, is effected due to the position of dominant
harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions.
The experimental results of computer simulations on female and male voices in different noises perform
that the gross pitch errors are lower in proposed method as compared to other related method in different
types of signal to noise ratio conditions.
Modified synthesis strategy for vowels and semi vowels klatt synthesizeIAEME Publication
This document discusses modifications to the Klatt synthesizer for synthesizing vowels and semi-vowels. It proposes storing control parameters and updating strategies to improve naturalness. The Klatt synthesizer uses a source-filter model with voicing and frication sources, and cascade/parallel vocal tract filters. Parameters like pitch, formants and bandwidths are stored in a database that is segmented into frames. Synthesis involves generating the excitation signal then filtering with resonators updated per frame. The modified approach varies the frame size for more precise parameter tracking as in the KlattGrid synthesizer.
Broad Phoneme Classification Using Signal Based Features ijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme classification is achieved using features derived directly from the speech at the signal level itself. Broad phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first three formants. Features derived from short time frames of training speech are used to train a multilayer feedforward neural network based classifier with manually marked class label as output and classification accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction which is useful for applications such as automatic speech recognition and automatic language identification.
Broad phoneme classification using signal based featuresijsc
Speech is the most efficient and popular means of human communication Speech is produced as a sequence
of phonemes. Phoneme recognition is the first step performed by automatic speech recognition system. The
state-of-the-art recognizers use mel-frequency cepstral coefficients (MFCC) features derived through short
time analysis, for which the recognition accuracy is limited. Instead of this, here broad phoneme
classification is achieved using features derived directly from the speech at the signal level itself. Broad
phoneme classes include vowels, nasals, fricatives, stops, approximants and silence. The features identified
useful for broad phoneme classification are voiced/unvoiced decision, zero crossing rate (ZCR), short time
energy, most dominant frequency, energy in most dominant frequency, spectral flatness measure and first
three formants. Features derived from short time frames of training speech are used to train a multilayer
feedforward neural network based classifier with manually marked class label as output and classification
accuracy is then tested. Later this broad phoneme classifier is used for broad syllable structure prediction
which is useful for applications such as automatic speech recognition and automatic language
identification.
Similar to [Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music (20)
Mining Melodic Patterns in Large Audio Collections of Indian Art MusicSankalp Gulati
More info: http://mtg.upf.edu/node/3108
Abstract: Discovery of repeating structures in music is fundamental to its analysis, understanding and interpretation. We present a data-driven approach for the discovery of short-time melodic patterns in large collections of Indian art music. The approach first discovers melodic patterns within an audio recording and subsequently searches for their repetitions in the entire music collection. We compute similarity between melodic patterns using dynamic time warping (DTW). Furthermore, we investigate four different variants of the DTW cost function for rank refinement of the obtained results. The music collection used in this study comprises 1,764 audio recordings with a total duration of 365 hours. Over 13 trillion DTW distance computations are done for the entire dataset. Due to the computational complexity of the task, different lower bounding and early abandoning techniques are applied during DTW distance computation. An evaluation based on expert feedback on a subset of the dataset shows that the discovered melodic patterns are musically relevant. Several musically interesting relationships are discovered, yielding further scope for establishing novel similarity measures based on melodic patterns. The discovered melodic patterns can further be used in challenging computational tasks such as automatic raga recognition, composition identification and music recommendation.
Landmark Detection in Hindustani Music MelodiesSankalp Gulati
More info: http://mtg.upf.edu/node/2998
Abstract: Musical melodies contain hierarchically organized events, where some events are more salient than others, acting as melodic landmarks. In Hindustani music melodies, an important landmark is the occurrence of a nyas. Occurrence of nyas is crucial to build and sustain the format of a rag and mark the boundaries of melodic motifs. Detection of nyas segments is relevant to tasks such as melody segmentation, motif discovery and rag recognition. However, detection of nyas segments is challenging as these segments do not follow explicit set of rules in terms of segment length, contour characteristics, and melodic context. In this paper we propose a method for the automatic detection of nyas segments in Hindustani music melodies. It consists of two main steps: a segmentation step that incorporates domain knowledge in order to facilitate the placement of nyas boundaries, and a segment classification step that is based on a series of musically motivated pitch contour features. The proposed method obtains significant accuracies for a heterogeneous data set of 20 audio music recordings containing 1257 nyas svar occurrences and total duration of 1.5 hours. Further, we show that the proposed segmentation strategy significantly improves over a classical piece-wise linear segmentation approach.
Phrase-based Rāga Recognition Using Vector Space ModelingSankalp Gulati
This document describes an approach for automatic raga recognition in Indian art music using phrase-based vector space modeling. It involves discovering melodic patterns from a Carnatic music collection through intra-recording and inter-recording analysis. The patterns are then clustered into communities based on their network of similarities. Features are extracted from the pattern-recording relationships using term frequency-inverse document frequency weighting. These features are used to build a feature matrix for raga recognition.
Computational Melodic Analysis of Indian Art MusicSankalp Gulati
This document summarizes research on computational melodic analysis of Indian art music. Key areas discussed include tonic identification, predominant melody estimation, motif discovery, melodic similarity, melodic pattern networks, raga recognition using melodic patterns, and resources like datasets and demonstrations of the techniques.
Computational Approaches for Melodic Description in Indian Art Music CorporaSankalp Gulati
Presentation for my PhD defense, Music Technology Group, Barcelona, Spain.
Resources: http://compmusic.upf.edu/node/304
Short abstract:
Automatically describing contents of recorded music is crucial for interacting with large volumes of audio recordings, and for developing novel tools to facilitate music pedagogy. Melody is a fundamental facet in most music traditions and, therefore, is an indispensable component in such description. In this thesis, we develop computational approaches for analyzing high-level melodic aspects of music performances in Indian art music (IAM), with which we can describe and interlink large amounts of audio recordings. With its complex melodic framework and well-grounded theory, the description of IAM melody beyond pitch contours offers a very interesting and challenging research topic. We analyze melodies within their tonal context, identify melodic patterns, compare them both within and across music pieces, and finally, characterize the specific melodic context of IAM, the rāgas. All these analyses are done using data-driven methodologies on sizable curated music corpora. Our work paves the way for addressing several interesting research problems in the field of music information research, as well as developing novel applications in the context of music discovery and music pedagogy.
Our presentation of Hindify at MusicHackDay Barcelona, 2013. Hindify is a music hack that automatically transforms an audio song and embed characteristics of Hindustani music such as slow/relaxed tempo, a drone (tanpura) in the background and tabla as the percussion instrument. This hack was done by Varun Jewalikar and Sankalp Gulati.
Some audio examples: https://soundcloud.com/sankalpg/sets/hindify
Discovery and Characterization of Melodic Motives in Large Audio Music Collec...Sankalp Gulati
Sankalp Gulati proposed a methodology for discovering and characterizing melodic motives in large audio music collections using domain knowledge of Indian art music. The methodology involves extracting pitch, loudness, and timbre features from audio signals, representing melodies, calculating melodic similarity, extracting repeated patterns as motives, and analyzing the extracted motives. Gulati aims to apply this methodology to a collection of over 550 hours of Indian art music audio and evaluate the results through listening tests and user feedback.
Tonic Identification System for Indian Art MusicSankalp Gulati
The document describes a system for identifying the tonic pitch in Hindustani and Carnatic music recordings. The system utilizes both audio signals and metadata. It analyzes the audio to extract sinusoidal components and computes pitch salience over time. Potential tonic candidates are identified and further processed using signal processing techniques like harmonic summation. The system then identifies the tonic pitch class using a multi-pitch histogram and machine learning. It also estimates the correct octave of the tonic using predominant melody extraction and classification methods. The goal is to automatically label the tonic in large music databases, which provides fundamental information for music analysis tasks.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
[Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music
1. Computational Approaches to Melodic
Analysis of Indian Art Music
Indian Institute of Sciences, Bengaluru, India 2016
Sankalp Gulati
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
4. Tonic Identification
time (s)
Frequency(Hz)
0 1 2 3 4 5 6 7 8
0
1000
2000
3000
4000
5000
100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
Frequency (bins), 1bin=10 cents, Ref=55 Hz
Normalizedsalience
f2
f3
f4
f
5f6
Tonic
Signal processing Learning
q Tanpura / drone background sound
q Extent of gamakas on Sa and Pa svara
q Vadi, sam-vadi svara of the rāga
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches
and evaluation. Journal of New Music Research, 43(01):55–73, 2014.
Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Conf. on Music
Information Retrieval (ISMIR) (pp. 499–504), Porto, Portugal.
Bellur, A., Ishwar, V., Serra, X., & Murthy, H. (2012). A knowledge based signal processing approach to tonic identification in Indian classical music. In 2nd
CompMusic Workshop (pp. 113–118) Istanbul, Turkey.
Ranjani, H. G., Arthi, S., & Sreenivas, T. V. (2011). Carnatic music analysis: Shadja, swara identification and raga verification in Alapana using stochastic
models. Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE Workshop , 29–32, New Paltz, NY.
Accuracy : ~90% !!!
5. Tonic Identification: Multipitch Approach
q Audio example:
q Utilizing drone sound
q Multi-pitch analysis
Vocals
Drone
J. Salamon, E. G´omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody
estimation. In Proc. 14th Int. Conf. on Digital Audio Effects (DAFX-11), pages 73–80, Paris, France, Sep. 2011.
10. Tonic Identification: Signal Processing
q Harmonic summation
§ Spectrum considered: 55-7200 Hz
§ Frequency range: 55-1760 Hz
§ Base frequency: 55 Hz
§ Bin resolution: 10 cents per bin (120
per octave)
§ N octaves: 5
§ Maximum harmonics: 20
§ Square cosine window across 50 cents
Bin salience mapping
Harmonic summa<on
11. Tonic Identification: Signal Processing
q Tonic candidate generation
§ Number of salience peaks per
frame: 5
§ Frequency range: 110-550 Hz
Mul<-pitch
histogram
12. Tonic Identification: Feature Exraction
q Identifying tonic in correct octave using multi-pitch
histogram
q Classification based template learning
q Class of an instance is the rank of the tonic
100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency bins (1 bin = 10 cents), Ref: 55Hz
Normalizedsalience
Multipitch Histogram
f2
f3
f4
f5
14. Tonic Identification: Results
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic
identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):
55–73, 2014.
16. Pitch Estimation Algorithms
q Time-domain approaches
§ ACF-based (Rabiner 1977)
§ AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches
§ Two-way mismatch (Maher and
Beauchamp 1994)
§ Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 25(1), 24–33
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society
of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches
§ Source separation-based (Klapuri, 2003)
§ Harmonic summation (Melodia) (Salamon and
Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48.
Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The
Journal of the Acoustical Society of , 95 (April), 2254–2263.
Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE
Transactions on Speech and Audio Processing, 11(6), 804–816.
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
17. Pitch Estimation Algorithms
q Time-domain approaches
§ ACF-based (Rabiner 1977)
§ AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches
§ Two-way mismatch (Maher and
Beauchamp 1994)
§ Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 25(1), 24–33
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical
Society of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches
§ Source separation-based (Klapuri, 2003)
§ Harmonic summation (Melodia) (Salamon
and Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48.
Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The
Journal of the Acoustical Society of , 95 (April), 2254–2263.
Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE
Transactions on Speech and Audio Processing, 11(6), 804–816.
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics.
IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
18. Predominant Pitch Estimation: YIN
Signal
Difference function
Auto-correlation
Cumulative difference
function
rt͑͒ϭ ͚jϭtϩ1
tϩW
xjxjϩ, ͑1͒
where rt() is the autocorrelation function of lag calculated
at time index t, and W is the integration window size. This
function is illustrated in Fig. 1͑b͒ for the signal plotted in
Fig. 1͑a͒. It is common in signal processing to use a slightly
different definition:
rtЈ͑͒ϭ ͚jϭtϩ1
tϩWϪ
xjxjϩ. ͑2͒
Here the integration window size shrinks with increasing
values of , with the result that the envelope of the function
decreases as a function of lag as illustrated in Fig. 1͑c͒. The
FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function
͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same,
calculated according to Eq. ͑2͒. The envelope of this function is tapered to
zero because of the smaller number of terms in the summation at larger .
FIG. 2. F0 estimation error rates as a function of the slope of the envelope
of the ACF, quantified by its intercept with the abscissa. The dotted line
represents errors for which the F0 estimate was too high, the dashed line
those for which it was too low, and the full line their sum. Triangles at the
right represent error rates for ACF calculated as in Eq. ͑1͒ (maxϭϱ). These
rates were measured over a subset of the database used in Sec. III.
Lag (samples)
The present article introduces a method for F0 estima-
tion that produces fewer errors than other well-known meth-
ods. The name YIN ͑from ‘‘yin’’ and ‘‘yang’’ of oriental
philosophy͒ alludes to the interplay between autocorrelation
and cancellation that it involves. This article is the first of a
rt͑͒ϭ
where rt(
at time ind
function is
Fig. 1͑a͒. I
different d
rtЈ͑͒ϭ
Here the
values of
decreases
two definit
side ͓tϩ1,
this articl
‘‘modified
correlation
In resp
multiples
FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function
͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same,
calculated according to Eq. ͑2͒. The envelope of this function is tapered to
zero because of the smaller number of terms in the summation at larger .
The horizontal arrows symbolize the search range for the period.
FIG. 2. F0 e
of the ACF,
represents er
those for wh
right represen
rates were m
max . The parameter max allows the algorithm to be biased
to favor one form of error at the expense of the other, with a
minimum of total error for intermediate values. Using Eq. ͑2͒
rather than Eq. ͑1͒ introduces a natural bias that can be tuned
by adjusting W. However, changing the window size has
other effects, and one can argue that a bias of this sort, if
useful, should be applied explicitly rather than implicitly.
This is one reason to prefer the definition of Eq. ͑1͒.
The autocorrelation method compares the signal to its
shifted self. In that sense it is related to the AMDF method
͑average magnitude difference function, Ross et al., 1974;
Ney, 1982͒ that performs its comparison using differences
rather than products, and more generally to time-domain
methods that measure intervals between events in time
͑Hess, 1983͒. The ACF is the Fourier transform of the power
spectrum, and can be seen as measuring the regular spacing
of harmonics within that spectrum. The cepstrum method
͑Noll, 1967͒ replaces the power spectrum by the log magni-
tude spectrum and thus puts less weight on high-amplitude
parts of the spectrum ͑particularly near the first formant that
often dominates the ACF͒. Similar ‘‘spectral whitening’’ ef-
fects can be obtained by linear predictive inverse filtering or
center-clipping ͑Rabiner and Schafer, 1978͒, or by splitting
the signal over a bank of filters, calculating ACFs within
each channel, and adding the results after amplitude normal-
ization ͑de Cheveigne´, 1991͒. Auditory models based on au-
tocorrelation are currently one of the more popular ways to
The same is true after taking the square and averaging over a
window:
͚jϭtϩ1
tϩW
͑xjϪxjϩT͒2
ϭ0. ͑5͒
Conversely, an unknown period may be found by forming
the difference function:
dt͑͒ϭ ͚jϭ1
W
͑xjϪxjϩ͒2
, ͑6͒
and searching for the values of for which the function is
zero. There is an infinite set of such values, all multiples of
the period. The difference function calculated from the signal
in Fig. 1͑a͒ is illustrated in Fig. 3͑a͒. The squared sum may
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
size was 25 ms, window shift was one sample, search range was 40 to 800
Hz, and threshold ͑step 4͒ was 0.1.
Version Gross error ͑%͒
Step 1 10.0
Step 2 1.95
Step 3 1.69
Step 4 0.78
Step 5 0.77
Step 6 0.50
Lag (samples)
ed
a
od
re
ow
00
sed
h a
͑2͒
ned
has
if
tly.
its
hod
74;
ces
ain
The same is true after taking the square and averaging over a
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
hod
were
dow
800
Lag (samples)
max . The parameter max allows the algorithm to be biased
to favor one form of error at the expense of the other, with a
minimum of total error for intermediate values. Using Eq. ͑2͒
rather than Eq. ͑1͒ introduces a natural bias that can be tuned
by adjusting W. However, changing the window size has
other effects, and one can argue that a bias of this sort, if
useful, should be applied explicitly rather than implicitly.
This is one reason to prefer the definition of Eq. ͑1͒.
The autocorrelation method compares the signal to its
shifted self. In that sense it is related to the AMDF method
͑average magnitude difference function, Ross et al., 1974;
Ney, 1982͒ that performs its comparison using differences
rather than products, and more generally to time-domain
The same is true after taking the square and averaging over a
window:
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method
͑step 1͒, and for the cumulated steps described in the text. These rates were
measured over a subset of the database used in Sec. III. Integration window
size was 25 ms, window shift was one sample, search range was 40 to 800
Hz, and threshold ͑step 4͒ was 0.1.
Version Gross error ͑%͒
Step 1 10.0
Step 2 1.95
Step 3 1.69
Step 4 0.78
Step 5 0.77
Step 6 0.50
Lag (samples)
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the
Acoustical Society of America 111, no. 4 (2002): 1917-1930.
20. Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
21. Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
22. Predominant Pitch Estimation: Melodia
audio
Spectrogram
Spectral peaks
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
23. Predominant Pitch Estimation: Melodia
Spectral peaks
Time-frequency
salience
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
24. Predominant Pitch Estimation: Melodia
Time-frequency
salience
Salience peaks
Contours
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
25. Predominant Pitch Estimation: Melodia
Contours
Predominant
melody contours
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.