Emotion recognition through speech has many potential applications, however the challenge comes from achieving a high emotion recognition while using limited resources or interference such as noise. In this paper we have explored the possibility of improving speech emotion recognition by utilizing the voice activity detection (VAD) concept. The emotional voice data from the Berlin Emotion Database (EMO-DB) and a custom-made database LQ Audio Dataset are firstly preprocessed by VAD before feature extraction. The features are then passed to the deep neural network for classification. In this paper, we have chosen MFCC to be the sole determinant feature. From the results obtained using VAD and without, we have found that the VAD improved the recognition rate of 5 emotions (happy, angry, sad, fear, and neutral) by 3.7% when recognizing clean signals, while the effect of using VAD when training a network with both clean and noisy signals improved our previous results by 50%.
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
In this paper, a system, developed for speech encoding, analysis, synthesis and gender identification is
presented. A typical gender recognition system can be divided into front-end system and back-end system.
The task of the front-end system is to extract the gender related information from a speech signal and
represents it by a set of vectors called feature. Features like power spectrum density, frequency at
maximum power carry speaker information. The feature is extracted using First Fourier Transform (FFT)
algorithm. The task of the back-end system (also called classifier) is to create a gender model to recognize
the gender from his/her speech signal in recognition phase. This paper also presents the digital processing
of a speech signals (pronounced “A” and “B”) which are taken from 10 persons, 5 of them are Male and
the rest of them are Female. Power Spectrum Estimation of the signal is examined .The frequency at
maximum power of the English Phonemes is extracted from the estimated power spectrum. The system uses
threshold technique as identification tool. The recognition accuracy of this system is 80% on average.
This document presents research on developing a machine learning model to recognize gender from voice recordings. The researchers collected a dataset of over 60,000 audio samples from online repositories. They extracted acoustic features from the recordings like mean, median and standard deviation of dominant frequencies. These features were used to train four machine learning algorithms - Support Vector Machine, Decision Tree, Gradient Boosted Trees and Random Forest. The Random Forest model achieved the highest testing accuracy of 89.3% for gender recognition. Feature importance analysis showed that standard deviation, mean and kurtosis were the most important discriminative features for the tree-based models. The research demonstrates that machine learning can effectively perform voice-based gender recognition using acoustic features extracted from speech.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
This document discusses an approach to extract features from speech signals using Mel Frequency Cepstral Coefficients (MFCC). MFCC is a popular feature extraction technique that models human auditory perception. It involves preprocessing the speech signal, framing it, applying the discrete cosine transform to the mel scale filter bank energies, and optionally adding delta and acceleration features. MFCC extraction was tested on isolated word samples to obtain coefficients representing the short-term power spectrum. These coefficients can then be used for applications like speech recognition by matching patterns in the feature space.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
Adaptive wavelet thresholding with robust hybrid features for text-independe...IJECEIAES
The robustness of speaker identification system over additive noise channel is crucial for real-world applications. In speaker identification (SID) systems, the extracted features from each speech frame are an essential factor for building a reliable identification system. For clean environments, the identification system works well; in noisy environments, there is an additive noise, which is affect the system. To eliminate the problem of additive noise and to achieve a high accuracy in speaker identification system a proposed algorithm for feature extraction based on speech enhancement and a combined features is presents. In this paper, a wavelet thresholding pre-processing stage, and feature warping (FW) techniques are used with two combined features named power normalized cepstral coefficients (PNCC) and gammatone frequency cepstral coefficients (GFCC) to improve the identification system robustness against different types of additive noises. Universal background model Gaussian mixture model (UBM-GMM) is used for features matching between the claim and actual speakers. The results showed performance improvement for the proposed feature extraction algorithm of identification system comparing with conventional features over most types of noises and different SNR ratios.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
GENDER RECOGNITION SYSTEM USING SPEECH SIGNALIJCSEIT Journal
In this paper, a system, developed for speech encoding, analysis, synthesis and gender identification is
presented. A typical gender recognition system can be divided into front-end system and back-end system.
The task of the front-end system is to extract the gender related information from a speech signal and
represents it by a set of vectors called feature. Features like power spectrum density, frequency at
maximum power carry speaker information. The feature is extracted using First Fourier Transform (FFT)
algorithm. The task of the back-end system (also called classifier) is to create a gender model to recognize
the gender from his/her speech signal in recognition phase. This paper also presents the digital processing
of a speech signals (pronounced “A” and “B”) which are taken from 10 persons, 5 of them are Male and
the rest of them are Female. Power Spectrum Estimation of the signal is examined .The frequency at
maximum power of the English Phonemes is extracted from the estimated power spectrum. The system uses
threshold technique as identification tool. The recognition accuracy of this system is 80% on average.
This document presents research on developing a machine learning model to recognize gender from voice recordings. The researchers collected a dataset of over 60,000 audio samples from online repositories. They extracted acoustic features from the recordings like mean, median and standard deviation of dominant frequencies. These features were used to train four machine learning algorithms - Support Vector Machine, Decision Tree, Gradient Boosted Trees and Random Forest. The Random Forest model achieved the highest testing accuracy of 89.3% for gender recognition. Feature importance analysis showed that standard deviation, mean and kurtosis were the most important discriminative features for the tree-based models. The research demonstrates that machine learning can effectively perform voice-based gender recognition using acoustic features extracted from speech.
Comparative Study of Different Techniques in Speaker Recognition: ReviewIJAEMSJORNAL
The speech is most basic and essential method of communication used by person.On the basis of individual information included in speech signals the speaker is recognized. Speaker recognition (SR) is useful to identify the person who is speaking. In recent years speaker recognition is used for security system. In this paper we have discussed the feature extraction techniques like Mel frequency cepstral coefficient (MFCC), Linear predictive coding (LPC), Dynamic time wrapping (DTW), and for classification Gaussian Mixture Models (GMM), Artificial neural network (ANN)& Support vector machine (SVM).
Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI) field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient) and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN), Multiclass Support Vector Machine (SVM) and their combination for emotion recognition.
This document discusses an approach to extract features from speech signals using Mel Frequency Cepstral Coefficients (MFCC). MFCC is a popular feature extraction technique that models human auditory perception. It involves preprocessing the speech signal, framing it, applying the discrete cosine transform to the mel scale filter bank energies, and optionally adding delta and acceleration features. MFCC extraction was tested on isolated word samples to obtain coefficients representing the short-term power spectrum. These coefficients can then be used for applications like speech recognition by matching patterns in the feature space.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
Adaptive wavelet thresholding with robust hybrid features for text-independe...IJECEIAES
The robustness of speaker identification system over additive noise channel is crucial for real-world applications. In speaker identification (SID) systems, the extracted features from each speech frame are an essential factor for building a reliable identification system. For clean environments, the identification system works well; in noisy environments, there is an additive noise, which is affect the system. To eliminate the problem of additive noise and to achieve a high accuracy in speaker identification system a proposed algorithm for feature extraction based on speech enhancement and a combined features is presents. In this paper, a wavelet thresholding pre-processing stage, and feature warping (FW) techniques are used with two combined features named power normalized cepstral coefficients (PNCC) and gammatone frequency cepstral coefficients (GFCC) to improve the identification system robustness against different types of additive noises. Universal background model Gaussian mixture model (UBM-GMM) is used for features matching between the claim and actual speakers. The results showed performance improvement for the proposed feature extraction algorithm of identification system comparing with conventional features over most types of noises and different SNR ratios.
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnnijcsa
In this paper we present text dependent speaker recognition with an enhancement of detecting the emotion
of the speaker prior using the hybrid FFBN and GMM methods. The emotional state of the speaker
influences recognition system. Mel-frequency Cepstral Coefficient (MFCC) feature set is used for
experimentation. To recognize the emotional state of a speaker Gaussian Mixture Model (GMM) is used in
training phase and in testing phase Feed Forward Back Propagation Neural Network (FFBNN). Speech
database consisting of 25 speakers recorded in five different emotional states: happy, angry, sad, surprise
and neutral is used for experimentation. The results reveal that the emotional state of the speaker shows a
significant impact on the accuracy of speaker recognition.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
This document discusses speaker recognition techniques and applications. It provides an overview of common feature extraction methods for speaker recognition systems such as Linear Predictive Coding (LPC), Perceptual Linear Predictive Coefficients (PLPC), Mel Frequency Cepstral Coefficients (MFCC), and Gamma Tone Frequency Cepstral Coefficients (GFCC). It compares these techniques and finds that MFCC and PLPC are most often used due to being derived from models of human auditory processing. GFCC has benefits of higher frequency resolution and noise robustness compared to MFCC. Speaker recognition has applications in security systems for access control.
The document discusses two experiments on classifying heart sounds using deep learning models. Experiment 1 compares the performance of various pre-trained deep learning models, finding that models pre-trained on audio data performed better than models pre-trained on image data. Experiment 2 evaluates model performance under domain shift conditions by training on one heart sound database and testing on others. It finds that data augmentation techniques like trimming and respiratory scaling can improve robustness, with all augmentation techniques together working best on 3 of 6 test databases.
This document presents a text-dependent speaker recognition system using neural networks that aims to improve recognition accuracy. It proposes changing the number of Mel Frequency Cepstral Coefficients (MFCCs) used in training. Voice Activity Detection is also used as a preprocessing step. Experimental results show recognition accuracy increases from 70.41% to a maximum of 92.91% as the number of MFCCs increases from 14 to 20, but then decreases with more MFCCs. The system is implemented on a Raspberry Pi for hardware acceleration.
This document summarizes a student's final year project analyzing cognitive radio spectrum access through mathematical modeling and simulation. The student developed queueing models to represent the radio spectrum and primary and cognitive users. Simulation programs were created and numerical results were discussed to analyze the performance of the cognitive radio system and effects of providing extra buffers. The models, programs, and results allow studying how cognitive radio systems can access licensed spectrum while avoiding interference to primary users.
This document outlines a research project for a Master's degree in computer engineering. The project aims to improve the performance of speaker verification systems in noisy environments. It will use voice activity detection (VAD) methods to estimate noise models and parallel model combination (PMC) to estimate noisy speech models. Two VAD methods will be implemented and compared: vector quantization VAD (VQVAD) and minimum mean squared error (MMSE). Experiments will be conducted on the NIST 1998 and NOISEX-92 databases at various signal-to-noise ratios to evaluate mel frequency cepstral coefficients using PMC. The results will be used to analyze the effectiveness of the VAD methods in improving speaker verification accuracy in noisy conditions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Speaker Identification & Verification Using MFCC & SVMIRJET Journal
This document discusses a method for speaker identification and verification using Mel Frequency Cepstral Coefficients (MFCC) and Support Vector Machines (SVM). MFCC is used to extract features from speech samples, representing the spectral properties of the voice. SVM is then used to match these features between speech samples to identify or verify speakers. The authors claim this method can identify speakers with up to 95% accuracy and is computationally efficient. It is proposed that MFCC and SVM outperform other techniques for speaker recognition tasks. The system is described as having applications in security, voice interfaces, and other voice-based systems.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
DATA HIDING IN AUDIO SIGNALS USING WAVELET TRANSFORM WITH ENHANCED SECURITYcsandit
Rapid increase in data transmission over internet results in emphasis on information security.
Audio steganography is used for secure transmission of secret data with audio signal as the
carrier. In the proposed method, cover audio file is transformed from space domain to wavelet
domain using lifting scheme, leading to secure data hiding. Text message is encrypted using
dynamic encryption algorithm. Cipher text is then hidden in wavelet coefficients of cover audio
signal. Signal to Noise Ratio (SNR) and Squared Pearson Correlation Coefficient (SPCC)
values are computed to judge the quality of the stego audio signal. Results show that stego
audio signal is perceptually indistinguishable from the cover audio signal. Stego audio signal is
robust even in presence of external noise. Proposed method provides secure and least error
data extraction.
Estimating the quality of digitally transmitted speech over satellite communi...Alexander Decker
This document discusses methods for estimating the quality of digitally transmitted speech over satellite
communication channels. It proposes a methodology that divides the speech signal into segments and calculates the
segmental signal-to-noise ratio (SNR) for each segment. This allows assessing changes in speech quality over time
through statistical modeling. It also considers the spectral characteristics of the speech signal, providing a better
measure than just signal power. The document reviews other existing objective and subjective methods for quality
estimation and argues that the proposed segmental SNR approach has advantages over these other approaches.
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing IJECEIAES
The speech enhancement algorithms are utilized to overcome multiple limitation factors in recent applications such as mobile phone and communication channel. The challenges focus on corrupted speech solution between noise reduction and signal distortion. We used a modified Wiener filter and compressive sensing (CS) to investigate and evaluate the improvement of speech quality. This new method adapted noise estimation and Wiener filter gain function in which to increase weight amplitude spectrum and improve mitigation of interested signals. The CS is then applied using the gradient projection for sparse reconstruction (GPSR) technique as a study system to empirically investigate the interactive effects of the corrupted noise and obtain better perceptual improvement aspects to listener fatigue with noiseless reduction conditions. The proposed algorithm shows an enhancement in testing performance evaluation of objective assessment tests outperform compared to other conventional algorithms at various noise type conditions of 0, 5, 10, 15 dB SNRs. Therefore, the proposed algorithm significantly achieved the speech quality improvement and efficiently obtained higher performance resulting in better noise reduction compare to other conventional algorithms.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
This document summarizes a research paper on speaker recognition using vocal tract features. It discusses extracting pitch using FFT as a vocal source feature and Mel Frequency Cepstral Coefficients (MFCCs) as vocal tract features. These features are used as inputs to a Support Vector Machine (SVM) classifier for binary speaker classification. Experimental results show the SVM achieves up to 94.42% accuracy in classifying speakers based on their MFCC features, and accuracy increases with higher signal-to-noise ratios when noise is added to speech signals. The paper concludes vocal source and tract features can be used together with SVM for text-independent speaker recognition.
Speech compression analysis using matlabeSAT Journals
This document discusses speech compression analysis using MATLAB. It begins with an introduction to speech compression, noting its importance for efficient storage and transmission of audio data. It then discusses various speech compression techniques, including lossy and lossless compression as well as standards like MPEG. It focuses on using the discrete cosine transform and MATLAB commands to analyze speech signals, including reading wav files, applying windowing functions and the DCT, and playing/viewing the output. The document concludes by discussing current applications of speech compression technologies like MPEG.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...TELKOMNIKA JOURNAL
Speech recognition can be defined as the process of converting voice signals into the ranks of the
word, by applying a specific algorithm that is implemented in a computer program. The research of speech
recognition in Indonesia is relatively limited. This paper has studied methods of feature extraction which is
the best among the Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCC) for
speech recognition in Indonesian language. This is important because the method can produce a high
accuracy for a particular language does not necessarily produce the same accuracy for other languages,
considering every language has different characteristics. Thus this research hopefully can help further
accelerate the use of automatic speech recognition for Indonesian language. There are two main
processes in speech recognition, feature extraction and recognition. The method used for comparison
feature extraction in this study is the LPC and MFCC, while the method of recognition using Hidden
Markov Model (HMM). The test results showed that the MFCC method is better than LPC in Indonesian
language speech recognition.
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
Automatic speaker recognition system is used to recognize an unknown speaker among several reference speakers by making use of speaker-specific information from their speech. In this paper, we introduce a novel, hierarchical, text-independent speaker recognition. Our baseline speaker recognition system accuracy, built using statistical modeling techniques, gives an accuracy of 81% on the standard MIT database and our baseline gender recognition system gives an accuracy of 93.795%. We then propose and implement a novel state-space pruning technique by performing gender recognition before speaker recognition so as to improve the accuracy/timeliness of our baseline speaker recognition system. Based on the experiments conducted on the MIT database, we demonstrate that our proposed system improves the accuracy over the baseline system by approximately 2%, while reducing the computational time by more than 30%.
This paper contains a report on an Audio-Visual Client Recognition System using Matlab software which identifies five clients and can be improved to identify as many clients as possible depending on the number of clients it is trained to identify which was successfully implemented. The implementation was accomplished first by visual recognition system implemented using The Principal Component Analysis, Linear Discriminant Analysis and Nearest Neighbour Classifier. A successful implementation of second part was achieved by audio recognition using Mel-Frequency Cepstrum Coefficient, Linear Discriminant Analysis and Nearest Neighbour Classifier the system was tested using images and sounds that have not been trained to the system to see whether it can detect an intruder which lead us to a very successful result with précised response to intruder.
This document discusses speaker recognition techniques and applications. It provides an overview of common feature extraction methods for speaker recognition systems such as Linear Predictive Coding (LPC), Perceptual Linear Predictive Coefficients (PLPC), Mel Frequency Cepstral Coefficients (MFCC), and Gamma Tone Frequency Cepstral Coefficients (GFCC). It compares these techniques and finds that MFCC and PLPC are most often used due to being derived from models of human auditory processing. GFCC has benefits of higher frequency resolution and noise robustness compared to MFCC. Speaker recognition has applications in security systems for access control.
The document discusses two experiments on classifying heart sounds using deep learning models. Experiment 1 compares the performance of various pre-trained deep learning models, finding that models pre-trained on audio data performed better than models pre-trained on image data. Experiment 2 evaluates model performance under domain shift conditions by training on one heart sound database and testing on others. It finds that data augmentation techniques like trimming and respiratory scaling can improve robustness, with all augmentation techniques together working best on 3 of 6 test databases.
This document presents a text-dependent speaker recognition system using neural networks that aims to improve recognition accuracy. It proposes changing the number of Mel Frequency Cepstral Coefficients (MFCCs) used in training. Voice Activity Detection is also used as a preprocessing step. Experimental results show recognition accuracy increases from 70.41% to a maximum of 92.91% as the number of MFCCs increases from 14 to 20, but then decreases with more MFCCs. The system is implemented on a Raspberry Pi for hardware acceleration.
This document summarizes a student's final year project analyzing cognitive radio spectrum access through mathematical modeling and simulation. The student developed queueing models to represent the radio spectrum and primary and cognitive users. Simulation programs were created and numerical results were discussed to analyze the performance of the cognitive radio system and effects of providing extra buffers. The models, programs, and results allow studying how cognitive radio systems can access licensed spectrum while avoiding interference to primary users.
This document outlines a research project for a Master's degree in computer engineering. The project aims to improve the performance of speaker verification systems in noisy environments. It will use voice activity detection (VAD) methods to estimate noise models and parallel model combination (PMC) to estimate noisy speech models. Two VAD methods will be implemented and compared: vector quantization VAD (VQVAD) and minimum mean squared error (MMSE). Experiments will be conducted on the NIST 1998 and NOISEX-92 databases at various signal-to-noise ratios to evaluate mel frequency cepstral coefficients using PMC. The results will be used to analyze the effectiveness of the VAD methods in improving speaker verification accuracy in noisy conditions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Speaker Identification & Verification Using MFCC & SVMIRJET Journal
This document discusses a method for speaker identification and verification using Mel Frequency Cepstral Coefficients (MFCC) and Support Vector Machines (SVM). MFCC is used to extract features from speech samples, representing the spectral properties of the voice. SVM is then used to match these features between speech samples to identify or verify speakers. The authors claim this method can identify speakers with up to 95% accuracy and is computationally efficient. It is proposed that MFCC and SVM outperform other techniques for speaker recognition tasks. The system is described as having applications in security, voice interfaces, and other voice-based systems.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
Emotion Recognition based on audio signal using GFCC Extraction and BPNN Clas...ijceronline
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
05 comparative study of voice print based acoustic features mfcc and lpccIJAEMSJORNAL
Voice is the best biometric feature for investigation and authentication. It has both biological and behavioural features. The acoustic features are related to the voice. The Speaker Recognition System is designed for the automatic authentication of speaker’s identity which is truly based on the human’s voice. Mel Frequency Cepstrum coefficient (MFCC) and Linear Prediction Cepstrum coefficient (LPCC) are taken in use for feature extraction from the provided voice sample. This paper provides a comparative study of MFCC and LPCC based on the accuracy of results and their working methodology. The results are better if MFCC is used for feature extraction.
DATA HIDING IN AUDIO SIGNALS USING WAVELET TRANSFORM WITH ENHANCED SECURITYcsandit
Rapid increase in data transmission over internet results in emphasis on information security.
Audio steganography is used for secure transmission of secret data with audio signal as the
carrier. In the proposed method, cover audio file is transformed from space domain to wavelet
domain using lifting scheme, leading to secure data hiding. Text message is encrypted using
dynamic encryption algorithm. Cipher text is then hidden in wavelet coefficients of cover audio
signal. Signal to Noise Ratio (SNR) and Squared Pearson Correlation Coefficient (SPCC)
values are computed to judge the quality of the stego audio signal. Results show that stego
audio signal is perceptually indistinguishable from the cover audio signal. Stego audio signal is
robust even in presence of external noise. Proposed method provides secure and least error
data extraction.
Estimating the quality of digitally transmitted speech over satellite communi...Alexander Decker
This document discusses methods for estimating the quality of digitally transmitted speech over satellite
communication channels. It proposes a methodology that divides the speech signal into segments and calculates the
segmental signal-to-noise ratio (SNR) for each segment. This allows assessing changes in speech quality over time
through statistical modeling. It also considers the spectral characteristics of the speech signal, providing a better
measure than just signal power. The document reviews other existing objective and subjective methods for quality
estimation and argues that the proposed segmental SNR approach has advantages over these other approaches.
Single Channel Speech Enhancement using Wiener Filter and Compressive Sensing IJECEIAES
The speech enhancement algorithms are utilized to overcome multiple limitation factors in recent applications such as mobile phone and communication channel. The challenges focus on corrupted speech solution between noise reduction and signal distortion. We used a modified Wiener filter and compressive sensing (CS) to investigate and evaluate the improvement of speech quality. This new method adapted noise estimation and Wiener filter gain function in which to increase weight amplitude spectrum and improve mitigation of interested signals. The CS is then applied using the gradient projection for sparse reconstruction (GPSR) technique as a study system to empirically investigate the interactive effects of the corrupted noise and obtain better perceptual improvement aspects to listener fatigue with noiseless reduction conditions. The proposed algorithm shows an enhancement in testing performance evaluation of objective assessment tests outperform compared to other conventional algorithms at various noise type conditions of 0, 5, 10, 15 dB SNRs. Therefore, the proposed algorithm significantly achieved the speech quality improvement and efficiently obtained higher performance resulting in better noise reduction compare to other conventional algorithms.
A novel automatic voice recognition system based on text-independent in a noi...IJECEIAES
Automatic voice recognition system aims to limit fraudulent access to sensitive areas as labs. Our primary objective of this paper is to increasethe accuracy of the voice recognition in noisy environment of the Microsoft Research (MSR) identity toolbox. The proposed system enabled the user tospeak into the microphone then it will match unknown voice with other human voices existing in the database using a statistical model, in order togrant or deny access to the system. The voice recognition was done in twosteps: training and testing. During the training a Universal BackgroundModel as well as a Gaussian Mixtures Model: GMM-UBM models arecalculated based on different sentences pronounced by the human voice (s) used to record the training data. Then the testing of voice signal in noisyenvironment calculated the Log-Likelihood Ratio of the GMM-UBM models in order to classify user's voice. However, before testing noise and de-noisemethods were applied, we investigated different MFCC features of the voiceto determine the best feature possible as well as noise filter algorithmthat subsequently improved the performance of the automatic voicerecognition system.
This document summarizes a research paper on speaker recognition using vocal tract features. It discusses extracting pitch using FFT as a vocal source feature and Mel Frequency Cepstral Coefficients (MFCCs) as vocal tract features. These features are used as inputs to a Support Vector Machine (SVM) classifier for binary speaker classification. Experimental results show the SVM achieves up to 94.42% accuracy in classifying speakers based on their MFCC features, and accuracy increases with higher signal-to-noise ratios when noise is added to speech signals. The paper concludes vocal source and tract features can be used together with SVM for text-independent speaker recognition.
Speech compression analysis using matlabeSAT Journals
This document discusses speech compression analysis using MATLAB. It begins with an introduction to speech compression, noting its importance for efficient storage and transmission of audio data. It then discusses various speech compression techniques, including lossy and lossless compression as well as standards like MPEG. It focuses on using the discrete cosine transform and MATLAB commands to analyze speech signals, including reading wav files, applying windowing functions and the DCT, and playing/viewing the output. The document concludes by discussing current applications of speech compression technologies like MPEG.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Speech emotion recognition with light gradient boosting decision trees machineIJECEIAES
Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
As the emergence of the voice biometric provides enhanced security and convenience, voice biometric-based applications such as speaker verification were gradually replacing the authentication techniques that were less secure. However, the automatic speaker verification (ASV) systems were exposed to spoofing attacks, especially artificial speech attacks that can be generated with a large amount in a short period of time using state-of-the-art speech synthesis and voice conversion algorithms. Despite the extensively used support vector machine (SVM) in recent works, there were none of the studies shown to investigate the performance of different SVM settings against artificial speech detection. In this paper, the performance of different SVM settings in artificial speech detection will be investigated. The objective is to identify the appropriate SVM kernels for artificial speech detection. An experiment was conducted to find the appropriate combination of the proposed features and SVM kernels. Experimental results showed that the polynomial kernel was able to detect artificial speech effectively, with an equal error rate (EER) of 1.42% when applied to the presented handcrafted features.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Intelligent Arabic letters speech recognition system based on mel frequency c...IJECEIAES
Speech recognition is one of the important applications of artificial intelligence (AI). Speech recognition aims to recognize spoken words regardless of who is speaking to them. The process of voice recognition involves extracting meaningful features from spoken words and then classifying these features into their classes. This paper presents a neural network classification system for Arabic letters. The paper will study the effect of changing the multi-layer perceptron (MLP) artificial neural network (ANN) properties to obtain an optimized performance. The proposed system consists of two main stages; first, the recorded spoken letters are transformed from the time domain into the frequency domain using fast Fourier transform (FFT), and features are extracted using mel frequency cepstral coefficients (MFCC). Second, the extracted features are then classified using the MLP ANN with back-propagation (BP) learning algorithm. The obtained results show that the proposed system along with the extracted features can classify Arabic spoken letters using two neural network hidden layers with an accuracy of around 86%.
Modeling Text Independent Speaker Identification with Vector QuantizationTELKOMNIKA JOURNAL
Speaker identification is one of the most important technologies nowadays. Many fields such as
bioinformatics and security are using speaker identification. Also, almost all electronic devices are using
this technology too. Based on number of text, speaker identification divided into text dependent and text
independent. On many fields, text independent is mostly used because number of text is unlimited. So, text
independent is generally more challenging than text dependent. In this research, speaker identification text
independent with Indonesian speaker data was modelled with Vector Quantization (VQ). In this research
VQ with K-Means initialization was used. K-Means clustering also was used to initialize mean and
Hierarchical Agglomerative Clustering was used to identify K value for VQ. The best VQ accuracy was
59.67% when k was 5. According to the result, Indonesian language could be modelled by VQ. This
research can be developed using optimization method for VQ parameters such as Genetic Algorithm or
Particle Swarm Optimization.
Mfcc based enlargement of the training set for emotion recognition in speechsipij
Emotional state recognition through speech is being a very interesting research topic nowadays. Using
subliminal information of speech, denominated as “prosody”, it is possible to recognize the emotional state
of the person. One of the main problems in the design of automatic emotion recognition systems is the small
number of available patterns. This fact makes the learning process more difficult, due to the generalization
problems that arise under these conditions.
In this work we propose a solution to this problem consisting in enlarging the training set through the
creation the new virtual patterns. In the case of emotional speech, most of the emotional information is
included in speed and pitch variations. So, a change in the average pitch that does not modify neither the
speed nor the pitch variations does not affect the expressed emotion. Thus, we use this prior information in
order to create new patterns applying a gender dependent pitch shift modification in the feature extraction
process of the classification system. For this purpose, we propose a frequency scaling modification of the
Mel Frequency Cepstral Coefficients, used to classify the emotion. For this purpose, we propose a gender
dependent frequency scaling modification. This proposed process allows us to synthetically increase the
number of available patterns in the training set, thus increasing the generalization capability of the system
and reducing the test error. Results carried out with two different classifiers with different degree of
generalization capability demonstrate the suitability of the proposal.
Text independent speaker identification system using average pitch and forman...ijitjournal
The aim of this paper is to design a closed-set text-independent Speaker Identification system using average
pitch and speech features from formant analysis. The speech features represented by the speech signal are
potentially characterized by formant analysis (Power Spectral Density). In this paper we have designed two
methods: one for average pitch estimation based on Autocorrelation and other for formant analysis. The
average pitches of speech signals are calculated and employed with formant analysis. From the performance
comparison of the proposed method with some of the existing methods, it is evident that the designed
speaker identification system with the proposed method is superior to others.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document presents research on emotion recognition from speech using a combination of MFCC and LPCC features with support vector machine (SVM) classification. Two databases were used: the Berlin Emotional Database and SAVEE database. MFCC and LPCC features were extracted from the speech samples and combined. SVM with radial basis function kernel achieved the highest accuracy of 88.59% for emotion recognition on the Berlin database using the combined features. Confusion matrices are presented to evaluate performance on each database.
This document describes a mobile application developed to monitor speech rate in real time for speech therapy patients. The application uses a digital signal processing algorithm to extract phonemes from speech and calculate the speech rate. It was tested on Hebrew speakers with over 90% accuracy. The application provides real-time feedback to patients and allows speech therapists to track patient progress. The goal is to help patients learn to control their speech fluency between therapy sessions.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Emotion Recognition through Speech Analysis using various Deep Learning Algor...IRJET Journal
This document summarizes a research paper on emotion recognition through speech analysis using deep learning algorithms. The researchers used datasets containing speech samples labeled with seven emotions to train and test convolutional neural network (CNN), support vector machine (SVM), recurrent neural network (RNN), and random forest models. They found that the RNN model achieved the highest testing accuracy at 75.31% for emotion recognition. The researchers concluded that speech emotion recognition systems could be useful for applications like dialogue systems, call centers, student voice reviews, and building healthy relationships.
This document summarizes a semi-supervised clustering approach for classifying P300 signals for brain-computer interface (BCI) speller systems. It involves using k-means clustering on wavelet features extracted from EEG data, with some data points labeled to initialize the clusters. An ensemble of support vector machines is then trained on the clustered data points to classify new unlabeled P300 signals. The document outlines the P300 speller paradigm used to collect the EEG data, pre-processing steps like filtering and wavelet transformation, the seeded k-means semi-supervised clustering method, and using an ensemble SVM classifier trained on the clustered data for classification.
This document discusses a fast algorithm for noisy speaker recognition using artificial neural networks. It summarizes the following:
- The algorithm first records voice patterns from speakers via a noisy channel and applies noise removal techniques.
- It then extracts features using Mel Frequency Cepstral Coefficients (MFCC) and reduces features using Principal Component Analysis.
- The reduced feature vectors are classified using an artificial neural network classifier.
- Experimental results showed the proposed algorithm achieved about 99% accuracy on average, which is higher than other methods, and was also faster.
A survey on Enhancements in Speech RecognitionIRJET Journal
This document discusses enhancements in speech recognition and provides an overview of the history and basic model of speech recognition. It summarizes key enhancements researchers have made to improve speech recognition, especially in noisy environments. The basic model of speech recognition involves speech input, preprocessing using techniques like MFCCs, classification models like RNNs and HMMs, and output of a transcript. Researchers are working to develop robust speech recognition that can understand speech in any environment.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
This document summarizes a research paper on speaker identification using artificial neural networks. The paper presents a speaker identification system that uses digital signal processing and ANN techniques. Speech features are extracted from utterances using FFT and windowing. These features are used to train a multi-layer perceptron network to classify speakers. The system was tested on Bangla speech and achieved accurate identification of speakers from their utterances.
Utterance Based Speaker Identification Using ANNIJCSEA Journal
In this paper we present the implementation of speaker identification system using artificial neural network with digital signal processing. The system is designed to work with the text-dependent speaker identification for Bangla Speech. The utterances of speakers are recorded for specific Bangla words using an audio wave recorder. The speech features are acquired by the digital signal processing technique. The identification of speaker using frequency domain data is performed using back propagation algorithm. Hamming window and Blackman-Harris window are used to investigate better speaker identification performance. Endpoint detection of speech is developed in order to achieve high accuracy of the system.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
Similar to On the use of voice activity detection in speech emotion recognition (20)
Square transposition: an approach to the transposition process in block cipherjournalBEEI
The transposition process is needed in cryptography to create a diffusion effect on data encryption standard (DES) and advanced encryption standard (AES) algorithms as standard information security algorithms by the National Institute of Standards and Technology. The problem with DES and AES algorithms is that their transposition index values form patterns and do not form random values. This condition will certainly make it easier for a cryptanalyst to look for a relationship between ciphertexts because some processes are predictable. This research designs a transposition algorithm called square transposition. Each process uses square 8 × 8 as a place to insert and retrieve 64-bits. The determination of the pairing of the input scheme and the retrieval scheme that have unequal flow is an important factor in producing a good transposition. The square transposition can generate random and non-pattern indices so that transposition can be done better than DES and AES.
Hyper-parameter optimization of convolutional neural network based on particl...journalBEEI
The document proposes using a particle swarm optimization (PSO) algorithm to optimize the hyperparameters of a convolutional neural network (CNN) for image classification. The PSO algorithm is used to find optimal values for CNN hyperparameters like the number and size of convolutional filters. In experiments on the MNIST handwritten digit dataset, the optimized CNN achieved a testing error rate of 0.87%, which is competitive with state-of-the-art models. The proposed approach finds optimized CNN architectures automatically without requiring manual design or encoding strategies during training.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
A secure and energy saving protocol for wireless sensor networksjournalBEEI
The research domain for wireless sensor networks (WSN) has been extensively conducted due to innovative technologies and research directions that have come up addressing the usability of WSN under various schemes. This domain permits dependable tracking of a diversity of environments for both military and civil applications. The key management mechanism is a primary protocol for keeping the privacy and confidentiality of the data transmitted among different sensor nodes in WSNs. Since node's size is small; they are intrinsically limited by inadequate resources such as battery life-time and memory capacity. The proposed secure and energy saving protocol (SESP) for wireless sensor networks) has a significant impact on the overall network life-time and energy dissipation. To encrypt sent messsages, the SESP uses the public-key cryptography’s concept. It depends on sensor nodes' identities (IDs) to prevent the messages repeated; making security goals- authentication, confidentiality, integrity, availability, and freshness to be achieved. Finally, simulation results show that the proposed approach produced better energy consumption and network life-time compared to LEACH protocol; sensors are dead after 900 rounds in the proposed SESP protocol. While, in the low-energy adaptive clustering hierarchy (LEACH) scheme, the sensors are dead after 750 rounds.
Plant leaf identification system using convolutional neural networkjournalBEEI
This paper proposes a leaf identification system using convolutional neural network (CNN). This proposed system can identify five types of local Malaysia leaf which were acacia, papaya, cherry, mango and rambutan. By using CNN from deep learning, the network is trained from the database that acquired from leaf images captured by mobile phone for image classification. ResNet-50 was the architecture has been used for neural networks image classification and training the network for leaf identification. The recognition of photographs leaves requested several numbers of steps, starting with image pre-processing, feature extraction, plant identification, matching and testing, and finally extracting the results achieved in MATLAB. Testing sets of the system consists of 3 types of images which were white background, and noise added and random background images. Finally, interfaces for the leaf identification system have developed as the end software product using MATLAB app designer. As a result, the accuracy achieved for each training sets on five leaf classes are recorded above 98%, thus recognition process was successfully implemented.
Customized moodle-based learning management system for socially disadvantaged...journalBEEI
This study aims to develop Moodle-based LMS with customized learning content and modified user interface to facilitate pedagogical processes during covid-19 pandemic and investigate how teachers of socially disadvantaged schools perceived usability and technology acceptance. Co-design process was conducted with two activities: 1) need assessment phase using an online survey and interview session with the teachers and 2) the development phase of the LMS. The system was evaluated by 30 teachers from socially disadvantaged schools for relevance to their distance learning activities. We employed computer software usability questionnaire (CSUQ) to measure perceived usability and the technology acceptance model (TAM) with insertion of 3 original variables (i.e., perceived usefulness, perceived ease of use, and intention to use) and 5 external variables (i.e., attitude toward the system, perceived interaction, self-efficacy, user interface design, and course design). The average CSUQ rating exceeded 5.0 of 7 point-scale, indicated that teachers agreed that the information quality, interaction quality, and user interface quality were clear and easy to understand. TAM results concluded that the LMS design was judged to be usable, interactive, and well-developed. Teachers reported an effective user interface that allows effective teaching operations and lead to the system adoption in immediate time.
Understanding the role of individual learner in adaptive and personalized e-l...journalBEEI
Dynamic learning environment has emerged as a powerful platform in a modern e-learning system. The learning situation that constantly changing has forced the learning platform to adapt and personalize its learning resources for students. Evidence suggested that adaptation and personalization of e-learning systems (APLS) can be achieved by utilizing learner modeling, domain modeling, and instructional modeling. In the literature of APLS, questions have been raised about the role of individual characteristics that are relevant for adaptation. With several options, a new problem has been raised where the attributes of students in APLS often overlap and are not related between studies. Therefore, this study proposed a list of learner model attributes in dynamic learning to support adaptation and personalization. The study was conducted by exploring concepts from the literature selected based on the best criteria. Then, we described the results of important concepts in student modeling and provided definitions and examples of data values that researchers have used. Besides, we also discussed the implementation of the selected learner model in providing adaptation in dynamic learning.
Prototype mobile contactless transaction system in traditional markets to sup...journalBEEI
1) Researchers developed a prototype contactless transaction system using QR codes and digital payments to support physical distancing during the COVID-19 pandemic in traditional markets.
2) The system allows sellers and buyers in traditional markets to conduct fast, secure transactions via smartphones without direct cash exchange. Buyers scan sellers' QR codes to view product details and make e-wallet payments.
3) Testing showed the system's functions worked properly and users found it easy to use and useful for supporting contactless transactions and digital transformation of traditional markets. However, further development is needed to increase trust in digital payments for users unfamiliar with the technology.
Wireless HART stack using multiprocessor technique with laxity algorithmjournalBEEI
The use of a real-time operating system is required for the demarcation of industrial wireless sensor network (IWSN) stacks (RTOS). In the industrial world, a vast number of sensors are utilised to gather various types of data. The data gathered by the sensors cannot be prioritised ahead of time. Because all of the information is equally essential. As a result, a protocol stack is employed to guarantee that data is acquired and processed fairly. In IWSN, the protocol stack is implemented using RTOS. The data collected from IWSN sensor nodes is processed using non-preemptive scheduling and the protocol stack, and then sent in parallel to the IWSN's central controller. The real-time operating system (RTOS) is a process that occurs between hardware and software. Packets must be sent at a certain time. It's possible that some packets may collide during transmission. We're going to undertake this project to get around this collision. As a prototype, this project is divided into two parts. The first uses RTOS and the LPC2148 as a master node, while the second serves as a standard data collection node to which sensors are attached. Any controller may be used in the second part, depending on the situation. Wireless HART allows two nodes to communicate with each other.
Implementation of double-layer loaded on octagon microstrip yagi antennajournalBEEI
This document describes the implementation of a double-layer structure on an octagon microstrip yagi antenna (OMYA) to improve its performance at 5.8 GHz. The double-layer consists of two double positive (DPS) substrates placed above the OMYA. Simulation and experimental results show that the double-layer configuration increases the gain of the OMYA by 2.5 dB compared to without the double-layer. The measured bandwidth of the OMYA with double-layer is 14.6%, indicating the double-layer can increase both the gain and bandwidth of the OMYA.
The calculation of the field of an antenna located near the human headjournalBEEI
In this work, a numerical calculation was carried out in one of the universal programs for automatic electro-dynamic design. The calculation is aimed at obtaining numerical values for specific absorbed power (SAR). It is the SAR value that can be used to determine the effect of the antenna of a wireless device on biological objects; the dipole parameters will be selected for GSM1800. Investigation of the influence of distance to a cell phone on radiation shows that absorbed in the head of a person the effect of electromagnetic radiation on the brain decreases by three times this is a very important result the SAR value has decreased by almost three times it is acceptable results.
Exact secure outage probability performance of uplinkdownlink multiple access...journalBEEI
In this paper, we study uplink-downlink non-orthogonal multiple access (NOMA) systems by considering the secure performance at the physical layer. In the considered system model, the base station acts a relay to allow two users at the left side communicate with two users at the right side. By considering imperfect channel state information (CSI), the secure performance need be studied since an eavesdropper wants to overhear signals processed at the downlink. To provide secure performance metric, we derive exact expressions of secrecy outage probability (SOP) and and evaluating the impacts of main parameters on SOP metric. The important finding is that we can achieve the higher secrecy performance at high signal to noise ratio (SNR). Moreover, the numerical results demonstrate that the SOP tends to a constant at high SNR. Finally, our results show that the power allocation factors, target rates are main factors affecting to the secrecy performance of considered uplink-downlink NOMA systems.
Design of a dual-band antenna for energy harvesting applicationjournalBEEI
This report presents an investigation on how to improve the current dual-band antenna to enhance the better result of the antenna parameters for energy harvesting application. Besides that, to develop a new design and validate the antenna frequencies that will operate at 2.4 GHz and 5.4 GHz. At 5.4 GHz, more data can be transmitted compare to 2.4 GHz. However, 2.4 GHz has long distance of radiation, so it can be used when far away from the antenna module compare to 5 GHz that has short distance in radiation. The development of this project includes the scope of designing and testing of antenna using computer simulation technology (CST) 2018 software and vector network analyzer (VNA) equipment. In the process of designing, fundamental parameters of antenna are being measured and validated, in purpose to identify the better antenna performance.
Transforming data-centric eXtensible markup language into relational database...journalBEEI
eXtensible markup language (XML) appeared internationally as the format for data representation over the web. Yet, most organizations are still utilising relational databases as their database solutions. As such, it is crucial to provide seamless integration via effective transformation between these database infrastructures. In this paper, we propose XML-REG to bridge these two technologies based on node-based and path-based approaches. The node-based approach is good to annotate each positional node uniquely, while the path-based approach provides summarised path information to join the nodes. On top of that, a new range labelling is also proposed to annotate nodes uniquely by ensuring the structural relationships are maintained between nodes. If a new node is to be added to the document, re-labelling is not required as the new label will be assigned to the node via the new proposed labelling scheme. Experimental evaluations indicated that the performance of XML-REG exceeded XMap, XRecursive, XAncestor and Mini-XML concerning storing time, query retrieval time and scalability. This research produces a core framework for XML to relational databases (RDB) mapping, which could be adopted in various industries.
Key performance requirement of future next wireless networks (6G)journalBEEI
The document provides an overview of the key performance indicators (KPIs) for 6G wireless networks compared to 5G networks. Some of the major KPIs discussed for 6G include: achieving data rates of up to 1 Tbps and individual user data rates up to 100 Gbps; reducing latency below 10 milliseconds; supporting up to 10 million connected devices per square kilometer; improving spectral efficiency by up to 100 times through technologies like terahertz communications and smart surfaces; and achieving an energy efficiency of 1 pico-joule per bit transmitted through techniques like wireless power transmission and energy harvesting. The document outlines how 6G aims to integrate terrestrial, aerial and maritime communications into a single network to provide ubiquitous connectivity with higher
Noise resistance territorial intensity-based optical flow using inverse confi...journalBEEI
This paper presents the use of the inverse confidential technique on bilateral function with the territorial intensity-based optical flow to prove the effectiveness in noise resistance environment. In general, the image’s motion vector is coded by the technique called optical flow where the sequences of the image are used to determine the motion vector. But, the accuracy rate of the motion vector is reduced when the source of image sequences is interfered by noises. This work proved that the inverse confidential technique on bilateral function can increase the percentage of accuracy in the motion vector determination by the territorial intensity-based optical flow under the noisy environment. We performed the testing with several kinds of non-Gaussian noises at several patterns of standard image sequences by analyzing the result of the motion vector in a form of the error vector magnitude (EVM) and compared it with several noise resistance techniques in territorial intensity-based optical flow method.
Modeling climate phenomenon with software grids analysis and display system i...journalBEEI
This study aims to model climate change based on rainfall, air temperature, pressure, humidity and wind with grADS software and create a global warming module. This research uses 3D model, define, design, and develop. The results of the modeling of the five climate elements consist of the annual average temperature in Indonesia in 2009-2015 which is between 29oC to 30.1oC, the horizontal distribution of the annual average pressure in Indonesia in 2009-2018 is between 800 mBar to 1000 mBar, the horizontal distribution the average annual humidity in Indonesia in 2009 and 2011 ranged between 27-57, in 2012-2015, 2017 and 2018 it ranged between 30-60, during the East Monsoon, the wind circulation moved from northern Indonesia to the southern region Indonesia. During the west monsoon, the wind circulation moves from the southern part of Indonesia to the northern part of Indonesia. The global warming module for SMA/MA produced is feasible to use, this is in accordance with the value given by the validate of 69 which is in the appropriate category and the response of teachers and students through a 91% questionnaire.
An approach of re-organizing input dataset to enhance the quality of emotion ...journalBEEI
The purpose of this paper is to propose an approach of re-organizing input data to recognize emotion based on short signal segments and increase the quality of emotional recognition using physiological signals. MIT's long physiological signal set was divided into two new datasets, with shorter and overlapped segments. Three different classification methods (support vector machine, random forest, and multilayer perceptron) were implemented to identify eight emotional states based on statistical features of each segment in these two datasets. By re-organizing the input dataset, the quality of recognition results was enhanced. The random forest shows the best classification result among three implemented classification methods, with an accuracy of 97.72% for eight emotional states, on the overlapped dataset. This approach shows that, by re-organizing the input dataset, the high accuracy of recognition results can be achieved without the use of EEG and ECG signals.
Parking detection system using background subtraction and HSV color segmentationjournalBEEI
Manual system vehicle parking makes finding vacant parking lots difficult, so it has to check directly to the vacant space. If many people do parking, then the time needed for it is very much or requires many people to handle it. This research develops a real-time parking system to detect parking. The system is designed using the HSV color segmentation method in determining the background image. In addition, the detection process uses the background subtraction method. Applying these two methods requires image preprocessing using several methods such as grayscaling, blurring (low-pass filter). In addition, it is followed by a thresholding and filtering process to get the best image in the detection process. In the process, there is a determination of the ROI to determine the focus area of the object identified as empty parking. The parking detection process produces the best average accuracy of 95.76%. The minimum threshold value of 255 pixels is 0.4. This value is the best value from 33 test data in several criteria, such as the time of capture, composition and color of the vehicle, the shape of the shadow of the object’s environment, and the intensity of light. This parking detection system can be implemented in real-time to determine the position of an empty place.
Quality of service performances of video and voice transmission in universal ...journalBEEI
The universal mobile telecommunications system (UMTS) has distinct benefits in that it supports a wide range of quality of service (QoS) criteria that users require in order to fulfill their requirements. The transmission of video and audio in real-time applications places a high demand on the cellular network, therefore QoS is a major problem in these applications. The ability to provide QoS in the UMTS backbone network necessitates an active QoS mechanism in order to maintain the necessary level of convenience on UMTS networks. For UMTS networks, investigation models for end-to-end QoS, total transmitted and received data, packet loss, and throughput providing techniques are run and assessed and the simulation results are examined. According to the results, appropriate QoS adaption allows for specific voice and video transmission. Finally, by analyzing existing QoS parameters, the QoS performance of 4G/UMTS networks may be improved.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
2. Bulletin of Electr Eng and Inf ISSN: 2302-9285
On the use of voice activity detection in speech emotion recognition (Muhammad Fahreza Alghifari)
1325
elicited emotional speech. However there are also noisy audio datasets which imitates the condition in natural
environment. To handle this, usually a preprocessing step is performed before training.
Figure 1. A typical speech emotion recognition algorithm
Voice activity detection (VAD) is one prominent method of pre-processing speech data. A VAD
functions to detect presence or absence of human voice in a signal [9]. Researches on practical applications
of VAD and how to implement them are numerous, such as the paper published by [10] which proposes the
implementation of real-time VAD smartphone application or the algorithm proposed by [11] for assisting
those with hearing problems. The personality identification research by [12] was improved by utilizing VAD.
In this research, the main usage of VAD is to improve the results of SER. Other similar research
using VAD to improve SER such as [13] yielded an accuracy rate of 96.97%. This is achieved however for
only 3 emotions, angry neutral and sad. Another research conducted by [14] incorporated the arousal-valence
theory and obtained a recognition rate of 95% for 4 emotions, happy, fear, anger, and sad. Other research,
such as [15], investigated the effects of VAD by introducing 5 different types (Voice babble, Factory noise,
HF radio channel, F-16 fighter-jets, and Volvo 340) to the clean audio signals. Their proposed VAD system
have shown a mean improvement of 5.54% when across the 5 noises when tested for the Berlin Emotion
Database (EMO-DB).
While research of VAD in SER has already been performed, there are still room for improvement in
terms of accuracy or the amount of emotions detected. Hence in this paper we contribute to the field in two
ways. The first is showcasing our proposed SER system results by using VAD, benchmarking with other
papers using similar methodology. The second is investigating the effects of VAD when mixing a clean
emotion dataset with a noisy one for training and testing a deep neural network. The rest of the paper is as
following. Section 2 briefly outlines our steps taken to perform SER. Section 3 displays the results obtained
and a discussion as well as benchmarking. Section 4 concludes by summarizing the content of this paper.
2. RESEARCH METHOD
The research conducted in this paper continues our previous research in [16], but now with the
addition of VAD. Our system has 4 steps to perform SER. The VAD is performed in the preprocessing stage,
where the audio files are passed through Sohn’s VAD algorithm to remove the silent segments. After, the
MFCC of the speech signals are extracted for feature extraction. The MFCC is then used to train and test the
deep neural network. For this system, we adopt 2 datasets, the German Emotion Database EMO-DB and a
custom made low quality database. The overview of the system can be view in Figure 2.
Figure 2. Proposed SER with VAD methodology
3. ISSN: 2302-9285
Bulletin of Electr Eng and Inf, Vol. 8, No. 4, December 2019 : 1324 – 1332
1326
As mentioned previously, the investigation has been divided into two parts. The first is investigating
the effects of VAD on a clean signal dataset, then comparing the results obtained with and without VAD
recognition. The second part is testing the VAD on noisy speech input. The following subsections elaborates
in more detail each step.
2.1. Voice activity detection pre-processing
A typical VAD algorithm is shown in Figure 3. For this project, the VAD is performed by using the
Sohn’s VAD algorithm which integrates a decision-directed parameter estimation and HMM-based
hang-over scheme to improve the results [17]. This is implemented as ‘vadsohn’ using VoiceBox toolbox
[18]. The algorithm uses a mix of The VAD output is a probability decision on a frame-by-frame basis, with
a threshold probability of 0.70, as shown in Figure 4.
Each audio files are then resampled at original sampling frequency of 16 kHz concatenating only the
signals with voice signal detected, to then be passed to the feature extractor. Figure 5 shows a sample of
VAD pre-processing performed on a sample audio from the EMO-DB and LQ Emo Dataset.
Figure 3. A typical voice activity detection algorithm
Figure 4. Sohn’s VAD Algorithm
(a) (b)
Figure 5. VAD pre-processing step, (a) EMO-DB, and (b) LQ Audio Database
Input Speech
Signal
Signal Processing
and Feature
Extraction
Model Fitting and
Decision Making
Output Decision
4. Bulletin of Electr Eng and Inf ISSN: 2302-9285
On the use of voice activity detection in speech emotion recognition (Muhammad Fahreza Alghifari)
1327
2.2. Mel-frequency cepstral coefficients (MFCCs) feature extraction
For the reasons outlined in [19] Mel-Frequency Cepstral Coefficients to be the sole feature for the
classification. MFCCs use a non-linear frequency scale, i.e. mel scale, based on the auditory perception. A
mel is a unit of measure of perceived pitch or frequency of a tone. The (1) can be used to convert frequency
scale to mel scale.
( ) (1)
where is the frequency in mels and is the normal frequency in Hz. MFCCs are often calculated using
a filter bank of filters, in which each filter has a triangular shape and is spaced uniformly on the mel scale
as shown in (2).
[ ]
{
[ ]
[ ]
[ ] [ ]
[ ] [ ]
[ ]
[ ] [ ]
[ ] [ ]
[ ]
(2)
where . The log-energy mel spectrum is then calculated as follows:
[ ] [∑ | [ ]| [ ]] (3)
where [ ] is the discrete Fourier transform (DFT) of a speech input [ ].
Although traditional cepstrum uses inverse discrete Fourier transform (IDFT), mel frequency
cepstrum is normally implemented using discrete cosine transform (DCT) since
m
S is even as shown in (4),
as follows:
̂[ ] ∑ ⌈ ⌉ *( ) + (4)
MFCC is one of the most popular speech feature to be utilized for SER, as shown in research [1-7].
This common usage enables us to benchmark the results. For this step, the melcepst module from VOICBOX
is used to extract the MFCC. We have used the default parameters, namely Hamming window in time
domain with a triangular shaped filter in the mel domain.
2.3. Deep feedforward neural network classification
In this paper, we utilize the deep neural network algorithm for the system to learn and classify the
emotions. MATLAB neural network pattern recognition tool is used with 70/15/15 training/validation/testing
ratio. As the purpose of this paper is to investigate the effects of VAD on the recognition rate, the network
size is kept at a constant rate of 2 hidden layers with 10 neuron each, using a varying number of MFCC as the
features, as shown in Figure 6. For each iteration, the training/validation/testing data is randomized.
Figure 6. Deep feedforward neural network configuration
2.4. Dataset
This research uses the Berlin Emotion Database (EMO-DB) [20] as the primary dataset. The
EMO-DB contains simulated emotional voices from 5 female and 5 male actors in 10 German utterances (5
short and 5 longer sentences) in 7 emotions. For this experiment, we use only 60 for each emotion so that the
5. ISSN: 2302-9285
Bulletin of Electr Eng and Inf, Vol. 8, No. 4, December 2019 : 1324 – 1332
1328
training would be equal across emotions. As we are analyzing up to 5 emotions, the total voice lines used is
300 samples.
For the second part of the research, we reused the Low Quality Emotional dataset (LQ Emo Dataset)
collected from [16], which contains 148 low quality emotional speech in .ogg format. The data collection was
performed using a common mobile phone recorder under noisy environment, transmitted with WhatsApp
audio message. WhatsApp is using the Opus codec, a lossy audio coding format, for voice media streams at
either 8 kHz or 16 kHz [21]. The database contains the short utterance in 4 emotions which has the most
potential applications in fields such as customer service, which is happy, angry, sad, and neutral. Each audio
file contains 5 daily conversational lines in English, such as ‘hello, good morning” and the average length of
the files are 2.26 seconds.
3. RESULTS AND ANALYSIS
The SER is performed using MATLAB R2018b on an Intel (R) Core i5-7200U CPU @ 2.50GHz. The
number of MFCC is varied from 1-30. Each investigation is repeated 3-5 times then the average is taken
when necessary to ensure accuracy.
3.1. Experiments of VAD in EMO-DB
From the methodology, we firstly experiment the SER accuracy for 5 emotions. The chosen
emotions are happy, angry, sad, fear, and neutral. 300 emotional voice audio signals from the clean dataset
(EMO-DB) are passed to the system for the MFCC feature extraction without pre-processing. The obtained
recognition rate of emotion is 88%, as shown in Figure 7(a).
The experimental setup is then repeated but now with VAD pre-processing. All 300 files are passed
through the vadsohn VAD, removing any segments without detected voice. Although the dataset is
considered with minimum noise, the VAD typically trims the beginning and the end of the emotion audio.
The processed data is then passed to the feature extraction and finally the classifier. The obtained recognition
rate is 91.7% as shown in Figure 7(b), an improvement of 3.7%. The best results were obtained when MFCC
is set to 30 coefficients.
(a) (b)
Figure 7. Emotion recognition results for clean dataset, (a) Without VAD, and (b) with VAD
Benchmarking with the other papers using similar methodology and VAD, our results were not as
accurate as that obtained in [13] or [14], however considering that we have also included more emotion to
detect, this recognition rate is acceptable. For the paper by [13], the amount of emotions detected is only
limited to angry neutral sad, [14] detects happiness anger sadness fear, while our study recognizes up to 5
emotions, happiness angry sadness neutral and fear. As the number of emotions detected increases the
complexity of the system also increases.
6. Bulletin of Electr Eng and Inf ISSN: 2302-9285
On the use of voice activity detection in speech emotion recognition (Muhammad Fahreza Alghifari)
1329
Another aspect to consider is the ratio of training/validation/testing. The research in [13], used a
ratio of 80/10/10 compared to our 70/15/15. Theoretically when the training size is increased, so does the
accuracy rate.
One last factor to consider is that [13] has split every file into 20 millisecond chunks with no overlap
vectors of length 320 (16 kHz * 20ms) while our methodology takes the whole audio file and average the
obtained MFCC, which limits the amount of training and testing data. Ideally for training a deep neural
network, more samples are better, hence in the future study we plan to add more sample files for training.
3.2. VAD Experiment on Both EMO-DB and LQ Emo Dataset
From the second part of the investigation, we investigated the effects of utilizing VAD when mixing
datasets from clean and noisy data. From our previous study [16], we found the accuracy rate was poor when
the noisy and clean are mixed, yielding an accuracy rate of 23.3%, as shown in Figure 8(a).
240 lines of 4 emotions (happy, angry, sad, and neutral) from the EMO-DB are mixed with 120
noisy lines for network training and testing. A sample of the VAD process using LQ Audio Dataset can
viewed in Figure 5(b). The number of MFCC is fixed to be 30, following the best results obtained in 3.1. By
applying VAD to both, we achieved a great improved recognition rate of 73.6%, which not only implies that
VAD can be used to process to noisy signals, but also can be used when dealing with a mix of low quality
audio and clean audio, as displayed in Figure 8(b). This result is consistent with the results obtained by [15]
which shows that VAD is particularly useful when dealing with noisy signals, or when mixing between both
clean and noisy.
(a) (b)
Figure 8. Emotion recognition results from lq audio database, (a) Results without VAD [16], (b) Results
with VAD
Another factor to consider is that the dataset in EMO-DB is in German while the LQ Audio Dataset
is in English. The high improved results using VAD may be attributed to the fact that certain languages
convey more information at different rates. According to [22] among 7 languages, German is one of the
slowest in terms of syllables per second, while English hovers around the middle. There is also the
consideration of different lingos or dialects for each speaker. By applying VAD, the unnecessary information
such as gaps between words can be filtered out and the SER system can focus on the important signals only.
4. CONCLUSION
To achieve a more accurate speech emotion recognition, this study has employed Sohn’s VAD
algorithm in the pre-processing step. We have tested the algorithm against two datasets–a clean dataset from
EMO-DB and a noisy dataset created from our previous study. From the results obtained, we have shown that
by using VAD, the accuracy rate of SER has improved by 3.7% for 5 emotions when testing against a clean
dataset. The second part of the study tested the usage of VAD when mixing a dataset of clean and noisy
dataset. From the results obtained, we have found that the recognition rate greatly improved from our
previous study, up to 50% recognition rate. Our results encourage future studies to include VAD in the pre-
processing step especially when dealing with mixed datasets.
There are a few shortcoming in this research. The first is the limited amount of data for training and
testing. Our methodology takes the average MFCC of a single speech sample rather than taking multiple
7. ISSN: 2302-9285
Bulletin of Electr Eng and Inf, Vol. 8, No. 4, December 2019 : 1324 – 1332
1330
frames per sample, hence limiting the total number of samples. The next is the usage of the LQ Emo Dataset.
The dataset’s number of speech samples are less than that of the EMO-DB samples used. Ideally to train the
network an equal amount of each dataset is used. Nonetheless, both limitation are addressed by repeating the
experiment 3-5 times to ensure that the errors are kept at the minimum. Future studies can further expand the
number of emotions detected, as the EMO-DB has 7 types of emotion recorded. For the LQ Emo Dataset,
additional audio is planned to be recorded in different languages.
ACKNOWLEDGEMENTS
The authors would like to express their gratitude to the Malaysian Ministry of Education (MOE),
through the Fundamental Research Grant Scheme, FRGS19-076-0684 and Universiti Teknologi MARA
(UiTM) Shah Alam which have provided funding and facilities for the research.
REFERENCES
[1] A. Rajasekhar and M. K. Hota, "A Study of Speech, Speaker and Emotion Recognition Using Mel Frequency
Cepstrum Coefficients and Support Vector Machines," in 2018 International Conference on Communication and
Signal Processing (ICCSP), pp. 0114-0118. 2018.
[2] P. P. Dahake, K. Shaw, and P. Malathi, "Speaker dependent speech emotion recognition using MFCC and Support
Vector Machine," in 2016 International Conference on Automatic Control and Dynamic Optimization Techniques
(ICACDOT), pp. 1080-1084. 2016.
[3] A. Sonawane, M. U. Inamdar, and K. B. Bhangale, "Sound based human emotion recognition using MFCC
& multiple SVM," in 2017 International Conference on Information, Communication, Instrumentation
and Control (ICICIC), pp. 1-4. 2017.
[4] J. Watada and Hanayuki, "Speech Recognition in a Multi-speaker Environment by Using Hidden Markov Model
and Mel-frequency Approach," in 2016 Third International Conference on Computing Measurement Control and
Sensor Network (CMCSN), pp. 80-83. 2016.
[5] Chandni, G. Vyas, M. K. Dutta, K. Riha, and J. Prinosil, "An automatic emotion recognizer using MFCCs and
Hidden Markov Models," in 2015 7th International Congress on Ultra Modern Telecommunications and Control
Systems and Workshops (ICUMT), pp. 320-324. 2015.
[6] S. Basu, J. Chakraborty, and M. Aftabuddin, "Emotion recognition from speech using convolutional neural network
with recurrent neural network architecture," in 2017 2nd International Conference on Communication and
Electronics Systems (ICCES), pp. 333-336. 2017.
[7] S. Amiriparian, N. Cummins, S. Ottl, M. Gerczuk, and B. Schuller, "Sentiment analysis using image-based deep
spectrum features," in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction
Workshops and Demos (ACIIW), pp. 26-29. 2017.
[8] S. Mirsamadi, E. Barsoum and C. Zhang, "Automatic speech emotion recognition using recurrent neural networks
with local attention," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
New Orleans, LA, 2017, pp. 2227-2231.
[9] W. Q. Ong and A. W. C. Tan, "Robust voice activity detection using gammatone filtering and entropy," in 2016
International Conference on Robotics, Automation and Sciences (ICORAS), 2016, pp. 1-5. 2016.
[10] A. Sehgal and N. Kehtarnavaz, "A Convolutional Neural Network Smartphone App for Real-Time Voice Activity
Detection," in IEEE Access, vol. 6, pp. 9017-9026, 2018.
[11] J. Song et al., "Research on Digital Hearing Aid Speech Enhancement Algorithm," in 2018 37th Chinese Control
Conference (CCC), pp. 4316-4320. 2018.
[12] K. Gokul and S. Lalitha, "Personality Identification Using Auditory Nerve Modelling of Human Speech," in 2018
International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1731-1737.
2018.
[13] P. Harár, R. Burget and M. K. Dutta, "Speech emotion recognition with deep learning," 2017 4th International
Conference on Signal Processing and Integrated Networks (SPIN), Noida, 2017, pp. 137-140.
[14] P. A. Bustamante, N. M. Lopez Celani, M. E. Perez and O. L. Quintero Montoya, "Recognition and regionalization
of emotions in the arousal-valence plane," 2015 37th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBC), Milan, 2015, pp. 6042-6045.
[15] M. Pandharipande, R. Chakraborty, A. Panda, and S. K. Kopparapu, "An Unsupervised frame Selection Technique
for Robust Emotion Recognition in Noisy Speech," in 2018 26th European Signal Processing Conference
(EUSIPCO), pp. 2055-2059. 2018.
[16] M. F. Alghifari, T. S. Gunawan, and M. Kartiwi, "Speech Emotion Recognition Using Deep Feedforward Neural
Network," Indonesian Journal of Electrical Engineering and Computer Science, vol. 10, no. 2, 2018.
[17] Jongseo Sohn, Nam Soo Kim and Wonyong Sung, "A statistical model-based voice activity detection," in IEEE
Signal Processing Letters, vol. 6, no. 1, pp. 1-3, Jan. 1999.
[18] D. Brookes. (2010, 14/2/2019). VOICEBOX: A speech processing toolbox for MATLAB. Available:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
8. Bulletin of Electr Eng and Inf ISSN: 2302-9285
On the use of voice activity detection in speech emotion recognition (Muhammad Fahreza Alghifari)
1331
[19] T. S. Gunawan, M. F. Alghifari, M. A. Morshidi, and M. Kartiwi, "A Review on Emotion Recognition Algorithms
using Speech Analysis," Indonesian Journal of Electrical Engineering and Informatics (IJEEI), vol. 6, no. 1,
pp. 12-20, 2018.
[20] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech,"
in Ninth European Conference on Speech Communication and Technology, 2005.
[21] F. Karpisek, I. Baggili, and F. Breitinger, "WhatsApp network forensics: Decrypting and understanding the
WhatsApp call signaling messages," Digital Investigation, vol. 15, pp. 110-118, 2015.
[22] F. Pellegrino, C. Coupé, and E. Marsico, "Across-language perspective on speech information rate," Language,
vol. 87, no. 3, pp. 539-558, 2011.
BIOGRAPHIES OF AUTHORS
Muhammad Fahreza Alghifari has completed his B.Eng. (Hons) degree in Electronics: Computer
Information Engineering from International Islamic University Malaysia (IIUM) in 2018 and is
currently pursuing his Masters in Computer Engineering while working as a research assistant.
His research interests are in signal processing, artificial intelligence and affective computing. He
received a best FYP award from IEEE Signal Processing–Malaysia chapter and achieved
recognition in several national level competitions such as Alliance Bank EcoBiz Competition
and IMDC2018.
Teddy Surya Gunawan received his BEng degree in Electrical Engineering with cum laude
award from Institut Teknologi Bandung (ITB), Indonesia in 1998. He obtained his M.Eng degree
in 2001 from the School of Computer Engineering at Nanyang Technological University,
Singapore, and PhD degree in 2007 from the School of Electrical Engineering and
Telecommunications, The University of New South Wales, Australia. His research interests are
in speech and audio processing, biomedical signal processing and instrumentation, image and
video processing, and parallel computing. He is currently an IEEE Senior Member (since 2012),
was chairman of IEEE Instrumentation and Measurement Society–Malaysia Section (2013 and
2014), Associate Professor (since 2012), Head of Department (2015-2016) at Department of
Electrical and Computer Engineering, and Head of Programme Accreditation and Quality
Assurance for Faculty of Engineering (2017-2018), International Islamic University Malaysia.
He is Chartered Engineer (IET, UK) and Insinyur Profesional Madya (PII, Indonesia) since
2016, and registered ASEAN engineer since 2018.
Mimi Aminah binti Wan Nordin received her B.Eng and Masters at International Islamic
University Malaysia and completed her PhD at National University of Malaysia in 2014. She
also obtained her MBA at Asia School of Business in 2018. Mimi has started several startups
and has experience working on sensor-based solutions. Currently, she is the deputy director for
innovation and commercialization unit at research management centre, IIUM, since 2019.
Syed Asif Ahmad Qadri completed his Bachelor of Technology in Computer Science and
Engineering, from Baba Ghulam Shah Badshah University, Kashmir. Currently, he is working as
a Research Assistant with the Department of Elec. & Comp. Engineering in International Islamic
University Malaysia IIUM. His research interests include Speech processing, Artificial
Intelligence, Information security, IoT etc. Currently, he is working on the application of speech
emotion recognition using Deep Neural Networks.
9. ISSN: 2302-9285
Bulletin of Electr Eng and Inf, Vol. 8, No. 4, December 2019 : 1324 – 1332
1332
Mira Kartiwi completed her studies at the University of Wollongong, Australia resulting in the
following degrees being conferred: Bachelor of Commerce in Business Information Systems,
Master in Information Systems in 2001 and her Doctor of Philosophy in 2009. She is currently
an Associate Professor in Department of Information Systems, Kulliyyah of Information and
Communication Technology, and Deputy Director of e-learning at Centre for Professional
Development, International Islamic University Malaysia. Her research interests include
electronic commerce, data mining, e-health and mobile applications development.
Zuriati Janin received her B.Eng in Electrical Engineering from the Universiti Teknologi Mara,
Malaysia in 1996 and MSc. in Remote Sensing & GIS from the Universiti Putra Malaysia
(UPM) in 2001. In 2007, she began her study towards a Ph.D in Instrumentation and Control
System at the Universiti Teknologi Mara, Malaysia. She has served as a lecturer at Universiti
Teknologi Mara for more than 20 years and currently she is a Senior Lecturer at Faculty of
Electrical Engineering, UiTM, Shah Alam. She has been involved with IEEE since 2012 and
been mainly working with IEEE Instrumentation & Measurement Chapter (IM09), Malaysia
Section since 2013. The IM09 acknowledged her role as a founder Treasurer in initiating and
promoting ICSIMA as a series of annual chapter’s flagship conferences since its inception in
2013. She also has more than 10 years experiences in organizing the International Conferences,
Workshops and Seminars. Her role as a conference treasurer started since 2005.