The transcription accuracy of automatic speech recognition (ASR) system may suffer when recognizing
accented speech. The resulting bias in ASR system towards a specific accent due to under representation of
that accent in the training dataset. Accent recognition of existing speech samples can help with the
preparation of the training datasets, which is an important step toward closing the accent gap and
eliminating biases in ASR system. For that we built a system to recognize accent from spoken speech data.
In this study, we have explored some prosodic and vocal speech features as well as speaker embeddings for
accent recognition on our custom English speech data that covers speakers from around the world with
varying accents. We demonstrate that our selected speech features are more effective in recognizing nonnative accents. Additionally, we experimented with a hierarchical classification model for multi-level
accent classification. To establish an accent hierarchy, we employed a bottom-up approach, combining
regional accents and categorizing them as either native or non-native at the top level. Furthermore, we
conducted a comparative study between flat classification and hierarchical classification using the accent
hierarchy structure.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
English speaking proficiency assessment using speech and electroencephalograp...IJECEIAES
In this paper, the English speaking proficiency level of non-native English speakers was automatically estimated as high, medium, or low performance. For this purpose, the speech of 142 non-native English speakers was recorded and electroencephalography (EEG) signals of 58 of them were recorded while speaking in English. Two systems were proposed for estimating the English proficiency level of the speaker; one used 72 audio features, extracted from speech signals, and the other used 112 features extracted from EEG signals. Multi-class support vector machines (SVM) was used for training and testing both systems using a cross-validation strategy. The speech-based system outperformed the EEG system with 68% accuracy on 60 testing audio recordings, compared with 56% accuracy on 30 testing EEG recordings.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
Emotional telugu speech signals classification based on k nn classifiereSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Emotional telugu speech signals classification based on k nn classifiereSAT Journals
Abstract Speech processing is the study of speech signals, and the methods used to process them. In application such as speech coding, speech synthesis, speech recognition and speaker recognition technology, speech processing is employed. In speech classification, the computation of prosody effects from speech signals plays a major role. In emotional speech signals pitch and frequency is a most important parameters. Normally, the pitch value of sad and happy speech signals has a great difference and the frequency value of happy is higher than sad speech. But, in some cases the frequency of happy speech is nearly similar to sad speech or frequency of sad speech is similar to happy speech. In such situation, it is difficult to recognize the exact speech signal. To reduce such drawbacks, in this paper we propose a Telugu speech emotion classification system with three features like Energy Entropy, Short Time Energy, Zero Crossing Rate and K-NN classifier for the classification. Features are extracted from the speech signals and given to the K-NN. The implementation result shows the effectiveness of proposed speech emotion classification system in classifying the Telugu speech signals based on their prosody effects. The performance of the proposed speech emotion classification system is evaluated by conducting cross validation on the Telugu speech database. Keywords: Emotion Classification, K-NN classifier, Energy Entropy, Short Time Energy, Zero Crossing Rate.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
Implementation of Marathi Language Speech Databases for Large Dictionaryiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
English speaking proficiency assessment using speech and electroencephalograp...IJECEIAES
In this paper, the English speaking proficiency level of non-native English speakers was automatically estimated as high, medium, or low performance. For this purpose, the speech of 142 non-native English speakers was recorded and electroencephalography (EEG) signals of 58 of them were recorded while speaking in English. Two systems were proposed for estimating the English proficiency level of the speaker; one used 72 audio features, extracted from speech signals, and the other used 112 features extracted from EEG signals. Multi-class support vector machines (SVM) was used for training and testing both systems using a cross-validation strategy. The speech-based system outperformed the EEG system with 68% accuracy on 60 testing audio recordings, compared with 56% accuracy on 30 testing EEG recordings.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
Emotional telugu speech signals classification based on k nn classifiereSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Emotional telugu speech signals classification based on k nn classifiereSAT Journals
Abstract Speech processing is the study of speech signals, and the methods used to process them. In application such as speech coding, speech synthesis, speech recognition and speaker recognition technology, speech processing is employed. In speech classification, the computation of prosody effects from speech signals plays a major role. In emotional speech signals pitch and frequency is a most important parameters. Normally, the pitch value of sad and happy speech signals has a great difference and the frequency value of happy is higher than sad speech. But, in some cases the frequency of happy speech is nearly similar to sad speech or frequency of sad speech is similar to happy speech. In such situation, it is difficult to recognize the exact speech signal. To reduce such drawbacks, in this paper we propose a Telugu speech emotion classification system with three features like Energy Entropy, Short Time Energy, Zero Crossing Rate and K-NN classifier for the classification. Features are extracted from the speech signals and given to the K-NN. The implementation result shows the effectiveness of proposed speech emotion classification system in classifying the Telugu speech signals based on their prosody effects. The performance of the proposed speech emotion classification system is evaluated by conducting cross validation on the Telugu speech database. Keywords: Emotion Classification, K-NN classifier, Energy Entropy, Short Time Energy, Zero Crossing Rate.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.Keywords:TTS, SBS, Sillable, Diphone.
Title:Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Author:Sohrab Hojjatkhah, Ali Jowharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish
Arabic speech recognition suffers from the scarcity of properly labeled data. In this project, we introduce a pipeline that performs semi-supervised segmentation of audio then— after hand-labeling a small dataset—feeds labeled segments to a supervised learning framework to select, through many rounds of hyperparameter optimization, an ensemble of models to infer labels for a larger dataset; using which we improved the keyword spotter’s F1 score from 75.85% (using a baseline model) to 90.91% on a ground-truth test set. We picked the keyword na`am (yes) to spot; we defined the system’s input as an audio file of an utterance and the output as a binary label: keyword or filler.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
In this research, we propose a hybrid approach for acoustic and pronunciation modeling for Arabic speech recognition. The hybrid approach benefits from both vocalized and non-vocalized Arabic resources, based on the fact that the amount of non-vocalized resources is always higher than vocalized resources. Two speech recognition baseline systems were built: phonemic and graphemic. The two baseline acoustic models were fused together after two independent trainings to create a hybrid acoustic model. Pronunciation modeling was also hybrid by generating graphemic pronunciation variants as well as phonemic variants. Different techniques are proposed for pronunciation modeling to reduce model complexity. Experiments were conducted on large vocabulary news broadcast speech domain. The proposed hybrid approach has shown a relative reduction in WER of 8.8% to 12.6% based on pronunciation modeling settings and the supervision in the baseline systems.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
FORMANT ANALYSIS OF BANGLA VOWEL FOR AUTOMATIC SPEECH RECOGNITIONsipij
To provide new technological benefits to the mass people, nowadays, regional and local language
recognition draws attention to the researchers. Similarly to other languages, Bangla speech recognition
scheme is demandable. A formant is considered as the resonance frequency of vocal tract. Formant
frequencies play an important role for the purpose of automatic speech recognition, due to its noise robust
characteristics. In this paper, Bangla vowels are investigated to acquire formant frequencies and its
corresponding bandwidth from continuous Bangla sentences, which are considered as potential parameters
for wide voice applications. For the purpose of formant analysis, cepstrum based formant estimation and
Linear Predictive Coding (LPC) techniques are used. In order to acquire formant characteristics, enrich
continuous sentences and widely available Bangla language corpus namely “SHRUTI” is considered.
Intensive experimentation is carried out to determine formant characteristics (frequency and bandwidth) of
Bangla vowels for both male and female speakers. Finally, vowel recognition accuracy of Bangla language
is reported considering first three formants
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...sipij
Speech disorders are very complicated in individuals suffering from Apraxia of Speech-AOS. In this paper ,
the pathological cases of speech disabled children affected with AOS are analyzed. The speech signal
samples of children of age between three to eight years are considered for the present study. These speech
signals are digitized and enhanced using the using the Speech Pause Index, Jitter,Skew ,Kurtosis analysis
This analysis is conducted on speech data samples which are concerned with both place of articulation and
manner of articulation. The speech disability of pathological subjects was estimated using results of above
analysis.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
More Related Content
Similar to ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBEDDINGS, PROSODIC AND VOCAL SPEECH FEATURES
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.Keywords:TTS, SBS, Sillable, Diphone.
Title:Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Author:Sohrab Hojjatkhah, Ali Jowharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
Semi-Supervised Keyword Spotting in Arabic Speech Using Self-Training EnsemblesMohamed El-Geish
Arabic speech recognition suffers from the scarcity of properly labeled data. In this project, we introduce a pipeline that performs semi-supervised segmentation of audio then— after hand-labeling a small dataset—feeds labeled segments to a supervised learning framework to select, through many rounds of hyperparameter optimization, an ensemble of models to infer labels for a larger dataset; using which we improved the keyword spotter’s F1 score from 75.85% (using a baseline model) to 90.91% on a ground-truth test set. We picked the keyword na`am (yes) to spot; we defined the system’s input as an audio file of an utterance and the output as a binary label: keyword or filler.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
In this research, we propose a hybrid approach for acoustic and pronunciation modeling for Arabic speech recognition. The hybrid approach benefits from both vocalized and non-vocalized Arabic resources, based on the fact that the amount of non-vocalized resources is always higher than vocalized resources. Two speech recognition baseline systems were built: phonemic and graphemic. The two baseline acoustic models were fused together after two independent trainings to create a hybrid acoustic model. Pronunciation modeling was also hybrid by generating graphemic pronunciation variants as well as phonemic variants. Different techniques are proposed for pronunciation modeling to reduce model complexity. Experiments were conducted on large vocabulary news broadcast speech domain. The proposed hybrid approach has shown a relative reduction in WER of 8.8% to 12.6% based on pronunciation modeling settings and the supervision in the baseline systems.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
FORMANT ANALYSIS OF BANGLA VOWEL FOR AUTOMATIC SPEECH RECOGNITIONsipij
To provide new technological benefits to the mass people, nowadays, regional and local language
recognition draws attention to the researchers. Similarly to other languages, Bangla speech recognition
scheme is demandable. A formant is considered as the resonance frequency of vocal tract. Formant
frequencies play an important role for the purpose of automatic speech recognition, due to its noise robust
characteristics. In this paper, Bangla vowels are investigated to acquire formant frequencies and its
corresponding bandwidth from continuous Bangla sentences, which are considered as potential parameters
for wide voice applications. For the purpose of formant analysis, cepstrum based formant estimation and
Linear Predictive Coding (LPC) techniques are used. In order to acquire formant characteristics, enrich
continuous sentences and widely available Bangla language corpus namely “SHRUTI” is considered.
Intensive experimentation is carried out to determine formant characteristics (frequency and bandwidth) of
Bangla vowels for both male and female speakers. Finally, vowel recognition accuracy of Bangla language
is reported considering first three formants
El modelo de traducción de voz de extremo a extremo de alta calidad se basa en una gran escala de datos de entrenamiento de voz a texto,
que suele ser escaso o incluso no está disponible para algunos pares de idiomas de bajos recursos. Para superar esto, nos
proponer un método de aumento de datos del lado del objetivo para la traducción del habla en idiomas de bajos recursos. En particular,
primero generamos paráfrasis del lado objetivo a gran escala basadas en un modelo de generación de paráfrasis
que incorpora varias características de traducción automática estadística (SMT) y el uso común
función de red neuronal recurrente (RNN). Luego, un modelo de filtrado que consiste en similitud semántica
y se propuso la co-ocurrencia de pares de palabras y habla para seleccionar la fuente con la puntuación más alta
pares de paráfrasis de los candidatos. Resultados experimentales en inglés, árabe, alemán, letón, estonio,
La generación de paráfrasis eslovena y sueca muestra que el método propuesto logra resultados significativos.
y mejoras consistentes sobre varios modelos de referencia sólidos en conjuntos de datos PPDB (http://paraphrase.
org/). Para introducir los resultados de la generación de paráfrasis en la traducción de voz de bajo recurso,
proponen dos estrategias: recombinación de pares audio-texto y entrenamiento de referencias múltiples. Experimental
Los resultados muestran que los modelos de traducción de voz entrenados en nuevos conjuntos de datos de audio y texto que combinan
los resultados de la generación de paráfrasis conducen a mejoras sustanciales sobre las líneas de base, especialmente en
lenguas de escasos recursos.
Sipij040305SPEECH EVALUATION WITH SPECIAL FOCUS ON CHILDREN SUFFERING FROM AP...sipij
Speech disorders are very complicated in individuals suffering from Apraxia of Speech-AOS. In this paper ,
the pathological cases of speech disabled children affected with AOS are analyzed. The speech signal
samples of children of age between three to eight years are considered for the present study. These speech
signals are digitized and enhanced using the using the Speech Pause Index, Jitter,Skew ,Kurtosis analysis
This analysis is conducted on speech data samples which are concerned with both place of articulation and
manner of articulation. The speech disability of pathological subjects was estimated using results of above
analysis.
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Similar to ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBEDDINGS, PROSODIC AND VOCAL SPEECH FEATURES (20)
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
ENHANCING NON-NATIVE ACCENT RECOGNITION THROUGH A COMBINATION OF SPEAKER EMBEDDINGS, PROSODIC AND VOCAL SPEECH FEATURES
1. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
DOI: 10.5121/sipij.2023.14302 11
ENHANCING NON-NATIVE ACCENT RECOGNITION
THROUGH A COMBINATION OF SPEAKER
EMBEDDINGS, PROSODIC AND
VOCAL SPEECH FEATURES
Anup Bera and Aanchal Agarwal
Accenture India
ABSTRACT
The transcription accuracy of automatic speech recognition (ASR) system may suffer when recognizing
accented speech. The resulting bias in ASR system towards a specific accent due to under representation of
that accent in the training dataset. Accent recognition of existing speech samples can help with the
preparation of the training datasets, which is an important step toward closing the accent gap and
eliminating biases in ASR system. For that we built a system to recognize accent from spoken speech data.
In this study, we have explored some prosodic and vocal speech features as well as speaker embeddings for
accent recognition on our custom English speech data that covers speakers from around the world with
varying accents. We demonstrate that our selected speech features are more effective in recognizing non-
native accents. Additionally, we experimented with a hierarchical classification model for multi-level
accent classification. To establish an accent hierarchy, we employed a bottom-up approach, combining
regional accents and categorizing them as either native or non-native at the top level. Furthermore, we
conducted a comparative study between flat classification and hierarchical classification using the accent
hierarchy structure.
KEYWORDS
Automatic speech recognition; accent recognition; Hierarchical classification; MLP Classifier; Speech
Features; speaker embedding; native and non-native accents
1. INTRODUCTION
The advent of automatic speech recognition (ASR) system and its wide adaptation in industry
across different applications such as voice bot, voice search, mobile phones, home appliances
etc., demands high transcription accuracy. The performance of some of the ASR systems is not
up to the desirable mark for accented speech, especially for non-native speakers. The poor
performance for a particular accented speech may be due to poor representation of that accented
speech in the training dataset. So, identifying accent gap and enriching training dataset by
including different accented speech data helps to improve ASR model’s overall performance. To
address this problem, it is necessary to recognize spoken accent of the speech data and using that
information create a well-balanced speech training dataset. We built a supervised classification
model for accent recognition. In this paper we examine the comparative study of the flat
classification and hierarchical classification for accent recognition for English speech data. The
classification models are trained using input features such as prosodic and vocal speech features
as well as speaker embeddings of the speech data. Here we also demonstrate that overall model
accuracy will be improved if prosodic and vocal speech features are being used along with
speaker embedding vectors. It has also been observed that, our speech features enhance model
2. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
12
performance towards non-native accent recognition. As our accent recognition experiment is not
limited for few selected accents or specific to any geolocation, we have prepared a custom dataset
which covers speakers from around the world with varying accents.
1.1. Related Work
Most of the previous works [1] , used MFCC as the input features and chose neural network
model for the accent prediction. A comparative study on various speech features, such as, MFCC,
Spectrogram, Chromagram, Spectral Centroid, and Spectral Roll-off has been conducted by [2]
using 2 layers CNN model. And they included that MFCC has given best accuracy on 5 accents,
namely, Arabic, English, French, Mandarin and Spanish, with 48.4% accuracy. Also, Arlo Faria
[3] explored accent classification for native and non-native speaker using acoustic and lexical
features, achieving accuracy 84.5%. The L. M. K. Sheng [4] explored deep learning approach for
accent classification on 3 accents, namely, English, Chinese, Korea, and achieved 87.6%
accuracy. There are some other works which jointly studied the relationship between ASR with
accent identification and use that information to select the accented speech [5] or improve
accuracy of both [6] or use accent information to improve ASR accuracy [7]. So far, no
experiment done using speaker embeddings as input features for accent recognition. As speaker
embeddings carry some salient information, such as speaker speaking style, age, dialect, accent
etc., it intrigued us to experiment with this feature. Further we appended some prosodic and vocal
features to embedding vectors to understand the accent recognition model performance on the
combined features.
2. RESEARCH METHODOLOGY
2.1. Dataset Preparation
We have prepared a custom dataset by combining multiple open datasets that provides sufficient
representation of various accents around the world.
2.1.1. Dataset Description
Our experiment is to recognize accent for an English speech which is spoken anywhere in the
world. In order to achieve that, we have prepared a dataset which are combination of various
speech data spoken across the geographic regions. To conduct our experiments, we used 5 open-
source speech datasets: Kaggle Common Voice [8], George Mason University Speech Accent
Archive [9], CSTR VCTK Corpus [10], NISP [11] and TIMIT [12] datasets. We have combined
these 5 datasets to train our model. For this experiment we have used a total of 22.5 hours of
speech data by under-sampling some accent groups to make dataset more balanced and
comprehensive. The detail of the dataset preparation approach is given in the paper [13].
2.1.2. Data Processing
Each dataset has different file format, sampling rate, encoding, and file duration. For example,
TIMIT dataset has files with duration mostly less than 5 secs and having 16 KHz sampling rate
with wav format. Whereas Common Voice data have 48 KHz sampling rate and in mp3 format.
The VCTK data are in flac format with 48 kHz sampling rate. Also, in some dataset there are
some language data other than English. As this study focus on English speech data, it is required
to filter out only for English data from the original dataset.
3. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
13
2.1.3. Data Standardization
As the above 5 datasets have different file format, sampling rate, bit rate, encoding, and file
duration, to create a single training dataset out of these, it is necessary to standardize all speech
data. Regardless of the original file format, sampling rate and bit rate of each dataset, a pipeline
was implemented to standardize all the data. Also, speech file duration needs to be considered
during data preparation. After some initial experiments, we found that a file with min duration of
5 seconds and max duration of 20 seconds give optimum result. So, we filtered out only those
files having duration more than 5 seconds and clipped all the files which are having duration
more than 20 seconds. All the data are standardized to following format:
File format: Wav
Sampling rate: 16KHz
Bits per sample: 16
Encoding: PCM_S
Channel: single
File max duration: 20secs
File min duration: 5secs
2.2. Accent Label preparation
Wherever the accent labels are available, we have used those as it is. In case the accent labels
directly are not available, we have used indirect information, such as speaker first language,
mother tongue, speaker coming from which region etc, to assign accent label. For example, if
metadata of the dataset mentions that speaker’s first language is Hindi, then the accent of the data
is assigned as Indian. Similarly, if it is mentioned that speaker’s first language as French, then
accent of the speech data assigned as French.
2.3. Hierarchical Accent Preparation
Now we have accent information for each speech data in the combined dataset. As some of the
accents have shared characteristics, those accents are grouped together and created a hierarchical
accent tree structure using bottom-up approach. We have created one variable, called,
“geolocation_accent” by grouping all the accents which are originated from similar geographic
location. For example, Arabic, Farsi, Hebrew, Kurdish accents are spoken in middle east
geographic location. All these accents from middle east are grouped together and map to “Middle
East” label in “geolocation_accent” variable. The purpose to create this type of geolocation
accent purely basis on needs of business use case. Mostly companies are working in a specific
geographic location, and they have operational unit to serve that geographical area. In that
scenario it is required to recognize the speaker geolocation_accent and based on that companies
can take necessary action. The total number of geolocation accent labels is 9; these are 'Africa',
'Australasia', 'British', 'East Asia', 'Europe', 'Middle East', 'North America', 'South Asia', 'South
East Asia'. For detail of this grouping and mapping of accents, please refer the Table 11 in
Appendix.
Further, for all those geolocation_accent labels in which first language is English, are combined
to “Native” accent category and rests are combined to “Non-Native” accent category where first
language is other than English. A new variable is created, namely, “binary_accent” comprising
with “native” and “non-native” accent labels. Here “North American”, “Australasia” and
“British” are grouped to “Native” label of “binary_accent” and “Africa”, “East Asia”, “Europe”,
“Middle East”, “South Asia”, “South East Asia” labels are grouped to “Non-Native” label of
“binary_accent”. So, we have prepared2 levels hierarchical accent structure. At the top, it is
4. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
14
“binary_accent” with “Native” and “Non-Native” labels, and “geolocation_accent” comprising 9
labels is in 2nd
level.
The accent speech file distributions and counts of each label of the “binary_accent” and
“geolocation_accent” are shown in Table 1 and Table 2.
Table 1: native and non-native accent distribution in binary_accent variable
binary_accent
File
counts
Distribution
percentage
Total duration in
seconds
Native 2998 44% 69259
Non-Native 3802 56% 11395
Table 2: The distribution of the different geolocation_accent categories.
geolocation_accent Filecounts File distribution
percentage
Total duration
in seconds
Africa 733 10.78 5788.7
Australasia 998 14.68 3838.1
British 1000 14.70 3716.8
East Asia 406 5.97 3840.4
Europe 978 14.38 8211.2
Middle East 251 3.69 27197.1
North America 1000 14.70 7394.6
South Asia 1000 14.70 12003.3
South East Asia 434 6.38 8663.7
2.4. Features Preparation
For accent recognition, we use several prosodic and vocal speech features as well as speaker
embeddings as model inputs. Hereafter we will discuss the methods to create various speech
features.
2.4.1. Speaker Embeddings
We have chosen freely available speaker embedding pre-trained model which is built by Real-
Time-Voice-Cloning [14] team. The model generates embedding vector having size of 512 for a
given speaker speech. The embedding vectors will be remained almost same for all speech data
from same speaker. The team has implemented the model [15]using the paper [16] and trained
the model using the 3 datasets, Libri speech [17], Voxceleb1 [18], Voxceleb2 [19].The team has
used this model for their speaker recognition project.
2.4.2. Speech Features
Initially, we have started with more than 20 various prosodic and vocal speech features. But after
some analysis, we have identified 7 prosodic and vocal speech features, namely, harmonicity to
noise, pitch mean, speech rate, number of pauses, speaking ratio, energy, and pitch frequency
mean, have shown high correlation, either positive or negative, with “geolocation_accent” accent
5. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
15
labels. The heatmap Figure 1 is showing the correlation values of each of the features with the
“geolocation_accent” labels. The correlation has been calculated using Pearson approach. To
extract those speech features, we have used librosa [20], parselmouth [21] and prosody [22]
libraries. The Table 3 provides details of each of the 7 speech features, such as, feature name, unit
of measurement, description, computed values, and library used.
Figure 1: Correlation heatmap among speech features and geolocation accents
6. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
16
Table 3: Speech features extraction methods
Feature
Name
Unit of
Measurement
Description Computed
Values
Library and
Function Used
harmonics_
to_noise
Decibel (dB)
Ratio between f0 and noise components,
which indirectly correlates with
perceived aspiration.
This may be due to reducing laryngeal
muscle tension resulting in a more open,
turbulent glottis.
Harmonicity
Mean
Library: parselmouth
Function:
sound.to_harmonicit
y()
pitch_xs_m
ean
dB
Pitch is a measure of how high or low
something sounds and is related to the
speed of the vibrations that produce the
sound
Pitch XS –
Mean
Library: parselmouth
Function:
sound.to_pitch()
speech_rat
e
Utterances/Secon
ds
Number of speech utterances per second
over the duration of the speech sample
(including pauses)
Speech Rate
Library: Myprosody
Function: myspsr()
n_pauses
Number of Pauses taken in a piece of
audio
Number of
Pauses
Library: MyProsody
Function:
mysppaus()
speaking_r
atio
The measure of ratio between speaking
duration and total speaking duration
Speaking
Ratio
Library: MyProsody
Function:
myspbala()
energy db
Measured as the mean-squared central
difference across frames and may
correlate with motor coordination.
Energy
Library: parselmouth
Function:
sound.get_energy()
pitch_fequ
ency_mean
db
Pitch is a measure of how high or low
something sounds and is related to the
speed of the vibrations that produce the
sound.
Here Pitch Frequency gives the pitch of
the fragment
Pitch
Frequency
Mean
Library: parselmouth
Function: pitch
=sound.to_pitch()
pitch.selected_array['
frequency'].mean()
2.5. Experimental Design
Experiment 1: Binary classification on the target variable “binary_accent”
Exp 1.1 Binary classification between “Native” and “Non-Native” labels of the target
variable “binary_accent” using speaker embedding vectors as input features
Exp 1.2 Binary classification between “Native” and “Non-Native” labels of the target
variable “binary_accent” using input feature as combination of speech features and speaker
embedding vectors.
Experiment 2: Multi-label classification on the target variable “geolocation_accent”
Exp 2.1 Multi-label classification for9 “geolocation_accent” labels using input feature as
speaker embedding vectors.
Exp 2.2 Multi-label classification of 9 “geolocation_accent” labels using input features as
combination of speech features and speaker embedding vectors.
Experiment 3: Hierarchical classification of the accents using HiClass[23].
Exp 3.1 Hierarchical classification on hierarchical structure of “binary_accent” and
“geolocation_accent” using input feature as speaker embedding vectors.
Exp 3.2 Hierarchical classification on hierarchical structure of “binary_accent” and
“geolocation_accent” using input feature as combination of speaker embedding vectors and
speech features.
7. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
17
2.6. Train and Test Dataset Preparation
For all the experiments, the data has been split in 3:1 ratio for train and test dataset with stratified
on target variable “binary_accent” so that Native and Non-Native accents are remain in equal
ratio in both the train and test dataset. Further the dataset also stratified on “geolocation_accent”
classes. All the geolocation accents are distributed in equal ratio between train and test dataset.
The accent file counts distribution in the dataset is shown in the Table 4.
Table 4: Accent file count distribution in train and test set
dataset/
geolocation_
accents
Australasia British North
America
Africa East
Asia
Europe Middle
East
South
Asia
South
East
Asia
dataset/
binary_
accent
Native Non-Native
train 748 750 750 550 305 733 188 750 325
2248 2852
test 250 250 250 183 101 245 63 250 109
750 950
3. EXPERIMENTS AND RESULTS
3.1. Experiment 1
The experiment is carried out on “binary_accent” as target variable whose classes are “Native”
and “Non-Native”.
Model: In the first experiment, we used multi-layer perceptron model for binary classification.
For this experiment purpose we have used scikit-learn MLP Classifier with the hyperparameters
as 1 hidden layer with 100 nodes, relu as activation function and adam as estimator with log-loss
as loss function.
Features: In this binary classification experiment, first we have used input features as speaker
embedding vectors as mentioned in Features Preparation section and for the second experiment,
the input features are prepared by appending 7 speech features with the speaker embedding
vectors.
Result: Through the experiment we have found the optimum model hyperparameters which are
mentioned in the Table 5. The accuracy is calculated on the test dataset which is prepared as
described in the section 2.6. The results of the 2 separate experiments conclude that after
including 7prosodic and vocal speech features along with embedding vectors giving better
accuracy. The result is not calculated basis on overall prediction accuracy, rather consider
individual class accuracy and then average out on both Native and Non-Native accuracies. It
should be noticed from individual accuracy result, which is shown in the Table 6, that Non-
Native accent prediction accuracy is improved when 7 speech features are added in the model
input feature.
8. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
18
Table 5: Overall accuracy when classification model trained on embedding vectors and combined with
speech features
Experiment Target
Variable
Classes Features Classification
model
Model
hyperparameter
s
Accurac
y
F1
score
Exp1.1 binary_acce
nt
Native
, Non-
Native
Speaker
Embeddin
g
MLPClassifie
r
random_state=
1,
max_iter=500,
shuffle=True
90.71% 91
Exp1.2 binary_acce
nt
Native
, Non-
Native
Speaker
Embeddin
g + speech
features
MLPClassifie
r
random_state=
1,
max_iter=500,
shuffle=True
92.5% 93
Table 6: Individual native and non-native accent accuracy in percentage
binary_accent Accuracy of Exp 1.1 Accuracy of Exp 1.2
Native 90.3% 90.8%
Non-Native 91.1% 94.2%
3.2. Experiment 2
This experiment is carried out on “geolocation_accent” as target variable which have 9 accent
classes.
Model: Here the model remains same as for binary classification with only difference in the
output layer which has 9 nodes.
Features: First experiment is carried out using embedding vectors as input features and then in
the second experiment, the speech features are appended to the embedding vectors to create a
combined input feature.
Results: The model hyperparameters are remain same as in the experiment 1. Here again the
final accuracy is calculated based on each of the classes’ accuracy in test dataset and then average
out of all individual accuracy. The individual accuracy of each class is shown in the Table 8 and
the overall accuracy is given in the Table 7.
9. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
19
Table 7: Overall accuracy when classification model trained on embedding vectors and
combined with speech features
Experi
ment
Target
Variable
Classes Features Classific
ation
model
Model hyperparameters Accuracy F1
sc
ore
Exp
2.1
geolocatio
n_accent
9 classes Speaker
Embedding
MLPCla
ssifier
number of layers=1, hidden
node=100,
random_state=1,
max_iter=500,
shuffle=True
60.24% 60
Exp
2.2
geolocatio
n_accent
9 classes Speaker
Embedding
+ speech
features
MLPCla
ssifier
number of layers=1, hidden
node=100,
random_state=1, hidden
nodes=100, max_iter=500,
shuffle=True
62.14% 62.
78
Table 8: Individual accuracy for each accent in geolocation_accent
geolocation_accent Accuracy of Exp 2.1 Accuracy of Exp 2.2
Africa 63% 62%
Australasia 81% 82%
British 67% 74%
East Asia 34% 40%
Europe 61% 76%
Middle East 25% 17%
North America 70% 75%
South Asia 84% 86%
South East Asia 57% 47%
3.3. Experiment 3
Next, we have experimented hierarchical classification on both “binary_accent” and
“geolocation_accent” variables.
Model: For the hierarchical classification, we have used HiClass [23] library. The HiClass is
compatible with scikit-learn [24] and support various hierarchical classifiers, such as, Local
Classifier per Parent Node, Local Classifier per Node, Local Classifier per Level. In this
experiment, the Local Classifier per Parent Node approach is chosen because it consists of
training a multi-class classifier for each parent node existing in the hierarchy. This is exactly the
current problem statement. First classify a speech data either to Native or Non-Native category
and then further classify to the Native geolocation accents or Non-Native geolocation accents.
The HiClass needs a base classification model which should be compatible with scikit-learn. To
keep all the experiments in comparable, we chose MLP Classifier as base classification model.
Features: The features are remained same as with the previous experiments. First experiment is
done with speaker embedding vectors as input features and then in next experiment, 7 speech
features are combined with the speaker embedding vectors.
10. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
20
Results: The base model, MLP Classifier’s hyperparameters remain same as in the previous
experiments. In case of hierarchical classification model, the default hyperparameters are used.
Here also it can be noticed that after appending speech features, model giving overall better
accuracy. In this experiment also, the Non-Native accent as well as non-native geolocation accent
recognition accuracies have improved after appending 7 speech features to embedding vectors as
input features. Only exception is for “Middle East” accent, where accuracy is decreased. The
individual accent accuracy is shown in the Table 10 and overall accuracy is shown in the Table 9
for Native, Non-Native and geolocation accents.
Table 9: Overall accuracy when hierarchical classification model trained on embedding vectors and
combined with speech features
Table 10: Individual accuracy from hierarchical classification model for each accent in binary_accent and
geolocation_accent
binary_accent Accuracy of Exp 3.1 Accuracy of Exp 3.2
Native 90.3% 90.8%
Non-Native 91.1% 94.2%
geolocation_accent
Africa 65% 67%
Australasia 87% 84%
British 76% 67%
East Asia 42% 46%
Europe 53% 72%
Middle East 35% 24%
North America 74% 72%
South Asia 82% 85%
South East
Asia
51% 52%
11. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
21
4. DISCUSSION
In case of “Native” and “Non-Native” binary classification, both flat classification from
experiment 1 and hierarchical classification from experiment-3 give same result. But in case of
“geolocation_accent” classification, the hierarchical classification in experiment-3 giving better
result than flat classification in experiment-2. The above result has shown that the accuracy for
“East Asia”, “Middle East” and “South East Asia” are lower than other classes. That we believe
due to low representation of those accented speech data in training dataset. Another interesting
finding is that, after including selected speech features, model performance has improved towards
non-native geolocation accents with exception of “Middle East”.
5. CONCLUSION AND FUTURE WORK
In this work we have demonstrated four key points. First, using embedding vectors we can get
decent accuracy for accent recognition for wide variety of accents. Second, by combining
prosodic and vocal speech features along with speaker embedding vectors, the classification
result showing improvement. Third, instead of use 2 separate flat classification models for 2 level
classification, it is always better to use a single hierarchical classification model. Fourth, our
selected 7 prosodic and vocal speech features improve the non-native accent recognition.
The size of training dataset for each geolocation accent is very low. We belief due to lack of
sufficient data, the overall accuracy is not so promising. In future, we will include more training
data as well as better hierarchical and flat classification model to improve the overall accuracy.
Also, our next focus area will be to increase the number of hierarchical levels further down to
regional dialects.
Appendix
The accent mapping table for binary_accent and geolocation_accent variables from the original
data are shown in Table 11.
12. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
22
Table 11: Accent mapping with geolocation accents and native, non-native accents
Binary Accent Accent Mapping Accent
Native North America
Canadian
American
New England
North Midland
South Midland
Native British
Irish
Welsh
British
Scottish
Native Australasia
Australian
New Zealand
Non-Native South Asia
Indian
Nepali
Sinhala
Burmese
Non-Native Africa
Hausa
African
Amharic
Kiswahili
South African
Non-Native Europe
French
Spanish
Russian
German
Portuguese
Non-Native South-East Asia
Malaysia
Singapore
Philippines
Vietnamese
Non-Native East Asia
Korean
Japanese
Mandarin
Hongkong
Cantonese
Non-Native Middle East
Farsi
Arabic
Kurdish
Hebrew
REFERENCES
[1] C. Shih, "Speech Accent Classification," Project Report, Stanford University , Stanford, CA, 2017.
[2] Y. e. e. Singh, "Features of Speech Audio for Accent Recognition," 2020 International Conference on
Artificial Intelligence, IEEE, 2020.
[3] A. Faria, "Accent Classification for Speech Recognition," International Computer Science Institute,
Berkeley CA 94704, USA, p. 285–293, 2005.
[4] L. M. A. Sheng, "Deep Learning Approach to Accent Classification," roject Report, Stanford
University, Stanford, CA, 2017.
[5] M. &. R. M. Najafian, "Automatic accent identification as an analytical tool for accent robust
automatic speech recognition.," Speech Communication, 2020.
[6] K. C. S. &. M. L. Deng, "Improving accent identification and accented speech recognition under a
framework of self-supervised learning," arXiv preprint arXiv:2109.07349, 2021.
13. Signal & Image Processing: An International Journal (SIPIJ) Vol.14, No.2/3, June 2023
23
[7] K. Sreenivasa Rao, "Transfer Accent Identification Learning for Enhancing Speech Emotion
Recognition".
[8] "Kaggle CV," [Online]. Available: https://www.kaggle.com/datasets/mozillaorg/common-voice.
[9] "GMU accent," [Online]. Available:
https://accent.gmu.edu/browse_language.php?function=find&language=english.
[10] J. Yamagishi, "VCTK," [Online]. Available: https://datashare.ed.ac.uk/handle/10283/3443.
[11] S. B. V. D. G. S. K. P. Kalluri, "NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling,"
In ICASSP 2021-2021 IEEE International Conference on Acoustics, pp. 6953–6957, IEEE, 2021.
[12] J. S. Garofolo, "TIMIT," 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93s1.
[13] A. Bera and V. V. Kandasamy, "IMPROVING ROBUSTNESS OF AGE AND GENDER
PREDICTION BASED ON CUSTOM SPEECH DATA," in AIRCC, 2022.
[14] "voice cloning," 2021. [Online]. Available: https://github.com/CorentinJ/Real-Time-Voice-Cloning.
[15] "speaker encoder," 2021. [Online]. Available: https://github.com/CorentinJ/Real-Time-Voice-
Cloning/tree/master/encoder.
[16] W. Q. R. A. M. I. L. Wan L., "GENERALIZED END-TO-END LOSS FOR SPEAKER
VERIFICATION," Retrieved from https://arxiv.org/pdf/1710.10467.pdf, 2020.
[17] C. G. P. D. K. S. Panayotov V., "LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC
DOMAIN AUDIO BOOKS," ICASSP, 2015.
[18] C. J. S. X. W. Z. A. Nagrani A., "Voxceleb: Large-scale speaker verification in the wild," Computer
Science and Language,, 2019.
[19] N. S. Z. A. Chung J. S., "VoxCeleb2: Deep Speaker Recognition," INTERSPEECH, 2018.
[20] "librosa," [Online]. Available: https://librosa.org.
[21] "praat," [Online]. Available: https://parselmouth.readthedocs.io/.
[22] S. Shahab, "prosody," [Online]. Available: https://shahabks.github.io/myprosody/.
[23] K. N. R. B. Y. Miranda F. M., "HiClass: a Python library for local hierarchical classification
compatible with scikit-learn," Retrieved from https://arxiv.org/pdf/2112.06560.pdf, 2022.
[24] "scikit-learn," [Online]. Available: https://scikit-learn.org/stable/.
[25] A. T. a. S. U. K. Chakraborty, "Voice Recognition Using MFCC Algorithm," International Journal of
Innovative Research in Advanced Engineering, vol. 1, no. 10, p. 4, 2014.
AUTHORS
Anup Bera has done M.Tech from IIT Kharagpur and he has total 17 years of industry
experience and currently working in Accenture India
Aanchal Agarwal has done Master in Computer Science in 2020 and currently she is
working as Machine Learning expert in Accenture India