This document presents a new framework that combines K-nearest neighbors (KNN) and decision trees (DT) for speech identification through emphatic letters in the Moroccan dialect of Arabic. The framework first uses KNN and DT individually to predict the gender of the speaker and the emphatic letter and diacritic pronounced. It then uses these predictions as additional features to improve the overall prediction of the sound content, achieving an accuracy of 71.43% - a 12.1% improvement over directly applying the classifiers. The study examines 720 speech samples from 12 speakers and evaluates the performance of hidden markov models, DT, and KNN applied individually, finding that KNN best recognizes diacritics while DT performs best for gender
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
The document summarizes a study that proposes a hybrid approach for acoustic and pronunciation modeling in Arabic speech recognition. It combines phonemic and graphemic modeling techniques. Two baseline speech recognition systems were built using phonemic and graphemic acoustic models. These models were then fused into a hybrid acoustic model. Different hybrid techniques for pronunciation modeling were also proposed and evaluated on a broadcast news speech corpus, showing error rate reductions of 8.8-12.6% over the baselines. The hybrid approach aims to benefit from both vocalized and non-vocalized Arabic resources.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages: Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the input speech waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. And then the likelihood of the observation of feature vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language Model (LM). The system will produce the most likely sequence of words as the output. The system creates the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
This document summarizes a research paper on developing a syllable-based speech recognition system for the Myanmar language. The proposed system has three main components: feature extraction, phone recognition, and decoding. Feature extraction transforms speech into acoustic feature vectors. Phone recognition computes likelihoods of acoustic observations given linguistic units like phones. Decoding uses acoustic and language models to find the most likely sequence of words. The paper discusses building acoustic and language models for Myanmar. The acoustic model is trained using Hidden Markov Models and Gaussian mixture models. The language model is an n-gram model built using syllable segmentation of text. Developing the first speech recognition system for Myanmar poses technical challenges due to its tonal syllabic structure.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
The document summarizes a study that proposes a hybrid approach for acoustic and pronunciation modeling in Arabic speech recognition. It combines phonemic and graphemic modeling techniques. Two baseline speech recognition systems were built using phonemic and graphemic acoustic models. These models were then fused into a hybrid acoustic model. Different hybrid techniques for pronunciation modeling were also proposed and evaluated on a broadcast news speech corpus, showing error rate reductions of 8.8-12.6% over the baselines. The hybrid approach aims to benefit from both vocalized and non-vocalized Arabic resources.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages: Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the input speech waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. And then the likelihood of the observation of feature vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language Model (LM). The system will produce the most likely sequence of words as the output. The system creates the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
This document summarizes a research paper on developing a syllable-based speech recognition system for the Myanmar language. The proposed system has three main components: feature extraction, phone recognition, and decoding. Feature extraction transforms speech into acoustic feature vectors. Phone recognition computes likelihoods of acoustic observations given linguistic units like phones. Decoding uses acoustic and language models to find the most likely sequence of words. The paper discusses building acoustic and language models for Myanmar. The acoustic model is trained using Hidden Markov Models and Gaussian mixture models. The language model is an n-gram model built using syllable segmentation of text. Developing the first speech recognition system for Myanmar poses technical challenges due to its tonal syllabic structure.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
This document discusses performance analysis of different acoustic features for Bangla speech recognition using LSTM neural networks. It develops a Bangla speech corpus and extracts linear predictive coding (LPC), Mel frequency cepstral coefficients (MFCC), and perceptual linear prediction (PLP) acoustic features from the corpus. The features are then used to train LSTM models for Bangla speech recognition and their performance is evaluated based on sentence correct rates on test data sets consisting of male and female speakers.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig
Researchers of many nations have developed automatic speech recognition (ASR) to show their national
improvement in information and communication technology for their languages. This work intends to
improve the ASR performance for Myanmar language by changing different Convolutional Neural Network
(CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of
reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality
and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is
investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on
syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate
(WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76%
SER on TestSet2.
Arabic digits speech recognition and speaker identification in noisy environm...TELKOMNIKA JOURNAL
This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The mel frequency cepstral coefficients (MFCC) is the best approach used in building a program in the Matlab platform; also, the quantization is used for generating the codebooks. The Gaussian mixture modelling (GMM) algorithms are used to generate template, feature-matching purpose. In this paper, we have proposed a system based on MFCC-GMM and MFCC-VQ approaches on the one hand and by using the hybrid approach MFCC-VQ-GMM on the other hand for speaker modeling. The white Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels to test the system in a noisy environment. The proposed system gives good results in recognition rate.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language identification (LID). With the one-pass recognition system, the spoken sounds are converted into phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based bigram language models (LM) are integrated into speech decoding to eliminate possible phone mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic information of mixed-language speech based on recognized phone sequences. As the back-end decision is taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to classify language identity. The speech corpus was tested on Sepedi and English languages that are often mixed. Our system is evaluated by measuring both the ASR performance and the LID performance separately. The systems have obtained a promising ASR accuracy with data-driven phone merging approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.Keywords:TTS, SBS, Sillable, Diphone.
Title:Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Author:Sohrab Hojjatkhah, Ali Jowharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSkevig
Recently, many researchers have focused on building and improving speech recognition systems to
facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system
has become an important and commontool from games to translation systems, robots, and so on. However,
there is still a need for research on speech recognition systems for low-resource languages. This article
deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients
(MFCCs) feature extraction method and three different deep neural networks including Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our
models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms.
This study obtained the impressive result of 98.365% average accuracy.
Development of depth map from stereo images using sum of absolute differences...nooriasukmaningtyas
This article proposes a framework for the depth map reconstruction using stereo images. Fundamentally, this map provides an important information which commonly used in essential applications such as autonomous vehicle navigation, drone’s navigation and 3D surface reconstruction. To develop an accurate depth map, the framework must be robust against the challenging regions of low texture, plain color and repetitive pattern on the input stereo image. The development of this map requires several stages which starts with matching cost calculation, cost aggregation, optimization and refinement stage. Hence, this work develops a framework with sum of absolute difference (SAD) and the combination of two edge preserving filters to increase the robustness against the challenging regions. The SAD convolves using block matching technique to increase the efficiency of matching process on the low texture and plain color regions. Moreover, two edge preserving filters will increase the accuracy on the repetitive pattern region. The results show that the proposed method is accurate and capable to work with the challenging regions. The results are provided by the Middlebury standard dataset. The framework is also efficiently and can be applied on the 3D surface reconstruction. Moreover, this work is greatly competitive with previously available methods.
Model predictive controller for a retrofitted heat exchanger temperature cont...nooriasukmaningtyas
This paper aims to demonstrate the practical aspects of process control theory for undergraduate students at the Department of Chemical Engineering at the University of Bahrain. Both, the ubiquitous proportional integral derivative (PID) as well as model predictive control (MPC) and their auxiliaries were designed and implemented in a real-time framework. The latter was realized through retrofitting an existing plate-and-frame heat exchanger unit that has been operated using an analog PID temperature controller. The upgraded control system consists of a personal computer (PC), low-cost signal conditioning circuit, national instruments USB 6008 data acquisition card, and LabVIEW software. LabVIEW control design and simulation modules were used to design and implement the PID and MPC controllers. The performance of the designed controllers was evaluated while controlling the outlet temperature of the retrofitted plate-and-frame heat exchanger. The distinguished feature of the MPC controller in handling input and output constraints was perceived in real-time. From a pedagogical point of view, realizing the theory of process control through practical implementation was substantial in enhancing the student’s learning and the instructor’s teaching experience.
More Related Content
Similar to A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Wavelet Based Feature Extraction for the Indonesian CV Syllables SoundTELKOMNIKA JOURNAL
This paper proposes the combined methods of Wavelet Transform (WT) and Euclidean Distance
(ED) to estimate the expected value of the possibly feature vector of Indonesian syllables. This research
aims to find the best properties in effectiveness and efficiency on performing feature extraction of each
syllable sound to be applied in the speech recognition systems. This proposed approach which is the
state-of-the-art of the previous study consist of three main phase. In the first phase, the speech signal is
segmented and normalized. In the second phase, the signal is transformed into frequency domain by using
the WT. In the third phase, to estimate the expected feature vector, the ED algorithm is used. Th e result
shows the list of features of each syllables can be used for the next research, and some recommendations
on the most effective and efficient WT to be used in performing syllable sound recognition.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
This document discusses performance analysis of different acoustic features for Bangla speech recognition using LSTM neural networks. It develops a Bangla speech corpus and extracts linear predictive coding (LPC), Mel frequency cepstral coefficients (MFCC), and perceptual linear prediction (PLP) acoustic features from the corpus. The features are then used to train LSTM models for Bangla speech recognition and their performance is evaluated based on sentence correct rates on test data sets consisting of male and female speakers.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig
Researchers of many nations have developed automatic speech recognition (ASR) to show their national
improvement in information and communication technology for their languages. This work intends to
improve the ASR performance for Myanmar language by changing different Convolutional Neural Network
(CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of
reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality
and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is
investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on
syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate
(WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76%
SER on TestSet2.
Arabic digits speech recognition and speaker identification in noisy environm...TELKOMNIKA JOURNAL
This paper presents an automatic speaker identification and speech recognition for Arabic digits in noisy environment. In this work, the proposed system is able to identify the speaker after saving his voice in the database and adding noise. The mel frequency cepstral coefficients (MFCC) is the best approach used in building a program in the Matlab platform; also, the quantization is used for generating the codebooks. The Gaussian mixture modelling (GMM) algorithms are used to generate template, feature-matching purpose. In this paper, we have proposed a system based on MFCC-GMM and MFCC-VQ approaches on the one hand and by using the hybrid approach MFCC-VQ-GMM on the other hand for speaker modeling. The white Gaussian noise is added to the clean speech at several signal-to-noise ratio (SNR) levels to test the system in a noisy environment. The proposed system gives good results in recognition rate.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language identification (LID). With the one-pass recognition system, the spoken sounds are converted into phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based bigram language models (LM) are integrated into speech decoding to eliminate possible phone mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic information of mixed-language speech based on recognized phone sequences. As the back-end decision is taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to classify language identity. The speech corpus was tested on Sepedi and English languages that are often mixed. Our system is evaluated by measuring both the ASR performance and the LID performance separately. The systems have obtained a promising ASR accuracy with data-driven phone merging approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
Accents of English have been investigated for many years both from the perspective of native and non-native speakers of the language. Various research results imply that non-native speakers of English language produce certain speech characteristics which are uncommon in native speakers’ speech. This is because non-native speakers do not produce the same tongue movement as native speakers. This paper presents an isolated English word recognition system devised with the speech of local Bangladeshi people, who are also non-native speakers of English language. Here, we have also noticed a different speech characteristic which is not available within the speech of native English speakers. Two acoustic features, ‘pitch’ and ‘formants’ have been utilized to develop the system. The system is speaker-independent and stands on Template based approach. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.
Keywords:TTS, SBS, Sillable, Diphone.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Segmentation Words for Speech Synthesis in Persian Language Based On Silencepaperpublications3
Abstract: In speech synthesis in text to speech systems, the words usually break to different parts and use from recorded sound of each part for play words. This paper use silent in word's pronunciation for better quality of speech. Most algorithms divide words to syllable and some of them divide words to phoneme, but This paper benefit from silent in intonation and divide words at silent region and then set equivalent sound of each parts whereupon joining the parts is trusty and speech quality being more smooth . this paper concern Persian language but extendable to another language. This method has been tested with MOS test and intelligibility, naturalness and fluidity are better.Keywords:TTS, SBS, Sillable, Diphone.
Title:Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Author:Sohrab Hojjatkhah, Ali Jowharpour
International Journal of Recent Research in Mathematics Computer Science and Information Technology (IJRRMCSIT)
Paper Publications
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSkevig
Recently, many researchers have focused on building and improving speech recognition systems to
facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system
has become an important and commontool from games to translation systems, robots, and so on. However,
there is still a need for research on speech recognition systems for low-resource languages. This article
deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients
(MFCCs) feature extraction method and three different deep neural networks including Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our
models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms.
This study obtained the impressive result of 98.365% average accuracy.
Similar to A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect (20)
Development of depth map from stereo images using sum of absolute differences...nooriasukmaningtyas
This article proposes a framework for the depth map reconstruction using stereo images. Fundamentally, this map provides an important information which commonly used in essential applications such as autonomous vehicle navigation, drone’s navigation and 3D surface reconstruction. To develop an accurate depth map, the framework must be robust against the challenging regions of low texture, plain color and repetitive pattern on the input stereo image. The development of this map requires several stages which starts with matching cost calculation, cost aggregation, optimization and refinement stage. Hence, this work develops a framework with sum of absolute difference (SAD) and the combination of two edge preserving filters to increase the robustness against the challenging regions. The SAD convolves using block matching technique to increase the efficiency of matching process on the low texture and plain color regions. Moreover, two edge preserving filters will increase the accuracy on the repetitive pattern region. The results show that the proposed method is accurate and capable to work with the challenging regions. The results are provided by the Middlebury standard dataset. The framework is also efficiently and can be applied on the 3D surface reconstruction. Moreover, this work is greatly competitive with previously available methods.
Model predictive controller for a retrofitted heat exchanger temperature cont...nooriasukmaningtyas
This paper aims to demonstrate the practical aspects of process control theory for undergraduate students at the Department of Chemical Engineering at the University of Bahrain. Both, the ubiquitous proportional integral derivative (PID) as well as model predictive control (MPC) and their auxiliaries were designed and implemented in a real-time framework. The latter was realized through retrofitting an existing plate-and-frame heat exchanger unit that has been operated using an analog PID temperature controller. The upgraded control system consists of a personal computer (PC), low-cost signal conditioning circuit, national instruments USB 6008 data acquisition card, and LabVIEW software. LabVIEW control design and simulation modules were used to design and implement the PID and MPC controllers. The performance of the designed controllers was evaluated while controlling the outlet temperature of the retrofitted plate-and-frame heat exchanger. The distinguished feature of the MPC controller in handling input and output constraints was perceived in real-time. From a pedagogical point of view, realizing the theory of process control through practical implementation was substantial in enhancing the student’s learning and the instructor’s teaching experience.
Control of a servo-hydraulic system utilizing an extended wavelet functional ...nooriasukmaningtyas
Servo-hydraulic systems have been extensively employed in various industrial applications. However, these systems are characterized by their highly complex and nonlinear dynamics, which complicates the control design stage of such systems. In this paper, an extended wavelet functional link neural network (EWFLNN) is proposed to control the displacement response of the servo-hydraulic system. To optimize the controller's parameters, a recently developed optimization technique, which is called the modified sine cosine algorithm (M-SCA), is exploited as the training method. The proposed controller has achieved remarkable results in terms of tracking two different displacement signals and handling external disturbances. From a comparative study, the proposed EWFLNN controller has attained the best control precision compared with those of other controllers, namely, a proportional-integralderivative (PID) controller, an artificial neural network (ANN) controller, a wavelet neural network (WNN) controller, and the original wavelet functional link neural network (WFLNN) controller. Moreover, compared to the genetic algorithm (GA) and the original sine cosine algorithm (SCA), the M-SCA has shown better optimization results in finding the optimal values of the controller's parameters.
Decentralised optimal deployment of mobile underwater sensors for covering la...nooriasukmaningtyas
This paper presents the problem of sensing coverage of layers of the ocean in three dimensional underwater environments. We propose distributed control laws to drive mobile underwater sensors to optimally cover a given confined layer of the ocean. By applying this algorithm at first the mobile underwater sensors adjust their depth to the specified depth. Then, they make a triangular grid across a given area. Afterwards, they randomly move to spread across the given grid. These control laws only rely on local information also they are easily implemented and computationally effective as they use some easy consensus rules. The feature of exchanging information just among neighbouring mobile sensors keeps the information exchange minimum in the whole networks and makes this algorithm practicable option for undersea. The efficiency of the presented control laws is confirmed via mathematical proof and numerical simulations.
Evaluation quality of service for internet of things based on fuzzy logic: a ...nooriasukmaningtyas
The development of the internet of thing (IoT) technology has become a major concern in sustainability of quality of service (SQoS) in terms of efficiency, measurement, and evaluation of services, such as our smart home case study. Based on several ambiguous linguistic and standard criteria, this article deals with quality of service (QoS). We used fuzzy logic to select the most appropriate and efficient services. For this reason, we have introduced a new paradigmatic approach to assess QoS. In this regard, to measure SQoS, linguistic terms were collected for identification of ambiguous criteria. This paper collects the results of other work to compare the traditional assessment methods and techniques in IoT. It has been proven that the comparison that traditional valuation methods and techniques could not effectively deal with these metrics. Therefore, fuzzy logic is a worthy method to provide a good measure of QoS with ambiguous linguistic and criteria. The proposed model addresses with constantly being improved, all the main axes of the QoS for a smart home. The results obtained also indicate that the model with its fuzzy performance importance index (FPII) has efficiently evaluate the multiple services of SQoS.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Smart monitoring system using NodeMCU for maintenance of production machinesnooriasukmaningtyas
Maintenance is an activity that helps to reduce risk, increase productivity, improve quality, and minimize production costs. The necessity for maintenance actions will increase efficiency and enhance the safety and quality of products and processes. On getting these conditions, it is necessary to implement a monitoring system used to observe machines' conditions from time to time, especially the machine parts that often experience problems. This paper presents a low-cost intelligent monitoring system using NodeMCU to continuously monitor machine conditions and provide warnings in the case of machine failure. Not only does it provide alerts, but this monitoring system also generates historical data on machine conditions to the Google Cloud (Google Sheet), includes which machines were down, downtime, issues occurred, repairs made, and technician handling. The results obtained are machine operators do not need to lose a relatively long time to call the technician. Likewise, the technicians assisted in carrying out machine maintenance activities and online reports so that errors that often occur due to human error do not happen again. The system succeeded in reducing the technician-calling time and maintenance workreporting time up to 50%. The availability of online and real-time maintenance historical data will support further maintenance strategy.
Design and simulation of a software defined networkingenabled smart switch, f...nooriasukmaningtyas
Using sustainable energy is the future of our planet earth, this became not only economically efficient but also a necessity for the preservation of life on earth. Because of such necessity, smart grids became a very important issue to be researched. Many literatures discussed this topic and with the development of internet of things (IoT) and smart sensors, smart grids are developed even further. On the other hand, software defined networking is a technology that separates the control plane from the data plan of the network. It centralizes the management and the orchestration of the network tasks by using a network controller. The network controller is the heart of the SDN-enabled network, and it can control other networking devices using software defined networking (SDN) protocols such as OpenFlow. A smart switching mechanism called (SDN-smgrid-sw) for the smart grid will be modeled and controlled using SDN. We modeled the environment that interact with the sensors, for the sun and the wind elements. The Algorithm is modeled and programmed for smart efficient power sharing that is managed centrally and monitored using SDN controller. Also, all if the smart grid elements (power sources) are connected to the IP network using IoT protocols.
Efficient wireless power transmission to remote the sensor in restenosis coro...nooriasukmaningtyas
In this study, the researchers have proposed an alternative technique for designing an asymmetric 4 coil-resonance coupling module based on the series-to-parallel topology at 27 MHz industrial scientific medical (ISM) band to avoid the tissue damage, for the constant monitoring of the in-stent restenosis coronary artery. This design consisted of 2 components, i.e., the external part that included 3 planar coils that were placed outside the body and an internal helical coil (stent) that was implanted into the coronary artery in the human tissue. This technique considered the output power and the transfer efficiency of the overall system, coil geometry like the number of coils per turn, and coil size. The results indicated that this design showed an 82% efficiency in the air if the transmission distance was maintained as 20 mm, which allowed the wireless power supply system to monitor the pressure within the coronary artery when the implanted load resistance was 400 Ω.
Grid reactive voltage regulation and cost optimization for electric vehicle p...nooriasukmaningtyas
Expecting large electric vehicle (EV) usage in the future due to environmental issues, state subsidies, and incentives, the impact of EV charging on the power grid is required to be closely analyzed and studied for power quality, stability, and planning of infrastructure. When a large number of energy storage batteries are connected to the grid as a capacitive load the power factor of the power grid is inevitably reduced, causing power losses and voltage instability. In this work large-scale 18K EV charging model is implemented on IEEE 33 network. Optimization methods are described to search for the location of nodes that are affected most due to EV charging in terms of power losses and voltage instability of the network. Followed by optimized reactive power injection magnitude and time duration of reactive power at the identified nodes. It is shown that power losses are reduced and voltage stability is improved in the grid, which also complements the reduction in EV charging cost. The result will be useful for EV charging stations infrastructure planning, grid stabilization, and reducing EV charging costs.
Topology network effects for double synchronized switch harvesting circuit on...nooriasukmaningtyas
Energy extraction takes place using several different technologies, depending on the type of energy and how it is used. The objective of this paper is to study topology influence for a smart network based on piezoelectric materials using the double synchronized switch harvesting (DSSH). In this work, has been presented network topology for circuit DSSH (DSSH Standard, Independent DSSH, DSSH in parallel, mono DSSH, and DSSH in series). Using simulation-based on a structure with embedded piezoelectric system harvesters, then compare different topology of circuit DSSH for knowledge is how to connect the circuit DSSH together and how to implement accurately this circuit strategy for maximizing the total output power. The network topology DSSH extracted power a technique allows again up to in terms of maximal power output compared with network topology standard extracted at the resonant frequency. The simulation results show that by using the same input parameters the maximum efficiency for topology DSSH in parallel produces 120% more energy than topology DSSH-series. In addition, the energy harvesting by mono-DSSH is more than DSSH-series by 650% and it has exceeded DSSHind by 240%.
Improving the design of super-lift Luo converter using hybrid switching capac...nooriasukmaningtyas
In this article, an improvement to the positive output super-lift Luo converter (POSLC) has been proposed to get high gain at a low duty cycle. Also, reduce the stress on the switch and diodes, reduce the current through the inductors to reduce loss, and increase efficiency. Using a hybrid switch unit composed of four inductors and two capacitors it is replaced by the main inductor in the elementary circuit. It’s charged in parallel with the same input voltage and discharged in series. The output voltage is increased according to the number of components. The gain equation is modeled. The boundary condition between continuous conduction mode (CCM) and discontinuous conduction mode (DCM) has been derived. Passive components are designed to get high output voltage (8 times at D=0.5) and low ripple about (0.004). The circuit is simulated and analyzed using MATLAB/Simulink. Maximum power point tracker (MPPT) controls the converter to provide the most interest from solar energy.
Third harmonic current minimization using third harmonic blocking transformernooriasukmaningtyas
Zero sequence blocking transformers (ZSBTs) are used to suppress third harmonic currents in 3-phase systems. Three-phase systems where singlephase loading is present, there is every chance that the load is not balanced. If there is zero-sequence current due to unequal load current, then the ZSBT will impose high impedance and the supply voltage at the load end will be varied which is not desired. This paper presents Third harmonic blocking transformer (THBT) which suppresses only higher harmonic zero sequences. The constructional features using all windings in single-core and construction using three single-phase transformers explained. The paper discusses the constructional features, full details of circuit usage, design considerations, and simulation results for different supply and load conditions. A comparison of THBT with ZSBT is made with simulation results by considering four different cases
Power quality improvement of distribution systems asymmetry caused by power d...nooriasukmaningtyas
With an increase of non-linear load in today’s electrical power systems, the rate of power quality drops and the voltage source and frequency deteriorate if not properly compensated with an appropriate device. Filters are most common techniques that employed to overcome this problem and improving power quality. In this paper an improved optimization technique of filter applies to the power system is based on a particle swarm optimization with using artificial neural network technique applied to the unified power flow quality conditioner (PSO-ANN UPQC). Design particle swarm optimization and artificial neural network together result in a very high performance of flexible AC transmission lines (FACTs) controller and it implements to the system to compensate all types of power quality disturbances. This technique is very powerful for minimization of total harmonic distortion of source voltages and currents as a limit permitted by IEEE-519. The work creates a power system model in MATLAB/Simulink program to investigate our proposed optimization technique for improving control circuit of filters. The work also has measured all power quality disturbances of the electrical arc furnace of steel factory and suggests this technique of filter to improve the power quality.
Studies enhancement of transient stability by single machine infinite bus sys...nooriasukmaningtyas
Maintaining network synchronization is important to customer service. Low fluctuations cause voltage instability, non-synchronization in the power system or the problems in the electrical system disturbances, harmonics current and voltages inflation and contraction voltage. Proper tunning of the parameters of stabilizer is prime for validation of stabilizer. To overcome instability issues and get reinforcement found a lot of the techniques are developed to overcome instability problems and improve performance of power system. Genetic algorithm was applied to optimize parameters and suppress oscillation. The simulation of the robust composite capacitance system of an infinite single-machine bus was studied using MATLAB was used for optimization purpose. The critical time is an indication of the maximum possible time during which the error can pass in the system to obtain stability through the simulation. The effectiveness improvement has been shown in the system
Renewable energy based dynamic tariff system for domestic load managementnooriasukmaningtyas
To deal with the present power-scenario, this paper proposes a model of an advanced energy management system, which tries to achieve peak clipping, peak to average ratio reduction and cost reduction based on effective utilization of distributed generations. This helps to manage conventional loads based on flexible tariff system. The main contribution of this work is the development of three-part dynamic tariff system on the basis of time of utilizing power, available renewable energy sources (RES) and consumers’ load profile. This incorporates consumers’ choice to suitably select for either consuming power from conventional energy sources and/or renewable energy sources during peak or off-peak hours. To validate the efficiency of the proposed model we have comparatively evaluated the model performance with existing optimization techniques using genetic algorithm and particle swarm optimization. A new optimization technique, hybrid greedy particle swarm optimization has been proposed which is based on the two aforementioned techniques. It is found that the proposed model is superior with the improved tariff scheme when subjected to load management and consumers’ financial benefit. This work leads to maintain a healthy relationship between the utility sectors and the consumers, thereby making the existing grid more reliable, robust, flexible yet cost effective.
Energy harvesting maximization by integration of distributed generation based...nooriasukmaningtyas
The purpose of distributed generation systems (DGS) is to enhance the distribution system (DS) performance to be better known with its benefits in the power sector as installing distributed generation (DG) units into the DS can introduce economic, environmental and technical benefits. Those benefits can be obtained if the DG units' site and size is properly determined. The aim of this paper is studying and reviewing the effect of connecting DG units in the DS on transmission efficiency, reactive power loss and voltage deviation in addition to the economical point of view and considering the interest and inflation rate. Whale optimization algorithm (WOA) is introduced to find the best solution to the distributed generation penetration problem in the DS. The result of WOA is compared with the genetic algorithm (GA), particle swarm optimization (PSO), and grey wolf optimizer (GWO). The proposed solutions methodologies have been tested using MATLAB software on IEEE 33 standard bus system
Intelligent fault diagnosis for power distribution systemcomparative studiesnooriasukmaningtyas
Short circuit is one of the most popular types of permanent fault in power distribution system. Thus, fast and accuracy diagnosis of short circuit failure is very important so that the power system works more effectively. In this paper, a newly enhanced support vector machine (SVM) classifier has been investigated to identify ten short-circuit fault types, including single line-toground faults (XG, YG, ZG), line-to-line faults (XY, XZ, YZ), double lineto-ground faults (XYG, XZG, YZG) and three-line faults (XYZ). The performance of this enhanced SVM model has been improved by using three different versions of particle swarm optimization (PSO), namely: classical PSO (C-PSO), time varying acceleration coefficients PSO (T-PSO) and constriction factor PSO (K-PSO). Further, utilizing pseudo-random binary sequence (PRBS)-based time domain reflectometry (TDR) method allows to obtain a reliable dataset for SVM classifier. The experimental results performed on a two-branch distribution line show the most optimal variant of PSO for short fault diagnosis.
A deep learning approach based on stochastic gradient descent and least absol...nooriasukmaningtyas
More than eighty-five to ninety percentage of the diabetic patients are affected with diabetic retinopathy (DR) which is an eye disorder that leads to blindness. The computational techniques can support to detect the DR by using the retinal images. However, it is hard to measure the DR with the raw retinal image. This paper proposes an effective method for identification of DR from the retinal images. In this research work, initially the Weiner filter is used for preprocessing the raw retinal image. Then the preprocessed image is segmented using fuzzy c-mean technique. Then from the segmented image, the features are extracted using grey level co-occurrence matrix (GLCM). After extracting the fundus image, the feature selection is performed stochastic gradient descent, and least absolute shrinkage and selection operator (LASSO) for accurate identification during the classification process. Then the inception v3-convolutional neural network (IV3-CNN) model is used in the classification process to classify the image as DR image or non-DR image. By applying the proposed method, the classification performance of IV3-CNN model in identifying DR is studied. Using the proposed method, the DR is identified with the accuracy of about 95%, and the processed retinal image is identified as mild DR.
Design and optimization of ion propulsion dronebjmsejournal
Electric propulsion technology is widely used in many kinds of vehicles in recent years, and aircrafts are no exception. Technically, UAVs are electrically propelled but tend to produce a significant amount of noise and vibrations. Ion propulsion technology for drones is a potential solution to this problem. Ion propulsion technology is proven to be feasible in the earth’s atmosphere. The study presented in this article shows the design of EHD thrusters and power supply for ion propulsion drones along with performance optimization of high-voltage power supply for endurance in earth’s atmosphere.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Rainfall intensity duration frequency curve statistical analysis and modeling...bijceesjournal
Using data from 41 years in Patna’ India’ the study’s goal is to analyze the trends of how often it rains on a weekly, seasonal, and annual basis (1981−2020). First, utilizing the intensity-duration-frequency (IDF) curve and the relationship by statistically analyzing rainfall’ the historical rainfall data set for Patna’ India’ during a 41 year period (1981−2020), was evaluated for its quality. Changes in the hydrologic cycle as a result of increased greenhouse gas emissions are expected to induce variations in the intensity, length, and frequency of precipitation events. One strategy to lessen vulnerability is to quantify probable changes and adapt to them. Techniques such as log-normal, normal, and Gumbel are used (EV-I). Distributions were created with durations of 1, 2, 3, 6, and 24 h and return times of 2, 5, 10, 25, and 100 years. There were also mathematical correlations discovered between rainfall and recurrence interval.
Findings: Based on findings, the Gumbel approach produced the highest intensity values, whereas the other approaches produced values that were close to each other. The data indicates that 461.9 mm of rain fell during the monsoon season’s 301st week. However, it was found that the 29th week had the greatest average rainfall, 92.6 mm. With 952.6 mm on average, the monsoon season saw the highest rainfall. Calculations revealed that the yearly rainfall averaged 1171.1 mm. Using Weibull’s method, the study was subsequently expanded to examine rainfall distribution at different recurrence intervals of 2, 5, 10, and 25 years. Rainfall and recurrence interval mathematical correlations were also developed. Further regression analysis revealed that short wave irrigation, wind direction, wind speed, pressure, relative humidity, and temperature all had a substantial influence on rainfall.
Originality and value: The results of the rainfall IDF curves can provide useful information to policymakers in making appropriate decisions in managing and minimizing floods in the study area.
A new framework based on KNN and DT for speech identification through emphatic letters in Moroccan dialect
1. Indonesian Journal of Electrical Engineering and Computer Science
Vol. 21, No. 3, March 2021, pp. 1417~1423
ISSN: 2502-4752, DOI: 10.11591/ijeecs.v21.i3.pp1417-1423 1417
Journal homepage: http://ijeecs.iaescore.com
A new framework based on KNN and DT for speech
identification through emphatic letters in Moroccan dialect
Bezoui Mouaz1
, Cherif Walid2
, Beni-Hssane Abderrahim3
, Elmoutaouakkil Abdelmajid4
1
Faculty of Science El jadida (FSJ), Université Chouaïb Doukkali (UCD), Morocco
2,3,4
SI2M Labortory, National Institute of Statistics and Applied Economics Rabat, Morocco
Article Info ABSTRACT
Article history:
Received Apr 17, 2020
Revised Jul 8, 2020
Accepted Aug 30, 2020
Arabic dialects differ substantially from modern standard arabic and each
other in terms of phonology, morphology, lexical choice and syntax. This
makes the identification of dialects from speeches a very difficult task. In this
paper, we introduce a speech recognition system that automatically identifies
the gender of speaker, the emphatic letter pronounced and also the diacritic
of these emphatic letters given a sample of author’s speeches. Firstly we
examined the performance of the single case classifier hidden markov
models (HMM) applied to the samples of our data corpus. Then we evaluated
our proposed approach KNN-DT which is a hybridization of two classifiers
namely decision trees (DT) and K-nearest neighbors (KNN). Both models are
singularly applied directly to the data corpus to recognize the emphatic letter
of the sound and to the diacritic and the gender of the speaker. This
hybridization proved quite interesting; it improved the speech recognition
accuracy by more than 10% compared to state-of-the-art approaches.
Keywords:
Decision tree
Hidden markov model
K-nearest neighbor
Machine learning
Speaker identification
This is an open access article under the CC BY-SA license.
Corresponding Author:
Bezoui Mouaz
Department of Computer Science
Chouaïb Doukkali University
24000, El jadida, Morocco
Email: mbezoui@gmail.com
1. INTRODUCTION
Automatic speech recognition is gaining big interest due to the high viability in speech signals.
Indeed, authors may express their ideas in different accents, dialects, and pronunciations. They also differ
from person to person, and between the two genders.
The existence of ambient noise, sound echoes, sound recording devices and stereo microphones
results in additional variability. The hidden markov model (HMM) is used by conventional speech
recognition systems to represent the sequential structure of speech signals. There are four emphatic
consonants in Arabic which are of interest here: two of them are plosives: / d / and / t / and the other two are
fricatives: / s / and / z / [1]. The letter / d / is an empathic plosive expressed with an alveo-dental articulation
point, as this phoneme is rare in human languages [2].
Arabic is commonly referred to as "The Dhaad language", where Dhaad is the name of the spoken
Arabic letter that carries the phoneme / d /. Additionally, this name was given to Arabic based on the
classical Arabic version of / d / phoneme which is an emphatic lateral fricative, but not plosive as given by
the MSA version. The letter / t / is an unvoiced emphatic plosive with an alveo-dental articulation point while
the letter / s / is an unvoiced emphatic fricative with an alveo-dental articulation point. The famous letter / z /
is an empathetic fricative expressed with an interdental point of articulation [3]. We explain in Table 1 the
four Arabic emphatic sounds as well as their non-emphatic counterparts.
2. ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 21, No. 3, March 2021 : 1417 - 1423
1418
Table 1. Arabic emphatic consonants ( ,ض
,ص
ط
ظ, )
Arabic letter LDC Symbol Non-Emphatic Conterparts
Dhaad ض d /d/ Daal
Saad ص s Voiced: /z/ (Zain); Unvoiced: /s/ (Seen)
Taà ط
Thaà ظ
t
z
Voiced: /d/ (Daal); Unvoiced: /t/ (Taà)
/th/ (Thaal)
In the literature, the majority of previous works for Arabic Speaker Identification has focused on
MSA speech recognition. Speech data are usually news broadcasts where MSA is the formal language [4, 5].
Different classifiers have been investigated in this sense such as neural networks (NN) [6, 7], K-nearest
neighbors (KNN) [8] and hidden markov models (HMM) [9, 10]. The common objective of these works was
the improvement of models’ accurccies [11]. Other researchers opted for ensemble methods [12]. They
combined different classifiers during the training stage [13, 14] and the final decision was based on a
comparison of obtained individual results. Another category of works proposed hybridizations of these
classifiers by making optimizations to their algorithms. In this paper, we propose a new hybrid KNN-DT
classifier. The main idea is to combine the robustness of KNN with the representativeness of DT classifier.
The computational time was also investigated, and we propose in our hybrid model an optimal accuracy in a
reasonable time for the recognition. The rest of this paper is organized as follows. In Section 2, we start by
describing the datasets and its preprocessing steps before summarizing main used classifiers. The last part of
section introduces the proposed framework which is based on a combination of KNN and DT. Section 3
discusses the experimental results. Finally, the last section concludes this work.
2. METHOD
2.1. Data description
We experiment three systems, hidden markov models, decision tree and KNN. In the case of the one
based on a decision tree, the emphatic consonant /d/ was not always recognized properly. This was due to the
specific characteristics of consonant’s acoustics which make it difficult to be pronounced and therefore to be
recognized [15]. Only few native speakers are able to pronounce it correctly. On all compounds, emphatic
consonants yielded considerably lower accuracies compared to fricatives, nasals and plosives consonants
[16].
While experimenting the proposed model for Speech recognition, we used a dataset consisting of
720 sounds. 12 people participated in this collection: 4 women and 8 men. 4 letters with 3 diacritics have
been considered during the experimentation. Each participant recorded each sound 5 times. We divided this
collection as follows:
The first four records of participants: woman1, woman2, woman3, man1, man2, man3, man5, man6,
man7 are considered as a training set; while the fifth record of each of them, and all records of woman4,
man4 and man8 are used for the test. To evaluate the proposed model, we used as metric the accuracy which
is the ratio of sounds correctly predicted divided by the global number of sounds in the test set. In Table 2 the
four Arabic consonants grapheme-phoneme correspondences.
Table 2. Arabic consonants table: grapheme-phoneme correspondences
Arabic Consonants Occlusive Emphatic Fricative
Labial ب ف
Inter-dental ظ ذ ث
Dental د ت ض ط
Pharyngeal
Velar
Glotal
ك
‘ ء
ع ح
ه
2.2. Data preparation
The training corpus was collected in both MSA and Moroccan Arabic dialect. The records were
carried out at the frequency: 16Khz. The collected recordings were segmented in order to clean the speech
from external sounds such as noise and background music [17]. A high-quality microphone (Labtec AM-232)
was used to record speechs from the authors over the period of October 2019 until December 2019.
3. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
A new framework based on KNN and DT for speech identification… (Bezoui Mouaz)
1419
2.3. The used classifiers overview
2.3.1. Hidden markov model
Among many other approaches, HMMs are proven to be the most efficient method in speech
recognition. The reasons behind this method’s popularity are the following ones: the availability of training
algorithms for estimating the parameters of the models from finite training sets of speech data, its solid
mathematical foundation as well as its ability to model time series with variable lengths. Despite its great
success, it is well known that one of the most weaknesses of the conventional HMM classifier is the great
number of tuning parameters (e.g., the states number, the number of Gaussian per state and the number of
training iterations). These parameters have to be set experimentally and they are crucially dependent on the
training and test data which affects the robustness of the classifier. Hidden markov models, introduced in the
early 1970s, became the perfect solution to the problems of this subfield, namely the automatic speech
recognition. The acoustic signal of speech is modeled by a small set of acoustic units, which can be
considered as elementary sounds of the language. Traditionally, the chosen unit is the phoneme, thereby the
word is formed by concatenating them [18]. More specific units can be used as syllables, disyllables,
phonemes in context, and by such means making the model more discriminating, but this theoretical
improvement is limited in practice by the complexity involved and estimation problems. The speech signal
can be likened to a series of units. In the context of Markov ASR, the acoustic units are modeled by HMM
which are typically left-right tristate as shown in Figure 1.
Figure 1. Hidden markov model (HMM) used topology
At each state of the Markov model, there is a probability distribution associated; modeling the
generation of acoustic vectors via this state. An HMM is characterized by several parameters [19]:
N: the number of the states of the model.
The matrix of transition probabilities on the set of states of the model is calculated by the equation
below:
𝐴 = {𝑎𝑖𝑗} = {𝑃(𝑞𝑡=1|𝑞𝑡−1=𝑖)}
The matrix of emitting probabilities of the observations Xt for the state qk is defined by:
B = {bk(Xt)} = {P(Xt|qt=k)}
π is the initial distribution of states (qi=0).
2.3.2. Decision trees
Our first experimentation concerns the prediction of sounds contents by using decision trees. The
choice of this model is justified by the explicit rules it generates. The decision tree method is very easy to
read and interpret. It illustrates that machine learning is not always synonymous with statistical models, but it
can also target symbolic objects [20]. In the case of road accident, it is of great importance as the main
purpose behind the use of machine learning techniques is to show visible rules to orient the actions.
Several algorithms have been proposed to build the optimal tree, the best-known ones are the ID3
algorithm [21] which was designed for the nominal attributes and its successors C4.5 and C5 [22] which also
support the quantitative attributes. Technically, the information gain of the set 𝐿 with respect to the feature xj
is therefore the variation of entropy [23] caused by the partition of 𝐿 according to xj:
4. ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 21, No. 3, March 2021 : 1417 - 1423
1420
𝐺𝑎𝑖𝑛 (𝐿, 𝑥𝑗) = 𝐻(𝐿) − ∑
𝑐𝑎𝑟𝑑(𝑙𝑥𝑗=𝑣)
𝑐𝑎𝑟𝑑(𝐿)
𝑣 ∈ 𝑣𝑎𝑙𝑒𝑢𝑟𝑠 (𝑥𝑗)
𝐻(𝑙𝑥𝑗=𝑣)
lxj=v refers to the set of accident for which the feature xj has the value 𝑣.
Similarly, the gain ratio is computed by:
𝐺𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜 (𝐿, 𝑥𝑗) =
𝐺𝑎𝑖𝑛 (𝐿, 𝑥𝑗)
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 (𝐿, 𝑥𝑗)
Then, the feature having the highest information gain / gain ratio is the most significative.
2.3.3. K-nearest neighbors
Our second experimentation concerns another lazy machine learning model, namely k-nearest
neighbors. This second technique is robust to noise. It was used in several pattern recognition applications
[24-26]. However, it needs very high storage and computational time for large volumes of data. Its algorithm
is based on similarities. It is considered as one of the simplest learning algorithms. When classifying a given
sample, the algorithm votes its most similar samples in the sense of a predefined distance, and the class of the
new sample is then determined by the majority among these k most similar samples (nearest neighbors) [27].
The performance of KNN depends largely on two factors: the value of k number and the measure used [28].
2.3.4. The proposed approach
The proposed model for Speaker Identification is a hybridization of decision trees (DT) and k-
nearest neighbors (KNN). Both models are first applied directly on the data to recognize the letter of the
sound, the diacritic and the gender of each speaker.
Instead of directly predicting the content of a sound, we judged better practice to detect first the
gender of the author and the diacritic of the sound. Once these two parameters predicted, we add them as
additional features to recognize the overall content. By doing so, we considerably refine the prediction by
eliminating sounds that may contain different diacritics or letters pronounced by participants from the other
gender.
Overall sound content prediction:
Accuracy of the decomposed model: 71.43%.
Improvement: 12.1%.
3. RESULTS AND DISCUSSION
First, we applied previously cited algorithms directly:
In Tables 3-5, we note that KNN is the most appropriate to recognize diacritics with an accuracy
exceeding 71%, followed by HMM, while DT returns its highest accuracy for gender prediction. Our
proposed approach will combine both predictors in an attempt to improve the overall performance.
The comparison of the three techniques: DT, KNN and the decomposed approach are shown in the
following Figure 2. Figure 2 shows that the sound content prediction is improved by the sub-predictions:
gender, letter and diacritic. The accuracy increased by 12.1%. Indeed, the delimitation of author's gender, as
well as the designation of the letter reduces the risk of error caused by severe female accents or acute male
accents; then the designation of the letter guides the prediction of diacritics towards a reduced list, thereby
increasing the accuracy of the overall sound content prediction.
Table 3. HMM Accuracy for different predictions
Prediction Gender prediction Diacritic prediction Sound content prediction
HMM Accuracy 63.42% 70.64% 67.03%
Table 4. DT Accuracy for different predictions
Prediction gender prediction diacritic prediction sound content prediction
DT Accuracy 66.67% 63.56% 59.33%
5. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
A new framework based on KNN and DT for speech identification… (Bezoui Mouaz)
1421
Table 5. KNN Accuracy for different predictions
Prediction gender prediction diacritic prediction sound content prediction
KNN Accuracy 52.78% 71.62% 59.33%
Figure 2. Accuracy of the three different experimented models
4. CONCLUSION AND PERSPECTIVES
In this paper, we presented a new model of speech identification for the Arabic language. Instead of
applying traditional machine learning classifiers to recognize speechs, we investigated the feasibility of
decomposing the identification task into gender, emphasis letters, and speech categorization. We proposed
this hybridization to supersede the common mapping of the global sound to its corresponding dialect. By
dividing this speech identification task, we improved the accuracy of our model by 12.1%. This novel
approach can be used for various language processing applications such as sentiment analysis, voicebots… In
our future works, we will continue our investigations for approaches that would directly map the raw acoustic
waveform to the corresponding dialects. Actually, we are exploring long short-term memory (LSTM) and
recurrent neural network (RNN) to further improve our hybrid model. Another line of research worth
exploring is the effect of integrating social data during the training stage. This could help to detect new
correlations between records and to highlight new important features for speech recognition.
ACKNOWLEDGEMENTS
We would like to acknowledge the main site where research was carried out; Department of
Computer, Chouaib Doukkali University, Faculty of Science, EI Jadida, Morocco.
REFERENCES
[1] Ouni, S., Cohen, M., Massaro, W., “Training Baldi to be Multilingual: A Case Study for an Arabic Badr”, Speech
Communication, vol. 45, pp. 115-37, 2005.
[2] Al-Muhtaseb, H., Elshafei, M., & Alghamdi, M. “Techniques for High Quality Arabic Text-tospeech”, In The Third
Workshop on Computer and Information Sciences pp. 73-83, 2000.
[3] Selouani, S., Caelen, J., “Arabic Phonetic Features Recognition using Modular Connectionist Architectures”,
Interactive Voice Technology for Communication, IVTTA’98, Proceedings 1998 IEEE 4th Workshop 29, pp. 155-
160, 1998.
[4] Maamouri, M., Bies, A., & Kulick, S. ”Diacritization: A challenge to Arabic treebank annotation and parsing”, In
Proceedings of the Conference of the Machine Translation SIG of the British Computer Society.
[5] Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., ... & Haddad, B. “Building Annotated
Written and Spoken Arabic LRs in NEMLAR Project”, In LREC, pp. 533-538, 2006.
[6] Masmoudi, S., Frikha, M., Chtourou, M., & Hamida, A. B. “Efficient MLP constructive training algorithm using a
neuron recruiting approach for isolated word recognition system”, International Journal of Speech Technology, vol.
14, no. 1, pp. 1-10, 2011.
[7] Dhanashri, D., & Dhonde, S. B. ‘’Isolated word speech recognition system using deep neural networks”, In
Proceedings of the international conference on data engineering and communication technology, pp. 9-17, 2017.
[8] Xu, B., Wang, N., Chen, T., & Li, M. “Empirical evaluation of rectified activations in convolutional network”,
2015. arXiv preprint arXiv:1505.00853.
[9] Khelifa, M. O., Elhadj, Y. M., Abdellah, Y., & Belkasmi, M. “Constructing accurate and robust HMM/GMM
models for an Arabic speech recognition system”, International Journal of Speech Technology, vol. 20, no. 4, pp.
937-949, 2017.
59.33% 61.22%
71.43%
0%
10%
20%
30%
40%
50%
60%
70%
80%
initial model model with gender
detection
composed model
Accuracy
6. ISSN: 2502-4752
Indonesian J Elec Eng & Comp Sci, Vol. 21, No. 3, March 2021 : 1417 - 1423
1422
[10] Rabiner, L. R., Wilpon, J. G., & Soong, F. K. “High performance connected digit recognition using hidden Markov
models”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 8, pp. 1214-1225, 1989.
[11] Zhang, X., Sun, J., & Luo, Z. ‘’One-against-all weighted dynamic time warping for language-independent and
speakerdependent speech recognition in adverse conditions”, PLoS ONE, vol. 9, no. 2, e85458, 2014. https
://doi.org/10.1371/journ al.pone.00854 58.
[12] Hazmoune, S., Bougamouza, F., Mazouzi, S., & Benmohammed, M. “A new hybrid framework based on Hidden
Markov models and K-nearest neighbors for speech recognition”, International Journal of Speech Technology, vol.
21, no. 3, pp. 689-704, 2018.
[13] Hazmoune, S., Bougamouza, F., Mazouzi, S., & Benmohammed, M. “A novel speech recognition approach based
on multiple modeling by hidden Markov models”, In International Conference on Computer Applications
Technology (ICCAT), pp. 1-6, 2013. Sousse: IEEE.
[14] Zhang, X., Povey, D., & Khudanpur, S. “A diversity-penalizing training method for deep learning”, In
INTERSPEECH, pp. 3590-3594, 2015.
[15] Hamza, M., Khodadadi, T., & Palaniappan, S. A “Novel automatic voice recognition system based on text-
independent in a noisy environment”, International Journal of Electrical and Computer Engineering, vol. 10, no. 4,
pp. 3643, 2020.
[16] Ouisaadane, A., Safi, S., & Frikel, M. ‘’Arabic digits speech recognition and speaker identification in noisy
environment using a hybrid model of VQ and GMM”, TELKOMNIKA (Telecommunication Computing
Electronics and Control), vol. 18, no. 4, pp. 2193-2204, 2020.
[17] Mouaz, B., Abderrahim, B. H., & Abdelmajid, E., “Speech Recognition of Moroccan Dialect Using Hidden
Markov Models”, International Journal of Artificial Intelligence (IJ-AI), vol. 8, no. 1, 2019, DOI:
10.11591/ijai.v8.i1.pp7-13.
[18] Rabiner L-R., Juang B-H., “Fundamentals of Speech Recognition”, Prentice-Hall, 1993.
[19] Cing, D. L., & Soe, K. M. “Improving accuracy of part-of-speech (POS) tagging using hidden markov model and
morphological analysis for Myanmar Language”, International Journal of Electrical & Computer Engineering
(IJECE), vol. 10, pp. 2088-8708, 2020.
[20] Quinlan, J.R, “Simplifying decision trees”, International Journal of Man-Machine Studies, vol. 27, no. 3, pp. 221-
234, 1987.
[21] Quinlan, J.R. “Induction of decision trees”, Mach Learn vol. 1, pp. 81–106, 1986,
https://doi.org/10.1007/BF00116251.
[22] Quinlan, J. R., & Cameron-Jones, R. M. “FOIL: A midterm report”, In European conference on machine learning
pp. 1-20, 1993.
[23] Graja S., and Boucher J., "Hidden Markov tree model applied to ECG delineation," in IEEE Transactions on
Instrumentation and Measurement, vol. 54, no. 6, pp. 2163-2168, 2005.
[24] Alkhateeb, F., Baget, J-F, et Euzenat, J., “Extending SPARQL with regular expression patterns (for querying
RDF)”, Journal of web semantics, vol. 7, no 2, pp. 57-73, 2009.
[25] LI, J. et WANG, J., “System and method for automatic linguistic indexing of images by a statistical modeling
approach”. U.S. Patent no. 7, pp. 394,947, 2008.
[26] Wang, X. H., Liu, A., & Zhang, S. Q. “New facial expressionrecognition based on FSVM and KNN”, Optik-
International Journal for Light and Electron Optics, vol. 126, no. 21, pp. 3132-3134, 2015.
[27] Cherif, W. Optimization of K-NN algorithm by clustering and reliability coefficients: application to breast-cancer
diagnosis. Procedia Computer Science, vol. 127, pp. 293-299, 2018.
[28] Cherif, W., Madani, A., & Kissi, M. “A combination of low-level light stemming and support vector machines for
the classification of Arabic opinions”, In 2016 11th International Conference on Intelligent Systems: Theories and
Applications (SITA), pp. 1-5, 2016. IEEE.
BIOGRAPHIES OF AUTHORS
Bezoui Mouaz (born in 1985) received a master’s degree of networks and telecommunications
in 2011 from Chouaïb Doukkali University, Faculty of Science EI Jadida, Morocco. He is
currently pursuing his Ph.D degree in Computer Science at the Chouaïb Doukkali University
Faculty of Science, El Jadida, Morocco. Bezoui Mouaz is the author of over 3 technical
publications. His research interests include Artificial Intellegence, Natural Language Processing,
Machine Learning, Speech Recognition and Signal Processing.
Walid Cherif has completed his PhD in Data Science from Chouaib-Doukkali University
(Morocco). He is an IT engineer from the National Institute of Statistics and Applied Economics
(Rabat, Morocco) in which he is now member of the SI2M laboratory. He is author of many
papers in reputed journals and international conferences. His research interests include: Data
Science, Artificial Intelligence, Machine Learning, Deep Learning, Text and Data Mining.
7. Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752
A new framework based on KNN and DT for speech identification… (Bezoui Mouaz)
1423
Beni-hssane Abderrahim is currently an Full Professor in the Department of Computer Science
at Chouaïb Doukkali University, Faculty of Science, EI Jadida, Morocco, in which he is now
member of the LAROSERI laboratory. His research interests include Data Mining and
Knowledge Discovery. He is author of many papers and technical publications in reputed
journals and international conferences. His research interests include Artificial Intelligence,
Natural Language Processing, Text and Data Mining, Machine Learning, Internet of Things and
Big Data.
Elmoutaouakkil Abdelmajid is currently a Full professor in the Department of Computer
Science at Chouaïb Doukkali University, Faculty of Science, EI Jadida, Morocco, in which he is
now member of the LAROSERI laboratory. He is author of many papers and technical
publications in reputed journals and international conferences. His research interests include
Image Processing, Speech Recognition, Data Mining, Knowledge Discovery, Machine Learning
and Big Data.