Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...journalBEEI
In this paper, i-vector and x-vector is used to extract the features from speech signal from local Indonesia languages, namely Javanese, Sundanese and Minang languages to help classifier identify the language spoken by the speaker. Probabilistic linear discriminant analysis (PLDA) are used as the baseline classifier and logistic regression technique are used because of prior studies showing logistic regression has better performance than PLDA for classifying speech data. Once these features are extracted. The feature is going to be classified using the classifier mentioned before. In the experiment, we tried to segment the test data to three segment such as 3, 10, and 30 seconds. This study is expanded by testing multiple parameters on the i-vector and x-vector method then comparing PLDA and logistic regression performance as its classifier. The x-vector has better score than i-vector for every segmented data while using PLDA as its classifier, except where the i-vector and x-vector is using logistic regression, i-vector still has better accuracy compared to x-vector.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
Marathi Isolated Word Recognition System using MFCC and DTW FeaturesIDES Editor
This paper presents a Marathi database and isolated
Word recognition system based on Mel-frequency cepstral
coefficient (MFCC), and Distance Time Warping (DTW) as
features. For the extraction of the feature, Marathi speech
database has been designed by using the Computerized Speech
Lab. The database consists of the Marathi vowels, isolated
words starting with each vowels and simple Marathi sentences.
Each word has been repeated three times by the 35 speakers.
This paper presents the comparative recognition accuracy of
DTW and MFCC.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSkevig
Recently, many researchers have focused on building and improving speech recognition systems to
facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system
has become an important and commontool from games to translation systems, robots, and so on. However,
there is still a need for research on speech recognition systems for low-resource languages. This article
deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients
(MFCCs) feature extraction method and three different deep neural networks including Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our
models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms.
This study obtained the impressive result of 98.365% average accuracy.
Tuning Dari Speech Classification Employing Deep Neural Networkskevig
Recently, many researchers have focused on building and improving speech recognition systems to facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system has become an important and common tool from games to translation systems, robots, and so on. However, there is still a need for research on speech recognition systems for low-resource languages. This article deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients (MFCCs) feature extraction method and three different deep neural networks including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms. This study obtained the impressive result of 98.365% average accuracy.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Performance Calculation of Speech Synthesis Methods for Hindi languageiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
We propose a model for carrying out deep learning based multimodal sentiment analysis. The MOUD dataset is taken for experimentation purposes. We developed two parallel text based and audio basedmodels and further, fused these heterogeneous feature maps taken from intermediate layers to complete thearchitecture. Performance measures–Accuracy, precision, recall and F1-score–are observed to outperformthe existing models.
In the present-day communications speech signals get contaminated due to
various sorts of noises that degrade the speech quality and adversely impacts
speech recognition performance. To overcome these issues, a novel approach
for speech enhancement using Modified Wiener filtering is developed and
power spectrum computation is applied for degraded signal to obtain the
noise characteristics from a noisy spectrum. In next phase, MMSE technique
is applied where Gaussian distribution of each signal i.e. original and noisy
signal is analyzed. The Gaussian distribution provides spectrum estimation
and spectral coefficient parameters which can be used for probabilistic model
formulation. Moreover, a-priori-SNR computation is also incorporated for
coefficient updation and noise presence estimation which operates similar to
the conventional VAD. However, conventional VAD scheme is based on the
hard threshold which is not capable to derive satisfactory performance and a
soft-decision based threshold is developed for improving the performance of
speech enhancement. An extensive simulation study is carried out using
MATLAB simulation tool on NOIZEUS speech database and a comparative
study is presented where proposed approach is proved better in comparison
with existing technique.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
Spoken language identification using i-vectors, x-vectors, PLDA and logistic ...journalBEEI
In this paper, i-vector and x-vector is used to extract the features from speech signal from local Indonesia languages, namely Javanese, Sundanese and Minang languages to help classifier identify the language spoken by the speaker. Probabilistic linear discriminant analysis (PLDA) are used as the baseline classifier and logistic regression technique are used because of prior studies showing logistic regression has better performance than PLDA for classifying speech data. Once these features are extracted. The feature is going to be classified using the classifier mentioned before. In the experiment, we tried to segment the test data to three segment such as 3, 10, and 30 seconds. This study is expanded by testing multiple parameters on the i-vector and x-vector method then comparing PLDA and logistic regression performance as its classifier. The x-vector has better score than i-vector for every segmented data while using PLDA as its classifier, except where the i-vector and x-vector is using logistic regression, i-vector still has better accuracy compared to x-vector.
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Artificial Neural Networks have proved their efficiency in a large number of research domains. In this paper, we have applied Artificial Neural Networks on Arabic text to prove correct language modeling, text generation, and missing text prediction. In one hand, we have adapted Recurrent Neural Networks architectures to model Arabic language in order to generate correct Arabic sequences. In the other hand, Convolutional Neural Networks have been parameterized, basing on some specific features of Arabic, to predict missing text in Arabic documents. We have demonstrated the power of our adapted models in generating and predicting correct Arabic text comparing to the standard model. The model had been trained and tested on known free Arabic datasets. Results have been promising with sufficient accuracy.
Marathi Isolated Word Recognition System using MFCC and DTW FeaturesIDES Editor
This paper presents a Marathi database and isolated
Word recognition system based on Mel-frequency cepstral
coefficient (MFCC), and Distance Time Warping (DTW) as
features. For the extraction of the feature, Marathi speech
database has been designed by using the Computerized Speech
Lab. The database consists of the Marathi vowels, isolated
words starting with each vowels and simple Marathi sentences.
Each word has been repeated three times by the 35 speakers.
This paper presents the comparative recognition accuracy of
DTW and MFCC.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSkevig
Recently, many researchers have focused on building and improving speech recognition systems to
facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system
has become an important and commontool from games to translation systems, robots, and so on. However,
there is still a need for research on speech recognition systems for low-resource languages. This article
deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients
(MFCCs) feature extraction method and three different deep neural networks including Convolutional
Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our
models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms.
This study obtained the impressive result of 98.365% average accuracy.
Tuning Dari Speech Classification Employing Deep Neural Networkskevig
Recently, many researchers have focused on building and improving speech recognition systems to facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system has become an important and common tool from games to translation systems, robots, and so on. However, there is still a need for research on speech recognition systems for low-resource languages. This article deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients (MFCCs) feature extraction method and three different deep neural networks including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP). We evaluate our models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms. This study obtained the impressive result of 98.365% average accuracy.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Machine learning for Arabic phonemes recognition using electrolarynx speechIJECEIAES
Automatic speech recognition system is one of the essential ways of interaction with machines. Interests in speech based intelligent systems have grown in the past few decades. Therefore, there is a need to develop more efficient methods for human speech recognition to ensure the reliability of communication between individuals and machines. This paper is concerned with Arabic phoneme recognition of electrolarynx device. Electrolarynx is a device used by cancer patients having vocal laryngeal cords removed. Speech recognition here is considered to find the preferred machine learning model that can classify phonemes produced by electrolarynx device. The phonemes recognition employs different machine learning schemes, including convolutional neural network, recurrent neural network, artificial neural network (ANN), random forest, extreme gradient boosting (XGBoost), and long short-term memory. Modern standard Arabic is utilized for testing and training phases of the recognition system. The dataset covers both an ordinary speech and electrolarynx device speech recorded by the same person. Mel frequency cepstral coefficients are considered as speech features. The results show that the ANN machine learning method outperformed other methods with an accuracy rate of 75%, a precision value of 77%, and a phoneme error rate (PER) of 21.85%.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSIJCI JOURNAL
Recently, many researchers have focused on building and improving speech recognition systems to facilitate and enhance human-computer interaction. Today, Automatic Speech Recognition (ASR) system has become an important and common tool from games to translation systems, robots, and so on. However, there is still a need for research on speech recognition systems for low-resource languages. This article deals with the recognition of a separate word for Dari language, using Mel-frequency cepstral coefficients (MFCCs) feature extraction method and three different deep neural networks including Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Multilayer Perceptron (MLP), and two hybrid models of CNN and RNN. We evaluate our models on our built-in isolated Dari words corpus that consists of 1000 utterances for 20 short Dari terms. This study obtained the impressive result of 98.365% average accuracy.
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages: Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the input speech waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. And then the likelihood of the observation of feature vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language Model (LM). The system will produce the most likely sequence of words as the output. The system creates the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEkevig
This paper contributes the first evaluation of neural
network models on NER task for Myanmar language. The experimental results show that those neural
sequence models can produce promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This
work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing
as well as to promote further researches on this understudied language.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
In this research, we propose a hybrid approach for acoustic and pronunciation modeling for Arabic speech recognition. The hybrid approach benefits from both vocalized and non-vocalized Arabic resources, based on the fact that the amount of non-vocalized resources is always higher than vocalized resources. Two speech recognition baseline systems were built: phonemic and graphemic. The two baseline acoustic models were fused together after two independent trainings to create a hybrid acoustic model. Pronunciation modeling was also hybrid by generating graphemic pronunciation variants as well as phonemic variants. Different techniques are proposed for pronunciation modeling to reduce model complexity. Experiments were conducted on large vocabulary news broadcast speech domain. The proposed hybrid approach has shown a relative reduction in WER of 8.8% to 12.6% based on pronunciation modeling settings and the supervision in the baseline systems.
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
Neural machine translation is a new approach to machine translation that has shown the effective results
for high-resource languages. Recently, the attention-based neural machine translation with the large scale
parallel corpus plays an important role to achieve high performance for translation results. In this
research, a parallel corpus for Myanmar-English language pair is prepared and attention-based neural
machine translation models are introduced based on word to word level, character to word level, and
syllable to word level. We do the experiments of the proposed model to translate the long sentences and to
address morphological problems. To decrease the low resource problem, source side monolingual data are
also used. So, this work investigates to improve Myanmar to English neural machine translation system.
The experimental results show that syllable to word level neural mahine translation model obtains an
improvement over the baseline systems.
A new framework based on KNN and DT for speech identification through emphat...nooriasukmaningtyas
Arabic dialects differ substantially from modern standard arabic and each other in terms of phonology, morphology, lexical choice and syntax. This makes the identification of dialects from speeches a very difficult task. In this paper, we introduce a speech recognition system that automatically identifies the gender of speaker, the emphatic letter pronounced and also the diacritic of these emphatic letters given a sample of author’s speeches. Firstly we examined the performance of the single case classifier hidden markov models (HMM) applied to the samples of our data corpus. Then we evaluated our proposed approach KNN-DT which is a hybridization of two classifiers namely decision trees (DT) and K-nearest neighbors (KNN). Both models are singularly applied directly to the data corpus to recognize the emphatic letter of the sound and to the diacritic and the gender of the speaker. This hybridization proved quite interesting; it improved the speech recognition accuracy by more than 10% compared to state-of-the-art approaches.
Similar to IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORK PARAMETERS (20)
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORK PARAMETERS
1. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
DOI: 10.5121/ijnlc.2018.7601 1
IMPROVING MYANMAR AUTOMATIC SPEECH
RECOGNITION WITH OPTIMIZATION OF
CONVOLUTIONAL NEURAL NETWORK
PARAMETERS
Aye Nyein Mon1
, Win Pa Pa2
and Ye Kyaw Thu3
1,2
Natural Language Processing Lab., University of Computer Studies, Yangon, Myanmar
3
Language and Speech Science Research Lab., Waseda University, Japan
ABSTRACT
Researchers of many nations have developed automatic speech recognition (ASR) to show their national
improvement in information and communication technology for their languages. This work intends to
improve the ASR performance for Myanmar language by changing different Convolutional Neural Network
(CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of
reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality
and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is
investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on
syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate
(WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76%
SER on TestSet2.
KEYWORDS
Automatic Speech Recognition (ASR) Myanmar Speech Corpus Convolutional Neural Network (CNN)
1. INTRODUCTION
Automatic speech recognition research has been doing for more than four decades. The purpose
of speech recognition system is to be convenient for humans with a computer, robot or any
machine in interaction via speech. Speech recognition is the converting of speech signal to text.
Speech recognition systems can be characterized by a number of parameters. These parameters
specify the performance of speech recognition systems such as utterances (isolated words vs.
continuous speech), speaking style (read speech formal style vs. spontaneous and conversational
speech with causal style), speaking situation (human-to-machine speech vs. human-to-human
speech), speaker dependency (speaker dependent, speaker adaptive, vs. speaker independent),
vocabulary size (small vocabulary of fewer than 100 words, medium vocabulary of fewer than
10,000 words, vs. large vocabulary of more than 10,000 words.), language model (N-gram
statistical model, finite state model, context free model, vs. context sensitive model),
environmental condition (high signal-to-noise of greater than 30 dB, medium SNR (signal to
noise ratio), vs. low SNR of lower of 10dB) and transducer (close-talking microphone, far-field
microphone, microphone array, vs. telephone) [1].
At present and later decade, speech recognition systems have been become in many other nations
for their owned languages. They have applied statistical based pattern recognition to develop
2. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
2
ASR for their languages. Furthermore, they used the particular features of their languages such as
tones, pitch, etc. For example, Mandarin, Vietnamese, Thai, etc., they appended tones related
information to acoustic modelling to increase their ASR accuracy [2] [3] [4].
For low-resourced languages, the researchers have developed their speech recognition systems by
building speech corpus from scratch. For instance, AGH corpus of Polish speech [5],
Agglutinative Hungarian spontaneous speech database [6], Bengali speech corpus [7], Bulgarian
speech corpus [8], European Portuguese corpus [9], etc. were developed for ASR technology.
There are previous works for Myanmar language using deep learning techniques. Thandar Soe
[et.al,] [10] built a robust automatic speech recognizer by using deep convolutional neural
networks (CNNs). The multiple acoustic models were trained with different acoustic feature
scales. The multi-CNN acoustic models were combined based on a Recognizer Output Voting
Error Reduction (ROVER) algorithm for final speech recognition experiments. They showed that
integration of temporal multi-scale features in model training achieved the lower error rate over
the best individual system on one temporal scale feature. Aye Nyein Mon [et.al,] [11] has
designed and developed a speech corpus from web news for Myanmar ASR. The speech corpus
consists of 15 hours of read speech data collected from online web news in which there are 155
speakers (109 females and 46 males). Moreover, this corpus was evaluated by using state-of-the-
art acoustic modeling approach, CNN. The experiments were conducted by means of the default
parameters of CNN. It showed that corpora built on Internet resources get promising results. In
this paper [12], Aye Nyein Mon [et.al,] explored the effect of tones for Myanmar language speech
recognition using Convolutional Neural Network (CNN). Experiments are conducted based on
the modeling of tones by integrating them into the phoneme set and incorporating them into
convolutional neural network, state-of-the-art acoustic model. Moreover, tonal questions are
utilized to build phonetic decision tree. In this work, the hyperparameters of CNN architecture are
not changed and they are used as default settings for all experiments. It showed that in
comparison with Deep Neural Network (DNN) baseline, the CNN model achieves a better ASR
performance over DNN. It was investigated that the CNN model with tone information got the
better accuracy than that of without tone information. Hay Mar Soe Naing [et.al,] [13] presented a
Myanmar large vocabulary continuous speech recognition system for Travel domain.
Phonemically-balanced corpus consisting of 4000 sentences and 40 hours of speech was used for
training data. In this work, three kinds of acoustic models were developed: Gaussian mixture
model (GMM), DNN (Cross Entropy), and DNN (state-level minimum Bayes risk (sMBR) and
compared their performance. The tonal and pitch features were incorporated to acoustic modeling
and experiments were conducted with and without those features. The differences between the
word-based language model (LM) and syllable-based LM were also investigated. The best WER
of and SER were achieved with sequence discriminative training DNN (sMBR).
In this paper, the Myanmar ASR performance is improved by altering hyperparameters of CNN
for Myanmar ASR. The CNN has locality property that can reduce the number of neural network
weights to be learned and hence decreases overfitting. Moreover, pooling operation is very useful
in handling small frequency shifts that are common in speech signals. Therefore, the different
parameters of CNN such as feature map numbers and pooling sizes have been changed to get the
better ASR accuracy for Myanmar language. In addition, as Myanmar is a tonal and syllable-
timed language, the syllable-based Myanmar ASR is developed by using a 42-hr-data set. And
then, the performance of word-based and syllable-based ASRs is compared by using two open
test sets: web data and recorded data.
This paper is formed as follows. In Section 2, about Myanmar language is presented.
Convolutional Neural Network (CNN) is described in Section 3. Experiments such as improving
CNN architecture and comparison of word-based and syllable-based ASRs are done in Section 4.
Conclusion and future work are recapped in Section 5.
3. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
3
2. MYANMAR LANGUAGE
The Myanmar language is Sino-Tibetan language spoken in Myanmar. Most of the Myanmar
people about 33 million have spoken the official language of Myanmar. Ethnic groups related to
Myanmar have used as a first language and 10 million of ethnic minorities in Myanmar as a
second language.
Myanmar is a tonal, largely monosyllabic and analytic language. Myanmar language has 4
contrasts tones and syllables can have different meanings on the basis of tone. A tone is denoted
by a diacritic mark. A simple syllable structure of the Myanmar language is composed of either a
vowel by itself or a consonant combined with a vowel. Unlike the other language such as English,
it has subject-object-verb order and there are no spaces between words in writing. Therefore, it
needs to do segmentation to determine the word boundaries. Myanmar characters has 3 groups:
consonants (known as “Byee”), medials (known as “Byee Twe”), and vowels (known as
“Thara”). There are 33 basic consonants, 4 basic medials, and 12 basic vowels in Myanmar script.
Myanmar numerals are decimal-based for counting [14].
3. CONVOLUTIONAL NEURAL NETWORK (CNN)
The Convolutional Neural Network is a feed-forward artificial neural network, which includes
convolutional and pooling layers. The CNN needs the input data to be organized in a certain way.
For processing speech signals, it requires to use features which are organized along frequency or
time (or both) so that the convolution operation can be correctly applied. In this work, a number
of one-dimensional (1-D) feature maps are used to consider frequency variations only which
normally occur in speech signals due to speaker differences [15].
Figure 1: An illustration of one CNN “layer” consisting of a pair of a convolution ply and a pooling ply in
succession, where mapping from either the input layer or a pooling ply to a convolution ply is based on eq.
(2) and mapping from a convolution ply to a pooling ply is based on eq. (3).
Once the input feature maps are formed, the convolution and pooling layers apply their respective
operations to generate the activations of the units in those layers, in sequence, as depicted in
Figure 1.
Input feature maps
Oi (i=1,2, . . ., I)
Convolution feature maps
Qj (j=1,2, . . ., J)
Pooling feature maps
Pj (j=1,2, . . ., J)
Pooling layerInput layer Convolution layer
Pooling
max
Convolution
Wij
i=1,2, . . ., I
j=1,2, . . ., J
4. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
4
3.1 Convolution Ply
As shown in Figure 1, all input feature maps (assume I in total), Oi (i =1,…, I) are mapped into a
number of feature maps (assume J in total),Qj (j = 1, …, J) in the convolution ply based on a
number of local filters (I × J in total), wi,j (i = 1, . . ., I; j = 1, . . ., J ). The mapping can be
represented as the well-known convolution operation in signal processing [15]. Assuming input
feature maps are all one dimensional, each unit of one feature map in the convolution layer can be
computed as:
, = ∑ ∑ , , , + , , ( = , … , ) (1)
where,
oi,m = the m-th unit of the i-th input feature map Oi,
qj,m = the m-th unit of the j-th feature map Qj in the convolution ply,
wi,j,n = the nth element of the weight vector, wi,j, connecting the ith feature
map of the input to the jth feature map of the convolution layer,
F = the filter size which is the number of input bands that each unit of the
convolution layer receives
When, the equation (1) is written as a more concise matrix form using the convolution operator ∗,
= ∑ ∗ , ( = , … , ), (2)
where,
Oi = the i-th input feature map,
wi,j = each local filter with the weights flipped to adhere to the convolution
operation definition.
A complete convolutional layer is composed of many feature maps, generated with different
localized filters, to extract multiple kinds of local patterns at every location.
3333.2 Pooling Ply
Each neuron of the pooling layer receives activations from a local region of the previous
convolutional layer and reduces the resolution to produce a single output from that region. The
pooling function computes some overall property of a local region by using a simple function like
maximization or averaging. In this case, max pooling function is applied. The max pooling layer
performs down-sampling by dividing the input into rectangular pooling regions, and computing
the maximum of each region. The max pooling layer is defined as:
, = ! "
#
, " $ (3)
where,
G = the pooling size,
s = the shift size, determines the overlap of adjacent pooling windows.
4. EXPERIMENTS
Experiments are done by using our developed speech corpus that has 42-hr-data size. 3-gram
language model with Kneser-Ney discounting is built by using SRILM language modeling toolkit
[16]. Two open test sets are used and the detailed statistics on the training and testing sets are
depicted in Table 1.
5. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
5
The input features are 40-dimensional log mel-filter bank features padded with 11 frames (5
frames left and right frame contexts). The targets are 2,052 context dependent (CD) triphone
states that are generated from GMM-HMM model.
For CNN training, 0.008 of constant learning rate is diminished by half based on cross-validation
error decreasing. When the error rate has no more noticeably reducing or the error rate started to
increase, training process stopped. Stochastic gradient descent with a mini-batch of 256 training
examples is utilized for backpropagation.
TESLA K80 GPU is used for all the neural networks training. Kaldi [17] automatic speech
recognition toolkit is applied to do the experiments.
Table 1: Train and test sets statistics
Data Size Domain
Speakers
Utterance
Female Male Total
TrainSet 42 Hrs 39 Mins
Web News +
Recorded Data
219 88 307 31,114
TestSet1 31 Mins 55 Sec Web News 5 3 8 193
TestSet2 32 Mins 40 Sec Recorded Data 3 2 5 887
4.1. Optimization of CNN Parameters
In this experiment, the effect of changing various CNN parameters is investigated on Myanmar
ASR performance. For this work, one dimensional convolution across frequency domain is
applied and there are two convolutional layers, one pooling layer and two fully connected hidden
layers with 300 units per layer. Pooling size and feature map numbers have a great impact to
improve the ASR performance [18] [19]. Hence, the ASR accuracy is evaluated on two open test
sets, web news and recorded data, by varying the different feature map numbers and pooling
sizes.
4.1.1 Number of Feature Maps of First Convolutional Layer
Firstly, different numbers of feature maps in the first convolutional layer are changed and
compared on the evaluation results. The filter size of the first layer is set to 8. The number of
feature maps is altered to 32, 64, 128, 256, 512, and 1,024 respectively. Table 2 depicts the
WER% on two open test sets based on changing the feature map numbers of first convolutional
layer.
Table 2: Evaluation results on number of feature maps of the first convolutional layer
Number of Feature Maps
WER%
TestSet1 TestSet2
32 18.97 24.01
64 18.59 23.80
128 18.18 23.50
256 18.52 23.75
512 18.31 23.65
1,024 18.53 23.85
It is observed that the lowest WERs on both test sets are obtained with feature map numbers of
128. According to the Table 2, the WERs reduce slightly when the number of feature maps is
6. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
6
increased up to 128. But, if the number of feature maps exceeds 128, it did not get the lower
WERs and there is a small increment of WER. As the result, the lowest WERs are attained with
128 feature maps.
4.1.2 Pooling Size
Pooling has ability in handling the small frequency shifts in speech signal. Max pooling has
achieved a greater accuracy than the other pooling types in ASR tasks [18]. Thus, for this
experiment, max pooling is used and the best pooling size is investigated. The pooling layer is
added above the first convolution layer. The experiments are done by varying max pooling size
with a shift size of 1. The feature maps of first convolution layer are fixed at 128. Table 3 gives
that the evaluation results based on the different pooling sizes.
Table 3: Evaluation results on pooling size
WER%
TestSet1 TestSet2
No pooling 18.18 23.50
pool size=2 18.05 23.00
pool size=3 17.22 22.56
pool size=4 18.06 23.33
From the Table 3, it can be found that the first convolution layer using 128 feature maps,
followed by the max pooling size 3 has the lowest WERs, 17.22% on TestSet1 and 22.56% on
TestSet2. It indicates that after the pooling layer is appended, the WER diminishes significantly
than without using it. Hence, pooling has an impact on the ASR performance and the lowest error
rates are achieved with the pooling size of 3.
4.1.3 Number of Feature Maps of Second Convolutional Layer
The second convolutional layer is added on top of the pooling layer and the experiments are
further done to investigate the best number of feature maps for the second convolutional layer.
The filter size of this layer is set to 4. The best result of 128 feature maps is fixed in the first
convolution layer. And, the max pooling layer with pool size 3 is also fixed in this experiment.
From Table 4, when the number of feature maps in the second convolutional layer is altered to 32,
64 and 128 respectively, it did not get the lower error rates on test sets. But, when the feature map
numbers are set above 128, it slightly decreases the WERs on both test sets. Hence, the more
number of filters the convolutional layers have, the more acoustic features get extracted and the
better the network becomes better at recognizing acoustic patterns of unseen speakers.
Table 4: Evaluation results on number of feature maps in the second convolutional layer
Number of Feature Maps
WER%
TestSet1 TestSet2
32 18.14 23.40
64 18.40 23.64
128 17.86 22.65
256 17.03 22.20
512 17. 02 21.89
1,024 16.70 21.83
7. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
7
Therefore, it is analyzed that 128/1,024 feature maps in first and second convolutional layers with
max pooling size 3 gives the best accuracy for Myanmar ASR.
4.2 Comparison of Syllable vs. Word-Based ASR
Syllable is a basic unit of Myanmar language and every syllable has a meaning in Myanmar
language. Therefore, in this experiment, the syllable-based ASR is developed and the evaluation
results of the syllable-based ASR and word-based ASR are compared. The above CNN
architecture, 128/1,024 feature maps in first and second convolutional layers with max pooling
size 3 are applied to develop the ASR with syllable units. As in word-based ASR, 3-gram
language model with Knesey-Ney discounting is used.
Figure 2: Comparison of word and syllable-based ASRs
Figure 2 shows the comparison of word-based and syllable-based ASRs in terms of word error
rate (WER%) and syllable error rate (SER%). It can be clearly seen that SER% are significantly
less than the WER% for both test sets. Therefore, it can be assumed that the accuracy of ASR
using syllable units is better than that of the ASR using word units. Nevertheless, for fair
comparison, the syllables of the hypothesis text of the syllable-based ASR are changed to words
levels and the WERs of that words are calculated again. Then, the error rates of the syllable-based
and word-based ASRs are compared again. Consequently, it is observed that the error rates of the
syllable-based ASR are greater than that of the word-based ASR. Therefore, it can be said that the
accuracy of the word-based ASR is better than that of the syllable-based ASR.
Moreover, the hypothesis texts of word-based and syllable-based ASRs are manually evaluated
by one native Myanmar people. It is observed that the recognition output of word-based ASR is
more effective, meaningful and accurate than the syllable-based ASR’s output. Therefore, it can
be summarized that the word-based ASR can give more reasonable output text than syllable-
based ASR.
The example reference and hypothesis sentences of word-based ASR are expressed as follows.
The reference text consists of 41 correct words. In the hypothesis text, the total count of insertion,
substitution and deletion words are 6. So, the WER obtains 14.63%. The error words in the
hypothesis text are expressed using numbers.
0
5
10
15
20
25
TestSet1 TestSet2
Word Error Rate (WER%)
Syllable Error Rate (SER%)
8. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
8
Reference Text of Word-based ASR:
Hypothesis Text of Word-based ASR:
The example reference and hypothesis sentences of syllable-based ASR are shown as follows.
The reference sentences of word-based and syllable-based ASRs are the same. The total number
of syllables in the reference texts is 73. The total insertion, substitution and deletion syllables in
the hypothesis text are 9 and so, 12.33% SER is attained. The highlighted words that describing
using numbers in the hypothesis are the wrong words.
Reference Text of Syllable-based ASR:
Hypothesis Text of Syllable-based ASR:
5. CONCLUSION
In this work, the better accuracy of automatic speech recognition for Myanmar language is
investigated by changing the hyperparameters of CNN. It is found that feature map numbers and
pooling sizes of CNN have a great impact on ASR performance. By varying CNN
hyperparameters, the lower error rates are achieved. In addition, as syllable is a basic unit of
Myanmar language, syllable-based ASR is also created and the performance of syllable-based and
word-based ASRs is compared. It can be concluded that although the SER% of syllable-based
ASR is less than the WER% of word-based ASR, the recognition output of word-based ASR is
more effective, meaningful and accurate than that of syllable-based ASR. And thus, word-based
ASR can give better recognizable results than syllable-based ASR for Myanmar language.
9. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
9
In the future, Myanmar ASR will be developed by using state-of-the-art end-to-end learning
approach.
REFERENCES
1. S.K. Saksamudre, P.P. Shrishrimal, and R.R. Deshmukh, “A Review on Different Approaches for
Speech Recognition System”, International Journal of Computer Applications (0975 – 8887), Volume
115 – No. 22, April 2015.
2. X.Hu, M.Saiko, and C.Hori, “Incorporating Tone Features to Convolutional Neural Network to
Improve Mandarin/Thai Speech Recognition”, Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference, APSIPA 2014, Chiang Mai, Thailand, December 9-12,
2014, pp.1-5
3. V.H.Nguyen, C.M.Luong, and T.T.Vu, “Tonal Phoneme Based Model for Vietnamese LVCSR”,
2015 International Conference Oriental COCOSDA held jointly with 2015 Conference on Asian
Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Shanghai, China, October 28-
30, 2015, pp.118-122.
4. X. Lei, M. Siu, M. Hwang, M. Ostendorf, T. Lee, “Improved Tone Modeling for Mandarin Broadcast
News Speech Recognition”, INTERSPEECH 2006 - ICSLP, Ninth International Conference on
Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006, 2006.
5. P.Zelasko, B. Ziolko, T.Jadczyk, and D.Skurzok, “AGH Corpus of Polish Speech”, Language
Resources and Evaluation, vol. 50, no. 3, pp. 585--601, 2016.
6. T.Neuberger, D.Gyarmathy, T.E.Graczi, V.Horvath, M.Gosy, and A.Beke, “Development of a Large
Spontaneous Speech Database of Agglutinative Hungarian Language”, in Text, Speech and Dialogue
- 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014.
Proceedings, 2014, pp. 424--431.
7. S.Mandal, B.Das, P.Mitra, and A.Basu, “Developing Bengali Speech Corpus for Phone Recognizer
Using Optimum Text Selection Technique”, in International Conference on Asian Language
Processing, IALP 2011, Penang, Malaysia, 15-17 November, 2011, 2011, pp. 268--271.
8. N.Hateva, P.Mitankin, and S.Mihov, “Bulphonc: Bulgarian Speech Corpus for the Development of
ASR Technology”, in Proceedings of the Tenth International Conference on Language Resources and
Evaluation LREC 2016, Portoroz, Slovenia, May 23-28, 2016., 2016.
9. F.Santos and T.Freitas, “CORP-ORAL: Spontaneous Speech Corpus for European Portuguese”, in
Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26
May - 1 June 2008, Marrakech, Morocco, 2008.
10. T.Soe, S.S.Maung, N.N.Oo, “Combination of Multiple Acoustic Models with Multi-scale Features for
Myanmar Speech Recognition”, International Journal of Computer (IJC), [S.l.], Volume 28, No 01,
pp.112-121, February. 2018. ISSN 2307-4523. Available at:
<http://ijcjournal.org/index.php/InternationalJournalOfComputer/article/view/1150>. Date accessed:
03 Sep. 2018.
11. A.N.Mon, W.P.Pa, Y. K.Thu and Y. Sagisakaa, "Developing a speech corpus from web news for
Myanmar (Burmese) language," 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF
THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND
SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), Seoul, 2017, pp. 1-6.
12. A.N.Mon, W.P.Pa and Y.K.Thu, “Exploring the Effect of Tones for Myanmar Language Speech
Recognition Using Convolutional Neural Network (CNN)”, In: Hasida K., Pa W. (eds) Computational
Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781.
Springer, Singapore.
10. International Journal on Natural Language Computing (IJNLC) Vol.7, No.6, December 2018
10
13. H.M.S.Naing, A.M.Hlaing, W.P.Pa, X.Hu, Y.K.Thu, C.Hori, and H.Kawai, “A Myanmar Large
Vocabulary Continuous Speech Recognition System”, In Asia Pacific Signal and Information
Processing Association Annual Summit and Conference, APSIPA 2015, Hong Kong, December 16-
19, 2015, pages 320–327, 2015.
14. Myanmar Language Committee, “Myanmar Grammar”, Myanmar Language Committee, Ministry of
Education, Myanmar, 2005.
15. O.A-Hamid, A-R.Mohamed, H.Jiang, L.Deng, G.Penn, D.Yu, “ Convolutional Neural Networks for
Speech Recognition”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 22,
No. 10, October 2014
16. A.Stolcke, “Srilm - An Extensible Language Modeling Toolkit”, pp. 901--904 (2002)
17. D.Povey, A.Ghoshal, G.Boulianne, L.Burget, O.Glembek, N.Goel, M.Hannemann, P.Motlicek,
Y.Qian, P.Schwarz, J.Silovsky, G.Stemmer, and K.Vesely, “The Kaldi Speech Recognition Toolkit”,
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing
Society, Dec 2011, IEEE Catalog No.: CFP11SRW-USB.
18. T.N.Sainath, B.Kingsbury, A.Mohamed, G.E.Dahl, G.Saon, H.Soltau, T.Beran, A.Y.Aravkin, and
B.Ramabhadran, “Improvements to Deep Convolutional Neural Networks for LVCSR”, 2013 IEEE
Workshop on Automatic Speech Recognition and Understanding,Olomouc, Czech Republic,
December 8-12, 2013, pp.315-320.
19. W. Chan, I. Lane, “Deep Convolutional Neural Networks for Acoustic Modeling in Low Resource
Languages”, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing,
ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, p. 2056-2060.
AUTHORS
Aye Nyein Mon received M.C.Sc. (Credit), in Computer Science, from University
of Computer Studies, Yangon (UCSY), Myanmar. She is currently pursuing Ph.D
in University of Computer Studies, Yangon. Her research interests are Natural
Language Processing, Speech Processing, and Machine Learning.
Dr. Win Pa Pa is now working as a Professor and has been doing research at
Natural Language Processing lab of UCSY. She has been supervising Master and
Ph.D thesis on Natural language processing such as Information Retrieval,
Morphological Analysis, Part of Speech Tagging, Parsing, and Automatic Speech
Recognition and Speech Synthesis. She took part in the project of ASEAN MT, the
machine translation project for South East Asian languages. She also participated in
the projects of Myanmar Automatic Speech Recognition and HMM Based
Myanmar Speech Synthesis (Text to Speech) that were the research collaboration between NICT, Japan and
UCSY.
Dr. Ye Kyaw Thu is a Visiting Researcher of Language and Speech Science
Research Lab., Waseda University, Japan and also a co-supervisor of masters' and
doctoral students of several universities. His research interests lie in the fields of AI,
natural language processing (NLP) and human-computer interaction (HCI). His
experience includes research and development of text input methods, corpus
building, statistical machine translation (SMT), automatic speech recognition (ASR)
and text to speech (TTS) engines for his native language Myanmar. He is currently,
working on robot language acquisition research and sign language processing.