This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
This document presents a study on automatic speech recognition (ASR). It discusses the different types of ASR systems including speaker-dependent, speaker-independent, and speaker-adaptive systems. It also covers the different types of utterances that can be recognized, such as isolated words, connected words, continuous speech, and spontaneous speech. The document then describes the basic phases involved in ASR, including front-end analysis using techniques like pre-emphasis, framing, windowing and feature extraction. It also discusses back-end processing using acoustic and language models to map features to words. Hidden Markov models are presented as a commonly used acoustic modeling technique in ASR systems.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Development of text to speech system for yoruba languageAlexander Decker
This document describes the development of a text-to-speech (TTS) system for the Yoruba language. It begins with background on TTS systems and an overview of previous work developing TTS for other languages but not extensively for Yoruba. The authors then describe the architecture and design of the Yoruba TTS system they developed using a concatenative synthesis method. This includes analyzing the phonology and syllable structure of Yoruba, and developing components for syllable identification, prosody assignment, and speech signal processing. An evaluation of the system found 70% of respondents found it usable.
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
This document describes an expert system for speech synthesis of Standard Arabic text. It involves two main stages: 1) creation of a sound database and 2) text-to-speech transformation. The transformation process involves phonetic orthographic transcription of the text and then generating voice signals corresponding to the transcribed phonetic sequence. The expert system uses a knowledge base containing sound data and rewriting rules. It transcribes text using graphemes as basic units and then concatenates sound units from the database to synthesize speech. Tests achieved a 96% success rate in pronouncing sentences correctly. Future work aims to improve prosody and develop fully automatic signal segmentation.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
International journal of signal and image processing issues vol 2015 - no 1...sophiabelthome
This document presents a study on automatic speech recognition (ASR). It discusses the different types of ASR systems including speaker-dependent, speaker-independent, and speaker-adaptive systems. It also covers the different types of utterances that can be recognized, such as isolated words, connected words, continuous speech, and spontaneous speech. The document then describes the basic phases involved in ASR, including front-end analysis using techniques like pre-emphasis, framing, windowing and feature extraction. It also discusses back-end processing using acoustic and language models to map features to words. Hidden Markov models are presented as a commonly used acoustic modeling technique in ASR systems.
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
Development of text to speech system for yoruba languageAlexander Decker
This document describes the development of a text-to-speech (TTS) system for the Yoruba language. It begins with background on TTS systems and an overview of previous work developing TTS for other languages but not extensively for Yoruba. The authors then describe the architecture and design of the Yoruba TTS system they developed using a concatenative synthesis method. This includes analyzing the phonology and syllable structure of Yoruba, and developing components for syllable identification, prosody assignment, and speech signal processing. An evaluation of the system found 70% of respondents found it usable.
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
This document describes an expert system for speech synthesis of Standard Arabic text. It involves two main stages: 1) creation of a sound database and 2) text-to-speech transformation. The transformation process involves phonetic orthographic transcription of the text and then generating voice signals corresponding to the transcribed phonetic sequence. The expert system uses a knowledge base containing sound data and rewriting rules. It transcribes text using graphemes as basic units and then concatenates sound units from the database to synthesize speech. Tests achieved a 96% success rate in pronouncing sentences correctly. Future work aims to improve prosody and develop fully automatic signal segmentation.
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...ijcsit
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of
speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech
recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.
Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels
MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORKijitcs
Speech technology is an emerging technology and automatic speech recognition has made advances in recent years. Many researches has been performed for many foreign and regional languages. But at present the multilingual speech processing technology has been attracting for research purpose. This paper tries to propose a methodology for developing a bilingual speech identification system for Assamese and English language based on artificial neural network.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...IJERA Editor
This research paper presents the approach towards converting text to speech using new methodology. The text to
speech conversion system enables user to enter text in Marathi and as output it gets sound. The paper presents
the steps followed for converting text to speech for Marathi language and the algorithm used for it. The focus of
this paper is based on the tokenisation process and the orthographic representation of the text that shows the
mapping of letter to sound using the description of language’s phonetics. Here the main focus is on the text to
IPA transcription concept. It is in fact, a system that translates text to IPA transcription which is the primary
stage for text to speech conversion. The whole procedure for converting text to speech involves a great deal of
time as it’s not an easy task and requires efforts.
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals
This document summarizes an implementation of an English-text to Marathi-speech synthesizer. The synthesizer uses a unit selection approach based on concatenative synthesis to produce natural sounding Marathi speech from English text input. Over 28,000 Marathi syllables, words and sentences were recorded from a female speaker and used to create the speech corpus. Formant frequencies (F1, F2, F3) were analyzed from the synthesized speech using MATLAB and PRAAT tools to evaluate the quality and naturalness of the output.
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...IJERA Editor
Marathi is one of the oldest languages in India. This research paper describes the development of Marathi Textto-
Speech System (TTS). In Marathi TTS the input is Marathi text in Unicode. The voices are sampled from real
recorded speech. The objective of a text to speech system is to convert an arbitrary text into its corresponding
spoken waveform. Speech synthesis is a process of building machinery that can generate human-like speech
from any text input to imitate human speakers. Text processing and speech generation are two main components
of a text to speech system. To build a natural sounding speech synthesis system, it is essential that text
processing component produce an appropriate sequence of phonemic units. Generation of sequence of phonetic
units for a given standard word is referred to as letter to phoneme rule or text to phoneme rule. The
complexity of these rules and their derivation depends upon the nature of the language. The quality of a speech
synthesizer is judged by its closeness to the natural human voice and understandability. In this research paper we
described an approach to build a Marathi TTS system using concatenative synthesis method with syllable as a
basic unit of concatenation.
This document summarizes the technical progress in speech recognition over the past few decades. It discusses the key components of a speech recognition system including speech analysis, feature extraction, language translation, and message understanding. The document also outlines the main challenges in speech recognition like variability in context, environmental conditions, speakers, and audio quality. Furthermore, it covers different types of speech recognition systems and their applications in various sectors such as education, medicine, transportation and more. The document concludes with an overview of the major approaches to speech recognition including acoustic-phonetic, pattern recognition, and artificial intelligence and discusses popular feature extraction methods used.
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor
The objective of this paper is to evaluate the quality of HMM based Marathi TTS system. The main advantage of HMM technique is its ability to allow the variation in voice easily. The output speeches produced in this method have greater impact on emotion, style and intonation. The naturalness and intelligibility are the two important parameters to decide the quality of synthetic speech. Depending on the parameters specified the results of synthetic speech are categorized into 4 categories: natural speech, high quality synthetic speech, low quality synthetic speech and moderate quality synthetic speech. The results are obtained by using CT, DRT and MOS test.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This document provides a review of speech recognition by machines over the past 60 years. It discusses the major approaches to speech recognition, including acoustic phonetic, pattern recognition, and artificial intelligence approaches. The pattern recognition approach using hidden Markov models has become predominant. The document outlines the basic model of speech recognition systems and various issues that affect recognition accuracy such as environment, speakers, speech styles, and vocabulary. It also discusses applications of speech recognition in different domains.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Classification improvement of spoken arabic language based on radial basis fu...IJECEIAES
This document summarizes a research paper that aimed to improve classification of spoken Arabic language letters using the Radial Basis Function (RBF) neural network. The paper proposes a three-step approach: 1) preprocessing the speech signals which includes removing noise and segmenting the signals, 2) extracting statistical features from the preprocessed signals like zero-crossing rate and MFCCs, and 3) classifying the letters using an RBF neural network. The researchers tested different parameters and found classification accuracy improved from 90-99.375% compared to prior works. They concluded that combining statistical features with RBF neural networks provided over 1.845% better recognition rates than other methods for Arabic speech classification.
This document is a B Tech project report submitted by Abhishek Agarwal in April 2005 at the Dhirubhai Ambani Institute of Information & Communication Technology. The project aims to develop machine understanding of Indian spoken languages. It discusses work done in language identification based on phonetic characteristics. The report covers background on language identification systems, objectives of the project, a discussion on developing language models for the identification process, and proposes tools that could utilize language identification algorithms.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
This document reviews techniques used in spoken-word recognition systems. It discusses popular feature extraction techniques like MFCC, LPC, DWT, WPD that are used to represent speech signals in a compact form before classification. Classification techniques discussed are ANN, HMM, DTW, and VQ. The document provides a brief overview of each technique and their advantages. It also presents the generalized workflow of a spoken-word recognition system including stages of speech acquisition, pre-emphasis, feature extraction, modeling, classification, and output of recognized text.
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTNathan Mathis
This document discusses the development of an Amharic text-to-speech synthesis system. It provides background on text-to-speech synthesis and challenges in developing systems for under-resourced languages like Amharic. The document outlines objectives to build an Amharic TTS prototype using deep learning methods like Tacotron 2 and WaveNet. It reviews related work on Amharic speech synthesis and discusses Amharic phonology. The proposed system aims to overcome limitations of previous rule-based and concatenative Amharic TTS systems by leveraging recent advances in neural network-based speech synthesis techniques.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
This document describes a speaker independent Malayalam word recognition system using support vector machines. It extracts mel frequency cepstral coefficients (MFCCs) as features and compares the performance of linear and radial basis function kernels in the SVM classifier. An overall accuracy of 91.8% was achieved on a database containing isolated word utterances from different speakers, demonstrating the effectiveness of the MFCC features and RBF kernel for speaker independent speech recognition.
SMATalk: Standard Malay Text to Speech Talk SystemCSCJournals
This document summarizes a research paper on SMaTTS, a rule-based text-to-speech synthesis system for Standard Malay. The system uses a sinusoidal synthesis method and some pre-recorded wave files to generate speech. It has two main phases: natural language processing (NLP) and digital signal processing (DSP). The NLP phase analyzes text and converts it to phonetic representations with prosody information. The DSP phase generates the speech waveform. The system was evaluated using diagnostic rhyme tests and mean opinion scores, and areas for improvement are discussed.
Isolated Word Recognition System For Tamil Spoken Language Using Back Propaga...CSEIJJournal
Speech recognition has been an active research topic for more than 50 years. Interacting with the
computer through speech is one of the active scientific research fields particularly for the disable
community who face variety of difficulties to use the computer. Such research in Automatic Speech
Recognition (ASR) is investigated for different languages because each language has its specific features.
Especially the need for ASR system in Tamil language has been increased widely in the last few years. In
this paper, a speech recognition system for individually spoken word in Tamil language using multilayer
feed forward network is presented. To implement the above system, initially the input signal is
preprocessed using four types of filters namely preemphasis, median, average and Butterworth bandstop
filter in order to remove the background noise and to enhance the signal. The performance of these filters
are measured based on MSE and PSNR values. The best filtered signal is taken as the input for the further
process of ASR system
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Approach of Syllable Based Unit Selection Text- To-Speech Synthesis System fo...iosrjce
IOSR journal of VLSI and Signal Processing (IOSRJVSP) is a double blind peer reviewed International Journal that publishes articles which contribute new results in all areas of VLSI Design & Signal Processing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on advanced VLSI Design & Signal Processing concepts and establishing new collaborations in these areas.Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels.
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...IJERA Editor
This research paper presents the approach towards converting text to speech using new methodology. The text to
speech conversion system enables user to enter text in Marathi and as output it gets sound. The paper presents
the steps followed for converting text to speech for Marathi language and the algorithm used for it. The focus of
this paper is based on the tokenisation process and the orthographic representation of the text that shows the
mapping of letter to sound using the description of language’s phonetics. Here the main focus is on the text to
IPA transcription concept. It is in fact, a system that translates text to IPA transcription which is the primary
stage for text to speech conversion. The whole procedure for converting text to speech involves a great deal of
time as it’s not an easy task and requires efforts.
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals
This document summarizes an implementation of an English-text to Marathi-speech synthesizer. The synthesizer uses a unit selection approach based on concatenative synthesis to produce natural sounding Marathi speech from English text input. Over 28,000 Marathi syllables, words and sentences were recorded from a female speaker and used to create the speech corpus. Formant frequencies (F1, F2, F3) were analyzed from the synthesized speech using MATLAB and PRAAT tools to evaluate the quality and naturalness of the output.
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...IJERA Editor
Marathi is one of the oldest languages in India. This research paper describes the development of Marathi Textto-
Speech System (TTS). In Marathi TTS the input is Marathi text in Unicode. The voices are sampled from real
recorded speech. The objective of a text to speech system is to convert an arbitrary text into its corresponding
spoken waveform. Speech synthesis is a process of building machinery that can generate human-like speech
from any text input to imitate human speakers. Text processing and speech generation are two main components
of a text to speech system. To build a natural sounding speech synthesis system, it is essential that text
processing component produce an appropriate sequence of phonemic units. Generation of sequence of phonetic
units for a given standard word is referred to as letter to phoneme rule or text to phoneme rule. The
complexity of these rules and their derivation depends upon the nature of the language. The quality of a speech
synthesizer is judged by its closeness to the natural human voice and understandability. In this research paper we
described an approach to build a Marathi TTS system using concatenative synthesis method with syllable as a
basic unit of concatenation.
This document summarizes the technical progress in speech recognition over the past few decades. It discusses the key components of a speech recognition system including speech analysis, feature extraction, language translation, and message understanding. The document also outlines the main challenges in speech recognition like variability in context, environmental conditions, speakers, and audio quality. Furthermore, it covers different types of speech recognition systems and their applications in various sectors such as education, medicine, transportation and more. The document concludes with an overview of the major approaches to speech recognition including acoustic-phonetic, pattern recognition, and artificial intelligence and discusses popular feature extraction methods used.
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemIJERA Editor
The objective of this paper is to evaluate the quality of HMM based Marathi TTS system. The main advantage of HMM technique is its ability to allow the variation in voice easily. The output speeches produced in this method have greater impact on emotion, style and intonation. The naturalness and intelligibility are the two important parameters to decide the quality of synthetic speech. Depending on the parameters specified the results of synthetic speech are categorized into 4 categories: natural speech, high quality synthetic speech, low quality synthetic speech and moderate quality synthetic speech. The results are obtained by using CT, DRT and MOS test.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
This document provides a review of speech recognition by machines over the past 60 years. It discusses the major approaches to speech recognition, including acoustic phonetic, pattern recognition, and artificial intelligence approaches. The pattern recognition approach using hidden Markov models has become predominant. The document outlines the basic model of speech recognition systems and various issues that affect recognition accuracy such as environment, speakers, speech styles, and vocabulary. It also discusses applications of speech recognition in different domains.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Classification improvement of spoken arabic language based on radial basis fu...IJECEIAES
This document summarizes a research paper that aimed to improve classification of spoken Arabic language letters using the Radial Basis Function (RBF) neural network. The paper proposes a three-step approach: 1) preprocessing the speech signals which includes removing noise and segmenting the signals, 2) extracting statistical features from the preprocessed signals like zero-crossing rate and MFCCs, and 3) classifying the letters using an RBF neural network. The researchers tested different parameters and found classification accuracy improved from 90-99.375% compared to prior works. They concluded that combining statistical features with RBF neural networks provided over 1.845% better recognition rates than other methods for Arabic speech classification.
This document is a B Tech project report submitted by Abhishek Agarwal in April 2005 at the Dhirubhai Ambani Institute of Information & Communication Technology. The project aims to develop machine understanding of Indian spoken languages. It discusses work done in language identification based on phonetic characteristics. The report covers background on language identification systems, objectives of the project, a discussion on developing language models for the identification process, and proposes tools that could utilize language identification algorithms.
Speech processing is considered as crucial and an intensive field of research in the growth of robust and efficient speech recognition system. But the accuracy for speech recognition still focuses for variation of context, speaker’s variability, and environment conditions. In this paper, we stated curvelet based Feature Extraction (CFE) method for speech recognition in noisy environment and the input speech signal is decomposed into different frequency channels using the characteristics of curvelet transform for reduce the computational complication and the feature vector size successfully and they have better accuracy, varying window size because of which they are suitable for non –stationary signals. For better word classification and recognition, discrete hidden markov model can be used and as they consider time distribution of speech signals. The HMM classification method attained the maximum accuracy in term of identification rate for informal with 80.1%, scientific phrases with 86%, and control with 63.8 % detection rates. The objective of this study is to characterize the feature extraction methods and classification phage in speech recognition system. The various approaches available for developing speech recognition system are compared along with their merits and demerits. The statistical results shows that signal recognition accuracy will be increased by using discrete Curvelet transforms over conventional methods.
This document reviews techniques used in spoken-word recognition systems. It discusses popular feature extraction techniques like MFCC, LPC, DWT, WPD that are used to represent speech signals in a compact form before classification. Classification techniques discussed are ANN, HMM, DTW, and VQ. The document provides a brief overview of each technique and their advantages. It also presents the generalized workflow of a spoken-word recognition system including stages of speech acquisition, pre-emphasis, feature extraction, modeling, classification, and output of recognized text.
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTNathan Mathis
This document discusses the development of an Amharic text-to-speech synthesis system. It provides background on text-to-speech synthesis and challenges in developing systems for under-resourced languages like Amharic. The document outlines objectives to build an Amharic TTS prototype using deep learning methods like Tacotron 2 and WaveNet. It reviews related work on Amharic speech synthesis and discusses Amharic phonology. The proposed system aims to overcome limitations of previous rule-based and concatenative Amharic TTS systems by leveraging recent advances in neural network-based speech synthesis techniques.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
This document describes a speaker independent Malayalam word recognition system using support vector machines. It extracts mel frequency cepstral coefficients (MFCCs) as features and compares the performance of linear and radial basis function kernels in the SVM classifier. An overall accuracy of 91.8% was achieved on a database containing isolated word utterances from different speakers, demonstrating the effectiveness of the MFCC features and RBF kernel for speaker independent speech recognition.
SMATalk: Standard Malay Text to Speech Talk SystemCSCJournals
This document summarizes a research paper on SMaTTS, a rule-based text-to-speech synthesis system for Standard Malay. The system uses a sinusoidal synthesis method and some pre-recorded wave files to generate speech. It has two main phases: natural language processing (NLP) and digital signal processing (DSP). The NLP phase analyzes text and converts it to phonetic representations with prosody information. The DSP phase generates the speech waveform. The system was evaluated using diagnostic rhyme tests and mean opinion scores, and areas for improvement are discussed.
Isolated Word Recognition System For Tamil Spoken Language Using Back Propaga...CSEIJJournal
Speech recognition has been an active research topic for more than 50 years. Interacting with the
computer through speech is one of the active scientific research fields particularly for the disable
community who face variety of difficulties to use the computer. Such research in Automatic Speech
Recognition (ASR) is investigated for different languages because each language has its specific features.
Especially the need for ASR system in Tamil language has been increased widely in the last few years. In
this paper, a speech recognition system for individually spoken word in Tamil language using multilayer
feed forward network is presented. To implement the above system, initially the input signal is
preprocessed using four types of filters namely preemphasis, median, average and Butterworth bandstop
filter in order to remove the background noise and to enhance the signal. The performance of these filters
are measured based on MSE and PSNR values. The best filtered signal is taken as the input for the further
process of ASR system
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
This paper describes an Hidden Markov Model-based Punjabi text-to-speech synthesis system (HTS), in which speech waveform is generated from Hidden Markov Models themselves, and applies it to Punjabi speech synthesis using the general speech synthesis architecture of HTK (HMM Tool Kit). This Hidden Markov Model based TTS can be used in mobile phones for stored phone directory or messages. Text messages and caller’s identity in English language are mapped to tokens in Punjabi language which are further concatenated to form speech with certain rules and procedures.
To build the synthesizer we recorded the speech database and phonetically segmented it, thus first extracting context-independent monophones and then context-dependent triphones. For e.g. for word bharat monophones are a, bh, t etc. & triphones are bh-a+r. These speech utterances and their phone level transcriptions (monophones and triphones) are the inputs to the speech synthesis system. System outputs the sequence of phonemes after resolving various ambiguities regarding selection of phonemes using word network files e.g. for the word Tapas the output phoneme sequence is ਤ,ਪ,ਸ instead of phoneme sequence ਟ,ਪ,ਸ .
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
American Standard Sign Language Representation Using Speech Recognitionpaperpublications3
Abstract: For many deaf people, sign language is the principle means of communication. This increases the isolation of hearing impaired people. This paper presents a system prototype that is able to automatically recognize speech which helps to communicate more effectively with the hearing or speech impaired people. This system recognizes speech signal . Recognized spoken words are represented using American standard sign language via a robotic arm and also on the computer using visual basic .In this project a software package is provided to convert the speech signal, (which does not have any meaning for the deaf and the dumb) into the sign language. The main purpose of this project is to bridge the communication and expression gap between the normal people who cannot understand the sign language, and the deaf and dumb who cannot understand the normal speech.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
A Review On Speech Feature Techniques And Classification TechniquesNicole Heredia
This document discusses speech feature extraction and classification techniques for speech recognition systems. It provides an overview of common feature extraction methods like MFCC and LPC, and classification algorithms like ANN and SVM. MFCC mimics human auditory perception but provides weak power spectrum, while LPC is easy to calculate but does not capture information at a linear scale. ANN can learn from data but is complex for large datasets, while SVM is accurate and suitable for pattern recognition but requires fixed-length coefficients. The document evaluates these techniques and concludes that MFCC performance is more efficient than LPC for feature extraction in speech recognition.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
SPEAKER VERIFICATION USING ACOUSTIC AND PROSODIC FEATURESacijjournal
In this paper we report the experiment carried out on recently collected speaker recognition database
namely Arunachali Language Speech Database (ALS-DB)to make a comparative study on the
performance of acoustic and prosodic features for speaker verification task.The speech database consists
of speech data recorded from 200 speakers with Arunachali languages of North-East India as mother
tongue. The collected database is evaluated using Gaussian mixture model-Universal Background Model
(GMM-UBM) based speaker verification system. The acoustic feature considered in the present study is
Mel-Frequency Cepstral Coefficients (MFCC) along with its derivatives.The performance of the system
has been evaluated for both acoustic feature and prosodic feature individually as well as in
combination.It has been observed that acoustic feature, when considered individually, provide better
performance compared to prosodic features. However, if prosodic features are combined with acoustic
feature, performance of the system outperforms both the systems where the features are considered
individually. There is a nearly 5% improvement in recognition accuracy with respect to the system where
acoustic features are considered individually and nearly 20% improvement with respect to the system
where only prosodic features are considered.
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficientijait
Development of Malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. In this paper we present the first work on speaker independent Malayalam isolated speech recognizer based on PLP (Perceptual Linear Predictive) Cepstral Coefficient and Hidden Markov Model (HMM). The performance of the developed system has been evaluated with different number of states of HMM (Hidden Markov Model). The system is trained with 21 male and female speakers in the age group ranging from 19 to 41 years. The system obtained an accuracy of 99.5% with the unseen data.
Emotional telugu speech signals classification based on k nn classifiereSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Emotional telugu speech signals classification based on k nn classifiereSAT Journals
Abstract Speech processing is the study of speech signals, and the methods used to process them. In application such as speech coding, speech synthesis, speech recognition and speaker recognition technology, speech processing is employed. In speech classification, the computation of prosody effects from speech signals plays a major role. In emotional speech signals pitch and frequency is a most important parameters. Normally, the pitch value of sad and happy speech signals has a great difference and the frequency value of happy is higher than sad speech. But, in some cases the frequency of happy speech is nearly similar to sad speech or frequency of sad speech is similar to happy speech. In such situation, it is difficult to recognize the exact speech signal. To reduce such drawbacks, in this paper we propose a Telugu speech emotion classification system with three features like Energy Entropy, Short Time Energy, Zero Crossing Rate and K-NN classifier for the classification. Features are extracted from the speech signals and given to the K-NN. The implementation result shows the effectiveness of proposed speech emotion classification system in classifying the Telugu speech signals based on their prosody effects. The performance of the proposed speech emotion classification system is evaluated by conducting cross validation on the Telugu speech database. Keywords: Emotion Classification, K-NN classifier, Energy Entropy, Short Time Energy, Zero Crossing Rate.
Similar to SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR (20)
Embracing Deep Variability For Reproducibility and Replicability
Abstract: Reproducibility (aka determinism in some cases) constitutes a fundamental aspect in various fields of computer science, such as floating-point computations in numerical analysis and simulation, concurrency models in parallelism, reproducible builds for third parties integration and packaging, and containerization for execution environments. These concepts, while pervasive across diverse concerns, often exhibit intricate inter-dependencies, making it challenging to achieve a comprehensive understanding. In this short and vision paper we delve into the application of software engineering techniques, specifically variability management, to systematically identify and explicit points of variability that may give rise to reproducibility issues (eg language, libraries, compiler, virtual machine, OS, environment variables, etc). The primary objectives are: i) gaining insights into the variability layers and their possible interactions, ii) capturing and documenting configurations for the sake of reproducibility, and iii) exploring diverse configurations to replicate, and hence validate and ensure the robustness of results. By adopting these methodologies, we aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical aspects.
https://hal.science/hal-04582287
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...MrSproy
ABSTRACT
The J'BaFofi, or "Giant Spider," is a mainly legendary arachnid by reportedly inhabiting the dense rain forests of
the Congo. As despite numerous anecdotal accounts and cultural references, the scientific validation remains more elusive.
My study aims to proper evaluate the existence of the J'BaFofi through the analysis of historical reports,indigenous
testimonies and modern exploration efforts.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
The Limited Role of the Streaming Instability during Moon and Exomoon FormationSérgio Sacani
It is generally accepted that the Moon accreted from the disk formed by an impact between the proto-Earth and
impactor, but its details are highly debated. Some models suggest that a Mars-sized impactor formed a silicate
melt-rich (vapor-poor) disk around Earth, whereas other models suggest that a highly energetic impact produced a
silicate vapor-rich disk. Such a vapor-rich disk, however, may not be suitable for the Moon formation, because
moonlets, building blocks of the Moon, of 100 m–100 km in radius may experience strong gas drag and fall onto
Earth on a short timescale, failing to grow further. This problem may be avoided if large moonlets (?100 km)
form very quickly by streaming instability, which is a process to concentrate particles enough to cause gravitational
collapse and rapid formation of planetesimals or moonlets. Here, we investigate the effect of the streaming
instability in the Moon-forming disk for the first time and find that this instability can quickly form ∼100 km-sized
moonlets. However, these moonlets are not large enough to avoid strong drag, and they still fall onto Earth quickly.
This suggests that the vapor-rich disks may not form the large Moon, and therefore the models that produce vaporpoor disks are supported. This result is applicable to general impact-induced moon-forming disks, supporting the
previous suggestion that small planets (<1.6 R⊕) are good candidates to host large moons because their impactinduced disks would likely be vapor-poor. We find a limited role of streaming instability in satellite formation in an
impact-induced disk, whereas it plays a key role during planet formation.
Unified Astronomy Thesaurus concepts: Earth-moon system (436)
Discovery of Merging Twin Quasars at z=6.05Sérgio Sacani
We report the discovery of two quasars at a redshift of z = 6.05 in the process of merging. They were
serendipitously discovered from the deep multiband imaging data collected by the Hyper Suprime-Cam (HSC)
Subaru Strategic Program survey. The quasars, HSC J121503.42−014858.7 (C1) and HSC J121503.55−014859.3
(C2), both have luminous (>1043 erg s−1
) Lyα emission with a clear broad component (full width at half
maximum >1000 km s−1
). The rest-frame ultraviolet (UV) absolute magnitudes are M1450 = − 23.106 ± 0.017
(C1) and −22.662 ± 0.024 (C2). Our crude estimates of the black hole masses provide log 8.1 0. ( ) M M BH = 3
in both sources. The two quasars are separated by 12 kpc in projected proper distance, bridged by a structure in the
rest-UV light suggesting that they are undergoing a merger. This pair is one of the most distant merging quasars
reported to date, providing crucial insight into galaxy and black hole build-up in the hierarchical structure
formation scenario. A companion paper will present the gas and dust properties captured by Atacama Large
Millimeter/submillimeter Array observations, which provide additional evidence for and detailed measurements of
the merger, and also demonstrate that the two sources are not gravitationally lensed images of a single quasar.
Unified Astronomy Thesaurus concepts: Double quasars (406); Quasars (1319); Reionization (1383); High-redshift
galaxies (734); Active galactic nuclei (16); Galaxy mergers (608); Supermassive black holes (1663)
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Physics Investigatory Project on transformers. Class 12thpihuart12
Physics investigatory project on transformers with required details for 12thes. with index, theory, types of transformers (with relevant images), procedure, sources of error, aim n apparatus along with bibliography🗃️📜. Please try to add your own imagination rather than just copy paste... Hope you all guys friends n juniors' like it. peace out✌🏻✌🏻
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
1. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
DOI : 10.5121/ijcseit.2015.5201 1
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM
FOR MYANMAR
Wunna Soe1
and Dr. Yadana Thein2
1
University of Computer Studies, Yangon (UCSY), Yangon Myanmar
2
Department of Computer Hardware, University of Computer Studies, Yangon (UCSY),
Yangon, Myanmar
ABSTRACT
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
KEYWORDS
Speech Recognition, Language Model, Myanmar, Syllable
1. INTRODUCTION
Speech recognition is one of the major tasks in natural language processing (NLP). Speech
recognition is the process by which a computer maps an acoustic speech signal to text. In general,
there are three speech recognition system; speaker dependent system, speaker independent
system, and speaker adaptive system. The speaker dependent systems are trained and learnt based
on a single speaker and can recognize the speech of that trained one speaker. The speaker
independent systems can recognize any speaker and these systems are the most difficult to
develop and most expansive and accuracy is lower than speaker dependent systems, but more
flexible. A speaker adaptive system is built to adapt its processes to the characteristics of new
speakers.
In other way, there are two types of speech recognition system: continuous speech recognition
system, and isolated-word speech recognition system. An isolated-word recognition system
performs single words at a time – requiring a pause between saying each word. A continuous
speech system recognizes on speech in which words are connected together, i.e. not separated by
pause.
Generally, most speech recognition systems are implemented mainly based on one of the Hidden
Markov Model (HMM), deep belief neural network, dynamic time wrapping.
Myanmar language is a tonal, syllable-timed language and largely monosyllabic and analytic
language, with a subject-object-verb word order. Myanmar language has 9 parts of speech and is
2. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
2
spoken by 32 million as a first language and as a second language by 10 million. Any Myanmar
speech recognition engine has not been before.
In this paper, we mainly focus on the Myanmar phonetic structure for speech recognition system.
This paper is organized as follow. In section2, we discuss the related works of the areas of speech
recognition system based on syllables models. In section 3, we describe characteristics of
Myanmar phones and syllables. In section4, we present the architecture of speech recognition
system. In section 5 and 6, we discuss how to build the acoustic model and language model for
speech recognition system. In section 7, we mention the phonetic dictionary of speech recognition
system. Finally, we conclude the results of proposed system and difficulties and limitations of this
system.
2. RELATED WORK
Many researchers have been work for speech recognition based on syllable in other languages.
But in our language, Myanmar, there is no one for implementing speech recognition system based
on syllable. In the following paragraphs, we present some of the related work in the area of
syllable-based speech recognition systems for other languages and speech recognition for
Myanmar language.
Piotr Majewski expressed a syllable-based language model for highly inflectional language like
Polish. The author demonstrated that syllables are useful sub-word units in language modeling of
Polish. Syllable-based model is a very promising choice for modeling language in many cases
such as small available corpora or highly inflectional language.[7]
R. Thangarajan, A.M. Natarajan, and M. Selvam expressed the Syllable modeling in continuous
speech recognition for Tamil language. In this paper, two methodologies are proposed which
demonstrate the syllable’s significance in speech recognition. In the first methodology, modeling
syllable as an acoustic unit is suggested and context independent (CI) syllable models are trained
and tested. The second methodology proposes integration of syllable information in the
conventional triphone or context dependent (CD) phone modeling.[8]
Xunying Liu James L. Hieronymus Mark J. F. Gales and Philip C. Woodland presented Syllable
language models for Mandarin speech recognition. In this paper character level language models
were used as an approximation of allowed syllable sequences that follow Mandarin Chinese
syllabiotactic rules. A range of combination schemes were investigated to integrate character
sequence level constraints into a standard word based speech recognition system.[9]
Ingyin Khaing presented Myanmar Continuous Speech Recognition System Based on DTW and
HMM. In this paper, we found that combinations of LPC, MFCC and GTCC techniques are
applied in feature extraction part of that system. The HMM method is extended by combining it
with the DTW algorithm in order to combine the advantages of these two powerful pattern
recognition technique. [10]
3. MYANMAR SYLLABLE
Myanmar language is a member of the Sino-Tibetan family of languages of which the Tibetan-
Myanmar subfamily forms a part. Myanmar script derives from Brahmi script. There are basic 12
vowels and 33 consonants and 4 medial in Myanmar language. In Myanmar language, words are
formed by combining basic characters with extended characters. Myanmar syllables can stand one
3. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
3
or more extended characters by combining consonants to form compound words. Myanmar
characters 33 consonants are described as the following table.
Table 1. Myanmar Consonants
The sequential extension of the 12 basic vowels results in 22 vowels listed in the original
thinbongyi. These 22 extension vowels are described as the following table.
Table 2. Basic and Extension Vowels
4. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
4
4. ARCHITECTURAL OVERVIEW OF SPEECH RECOGNITION SYSTEM
Generally, a speech recognition system takes the speech as input and voice data as knowledge
base and then the output result is the text as the below the figure 1. The knowledge base is the
data that derives the decoder of the speech recognition system. The knowledge base is created by
three sets of data:
• Dictionary
• Acoustic Model
• Language Model
Figure 1. Speech Recognition System
The dictionary contains a mapping from word to phones. An acoustic model contains acoustic
properties for each senone, state of the phone. A language model is used restrict word search. It
defines which word could follow previously recognized words (remember that matching is a
sequential process) and helps to significantly restrict the matching process by stripping words that
are not probable.
5. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
5
The proposed speech recognition system has three main components: Feature Extraction, Phone
Recognition, and Decoding. The architecture for a simplified speech recognition system is as
follow.
In the following architecture, we can compute the most probable sequence W given some
observation sequence O. We can choose the sentence which the product of two probabilities for
each sentence is greatest as the following equation.[1]
Figure 2. A Simple Discriminative Speech Recognition System Overview
ˆ
W = arg max
W ∈ L
P(O |W )P(W )
(1)
In the equation (1), the acoustic model can computed the observation likelihood, P(O/W). The
language model can get for computing the prior probability, P(W).[1]
4.1. Feature Extraction
The feature extraction is the transformation stage of speech waveform into a sequence of acoustic
feature vectors. The feature vectors represent the information in a small time window of the
6. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
6
signal. The acoustic waveform is sampled into frames (usually 10, 15, or 20 milliseconds of
frame size) that are transformed into spectral features as the following figure. Each time frame
(window) is thus represented by a vector of around 39 features representing this spectral
information. [1]
There are seven steps in feature extraction process:
1. Pre-emphasis
2. Windowing
3. Discrete Fourier Transform
4. Mel Filter Bank
5. Log
6. Inverse Discrete Fourier Transform
7. Deltas and Energy.
Figure 3. Windowing Process of Feature Extraction
The pre-emphasis stage is to boost the amount of energy in the high frequencies. Information
from these higher formats more available to the acoustic model can be made by boosting the high
frequency energy and this process can improve phone detection accuracy. The waveform is
extracted the roughly stationary portion of speech by using a window which is non-zero inside
some region and zero elsewhere, running this window across the speech signal and extracting the
waveform inside this window. The method for extracting spectral information for discrete
frequency bands for a discrete-time signal is the discrete Fourier transform (DFT).
The form of the model used in Mel Frequency Cepstral Coefficient (MFCC) is to wrap the
frequencies output by the DFT onto the Mel. A Mel is a unit of pitch. In general, the human
response to signal level is logarithmic; humans are less sensitive to slight differences in amplitude
at high amplitudes than at low amplitudes. In addition, the feature estimates less sensitive to
variations in input can be made by using a log, such as power variations due to the distance
between the speaker and the microphone. The next step in MFCC feature extraction is the
Frame Size
20ms
Frame
Shift
4ms
7. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
7
computation of the cepstrum, also called as the spectrum of the log of the spectrum. The cepstrum
can be seen as the inverse DFT of the log magnitude of the DFT of a signal.
The extraction of the cepstrum with the inverse DFT from the previous steps results in 12 cepstral
coefficients for each frame. The energy in a frame, the 13th
feature, is the sum over time of the
power of the samples in the frame. And the delta value estimates the slope using a wider context
of frames.
4.2. Phone Recognition
The phone recognition stage computes the phone likelihood of the observed spectral feature
vectors given Myanmar phone units or subparts of phones. In this proposed system, we use the
Gaussian Mixture Model (GMM) classifiers to compute for each HMM state q, corresponding to
a phone or sub phone, the likelihood of a given feature vector given this phone p(o/q). We can
compute the Gaussian Mixture Model (GMM) as the following equation. [1]
−
Σ
−
−
Σ
=
Σ −
=
∑ )
(
)
(
2
1
exp
|
|
)
2
(
1
)
,
|
( 1
2
/
1
2
/
1
jk
T
jk
jk
D
M
k
jk
jk
jk x
x
c
x
f µ
µ
π
µ
(2)
In the equation (2), M is the number of Gaussian Models, called mixture weights. D is the
dimensionality and in this system, it has 39 dimensions.
Most speech recognition algorithms are based on computing observation probabilities directly on
the real-valued, continuous input feature vector. The acoustic models are based on the
computation of a probability density function (pdf) over a continuous space. By far the most
common method for computing acoustic likelihoods is the Gaussian mixture model (GMM) pdfs,
although neural networks, support vector machines (SVM), and conditional random fields
(CRFs), are also used.
4.3. Decoding
In decoding stage, the proposed system used the Viterbi algorithm as the decoder. The decoder is
the heart of the speech recognition process. The task of the decoder is to find the best hidden
sequence of states by using the sequence of observations as inputs. First, the decoder selects the
next set of likely states and then scores the incoming features against these states. The decoder
prunes low scoring states and finally generates the result.
5. HOW TO BUILD ACOUSTIC MODEL
The acoustic model is trained by analyzing large corpora of Myanmar language speech. Hidden
Markov Models (HMM) represent each unit of speech in the acoustic model. HMMs are used by
a scorer to calculate the acoustic probability for a particular unit of speech. Each state of an HMM
is represented by a set of Gaussian mixture density functions. A Gaussian mixture model (GMM)
is a parametric probability density function represented as a weighted sum of Gaussian
component densities. There are many acoustic model training tools. Among them we choose the
sphinxtrain tool to build acoustic model for a new language, Myanmar.
To build speech recognition system for a single speaker, we collect recording files for an hour.
Each file has 7 seconds average length. The parameters of the acoustic model of the sound units
8. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
8
using feature vectors, are learnt by the trainer. This is called a training database. The file structure
of the database is
/etc
/db_name.dic
/db_name.phone
/db_name.lm.DMP
/db_name.filler
/db_name_train.fileids
/db_name_train.transcription
/db_name_test.fileids
/db_name_test.transcription
/wav
/speaker_1
/file1.wav
/file2.wav
…
In the above file structure, etc, wav, and speaker_1 are folder names. The db_name.dic file is
phonetic dictionary that maps words and phones. The db_name.phone is the phone set file that
has one phone per line. The db_name.lm.DMP is language model file. It may be in ARPA format
or in DMP format. The db_name.filler file is a filler dictionary that contains filler phones (not-
covered by language model non-linguistic sounds like breathe, hmm or laugh). The
db_name_train.fileids is a text file listing the names of the recordings one by line for training. The
db_name_test.fileids is also a text file listing the names of the recordings one by line for testing.
The db_name_train.transcription is a text file that contains the list of the transcription for each
audio file for training. The db_name_test.transcription is also a text file listing the transcription
for each audio file for testing. The wav files (filename.wav) that we used are recording files that
have specific sample rate - 16 kHz, 16 bit, mono. [4]
After training, the acoustic model is located in db_name.cd_cont_number_of senones folder.
number_of senones is the number of senones produced by the training tool. This folder is
under the model_parameters folder auto generated by sphinxtrain. In the
db_name.cd_cont_number_of senones folder, the model should have the following files:
/db_name.cd_cont_number_of senones
/mdef
/feat.params
/mixture_weights
/means
/noisedict
/transition_matrices
/variances.
The feat.params file contains feature extraction parameters, a list of options used to configure
feature extraction. The mdef file is the definition file that maps the triphone contexts and GMM
ids (senones). The means file is Gaussian codebook variances. The variances file consists of the
Gaussian codebook variances. The mixture_weights describes the mixtures for Gaussians. The
transition_matrices file contains HMM transition matrices. The noisedict file is the dictionary for
filler words.
9. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
9
6. HOW TO BUILD LANGUAGE MODEL
The language model describes what is likely to be spoken in a particular context. There are two
types of models that are used in speech recognition systems - grammars and statistical language
models. The grammar-type of language model describe very simple types of languages for
command and control, and they are usually written by hand or generated automatically with plain
code. The statistical language model uses stochastic approach called n-gram language model. An
N-gram is an N-token sequence of words: a b2-gram (bigram) is a two-word sequence; a 3-gram
(trigram) is a three-word sequence. N-gram conditional probabilities can be computed from plain
text based on the relative frequency of word sequences. By another way, there are two types of
statistical language model. The first is the close-vocabulary language model. The close-
vocabulary language model assumes that the test set can only contain words from the given
lexicon. There are no unknown words in the close-vocabulary model. An open-vocabulary
language model is one in which we model the possible unknown words in the test set by adding a
pseudo-word called UNK. An open-vocabulary model contains the training process for the
probabilities of the unknown word model.
There are many approach and tools to create the statistical language models. We use CMU
language modeling toolkit to create n-gram language model. The language model toolkit expects
its input to be in the form of normalized text files, with utterances delimited by s and /s
tags.[4] In this pre-processing step, we make syllable based way for creating normalized text files.
The syllable-based normalization as is the following:
(normalized sentence)
Before normalization we split one syllable by syllable from the sentences. In syllable
segmentation process, we use rule-based segmentation to spilt syllables from sentences. The
output is a 3-gram language model based on vocabularies given from normalized text file. But in
our output language model file is based on Myanmar syllables. In this system, we chose to use the
close-vocabulary model.
The output language model file is as the ARPA format or binary format. The ARPA format
language model file is shown as the following figure 4.
10. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
10
Figure 4. 3-grams (3 syllables sequences) in Language Model File
7. PHONETIC DICTIONARY (LEXICON)
A phonetic dictionary is a text file that contains a mapping from words to phones. It is also a
lexicon that is a list of words, with a pronunciation for each word expressed as a phone sequence.
The phone sequence can be specified by a lexicon. Each phone HMM sequence is composed of
some sub phones, each with a Gaussian emission likelihood model. Example of the phonetic
dictionary is as follow in the table 3.
Table 3. Part of Phonetic Dictionary
11. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
11
7. RESULT OF SPEECH RECOGNITION
In figure 6, hidden states are in circles and observations are in squares. Dotted line(unfilled)
circles indicate illegal transitions. For a given state qj at time t, the vale αt (j) is computed as
follow:
(3)
In equation (3), vt (j) is the Viterbi probability at time t and aij is the transition probability from
previous state qi to current state qj . bj(ot) is the state observation likelihood of the observation ot
given the current state j.[1]
Figure 5. A Hidden Markov Model for relating feature values and Myanmar syllables
)
(
)
(
max
)
( 1
1
t
j
ij
t
N
i
t o
b
a
i
v
j
v −
=
=
12. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
12
Figure 6. The Viterbi trellis for computing the best sequence of Myanmar Syllable
8. CONCLUSION AND FUTURE WORK
The standard evaluation metric for speech recognition systems is word error rate (WER). The
word error rate is based on how much the word string returned by recognizer differs from a
correct or reference transcription. But the proposed system use syllable error rate (SER) as
evaluation metric instead of word error rate. Therefore, the result is based on how much syllable
string returned by recognition engine differs from a correct or reference transcription. Our
proposed system is just speaker dependent system at present and language model is also closed-
vocabulary type. In the future, this proposed system will be developed as a speaker independent
speech recognition system and language model will also hope to be an open-vocabulary type.
REFERENCES
[1] Daniel Jurafsky, and James H. Martin Smith (2009), Speech and Language Processing, Pearson
Education Ltd., Upper Saddle River, New Jersey 07458
[2] Myanmar Language Commission (2011), Myanmar-English Dictionary, Department of Myanmar
Language Commission, Ministry of Education, Union of Myanmar
[3] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf,
and Jole Woelfel (2004), “Sphinx 4: A Flexible Open Source Framework for Speech Recognition”,
SMLI TR2004-0811, Sun Microsystems Inc.
[4] Hassan Satori, Hussein Hiyassat, Mostafa Harti, and Noureddine Chenfour (2009), “Investigation
Arabic Speech Recognition Using CMU Sphinx System”, The International Arab Journal of
Information Technology, Vol. 6, April
[5] http://en.wikipedia.org/wiki/Speech_recognition
[6] http://cmusphinx.sourceforge.net/
[7] Piotr Majewski (2008), “Syllable Based Language Model for Large Vocabulary Continuous Speech
Recognition of Polish”, University of Łód´z, Faculty of Mathematics and Computer Science ul.
Banacha 22, 90-238 Łód´z, Poland, P. Sojka et al. (Eds.): TSD 2008, LNAI 5246, pp. 397–401
13. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol. 5,No.2, April 2015
13
[8] R. Thangarajan, A.M. Natarajan, M. Selvam(2009), “Syllable modeling in continuous speech
recognition for Tamil language”, Department of Information Technology, Kongu Engineering
College, Perundurai 638 052, Erode, India, Int J Speech Technol (2009) 12: 47–57
[9] Xunying Liu, James L. Hieronymus, Mark J. F. Gales and Philip C. Woodland (2013), “Syllable
language models for Mandarin speech recognition: Exploiting character language models”,
Cambridge University Engineering Department, Cambridge, United Kingdom, J. Acoust. Soc. Am.
133 (1), January 2013
[10] Ingyin Khaing (2013), “Myanmar Continuous Speech Recognition System Based on DTW and
HMM”, Department of Information and Technology, University of Technology (Yatanarpon Cyber
City),near Pyin Oo Lwin, Myanmar, International Journal of Innovations in Engineering and
Technology (IJIET), Vol. 2 Issue 1 February 2013
[11] Ciro Martins, António Teixeira, João Neto (2004), “Language Models in Automatic Speech
Recognition”, VOL. 4, Nº 2, JANEIRO 2004, L2F – Spoken Language Systems Lab; INESC-ID/IST,
Lisbon
[12] Edward W. D. Whittake, Statistical Language Modeling for Automatic Speech Recognition of
Russian and English, Trinity College, University of Cambridge
[13] Mohammad Bahrani, Hossein Sameti, Nazila Hafezi, and Saeedeh Momtazi (2008), “A New Word
Clustering Method for Building N-Gram Language Models in Continuous Speech Recognition
Systems”, Speech Processing Lab, Computer Engineering Department, Sharif University of
Technology, Tehran, Iran, N.T. Nguyen et al. (Eds.): IEA/AIE 2008, LNAI 5027, pp. 286–293
Authors
Dr. Yadana Thein is working as an associate professor at department of computer hardware technology in
University of Computer Studies, Yangon. She received master degree from University of Computer
Studies, Yangon. She received doctoral degree at the same university. She interest in Natural Language
Processing.
Wunna Soe is at present Ph.D candidate student from University of Computer Studies,
Yangon. He received Master of Computer Science (M.C.Sc.) from University of Computer
Studies, Mandalay (UCSM). His current research is Automatic Speech Recognition and
Natural Language Processing.