This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a target speaker. It works by analyzing the source speech into an excitation signal and filter components, then resynthesizing it with the pitch and vocal characteristics of the target speaker. The key steps are detecting the pitches of the source and target speakers, scaling the source pitch to match the target, then resynthesizing the source speech using the target's vocal filter characteristics and the pitch-scaled excitation signal. Voice morphing was developed in 1999 and has applications in text-to-speech, dubbing, voice disguising, and public announcement systems.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
The document summarizes a study that proposes a hybrid approach for acoustic and pronunciation modeling in Arabic speech recognition. It combines phonemic and graphemic modeling techniques. Two baseline speech recognition systems were built using phonemic and graphemic acoustic models. These models were then fused into a hybrid acoustic model. Different hybrid techniques for pronunciation modeling were also proposed and evaluated on a broadcast news speech corpus, showing error rate reductions of 8.8-12.6% over the baselines. The hybrid approach aims to benefit from both vocalized and non-vocalized Arabic resources.
This document describes the process of voice morphing, which involves transitioning one speech signal into another while preserving shared characteristics. It discusses representing speech signals in a domain that separates pitch and envelope information. A key step is dynamic time warping to match pitch features between signals. The morphed signal is created through interpolation and reconverted to an acoustic waveform. Examples show morphing between different gender pairs of speakers. Voice morphing aims to smoothly transition one voice into another in a similar manner as image morphing blends two faces.
This document presents an overview of voice morphing technology. It discusses that voice morphing is a technique to modify a source speaker's voice to sound like a target speaker. It describes the need for voice morphing in applications like text-to-speech, public address systems, and for special effects. The technical process involves extracting spectral and pitch information from both voices and using algorithms like dynamic time warping and signal re-estimation to morph the source voice into the target voice. Some applications discussed are for altering evidence in courts or creating fake orders in military conflicts.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It was developed by George Papcun at Los Alamos National Laboratory. The process involves preprocessing the speech signals, analyzing pitch and formants, matching signals using dynamic time warping, and re-estimating the signals. Voice morphing can be used for text-to-speech, public address systems, and special effects. While it allows cloning speech patterns, it has limitations around normalization and requires extensive sound libraries for different languages.
It is a technique to modify a source speaker's speech to sound as if it was spoken by a target speaker.
Voice morphing enables speech patterns to be cloned
And an accurate copy of a person's voice can be made that can wishes to say, anything in the voice of someone else.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
This paper proposes a voice morphing system for people suffering from Laryngectomy, which is the surgical removal of all or part of the larynx or the voice box, particularly performed in cases of laryngeal cancer. A primitive method of achieving voice morphing is by extracting the source's vocal coefficients and then converting them into the target speaker's vocal parameters. In this paper, we deploy Gaussian Mixture Models (GMM) for mapping the coefficients from source to destination. However, the use of the traditional/conventional GMM-based mapping approach results in the problem of over-smoothening of the converted voice. Thus, we hereby propose a unique method to perform efficient voice morphing and conversion based on GMM, which overcomes the traditional-method effects of over-smoothening. It uses a technique of glottal waveform separation and prediction of excitations and hence the result shows that not only over-smoothening is eliminated but also the transformed vocal tract parameters match with the target. Moreover, the synthesized speech thus obtained is found to be of a sufficiently high quality. Thus, voice morphing based on a unique GMM approach has been proposed and also critically evaluated based on various subjective and objective evaluation parameters. Further, an application of voice morphing for Laryngectomees which deploys this unique approach has been recommended by this paper
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a target speaker. It works by analyzing the source speech into an excitation signal and filter components, then resynthesizing it with the pitch and vocal characteristics of the target speaker. The key steps are detecting the pitches of the source and target speakers, scaling the source pitch to match the target, then resynthesizing the source speech using the target's vocal filter characteristics and the pitch-scaled excitation signal. Voice morphing was developed in 1999 and has applications in text-to-speech, dubbing, voice disguising, and public announcement systems.
Hybrid Phonemic and Graphemic Modeling for Arabic Speech RecognitionWaqas Tariq
The document summarizes a study that proposes a hybrid approach for acoustic and pronunciation modeling in Arabic speech recognition. It combines phonemic and graphemic modeling techniques. Two baseline speech recognition systems were built using phonemic and graphemic acoustic models. These models were then fused into a hybrid acoustic model. Different hybrid techniques for pronunciation modeling were also proposed and evaluated on a broadcast news speech corpus, showing error rate reductions of 8.8-12.6% over the baselines. The hybrid approach aims to benefit from both vocalized and non-vocalized Arabic resources.
This document describes the process of voice morphing, which involves transitioning one speech signal into another while preserving shared characteristics. It discusses representing speech signals in a domain that separates pitch and envelope information. A key step is dynamic time warping to match pitch features between signals. The morphed signal is created through interpolation and reconverted to an acoustic waveform. Examples show morphing between different gender pairs of speakers. Voice morphing aims to smoothly transition one voice into another in a similar manner as image morphing blends two faces.
This document presents an overview of voice morphing technology. It discusses that voice morphing is a technique to modify a source speaker's voice to sound like a target speaker. It describes the need for voice morphing in applications like text-to-speech, public address systems, and for special effects. The technical process involves extracting spectral and pitch information from both voices and using algorithms like dynamic time warping and signal re-estimation to morph the source voice into the target voice. Some applications discussed are for altering evidence in courts or creating fake orders in military conflicts.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It was developed by George Papcun at Los Alamos National Laboratory. The process involves preprocessing the speech signals, analyzing pitch and formants, matching signals using dynamic time warping, and re-estimating the signals. Voice morphing can be used for text-to-speech, public address systems, and special effects. While it allows cloning speech patterns, it has limitations around normalization and requires extensive sound libraries for different languages.
It is a technique to modify a source speaker's speech to sound as if it was spoken by a target speaker.
Voice morphing enables speech patterns to be cloned
And an accurate copy of a person's voice can be made that can wishes to say, anything in the voice of someone else.
The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS) Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM) and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It does this by changing the pitch from the source speaker, like a male voice, to the target speaker, like a female voice. This is done by interpolating the linear predictive coding coefficients of the source and target signals. The pitch of the morphed signal can be positioned between the source and target by varying a constant value between 0 and 1. Applications include changing voices for security or entertainment purposes, but limitations include difficulties with voice detection and requiring extensive sound libraries.
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a different target speaker. The process involves preprocessing the speech signal, analyzing the pitch and envelope, morphing through warping and interpolation, and re-estimating the signal. To morph voices between a male and female speaker, the pitch of the male speaker is shifted to match that of the female speaker by time-stretching the residue signal and adjusting the LPC coefficients. Potential applications include using popular speakers for public announcements, and effects in films, but limitations include difficulties in voice detection and updating systems for new languages.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It works by preprocessing the speech signals, analyzing pitch and envelope, matching and warping the signals, and re-estimating the signal. This allows for an accurate copy of a person's voice to be made. Voice morphing has applications in text-to-speech systems, public address systems, and special effects. It can also be used to create fake conversations for use in courts or as a battlefield deception tool. The process separates speech into spectral envelope and pitch/voicing information before realigning and recombining the signals.
This document summarizes a seminar on voice morphing techniques. It discusses what voice morphing is, including modifying a source speaker's voice to sound like a target speaker. It then outlines the main topics that were covered in the seminar, including transform-based voice morphing systems, enhancing systems with phase prediction and spectral refinement, enabling real-time voice morphing, and conclusions about extending the techniques to other audio sounds.
This document analyzes speech coding algorithms for Hindi and English languages. It discusses Linear Predictive Coding (LPC), an algorithm that accurately estimates speech parameters and represents speech signals at reduced bit rates while preserving quality. The paper proposes a voice-excited LPC algorithm and implements it on Hindi and English male and female voices. It analyzes tradeoffs between bit rates, delay, signal-to-noise ratio, and complexity. The results show low bit-rates and better signal-to-noise ratio with this algorithm.
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
This document summarizes digital modeling techniques for speech signals. It describes the vocal source and vocal tract that produce speech. It then discusses using sampling and techniques like PCM to digitally represent speech signals. Linear predictive coding is presented as a simple method to analyze speech that approximates samples as combinations of past signals. The summary concludes that linear prediction can be used for spectrum estimation by representing the vocal tract transfer function, pitch detection, and speech synthesis.
The document summarizes a technical seminar presentation on voice morphing. It describes voice morphing as a technique that modifies a source speaker's speech to sound like a target speaker. It discusses the need for voice morphing in applications like text-to-speech, discusses the voice morphing process which includes preprocessing, pitch and envelope analysis, morphing and signal re-estimation, and addresses limitations and advantages.
This document discusses voice modification techniques. It compares four models for voice modification: LPC, H/S, TD-PSOLA, and MBR-PSOLA. TD-PSOLA relies on pitch-synchronous overlap-add and modifies the speech signal in the time domain based on analysis and synthesis markings. The document implements a voice modification system using TD-PSOLA that can modify vocal tract, pitch, and time scale parameters to change voice quality, such as making a female voice sound more husky or nasal. Results show the algorithm can effectively modify input voices as desired.
Speaker recognition systems aim to automatically identify or verify a speaker's identity based on characteristics of their voice. There are two main types: speaker identification determines which registered speaker is speaking, while speaker verification accepts or rejects a speaker's claimed identity. All systems contain modules for feature extraction and feature matching. Feature extraction represents the voice signal with parameters like MFCCs that can distinguish speakers. Feature matching compares extracted features from an unknown voice to known speaker models. The document describes the process of MFCC feature extraction in detail, including framing the speech signal, windowing frames, taking the FFT, mapping to the mel scale, and finally the DCT to produce MFCC coefficients.
Voice morphing involves transforming one speech signal into another through three main steps: 1) extracting pitch and envelope information through cepstral analysis, 2) using dynamic time warping to match pitch contours between signals and interpolating values to create intermediate morphs, and 3) re-estimating and reconstructing the morphed signal as an acoustic waveform. Applications include military psychological operations, creating fake audio evidence, and voice acting in cartoons. The document provides details on preprocessing signals, the morphing and warping process, and applications and conclusions regarding voice morphing technology.
This document discusses homomorphic speech processing and techniques for speech enhancement. It provides an overview of modeling speech production as the excitation of a linear time-invariant system. Homomorphic filtering is introduced as a way to deconvolve speech into excitation and system response using logarithmic transformations. The complex cepstrum is discussed as a representation of speech that can be used to estimate pitch, voicing and formant frequencies. Homomorphic vocoding is described as a speech coding technique that quantizes the low-time part of the cepstrum at regular intervals to encode speech. Common techniques for speech enhancement like spectral subtraction and adaptive noise cancellation are also mentioned.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
IRJET- Pitch Detection Algorithms in Time DomainIRJET Journal
This document discusses pitch detection algorithms in the time domain. It describes two common time domain pitch detection methods: the autocorrelation method and average magnitude difference function (AMDF) method. The autocorrelation method detects the periodicity of a speech signal by finding the highest value of the autocorrelation function. The AMDF method calculates the average magnitude of differences between the original and delayed speech signal at different lags, and identifies the pitch period as the lag with the minimum AMDF value. The document also provides implementation results of these two methods on speech samples, demonstrating their ability to estimate pitch periods in the time domain.
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
This work presents an application of Fundamental Frequency (Pitch), Linear Predictive Cepstral Coefficient
(LPCC) and Mel Frequency Cepstral Coefficient (MFCC) in identification of sex of the speaker in speech
recognition research. The aim of this article is to compare the performance of these three methods for
identification of sex of the speakers. A successful speech recognition system can help in non critical operations
such as presenting the driving route to the driver, dialing a phone number, light switch turn on/off, the coffee
machine on/off etc. apart from speaker verification-caste wise, community wise and locality wise including
identification of sex. Here an attempt has been made to identify the sex of Bodo speakers through vowel
utterance by following Pitch value, LPCC and MFCC techniques. It is found here that the feature vector
organization of LPCC coefficients provides a more promising way of speech-speaker recognition in case of
Bodo Language than that of Pitch and MFCC.
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...IDES Editor
Using the Mel-frequency cepstral coefficients
(MFCC), Human Factor cepstral coefficients (HFCC) and
their new parameters derived from log dynamic spectrum and
dynamic log spectrum, these features are widely used for
speech recognition in various applications. But, speech
recognition systems based on these features do not perform
efficiently in the noisy conditions, mobile environment and
for speech variation between users of different genders and
ages. To maximize the recognition rate of speaker independent
isolated word recognition system, we combine both of the above
features and proposed a hybrid feature set of them. We tested
the system for this hybrid feature vector and we gained results
with accuracy of 86.17% in clean condition (closed window),
82.33% in class room open window environment, and 73.67%
in outdoor with noisy environment.
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...CSCJournals
This paper proposes a new voicing detection and pitch estimation method that is particularly robust for noisy speech. This method is based on the spectral analysis of the speech multi-scale product. The multi-scale product (MP) consists of making the product of wavelet transform coefficients. The wavelet used is the quadratic spline function. We argue that the spectral of Multi-scale Product Analysis is capable of revealing an estimate of a pitch-harmonic more accurately even in a heavy noisy scenario. We evaluate our approach on the Keele database. The experimental results show the robustness of our method for noisy speech, and the good performance for clean speech in comparison with state-of-the-art algorithms.
An Introduction to Various Features of Speech SignalSpeech featuresSivaranjan Goswami
An overview of various temporal, spectral and cepstral features of speech signal used in digital speech processing.
For more tutorials visit:
https://sites.google.com/site/enggprojectece
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It does this by changing the pitch from the source speaker, like a male voice, to the target speaker, like a female voice. This is done by interpolating the linear predictive coding coefficients of the source and target signals. The pitch of the morphed signal can be positioned between the source and target by varying a constant value between 0 and 1. Applications include changing voices for security or entertainment purposes, but limitations include difficulties with voice detection and requiring extensive sound libraries.
Voice morphing is a technique that modifies a source speaker's speech to sound like it was spoken by a different target speaker. The process involves preprocessing the speech signal, analyzing the pitch and envelope, morphing through warping and interpolation, and re-estimating the signal. To morph voices between a male and female speaker, the pitch of the male speaker is shifted to match that of the female speaker by time-stretching the residue signal and adjusting the LPC coefficients. Potential applications include using popular speakers for public announcements, and effects in films, but limitations include difficulties in voice detection and updating systems for new languages.
Voice morphing is a technique that modifies a source speaker's speech to sound like a target speaker. It works by preprocessing the speech signals, analyzing pitch and envelope, matching and warping the signals, and re-estimating the signal. This allows for an accurate copy of a person's voice to be made. Voice morphing has applications in text-to-speech systems, public address systems, and special effects. It can also be used to create fake conversations for use in courts or as a battlefield deception tool. The process separates speech into spectral envelope and pitch/voicing information before realigning and recombining the signals.
This document summarizes a seminar on voice morphing techniques. It discusses what voice morphing is, including modifying a source speaker's voice to sound like a target speaker. It then outlines the main topics that were covered in the seminar, including transform-based voice morphing systems, enhancing systems with phase prediction and spectral refinement, enabling real-time voice morphing, and conclusions about extending the techniques to other audio sounds.
This document analyzes speech coding algorithms for Hindi and English languages. It discusses Linear Predictive Coding (LPC), an algorithm that accurately estimates speech parameters and represents speech signals at reduced bit rates while preserving quality. The paper proposes a voice-excited LPC algorithm and implements it on Hindi and English male and female voices. It analyzes tradeoffs between bit rates, delay, signal-to-noise ratio, and complexity. The results show low bit-rates and better signal-to-noise ratio with this algorithm.
This document discusses speech signal processing and speech recognition. It begins by defining speech processing and its relationship to digital signal processing. It then outlines several disciplines related to speech processing including signal processing, physics, pattern recognition, and computer science. The document discusses aspects of speech signals including phonemes, the speech waveform, and spectral envelope. It covers various aspects of speech processing including pre-processing, feature extraction, and recognition. It provides details on techniques for pre-processing, feature extraction including filtering, linear predictive coding, and cepstrum. Finally, it summarizes the main steps in a speech recognition procedure including endpoint detection, framing and windowing, feature extraction, and distortion measure calculations for recognition.
The state-of-the-art Automatic Speech Recognition (ASR) systems lack the ability to identify spoken words if they have non-standard pronunciations. In this paper, we present a new classification algorithm to identify pronunciation variants. It uses Dynamic Phone Warping (DPW) technique to compute the
pronunciation-by-pronunciation phonetic distance and a threshold critical distance criterion for the classification. The proposed method consists of two steps; a training step to estimate a critical distance
parameter using transcribed data and in the second step, use this critical distance criterion to classify the input utterances into the pronunciation variants and OOV words.
The algorithm is implemented using Java language. The classifier is trained on data sets from TIMIT
speech corpus and CMU pronunciation dictionary. The confusion matrix and precision, recall and accuracy performance metrics are used for the performance evaluation. Experimental results show significant performance improvement over the existing classifiers.
This document summarizes digital modeling techniques for speech signals. It describes the vocal source and vocal tract that produce speech. It then discusses using sampling and techniques like PCM to digitally represent speech signals. Linear predictive coding is presented as a simple method to analyze speech that approximates samples as combinations of past signals. The summary concludes that linear prediction can be used for spectrum estimation by representing the vocal tract transfer function, pitch detection, and speech synthesis.
The document summarizes a technical seminar presentation on voice morphing. It describes voice morphing as a technique that modifies a source speaker's speech to sound like a target speaker. It discusses the need for voice morphing in applications like text-to-speech, discusses the voice morphing process which includes preprocessing, pitch and envelope analysis, morphing and signal re-estimation, and addresses limitations and advantages.
This document discusses voice modification techniques. It compares four models for voice modification: LPC, H/S, TD-PSOLA, and MBR-PSOLA. TD-PSOLA relies on pitch-synchronous overlap-add and modifies the speech signal in the time domain based on analysis and synthesis markings. The document implements a voice modification system using TD-PSOLA that can modify vocal tract, pitch, and time scale parameters to change voice quality, such as making a female voice sound more husky or nasal. Results show the algorithm can effectively modify input voices as desired.
Speaker recognition systems aim to automatically identify or verify a speaker's identity based on characteristics of their voice. There are two main types: speaker identification determines which registered speaker is speaking, while speaker verification accepts or rejects a speaker's claimed identity. All systems contain modules for feature extraction and feature matching. Feature extraction represents the voice signal with parameters like MFCCs that can distinguish speakers. Feature matching compares extracted features from an unknown voice to known speaker models. The document describes the process of MFCC feature extraction in detail, including framing the speech signal, windowing frames, taking the FFT, mapping to the mel scale, and finally the DCT to produce MFCC coefficients.
Voice morphing involves transforming one speech signal into another through three main steps: 1) extracting pitch and envelope information through cepstral analysis, 2) using dynamic time warping to match pitch contours between signals and interpolating values to create intermediate morphs, and 3) re-estimating and reconstructing the morphed signal as an acoustic waveform. Applications include military psychological operations, creating fake audio evidence, and voice acting in cartoons. The document provides details on preprocessing signals, the morphing and warping process, and applications and conclusions regarding voice morphing technology.
This document discusses homomorphic speech processing and techniques for speech enhancement. It provides an overview of modeling speech production as the excitation of a linear time-invariant system. Homomorphic filtering is introduced as a way to deconvolve speech into excitation and system response using logarithmic transformations. The complex cepstrum is discussed as a representation of speech that can be used to estimate pitch, voicing and formant frequencies. Homomorphic vocoding is described as a speech coding technique that quantizes the low-time part of the cepstrum at regular intervals to encode speech. Common techniques for speech enhancement like spectral subtraction and adaptive noise cancellation are also mentioned.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
Sundanese language is one of the popular languages in Indonesia. Thus, research in Sundanese language becomes essential to be made. It is the reason this study was being made. The vital parts to get the high accuracy of recognition are feature extraction and classifier. The important goal of this study was to analyze the first one. Three types of feature extraction tested were Linear Predictive Coding (LPC), Mel Frequency Cepstral Coefficients (MFCC), and Human Factor Cepstral Coefficients (HFCC). The results of the three feature extraction became the input of the classifier. The study applied Hidden Markov Models as its classifier. However, before the classification was done, we need to do the quantization. In this study, it was based on clustering. Each result was compared against the number of clusters and hidden states used. The dataset came from four people who spoke digits from zero to nine as much as 60 times to do this experiments. Finally, it showed that all feature extraction produced the same performance for the corpus used.
IRJET- Pitch Detection Algorithms in Time DomainIRJET Journal
This document discusses pitch detection algorithms in the time domain. It describes two common time domain pitch detection methods: the autocorrelation method and average magnitude difference function (AMDF) method. The autocorrelation method detects the periodicity of a speech signal by finding the highest value of the autocorrelation function. The AMDF method calculates the average magnitude of differences between the original and delayed speech signal at different lags, and identifies the pitch period as the lag with the minimum AMDF value. The document also provides implementation results of these two methods on speech samples, demonstrating their ability to estimate pitch periods in the time domain.
Identification of Sex of the Speaker With Reference To Bodo Vowels: A Compara...IJERA Editor
This work presents an application of Fundamental Frequency (Pitch), Linear Predictive Cepstral Coefficient
(LPCC) and Mel Frequency Cepstral Coefficient (MFCC) in identification of sex of the speaker in speech
recognition research. The aim of this article is to compare the performance of these three methods for
identification of sex of the speakers. A successful speech recognition system can help in non critical operations
such as presenting the driving route to the driver, dialing a phone number, light switch turn on/off, the coffee
machine on/off etc. apart from speaker verification-caste wise, community wise and locality wise including
identification of sex. Here an attempt has been made to identify the sex of Bodo speakers through vowel
utterance by following Pitch value, LPCC and MFCC techniques. It is found here that the feature vector
organization of LPCC coefficients provides a more promising way of speech-speaker recognition in case of
Bodo Language than that of Pitch and MFCC.
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...IDES Editor
Using the Mel-frequency cepstral coefficients
(MFCC), Human Factor cepstral coefficients (HFCC) and
their new parameters derived from log dynamic spectrum and
dynamic log spectrum, these features are widely used for
speech recognition in various applications. But, speech
recognition systems based on these features do not perform
efficiently in the noisy conditions, mobile environment and
for speech variation between users of different genders and
ages. To maximize the recognition rate of speaker independent
isolated word recognition system, we combine both of the above
features and proposed a hybrid feature set of them. We tested
the system for this hybrid feature vector and we gained results
with accuracy of 86.17% in clean condition (closed window),
82.33% in class room open window environment, and 73.67%
in outdoor with noisy environment.
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...CSCJournals
This paper proposes a new voicing detection and pitch estimation method that is particularly robust for noisy speech. This method is based on the spectral analysis of the speech multi-scale product. The multi-scale product (MP) consists of making the product of wavelet transform coefficients. The wavelet used is the quadratic spline function. We argue that the spectral of Multi-scale Product Analysis is capable of revealing an estimate of a pitch-harmonic more accurately even in a heavy noisy scenario. We evaluate our approach on the Keele database. The experimental results show the robustness of our method for noisy speech, and the good performance for clean speech in comparison with state-of-the-art algorithms.
An Introduction to Various Features of Speech SignalSpeech featuresSivaranjan Goswami
An overview of various temporal, spectral and cepstral features of speech signal used in digital speech processing.
For more tutorials visit:
https://sites.google.com/site/enggprojectece
An expert system for automatic reading of a text written in standard arabicijnlc
In this work we present our expert system of Automatic reading or speech synthesis based on a text
written in Standard Arabic, our work is carried out in two great stages: the creation of the sound data
base, and the transformation of the written text into speech (Text To Speech TTS). This transformation is
done firstly by a Phonetic Orthographical Transcription (POT) of any written Standard Arabic text with
the aim of transforming it into his corresponding phonetics sequence, and secondly by the generation of
the voice signal which corresponds to the chain transcribed. We spread out the different of conception of
the system, as well as the results obtained compared to others works studied to realize TTS based on
Standard Arabic.
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESkevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemeskevig
Speech synthesis and recognition are the basic techniques used for man-machine communication. This type
of communication is valuable when our hands and eyes are busy in some other task such as driving a
vehicle, performing surgery, or firing weapons at the enemy. Dynamic time warping (DTW) is mostly used
for aligning two given multidimensional sequences. It finds an optimal match between the given sequences.
The distance between the aligned sequences should be relatively lesser as compared to unaligned
sequences. The improvement in the alignment may be estimated from the corresponding distances. This
technique has applications in speech recognition, speech synthesis, and speaker transformation. The
objective of this research is to investigate the amount of improvement in the alignment corresponding to the
sentence based and phoneme based manually aligned phrases. The speech signals in the form of twenty five
phrases were recorded from each of six speakers (3 males and 3 females). The recorded material was
segmented manually and aligned at sentence and phoneme level. The aligned sentences of different speaker
pairs were analyzed using HNM and the HNM parameters were further aligned at frame level using DTW.
Mahalanobis distances were computed for each pair of sentences. The investigations have shown more than
20 % reduction in the average Mahalanobis distances.
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
This document describes an expert system for speech synthesis of Standard Arabic text. It involves two main stages: 1) creation of a sound database and 2) text-to-speech transformation. The transformation process involves phonetic orthographic transcription of the text and then generating voice signals corresponding to the transcribed phonetic sequence. The expert system uses a knowledge base containing sound data and rewriting rules. It transcribes text using graphemes as basic units and then concatenates sound units from the database to synthesize speech. Tests achieved a 96% success rate in pronouncing sentences correctly. Future work aims to improve prosody and develop fully automatic signal segmentation.
Effect of MFCC Based Features for Speech Signal Alignmentskevig
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTSijnlc
The fundamental techniques used for man-machine communication include Speech synthesis, speech
recognition, and speech transformation. Feature extraction techniques provide a compressed
representation of the speech signals. The HNM analyses and synthesis provides high quality speech with
less number of parameters. Dynamic time warping is well known technique used for aligning two given
multidimensional sequences. It locates an optimal match between the given sequences. The improvement in
the alignment is estimated from the corresponding distances. The objective of this research is to investigate
the effect of dynamic time warping on phrases, words, and phonemes based alignments. The speech signals
in the form of twenty five phrases were recorded. The recorded material was segmented manually and
aligned at sentence, word, and phoneme level. The Mahalanobis distance (MD) was computed between the
aligned frames. The investigation has shown better alignment in case of HNM parametric domain. It has
been seen that effective speech alignment can be carried out even at phrase level.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...ijnlc
Researchers of many nations have developed automatic speech recognition (ASR) to show their national improvement in information and communication technology for their languages. This work intends to improve the ASR performance for Myanmar language by changing different Convolutional Neural Network (CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate (WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76% SER on TestSet2.
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...kevig
Researchers of many nations have developed automatic speech recognition (ASR) to show their national
improvement in information and communication technology for their languages. This work intends to
improve the ASR performance for Myanmar language by changing different Convolutional Neural Network
(CNN) hyperparameters such as number of feature maps and pooling size. CNN has the abilities of
reducing in spectral variations and modeling spectral correlations that exist in the signal due to the locality
and pooling operation. Therefore, the impact of the hyperparameters on CNN accuracy in ASR tasks is
investigated. A 42-hr-data set is used as training data and the ASR performance was evaluated on two open
test sets: web news and recorded data. As Myanmar language is a syllable-timed language, ASR based on
syllable was built and compared with ASR based on word. As the result, it gained 16.7% word error rate
(WER) and 11.5% syllable error rate (SER) on TestSet1. And it also achieved 21.83% WER and 15.76%
SER on TestSet2.
A new framework based on KNN and DT for speech identification through emphat...nooriasukmaningtyas
This document presents a new framework that combines K-nearest neighbors (KNN) and decision trees (DT) for speech identification through emphatic letters in the Moroccan dialect of Arabic. The framework first uses KNN and DT individually to predict the gender of the speaker and the emphatic letter and diacritic pronounced. It then uses these predictions as additional features to improve the overall prediction of the sound content, achieving an accuracy of 71.43% - a 12.1% improvement over directly applying the classifiers. The study examines 720 speech samples from 12 speakers and evaluates the performance of hidden markov models, DT, and KNN applied individually, finding that KNN best recognizes diacritics while DT performs best for gender
FORMANT ANALYSIS OF BANGLA VOWEL FOR AUTOMATIC SPEECH RECOGNITIONsipij
To provide new technological benefits to the mass people, nowadays, regional and local language
recognition draws attention to the researchers. Similarly to other languages, Bangla speech recognition
scheme is demandable. A formant is considered as the resonance frequency of vocal tract. Formant
frequencies play an important role for the purpose of automatic speech recognition, due to its noise robust
characteristics. In this paper, Bangla vowels are investigated to acquire formant frequencies and its
corresponding bandwidth from continuous Bangla sentences, which are considered as potential parameters
for wide voice applications. For the purpose of formant analysis, cepstrum based formant estimation and
Linear Predictive Coding (LPC) techniques are used. In order to acquire formant characteristics, enrich
continuous sentences and widely available Bangla language corpus namely “SHRUTI” is considered.
Intensive experimentation is carried out to determine formant characteristics (frequency and bandwidth) of
Bangla vowels for both male and female speakers. Finally, vowel recognition accuracy of Bangla language
is reported considering first three formants
Hindi digits recognition system on speech data collected in different natural...csandit
This paper presents a baseline digits speech recognizer for Hindi language. The recording environment is different for all speakers, since the data is collected in their respective homes. The different environment refers to vehicle horn noises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. The Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer is 10 words. HTK toolkit is used for building
acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
This document discusses performance analysis of different acoustic features for Bangla speech recognition using LSTM neural networks. It develops a Bangla speech corpus and extracts linear predictive coding (LPC), Mel frequency cepstral coefficients (MFCC), and perceptual linear prediction (PLP) acoustic features from the corpus. The features are then used to train LSTM models for Bangla speech recognition and their performance is evaluated based on sentence correct rates on test data sets consisting of male and female speakers.
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerIOSR Journals
This document summarizes an implementation of an English-text to Marathi-speech synthesizer. The synthesizer uses a unit selection approach based on concatenative synthesis to produce natural sounding Marathi speech from English text input. Over 28,000 Marathi syllables, words and sentences were recorded from a female speaker and used to create the speech corpus. Formant frequencies (F1, F2, F3) were analyzed from the synthesized speech using MATLAB and PRAAT tools to evaluate the quality and naturalness of the output.
The document describes the implementation of a natural sounding speech synthesizer for the Marathi language using English text input. It discusses concatenative speech synthesis using a unit selection approach. Over 28,580 syllables, words and sentences recorded from a female speaker were used to create an inventory of speech units. The synthesizer was tested and able to generate natural sounding output and waveforms. Formant frequencies were analyzed using MATLAB and PRAAT tools to evaluate the quality of the synthesized speech.
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...ijma
In this work a new Bangla speech corpus along with proper transcriptions has been developed; also
various acoustic feature extraction methods have been investigated using Long Short-Term Memory
(LSTM) neural network to find their effective integration into a state-of-the-art Bangla speech recognition
system. The acoustic features are usually a sequence of representative vectors that are extracted from
speech signals and the classes are either words or sub word units such as phonemes. The most commonly
used feature extraction method, known as linear predictive coding (LPC), has been used first in this work.
Then the other two popular methods, namely, the Mel frequency cepstral coefficients (MFCC) and
perceptual linear prediction (PLP) have also been applied. These methods are based on the models of the
human auditory system. A detailed review of the implementation of these methods have been described
first. Then the steps of the implementation have been elaborated for the development of an automatic
speech recognition system (ASR) for Bangla speech.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages: Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the input speech waveform into a sequence of acoustic feature vectors, each vector representing the information in a small time window of the signal. And then the likelihood of the observation of feature vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language Model (LM). The system will produce the most likely sequence of words as the output. The system creates the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
This document summarizes a research paper on developing a syllable-based speech recognition system for the Myanmar language. The proposed system has three main components: feature extraction, phone recognition, and decoding. Feature extraction transforms speech into acoustic feature vectors. Phone recognition computes likelihoods of acoustic observations given linguistic units like phones. Decoding uses acoustic and language models to find the most likely sequence of words. The paper discusses building acoustic and language models for Myanmar. The acoustic model is trained using Hidden Markov Models and Gaussian mixture models. The language model is an n-gram model built using syllable segmentation of text. Developing the first speech recognition system for Myanmar poses technical challenges due to its tonal syllabic structure.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARijcseit
This proposed system is syllable-based Myanmar speech recognition system. There are three stages:
Feature Extraction, Phone Recognition and Decoding. In feature extraction, the system transforms the
input speech waveform into a sequence of acoustic feature vectors, each vector representing the
information in a small time window of the signal. And then the likelihood of the observation of feature
vectors given linguistic units (words, phones, subparts of phones) is computed in the phone recognition
stage. Finally, the decoding stage takes the Acoustic Model (AM), which consists of this sequence of
acoustic likelihoods, plus an phonetic dictionary of word pronunciations, combined with the Language
Model (LM). The system will produce the most likely sequence of words as the output. The system creates
the language model for Myanmar by using syllable segmentation and syllable based n-gram method.
Similar to High Quality Arabic Concatenative Speech Synthesis (20)
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Design and optimization of ion propulsion dronebjmsejournal
Electric propulsion technology is widely used in many kinds of vehicles in recent years, and aircrafts are no exception. Technically, UAVs are electrically propelled but tend to produce a significant amount of noise and vibrations. Ion propulsion technology for drones is a potential solution to this problem. Ion propulsion technology is proven to be feasible in the earth’s atmosphere. The study presented in this article shows the design of EHD thrusters and power supply for ion propulsion drones along with performance optimization of high-voltage power supply for endurance in earth’s atmosphere.
Gas agency management system project report.pdfKamal Acharya
The project entitled "Gas Agency" is done to make the manual process easier by making it a computerized system for billing and maintaining stock. The Gas Agencies get the order request through phone calls or by personal from their customers and deliver the gas cylinders to their address based on their demand and previous delivery date. This process is made computerized and the customer's name, address and stock details are stored in a database. Based on this the billing for a customer is made simple and easier, since a customer order for gas can be accepted only after completing a certain period from the previous delivery. This can be calculated and billed easily through this. There are two types of delivery like domestic purpose use delivery and commercial purpose use delivery. The bill rate and capacity differs for both. This can be easily maintained and charged accordingly.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
Generative AI Use cases applications solutions and implementation.pdf
High Quality Arabic Concatenative Speech Synthesis
1. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
DOI : 10.5121/sipij.2011.2403 27
High Quality Arabic Concatenative Speech
Synthesis
Abdelkader Chabchoub and Adnan Cherif
Signal Processing Laboratory, Science Faculty of Tunis, 1060 Tunisia
achabchoub@yahoo.fr, adnen2fr@yahoo.fr
ABSTRACT
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-to-
speech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer.
This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech
synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is
based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The
main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications
of the speech signal and diphone with vowel integrated database adjustment and optimisation.
KEYWORDS
Speech processing and synthesis, Arabic speech, prosody, diphone, spectrum analysis, pitch mark, timbre,
TD-PSOLA.
1. INTRODUCTION
The synthetic voice that imitates human speech from plain text is not a trivial task, since this
generally requires great knowledge about the real world, the language, the context where the text
comes from, a deep understanding of the semantics of the text content and the relations that
underlie all these information. However, many research and commercial speech synthesis systems
developed have contributed to our understanding of all these phenomena, and have been
successful in various respective ways for many applications such as in human-machine
interaction, hands and eyes free access of information, interactive voice response systems.
There have been three major approaches to speech synthesis: articulatory, formant and
concatenative [1] [2] [3] [4]. Articulatory synthesis tries to model the human articulatory system,
i.e. vocal cords, vocal tract, etc. Formant synthesis employs some set of rules to synthesize speech
using the formants that are the resonance frequencies of the vocal tract[19]. Since the formants
constitute the main frequencies that make sounds distinct, speech is synthesized using these
estimated frequencies. Several speech synthesis systems were developed such as vocoders and
LPC synthesizers [5][6], but most of them did not reproduce high quality of synthetic speech
when compared with that of PSOLA based systems [7] such as MBROLA synthesizers[8].
Especially TD-PSOLA method (Time Domain Pitch Synchronous Overlap-Add) is the most
efficient method to produce criteria of satisfaction speech [9] and is one of the most popular
concatenation synthesis techniques nowadays. LP-PSOLA (Linear Predictive PSOLA) and FD-
2. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
28
PSOLA (Frequency Domain PSOLA), though able to produce equivalent result, require much
more computational power. The first step of the TD-PSOLA is to perform a pitch detection
algorithm and to generate pitch marks through overlapping windowed speech. To synthesize
speech, the Short Time signals (ST signals) are simply overlapped and added with desired
spacing of the ST-signals.
2. THE ARABIC DATABASE
2.1. Introduction for Arabic language
The Arabic language is spoken throughout the Arab world and is the liturgical language of Islam.
This means that Arabic is known widely by all Muslims in the world. Arabic either refers to
Standard Arabic or to the many dialectal variations of Arabic. Standard Arabic is the language
used by media and the language of Qur’an. Modern Standard Arabic is generally adopted as the
common medium of communication through the Arab world today. Dialectal Arabic refers to the
dialects derived from Classical Arabic [10]. These dialects differ sometimes which means that it
is hard and a challenge for a Lebanese to understand an Algerian and it is worth mentioning there
is even a difference within the same country.
Standard Arabic has 34 basic phonemes, of which six are vowels, and 28 are consonants [11].
Several factors affect the pronunciation of phonemes. An example is the position of the phoneme
in the syllable as initial, closing, intervocalic, or suffix. The pronunciation of consonants may also
be influenced by the interaction (co-articulation) with other phonemes in the same syllable.
Among these coarticulation effects are the accentuation and the nasalization. Arabic vowels are
affected as well by the adjacent phonemes. Accordingly, each Arabic vowel has at least three
allophones, the normal, the accentuated, and the nasalized allophone. In classic Arabic, we can
divide the Arabic consonants into three categories with respect to dilution and accentuation [12].
Arabic language has five syllable patterns: CV, CW, CVC, CWC and CCV, where C represents a
consonant, V represents a vowel and W represents a long vowel.
Table 1. Arabic Consonants and Vowels and their phonetic notations SAMPA
Code
SAMP
A
Graphem
es
Code
SAMPA
Graphem
es
Code
SAMP
A
Graphem
es
Code
SAMP
A
Graphem
es
j
ي
G
غ
r
ر
ء
a
َ◌
f
ف
z
ز
b
ب
u
ُ◌
q
ق
s
س
t
ت
i
ِ◌
K
ك
S
ش
T
ث
an
ً◌
l
ل
s.
ص
Z
ج
un
ٌ◌
m
م
d.
ض
X
ح
in
ٍ◌
n
ن
t.
ط
X
خ
h
ه
z.
ظ
d
د
w
و
H
ع
D
ذ
3. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
29
2.2. Database construction
The first step in constructing a diphone database for Arabic is to determine all possible diphone
pairs of Arabic. In general, the typical diphone size is the square of the phone number for any
language [13]. In reality, additional sound segments and various allophonic variations may in
some cases be also included. The basic idea is to define classes of diphones, for example: vowel-
consonant, consonant- vowel, vowel-vowel, and consonant-consonant.
The syllabic structure of Arabic language is exploited here to simplify the required diphones
database. The proposed sound segments may be considered as "sub-syllabic" units [10]. For good
quality, the diphones boundaries are taken from the middle portion of vowels. Because diphones
need to be clearly articulated various techniques have been proposed to extract them from
subjects. One technique uses words within carrier sentences to ensure that the diphones are
pronounced with acceptable duration and prosody[20] (i.e. consistent). Ideally, the diphones
should come from a middle syllable of nonsense words so it is fully articulated and minimize the
articulatory effects at the start and end of the word [14].
The second step is to record the corpus, this recording made by a native speaker of Arabic
standard cardioids microphone with a high quality flat frequency response. The signal was
sampled at 16 kHz and 16 bit.
Finally Segmentation and annotation, the database registered must be prepared for the selection
method has all the information necessary for its operation. The base is first segmented into
phones, in second step to diphones. This was handmade by the studio diphone software developed
by the laboratory TCTS of Mons. A correction on the units to ensure quality was made by the
software Praat (Boers and my Weening, 2008).Prosodic analysis performed on the corrected
signal to determine the pitch and duration of phone.
The result of this segmentation provides a significant reduction of unites, all units not exceeding
400 (diphones, phonemes and phones), comparisons with other basis developed by other
laboratories; Chanfour CENT laboratory Rabat Faculty of Sciences, a database of diphones S.
Baloul thesis LIUM Mans France and a base of diphones by Noufel Tounsi, laboratory TCTS
Mons.The code SAMPA (Speech Assessment Method Phonetic Alphabet) used for
transformation grapheme phoneme.
3. SPEECH ANALYSIS AND SYNTHESIS
This section will describe the procedures of synchronous analysis and synthesis using TD-
PSOLA modifier Figure2 presents the block diagram of these two stages.
3.1. Speech analysis
The first step in the speech analysis is to filter the speech signal by a RIF filter (pre-accentuation).
The next step is to provide a sequence of pitch-marks and voiced/unvoiced classification for each
segment between two consecutive pitch marks. This decision is based on the zero-crossing and
the short time energy Figure1. A coefficient of voicement (v/uv) can be computed in order to
quantize the periodicity of the signal [15].
4. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
30
3.1.1. Segmentation
The segmentation of a speech signal is used in order to identify the voiced and un-voiced frames.
This classification is based on the zero-crossing ratio and the energy value of each signal frame
Figure 1. Automatic segmentation of Arabic speech (a) « باب
; babun» (b) «شمسchamsun»
This segmentation is used in order to identify the voiced and unvoiced frames.
3.1.2. Speech marks
Different procedures of placed [ ]
i
ta are used according to the local features of components of
the signal. A previous segmentation of the signal in identical feature zones permits to orient the
marking toward the suitable method. Besides results of this segmentation will be necessary for
the synthesis stage.
3.1.2.1. Reading marks
The idea of our algorithm is to select pitch marks among local extrema of the speech signal.
Given a set of mark candidates which all are negative peaks or all positive peaks:
[ ] )
(
).....
(
)...
1
(
)
( N
t
i
t
t
i
t
T a
a
a
a
a =
=
where ( )
i
ta is the sample of the ith
peak, and N the number of peaks extracted ([16] explain how
these candidates are found).Pitch marks are a subset of points out of a
T , which are spaced by
periods of pitch given by the pitch extraction algorithm. The selection can be represented by a
sequence of indices:
[ ] )
1
(
)
(
)......
(
)......
1
(
)
( K
j
k
j
j
k
j
J =
=
With K<N. J has to preserve the chronological order which requires the monotony of j:
( )
1
)
( +
< k
j
k
j .
5. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
31
The sequence of indices along with the corresponding peaks is defined to be the set of pitch
marks:
[ ] )
2
(
))
(
(
))...
(
(
))...
1
(
(
))
(
( K
j
t
k
j
t
j
t
k
j
t
T a
a
a
a
a =
=
The determination of j requires a criterion expressing the reliability of two consecutive pitch
marks with respect to pitch values previously determined. The local criterion we chose is:
( )
3
))
(
(
))
(
)
(
(
))
(
);
(
( l
c
P
l
c
i
c
i
c
l
c
d a
−
−
=
We use the following algorithm for the marking: where l < i. It takes into account the time
interval between two marks compared to the pitch period a
P in samples. This criterion returns
zero if the two peaks are exactly ))
(
( l
c
Pa samples away from one another and a positive value if
the distance between these peaks is greater or less than the pitch period. The overall criterion is:
( ) ( ) ( )
4
))
1
(
(
))
1
(
(
)),
(
(
1
1
+
−
+
= ∑
−
=
k
j
t
B
k
j
t
k
j
t
d
D a
K
K
a
a
Where B is the bonus of selecting an extremum as a pitch mark. In a first time,
( )
5
)))
(
(
(
))
(
(
( k
j
t
amplitude
k
j
t
B a
a δ
=
The coefficient δ expresses the compromise between closeness to pitch values and strength of
pitch marks. Minimising D is achieved by using dynamic programming. The Pitch marking
results is shown in Figure2.
1600 1800 2000 2200 2400 2600 2800
50
100
150
original speech
1400 1600 1800 2000 2200 2400 2600
0
50
100
150
pitch marks
6. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
32
Figure 2. Pitch marks of Arabic speech (a) « باب
; babun» (b) « akala أكل »
3.1.2.2. Synthesis marks
The OLA synthesis is based on the superposition-Addition of elementary signals ( )
n
Y j , obtained
from the ( )
n
X i placed in the new positions [ ]
j
ts . These positions are determined by the height
and the length of the synthesis signal. In such synthesis one can modify the temporal scale by a
coefficient tscale .The positions ( )
1
−
k
ts and the pitch period Pa(k) are supposed to be known
we can deduce ( )
k
ts as [17];
( ) ( ) ( )
( )
( ) ( ) ( )
6
1
1
tscale
k
n
k
n
k
n
P
tscale
k
t
k
t
s
a
s
s
+
=
+
⋅
+
−
=
tscale: coefficient of length modification
1200 1300 1400 1500 1600 1700 1800 1900
50
100
150
original speech
1300 1400 1500 1600 1700 1800 1900
0
50
100
150
pitch marks
7. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
33
Figure 3. TD-PSOLA for pitch (F0) modification. In order to increase the pitch, the individual pitch-
synchronous frames are extracted, Hanning windowed, moved closer together and then added up. To
decrease the pitch, we move the frames further apart. Increasing the pitch will result in a shorter
signal, so we also need to duplicate frames if we want to change the pitch while holding the duration
constant.
3.2. Synthesis speech
Therefore, given the pitch mark and the synthesis mark of a given frame we use a fast re-
sampling method described below to shift the frame precisely where it will appear in the new
signal. Let x[n] the original frame, the re-sampled signal is given by A. Oppenheim [18]:
[ ] ( )
7
)
(
sin
)
(
−
= ∑
∞
−∞
= Ts
nTs
t
c
n
x
t
x
n
π
Where Ts is the sampling period. Calculating the result frame y[m] corresponding to the frame
x[n] shifted by a small delay δ amounts to evaluate x (mTs - δ). Therefore, y[m] = x (mTs - δ) i.e:
[ ] [ ] [ ]
( )
[ ] [ ]
( )
)
8
(
)
(
sin
)
(
sin
δ
π
δ
π
−
−
=
−
−
=
∑
∑
∞
−∞
=
∞
−∞
=
Ts
n
m
fs
c
n
x
nTs
mTs
fs
c
n
x
m
y
n
n
Where fs is the sampling frequency (1/Ts).Now, by rewriting c
sin as x
x /
)
sin( and by using
the following formula:
[ ] ))
(
sin(
)
cos(
)
(
sin( n
m
fs
Ts
n
m
fs −
=
−
− π
δ
π
δ
π But 0
)
(
sin
1
)
(
cos =
−
±
=
− n
m
and
n
m π
π we
get
[ ] [ ]
[ ]
( )
9
)
(
)
sin(
)
1
( )
1
(
δ
π
δ
π
−
−
−
=
+
−
∞
−∞
=
∑ Ts
n
m
fs
fs
n
x
m
y
n
m
n
As 0 < δ < Ts (resp. -Ts < δ < 0), we define
δ = α Ts, where 0 < α < 1 (resp. -1 < α < 0).
Then the synthesized speech is
[ ] [ ] ( )
10
)
(
1
)
sin(
)
1
( )
1
(
α
π
απ
−
−
−
= ∑
∞
−∞
=
+
−
n
m
n
x
m
y
n
n
m
And it is shown in Figure6
8. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
34
Figure 4. A waveform of a human utterance and its synthesized equivalent using TD-PSOLA
tools.« Akala أكل»
4. RESULTS AND EVALUATION
Two types of tests were applied two evaluate the speech of the developed system regarding the
intelligibility and the naturalness aspects. The first test which measures the intelligibility is the
Diagnostic Rhyme Test (DRT). In this test, twenty pairs of words that differ only in a single
consonant are uttered and the listeners are asked to mark on an answer sheet which word of each
pair of the words they think is correct [21]. In the second evaluation test, which is Categorical
Estimation (CE), the listeners were asked a few questions about several attributes such as the
speed, the pronunciation, the stress, etc.[22] of the speech and they were asked to rank the voice
quality using a five level scale. The test group consisted of sixteen persons and the previously
mentioned two tests were repeated twice to see whether or not the test results will increase by the
learning effect which means that the listeners may become accustomed to the synthesized speech
they hear and they understand it better after every listening session . The following tables and
charts illustrate the results of these tests.
For both listening tests we prepared listening test programs and a brief introduction was given
before the listening test. In the first listening test, each sound was played once in 4 seconds
interval and the listeners write the corresponding scripts to the word they heard on the given
answer sheet. In the second listening test, for each listener, we played all 15 sentences together
and randomly. Each subject listens to 15 sentences and gives their judgment score using the
listening test program by giving a measure of quality as follows: (5 – Excellent, 4 - Good,1–
Bad). They evaluated the system by considering the naturalness aspect. Each listener did the
listening test fifteen times and we took the last ten results considering the first five tests as
training.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 10
4
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
original speech
0 0.5 1 1.5 2 2.5
x 10
4
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
speech synthesis
9. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
35
After collecting all listeners’ response, we calculated the average values and we found the
following results. In the first listening test, the average correct-rate for original and analysis-
synthesis sounds were 98% and that of rule-based synthesized sounds was 90%. We found the
synthesized words to be very intelligible figure5.
Figure5. Average scores for the first test (system Euler, our system, natural speech and Acapela
system. for the intelligibility of speech
5. CONCLUSION
In this work, a voice quality conversion algorithm with TD-PSOLA modifier was implemented
and tested under Matlab environment using our database. The results of perceptual evaluation test
indicate that the algorithm can effectively convert modal voice into the desired voice quality.
Results of the simulation verify that the quality of the synthesized signal by TD-PSOLA with
technique depends on the precision of the analysis marking as well as the synthesis marking
which must be placed with precision to avoid errors in the phase. Our higher precision algorithm
for pitch marking during the synthesis stage increases the signal quality. This gain in accuracy
avoids the reduction of deference between original and synthetic signals. We have shown that
syllables produce reasonably natural quality speech and durational modeling is crucial for
naturalness, with a significant reduction in numbers of units of the total base developed. We can
see this quality from the listening tests and objective evaluation to compare the original and
synthetic speech.
REFERENCES
[1] Huang, X., A. Acero and H. W. Hon (2001), Spoken Language Processing, Prentice Hall PTR, New
Jersey.
[2] Greenwood, A. R.(1997) “Articulatory Speech Synthesis Using Diphone Units”, IEEE international
Conference on Acoustics, Speech and Signal Processing, pp. 1635–1638.
[3] Sagisaka, Y., N. Iwahashi and K. Mimura, (1992) “ATR v-TALK Speech Synthesis System”,
Proceedings of the ICSLP, Vol. 1, pp. 483–486.
[4] Black, A. W. and P. Taylor, (1994) “CHATR: A Generic Speech Synthesis System”, Proceedings of
the International Conference on Computational Linguistics, Vol. 2, pp. 983–986.
10. Signal & Image Processing : An International Journal (SIPIJ) Vol.2, No.4, December 2011
36
[5] Childers, D.G. «Glottal source modeling for voice conversion». Speech communication, 16(2): 127-
138, 1995.
[6] Childers, D.G., and Lee, C.K. «Vocal quality factors: Analysis, synthesis, and perception».Journal of
the Acoustical Society of America, 1991.
[7] Acero A. «Source-filter Models for Time-Scale Pitch-Scale Modification of Speech». IEEE
International Conference on Acoustics, Speech, and Signal Processing, Seattle, USA, pp.881-884.
May, 1998.
[8] Dutoit, T., Pagel, V., Pierret, N., Bataille, and F. & van der Vrecken, O. (1996) The MBROLA
Project: Towards a Set of High-Quality Speech Synthesizers Free of Use.
[9] Moulines, E., and Charpentier, F. «Pitch-Synchronous Waveform Processing Techniques for TTS
Synthesis».Speech communication Vol 9, pp453-467, 1990.
[10] Alghmadi, M., (2003) “KACST Arabic Phonetic Database”, the Fifteenth International Congress of
Phonetics Science, Barcelona, pp 3109-3112.
[11] Assaf, M.,(2005). “A Prototype of an Arabic Diphone Speech Synthesizer in Festival,” Master Thesis,
Department of Linguistics and Philology, Uppsala University.
[12] Al-Zabibi, M., (1990) “An Acoustic–Phonetic Approach in Automatic Arabic Speech Recognition,”
The British Library in Association with UMI.
[13] Ibraheem A.(1990). “Al-Aswat Al-Arabia”, Arabic title, Anglo-Egyptian Publisher, Egypt.
[14] Maria M., “A Prototype of an Arabic Diphone Speech Synthesizer in Festival", Master Thesis in
Computational Linguistics, Uppsala university, 2004.
[15] Laprie, Y. and Colotte, V. (1998) “Automatic pitch marking for speech transformations via TD-
PSOLA”. In IX European Signal Processing Conference, Rhodes, Greece, 1998.
[16] Mower, L., Boeffard, O., Cherbonnel, B. (1991) “An algorithm of speech synthesis high-quality”
Proceeding of a Seminar SFA/GCP, pp 104-107.
[17] Oppenheim A. V. and Schafer, W. R. (1975) Digital Signal Processing. Prentice-Hall, Inc,
[18] Oppenheim, A.V. and Schafer R.W. (1975) Digital Signal Processing. Prentice-Hall, Inc., New York.
[19] Walker, J., Murphy, P. (2007). “A review of glottal waveform analysis. In: Progress in Nonlinear
Speech Processing.
[20] Demenko, G., Grocholewski, S., Wagner, A. & Szymański, M. (2006). “Prosody Annotation for
Corpus Based Speech Synthesis”. In: Proceedings of the Eleventh Australasian International
Conference on Speech Science and Technology. Auckland, New Zealand, pp. 460-465.
[21] Maria M., (2004) "A Prototype of an Arabic Diphone Speech Synthesizer in Festival",Master Thesis
in Computational Linguistics, Uppsala university, 2004.
[22] Kraft V., Portele T.,(1995) "Quality Evaluation of Five German Speech Synthesis Systems" Acta
Acustica 3, pp. 351-365.