ESTIMATION AND OPTIMIZATION OF PROSODIC TO IMPROVE THE QUALITY OF THE ARABIC SYNTHETIC SPEECH Copyright IJAET

475 views

Published on

The prosody modeling has been extensively applied in speech synthesis. This is simply because there is an obvious need for every speech synthesis system to generate prosodic properties of speech, for a natural and intelligible synthetic speech. This paper introduces a new technique for the prediction of a deterministic prosodic target at an early stage which relies on probabilistic models of F0 contour and may predict the duration. This paper, also, proposes a method that searches for the optimal unit sequence by maximizing a joint likelihood at both segmental and prosodic levels. This method has successfully been implemented in the analysis corpus for developing the Arabic prosody database which itself is the input of the Arabic speech synthesizer. This paper, also, shows a drastic improvement in the Arabic prosodic quality through extensive objective and subjective evaluation.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

ESTIMATION AND OPTIMIZATION OF PROSODIC TO IMPROVE THE QUALITY OF THE ARABIC SYNTHETIC SPEECH Copyright IJAET

  1. 1. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963 ESTIMATION AND OPTIMIZATION OF PROSODIC TO IMPROVE THE QUALITY OF THE ARABIC SYNTHETIC SPEECH Abdelkader CHABCHOUB & Adnen CHERIF 1 Signal Processing Laboratory, Science Faculty of Tunis 1060, TunisiaA BSTRACTThe prosody modeling has been extensively applied in speech synthesis. This is simply because there is anobvious need for every speech synthesis system to generate prosodic properties of speech, for a natural andintelligible synthetic speech. This paper introduces a new technique for the prediction of a deterministicprosodic target at an early stage which relies on probabilistic models of F0 contour and may predict theduration. This paper, also, proposes a method that searches for the optimal unit sequence by maximizing a jointlikelihood at both segmental and prosodic levels. This method has successfully been implemented in the analysiscorpus for developing the Arabic prosody database which itself is the input of the Arabic speech synthesizer.This paper, also, shows a drastic improvement in the Arabic prosodic quality through extensive objective andsubjective evaluation.K EYW ORDS: Segmental duration, pitch, predictive, prosodic Model, Neural Network, speech synthesis,Arabic speech. I. INTRODUCTIONGenerating natural sounding prosody is a central challenge in text-to-speech synthesis (TTS), which isnowadays a technology that enables computers to talk and assist people in learning languages. Whileexisting synthesis techniques produce speech that is intelligible, few people would claim that listeningto computer speech is naturally or expressive. Therefore in recent year’s research in the areas ofspeech synthesis were directed more towards improving the intelligibility and natural of syntheticsystems to achieve better quality, tone of voice as well as its synthetic speech and intonation [1] [2].In several systems, the usability of systems-speech voice that produces a good quality still needextensive research to be able to increase its overall use.In the Arabic language the processing linguistic and prosodic [3] is essential for the synthesis quality.So processing station based on the modification of the Arabic prosodic (optimization of the pitch andpredictive duration) trained to improve the new Arabic voice. From the phonetic point of view, this isthe processing of prosodic parameters defined by: the fundamental frequency (F0), segmental durationand intensity. Modeling of these parameters is the main target of our research work which essentiallyconcentrates on the fundamental frequency and duration [4].This paper is organized as follows. In Section 2, the morphological model of the Arabic language willbe presented with, in particular the concepts of word. Section 3 describes the corpus used in the studyand presents a list of phonemes and the corresponding acoustic parameters for each phoneme(duration and F0). These values are entered in the module to change the parameters that will optimizethe prosodic parameters (pitch and duration) which will be presented in Section 4. Section 5 presents 632 Vol. 2, Issue 1, pp. 632-639
  2. 2. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963the results and evaluation of the algorithm as well as the implementation of the speech synthesissystem.II. DATABASE OF ARABIC SPEECH PROSODYThe quality of a speech synthesis system depends on the intelligibility and naturalness of speechgenerated, hence the need for generating prosody quality. Our database has been developed to be used toimprove the quality of Arabic synthetic speech with MBROLA [5].The fundamental idea is to create a speech corpus consisting of phone sequence phonemic/prosodiccontext combinations that forms a specially structured subset of the set of all such combinations, andthen use Arabic prosody transplantation [6].The modules are cascaded in the order Phonetisation-Duration- Pitch. The input is a pair of a speechsignal file and a time-aligned phonemic annotation, followed by phoneme validation (code SAMPA),followed by an identification of the voiced and un-voiced frames (V/NV), followed by durationextraction, followed by pitch extraction [7] and finally followed by Prosodic modification/optimization.This algorithm results are the entries of our Arabic prosodic database. The main data flow steps areshown in Figure 1 it represents the generation of the database. Original speech Automatic annotation and segmentation V / NV Measure duration and pitch Measure duration Prosodic modification and optimization Arabic Prosodic database Figure1. Arabic prosodic database generation with prediction duration and pitch optimization algorithm.2.1. Description of the corpus of analysis usedThe corpus, which we used to build our database, is composed of 120 sentences, with an average of 5words per sentence. These sentences contain in total 1296 syllables, 3240 phonemes, including a shortvowels, long and semi- vowels [ and ], fricatives consonants, plosives and liquids consonants [ and ] and nasal consonants [ and ]. Breaks were characterized with a "_" in the text corresponding to thenatural voice. These sentences were read at an average speed (from 10 to 12 phonemes / second) by aspeaker who did not receive any specific instruction to avoid any influence that could affect spontaneity.These corpus were recorded with a 16-Khz sampling rate and encoding 16 bits.2.2. Segmentation and labeling of the corpusContinuous speech corpus has been segmented and labeled by a semi-automatic procedure, whichinvolves the following steps [12]: 633 Vol. 2, Issue 1, pp. 632-639
  3. 3. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963 • Step 1: phonetic spelling manual transcription of each sentence using the SAMPA transcriptionsystem. • Step 2: Automatic segmentations by Praat.2.3. Automatic segmentation of the corpus.The extraction of pitch is an important step. For a period (ms) of phonemes, we extract the pitch inseveral positions that will be the parameters of the input file for the MBROLA, resulting in a pitchextraction algorithm robust and accurate, providing a good quality synthetic speech.2.4. Identify the voiced and un-voiced frames.The automatic segmentation of a speech signal is used in order to identify the voiced and un-voicedframes. This classification is based on the zero-crossing ratio and the energy value of each signal frame. ie g e té fich r se m n 100 0 -100 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Z ro C ssing 0.2 ro 0.1 0 e 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 5 e e ie n rg 0 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 zo e vo e isé 1 0.5 n 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 temps en secFigure2. Voiced zone Arabic Sentence ", door" This figure represents an automatic segmentation. Forexample, between [0.4, 0.75 (s)] as the voiced sections corresponds to a low Zero Crossing and high Energy. fic ie s g e té h r e mn 200 0 -200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 eo ro s g Z r C s in 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 5 e e ie n rg 0 -5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 z n v is e oe o é 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 temps en sec Figure3. Un-voiced zone, Arabic Sentence " , sun" This figure represents an automatic segmentation. For example, between [0.65, 0.8 (s)] that the un-voiced sections corresponds to a high Zero Crossing and low Energy.2.5. Duration and pitch extractionThe extraction of pitch is the next step. Copying the phonemes, the durations of the phonemes from theannotation file and measuring the pitch values from the original recording of a human utterance allowsbest case speech synthesis. To extract pitch from the recordings a Praat script called max- pitch wasimplemented as in [8], this script goes through the Sound and TextGrid files in a directory, opens each 634 Vol. 2, Issue 1, pp. 632-639
  4. 4. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963pair of Sound and TextGrid, calculates the pitch maxima of each labeled interval, and saves the results toa text file [9]. The implementation of this script caused another problem and some modifications to thescript were made. The inputs to this script are WAV files and TextGrid annotation files.The Praat pitch extraction file produces one TXT file with the pitch values of all the phonemes in thefiles in the directory. The output “pitchresults.txt” file contains the following information: 1. File names of the files in the directory. 2. Labels. 3. Maximum pitch values of the labeled intervals in Hz.The pitch results file for one file in a directory is shown in the following example: automatics extractPraat script .pho “ “ _ 387 s. 90 83 132 a. 104 14 126 29 123 43 119 58 116 72 120 87 120 d 118 25 133 38 137 51 145 64 152 76 154 89 155 i 77 19 153 39 149 58 145 78 141 97 135 q 125 i 63 24 129 48 129 71 127 95 119 l 103 15 122 29 129 44 125 58 124 73 125 87 120 H 71 21 113 42 111 63 111 85 112 a 75 20 117 40 119 60 119 80 116 z 116 13 109 26 109 39 108 52 108 65 109 91 122 i 155 10 136 19 139 29 141 39 141 48 141 97 138 z 15210 135 20 133 30 129 39 123 49 119 99 122III. ARABIC PROSODIC MODELLING3.1. Prediction models of segmental durationStudy analysis on the automatic generation of time have experienced many changes in recent years. Themodel proposed in this paper is based on two basic criteria that are: linear prediction and neuralnetworks.The model of W. N. Campbell assumes that the temporal organization of a statement is made at a higherlevel in terms of phonemes. Two stages are distinguished in the implementation of this model, the first isthe prediction of syllabic duration and the second is the prediction of syllable phoneme durations.A learning process automatically allows the prediction of syllable durations. It uses neural networks forlearning because it is assumed that they can learn the fundamental interactions between contextualeffects. These should represent the behavior governed by rules that are implicit in the data. If thenetworks can encode the fundamental interactions, , then they would do the same with data notpreviously encountered [10].Regarding the segmental durations, their distribution is given by the calculation of a coefficient ofelongation (deviation from the mean). Campbell has suggested that all the phonemes of one syllablehave the same elongation factor z: z-score. The z-score of each phonemic realization of the corpus ofstudy is calculated by: ( dureeobser vee realisatio n − µ phoneme ) z realisatio n = (1) σ phonemeWhere µ phoneme and σ phoneme are the mean and standard deviation obtained from the absolute time of theachievements of each phoneme in the corpus. So, every time a phonetic realization is normalized byusing the z-score (mean = 0 and standard deviation = 1) the durations of the syllables will be determinedby the neural network[11]. Moreover, the model will calculate the z-score associated with each syllableby solving the following equation: n Duree( syllabe) = ∑ exp(µi + zσ i ) (2) i =1 635 Vol. 2, Issue 1, pp. 632-639
  5. 5. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963The sum on the phonemic elements of the syllable, z is the z-score associated with that syllable and thepair (µi and σi) contains the mean and standard deviation associated with the phoneme i and obtainedfrom the logarithms of the durations of achievements (in milliseconds) of this phoneme in the corpus.Thus, the duration of each phoneme of the syllable is calculated using equation (3). Duree ( phoneme i ) = exp( µ i + zσ i ) ( 3)3.2 F0 Prediction Module Based on a Neural NetworkNeural networks provide a good solution for problems involving strong non-linearity between input andoutput parameters, and also when the quantitative mechanism of the mapping is not well understood.The use of neural networks in prosodic modeling has been reported in [13] and [14], but those methodsdo not make use of a model to limit the degrees of freedom of the problem. Additional care must betaken in order to account for the continuity of F0 contours (using recurrent networks). In the proposedmodel, the continuity and basic shape of F0 contours are ensured by the F0 model [15][16].In this paper, three types of neural network structures are evaluated: the multi-layer perceptron (MLP),Jordan (a structure having feedbacks from output elements), and Elman (a structure having feedbacksfrom hidden elements). The latter two neural network structures are called partial recurrent networks,and are tested here in order to account for the mutual influence of neighboring accentual phrases.All structures have a single hidden layer containing either 10 or 20 elements. For the experiments, weutilized the SNNS neural network simulation software [17]. The results of F0 contour prediction on thetest data set are shown in Figure.4.Figure.6 shows the pitch contour of an original and synthetic speech used with our system.Figure 4. Evaluation of the fundamental frequency F0 of the Arabic phrase "" top-down voice signal, the varieties of F0 autocorrelation method, spectral method, annotation segmentaverage F0 value of syllable and F0 estimation by MOMEL.IV. RESULTS AND EVALUATION4.1 Implementation of prosodic values into MBROLAMBROLA synthesis system is a multilingual; it was originally designed based on the characteristicsphonostactics of the French language, our synthesis system requires for its adaptation to the Arabiclanguage, with adjustments segmental and prosodic.A first look at the results of the system showed that although there were similarities between the naturaland synthetic versions, there is a considerable resemblance between the natural and synthetic F0contours. Only a few minor differences can be observed, since the F0 values were extracted only once 636 Vol. 2, Issue 1, pp. 632-639
  6. 6. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963every 10ms. Also note the halved F0 in the creaky parts of the synthetic versions which successfullysimulated creak. Similarly for the spectrogram there is small difference with the estimation algorithm.This can be seen in Figure.5 and Figure.6.The implantation of our algorithm of estimation optimization of prosodic parameters produced anArabic synthetic speech intelligible and natural. Figure5. Neutral and Synthesis speech, signal and spectrogram, Arabic Sentence « » Figure6. Neutral and synthesis speech, pitch contour, Arabic Sentence, « ». 637 Vol. 2, Issue 1, pp. 632-639
  7. 7. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-19634.2. Subjective evaluationEvaluation consists of a subjective comparison between the 4 models. A comparison category rating(CCR) test was used to compare the quality of the synthetic speech generated by our system, Eulersystem, Acapela system and natural speech models. The listening tests were conducted by four Arabadults who are native speakers of the language. All listeners are born and raised in the Arab countries.For both listening tests we prepared listening test programs and a brief introduction was given before thelistening test. They were asked to attribute a preference score according to the quality of each of thesample pairs on the comparison mean opinion score (CMOS) scale[18]. Listening test was performedwith headphones. After collecting all listeners’ response, we calculated the average values and we foundthe following results. In the first listening test, the average correct-rate for original and analysis-synthesis sounds were 98% and that of rule-based synthesized sounds was 90%. We found thesynthesized words to be very intelligible Figiure7.Figure7. Average scores for the first test (system Euler, our system, natural speech and Acapela system. for theintelligibility of speech. V. CONCLUSIONSA new high quality Arabic speech synthesis technique has been introduced in this paper. The techniqueis based on the estimation and optimization of the prosodic parameters such as pitch and duration forMBROLA method. It has also been shown in this paper that syllables produce reasonably natural qualityspeech and durational modeling is crucial for naturalness with a significant reduction in numbers ofunits of the total base developed. This was readily observed during the listening tests based on highquality and objective evaluation when comparing the original with the synthetic speech.REFERENCES[1] S. Baloul, (2003) “ Développement dun système automatique de synthèse de la parole à partir du texte arabe standard voyellé ”, Thèse de doctorat, université du Maine, Le Mans, France.[2] M. Elshafi, H. Al-Muhtaseb M. Al-Ghamdi, (2002) “Techniques for high quality Arabic speech synthesis”, Information Sciences 140-255-267, Elsevier.[3] M. Assaf, (2005) “A Prototype of an Arabic Diphone Speech Synthesizer in Festival”, Master Thesis, Department of Linguistics and Philology, Uppsala University.[4] Möbius, B. and Dogil, G., (2002) “Phonemic and postural effects on the production of prosody”, Speech Prosody 2002(Aix-en-Provence), p 523–526.[5] T. Dutoit, V. Pagel, N. Pierret, F. Bataille, & O. van der Vrecken, (1996) The MBROLA Project: Towards a Set of High-Quality Speech Synthesizers ”, Free of Use.[6] M. Al-Zabibi, (1990) “An Acoustic–Phonetic Approach in Automatic Arabic Speech Recognition”, the British Library in Association with UMI.[7] G. Demenko, S. Grocholewski, A. Wagner, & M. Szymański, (2006) “Prosody Annotation for Corpus Based Speech Synthesis”, In: Proceedings of the Eleventh Australasian International Conference on Speech Science and Technology. 638 Vol. 2, Issue 1, pp. 632-639
  8. 8. International Journal of Advances in Engineering & Technology, Jan 2012.©IJAET ISSN: 2231-1963 Auckland, New Zealand, pp. 460-465.[8] Boersma, P. & Weenink, D. (2005) Praat. Doing phonetics by computer. [Computer program]. Version 4.3.04 Retrieved March 31, 2005 from http://www.praat.org/[9] J. Bachan, & D. Gibbon, (2006) “Close Copy Speech Synthesis for Speech Perception Testing”, In: Investigationes Linguisticae, vol. 13, pp. 9--24.[10] W.N. Campbell, (1992) “ syllable-based segmental duration” , Edition G. Bailly and C. Benoît, Talking Machines: theories, Models and Designs”, Elsevier Science Publishers, Amestrdam, pp.211-22 4,[11] A. Lacheret-Dujour, B. Beaugendre, (1999) “ La prosodie du français”, Paris, Editions du CNRS.[12] F. Chouireb, M. Guerti, M. Naïl, and Y. Dimeh, (2007) “Development of a Prosodic Database for Standard Arabic”, the Arabian Journal for Science and Engineering, Volume 32, Number 2B, pp. 251-262, ISSN: 1319-8025, October.[13] S. Keagy. (2000) “Integrating voice and data networks: Practical solutions for the new world of packetized voice over data networks”. Cisco Press,[14] G. Sonntag, T. Portele and B. Heuft, (1997) “Prosody generation with a neural network: Weighing the importance of input parameters”, in Proceedings of ICASSP, pp 931-934, Munich, Germany.[15] J. P. Teixieira, D. Freitas and H. Fujisaki, (2003) “Prediction of Fujisaki model’s phrase commands”, in Proceedings of Eurospeech , Geneva, pp 397-400[16] J. P. Teixiera, D. Freitas and H. Fujisaki, “ Prediction of accent commands for the Fujisaki intonation model”, in Proceeding of Speech Prosody 2004, Nara, Japan, March 23-26, 2004, pp 451-454.[17] SNNS (Stuttgart Neural Network Simulator) User Manual (1995), Version 4.1, University of Stuttgart, Institute for Parallel and Distributed High Performance Systems (IPVR), Report No.[18] K. S. Rao and B. Yegnanarayana, (4-8 October 2004) “Intonation modeling for Indian languages”, in Proccedings of Interspeech’04,Jeju Island, K0rea, pp733-736AuthorsA. Chabchoub: is a researcher in signal processing laboratory at the University of Sciencesof Tunis – Tunisia (FST). Degree in electronics and he received a M.Sc. degree inAutomatic and Signal Processing (ATS) from The National Engineering School of Tunis(ENIT). Currently, he is a PhD student under the supervision of Prof. A. Cherif. Hisresearch interests include speech synthesis and analysis.A.Cherif: received his engineering diploma from the Engineering Faculty of Tunis and hisPh.D. in electrical engineering and electronics from The National Engineering School ofTunis (ENIT). Actually he is a professor at the Science Faculty of Tunis, Responsible forthe Signal Processing Laboratory. He participated in several research and cooperationprojects, and he is the author of international communications and publications. 639 Vol. 2, Issue 1, pp. 632-639

×