Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA)...
Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA)...
Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA)...
Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA)...
Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA)...
Upcoming SlideShare
Loading in …5
×

Bz33462466

123 views
86 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
123
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bz33462466

  1. 1. Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.461-465462 | P a g eVerification of TD-PSOLA for Implementing Voice ModificationVivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee(Electrical Department, VJTI, Mumbai, India)(Electrical Department, VJTI, Mumbai, India)(Electrical Department, VJTI, Mumbai, India)ABSTRACTVoice modification is conversion of a model inputvoice signal into desired output signal. This canbe achieved by modifying basic parameters ofvoice viz. vocal tract, pitch and time scale. Fourmodels which are used for this purpose arediscussed here namely LPC model, H/S model,TD-PSOLA model and MBR-PSOLA model. Inthis paper, TD-PSOLA technique is studied andimplemented. Results of this implementation arederived and reviewed. The voice modificationsystem thus developed can contribute greatly tothe Medical and Entertainment Industry wheredesire voice modification is required.Keywords – Vocal tract, formants, pitch, voice.I. INTRODUCTIONIt is noteworthy fact that for people the voice is notonly just a tool for communication, but also anidentifying feature that allows expression ofpersonality. Researchers observed changes informant tracks and pitch contours, during clinicaldiagnosis of certain diseases. Diseases likeLaryngeal cancer can have a drastic impact on thespeech leading to disturbances of voice. Rehan A.Kazi concludes that the formant frequencies oflaryngectomy patients are higher than the formantfrequencies of normal subjects [1]. Gregory K.Sewall claims that patients with Parkinson‟s-relateddysphonia usually have reduced pitch ranges andincreased vocal tremor [2]. In such cases a needarises for voice modification. In voice modificationa model input voice signal is modified by varyingvocal tract, pitch and time scale. Men have a largervocal tract, which essentially gives the resultantvoice a lower-sounding timbre. Whereas Femalehave a smaller vocal tract, giving the resultant voicea higher-sounding timbre. Adult male voices areusually lower-pitched and adult female voices arehigher-pitched. Several works have been done forachieving voice modification and many methods areavailable for the same. Some of these are Theclassical AutoRegressive (LPC) [3], The hybridHarmonic/Stochastic (H/S) [4],[5], Time-DomainPitch-Synchronous Overlap-Add (TD-PSOLA) [6]and Multi-Band Re-synthesis Pitch-SynchronousOverlap-Add (MBR-PSOLA) [7]. These methodscan process stored data as well as data in real time.E.g. in karaoke, pitch shifting is used to tune theusers singing so that it reaches the original melodyin real-time [8]. In music composition, composersprocess music using pitch shifting [9]. Also it isused in applications like public speech system,entertainment Industry where specific voice style,tone, expressions are required.Voice modification process using TD-PSOLA isexplained in subsequent sections. Section 2describes speech signal fundamentals andimportance parameters of voice. The algorithmsused for modification are compared in section 3.Section 4 gives the details of TD-PSOLA method.Implementation of method and results areinterpreted in section 5 and 6. Section 7 summarizesand throws light on future scope.II. SPEECH SIGNAL FUNDAMENTALSThe vocal cords periodically vibrate to generateglottal flow. Human pronounces a vowel or a voicedconsonant which is composed of glottal pulses.Pitch period is the period of a glottal pulse.Fundamental frequency is reciprocal of the pitchperiod. The vocal tract acts as a time-varying filterto the glottal flow. The characteristics of the vocaltract include the frequency response, which dependson the position of organs, such as the pharynx andtongue. The peak frequencies in the frequencyresponse of the vocal tract are formants, also knownas formant frequencies. A speech signal is aconvolution of a time-varying stimulus (GlottalFlow) and a time-varying filter (Vocal Tract) asshown in Fig.1.Time-varyingStimulus(Glottal Flow) (Speech Signal)Fig.1. Modeling of speech signalII. COMPARISON OF ALGORITHMSThe four models covered are:1. The linear-prediction voice model is bestclassified as a parametric, spectral, source-filtermodel, in which the short-time spectrum isdecomposed into a flat excitation spectrummultiplied by a smooth spectral envelope capturingprimarily vocal formants [3].2. The hybrid Harmonic/Stochastic (H/S) modelproposed in [4] and [5]. In this speech signalsexpresses as the summation of slowly varyingharmonic and stochastic components. This is basedon Griffins analysis algorithm, uses the OLAsynthesis approach, the prosody matching andsegment concatenation.Time-varying Filter(Vocal Tract)
  2. 2. Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.461-465463 | P a g e3. TD-PSOLA algorithms rely on a pitch-synchronous overlap-add approach for modifyingthe speech prosody and concatenating speechwaveforms [6]. The modifications of the speechsignal are performed in the time domain dependingon the length of the window used in the synthesisprocess. The time domain approach provides veryefficient solutions for the real time implementationof synthesis systems.4. The Multi-Band Re-synthesis Pitch-SynchronousOverlap-Add (MBR-PSOLA) model is based on re-synthesis of the segments database of an originaland efficient hybrid H/S [7]. It supports spectralinterpolation between voiced parts of segments, withvirtually no increase in complexity.Result of ComparisonTD-PSOLA virtually exhibits no segmentsconcatenation capabilities. LPC is slightly superiorin this. LPC, hybrid H/S and MBR-PSOLA aresuperior to TD-PSOLA for automatic analysisprocedures. An efficient segment databasecompression algorithm is ensured for LPC andhybrid H/S synthesizers. It is currently beingdeveloped for MBR-PSOLA. Fluidity is better ofMBR-PSOLA and hybrid H/S due to superiorconcatenation capabilities. Prosody matching givescomparable results with all four models. TD-PSOLA is computationally simple compared tocomplexity of their respective synthesizers. It ismuch more intelligible than other synthesizers. It ismuch less sensitive to analysis V/UV errors thanMBR-PSOLA. It is perceived as equally naturalbecause it does not make use of any speech model.As seen from above, TD-PSOLA technique is betterthan other.III. TD-PSOLA (TIME DOMAIN PITCHSYNCHRONOUS OVERLAP-ADD)This approach is based on the decomposition of thesignal into overlapping frames synchronized withthe pitch period. After modifications of the speechsignal, consistency and accuracy of the pitch marksmust be preserved [10]. For this pitch detection isperformed to generate pitch marks throughoverlapping windowed speech.Input signal 𝑠 𝑛 and 𝑠𝑎 𝑛 centered at 𝑡 𝑎 time isdefined as:𝑠𝑎 𝑛 = 𝑠 𝑡 𝑎+𝑛Where, 𝑡 𝑎 is an analysis marks.A short-time version of 𝑠𝑎 𝑛 by multiplying it by awindow 𝑤𝑎 𝑛 is defined as 𝑧 𝑎 𝑛 :𝑧 𝑎 𝑛 = 𝑤𝑎 𝑛 × 𝑠𝑎 𝑛The window length is two times of the local pitchperiod. To synthesize speech at different pitchperiods, the Short Time signals (ST) are simplyoverlapped and added with desired spacing. Thesynthesized speech is defined as:𝑧 𝑛 = 𝑧 𝑛 [𝑛 − 𝑡 𝑎 ]∞𝑎=−∞A good choice for the time marks 𝑡 𝑎 is to coincidewith the instants of closing of the vocal folds whichindicates the periodicity of speech. For unvoicedspeech, these marks could be arbitrarily placed. Thisestimation from speech waveforms is a very difficultproblem.In speech analysis, a sequence of pitch-marks isprovided after filtering the speech signal.Voiced/unvoiced decision is based on the zero-crossing and the short time energy for each segmentbetween two consecutive pitch marks. A coefficientof voicement (v/uv) can be computed in order toquantize the periodicity of the signal [11]. To selectpitch marks among local extreme of the speechsignal, a set of mark candidates given with allnegative and positive peaks. The OLA synthesis isbased on the superposition-addition of elementarysignals in the new positions. These positions aredetermined by the height and the length of thesynthesis signal. To increase the pitch, theindividual pitch-synchronous frames are extractedand given to Hanning window. Then output framemoved close together and added up, whereas outputframe moved further apart to decrease the pitch.Increasing the pitch will result in a shorter signal, soto keep constant duration duplicate frames need tobe added. A fast re-sampling method is used to shiftthe frame precisely, where it will appear in the newsignal using the pitch mark and the synthesis markof a given frame.IV. IMPLEMENTATIONFollowing steps were implemented for achievingvoice modification:1. A wave file of mono sound quality, rate of11025 with 8 bits per sample is given asinput and amplitude vs. time graph wasplotted.2. Fast Fourier Transform of input file wastaken and then magnitude (dB) vs.frequency graph of the same was plotted.3. Input value of pitch scale, time scale andvocal tract were stored in a variable „a‟, „b‟,„c‟ respectively which ranges from -1 to 1.4. Using a polyphase filter implementation,resampling of the input at (P/Q) times ofthe original sampling rate is done.Resample applies an anti-aliasing(lowpass) FIR filter to input.Where,P = round (GValue*10); GValue = 2^c;Q=10.5. Approximate pitch contour was calculatedbased on energy peaks for finding pitchmarks and Path was found out usingrridden function. By differentiating pitch
  3. 3. Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.461-465464 | P a g emarks pitch period was found and first andlast pitch marks were removed.6. Analysis segment center was found. Thensegment was stretched and new segmentwas added and amount of overlap betweenwindows was defined.7. Output signal so acquired, was plotted toget graphs mentioned in first two steps.8. Finally both input and output files wereplayed and result was concluded.V. RESULTSIn this study, a voice quality modification algorithmwith TD-PSOLA modifier was implemented andtested. Normal female voice was taken as thereference and graphs plotted as shown in Fig.2 andFig.3.Fig. 2 Time domain graph of inputFig. 3 Frequency domain graph of inputAbove signal is then modified in different voices byvarying parameters. After changing Vocal Tractparameter from 0 to +1 and from 0 to -1, the changein frequency domain graphs is shown in Fig. 4 and5.Fig. 4 FFT graph of VT= +1Fig. 5 FFT graph of VT= -1After changing Time Scale parameter from 0 to +1,signal is expanded in time domain as shown in Fig.6 and when change from 0 to -1 signal shrinks asshown in Fig. 7.Fig. 6 FFT graph of TS= +1
  4. 4. Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.461-465465 | P a g eFig. 7 FFT graph of TS= -1After changing Pitch Scale parameter from 0 to +1and from 0 to -1, the change in frequency domaingraphs is shown in Fig. 8 and 9.Fig. 8 FFT graph of PS= +1Fig. 9 FFT graph of PS= -1VI. CONCLUSION AND FUTURE SCOPEAs seen from above results, since male have largerVocal Tract, when VT parameter increased to +1,husky male voice is heard and decreasing it to -1result into small child voice. Since female havehigher pitch, when PS parameter increased to +1,nasal female voice is heard and decreasing it to -1result into male voice. Same is summarized in Tablebelow.Table 1 Result of voice modificationIt is seen from above that vocal tract controls timbreof the voice signal whereas pitch varies tone of thesame. Above results show that the algorithm caneffectively modify input voice into the desired voicequality.Results of the simulation verify that the quality ofthe synthesized signal by TD-PSOLA techniquedepends on the precision of the analysis markingand the synthesis marking. In future, Algorithm withbetter analysis and synthesis marking can beimplemented for better voice quality and voicemodification technique can be used for voiceconversion applications.REFERENCES[1] Kazi, Rehan A., Vyas M.N. Prasad, JeeveKanagalingam, Christopher M. Nutting,Peter Clarke, Peter Rhys-Evans, and KevinJ. Harrington, “Assessment of the FormantFrequencies in Normal and LaryngectomyIndividuals Using Linear PredictiveCoding”, Journal of Voice 21, no. 6:661-668.[2] Sewall, Gregory K. MD; Jack Jiang, MD,PhD; and Charles N. Ford, MD, “ClinicalEvaluation of Parkinson‟s-RelatedDysphonia”, The Laryngoscope, 2006,116:1740-1744.[3] J.D. MARKEL, A.H. GRAY Jr, ”LinearPrediction of Speech”, Springer Verlag,New York, pp. 10-42, 1976.[4] D.W. GRIFFIN, J.S. LIM, “Multi-BandExcitation Vocoder”, IEEE Trans. onASSP, vol. ASSP-36, pp. 1223-1235, august1988.[5] A. J. ABRANTES, J. S. MARQUES, I. M.TRANSCOSO, "Hybrid SinusoidalModeling of Speech without VoicingDecision", EUROSPEECH 91, pp. 231-234.[6] E.MOULINES, F. CHARPENTIER, "PitchSynchronous waveform Processingtechniques for Text-To-Speech Synthesisusing diphones", Speech Communication,Vol. 9, n°5-6. 1989.[7] T. DUTOIT, H. LEICH, "Improving theTD-PSOLA Text-To-Speech Synthesizerwith a Specially Designed MBE Re-Synthesis of the Segments Database", Proc.-1 0 1Vocal Tract Small Child Normal Female Husky MalePitch Scale Normal Male Normal Female Nasal FemaleTime Scale Fast Normal Female SlowParameter ValuesParameter
  5. 5. Vivek Vijay Nar, Alice N. Cheeran, Souvik Banerjee / International Journal of EngineeringResearch and Applications (IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.461-465466 | P a g eEUSIPCO 92, 25-28 august 92, Brussels,pp. 343-347.[8] M. Ryynanen, T. Virtanen, J. Paulus, A.Klapuri, “Accompaniment Separation andKaraoke Application Based on AutomaticMelody Transcription”, IEEE InternationalConference on Multimedia and Expo, 2008,pp. 1147-1420.[9] E. Gomez, G. Peterschmitt, X. Amatriain,P. Herrera, “Content-Based MelodicTransformations of Audio Material for aMusic Processing Application”, Proc of6th International Conference on DigitalAudio Effects, London, UK, Sept 2003.[10] Abdelkader Chabchoub and Adnan Cherif,“Implementation of the Arabic SpeechSynthesis with TD-PSOLA Modifier”,International Journal of Signal SystemControl and Engineering Application,2010, Volume: 3, Issue: 4, pp. 77-80.[11] Cheveigne,A. and H. Ahara,”Acomparative evaluation of Fo estimationalgorithm”, Proceedings of the EuroSpeech Conference. (ESC’98), Norvege,pp: 453-467.

×