Speech signal processing lizy


Published on

Based on Kerala University M-Tech 1st Semester Speech Signal Processing of Signal Processing Branch.

Published in: Technology

Speech signal processing lizy

  1. 1. SPEECH SIGNAL PROCESSINGKERALA UNIVERSITY M-TECH 1ST SEMESTER M- lizytvm@yahoo.com Lizy Abraham +919495123331 Assistant Professor Department of ECE LBS Institute of Technology for Women (A Govt. of Kerala Undertaking) Poojappura Trivandrum -695012 Kerala, India 1
  2. 2. SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3-0-0-3 3- Speech Production :- Acoustic theory of speech production (Excitation, Vocal tract model for speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing, Articulatory model). Acoustic Phonetics ( Basic speech units and their classification). Speech Analysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram, Formant Estimation &Analysis). Cepstral Analysis Parametric representation of speech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR, MFCC, Sinusoidal Model, GMM, HMM Speech coding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic Coding, Vector Quantization based Coders, CELP Speech processing :- Fundamentals of Speech recognition, Speech segmentation. Text-to- speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues of Voice transmission over Internet. 2
  3. 3. REFERENCE 1. Douglas OShaughnessy, Speech Communications : Human & Machine, IEEE Press, Hardcover 2nd edition, 1999; ISBN: 0780334493. 2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547 3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice, Prentice Hall; ISBN: 013242942X; 1st edition 6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley & Sons, September 1999; ISBN: 0471349593 For the End semester exam (100 marks), the question paper shall have six questions of 20 marks each covering entire syllabus out of which any five shall be answered. It shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20 marks each and 10 marks for assignments (Minimum two) /Term Project. 3
  4. 4. Speech Processing means Processing of discrete time speech signals 4
  5. 5. Algorithms Psychoacoustics (Programming) Room acoustics Speech production Speech Processing Acoustics Signal Processing Information Phonetics TheoryFourier transforms EntropyDiscrete time filters Statistical SP Communication theoryAR(MA) models Stochastic Rate-distortion theory models 5
  6. 6. 6
  7. 7. 7
  8. 8. HOW IS SPEECH PRODUCED ? Speech can be defined as “ a pressure acoustic signal that is articulated in the vocal tract” Speech is produced when: air is forced from the lungs through the vocal cords and along the vocal tract. 8
  9. 9. This air flow is referred to as “excitation signal”.This excitation signal causes the vocal cords tovibrate and propagate the energy to excite the oraland nasal openings, which play a major role inshaping the sound produced.Vocal Tract components: – Oral Tract: (from lips to vocal cords). – Nasal Tract: (from the velum till nostrills). nostrills). 9
  10. 10. 10
  11. 11. 11
  12. 12. • Larynx: the source of speech• Vocal cords (folds): the two folds of tissue in the larynx. They can open and shut like a pair of fans.• Glottis: the gap between the vocal cords. As air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow.• The frequency of vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz). 12
  14. 14. Places of articulation alveolar post-alveolar/palatal dental velar uvular labial pharyngeal laryngeal/glottal 14
  15. 15. Classes of speech sounds Voiced sound The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing – the pitch Unvoiced sounds Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present Eg. “S”, “F” Plosive sounds A complete closure in the vocal tract Air pressure is built up and released suddenly Eg. “B” , “P” 15
  16. 16. Speech Model 16
  17. 17. SPEECH SOUNDS Coarse classification with phonemes. A phone is the acoustic realization of a phoneme. Allophones are context dependent phonemes. 17
  18. 18. PHONEME HIERARCHY Speech sounds Language dependent. About 50 in English. Vowels Diphtongs Consonantsiy, ih, ae, aa, ay, ey,ah, ao,ax, eh, oy, aw Lateraler, ow, uh, uw liquid Glide Retroflex l w, y Plosive liquid p, b, t, Fricative Nasal r d, k, g m, n, ng f, v, th, dh, s, z, sh, zh, h 18
  19. 19. 19
  20. 20. 20
  21. 21. sounds like /SH/ and /S/ look like(spectrally shaped) random noise,while the vowel sounds /UH/, /IY/,and /EY/ are highly structured andquasi-periodic.These differences result from thedistinctively different ways that thesesounds are produced. 21
  22. 22. 22
  23. 23. Vowel Chart Front Center Back i uHigh ɪ ʊ e o ə ʌ ɪMid ɛLow æ a ɪ
  24. 24. 24
  25. 25. SPEECH WAVEFORM CHARACTERISTICS Loudness Voiced/Unvoiced. Pitch. Fundamental frequency. Spectral envelope. Formants. 25
  26. 26. Acoustic Characteristics of speech Pitch: Signal within each voiced interval is periodic. The period T is called “pitch”. The pitch depends on the vowel being spoken, changes in time. T~70 samples in this ex. f0=1/T is the fundamental frequency (also known as formant frequency). 26
  27. 27. FORMANTS Formants can be recognized in the frequency content of the signal segment. Formants are best described as high energy peaks in the frequency spectrum of speech sound. 27
  28. 28. The resonant frequencies of the vocal tract arecalled formant frequencies or simply formants.The peaks of the spectrum of the vocal tractresponse correspond approximately to itsformants.Under the linear time-invariant all-poleassumption, each vocal tract shape ischaracterized by a collection of formants. 28
  29. 29. Because the vocal tract is assumed stable withpoles inside the unit circle, the vocal tracttransfer function can be expressed either inproduct or partial fraction expansion form: 29
  30. 30. 30
  31. 31. A detailed acoustic theory must consider the effects of thefollowing:• Time variation of the vocal tract shape• Losses due to heat conduction and viscous friction at thevocal tract walls• Softness of the vocal tract walls• Radiation of sound at the lips• Nasal coupling• Excitation of sound in the vocal tractLet us begin by considering a simple case of a lossless tube: 31
  32. 32. 28 December 2012MULTI-TUBE APPROXIMATION OF THE VOCALTRACT We can represent the vocal tract as a concatenation of N lossless tubes with area {Ak}.and equal length ∆x = l/N The wave propagation time through each tube is τ =∆x/c = l/Nc 32
  33. 33. 33
  34. 34. Consider an N-tube model of the previous figure. Each tube has length lkand cross sectional area of Ak.Assume: No losses Planar wave propagationThe wave equations for section k: 0≤x≤lk 34
  35. 35. 35
  36. 36. 28 December 2012SOUND PROPAGATION IN THE CONCATENATEDTUBE MODEL Boundary conditions: Physical principle of continuity: Pressure and volume velocity must be continuous both in time and in space everywhere in the system: At k’th/(k+1)’st junction we have: 36
  38. 38. 28 December 2012PROPAGATION OF SOUND IN A UNIFORM TUBE The vocal tract transfer function of volume velocities is 38
  39. 39. 28 December 2012PROPAGATION OF SOUND IN A UNIFORM TUBE Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0 *(derivation in Quateri text: page 122 – 125) The poles of the transfer function T (j ) are where cos( l/c)=0 119 – 124: Quatieri Derivation of eqn.4.18 is important. 39
  40. 40. 28 December 2012PROPAGATION OF SOUND IN A UNIFORM TUBE(CON’T) For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, … The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat) The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of the ith natural frequency 40
  41. 41. 28 December 2012UNIFORM TUBE MODEL Example Consider a uniform tube of length l=35 cm. If speed of sound is 350 m/s calculate its resonances in Hz. Compare its resonances with a tube of length l = 17.5 cm. f=Ω/2π ⇒ π c Ω=k , k = 1,3,5,... 2 l Ω π c 1 350 f= =k =k = 250k 2π 2 l 2π 4 × 0.35 f = 250,750,1250,... 41
  42. 42. 28 December 2012UNIFORM TUBE MODEL For 17.5 cm tube: Ω π c 1 350 f= =k =k = 250k 2π 2 l 2π 4 × 0.175 f = 500,1500,2500,... 42
  43. 43. 43
  45. 45. 45
  46. 46. VOWELSModeled as a tube closed at one end and open at the other the closure is a membrane with a slit in it the tube has uniform cross sectional area membrane represents the source of energy (vocal folds) the energy travels through the tube the tube generates no energy on its own the tube represents an important class of resonators odd quarter length relationship Fn=(2n-1)c/4l
  47. 47. VOWELSFilter characteristics for vowels the vocal tract is a dynamic filter it is frequency dependent it has, theoretically, an infinite number of resonances each resonance has a center frequency, an amplitude and a bandwidth for speech, these resonances are called formants formants are numbered in succession from the lowest F1, F2, F3, etc.
  48. 48. Fricatives Modeled as a tube with a very severe constriction The air exiting the constriction is turbulent Because of the turbulence, there is no periodicity unless accompanied by voicing
  49. 49. When a fricative constriction is tapered the back cavity is involved this resembles a tube closed at both ends Fn=nc/2l such a situation occurs primarily for articulation disorders
  50. 50. Introduction to Digital Speech Processing(Rabiner & Schafer )– 20-23 51
  51. 51. 52
  52. 52. Rabiner &Schafer : 98-105 53
  53. 53. 54
  54. 54. 28 December 2012SOUND SOURCE:VOCAL FOLD VIBRATION Modeled as a volume velocity source at glottis, UG(j ) 55
  55. 55. 56
  56. 56. SHORT-TIME SPEECH ANALYSIS Segments (or frames, or vectors) are typically of length 20 ms. Speech characteristics are constant. Allows for relatively simple modeling. Often overlapping segments are extracted. 57
  58. 58. the system is an all-pole system with system function of the form:For all-pole linear systems, the input and output are related bya difference equation of the form: 59
  59. 59. 60
  60. 60. The operator T{ } defines the nature of theshort-time analysis function, and w[ˆn − m]represents a time shifted window sequence 61
  61. 61. 62
  62. 62. SHORT-TIME ENERGY simple to compute, and useful for estimating properties of the excitation function in the model. In this case the operator T{ } is simply squaring the windowed samples. 63
  63. 63. SHORT-TIME ZERO-CROSSING RATE Weighted average of the number of times the speech signal changes sign within the time window. Representing this operator in terms of linear filtering leads to: 64
  64. 64. Since |sgn{x[m]} − sgn{x[m − 1]}| is equal to 1if x[m] and x[m − 1] have different algebraicsigns and 0 if they have the same sign, itfollows that it is a weighted sum of all theinstances of alternating sign (zero-crossing)that fall within the support region of the shiftedwindow w[ˆn − m]. 65
  65. 65. shows an example of the short-time energy andzero crossing rate for a segment of speech witha transition from unvoiced to voiced speech. In both cases, the window is a Hammingwindow of duration 25ms (equivalent to 401samples at a 16 kHz sampling rate). Thus, both the short-time energy and theshort-time zero-crossing rate are output of alow pass filter whose frequency response is asshown. 66
  66. 66. Short time energy and zero-crossing rate functions are slowly varyingcompared to the time variations of the speech signal, and therefore, theycan be sampled at a much lower rate than that of the original speechsignal.For finite-length windows like the Hamming window, this reduction ofthe sampling rate is accomplished by moving the window position ˆn injumps of more than one sample 67
  67. 67. during the unvoiced interval, the zero-crossingrate is relatively high compared to the zero-crossing rate in the voiced interval.Conversely, the energy is relatively low in theunvoiced region compared to the energy in thevoiced region. 68
  68. 68. SHORT-TIME AUTOCORRELATION FUNCTION(STACF) The autocorrelation function is often used as a means of detecting periodicity in signals, and it is also the basis for many spectrum analysis methods. STACF is defined as the deterministic autocorrelation function of the sequence xˆn[m] = x[m]w[ˆn − m] that is selected by the window shifted to time ˆn, i.e., 69
  69. 69. 70
  70. 70. e[n] is the excitation to thelinear system with impulse response h[n]. Awell known, and easilyproved, property of the autocorrelationfunction is that i.e., the autocorrelation function of s[n] = e[n] h[n] is the convolution of the autocorrelation functions of e[n] and h[n]. 71
  71. 71. 72
  72. 72. SHORT-TIME FOURIER TRANSFORM (STFT) The expression for the discrete-time STFT at time n where w[n] is assumed to be non-zero only in the interval [0, N w - 1] and is referred to as analysis window or sometimes as the analysis filter 73
  73. 73. 74
  74. 74. FILTERING VIEW 75
  75. 75. 76
  76. 76. 77
  77. 77. SHORT TIME SYNTHESIS problem of obtaining a sequence back from its discrete-time STFT. This equation represents a synthesis equation for the discrete-time STFT. 78
  78. 78. FILTER BANK SUMMATION (FBS) METHOD the discrete STFT is considered to be the set of outputs of a bank of filters. the output of each filter is modulated with a complex exponential, and these modulated filter outputs are summed at each instant of time to obtain the corresponding time sample of the original sequence That is, given a discrete STFT, X (n, k), the FBS method synthesize a sequence y(n) satisfying the following equation: 79
  79. 79. 80
  80. 80. 81
  81. 81. 82
  82. 82. 83
  83. 83. OVERLAP-ADD METHOD Just as the FBS method was motivated from the filteling view of the STFT, the OLA method is motivated from the Fourier transform view of the STFT. In this method, for each fixed time, we take the inverse DFT of the corresponding frequency function and divide the result by the analysis window. However, instead of dividing out the analysis window from each of the resulting short-time sections, we perform an overlap and add operation between the short-time sections. 84
  84. 84. given a discrete STFT X (n, k), the OLA methodsynthesizes a sequence Y[n] given by 85
  85. 85. 86
  86. 86. Furthermore, if the discrete STFT had beendecimated in time by a factor L, it can besimilarly shown that if the analysis windowsatisfies 87
  87. 87. 88
  88. 88. DESIGN OF DIGITAL FILTER BANKS 282 – 297: Rabiner & Schafer 89
  89. 89. 90
  90. 90. 91
  91. 91. 92
  92. 92. USING IIR FILTER 93
  93. 93. 94
  94. 94. USING FIR FILTER 95
  95. 95. 96
  96. 96. 97
  97. 97. 98
  98. 98. 99
  99. 99. 100
  101. 101. 102
  102. 102. 103
  103. 103. FBS synthesis results in multiple copies of theinput: 104
  104. 104. PHASE VOCODER The fourier series is computed over a sliding window of a single pitch period duration and provide a measure of amplitude and frequency trajectories of the musical tones. 105
  105. 105. 106
  106. 106. 107
  107. 107. which can be interpreted as a real sinewavethat is amplitude- and phase-modulated by theSTFT, the "carrier" of the latter being the kthfilters center frequency.the STFT of a continuos time signal as, 108
  108. 108. 109
  109. 109. where is an initial condition.The signal is likewise referred to as theinstantaneous amplitude for each channel. Theresulting filter-bank output is a sinewave withgenerally a time-varying amplitude andfrequency modulation.An alternative expression is, 110
  110. 110. which is the time-domain counterpart to thefrequency-domain phase derivative. 111
  111. 111. we can sample the continuous-time STFT, withsampling interval T, to obtain the discrete-timeSTFT. 112
  112. 112. 113
  113. 113. 114
  114. 114. 115
  115. 115. 116
  116. 116. 117
  118. 118. 119
  119. 119. 120
  120. 120. 121
  121. 121. 122
  122. 122. CEPSTRAL)HOMOMORPHIC (CEPSTRAL) SPEECH ANALYSIS use of the short-time cepstrum as a representation of speech and as a basis for estimating the parameters of the speech generation model. cepstrum of a discrete-time signal, 123
  123. 123. 124
  124. 124. That is, the complex cepstrum operatortransforms convolution into addition.This property, is what makes the cepstrumuseful for speech analysis, since the model forspeech production involves convolution of theexcitation with the vocal tract impulseresponse, and our goal is often to separate theexcitation signal from the vocal tract signal. 125
  125. 125. The key issue in the definition and computationof the complex cepstrum is the computation ofthe complex logarithm.ie, the computation of the phase anglearg[X(ejω)], which must be done so as topreserve an additive combination of phases fortwo signals combined by convolution 126
  126. 126. SHORT-THE SHORT-TIME CEPSTRUM The short-time cepstrum is a sequence of cepstra of windowed finite-duration segments of the speech waveform. 127
  127. 127. 128
  128. 128. RECURSIVE COMPUTATION OF THE COMPLEXCEPSTRUM Another approach to compute the complex cepstrum applies only to minimum-phase signals. i.e., signals having an z-transform whose poles and zeros are inside the unit circle. An example would be the impulse response of an all-pole vocal tract model with system function 129
  129. 129. In this case, all the poles ck must be insidethe unit circlefor stability of the system. 130
  131. 131. The low quefrency part of the cepstrum isexpected to be representative of the slowvariations (with frequency) in the log spectrum,while the high quefrency components wouldcorrespond to the more rapid fluctuations ofthe log spectrum. 132
  132. 132. the spectrum for the voiced segment has a structure of periodic ripplesdue to the harmonic structure of the quasi-periodic segment of voicedspeech.This periodic structure in the log spectrum manifests itself in thecepstrum peak at a quefrency of about 9ms.The existence of this peak in the quefrency range of expected pitchperiods strongly signals voiced speech.Furthermore, the quefrency of the peak is an accurate estimate of thepitch period during the corresponding speech interval.the autocorrelation function also displays an indication of periodicity, but not nearly as unambiguously as does the cepstrum.But the rapid variations of the unvoiced spectra appear random with noperiodic structure.As a result, there is no strong peak indicating periodicity as in the voicedcase. 133
  133. 133. These slowly varying log spectra clearly retainthe general spectral shape with peakscorresponding to the formant resonancestructure for the segment of speech underanalysis. 134
  134. 134. APPLICATION TO PITCH DETECTION The cepstrum was first applied in speech processing to determine the excitation parameters for the discrete-time speech model. The successive spectra and cepstra are for 50 ms segments obtained by moving the window in steps of 12.5 ms (100 samples at a sampling rate of 8000 samples/sec). 135
  135. 135. for the positions 1 through 5, the window includes onlyunvoiced speechfor positions 6 and 7 the signal within the window is partlyvoiced and partly unvoiced.For positions 8 through 15 the window only includes voicedspeech.the rapid variations of the unvoiced spectra appear randomwith no periodic structure.the spectra for voiced segments have a structure of periodicripples due to the harmonic structure of the quasi-periodicsegment of voiced speech. 136
  136. 136. 137
  137. 137. the cepstrum peak at a quefrency of about 11–12 ms strongly signals voiced speech, and thequefrency of the peak is an accurate estimateof the pitch period during the correspondingspeech interval.Presence of a strong peak implies voicedspeech, and the quefrency location of the peakgives the estimate of the pitch period. 138
  138. 138. MEL-MEL-FREQUENCY CEPSTRUM COEFFICIENTS MFCC)(MFCC) The idea is to compute a frequency analysis based upon a filter bank with approximately critical band spacing of the filters and bandwidths. For 4 KHz bandwidth, approximately 20 filters are used. a short-time Fourier analysis is done first, resulting in a DFT Xˆn[k] for analysis time ˆn. Then the DFT values are grouped together in critical bands and weighted by a triangular weighting function. 139
  139. 139. the bandwidths are constant for centerfrequencies below 1 kHz and then increaseexponentially up to half the sampling rate of 4kHz resulting in a total of 22 filters.The mel-frequency spectrum at analysis timeˆnis defined for r = 1,2,...,R as 140
  140. 140. 141
  141. 141. is a normalizing factor for the rth mel-filter.For each frame, a discrete cosine transform ofthe log of the magnitude of the filter outputs iscomputed to form the function mfccˆn[m], i.e., 142
  142. 142. 143
  143. 143. shows the result of mfcc analysis of a frame ofvoiced speech in comparison with the short-time Fourier spectrum, LPC spectrum, and ahomomorphically smoothed spectrum.all these spectra are different, but they have incommon that they have peaks at the formantresonances.At higher frequencies, the reconstructed mel-spectrum has more smoothing due to thestructure of the filter bank. 144
  144. 144. THE SPEECH SPECTROGRAM simply a display of the magnitude of the STFT. Specifically, the images in Figure are plots of where the plot axes are labeled in terms of analog time and frequency through the relations tr = rRT and fk = k/(NT), where T is the sampling period of the discrete-time signal x[n] = xa(nT). 145
  145. 145. In order to make smooth, R is usually quitesmall compared to both the window length Land the number of samples in the frequencydimension, N, which may be much larger thanthe window length L. Such a function of two variables can be plottedon a two dimensional surface as either a gray-scale or a color-mapped image.The bars on the right calibrate the color map (indB). 146
  146. 146. 147
  147. 147. if the analysis window is short, the spectrogramis called a wide-band spectrogram which ischaracterized by good time resolution and poorfrequency resolution.when the window length is long, thespectrogram is a narrow-band spectrogram,which is characterized by good frequencyresolution and poor time resolution. 148
  148. 148. THE SPECTROGRAM • A classic analysis tool. – Consists of DFTs of overlapping, and windowed frames. • Displays the distribution of energy in time and frequency. 2 – 10 log10 X m ( f ) is typically displayed. 149
  149. 149. THE SPECTROGRAM CONT. 150
  150. 150. 151
  151. 151. Note the three broad peaks in the spectrumslice at time tr = 430 ms, and observe thatsimilar slices would be obtained at other timesaround tr = 430 ms.These large peaks are representative of theunderlying resonances of the vocal tract at thecorresponding time in the production of thespeech signal. 152
  152. 152. The lower spectrogram is not as sensitive torapid time variations, but the resolution in thefrequency dimension is much better.This window length is on the order of severalpitch periods of the waveform during voicedintervals.As a result, the spectrogram no longer displaysvertically oriented striations since severalperiods are included in the window. 153
  153. 153. SHORT TIME ACF /m/ /ow/ /s/ACF 154
  154. 154. CEPSTRUMSPEECH WAVE (X)= EXCITATION (E) . FILTER (H) (S) (H) (Vocal tract filter) (E) Glottal excitation From Vocal cords (Glottis) http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif 155
  155. 155. CEPSTRAL ANALYSIS Signal(s)=convolution(*) of glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)} Convolution(*) becomes multiplication (.) n(time) w(frequency), S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|} Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1 156
  156. 156. CEPSTRUM C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ] X(n) X(w) Log|x(w)|S(n) windowing DFT Log|x(w)| IDFT C(n) N=time index w=frequency I-DFT=Inverse-discrete Fourier transform In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis 157
  157. 157. EXAMPLE OF CEPSTRUM sampling frequency 22.05KHz 158
  158. 158. SUB BAND CODING 159
  159. 159. the time-decimated subband outputs are quantizedand encoded, then are decoded at the receiver.In subband coding, a small number of filters with wideand overlapping bandwidths are chosen and eachoutput is quantizedeach bandpass filter output is quantized individually.although the bandpass filters are wide andoverlapping, careful design of the filter, resuIts in acancellation of quantization noise that leaks acrossbands. 160
  160. 160. Quadrature mirror filters are one such filterclass;shows an example of a two-band subbandcoder using two overlapping quadrature mirrorfiltersQuadrature mirror filters can be furthersubdivided from high to low filters by splittingthe fullband into two, then the resulting lowerband into two, and so on. 161
  161. 161. This octave-band splitting, together with theiterative decimation, can be shown to yield aperfect reconstruction filter banksuch octave-band filter banks, and theirconditions for perfect reconstruction, areclosely related to wavelet analysis/synthesisstructures. 162
  162. 162. 163
  163. 163. 164LINEAR PREDICTION (INTRODUCTION): The object of linear prediction is to estimate the output sequence from a linear combination of input samples, past output samples or both : q p y(n) = ∑b( j) x(n − j) − ∑ a(i) y(n − i) ˆ j =0 i =1 The factors a(i) and b(j) are called predictor coefficients.
  164. 164. 165LINEAR PREDICTION (INTRODUCTION): Many systems of interest to us are describable by a linear, constant-coefficient difference equation : p q ∑ a(i) y(n − i) = ∑ b( j ) x(n − j ) i =0 j =0 If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then q p N ( z ) = ∑ b( j ) z − j and D( z ) = ∑ a(i ) z −i j =0 i =0 Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).
  165. 165. 166LINEAR PREDICTION (TYPES OF SYSTEM MODEL): There are two important variants : All-pole model (in statistics, autoregressive (AR) model ) : The numerator N(z) is a constant. All-zero model (in statistics, moving-average (MA) model ) : The denominator D(z) is equal to unity. The mixed pole-zero model is called the autoregressive moving-average (ARMA) model.
  166. 166. 167LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): Given a zero-mean signal y(n), in the AR model : p y (n) = −∑ a(i ) y (n − i ) ˆ i =1 The error is : ˆ e( n ) = y ( n ) − y ( n ) p = ∑ a (i ) y (n − i ) i =0 To derive the predictor we use the orthogonality principle, the principle states that the desired coefficients are those which make the error orthogonal to the samples y(n-1), y(n-2),…, y(n-p).
  167. 167. 168LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): Thus we require that < y (n − j )e(n) >= 0 for j = 1, 2, ..., p p Or, y (n − j )∑ a (i ) y (n − i ) = 0 i =0 Interchanging the operation of averaging and summing, and representing < > by summing over n, we have p ∑ a(i)∑ y(n − i) y(n − j ) = 0, j = 1,..., p i =0 n The required predictors are found by solving these equations.
  168. 168. 169LINEAR PREDICTION (DERIVATION OF LP EQUATIONS): The orthogonality principle also states that resulting minimum error is given by E = e 2 ( n ) = y ( n ) e( n ) Or, p ∑ a(i)∑ y(n − i) y(n) = E i =0 n We can minimize the error over all time : p ∑ i=0 a ( i )ri − j = 0 , j = 1 ,2 , ...,p p ∑ i=0 a ( i ) ri = E ∞ where ri = ∑ y ( n) y ( n − i ) n = −∞
  169. 169. 170LINEAR PREDICTION (APPLICATIONS): Autocorrelation matching : We have a signal y(n) with known autocorrelation . We model this with the AR system shown below : e(n) ryy (n) y (n ) σ 1-A(z) σ σ H ( z) = = p A( z ) 1 − ∑ ai z −i i =1
  170. 170. 171LINEAR PREDICTION (ORDER OF LINEAR PREDICTION): The choice of predictor order depends on the analysis bandwidth. The rule of thumb is : 2 BW p= +c 1000 For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW. One formant requires two complex conjugate poles. Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.
  171. 171. 172 LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL): True Model: Pitch Gain s(n) Speech DT G(z) Signal Voiced Impulse Glottal U(n) generator Filter Voiced Volume V H(z) R(z) velocity Vocal tract LP U Filter Filter UncorrelatedUnvoiced Noise generator Gain
  172. 172. 173LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL): Using LP analysis : Pitch Gain DT Voiced Impulse estimate s(n) generator Speech V All-Pole Signal Filter U (AR) White Unvoiced Noise H(z) generator