The document discusses speech processing and vocoding. It begins by defining speech and how it is produced, including voiced and unvoiced sounds. It then describes the human speech production system and various speech coding techniques like waveform coding, vocoding, and analysis-by-synthesis coding. Finally, it provides details on the G.729 speech codec, including its operations, process flow, specifications, and how it achieves speech compression to 8 kbps from the original 128 kbps.
2. What is Speech?
• Speech is composed of phonemes, which are produced by the vocal cords and the vocal tract (which includes
the mouth and the lips).
• It is the ability to express the thoughts and feelings by vocalize sounds.
3. • Voiced Speech: Voiced signals are produced Unvoiced Speech: Unvoiced signals, on the other hand,
when the vocal cords vibrate during the tend to be more abrupt like the stop consonants
pronunciation of a phoneme. /p/, /t/, /k/
Voiced signals tend to be louder like the
vowels /a/, /e/, /i/, /o/, /u/.
• Speech can be further divided in Voiced and Unvoiced speech.
4. Human Speech Production System
When we speak:
• Air is pushed from lung through vocal tract and out of mouth comes speech.
• For certain voiced sound, vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of voice.
• Women and young children tend to have high pitch (fast vibration) while adult
males tend to have low pitch (slow vibration).
• For certain fricatives and plosive (or unvoiced) sound, vocal cords do not
vibrate but remain constantly opened.
• The shape of vocal tract determines the sound that you make.
• As we speak, our vocal tract changes its shape producing different sound.
• The shape of the vocal tract changes relatively slowly (on the scale of 10 msec
to 100 msec).
6. Speech Coding
• It is a procedure to represent a digitized speech signal using as few bits as possible while
maintaining the speech quality.
• It is the process to convert speech signal with higher bit rate to lower bit rate.
• It is a speech compression process.
Speech codec in mobile phone technology
8. Speech Coding Performance Attributes
• The primary requirements of the speech coders are:
• Low bit-rate: Less bandwidth is required for transmission, leading to a more cost-
efficient system.
• High speech quality: Good SNR and PESQ values.
• Other desirable requirements are:
• Robustness across different speakers / languages: Must support different speakers
(adult male, adult female, and children) and different languages.
• Robustness in the presence of channel errors
• Good Performance on non-speech signals such as telephone signaling tones
• Low memory size and low computational complexity
• Low coding delay
9. Speech Coder
• In speech codecs, speech is represented in the form of a code and the code is stored or transmitted.
• The implementation of a speech codec essentially means the implementation of a speech coder.
• The operation of the speech decoder depends on the method of coding employed in the speech coder.
Waveform Coders Analysis-by-
Synthesis Coders
Hybrid Coders
Voice Coders
(Vocoder)
Classification of speech coders
(Based on coding techniques)
10. • Waveform Coders:
• It preserve the original shape of the signal waveform.
• Better suited for higher bit -rate coders e.g. PCM, ADPCM.
• Voice Coders (Vocoders):
• Speech signal is assumed to be generated from a model which is controlled by some
parameters.
• During encoding, parameters of the model are estimated from the input speech signal.
• Then the parameters are transmitted as the encoded bit-stream.
• Quality of the decoded speech depends on model.
• Low bit rate coder.
• Example : G.729 Vocoder
11. • Hybrid Coders
• Hybrid coders combine the strength of Waveform coders and Vocoders.
• Additional parameters of the model are optimized such that the decoded speech is as close as
possible to the original waveform.
• Medium bit rate coder.
• Analysis-by-Synthesis Coders
• Improved form of vocoders
• Synthesized signal are extracted from the given codebook structure.
• Find the best perceptual match to the original speech by comparing each synthesized signal to the
original one with minimum error.
• The parameters representing the best excitation signal and corresponding production filters are then
send over to the decoder.
• Example: G.729 ACELP, CS-ACELP Speech coders
12. Applications
• Mobile VoIP
• Audio and video conferencing
• VoIP services
• WIFI phones VoWLAN
• Wireless GPRS EDGE systems
• Wideband IP telephony
• Transcoding /transcode between Vocoders
• Fax over IP/Fax Relay
16. Comparison of few speech coders w.r.t. their bit-rate, quality and delay
17. G.729 Speech Codec
• Input speech: 8 KHz frequency and 16-bit PCM signal, so 8 x 16 = 128 Kbps.
• Uses input frames of 10ms which is equal to 80 samples for analysis.
• Each frame has two subframes of 5ms of 40 samples.
• Encoding Rate: 80 bits per 10 msec = 8Kbps.
• Output speech: Encoded speech decoded to 128 Kbps.
G.729 Speech codec
18. Speech Compression in G.729
• Telecommunication applications has frequency contents ranging from 300Hz - 3.4 kHz (approx.
4KHz) .
• Nyquist theorem states that the sampling frequency must be at least twice the bandwidth of
continuous-time signal to avoid aliasing.
• Therefore, standard sampling frequency (fs) for speech signals is
fs = 2*4kHz = 8kHz
• If the number of bits to represent each sample is 16.
• Bit-rate = 8 kHz *16 bits = 128 kbps
• This Bit-rate is very high and speech encoder reduces this bitrate to 8 kbps.
19. Gain
Quantization
Perceptual
Weighting
Pre-processing
LP Analysis
Quantization
Interpolation
Synthesis
Filter
Adaptive
Codebook
[14 Bits]
Fixed Codebook
[34 Bits]
Gc
Gp
Fixed CB
Search
Pitch
Analysis
Parameter Encoding
Input Speech
(16 Bit PCM Signal)
LPC Information
LPC Information
Adaptive
Codevector (40 Samples)
Fixed Codebook
Vector (40 Samples)
Codebook Gain (14 Bits)
LPC Information
ENCODER
DECODER
Bit-stream
Segmentation
Adaptive
Codebook
[14 Bits]
Fixed Codebook
[34 Bits]
Gc
Gp
Short-term Filter
Post-processing
Output Speech
Encoded Bit
GP: Adaptive Codebook Gain
GC: Fixed Codebook Gain
Stream
20. How Does G.729 Work ?
• These operations are performed once per frame:
1.Pre processing: Scales down the signal by a
factor of 2 , passing from a HP filter.
2.LP (linear prediction) Analysis: Uses linear
prediction to model the signal, the LP
coefficients are converted to LPC coefficients for
less sensitivity to quantization noise.
21. How Does G.729 Work ? (contd.)
3. Quantization: LPCs are quantized and used
throughout the rest of the algorithm.
4. Open-loop Pitch Analysis: Pitch analysis is
complex, this part gives us a rough estimation
of the pitch.
22. How Does G.729 Work ? (contd.)
• These operations are performed twice a frame or
once per subframe:
1.Adaptive Codebook Search: Determines exact
pitch delay through a closed loop.
2.Fixed Codebook Search: Unvoiced speech
analysis by searching the best codevector from
codebook.
23. G.729 Based Speech Codec Process Flow
ENCODER PROCESS FLOW DECODER PROCESS FLOW
Figure: G.729 based speech codec process
24. Specifications of G.729 Speech Codec:
• Speech coding algorithm: Conjugate-structure algebraic code-excited linear prediction (CS- ACELP);
• Speech sample stream: 16-bit PCM;
• Sampling rate: 8000 samples per second;
• Codec operates on: 10ms speech frames corresponding to 80 samples;
• Input speech: 128 Kbps ;
• Speech compression rate: 8 Kbps;
• Compression ratio: [16:1];
• Arithmetic operations: Fixed-point;
• Algorithmic delay: 10ms per frame and 5ms look-ahead delay;
• Linear-prediction filter: 10th order;
• Windowing: Hamming window and quarter of cosine function;
• Conversion of Auto-correlated coefficient to LP coefficients: Levinson-Durbin Algorithm;
25. • Adaptive codebook search for pitch-delay: Correlation method and Interpolating filter;
• Adaptive codebook size: 14 bits per frame (8 bits in 1st subframe and 5 bits in 2nd subframe);
• Fixed codebook structure: Algebraic codebook structure using an interleaved single-pulse permutation
(ISSP) design;
• Fixed codebook search Approach: Focused and Depth-first tree search;
• Fixed codebook size: 34 bits per frame (17 bits per subframe);
• Excitation parameter determination: 5ms per sub-frame in adaptive and fixed codebook search;
• Output speech: 128 Kbps ;