3. QUANTIZATION (SCALAR
QUANTIZATION)
v1 v2 vk+1 vL
m0= -A m1 m2 …… mk mk+1 mL−1 mL=A
· Assume | x[n] | ≤ A
divide the range [ −A , A ] into L quantization levels
{ J1
, J2
, …… Jk
,….. JL
}
Jk
: [mk-1,mk
]
L = 2
R
each quantization level Jk
is represented by a value vk
S = U Jk
, V = { v1
, v2
, …… vk
,….. vL
}
Jk+1
6. BINARY ENCODING
• Binary encoding: to represent a finite set of symbols using
binary codewords.
• Fixed length coding: N levels represented by (int) log2(N)
bits.
• Variable length coding (VLC): more frequently appearing
symbols represented by shorter codewords (Huffman,
arithmetic, LZW=zip).
• The minimum number of bits required to represent a source
is bounded by its entropy
7. TYPES OF SPEECH CODECS
• Waveform codecs,source codecs (also known as
vocoders),and hybrid codecs.
8.
9. WAVEFORM-BASED CODERS
• Non-predictive coding (uniform or non-uniform): samples
are encoded independently; PCM
• Predictive coding: samples are encoded as difference from
other samples; LCP or Differential PCM (DPCM)
10. PCM (PULSE CODE
MODULATION)
• In PCM each sample of the signal is quantized to one of the
amplitude levels, where B is the number of bits used to
represent each sample
• The bitrate of the encoded signal will be : B*F bps where F
is the sample frequency
• The quantized waveform is modeled as:
where q(n) is the quantization noise
B
2
)()()(~ nqnsns +=
11. PREDICTIVE CODING (LPC
OR DPCM)
• Observation: Adjacent samples are often similar
• Predictive coding:
• Predict the current sample from previous samples, quantize and
code the prediction error, instead of the original sample.
• If the prediction is accurate most of the time, the prediction error is
concentrated near zeros and can be coded with fewer bits than the
original signal
• Usually a linear predictor is used (linear predictive coding):
∑=
−∗=
p
k
kp knxanx
1
)()(
14. SPEECH SOURCE MODEL AND
SOURCE CODING
unvoiced
G
v/u
voiced
N
random
sequence
generator
periodic
pulse
train
generator
× G(z) = 1
1− ∑ akz-k
P
k = 1
x[n]
G(z), G(ω), g[n]
u[n]
Excitation
Vocal Tract
Model
Excitation parameters
v/u : voiced/ unvoiced
N : pitch for voiced
G : signal gain
→ excitation signal u[n]
Vocal Tract parameters
{ak
} : LPC coefficients
→formant structure of
speech signals
A good approximation,
though not precise enough
15. LPC VOCODER(VOICE CODER)
x[n]
LPC
Analysis
{ ak }
N , G
v/u
Encoder
…
11011…
N by pitch detection
v/u by voicing detection
Decoder
{ ak }
N , G
v/u
receiver
…
11011…
g[n]
G(z)
Ex
x[n]
{ak
} can be non-uniform or vector
quantized to reduce bit rate further
16. SPEECH CODING
CHARACTERISTICS
• Speech coders are lossy coders, i.e. the decoded signal is
different from the original
• The goal in speech coding is to minimize the distortion at a given
bit rate, or minimize the bit rate to reach a given distortion
• Metrics in speech coding:
• Objective measure of distortion is SNR (Signal to noise ratio);
SNR does not correlate well with perceived speech quality
• Subjective measure - MOS (mean opinion score):
• 5: excellent
• 4: good
• 3: fair
• 2: poor
• 1: bad
17. G.711
• The most commonplace codec
• Used in circuit-switched telephone network
• PCM, Pulse-Code Modulation
• If uniform quantization
• 12 bits * 8 k/sec = 96 kbps
• Non-uniform quantization
• 65 kbps DS0 rate
•
• North America
• A-law
• Other countries, a little friendlier to
lower signal levels
• An MOS of about 4.3
law−µ
18. ADPCM(ADAPTIVE
DIFFERENTIAL PCM)
• DPCM and ADPCM.
• ADPCM : Adaptive Prediction in DPCM
Adaptive Quantization
Adaptive Quantization
• Quantization level ∆ varies with local signal level
• ∆[n] = aσx
[n]
• σx
[n] : locally estimated standard deviation of x[n]
• G.721:ADPCM-coded speech at 32Kbps.
• G.726(A-law or )
• 16,24,32,40Kbps
• MOS 4.0 , at 32Kbps
law−µ
19. ANALYSIS-BY-SYNTHESIS (ABS)
CODECS
• Hybrid codec
• Fill the gap between waveform and source codecs
• The most successful and commonly used
• Time-domain AbS codecs
• Not a simple two-state, voiced/unvoiced
• Different excitation signals are attempted
• Closest to the original waveform is selected
• MPE, Multi-Pulse Excited
• RPE, Regular-Pulse Excited
• CELP, Code-Excited Linear Predictive
20. G.728 LD-CELP
• CELP codecs
• A filter; its characteristics change over time
• A codebook of acoustic vectors
• A vector = a set of elements representing various char.
of the excitation
• Transmit
• Filter coefficients, gain, a pointer to the vector chosen
• Low Delay CELP
• Backward-adaptive coder
• Use previous samples to determine filter coefficients
• Operates on five samples at a time
• Delay < 1 ms
• Only the pointer is transmitted
21.
1024 vectors in the code book
10-bit pointer (index)
16 kbps
LD-CELP encoder
Minimize a frequency-weighted mean-square error
22. LD-CELP decoder
An MOS score of about 3.9
One-quarter of G.711 bandwidth
23. G.723.1 ACELP
6.3 or 5.3 kbps
Both mandatory
Can change from one to another during a conversation
The coder
A band-limited input speech signal
Sampled at 8 KHz, 16-bit uniform PCM quantization
Operate on blocks of 240 samples at a time
A look-ahead of 7.5 ms
A total algorithmic delay of 37.5 ms + other delays
A high-pass filter to remove any DC component
24. G.723.1 Annex A
Silence Insertion Description (SID) frames
of size four octets
The two lsbs of the first octet
00 6.3kbps 24 octets/frame
01 5.3kbps 20
10 SID frame 4
An MOS of about 3.8
At least 37.5 ms delay
25. G.729
8 kbps
Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
5 ms look-ahead
Algorithmic delay of 15 ms
An 80-bit frame for 10 ms of speech
A complex codec
G.729.A (Annex A), a number of simplifications
Same frame structure
Encoder/decoder, G.729/G.729.A
Slightly lower quality
26. G.729.B
VAD, Voice Activity Detection
Based on analysis of several parameters of the input
The current frames plus two preceding frames
DTX, Discontinuous Transmission
Send nothing or send an SID frame
SID frame contains information to generate comfort noise
CNG, Comfort Noise Generation
G.729, an MOS of about 4.0
G.729A an MOS of about 3.7
27. Other Codecs
CDMA QCELP defined in IS-733
Variable-rate coder
Two most common rates
The high rate, 13.3 kbps
A lower rate, 6.2 kbps
Silence suppression
For use with RTP, RFC 2658
28. GSM Enhanced Full-Rate (EFR)
GSM 06.60
An enhanced version of GSM Full-Rate
ACELP-based codec
The same bit rate and the same overall
packing structure
12.2 kbps
Support discontinuous transmission
For use with RTP, RFC 1890
29. GSM Adaptive Multi-Rate (AMR) codec
GSM 06.90
Eight different modes
4.75 kbps to 12.2 kbps
12.2 kbps, GSM EFR
7.4 kbps, IS-641 (TDMA cellular systems)
Change the mode at any time
Offer discontinuous transmission
The coding choice of many 3G wireless
networks
30. The MOS values are for laboratory
conditions
G.711 does not deal with lost packets
G.729 can accommodate a lost frame by
interpolating from previous frames
But cause errors in subsequent speech frames
Processing Power
G.728 or G.729, 40 MIPS
G.726 10 MIPS
31. Cascaded Codecs
E.g., G.711 stream -> G.729
encoder/decoder
Might not even come close to G.729
Each coder only generate an
approximate of the incoming signal
32. Tones, Signal, and DTMF
Digits
The hybrid codecs are optimized for human
speech
Other data may need to be transmitted
Tones: fax tones, dialing tone, busy tone
DTMF digits for two-stage dialing or voice-mail
G.711 is OK
G.723.1 and G.729 can be unintelligible
The ingress gateway needs to intercept
The tones and DTMT digits
Use an external signaling system
33.
Easy at the start of a call
Difficult in the middle of a call
Encode the tones differently form the speech
Send them along the same media path
An RTP packet provides the name of the tone and the
duration
Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration
RFC 2198
An RTP payload format for redundant audio data
Sending both types of RTP payload
34. RTP Payload Format for DTMF Digits
An Internet Draft
Both methods described before
A large number of tones and events
DTMF digits, a busy tone, a congestion tone, a
ringing tone, etc.
The named events
E: the end of the tone, R: reserved