Speech coding techniques

∑
∞
−∞=
−=
n
nTtts )()( δ
∑
∞
−∞=
−=
=
n
c
cs
nTttx
tstxtx
)()(
)()()(
δ
∑
∞
−∞=
Ω−Ω=Ω
k
sk
T
jS )(
2
)( δ
π
Ω
)( ΩjXc
NΩNΩ−
ΩSΩSΩ− 0
Ω
)( ΩjXc
NΩNΩ− SΩSΩ
)( NS Ω−Ω
 Nyquist sampling theorem

QUANTIZATION (SCALAR
QUANTIZATION)
v1 v2 vk+1 vL
m0= -A m1 m2 …… mk mk+1 mL−1 mL=A
· Assume | x[n] | ≤ A
divide the range [ −A , A ] into L quantization levels
{ J1
, J2
, …… Jk
,….. JL
}
Jk
: [mk-1,mk
]
L = 2
R
each quantization level Jk
is represented by a value vk
S = U Jk
, V = { v1
, v2
, …… vk
,….. vL
}
Jk+1

COMPANDING
F(x)
x[n]
Uniform
Quantization
F−1
(x)
x[n]
Uniform
Decoder
^
Compressor …1101…1101… Expandor
Compressor + Expandor → Compandor
F(x) is to specify the non-uniform quantization
characteristics

BINARY ENCODING
• Binary encoding: to represent a finite set of symbols using
binary codewords.
• Fixed length coding: N levels represented by (int) log2(N)
bits.
• Variable length coding (VLC): more frequently appearing
symbols represented by shorter codewords (Huffman,
arithmetic, LZW=zip).
• The minimum number of bits required to represent a source
is bounded by its entropy

TYPES OF SPEECH CODECS
• Waveform codecs,source codecs (also known as
vocoders),and hybrid codecs.

WAVEFORM-BASED CODERS
• Non-predictive coding (uniform or non-uniform): samples
are encoded independently; PCM
• Predictive coding: samples are encoded as difference from
other samples; LCP or Differential PCM (DPCM)

PCM (PULSE CODE
MODULATION)
• In PCM each sample of the signal is quantized to one of the
amplitude levels, where B is the number of bits used to
represent each sample
• The bitrate of the encoded signal will be : B*F bps where F
is the sample frequency
• The quantized waveform is modeled as:
where q(n) is the quantization noise
B
2
)()()(~ nqnsns +=

PREDICTIVE CODING (LPC
OR DPCM)
• Observation: Adjacent samples are often similar
• Predictive coding:
• Predict the current sample from previous samples, quantize and
code the prediction error, instead of the original sample.
• If the prediction is accurate most of the time, the prediction error is
concentrated near zeros and can be coded with fewer bits than the
original signal
• Usually a linear predictor is used (linear predictive coding):
∑=
−∗=
p
k
kp knxanx
1
)()(

SPEECH SOURCE MODEL AND
SOURCE CODING
unvoiced
G
v/u
voiced
N
random
sequence
generator
periodic
pulse
train
generator
× G(z) = 1
1− ∑ akz-k
P
k = 1
x[n]
G(z), G(ω), g[n]
u[n]
Excitation
Vocal Tract
Model
Excitation parameters
v/u : voiced/ unvoiced
N : pitch for voiced
G : signal gain
→ excitation signal u[n]
Vocal Tract parameters
{ak
} : LPC coefficients
→formant structure of
speech signals
A good approximation,
though not precise enough

LPC VOCODER(VOICE CODER)
x[n]
LPC
Analysis
{ ak }
N , G
v/u
Encoder
…
11011…
N by pitch detection
v/u by voicing detection
Decoder
{ ak }
N , G
v/u
receiver
…
11011…
g[n]
G(z)
Ex
x[n]
{ak
} can be non-uniform or vector
quantized to reduce bit rate further

SPEECH CODING
CHARACTERISTICS
• Speech coders are lossy coders, i.e. the decoded signal is
different from the original
• The goal in speech coding is to minimize the distortion at a given
bit rate, or minimize the bit rate to reach a given distortion
• Metrics in speech coding:
• Objective measure of distortion is SNR (Signal to noise ratio);
SNR does not correlate well with perceived speech quality
• Subjective measure - MOS (mean opinion score):
• 5: excellent
• 4: good
• 3: fair
• 2: poor
• 1: bad

G.711
• The most commonplace codec
• Used in circuit-switched telephone network
• PCM, Pulse-Code Modulation
• If uniform quantization
• 12 bits * 8 k/sec = 96 kbps
• Non-uniform quantization
• 65 kbps DS0 rate
•
• North America
• A-law
• Other countries, a little friendlier to
lower signal levels
• An MOS of about 4.3
law−µ

ADPCM(ADAPTIVE
DIFFERENTIAL PCM)
• DPCM and ADPCM.
• ADPCM : Adaptive Prediction in DPCM
Adaptive Quantization
Adaptive Quantization
• Quantization level ∆ varies with local signal level
• ∆[n] = aσx
[n]
• σx
[n] : locally estimated standard deviation of x[n]
• G.721:ADPCM-coded speech at 32Kbps.
• G.726(A-law or )
• 16,24,32,40Kbps
• MOS 4.0 , at 32Kbps
law−µ

ANALYSIS-BY-SYNTHESIS (ABS)
CODECS
• Hybrid codec
• Fill the gap between waveform and source codecs
• The most successful and commonly used
• Time-domain AbS codecs
• Not a simple two-state, voiced/unvoiced
• Different excitation signals are attempted
• Closest to the original waveform is selected
• MPE, Multi-Pulse Excited
• RPE, Regular-Pulse Excited
• CELP, Code-Excited Linear Predictive

G.728 LD-CELP
• CELP codecs
• A filter; its characteristics change over time
• A codebook of acoustic vectors
• A vector = a set of elements representing various char.
of the excitation
• Transmit
• Filter coefficients, gain, a pointer to the vector chosen
• Low Delay CELP
• Backward-adaptive coder
• Use previous samples to determine filter coefficients
• Operates on five samples at a time
• Delay < 1 ms
• Only the pointer is transmitted


1024 vectors in the code book

10-bit pointer (index)

16 kbps
 LD-CELP encoder
 Minimize a frequency-weighted mean-square error

 LD-CELP decoder
 An MOS score of about 3.9
 One-quarter of G.711 bandwidth

G.723.1 ACELP
 6.3 or 5.3 kbps
 Both mandatory
 Can change from one to another during a conversation
 The coder
 A band-limited input speech signal
 Sampled at 8 KHz, 16-bit uniform PCM quantization
 Operate on blocks of 240 samples at a time
 A look-ahead of 7.5 ms
 A total algorithmic delay of 37.5 ms + other delays
 A high-pass filter to remove any DC component

 G.723.1 Annex A
 Silence Insertion Description (SID) frames
of size four octets
 The two lsbs of the first octet
 00 6.3kbps 24 octets/frame
 01 5.3kbps 20
 10 SID frame 4
 An MOS of about 3.8
 At least 37.5 ms delay

G.729
 8 kbps
 Input frames of 10 ms, 80 samples for 8 KHz
sampling rate
 5 ms look-ahead
 Algorithmic delay of 15 ms
 An 80-bit frame for 10 ms of speech
 A complex codec
 G.729.A (Annex A), a number of simplifications
 Same frame structure
 Encoder/decoder, G.729/G.729.A
 Slightly lower quality

 G.729.B
 VAD, Voice Activity Detection

Based on analysis of several parameters of the input

The current frames plus two preceding frames
 DTX, Discontinuous Transmission

Send nothing or send an SID frame

SID frame contains information to generate comfort noise
 CNG, Comfort Noise Generation
 G.729, an MOS of about 4.0
 G.729A an MOS of about 3.7

Other Codecs
 CDMA QCELP defined in IS-733
 Variable-rate coder
 Two most common rates

The high rate, 13.3 kbps

A lower rate, 6.2 kbps
 Silence suppression
 For use with RTP, RFC 2658

 GSM Enhanced Full-Rate (EFR)
 GSM 06.60
 An enhanced version of GSM Full-Rate
 ACELP-based codec
 The same bit rate and the same overall
packing structure

12.2 kbps
 Support discontinuous transmission
 For use with RTP, RFC 1890

 GSM Adaptive Multi-Rate (AMR) codec
 GSM 06.90
 Eight different modes
 4.75 kbps to 12.2 kbps
 12.2 kbps, GSM EFR
 7.4 kbps, IS-641 (TDMA cellular systems)
 Change the mode at any time
 Offer discontinuous transmission
 The coding choice of many 3G wireless
networks

 The MOS values are for laboratory
conditions
 G.711 does not deal with lost packets
 G.729 can accommodate a lost frame by
interpolating from previous frames

But cause errors in subsequent speech frames
 Processing Power
 G.728 or G.729, 40 MIPS
 G.726 10 MIPS

 Cascaded Codecs
 E.g., G.711 stream -> G.729
encoder/decoder
 Might not even come close to G.729
 Each coder only generate an
approximate of the incoming signal

Tones, Signal, and DTMF
Digits
 The hybrid codecs are optimized for human
speech
 Other data may need to be transmitted
 Tones: fax tones, dialing tone, busy tone
 DTMF digits for two-stage dialing or voice-mail
 G.711 is OK
 G.723.1 and G.729 can be unintelligible
 The ingress gateway needs to intercept
 The tones and DTMT digits
 Use an external signaling system


Easy at the start of a call

Difficult in the middle of a call
 Encode the tones differently form the speech

Send them along the same media path

An RTP packet provides the name of the tone and the
duration

Or, a dynamic RTP profile; an RTP packet containing the
frequency, volume and the duration

RFC 2198
 An RTP payload format for redundant audio data
 Sending both types of RTP payload

 RTP Payload Format for DTMF Digits
 An Internet Draft
 Both methods described before
 A large number of tones and events

DTMF digits, a busy tone, a congestion tone, a
ringing tone, etc.
 The named events
 E: the end of the tone, R: reserved

DISCRETE TIME LTI SYSTEMS:
THE CONVOLUTION SUM
∑
+∞
−∞=
−=
k
knkxnx ][][][ δ
0 1 2
0 1 0 1 2 3
h[n]
x[n] y[n]
n
n n
1
0.5
2
0.5
2.5
2
∑
+∞
−∞=
−=
k
knhkxny ][][][

FREQUENCY-DOMAIN
REPRESENTATION OF
SAMPLING
∑
∞
−∞=
−=
n
nTtts )()( δ
∑
∞
−∞=
−=
=
n
c
cs
nTttx
tstxtx
)()(
)()()(
δ
∑
∞
−∞=
Ω−Ω=Ω
k
sk
T
jS )(
2
)( δ
π
Ω
)( ΩjXc
NΩNΩ−
ΩSΩSΩ− 0
Ω
)( ΩjXc
NΩNΩ− SΩSΩ
)( NS Ω−Ω

SPEECH SOURCE MODEL AND
SOURCE CODING
• Vocal Tract Model
∑=
=−+
p
k
k nxknxanu
1
][][)(
)(
)(
1
1
)(
1
zU
zX
za
zG p
k
k
k
=
−
=
∑=
−

Speech coding techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speech coding techniques

Similar to Speech coding techniques (20)

More from Hemaraja Nayaka S

More from Hemaraja Nayaka S (15)

Recently uploaded

Recently uploaded (20)

Speech coding techniques