MPEG/Audio Compression

A Tutorial on
MPEG/Audio
Compression
Davis Pan, IEEE Multimedia Journal,
Summer 1995
Presented by:
Randeep Singh Gakhal
CMPT 820, Spring 2004

Outline
 Introduction
 Technical Overview
 Polyphase Filter Bank
 Psychoacoustic Model
 Coding and Bit Allocation
 Conclusions and Future Work

Introduction
 What does MPEG-1 Audio provide?
A transparently lossy audio compression system based on
the weaknesses of the human ear.
 Can provide compression by a factor of 6 and
retain sound quality.
 One part of a three part standard that includes
audio, video, and audio/video synchronization.

MPEG-I Audio Features
 PCM sampling rate of 32, 44.1, or 48 kHz
 Four channel modes:
 Monophonic and Dual-monophonic
 Stereo and Joint-stereo
 Three modes (layers in MPEG-I speak):
 Layer I: Computationally cheapest, bit rates > 128kbps
 Layer II: Bit rate ~ 128 kbps, used in VCD
 Layer III: Most complicated encoding/decoding, bit rates ~
64kbps, originally intended for streaming audio

Human Audio System (ear + brain)
 Human sensitivity to sound is non-linear
across audible range (20Hz – 20kHz)
 Audible range broken into regions where
humans cannot perceive a difference
 called the critical bands

MPEG-I Encoder Architecture[1]

MPEG-I Encoder Architecture
 Polyphase Filter Bank: Transforms PCM samples
to frequency domain signals in 32 subbands
 Psychoacoustic Model: Calculates acoustically
irrelevant parts of signal
 Bit Allocator: Allots bits to subbands according to
input from psychoacoustic calculation.
 Frame Creation: Generates an MPEG-I compliant
bit stream.

Polyphase Filter Bank
 Divides audio signal into 32 equal width
subband streams in the frequency domain.
 Inverse filter at decoder cannot recover
signal without some, albeit inaudible, loss.
 Based on work by Rothweiler[2].
 Standard specifies 512 coefficient analysis
window, C[n]

 Buffer of 512 PCM samples with 32 new
samples, X[n], shifted in every computation cycle
 Calculate window samples for i=0…511:
 Partial calculation for i=0…63:
 Calculate 32 subsamples:
][][][ iXiCiZ ⋅=
∑=
+=
7
0
]64[][
j
jiZiY
∑=
⋅=
63
0
]][[][][
k
kiMiYiS

 Visualization of the filter[1]
:

 The net effect:
 Analysis matrix:
 Requires 512 + 32x64 = 2560 multiplies.
 Each subband has bandwidth π/32T centered at
odd multiples of π/64T
]64[]64[]][[][
63
0
7
0
jiXjiCkiMiS
k j
++= ∑ ∑= =





 −+
=
64
)16)(12(
cos]][[
πki
kiM

 Shortcomings:
 Equal width filters do not correspond with critical
band model of auditory system.
 Filter bank and its inverse are NOT lossless.
 Frequency overlap between subbands.

 Comparison of filter banks and critical bands[1]:

 Frequency response of one subband[1]
:

The Weakness of the Human Ear
 Frequency dependent resolution:
 We do not have the ability to discern minute
differences in frequency within the critical bands.
 Auditory masking:
 When two signals of very close frequency are
both present, the louder will mask the softer.
 A masked signal must be louder than some
threshold for it to be heard  gives us room to
introduce inaudible quantization noise.

MPEG-I Psychoacoustic Models
 MPEG-I standard defines two models:
 Psychoacoustic Model 1:
 Less computationally expensive
 Makes some serious compromises in what it
assumes a listener cannot hear
 Psychoacoustic Model 2:
 Provides more features suited for Layer III
coding, assuming of course, increased processor
bandwidth.

Psychoacoustic Model
 Convert samples to frequency domain
 Use a Hann weighting and then a DFT
 Simply gives an edge artifact (from finite window
size) free frequency domain representation.
 Model 1 uses 512 (Layer I) or 1024 (Layers II
and III) sample window.
 Model 2 uses a 1024 sample window and two
calculations per frame.

 Need to separate sound into “tones” and “noise”
components
 Model 1:
 Local peaks are tones, lump remaining spectrum per
critical band into noise at a representative frequency.
 Model 2:
 Calculate “tonality” index to determine likelihood of each
spectral point being a tone
 based on previous two analysis windows

 “Smear” each signal within its critical band
 Use either a masking (Model 1) or a spreading
function (Model 2).
 Adjust calculated threshold by incorporating
a “quiet” mask – masking threshold for
each frequency when no other frequencies
are present.

 Calculate a masking threshold for each subband in the
polyphase filter bank
 Model 1:
 Selects minima of masking threshold values in range of each
subband
 Inaccurate at higher frequencies – recall how subbands are
linearly distributed, critical bands are NOT!
 Model 2:
 If subband wider than critical band:
 Use minimal masking threshold in subband
 If critical band wider than subband:
 Use average masking threshold in subband

 The hard work is done – now, we just
calculate the signal-to-mask ratio (SMR)
per subband
 SMR = signal energy / masking threshold
 We pass our result on to the coding unit
which can now produce a compressed
bitstream

Psychoacoustic Model (example)
 Input[1]
:

 Transformation to perceptual domain[1]
:

 Calculation of masking thresholds[1]
:

 Signal-to-mask ratios[1]
:

 What we actually send[1]
:

Layer Specific Coding
 Layer specific frame formats[1]
:

Layer Specific Coding
 Stream of samples is processed in groups[1]
:

Layer I Coding
 Group 12 samples from each subband and
encode them in each frame (=384 samples)
 Each group encoded with 0-15 bits/sample
 Each group has 6-bit scale factor

Layer II Coding
 Similar to Layer I except:
 Groups are now 3 of 12 samples per-subband =
1152 samples per frame
 Can have up to 3 scale factors per subband to
avoid audible distortion in special cases
 Called scale factor selection information (SCFSI)

Layer III Coding
 Further subdivides subbands using Modified
Discrete Cosine Transform (MDCT) – a lossless
transform
 Larger frequency resolution => smaller time
resolution
 possibility of pre-echo
 Layer III encoder can detect and reduce pre-echo
by “borrowing bits” from future encodings

Bit Allocation
 Determine number of bits to allot for each
subband given SMR from psychoacoustic model.
 Layers I and II:
 Calculate mask-to-noise ratio:
 MNR = SNR – SMR (in dB)
 SNR given by MPEG-I standard (as function of quantization
levels)
 Now iterate until no bits to allocate left:
 Allocate bits to subband with lowest MNR.
 Re-calculate MNR for subband allocated more bits.

Bit Allocation
 Layer III:
 Employs “noise allocation”
 Quantizes each spectral value and employs
Huffman coding
 If Huffman encoding results in noise in excess of
allowed distortion for a subband, encoder
increases resolution on that subband
 Whole process repeats until one of three
specified stop conditions is met.

Conclusions
 MPEG-I provides tremendous compression
for relatively cheap computation.
 Not suitable for archival or audiophile grade
music as very seasoned listeners can
discern distortion.
 Modifying or searching MPEG-I content
requires decompression and is not cheap!

Future Work
 MPEG-1 audio lays the foundation for all modern
audio compression techniques
 Lots of progress since then (1994!)
 MPEG-2 (1996) extends MPEG audio
compression to support 5.1 channel audio
 MPEG-4 (1998) attempts to code based on
perceived audio objects in the stream
 Finally, MPEG-7 (2001) operates at an even
higher level of abstraction, focusing on meta-data
coding to make content searchable and
retrievable

References
[1] D. Pan, “A Tutorial on MPEG/Audio Compression”,
IEEE Multimedia Journal, 1995.
[2] J. H. Rothweiler, “Polyphase Quadrature Filters – a New
Subband Coding Technique”, Proc of the Int. Conf. IEEE
ASSP, 27.2, pp1280-1283, Boston 1983.

MPEG/Audio Compression

More Related Content

What's hot

Viewers also liked

Similar to MPEG/Audio Compression

More from Daniel Brewster

Recently uploaded

MPEG/Audio Compression