A Tutorial on
MPEG/Audio
Compression
Davis Pan, IEEE Multimedia Journal,
Summer 1995
Presented by:
Randeep Singh Gakhal
CMPT 820, Spring 2004
Outline
 Introduction
 Technical Overview
 Polyphase Filter Bank
 Psychoacoustic Model
 Coding and Bit Allocation
 Conclusions and Future Work
Introduction
 What does MPEG-1 Audio provide?
A transparently lossy audio compression system based on
the weaknesses of the human ear.
 Can provide compression by a factor of 6 and
retain sound quality.
 One part of a three part standard that includes
audio, video, and audio/video synchronization.
Technical Overview
MPEG-I Audio Features
 PCM sampling rate of 32, 44.1, or 48 kHz
 Four channel modes:
 Monophonic and Dual-monophonic
 Stereo and Joint-stereo
 Three modes (layers in MPEG-I speak):
 Layer I: Computationally cheapest, bit rates > 128kbps
 Layer II: Bit rate ~ 128 kbps, used in VCD
 Layer III: Most complicated encoding/decoding, bit rates ~
64kbps, originally intended for streaming audio
Human Audio System (ear + brain)
 Human sensitivity to sound is non-linear
across audible range (20Hz – 20kHz)
 Audible range broken into regions where
humans cannot perceive a difference
 called the critical bands
MPEG-I Encoder Architecture[1]
MPEG-I Encoder Architecture
 Polyphase Filter Bank: Transforms PCM samples
to frequency domain signals in 32 subbands
 Psychoacoustic Model: Calculates acoustically
irrelevant parts of signal
 Bit Allocator: Allots bits to subbands according to
input from psychoacoustic calculation.
 Frame Creation: Generates an MPEG-I compliant
bit stream.
The Polyphase
Filter Bank
Polyphase Filter Bank
 Divides audio signal into 32 equal width
subband streams in the frequency domain.
 Inverse filter at decoder cannot recover
signal without some, albeit inaudible, loss.
 Based on work by Rothweiler[2].
 Standard specifies 512 coefficient analysis
window, C[n]
Polyphase Filter Bank
 Buffer of 512 PCM samples with 32 new
samples, X[n], shifted in every computation cycle
 Calculate window samples for i=0…511:
 Partial calculation for i=0…63:
 Calculate 32 subsamples:
][][][ iXiCiZ ⋅=
∑=
+=
7
0
]64[][
j
jiZiY
∑=
⋅=
63
0
]][[][][
k
kiMiYiS
Polyphase Filter Bank
 Visualization of the filter[1]
:
Polyphase Filter Bank
 The net effect:
 Analysis matrix:
 Requires 512 + 32x64 = 2560 multiplies.
 Each subband has bandwidth π/32T centered at
odd multiples of π/64T
]64[]64[]][[][
63
0
7
0
jiXjiCkiMiS
k j
++= ∑ ∑= =





 −+
=
64
)16)(12(
cos]][[
πki
kiM
Polyphase Filter Bank
 Shortcomings:
 Equal width filters do not correspond with critical
band model of auditory system.
 Filter bank and its inverse are NOT lossless.
 Frequency overlap between subbands.
Polyphase Filter Bank
 Comparison of filter banks and critical bands[1]:
Polyphase Filter Bank
 Frequency response of one subband[1]
:
Psychoacoustic
Model
The Weakness of the Human Ear
 Frequency dependent resolution:
 We do not have the ability to discern minute
differences in frequency within the critical bands.
 Auditory masking:
 When two signals of very close frequency are
both present, the louder will mask the softer.
 A masked signal must be louder than some
threshold for it to be heard  gives us room to
introduce inaudible quantization noise.
MPEG-I Psychoacoustic Models
 MPEG-I standard defines two models:
 Psychoacoustic Model 1:
 Less computationally expensive
 Makes some serious compromises in what it
assumes a listener cannot hear
 Psychoacoustic Model 2:
 Provides more features suited for Layer III
coding, assuming of course, increased processor
bandwidth.
Psychoacoustic Model
 Convert samples to frequency domain
 Use a Hann weighting and then a DFT
 Simply gives an edge artifact (from finite window
size) free frequency domain representation.
 Model 1 uses 512 (Layer I) or 1024 (Layers II
and III) sample window.
 Model 2 uses a 1024 sample window and two
calculations per frame.
Psychoacoustic Model
 Need to separate sound into “tones” and “noise”
components
 Model 1:
 Local peaks are tones, lump remaining spectrum per
critical band into noise at a representative frequency.
 Model 2:
 Calculate “tonality” index to determine likelihood of each
spectral point being a tone
 based on previous two analysis windows
Psychoacoustic Model
 “Smear” each signal within its critical band
 Use either a masking (Model 1) or a spreading
function (Model 2).
 Adjust calculated threshold by incorporating
a “quiet” mask – masking threshold for
each frequency when no other frequencies
are present.
Psychoacoustic Model
 Calculate a masking threshold for each subband in the
polyphase filter bank
 Model 1:
 Selects minima of masking threshold values in range of each
subband
 Inaccurate at higher frequencies – recall how subbands are
linearly distributed, critical bands are NOT!
 Model 2:
 If subband wider than critical band:
 Use minimal masking threshold in subband
 If critical band wider than subband:
 Use average masking threshold in subband
Psychoacoustic Model
 The hard work is done – now, we just
calculate the signal-to-mask ratio (SMR)
per subband
 SMR = signal energy / masking threshold
 We pass our result on to the coding unit
which can now produce a compressed
bitstream
Psychoacoustic Model (example)
 Input[1]
:
Psychoacoustic Model (example)
 Transformation to perceptual domain[1]
:
Psychoacoustic Model (example)
 Calculation of masking thresholds[1]
:
Psychoacoustic Model (example)
 Signal-to-mask ratios[1]
:
Psychoacoustic Model (example)
 What we actually send[1]
:
Coding and Bit
Allocation
Layer Specific Coding
 Layer specific frame formats[1]
:
Layer Specific Coding
 Stream of samples is processed in groups[1]
:
Layer I Coding
 Group 12 samples from each subband and
encode them in each frame (=384 samples)
 Each group encoded with 0-15 bits/sample
 Each group has 6-bit scale factor
Layer II Coding
 Similar to Layer I except:
 Groups are now 3 of 12 samples per-subband =
1152 samples per frame
 Can have up to 3 scale factors per subband to
avoid audible distortion in special cases
 Called scale factor selection information (SCFSI)
Layer III Coding
 Further subdivides subbands using Modified
Discrete Cosine Transform (MDCT) – a lossless
transform
 Larger frequency resolution => smaller time
resolution
 possibility of pre-echo
 Layer III encoder can detect and reduce pre-echo
by “borrowing bits” from future encodings
Bit Allocation
 Determine number of bits to allot for each
subband given SMR from psychoacoustic model.
 Layers I and II:
 Calculate mask-to-noise ratio:
 MNR = SNR – SMR (in dB)
 SNR given by MPEG-I standard (as function of quantization
levels)
 Now iterate until no bits to allocate left:
 Allocate bits to subband with lowest MNR.
 Re-calculate MNR for subband allocated more bits.
Bit Allocation
 Layer III:
 Employs “noise allocation”
 Quantizes each spectral value and employs
Huffman coding
 If Huffman encoding results in noise in excess of
allowed distortion for a subband, encoder
increases resolution on that subband
 Whole process repeats until one of three
specified stop conditions is met.
Conclusions and
Future Work
Conclusions
 MPEG-I provides tremendous compression
for relatively cheap computation.
 Not suitable for archival or audiophile grade
music as very seasoned listeners can
discern distortion.
 Modifying or searching MPEG-I content
requires decompression and is not cheap!
Future Work
 MPEG-1 audio lays the foundation for all modern
audio compression techniques
 Lots of progress since then (1994!)
 MPEG-2 (1996) extends MPEG audio
compression to support 5.1 channel audio
 MPEG-4 (1998) attempts to code based on
perceived audio objects in the stream
 Finally, MPEG-7 (2001) operates at an even
higher level of abstraction, focusing on meta-data
coding to make content searchable and
retrievable
References
[1] D. Pan, “A Tutorial on MPEG/Audio Compression”,
IEEE Multimedia Journal, 1995.
[2] J. H. Rothweiler, “Polyphase Quadrature Filters – a New
Subband Coding Technique”, Proc of the Int. Conf. IEEE
ASSP, 27.2, pp1280-1283, Boston 1983.

MPEG/Audio Compression

  • 1.
    A Tutorial on MPEG/Audio Compression DavisPan, IEEE Multimedia Journal, Summer 1995 Presented by: Randeep Singh Gakhal CMPT 820, Spring 2004
  • 2.
    Outline  Introduction  TechnicalOverview  Polyphase Filter Bank  Psychoacoustic Model  Coding and Bit Allocation  Conclusions and Future Work
  • 3.
    Introduction  What doesMPEG-1 Audio provide? A transparently lossy audio compression system based on the weaknesses of the human ear.  Can provide compression by a factor of 6 and retain sound quality.  One part of a three part standard that includes audio, video, and audio/video synchronization.
  • 4.
  • 5.
    MPEG-I Audio Features PCM sampling rate of 32, 44.1, or 48 kHz  Four channel modes:  Monophonic and Dual-monophonic  Stereo and Joint-stereo  Three modes (layers in MPEG-I speak):  Layer I: Computationally cheapest, bit rates > 128kbps  Layer II: Bit rate ~ 128 kbps, used in VCD  Layer III: Most complicated encoding/decoding, bit rates ~ 64kbps, originally intended for streaming audio
  • 6.
    Human Audio System(ear + brain)  Human sensitivity to sound is non-linear across audible range (20Hz – 20kHz)  Audible range broken into regions where humans cannot perceive a difference  called the critical bands
  • 7.
  • 8.
    MPEG-I Encoder Architecture Polyphase Filter Bank: Transforms PCM samples to frequency domain signals in 32 subbands  Psychoacoustic Model: Calculates acoustically irrelevant parts of signal  Bit Allocator: Allots bits to subbands according to input from psychoacoustic calculation.  Frame Creation: Generates an MPEG-I compliant bit stream.
  • 9.
  • 10.
    Polyphase Filter Bank Divides audio signal into 32 equal width subband streams in the frequency domain.  Inverse filter at decoder cannot recover signal without some, albeit inaudible, loss.  Based on work by Rothweiler[2].  Standard specifies 512 coefficient analysis window, C[n]
  • 11.
    Polyphase Filter Bank Buffer of 512 PCM samples with 32 new samples, X[n], shifted in every computation cycle  Calculate window samples for i=0…511:  Partial calculation for i=0…63:  Calculate 32 subsamples: ][][][ iXiCiZ ⋅= ∑= += 7 0 ]64[][ j jiZiY ∑= ⋅= 63 0 ]][[][][ k kiMiYiS
  • 12.
    Polyphase Filter Bank Visualization of the filter[1] :
  • 13.
    Polyphase Filter Bank The net effect:  Analysis matrix:  Requires 512 + 32x64 = 2560 multiplies.  Each subband has bandwidth π/32T centered at odd multiples of π/64T ]64[]64[]][[][ 63 0 7 0 jiXjiCkiMiS k j ++= ∑ ∑= =       −+ = 64 )16)(12( cos]][[ πki kiM
  • 14.
    Polyphase Filter Bank Shortcomings:  Equal width filters do not correspond with critical band model of auditory system.  Filter bank and its inverse are NOT lossless.  Frequency overlap between subbands.
  • 15.
    Polyphase Filter Bank Comparison of filter banks and critical bands[1]:
  • 16.
    Polyphase Filter Bank Frequency response of one subband[1] :
  • 17.
  • 18.
    The Weakness ofthe Human Ear  Frequency dependent resolution:  We do not have the ability to discern minute differences in frequency within the critical bands.  Auditory masking:  When two signals of very close frequency are both present, the louder will mask the softer.  A masked signal must be louder than some threshold for it to be heard  gives us room to introduce inaudible quantization noise.
  • 19.
    MPEG-I Psychoacoustic Models MPEG-I standard defines two models:  Psychoacoustic Model 1:  Less computationally expensive  Makes some serious compromises in what it assumes a listener cannot hear  Psychoacoustic Model 2:  Provides more features suited for Layer III coding, assuming of course, increased processor bandwidth.
  • 20.
    Psychoacoustic Model  Convertsamples to frequency domain  Use a Hann weighting and then a DFT  Simply gives an edge artifact (from finite window size) free frequency domain representation.  Model 1 uses 512 (Layer I) or 1024 (Layers II and III) sample window.  Model 2 uses a 1024 sample window and two calculations per frame.
  • 21.
    Psychoacoustic Model  Needto separate sound into “tones” and “noise” components  Model 1:  Local peaks are tones, lump remaining spectrum per critical band into noise at a representative frequency.  Model 2:  Calculate “tonality” index to determine likelihood of each spectral point being a tone  based on previous two analysis windows
  • 22.
    Psychoacoustic Model  “Smear”each signal within its critical band  Use either a masking (Model 1) or a spreading function (Model 2).  Adjust calculated threshold by incorporating a “quiet” mask – masking threshold for each frequency when no other frequencies are present.
  • 23.
    Psychoacoustic Model  Calculatea masking threshold for each subband in the polyphase filter bank  Model 1:  Selects minima of masking threshold values in range of each subband  Inaccurate at higher frequencies – recall how subbands are linearly distributed, critical bands are NOT!  Model 2:  If subband wider than critical band:  Use minimal masking threshold in subband  If critical band wider than subband:  Use average masking threshold in subband
  • 24.
    Psychoacoustic Model  Thehard work is done – now, we just calculate the signal-to-mask ratio (SMR) per subband  SMR = signal energy / masking threshold  We pass our result on to the coding unit which can now produce a compressed bitstream
  • 25.
  • 26.
    Psychoacoustic Model (example) Transformation to perceptual domain[1] :
  • 27.
    Psychoacoustic Model (example) Calculation of masking thresholds[1] :
  • 28.
    Psychoacoustic Model (example) Signal-to-mask ratios[1] :
  • 29.
    Psychoacoustic Model (example) What we actually send[1] :
  • 30.
  • 31.
    Layer Specific Coding Layer specific frame formats[1] :
  • 32.
    Layer Specific Coding Stream of samples is processed in groups[1] :
  • 33.
    Layer I Coding Group 12 samples from each subband and encode them in each frame (=384 samples)  Each group encoded with 0-15 bits/sample  Each group has 6-bit scale factor
  • 34.
    Layer II Coding Similar to Layer I except:  Groups are now 3 of 12 samples per-subband = 1152 samples per frame  Can have up to 3 scale factors per subband to avoid audible distortion in special cases  Called scale factor selection information (SCFSI)
  • 35.
    Layer III Coding Further subdivides subbands using Modified Discrete Cosine Transform (MDCT) – a lossless transform  Larger frequency resolution => smaller time resolution  possibility of pre-echo  Layer III encoder can detect and reduce pre-echo by “borrowing bits” from future encodings
  • 36.
    Bit Allocation  Determinenumber of bits to allot for each subband given SMR from psychoacoustic model.  Layers I and II:  Calculate mask-to-noise ratio:  MNR = SNR – SMR (in dB)  SNR given by MPEG-I standard (as function of quantization levels)  Now iterate until no bits to allocate left:  Allocate bits to subband with lowest MNR.  Re-calculate MNR for subband allocated more bits.
  • 37.
    Bit Allocation  LayerIII:  Employs “noise allocation”  Quantizes each spectral value and employs Huffman coding  If Huffman encoding results in noise in excess of allowed distortion for a subband, encoder increases resolution on that subband  Whole process repeats until one of three specified stop conditions is met.
  • 38.
  • 39.
    Conclusions  MPEG-I providestremendous compression for relatively cheap computation.  Not suitable for archival or audiophile grade music as very seasoned listeners can discern distortion.  Modifying or searching MPEG-I content requires decompression and is not cheap!
  • 40.
    Future Work  MPEG-1audio lays the foundation for all modern audio compression techniques  Lots of progress since then (1994!)  MPEG-2 (1996) extends MPEG audio compression to support 5.1 channel audio  MPEG-4 (1998) attempts to code based on perceived audio objects in the stream  Finally, MPEG-7 (2001) operates at an even higher level of abstraction, focusing on meta-data coding to make content searchable and retrievable
  • 41.
    References [1] D. Pan,“A Tutorial on MPEG/Audio Compression”, IEEE Multimedia Journal, 1995. [2] J. H. Rothweiler, “Polyphase Quadrature Filters – a New Subband Coding Technique”, Proc of the Int. Conf. IEEE ASSP, 27.2, pp1280-1283, Boston 1983.