Audio Compression using
Discrete Wavelet
Transform
Thyagarajan Venkatanarayanan
Meghasyam Tummalacherla
Overview
 Why audio compression?
 Overview of system
 Wavelet Representation
 The psychoacoustic model
 Results
Overview of the system
Wavelet
Analysis
Psychoacoustic
Model
Threshold,
decide bits
(Mu law)
Compression
Audio Signal Quantization
New
representation
(Mu law)
Expansion
Wavelet
Synthesis
Reconstructed
Signal
Encoding
Decoding
Why audio compression?
 To represent signal with minimum number of bits without
losing quality/message of the signal
 We use a Wavelet based coding method with a
psychoacoustic model to exploit perceptual masking and
eliminate source redundancies
Masking phenomena
 Masking refers to a process where one sound is rendered
inaudible because of the presence of another sound
 Simultaneous masking refers to a frequency domain
phenomenon which has been observed within critical
bands (in-band).
 Important to distinguish between two types of
simultaneous masking, namely
 tone-masking-noise: a tone occurring at the center of a
critical band masks noise of any subcritical bandwidth
 noise-masking-tone: follows the same pattern with the
roles of masker and maskee reversed
Temporal masking
 Masking also occurs in the time-domain.
 In the context of audio signal analysis, abrupt signal
transients (e.g., the onset of a percussive musical
instrument) create pre- and post- masking regions in time
Simultaneous masking
Critical Band and Masking
Distance between each Critical band is one bark
The Psychoacoustic model
 Based on tests done on Human hearing
 Uses an N-point DFT for high resolution spectral analysis,
then estimates for each input frame individual
simultaneous masking thresholds due to the presence of
tone-like and noise-like maskers in the signal spectrum.
 A global masking threshold is then estimated for a subset
of the original N/2 frequency bins by (power) additive
combination of the tonal and non-tonal individual masking
thresholds
Step 1: Spectral analysis and
Normalization
 Input is segmented into 512 length frames by applying a
Hanning window and the power spectral density (PSD) is
obtained using a N-point FFT:
𝑃 𝑘 = log10
𝑛=0
𝑁−1
𝑤 𝑛 𝑥 𝑛 𝑒−𝑗2𝜋𝑘𝑛
𝑁
2
0 ≤ 𝑘 ≤ 𝑁
2
 Normalized to 96dB, for MPEG 1 codec parameters
𝑃 𝑁 𝑘 = 𝑃 𝑘 − max 𝑃 𝑘 + 96
Step 2: Identification of tonal &
noise (non tonal) frequencies
 Find Local Maxima
 Is a tonal frequency if
 Remaining maxima, not in the ±Δ 𝑘 range of a tonal
frequency are classified as noise frequencies
Tonal and Noise Components
Tonal and Noise Masks
Step 3: Thresholding and
reorganization of Masks
 Any tonal/noise maskers below the absolute hearing
threshold are discarded
𝑃 𝑇𝑀,𝑁𝑀 𝑘 ≤ 𝑇𝑞 𝑘
Where Tq(f) is the absolute threshold of hearing (the
amount of energy needed in a pure tone such that it can
be detected by a listener in a noiseless environment)
 Next, a sliding is used to replace any pair of maskers
occurring within a distance of 0.5 Bark by the stronger of
the two
Thresholding and reorganization
Step 4: Individual masking thresholds
 Each individual threshold represents a masking contribution at
frequency bin i due to the tone or noise masker located at bin
j.
 The tonal masker thresholds, TTM(i, j), are given (in dB) by:
𝑇 𝑇𝑀 𝑖, 𝑗 = 𝑃 𝑇𝑀 𝑗 − 0.275𝑧 𝑗 + 𝑆𝐹 𝑖, 𝑗 − 6.025
 The noise masker thresholds, TTM(i, j), are given (in dB) by:
𝑇 𝑁𝑀 𝑖, 𝑗 = 𝑃 𝑁𝑀 𝑗 − 0.175𝑧 𝑗 + 𝑆𝐹 𝑖, 𝑗 − 2.025
Individual threshold corresponding
to tonal components
Individual threshold corresponding
to non-tonal components
Step 5: Global masking threshold
 The global masking threshold, Tg(i), is obtained by:
𝑇𝑔 𝑖 = 10 log10
𝑙=1
𝐿
100.1𝑇 𝑇𝑀 𝑖,𝑙
+
𝑚=1
𝑀
100.1𝑇 𝑁𝑀 𝑖,𝑚
𝑑𝐵
 The global threshold for each frequency bin represents a
signal dependent, power additive modification of the
absolute threshold due to the spread of all tonal and noise
maskers in the signal power spectrum
Global masking threshold
Effective SNR (PSD – Global Threshold)
Wavelet representation of signal
𝑔 𝑡 =
𝑘
𝑐𝑗0
𝑘 2
𝑗
2 𝛷 2 𝑗 𝑡 − 𝑘 +
𝑘 𝑗=𝑗0
𝑗1
𝑑𝑗 𝑘 2
𝑗
2 𝛹(2 𝑗 𝑡 − 𝑘)
 Audio signal divided into non-overlapping frames of length
512 samples (11.5 ms at 44.1 kHz). Each frame is
multiplied by a Hanning window of same length to avoid
border distortions
Wavelet decomposition – initial step
 Given a signal s of length N, the DWT consists of log2 N
stages at most.
 First step produces 2 sets of coefficients: approximation
coefficients CA1, and detail coefficients CD1.
Recursive wavelet decomposition
 The wavelet decomposition of the signal s analyzed at
level j has the following structure: [cAj, cDj, ..., cD1].
Wavelet decomposition
 The wavelet transform coefficients are computed
recursively using an efficient pyramid algorithm
Wavelet: Vanishing moments
 We choose orthonormal
wavelets
 For a wavelet of K coefficients,
it can have at most K/2
vanishing moments
 To ensure regularity (How fast
the coefficients decay to zero),
we choose a wavelet with high
no. of vanishing moments
 As they are best suited for
Audio processing
Wavelet: Sparsity
 Sparsity – Number of non zero coefficients, the lesser the
better
Dependence of efficiency on choice
of wavelet
 Type of wavelet basis has a significant impact on
efficiency of coding scheme.
 Used MATLAB function “wavedec” to perform our 1-D
wavelet decomposition.
 Compression using “wdencmp” by parameters obtained
using “ddencmp”
 Efficiency of compression measured using two metrics
returned by function: PERFL0 and PERFL2
Dependence of efficiency on choice
of wavelet
 Daubechies refers to a particular family of wavelets. The
number refers to the number of vanishing moments
 Simply put, the higher the number of vanishing moments,
the smoother the wavelet (and longer the wavelet filter).
Wavelet PERF0 PERFL2
Haar 31.01 99.93
Daubechies-2 64.63 99.95
Daubechies-10 65.59 99.97
Bit-rate reduction
 After testing the coder (with the Daubechies-10 wavelet)
with 4 different music signals originally at 16 bits/sample
(violin, drums, piano, Adele), we observed that the
average number of bits required to encode them was
around 7.5 i.e. we are able to attain more than 50 %
reduction.
 We assumed a bit for every 6.02dB of max of effective
SNR for a frame
Bits Allocation per frame (lowest is 4)
Original vs reconstructed signal
Subjective tests
 Important to eliminate chance in listening tests. So, we provided several
stimuli of each source material to each listener. We also did not reveal to the
listener the actual order in which the stimuli are presented (e.g. original,
coder 1, coder 2 etc).
 Figures indicate coder provided a transparent coding for all audio sources.
 Quality of the piano signal was not as good as others because it contains long
segments of nearly steady of slowly decaying sinusoids which the wavelet
based coder did not seem to handle well
Sample Average
probability of
original
preferred
over encoded
Sample size Comments
Violin 0.25 12 Transparent
Piano 0.5 10 Nearly
transparent
Drums 0.3 10 Transparent
Adele 0.27 15 Transparent
References
 [1] D. Sinha and A. Tewfik. “Low Bit Rate Transparent Audio Compression using
Adapted Wavelets”, IEEE Trans. ASSP, Vol. 41, No. 12, December 1993
 [2] T. Painter and A. Spanias, “A review of algorithms for perceptual coding of
digital audio signals,” DSP-97, 1977.
 [3] I. Daubechies, “Orthonormal bases of compactly supported wavelets”,
Commun. Pure Appl. Math.. vol 41, pp 909-996, Nov 1988
 [4] ISO/IEC JTC1/SC29/WG11 MPEG, IS11172-3 “Information Technology - Coding of
Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5
Mbit/s, Part 3: Audio” 1992. (“MPEG-1”)
 [5] R. Hellman, “Asymmetry of Masking Between Noise and Tone,” Percep. and
Psychphys., pp. 241-246, vol.11, 1972
 [6] M. Schroeder, et al., “Optimizing Digital Speech Coders by Exploiting Masking
Properties of the Human Ear,” J. Acoust. Soc. Am., pp. 1647-1652, Dec. 1979
 [7] E. Zwicker and H. Fastl, Psychoacoustics Facts and Models, Springer-Verlag,
1990
 [8] C. Burrus, R. A. Gopinath and H. Guo, “Introduction to Wavelets & Wavelet
Transforms”, Prentice-Hall 1998

Final presentation

  • 1.
    Audio Compression using DiscreteWavelet Transform Thyagarajan Venkatanarayanan Meghasyam Tummalacherla
  • 2.
    Overview  Why audiocompression?  Overview of system  Wavelet Representation  The psychoacoustic model  Results
  • 3.
    Overview of thesystem Wavelet Analysis Psychoacoustic Model Threshold, decide bits (Mu law) Compression Audio Signal Quantization New representation (Mu law) Expansion Wavelet Synthesis Reconstructed Signal Encoding Decoding
  • 4.
    Why audio compression? To represent signal with minimum number of bits without losing quality/message of the signal  We use a Wavelet based coding method with a psychoacoustic model to exploit perceptual masking and eliminate source redundancies
  • 5.
    Masking phenomena  Maskingrefers to a process where one sound is rendered inaudible because of the presence of another sound  Simultaneous masking refers to a frequency domain phenomenon which has been observed within critical bands (in-band).  Important to distinguish between two types of simultaneous masking, namely  tone-masking-noise: a tone occurring at the center of a critical band masks noise of any subcritical bandwidth  noise-masking-tone: follows the same pattern with the roles of masker and maskee reversed
  • 6.
    Temporal masking  Maskingalso occurs in the time-domain.  In the context of audio signal analysis, abrupt signal transients (e.g., the onset of a percussive musical instrument) create pre- and post- masking regions in time
  • 7.
  • 8.
    Critical Band andMasking Distance between each Critical band is one bark
  • 9.
    The Psychoacoustic model Based on tests done on Human hearing  Uses an N-point DFT for high resolution spectral analysis, then estimates for each input frame individual simultaneous masking thresholds due to the presence of tone-like and noise-like maskers in the signal spectrum.  A global masking threshold is then estimated for a subset of the original N/2 frequency bins by (power) additive combination of the tonal and non-tonal individual masking thresholds
  • 10.
    Step 1: Spectralanalysis and Normalization  Input is segmented into 512 length frames by applying a Hanning window and the power spectral density (PSD) is obtained using a N-point FFT: 𝑃 𝑘 = log10 𝑛=0 𝑁−1 𝑤 𝑛 𝑥 𝑛 𝑒−𝑗2𝜋𝑘𝑛 𝑁 2 0 ≤ 𝑘 ≤ 𝑁 2  Normalized to 96dB, for MPEG 1 codec parameters 𝑃 𝑁 𝑘 = 𝑃 𝑘 − max 𝑃 𝑘 + 96
  • 11.
    Step 2: Identificationof tonal & noise (non tonal) frequencies  Find Local Maxima  Is a tonal frequency if  Remaining maxima, not in the ±Δ 𝑘 range of a tonal frequency are classified as noise frequencies
  • 12.
    Tonal and NoiseComponents
  • 13.
  • 14.
    Step 3: Thresholdingand reorganization of Masks  Any tonal/noise maskers below the absolute hearing threshold are discarded 𝑃 𝑇𝑀,𝑁𝑀 𝑘 ≤ 𝑇𝑞 𝑘 Where Tq(f) is the absolute threshold of hearing (the amount of energy needed in a pure tone such that it can be detected by a listener in a noiseless environment)  Next, a sliding is used to replace any pair of maskers occurring within a distance of 0.5 Bark by the stronger of the two
  • 15.
  • 16.
    Step 4: Individualmasking thresholds  Each individual threshold represents a masking contribution at frequency bin i due to the tone or noise masker located at bin j.  The tonal masker thresholds, TTM(i, j), are given (in dB) by: 𝑇 𝑇𝑀 𝑖, 𝑗 = 𝑃 𝑇𝑀 𝑗 − 0.275𝑧 𝑗 + 𝑆𝐹 𝑖, 𝑗 − 6.025  The noise masker thresholds, TTM(i, j), are given (in dB) by: 𝑇 𝑁𝑀 𝑖, 𝑗 = 𝑃 𝑁𝑀 𝑗 − 0.175𝑧 𝑗 + 𝑆𝐹 𝑖, 𝑗 − 2.025
  • 17.
  • 18.
  • 19.
    Step 5: Globalmasking threshold  The global masking threshold, Tg(i), is obtained by: 𝑇𝑔 𝑖 = 10 log10 𝑙=1 𝐿 100.1𝑇 𝑇𝑀 𝑖,𝑙 + 𝑚=1 𝑀 100.1𝑇 𝑁𝑀 𝑖,𝑚 𝑑𝐵  The global threshold for each frequency bin represents a signal dependent, power additive modification of the absolute threshold due to the spread of all tonal and noise maskers in the signal power spectrum
  • 20.
  • 21.
    Effective SNR (PSD– Global Threshold)
  • 22.
    Wavelet representation ofsignal 𝑔 𝑡 = 𝑘 𝑐𝑗0 𝑘 2 𝑗 2 𝛷 2 𝑗 𝑡 − 𝑘 + 𝑘 𝑗=𝑗0 𝑗1 𝑑𝑗 𝑘 2 𝑗 2 𝛹(2 𝑗 𝑡 − 𝑘)  Audio signal divided into non-overlapping frames of length 512 samples (11.5 ms at 44.1 kHz). Each frame is multiplied by a Hanning window of same length to avoid border distortions
  • 23.
    Wavelet decomposition –initial step  Given a signal s of length N, the DWT consists of log2 N stages at most.  First step produces 2 sets of coefficients: approximation coefficients CA1, and detail coefficients CD1.
  • 24.
    Recursive wavelet decomposition The wavelet decomposition of the signal s analyzed at level j has the following structure: [cAj, cDj, ..., cD1].
  • 25.
    Wavelet decomposition  Thewavelet transform coefficients are computed recursively using an efficient pyramid algorithm
  • 26.
    Wavelet: Vanishing moments We choose orthonormal wavelets  For a wavelet of K coefficients, it can have at most K/2 vanishing moments  To ensure regularity (How fast the coefficients decay to zero), we choose a wavelet with high no. of vanishing moments  As they are best suited for Audio processing
  • 27.
    Wavelet: Sparsity  Sparsity– Number of non zero coefficients, the lesser the better
  • 28.
    Dependence of efficiencyon choice of wavelet  Type of wavelet basis has a significant impact on efficiency of coding scheme.  Used MATLAB function “wavedec” to perform our 1-D wavelet decomposition.  Compression using “wdencmp” by parameters obtained using “ddencmp”  Efficiency of compression measured using two metrics returned by function: PERFL0 and PERFL2
  • 29.
    Dependence of efficiencyon choice of wavelet  Daubechies refers to a particular family of wavelets. The number refers to the number of vanishing moments  Simply put, the higher the number of vanishing moments, the smoother the wavelet (and longer the wavelet filter). Wavelet PERF0 PERFL2 Haar 31.01 99.93 Daubechies-2 64.63 99.95 Daubechies-10 65.59 99.97
  • 30.
    Bit-rate reduction  Aftertesting the coder (with the Daubechies-10 wavelet) with 4 different music signals originally at 16 bits/sample (violin, drums, piano, Adele), we observed that the average number of bits required to encode them was around 7.5 i.e. we are able to attain more than 50 % reduction.  We assumed a bit for every 6.02dB of max of effective SNR for a frame
  • 31.
    Bits Allocation perframe (lowest is 4)
  • 32.
  • 33.
    Subjective tests  Importantto eliminate chance in listening tests. So, we provided several stimuli of each source material to each listener. We also did not reveal to the listener the actual order in which the stimuli are presented (e.g. original, coder 1, coder 2 etc).  Figures indicate coder provided a transparent coding for all audio sources.  Quality of the piano signal was not as good as others because it contains long segments of nearly steady of slowly decaying sinusoids which the wavelet based coder did not seem to handle well Sample Average probability of original preferred over encoded Sample size Comments Violin 0.25 12 Transparent Piano 0.5 10 Nearly transparent Drums 0.3 10 Transparent Adele 0.27 15 Transparent
  • 34.
    References  [1] D.Sinha and A. Tewfik. “Low Bit Rate Transparent Audio Compression using Adapted Wavelets”, IEEE Trans. ASSP, Vol. 41, No. 12, December 1993  [2] T. Painter and A. Spanias, “A review of algorithms for perceptual coding of digital audio signals,” DSP-97, 1977.  [3] I. Daubechies, “Orthonormal bases of compactly supported wavelets”, Commun. Pure Appl. Math.. vol 41, pp 909-996, Nov 1988  [4] ISO/IEC JTC1/SC29/WG11 MPEG, IS11172-3 “Information Technology - Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 Mbit/s, Part 3: Audio” 1992. (“MPEG-1”)  [5] R. Hellman, “Asymmetry of Masking Between Noise and Tone,” Percep. and Psychphys., pp. 241-246, vol.11, 1972  [6] M. Schroeder, et al., “Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear,” J. Acoust. Soc. Am., pp. 1647-1652, Dec. 1979  [7] E. Zwicker and H. Fastl, Psychoacoustics Facts and Models, Springer-Verlag, 1990  [8] C. Burrus, R. A. Gopinath and H. Guo, “Introduction to Wavelets & Wavelet Transforms”, Prentice-Hall 1998

Editor's Notes

  • #5 Used to obtain compact digital representations of wideband audio signals for the purposes of efficient transmission or storage. Central objective: to represent signal with minimum number of bits while achieving transparent signal reconstruction, i.e., generating output audio which cannot be distinguished from the original input, even by a sensitive listener An audio compression scheme must exploit the two sources of irrelevancies and redundancies in audio signals: the masking characteristics of the human hearing process and the statistical redundancies in the signal An approach which employs a wavelet based coding method used with a psychoacoustic model to exploit perceptual masking and eliminate source redundancies
  • #7 Masking also occurs in the time-domain. In the context of audio signal analysis, abrupt signal transients (e.g., the onset of a percussive musical instrument) create pre- and post- masking regions in time during which a listener will not perceive signals beneath the elevated audibility thresholds produced by a masker
  • #9 We consider the case of a single masking tone occurring at the center of a critical band. All levels in the figure are given in terms of dB. A hypothetical masking tone occurs at some masking level. This generates an excitation along the basilar membrane which is modeled by a spreading function and a corresponding masking threshold
  • #12 Local maxima in the sample PSD which exceed neighboring components within a certain bark distance by at least 7 dB are classified as tonal. Tonal maskers are then computed from the spectral peaks listed in ST as follows: 𝑃 𝑇𝑀 𝑘 =10 log 10 𝑗= −1 1 10 0.1 𝑃(𝑘+𝑗) 𝑑𝐵 A single noise masker for each critical band is computed from (remaining) spectral lines not within the ±Δk neighborhood of a tonal masker using a similar sum
  • #17 where PTM(j) denotes the tonal masker in frequency bin j, z(j) denotes the Bark frequency of bin j, and the spread of masking from masker bin j, to masker bin i, SF(i, j) is a piecewise linear function of the masker level, P(j) and Bark maskee-masker separation Δz = z(i) – z(j)
  • #23 The audio signal is represented in terms of the translates and dilates of the scaling function (say Daubechies 10) as: 𝑔 𝑡 = 𝑘 𝑐 𝑗 0 𝑘 2 𝑗 0 2 𝛷 2 𝑗 𝑡−𝑘 + 𝑘 𝑗= 𝑗 0 ∞ 𝑑 𝑗 𝑘 2 𝑗 2 𝛹( 2 𝑗 𝑡−𝑘) Such an expansion provides a multiresolution analysis of g(t). The choice of j0 sets the coarsest scale whose space is spanned by Φj0, k (t). Audio signal divided into non-overlapping frames of length 512 samples (11.5 ms at 44.1 kHz). Each frame is multiplied by a Hanning window of same length to avoid border distortions Restrictions: compact support wavelets, to create orthogonal translates and dilates of the wavelet and to ensure regularity (fast decay of coefficients controlled by choosing wavelets with large number of vanishing moments)
  • #24 Given a signal s of length N, the DWT consists of log2 N stages at most. First step produces 2 sets of coefficients: approximation coefficients CA1, and detail coefficients CD1. More precisely, the first step is: The next step splits the approximation coefficients cA1 in two parts using the same scheme, replacing s by cA1, and producing cA2 and cD2, and so on