2. Overview
ο΅ Why audio compression?
ο΅ Overview of system
ο΅ Wavelet Representation
ο΅ The psychoacoustic model
ο΅ Results
3. Overview of the system
Wavelet
Analysis
Psychoacoustic
Model
Threshold,
decide bits
(Mu law)
Compression
Audio Signal Quantization
New
representation
(Mu law)
Expansion
Wavelet
Synthesis
Reconstructed
Signal
Encoding
Decoding
4. Why audio compression?
ο΅ To represent signal with minimum number of bits without
losing quality/message of the signal
ο΅ We use a Wavelet based coding method with a
psychoacoustic model to exploit perceptual masking and
eliminate source redundancies
5. Masking phenomena
ο΅ Masking refers to a process where one sound is rendered
inaudible because of the presence of another sound
ο΅ Simultaneous masking refers to a frequency domain
phenomenon which has been observed within critical
bands (in-band).
ο΅ Important to distinguish between two types of
simultaneous masking, namely
ο΅ tone-masking-noise: a tone occurring at the center of a
critical band masks noise of any subcritical bandwidth
ο΅ noise-masking-tone: follows the same pattern with the
roles of masker and maskee reversed
6. Temporal masking
ο΅ Masking also occurs in the time-domain.
ο΅ In the context of audio signal analysis, abrupt signal
transients (e.g., the onset of a percussive musical
instrument) create pre- and post- masking regions in time
8. Critical Band and Masking
Distance between each Critical band is one bark
9. The Psychoacoustic model
ο΅ Based on tests done on Human hearing
ο΅ Uses an N-point DFT for high resolution spectral analysis,
then estimates for each input frame individual
simultaneous masking thresholds due to the presence of
tone-like and noise-like maskers in the signal spectrum.
ο΅ A global masking threshold is then estimated for a subset
of the original N/2 frequency bins by (power) additive
combination of the tonal and non-tonal individual masking
thresholds
10. Step 1: Spectral analysis and
Normalization
ο΅ Input is segmented into 512 length frames by applying a
Hanning window and the power spectral density (PSD) is
obtained using a N-point FFT:
π π = log10
π=0
πβ1
π€ π π₯ π πβπ2πππ
π
2
0 β€ π β€ π
2
ο΅ Normalized to 96dB, for MPEG 1 codec parameters
π π π = π π β max π π + 96
11. Step 2: Identification of tonal &
noise (non tonal) frequencies
ο΅ Find Local Maxima
ο΅ Is a tonal frequency if
ο΅ Remaining maxima, not in the Β±Ξ π range of a tonal
frequency are classified as noise frequencies
14. Step 3: Thresholding and
reorganization of Masks
ο΅ Any tonal/noise maskers below the absolute hearing
threshold are discarded
π ππ,ππ π β€ ππ π
Where Tq(f) is the absolute threshold of hearing (the
amount of energy needed in a pure tone such that it can
be detected by a listener in a noiseless environment)
ο΅ Next, a sliding is used to replace any pair of maskers
occurring within a distance of 0.5 Bark by the stronger of
the two
16. Step 4: Individual masking thresholds
ο΅ Each individual threshold represents a masking contribution at
frequency bin i due to the tone or noise masker located at bin
j.
ο΅ The tonal masker thresholds, TTM(i, j), are given (in dB) by:
π ππ π, π = π ππ π β 0.275π§ π + ππΉ π, π β 6.025
ο΅ The noise masker thresholds, TTM(i, j), are given (in dB) by:
π ππ π, π = π ππ π β 0.175π§ π + ππΉ π, π β 2.025
19. Step 5: Global masking threshold
ο΅ The global masking threshold, Tg(i), is obtained by:
ππ π = 10 log10
π=1
πΏ
100.1π ππ π,π
+
π=1
π
100.1π ππ π,π
ππ΅
ο΅ The global threshold for each frequency bin represents a
signal dependent, power additive modification of the
absolute threshold due to the spread of all tonal and noise
maskers in the signal power spectrum
22. Wavelet representation of signal
π π‘ =
π
ππ0
π 2
π
2 π· 2 π π‘ β π +
π π=π0
π1
ππ π 2
π
2 πΉ(2 π π‘ β π)
ο΅ Audio signal divided into non-overlapping frames of length
512 samples (11.5 ms at 44.1 kHz). Each frame is
multiplied by a Hanning window of same length to avoid
border distortions
23. Wavelet decomposition β initial step
ο΅ Given a signal s of length N, the DWT consists of log2 N
stages at most.
ο΅ First step produces 2 sets of coefficients: approximation
coefficients CA1, and detail coefficients CD1.
24. Recursive wavelet decomposition
ο΅ The wavelet decomposition of the signal s analyzed at
level j has the following structure: [cAj, cDj, ..., cD1].
25. Wavelet decomposition
ο΅ The wavelet transform coefficients are computed
recursively using an efficient pyramid algorithm
26. Wavelet: Vanishing moments
ο΅ We choose orthonormal
wavelets
ο΅ For a wavelet of K coefficients,
it can have at most K/2
vanishing moments
ο΅ To ensure regularity (How fast
the coefficients decay to zero),
we choose a wavelet with high
no. of vanishing moments
ο΅ As they are best suited for
Audio processing
28. Dependence of efficiency on choice
of wavelet
ο΅ Type of wavelet basis has a significant impact on
efficiency of coding scheme.
ο΅ Used MATLAB function βwavedecβ to perform our 1-D
wavelet decomposition.
ο΅ Compression using βwdencmpβ by parameters obtained
using βddencmpβ
ο΅ Efficiency of compression measured using two metrics
returned by function: PERFL0 and PERFL2
29. Dependence of efficiency on choice
of wavelet
ο΅ Daubechies refers to a particular family of wavelets. The
number refers to the number of vanishing moments
ο΅ Simply put, the higher the number of vanishing moments,
the smoother the wavelet (and longer the wavelet filter).
Wavelet PERF0 PERFL2
Haar 31.01 99.93
Daubechies-2 64.63 99.95
Daubechies-10 65.59 99.97
30. Bit-rate reduction
ο΅ After testing the coder (with the Daubechies-10 wavelet)
with 4 different music signals originally at 16 bits/sample
(violin, drums, piano, Adele), we observed that the
average number of bits required to encode them was
around 7.5 i.e. we are able to attain more than 50 %
reduction.
ο΅ We assumed a bit for every 6.02dB of max of effective
SNR for a frame
33. Subjective tests
ο΅ Important to eliminate chance in listening tests. So, we provided several
stimuli of each source material to each listener. We also did not reveal to the
listener the actual order in which the stimuli are presented (e.g. original,
coder 1, coder 2 etc).
ο΅ Figures indicate coder provided a transparent coding for all audio sources.
ο΅ Quality of the piano signal was not as good as others because it contains long
segments of nearly steady of slowly decaying sinusoids which the wavelet
based coder did not seem to handle well
Sample Average
probability of
original
preferred
over encoded
Sample size Comments
Violin 0.25 12 Transparent
Piano 0.5 10 Nearly
transparent
Drums 0.3 10 Transparent
Adele 0.27 15 Transparent
34. References
ο΅ [1] D. Sinha and A. Tewfik. βLow Bit Rate Transparent Audio Compression using
Adapted Waveletsβ, IEEE Trans. ASSP, Vol. 41, No. 12, December 1993
ο΅ [2] T. Painter and A. Spanias, βA review of algorithms for perceptual coding of
digital audio signals,β DSP-97, 1977.
ο΅ [3] I. Daubechies, βOrthonormal bases of compactly supported waveletsβ,
Commun. Pure Appl. Math.. vol 41, pp 909-996, Nov 1988
ο΅ [4] ISO/IEC JTC1/SC29/WG11 MPEG, IS11172-3 βInformation Technology - Coding of
Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5
Mbit/s, Part 3: Audioβ 1992. (βMPEG-1β)
ο΅ [5] R. Hellman, βAsymmetry of Masking Between Noise and Tone,β Percep. and
Psychphys., pp. 241-246, vol.11, 1972
ο΅ [6] M. Schroeder, et al., βOptimizing Digital Speech Coders by Exploiting Masking
Properties of the Human Ear,β J. Acoust. Soc. Am., pp. 1647-1652, Dec. 1979
ο΅ [7] E. Zwicker and H. Fastl, Psychoacoustics Facts and Models, Springer-Verlag,
1990
ο΅ [8] C. Burrus, R. A. Gopinath and H. Guo, βIntroduction to Wavelets & Wavelet
Transformsβ, Prentice-Hall 1998
Editor's Notes
Used to obtain compact digital representations of wideband audio signals for the purposes of efficient transmission or storage.
Central objective: to represent signal with minimum number of bits while achieving transparent signal reconstruction, i.e., generating output audio which cannot be distinguished from the original input, even by a sensitive listener
An audio compression scheme must exploit the two sources of irrelevancies and redundancies in audio signals:
the masking characteristics of the human hearing process and
the statistical redundancies in the signal
An approach which employs a wavelet based coding method used with a psychoacoustic model to exploit perceptual masking and eliminate source redundancies
Masking also occurs in the time-domain.
In the context of audio signal analysis, abrupt signal transients (e.g., the onset of a percussive musical instrument) create pre- and post- masking regions in time during which a listener will not perceive signals beneath the elevated audibility thresholds produced by a masker
We consider the case of a single masking tone occurring
at the center of a critical band.
All levels in the figure are given in terms of dB.
A hypothetical masking tone occurs at some masking
level. This generates an excitation along the basilar
membrane which is modeled by a spreading function and
a corresponding masking threshold
Local maxima in the sample PSD which exceed neighboring components within a certain bark distance by at least 7 dB are classified as tonal.
Tonal maskers are then computed from the spectral peaks listed in ST as follows:
π ππ π =10 log 10 π= β1 1 10 0.1 π(π+π) ππ΅
A single noise masker for each critical band is computed from (remaining) spectral lines not within the Β±Ξk neighborhood of a tonal masker using a similar sum
where
PTM(j) denotes the tonal masker in frequency bin j,
z(j) denotes the Bark frequency of bin j, and the spread of masking from masker bin j, to masker bin i,
SF(i, j) is a piecewise linear function of the masker level, P(j) and Bark maskee-masker separation Ξz = z(i) β z(j)
The audio signal is represented in terms of the translates and dilates of the scaling function (say Daubechies 10) as:
π π‘ = π π π 0 π 2 π 0 2 π· 2 π π‘βπ + π π= π 0 β π π π 2 π 2 πΉ( 2 π π‘βπ)
Such an expansion provides a multiresolution analysis of g(t). The choice of j0 sets the coarsest scale whose space is spanned by Ξ¦j0, k (t).
Audio signal divided into non-overlapping frames of length 512 samples (11.5 ms at 44.1 kHz). Each frame is multiplied by a Hanning window of same length to avoid border distortions
Restrictions: compact support wavelets, to create orthogonal translates and dilates of the wavelet and to ensure regularity (fast decay of coefficients controlled by choosing wavelets with large number of vanishing moments)
Given a signal s of length N, the DWT consists of log2 N stages at most.
First step produces 2 sets of coefficients: approximation coefficients CA1, and detail coefficients CD1.
More precisely, the first step is:
The next step splits the approximation coefficientsΒ cA1Β in two parts using the same scheme, replacingΒ sΒ byΒ cA1, and producingΒ cA2Β andΒ cD2, and so on