example based audio editing

3
Example Based Audio Editing
Ramin Anushiravani
Advisor: Paris Smaragdis
Qualifying Exam Fall 15 1

Outline
• Motivation
– Why? What? How?
• Equalizer Matching
• Noise Matching
• Reverberation Matching
• Summary
2

What? Graphic Equalizer.
iTunes Equalizer setting
4

How? Signal Processing!
Input
Example
Trim
Resample
to 44.1
kHz
STFT Function ISTFT
Result
Normalize
R: hop size
: time frame
L: length of the signal
Smith, J.O. Spectral Audio Signal Processing,
http://ccrma.stanford.edu/~jos/sasp/, online book,
2011 edition
k: frequency index
w: window function
Preprocessing
5

Inverse
Equalizer Matching
Power
Spectrum
STFT
Power
Spectrum
Element-wise multiplication
P Average Power Spectrum
input
example result
L Total Number of frames
Time-Invariant
6

Noise Matching
Denoise
Denoise
EQ
EQ +
SNRx
1
2
2:
1:
Equalizing noisy signals
Equalizing just the noise
-
-
8

Demo
example input
9
Musical
noise

Denoising
Spectral Subtraction
Noise profile estimate
Estimate clean power spectrum Noise suppression
factor
Fourier transform of the noisy
signal in one frame
In practice,
• Noise profile is estimated over multiple frequency bands.
• Spectral subtraction fails at low SNR regions by creating musical noises. This artifact is
reduced by post-filtering the spectral subtraction.
(Philipos C. Loizou, Speech Enhancement
Theory and Practice, 2013)
Additive stationary noise
( Esch and Vary, Efficient Musical Noise Suppression for
Speech Enhancement Systems, 2009)
11

Reverberation
Krannert Center for the Performing Arts, Foellinger Great Hall
12

Reverberation
Falkland Palace Bottle Dungeon
reverb sound
dry sound reverb kernel
(OpenAir database, www.openairlib.net)
Approximate in the
magnitude STFT domain
Convolution between
time frames of
magnitude X and H at
each frequency index
(R. Talmon, I. Cohen, and S. Gannot, “Relative
transfer function identification using convolutive
transfer function approximation,” IEEE Trans.
Audio, Speech, and Language Process, 2009.)
13

Reverberation Matching 1
Adry
Ra
Bdry
Rb
Dereverberation
Dereverberation
Ideal case – Perfect decomposition of reverb sounds into dry sounds and
reverb kernels.
Running out of letters!
input
example
Focus is on decomposing the magnitude spectrograms into magnitude spectrograms.
I took the signals back to time domain using the reverberated input phase information.
15

Convolutive Non-negative Matrix Factorization
Update Equations:
,
Paul O’Grady & Barak Pearkmutter, Convolutive NMF with a
Sparseness Constraint, MLSP Conference, 2006
Convolution of non-
negative matrices
Shift operator
Spectrum at time frame t
Matrix of size
Ly x k with all
its elements
set to 1.
16

Dereverberation
• Initialize with positive random values.
• Initialize with positive exponential decays.
• On each iteration, enforce anti-sparsity on ,
I dropped indices and absolute values, but they’re there.
17

Set of dry speech bases (trained offline)
Corresponding activation
Reverberated activation matrix
Dereverberation
We can do better by using more prior knowledge.
Convolution is associative
average R over multiple
frequency bands
(Paris Smaragdis, “Convolutive speech
bases and their application to supervised
speech separation,” in Speech And Audio
Processing. IEEE, 2007)
18

Demo
Dereverberated
Reverb
HrWc
R
Hc
Fixed
19

Result
Original Input
Demo…
20

Reverberation Matching 2
Adry
Ra
Bdry
Rb
Dereverberation
Dereverberation
input
example
result
+
Suppress Artifact
Match Kernels
21
Example- Input
Example- Result

Summary
=>
Find power spectrums => Find EQ filter to match them. => Multiply the
EQ filter with every time frame in the input sound magnitude spectrogram.
=>
Denoise => EQ match the estimated clean and noise signals
individually. => Add the resulting input noise to the resulting clean signal
using their original SNR.
=>
Decompose to dry sound and reverb kernels => Convolve the
estimated dry input sound with the example sound’s estimated reverb
kernel.
22

Equalizer Matching
Log Mag-dB
Log spaced frequency-Hz
25

noisy Signal clean Signal noise
A common assumption in most papers:
Noise and the clean signal are uncorrelated.
(Philipos C. Loizou, Speech Enhancement
Theory and Practice, 2013)
Fourier Transform over a segment of x(n).
AWGN. Same over all clean input segments.
Estimated Noise PSD.
In practice H is learned
over different
frequency bands.
26

Musical Noise Reduction
( Esch and Vary, Efficient Musical Noise Suppression for
Speech Enhancement Systems, 2009)
Aim: Retain the naturalness of the
remaining background noise.
How?
• 1
Detect low SNR frames based on the
noisy signal and the estimated clean signal.
• 2
Design a smoothing window based on 1.
Lower the SNR, longer the window.
• 3
Design a post-filter to smooth the low SNR
frames, i.e. an FIR low pass filter designed
based on 2.
• 3
Element-wise multiply the noise suppression
factor by 2.
Step 3
Enhanced Spectral Subtraction 27

SS + Musical Noise Reduction
G.*H Musical Suppression PostFilterSNR= 22 dB
Noisy Input
Much Better!
.^2 .^2
(
(
.^0.5
28

Metrics for Ideal Reverberation
time
Magnitude-dB
Energy Decay Relief
Energy Decay Curve
EDC at multiple frequency bands
29

Reverberation Model
• Time Domain Statistical Model
Where b(t) is a zero mean Gaussian noise. is related to reverberation time.
• Reverberation time = RT60= Length of time to drop below 60 dB below the original level.
Sabine Formula:
Volume of the enclosure
Effective absorbing area
Area
of each wall
Absorption
coefficient
Reflection Coefficients:
30

Image Source Method
Source
Microphone
Mirror image
of the original source
Actual path
Perceived path
Image source produces
another image source
(Allen, J and Berkley, D. 'Image Method
for efficiently simulating small‐room acoustics'. The Journal of the
Acoustical Society of America, Vol 65, No.4, pp. 943‐950, 1978)
(Pictures from: Alex Tu, Reverberation
simulation from impulse response using
the Image Source Method)
Parameters that control which image source in which dimension
Reflection coefficients of the six surfaces in a rectangular
Time delay of the considered image source
31

Non-Negative Matrix Factorization
,
• Applying Gradient Descent under positive initial conditions for W and H and a ‘clever’ learning rate results in
the following multiplicative update rules,
(Lee and Seung, 1999)
Normalize W
32

Why NMF? (Lee and Seung, 1999)
Visually meaningful.
Decomposition can only be
positive. Part based
presentation.
Statistically meaningful.
Eigen faces are in the
direction of the largest
variance. Subtraction can
occur.
33

Why NMF?
m,Frequency
n, time Frame
k, Components = 2 n, time framem,Frequency
k,Components=2
W HX
34

Why Not NMF?
(Adopted from: Paul O’Grady & Barak Pearkmutter, Convolutive NMF
with a Sparseness Constraint, MLSP Conference, 2006)
35

Convolutive NMF
T
H
m
k
k
n
X
n
m
37

Convolutive NMF
Iteration 1Iteration 2Iteration 3Iteration 10
38

SNR= 22 dB
Musical Noise –
mainly at low SNR regions
Noisy Input
Denoised-ish?
Go back to time domain
Use noisy input phase
H – Noise Suppression Factor
.^2 .^2
(
(.^0.5
39

With Musical Noise
SNR= 22 dB
Same results, better colormap?
Without Musical Noise
Noisy Signal
40

example based audio editing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to example based audio editing

Similar to example based audio editing (20)

example based audio editing

Editor's Notes