Poster cs543

Ramin Anushiravani – CS 543 Spring 2015
Department of Electrical and Computer Engineering, College of Engineering, University of Illinois at Urbana-Champaign
Audio Enhancement: A Computer Vision Approach
Aim
Introduction
Many audio enhancements projects can be simplified by some sort
of user interface. One example is removing a specific desired noise
from a recording, which was studied in this project. To illustrate the
goal of this project, imagine having a recording of a live concert or
a lecture and in the middle of your recording someone’s cellphone
rang. There is no easy way of identifying the ringtone as an
undesired noise. One heuristic way of removing the ringtone is to
identify every time-frequency bin of the ringtone in the spectrogram
remove them. You can think of this process as editing an image on
Adobe Photoshop. However, with some help from the user, we can
automate the process of removing/transferring most sounds from
most recordings based on the similarity between the two in the
image spectrogram. The algorithm developed in this project would
ask the user to mimic the noise in the recording which he wants to
remove. The algorithm would then look for the closest match to
users’ input in the time-frequency spectrogram of the noisy
recording.
Motivation
Since spectrogram show us how a sound looks like in time-
frequency domain, we can think of editing the spectrogram of a
sound as editing an image. Being inspired by this idea, I decided to
apply computer vision methods, object recognition, to de-noise a
recording from a desired noise. The problem of finding a specific
noise in a noisy recording is therefore analogous to the problem of
finding a cat in an image with cat/s in it. This is illustrated in the
next section.
(1) (2)
Both of these methods can be speed up using the following trick,
which comes close to the idea of Viola-Jones features. By
convolving the image with the noise object image, we would have a
rough idea of where the image is and so we can limit the scanning
of the image to those areas (white areas in the figure below).
Basically ignoring lots of the patches using a weak classifier first.
• HOG Features
HOG features are descriptors that captures the edge orientation of
an image in a defined sized cell and it is invariant to the scale
transform. HOG features are mainly known for object detection
applications in computer vision. Since they require very careful
tuning and normalizing, I used an outside library VLFeat [2] to
compute HOG features. In this project I used a cell size of 8 and
extract the HOG features of a gray colored image (instead of RGB
color).
After extracting the HOG from each
window in the noisy image and from
defined noise object, we must check
to see which patches are most similar
to the noise object.
Classification
In order to classify each patch of the image, I used two different
methods.
1- K-Nearest Neighbor. Vectorize all the HOG features of the image
into one big matrix. The error function used in K-NN is a Euclidean
distance,
𝑒𝑟𝑟𝑜𝑟 = (𝑣𝑒𝑐 𝑛𝑜𝑖𝑠𝑒ℎ𝑜𝑔
2
−𝑣𝑒𝑐 𝑖𝑚𝑎𝑔𝑒ℎ𝑜𝑔
2
)
This error function seems to give a lot of misclassifications and so I
purpose the following error function for better accuracy.
2- The modified error function is as follows,
𝑒𝑟𝑟𝑜𝑟 = | 𝑛𝑜𝑖𝑠𝑒ℎ𝑜𝑔 − 𝑖𝑚𝑎𝑔𝑒ℎ𝑜𝑔 | / | 𝑛𝑜𝑖𝑠𝑒ℎ𝑜𝑔 |
The latter error function seems to give a much better accuracy in
localizing the noise object.
For example, even though
the audio samples are still
in the spectrogram, we can
barely see the pixels of the
clean signal or the desired
noise.
Where 𝑖𝑛𝑑 𝑦 is a 2 elements vector with the start and end y-position
of the spectrogram, w is the width of the image and the (‘) operator
corresponds to taking the gradient of the image with respect to x
and y positions. α is a threshold factor bigger than one for
determining the major peaks in the mean gradient. The same
procedure can be done over the transpose of the image and sum
over the height of the image to extract the start and end x-position.
I chose a window size of 1024 samples using Hanning window,
with 25% overlap to construct the STFTs and overlap-add for
inverse STFT. I chose “hot” to the power of 0.35 as my colormap.
Object Extraction
When a user is asked to mimic the noise in a noisy signal, there
might be some background noise and most probably many
frequencies that does not correspond to the actual desired noise. In
order to create a better object, stationery noise of the mimicked
noise is removed using a very strong Spectral Subtraction
algorithm [1]. A threshold is then defined to extract just enough
pixel information from the mimicked sound to use as an object. This
is illustrated below.
The resulting objects for
the case of 50% overlap
is shown here. The score
on the top shows the value
of the latter error function.
The resulting object for the
12.5% overlap scanning is
similar
Non Maximum Suppression
The purpose of NMS is to see if the objects found in the image
overlaps or not. If they do, then we pick the one with the highest
score and if they don’t overlap as much we pick both. The figure
below shows the amount of overlap between each patch and the
resulting object. The ones on diagonals are the patch itself.
Example
Object
Noisy Image
We are given an example object by the user, in the case of
images, an example image and in the case of sounds, an example
sound (which can also be mimicked by the user). We can then
localize the noise in the desired noisy signal using object
recognition algorithms.
Noise
Mimicked
By the user
Noisy Spectrogram “Image”
When saving an image on Matlab,
a white area around the image including the
titles are also saved. In order to extract the
spectrogram we can do the following.
User mimicked
noise
After Spectral
Subtraction
Final Object
• Vectorized method
There is also a vectorized way
of finding the most likely
object without having to scan
the image using integral image
and 2D Fourier transform to
speed up the recognition.
This is discussed in details in
the paper.
Pre-processing
From Sound Samples to Image Pixels
When visualizing an audio signal, a time domain representation will
not tell us much about what is going on in the signal. A better
visualization of an audio signal can be done through Short Time
Fourier Transform (STFT). Since the purpose of this project is to
treat an audio as just another image, we should choose a colormap
that makes sense visually.
𝑖𝑛𝑑 𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑖 (
𝑖𝑚𝑎𝑔𝑒′𝑤
𝑖=1
𝑤
>
𝑖𝑚𝑎𝑔𝑒′𝑤
𝑖=1
𝛼𝑤
)
Object Recognition
A common object recognition follows these steps,
• Scan the image with a fixed window at different scales.
• Extract Histogram of Gradients (HOG) features from each
patch.
• Score each patch by comparing it to the object HOG features.
• Perform Non-Maximum Suppression.
The object recognition algorithm in this project also follows these
steps, but because of the user interface we have a few
advantages. Since the user is asked to mimic the noise in the noisy
signal, we know how long the signal is and approximately know the
most important frequencies (fundamental frequency hopefully). As
a result, we know the size of the search window (w, h) and do not
need to search the image spectrogram at different scales.
Scanning the Image
Scanning the image with overlaps can be a very time consuming
task given the implementation and can also affect the accuracy of
the algorithm greatly. I’ve tried multiple ways for scanning the
image spectrogram listed below.
1- At each position, extract four windows with 50% overlap.
2- Extracting windows in a row from an image with 12.5% overlap.
One patch of the noisy signal
Synthesize and Voila!
When resynthesizing the sound, we can either multiply the mask
with the spectrogram of the sound and get rid of the whole
object(right), or we can only subtract the noise template within the
mask from the signal(left).
Ideally, we would hope to subtract
all the noise without subtracting
any of the signal. For future work,
I suggest looking into ways to predict
the most likely pixels inside the
removed noise object. In addition, when localizing a deformed
object (when the user cannot mimicked the noise accurately), it is
important to look for techniques that take this matter into
consideration as well.
ℓ2 ℓ2
I then extracted the object with
The highest overlap (they already
have the highest score).The
resulting object and its mask is
shown below.
This results was improved with the
12.5% overlap and a stronger NMS
Which is discussed in the paper.
Time Domain:
Spectrogram:
Reference
[1] Y. Ephraim and D. Malah “Speech enhancement using a minimum
mean-square error short-time spectral amplitude estimator" // IEEE
Trans. Acoustics, Speech, Signal Processing, vol. 32, pp. 1109- 1121,
Dec. 1984
[2] A. Vedaldi and B. Fulkerso, VLFeat, “An Open and Portable Library of
Computer Vision Algorithms”, 2008, http://www.vlfeat.org/

Poster cs543

More Related Content

What's hot

Similar to Poster cs543

More from Ramin Anushiravani

Recently uploaded

Poster cs543