SlideShare a Scribd company logo
1 of 9
3D AUDIO RECONSTRUCTION AND SPEAKER RECOGNITION USING SUPERVISED
LEARNING METHODS BASED ON VOICE AND VISUAL CUES
Ramin Anushiravani†
, Marcell Vazquez-Chanlatte∗‡
, Faraz Faghri∗
University of Illinois at Urbana-Champaign
†
Dept. of Electrical and Computer Engineering
∗
Dept. of Computer Science
‡
Dept. of Physics
ABSTRACT
In this paper we present a creative approach to recon-
struct 3D audio for multiple sources from a single chan-
nel input by detecting and tracking visual cues using
supervised learning methods. We also discuss a similar
approach for improving speaker’s classification from a
video stream by employing both facial and speech like-
lihoods, or simply Multimodal Speaker Recognition on
a video stream.
Index Terms— 3D audio, speaker classification, vi-
sual cues, supervised learning, Multimodal Speaker
Recognition
1. INTRODUCTION
Spatial audio and speaker classifications both have im-
portant applications in video conferencing and enter-
tainment. With spatial audio, one can listen to music
and feel the depth and the directionality of the sound
using headphones or crosstalk canceller speakers [1].
Imagine hearing a tank approaching you from the living
room when playing high-end resolution video games.
3D audio is yet to be able achieve these ambitious goals,
though it can still create a more realistic sound fields
than surround sound. 3D audio is normally recorded
through binaural microphones mounted on a Kemar [2].
One can also reconstruct 3D audio using Head Related
Transfer Function by forming beams at every direction
using microphone arrays [3]. These techniques require
precise calibration and are not very accurate in real en-
vironments due to noise and reverberation. Most im-
portantly, if a video was not recorded using microphone
arrays or binaural microphones, it is too difficult, if not
impossible, to recover spatial audio. That is why we
purpose a method by using the contents in the video,
the visual cues, to help reconstruct 3D audio.
The other goal of this project is to perform multi-
modal speaker recognition using facial and speech like-
lihoods by first learning different features from a train-
ing video using a user friendly calibration procedure
discussed in this paper. In section 2, we talk about the
general flow of the algorithm for reconstructing 3D au-
dio and speaker recognition algorithms. In section 4,
face detection, classification, and tracking are discussed
in detail. In section 4, voice activity detection (VAD)
and speech classification algorithms are explained thor-
oughly. In section 5, we go over the results for 3D au-
dio reconstruction and multimodal speaker recognition.
And finally, in section 6, we suggest future possibilities
for our project in other systems and applications.
2. GENERAL FLOW
In this section, the basic procedure for reconstructing
3D audio and multimodal speaker recognition are dis-
cussed.
2.1. 3D Audio Reconstruction
One can reconstruct a 3D audio by simply convolving
a mono sound with the spatial response corresponding
to a desired location in space. These impulse responses
can be captured by recording a maximum length se-
quence (MLS) at listener’s ears at different angles. We
can then extract corresponding impulse responses by
cross-correlating the recorded MLS with the original
MLS as shown in equation 1. These spatial responses
are also called Head Related Impulse Response (hrir).
index = argmax(C(MLSrecorded,MLSoriginal)) (1a)
hrir = C(MLSrecorded,MLSoriginal)(index − L : index + L) (1b)
3Daudio = signalmono ∗ hrir (1c)
Where L is half the impulse response desired length.
This number is usually about 128 samples, but it can
differ based on the recording room, e.g. such the early
reflection sample index and Cxy represents the cross-
correlation between x and y. In this project we used
MIT HRTF database for reconstructing spatial audio [4].
In order to avoid clicking sounds when reconstruct-
ing 3D audio and also for simplicity, we did the recon-
struction in frequency domain. In short, one must first
find the short time Fourier transform of the signal and
then multiplied that by the desired zero-padded HRTF
and take inverse STFT to recover the time domain signal
back, as shown in figure 1.
Fig. 1. 3D Audio Reconstruction
We made a few assumptions to make this project
doable in our limited timeline.
Assumption
1. This algorithm is only able to spatialize speech
based on a speaker’s face.
2. There are at most two speakers in the video.
3. At least one of the two speakers in the room is in
the training database.
4. There are no sudden movements in the video
stream.
Most of these assumptions can be relaxed even fur-
ther, some which are discussed in section VI. For sim-
plicity, we used the following video clip for testing our
applications. We recorded a video clip of two people
sitting on left and right of a video frame having a con-
versation and use that video for reconstructing 3D audio
and speaker recognition, as shown in figure 2. We also
purpose a calibration procedure that require some user
interference for training facial and speech database from
each user.
2.2. Calibration
In order to simplify the algorithm complexity, we de-
cided to employ some user friendly interface when
collecting a training database. Each user is asked to do
perform two tasks for at least one time which might
take about a minute or so. This will assure an accurate
training dataset. In practice, one might have to repeat
Fig. 2. Recording of two people having a conversation
followed by a depiction of 3D audio and speaker recog-
nition
this procedure if/when the lighting and the background
sound is different for both 3D audio reconstruction
and speaker recognition. A better and bigger training
database, obviously would have better classification
results which is the essence of this project.
Calibration
1. Clap hands and wave to the camera.
2. Speak for about a minute while facing the camera.
This calibration procedure is repeated until another
clap sound is detected, which means there is another
user whose database needs to be collected as well. This
calibration procedure helps us particularly in labeling
the training database. This is summarized in figure 3.
2.3. 3D Audio Reconstruction
After collecting a training database, we proceed making
classifier which is discussed in more details in section
Fig. 3. Calibration
3 and 4. We wanted to test our classifier results on
different applications, and one that came to mind was
reconstructing spatial audio from a single channel input.
The idea is to find out when and where the speaker is
talking and use the location (HRTF angle) and the time
(frame numbers) to create 3D audio without having to
use microphone arrays. The procedure for reconstruct-
ing a 3D audio is as follows,
3D audio
1. Detect faces in the video stream for each frame and
classify them to one in the database.
2. Map the position of each face to a meaningful
HRTF angle in each frame.
3. Detect whether there is speech in a frame or not.
4. Label the frames with speech from step 4 with their
corresponding class label in the database.
5. Use speech classification results to assign the
speech from step 4 to the point find from step 2.
That is, given a speech signal, find corresponding
face location.
6. Pick corresponding HRTFs from step 5 for each
frame.
7. Reconstruct 3D audio as shown in figure 1.
2.4. Speaker Recognition
Consider the same video from section 2.1. We would
like to be able to identify each user using their facial and
speech features. If the speaker is not talking, then the
recognition system should put more weight towards the
facial classifier. If the speaker’s face cannot be detected,
then the classification algorithm only uses the speech
classifier. If both face and speech are available, then the
classification algorithm will use both with some weight-
ing on each model to determine and identify each user
for maximizing the user recognition likelihood given the
training database [9].
P(user) = P( face|model)Wi
P( face model) + ...
P(speech|model)Wmax−Wi
P(speech model) (2)
The prior probabilities in equation 2 are assumed
to be equally probable, however, in practice for better
accuracy they need to be estimated more carefully given
the environment and the recording equipment. Steps 1
through 4 in 3D Audio Reconstruction are also performed
when doing speaker recognition. After having a training
database for faces and speech. We then do the following,
Multimodal Speaker Recognition
1. Find the likelihood of each user’s face given the
face model in each frame analysis.
2. Find the likelihood of each user’s speech given the
speech model in each frame analysis.
3. Plug the values from step 5 and 6 in equation 2.
4. Determine the value of equation 2 for w = 1, 2..., 10.
5. Find w such that, argmaxw(P(user)).
Higher w’s shows that the face classification results
was better than speech classification results, and so we
should put more weight on the facial classifier. This
could be due to the quality of the training set, or just the
fact that the face classifier works better than the speech
classifier. As an example, one way of choosing w is
measure the SNR of the signal, and put more weight
on the speech classifier when the SNR is higher. With
figure 4 summarizing the general flow of this project, we
conclude section 2.
Fig. 4. Speaker Classification
3. FACE DETECTION, CLASSIFICATION AND
TRACKING
In this section, we discuss the details of detecting faces
in a video stream, training facial features, and tracking
faces throughout the video frames.
3.1. Face Detection
The first step in detecting faces is preprocessing every
frame. In most image processing application, prepro-
cessing is done to assure better quality results, whether
that is for recognizing an object in the image or decon-
volving a blur from the image.
Face Detection
1. Resize the video frames to smaller dimensions so
we can detect faces faster.
2. Map the RGB color to gray colormap to simplify
the computation.
3. Apply histogram equalization to each frame to
make sure the contrast is evenly spread out
throughout the whole pixels.
We then setup Matlab’s vision library for face detec-
tion [5]. Since some detected faces are going to be small,
we defined a threshold that would disregard any faces
smaller than a certain number of pixels. This would
create a precise training database. We then resize and
vectorised all detected faces into a matrix for each user
(face database). This is summarized in figure 5.
Fig. 5. Face Detection
Note that if there is more than one user in the room,
we need to be able to identify the approximate location
of each so we can label them. This is why we defined the
calibration procedure in section 2. We first attempted to
find this location by detecting mouth movement, how-
ever, due to resizing; the resolution was not high enough
to detect lips movement. For simplicity, we ask our users
to clap and wave to the camera, since it is easier to detect
a bigger area of a motion. We simply use frame subtrac-
tion to identify the pixels that corresponds to higher
variance. The clap sound detection from previous sec-
tion will tell us the approximate frame number for the
clap motion in the video, so the area of search is only a
few frames.
subframe = ( framei − mean( frame1:i−1))2
[imax, jmax] = max(subi=1:n,j=1:m)
Where n and m are the number of pixels in the vertical
and horizontal axis. We don’t care for the exact position
of those high variant pixels, we only need an approxi-
mation of which part of the frame that specific user is
located at, e.g. left, right, top or bottom of a frame. The
resulting pictures for frame subtraction, division and
face detection is shown in figure 6.
Fig. 6. Frame subtraction and division into two parti-
tions for labeling each class
3.2. Face Classification
Now that we have a database of faces for each user, we
need to train each database. We used Gaussian Mix-
ture Models for training the facial database. Since the
dimensions of the training database are fairly large, we
first lower the dimensionality of the database using PCA
to 36 and then use a 5 order GMM to fit it to the model.
Dimensionality reduction is also important when train-
ing a database with a GMM, since in high dimensions;
GMM might not able to estimate the covariance matrix.
Face training is summarized below.
Train = {trainuser1, trainuser2} (2a)
Pt = PCA(train, number of eigenvector) (2b)
Gpt = GMM(pt, dimensions) (2c)
The eigen faces from the PCA is shown in figure 7.
Fig. 7. Eigen faces for class 1
Having the GMM model for each database, one can
then easily find the probability of each face given each
model by calculating the log likelihood of the testing
data that is projected to PCA space in equation 2.b, given
the mean and the variance find in equation 2.c. This is
shown in equation 3.
f(test|µ, σ) =
1
test σ
√
wπ
e
ln(x−µ)2
2σ2
(3)
In figure 8, the scatter plot each class using the first two
highest eigenvector from the PCA matrix is shown. As
you can see, the two classes are linearly separable, so we
should be able to get high accuracy from our classifier.
Fig. 8. 2 Dim PCA Face Features
We left 10% of our training database for testing and
trained the remaining 90%. The result of our facial
classifier is tabulated below.
Class 1 95%
Class 2 100%
Table 1: Face Classification Accuracy
As expected we have high accuracies, since the two
classes were shown to be linearly separable in 2 dimen-
sions.
3.3. Face Tracking
Matlab’s face detection function draw a rectangle
around the detected faces. We extracted the coordinates
for that rectangle and use the center of the rectangle
as the position of the detected user. Such tracking
algorithm usually does give good results due to face
detection inaccuracy. Since we do not need to know the
exact position of the user at each frame, we can com-
pensate for this flaw by applying moving average to the
tracking results as well as fitting a polynomial to it (in
this case a polynomial of order 10 and moving average
of 25 frames length). If the sources were stationary we
can use a longer moving average for smoothing the
results. The result of such process is shown in figure 9.
Fig. 9. Tracking Results
The x − axis corresponds to the time frame, and the
y − axis is the approximated x − position of the detected
face. We only look at things in the azimuth, since the 3D
audio does not sound very good for elevation angles.
In general, for recreating spatial audio for n people,
we need at least n−1 training databases. As discussed in
section 2.1 we assumed that we only have up to 2 users in
a video frame. Therefore, if we have the training dataset
for one of the users, we can classify one in the database
with a label and define a threshold for the other user; so
we can label him/her as unknown class. This threshold
is defined as following,
if P( face|model) < TH
then Facelabel : unknown
The tracking result of having one unknown and one
known user in a frame video shown in figure 6 is visu-
alized in figure 10.
Fig. 10. Tracking results for two users, one labeled, and
one unknown
As you can see, each source is correctly detected at
opposite edge of each video frame. The spikes in these
graphs are mainly for two reasons, 1- slight head move-
ment and 2- Face detection errors. As mentioned earlier,
the first problem can be ease out by using moving aver-
age and the second problem by fitting a polynomial to
the database as seen in figure 9. The spectrogram shown
in figure 10 is also the tracking results; we mainly use
it since it is clearer when visualizing the tracking over
video frames.
4. VOICE ACTIVITY DETECTION AND
CLASSIFICATION
4.1. Voice Activity Detection
Voice activity detection enables the filtering of non-
speech components (usually silence) from speech com-
ponents. As mentioned in section 2, the calibration
procedure labels the speech signals. We also filtered
the signal with a bandpass filter at ω1 = 300 kHz and
ω2 = 8 kHz, since our focus in this project is on speech
signals.In order to do VAD, we developed a supervised
method by training speech signals (the signals between
claps) and non-speech signals (the signals before the
first clap). The VAD classification results for four differ-
ent classifiers using Log Spectral Coefficients (LSC) or
Mel Frequency Cepstral Coefficients (MFCC) are listed
in table 2.
Classifier LSC MFCC
Linear SVM 99.9% 99.9%
Gaussian Naive Bayes 99.9% 99.9%
20-Nearest Neighbor 99.9% 99.9%
Table 2: VAD Accuracy
As you can see the results from each classifier is near
perfect. This is perhaps expected given the clear linear
separability in figure 11. In fact, on inspection, the pri-
mary feature is unsurprisingly dominated by the energy
level.
Ultimately, we selected linear SVM for further pro-
cessing, due to its simplicity and speed. After remov-
ing non-speech components from the signal, we create
a database of each user’s speech using the calibration
procedure mentioned in section 2.1. The system then
trains each database separately like in a similar manner
described for faces in section 3.
4.2. Voice Classification
10% of the data was separated for testing and the re-
maining 90% for training speech models for each class.
The STFT of the signal uses non-overlapping rectangu-
lar windows as long corresponding to one frame. The
samples per frame is computed as
samples
seconds , seconds
frames . The
unit of frame time is used because frames serves as the
base unit in the facial analysis.1
We then project the
extracted features to a lower dimensional space using
PCA. This procedure mirrors the technique described in
section 3 for training the face databases.
The procedure for classifying speech is summarized
in figure 12. The resulting classification results are
shown in table 3 below. In addition, the corresponding
LCS and MFCC features are shown for a 2 dimen-
sional space in figure 11 for non-speech signals, class
1 and class 2 speech signals. Non-speech and speech
classes are clearly separable; however, the seperability
of th speech classes remains suspect. Nevertheless,
given the empirical success of the classifiers, the speech
classes appear to be separable in higher dimensions.
As you can see Ada-Boost has the highest classification
accuracy among all. Our suspicion is that the other
classifiers make Gaussian assumptions either explicitly
or implicitly in the Euclidean distance measures. Ada
Boost avoids this fate by using a collection of classi-
fiers, that while potentially making individual Gaussian
assumptions, are not necessarily Gaussian distributed
themselves.
Classifier LSC MFCC
Linear SVM 83.0% 85.3%
Gaussian Naive Bayes 82.4% 80.2%
20-Nearest Neighbor 84.9% 86.7%
Ada-Boost 91.2% 90.3%
GMM 83.5% − − −
Table 3: Voice Classification Accuracy
1The classification tool also supports averaging over a number of
frames.
Fig. 11. MFCC (top) and LSC features (bottom) for three
classes
Now that we have our VAD and speech classifiers,
we can easily assign labels to every analysis frames,
e.g. non-speech, class 1, class 1, class 2, etc. The labels
then allow selecting which face rectangle is active, yield-
ing the location of each user at each frame from section
3 results. We then reconstruct spatial audio using the
techniques explained in section 2.1.
5. RESULTS
We recorded a few video clips for which we attempted
to reconstruct 3D audio from, the link is provided in [6].
For the first case, we simply tracked one person in a
video frame, mapped his face location to HRTF angles
and then reconstruct spatial audio using corresponding
HRTF angles from the single channel audio input as
was shown in figure 1. The resulting video named gob-
ble cs598.mp4 can be found in [6]. For the second video,
we recorded two speakers speaking, a snapshot was pro-
vided earlier in figure 2. We trained both speakers’ facial
and speech features beforehand as was discussed earlier
in section 3 and 4 and use that to find the face location.
At every frame we detect and capture both speech and
faces. Note that given our face model we already know
where speakers are located at, so we simply classify each
audio frame to one of three classes discussed in section
Fig. 12. Speech Classification)
4. We can then create a spatial sound for the two users.
In general we were able to recreate a sound that was
guided by speakers faces. We also included some ex-
ample video clips where a piece of music was guided
by the users face. This technology can be specifically
useful in hearing aids. Imagine wearing google glass.
You can capture faces using the glass and capture speech
using your hearing aid. We can do tracking and classi-
fication either on the glass, offline or on the cloud and
then feed the resulting information to the hearing aid to
form beams toward the desired sound source and undo
artifacts such as noise and reverberation in the room.
We use the same video with two speakers from last
time, to perform multimodal speaker recognition. We
can calculate the recognition of each speaker using equa-
tion 2. We, however, did not try to estimate w s from
the environment. We used values from 1 to 10 for w’s,
where 1 put more focus on the face classifier and 10,
more weight on the speech classifier. We then evaluate
the P(user)K
k=1
over all video frames for all 10 values of
w. We can then create a matrix of likelihoods as the one
shown below.
P(user) =


p1,1 . . . p1,wmax
...
...
...
pk,1 . . . pk,wmax


We then look for a value of w that maximizes the user
classification at most frames, that is,
wmax = argmaxw(P(user)K
k=1&10
w=1)
Where K is the number of frames. For our video, it
turned out that the value of w is 4. This means that the
multimodal classifier is putting more weight on the face
classifier. This makes sense since the face classifier was
able to achieve higher accuracy than the speech classifier.
6. FUTURE WORK
One of the assumptions made in section 2 was that
speakers do not talk at the same time. One can then
apply source separation techniques [7] on the input sig-
nals first based on the number of speakers in the room,
and then follow the same procedure on section 4 to clas-
sify each speaker into their corresponding classes while
separating each source.
Another future enhancement is the ability to spatial-
ize sounds other than speech in a desired video clip.
A simple, but exhaustive way of doing this is to learn
features for different objects and sounds as well. This
will also require an even more exhaustive search of de-
tecting and recognizing objects in defined blocks for ev-
ery video frames. This work, however, could be useful
for entertainment application, such as in movies and
video games. Based on the key objects in the movie,
one can learn a big database of sounds they make and
their shape, and then reconstruct 3D audio throughout
the whole clip, which might take hours or even days.
One can also use this project and apply it to telecon-
ferencing applications as well. If you have a room full of
stationary speakers, you can easily learn their facial and
speech features by recording one of the sessions. One
can then use the tracking procedure discussed in section
3 to locate each speaker and use the techniques explained
in section 4 to label the signal as well. That way if the
conference room is equipped with microphone arrays,
one can use this information to form the beam at the cur-
rent speaker to enhance speech intelligibility. Note that
in teleconferencing, we need to be able to do this in real
time. There are assumptions that we can make to speed
things up. For example, users are probably not moving
around the room, so one only needs to locate them one
time. After that, we only need to speech classification
and labeling the signal where even assumption can be
made given the application to simplify and speed up the
process for real-time scenarios.
There are other possibilities for the future of this
project. For example, is it possible to learn a relationship
between speech and facial features using unsupervised
methods? Earlier in figure 8 and 11 we depicted the fa-
cial and speech features for two and three classes respec-
tively. We can use clustering methods to classify each
group of features together, e.g. K-Means, GMM cluster-
ing or even better time series clustering such as Hidden
Markov Models (HMM). One obvious disadvantages of
clustering is obviously not having access to the labels of
each class. For both 3D audio and speaker recognition,
we need to have labels at each frame of analysis, e.g.
given my speech signal, where is the user? Or given
the speech and face likelihoods, who is the most likely
speaker? We think there might be a relationship between
facial features and its corresponding speech features at
every frame. If such correlation exists between the two,
one can find one’s face just by analyzing the speech sig-
nals.
One last application that we can think of is to be able
to perform supervised source separation. Since we al-
ready have a dictionary of Eigen faces and speeches for
every user, we can simply scan the whole video frames
and signals to extract one’s face and speech by finding
the best match between the signals and the correspond-
ing Eigen values. This idea of unconstrained source
separation has already been extensively investigated on
video content analysis and sound recognitions in [8].
7. REFERENCE
[1] Choueiri, E , Optimal Crosstalk Cancellation for
Binaural Audio with Two Loudspeakers , Princeton Uni-
versity
[2] Binaural Recording. Wiki 100k. Princeton Uni-
versity.
[3] Friedman, Michael. ”Capturing spatial audio
from arbitrary microphone arrays for binaural repro-
duction.” (2014).
[4] Gardner, Bill. ”HRTF Measurements of a KEMAR
Dummy-Head Microphone.” HRTF Measurements of a
KEMAR Dummy-Head Microphone.
[5] ”Documentation.” Face Detection and Tracking
Using CAMShift. Web. 14 Dec. 2014.
[6] Recorded Videos. (https://www.dropbox.com/sh/
s16c6lrx9tay065/AADZpsofFS4ANiRzpQA8fHPca?dl=0)
[7] Bryan, Nick J. ”ISSE.” Nick (Nicholas) J. Bryan.
Stanford.
[8] Paris Smaragdis. ”Audio Demos.” Audio Demos
- Sound Recognition for Content Analysis
[9] ”Multimodal Person Identification (Face Recogni-
tion, Person ID).” ECE 417/MP3. ECE 417 - Multimedia
Signal Processing, Mar.-Apr. 2014.

More Related Content

What's hot

Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...CSCJournals
 
Enhancement in frequency domain
Enhancement in frequency domainEnhancement in frequency domain
Enhancement in frequency domainAshish Kumar
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...IJORCS
 
Lec 07 image enhancement in frequency domain i
Lec 07 image enhancement in frequency domain iLec 07 image enhancement in frequency domain i
Lec 07 image enhancement in frequency domain iAli Hassan
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using WaveletAsim Qureshi
 
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color ImagesReduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color ImagesIDES Editor
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...IRJET Journal
 
Frequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital ImagesFrequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital ImagesUpendra Pratap Singh
 
04 1 - frequency domain filtering fundamentals
04 1 - frequency domain filtering fundamentals04 1 - frequency domain filtering fundamentals
04 1 - frequency domain filtering fundamentalscpshah01
 
Frequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesFrequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesDiwaker Pant
 
08 frequency domain filtering DIP
08 frequency domain filtering DIP08 frequency domain filtering DIP
08 frequency domain filtering DIPbabak danyal
 
Curved Wavelet Transform For Image Denoising using MATLAB.
Curved Wavelet Transform For Image Denoising using MATLAB.Curved Wavelet Transform For Image Denoising using MATLAB.
Curved Wavelet Transform For Image Denoising using MATLAB.Nikhil Kumar
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]威華 王
 
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
Robust Sound Field Reproduction against  Listener’s Movement Utilizing Image ...Robust Sound Field Reproduction against  Listener’s Movement Utilizing Image ...
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...奈良先端大 情報科学研究科
 
Filtering in frequency domain
Filtering in frequency domainFiltering in frequency domain
Filtering in frequency domainGowriLatha1
 
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...Norishige Fukushima
 

What's hot (20)

Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
 
Enhancement in frequency domain
Enhancement in frequency domainEnhancement in frequency domain
Enhancement in frequency domain
 
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
Improving the Efficiency of Spectral Subtraction Method by Combining it with ...
 
Lec 07 image enhancement in frequency domain i
Lec 07 image enhancement in frequency domain iLec 07 image enhancement in frequency domain i
Lec 07 image enhancement in frequency domain i
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using Wavelet
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color ImagesReduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
Reduced Ordering Based Approach to Impulsive Noise Suppression in Color Images
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
Frequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital ImagesFrequency Domain Filtering of Digital Images
Frequency Domain Filtering of Digital Images
 
04 1 - frequency domain filtering fundamentals
04 1 - frequency domain filtering fundamentals04 1 - frequency domain filtering fundamentals
04 1 - frequency domain filtering fundamentals
 
Frequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement TechniquesFrequency Domain Image Enhancement Techniques
Frequency Domain Image Enhancement Techniques
 
08 frequency domain filtering DIP
08 frequency domain filtering DIP08 frequency domain filtering DIP
08 frequency domain filtering DIP
 
K31074076
K31074076K31074076
K31074076
 
Curved Wavelet Transform For Image Denoising using MATLAB.
Curved Wavelet Transform For Image Denoising using MATLAB.Curved Wavelet Transform For Image Denoising using MATLAB.
Curved Wavelet Transform For Image Denoising using MATLAB.
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
Robust Sound Field Reproduction against  Listener’s Movement Utilizing Image ...Robust Sound Field Reproduction against  Listener’s Movement Utilizing Image ...
Robust Sound Field Reproduction against Listener’s Movement Utilizing Image ...
 
APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...
APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...
APPRAISAL AND ANALOGY OF MODIFIED DE-NOISING AND LOCAL ADAPTIVE WAVELET IMAGE...
 
Filtering in frequency domain
Filtering in frequency domainFiltering in frequency domain
Filtering in frequency domain
 
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
Comparison between Blur Transfer and Blur Re-Generation in Depth Image Based ...
 
Microphone arrays
Microphone arraysMicrophone arrays
Microphone arrays
 

Similar to 3D Audio playback for single channel audio using visual cues

silent sound technology pdf
silent sound technology pdfsilent sound technology pdf
silent sound technology pdfrahul mishra
 
Bachelors project summary
Bachelors project summaryBachelors project summary
Bachelors project summaryAditya Deshmukh
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET Journal
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
 
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey WaveletLip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey WaveletCSCJournals
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approachijsrd.com
 
Paper id 23201490
Paper id 23201490Paper id 23201490
Paper id 23201490IJRAT
 
On the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionOn the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionjournalBEEI
 
Isolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationIsolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationeSAT Journals
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationeSAT Publishing House
 
A Review of Video Classification Techniques
A Review of Video Classification TechniquesA Review of Video Classification Techniques
A Review of Video Classification TechniquesIRJET Journal
 
M sc thesis_presentation_
M sc thesis_presentation_M sc thesis_presentation_
M sc thesis_presentation_Dia Abdulkerim
 
Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Datasipij
 
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...IRJET Journal
 

Similar to 3D Audio playback for single channel audio using visual cues (20)

silent sound technology pdf
silent sound technology pdfsilent sound technology pdf
silent sound technology pdf
 
Bachelors project summary
Bachelors project summaryBachelors project summary
Bachelors project summary
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
IRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound RecognitionIRJET- A Survey on Sound Recognition
IRJET- A Survey on Sound Recognition
 
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEMULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
 
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey WaveletLip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
Lip Reading by Using 3-D Discrete Wavelet Transform with Dmey Wavelet
 
Speaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization ApproachSpeaker Recognition System using MFCC and Vector Quantization Approach
Speaker Recognition System using MFCC and Vector Quantization Approach
 
Paper id 23201490
Paper id 23201490Paper id 23201490
Paper id 23201490
 
On the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognitionOn the use of voice activity detection in speech emotion recognition
On the use of voice activity detection in speech emotion recognition
 
Isolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantizationIsolated word recognition using lpc &amp; vector quantization
Isolated word recognition using lpc &amp; vector quantization
 
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantizationIsolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
 
A Review of Video Classification Techniques
A Review of Video Classification TechniquesA Review of Video Classification Techniques
A Review of Video Classification Techniques
 
Dy36749754
Dy36749754Dy36749754
Dy36749754
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Kf2517971799
Kf2517971799Kf2517971799
Kf2517971799
 
Kf2517971799
Kf2517971799Kf2517971799
Kf2517971799
 
M sc thesis_presentation_
M sc thesis_presentation_M sc thesis_presentation_
M sc thesis_presentation_
 
Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Data
 
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
 

More from Ramin Anushiravani

More from Ramin Anushiravani (7)

recommender_systems
recommender_systemsrecommender_systems
recommender_systems
 
Techfest jan17
Techfest jan17Techfest jan17
Techfest jan17
 
Beamforming and microphone arrays
Beamforming and microphone arraysBeamforming and microphone arrays
Beamforming and microphone arrays
 
3D audio
3D audio3D audio
3D audio
 
Poster cs543
Poster cs543Poster cs543
Poster cs543
 
3D Spatial Response
3D Spatial Response3D Spatial Response
3D Spatial Response
 
example based audio editing
example based audio editingexample based audio editing
example based audio editing
 

Recently uploaded

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 

Recently uploaded (20)

Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 

3D Audio playback for single channel audio using visual cues

  • 1. 3D AUDIO RECONSTRUCTION AND SPEAKER RECOGNITION USING SUPERVISED LEARNING METHODS BASED ON VOICE AND VISUAL CUES Ramin Anushiravani† , Marcell Vazquez-Chanlatte∗‡ , Faraz Faghri∗ University of Illinois at Urbana-Champaign † Dept. of Electrical and Computer Engineering ∗ Dept. of Computer Science ‡ Dept. of Physics ABSTRACT In this paper we present a creative approach to recon- struct 3D audio for multiple sources from a single chan- nel input by detecting and tracking visual cues using supervised learning methods. We also discuss a similar approach for improving speaker’s classification from a video stream by employing both facial and speech like- lihoods, or simply Multimodal Speaker Recognition on a video stream. Index Terms— 3D audio, speaker classification, vi- sual cues, supervised learning, Multimodal Speaker Recognition 1. INTRODUCTION Spatial audio and speaker classifications both have im- portant applications in video conferencing and enter- tainment. With spatial audio, one can listen to music and feel the depth and the directionality of the sound using headphones or crosstalk canceller speakers [1]. Imagine hearing a tank approaching you from the living room when playing high-end resolution video games. 3D audio is yet to be able achieve these ambitious goals, though it can still create a more realistic sound fields than surround sound. 3D audio is normally recorded through binaural microphones mounted on a Kemar [2]. One can also reconstruct 3D audio using Head Related Transfer Function by forming beams at every direction using microphone arrays [3]. These techniques require precise calibration and are not very accurate in real en- vironments due to noise and reverberation. Most im- portantly, if a video was not recorded using microphone arrays or binaural microphones, it is too difficult, if not impossible, to recover spatial audio. That is why we purpose a method by using the contents in the video, the visual cues, to help reconstruct 3D audio. The other goal of this project is to perform multi- modal speaker recognition using facial and speech like- lihoods by first learning different features from a train- ing video using a user friendly calibration procedure discussed in this paper. In section 2, we talk about the general flow of the algorithm for reconstructing 3D au- dio and speaker recognition algorithms. In section 4, face detection, classification, and tracking are discussed in detail. In section 4, voice activity detection (VAD) and speech classification algorithms are explained thor- oughly. In section 5, we go over the results for 3D au- dio reconstruction and multimodal speaker recognition. And finally, in section 6, we suggest future possibilities for our project in other systems and applications. 2. GENERAL FLOW In this section, the basic procedure for reconstructing 3D audio and multimodal speaker recognition are dis- cussed. 2.1. 3D Audio Reconstruction One can reconstruct a 3D audio by simply convolving a mono sound with the spatial response corresponding to a desired location in space. These impulse responses can be captured by recording a maximum length se- quence (MLS) at listener’s ears at different angles. We can then extract corresponding impulse responses by cross-correlating the recorded MLS with the original MLS as shown in equation 1. These spatial responses are also called Head Related Impulse Response (hrir). index = argmax(C(MLSrecorded,MLSoriginal)) (1a) hrir = C(MLSrecorded,MLSoriginal)(index − L : index + L) (1b) 3Daudio = signalmono ∗ hrir (1c) Where L is half the impulse response desired length. This number is usually about 128 samples, but it can differ based on the recording room, e.g. such the early
  • 2. reflection sample index and Cxy represents the cross- correlation between x and y. In this project we used MIT HRTF database for reconstructing spatial audio [4]. In order to avoid clicking sounds when reconstruct- ing 3D audio and also for simplicity, we did the recon- struction in frequency domain. In short, one must first find the short time Fourier transform of the signal and then multiplied that by the desired zero-padded HRTF and take inverse STFT to recover the time domain signal back, as shown in figure 1. Fig. 1. 3D Audio Reconstruction We made a few assumptions to make this project doable in our limited timeline. Assumption 1. This algorithm is only able to spatialize speech based on a speaker’s face. 2. There are at most two speakers in the video. 3. At least one of the two speakers in the room is in the training database. 4. There are no sudden movements in the video stream. Most of these assumptions can be relaxed even fur- ther, some which are discussed in section VI. For sim- plicity, we used the following video clip for testing our applications. We recorded a video clip of two people sitting on left and right of a video frame having a con- versation and use that video for reconstructing 3D audio and speaker recognition, as shown in figure 2. We also purpose a calibration procedure that require some user interference for training facial and speech database from each user. 2.2. Calibration In order to simplify the algorithm complexity, we de- cided to employ some user friendly interface when collecting a training database. Each user is asked to do perform two tasks for at least one time which might take about a minute or so. This will assure an accurate training dataset. In practice, one might have to repeat Fig. 2. Recording of two people having a conversation followed by a depiction of 3D audio and speaker recog- nition this procedure if/when the lighting and the background sound is different for both 3D audio reconstruction and speaker recognition. A better and bigger training database, obviously would have better classification results which is the essence of this project. Calibration 1. Clap hands and wave to the camera. 2. Speak for about a minute while facing the camera. This calibration procedure is repeated until another clap sound is detected, which means there is another user whose database needs to be collected as well. This calibration procedure helps us particularly in labeling the training database. This is summarized in figure 3. 2.3. 3D Audio Reconstruction After collecting a training database, we proceed making classifier which is discussed in more details in section
  • 3. Fig. 3. Calibration 3 and 4. We wanted to test our classifier results on different applications, and one that came to mind was reconstructing spatial audio from a single channel input. The idea is to find out when and where the speaker is talking and use the location (HRTF angle) and the time (frame numbers) to create 3D audio without having to use microphone arrays. The procedure for reconstruct- ing a 3D audio is as follows, 3D audio 1. Detect faces in the video stream for each frame and classify them to one in the database. 2. Map the position of each face to a meaningful HRTF angle in each frame. 3. Detect whether there is speech in a frame or not. 4. Label the frames with speech from step 4 with their corresponding class label in the database. 5. Use speech classification results to assign the speech from step 4 to the point find from step 2. That is, given a speech signal, find corresponding face location. 6. Pick corresponding HRTFs from step 5 for each frame. 7. Reconstruct 3D audio as shown in figure 1. 2.4. Speaker Recognition Consider the same video from section 2.1. We would like to be able to identify each user using their facial and speech features. If the speaker is not talking, then the recognition system should put more weight towards the facial classifier. If the speaker’s face cannot be detected, then the classification algorithm only uses the speech classifier. If both face and speech are available, then the classification algorithm will use both with some weight- ing on each model to determine and identify each user for maximizing the user recognition likelihood given the training database [9]. P(user) = P( face|model)Wi P( face model) + ... P(speech|model)Wmax−Wi P(speech model) (2) The prior probabilities in equation 2 are assumed to be equally probable, however, in practice for better accuracy they need to be estimated more carefully given the environment and the recording equipment. Steps 1 through 4 in 3D Audio Reconstruction are also performed when doing speaker recognition. After having a training database for faces and speech. We then do the following, Multimodal Speaker Recognition 1. Find the likelihood of each user’s face given the face model in each frame analysis. 2. Find the likelihood of each user’s speech given the speech model in each frame analysis. 3. Plug the values from step 5 and 6 in equation 2. 4. Determine the value of equation 2 for w = 1, 2..., 10. 5. Find w such that, argmaxw(P(user)). Higher w’s shows that the face classification results was better than speech classification results, and so we should put more weight on the facial classifier. This could be due to the quality of the training set, or just the fact that the face classifier works better than the speech classifier. As an example, one way of choosing w is measure the SNR of the signal, and put more weight on the speech classifier when the SNR is higher. With figure 4 summarizing the general flow of this project, we conclude section 2. Fig. 4. Speaker Classification 3. FACE DETECTION, CLASSIFICATION AND TRACKING In this section, we discuss the details of detecting faces in a video stream, training facial features, and tracking faces throughout the video frames. 3.1. Face Detection The first step in detecting faces is preprocessing every frame. In most image processing application, prepro- cessing is done to assure better quality results, whether
  • 4. that is for recognizing an object in the image or decon- volving a blur from the image. Face Detection 1. Resize the video frames to smaller dimensions so we can detect faces faster. 2. Map the RGB color to gray colormap to simplify the computation. 3. Apply histogram equalization to each frame to make sure the contrast is evenly spread out throughout the whole pixels. We then setup Matlab’s vision library for face detec- tion [5]. Since some detected faces are going to be small, we defined a threshold that would disregard any faces smaller than a certain number of pixels. This would create a precise training database. We then resize and vectorised all detected faces into a matrix for each user (face database). This is summarized in figure 5. Fig. 5. Face Detection Note that if there is more than one user in the room, we need to be able to identify the approximate location of each so we can label them. This is why we defined the calibration procedure in section 2. We first attempted to find this location by detecting mouth movement, how- ever, due to resizing; the resolution was not high enough to detect lips movement. For simplicity, we ask our users to clap and wave to the camera, since it is easier to detect a bigger area of a motion. We simply use frame subtrac- tion to identify the pixels that corresponds to higher variance. The clap sound detection from previous sec- tion will tell us the approximate frame number for the clap motion in the video, so the area of search is only a few frames. subframe = ( framei − mean( frame1:i−1))2 [imax, jmax] = max(subi=1:n,j=1:m) Where n and m are the number of pixels in the vertical and horizontal axis. We don’t care for the exact position of those high variant pixels, we only need an approxi- mation of which part of the frame that specific user is located at, e.g. left, right, top or bottom of a frame. The resulting pictures for frame subtraction, division and face detection is shown in figure 6. Fig. 6. Frame subtraction and division into two parti- tions for labeling each class 3.2. Face Classification Now that we have a database of faces for each user, we need to train each database. We used Gaussian Mix- ture Models for training the facial database. Since the dimensions of the training database are fairly large, we first lower the dimensionality of the database using PCA to 36 and then use a 5 order GMM to fit it to the model. Dimensionality reduction is also important when train- ing a database with a GMM, since in high dimensions;
  • 5. GMM might not able to estimate the covariance matrix. Face training is summarized below. Train = {trainuser1, trainuser2} (2a) Pt = PCA(train, number of eigenvector) (2b) Gpt = GMM(pt, dimensions) (2c) The eigen faces from the PCA is shown in figure 7. Fig. 7. Eigen faces for class 1 Having the GMM model for each database, one can then easily find the probability of each face given each model by calculating the log likelihood of the testing data that is projected to PCA space in equation 2.b, given the mean and the variance find in equation 2.c. This is shown in equation 3. f(test|µ, σ) = 1 test σ √ wπ e ln(x−µ)2 2σ2 (3) In figure 8, the scatter plot each class using the first two highest eigenvector from the PCA matrix is shown. As you can see, the two classes are linearly separable, so we should be able to get high accuracy from our classifier. Fig. 8. 2 Dim PCA Face Features We left 10% of our training database for testing and trained the remaining 90%. The result of our facial classifier is tabulated below. Class 1 95% Class 2 100% Table 1: Face Classification Accuracy As expected we have high accuracies, since the two classes were shown to be linearly separable in 2 dimen- sions. 3.3. Face Tracking Matlab’s face detection function draw a rectangle around the detected faces. We extracted the coordinates for that rectangle and use the center of the rectangle as the position of the detected user. Such tracking algorithm usually does give good results due to face detection inaccuracy. Since we do not need to know the exact position of the user at each frame, we can com- pensate for this flaw by applying moving average to the tracking results as well as fitting a polynomial to it (in this case a polynomial of order 10 and moving average of 25 frames length). If the sources were stationary we can use a longer moving average for smoothing the results. The result of such process is shown in figure 9. Fig. 9. Tracking Results The x − axis corresponds to the time frame, and the y − axis is the approximated x − position of the detected face. We only look at things in the azimuth, since the 3D audio does not sound very good for elevation angles. In general, for recreating spatial audio for n people, we need at least n−1 training databases. As discussed in section 2.1 we assumed that we only have up to 2 users in a video frame. Therefore, if we have the training dataset for one of the users, we can classify one in the database with a label and define a threshold for the other user; so we can label him/her as unknown class. This threshold is defined as following, if P( face|model) < TH then Facelabel : unknown
  • 6. The tracking result of having one unknown and one known user in a frame video shown in figure 6 is visu- alized in figure 10. Fig. 10. Tracking results for two users, one labeled, and one unknown As you can see, each source is correctly detected at opposite edge of each video frame. The spikes in these graphs are mainly for two reasons, 1- slight head move- ment and 2- Face detection errors. As mentioned earlier, the first problem can be ease out by using moving aver- age and the second problem by fitting a polynomial to the database as seen in figure 9. The spectrogram shown in figure 10 is also the tracking results; we mainly use it since it is clearer when visualizing the tracking over video frames. 4. VOICE ACTIVITY DETECTION AND CLASSIFICATION 4.1. Voice Activity Detection Voice activity detection enables the filtering of non- speech components (usually silence) from speech com- ponents. As mentioned in section 2, the calibration procedure labels the speech signals. We also filtered the signal with a bandpass filter at ω1 = 300 kHz and ω2 = 8 kHz, since our focus in this project is on speech signals.In order to do VAD, we developed a supervised method by training speech signals (the signals between claps) and non-speech signals (the signals before the first clap). The VAD classification results for four differ- ent classifiers using Log Spectral Coefficients (LSC) or Mel Frequency Cepstral Coefficients (MFCC) are listed in table 2. Classifier LSC MFCC Linear SVM 99.9% 99.9% Gaussian Naive Bayes 99.9% 99.9% 20-Nearest Neighbor 99.9% 99.9% Table 2: VAD Accuracy As you can see the results from each classifier is near perfect. This is perhaps expected given the clear linear separability in figure 11. In fact, on inspection, the pri- mary feature is unsurprisingly dominated by the energy level. Ultimately, we selected linear SVM for further pro- cessing, due to its simplicity and speed. After remov- ing non-speech components from the signal, we create a database of each user’s speech using the calibration procedure mentioned in section 2.1. The system then trains each database separately like in a similar manner described for faces in section 3. 4.2. Voice Classification 10% of the data was separated for testing and the re- maining 90% for training speech models for each class. The STFT of the signal uses non-overlapping rectangu- lar windows as long corresponding to one frame. The samples per frame is computed as samples seconds , seconds frames . The unit of frame time is used because frames serves as the base unit in the facial analysis.1 We then project the extracted features to a lower dimensional space using PCA. This procedure mirrors the technique described in section 3 for training the face databases. The procedure for classifying speech is summarized in figure 12. The resulting classification results are shown in table 3 below. In addition, the corresponding LCS and MFCC features are shown for a 2 dimen- sional space in figure 11 for non-speech signals, class 1 and class 2 speech signals. Non-speech and speech classes are clearly separable; however, the seperability of th speech classes remains suspect. Nevertheless, given the empirical success of the classifiers, the speech classes appear to be separable in higher dimensions. As you can see Ada-Boost has the highest classification accuracy among all. Our suspicion is that the other classifiers make Gaussian assumptions either explicitly or implicitly in the Euclidean distance measures. Ada Boost avoids this fate by using a collection of classi- fiers, that while potentially making individual Gaussian assumptions, are not necessarily Gaussian distributed themselves. Classifier LSC MFCC Linear SVM 83.0% 85.3% Gaussian Naive Bayes 82.4% 80.2% 20-Nearest Neighbor 84.9% 86.7% Ada-Boost 91.2% 90.3% GMM 83.5% − − − Table 3: Voice Classification Accuracy 1The classification tool also supports averaging over a number of frames.
  • 7. Fig. 11. MFCC (top) and LSC features (bottom) for three classes Now that we have our VAD and speech classifiers, we can easily assign labels to every analysis frames, e.g. non-speech, class 1, class 1, class 2, etc. The labels then allow selecting which face rectangle is active, yield- ing the location of each user at each frame from section 3 results. We then reconstruct spatial audio using the techniques explained in section 2.1. 5. RESULTS We recorded a few video clips for which we attempted to reconstruct 3D audio from, the link is provided in [6]. For the first case, we simply tracked one person in a video frame, mapped his face location to HRTF angles and then reconstruct spatial audio using corresponding HRTF angles from the single channel audio input as was shown in figure 1. The resulting video named gob- ble cs598.mp4 can be found in [6]. For the second video, we recorded two speakers speaking, a snapshot was pro- vided earlier in figure 2. We trained both speakers’ facial and speech features beforehand as was discussed earlier in section 3 and 4 and use that to find the face location. At every frame we detect and capture both speech and faces. Note that given our face model we already know where speakers are located at, so we simply classify each audio frame to one of three classes discussed in section Fig. 12. Speech Classification) 4. We can then create a spatial sound for the two users. In general we were able to recreate a sound that was guided by speakers faces. We also included some ex- ample video clips where a piece of music was guided by the users face. This technology can be specifically useful in hearing aids. Imagine wearing google glass. You can capture faces using the glass and capture speech using your hearing aid. We can do tracking and classi- fication either on the glass, offline or on the cloud and then feed the resulting information to the hearing aid to form beams toward the desired sound source and undo artifacts such as noise and reverberation in the room. We use the same video with two speakers from last time, to perform multimodal speaker recognition. We can calculate the recognition of each speaker using equa- tion 2. We, however, did not try to estimate w s from the environment. We used values from 1 to 10 for w’s, where 1 put more focus on the face classifier and 10, more weight on the speech classifier. We then evaluate the P(user)K k=1 over all video frames for all 10 values of w. We can then create a matrix of likelihoods as the one shown below. P(user) =   p1,1 . . . p1,wmax ... ... ... pk,1 . . . pk,wmax   We then look for a value of w that maximizes the user classification at most frames, that is, wmax = argmaxw(P(user)K k=1&10 w=1) Where K is the number of frames. For our video, it turned out that the value of w is 4. This means that the multimodal classifier is putting more weight on the face classifier. This makes sense since the face classifier was able to achieve higher accuracy than the speech classifier.
  • 8. 6. FUTURE WORK One of the assumptions made in section 2 was that speakers do not talk at the same time. One can then apply source separation techniques [7] on the input sig- nals first based on the number of speakers in the room, and then follow the same procedure on section 4 to clas- sify each speaker into their corresponding classes while separating each source. Another future enhancement is the ability to spatial- ize sounds other than speech in a desired video clip. A simple, but exhaustive way of doing this is to learn features for different objects and sounds as well. This will also require an even more exhaustive search of de- tecting and recognizing objects in defined blocks for ev- ery video frames. This work, however, could be useful for entertainment application, such as in movies and video games. Based on the key objects in the movie, one can learn a big database of sounds they make and their shape, and then reconstruct 3D audio throughout the whole clip, which might take hours or even days. One can also use this project and apply it to telecon- ferencing applications as well. If you have a room full of stationary speakers, you can easily learn their facial and speech features by recording one of the sessions. One can then use the tracking procedure discussed in section 3 to locate each speaker and use the techniques explained in section 4 to label the signal as well. That way if the conference room is equipped with microphone arrays, one can use this information to form the beam at the cur- rent speaker to enhance speech intelligibility. Note that in teleconferencing, we need to be able to do this in real time. There are assumptions that we can make to speed things up. For example, users are probably not moving around the room, so one only needs to locate them one time. After that, we only need to speech classification and labeling the signal where even assumption can be made given the application to simplify and speed up the process for real-time scenarios. There are other possibilities for the future of this project. For example, is it possible to learn a relationship between speech and facial features using unsupervised methods? Earlier in figure 8 and 11 we depicted the fa- cial and speech features for two and three classes respec- tively. We can use clustering methods to classify each group of features together, e.g. K-Means, GMM cluster- ing or even better time series clustering such as Hidden Markov Models (HMM). One obvious disadvantages of clustering is obviously not having access to the labels of each class. For both 3D audio and speaker recognition, we need to have labels at each frame of analysis, e.g. given my speech signal, where is the user? Or given the speech and face likelihoods, who is the most likely speaker? We think there might be a relationship between facial features and its corresponding speech features at every frame. If such correlation exists between the two, one can find one’s face just by analyzing the speech sig- nals. One last application that we can think of is to be able to perform supervised source separation. Since we al- ready have a dictionary of Eigen faces and speeches for every user, we can simply scan the whole video frames and signals to extract one’s face and speech by finding the best match between the signals and the correspond- ing Eigen values. This idea of unconstrained source separation has already been extensively investigated on video content analysis and sound recognitions in [8].
  • 9. 7. REFERENCE [1] Choueiri, E , Optimal Crosstalk Cancellation for Binaural Audio with Two Loudspeakers , Princeton Uni- versity [2] Binaural Recording. Wiki 100k. Princeton Uni- versity. [3] Friedman, Michael. ”Capturing spatial audio from arbitrary microphone arrays for binaural repro- duction.” (2014). [4] Gardner, Bill. ”HRTF Measurements of a KEMAR Dummy-Head Microphone.” HRTF Measurements of a KEMAR Dummy-Head Microphone. [5] ”Documentation.” Face Detection and Tracking Using CAMShift. Web. 14 Dec. 2014. [6] Recorded Videos. (https://www.dropbox.com/sh/ s16c6lrx9tay065/AADZpsofFS4ANiRzpQA8fHPca?dl=0) [7] Bryan, Nick J. ”ISSE.” Nick (Nicholas) J. Bryan. Stanford. [8] Paris Smaragdis. ”Audio Demos.” Audio Demos - Sound Recognition for Content Analysis [9] ”Multimodal Person Identification (Face Recogni- tion, Person ID).” ECE 417/MP3. ECE 417 - Multimedia Signal Processing, Mar.-Apr. 2014.