M.sc. presentation t.bagheri fashkhami

A Model of Moving Source
Localization Based on Binaural Hearing
Faculty of Electrical and Computer
Engineering
Department of Communication
Dr. Masoud Geravanchizadeh
Taher Bagheri
University
of Tabriz
Summer 92

Contents
Taher Bagheri
2
 Introduction
 Model Architecture
 Simulation Setup
 Results

Introduction
• Localization of multiple sound sources from a
binaural input is a challenging problem that has
applications in hearing prostheses, spatial sound
reproduction, and mobile robotics.
• Binaural localization has received significant
attention in the field of computational auditory scene
analysis (CASA) which is guided by principles in the
perceptual organization of sound by human listeners.
3
Taher Bagheri

• Two principal localization cues are interaural time
difference (ITD) and interaural level difference
(ILD).
• ITD commonly referred to as the time difference of
arrival, and ILD is due to the effects of the head,
torso, and outer ear.
• The generalized cross correlation (GCC) method is a
well-known approach for ITD estimation.
4
Taher Bagheri

• If it can be assumed that sources are spatially
stationary over a given interval of time, a simple
approach is to first integrate azimuth information
across frequency, then average across time and select
multiple peaks in the resulting azimuth-dependent
response function.
5
Taher Bagheri

• There are two main approaches to target tracking that
utilize Bayesian inference.
– Multiple Hypothesis Tracking (MHT)
– Bayesian filtering
• The Bayesian tracker has a closed form solution only
for a linear process with Gaussian noise which is
equivalent to the Kalman filter in this case.
6
Taher Bagheri

• However, when restricting the size of the array to
only two sensors, as in the case of human audition,
the multisource tracking problem becomes more
challenging.
• Many approaches for tracking are based on full field
calculations which are computationally intensive and
sensitive to assumptions on the structure of the
environment.
Taher Bagheri
7

• Alternative methods that use only select features of
the acoustic field for localization and environmental
parameter estimation have been proposed.
• One the best method that has been proposed, extracts
arrival times and amplitudes of distinct paths from
measured acoustic time-series using sequential
Bayesian filtering, namely, particle filtering.
• Particles filters are popular at tracking for non-linear
and/or non-Gaussian Models.
Taher Bagheri
8

Model Architecture
Taher Bagheri
9
• In this model we have two main part that in firs part
algorithm localizes primary position of target source
and next tracks target.
• At first algorithm trains the model with primary
feature extracting like ITD and ILD.
• Next step is to determine position (azimuth) of target
for initializing particle filtering.

• Model contains three main part:
►Monaural pathway
►Binaural pathway
Taher Bagheri
10

Monaural Pathway
• Onset synchrony is known to be strong cue for across
frequency grouping in auditory scene analysis and
have been shown to influence localization judgments
by human listeners.
• The proposed framework incorporates a monaural
pathway that uses an onset/offset analysis to group T-
F units dominated by the same underlying source.
11
Taher Bagheri

• The grouping is used to constrain the integration of
binaural cues for azimuth estimation.
• Monaural pathway includes two parts:
• A. Onset/Offset-Based Segmentation
• B. Onset-Based Weights
12
Taher Bagheri

• Monaural Pathway
13
Onset/Offset-based Segmentation
T-F segments
Onset-based Weights
wE
c,m
eE
c,m
Taher Bagheri

Onset/Offset-Based Segmentation
• The method first identifies onsets (increases in signal
energy) and offsets (decreases in signal energy)
across time within gammatone sub-bands.
• The set of T-F units between a pair of onset and offset
fronts forms a T-F segment.
• This segmentation system has been used to generate
T-F segments for the left and the right mixture
independently.
14
Taher Bagheri

Onset-Based Weights
• In challenging acoustic environments, many T-F units
will be corrupted by diffuse noise or reverberation.
• At first method extracts the signal envelope for each
frequency channel of the left and the right signal by
squaring and passing each sub-band through a first-
order IIR filter.
15
Taher Bagheri

• Finally we compute:
– as the weight for unit uE
c,m.
Taher Bagheri
16
,
[ ] [ 1]
[ 1]
E E
E c c
c m E
c
e m e m
w
e m
+
 − −
=  
− 

• Binaural Pathway
17
Auditory Periphery
and Binaural Feature
Extraction
ITD &
ILD
GMM model of ITD
and ILD
Pc(τ,λ|θ)
Model Training
Pc(τc,m,λc,m|θ)
noise
Taher Bagheri

Binaural Pathway
• Binaural pathway contains three stages:
• A. Auditory Periphery and Binaural Feature
Extraction
• B. Azimuth-Dependent Binaural Model
• C. Model Training
18
Taher Bagheri

Binaural Feature Extraction
• The binaural pathway consists of a low-level feature
extraction stage where we calculate the ITD, denoted
τc,m, and ILD, denoted λc,m, for each T-F unit pair.
• A T-F unit is an elemental sound component from one
frame, indexed by m, and one filter channel, indexed
by c.
19
Taher Bagheri

• We calculate ITD as the maximum peak in a running
cross-correlation between T-F units uL
c,m and uR
c,m, where
we consider time lags between -1 and 1 ms.
• So ITD is:
20
1
, ,0
2 2
1 1
, ,0 0
( ) ( )
2 2( , , )
( ) ( )
2 2
n
n n
T L Rn n
c m c mn
T TL Rn n
c m c mn n
T T
u m n u m n
C c m
T T
u m n u m n
τ
τ
τ
−
=
− −
= =
− − −
=
   
− − − ÷  ÷
   
∑
∑ ∑
, argmax ( , , )c m
L
C c m
τ
τ τ
∈
=
Taher Bagheri

• ILD corresponds to the energy ratio in dB between
uL
c,mand uR
c,m:
21
2
1
,0
, 10 2
1
,0
( )
2
10log
( )
2
n
n
T L n
c mn
c m
T R n
c mn
T
u m n
T
u m n
λ
−
=
−
=
  
− ÷ ÷
  ÷=
 ÷ 
− ÷ ÷
  
∑
∑
Taher Bagheri

Azimuth-Dependent Binaural Model
• The model independently captures the frequency-
dependent pattern of ITD and ILD values due to
direct-path propagation, which we refer to as direct-
path cues.
• The azimuth-dependent model of ITD and ILD has
been denoted as Pc(τ,λ|θ), which represents the
likelihood of observing a pair of ITD and ILD values
in frequency channel given energy from a point
source with azimuth θ.
22
Taher Bagheri

• Due to spatial aliasing, the probability space for
observed ITDs in higher frequency channels is multi-
modal. We therefore use a mixture of Gaussians to
capture:
• The ILD likelihood is well described by a single
Gaussian:
23
, , ,
1
( | , ) ( , ) ( | ( , ), ( , ))
cK
c c k c k c k
k
P r r r rθ θ θ θτ τ ρ τ τ µ τ σ τ
=
= ℵ∑
( | , ) ( | ( , ), ( , ))c c cP r r rθ θ θλ λ λ µ λ σ λ=ℵ
Taher Bagheri

Model Training
• In this work, we generate training mixtures by
combining a point source with a simulated diffuse
noise, and in doing so, avoid capturing environment-
specific effects.
• Only the head-related transfer functions (HRTFs) of
the binaural setup are known.
• We simulate a point source by filtering monaural
signals using the HRTF for a given azimuth.
24
Taher Bagheri

• The diffuse noise is created by passing uncorrelated
noise signals through each of the HRTFs for the
binaural setup and then adding them together.
• Given a set of training data for a specific azimuth,
model measures τ and λ from each pair of mixture T-F
units and calculates p.
25
Taher Bagheri

Localization Framework
• The binaural pathway extracts azimuth-dependent
information from each T-F unit pair while the
monaural pathway groups T-F units that are likely to
be dominated by the same source.
• The final stage of the proposed system then integrates
this information and produces a set of N azimuth
estimates.
26
Taher Bagheri

• To perform localization, we first postulate a set of
possible N azimuths, where we assume N is known.
• For each simultaneous stream or T-F segments we
find the most likely azimuth from the postulated set
and integrate likelihood scores over all streams and
segments.
27
Taher Bagheri

• The process generates a total likelihood for each
postulated set of azimuths, and we choose the set that
maximizes this value.
• Formally, let IE
be the total number of simultaneous
streams and T-F segments from ear signal E.
Taher Bagheri
28

• Assuming conditional independence between T-F
units, the weighted log-likelihood for sE
i is then:
• We search for the most likely set of N azimuths using:
29
Taher Bagheri
, , ,
,
( ) ln( ( , | ))
E
i
E E
i c m c c m c m
c m s
w Pβ θ τ λ θ
∈
= ∑
ˆ ˆ
1 1
ˆ argmax ( ) ( )
L R
i j
I I
L R
i y L j y R
i j
β θ β θ
= =
 
Θ = + ÷
 
∑ ∑

Simulation Setup
• To test this model ROOMSIM simulator has been
used in this study.
• Roomsim is a simulation of the acoustics of a simple
rectangular prism room has been constructed using
the MATLAB m-code programming language.
30
Taher Bagheri

• The image method of simulating room acoustics is
often used to provide a means of generating signals
incorporating “sufficiently realistic” reverberation
and directional cues for the testing of audio/speech
processing algorithms.
• The foundation on which RoomSim is built is the
publication of a Fortran routine by Allen and Berkley
in 1979.
31
Taher Bagheri

The RoomSim Program
• The program simulates the geometrical acoustics of
a perfect rectangular parallelepiped enclosure using
the image-source model to produce an impulse
response from each omni-directional primary
source to a directional receiver system.
• This software combines the image method for
reverberation with HRTF measurements made using a
KEMAR dummy head.
32
Taher Bagheri

• The simulation of a head utilizes the Head Related
Transfer Function (HRTF) data, actually Head
Related Impulse Response (HRIR) data, provided
from measurements made either on a Kemar dummy
head at the Center for Image Processing and
Integrated Computing (CIPIC), University of
California.
33
Taher Bagheri

RoomSim Operation
• In operation the user specifies the dimensions of the
room, its surface materials the type, location and
orientation of the receiver system and the location of
the primary source(s).
• This can be done by submitting either a Microsoft
Excel spreadsheet form, a text file, or by selecting a
MATLAB *.mat file which saved a configuration
from a previous run.
34
Taher Bagheri

• If a simulated head has been selected the response
from each quantized image-source direction is
convolved with the relevant HRIR data.
• The individual image-source response are then
accumulated to form the complete pressure impulse
response from each primary source to the receiver
and the result plotted and saved to file.
35
Taher Bagheri

Figure 1. RoomSim
setup, importing user
parameters like room
dimensions, humidity,
temperature ...
36
Taher Bagheri

37
Taher Bagheri
Figure 2. Room
dimensions and
source(s) and
receivers
positions.

38
Taher Bagheri
Figure 3. spectrum
of left and right ears

39
Taher Bagheri
• As mentioned this simulator saves user parameters as
MATLAB files to use this configuration for other
MATLAB functions in any application.
• In this study we save Roomsim configuration in
MATLAB file and use it for tracking algorithm.

Results
40
Taher Bagheri
• To evaluate model we use a set of binaural impulse
response (BIR)
• The simulated BIR, which we refer to as the Kemar
set, are generated using the Roomsim package.
• We create a library of BIRs by generating room
configurations, where room size, array position, and
array orientation are set at different states.

• Generated BIRs will be for azimuths between -90˚
and 90˚, spaced by 5˚, at distances of 4, 4, and 3 m.
• In order to train binaural models, we generate
anechoic BIRs for the same azimuths using the HRTF
measurements directly.
• Speech sources are selected from CIPIC database, by
a selected Kemar BIR.
• This model has been tested in anechoic room with
additive noise like babble noise, restaurant noise, …
41
Taher Bagheri

Taher Bagheri
42
Figure 6. without
noise and
reverberation.

43
Taher Bagheri
Figure 7. different
noise.

Figure 8. with
reverberation.
44
Taher Bagheri

Figure . with
reverberation and
noise.
Taher Bagheri
45

• Table 1. means of results with three experiments.
• T60= 600ms
46
Taher Bagheri
Noise Accuracy (%) Reverberation Accuracy (%)
Clean 99.2 Clean 99.2
Babble Noise (5dB) 98.5 T60=600ms 98
Babble Noise (15dB) 97.9 T60=600ms +
Babble Noise
97.4
Restaurant Noise (5dB) 97.8 T60=600ms +
Restaurant Noise
97.0
Car Noise (5dB) 98.9 T60=600ms +
Car Noise
97.6

Conclusion
• This model has not good result for more than one
source tracking.
• Comparing proposed model with previous tracking
model (Roman and Wang 2008) shows better results
in same noisy condition.
• Model performance in reverberant environment is not
as good as expected.
47
Taher Bagheri

Future works
• In this work, we assumed prior knowledge of the
number of sources, and thus a key problem for future
work is estimating the number of sources.
• Using joint visual and auditory information will lead
to better results.
Taher Bagheri
48

M.sc. presentation t.bagheri fashkhami

M.sc. presentation t.bagheri fashkhami

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to M.sc. presentation t.bagheri fashkhami

Similar to M.sc. presentation t.bagheri fashkhami (20)

Recently uploaded

Recently uploaded (20)

M.sc. presentation t.bagheri fashkhami