1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
Office +82-2-2220-1795 | Fax +82-2-2220-4794
http://acoustics.hanyang.ac.kr
Muhammad Imran, Jin Yong Jeon
December 12, 2015
A Steered-Response Power (SRP) based Framework for
Sound Source Localization using Microphone Arrays in
Reverberant Rooms for Enhancement of Speech Intelligibility
42. Jahrestagung für Akustik, 14.-17. März 2016
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
2
o Introduction
o Background and Motivation
o Sound Source Localization
• Methodology
◦ VAD: Voice Activity Detection
◦ SRP (Beamforming) filters
◦ PHAT-weighting
• Real-time Framework and Implementation
• Optimization and Clustering
o Results
o Conclusion
Contents
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
3
o Sound Source Localization and Tracking using microphone arrays for
• Room Acoustics measurements
• Teleconference Systems
o Traditional Methods
• Time-delay estimation (TDOA) techniques between microphone pairs using Correlation
Function, ignoring
◦ Ambient noise
◦ Reflections from surrounding
◦ Reverberation in closed space
o Therefore
• Producing poor results in terms of Precision, Resolution and Robustness
• Require additional post-processing to track multiple sources in real time applications
• Limited bandwidth
Introduction (Background and Motivation)
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
4
o The Methods for sound source localization using microphone arrays
o Time difference of arrival estimation (TDOA)
o Generalized cross-correlation (GCC)
o Weighting function
o Optimum detection in the presence of reverberant environment
o Improved Signal to Noise Ratio (SNR)
o Steered Response Power (SRP)
o Weighting function as Beamformer
o Source localization and tracking
o Robust in Reverb Condition
Sound Source Localization
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
5
o The method is based on Steered-response power (SRP)
o The power at the beamformer output as a function of the look-up direction 𝑐 is
o Weighted Steered-response Power
o MVDR beamformer as Weight
o After simplifications
Methodology (1/3)
𝑃𝐵𝐹(𝑐) = 𝐷(𝑐) 𝐻 𝑆𝐷(𝑐)
𝑆 is cross-power matrix
𝐷(𝑐) is Array directivity
𝑤 =
𝐷 𝐻
(𝑐)𝑆−1
𝐷 𝐻(𝑐)𝑆−1 𝐷(𝑐)
𝑃 𝑀𝑉𝐷𝑅(𝑐) =
1
𝐷(𝑐) 𝐻 𝑆−1 𝐷(𝑐)
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
6
o Combining the Bins
o Combining the signals 𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘) from different frequency bins
o Approach used is PHAT weighting
o Information of noise variance can be used and the final beamformer is improved
o Therefore,
Methodology (2/3)
𝑃𝑆𝑆𝐿(𝑐) =
1
𝐾
𝑘=1
𝐾
𝑀
𝑋 𝑘
𝐻
𝑋 𝑘
𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘)
𝑋 𝑘 input vector with length M containing the input signals for this frequency bin from all microphones
𝑁𝑘 =
1
𝑀
𝑖
𝑁𝑖(𝑘)
𝑃𝑆𝑆𝐿(𝑐) =
1
𝐾
𝑘=1
𝐾
𝑀
𝑞𝑋 𝑘
𝐻
𝑋 𝑘 + (1 − 𝑞) 𝑁𝑘
𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘)
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
7
o Post Processing (Simple Clustering)
o Algorithm used is so-called “Bucket Clustering”
o Step: 1; Grouping the Measurements
o Based on Single-frame Information of azimuth ′𝜑′, elevation ′𝜃′, standard deviations ′𝜎 𝜑
′
and ′𝜎 𝜃
′ microphone-array working volume is computed as (50% overlapping)
o Step: 2; Number of Cluster Candidates
o Applying threshold defined as the average confidence of sections with more than one measurement:
o Step: 3; Averaging the Measurements in Each Cluster Candidate
Methodology (3/3)
𝑀 = 4
𝜑 𝑚𝑎𝑥 − 𝜑 𝑚𝑖𝑛
6𝜎 𝜑
×
𝜃 𝑚𝑎𝑥 − 𝜃 𝑚𝑖𝑛
6𝜎 𝜃
𝐶 𝑇ℎ =
1
𝐿
𝑖=1
𝑀
𝑗=1
𝑁 𝑖
𝐶𝑖𝑗
𝐶𝑖𝑗 confidence of the jth measurement in the ith section
𝑁𝑖 number of measurements in the ith section, 𝑀 number of sections
𝐿 number of sections with number of measurements larger than 1
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
8
Real time framework
o Sound capturing by 3D microphone array (6-channel)
o Data is Subjected to Framing and windowing block
o Short time discrete Fourier transform (DFT)
o Voice Activity Detectors (VAD)
• VAD are used for detecting active signals
• Based on Energy and Spectral shifts for each frame
o Localization block (Source estimation)
• MVDR Weights for each frequency bin
• PHAT weightings for combining all frequency response
o Optimization (Improving the localization estimates by averaging several measurements)
• Simple Clustering
Framework
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
9
Measurement setup and Procedure
o Microphone Array:
o 6-channel orthogonal array
o Speech Sources:
o Three Speech Sources placed at 0o, 45o, -45o Azimuth
o Speech duration = 20 sec
o 1.5 m from array
o Pure Speech mixed with Pink Noise
o SNR is 15 dB
Evaluation
Source 02 (-45o)
Source 03 (+45o)
Source 01 (0o)
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
Results (1/3)
o Localization Results at 0o azimuth
• VAD Voiced Frames = 500
Results
Number of Frames 496
STD 2.03
Localization Error 0.027
Localization error =
1
𝑁
𝑖=1
𝑁
𝜑𝑖 − 𝜑𝑖
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
Results (2/3)
o Localization Results at 45o azimuth
• VAD Voiced Frames = 500
Results
Number of Frames 493
STD 1.9
Localization error 0.081
Localization error =
1
𝑁
𝑖=1
𝑁
𝜑𝑖 − 𝜑𝑖
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
Results (3/3)
o Localization Results at -45o azimuth
• VAD Voiced Frames = 500
Results
Number of Frames 494
STD 1.3
Localization error 0.058
Localization error =
1
𝑁
𝑖=1
𝑁
𝜑𝑖 − 𝜑𝑖
1 November 2016
HANYANG UNIVERSITY
ARCHITECTURAL ACOUSTICS LAB
Conclusion
o Presented a framework for sound source localization
• Using six channel spherical microphone array based on
 SRP-MVDR algorithms weighted with PHAT
 VAD is used for extracting voiced frames of Speech
 Optimization using clustering method for accurate localization
o Producing convincing results within the accuracy of ±2o works well for 15 dB SNR
o MVDR weighted SPR Localizer estimated the sound sources with STD value of ±1.3o in
DOA

Sound Source Localization

  • 1.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB Office +82-2-2220-1795 | Fax +82-2-2220-4794 http://acoustics.hanyang.ac.kr Muhammad Imran, Jin Yong Jeon December 12, 2015 A Steered-Response Power (SRP) based Framework for Sound Source Localization using Microphone Arrays in Reverberant Rooms for Enhancement of Speech Intelligibility 42. Jahrestagung für Akustik, 14.-17. März 2016
  • 2.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 2 o Introduction o Background and Motivation o Sound Source Localization • Methodology ◦ VAD: Voice Activity Detection ◦ SRP (Beamforming) filters ◦ PHAT-weighting • Real-time Framework and Implementation • Optimization and Clustering o Results o Conclusion Contents
  • 3.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 3 o Sound Source Localization and Tracking using microphone arrays for • Room Acoustics measurements • Teleconference Systems o Traditional Methods • Time-delay estimation (TDOA) techniques between microphone pairs using Correlation Function, ignoring ◦ Ambient noise ◦ Reflections from surrounding ◦ Reverberation in closed space o Therefore • Producing poor results in terms of Precision, Resolution and Robustness • Require additional post-processing to track multiple sources in real time applications • Limited bandwidth Introduction (Background and Motivation)
  • 4.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 4 o The Methods for sound source localization using microphone arrays o Time difference of arrival estimation (TDOA) o Generalized cross-correlation (GCC) o Weighting function o Optimum detection in the presence of reverberant environment o Improved Signal to Noise Ratio (SNR) o Steered Response Power (SRP) o Weighting function as Beamformer o Source localization and tracking o Robust in Reverb Condition Sound Source Localization
  • 5.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 5 o The method is based on Steered-response power (SRP) o The power at the beamformer output as a function of the look-up direction 𝑐 is o Weighted Steered-response Power o MVDR beamformer as Weight o After simplifications Methodology (1/3) 𝑃𝐵𝐹(𝑐) = 𝐷(𝑐) 𝐻 𝑆𝐷(𝑐) 𝑆 is cross-power matrix 𝐷(𝑐) is Array directivity 𝑤 = 𝐷 𝐻 (𝑐)𝑆−1 𝐷 𝐻(𝑐)𝑆−1 𝐷(𝑐) 𝑃 𝑀𝑉𝐷𝑅(𝑐) = 1 𝐷(𝑐) 𝐻 𝑆−1 𝐷(𝑐)
  • 6.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 6 o Combining the Bins o Combining the signals 𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘) from different frequency bins o Approach used is PHAT weighting o Information of noise variance can be used and the final beamformer is improved o Therefore, Methodology (2/3) 𝑃𝑆𝑆𝐿(𝑐) = 1 𝐾 𝑘=1 𝐾 𝑀 𝑋 𝑘 𝐻 𝑋 𝑘 𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘) 𝑋 𝑘 input vector with length M containing the input signals for this frequency bin from all microphones 𝑁𝑘 = 1 𝑀 𝑖 𝑁𝑖(𝑘) 𝑃𝑆𝑆𝐿(𝑐) = 1 𝐾 𝑘=1 𝐾 𝑀 𝑞𝑋 𝑘 𝐻 𝑋 𝑘 + (1 − 𝑞) 𝑁𝑘 𝑃 𝑀𝑉𝐷𝑅(𝑐, 𝑘)
  • 7.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 7 o Post Processing (Simple Clustering) o Algorithm used is so-called “Bucket Clustering” o Step: 1; Grouping the Measurements o Based on Single-frame Information of azimuth ′𝜑′, elevation ′𝜃′, standard deviations ′𝜎 𝜑 ′ and ′𝜎 𝜃 ′ microphone-array working volume is computed as (50% overlapping) o Step: 2; Number of Cluster Candidates o Applying threshold defined as the average confidence of sections with more than one measurement: o Step: 3; Averaging the Measurements in Each Cluster Candidate Methodology (3/3) 𝑀 = 4 𝜑 𝑚𝑎𝑥 − 𝜑 𝑚𝑖𝑛 6𝜎 𝜑 × 𝜃 𝑚𝑎𝑥 − 𝜃 𝑚𝑖𝑛 6𝜎 𝜃 𝐶 𝑇ℎ = 1 𝐿 𝑖=1 𝑀 𝑗=1 𝑁 𝑖 𝐶𝑖𝑗 𝐶𝑖𝑗 confidence of the jth measurement in the ith section 𝑁𝑖 number of measurements in the ith section, 𝑀 number of sections 𝐿 number of sections with number of measurements larger than 1
  • 8.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 8 Real time framework o Sound capturing by 3D microphone array (6-channel) o Data is Subjected to Framing and windowing block o Short time discrete Fourier transform (DFT) o Voice Activity Detectors (VAD) • VAD are used for detecting active signals • Based on Energy and Spectral shifts for each frame o Localization block (Source estimation) • MVDR Weights for each frequency bin • PHAT weightings for combining all frequency response o Optimization (Improving the localization estimates by averaging several measurements) • Simple Clustering Framework
  • 9.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB 9 Measurement setup and Procedure o Microphone Array: o 6-channel orthogonal array o Speech Sources: o Three Speech Sources placed at 0o, 45o, -45o Azimuth o Speech duration = 20 sec o 1.5 m from array o Pure Speech mixed with Pink Noise o SNR is 15 dB Evaluation Source 02 (-45o) Source 03 (+45o) Source 01 (0o)
  • 10.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB Results (1/3) o Localization Results at 0o azimuth • VAD Voiced Frames = 500 Results Number of Frames 496 STD 2.03 Localization Error 0.027 Localization error = 1 𝑁 𝑖=1 𝑁 𝜑𝑖 − 𝜑𝑖
  • 11.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB Results (2/3) o Localization Results at 45o azimuth • VAD Voiced Frames = 500 Results Number of Frames 493 STD 1.9 Localization error 0.081 Localization error = 1 𝑁 𝑖=1 𝑁 𝜑𝑖 − 𝜑𝑖
  • 12.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB Results (3/3) o Localization Results at -45o azimuth • VAD Voiced Frames = 500 Results Number of Frames 494 STD 1.3 Localization error 0.058 Localization error = 1 𝑁 𝑖=1 𝑁 𝜑𝑖 − 𝜑𝑖
  • 13.
    1 November 2016 HANYANGUNIVERSITY ARCHITECTURAL ACOUSTICS LAB Conclusion o Presented a framework for sound source localization • Using six channel spherical microphone array based on  SRP-MVDR algorithms weighted with PHAT  VAD is used for extracting voiced frames of Speech  Optimization using clustering method for accurate localization o Producing convincing results within the accuracy of ±2o works well for 15 dB SNR o MVDR weighted SPR Localizer estimated the sound sources with STD value of ±1.3o in DOA

Editor's Notes

  • #2 Hello I have been presenting a framework for sound source localization using SRP Method in order to get a clean speech signal in reverb environments form a particular speaker
  • #3 Staring from background and motivation of the presented work, I would talk about the methodology of the sound localizer, including the usage of VAD as voiced frame detection in speech signals and PHAT as combining the frequency bins of the beamformer for different microphone pairs of arrays. Secondly, a real-time frame would be introduced with optimization of the results of DOA Finally, results will be shown and discussed
  • #4 The common purpose of sound source localizer and tracker is for room acoustics measurements and teleconference systems where the speakers are detected and localized for reverb rooms Traditional methods use TDOA between single pair of microphone using GCC. Some recent techniques have been introduced that uses different microphone combination for computing GCC function with certain weighting functions. But these methods often ignores the ambience and reverb room condition, therefore offering poor results in terms of Precision, Resolution and Robustness. Additionally they require post-processing algorithm to track multiple sources in real time applications and have limited bandwidth
  • #5 Generally the sound localization is categorized into two areas Time difference of arrival estimation (TDOA), where Generalized cross-correlation are used for finding the delay time among different microphone pairs of arrays Steered Response Power (SRP) methods are used for steering the source and commonly applied for tracking purposed We have been discussing in this presentation a SRP based approach
  • #6 These slides would present the methodology and necessary mathematical background for our proposed framework The general formula for beamforming is show in first equation, where 𝑆 is cross-power matrix and 𝐷(𝑐) is Array directivity We have bean using MVDR as weighting function for the beamformer and after simplification we get the final equation for the beamformer steering at direction c This beamformer is for single frequency bin
  • #7 After getting the beamformer functions for al frequencies of interest, we Combining the signals 𝑃 𝑀𝑉𝐷𝑅 (𝑐,𝑘) from different frequency bins. The approach followed is by using PHAT weight as a combiner Here, 𝑋 𝑘 input vector with length M containing the input signals for this frequency bin from all microphones In addition, the noise variance information is used and the final beamformer is improved as shown in last equation
  • #8 A post processing is performed using so-called “Bucket Clustering” Algorithm for improving the confidence level of measurements and detecting actual sound source among reflections
  • #9 This slide describes the Real time framework of the method Sound is captured and transformed into frequency domain VAD is applied And the data f voiced frames are subjected to localization block Finally, Optimization is performed for Improving the localization estimates by averaging several measurements using Simple Clustering
  • #10 For evaluation 6-ch microphone array is used Three Speech Sources were placed at 0o, 45o, -45o Azimuth at a distance of 1.5 from array and speech was recorded for 20sec The speech was mixed with pink noise and final speech has a 15dB SNR
  • #11 Localization Results at 0o azimuth
  • #12 Localization Results at 45o azimuth
  • #13 Localization Results at -45o azimuth
  • #14 Presented a framework for sound source localization Using six channel spherical microphone array based on SRP-MVDR algorithms weighted with PHAT VAD is used for extracting voiced frames of Speech Optimization using clustering method for accurate localization Producing convincing results within the accuracy of ±2o works well for 15 dB SNR MVDR weighted SPR Localizer estimated the sound sources with STD value of ±1.3o in DOA