K. Yamaoka, N. Ono, S. Makino, and T. Yamada, “Time-frequency-bin-wise switching of minimum variance distortionless response beamformer for underdetermined situations,” in Proc. ICASSP, pp. 7908-7912, 2019
16-17 min. + 4-3 min. Thank you Mr. chairman. Today, I would like to talk about the Time-Frequency-Bin-Wise Switching of Minimum Variance Distortionless Response Beamformer for Underdetermined situations. We conducted this study in collaboration with Tokyo Metropolitan University, Japan.
Under this concept, we have proposed tfs beamformer. In this method, we design multiple beamformers and switch them at each time-frequency bin. Let’s consider a situation of two microphones and three sound sources. In this situation, we cannot construct a null beamformer that suppresses both interferers like this. But a beamformer that suppresses only the interferer 1 can be constructed if the interferer 2 does not exist. Then, the spatial filter w_1 outputs the target signal. In the filtering stage, it also outputs a slightly altered version of the interferer 2 if it exists. w_2 can be constructed in the same way. So, here, we have K spatial filters where K = N-1. It equals to the number of interferers in the two microphone case. The important point is that each filter have the same steering direction but different null directions. The k-th filter w_k suppresses only the k-th interferer n_k and every filter enhances target s. Then, we switch them at each time-frequency bin. In other words, we use w_1 for the bins dominated by n1, and use w_2 for the counterpart. This binary mask shows the example of selected beamformers. We used w1 for red bins and w_2 for blue bins.
In this work, we focus on an MVDR beamformer. Let’s recall the optimization problem of MVDR beamformer. When the exact transfer function a_s is available, MVDR beamformer minimizes the signal variance subject to the distortionless response. In the TFS framework, a joint optimization problem that estimate the m and the mvdr beamformers w_k is defined as this equation. It should be emphasized that every parameter is optimized using the same criterion, that is, minimum variance, and subject to the distortionless response of every filter. Here, note that the transfer function is precisely known. Although it is difficult to optimize w_k and m_k simultaneously, we can optimize them alternately.
First, we fix the w_k and solve the optimization problem with respects to m_k. Since the red part corresponding to w_k are fixed values, the solution is this. This is equivalent to switching the filters. Next, we fix the m_k and solve the optimization problem with respects to w_k. Let’s consider the perforated signal x_k which is the observation x masked by m_k. x_k contains many discontinuous zero points. Then, this optimization has the closed-form solution, and this has complete agreement with conventional one. Finally, we update them alternately.
Let me elaborate further on what this equation means. Here, we consider that the key assumption of this method is satisfied, that is, an interferer exists at a time-frequency bin. Then, first, let’s think this observation as an example. This is the illustrated spectrogram of interferer signals. Then, the output signal of w_1 is this and thus, m_1 becomes this. Therefore, x_1 contains only the target and interferer 1. So, this equation means the computation of mvdr beamformer in the determined situation where the target and interferer 1 exist.
Here, the important property is that this minimization of the filter outputs is equal to that of the powers of interferers. This is because the target signal in these outputs ideally matches owing to the constraints of distortionless response. And the output of k-th filter also contains interferers except for the k-th one. Therefore, it equivalent to comparison of them. Additionally, the magnitude and phase of the target signal perfectly match in ideal case. Therefore, no target distortion occurs due to the switching.
This shows the directivity patterns of the initialized spatial filters. The vertical axis is the angle, the horizontal axis is the frequency and the color scale shows the gain. We consider am example, where the target DOA is 90 degrees, the DOAs of interferers n1 and n2 are 150 and 40 degrees, respectively. And we initialize w having null steering that suppress the sounds coming from 124 and 166 degrees, the same sides of target. This can be said a bad initialization since interferers we want to suppress come form both sides of target. Then, using these filters, we compute m, and compute perforated spectrograms m_1 y_1 and m_2 y_2.
Next, we update the filters using the perforated observation x masked by m. w_1 has a straight null steering to n_1 and w_2 suppresses n_2 in some frequencies. Then, we update m and compute perforated spectrograms again.
%w_2 has a straight null because the initialized w_2 had the good null steering to n_1. %Even though the initialized w_1 had null in the direction away from n_2, updated w_1 suppresses n_2 in some frequencies. %This is because almost all time-frequency bins dominated by n_A are used for w_2, in other words, there is no component of n_A in m_1y_1. %So, updated w_1 computed from m_1y_1 has the null steering to the direction of n_B.
Finally, we obtain the m and the spatial filters w shown in there figures after the sufficient iterations. Then we compute the enhanced target y by adding m_1y_1 to m_2y_2. It can be considered that this optimization has a lot of local optimum such as these frequencies. And the better initialization should lead the better performance.
To evaluate the effectiveness of the proposed method, we conducted experiments in a reverberant environment with RIR generator. This figure shows the layout of the experiments, where we have two microphones and a target signal whose DOA was 90 degrees. We prepared 6 interferers and select 2 or 3 sources from them; for example, we used the three-sources mixture consisting of the target, interferer B and D. The input SNR was about -4 dB. For the initialization of filters, we used a null beamformer suppressing a random direction. We evaluated five different initial values.
We evaluated 4 methods; the conventional single MVDR beamformer, DUET, the conventional binary time-frequency masking, proposed TFS beamformer, and conventional TFS beamformer with oracle filtering. We gave the exact transfer function to these three methods and also gave the interferer-wise-covariance matrices for the oracle filtering. We used objective criteria, namely, SDR and SIR to quantify the results.
The SDR and SIR improvement for three-source mixtures are shown in this figure for each noise set. The proposed method, the red bar, shows a performance superior to those of conventional MVDR and DUET regardless of the interferer DOAs. Additionally, the performance is close to that of conventional tfs with oracle filtering even though we use the random initialization. Thus, we concluded that the proposed new tfs algorithm, which estimating the filters and selecting the best one simultaneously, is effective for speech enhancement in underdetermined situations.
These show the results for mixtures of the target and three interferers. We concluded that our proposed method improves speech enhancement performance in an underdetermined situation.
That’s all, Thank you for your attention.
ICASSP2019 音声＆音響論文読み会 著者紹介2 （信号処理系2）
, Nobutaka Ono
, Shoji Makino
and Takeshi Yamada
1 University of Tsukuba, Japan
2 Tokyo Metropolitan University, Japan
Time-Frequency-Bin-Wise Switching of
Minimum Variance Distortionless Response
Beamformer for Underdetermined Situations
ICASSP 2019, SS-L6
Acoustic Scene analysis and Tracking for Time-Varying Reverberant Environments