- 1. DNN-based frequency component prediction for frequency-domain audio source separation Rui Watanabe, Daichi Kitamura (National Institute of Technology, Japan) Hiroshi Saruwatari (The University of Tokyo, Japan) Yu Takahashi, Kazunobu Kondo (Yamaha Corporation, Japan) 28th European Signal Processing Conference (EUSIPCO) SS-2.4
- 2. Background Audio source separation – aims to separate audio sources such as speech, singing voice, and musical instruments. Products with audio source separation – Intelligent speaker – Hearing-aid system – Music editing by users etc. 1
- 3. Background Multichannel audio source separation (MASS) – estimates separation system using multichannel observation without knowing mixing system Popular methods for each condition – Underdetermined (number of mics < number of sources) • Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013] • Approaches based on deep neural networks (DNNs) – Overdetermined (number of mics ≥ number of sources) • Frequency-domain Independent component analysis [Smaragdis, 1998] • Independent vector analysis [Kim+, 2007] • Independent low-rank matrix (ILRMA) [Kitamura+, 2016] 2
- 4. Background Frequency-domain MASS – performs a short-time Fourier transform to the observed time-domain signal to obtain the spectrograms – estimates frequency-wise separation filter 3
- 5. Conventional frequency-domain MASS Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013] – Unsupervised source separation algorithm without any prior information or training – High quality MASS can be achieved – Huge computational cost for estimating the parameters 4
- 6. Proposed method: motivation High-quality MASS with low computational cost A new framework combining frequency-domain MASS and DNN – Separate specific frequencies via MNMF and obtain separated source components – The estimated source components of the other frequencies will be predicted by DNN 5
- 7. Proposed method: interpretation of DNN DNN in proposed framework can be interpreted in two ways 1. Audio source separation of specific frequencies (high- frequency band) • Low-frequency bands can be used for predicting high-frequency separated components 2. Audio bandwidth expansion of each source • High-frequency band of the mixture is a strong cue for expanding bandwidth 6
- 8. Proposed method: details of framework Observed multichannel spectrograms and are divided into low- and high-frequency bands Apply MNMF to the low frequency band and to obtain the separated source components and – High-frequency band and are not separated in this step 7
- 9. Proposed method: details of framework Input , , and to DNN – DNN outputs softmasks and such that the high- frequency bands and are estimated from 8 Apply softmasks
- 10. Proposed method: input vector of DNN DNN prediction is performed for each time frame (each column of spectrograms) – Input vector is a concatenation of several time frames around th frame in , , and – Normalize the concatenated vector 9
- 11. Proposed method: DNN architecture Simple full-connected networks – Four hidden layers with Swish or Softmax functions 10
- 12. Experiment 1: bandwidth expansion Validation of the proposed framework – Evaluate bandwidth expansion performance from the low- frequency band of true sources with/without mixture – Confirm validity of the proposed framework that utilizes mixture components for predicting the separated sources – Use sources-to-artifact ratio (SAR) [Vincent+, 2006] 11
- 13. Experiment 1: bandwidth expansion Training conditions of DNN Test dataset (SiSEC2011) [Araki+, 2012] for evaluation 12 Training dataset 100 drums (Dr.) and vocal (Vo.) songs in SiSEC2016 Database [Liutkus+, 2016] FFT length/Shift length 128 ms/64 ms Boundary frequency 4 kHz (Half of Nyquist frequency) Epochs/batch size 1000/128 Optimizer Adam (learning rate=0.001) Song ID Song name Signal length [s] 1 dev1__bearlin-roads (Dr. & Vo.) 14.0 2 dev2__another_dreamer-the_ones_we_love (Dr. & Vo.) 25.0 3 dev2__fort_minor-remember_the_name (Dr. & Vo.) 24.0 4 dev2_ultimate_nz_tour (Dr. & Vo.) 18.0
- 14. Experiment 1: bandwidth expansion Mixture components help to predict the high- frequency band of the separated sources 13 Song ID DNN w/o mixture DNN w/ mixture 1 Dr. : 21.1 dB Dr. : 28.0 dB Vo. : 21.8 dB Vo. : 31.5 dB 2 Dr. : 22.0 dB Dr. : 21.8 dB Vo. : 12.7 dB Vo. : 19.6 dB 3 Dr. : 15.0 dB Dr. : 20.4 dB Vo. : 11.2 dB Vo. : 18.5 dB 4 Dr. : 11.0 dB Dr. : 18.2 dB Vo. : 10.4 dB Vo. : 15.3 dB
- 15. Experiment 2: evaluate proposed MASS framework Compare conventional fullband MNMF and the proposed framework – In terms of separation accuracy (source-to-distortion ratio: SDR [Vincent+, 2006]) and computational efficiency 14
- 16. Experiment 2: evaluate proposed MASS framework Experimental conditions of MNMF 15 Multichannel observed signal Produce two-channel mixture by convoluting E2A impulse responses to the sources of the test dataset Boundary frequency 4 kHz Number of bases in MNMF 13
- 17. 16 Experiment 2: evaluate proposed MASS framework
- 18. Song ID 4 – Since the number of frequencies is reduced by half, the proposed method is twice faster – In Fullband MNMF, 13dB was achieved in 120s – Proposed method achieved 13 dB in less than 50s 17 Experiment 2: evaluate proposed MASS framework
- 19. 18 Experiment 2: evaluate proposed MASS framework
- 20. Conclusion In this paper – We proposed a computationally efficient audio source separation framework combined frequency-domain MASS and frequency component prediction based on DNN – In the proposed framework, MASS is applied to only the limited frequencies, and DNN predicts the other frequency components of the sources – By comparing fullband MNMF, the proposed method can achieve almost the same quality with the half- reduced computational cost 19 Thank you for your attention!

- Hi everyone, I’m Rui Watanabe / from National Institute of Technology（テクナーロジィ）, / Kagawa College, / Japan. I’m gonna talk about / DNN-based frequency component prediction / for frequency-domain audio source separation.
- Audio source separation / is a technique（テクニーク）to separate audio sources / such as speech,↑ / singing voice,↑ / musical instruments↑, and so on↓. This technology（テクナーロジィ）can be used for many products / including an intelligent speakers,↑/ hearing-aid systems,↑ / and music editing by users↓.
- In particular, / multichannel（マーチチャネル）audio source separation, / MASS（エムエイエスエス）in short, / estimates a separation system W / using multichannel（マーチチャネル） observation / without knowing the mixing system A.（WとAは指し示しながら） This technique（テクニーク）can be divided into two categories（キャテゴリーズ）, / for underdetermined / and overdetermined situations（スィテュエイションズ）. The underdetermined situation（スィテュエイション） is that / the number of microphones / is less than the number of sources in the mixture. For this case, multichannel（マーチチャネル）nonnegative matrix（メイトリクス）factorization, / MNMF in short, / is a popular（パピュラー）algorithm. Also, / many DNN-based approaches / have been proposed so far in this case. On the other hand, / in the overdetermined situation（スィテュエイション）, / the number of microphones is equal to / or larger than the number of sources. In this case, / frequency-domain independent component analysis / and independent low-rank matrix（メイトリクス）analysis / are the most reliable approaches.
- In this presentation,↑/ we only treat “frequency-domain MASS”. In this algorithm↑, / we perform / a short-time Fourier transform to the observed time-domain signal / and obtain the multichannel（マーチチャネル）spectrograms.（図の紫部分を指しながら） Then, / we estimate a frequency-wise separation filter, / which is applied to each frequency like this（図の中央を指しながら）/ to estimate the separated source signals.
- Let me introduce the conventional frequency-domain MASS called “multichannel（マーチチャネル）nonnegative matrix（メイトリクス）factorization,” / MNMF in short. This is an unsupervised source separation algorithm / and does not require any prior（プライォア）information or training（テュレイニン）. As an unsupervised technique（テクニーク）, MNMF tends to provide high quality separation performance. In MNMF, / the observed multichannel signal / is represented by the time-frequency-wise channel correlation matrices / denoted by X. Since X is a frequency-by-time matrix whose element is a channel-by-channel matrix↑, / this is a matrix of matrices, / which is a fourth-order tensor（テンサー）. （frequency-by-timeのところは指し示しながら） MNMF decomposes X / into the source-wise spatial model（マドー）/ and the low-rank spectral model（マドー）of all the sources. Thus, / by clustering the spectral model（マドー）into each source using the estimated spatial model（マドー）, the source separation is achieved. However, it requires a huge computational cost / for estimating the parameters（パラミターズ）/ because there are so many parameters（パラミターズ）in this model（マドー）.
- In this presentation, / our motivation（モーティベイシュン） is that / we want to achieve a high-quality MASS / with a low computational cost. And we propose / a new source separation framework / combining frequency-domain MASS / and deep neural networks. In this framework, / as an initial process, / the mixture signal in specific frequencies are separated by MNMF, / and we obtain the separated source components in that frequencies. In this figure, / since only the low-frequency band of the mixture is input to MNMF, we can get the separated components in the low-frequency band. Of course, / the high-frequency bands of the separated sources / are missing. （しっかり間を開ける） As a post process, / we apply DNN-based frequency component prediction, / namely, / the missing high-frequency bands of the separated sources are predicted by DNN, / where we input not only the separated low-frequency bands / but also the mixture of the high-frequency band.（inputの矢印をそれぞれ指し示しながら） Since the DNN prediction process is much faster than MNMF process, / we can reduce / the total computational cost in this framework. For example, if we divide the frequency bands in half like figure,↑/ we can reduce the computational time / almost half.
- In our framework, / the post DNN process can be interpreted in two ways. First, / the DNN is an audio source separation of specific frequencies, / high-frequency band in this figure. Please note that / the low-frequency bands can be used for predicting high-frequency separated components in our DNN model（マドー）. Second, / the DNN seems to be a bandwidth expansion of each source / because the high-frequency bands are predicted. In general, a bandwidth expansion is a hard task / even for DNN. However, / in our model,（マドー）/ the high-frequency band of the mixture / becomes a strong cue / to achieve the bandwidth expansion.
- The details of the proposed method is as follows. First, / the observed multichannel spectrograms / M1 and M2 / are divided into low- and high-frequency bands. Then, / we apply MNMF to only the low-frequency band / M1(L) and M2(L) / to obtain the separated source components / Y1(L) and Y2(L). The high-frequency band / M1(H) and M2(H) / are not separated in this step.
- Next, / we input the high-frequency band of the mixture / and the low-frequency bands of the separated sources / like this figure. The DNN / outputs two soft-masks / W1 and W2 / such that the high-frequency bands of the separated sources are calculated from M1(H) / by multiplying them. Of course, / the masks are the matrices with the elements between zero（ジロー）and one, / and the sum of each element in W1 and W2 is always unity.
- The DNN prediction is performed for each time frame j, / which is / each column of spectrograms. To utilize the information along time in the prediction, / the input vector for DNN is a concatenation of several time frames around j in the mixture and the separated sources. Also, / before we input the vector to the DNN, we normalize it to stabilize the model（マドー）training（テュレイニン）, / where the normalization coefficient is added / to keep the information of the signal volume.
- The DNN model（マドー）in the proposed method is very simple. We have full-connected four hidden layers, / and we apply Swish function / to each hidden layer. Just before the output, / we apply frequency-wise Softmax function, / to ensure（インシュァー）that / the sum of the masks equals unity in each frequency. The mean squared error / between the separated source vector and the label（レイボーゥ）vector↑/ is used as a loss function of the DNN training（テュレイニン）.
- To confirm the validity of the proposed method, / we have done two experiments（イクスペリメンツ）. In the first experiment, / we evaluate the performance of the DNN model（マドー）/ as the bandwidth expansion. That is, / the DNN restores the high-frequency band from the low-frequency band of the completely separated sources, / where we confirm whether the high-frequency band of the mixture is effective / by comparing these two models（マドーズ）.（図を指しながら） Therefore, / we can confirm the validity of the proposed framework / that utilizes mixture components / for predicting the separated sources. As an evaluation score, / we use sources-to-artifact ratio（レイシオ）, / SAR, / which shows the absence of artificial distortions / in the estimated audio signals.
- This slide shows experimental conditions. For the training of DNN, / we used 100 songs with drums and vocals / in the SiSEC2016 database.（トゥエンティシックスティーンと発音） The boundary frequency between low- and high-frequency bands / was set to 4kHz, / which is a half of Nyquist frequency. As the test dataset, / we used four songs included in the SiSEC2011 database（トゥエンティイレブンと発音）, / where these songs are the mixture of drums and vocals.
- This is the result of bandwidth expansion. For each song, / we showed the SAR values of Drums and Vocals. Higher SAR indicates better audio quality. Two columns show the results of DNN without mixture / and DNN with mixture. In almost all results, / the DNN with mixture outperforms the DNN without mixture. From this result, / we can confirm that / the mixture components help to predict the high-frequency band of the separated sources. Thus, / we can expect that / the proposed framework will perform effectively / in a source separation task.
- Next, / we conducted the MASS experiment. We compare the conventional MNMF and the proposed framework. The conventional method separates fullband mixture by MNMF / whereas the proposed framework separates only the low-frequency band by MNMF, / and the high-frequency band is predicted by DNN post process. We expect that / the computational time is reduced by skipping half number of frequencies in the MNMF process / while the separation performance is almost the same. As a source separation score, / we used source-to-distortion ratio（レイシオ）, / SDR, / which represents the total performance of source separation including both “degree of separation” and the “quality of separated signals.”
- The other conditions are shown in this slide.（ここのtheは重要，otherとthe otherは違う） DNN is trained using the same dataset in the previous experiment. For the MASS test data, / we produced two-channel observed mixtures / by convoluting E2A impulse responses to the Drums and Vocals sources of the test dataset, / where the recording condition of E2A impulse response is depicted here.（図を指しながら） The reverberation time of E2A is 300 ms. The number of bases in MNMF was set to 13, / which provides the best result for both the conventional and proposed methods.
- This is the result for each song. The vertical axis indicates SDR score / averaged over 10 random initial values.（指しながら） The horizontal axis shows the average elapsed time.（指しながら） The black line is the conventional method, / fullband MNMF, / and the red circles are the results of the proposed framework. Since the elapsed time depends on the number of iterations of parameter update in MNMF,↑/ for the proposed framework,↑/ we plot the results with every 10 iterations in the MNMF process. Of course, / the computational time for the DNN prediction process / is included in each red circle, / although the DNN process requires less than 0.1 s（ジロポイントワンセカンズ）. In all the results, / we can confirm the efficacy of the proposed method. In particular, / Song ID 4 shows the result just as we expected, / so let me explain the result of Song ID 4.
- In the case of Song ID 4, / the proposed method achieves 13 dB / in less than 50 s, / whereas fullband MNMF converged to 13 dB in 120 s. This is because / the number of frequencies in MNMF is reduced by half.
- In addition, / the proposed method outperforms fullband MNMF in Song IDs 1, 2, and 4. In particular, / the improvement in Song ID 1 is very large. The reason of these improvements might be that / the proposed method performed more accurate estimation of high-frequency band sources / based on the training with 100 songs. Also, / in the case of Song ID 1, / fullband MNMF might be trapped into a bad local minimum during the iterative optimization.