SlideShare a Scribd company logo
1 of 27
14th Asia Pacific Signal and Information Processing Association Annual Summit and Conference
Session: Deep Learning: Algorithm, Implementations, and Applications
Time: Wed., 9. Nov., 15:20-15:40 (UTC +7)
DNN-Based Frequency-Domain Permutation Solver
for Multichannel Audio Source Separation
Fumiya Hasuike*, Daichi Kitamura*, Rui Watanabe
*National Institute of Technology, Kagawa College, Japan
Japan Advanced Institute of Science and Technology, Japan
†
†
2
• Audio source separation
– extracts specific sounds such as voice, noise, singing, instrumental,
machine, etc.
• Applications
⁃ Speech recognition
⁃ AI speaker
⁃ Increased functionality of hearing aids
⁃ Noise canceling function etc.
Background
Audio source
separation
3
• Blind source separation (BSS)
– assumes that the mixing system is unknown and estimates the
demixing system
– BSS for determined situation (microphones ≧ sources)
• defines an inverse matrix because mixtures can be a square matrix
• Good sound quality of separated sound because linear separation is
possible
• Handles determined BSS in this research
– High sound quality and applicable to a wide variety of fields
Background
Example: Independent component analysis (ICA) [Comon, 1994]
Independent vector analysis (IVA) [Hiroe, 2006], [Kim+, 2006]
Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016]
Mixing system
BSS
Demixing system
4
Conventional BSS
Frequency-domain independent
component analysis (FDICA)
Full-rank spatial
covariance analysis (FCA)
[Smaragdis, 1998] [Duong+, 2010]
Encounter permutation problem
Independent vector
analysis (IVA)
[Hiroe, 2006],
[Kim+, 2006]
Auxiliary independent
vector analysis (AuxIVA) [Ono, 2011]
Independent low-rank matrix
analysis (ILRMA) [Kitamura+, 2016]
Deep permutation solver based
on local time-frequency structure [Yamaji+, 2020]
Avoid permutation
problem
Solve permutation
problem
Proposed method
Permutation solution based
on frequency correlation
Permutation solution based
on direction of arrival [Saruwatari+, 2006]
[Murata+, 2001],
[Sawada+, 2004]
Supervised methods
Unsupervised methods
5
Contents
• Related methods
– Permutation problem in FDICA or FCA
– Methods to avoid permutation problem
– Deep permutation solver based on local time-frequency structure
• Proposed method
– Overview and permutation matrix estimation
– Estimate separated signals and record losses
– Process against test data
• Experiments
6
Contents
• Related methods
– Permutation problem in FDICA or FCA
– Methods to avoid permutation problem
– Deep permutation solver based on local time-frequency structure
• Proposed method
– Overview and permutation matrix estimation
– Estimate separated signals and record losses
– Process against test data
• Experiments
7
• Permutation problem
– Mix up the order of components in each frequency
Permutation problem in FDICA or FCA
ICA
or
FCA
All frequency
components
Source
signal 1
Source
signal 2
Observed
signal 1
Observed
signal 2
Permutation
Solver
Separated signal 1
Separated signal 2
Time
Permutation-inconsistent
signal 1
Permutation problem: Separate source components in each frequency,
but the order of separated signals are mixed
Permutation-inconsistent
signal 2
8
Methods to avoid permutation problem
• IVA [Hiroe, 2006], [Kim+, 2006]
– We assume that all the frequency
components simultaneously have
the strong powers in each source
• ILRMA [Kitamura+, 2016]
– We assume that each source has
a low-rank time-frequency structure
(highly including repetition)
Source model of ILRMA
Frequency
Time
Source model of IVA
Time
Frequency
Independent vector
analysis (IVA)
[Hiroe, 2006],
[Kim+, 2006]
Auxiliary independent
vector analysis (AuxIVA) [Ono, 2011]
Independent low-rank matrix
analysis (ILRMA) [Kitamura+, 2016]
Avoid permutation
problem
9
Motivation of deep permutation solver
• Sources have their own time-frequency structures
– When the source model does not fit to a source, BSS performance of
IVA or ILRMA is degraded
– A source model that fits to versatile sources is required
• DNN that focuses on only the permutation alignment
– may be applicable for versatile source signals
Drums Guitar
Vocals
Discovering such a universal source model is difficult
Deep permutation solver [Yamaji+, 2020] is addressed
10
Conventional deep permutation solver
• Deep permutation solver based on local time-frequency
structure [Yamaji+, 2020]
– The DNN predicts whether the input reference frequency component
and its neighbors are the same or different sources
• “0” means the same
• “1” means the different sources
Time
Frequency
…
DNN
DNN
1 : Diff.
1 : Diff.
0 : Same
1 : Diff.
0 : Same
Input vector
DNN
outputs
…
DNN
…
…
Input vector
DNN estimation
results
1 : Different
1 : Different
0 : Same
1 : Different
0 : Same
Frequency
Time
11
Conventional deep permutation solver
• Algorithm becomes complex with more than 3 sources
– We cannot determine the permutation, When the DNN predicts 1
• there are several possibilities of permutations
Algorithm is complicated because it needs
combination process for the number of sources.
Time
Frequency
DNN
DNN
Input vector
Output
vector
・
・
・
・
・
・
・
・
・
1 : Different
1 : Different
0 : Same
1 : Different
0 : Same
Unknown combination
of sources
12
Contents
• Related methods
– Permutation problem in FDICA or FCA
– Methods to avoid permutation problem
– Deep permutation solver based on local time-frequency structure
• Proposed method
– Overview and permutation matrix estimation
– Estimate separated signals and record losses
– Process against test data
• Experiments
13
• The permutation solver must estimate the inverse of the
permutation matrix
– Permutation-aligned signals can be obtained by matrix product
Overview of the proposed method
Predicted by DNN
Matrix
product
Can be calculated
Predicted by DNN
Permutation matrices of
two sources
Permutation-inconsistent
signals
Permutation-aligned
signals
14
Preprocessing
• Normalization process for permutation-inconsistent
signals [Sawada+, 2004]
– emphasizes the correlation of components of the same source
– restricts the value of the DNN input to the interval [0, 1]
Frequency
Time
Frequency
Time
Frequency
Time
Frequency
Time
Frequency
Frequency
Frequency
Time
Frequency
Time
Time
Time
15
Input for DNN
• Extract the several time frames of each signal as local-time
components
– This vector is input to the DNN
16
DNN architecture
• DNN consists of three hidden layers
– Fully connected dense layers with ReLU functions
• Apply softmax function in the output layer
– Predicted values become the probabilities in each frequency
Frequency
1.0
0.1
0.9
0.1
0.5
0.0
0.9
Frequency
0.0
0.1
0.9
0.9
0.1
0.5
1.0
17
Create predicted permutation matrices
• Create predicted permutation matrices using the output
values (probability values) of the DNN
– Use probability values as coefficients of the permutation matrix
– In the two-source case, there are two possibilities of the permutation
matrices
Convert to predicted
permutation matrices
Frequency
1.0
0.1
0.9
0.1
0.5
0.0
0.9
Frequency
0.0
0.1
0.9
0.9
0.1
0.5
1.0
18
Create estimated signals in local time
• Apply matrix product to create estimated signals
– Matrix product between predicted soft permutation matrices and
permutation-inconsistent signals
Matrix
product
Permutation-inconsistent
signals
Permutation-aligned
signals
19
Design of Loss Function
• Apply mean squared error (MSE) between the estimated and
completely aligned spectrograms
– Introduce permutation-invariant training (PIT) [Yu+, 2017] to permit source-
order ambiguity
Frequency
Time Time
Frequency
Time Time
Frequency
Frequency
MSE & PIT
20
Frequency
Time
Frequency
Time
Majority decision for test data
• Take a majority decision to obtain the most reliable
permutation matrix in each frequency
多数決処理
パーミュテーション
行列へ変換
パーミュテーション
行列へ変換
パーミュテーション
行列へ変換
Covert to
permutation matrix
Covert to
permutation matrix
Covert to
permutation matrix
Majority
decision
21
Contents
• Related methods
– Permutation problem in FDICA or FCA
– Methods to avoid permutation problem
– Deep permutation solver based on local time-frequency structure
• Proposed method
– Overview and permutation matrix estimation
– Estimate separated signals and record losses
– Process against test data
• Experiments
22
• Compared method
– Conventional deep permutation solver [Yamaji+, 2020]
– Proposed method
• Evaluation criterion
– Source-to-distortion ratio (SDR) [Vincent+, 2006]
• represents total quality the source separation
• Experimental data
– Use two pairs of dry source signals
– Obtained from SiSEC2011 [Araki+, 2012]
Conditions
Signal type Source Data name Length
Speech
Male speech dev2_male4_inst_src_2.wav 10.0 s
Female speech dev3_female4_inst_src_2.wav 10.0 s
Music
Drums dev1_wdrums_src_3.wav 11.0 s
Guitar dev1_wdrums_src_2.wav 11.0 s
23
• Training data
– Speech and music signals were divided into 64 blocks
– These blocks were randomly swapped
– We simulate the block permutation problem
– We prepare a total of 300 random swapping
patterns
• Test data
– We prepare ten new patterns of randomly
swapped signals as the test dataset
• Train two models: speech and music
– Speech model: Trained using only speech pair
– Music model: Trained using only music pair
• Two test conditions: in-domain and out-of-domain
– In-domain: Using the same sources in the training and testing
– Out-of-domain: Using different sources in the training and testing
Conditions
Random shuffling
Random shuffling
Random shuffling
24
• Other conditions
Conditions
Sampling frequency 16 kHz
Window function Hann window
Window length 128 ms
Shift length 64 ms
Local time parameter 13
Epoch 1000
Units of hidden layers 4096
25
Results
1 2 3 4 5 6 7 8 9 10
Average
Test data pattern
1 2 3 4 5 6 7 8 9 10
Average
Test data pattern
1 2 3 4 5 6 7 8 9 10
Average
Test data pattern
1 2 3 4 5 6 7 8 9 10
Average
Test data pattern
Using a speech model (in-domain)
Using a music model (in-domain)
Using a speech model (out-of-domain)
Using a music model (out-of-domain)
Good
Poor
Good
Poor
26
Results
• Spectrograms (in-domain test dataset with a music model)
– Conventional method fails to align mainly in the low-frequency domain
– Proposed method solves the permutation problem almost perfectly
Conventional
method
Proposed method
Frequency
[kHz]
3
2
1
0
Time [s] Time [s] Time [s] Time [s]
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
Original signal
Input signal
(permutation-inconsistent
signal)
27
Conclusion
• Motivation
– Construct a model for solving permutation problems (deep permutation
solver) rather than assuming specific source models
– Construct algorithms that can withstand expansion in the number of
sound sources
• Conventional deep permutation solver is difficult to apply to the case more than two
sources
• Proposed method
– is a simple, robust, and precise DNN-based permutation solver
– The model can be trained with few samples (few-shot learning)
• Proposed method outperformed conventional method
Thank you for your attention.

More Related Content

Similar to DNN-based frequency-domain permutation solver for multichannel audio source separation

Blind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure modelsBlind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure modelsKitamura Laboratory
 
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...Kitamura Laboratory
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesisNAVER Engineering
 
Linear multichannel blind source separation based on time-frequency mask obta...
Linear multichannel blind source separation based on time-frequency mask obta...Linear multichannel blind source separation based on time-frequency mask obta...
Linear multichannel blind source separation based on time-frequency mask obta...Kitamura Laboratory
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsTakuya Yoshioka
 
Depth estimation of sound images using directional clustering and activation-...
Depth estimation of sound images using directional clustering and activation-...Depth estimation of sound images using directional clustering and activation-...
Depth estimation of sound images using directional clustering and activation-...Daichi Kitamura
 
Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Daichi Kitamura
 
Online divergence switching for superresolution-based nonnegative matrix fact...
Online divergence switching for superresolution-based nonnegative matrix fact...Online divergence switching for superresolution-based nonnegative matrix fact...
Online divergence switching for superresolution-based nonnegative matrix fact...Daichi Kitamura
 
Comparison of Single Carrier and Multi-carrier.ppt
Comparison of Single Carrier and Multi-carrier.pptComparison of Single Carrier and Multi-carrier.ppt
Comparison of Single Carrier and Multi-carrier.pptStefan Oprea
 
Introduction of digital communication
Introduction of digital communicationIntroduction of digital communication
Introduction of digital communicationasodariyabhavesh
 
Digital Signal Processing-Digital Filters
Digital Signal Processing-Digital FiltersDigital Signal Processing-Digital Filters
Digital Signal Processing-Digital FiltersNelson Anand
 
Lecture spread spectrum
Lecture spread spectrumLecture spread spectrum
Lecture spread spectrumRonoh Kennedy
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemRojith Thomas
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemRojith Thomas
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemRojith Thomas
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemRojith Thomas
 
Final presentation
Final presentationFinal presentation
Final presentationRohan Lad
 

Similar to DNN-based frequency-domain permutation solver for multichannel audio source separation (20)

Blind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure modelsBlind audio source separation based on time-frequency structure models
Blind audio source separation based on time-frequency structure models
 
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
Amplitude spectrogram prediction from mel-frequency cepstrum coefficients and...
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
cdma2000_Fundamentals.pdf
cdma2000_Fundamentals.pdfcdma2000_Fundamentals.pdf
cdma2000_Fundamentals.pdf
 
Toward wave net speech synthesis
Toward wave net speech synthesisToward wave net speech synthesis
Toward wave net speech synthesis
 
Linear multichannel blind source separation based on time-frequency mask obta...
Linear multichannel blind source separation based on time-frequency mask obta...Linear multichannel blind source separation based on time-frequency mask obta...
Linear multichannel blind source separation based on time-frequency mask obta...
 
Sampling
SamplingSampling
Sampling
 
Environmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic modelsEnvironmentally robust ASR front end for DNN-based acoustic models
Environmentally robust ASR front end for DNN-based acoustic models
 
Depth estimation of sound images using directional clustering and activation-...
Depth estimation of sound images using directional clustering and activation-...Depth estimation of sound images using directional clustering and activation-...
Depth estimation of sound images using directional clustering and activation-...
 
Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...Blind source separation based on independent low-rank matrix analysis and its...
Blind source separation based on independent low-rank matrix analysis and its...
 
Online divergence switching for superresolution-based nonnegative matrix fact...
Online divergence switching for superresolution-based nonnegative matrix fact...Online divergence switching for superresolution-based nonnegative matrix fact...
Online divergence switching for superresolution-based nonnegative matrix fact...
 
Comparison of Single Carrier and Multi-carrier.ppt
Comparison of Single Carrier and Multi-carrier.pptComparison of Single Carrier and Multi-carrier.ppt
Comparison of Single Carrier and Multi-carrier.ppt
 
Introduction of digital communication
Introduction of digital communicationIntroduction of digital communication
Introduction of digital communication
 
Digital Signal Processing-Digital Filters
Digital Signal Processing-Digital FiltersDigital Signal Processing-Digital Filters
Digital Signal Processing-Digital Filters
 
Lecture spread spectrum
Lecture spread spectrumLecture spread spectrum
Lecture spread spectrum
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio system
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio system
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio system
 
An audio quality evaluation of digital radio system
An audio quality evaluation of digital radio systemAn audio quality evaluation of digital radio system
An audio quality evaluation of digital radio system
 
Final presentation
Final presentationFinal presentation
Final presentation
 

More from Kitamura Laboratory

付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定Kitamura Laboratory
 
STEM教育を目的とした動画像処理による二重振り子の軌跡推定
STEM教育を目的とした動画像処理による二重振り子の軌跡推定STEM教育を目的とした動画像処理による二重振り子の軌跡推定
STEM教育を目的とした動画像処理による二重振り子の軌跡推定Kitamura Laboratory
 
ギタータブ譜からのギターリフ抽出アルゴリズム
ギタータブ譜からのギターリフ抽出アルゴリズムギタータブ譜からのギターリフ抽出アルゴリズム
ギタータブ譜からのギターリフ抽出アルゴリズムKitamura Laboratory
 
時間微分スペクトログラムに基づくブラインド音源分離
時間微分スペクトログラムに基づくブラインド音源分離時間微分スペクトログラムに基づくブラインド音源分離
時間微分スペクトログラムに基づくブラインド音源分離Kitamura Laboratory
 
周波数双方向再帰に基づく深層パーミュテーション解決法
周波数双方向再帰に基づく深層パーミュテーション解決法周波数双方向再帰に基づく深層パーミュテーション解決法
周波数双方向再帰に基づく深層パーミュテーション解決法Kitamura Laboratory
 
Heart rate estimation of car driver using radar sensors and blind source sepa...
Heart rate estimation of car driver using radar sensors and blind source sepa...Heart rate estimation of car driver using radar sensors and blind source sepa...
Heart rate estimation of car driver using radar sensors and blind source sepa...Kitamura Laboratory
 
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価Kitamura Laboratory
 
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討Kitamura Laboratory
 
多重解像度時間周波数表現に基づく独立低ランク行列分析,
多重解像度時間周波数表現に基づく独立低ランク行列分析,多重解像度時間周波数表現に基づく独立低ランク行列分析,
多重解像度時間周波数表現に基づく独立低ランク行列分析,Kitamura Laboratory
 
深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討Kitamura Laboratory
 
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測Kitamura Laboratory
 
音楽信号処理における基本周波数推定を応用した心拍信号解析
音楽信号処理における基本周波数推定を応用した心拍信号解析音楽信号処理における基本周波数推定を応用した心拍信号解析
音楽信号処理における基本周波数推定を応用した心拍信号解析Kitamura Laboratory
 
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離Kitamura Laboratory
 
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離Kitamura Laboratory
 
非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧Kitamura Laboratory
 
独立成分分析に基づく信号源分離精度の予測
独立成分分析に基づく信号源分離精度の予測独立成分分析に基づく信号源分離精度の予測
独立成分分析に基づく信号源分離精度の予測Kitamura Laboratory
 
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化Kitamura Laboratory
 
独立低ランク行列分析を用いたインタラクティブ音源分離システム
独立低ランク行列分析を用いたインタラクティブ音源分離システム独立低ランク行列分析を用いたインタラクティブ音源分離システム
独立低ランク行列分析を用いたインタラクティブ音源分離システムKitamura Laboratory
 
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価Kitamura Laboratory
 
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用Kitamura Laboratory
 

More from Kitamura Laboratory (20)

付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
付け爪センサによる生体信号を用いた深層学習に基づく心拍推定
 
STEM教育を目的とした動画像処理による二重振り子の軌跡推定
STEM教育を目的とした動画像処理による二重振り子の軌跡推定STEM教育を目的とした動画像処理による二重振り子の軌跡推定
STEM教育を目的とした動画像処理による二重振り子の軌跡推定
 
ギタータブ譜からのギターリフ抽出アルゴリズム
ギタータブ譜からのギターリフ抽出アルゴリズムギタータブ譜からのギターリフ抽出アルゴリズム
ギタータブ譜からのギターリフ抽出アルゴリズム
 
時間微分スペクトログラムに基づくブラインド音源分離
時間微分スペクトログラムに基づくブラインド音源分離時間微分スペクトログラムに基づくブラインド音源分離
時間微分スペクトログラムに基づくブラインド音源分離
 
周波数双方向再帰に基づく深層パーミュテーション解決法
周波数双方向再帰に基づく深層パーミュテーション解決法周波数双方向再帰に基づく深層パーミュテーション解決法
周波数双方向再帰に基づく深層パーミュテーション解決法
 
Heart rate estimation of car driver using radar sensors and blind source sepa...
Heart rate estimation of car driver using radar sensors and blind source sepa...Heart rate estimation of car driver using radar sensors and blind source sepa...
Heart rate estimation of car driver using radar sensors and blind source sepa...
 
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
双方向LSTMによるラウドネス及びMFCCからの振幅スペクトログラム予測と評価
 
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
深層ニューラルネットワークに基づくパーミュテーション解決法の基礎的検討
 
多重解像度時間周波数表現に基づく独立低ランク行列分析,
多重解像度時間周波数表現に基づく独立低ランク行列分析,多重解像度時間周波数表現に基づく独立低ランク行列分析,
多重解像度時間周波数表現に基づく独立低ランク行列分析,
 
深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討深層パーミュテーション解決法の基礎的検討
深層パーミュテーション解決法の基礎的検討
 
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測深層学習に基づく音響特徴量からの振幅スペクトログラム予測
深層学習に基づく音響特徴量からの振幅スペクトログラム予測
 
音楽信号処理における基本周波数推定を応用した心拍信号解析
音楽信号処理における基本周波数推定を応用した心拍信号解析音楽信号処理における基本周波数推定を応用した心拍信号解析
音楽信号処理における基本周波数推定を応用した心拍信号解析
 
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離調波打撃音モデルに基づく線形多チャネルブラインド音源分離
調波打撃音モデルに基づく線形多チャネルブラインド音源分離
 
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
コサイン類似度罰則条件付き非負値行列因子分解に基づく音楽音源分離
 
非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧非負値行列因子分解を用いた被り音の抑圧
非負値行列因子分解を用いた被り音の抑圧
 
独立成分分析に基づく信号源分離精度の予測
独立成分分析に基づく信号源分離精度の予測独立成分分析に基づく信号源分離精度の予測
独立成分分析に基づく信号源分離精度の予測
 
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
深層学習に基づく間引きインジケータ付き周波数帯域補間手法による音源分離処理の高速化
 
独立低ランク行列分析を用いたインタラクティブ音源分離システム
独立低ランク行列分析を用いたインタラクティブ音源分離システム独立低ランク行列分析を用いたインタラクティブ音源分離システム
独立低ランク行列分析を用いたインタラクティブ音源分離システム
 
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
局所時間周波数構造に基づく深層パーミュテーション解決法の実験的評価
 
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用
基底共有型非負値行列因子分解に基づく楽器音の共通・固有成分の分析と音色変換への応用
 

Recently uploaded

Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书rnrncn29
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical trainingGladiatorsKasper
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier Fernández Muñoz
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 

Recently uploaded (20)

Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
『澳洲文凭』买麦考瑞大学毕业证书成绩单办理澳洲Macquarie文凭学位证书
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training70 POWER PLANT IAE V2500 technical training
70 POWER PLANT IAE V2500 technical training
 
Javier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptxJavier_Fernandez_CARS_workshop_presentation.pptx
Javier_Fernandez_CARS_workshop_presentation.pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 

DNN-based frequency-domain permutation solver for multichannel audio source separation

  • 1. 14th Asia Pacific Signal and Information Processing Association Annual Summit and Conference Session: Deep Learning: Algorithm, Implementations, and Applications Time: Wed., 9. Nov., 15:20-15:40 (UTC +7) DNN-Based Frequency-Domain Permutation Solver for Multichannel Audio Source Separation Fumiya Hasuike*, Daichi Kitamura*, Rui Watanabe *National Institute of Technology, Kagawa College, Japan Japan Advanced Institute of Science and Technology, Japan † †
  • 2. 2 • Audio source separation – extracts specific sounds such as voice, noise, singing, instrumental, machine, etc. • Applications ⁃ Speech recognition ⁃ AI speaker ⁃ Increased functionality of hearing aids ⁃ Noise canceling function etc. Background Audio source separation
  • 3. 3 • Blind source separation (BSS) – assumes that the mixing system is unknown and estimates the demixing system – BSS for determined situation (microphones ≧ sources) • defines an inverse matrix because mixtures can be a square matrix • Good sound quality of separated sound because linear separation is possible • Handles determined BSS in this research – High sound quality and applicable to a wide variety of fields Background Example: Independent component analysis (ICA) [Comon, 1994] Independent vector analysis (IVA) [Hiroe, 2006], [Kim+, 2006] Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016] Mixing system BSS Demixing system
  • 4. 4 Conventional BSS Frequency-domain independent component analysis (FDICA) Full-rank spatial covariance analysis (FCA) [Smaragdis, 1998] [Duong+, 2010] Encounter permutation problem Independent vector analysis (IVA) [Hiroe, 2006], [Kim+, 2006] Auxiliary independent vector analysis (AuxIVA) [Ono, 2011] Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016] Deep permutation solver based on local time-frequency structure [Yamaji+, 2020] Avoid permutation problem Solve permutation problem Proposed method Permutation solution based on frequency correlation Permutation solution based on direction of arrival [Saruwatari+, 2006] [Murata+, 2001], [Sawada+, 2004] Supervised methods Unsupervised methods
  • 5. 5 Contents • Related methods – Permutation problem in FDICA or FCA – Methods to avoid permutation problem – Deep permutation solver based on local time-frequency structure • Proposed method – Overview and permutation matrix estimation – Estimate separated signals and record losses – Process against test data • Experiments
  • 6. 6 Contents • Related methods – Permutation problem in FDICA or FCA – Methods to avoid permutation problem – Deep permutation solver based on local time-frequency structure • Proposed method – Overview and permutation matrix estimation – Estimate separated signals and record losses – Process against test data • Experiments
  • 7. 7 • Permutation problem – Mix up the order of components in each frequency Permutation problem in FDICA or FCA ICA or FCA All frequency components Source signal 1 Source signal 2 Observed signal 1 Observed signal 2 Permutation Solver Separated signal 1 Separated signal 2 Time Permutation-inconsistent signal 1 Permutation problem: Separate source components in each frequency, but the order of separated signals are mixed Permutation-inconsistent signal 2
  • 8. 8 Methods to avoid permutation problem • IVA [Hiroe, 2006], [Kim+, 2006] – We assume that all the frequency components simultaneously have the strong powers in each source • ILRMA [Kitamura+, 2016] – We assume that each source has a low-rank time-frequency structure (highly including repetition) Source model of ILRMA Frequency Time Source model of IVA Time Frequency Independent vector analysis (IVA) [Hiroe, 2006], [Kim+, 2006] Auxiliary independent vector analysis (AuxIVA) [Ono, 2011] Independent low-rank matrix analysis (ILRMA) [Kitamura+, 2016] Avoid permutation problem
  • 9. 9 Motivation of deep permutation solver • Sources have their own time-frequency structures – When the source model does not fit to a source, BSS performance of IVA or ILRMA is degraded – A source model that fits to versatile sources is required • DNN that focuses on only the permutation alignment – may be applicable for versatile source signals Drums Guitar Vocals Discovering such a universal source model is difficult Deep permutation solver [Yamaji+, 2020] is addressed
  • 10. 10 Conventional deep permutation solver • Deep permutation solver based on local time-frequency structure [Yamaji+, 2020] – The DNN predicts whether the input reference frequency component and its neighbors are the same or different sources • “0” means the same • “1” means the different sources Time Frequency … DNN DNN 1 : Diff. 1 : Diff. 0 : Same 1 : Diff. 0 : Same Input vector DNN outputs … DNN … … Input vector DNN estimation results 1 : Different 1 : Different 0 : Same 1 : Different 0 : Same Frequency Time
  • 11. 11 Conventional deep permutation solver • Algorithm becomes complex with more than 3 sources – We cannot determine the permutation, When the DNN predicts 1 • there are several possibilities of permutations Algorithm is complicated because it needs combination process for the number of sources. Time Frequency DNN DNN Input vector Output vector ・ ・ ・ ・ ・ ・ ・ ・ ・ 1 : Different 1 : Different 0 : Same 1 : Different 0 : Same Unknown combination of sources
  • 12. 12 Contents • Related methods – Permutation problem in FDICA or FCA – Methods to avoid permutation problem – Deep permutation solver based on local time-frequency structure • Proposed method – Overview and permutation matrix estimation – Estimate separated signals and record losses – Process against test data • Experiments
  • 13. 13 • The permutation solver must estimate the inverse of the permutation matrix – Permutation-aligned signals can be obtained by matrix product Overview of the proposed method Predicted by DNN Matrix product Can be calculated Predicted by DNN Permutation matrices of two sources Permutation-inconsistent signals Permutation-aligned signals
  • 14. 14 Preprocessing • Normalization process for permutation-inconsistent signals [Sawada+, 2004] – emphasizes the correlation of components of the same source – restricts the value of the DNN input to the interval [0, 1] Frequency Time Frequency Time Frequency Time Frequency Time Frequency Frequency Frequency Time Frequency Time Time Time
  • 15. 15 Input for DNN • Extract the several time frames of each signal as local-time components – This vector is input to the DNN
  • 16. 16 DNN architecture • DNN consists of three hidden layers – Fully connected dense layers with ReLU functions • Apply softmax function in the output layer – Predicted values become the probabilities in each frequency Frequency 1.0 0.1 0.9 0.1 0.5 0.0 0.9 Frequency 0.0 0.1 0.9 0.9 0.1 0.5 1.0
  • 17. 17 Create predicted permutation matrices • Create predicted permutation matrices using the output values (probability values) of the DNN – Use probability values as coefficients of the permutation matrix – In the two-source case, there are two possibilities of the permutation matrices Convert to predicted permutation matrices Frequency 1.0 0.1 0.9 0.1 0.5 0.0 0.9 Frequency 0.0 0.1 0.9 0.9 0.1 0.5 1.0
  • 18. 18 Create estimated signals in local time • Apply matrix product to create estimated signals – Matrix product between predicted soft permutation matrices and permutation-inconsistent signals Matrix product Permutation-inconsistent signals Permutation-aligned signals
  • 19. 19 Design of Loss Function • Apply mean squared error (MSE) between the estimated and completely aligned spectrograms – Introduce permutation-invariant training (PIT) [Yu+, 2017] to permit source- order ambiguity Frequency Time Time Frequency Time Time Frequency Frequency MSE & PIT
  • 20. 20 Frequency Time Frequency Time Majority decision for test data • Take a majority decision to obtain the most reliable permutation matrix in each frequency 多数決処理 パーミュテーション 行列へ変換 パーミュテーション 行列へ変換 パーミュテーション 行列へ変換 Covert to permutation matrix Covert to permutation matrix Covert to permutation matrix Majority decision
  • 21. 21 Contents • Related methods – Permutation problem in FDICA or FCA – Methods to avoid permutation problem – Deep permutation solver based on local time-frequency structure • Proposed method – Overview and permutation matrix estimation – Estimate separated signals and record losses – Process against test data • Experiments
  • 22. 22 • Compared method – Conventional deep permutation solver [Yamaji+, 2020] – Proposed method • Evaluation criterion – Source-to-distortion ratio (SDR) [Vincent+, 2006] • represents total quality the source separation • Experimental data – Use two pairs of dry source signals – Obtained from SiSEC2011 [Araki+, 2012] Conditions Signal type Source Data name Length Speech Male speech dev2_male4_inst_src_2.wav 10.0 s Female speech dev3_female4_inst_src_2.wav 10.0 s Music Drums dev1_wdrums_src_3.wav 11.0 s Guitar dev1_wdrums_src_2.wav 11.0 s
  • 23. 23 • Training data – Speech and music signals were divided into 64 blocks – These blocks were randomly swapped – We simulate the block permutation problem – We prepare a total of 300 random swapping patterns • Test data – We prepare ten new patterns of randomly swapped signals as the test dataset • Train two models: speech and music – Speech model: Trained using only speech pair – Music model: Trained using only music pair • Two test conditions: in-domain and out-of-domain – In-domain: Using the same sources in the training and testing – Out-of-domain: Using different sources in the training and testing Conditions Random shuffling Random shuffling Random shuffling
  • 24. 24 • Other conditions Conditions Sampling frequency 16 kHz Window function Hann window Window length 128 ms Shift length 64 ms Local time parameter 13 Epoch 1000 Units of hidden layers 4096
  • 25. 25 Results 1 2 3 4 5 6 7 8 9 10 Average Test data pattern 1 2 3 4 5 6 7 8 9 10 Average Test data pattern 1 2 3 4 5 6 7 8 9 10 Average Test data pattern 1 2 3 4 5 6 7 8 9 10 Average Test data pattern Using a speech model (in-domain) Using a music model (in-domain) Using a speech model (out-of-domain) Using a music model (out-of-domain) Good Poor Good Poor
  • 26. 26 Results • Spectrograms (in-domain test dataset with a music model) – Conventional method fails to align mainly in the low-frequency domain – Proposed method solves the permutation problem almost perfectly Conventional method Proposed method Frequency [kHz] 3 2 1 0 Time [s] Time [s] Time [s] Time [s] 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 Original signal Input signal (permutation-inconsistent signal)
  • 27. 27 Conclusion • Motivation – Construct a model for solving permutation problems (deep permutation solver) rather than assuming specific source models – Construct algorithms that can withstand expansion in the number of sound sources • Conventional deep permutation solver is difficult to apply to the case more than two sources • Proposed method – is a simple, robust, and precise DNN-based permutation solver – The model can be trained with few samples (few-shot learning) • Proposed method outperformed conventional method Thank you for your attention.

Editor's Notes

  1. 【0:00】 Hi everyone , I’m Fumiya Hasuike/ from National Institute of Technology,/ Japan. I’m gonna talk about / DNN-Based Frequency-Domain Permutation Solver/ for Multichannel Audio Source Separation.
  2. 【0:15】 First,/ let us explain/ audio source separation. Audio source separation is the process/ of extracting specific sounds/ such as voice,/ noise,/ singing/ and instrumental sounds. Applications of audio source separation/ include speech recognition,/ AI speakers,/ advanced hearing aids(エイズ),/ and noise cancellation etc. . 初めに,音源分離について説明します. 特定の音声を抽出したり,雑音,楽器音等の音の分離を行ったりすることを音源分離と呼びます. 音源分離の応用先としては,音声認識,AIスピーカー,補聴器の高機能化,ノイズキャンセリング等があります.
  3. 【0:40】 Blind source separation/ is a technique/ to extract sources /from an observed multichannel mixture,/ where the mixing system, A,/ is unknown. BSS/ estimates the demixing system, W,/ which is an inverse system of A,/ and we get the separated signals. In particular,/ we call determined BSS /in a situation/ which the number of microphones/ is greater than the number of sources. In general,/ determined BSS/ is linearly separable,/ so it is less artificial noise distortion/ and has less adverse effects/ on the postprocessing of BSS. This makes it applicable/ to a wide variety of fields. This research/ handles determined BSS. ブラインド音源分離について説明します. ブラインド音源分離とは混合系Aが未知の条件で分離系Wを推定する手法でありBSSと呼ばれます. BSSには,優決定BSSと呼ばれる条件があり.優決定BSSとはマイク数が分離したい音源の数以上であることを指します. 例えば2人の声を2つのマイクで観測すると、優決定となります. 一般的に,優決定BSSは線形な分離が可能であるため,人工的な雑音等の歪みが少なく,音源分離に続く処理に対して悪影響を及ぼしにくいです. そのため,様々な分野に応用可能となっています. 本研究ではこの優決定BSSについて取り扱います.
  4. 【1:25】 Various audio source separation methods/ have been proposed. When applying frequency-domain independent component analysis,/ FDIC(シィ)A,/ or full-rank spatial covariance analysis,/ FC(シィ)A,/ encounter the permutation problem. We will explain the permutation problem/ in more detail later. Methods to avoid permutation problem/ and methods to solve the permutation problem/ have been proposed. Considering methods/ to solve the permutation problem,/ for example methods/ based on frequency correlation/ and based on direction of arrival were proposed. In 2020,/ a supervised method/ was proposed/ to solve the permutation problem/ using deep learning/ based on local time-frequency structure. In this presentation,/ we propose a new supervised method/ to solve the permutation problem. これまでさまざまな音源分離手法が提案されてきました. 周波数領域ICA((FDICA)やフルランク空間共分散分析(FCA)を適用した際には,パーミュテーション問題と呼ばれる問題が生じます. パーミュテーション問題については後程詳しくお伝えします. これまで,DOAに基づく解決法や周波数間相関に基づく解決法等のパーミュテーション問題を解決するような手法が提案されてきましたが,2006年ごろからはこちらにあるような(図を指す)パーミュテーション問題を回避するような,手法が提案されました. また,2020年ごろには,深層学習を用いてパーミュテーション問題を解決するような手法が提案されました. ただ,深層学習を用いたパーミュテーション解決法は入力が2音源に限定されるアルゴリズムであるため,3音源以上への拡張が難しいことが問題点として挙げられます. そこで,今回は新たに3音源以上でも一般性を失わない深層パーミュテーション解決法を提案します.
  5. 【2:25】 This is the contents/ of today’s talk. こちらは本発表の目次になります.
  6. 【2:30】 First, we explain some Related methods.
  7. 【2:35】 We will explain permutation problem/ that arises in FDIC(シィ)A/ and FC(シィ)A. This figure shows frequency/ on the depth/ and time/ on the horizontal axis. FDICA applies independent component analysis,/ IC(シィ)A,/ in each frequency. Let’s consider that/ the red and blue components/ are separated/ in each frequency,/ but IC(シィ)A/ does not care about the order/ of separated signals,/ so In FDIC(シィ)A,/ the problem arises/ that the order of the IC(シィ)A outputs/ is different/ in each frequency. This problem/ is generally called/ the permutation problem,/ and methods/ to solve this problem/ are currently being sought. Signal/ with mixed components/ in each frequency,/ such as/ Y1/ and Y2,/ is defined/ as a permutation-inconsistent signal/ from now on. We propose a new method/ to solve this permutation problem/ using deep learning (DNN),/ which is currently used/ in a wide variety of fields. ただ,FDICAには問題点があります. この図は奥行きが周波数で横軸が時間を示しています. FDICAは各周波数成分に対して独立なICAを行います.各周波数ごとに赤色と青色の音源が分離されていきますが,ICAは先ほどお伝えしたように出力の順番を問わないためFDICAに適用した際に周波数毎に順番がバラバラになってしまう問題が生じます. これは,一般的にパーミュテーション問題と呼ばれ,この問題を解決するような手法が現在求められています. ここで,Y1とY2のように周波数ごとに成分がバラバラになっている信号を今後パーミュテーション不整合信号と定義します. 私はこのパーミュテーション問題に対して現在広い分野で用いられている深層学習(DNN)を用いて解決する手法を新たに提案します. %Y1とY2が周波数と時間になる..ICAが時間と周波数になっているのではない.
  8. 【3:45】 Next, we explain the methods/ to avoid the permutation problem. Independent vector analysis,/ IVA,(アイヴィーエイ)/ and Independent low-rank matrix analysis,/ ILRMA(アイルーマ),/ performs audio source separation/ using a source model(マドーーーーー). In IVA,/ we assume that/ all the frequency components/ simultaneously have/ the strong powers/ in each source. In contrast(コントラスト),/ in ILRMA(アイルーマ)/we assume that each source/ has a low-rank time-frequency structure. This kind of source models is utilized to avoid the permutation problem, while estimating the demixing matrix. % On the other hand,/ a model/ that only focuses on the permutation alignment/ may enable us/ to construct a versatile source separation method. Motivated by this expectation,/ recently,/ deep permutation solver was proposed. IVAでは各信号源は全周波数成分の強弱が同期すると仮定した音源モデルに従って分離を行います. それに対してILRMAでは各信号源は時間周波数構造が繰り返し多分を含むような,つまり低ランクな構造を持つと仮定した音源モデルに従って分離を行います. 各信号源はそれぞれ独自の特徴を含んでいるので,どんな音源にも適応できる万能な音源モデルを作成することは難しいです. それに対して,パーミュテーション問題を解決するようなモデル,すなわち周波数成分を並び替える機能だけを持つモデルなら万能な音源分離手法を構築できるのではと思ったことが私の今回の研究の動機の一つです.
  9. 【4:20】 However,/ the sources/ for example/ vocals,/ drums,/ guitars/ and so on,/ have their own time-frequency structures. When the source model does not fit to a source, BSS performance of IVA or ILRMA is degraded. A source model that fits to versatile sources is required,/ but it is difficult/ to discover a universal source model. On the other hand,/ a model/ that focuses on/ only the permutation alignment/ may enable us/ to construct a versatile source separation method. Motivated by this expectation,/ recently,/ deep permutation solver was proposed. % As I mentioned earlier,/ vocals,/ drums,/ and guitar/ have their own time-frequency structures shown in this figure.
  10. 【5:00】 In this method,/ first,/ we prepare a power spectrogram/ of the permutation-inconsistent signal,(イントネーション↓)/ which is the output of FDICA or FCA. And then,/ the component of reference and neighboring frequencies/ are input to DNN. The DNN/ predicts whether/ the input reference frequency component/ and its neighbors/ are the same/ or different sources,/ where 0(ジィロー) means the same,/ and 1 means the different sources. % Y2の赤が濃い 先行研究としまして,既存の深層パーミュテーション解決法があります. パーミュテーション不整合信号のパワーをとったスペクトログラムを用意します. DNNの入力ベクトルには,参照周波数成分と近傍の周波数成分を用います.また,時間方向についても局所時間に絞ったものを入力ベクトルとしています. このように周波数方向にも,時間方向にも局所時間を抽出したものを今後サブバンドと呼ぶこととします, DNNは,入力された参照周波数成分と近傍の周波数成分が同一音源であるか異なる音源であるかを学習し,同一であれば0,異なる音源であれば1を返すように学習を行います. この図の一番上の例では,入力ベクトルとして参照周波数成分と近傍の周波数成分が異なる値を用いているので,DNNは推定結果として1を出力します.
  11. 【5:30】 However,/ this method/ becomes highly complex/ when the number of sources increases. For example,/ let’s consider/ a three-source case,/ We cannot determine the permutation/ when the DNN predicts 1,/ namely,/ the reference and neighboring components/ are “different sources”. This is because there are several possibilities of permutations as shown in this figure. This problem arises/ because the DNN/ is trained as binary classification. In addition,/ the performance of this deep permutation solver/ is not satisfactory/ which will be confirmed/ in our experiment later. 3音源になると従来の手法はより複雑になります. 参照周波数成分と近傍の周波数成分の値が異なる音源であるとDNNが予測した場合に,どの組み合わせと一致するかが一意に定まらないため,3音源以上に対する汎用性に欠けるといった課題があります. こちらの図では,一番上のDNNの予測が「1」であり,異なる音源といった予測結果になっていますが,どの音源の組み合わせと一致するのかがわかりません. そのため,音源数分の組み合わせの処理を行う必要があり,処理がかなり複雑になるといった問題点があります. そこで,私は新たに3音源以上になっても一般性をかけない深層パーミュテーション解決法を新たに解決します.
  12. 【6:10】 Next, we explain our proposed method.
  13. 【6:15】 This is an overview/ of the proposed method./ For sake of clarity(クラリティ),/ we will use/a two-source example/ to explain the proposed method,/ but the same algorithm/ can be applied/ to three or more sources. After applying FDICA,/ we can obtain the estimated demixing matrix,/ W-hat.(指しながら) However,/ this demixing matrix has scale/ and permutation ambiguities,/ which are shown as the matrices,/ D/ and P(指しながら), respectively. These ambiguities can be recovered/ by multiplying the inverse matrices of D/ and P like this(指しながら),/ where the inverse of D can easily be calculated. Thus(ザス), the permutation solver must estimate the inverse of the permutation matrix,/ P. In the two-source case,/ there are two possibilities of the permutation matrix/ as shown in here(指しながら). In the proposed method,/ DNN model predicts the frequency-wise permutation matrix(指しながら). The permutation-aligned signals/ can be obtained by/ multiplying P inverse/ by the permutation-inconsistent signals,/ y,/ like this. 提案手法の概要です.以後,分かりやすさのために2音源の例で提案手法の処理を説明していきますが,3音源以上になっても同じ処理を考えることができます. 提案手法では,パーミュテーション問題を解決するために,DNNを用いてパーミュテーション不整合信号を並び替えるようなパーミュテーション行列を求めます. FDICAを適用した後の推定分離行列は,こちらの式のようにDとPがかかっています.Dは対角行列,Pはパーミュテーション行列です. パーミュテーション行列とは順番を並び替える役割を持つ行列であり,2音源の場合はこちらの2つの行列のことを指します. 推定信号であるyは推定分離行列と観測信号の掛け算で表されています. 真の分離行列を求めるためには,PとDのインバースが必要でありDのインバースはプロジェクションバック法で解析可能となっています. そのため,私たちが求めるべきものはPのインバースとなります.提案手法では,DNNを用いて予測したパーミュテーション行列とパーミュテーション不整合信号との間で行列積を取ることで,推定分離信号を求めるような手法となっています.
  14. 【7:30】 As a preprocessing step,/ we apply normalization process/ for permutation-inconsistent signals. The normalization process/ can be represented/ by this equation,/ where the absolute for matrices/ represents an element-wise absolute. This process/ can emphasize/ the correlation of components/ of the same source,/ and at the same time,/ it can restrict the value/ of the DNN input/ to the interval between 0(ジィロー) and 1. 提案手法における前処理について説明します. 前処理として,パーミュテーション不整合信号に対して正規化処理を行います. 正規化処理はこの式で表すことができ,行列に対する絶対値記号は要素ごとの絶対値,ドット付き指数乗は要素毎の指数乗,分数は要素ごとの商を表しています. この処理を行うことで,同一音源の成分の相関を強調できるのと同時に,推定信号の値を0〜1の区間に限定することができ,DNNの学習が安定する効果があります.
  15. 【8:05】 Then,/ We focus on the specific time,/ j. We extract the several time frames of each signal, Y,/ as local-time components. These components/ are then converted(ケンバーティッド) into one vector. And the combined these vectors/ are used as the input vectors/ for the DNN. 提案手法におけるDNNの入力について説明します. パーミュテーション問題が生じている信号から参照時間であるjをランダムで設定した後,時間方向に対して局所的な部分を抽出します. そして抽出した部分を各時間方向ごとに一次元にベクトル化し,それらを結合させたベクトルをDNNの入力ベクトルとして用います.
  16. 【8:30】 This is an architecture of our DNN. The model consists of three hidden layers, which are fully connected dense layers with ReLU functions,/ and frequency-wise softmax function. This softmax function is utilized to ensure that the predicted values become the probabilities in each frequency. Finally, we obtain the frequency-wise probabilities like this,/ and these values are used to construct the permutation matrix. DNNの構造について説明します. DNNの構造は,入力層,隠れ層3層,出力層の計5層の多層パーセプトロン(全結合)となっています. 活性化関数には,ReLU関数を用いています. DNNの出力層は音源数分用意します. 出力層の値に対し,Softmax関数をかけることで,各周波数成分に対する確率値が出力されます. 出力される確率値は,この図の一番上の0.9と0.1といった値のようにそれぞれの周波数成分の値が足して1になるような制約となっています.
  17. 【9:05】 We explain/ how to obtain/ the predicted permutation matrix,/ which is necessary/ to align the permutation-inconsistent signal. Earlier,/ we explained/ that the DNN/ outputs probability values. These probability values/ are used/ as the coefficients/ of the permutation matrix. For example,/ in the two-source case,/ There are two possibilities of the permutation matrices,/ like this. And the outputs of DNN are used as weights of them. Finally,/ we can obtain the soft permutation matrices,/ like this. % 3 sourceの時のパーミュテーション行列を質問スライドに用意しておく,バークスフォンのいまんも質問スライドに入れる(二重確率行列はパーミュテーション行列の凸結合である.) パーミュテーション不整合信号を並び替えるために必要である,推定パーミュテーション行列を求める方法について説明します. 先ほど,DNNの出力として確率値を出力すると説明しました. この確率値はパーミュテーション行列の係数として,用いられます. 2音源を並び替えるようなパーミュテーション行列は,先ほど説明したように2種類ありそれは,この図の[1.0 ,0.0, 0.0, 1.0]と[0.0, 1.0, 1.0 0.0]にあたります. 3音源となるとこのパーミュテーション行列の数は6種類となり,音源数の階乗分,増加していくこととなります. DNNから出力された確率値を2つのパーミュテーション行列に係数としてかけ,それぞれの行列を足したものを推定パーミュテーション行列とします.
  18. 【9:50】 After the prediction of the soft permutation matrices,/ we take a matrix product with the permutation-inconsistent signals,/ Y. Thus,/ the permutation-aligned signals,/ Z,/ are estimated. 推定したパーミュテーション行列を用いて,局所時間における推定分離信号を導出します. DNNの出力である確率値を元に作成した推定パーミュテーション行列と,DNNの入力に用いた局所時間の正規化パワースペクトログラムとの間で行列積を取ります. 行列積を取ることで,各成分をパーミュテーション行列の値に従って並び替えることで局所時間における推定分離信号を作成することができます. この図では,下から2つ目の周波数成分において推定パーミュテーション行列の値が[0.5, 0.5, 0.5, 0.5]であるため,2つの正規化パワースペクトログラムの成分が半分ずつ入った成分が推定分離信号として出力されていることがわかります. % In this figure, the value of the second permutation matrix from the bottom is [0.5, 0.5, 0.5, 0.5, 0.5] which means the component containing half of the components of the two permutation inconsistency spectrograms.
  19. 【10:05】 For the training of DNN,/ we use mean squared error between the estimated and completely aligned spectrograms. In this case,/ the order of entire source components should not be considered in the training stage of DNN. To deal with this problem, we introduce permutation-invariant training. The loss function is represented by this equation. % 自動微分によって勾配が計算できる. 損失の導出方法についてです. 先ほど作成した局所時間の推定スペクトログラムと局所時間の完全分離信号との間で平均二乗誤差MSEを用いて損失を計上します. DNNはここで得た損失値を用いて,誤差逆伝播を行い最適なモデルを作成するように学習を行います. また,分離信号の順序は予測の対象としないため,順序不変学習,通称PITと呼ばれる手法を用いました. PITとMSEを用いたLossの取得に関する式はこちらになります. 全ての信号に対して総当たり的に損失を求めることとなり,推定分離信号の順序に関わらず常に最小のLossを計上することができます. % In this way, we always record the lowest losses, regardless of the order of the estimated separated signals.
  20. 【10:35】 When we apply the pretrained DNN model to the test data,/ we can further improve the accuracy of the estimation. Since the permutation problem is time-invariant,(↑)/ we can input multiple time frames as shown in this figure. Then, we collect the outputs, and we take a majority decision to obtain the most reliable permutation matrix in each frequency. テストデータに対してDNNの予測精度の向上のため時間方向に対する多数決処理を行いました. パーミュテーション不整合信号に対して時間方向にストライドしていくことで複数の局所時間スペクトログラムを抽出します. その後,それぞれの局所時間スペクトログラムに対して,DNNの学習とパーミュテーション行列への変換を行います. ここで求めた複数の推定パーミュテーション行列に対して多数決処理を行うことで,最終的に0か1で形成されたパーミュテーション行列を導き出します. 最終的には,この行列を用いることで推定分離信号を求めることができます.
  21. 【11:00】 Let’s move on to the experiments.
  22. 【11:05】 To evaluate the performance/ of the proposed method,/ we conducted an experiment. We compared/ conventional deep permutation solver/ and proposed method. We used two pairs/ of speech and music signals/ in our experimental data. Speech data consisted of male and female signals, and Music data consisted of drums and guitar signals. These data/ were obtained from SiSEC2011. SDRは,分離度合い(degree of separation)と分離処理によって生じる人工歪みの影響(absence of artificial distortion)が少ないとSDRが上がる Commonly used criterion in the audio source separation tasks
  23. 【11:40】 For the training data,/ the frequency components/ of the speech/ and music signals/ were divided/ into 64 blocks,/ and these blocks/ were randomly swapped. That is, we prepared a signal that simulates the block permutation problem. We prepared/ a total of 300 random swapping patterns/ of training data. Then,/ We prepare ten new patterns of randomly swapped signals as the test dataset We prepared two models,/ one for speech/ and the other/ for music models. We then/ evaluated in-domain/ and out-of-domain experiments/ for each model. Using the same sources/ for training/ and test data/ is called in-domain,/ while using different sources/ is called out-of-domain. 本実験では,音声信号と音楽信号に対する実験を行いました. 使用した音響信号はSiSEC2011にある男女の音声信号とドラムとピアノの音楽信号です. 客観評価値として,各周波数ビンにおける並び替えの正答率を用いました. 使用したデータはこの4つになります. 音声信号に関してはそれぞれ10秒,音楽信号に対してはそれぞれ11秒の信号を用いました. コメント:がんま=16と図に記載する ブロック単位のパーミュテーション問題を模擬 表の縦線は引かないでも良いかも.表の上を太線にしない IVAやILRMAではブロック単位でパーミュテーション問題が起きているので,それ模擬した.各行にシャッフルしてできなかったのは言わない. 出力したスペクトログラムを新たにDNNの入力として使ってみても面白いかも
  24. 【12:30】 The other conditions/ are shown/ in this table.
  25. 【12:35】 These figures show/ the SDR values/ for the in-domain/ and out-of-domain test data. The values/ of SDR/ are averaged/ over the two sources. For both the speech/ and music results/ of the in-domain evaluation,/ the proposed method/ improved SDR/ by over 20 dB/ on average. In contrast,/ the conventional method/ often failed to improve SDR,/ particularly for the music signals. Furthermore,/ the efficacy of the proposed method, can be verified/ even in the out-of-domain evaluation. As the two figures on the right show, for the out-of-domain dataset, the proposed method improved SDR over 10 dB for both speech and music signals. This result shows/ the robustness of proposed method/ against the domain of the dataset. % 大藪さんのスライド参考にSDRのgood poorの軸追加
  26. 【13:30】 This shows spectrograms/ for the in-domain test dataset/ with music model. From left to right,/ the spectrograms/ of the original signal, the input signal,/ the conventional method,/ and the proposed method/ are shown. Conventional method/ fails to align/ mainly in the low-frequency domain. On the other hand,/ the proposed method/ solves the permutation problem almost perfectly. % originalの信号持ってくる.縦を3kHzくらいにする(見づらいなら).Input signalをギターよりの信号にする2*2で表示したほうが良いかも! % FDICAの出力使うとどうなる?→future workにする!学習データにそれらの出力を使えば良いのでは. ILRMAやIVAでブロックパーミュテーションが起きるからブロックパーミュテーションを模擬している.
  27. 【14:00】 This is a conclusion. That’s all. Thank you for your attention. 【14:10】終了
  28. 【15:05】 The predicted permutation matrix created by the previous majority decision. We apply matrix product between these matrices and the permutational inconsistency signal. In this way, each frequency component is always separated into one of the two without mixing. Thus, the final estimated separated signals can be obtained. それでは,推定分離信号の導出についてです. 先ほどの多数決処理によって作成した,推定パーミュテーション行列を用いて,元々のパーミュテーション不整合信号との間で行列積を取ります. そうすることで,各周波数成分が混じることなく,必ずどちらかに分離されるようになっています このようにして,最終的な分離信号を求めることができます.
  29. 【4:50】 IVA and ILRMA performs audio source separation using a source model. A source model of IVA assumes that strength of all frequency components is synchronized in each source. In contrast, a source model of ILRMA assumes that each source has a low-rank time-frequency structure like the time-frequency structure highly contains repetitions. Since each source contains its own unique characteristics, it is difficult to create a universal source model that can be applied to any source. On the other hand, a model with only the ability to align frequency components may be able to construct a versatile source separation method. This is one of our motivations for this research. IVAでは各信号源は全周波数成分の強弱が同期すると仮定した音源モデルに従って分離を行います. それに対してILRMAでは各信号源は時間周波数構造が繰り返し多分を含むような,つまり低ランクな構造を持つと仮定した音源モデルに従って分離を行います. 各信号源はそれぞれ独自の特徴を含んでいるので,どんな音源にも適応できる万能な音源モデルを作成することは難しいです. それに対して,パーミュテーション問題を解決するようなモデル,すなわち周波数成分を並び替える機能だけを持つモデルなら万能な音源分離手法を構築できるのではと思ったことが私の今回の研究の動機の一つです.
  30. Independent component analysis, ICA, is a technique for estimating the demixing matrix W under the condition that the mixing matrix A is unknown. ICA estimates W, the inverse of A, using two assumptions: first, the sources are independent. Second, the mixing matrix is invertible and time-invariant. In ICA, the order of the separated signals is unknown. For example, in this figure, the order of the red and blue signals is undefined. In addition, since sources generally have reverberation, it is necessary to bring them into the frequency domain to remove the effect of reverberation. 優決定BSSである独立成分分析,通称ICAとは混合行列が未知の条件で分離行列Wを推定する技術です. 音源は独立である.また,混合行列は可逆で時不変であるといった2つの仮定を用いAの逆行列であるWを推定します. ICAでは分離信号の順番はどうなるかわからず,この図の赤と青の信号もどちらの順番で出力されるかは定まっていません. また,一般的に音響信号には残響があるので残響の影響を取り除くため、周波数領域に持っていく必要があります。
  31. Later, a method called frequency-domain ICA, FDICA, was proposed. This figure shows the number of microphones in the depth, frequency on the vertical axis, and time on the horizontal axis. This method applies an independent ICA to the complex time series for each frequency bin. 時間領域では畳み込み信号でも,周波数領域に持っていくことで単なる掛け算にすることができます. そこで新たに生み出された手法がFDICAと呼ばれる手法です. この図は奥行きがマイクロフォン数,縦軸が周波数,横軸が時間を表しています. この手法は各周波数ビンの複素時系列に対して独立なICAを適用することで音源分離を行う手法になります.
  32. The method/ applies DNN predictions/ for all frequency components as well, /and then/ applies majority decision/ to make full-band vector. However,/ DNN applies a binary classification,/ it needs to do similarity comparison/ to create full-band-vector. We prepare the vector/ predicted by the DNN/ and its logical inversion vector. In the case of this figure,/ the value/ of the first prediction result/ is compared with the value/ of the second prediction results,/ and the one/ that is closer/ to first prediction result/ is adopted/ as a component of the full band vector. In this method,/ the process/ of creating the full-band vector/ is complicated,/ because DNN/ applies a binary classification. 時間方向に多数決処理を行った後は,周波数方向についても多数決処理を行います. ただ,DNNの出力値としては,参照周波数成分に対して近傍の周波数成分が一致しているかどうかの2値分類を行なっているため,周波数方向に多数決処理を行うさいは類似度で比較する必要があります. まず,DNNの推定結果である,サブバンドベクトルと論理反転ベクトルの2つのベクトルを用意し,この図の場合は,一つ目の予測結果の値と2つ目の予測結果の値を比較して,より近い方を採用してフルバンドベクトルの要素とします. この手法では,DNNの出力として,参照周波数成分に対して近傍の周波数成分が同一成分であるかどうかの2値分類を行なっていることにより,フルバンドベクトルを作成する処理が複雑となっています.
  33. This is the DNN architecture. The architecture of the DNN/ is a multilayer perceptron/ with five layers:/ an input layer,/ three hidden layers,/ and an output layer. The ReLU function/ is used/ as the activation function. DNN uses softmax function/ for each frequency/ so DNN outputs probability values. The output probability values/ are constrained/ by the values add up to 1,/ for example/ 0.9/ and 0.1/ at the top of this figure. DNNの構造について説明します. DNNの構造は,入力層,隠れ層3層,出力層の計5層の多層パーセプトロン(全結合)となっています. 活性化関数には,ReLU関数を用いています. DNNの出力層は音源数分用意します. 出力層の値に対し,Softmax関数をかけることで,各周波数成分に対する確率値が出力されます. 出力される確率値は,この図の一番上の0.9と0.1といった値のようにそれぞれの周波数成分の値が足して1になるような制約となっています.