SlideShare a Scribd company logo
1 of 23
WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection
stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label
Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem
Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?
How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction
Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction
Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder
Overview of self-attention
8
DenseDenseDense
×
=
×
event feature
attention weight
In weakly-supervised SED,
how to handle weak label data?
input frame
level feature
Time
output frame
level feature
Time
Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input
Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature
Self attention with tag token
11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Self attention with tag token
12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13
Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14
Experimental results
15
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Experimental results
16
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Transformer models outperformed CRNN model
Experimental results
17
: CRNN
: Transformer
Experimental results
18
Especially Blender and Dishes class are improved
=> Effective for repeatedly appear sounds
+10.4%
+13.5%
Experimental results
19
Attention pooling vs. Tag token
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Experimental results
20
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Perform comparable results
Attention pooling vs. Tag token
Predicted example
21
Visualization of attention weights
22
Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23

More Related Content

What's hot

MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionMLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionKazuyuki Miyazawa
 
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...Deep Learning JP
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...Deep Learning JP
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...Deep Learning JP
 
Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Fumihiko Takahashi
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )cvpaper. challenge
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...Deep Learning JP
 
Data-centricなML開発
Data-centricなML開発Data-centricなML開発
Data-centricなML開発Takeshi Suzuki
 
自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)cvpaper. challenge
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Yamato OKAMOTO
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformercvpaper. challenge
 
MLflowで学ぶMLOpsことはじめ
MLflowで学ぶMLOpsことはじめMLflowで学ぶMLOpsことはじめ
MLflowで学ぶMLOpsことはじめKenichi Sonoda
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話Yusuke Uchida
 
【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces UnderfittingDeep Learning JP
 
機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組みYoshitaka Ushiku
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用Yoshitaka Ushiku
 
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたいTakuma Wakamori
 

What's hot (20)

MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for VisionMLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for Vision
 
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...
[DL輪読会]Model soups: averaging weights of multiple fine-tuned models improves ...
 
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
[DL輪読会]BERT: Pre-training of Deep Bidirectional Transformers for Language Und...
 
CVPR 2019 速報
CVPR 2019 速報CVPR 2019 速報
CVPR 2019 速報
 
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...[DL輪読会]data2vec: A General Framework for  Self-supervised Learning in Speech,...
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
 
continual learning survey
continual learning surveycontinual learning survey
continual learning survey
 
Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究Active Learning の基礎と最近の研究
Active Learning の基礎と最近の研究
 
動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )動画認識サーベイv1(メタサーベイ )
動画認識サーベイv1(メタサーベイ )
 
ゼロから始める転移学習
ゼロから始める転移学習ゼロから始める転移学習
ゼロから始める転移学習
 
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
 
Data-centricなML開発
Data-centricなML開発Data-centricなML開発
Data-centricなML開発
 
自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)自己教師学習(Self-Supervised Learning)
自己教師学習(Self-Supervised Learning)
 
Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)Skip Connection まとめ(Neural Network)
Skip Connection まとめ(Neural Network)
 
【メタサーベイ】Video Transformer
 【メタサーベイ】Video Transformer 【メタサーベイ】Video Transformer
【メタサーベイ】Video Transformer
 
MLflowで学ぶMLOpsことはじめ
MLflowで学ぶMLOpsことはじめMLflowで学ぶMLOpsことはじめ
MLflowで学ぶMLOpsことはじめ
 
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
You Only Look One-level Featureの解説と見せかけた物体検出のよもやま話
 
【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting【DL輪読会】Dropout Reduces Underfitting
【DL輪読会】Dropout Reduces Underfitting
 
機械学習を民主化する取り組み
機械学習を民主化する取り組み機械学習を民主化する取り組み
機械学習を民主化する取り組み
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい
強化学習初心者が強化学習でニューラルネットワークの設計を自動化してみたい
 

Similar to Weakly-Supervised Sound Event Detection with Self-Attention

Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesXavier Ochoa
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4ozzie73
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchEduard Tyantov
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Ontico
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Ankit Shah
 
DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...Kitamura Laboratory
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processinganeetaanu
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionNAVER Engineering
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerSeiya Tokui
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016Hagai Aronowitz
 
Object Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningObject Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningJui-Hsin (Larry) Lai
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersVasileiosMezaris
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Dataymelka
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningBigML, Inc
 
DNA Splice site prediction
DNA Splice site predictionDNA Splice site prediction
DNA Splice site predictionsageteam
 

Similar to Weakly-Supervised Sound Event Detection with Self-Attention (20)

Expert estimation from Multimodal Features
Expert estimation from Multimodal FeaturesExpert estimation from Multimodal Features
Expert estimation from Multimodal Features
 
Surrey dl-4
Surrey dl-4Surrey dl-4
Surrey dl-4
 
Face Recognition: From Scratch To Hatch
Face Recognition: From Scratch To HatchFace Recognition: From Scratch To Hatch
Face Recognition: From Scratch To Hatch
 
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
Face Recognition: From Scratch To Hatch / Эдуард Тянтов (Mail.ru Group)
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
ASR_final
ASR_finalASR_final
ASR_final
 
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
 
BAS 250 Lecture 8
BAS 250 Lecture 8BAS 250 Lecture 8
BAS 250 Lecture 8
 
DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...DNN-based permutation solver for frequency-domain independent component analy...
DNN-based permutation solver for frequency-domain independent component analy...
 
Audio Processing
Audio ProcessingAudio Processing
Audio Processing
 
Introduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detectionIntroduction to deep learning based voice activity detection
Introduction to deep learning based voice activity detection
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016
 
Object Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online LearningObject Tracking with Instance Matching and Online Learning
Object Tracking with Instance Matching and Online Learning
 
Explaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiersExplaining the decisions of image/video classifiers
Explaining the decisions of image/video classifiers
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
 
MLSD18. Unsupervised Learning
MLSD18. Unsupervised LearningMLSD18. Unsupervised Learning
MLSD18. Unsupervised Learning
 
DNA Splice site prediction
DNA Splice site predictionDNA Splice site prediction
DNA Splice site prediction
 

More from NU_I_TODALAB

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例NU_I_TODALAB
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術NU_I_TODALAB
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離NU_I_TODALAB
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワークNU_I_TODALAB
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知NU_I_TODALAB
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...NU_I_TODALAB
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionNU_I_TODALAB
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトNU_I_TODALAB
 
Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?NU_I_TODALAB
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingNU_I_TODALAB
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識NU_I_TODALAB
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法NU_I_TODALAB
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークNU_I_TODALAB
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法NU_I_TODALAB
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元NU_I_TODALAB
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice ConversionNU_I_TODALAB
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice ConversionNU_I_TODALAB
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法NU_I_TODALAB
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換NU_I_TODALAB
 

More from NU_I_TODALAB (20)

異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例異常音検知に対する深層学習適用事例
異常音検知に対する深層学習適用事例
 
深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術深層生成モデルに基づく音声合成技術
深層生成モデルに基づく音声合成技術
 
信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離信号の独立性に基づく多チャンネル音源分離
信号の独立性に基づく多チャンネル音源分離
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク敵対的学習による統合型ソースフィルタネットワーク
敵対的学習による統合型ソースフィルタネットワーク
 
距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知距離学習を導入した二値分類モデルによる異常音検知
距離学習を導入した二値分類モデルによる異常音検知
 
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
Investigation of Text-to-Speech based Synthetic Parallel Data for Sequence-to...
 
Interactive voice conversion for augmented speech production
Interactive voice conversion for augmented speech productionInteractive voice conversion for augmented speech production
Interactive voice conversion for augmented speech production
 
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクトCREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
CREST「共生インタラクション」共創型音メディア機能拡張プロジェクト
 
Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?Recent progress on voice conversion: What is next?
Recent progress on voice conversion: What is next?
 
Statistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modelingStatistical voice conversion with direct waveform modeling
Statistical voice conversion with direct waveform modeling
 
音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識音素事後確率を利用した表現学習に基づく発話感情認識
音素事後確率を利用した表現学習に基づく発話感情認識
 
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
楽曲中歌声加工における声質変換精度向上のための歌声・伴奏分離法
 
End-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head DecoderネットワークEnd-to-End音声認識ためのMulti-Head Decoderネットワーク
End-to-End音声認識ためのMulti-Head Decoderネットワーク
 
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
空気/体内伝導マイクロフォンを用いた雑音環境下における自己発声音強調/抑圧法
 
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
時間領域低ランクスペクトログラム近似法に基づくマスキング音声の欠損成分復元
 
Hands on Voice Conversion
Hands on Voice ConversionHands on Voice Conversion
Hands on Voice Conversion
 
Advanced Voice Conversion
Advanced Voice ConversionAdvanced Voice Conversion
Advanced Voice Conversion
 
Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法Deep Neural Networkに基づく日常生活行動認識における適応手法
Deep Neural Networkに基づく日常生活行動認識における適応手法
 
CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換CTCに基づく音響イベントからの擬音語表現への変換
CTCに基づく音響イベントからの擬音語表現への変換
 

Recently uploaded

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 

Recently uploaded (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 

Weakly-Supervised Sound Event Detection with Self-Attention

  • 1. WEAKLY-SUPERVISED SOUND EVENT DETECTION WITH SELF-ATTENTION Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda This work was done in the internship at LINE Corporation ICASSP2020 Session WE1.L5: Acoustic Event Detection
  • 2. stacked Transformer encoder Outline of this work l Goal – Improve sound event detection (SED) performance – Utilize weak label data for training l Contributions – Propose self-attention based weakly-supervised SED – Introduce a special tag token to handle weak label information l Evaluation – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% 2 Alarm Time detect onset offset Alarm, Dog, Speech weak label
  • 3. Background l Sound event detection (SED) – Identifying environmental sounds with timestamps l Collecting annotated dataset – Strong label • Easy to handle J • Expensive annotation cost L – Weak label • Hard to handle L • Cheap annotation cost J 3 Alarm Time detect onset offset Time Dog Speech Alarm, Dog, Speech → NOT including timestamps = only tags are available → including timestamps Alarm strong label weak label Problem
  • 4. Weakly-supervised training for SED l Multi-instance learning (MIL) – Effective approach to train using weal label – Predict frame-by-frame, aggregate them to obtain sequence-level prediction 4 Aggregate in time domain Time Score calculate loss weak label class predicted score class1 class2 class3 What approach is effective to aggregate?
  • 5. How to aggregate frame-level prediction l Global max pooling – Capture short duration – Weak to effect of noise l Global average pooling – Capture long duration – Ignore short duration l Attention pooling – Flexible decision by attention mechanism 5 weighted sum max average Time Score sequence-level prediction frame-level prediction
  • 6. Attention pooling l Calculate prediction and confidence of each frame according to the input 6 Frame-level prediction input frame level feature event feature frame level confidence (attention weight) sum sigmoidsoftmax weighted sum Time sequence-level prediction
  • 7. Self-attention l Transformer [Vaswani+17] – Effectively use self-attention model – Enable to capture local and global context information – Great success in NLP, various audio/speech tasks • ASR, speaker recognition, speaker diarization, TTS, etc.. 7 Positional Encoding Multi-Head Attention Add & Norm Feed Forward Add & Norm N× Transformerencoder Input Output In this work, we use Transformer encoder
  • 8. Overview of self-attention 8 DenseDenseDense × = × event feature attention weight In weakly-supervised SED, how to handle weak label data? input frame level feature Time output frame level feature Time
  • 9. Proposed method l Weakly-supervised training for SED with self-attention and tag token – Introduce transformer encoder as self-attention for sequence modeling – Introduce tag token dedicated to weak label estimation 9 Predict stronglabel Predict weaklabel SigmoidSigmoid Classifier Append tag token at first frame stacked Transformer encoder feature sequence input
  • 10. Self attention with tag token 10 DenseDenseDense × = × event feature attention weight : Tag token TimeTime input frame level feature output frame level feature
  • 11. Self attention with tag token 11 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 12. Self attention with tag token 12 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) strong label prediction weak label prediction input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 13. Experiments l DCASE2019 task 4 – Sound event detection in domestic environments – Evaluation metrics: Event-based, Segment based macro F1 – Baseline model: CRNN – Provided dataset details 13
  • 14. Experimental conditions l Network training configuration – Feature: 64-dim log mel filterbank – Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim) 14
  • 15. Experimental results 15 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26
  • 16. Experimental results 16 Method Event-based[%] Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26 Transformer models outperformed CRNN model
  • 18. Experimental results 18 Especially Blender and Dishes class are improved => Effective for repeatedly appear sounds +10.4% +13.5%
  • 19. Experimental results 19 Attention pooling vs. Tag token Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26
  • 20. Experimental results 20 Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26 Perform comparable results Attention pooling vs. Tag token
  • 23. Conclusion l Proposed method – Weakly-supervised training for SED with self-attention and tag token • Self-attention: effective sequence modeling using local and global context • Tag token: aggregate tag information through self-attention l Result – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% – Effective for repeatedly appear sounds 23