IEEE ICASSP 2020
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Weakly-supervised sound event detection with self-attention, May 2020
Toda Laboratory, Department of Intelligent Systems, Graduate School of Informatics, Nagoya University
Weakly-Supervised Sound Event Detection with Self-Attention
1. WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection
2. stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label
3. Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem
4. Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?
5. How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction
6. Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction
7. Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder
9. Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input
10. Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature
11. Self attention with tag token
11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
12. Self attention with tag token
12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
13. Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13
14. Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14
23. Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23