WEAKLY-SUPERVISED SOUND EVENT DETECTION
WITH SELF-ATTENTION
Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda
This work was done in the internship at LINE Corporation
ICASSP2020
Session WE1.L5: Acoustic Event Detection
stacked
Transformer
encoder
Outline of this work
l Goal
– Improve sound event detection (SED) performance
– Utilize weak label data for training
l Contributions
– Propose self-attention based weakly-supervised SED
– Introduce a special tag token to handle weak label information
l Evaluation
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
2
Alarm
Time
detect
onset offset
Alarm, Dog, Speech
weak label
Background
l Sound event detection (SED)
– Identifying environmental sounds with timestamps
l Collecting annotated dataset
– Strong label
• Easy to handle J
• Expensive annotation cost L
– Weak label
• Hard to handle L
• Cheap annotation cost J
3
Alarm
Time
detect
onset offset
Time
Dog
Speech
Alarm, Dog, Speech
→ NOT including timestamps
= only tags are available
→ including timestamps
Alarm
strong label
weak label
Problem
Weakly-supervised training for SED
l Multi-instance learning (MIL)
– Effective approach to train using weal label
– Predict frame-by-frame, aggregate them to obtain sequence-level prediction
4
Aggregate in
time domain
Time
Score
calculate loss
weak label
class
predicted score
class1
class2
class3
What approach is effective to aggregate?
How to aggregate frame-level prediction
l Global max pooling
– Capture short duration
– Weak to effect of noise
l Global average pooling
– Capture long duration
– Ignore short duration
l Attention pooling
– Flexible decision by
attention mechanism
5
weighted sum
max
average
Time
Score
sequence-level
prediction
frame-level
prediction
Attention pooling
l Calculate prediction and confidence of each frame
according to the input
6
Frame-level prediction
input frame
level feature
event feature
frame level confidence
(attention weight)
sum
sigmoidsoftmax
weighted sum
Time
sequence-level
prediction
Self-attention
l Transformer [Vaswani+17]
– Effectively use self-attention model
– Enable to capture local and global context information
– Great success in NLP, various audio/speech tasks
• ASR, speaker recognition, speaker diarization, TTS, etc..
7
Positional
Encoding
Multi-Head
Attention
Add & Norm
Feed
Forward
Add & Norm
N×
Transformerencoder
Input
Output
In this work, we use Transformer encoder
Overview of self-attention
8
DenseDenseDense
×
=
×
event feature
attention weight
In weakly-supervised SED,
how to handle weak label data?
input frame
level feature
Time
output frame
level feature
Time
Proposed method
l Weakly-supervised training for SED with self-attention and tag token
– Introduce transformer encoder as self-attention for sequence modeling
– Introduce tag token dedicated to weak label estimation
9
Predict
stronglabel
Predict
weaklabel
SigmoidSigmoid
Classifier
Append tag token at first frame
stacked
Transformer
encoder
feature
sequence
input
Self attention with tag token
10
DenseDenseDense
×
=
×
event feature
attention weight
: Tag token
TimeTime
input frame
level feature
output frame
level feature
Self attention with tag token
11
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Self attention with tag token
12
DenseDenseDense
×
=
×
event feature
attention weight
… encoder N
: Tag token
encoder 2
encoder 1
TimeTime
input frame
level feature
output frame
level feature
…
append tag token
(constant value)
strong label
prediction
weak label
prediction
input
Relationship of
tag token and input
Aggregatedtotagtoken
ineachencoder
Experiments
l DCASE2019 task 4
– Sound event detection in domestic environments
– Evaluation metrics: Event-based, Segment based macro F1
– Baseline model: CRNN
– Provided dataset details
13
Experimental conditions
l Network training configuration
– Feature: 64-dim log mel filterbank
– Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim)
14
Experimental results
15
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Experimental results
16
Method Event-based[%] Segment-based[%] Frame-based[%]
CRNN(baseline) 30.61 62.21 60.94
Transformer(E=3) 34.27 65.07 61.85
Transformer(E=4) 33.05 65.14 62.00
Transformer(E=5) 31.81 63.90 60.78
Transformer(E=6) 34.28 64.33 61.26
Transformer models outperformed CRNN model
Experimental results
17
: CRNN
: Transformer
Experimental results
18
Especially Blender and Dishes class are improved
=> Effective for repeatedly appear sounds
+10.4%
+13.5%
Experimental results
19
Attention pooling vs. Tag token
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Experimental results
20
Method Encoder stack
Event-
based[%]
Segment-
based[%]
Frame-
based[%]
Self-attention
+ Attention pooling
3 33.99 65.95 62.36
6 33.84 65.61 62.10
Self-attention
+ Tag token
3 34.27 65.07 61.85
6 34.28 64.33 61.26
Perform comparable results
Attention pooling vs. Tag token
Predicted example
21
Visualization of attention weights
22
Conclusion
l Proposed method
– Weakly-supervised training for SED with self-attention and tag token
• Self-attention: effective sequence modeling using local and global context
• Tag token: aggregate tag information through self-attention
l Result
– Improved SED performance compared with CRNN
• CRNN baseline: 30.61% → Proposed: 34.28%
– Effective for repeatedly appear sounds
23

Weakly-Supervised Sound Event Detection with Self-Attention

  • 1.
    WEAKLY-SUPERVISED SOUND EVENTDETECTION WITH SELF-ATTENTION Koichi Miyazaki, Tatsuya Komatsu, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda This work was done in the internship at LINE Corporation ICASSP2020 Session WE1.L5: Acoustic Event Detection
  • 2.
    stacked Transformer encoder Outline of thiswork l Goal – Improve sound event detection (SED) performance – Utilize weak label data for training l Contributions – Propose self-attention based weakly-supervised SED – Introduce a special tag token to handle weak label information l Evaluation – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% 2 Alarm Time detect onset offset Alarm, Dog, Speech weak label
  • 3.
    Background l Sound eventdetection (SED) – Identifying environmental sounds with timestamps l Collecting annotated dataset – Strong label • Easy to handle J • Expensive annotation cost L – Weak label • Hard to handle L • Cheap annotation cost J 3 Alarm Time detect onset offset Time Dog Speech Alarm, Dog, Speech → NOT including timestamps = only tags are available → including timestamps Alarm strong label weak label Problem
  • 4.
    Weakly-supervised training forSED l Multi-instance learning (MIL) – Effective approach to train using weal label – Predict frame-by-frame, aggregate them to obtain sequence-level prediction 4 Aggregate in time domain Time Score calculate loss weak label class predicted score class1 class2 class3 What approach is effective to aggregate?
  • 5.
    How to aggregateframe-level prediction l Global max pooling – Capture short duration – Weak to effect of noise l Global average pooling – Capture long duration – Ignore short duration l Attention pooling – Flexible decision by attention mechanism 5 weighted sum max average Time Score sequence-level prediction frame-level prediction
  • 6.
    Attention pooling l Calculateprediction and confidence of each frame according to the input 6 Frame-level prediction input frame level feature event feature frame level confidence (attention weight) sum sigmoidsoftmax weighted sum Time sequence-level prediction
  • 7.
    Self-attention l Transformer [Vaswani+17] –Effectively use self-attention model – Enable to capture local and global context information – Great success in NLP, various audio/speech tasks • ASR, speaker recognition, speaker diarization, TTS, etc.. 7 Positional Encoding Multi-Head Attention Add & Norm Feed Forward Add & Norm N× Transformerencoder Input Output In this work, we use Transformer encoder
  • 8.
    Overview of self-attention 8 DenseDenseDense × = × eventfeature attention weight In weakly-supervised SED, how to handle weak label data? input frame level feature Time output frame level feature Time
  • 9.
    Proposed method l Weakly-supervisedtraining for SED with self-attention and tag token – Introduce transformer encoder as self-attention for sequence modeling – Introduce tag token dedicated to weak label estimation 9 Predict stronglabel Predict weaklabel SigmoidSigmoid Classifier Append tag token at first frame stacked Transformer encoder feature sequence input
  • 10.
    Self attention withtag token 10 DenseDenseDense × = × event feature attention weight : Tag token TimeTime input frame level feature output frame level feature
  • 11.
    Self attention withtag token 11 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 12.
    Self attention withtag token 12 DenseDenseDense × = × event feature attention weight … encoder N : Tag token encoder 2 encoder 1 TimeTime input frame level feature output frame level feature … append tag token (constant value) strong label prediction weak label prediction input Relationship of tag token and input Aggregatedtotagtoken ineachencoder
  • 13.
    Experiments l DCASE2019 task4 – Sound event detection in domestic environments – Evaluation metrics: Event-based, Segment based macro F1 – Baseline model: CRNN – Provided dataset details 13
  • 14.
    Experimental conditions l Networktraining configuration – Feature: 64-dim log mel filterbank – Transformer setting: 128 attention dim, 16 heads (each head handle 8 dim) 14
  • 15.
    Experimental results 15 Method Event-based[%]Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26
  • 16.
    Experimental results 16 Method Event-based[%]Segment-based[%] Frame-based[%] CRNN(baseline) 30.61 62.21 60.94 Transformer(E=3) 34.27 65.07 61.85 Transformer(E=4) 33.05 65.14 62.00 Transformer(E=5) 31.81 63.90 60.78 Transformer(E=6) 34.28 64.33 61.26 Transformer models outperformed CRNN model
  • 17.
  • 18.
    Experimental results 18 Especially Blenderand Dishes class are improved => Effective for repeatedly appear sounds +10.4% +13.5%
  • 19.
    Experimental results 19 Attention poolingvs. Tag token Method Encoder stack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26
  • 20.
    Experimental results 20 Method Encoderstack Event- based[%] Segment- based[%] Frame- based[%] Self-attention + Attention pooling 3 33.99 65.95 62.36 6 33.84 65.61 62.10 Self-attention + Tag token 3 34.27 65.07 61.85 6 34.28 64.33 61.26 Perform comparable results Attention pooling vs. Tag token
  • 21.
  • 22.
  • 23.
    Conclusion l Proposed method –Weakly-supervised training for SED with self-attention and tag token • Self-attention: effective sequence modeling using local and global context • Tag token: aggregate tag information through self-attention l Result – Improved SED performance compared with CRNN • CRNN baseline: 30.61% → Proposed: 34.28% – Effective for repeatedly appear sounds 23