In this talk, we will explain the techniques and models applied to the submission of Cochlear.ai team for DCASE 2018 task 2: General-purpose audio tagging of Freesound content with AudioSet labels. We mainly focused on how to train deep learning models efficiently against strong augmentation and label noise. First, we conducted a single-block DenseNet architecture and multi-head softmax classifier for efficient learning with mixup augmentation. For the label noise, we applied the batch-wise loss masking to eliminate the loss of outliers in a mini-batch. We also tried an ensemble of various models, trained by using different sampling rate or audio representation.
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Audio tagging system using densely connected convolutional networks (DCASE2018 task2)
1. Audio tagging system using
densely connected convolutional networks
Il-Young Jeong
Presented by:
Il-Young Jeong and Hyungui Lim
Authors:
Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018
20 November 2018, Surrey, UK
2. Introduction: DCASE 2018 challenge task 2
General-purpose audio tagging of
Freesound content with AudioSet labels
• Classifying sound events of very diverse nature including:
- musical instruments
- human sounds
- domestic sounds
- animals
- etc.
• Dataset: Subset of Freesound Dataset with AudioSet Ontology
3. Difficulty of the task was due to:
• Varied input length
- from 300ms to 30s
• Insufficient training data
- ~9.5k recordings for 41 classes
• Imbalanced class distribution
- from 94 to 300 samples per class
• Unreliable annotation
- Only ~40% labels were verified
Introduction: DCASE 2018 challenge task 2
4. Introduction: DCASE 2018 challenge task 2
Our Solutions
•Segment-wise learning
•Strong augmentation
(mixup)
•Evenly-distributed batch
•Batch-wise loss masking
Difficulty of the task was due to:
• Varied input length
- from 300ms to 30s
• Insufficient training data
- ~9.5k recordings for 41 classes
• Imbalanced class distribution
- from 94 to 300 samples per class
• Unreliable annotation
- Only ~40% labels were verified
•Ensemble approach
5. Segmentation
• All the preprocessing steps are performed for each batch generation.
Pros: Fast implementation of various settings
Cons: Computation in batch generation
Framework: (On-the-fly) Preprocessing
Mixup Augmentation T-F representation
- Long data
-> Takes excerpts
- Short data
-> Zero-padding
- New data generated by
mixing two segments.
- Raw waveform/ Logmel
- Faster operation using
GPU, thanks to kapre.
6. Framework: Evenly distributed batch generation
• Mini-batch learning: Updates model by using subset of training data.
• Randomly selected batch: randomly selects N data from training set.
- Not guarantees that a mini-batch consists of all the classes
- Has imbalanced class distribution if whole training data has.
• Evenly distributed batch: Choose M data per class. N=M*C
- All the mini-batch consists of all the classes.
- Has balanced class distribution.
- (Empirically) shows more stable and fast convergence.
7. • Mixup: Data augmentation using linear interpolation between two data
Framework: Mixup augmentation
• We used mixup to train model to predict the relative scale of data,
rather than binary classification.
x: data
t: label
λ: mixing parameter
w: scale parameter
9. • DenseNet: Densely connected network
f_dense(x) = concatenate(f(x),x))
• Allows direct path for backpropagation
• End-to-end DenseNet:
- All layers from input(logmel) to output(loss) is concatenated
Framework: End-to-end DenseNet
10. Framework: Multi-head softmax
• Replacing softmax layer to
average of multiple softmax outputs.
• Why?
- Good initialization close to 0.5
prediction results especially for mixup.
- Allows prediction for near-0.5 easier.
11. • Categorical cross-entropy for a mini batch:
Framework: Batch-wise loss masking (1)
• Masked loss when false-annotated data is known:
m_n: 1 when n-th data has true label
0 when n-th data has false label
12. • Our solution: Remove outliers which have the highest loss from the
gradient calculation.
- x may be false-annotated data if:
1) it is non-verified, and
2) it shows the highest or similar loss in the current batch / iteration.
Framework: Batch-wise loss masking (2)
• Efficient computation for max(loss) using batch-wise calculation.
13. Experimental results
• Audio segment: 64,000 samples for all experiments
- 16kHz/4s, 32kHz/2s, 44.1kHz/1.45s
• Input domain: logmel or waveform
• MAP@3 Results
15. Future work
• Verifying ideas with additional experiments.
• Model size minimization
• Implementation for real-world application
16. • Thank you!
• We thank to @Zafar and @daisukelab, who provided wonderful kernels
and discussions for the task.
• If you have interests to Cochlear.ai,
please visit www.cochlear.ai