SlideShare a Scribd company logo
1 of 36
Marmara University,
Electrical and Electronics Engineering
Spring 2020, EEE7000 – Seminar
CONFERENCE PAPER
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO
CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON
ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
by Mehmet Çağrı Aksoy
24/04/2020
What is Audio Classification and CNN?
Audio classification is the process of listening to and analyzing audio recordings. Also known as
sound classification, this process is at the heart of a variety of modern AI technology including
virtual assistants, automatic speech recognition, and text to speech applications. [1]
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural
networks, most applied to analyzing visual imagery. They are also known as shift invariant or
space invariant artificial neural networks (SIANN), based on their shared-weights architecture
and translation invariance characteristics. [2]
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
2
Index Terms of this Paper
Acoustic Event Detection, Acoustic Scene Classification, Convolutional Neural Networks, Deep
Neural Networks, Video Classification
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
3
What is the main event?
The main event and purpose of this task is to do “Acoustic Event Detection” also named as
“Audio Classification”
Historically, audio classification tasks has been addressed with another methods named LSTM,
SVM etc. More recent approaches use some form of DNN ( Deep Neural Network ) including
CNNs and RNNs.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
4
What is the main event? Cont.
Prior work has been reported on datasets such as TRECVid, ActivityNet, Sports1M and DCASE
Acoustic scenes 2016 which are much smaller than the dataset that are using in this paper.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
5
What are the issues they are facing?
Problem Statements
Datasets
Audio file overview
Data Exploratory
Data Pre-processing
Extract Features
Building the model
Observing the results
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
6
YouTube100M
The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M
evaluation videos, and a pool of 20M videos that they use for validation. Videos average 4.6
minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more
topic identifiers from a set of 30,871 labels.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
7
YouTube-100M Cont.
The dataset has some labels that wrongly named. They also need to handle these bugs.
Being machine generated, the labels are not 100% accurate and of the 30K labels, some are
clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear
annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are
often labeled “Entertainment” as well, although no hierarchy is enforced.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
8
Audio file overview & Data Exploratory
The audio is divided into non-overlapping 960 ms frames.
This gave approximately 20 billion examples from the 70M videos.
Each frame inherits all the labels of its parent video.
The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms
windows every 10 ms.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
9
The resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude
of each bin is log transformed after adding a small offset to avoid numerical issues.
This gives log-mel spectrogram patches of 96 x64 bins that form the input to all classifiers.
During training we fetch mini-batches of 128 input examples by randomly sampling from all
patches.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
10
Spectrogram examples
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
11
System Model
They have used various CNN architectures to classify the soundtracks of a dataset of 70M
training videos (5.24 million hours) with 30,871 video-level labels.
They examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and
ResNet.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 12
System Model cont.
All experiments used TensorFlow and were trained asynchronously on multiple GPUs using the
Adam optimizer.
Batch normalization was applied after all convolutional layers.
All models used a final sigmoid layer rather than a softmax layer since each example can have
multiple labels. Cross-entropy was the loss function.
In view of the large training set size, they did not use dropout, weight decay, or other common
regularization techniques.
For the models trained on 7M or more examples, they saw no evidence of overfitting. During
training.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 13
CNN Architectures
Their baseline is a fully connected DNN, which they compared to several networks closely
modeled on successful image classifiers.
Also they have used, AlexNet, ResNet, Inception and VGG.
For their baseline experiments, they trained and evaluated using only the 10% most frequent
labels of the original 30K (i.e, 3K labels).
For each experiment, they optimized number of GPUs and learning rate for the frame level
classification accuracy.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
14
Fully Connected
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
15
Fully Connected
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
16
Their baseline network is a fully connected model with RELU activations, N layers, and M units
per layer.
N = [2; 3; 4; 5; 6] and M = [500; 1000; 2000; 3000; 4000].
Their best performing model had N = 3 layers, M = 1000 units, learning rate of 3e-5, 10 GPUs and 5
parameter servers.
This network has approximately 11.2M weights and 11.2M multiplies.
AlexNet
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
17
AlexNet
The original AlexNet architectures was designed for a 224x224 3 input with an initial 11 x11
convolutional layer with a stride of 4. Because our inputs are 96 x64, we use a stride of 2 x1 so
that the number of activations are similar after the initial layer.
They also use batch normalization after each convolutional layer instead of local response
normalization (LRN) and replace the final 1000-unit layer with a 3087 unit layer.
While the original AlexNet has approximately 62.4M weights and 1.1G multiplies, their version
has 37.3M weights and 767M multiplies.
Also, for simplicity, unlike the original AlexNet, they do not split filters across multiple devices.
They trained with 20 GPUs and 10 parameter servers.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
18
VGG
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
19
VGG
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
20
The only changes they made to VGG were to the final layer (3087 units with a sigmoid) as well as
the use of batch normalization instead of LRN.
While the original network had 144M weights and 20B multiplies, the audio variant uses 62M
weights and 2.4B multiplies.
They tried another variant that reduced the initial strides (as we they with AlexNet), but found
that not modifying the strides resulted in faster training and better performance.
With their setup, parallelizing beyond 10 GPUs did not help significantly, so they trained with 10
GPUs and 5 parameter servers.
Inception V3
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
21
Inception V3
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
22
They modified the inception V3 network by removing the first four layers of the stem, up to and
including the MaxPool, as well as removing the auxiliary network.
They changed the Average Pool size to 10 x6 to reflect the change in activations.
The original network has 27M weights with 5.6B multiplies, and the audio variant has 28M
weights and 4.7B multiplies.
They trained with 40 GPUs and 20 parameter servers.
ResNet-50
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
23
ResNet-50
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
24
They modified ResNet-50 by removing the stride of 2 from the first 77 convolution so that the
number of activations was not too different in the audio version.
They changed the Average Pool size to 6 x4 to reflect the change in activations.
The original network has 26M weights and 3.8B multiplies. The audio variant has 30M weights
and 1.9B multiplies.
They trained with 20 GPUs and 10 parameter servers.
Performance Metrics
mAP -> mean Average Precision
AUC -> AUC is the area under the Receiver Operating Characteristic (ROC) curve.
D-prime -> It provides the separation between the means of the signal and the noise
distributions
Lower mAP values are better.
Higher D-prime values are better.
Perfect classification achieves AUC of 1.0, and random guessing gives an AUC of 0.5
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
25
Results
Table 2 shows the evaluation results calculated over the 100K balanced videos.
All CNNs beat the fully-connected baseline. 
Inception and ResNet achieve the best performance;
◦ They provide high model capacity and their convolutional units can efficiently capture common
structures that may occur in different areas of the input array for both images, and audio
representation.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
26
Results of comparison between architectures
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 27
Results of varying label set size
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
28
Results of training with different amount of data
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
29
AED with the Audio Set Dataset
Audio Set is a dataset of over 1 million 10 second excerpts labeled with a vocabulary of acoustic
events.
They train two fully-connected models to predict labels for Audio Set.
The first model uses 6420 log-mel patches and the second uses the output of “embedding” layer
of best ResNet model as inputs.
The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d-
prime of 1.846).
The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452.
This jump in performance reflects the benefit of the larger YouTube-100M training set embodied
in the ResNet classifier outputs.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
30
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
31
Conclusions
The results show that state-of-the-art image networks are capable of excellent results on audio
classification when compared to a simple fully connected network or earlier image classification
architectures.
They saw results showing that training on larger label set vocabularies can improve
performance, albeit modestly, when evaluating on smaller label sets.
They saw that increasing the number of videos up to 7M improves performance for the best-
performing ResNet-50 architecture. We note that regularization could have reduced the gap
between the models trained on smaller datasets and the 7M and 70M datasets.
They see a significant increase over our baseline when training a model for AED with ResNet
embeddings on the Audio Set dataset.
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
Friday, April 24, 2020 32
What do we need to do to move forward?
Creating of more precise architectures.
Training times are very long, hardware technology needs to be waited or faster architectures
should be created.
Removing some noisy data from related dataset.
Train the model with more labels for detecting more audio and increase unique detected audio
population.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
33
What have you learned?
The importance of dataset size.
Differences between CNN architectures and their responses.
CNN behavior on audio classification.
Understood how label and data size effect on the results.
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
34
References
[1] https://lionbridge.ai/articles/what-is-audio-classification/
[2] https://en.wikipedia.org/wiki/Convolutional_neural_network
[3] https://www.researchgate.net/figure/Polyphonic-acoustic-event-detection-
task_fig2_322910427
[4] http://150.162.46.34:8080/icassp2017/pdfs/0000131.pdf
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
35
Q&A
Friday, April 24, 2020
S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE
INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW
ORLEANS, LA, 2017, PP. 131-135.
36

More Related Content

Similar to CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION

Adithya Rajan_Jan_2016
Adithya Rajan_Jan_2016Adithya Rajan_Jan_2016
Adithya Rajan_Jan_2016
Adithya Rajan
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Sayed Abulhasan Quadri
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
Marilyn Arceo
 
Welcome ndm11
Welcome ndm11Welcome ndm11
Welcome ndm11
balmanme
 

Similar to CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION (20)

IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
 
Adithya Rajan_Jan_2016
Adithya Rajan_Jan_2016Adithya Rajan_Jan_2016
Adithya Rajan_Jan_2016
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
 
Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...Performance estimation based recurrent-convolutional encoder decoder for spee...
Performance estimation based recurrent-convolutional encoder decoder for spee...
 
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
파이콘 한국 2019 튜토리얼 - 설명가능인공지능이란? (Part 1)
 
PCS 2016 presentation
PCS 2016 presentationPCS 2016 presentation
PCS 2016 presentation
 
ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung Sounds
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),International Journal of Engineering Inventions (IJEI),
International Journal of Engineering Inventions (IJEI),
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
 
Effective Audio Storage and Retrieval in Infrastructure less Environment over...
Effective Audio Storage and Retrieval in Infrastructure less Environment over...Effective Audio Storage and Retrieval in Infrastructure less Environment over...
Effective Audio Storage and Retrieval in Infrastructure less Environment over...
 
SHORT VITA 20140630
SHORT VITA 20140630SHORT VITA 20140630
SHORT VITA 20140630
 
Bc34333339
Bc34333339Bc34333339
Bc34333339
 
CNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent AdvancesCNNs: from the Basics to Recent Advances
CNNs: from the Basics to Recent Advances
 
21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-Infrastructure21st Century e-Knowledge Requires a High Performance e-Infrastructure
21st Century e-Knowledge Requires a High Performance e-Infrastructure
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
 
Generative AI Using HPC in Text Summarization and Energy Plants
Generative AI Using HPC in Text Summarization and Energy PlantsGenerative AI Using HPC in Text Summarization and Energy Plants
Generative AI Using HPC in Text Summarization and Energy Plants
 
Smart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart CitySmart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart City
 
Welcome ndm11
Welcome ndm11Welcome ndm11
Welcome ndm11
 

Recently uploaded

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 

Recently uploaded (20)

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

CNN architectures for large-scale audio classification CONFERENCE PAPER REVIEW, EXPLANATION

  • 1. Marmara University, Electrical and Electronics Engineering Spring 2020, EEE7000 – Seminar CONFERENCE PAPER S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. by Mehmet Çağrı Aksoy 24/04/2020
  • 2. What is Audio Classification and CNN? Audio classification is the process of listening to and analyzing audio recordings. Also known as sound classification, this process is at the heart of a variety of modern AI technology including virtual assistants, automatic speech recognition, and text to speech applications. [1] In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks, most applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics. [2] Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 2
  • 3. Index Terms of this Paper Acoustic Event Detection, Acoustic Scene Classification, Convolutional Neural Networks, Deep Neural Networks, Video Classification Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 3
  • 4. What is the main event? The main event and purpose of this task is to do “Acoustic Event Detection” also named as “Audio Classification” Historically, audio classification tasks has been addressed with another methods named LSTM, SVM etc. More recent approaches use some form of DNN ( Deep Neural Network ) including CNNs and RNNs. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 4
  • 5. What is the main event? Cont. Prior work has been reported on datasets such as TRECVid, ActivityNet, Sports1M and DCASE Acoustic scenes 2016 which are much smaller than the dataset that are using in this paper. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 5
  • 6. What are the issues they are facing? Problem Statements Datasets Audio file overview Data Exploratory Data Pre-processing Extract Features Building the model Observing the results Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 6
  • 7. YouTube100M The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and a pool of 20M videos that they use for validation. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos is labeled with 1 or more topic identifiers from a set of 30,871 labels. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 7
  • 8. YouTube-100M Cont. The dataset has some labels that wrongly named. They also need to handle these bugs. Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant (“Trumpet”) and others are less so (“Web Page”). Videos often bear annotations with multiple degrees of specificity. For example, videos labeled with “Trumpet” are often labeled “Entertainment” as well, although no hierarchy is enforced. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 8
  • 9. Audio file overview & Data Exploratory The audio is divided into non-overlapping 960 ms frames. This gave approximately 20 billion examples from the 70M videos. Each frame inherits all the labels of its parent video. The 960 ms frames are decomposed with a short-time Fourier transform applying 25 ms windows every 10 ms. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 9
  • 10. The resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log transformed after adding a small offset to avoid numerical issues. This gives log-mel spectrogram patches of 96 x64 bins that form the input to all classifiers. During training we fetch mini-batches of 128 input examples by randomly sampling from all patches. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 10
  • 11. Spectrogram examples Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 11
  • 12. System Model They have used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. They examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and ResNet. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 12
  • 13. System Model cont. All experiments used TensorFlow and were trained asynchronously on multiple GPUs using the Adam optimizer. Batch normalization was applied after all convolutional layers. All models used a final sigmoid layer rather than a softmax layer since each example can have multiple labels. Cross-entropy was the loss function. In view of the large training set size, they did not use dropout, weight decay, or other common regularization techniques. For the models trained on 7M or more examples, they saw no evidence of overfitting. During training. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 13
  • 14. CNN Architectures Their baseline is a fully connected DNN, which they compared to several networks closely modeled on successful image classifiers. Also they have used, AlexNet, ResNet, Inception and VGG. For their baseline experiments, they trained and evaluated using only the 10% most frequent labels of the original 30K (i.e, 3K labels). For each experiment, they optimized number of GPUs and learning rate for the frame level classification accuracy. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 14
  • 15. Fully Connected Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 15
  • 16. Fully Connected Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 16 Their baseline network is a fully connected model with RELU activations, N layers, and M units per layer. N = [2; 3; 4; 5; 6] and M = [500; 1000; 2000; 3000; 4000]. Their best performing model had N = 3 layers, M = 1000 units, learning rate of 3e-5, 10 GPUs and 5 parameter servers. This network has approximately 11.2M weights and 11.2M multiplies.
  • 17. AlexNet Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 17
  • 18. AlexNet The original AlexNet architectures was designed for a 224x224 3 input with an initial 11 x11 convolutional layer with a stride of 4. Because our inputs are 96 x64, we use a stride of 2 x1 so that the number of activations are similar after the initial layer. They also use batch normalization after each convolutional layer instead of local response normalization (LRN) and replace the final 1000-unit layer with a 3087 unit layer. While the original AlexNet has approximately 62.4M weights and 1.1G multiplies, their version has 37.3M weights and 767M multiplies. Also, for simplicity, unlike the original AlexNet, they do not split filters across multiple devices. They trained with 20 GPUs and 10 parameter servers. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 18
  • 19. VGG Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 19
  • 20. VGG Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 20 The only changes they made to VGG were to the final layer (3087 units with a sigmoid) as well as the use of batch normalization instead of LRN. While the original network had 144M weights and 20B multiplies, the audio variant uses 62M weights and 2.4B multiplies. They tried another variant that reduced the initial strides (as we they with AlexNet), but found that not modifying the strides resulted in faster training and better performance. With their setup, parallelizing beyond 10 GPUs did not help significantly, so they trained with 10 GPUs and 5 parameter servers.
  • 21. Inception V3 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 21
  • 22. Inception V3 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 22 They modified the inception V3 network by removing the first four layers of the stem, up to and including the MaxPool, as well as removing the auxiliary network. They changed the Average Pool size to 10 x6 to reflect the change in activations. The original network has 27M weights with 5.6B multiplies, and the audio variant has 28M weights and 4.7B multiplies. They trained with 40 GPUs and 20 parameter servers.
  • 23. ResNet-50 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 23
  • 24. ResNet-50 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 24 They modified ResNet-50 by removing the stride of 2 from the first 77 convolution so that the number of activations was not too different in the audio version. They changed the Average Pool size to 6 x4 to reflect the change in activations. The original network has 26M weights and 3.8B multiplies. The audio variant has 30M weights and 1.9B multiplies. They trained with 20 GPUs and 10 parameter servers.
  • 25. Performance Metrics mAP -> mean Average Precision AUC -> AUC is the area under the Receiver Operating Characteristic (ROC) curve. D-prime -> It provides the separation between the means of the signal and the noise distributions Lower mAP values are better. Higher D-prime values are better. Perfect classification achieves AUC of 1.0, and random guessing gives an AUC of 0.5 Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 25
  • 26. Results Table 2 shows the evaluation results calculated over the 100K balanced videos. All CNNs beat the fully-connected baseline.  Inception and ResNet achieve the best performance; ◦ They provide high model capacity and their convolutional units can efficiently capture common structures that may occur in different areas of the input array for both images, and audio representation. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 26
  • 27. Results of comparison between architectures S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 27
  • 28. Results of varying label set size Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 28
  • 29. Results of training with different amount of data Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 29
  • 30. AED with the Audio Set Dataset Audio Set is a dataset of over 1 million 10 second excerpts labeled with a vocabulary of acoustic events. They train two fully-connected models to predict labels for Audio Set. The first model uses 6420 log-mel patches and the second uses the output of “embedding” layer of best ResNet model as inputs. The log-mel baseline achieves a balanced mAP of 0.137 and AUC of 0.904 (equivalent to d- prime of 1.846). The model trained on embeddings achieves mAP / AUC / d-prime of 0.314 / 0.959 / 2.452. This jump in performance reflects the benefit of the larger YouTube-100M training set embodied in the ResNet classifier outputs. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 30
  • 31. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 31
  • 32. Conclusions The results show that state-of-the-art image networks are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures. They saw results showing that training on larger label set vocabularies can improve performance, albeit modestly, when evaluating on smaller label sets. They saw that increasing the number of videos up to 7M improves performance for the best- performing ResNet-50 architecture. We note that regularization could have reduced the gap between the models trained on smaller datasets and the 7M and 70M datasets. They see a significant increase over our baseline when training a model for AED with ResNet embeddings on the Audio Set dataset. S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. Friday, April 24, 2020 32
  • 33. What do we need to do to move forward? Creating of more precise architectures. Training times are very long, hardware technology needs to be waited or faster architectures should be created. Removing some noisy data from related dataset. Train the model with more labels for detecting more audio and increase unique detected audio population. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 33
  • 34. What have you learned? The importance of dataset size. Differences between CNN architectures and their responses. CNN behavior on audio classification. Understood how label and data size effect on the results. Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 34
  • 35. References [1] https://lionbridge.ai/articles/what-is-audio-classification/ [2] https://en.wikipedia.org/wiki/Convolutional_neural_network [3] https://www.researchgate.net/figure/Polyphonic-acoustic-event-detection- task_fig2_322910427 [4] http://150.162.46.34:8080/icassp2017/pdfs/0000131.pdf Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 35
  • 36. Q&A Friday, April 24, 2020 S. HERSHEY ET AL., "CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION,“ 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), NEW ORLEANS, LA, 2017, PP. 131-135. 36