SlideShare a Scribd company logo
AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS:
LISTENING TO LEARNED FEATURES
Keunwoo Choi
Queen Mary
University of London
keunwoo.choi
@qmul.ac.uk,
Jeonghee Kim
Naver Labs,
South Korea
jeonghee.kim
@navercorp.com
George Fazekas
Queen Mary
University of London
g.e.fazekas
@qmul.ac.uk
Mark Sandler
Queen Mary
University of London
m.sandler
@qmul.ac.uk
ABSTRACT
Deep learning has been actively adopted in the field of mu-
sic information retrieval, e.g. genre classification, mood
detection, and chord recognition. Deep convolutional neu-
ral networks (CNNs), one of the most popular deep learn-
ing approach, also have been used for these tasks. How-
ever, the process of learning and prediction is little under-
stood, particularly when it is applied to spectrograms. We
introduce auralisation of CNNs to understand its underly-
ing mechanism.
1. INTRODUCTION
In the field of computer vision, deep learning approaches
become de facto standard after convolutional neural net-
works (CNNs) showed break-through results in ImageNet
competition in 2012 [3]. It rapidly became popular while
the reason of success was not completely understood.
One effective way to understand and explain the CNNs
was introduced in [5], where the features in deeper levels
are visualised by a method called deconvolution. By de-
convolving and un-pooling layers, it enables people to see
which part of the input image are focused on by each filter.
However, spectrograms have not been analysed by this
approach. Moreover, it is not clear what can be understood
by deconvolving spectrograms. The information from ’see-
ing’ a part of spectrograms can be extended by auralising
the convolutional filters.
In this paper, we introduce the procedure and results of
deconvolution of spectrograms. Furthermore, we propose
auralisastion of filters by extending it to time-domain re-
construction. In Section 2, the background of CNNs and
deconvolution are explained. The proposed auralisation
method is introduced in Section 3. The results are dis-
cussed in Section 4.
2. BACKGROUND
The majority of research of CNNs on audio signal uses 2D
time-frequency representation as input data, considering it
c Keunwoo Choi, Jeonghee Kim, George Fazekas, Mark
Sandler. Licensed under a Creative Commons Attribution 4.0 Interna-
tional License (CC BY 4.0). Attribution: Keunwoo Choi, Jeonghee
Kim, George Fazekas, Mark Sandler. “Auralisation of Deep Convolu-
tional Neural Networks: Listening to Learned Features”, 16th Interna-
tional Society for Music Information Retrieval Conference, 2015.
Figure 1. Deconvolution of CNNs trained for image clas-
sification. The filters’ response and a corresponding part
in the input images are shown respectively on the left and
right for each layer. Image courtesy of [5].
as an image. Various types of representations have been
used including Short-time Fourier transform (STFT), Mel-
spectrogram and constant-Q transform (CQT). In [4], for
example, 80-by-15 (mel-band-by-frame) mel-spectrogram
is used with 7-by-3 and 3-by-3 convolutions for onset de-
tection and CQT is used with 5-by-25 and 5-by-13 convo-
lutions for chord recognition in [2].
Visualisation of CNNs was introduced in [5], which
showed how high-level features (postures/objects) are com-
bined from low-level features (lines/curves), as illustrated
in Figure 1. Visualisation of CNNs helps not only to un-
derstand the process inside the black box model, but also
to decide hyper-parameters of the networks. For exam-
ple, redundancy or deficiency of the capacity of the net-
works, which is limited by hyper-parameters such as the
number of layers and filters, can be judged by inspecting
the learned filters. Network visualisation provides useful
information since fine tuning hyper-parameters is among
the most crucial factors in designing CNNs, while it is a
computationally expensive process.
3. AURALISATION OF FILTERS
The spectrograms used in CNNs can be also deconvolved
as well. Unlike visual images however, a deconvolved
spectrogram does not generally facilitate an intuitive ex-
planation. This is because, first, seeing a spectrogram does
not necessarily provide clear intuition that is comparable to
observing an image. Second, detecting edges of a spectro-
gram, which is known to happen at the first layer in CNNs,
results in removing components of the spectrograms.
This paper has been supported by EPSRC Grant EP/L019981/1, Fus-
ing Audio and Semantic Technologies for Intelligent Music Production
and Consumption.
Figure 2. Filters at the first layer of CNNs trained for genre
classification, initial (on the left) and learned filters (on the
right).
To solve this problem, we propose to reconstruct the
audio signal using a process called auralisation. This re-
quires an additional stage for inverse-transformation of a
deconvolved spectrogram. The phase information is pro-
vided by the phase of the original STFT, following the
generic approach in spectrogram-based sound source sepa-
ration algorithms. STFT is therefore recommended which
allows us to later obtain a time-domain signal easily.
4. EXPERIMENTS AND DISCUSSION
We implemented a CNNs-based genre classification algo-
rithm using a dataset obtained from Naver Music . Three
genres (ballade, dance, and hiphop) were classified using
8,000 songs in total. 10 clips of 4 seconds were extracted
for each song, generating 80,000 data samples by STFT
with 512-point using windowed Fast Fourier Transform
and 50% hop size. 6,600/700/700 songs were designated
as training/validation/test sets respectively. As a reference
of performance, CNNs with 5 layers, 3-by-3 filters, max-
pooling with size and stride of 2 was used. This system
showed 75% of accuracy with a loss of 0.62.
4.1 Visualisation of First-layer Filters
A 4-layer CNNs was built with larger filter sizes at the
first layer. Since max-pooling layers are involved with in
deeper layers, filters at deeper layers can be only illustrated
when input data is given. To obtain a more intuitive visu-
alisation, large convolutional filters were used, as shown
in [1]. This system showed 76% of accuracy.
Figure 2 shows the visualisation of the first layer, which
consists of 12-by-12 filters. Filters are initialised with uni-
formly distributed random values, which resemble white
noise as shown on the left. After training, some of the fil-
ters develop patterns that can detect certain shapes. For
image classification tasks, the detectors for edges with dif-
ferent orientations usually are observed since the outlines
of objects are a useful cue for the task. Here however, sev-
eral vertical edge detectors are observed as shown on the
right. This may be because the networks learn to extract
not only edges but more complex patterns from spectro-
grams.
4.2 Auralisation of Filters
Fig 3 shows original and deconvolved signals and spec-
trograms of Bach’s English Suite No. 1: Prelude from
http://music.naver.com, a Korean music streaming service
Figure 3. Original and deconvolved signals and spectro-
grams with ground-truth onsets annotated (on the left)
7th feature at the 1st layer. By listening to these decon-
volved signals, it turns out that this feature provides an
onset detector-like filter. It can also be explained by vi-
sualising the filter. Vertical edge detectors can work as a
crude onset detector when it is applied to spectrograms,
since rapid change along time-axis will pass the edge de-
tector while the sustain parts will be filtered out.
There was a different case at deeper layers, where high-
level information can be expressed in deconvolved spectro-
grams. One of the features at the deepest layer was mostly
deactivated by hiphop music. However, not all the features
can be easily interpreted by listening to the signal.
5. CONCLUSIONS
We introduce auralisation of CNNs, which is an extension
to CNNs visualisation. This is done by inverse-transformation
of a deconvolved spectrogram to obtain a time-domain au-
dio signal. Listening the audio signal enables researchers
to understand the mechanism of CNNs when they are ap-
plied to spectrograms. Further research will include more
in depth interpretation of the learned features.
6. REFERENCES
[1] Luiz Gustavo Hafemann. An analysis of deep neural
networks for texture classification. 2014.
[2] Eric J Humphrey and Juan P Bello. Rethinking au-
tomatic chord recognition with convolutional neu-
ral networks. In Machine Learning and Applications
(ICMLA), International Conference on. IEEE, 2012.
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
ton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[4] Jan Schluter and Sebastian Bock. Improved musical
onset detection with convolutional neural networks.
In International Conference on Acoustics, Speech and
Signal Processing. IEEE, 2014.
[5] Matthew D Zeiler and Rob Fergus. Visualizing and
understanding convolutional networks. In Computer
Vision–ECCV 2014, pages 818–833. Springer, 2014.
The results are demonstrated during the LBD session and on-line at
author’s soundcloud https://soundcloud.com/kchoi-research
An example code of the whole deconvolution procedure is open to
public at https://github.com/gnuchoi/CNNauralisation

More Related Content

What's hot

Optical coherence tomography
Optical coherence tomographyOptical coherence tomography
Optical coherence tomography
Krunal Siddhapathak
 
Teaching Computers to Listen to Music
Teaching Computers to Listen to MusicTeaching Computers to Listen to Music
Teaching Computers to Listen to Music
Eric Battenberg
 
A self localization scheme for mobile wireless sensor networks
A self localization scheme for mobile wireless sensor networksA self localization scheme for mobile wireless sensor networks
A self localization scheme for mobile wireless sensor networks
ambitlick
 
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
csandit
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
TELKOMNIKA JOURNAL
 
Kernel analysis of deep networks
Kernel analysis of deep networksKernel analysis of deep networks
Kernel analysis of deep networks
Behrang Mehrparvar
 
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM SystemThe Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
nadia abd
 
Methods for interpreting and understanding deep neural networks
Methods for interpreting and understanding deep neural networksMethods for interpreting and understanding deep neural networks
Methods for interpreting and understanding deep neural networks
KIMMINHA3
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learning
Mohamed Essam
 
CH5
CH5CH5
Ar26272276
Ar26272276Ar26272276
Ar26272276
IJERA Editor
 
let's dive to deep learning
let's dive to deep learninglet's dive to deep learning
let's dive to deep learning
Mohamed Essam
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker Recognition
CSCJournals
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical Imaging
Joonhyung Lee
 
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D MaterialsTearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
Applied Research and Photonics, Inc.
 
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
NAVER Engineering
 
improving Profile detection using Deep Learning
improving Profile detection using Deep Learningimproving Profile detection using Deep Learning
improving Profile detection using Deep Learning
Sahil Kaw
 
O026084087
O026084087O026084087
O026084087
ijceronline
 
Image compression techniques by using wavelet transform
Image compression techniques by using wavelet transformImage compression techniques by using wavelet transform
Image compression techniques by using wavelet transform
Alexander Decker
 

What's hot (20)

Optical coherence tomography
Optical coherence tomographyOptical coherence tomography
Optical coherence tomography
 
Teaching Computers to Listen to Music
Teaching Computers to Listen to MusicTeaching Computers to Listen to Music
Teaching Computers to Listen to Music
 
A self localization scheme for mobile wireless sensor networks
A self localization scheme for mobile wireless sensor networksA self localization scheme for mobile wireless sensor networks
A self localization scheme for mobile wireless sensor networks
 
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
DIGITAL VIDEO SOURCE IDENTIFICATION BASED ON GREEN-CHANNEL PHOTO RESPONSE NON...
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
 
Kernel analysis of deep networks
Kernel analysis of deep networksKernel analysis of deep networks
Kernel analysis of deep networks
 
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM SystemThe Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
The Performance Analysis Fiber Optic Dispersion on OFDM-QAM System
 
Methods for interpreting and understanding deep neural networks
Methods for interpreting and understanding deep neural networksMethods for interpreting and understanding deep neural networks
Methods for interpreting and understanding deep neural networks
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
Speech Processing with deep learning
Speech Processing  with deep learningSpeech Processing  with deep learning
Speech Processing with deep learning
 
CH5
CH5CH5
CH5
 
Ar26272276
Ar26272276Ar26272276
Ar26272276
 
let's dive to deep learning
let's dive to deep learninglet's dive to deep learning
let's dive to deep learning
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker Recognition
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical Imaging
 
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D MaterialsTearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
Tearhertz Sub-Nanometer Sub-Surface Imaging of 2D Materials
 
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
 
improving Profile detection using Deep Learning
improving Profile detection using Deep Learningimproving Profile detection using Deep Learning
improving Profile detection using Deep Learning
 
O026084087
O026084087O026084087
O026084087
 
Image compression techniques by using wavelet transform
Image compression techniques by using wavelet transformImage compression techniques by using wavelet transform
Image compression techniques by using wavelet transform
 

Similar to AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS: LISTENING TO LEARNED FEATURES

International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
CSCJournals
 
Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...
journalBEEI
 
1801 1805
1801 18051801 1805
1801 1805
Editor IJARCET
 
1801 1805
1801 18051801 1805
1801 1805
Editor IJARCET
 
2933bf63f71e22ee0d6e84792f3fec1a.pdf
2933bf63f71e22ee0d6e84792f3fec1a.pdf2933bf63f71e22ee0d6e84792f3fec1a.pdf
2933bf63f71e22ee0d6e84792f3fec1a.pdf
mokamojah
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
IJERA Editor
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
IJERA Editor
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET Journal
 
Various Applications of Compressive Sensing in Digital Image Processing: A Su...
Various Applications of Compressive Sensing in Digital Image Processing: A Su...Various Applications of Compressive Sensing in Digital Image Processing: A Su...
Various Applications of Compressive Sensing in Digital Image Processing: A Su...
IRJET Journal
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
Ramin Anushiravani
 
Guillet mottin 2005 spie_ 5964
Guillet mottin 2005 spie_ 5964Guillet mottin 2005 spie_ 5964
Guillet mottin 2005 spie_ 5964
Stéphane MOTTIN
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptx
RiteshPandey184067
 
Analysis of Various Image De-Noising Techniques: A Perspective View
Analysis of Various Image De-Noising Techniques: A Perspective ViewAnalysis of Various Image De-Noising Techniques: A Perspective View
Analysis of Various Image De-Noising Techniques: A Perspective View
ijtsrd
 
Recognition of music genres using deep learning.
Recognition of music genres using deep learning.Recognition of music genres using deep learning.
Recognition of music genres using deep learning.
IRJET Journal
 
T26123129
T26123129T26123129
T26123129
IJERA Editor
 
Alternatives to Point-Scan Confocal Microscopy
Alternatives to Point-Scan Confocal MicroscopyAlternatives to Point-Scan Confocal Microscopy
Alternatives to Point-Scan Confocal Microscopy
mchelen
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004
guest6e7a1b1
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004
anvita_bajpai
 
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
CSCJournals
 
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 AudioNovel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
inventy
 

Similar to AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS: LISTENING TO LEARNED FEATURES (20)

International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (1) Issue...
 
Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...
 
1801 1805
1801 18051801 1805
1801 1805
 
1801 1805
1801 18051801 1805
1801 1805
 
2933bf63f71e22ee0d6e84792f3fec1a.pdf
2933bf63f71e22ee0d6e84792f3fec1a.pdf2933bf63f71e22ee0d6e84792f3fec1a.pdf
2933bf63f71e22ee0d6e84792f3fec1a.pdf
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
 
Ku2518881893
Ku2518881893Ku2518881893
Ku2518881893
 
IRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural NetworkIRJET- Music Genre Recognition using Convolution Neural Network
IRJET- Music Genre Recognition using Convolution Neural Network
 
Various Applications of Compressive Sensing in Digital Image Processing: A Su...
Various Applications of Compressive Sensing in Digital Image Processing: A Su...Various Applications of Compressive Sensing in Digital Image Processing: A Su...
Various Applications of Compressive Sensing in Digital Image Processing: A Su...
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
Guillet mottin 2005 spie_ 5964
Guillet mottin 2005 spie_ 5964Guillet mottin 2005 spie_ 5964
Guillet mottin 2005 spie_ 5964
 
BASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptxBASIC CONCEPT OF DEEP LEARNING.pptx
BASIC CONCEPT OF DEEP LEARNING.pptx
 
Analysis of Various Image De-Noising Techniques: A Perspective View
Analysis of Various Image De-Noising Techniques: A Perspective ViewAnalysis of Various Image De-Noising Techniques: A Perspective View
Analysis of Various Image De-Noising Techniques: A Perspective View
 
Recognition of music genres using deep learning.
Recognition of music genres using deep learning.Recognition of music genres using deep learning.
Recognition of music genres using deep learning.
 
T26123129
T26123129T26123129
T26123129
 
Alternatives to Point-Scan Confocal Microscopy
Alternatives to Point-Scan Confocal MicroscopyAlternatives to Point-Scan Confocal Microscopy
Alternatives to Point-Scan Confocal Microscopy
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004
 
Anvita Eusipco 2004
Anvita Eusipco 2004Anvita Eusipco 2004
Anvita Eusipco 2004
 
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
Computationally Efficient Methods for Sonar Image Denoising using Fractional ...
 
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 AudioNovel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
Novel Approach of Implementing Psychoacoustic model for MPEG-1 Audio
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS: LISTENING TO LEARNED FEATURES

  • 1. AURALISATION OF DEEP CONVOLUTIONAL NEURAL NETWORKS: LISTENING TO LEARNED FEATURES Keunwoo Choi Queen Mary University of London keunwoo.choi @qmul.ac.uk, Jeonghee Kim Naver Labs, South Korea jeonghee.kim @navercorp.com George Fazekas Queen Mary University of London g.e.fazekas @qmul.ac.uk Mark Sandler Queen Mary University of London m.sandler @qmul.ac.uk ABSTRACT Deep learning has been actively adopted in the field of mu- sic information retrieval, e.g. genre classification, mood detection, and chord recognition. Deep convolutional neu- ral networks (CNNs), one of the most popular deep learn- ing approach, also have been used for these tasks. How- ever, the process of learning and prediction is little under- stood, particularly when it is applied to spectrograms. We introduce auralisation of CNNs to understand its underly- ing mechanism. 1. INTRODUCTION In the field of computer vision, deep learning approaches become de facto standard after convolutional neural net- works (CNNs) showed break-through results in ImageNet competition in 2012 [3]. It rapidly became popular while the reason of success was not completely understood. One effective way to understand and explain the CNNs was introduced in [5], where the features in deeper levels are visualised by a method called deconvolution. By de- convolving and un-pooling layers, it enables people to see which part of the input image are focused on by each filter. However, spectrograms have not been analysed by this approach. Moreover, it is not clear what can be understood by deconvolving spectrograms. The information from ’see- ing’ a part of spectrograms can be extended by auralising the convolutional filters. In this paper, we introduce the procedure and results of deconvolution of spectrograms. Furthermore, we propose auralisastion of filters by extending it to time-domain re- construction. In Section 2, the background of CNNs and deconvolution are explained. The proposed auralisation method is introduced in Section 3. The results are dis- cussed in Section 4. 2. BACKGROUND The majority of research of CNNs on audio signal uses 2D time-frequency representation as input data, considering it c Keunwoo Choi, Jeonghee Kim, George Fazekas, Mark Sandler. Licensed under a Creative Commons Attribution 4.0 Interna- tional License (CC BY 4.0). Attribution: Keunwoo Choi, Jeonghee Kim, George Fazekas, Mark Sandler. “Auralisation of Deep Convolu- tional Neural Networks: Listening to Learned Features”, 16th Interna- tional Society for Music Information Retrieval Conference, 2015. Figure 1. Deconvolution of CNNs trained for image clas- sification. The filters’ response and a corresponding part in the input images are shown respectively on the left and right for each layer. Image courtesy of [5]. as an image. Various types of representations have been used including Short-time Fourier transform (STFT), Mel- spectrogram and constant-Q transform (CQT). In [4], for example, 80-by-15 (mel-band-by-frame) mel-spectrogram is used with 7-by-3 and 3-by-3 convolutions for onset de- tection and CQT is used with 5-by-25 and 5-by-13 convo- lutions for chord recognition in [2]. Visualisation of CNNs was introduced in [5], which showed how high-level features (postures/objects) are com- bined from low-level features (lines/curves), as illustrated in Figure 1. Visualisation of CNNs helps not only to un- derstand the process inside the black box model, but also to decide hyper-parameters of the networks. For exam- ple, redundancy or deficiency of the capacity of the net- works, which is limited by hyper-parameters such as the number of layers and filters, can be judged by inspecting the learned filters. Network visualisation provides useful information since fine tuning hyper-parameters is among the most crucial factors in designing CNNs, while it is a computationally expensive process. 3. AURALISATION OF FILTERS The spectrograms used in CNNs can be also deconvolved as well. Unlike visual images however, a deconvolved spectrogram does not generally facilitate an intuitive ex- planation. This is because, first, seeing a spectrogram does not necessarily provide clear intuition that is comparable to observing an image. Second, detecting edges of a spectro- gram, which is known to happen at the first layer in CNNs, results in removing components of the spectrograms. This paper has been supported by EPSRC Grant EP/L019981/1, Fus- ing Audio and Semantic Technologies for Intelligent Music Production and Consumption.
  • 2. Figure 2. Filters at the first layer of CNNs trained for genre classification, initial (on the left) and learned filters (on the right). To solve this problem, we propose to reconstruct the audio signal using a process called auralisation. This re- quires an additional stage for inverse-transformation of a deconvolved spectrogram. The phase information is pro- vided by the phase of the original STFT, following the generic approach in spectrogram-based sound source sepa- ration algorithms. STFT is therefore recommended which allows us to later obtain a time-domain signal easily. 4. EXPERIMENTS AND DISCUSSION We implemented a CNNs-based genre classification algo- rithm using a dataset obtained from Naver Music . Three genres (ballade, dance, and hiphop) were classified using 8,000 songs in total. 10 clips of 4 seconds were extracted for each song, generating 80,000 data samples by STFT with 512-point using windowed Fast Fourier Transform and 50% hop size. 6,600/700/700 songs were designated as training/validation/test sets respectively. As a reference of performance, CNNs with 5 layers, 3-by-3 filters, max- pooling with size and stride of 2 was used. This system showed 75% of accuracy with a loss of 0.62. 4.1 Visualisation of First-layer Filters A 4-layer CNNs was built with larger filter sizes at the first layer. Since max-pooling layers are involved with in deeper layers, filters at deeper layers can be only illustrated when input data is given. To obtain a more intuitive visu- alisation, large convolutional filters were used, as shown in [1]. This system showed 76% of accuracy. Figure 2 shows the visualisation of the first layer, which consists of 12-by-12 filters. Filters are initialised with uni- formly distributed random values, which resemble white noise as shown on the left. After training, some of the fil- ters develop patterns that can detect certain shapes. For image classification tasks, the detectors for edges with dif- ferent orientations usually are observed since the outlines of objects are a useful cue for the task. Here however, sev- eral vertical edge detectors are observed as shown on the right. This may be because the networks learn to extract not only edges but more complex patterns from spectro- grams. 4.2 Auralisation of Filters Fig 3 shows original and deconvolved signals and spec- trograms of Bach’s English Suite No. 1: Prelude from http://music.naver.com, a Korean music streaming service Figure 3. Original and deconvolved signals and spectro- grams with ground-truth onsets annotated (on the left) 7th feature at the 1st layer. By listening to these decon- volved signals, it turns out that this feature provides an onset detector-like filter. It can also be explained by vi- sualising the filter. Vertical edge detectors can work as a crude onset detector when it is applied to spectrograms, since rapid change along time-axis will pass the edge de- tector while the sustain parts will be filtered out. There was a different case at deeper layers, where high- level information can be expressed in deconvolved spectro- grams. One of the features at the deepest layer was mostly deactivated by hiphop music. However, not all the features can be easily interpreted by listening to the signal. 5. CONCLUSIONS We introduce auralisation of CNNs, which is an extension to CNNs visualisation. This is done by inverse-transformation of a deconvolved spectrogram to obtain a time-domain au- dio signal. Listening the audio signal enables researchers to understand the mechanism of CNNs when they are ap- plied to spectrograms. Further research will include more in depth interpretation of the learned features. 6. REFERENCES [1] Luiz Gustavo Hafemann. An analysis of deep neural networks for texture classification. 2014. [2] Eric J Humphrey and Juan P Bello. Rethinking au- tomatic chord recognition with convolutional neu- ral networks. In Machine Learning and Applications (ICMLA), International Conference on. IEEE, 2012. [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- ton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [4] Jan Schluter and Sebastian Bock. Improved musical onset detection with convolutional neural networks. In International Conference on Acoustics, Speech and Signal Processing. IEEE, 2014. [5] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pages 818–833. Springer, 2014. The results are demonstrated during the LBD session and on-line at author’s soundcloud https://soundcloud.com/kchoi-research An example code of the whole deconvolution procedure is open to public at https://github.com/gnuchoi/CNNauralisation