1. 1
B. Tech Project Report First Phase
Polyphonic music transcription using Machine learning techniques
Submitted in partial fulfillment of requirements
For the award of the degree of Bachelor of Technology from
Indian Institute of Technology, Guwahati
Under the supervision of
Associate Professor Girish Sampath Setlur
Assistant Professor Amit Sethi
Submitted by-
Lalit Pradhan
10012119
November 15, 2013
Department of Physics
Indian Institute of Technology Guwahati
Guwahati 781039, Assam, INDIA
2. 2
Certificate
This is to certify that the work presented in the report entitled “Polyphonic music transcription using
machine learning techniques” by Lalit Pradhan, 10012119, represents an original work under the
guidance of Associate Professor Girish Sampath Setlur and Assistant Professor Amit Sethi. This study
has not been submitted elsewhere for a degree.
Signature of student:
Date:
Place: Lalit Pradhan, 10012119
Signature of supervisor I
Date:
Place: Associate Professor Girish S Setlur
Signature of supervisor II
Date:
Place: Assistant Professor Amit Sethi
Signature of Examiner
Date:
Place:
3. 3
Abstract
In this project report we present a method for recognition of timbre and identification of the
different musical instruments, and automatic transcription of these individual instruments.
We introduce time frequency distributions for the analysis of musical instruments. The
method presented in the earlier semester uses Independent component analysis (hereafter
ICA), a special case of Blind source separation (hereafter BSS) to solve the issue in hand.
We understood how the ICA method works. In this term we will also introduce a more
efficient algorithm for the aforementioned problem statement via the use of a reduced
dimensional auto encoder set for identification of timbre. This paper describes several
approaches to analyze the frequency or pitch or content of the sound produced by musical
instruments via different methods like Fourier analysis, spectrograms and scalograms.
Scalograms allow one to zoom in to selected regions of time-frequency plane in a more
flexible manner than is possible with spectrograms. These time-frequency portraits seem to
correlate well with our perception of the sounds produced by these instruments and of the
differences between each instrument. This property will be used as feature for the sparse
auto encoders in training the machine to recognize the instrument and extract out
individual instrument notes. The training sets are fed to a learning algorithm which forms
a hypothesis function which will in turn recognize any new input and give the appropriate
output, i.e. in our case identify the appropriate instrument.
Keywords
Blind Source Separation, Independent Component Analysis, Identification of Timbre,
Sparse Auto encoders, Spectrograms, Scalograms.
4. 4
Table of Contents
1. Introduction .................................................................................................................................. 5
2. Objective....................................................................................................................................... 5
3. Literature Research and user study………………………………………………………………………………………………..5
3.1 Identification of the instruments…………………………………………………………………………………………….6
3.1.1Fourier Series………………………………………………………………………………………………………………..6
3.1.1.1 FFTs…………..………………………………………………………………………………………………….7
3.1.2 Spectrograms………………………………………………………………………………………………………………..7
3.1.2.1 Spectrograms of piano and flute…………………………………………………………………...…8
3.1.3. Scalograms…………………………………………………………………………………………………….…….………9
4. Neural Networks…………………………………………………………………………………………………………………….…….10
4.1 Neural Network models……………………………………………………………………………………………..….10
4.2. Backpropagation Algorithm……………………………………………………………………………………..…..11
4.3. Autoencoders and Sparsity…………………………………….………………………………………………..…..12
5. Observations and Conclusions…………………………………………………………………………………………….….…...13
References ...................................................................................................................................... 11
Acknowledgements………………………………………………………………………………………………………………………….11
5. 5
1. Introduction
A typical musical piece consists of several
instruments playing same or different notes
at different pitch, loudness and intensity. We
will try to develop a computer based model
based on training set learnt by the machine
to identify the instruments and separate the
sources. Separation of the individual
instruments is a very essential pre requisite
for automatic music transcription. Hence
automatic music transcription is identifying
the instruments playing and when and for
how long each note is being played.
The musical concepts here are fundamentals
and overtones. Mathematically, these concepts
are described via Fourier coefficients and
their role in producing sounds is modeled by
Fourier series. Although Fourier series are an
essential tool, they do have limitations; in
particular, they are not effective at capturing
abrupt changes in the frequency content of
sounds. These abrupt changes occur, for
instance, in transitions between individual
notes. Hence we will describe a modern
method of time frequency analysis, known as
spectrogram, which better handles changes in
frequency content over time. They provide a
type of “fingerprint” of sounds from various
instruments. These fingerprints allow us to
distinguish one instrument from another.
While spectrograms are fine tool for many
situations, they are not closely correlated
with the frequencies (pitches) typically found
on musical scales, and there are cases where
this leads to problems. Hence we describe a
method of time-frequency analysis, known as
scalograms, which does correlate well with
music scale frequencies. Scalograms yield a
powerful new approach, based on the
mathematical theory of wavelets, which will
solve problems lying beyond the scope of
either Fourier series or spectrograms.
The situation at our hand is that of a logistic
classification problem in the world of machine
learning. We will learn about the different
learning algorithms like the gradient decent
and newton’s methods to come up with the
right cost function. The auto encoder tries to
learn the hypothesis function so that it can
identify the individual inputs (instruments)
and reproduce the input function (identity
function).
2. Objective
The objective of this report is to understand
Fourier spectra, spectrograms and scalograms
and the working of sparse auto encoders and use
the aforementioned time-frequency analysis as
features to learn the hypothesis function for the
training input sets for auto encoders.
3. Literature Research and user study
The basic idea to identify the different
instruments using a computer based model is by
teaching the machine to identify the instrument.
After identification the machine would separate
the individual instruments and then would do a
transcription of the notes being played. The
machine is made to learn the different attributes
of an instrument. This uses an exemplar based
learning method where the machine learns to
identify an instrument after being trained with a
number of training sets. Once we identify the
instrument, auto encoders separate the
individual instruments.
6. 6
3.1 Identification of the instruments
To understand the relation between pitch and
frequency, we demonstrate the sound from a
tuning fork recorded with an oscilloscope
attached to a microphone. This will produce a
graph similar to the one shown below.
The above graph was created by plotting the
function100𝑠𝑖𝑛2𝜋𝜈𝑡, a sinusoid of frequency
𝜈 = 440 cycles/sec (Hz), a tone identical to the
tuning fork of pitch 𝐴4 on the well-tempered
scale. A pure tone with a single pitch is thus
associated with a single frequency, in this
case 440 Hz. The next figure shows the
Fourier spectrum of sinusoid; the single peak
at 440 Hz is clearly evident.
The formulas used to generate this Fourier
spectrum will be discussed below.
Unlike tuning forks, sounds from musical
instruments are time-evolving superpositions
of several pure tones, or sinusoid waves. For
example, in the next figure we show the
Fourier spectrum of the piano note 𝐸4 with
base frequency of 330 Hz. In this
spectrum, there are peaks located at the
(approximate) frequencies 330 Hz, 660 Hz,
990Hz, 1320 Hz and 1620 Hz. Notice that
these frequencies are all integral multiples of
the base frequency, 330 Hz, called the
fundamental. The integral multiples of this
fundamental are called overtones.
3.1.1 Fourier Series
The classic mathematical theory for
describing musical notes is that of Fourier
series. Given a sound signal 𝑓(𝑡) (such as a
musicall note or chord) defined on the
interval [0, Ω], its Fourier series is
𝑐0 + ∑{𝑎 𝑛 𝑐𝑜𝑠
2𝜋𝑛𝑡
Ω
+ 𝑏 𝑛 𝑠𝑖𝑛
2𝜋𝑛𝑡
Ω
}
∞
𝑛=1
with Fourier constants 𝑐0, 𝑎 𝑛 and 𝑏 𝑛 defined
by
𝑐0 =
1
Ω
∫ 𝑓( 𝑡) 𝑑𝑡
Ω
0
𝑎 𝑛 =
2
Ω
∫ 𝑓( 𝑡) 𝑐𝑜𝑠
2𝜋𝑛𝑡
Ω
𝑑𝑡. 𝑛 = 1,2,3, … .
Ω
0
𝑏 𝑛 =
2
Ω
∫ 𝑓( 𝑡) 𝑠𝑖𝑛
2𝜋𝑛𝑡
Ω
𝑑𝑡. 𝑛 = 1,2,3, … .
Ω
0
7. 7
Thus the input signal is a superposition of
the waves of frequencies 1 Ω⁄ , 2 Ω⁄ , 3 Ω⁄ ….
It is more convenient to write the above
equations using complex notations, via
Euler’s formulas, 𝑒 𝑖𝜃
= 𝑐𝑜𝑠𝜃 + 𝑖 𝑠𝑖𝑛𝜃.
Equation can be re written as
𝑐0 + ∑{𝑐 𝑛 𝑒 𝑖2𝜋𝑛𝑡 Ω⁄
+ 𝑐−𝑛 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄
}
∞
𝑛=1
with complex Fourier coefficients
𝑐 𝑛 =
1
Ω
∫ 𝑓( 𝑡) 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄
𝑑𝑡. 𝑛 = 0, ±1, ±2, . . .
Ω
0
The relation between the two Fourier
constants defined are 𝑐 𝑛 = (𝑎 𝑛 + 𝑖𝑏 𝑛)/2 and
𝑐−𝑛 = 𝑐 𝑛̅̅̅.
3.1.1.1 FFTs
The audio signals of piano notes discussed above
were recorded digitally. The method of digitally
computing Fourier spectra is referred as FFT
(fast Fourier transform). An FFT provides an
extremely efficient method for computing
approximations to Fourier series coefficients;
these approximations are called DFTs (discrete
Fourier transforms) defined via Reimann sum
approximations of the integrals described above
for. For a (large) positive integer 𝑁, let 𝑡 𝑘 = 𝑘Ω
for 𝑘 = 0,1,2, … . , 𝑁 − 1 and let ∆𝑡 =
Ω
𝑁
. Then the
𝑛th Fourier coefficient 𝑐 𝑛 is approximated as
follows:
𝑐 𝑛 ≈
1
Ω
∑ 𝑓( 𝑡 𝑘)
𝑁−1
𝑘=0
𝑒−𝑖2𝜋𝑛𝑡 𝑘 Ω⁄
∆𝑡
=
1
𝑁
∑ 𝑓( 𝑡 𝑘)
𝑁−1
𝑘=0
𝑒−𝑖2𝜋𝑛𝑘 𝑁⁄
= 𝐹[𝑛]
The above quantity is the DFT of the finite
sequence of numbers {𝑓(𝑡 𝑘)}. The spectra shown
in the figures above were obtained via DFT
approximations {2|𝐹[𝑛]|2
} 𝑛≥1.
3.1.2 Spectrograms
TheFourier spectra are not as useful for
analyzing several notes in a musical passage. For
ecample, below is a graph of of a recording of a
piano playing the notes 𝐸4, 𝐹4, 𝐺4and 𝐴4.
The spectrum from this musical passage is
shown as below. Unlike the single note case it is
not easy to assign fundamentals and overtones.
One way of implementing this “mixed defination
of a signal” is to compute specrograms, which are
a moving sequence of local spectra for the signal.
In order to isolate the individual notes in the
musical passage, the sound signal 𝑓(𝑡) is
multiplied by a succession of time-windows:
{𝑤(𝑡 − 𝜏 𝑚)}, 𝑚 = 1,2, … , 𝑀. Each window 𝑤( 𝑡 − 𝜏 𝑚)
is equal to 1 in a time interval (𝜏 𝑚 − 𝜖, 𝜏 𝑚 + 𝜖)
centered at 𝜏 𝑚 and decreases smoothly down to 0
for 𝑡 < 𝜏 𝑚−1 + 𝛿 and 𝑡 > 𝜏 𝑚+1 − 𝛿. These windows
also satisfy
∑ 𝑤( 𝑡 − 𝜏 𝑚) = 1
𝑀
𝑚=1
Over the time interval [0, Ω], multiplying both
sides by 𝑓(𝑡) we see that
𝑓( 𝑡) = ∑ 𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚)
𝑀
𝑚=1
Thus the sound signal is the sum of sub-signals
8. 8
𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚). Notice that the subsignal
𝑓( 𝑡) 𝑤( 𝑡 − 𝜏 𝑚) is shown as having a restricted
domain of [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿]. When FFT is
applied to the sequence {𝑓( 𝑡 𝑘) 𝑤( 𝑡 𝑘 − 𝜏 𝑚)} with
points 𝑡 𝑘 ∈ [𝜏 𝑚−1, 𝜏 𝑚+1], then this FFT produces
Fourier coefficients that are localized to the time
interval [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿] for each 𝑚. This
localization in time of Fourier coefficients
constitutes the spectrogram solution of the
problem of separating the spectra of the
individual notes in the musical passage.
3.1.2.1 Spectrograms of Piano and Flute
The next figure shows a spectrogram for the
sequence of piano notes 𝐸4, 𝐹4, 𝐺4and 𝐴4. The
sound signal is plotted at the bottom of the
figure.
.
Above the dound signal is a plot of the FFT
spectra {2|𝐹𝑚[𝑛]|2}, 𝑚 = 1,2, … 𝑀 obtained for the
subsignals. The vertical scale is a frequency scale
(in Hz) and the horizontal scale is a time scale (in
sec). It can be seen clearly in the figure that the
spectra for individual notes are clearly separated
in time. Below is a similar plot for flute.
Comparing these two spectrograms, there are
clear differences between attack and decay of the
spectral line segments for the notes played by the
two instruments. For the piano there is a very
prominent attack- due to the striking of the
piano hammer on its strings. There is also a
longer decay for the piano notes due to slow
9. 9
damping down of the piano string vibrations
which is evident in the overlapping of the time
intervals underlying each note’s line segment.
3.1.3 Scalograms
Spectrograms display frequencies on a uniform
scale, whereas musical scales such as the well
tempered scale are based on a logrithmic scale
for frequencies.
Consider the figure below.
The spectrogram of of the note 𝐸4 played on a
guitar. In this spectrogram there are a number of
spectral line segments crowded together at the
lower end of the frequency scale. These
correspond to the lower frequency peaks in the
Fourier spectrum for the note.
The lower frequencies are integral divisors of
some of the overtones of the note and are called
undertones resulting from body cavity
resonances in the guitar.
A technique of mathematically “zooming
in” on these lower frequencies is needed which is
provided by the scalogram.
The vertical scale on this scalogram consists of
multiples of a base frrequency of 80 Hz viz.,
80.20
= 80 Hz, 80.21
= 160 Hz, 80.22
= 320 Hz.
This ia a logrithmic scale of frequencies, octaves,
as in the well-tempered scale.
We compute the scalograms via a method known
as the continuous wavelet transform (CWT). The
CWT differs from the spectrogram approach in
that it does not use translations of a window of
fixed width. Instead it uses translations of
differently sized dilations of a window.
Scalograms are based on a discretization of the
CWT.
Given a function 𝑔, called the wavelet, the CWT
𝑊𝑔[𝑓](𝜏, 𝑠) =
1
𝑠
∫ 𝑓(𝑡)𝑔(
𝑡 − 𝜏
𝑠
)
̅̅̅̅̅̅̅̅̅̅̅
𝑑𝑡
∞
∞
10. 10
For scale 𝑠 > 0 and time-translation 𝜏. If we
assume that the sounf signal 𝑓(𝑡) is non zero
only over the time interval [0, Ω] then the limits
of above equation changes accordingly.
As we did for Fourier coefficients, we make
Reinmann sum approximation to this integral
using 𝑡 𝑚 = 𝑚∆𝑡,with a uniform spacing ∆𝑡 = Ω 𝑁⁄
we also discretize the time vatiable 𝜏 𝑘 = 𝑘∆𝑡.
This yields
𝑊𝑔[𝑓](𝑘∆𝑡, 𝑠) ≈
Ω
𝑁
1
𝑠
∑ 𝑓
𝑁−1
𝑚=0
(𝑚∆𝑡)𝑔(
𝑚 − 𝑘
𝑠
∆𝑡)
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
The above sum is a correlation of two discrete
sequences. Given two 𝑁 point discrete sequences
{𝑓𝑘} and {𝑔 𝑘}, their correlation {(𝑓: 𝑔) 𝑘} is the
sequence defined by
{(𝑓: 𝑔) 𝑘} = ∑ 𝑓𝑚
𝑁−1
𝑚=0
𝑔 𝑚−𝑘̅̅̅̅̅̅̅
4. Neural Networks
Consider a supervised learning problem where
we have access to labelled training examples
(𝑥(𝑖)
, 𝑦(𝑖)
). neural networks give a way of defining
a complex, non-linear form of hypothesis ℎ 𝑊,𝑏(𝑥),
with parameters 𝑊, 𝑏 that we can fit to our data.
To describe neural networks, we will begin by
describing the simplest possible neural network,
one which comprises a single "neuron." We will
use the following diagram to denote a single
neuron:
This “neuron” is a computational unit that takes
input as 𝑥1, 𝑥2, 𝑥3 (and a +1 as the intercept
term), and outputs
ℎ 𝑊,𝑏(𝑥) = 𝑓( 𝑊 𝑇
𝑥) = 𝑓 (∑ 𝑊𝑖 𝑥𝑖 + 𝑏
3
𝑖=1
),
Where 𝑓: ℜ → ℜ is called the activation function.
In our logical regression case we will use 𝑓(·) to
be sigmoid function:
𝑓( 𝑧) =
1
1 + exp(−𝑧)
Another choice for 𝑓( 𝑧) = tanh(𝑧)
4.1 Neural Network Model
A neural network is put together by hooking
together many of our simple neurons, so that the
output of a neuron can be the input of another.
For example here is a small neural network:
In this figure, we have used circles to also denote
the inputs to the network. The circles labeled
"+1" are called bias units, and correspond to the
intercept term. The leftmost layer of the network
is called the input layer, and the rightmost layer
the output layer (which, in this example, has
only one node). The middle layer of nodes is
called the hidden layer, because its values are
not observed in the training set. We also say that
our example neural network has 3 input units
(not counting the bias unit), 3 hidden units, and
1 output unit.
We will let 𝑛𝑙 denote the number of layers in our
network; thus 𝑛𝑙 = 3 in our example. We label
layer 𝑙 as𝐿𝑙, so layer 𝐿1 is the input layer, and
layer 𝐿 𝑛𝑙
the output layer. Our neural network
has parameters( 𝑊, 𝑏) = (𝑊(1)
, 𝑏(1)
, 𝑊(2)
, 𝑏(1)
),
11. 11
where we write 𝑊𝑖𝑗
(𝑙)
to denote the parameter (or
weight) associated with the connection between
unit 𝑗 in layer𝑙, and unit 𝑖 in layer𝑙 + 1. (Note
the order of the indices.) Also, 𝑏𝑖
(𝑙)
is the bias
associated with the unit 𝑖 in layer𝑙 + 1. We have
𝑊1
∈ ℜ3𝑋3
and𝑊2
∈ ℜ1𝑋3
. Note that bias units
don't have inputs or connections going into them,
since they always output the value +1. We also
let 𝑠𝑙denote the number of nodes in layer 𝑙 (not
counting the bias unit).
We will write 𝑎𝑖
(𝑙)
to denote the activation
(meaning output value) of unit 𝑖 in layer𝑙. For
layer𝑙 = 1, we also use 𝑎𝑖
(1)
= 𝑥𝑖to denote the 𝑖th
input. Given a fixed setting of the parameters
𝑊, 𝑏, our neural network defines a hypothesis
ℎ 𝑊,𝑏(𝑥) that outputs a real number. Specifically,
the computation that this neural network
represents is given by:
𝑎1
(2)
= 𝑓(𝑊11
(1)
𝑥1 + 𝑊12
(1)
𝑥2 + 𝑊13
(1)
𝑥3 + 𝑏1
(1)
)
𝑎2
(2)
= 𝑓(𝑊21
(1)
𝑥1 + 𝑊22
(1)
𝑥2 + 𝑊23
(1)
𝑥3 + 𝑏2
(1)
)
𝑎3
(2)
= 𝑓(𝑊31
(1)
𝑥1 + 𝑊32
(1)
𝑥2 + 𝑊33
(1)
𝑥3 + 𝑏3
(1)
)
ℎ 𝑊,𝑏(𝑥) = 𝑎1
(3)
= 𝑓(𝑊11
(2)
𝑎1
(2)
+ 𝑊12
(2)
𝑎2
(2)
+ 𝑊13
(2)
𝑎3
(2)
+ 𝑏1
(2)
)
4.2 Backpropagation Algorithm
Suppose we have a fixed training set
{(𝑥(1)
, 𝑦(1)
), … , (𝑥(𝑚)
, 𝑦(𝑚))} of 𝑚 training
examples. We can train our neural network using
batch gradient descent. In detail, for a single
training example(𝑥, 𝑦), we define the cost
function with respect to that single example to
be:
𝐽( 𝑊, 𝑏; 𝑥, 𝑦) =
1
2
||ℎ 𝑊,𝑏( 𝑥) − 𝑦||2
This is a (one-half) squared-error cost function.
Given a training set of 𝑚 examples, we then
define the overall cost function to be:
𝐽( 𝑊, 𝑏) = [
1
𝑚
∑ 𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
+
𝜆
2
∑ ∑ ∑(𝑊𝑗𝑖
(𝑙)
)2
𝑠 𝑙+1
𝑗=1
𝑠 𝑙
𝑖=1
𝑛𝑙−1
𝑙=1
= [
1
𝑚
∑
1
2
‖ℎ 𝑊,𝑏(𝑥 𝑖) − 𝑦 𝑖‖
2
𝑚
𝑖=1
] +
𝜆
2
∑ ∑ ∑(𝑊𝑗𝑖
(𝑙)
)2
𝑠 𝑙+1
𝑗=1
𝑠 𝑙
𝑖=1
𝑛𝑙−1
𝑙=1
The first term in the definition of 𝐽(𝑊, 𝑏) is an
average sum-of-squares error term. The second
term is a regularization term (also called a
weight decay term) that tends to decrease the
magnitude of the weights, and helps prevent over
fitting.
The weight decay parameter 𝜆 controls the
relative importance of the two terms. Note also
the slightly overloaded notation: 𝐽(𝑊, 𝑏; 𝑥, 𝑦) is
the squared error cost with respect to a single
example; 𝐽(𝑊, 𝑏) is the overall cost function,
which includes the weight decay term.
This cost function above is often used both for
classification and for regression problems. For
classification, we let 𝑦 = 0 or 1 represent the
two class labels (recall that the sigmoid
activation function outputs values in[0,1]; if we
were using a tanh activation function, we would
instead use -1 and +1 to denote the labels). For
regression problems, we first scale our outputs to
ensure that they lie in the [0,1] range (or if we
were using a tanh activation function, then the
[ − 1,1] range).
Our goal is to minimize 𝐽(𝑊, 𝑏) as a function
of 𝑊 and𝑏. To train our neural network, we
will initialize each parameter 𝑊𝑖𝑗
(𝑙)
and each
𝑏𝑖
(𝑙)
to a small random value near zero (say
according to a normal (0, 𝜀2
) distribution for
some small 𝜀, say 0.01), and then apply an
optimization algorithm such as batch
gradient descent. Since 𝐽(𝑊, 𝑏) is a non-
convex function, gradient descent is
susceptible to local optima; however, in
practice gradient descent usually works
fairly well. Finally, note that it is important
to initialize the parameters randomly, rather
than to all 0's. If all the parameters start off
at identical values, then all the hidden layer
units will end up learning the same function
of the input (more formally, 𝑊𝑖𝑗
(𝑙)
will be the
same for all values of 𝑖, so that 𝑎1
(2)
= 𝑎2
(2)
=
12. 12
𝑎3
(2)
… for any input 𝑥) The random
initialization serves the purpose of symmetry
breaking.
One iteration of gradient descent updates the
parameters 𝑊, 𝑏 as follows:
𝑊𝑖𝑗
(𝑙)
= 𝑊𝑖𝑗
(𝑙)
− 𝛼
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽(𝑊, 𝑏)
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
− 𝛼
𝜕
𝜕 𝑏𝑖
(𝑙)
𝐽(𝑊, 𝑏)
Where 𝛼 is the learning rate. The key step is
computing the partial derivatives above. We will
now describe the backpropagation algorithm,
which gives an efficient way to compute these
partial derivatives.
We will first describe how backpropagation can
be used to compute
𝜕
𝜕 𝑊𝑖𝑗
(𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) and
𝜕
𝜕 𝑏𝑖
(𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦), the partial derivatives of the
cost function 𝐽(𝑊, 𝑏; 𝑥, 𝑦) defined with respect to a
single example (𝑥, 𝑦). Once we can compute
these, we see that the derivative of the overall
cost function 𝐽(𝑊, 𝑏) can be computed as:
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽( 𝑊, 𝑏)
= [
1
𝑚
∑
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
+ 𝜆𝑊𝑖𝑗
(𝑙)
𝜕
𝜕 𝑏𝑖
( 𝑙)
𝐽( 𝑊, 𝑏) = [
1
𝑚
∑
𝜕
𝜕 𝑏𝑖
( 𝑙)
𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
The two lines above differ slightly because
weight decay is applied to 𝑊 but not𝑏.
4.3 Autoencoders and sparsity
So far, we have described the application of
neural networks to supervised learning, in which
we have labelled training examples. Now
suppose we have only a set of unlabelled training
examples {𝑥(1)
, 𝑥(2)
, 𝑥(3)
… }, where 𝑥(𝑖)
𝜖 ℜ
𝑛
. An
autoencoder neural network is an unsupervised
learning algorithm that applies backpropagation,
setting the target values to be equal to the
inputs. i.e., it uses𝑦(𝑖)
= 𝑥(𝑖)
.
Here is an autoencoder:
The auto encoder tries to learn a
functionℎ 𝑊,𝑏(𝑥) ≈ 𝑥. In other words, it is trying
to learn an approximation to the identity
function, so as to output 𝑥̂ that is similar to 𝑥.
The identity function seems a particularly trivial
function to be trying to learn; but by placing
constraints on the network, such as by limiting
the number of hidden units, we can discover
interesting structure about the data. If the input
were completely random---say, each 𝑥𝑖comes from
an IID Gaussian independent of the other
features---then this compression task would be
very difficult. But if there is structure in the
data, for example, if some of the input features
are correlated, then this algorithm will be able to
discover some of those correlations. In fact, this
simple autoencoder often ends up learning a low-
dimensional representation very similar to PCAs.
Our argument above relied on the number of
hidden units 𝑠2being small. But even when the
13. 13
number of hidden units is large (perhaps even
greater than the number of input pixels), we can
still discover interesting structure, by imposing
other constraints on the network. In particular, if
we impose a sparsity constraint on the hidden
units, then the autoencoder will still discover
interesting structure in the data, even if the
number of hidden units is large.
5. Observations and conclusions
The First part of objective of this project, i.e.
identification of instrument was successfully
observer with three different techniques
aforementioned. The scalogram method being the
most conclusive and best. This method is fed as
feature to train the autoencoder to come up with
the hypothesis which will identify the instrument
individually and give outputs as individual
instrument. A instrument pass filter is to be
made in order to extract the instruments
separately.
Once the individual instrument is extracted, The
Notes can be identified with the respective
spectrograms.
14. 14
References
1. Uhle, Dittmar and Sporer. Extraction of drum tracks from polyphonic music using independent
subspace analysis. 4th International Symposium on Independent Component Analysis and Blind Signal
Separation (ICA2003), April 2003, Nara, Japan
2. Endelt and Harbo. Time frequency distribution of music based on sparse wavelet packet
representations. SPAR05
3. Elver and Akan. Recognigition of musical instruments using time-frequency analysis.
4. Niva Das. Nov 2007. ICA methods for BSS of instantaneous mixtures: A case study, Neural information
processing- Letters and reviews Vol. 11, No. 11,
5.Naik and Kumar. 2011. An overview of ICA and its applications. informatica 35 : 63-81
6.Fuginaga. Machine recognition of timbre using steady-state tone of acoustic instruments.
7. John M. Barry. Polyphonic music transcription using independent component analysis.
8. Jeremy F. Alm, James S. Walker, Time frequency analysis of musical instruments, SIAM Review, Vol
44 No3, Society of Industrial and applied mathematics.
Acknowledgements
I would like t o thank my guide Dr. Setlur and my co guide Dr. Sethi for their guidance, support and
patience. I also would like to acknowledge Dr. Tristan Jehan for both his master s and Ph.D thesis that
have immensly helped me in learning about music in its true essence, Dr. Andrew Ng, Stanford
university and www.coursera.org for the helping me understand the pre requisite like machine learning.
Also I would like to mention UFLDL Tutorial by Stanford which guided me through the working of
autoecoders. Finally I would like to thank the creaters of Matlab and Octave and owners of webpages
wikipedia and hyperphysics.