SlideShare a Scribd company logo
1 of 14
Download to read offline
1
B. Tech Project Report First Phase
Polyphonic music transcription using Machine learning techniques
Submitted in partial fulfillment of requirements
For the award of the degree of Bachelor of Technology from
Indian Institute of Technology, Guwahati
Under the supervision of
Associate Professor Girish Sampath Setlur
Assistant Professor Amit Sethi
Submitted by-
Lalit Pradhan
10012119
November 15, 2013
Department of Physics
Indian Institute of Technology Guwahati
Guwahati 781039, Assam, INDIA
2
Certificate
This is to certify that the work presented in the report entitled “Polyphonic music transcription using
machine learning techniques” by Lalit Pradhan, 10012119, represents an original work under the
guidance of Associate Professor Girish Sampath Setlur and Assistant Professor Amit Sethi. This study
has not been submitted elsewhere for a degree.
Signature of student:
Date:
Place: Lalit Pradhan, 10012119
Signature of supervisor I
Date:
Place: Associate Professor Girish S Setlur
Signature of supervisor II
Date:
Place: Assistant Professor Amit Sethi
Signature of Examiner
Date:
Place:
3
Abstract
In this project report we present a method for recognition of timbre and identification of the
different musical instruments, and automatic transcription of these individual instruments.
We introduce time frequency distributions for the analysis of musical instruments. The
method presented in the earlier semester uses Independent component analysis (hereafter
ICA), a special case of Blind source separation (hereafter BSS) to solve the issue in hand.
We understood how the ICA method works. In this term we will also introduce a more
efficient algorithm for the aforementioned problem statement via the use of a reduced
dimensional auto encoder set for identification of timbre. This paper describes several
approaches to analyze the frequency or pitch or content of the sound produced by musical
instruments via different methods like Fourier analysis, spectrograms and scalograms.
Scalograms allow one to zoom in to selected regions of time-frequency plane in a more
flexible manner than is possible with spectrograms. These time-frequency portraits seem to
correlate well with our perception of the sounds produced by these instruments and of the
differences between each instrument. This property will be used as feature for the sparse
auto encoders in training the machine to recognize the instrument and extract out
individual instrument notes. The training sets are fed to a learning algorithm which forms
a hypothesis function which will in turn recognize any new input and give the appropriate
output, i.e. in our case identify the appropriate instrument.
Keywords
Blind Source Separation, Independent Component Analysis, Identification of Timbre,
Sparse Auto encoders, Spectrograms, Scalograms.
4
Table of Contents
1. Introduction .................................................................................................................................. 5
2. Objective....................................................................................................................................... 5
3. Literature Research and user study………………………………………………………………………………………………..5
3.1 Identification of the instruments…………………………………………………………………………………………….6
3.1.1Fourier Series………………………………………………………………………………………………………………..6
3.1.1.1 FFTs…………..………………………………………………………………………………………………….7
3.1.2 Spectrograms………………………………………………………………………………………………………………..7
3.1.2.1 Spectrograms of piano and flute…………………………………………………………………...…8
3.1.3. Scalograms…………………………………………………………………………………………………….…….………9
4. Neural Networks…………………………………………………………………………………………………………………….…….10
4.1 Neural Network models……………………………………………………………………………………………..….10
4.2. Backpropagation Algorithm……………………………………………………………………………………..…..11
4.3. Autoencoders and Sparsity…………………………………….………………………………………………..…..12
5. Observations and Conclusions…………………………………………………………………………………………….….…...13
References ...................................................................................................................................... 11
Acknowledgements………………………………………………………………………………………………………………………….11
5
1. Introduction
A typical musical piece consists of several
instruments playing same or different notes
at different pitch, loudness and intensity. We
will try to develop a computer based model
based on training set learnt by the machine
to identify the instruments and separate the
sources. Separation of the individual
instruments is a very essential pre requisite
for automatic music transcription. Hence
automatic music transcription is identifying
the instruments playing and when and for
how long each note is being played.
The musical concepts here are fundamentals
and overtones. Mathematically, these concepts
are described via Fourier coefficients and
their role in producing sounds is modeled by
Fourier series. Although Fourier series are an
essential tool, they do have limitations; in
particular, they are not effective at capturing
abrupt changes in the frequency content of
sounds. These abrupt changes occur, for
instance, in transitions between individual
notes. Hence we will describe a modern
method of time frequency analysis, known as
spectrogram, which better handles changes in
frequency content over time. They provide a
type of “fingerprint” of sounds from various
instruments. These fingerprints allow us to
distinguish one instrument from another.
While spectrograms are fine tool for many
situations, they are not closely correlated
with the frequencies (pitches) typically found
on musical scales, and there are cases where
this leads to problems. Hence we describe a
method of time-frequency analysis, known as
scalograms, which does correlate well with
music scale frequencies. Scalograms yield a
powerful new approach, based on the
mathematical theory of wavelets, which will
solve problems lying beyond the scope of
either Fourier series or spectrograms.
The situation at our hand is that of a logistic
classification problem in the world of machine
learning. We will learn about the different
learning algorithms like the gradient decent
and newton’s methods to come up with the
right cost function. The auto encoder tries to
learn the hypothesis function so that it can
identify the individual inputs (instruments)
and reproduce the input function (identity
function).
2. Objective
The objective of this report is to understand
Fourier spectra, spectrograms and scalograms
and the working of sparse auto encoders and use
the aforementioned time-frequency analysis as
features to learn the hypothesis function for the
training input sets for auto encoders.
3. Literature Research and user study
The basic idea to identify the different
instruments using a computer based model is by
teaching the machine to identify the instrument.
After identification the machine would separate
the individual instruments and then would do a
transcription of the notes being played. The
machine is made to learn the different attributes
of an instrument. This uses an exemplar based
learning method where the machine learns to
identify an instrument after being trained with a
number of training sets. Once we identify the
instrument, auto encoders separate the
individual instruments.
6
3.1 Identification of the instruments
To understand the relation between pitch and
frequency, we demonstrate the sound from a
tuning fork recorded with an oscilloscope
attached to a microphone. This will produce a
graph similar to the one shown below.
The above graph was created by plotting the
function100𝑠𝑖𝑛2𝜋𝜈𝑡, a sinusoid of frequency
𝜈 = 440 cycles/sec (Hz), a tone identical to the
tuning fork of pitch 𝐴4 on the well-tempered
scale. A pure tone with a single pitch is thus
associated with a single frequency, in this
case 440 Hz. The next figure shows the
Fourier spectrum of sinusoid; the single peak
at 440 Hz is clearly evident.
The formulas used to generate this Fourier
spectrum will be discussed below.
Unlike tuning forks, sounds from musical
instruments are time-evolving superpositions
of several pure tones, or sinusoid waves. For
example, in the next figure we show the
Fourier spectrum of the piano note 𝐸4 with
base frequency of 330 Hz. In this
spectrum, there are peaks located at the
(approximate) frequencies 330 Hz, 660 Hz,
990Hz, 1320 Hz and 1620 Hz. Notice that
these frequencies are all integral multiples of
the base frequency, 330 Hz, called the
fundamental. The integral multiples of this
fundamental are called overtones.
3.1.1 Fourier Series
The classic mathematical theory for
describing musical notes is that of Fourier
series. Given a sound signal 𝑓(𝑡) (such as a
musicall note or chord) defined on the
interval [0, Ω], its Fourier series is
𝑐0 + ∑{𝑎 𝑛 𝑐𝑜𝑠
2𝜋𝑛𝑡
Ω
+ 𝑏 𝑛 𝑠𝑖𝑛
2𝜋𝑛𝑡
Ω
}
∞
𝑛=1
with Fourier constants 𝑐0, 𝑎 𝑛 and 𝑏 𝑛 defined
by
𝑐0 =
1
Ω
∫ 𝑓( 𝑡) 𝑑𝑡
Ω
0
𝑎 𝑛 =
2
Ω
∫ 𝑓( 𝑡) 𝑐𝑜𝑠
2𝜋𝑛𝑡
Ω
𝑑𝑡. 𝑛 = 1,2,3, … .
Ω
0
𝑏 𝑛 =
2
Ω
∫ 𝑓( 𝑡) 𝑠𝑖𝑛
2𝜋𝑛𝑡
Ω
𝑑𝑡. 𝑛 = 1,2,3, … .
Ω
0
7
Thus the input signal is a superposition of
the waves of frequencies 1 Ω⁄ , 2 Ω⁄ , 3 Ω⁄ ….
It is more convenient to write the above
equations using complex notations, via
Euler’s formulas, 𝑒 𝑖𝜃
= 𝑐𝑜𝑠𝜃 + 𝑖 𝑠𝑖𝑛𝜃.
Equation can be re written as
𝑐0 + ∑{𝑐 𝑛 𝑒 𝑖2𝜋𝑛𝑡 Ω⁄
+ 𝑐−𝑛 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄
}
∞
𝑛=1
with complex Fourier coefficients
𝑐 𝑛 =
1
Ω
∫ 𝑓( 𝑡) 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄
𝑑𝑡. 𝑛 = 0, ±1, ±2, . . .
Ω
0
The relation between the two Fourier
constants defined are 𝑐 𝑛 = (𝑎 𝑛 + 𝑖𝑏 𝑛)/2 and
𝑐−𝑛 = 𝑐 𝑛̅̅̅.
3.1.1.1 FFTs
The audio signals of piano notes discussed above
were recorded digitally. The method of digitally
computing Fourier spectra is referred as FFT
(fast Fourier transform). An FFT provides an
extremely efficient method for computing
approximations to Fourier series coefficients;
these approximations are called DFTs (discrete
Fourier transforms) defined via Reimann sum
approximations of the integrals described above
for. For a (large) positive integer 𝑁, let 𝑡 𝑘 = 𝑘Ω
for 𝑘 = 0,1,2, … . , 𝑁 − 1 and let ∆𝑡 =
Ω
𝑁
. Then the
𝑛th Fourier coefficient 𝑐 𝑛 is approximated as
follows:
𝑐 𝑛 ≈
1
Ω
∑ 𝑓( 𝑡 𝑘)
𝑁−1
𝑘=0
𝑒−𝑖2𝜋𝑛𝑡 𝑘 Ω⁄
∆𝑡
=
1
𝑁
∑ 𝑓( 𝑡 𝑘)
𝑁−1
𝑘=0
𝑒−𝑖2𝜋𝑛𝑘 𝑁⁄
= 𝐹[𝑛]
The above quantity is the DFT of the finite
sequence of numbers {𝑓(𝑡 𝑘)}. The spectra shown
in the figures above were obtained via DFT
approximations {2|𝐹[𝑛]|2
} 𝑛≥1.
3.1.2 Spectrograms
TheFourier spectra are not as useful for
analyzing several notes in a musical passage. For
ecample, below is a graph of of a recording of a
piano playing the notes 𝐸4, 𝐹4, 𝐺4and 𝐴4.
The spectrum from this musical passage is
shown as below. Unlike the single note case it is
not easy to assign fundamentals and overtones.
One way of implementing this “mixed defination
of a signal” is to compute specrograms, which are
a moving sequence of local spectra for the signal.
In order to isolate the individual notes in the
musical passage, the sound signal 𝑓(𝑡) is
multiplied by a succession of time-windows:
{𝑤(𝑡 − 𝜏 𝑚)}, 𝑚 = 1,2, … , 𝑀. Each window 𝑤( 𝑡 − 𝜏 𝑚)
is equal to 1 in a time interval (𝜏 𝑚 − 𝜖, 𝜏 𝑚 + 𝜖)
centered at 𝜏 𝑚 and decreases smoothly down to 0
for 𝑡 < 𝜏 𝑚−1 + 𝛿 and 𝑡 > 𝜏 𝑚+1 − 𝛿. These windows
also satisfy
∑ 𝑤( 𝑡 − 𝜏 𝑚) = 1
𝑀
𝑚=1
Over the time interval [0, Ω], multiplying both
sides by 𝑓(𝑡) we see that
𝑓( 𝑡) = ∑ 𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚)
𝑀
𝑚=1
Thus the sound signal is the sum of sub-signals
8
𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚). Notice that the subsignal
𝑓( 𝑡) 𝑤( 𝑡 − 𝜏 𝑚) is shown as having a restricted
domain of [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿]. When FFT is
applied to the sequence {𝑓( 𝑡 𝑘) 𝑤( 𝑡 𝑘 − 𝜏 𝑚)} with
points 𝑡 𝑘 ∈ [𝜏 𝑚−1, 𝜏 𝑚+1], then this FFT produces
Fourier coefficients that are localized to the time
interval [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿] for each 𝑚. This
localization in time of Fourier coefficients
constitutes the spectrogram solution of the
problem of separating the spectra of the
individual notes in the musical passage.
3.1.2.1 Spectrograms of Piano and Flute
The next figure shows a spectrogram for the
sequence of piano notes 𝐸4, 𝐹4, 𝐺4and 𝐴4. The
sound signal is plotted at the bottom of the
figure.
.
Above the dound signal is a plot of the FFT
spectra {2|𝐹𝑚[𝑛]|2}, 𝑚 = 1,2, … 𝑀 obtained for the
subsignals. The vertical scale is a frequency scale
(in Hz) and the horizontal scale is a time scale (in
sec). It can be seen clearly in the figure that the
spectra for individual notes are clearly separated
in time. Below is a similar plot for flute.
Comparing these two spectrograms, there are
clear differences between attack and decay of the
spectral line segments for the notes played by the
two instruments. For the piano there is a very
prominent attack- due to the striking of the
piano hammer on its strings. There is also a
longer decay for the piano notes due to slow
9
damping down of the piano string vibrations
which is evident in the overlapping of the time
intervals underlying each note’s line segment.
3.1.3 Scalograms
Spectrograms display frequencies on a uniform
scale, whereas musical scales such as the well
tempered scale are based on a logrithmic scale
for frequencies.
Consider the figure below.
The spectrogram of of the note 𝐸4 played on a
guitar. In this spectrogram there are a number of
spectral line segments crowded together at the
lower end of the frequency scale. These
correspond to the lower frequency peaks in the
Fourier spectrum for the note.
The lower frequencies are integral divisors of
some of the overtones of the note and are called
undertones resulting from body cavity
resonances in the guitar.
A technique of mathematically “zooming
in” on these lower frequencies is needed which is
provided by the scalogram.
The vertical scale on this scalogram consists of
multiples of a base frrequency of 80 Hz viz.,
80.20
= 80 Hz, 80.21
= 160 Hz, 80.22
= 320 Hz.
This ia a logrithmic scale of frequencies, octaves,
as in the well-tempered scale.
We compute the scalograms via a method known
as the continuous wavelet transform (CWT). The
CWT differs from the spectrogram approach in
that it does not use translations of a window of
fixed width. Instead it uses translations of
differently sized dilations of a window.
Scalograms are based on a discretization of the
CWT.
Given a function 𝑔, called the wavelet, the CWT
𝑊𝑔[𝑓](𝜏, 𝑠) =
1
𝑠
∫ 𝑓(𝑡)𝑔(
𝑡 − 𝜏
𝑠
)
̅̅̅̅̅̅̅̅̅̅̅
𝑑𝑡
∞
∞
10
For scale 𝑠 > 0 and time-translation 𝜏. If we
assume that the sounf signal 𝑓(𝑡) is non zero
only over the time interval [0, Ω] then the limits
of above equation changes accordingly.
As we did for Fourier coefficients, we make
Reinmann sum approximation to this integral
using 𝑡 𝑚 = 𝑚∆𝑡,with a uniform spacing ∆𝑡 = Ω 𝑁⁄
we also discretize the time vatiable 𝜏 𝑘 = 𝑘∆𝑡.
This yields
𝑊𝑔[𝑓](𝑘∆𝑡, 𝑠) ≈
Ω
𝑁
1
𝑠
∑ 𝑓
𝑁−1
𝑚=0
(𝑚∆𝑡)𝑔(
𝑚 − 𝑘
𝑠
∆𝑡)
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
The above sum is a correlation of two discrete
sequences. Given two 𝑁 point discrete sequences
{𝑓𝑘} and {𝑔 𝑘}, their correlation {(𝑓: 𝑔) 𝑘} is the
sequence defined by
{(𝑓: 𝑔) 𝑘} = ∑ 𝑓𝑚
𝑁−1
𝑚=0
𝑔 𝑚−𝑘̅̅̅̅̅̅̅
4. Neural Networks
Consider a supervised learning problem where
we have access to labelled training examples
(𝑥(𝑖)
, 𝑦(𝑖)
). neural networks give a way of defining
a complex, non-linear form of hypothesis ℎ 𝑊,𝑏(𝑥),
with parameters 𝑊, 𝑏 that we can fit to our data.
To describe neural networks, we will begin by
describing the simplest possible neural network,
one which comprises a single "neuron." We will
use the following diagram to denote a single
neuron:
This “neuron” is a computational unit that takes
input as 𝑥1, 𝑥2, 𝑥3 (and a +1 as the intercept
term), and outputs
ℎ 𝑊,𝑏(𝑥) = 𝑓( 𝑊 𝑇
𝑥) = 𝑓 (∑ 𝑊𝑖 𝑥𝑖 + 𝑏
3
𝑖=1
),
Where 𝑓: ℜ → ℜ is called the activation function.
In our logical regression case we will use 𝑓(·) to
be sigmoid function:
𝑓( 𝑧) =
1
1 + exp(−𝑧)
Another choice for 𝑓( 𝑧) = tanh(𝑧)
4.1 Neural Network Model
A neural network is put together by hooking
together many of our simple neurons, so that the
output of a neuron can be the input of another.
For example here is a small neural network:
In this figure, we have used circles to also denote
the inputs to the network. The circles labeled
"+1" are called bias units, and correspond to the
intercept term. The leftmost layer of the network
is called the input layer, and the rightmost layer
the output layer (which, in this example, has
only one node). The middle layer of nodes is
called the hidden layer, because its values are
not observed in the training set. We also say that
our example neural network has 3 input units
(not counting the bias unit), 3 hidden units, and
1 output unit.
We will let 𝑛𝑙 denote the number of layers in our
network; thus 𝑛𝑙 = 3 in our example. We label
layer 𝑙 as𝐿𝑙, so layer 𝐿1 is the input layer, and
layer 𝐿 𝑛𝑙
the output layer. Our neural network
has parameters( 𝑊, 𝑏) = (𝑊(1)
, 𝑏(1)
, 𝑊(2)
, 𝑏(1)
),
11
where we write 𝑊𝑖𝑗
(𝑙)
to denote the parameter (or
weight) associated with the connection between
unit 𝑗 in layer𝑙, and unit 𝑖 in layer𝑙 + 1. (Note
the order of the indices.) Also, 𝑏𝑖
(𝑙)
is the bias
associated with the unit 𝑖 in layer𝑙 + 1. We have
𝑊1
∈ ℜ3𝑋3
and𝑊2
∈ ℜ1𝑋3
. Note that bias units
don't have inputs or connections going into them,
since they always output the value +1. We also
let 𝑠𝑙denote the number of nodes in layer 𝑙 (not
counting the bias unit).
We will write 𝑎𝑖
(𝑙)
to denote the activation
(meaning output value) of unit 𝑖 in layer𝑙. For
layer𝑙 = 1, we also use 𝑎𝑖
(1)
= 𝑥𝑖to denote the 𝑖th
input. Given a fixed setting of the parameters
𝑊, 𝑏, our neural network defines a hypothesis
ℎ 𝑊,𝑏(𝑥) that outputs a real number. Specifically,
the computation that this neural network
represents is given by:
𝑎1
(2)
= 𝑓(𝑊11
(1)
𝑥1 + 𝑊12
(1)
𝑥2 + 𝑊13
(1)
𝑥3 + 𝑏1
(1)
)
𝑎2
(2)
= 𝑓(𝑊21
(1)
𝑥1 + 𝑊22
(1)
𝑥2 + 𝑊23
(1)
𝑥3 + 𝑏2
(1)
)
𝑎3
(2)
= 𝑓(𝑊31
(1)
𝑥1 + 𝑊32
(1)
𝑥2 + 𝑊33
(1)
𝑥3 + 𝑏3
(1)
)
ℎ 𝑊,𝑏(𝑥) = 𝑎1
(3)
= 𝑓(𝑊11
(2)
𝑎1
(2)
+ 𝑊12
(2)
𝑎2
(2)
+ 𝑊13
(2)
𝑎3
(2)
+ 𝑏1
(2)
)
4.2 Backpropagation Algorithm
Suppose we have a fixed training set
{(𝑥(1)
, 𝑦(1)
), … , (𝑥(𝑚)
, 𝑦(𝑚))} of 𝑚 training
examples. We can train our neural network using
batch gradient descent. In detail, for a single
training example(𝑥, 𝑦), we define the cost
function with respect to that single example to
be:
𝐽( 𝑊, 𝑏; 𝑥, 𝑦) =
1
2
||ℎ 𝑊,𝑏( 𝑥) − 𝑦||2
This is a (one-half) squared-error cost function.
Given a training set of 𝑚 examples, we then
define the overall cost function to be:
𝐽( 𝑊, 𝑏) = [
1
𝑚
∑ 𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
+
𝜆
2
∑ ∑ ∑(𝑊𝑗𝑖
(𝑙)
)2
𝑠 𝑙+1
𝑗=1
𝑠 𝑙
𝑖=1
𝑛𝑙−1
𝑙=1
= [
1
𝑚
∑
1
2
‖ℎ 𝑊,𝑏(𝑥 𝑖) − 𝑦 𝑖‖
2
𝑚
𝑖=1
] +
𝜆
2
∑ ∑ ∑(𝑊𝑗𝑖
(𝑙)
)2
𝑠 𝑙+1
𝑗=1
𝑠 𝑙
𝑖=1
𝑛𝑙−1
𝑙=1
The first term in the definition of 𝐽(𝑊, 𝑏) is an
average sum-of-squares error term. The second
term is a regularization term (also called a
weight decay term) that tends to decrease the
magnitude of the weights, and helps prevent over
fitting.
The weight decay parameter 𝜆 controls the
relative importance of the two terms. Note also
the slightly overloaded notation: 𝐽(𝑊, 𝑏; 𝑥, 𝑦) is
the squared error cost with respect to a single
example; 𝐽(𝑊, 𝑏) is the overall cost function,
which includes the weight decay term.
This cost function above is often used both for
classification and for regression problems. For
classification, we let 𝑦 = 0 or 1 represent the
two class labels (recall that the sigmoid
activation function outputs values in[0,1]; if we
were using a tanh activation function, we would
instead use -1 and +1 to denote the labels). For
regression problems, we first scale our outputs to
ensure that they lie in the [0,1] range (or if we
were using a tanh activation function, then the
[ − 1,1] range).
Our goal is to minimize 𝐽(𝑊, 𝑏) as a function
of 𝑊 and𝑏. To train our neural network, we
will initialize each parameter 𝑊𝑖𝑗
(𝑙)
and each
𝑏𝑖
(𝑙)
to a small random value near zero (say
according to a normal (0, 𝜀2
) distribution for
some small 𝜀, say 0.01), and then apply an
optimization algorithm such as batch
gradient descent. Since 𝐽(𝑊, 𝑏) is a non-
convex function, gradient descent is
susceptible to local optima; however, in
practice gradient descent usually works
fairly well. Finally, note that it is important
to initialize the parameters randomly, rather
than to all 0's. If all the parameters start off
at identical values, then all the hidden layer
units will end up learning the same function
of the input (more formally, 𝑊𝑖𝑗
(𝑙)
will be the
same for all values of 𝑖, so that 𝑎1
(2)
= 𝑎2
(2)
=
12
𝑎3
(2)
… for any input 𝑥) The random
initialization serves the purpose of symmetry
breaking.
One iteration of gradient descent updates the
parameters 𝑊, 𝑏 as follows:
𝑊𝑖𝑗
(𝑙)
= 𝑊𝑖𝑗
(𝑙)
− 𝛼
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽(𝑊, 𝑏)
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
− 𝛼
𝜕
𝜕 𝑏𝑖
(𝑙)
𝐽(𝑊, 𝑏)
Where 𝛼 is the learning rate. The key step is
computing the partial derivatives above. We will
now describe the backpropagation algorithm,
which gives an efficient way to compute these
partial derivatives.
We will first describe how backpropagation can
be used to compute
𝜕
𝜕 𝑊𝑖𝑗
(𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) and
𝜕
𝜕 𝑏𝑖
(𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦), the partial derivatives of the
cost function 𝐽(𝑊, 𝑏; 𝑥, 𝑦) defined with respect to a
single example (𝑥, 𝑦). Once we can compute
these, we see that the derivative of the overall
cost function 𝐽(𝑊, 𝑏) can be computed as:
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽( 𝑊, 𝑏)
= [
1
𝑚
∑
𝜕
𝜕 𝑊𝑖𝑗
(𝑙)
𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
+ 𝜆𝑊𝑖𝑗
(𝑙)
𝜕
𝜕 𝑏𝑖
( 𝑙)
𝐽( 𝑊, 𝑏) = [
1
𝑚
∑
𝜕
𝜕 𝑏𝑖
( 𝑙)
𝐽(𝑊, 𝑏; 𝑥(𝑖)
, 𝑦(𝑖))
𝑚
𝑖=1
]
The two lines above differ slightly because
weight decay is applied to 𝑊 but not𝑏.
4.3 Autoencoders and sparsity
So far, we have described the application of
neural networks to supervised learning, in which
we have labelled training examples. Now
suppose we have only a set of unlabelled training
examples {𝑥(1)
, 𝑥(2)
, 𝑥(3)
… }, where 𝑥(𝑖)
𝜖 ℜ
𝑛
. An
autoencoder neural network is an unsupervised
learning algorithm that applies backpropagation,
setting the target values to be equal to the
inputs. i.e., it uses𝑦(𝑖)
= 𝑥(𝑖)
.
Here is an autoencoder:
The auto encoder tries to learn a
functionℎ 𝑊,𝑏(𝑥) ≈ 𝑥. In other words, it is trying
to learn an approximation to the identity
function, so as to output 𝑥̂ that is similar to 𝑥.
The identity function seems a particularly trivial
function to be trying to learn; but by placing
constraints on the network, such as by limiting
the number of hidden units, we can discover
interesting structure about the data. If the input
were completely random---say, each 𝑥𝑖comes from
an IID Gaussian independent of the other
features---then this compression task would be
very difficult. But if there is structure in the
data, for example, if some of the input features
are correlated, then this algorithm will be able to
discover some of those correlations. In fact, this
simple autoencoder often ends up learning a low-
dimensional representation very similar to PCAs.
Our argument above relied on the number of
hidden units 𝑠2being small. But even when the
13
number of hidden units is large (perhaps even
greater than the number of input pixels), we can
still discover interesting structure, by imposing
other constraints on the network. In particular, if
we impose a sparsity constraint on the hidden
units, then the autoencoder will still discover
interesting structure in the data, even if the
number of hidden units is large.
5. Observations and conclusions
The First part of objective of this project, i.e.
identification of instrument was successfully
observer with three different techniques
aforementioned. The scalogram method being the
most conclusive and best. This method is fed as
feature to train the autoencoder to come up with
the hypothesis which will identify the instrument
individually and give outputs as individual
instrument. A instrument pass filter is to be
made in order to extract the instruments
separately.
Once the individual instrument is extracted, The
Notes can be identified with the respective
spectrograms.
14
References
1. Uhle, Dittmar and Sporer. Extraction of drum tracks from polyphonic music using independent
subspace analysis. 4th International Symposium on Independent Component Analysis and Blind Signal
Separation (ICA2003), April 2003, Nara, Japan
2. Endelt and Harbo. Time frequency distribution of music based on sparse wavelet packet
representations. SPAR05
3. Elver and Akan. Recognigition of musical instruments using time-frequency analysis.
4. Niva Das. Nov 2007. ICA methods for BSS of instantaneous mixtures: A case study, Neural information
processing- Letters and reviews Vol. 11, No. 11,
5.Naik and Kumar. 2011. An overview of ICA and its applications. informatica 35 : 63-81
6.Fuginaga. Machine recognition of timbre using steady-state tone of acoustic instruments.
7. John M. Barry. Polyphonic music transcription using independent component analysis.
8. Jeremy F. Alm, James S. Walker, Time frequency analysis of musical instruments, SIAM Review, Vol
44 No3, Society of Industrial and applied mathematics.
Acknowledgements
I would like t o thank my guide Dr. Setlur and my co guide Dr. Sethi for their guidance, support and
patience. I also would like to acknowledge Dr. Tristan Jehan for both his master s and Ph.D thesis that
have immensly helped me in learning about music in its true essence, Dr. Andrew Ng, Stanford
university and www.coursera.org for the helping me understand the pre requisite like machine learning.
Also I would like to mention UFLDL Tutorial by Stanford which guided me through the working of
autoecoders. Finally I would like to thank the creaters of Matlab and Octave and owners of webpages
wikipedia and hyperphysics.

More Related Content

What's hot

What's hot (7)

Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Ch1 representation of signal pg 130
Ch1 representation of signal pg 130
 
Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81Ch2 probability and random variables pg 81
Ch2 probability and random variables pg 81
 
EC8352- Signals and Systems - Unit 2 - Fourier transform
EC8352- Signals and Systems - Unit 2 - Fourier transformEC8352- Signals and Systems - Unit 2 - Fourier transform
EC8352- Signals and Systems - Unit 2 - Fourier transform
 
Instrumentation Engineering : Signals & systems, THE GATE ACADEMY
Instrumentation Engineering : Signals & systems, THE GATE ACADEMYInstrumentation Engineering : Signals & systems, THE GATE ACADEMY
Instrumentation Engineering : Signals & systems, THE GATE ACADEMY
 
Lecture2 Signal and Systems
Lecture2 Signal and SystemsLecture2 Signal and Systems
Lecture2 Signal and Systems
 

Similar to BTP Second Phase

6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
Nasir Ahmad
 
Intelligent error detection and advisory system for practitioners of music us...
Intelligent error detection and advisory system for practitioners of music us...Intelligent error detection and advisory system for practitioners of music us...
Intelligent error detection and advisory system for practitioners of music us...
OjasBhargave
 
IEEE_Paper_PID2966731 (1).pdf
IEEE_Paper_PID2966731 (1).pdfIEEE_Paper_PID2966731 (1).pdf
IEEE_Paper_PID2966731 (1).pdf
Chirag Dalal
 
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Cemal Ardil
 
Doctoal Thesis Matthieu Hodgkinson
Doctoal Thesis Matthieu HodgkinsonDoctoal Thesis Matthieu Hodgkinson
Doctoal Thesis Matthieu Hodgkinson
Matthieu Hodgkinson
 

Similar to BTP Second Phase (20)

BTP First Phase
BTP First PhaseBTP First Phase
BTP First Phase
 
6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
6.5.1 Review of Beat Detection and Phase Alignment Techniques used in DMM
 
Analytical Review of Feature Extraction Techniques for Automatic Speech Recog...
Analytical Review of Feature Extraction Techniques for Automatic Speech Recog...Analytical Review of Feature Extraction Techniques for Automatic Speech Recog...
Analytical Review of Feature Extraction Techniques for Automatic Speech Recog...
 
Annotating Soundscapes.pdf
Annotating Soundscapes.pdfAnnotating Soundscapes.pdf
Annotating Soundscapes.pdf
 
Junior thesis: Automatic tempo identification from audio signal of bollywood ...
Junior thesis: Automatic tempo identification from audio signal of bollywood ...Junior thesis: Automatic tempo identification from audio signal of bollywood ...
Junior thesis: Automatic tempo identification from audio signal of bollywood ...
 
HUFFMAN CODING ALGORITHM BASED ADAPTIVE NOISE CANCELLATION
HUFFMAN CODING ALGORITHM BASED ADAPTIVE NOISE CANCELLATIONHUFFMAN CODING ALGORITHM BASED ADAPTIVE NOISE CANCELLATION
HUFFMAN CODING ALGORITHM BASED ADAPTIVE NOISE CANCELLATION
 
Intelligent error detection and advisory system for practitioners of music us...
Intelligent error detection and advisory system for practitioners of music us...Intelligent error detection and advisory system for practitioners of music us...
Intelligent error detection and advisory system for practitioners of music us...
 
Animal Voice Morphing System
Animal Voice Morphing SystemAnimal Voice Morphing System
Animal Voice Morphing System
 
Artificial Intelligent Algorithm for the Analysis, Quality Speech & Different...
Artificial Intelligent Algorithm for the Analysis, Quality Speech & Different...Artificial Intelligent Algorithm for the Analysis, Quality Speech & Different...
Artificial Intelligent Algorithm for the Analysis, Quality Speech & Different...
 
Microphone arrays
Microphone arraysMicrophone arrays
Microphone arrays
 
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
 
IEEE_Paper_PID2966731 (1).pdf
IEEE_Paper_PID2966731 (1).pdfIEEE_Paper_PID2966731 (1).pdf
IEEE_Paper_PID2966731 (1).pdf
 
V041203124126
V041203124126V041203124126
V041203124126
 
Graphical visualization of musical emotions
Graphical visualization of musical emotionsGraphical visualization of musical emotions
Graphical visualization of musical emotions
 
Vibration analysis of laminated composite beam based on virtual instrumentati...
Vibration analysis of laminated composite beam based on virtual instrumentati...Vibration analysis of laminated composite beam based on virtual instrumentati...
Vibration analysis of laminated composite beam based on virtual instrumentati...
 
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
Investigation of-combined-use-of-mfcc-and-lpc-features-in-speech-recognition-...
 
Doctoal Thesis Matthieu Hodgkinson
Doctoal Thesis Matthieu HodgkinsonDoctoal Thesis Matthieu Hodgkinson
Doctoal Thesis Matthieu Hodgkinson
 
Rigaud_et_al_WASPAA
Rigaud_et_al_WASPAARigaud_et_al_WASPAA
Rigaud_et_al_WASPAA
 
Ft and FFT
Ft and FFTFt and FFT
Ft and FFT
 
Unit 9 Summary 2009
Unit 9 Summary 2009Unit 9 Summary 2009
Unit 9 Summary 2009
 

BTP Second Phase

  • 1. 1 B. Tech Project Report First Phase Polyphonic music transcription using Machine learning techniques Submitted in partial fulfillment of requirements For the award of the degree of Bachelor of Technology from Indian Institute of Technology, Guwahati Under the supervision of Associate Professor Girish Sampath Setlur Assistant Professor Amit Sethi Submitted by- Lalit Pradhan 10012119 November 15, 2013 Department of Physics Indian Institute of Technology Guwahati Guwahati 781039, Assam, INDIA
  • 2. 2 Certificate This is to certify that the work presented in the report entitled “Polyphonic music transcription using machine learning techniques” by Lalit Pradhan, 10012119, represents an original work under the guidance of Associate Professor Girish Sampath Setlur and Assistant Professor Amit Sethi. This study has not been submitted elsewhere for a degree. Signature of student: Date: Place: Lalit Pradhan, 10012119 Signature of supervisor I Date: Place: Associate Professor Girish S Setlur Signature of supervisor II Date: Place: Assistant Professor Amit Sethi Signature of Examiner Date: Place:
  • 3. 3 Abstract In this project report we present a method for recognition of timbre and identification of the different musical instruments, and automatic transcription of these individual instruments. We introduce time frequency distributions for the analysis of musical instruments. The method presented in the earlier semester uses Independent component analysis (hereafter ICA), a special case of Blind source separation (hereafter BSS) to solve the issue in hand. We understood how the ICA method works. In this term we will also introduce a more efficient algorithm for the aforementioned problem statement via the use of a reduced dimensional auto encoder set for identification of timbre. This paper describes several approaches to analyze the frequency or pitch or content of the sound produced by musical instruments via different methods like Fourier analysis, spectrograms and scalograms. Scalograms allow one to zoom in to selected regions of time-frequency plane in a more flexible manner than is possible with spectrograms. These time-frequency portraits seem to correlate well with our perception of the sounds produced by these instruments and of the differences between each instrument. This property will be used as feature for the sparse auto encoders in training the machine to recognize the instrument and extract out individual instrument notes. The training sets are fed to a learning algorithm which forms a hypothesis function which will in turn recognize any new input and give the appropriate output, i.e. in our case identify the appropriate instrument. Keywords Blind Source Separation, Independent Component Analysis, Identification of Timbre, Sparse Auto encoders, Spectrograms, Scalograms.
  • 4. 4 Table of Contents 1. Introduction .................................................................................................................................. 5 2. Objective....................................................................................................................................... 5 3. Literature Research and user study………………………………………………………………………………………………..5 3.1 Identification of the instruments…………………………………………………………………………………………….6 3.1.1Fourier Series………………………………………………………………………………………………………………..6 3.1.1.1 FFTs…………..………………………………………………………………………………………………….7 3.1.2 Spectrograms………………………………………………………………………………………………………………..7 3.1.2.1 Spectrograms of piano and flute…………………………………………………………………...…8 3.1.3. Scalograms…………………………………………………………………………………………………….…….………9 4. Neural Networks…………………………………………………………………………………………………………………….…….10 4.1 Neural Network models……………………………………………………………………………………………..….10 4.2. Backpropagation Algorithm……………………………………………………………………………………..…..11 4.3. Autoencoders and Sparsity…………………………………….………………………………………………..…..12 5. Observations and Conclusions…………………………………………………………………………………………….….…...13 References ...................................................................................................................................... 11 Acknowledgements………………………………………………………………………………………………………………………….11
  • 5. 5 1. Introduction A typical musical piece consists of several instruments playing same or different notes at different pitch, loudness and intensity. We will try to develop a computer based model based on training set learnt by the machine to identify the instruments and separate the sources. Separation of the individual instruments is a very essential pre requisite for automatic music transcription. Hence automatic music transcription is identifying the instruments playing and when and for how long each note is being played. The musical concepts here are fundamentals and overtones. Mathematically, these concepts are described via Fourier coefficients and their role in producing sounds is modeled by Fourier series. Although Fourier series are an essential tool, they do have limitations; in particular, they are not effective at capturing abrupt changes in the frequency content of sounds. These abrupt changes occur, for instance, in transitions between individual notes. Hence we will describe a modern method of time frequency analysis, known as spectrogram, which better handles changes in frequency content over time. They provide a type of “fingerprint” of sounds from various instruments. These fingerprints allow us to distinguish one instrument from another. While spectrograms are fine tool for many situations, they are not closely correlated with the frequencies (pitches) typically found on musical scales, and there are cases where this leads to problems. Hence we describe a method of time-frequency analysis, known as scalograms, which does correlate well with music scale frequencies. Scalograms yield a powerful new approach, based on the mathematical theory of wavelets, which will solve problems lying beyond the scope of either Fourier series or spectrograms. The situation at our hand is that of a logistic classification problem in the world of machine learning. We will learn about the different learning algorithms like the gradient decent and newton’s methods to come up with the right cost function. The auto encoder tries to learn the hypothesis function so that it can identify the individual inputs (instruments) and reproduce the input function (identity function). 2. Objective The objective of this report is to understand Fourier spectra, spectrograms and scalograms and the working of sparse auto encoders and use the aforementioned time-frequency analysis as features to learn the hypothesis function for the training input sets for auto encoders. 3. Literature Research and user study The basic idea to identify the different instruments using a computer based model is by teaching the machine to identify the instrument. After identification the machine would separate the individual instruments and then would do a transcription of the notes being played. The machine is made to learn the different attributes of an instrument. This uses an exemplar based learning method where the machine learns to identify an instrument after being trained with a number of training sets. Once we identify the instrument, auto encoders separate the individual instruments.
  • 6. 6 3.1 Identification of the instruments To understand the relation between pitch and frequency, we demonstrate the sound from a tuning fork recorded with an oscilloscope attached to a microphone. This will produce a graph similar to the one shown below. The above graph was created by plotting the function100𝑠𝑖𝑛2𝜋𝜈𝑡, a sinusoid of frequency 𝜈 = 440 cycles/sec (Hz), a tone identical to the tuning fork of pitch 𝐴4 on the well-tempered scale. A pure tone with a single pitch is thus associated with a single frequency, in this case 440 Hz. The next figure shows the Fourier spectrum of sinusoid; the single peak at 440 Hz is clearly evident. The formulas used to generate this Fourier spectrum will be discussed below. Unlike tuning forks, sounds from musical instruments are time-evolving superpositions of several pure tones, or sinusoid waves. For example, in the next figure we show the Fourier spectrum of the piano note 𝐸4 with base frequency of 330 Hz. In this spectrum, there are peaks located at the (approximate) frequencies 330 Hz, 660 Hz, 990Hz, 1320 Hz and 1620 Hz. Notice that these frequencies are all integral multiples of the base frequency, 330 Hz, called the fundamental. The integral multiples of this fundamental are called overtones. 3.1.1 Fourier Series The classic mathematical theory for describing musical notes is that of Fourier series. Given a sound signal 𝑓(𝑡) (such as a musicall note or chord) defined on the interval [0, Ω], its Fourier series is 𝑐0 + ∑{𝑎 𝑛 𝑐𝑜𝑠 2𝜋𝑛𝑡 Ω + 𝑏 𝑛 𝑠𝑖𝑛 2𝜋𝑛𝑡 Ω } ∞ 𝑛=1 with Fourier constants 𝑐0, 𝑎 𝑛 and 𝑏 𝑛 defined by 𝑐0 = 1 Ω ∫ 𝑓( 𝑡) 𝑑𝑡 Ω 0 𝑎 𝑛 = 2 Ω ∫ 𝑓( 𝑡) 𝑐𝑜𝑠 2𝜋𝑛𝑡 Ω 𝑑𝑡. 𝑛 = 1,2,3, … . Ω 0 𝑏 𝑛 = 2 Ω ∫ 𝑓( 𝑡) 𝑠𝑖𝑛 2𝜋𝑛𝑡 Ω 𝑑𝑡. 𝑛 = 1,2,3, … . Ω 0
  • 7. 7 Thus the input signal is a superposition of the waves of frequencies 1 Ω⁄ , 2 Ω⁄ , 3 Ω⁄ …. It is more convenient to write the above equations using complex notations, via Euler’s formulas, 𝑒 𝑖𝜃 = 𝑐𝑜𝑠𝜃 + 𝑖 𝑠𝑖𝑛𝜃. Equation can be re written as 𝑐0 + ∑{𝑐 𝑛 𝑒 𝑖2𝜋𝑛𝑡 Ω⁄ + 𝑐−𝑛 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄ } ∞ 𝑛=1 with complex Fourier coefficients 𝑐 𝑛 = 1 Ω ∫ 𝑓( 𝑡) 𝑒−𝑖2𝜋𝑛𝑡 Ω⁄ 𝑑𝑡. 𝑛 = 0, ±1, ±2, . . . Ω 0 The relation between the two Fourier constants defined are 𝑐 𝑛 = (𝑎 𝑛 + 𝑖𝑏 𝑛)/2 and 𝑐−𝑛 = 𝑐 𝑛̅̅̅. 3.1.1.1 FFTs The audio signals of piano notes discussed above were recorded digitally. The method of digitally computing Fourier spectra is referred as FFT (fast Fourier transform). An FFT provides an extremely efficient method for computing approximations to Fourier series coefficients; these approximations are called DFTs (discrete Fourier transforms) defined via Reimann sum approximations of the integrals described above for. For a (large) positive integer 𝑁, let 𝑡 𝑘 = 𝑘Ω for 𝑘 = 0,1,2, … . , 𝑁 − 1 and let ∆𝑡 = Ω 𝑁 . Then the 𝑛th Fourier coefficient 𝑐 𝑛 is approximated as follows: 𝑐 𝑛 ≈ 1 Ω ∑ 𝑓( 𝑡 𝑘) 𝑁−1 𝑘=0 𝑒−𝑖2𝜋𝑛𝑡 𝑘 Ω⁄ ∆𝑡 = 1 𝑁 ∑ 𝑓( 𝑡 𝑘) 𝑁−1 𝑘=0 𝑒−𝑖2𝜋𝑛𝑘 𝑁⁄ = 𝐹[𝑛] The above quantity is the DFT of the finite sequence of numbers {𝑓(𝑡 𝑘)}. The spectra shown in the figures above were obtained via DFT approximations {2|𝐹[𝑛]|2 } 𝑛≥1. 3.1.2 Spectrograms TheFourier spectra are not as useful for analyzing several notes in a musical passage. For ecample, below is a graph of of a recording of a piano playing the notes 𝐸4, 𝐹4, 𝐺4and 𝐴4. The spectrum from this musical passage is shown as below. Unlike the single note case it is not easy to assign fundamentals and overtones. One way of implementing this “mixed defination of a signal” is to compute specrograms, which are a moving sequence of local spectra for the signal. In order to isolate the individual notes in the musical passage, the sound signal 𝑓(𝑡) is multiplied by a succession of time-windows: {𝑤(𝑡 − 𝜏 𝑚)}, 𝑚 = 1,2, … , 𝑀. Each window 𝑤( 𝑡 − 𝜏 𝑚) is equal to 1 in a time interval (𝜏 𝑚 − 𝜖, 𝜏 𝑚 + 𝜖) centered at 𝜏 𝑚 and decreases smoothly down to 0 for 𝑡 < 𝜏 𝑚−1 + 𝛿 and 𝑡 > 𝜏 𝑚+1 − 𝛿. These windows also satisfy ∑ 𝑤( 𝑡 − 𝜏 𝑚) = 1 𝑀 𝑚=1 Over the time interval [0, Ω], multiplying both sides by 𝑓(𝑡) we see that 𝑓( 𝑡) = ∑ 𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚) 𝑀 𝑚=1 Thus the sound signal is the sum of sub-signals
  • 8. 8 𝑓(𝑡)𝑤( 𝑡 − 𝜏 𝑚). Notice that the subsignal 𝑓( 𝑡) 𝑤( 𝑡 − 𝜏 𝑚) is shown as having a restricted domain of [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿]. When FFT is applied to the sequence {𝑓( 𝑡 𝑘) 𝑤( 𝑡 𝑘 − 𝜏 𝑚)} with points 𝑡 𝑘 ∈ [𝜏 𝑚−1, 𝜏 𝑚+1], then this FFT produces Fourier coefficients that are localized to the time interval [𝜏 𝑚−1 + 𝛿, 𝜏 𝑚+1 − 𝛿] for each 𝑚. This localization in time of Fourier coefficients constitutes the spectrogram solution of the problem of separating the spectra of the individual notes in the musical passage. 3.1.2.1 Spectrograms of Piano and Flute The next figure shows a spectrogram for the sequence of piano notes 𝐸4, 𝐹4, 𝐺4and 𝐴4. The sound signal is plotted at the bottom of the figure. . Above the dound signal is a plot of the FFT spectra {2|𝐹𝑚[𝑛]|2}, 𝑚 = 1,2, … 𝑀 obtained for the subsignals. The vertical scale is a frequency scale (in Hz) and the horizontal scale is a time scale (in sec). It can be seen clearly in the figure that the spectra for individual notes are clearly separated in time. Below is a similar plot for flute. Comparing these two spectrograms, there are clear differences between attack and decay of the spectral line segments for the notes played by the two instruments. For the piano there is a very prominent attack- due to the striking of the piano hammer on its strings. There is also a longer decay for the piano notes due to slow
  • 9. 9 damping down of the piano string vibrations which is evident in the overlapping of the time intervals underlying each note’s line segment. 3.1.3 Scalograms Spectrograms display frequencies on a uniform scale, whereas musical scales such as the well tempered scale are based on a logrithmic scale for frequencies. Consider the figure below. The spectrogram of of the note 𝐸4 played on a guitar. In this spectrogram there are a number of spectral line segments crowded together at the lower end of the frequency scale. These correspond to the lower frequency peaks in the Fourier spectrum for the note. The lower frequencies are integral divisors of some of the overtones of the note and are called undertones resulting from body cavity resonances in the guitar. A technique of mathematically “zooming in” on these lower frequencies is needed which is provided by the scalogram. The vertical scale on this scalogram consists of multiples of a base frrequency of 80 Hz viz., 80.20 = 80 Hz, 80.21 = 160 Hz, 80.22 = 320 Hz. This ia a logrithmic scale of frequencies, octaves, as in the well-tempered scale. We compute the scalograms via a method known as the continuous wavelet transform (CWT). The CWT differs from the spectrogram approach in that it does not use translations of a window of fixed width. Instead it uses translations of differently sized dilations of a window. Scalograms are based on a discretization of the CWT. Given a function 𝑔, called the wavelet, the CWT 𝑊𝑔[𝑓](𝜏, 𝑠) = 1 𝑠 ∫ 𝑓(𝑡)𝑔( 𝑡 − 𝜏 𝑠 ) ̅̅̅̅̅̅̅̅̅̅̅ 𝑑𝑡 ∞ ∞
  • 10. 10 For scale 𝑠 > 0 and time-translation 𝜏. If we assume that the sounf signal 𝑓(𝑡) is non zero only over the time interval [0, Ω] then the limits of above equation changes accordingly. As we did for Fourier coefficients, we make Reinmann sum approximation to this integral using 𝑡 𝑚 = 𝑚∆𝑡,with a uniform spacing ∆𝑡 = Ω 𝑁⁄ we also discretize the time vatiable 𝜏 𝑘 = 𝑘∆𝑡. This yields 𝑊𝑔[𝑓](𝑘∆𝑡, 𝑠) ≈ Ω 𝑁 1 𝑠 ∑ 𝑓 𝑁−1 𝑚=0 (𝑚∆𝑡)𝑔( 𝑚 − 𝑘 𝑠 ∆𝑡) ̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ The above sum is a correlation of two discrete sequences. Given two 𝑁 point discrete sequences {𝑓𝑘} and {𝑔 𝑘}, their correlation {(𝑓: 𝑔) 𝑘} is the sequence defined by {(𝑓: 𝑔) 𝑘} = ∑ 𝑓𝑚 𝑁−1 𝑚=0 𝑔 𝑚−𝑘̅̅̅̅̅̅̅ 4. Neural Networks Consider a supervised learning problem where we have access to labelled training examples (𝑥(𝑖) , 𝑦(𝑖) ). neural networks give a way of defining a complex, non-linear form of hypothesis ℎ 𝑊,𝑏(𝑥), with parameters 𝑊, 𝑏 that we can fit to our data. To describe neural networks, we will begin by describing the simplest possible neural network, one which comprises a single "neuron." We will use the following diagram to denote a single neuron: This “neuron” is a computational unit that takes input as 𝑥1, 𝑥2, 𝑥3 (and a +1 as the intercept term), and outputs ℎ 𝑊,𝑏(𝑥) = 𝑓( 𝑊 𝑇 𝑥) = 𝑓 (∑ 𝑊𝑖 𝑥𝑖 + 𝑏 3 𝑖=1 ), Where 𝑓: ℜ → ℜ is called the activation function. In our logical regression case we will use 𝑓(·) to be sigmoid function: 𝑓( 𝑧) = 1 1 + exp(−𝑧) Another choice for 𝑓( 𝑧) = tanh(𝑧) 4.1 Neural Network Model A neural network is put together by hooking together many of our simple neurons, so that the output of a neuron can be the input of another. For example here is a small neural network: In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units, and correspond to the intercept term. The leftmost layer of the network is called the input layer, and the rightmost layer the output layer (which, in this example, has only one node). The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. We also say that our example neural network has 3 input units (not counting the bias unit), 3 hidden units, and 1 output unit. We will let 𝑛𝑙 denote the number of layers in our network; thus 𝑛𝑙 = 3 in our example. We label layer 𝑙 as𝐿𝑙, so layer 𝐿1 is the input layer, and layer 𝐿 𝑛𝑙 the output layer. Our neural network has parameters( 𝑊, 𝑏) = (𝑊(1) , 𝑏(1) , 𝑊(2) , 𝑏(1) ),
  • 11. 11 where we write 𝑊𝑖𝑗 (𝑙) to denote the parameter (or weight) associated with the connection between unit 𝑗 in layer𝑙, and unit 𝑖 in layer𝑙 + 1. (Note the order of the indices.) Also, 𝑏𝑖 (𝑙) is the bias associated with the unit 𝑖 in layer𝑙 + 1. We have 𝑊1 ∈ ℜ3𝑋3 and𝑊2 ∈ ℜ1𝑋3 . Note that bias units don't have inputs or connections going into them, since they always output the value +1. We also let 𝑠𝑙denote the number of nodes in layer 𝑙 (not counting the bias unit). We will write 𝑎𝑖 (𝑙) to denote the activation (meaning output value) of unit 𝑖 in layer𝑙. For layer𝑙 = 1, we also use 𝑎𝑖 (1) = 𝑥𝑖to denote the 𝑖th input. Given a fixed setting of the parameters 𝑊, 𝑏, our neural network defines a hypothesis ℎ 𝑊,𝑏(𝑥) that outputs a real number. Specifically, the computation that this neural network represents is given by: 𝑎1 (2) = 𝑓(𝑊11 (1) 𝑥1 + 𝑊12 (1) 𝑥2 + 𝑊13 (1) 𝑥3 + 𝑏1 (1) ) 𝑎2 (2) = 𝑓(𝑊21 (1) 𝑥1 + 𝑊22 (1) 𝑥2 + 𝑊23 (1) 𝑥3 + 𝑏2 (1) ) 𝑎3 (2) = 𝑓(𝑊31 (1) 𝑥1 + 𝑊32 (1) 𝑥2 + 𝑊33 (1) 𝑥3 + 𝑏3 (1) ) ℎ 𝑊,𝑏(𝑥) = 𝑎1 (3) = 𝑓(𝑊11 (2) 𝑎1 (2) + 𝑊12 (2) 𝑎2 (2) + 𝑊13 (2) 𝑎3 (2) + 𝑏1 (2) ) 4.2 Backpropagation Algorithm Suppose we have a fixed training set {(𝑥(1) , 𝑦(1) ), … , (𝑥(𝑚) , 𝑦(𝑚))} of 𝑚 training examples. We can train our neural network using batch gradient descent. In detail, for a single training example(𝑥, 𝑦), we define the cost function with respect to that single example to be: 𝐽( 𝑊, 𝑏; 𝑥, 𝑦) = 1 2 ||ℎ 𝑊,𝑏( 𝑥) − 𝑦||2 This is a (one-half) squared-error cost function. Given a training set of 𝑚 examples, we then define the overall cost function to be: 𝐽( 𝑊, 𝑏) = [ 1 𝑚 ∑ 𝐽(𝑊, 𝑏; 𝑥(𝑖) , 𝑦(𝑖)) 𝑚 𝑖=1 ] + 𝜆 2 ∑ ∑ ∑(𝑊𝑗𝑖 (𝑙) )2 𝑠 𝑙+1 𝑗=1 𝑠 𝑙 𝑖=1 𝑛𝑙−1 𝑙=1 = [ 1 𝑚 ∑ 1 2 ‖ℎ 𝑊,𝑏(𝑥 𝑖) − 𝑦 𝑖‖ 2 𝑚 𝑖=1 ] + 𝜆 2 ∑ ∑ ∑(𝑊𝑗𝑖 (𝑙) )2 𝑠 𝑙+1 𝑗=1 𝑠 𝑙 𝑖=1 𝑛𝑙−1 𝑙=1 The first term in the definition of 𝐽(𝑊, 𝑏) is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent over fitting. The weight decay parameter 𝜆 controls the relative importance of the two terms. Note also the slightly overloaded notation: 𝐽(𝑊, 𝑏; 𝑥, 𝑦) is the squared error cost with respect to a single example; 𝐽(𝑊, 𝑏) is the overall cost function, which includes the weight decay term. This cost function above is often used both for classification and for regression problems. For classification, we let 𝑦 = 0 or 1 represent the two class labels (recall that the sigmoid activation function outputs values in[0,1]; if we were using a tanh activation function, we would instead use -1 and +1 to denote the labels). For regression problems, we first scale our outputs to ensure that they lie in the [0,1] range (or if we were using a tanh activation function, then the [ − 1,1] range). Our goal is to minimize 𝐽(𝑊, 𝑏) as a function of 𝑊 and𝑏. To train our neural network, we will initialize each parameter 𝑊𝑖𝑗 (𝑙) and each 𝑏𝑖 (𝑙) to a small random value near zero (say according to a normal (0, 𝜀2 ) distribution for some small 𝜀, say 0.01), and then apply an optimization algorithm such as batch gradient descent. Since 𝐽(𝑊, 𝑏) is a non- convex function, gradient descent is susceptible to local optima; however, in practice gradient descent usually works fairly well. Finally, note that it is important to initialize the parameters randomly, rather than to all 0's. If all the parameters start off at identical values, then all the hidden layer units will end up learning the same function of the input (more formally, 𝑊𝑖𝑗 (𝑙) will be the same for all values of 𝑖, so that 𝑎1 (2) = 𝑎2 (2) =
  • 12. 12 𝑎3 (2) … for any input 𝑥) The random initialization serves the purpose of symmetry breaking. One iteration of gradient descent updates the parameters 𝑊, 𝑏 as follows: 𝑊𝑖𝑗 (𝑙) = 𝑊𝑖𝑗 (𝑙) − 𝛼 𝜕 𝜕 𝑊𝑖𝑗 (𝑙) 𝐽(𝑊, 𝑏) 𝑏𝑖 (𝑙) = 𝑏𝑖 (𝑙) − 𝛼 𝜕 𝜕 𝑏𝑖 (𝑙) 𝐽(𝑊, 𝑏) Where 𝛼 is the learning rate. The key step is computing the partial derivatives above. We will now describe the backpropagation algorithm, which gives an efficient way to compute these partial derivatives. We will first describe how backpropagation can be used to compute 𝜕 𝜕 𝑊𝑖𝑗 (𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦) and 𝜕 𝜕 𝑏𝑖 (𝑙) 𝐽(𝑊, 𝑏; 𝑥, 𝑦), the partial derivatives of the cost function 𝐽(𝑊, 𝑏; 𝑥, 𝑦) defined with respect to a single example (𝑥, 𝑦). Once we can compute these, we see that the derivative of the overall cost function 𝐽(𝑊, 𝑏) can be computed as: 𝜕 𝜕 𝑊𝑖𝑗 (𝑙) 𝐽( 𝑊, 𝑏) = [ 1 𝑚 ∑ 𝜕 𝜕 𝑊𝑖𝑗 (𝑙) 𝐽(𝑊, 𝑏; 𝑥(𝑖) , 𝑦(𝑖)) 𝑚 𝑖=1 ] + 𝜆𝑊𝑖𝑗 (𝑙) 𝜕 𝜕 𝑏𝑖 ( 𝑙) 𝐽( 𝑊, 𝑏) = [ 1 𝑚 ∑ 𝜕 𝜕 𝑏𝑖 ( 𝑙) 𝐽(𝑊, 𝑏; 𝑥(𝑖) , 𝑦(𝑖)) 𝑚 𝑖=1 ] The two lines above differ slightly because weight decay is applied to 𝑊 but not𝑏. 4.3 Autoencoders and sparsity So far, we have described the application of neural networks to supervised learning, in which we have labelled training examples. Now suppose we have only a set of unlabelled training examples {𝑥(1) , 𝑥(2) , 𝑥(3) … }, where 𝑥(𝑖) 𝜖 ℜ 𝑛 . An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. i.e., it uses𝑦(𝑖) = 𝑥(𝑖) . Here is an autoencoder: The auto encoder tries to learn a functionℎ 𝑊,𝑏(𝑥) ≈ 𝑥. In other words, it is trying to learn an approximation to the identity function, so as to output 𝑥̂ that is similar to 𝑥. The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data. If the input were completely random---say, each 𝑥𝑖comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low- dimensional representation very similar to PCAs. Our argument above relied on the number of hidden units 𝑠2being small. But even when the
  • 13. 13 number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large. 5. Observations and conclusions The First part of objective of this project, i.e. identification of instrument was successfully observer with three different techniques aforementioned. The scalogram method being the most conclusive and best. This method is fed as feature to train the autoencoder to come up with the hypothesis which will identify the instrument individually and give outputs as individual instrument. A instrument pass filter is to be made in order to extract the instruments separately. Once the individual instrument is extracted, The Notes can be identified with the respective spectrograms.
  • 14. 14 References 1. Uhle, Dittmar and Sporer. Extraction of drum tracks from polyphonic music using independent subspace analysis. 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), April 2003, Nara, Japan 2. Endelt and Harbo. Time frequency distribution of music based on sparse wavelet packet representations. SPAR05 3. Elver and Akan. Recognigition of musical instruments using time-frequency analysis. 4. Niva Das. Nov 2007. ICA methods for BSS of instantaneous mixtures: A case study, Neural information processing- Letters and reviews Vol. 11, No. 11, 5.Naik and Kumar. 2011. An overview of ICA and its applications. informatica 35 : 63-81 6.Fuginaga. Machine recognition of timbre using steady-state tone of acoustic instruments. 7. John M. Barry. Polyphonic music transcription using independent component analysis. 8. Jeremy F. Alm, James S. Walker, Time frequency analysis of musical instruments, SIAM Review, Vol 44 No3, Society of Industrial and applied mathematics. Acknowledgements I would like t o thank my guide Dr. Setlur and my co guide Dr. Sethi for their guidance, support and patience. I also would like to acknowledge Dr. Tristan Jehan for both his master s and Ph.D thesis that have immensly helped me in learning about music in its true essence, Dr. Andrew Ng, Stanford university and www.coursera.org for the helping me understand the pre requisite like machine learning. Also I would like to mention UFLDL Tutorial by Stanford which guided me through the working of autoecoders. Finally I would like to thank the creaters of Matlab and Octave and owners of webpages wikipedia and hyperphysics.