BTP First Phase

1
B. Tech Project Report First Phase
Polyphonic music transcription using Machine learning techniques
Submitted in partial fulfillment of requirements
for the award of the degree of Bachelor of Technology from
Indian Institute of Technology, Guwahati
Under the supervision of
Associate Professor Girish Sampath Setlur
Assistant Professor Amit Sethi
Submitted by-
Lalit Pradhan
10012119
November 15, 2013
Department of Physics
Indian Institute of Technology Guwahati
Guwahati 781039, Assam, INDIA

2
Certificate
This is to certify that the work presented in the report entitled “Polyphonic music transcription using
machine learning techniques” by Lalit Pradhan, 10012119, represents an original work under the
guidance of Associate Professor Girish Sampath Setlur and Assistant Professor Amit Sethi. This study
has not been submitted elsewhere for a degree.
Signature of student:
Date:
Place: Lalit Pradhan, 10012119
Signature of supervisor I
Date:
Place: Associate Professor Girish S Setlur
Signature of supervisor II
Date:
Place: Assistant Professor Amit Sethi
Signature of Examiner
Date:
Place:

3
Abstract
In this project report we present a method for recognition of timbre and identification of the
different musical instruments, and automatic transcription of these individual instruments.
We introduce time frequency distributions for the analysis of musical instruments. The
method presented here uses Independent component analysis (hereafter ICA), a special
case of Blind source separation (hereafter BSS) to solve the issue in hand. We will
understand how the ICA method works. We will also introduce a more efficient algorithm
for the aforementioned problem statement via the use of a reduced dimensional auto
encoder set for identification of timbre.
Keywords
Blind Source Separation, Independent Component Analysis, Identification of Timbre

4
Table of Contents
1. Introduction ........................................................................................................................................5
2. Objective.............................................................................................................................................5
3. Literature Research and user study………………………………………………………………………………………………..5
3.1 Identification of the instruments…………………………………………………………………………………………….6
3.2 Separation of source signal……………………………………………………………………………………………………..6
3.2.1 Cocktail Party Problem………………………………………………………………………………………………….7
3.2.2 Independent Component Analysis………………………………………………………………………………..7
3.2.2.1 ICA Model……………………………………………………………………………………………………...…7
3.2.2.2 Independence…………………………………………………………………………………….…….………7
3.2.2.2 Non Gaussian and independence………………………………….…………………………….……8
3.2.2.2 Pre-processing………………………………………………………………………………………….………8
4. ICA Test Algorithm………………………………………………………………………………………………………………….……..8
5. Observations and Conclusions…………………………………………………………………………………………………..….10
6. Future work……………………………………………………………………………………………………………………………..…..10
References ............................................................................................................................................11

5
1. Introduction
A typical musical piece consists of several
instruments playing same or different
notes at different pitch, loudness and
intensity. We will try to develop a
computer based model based on training
set learnt by the machine to identify the
instruments and separate the sources.
Separation of the individual instruments
is a very essential pre requisite for
automatic music transcription. Hence
automatic music transcription is
identifying the instruments playing and
when and for how long each note is being
played. We can try to look at the issue in
two different ways. Consider a piece of
music being played from a loud speaker.
The different instruments are being
played simultaneously to create a
harmony. We can treat the speaker as a
single origin or source with multiple
instruments. Another way to look at the
problem is by assuming that multiple
instruments are being played inside a
room and hence we will have multiple
origins of sources. The problem of source
separation is an inductive inference
problem. Deducing the most probable
solution is possible only if we have some a
priori knowledge about the source. The
signal perceived by the ear can be
modeled as a linear combination of the
source signals. Blind source separation is
a technique developed to separate the set
of source signals from mixed source
signals without the knowledge of the
source signals and mixing characteristics.
The term blind is used as we do not have
any information as how the signals were
generated and how they mixed. If fact
without some a priori knowledge about
the signals, it is not possible to estimate
the source signals without
indeterminacies and ambiguities.
Mathematically these indeterminacies are
modeled as arbitrary scaling or delay of
the predicted source signals. In the
identification of the instruments the
estimated waveform is of more
importance than the amplitude scaling or
delays estimated. ICA is a special case of
BSS used here for source separation.
The latter situation presented with
multiple sources can be modeled as
multiple speakers and other auditory
sources in a party. This situation is
referred as the Cocktail Party Problem
(hereafter CCP) and is special case of ICA.
We will show how any to statistically
independent signals can be separated into
individual components. This report also
discusses the pros and cons of this
technique and suggests a better method
on which work is in progress.
2. Objective
The objective of this report is to understand the
working of ICA and how it is a prominent tool for
solving the BSS problem.
3. Literature Research and user study
The basic idea to identify the different
instruments using a computer based model is
by teaching the machine to identify the
instrument. After identification the machine
would separate the individual instruments
and then would do a transcription of the
notes being played. The machine is made to
learn the different attributes of an
instrument. This uses an exemplar based
learning method where the machine learns to
identify an instrument after being trained
with a number of training sets. Once we
identify the instrument, ICA is applied to
separate the individual instruments.

6
3.1 Identification of the instruments
A characteristic feature of any musical
instrument is its timbre. A sound spectrum
displays the different frequencies present in
the sound, i.e., it represents the amount of
vibration of each individual frequency
represented by graph of either power or
pressure versus frequency. So the basic input
here is a microphone signal. A very short
time window is selected from the signal for a
particular note. The input waveform and its
Fourier transform together form the
characteristic of the instrument. The
identification of the instruments is
characterized on the basis of which overtones
are emphasized in a particular instrument.
The notes are identified based on the
harmonic content. A typical example is shown
in the following figure showing a frequency
spectrum (Cubic spline filter) of low E string
(E2 83 Hz) from the two Morrison Classic
Guitars (Y axis:Db, X axis Hx). To provide
some content, #211has an Engleman spurce
Figure: Frequency spectrum of E string of
two guitars
top and Brazilian Rosewood back. #212has a
Wester Red Cedar top and back. Both the
spectrum are almost identical till 1000 Hz. #211
sounds deep while #212 sounds very bright. Most
guitar makers concern themselves with points
under 1000 Hz and ignore the rest of the
spectrum as noise. The peaks when studied from
the Fourier transform of the amplitude versus
time curve for this on further filtering would
gives a more clear image of the peaks shown
above which are found to be harmonics of the
fundamental frequency. Further the amplitude
versus time plot also itself is a characteristic of
an instrument.
Figure: Attack and decay of Guitar
Above is an illustration of attack and decay of
a plucked guitar string. The plucking action
gives it a sudden attack characterized by a
rapid rise to its peak amplitude. The decay is
long and gradual by comparison.
Figure: Attack and decay of cymbal
The above figure shows the sound envelope of
striking a cymbal with a stick. The attack is
almost instantaneous while the decay is
envelope is very long. A comparison of both
the spectrum is conclusive about the
difference in the instruments. Both spectrum
are for a time window of about half a second,
but since the frequency of guitar is much
lower, its individual periods of sound
envelope can be resolved.
3.2 separation of source signals
The problem in hand can be simplified as a
two source problem. Consider a situation of
two instruments being played in a room. The
mixed signals perceived are incomprehensible

7
individually. In signal processing, ICA is a
computational method to for separating
multivariate signal into additive
subcomponents by asuuming that the
subcomponents are non gaussian and that
they are all statistically independent from
each other. The aforementioned situation is a
classic example of CPP.
3.2.1 Cocktail Party Problem
The simultaneous signals are recorded with two
spatially separated microphones as shown in the
figure below. The spatial separation between the
microphones ensures two different signals being
Figure: Setup for identification of source in a
Cocktail Party Problem
recorded. The ICA algorithm is then run to
separate out the individual source signals.
3.2.2 Independent Component Analysis
ICA is a statistical technique widely used for
solving the BSS problem. The basic framework
regarding ICA is that the mixing is
instantaneous and linear. Here we will discuss
the conditions under which the signals are
estimated and the method of estimation.
3.2.2.1 ICA model
Suppose we have N statistically independent
signals, 𝑠𝑖(𝑡), 𝑖 = 1, … , 𝑁. We assume that the
sources themselves cannot be directly observed
and that each signal, 𝑠𝑖(𝑡), is a realization of
some fixed probability distribution at each point
𝑡. Also suppose we observe these signals using N
sensors, then we obtain a set of N observation
signals 𝑥𝑖(𝑡), 𝑖 = 1, … , 𝑁 that are mixuture of
sources. With the assumption of spatial
separation of sensors in mind, we can model the
mixing multiplication as follow:
𝑥(𝑡) = 𝐴𝑠(𝑡) (1)
Where A is an unknown matrix called the mixing
matrix and 𝑥(𝑡), 𝑠(𝑡) are the two vectors
representing the observed and source signals
respectively. This is referred as 𝑏𝑙𝑖𝑛𝑑 as we have
information neither about the matrix A or the
source vector 𝑠(𝑡).
The objective is to recover the original
signals, 𝑠𝑖(𝑡), from the observed vector 𝑥𝑖(𝑡). We
achieve this by estimating the un-mixing matrix
𝑾, where 𝑊 = 𝐴−1
. This enables an estimate of
𝑠̂( 𝑡), of the independent sources to be obtained:
𝑠̂( 𝑡) = 𝑊𝑥(𝑡) = 𝐴−1
𝑥(𝑡) (2)
algorithm which to estimate the signals.
Figure: BSS block diagram, 𝑠(𝑡) are the sources,
𝑥(𝑡) are the recordings, 𝑠̂( 𝑡) are the estimated
sources. 𝐴 is the mixing matrix and 𝑊 is the un-
mixing matrix.
3.2.2.2 Independence
A key concept that constitutes the foundation of
ICA is statistical independence. To simplify this
lets assume a two source model. Let the two
sources be 𝑠1 and 𝑠2 which are independent in
nature.
-probability distribution function
Let the probability distribution function (pdf) of
𝑠1 and 𝑠2 be 𝑝(𝑠1, 𝑠2). Let the marginal pdf of 𝑠1
and 𝑠2 be denoted by 𝑝(𝑠1) and 𝑝(𝑠2) respectively.
𝑠1 and 𝑠2 are said to be independent if :
𝑝(𝑠1, 𝑠2) = 𝑝(𝑠1)𝑝(𝑠2) (3)
And the cumulative distributive function obey:
𝐸{𝑝(𝑠1, 𝑠2)} = 𝐸{𝑝(𝑠1)𝐸{𝑝(𝑠2)} (4)
Where 𝐸{. }is the expectation operator.

8
-Uncorrelatedness
The two sources are said to be uncorrelated if
their covariance is zero.
𝐶(𝑠1, 𝑠2) = 𝐸{(𝑠1 − 𝑚 𝑠1)(𝑠1 − 𝑚 𝑠1)}
= 𝐸{𝑠1 𝑠1 − 𝑠1 𝑚 𝑠2 − 𝑠2 𝑚 𝑠1 + 𝑚 𝑠1 𝑚 𝑠2} (5)
= 𝐸{𝑠1 𝑠2} − 𝐸{𝑠1}𝐸{𝑠2}
= 0
Where 𝑚 𝑠1is the mean of the signal.
-Rank
Rank of the matrix will be less than the matrix
size dor linear dependencey and rank will be size
of the matrix for linear independecy, but this
couldn’t be assured yet due to noise in the signal.
-Determinant
In real time applications, Determinant value is
zero for linear dependencey and more than zero
for linear independencey.
3.2.2.3 Non Gaussianity and independence
Central limit theorem states that sum of
independent signals with arbitrary distributions
tends towards Gaussian distribution under
certain conditions. Hence a Gaussian
distribution can be considered to be a linear
combination of many independent source signals.
The separation of independent signals from their
mixtures can be accomplished by making the
linear signal transformation as non-Gaussian as
possible. A quantitative measure of non-
Gaussianity of of the normalized signals can be
estimated by kutosis.
-Kurtosis
When data is preprocessed to have a unit
variance, Kurtosis is equal to the fouth moment
of the data. The kutosis of a signal is defined by:
𝑘𝑢𝑟𝑡(𝑠) = 𝐸{𝑠4} − 3(𝐸{𝑠2
})2
(6)†
Here we have assumed a normalized distribution
hence the mean is zero and the variance 𝐸{𝑠2} =
1. The equation simplifies to
𝑘𝑢𝑟𝑡(𝑠) = 𝐸{𝑠4} − 3 (7)
For Gaussian function kurtosis is zero and its
non zero for non-Gaussian functions.
3.2.2.4 Preprocessing
Before running the ICA algorithm, the signal
was preprocessed.
-Centering
The observation vector is centerd by subtracting
it from the mean vector 𝑚 = 𝐸{𝑥}. This makes
the mean zero and the observation vector 𝑥 𝑐 =
𝑥 − 𝑚. The source can hence be estimated by:
𝑠̂( 𝑡) = 𝐴−1
(𝑥 𝑚 + 𝑐) (8)
-Whitening
Another step which is useful is to prewhiten the
observation vector x. Whitening involves linear
transforming the observation vector such that its
components are uncorrelated and have unit
variance. Let 𝑥 𝑤 denote the whitened vector.
Then we observe that it satisfies
𝐸{𝑥 𝑤 𝑥 𝑤
𝑇
} = 𝐼
Where 𝐸{𝑥 𝑤 𝑥 𝑤
𝑇
} is the covariance matrix of 𝑥 𝑤. A
simple method to perform the whitening
transformation is to use the eigen value
decomposition of 𝑥. i.e.
𝐸{𝑥𝑥 𝑇} = 𝑉𝐷𝑉 𝑇
(10)
Where 𝑉 is the matrix of eigen vectors of 𝐸{𝑥𝑥 𝑇},
and 𝐷 is the diagonal matrix of eigenvalues. The
observation vector can be whitened by the
following transformation:
𝑥 𝑤 = 𝑉𝐷−1 2⁄
𝑉 𝑇
𝑥 (11)
Where 𝐷−1 2⁄
= 𝑑𝑖𝑎𝑔{𝜆1
−1 2⁄
, 𝜆2
−1 2⁄
, … , 𝜆 𝑛
−1 2⁄
}
Whitening transforms the mixing matrix into a
new one which is orthogonal in nature
𝑥 𝑤 = 𝑉𝐷−1 2⁄
𝑉 𝑇
𝐴𝑠 = 𝐴 𝑤 𝑠 (12)
Whitening thus reduces the number of
parameters to be estimated from 𝑛2
elements in
the original matrix to 𝑛(𝑛 − 1) 2⁄ elements.
4. ICA Test Algorithm
Here we took two source signals 𝑠1 and 𝑠2 as test
case. The two mixing matrix was taken as:
𝐴 = (
0.3816 0.8678
0.8534 −0.5853
)
and is treated as unknown while estimating the
original signal.

9
Figure: Independent sources 𝑠1 and 𝑠2
Figure:Observed signals 𝑥1 and 𝑥2 from an
unknown linear mixture of unknown
independent components.
The mixed signals were then separated using
ICA algorithm. Here we show a detailed analysis
of the estimated output as well as the detailed
intermediate steps. Here we used FastICA
algorithm which to estimate the signals.
Figure: Estimated source signals
Figure: To be read anti-clockwise sense starting from the top left

10
5. Observation and Conclusion
The results estimated signals produced a similar
waveform to that of the original signal except for
amplitude owing to indeteminacies of the
assumptions of ICA method. But it was realized
that this method would not be a very practical
solution to our problem in hand which would deal
with more than two source signals. Hence we
would require 𝑁 number of spatially separated
sensors/ microphones for recording 𝑁 source
signals. Also the whitenening process would
require it to be done in 𝑁 dimension unlike two
dimensions here. Hence, we had to opt for a
entirely new approach to the given problem
statement.
6. Present and future work
The new method deals with unsupervised and
deep learning algorithms.
Figure:We have used circles to denote the inputs
to the network. The circles labeled “+1” are called
bias units and correspond to the intercept term.
Layer1 is the input signal. Layer2 is the hidden
layer which contains a trained data set to
identify the instruments based on the and
consists of the autoencoder. For example the data
from the attack and decay of a guitar note will be
made as as pass filter in the auto encoder and
the output will be that of just the guitar. The
goal of the next semester is to collect the data of
different attributes like harmonicity and attack
and decay of each note of a few instruments and
train the autoencoder. This method unlike ICA it
doesn’t require any sensors is a purely neural
network based method of unsupervised/
Deeplearning.

BTP First Phase

Recommended

Recommended

More Related Content

Similar to BTP First Phase

Similar to BTP First Phase (20)

BTP First Phase