Nithin Xavier research_proposal

Composing Monophonic World Music Using Deep Learning
Nithin Xavier
x17110530
MSc in Data Analytics
14th August 2018
Keywords Music Generation; Computer Composing; Neural Networks; Deep Learning.
Abstract: Music is considered as a universal language, loved by all. We enjoy music of
languages, without any language barrier since music defines itself as a medium which connects
the mind and the soul. Computer generated music is a relatively new term and the associated
domain is still in its infancy. Still, there have been good result yielding researches in this field
albeit there are limited researches. Emulating a human composer is the task of the system
proposed to generate good sounding music. Alan Turings theory can be applied here that when
a human cannot distinguish between a computer-generated music and a human composers
work, that computer music generation system will be deemed as perfect. The current problems
of music composers to generate long hours of music which are normally played in airports,
restaurants, flights, malls and other public places can be effectively solved by neural networks.
These neural networks can also synthesize Sleep Music, which also is a growing genre is
becoming popular with people having sleeping problems or discomforts. We generalize these
problems and strive to develop a model to compose monophonic world music using a good
training dataset and a novel application of musical theory knowledge.
1

Contents
1 Introduction 3
2 State of the Art 3
2.1 Filetype of Dataset used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Model Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Research Question 5
4 Proposed Approach 5
5 Proposed Implementation 7
6 Proposed Evaluation 8
7 Conclusion 9
2

1 Introduction
Composition of music is considered a creative and innovative work. It involves application of music
principles like chord progressions, scales, harmony, dynamics, etc. Even though a composer ought to
follow these musical rules or theory, sometimes a composer may introduce out of the box arrangements
by making changes in chord progressions or by inducing accidental notes and styles. Also, due to the
fact that there are lot of music already available and under development, the composer has to be alert to
avoid the chances of his/her music to be similar to another already available music. There are seven main
notes in a scale ranging from C to B. The higher octave C will constitute the 8th note forming the C scale
in music. For every note there are sharp notes or flat notes except for two notes. This same arrangement
of keys are repeated to form higher or lower octaves. Hence, there are only 12 distinct notes in music.
Composers have only these 12 notes to compose any music. The permutations and combinations of these
12 notes is what we find in every music produced or arranged. Sometimes, after composing a song, the
composer may find a similar tune matching his/her work and is in a position to change the complete
affected sequence. The musical rules and theory can be fed into neural networks to generate a unique
sequence of musical arrangement which form the novelty in this project. Computer systems can learn
effectively better than humans if properly trained. Computer generated music have been researched since
long, but there are only a few researches in this field, owing to the lack of musical training or expertise and
also due to the lack of interest shown by sound technologists and musicians. Recurrent neural networks
are employed in many researches because of its good performance in this arena. Through this research we
address the problems of the music composers such as composing long hours of music for meditation and
leisure music to be played at public places like airports, flights, malls, etc. Composing music which has
a duration for more than 2-3 hours can be time consuming and exhaustive experience. These music may
contain similar patterns throughout which can be generated effectively by neural networks by training
the system well. Hence, the research undertaken with this regard has the following research question:
How effectively can deep learning techniques generate unique monophonic world music based on a music
dataset and musical theory?. We address this research question to propose an appropriate methodology,
implementation and expected result evaluation. In the next section we talk about the various researches
carried out related to our proposed objective.
2 State of the Art
In this section we analyse, and survey various research papers related to our research objective. The
following first sub-section under consideration is the format or type of the dataset used in the researches
of the referenced papers.
2.1 Filetype of Dataset used
Since our research question seeks solution to develop a monophonic music sample based on a contem-
porary music dataset and music theory rules, the processing of the dataset and the output format of the
music generated plays a crucial role. The type of the dataset used for the processing in neural networks
decides the quality of the output of generated music and the complexity of neural networks. There are
two main types of audio formats which are MIDI (Musical Instrument Digital Interface Format) and
WAV (Waveform Audio File Format) filetypes. If we analyse the type of data used in the research pa-
pers of related work, we can observe that there is a clear dominance of the usage of MIDI format for
computer music generation. The following are the citations of research works which have used MIDI
filetype: Madhok et al. (2018), Mao (2018), Yang et al. (2017), Sabathe et al. (2017), Liang (2016),
Lyu et al. (2015), Goel et al. (2014), Chung et al. (2014), Roig et al. (2014), Boulanger-Lewandowski
et al. (2012), Yuksel et al. (2011), Oliwa and Wagner (2008), Cameron (2001) and Masako and Kazuyuki
(1992). This shows that from the early 90s till the present time, MIDI filetype have been the preferred
choice of researchers in this domain. The main reason of this choice is because of low complexity in
computation and the output is recognizable and editable in most of the digital audio workstations or
music production softwares. MIDI format contains information like the pitch, velocity and many other
information regarding the played notation. The main advantage is that instrument of the music gener-
ated can be changed to any desirable, high quality virtual instrument to increase the sound quality of
the final output. The disadvantage of the direct playback of MIDI files is that the sound quality of the
instrument is substandard, and poor compared to all other formats. Now, well see the next filetype used
for the researches in this domain.
3

WAV is the preferred filetype in music production scenarios because of the feature of lossless audio.
But they face a drawback of a significantly large file size than compared to the MIDI format. Engel et al.
(2017) and Mańdziuk et al. (2016) have used the WAV format in their researches. These files cannot be
edited by note but will have to be edited by the waveform which is a more complex task. Because of the
large file size, the computational complexity is high, and it will require superior level GPUs to execute
the machine learning algorithms. The next filetype concerns extraction of information related to sheet
music to train the system.
Lichtenwalter and Lichtenwalter (2009) has used MusicXML format to garner information regarding
the notes, time signature, pitch, dynamics and other musical parameters to feed into the system for
teaching the system about chord progressions and other music theory for effective music generation.
While, Eck and Schmidhuber (2002) have directly extracted musical information from sheet music to
train the system of the chords and the sequence of notes to be played. Extracting information from
sheet music is not as effective as the two commonly playable audio formats mentioned before which are
MIDI and WAV. Hence, most of the researches in the domain of computerised music generation or music
composing utilizes these filetypes.
2.2 Model Generation
The model generation or the development of neural network is the major part of this project which
learns and trains itself to produce monophonic music samples as per our objective. This section aims to
discuss and compare the various methods used for the network generation. Madhok et al. (2018) in their
research have recorded 7 major human emotions and then as per the detected emotion, generated music
apt for that observed scenario and emotion using dual layer Long Short Term Memory Network (LSTM)
architecture. The evaluation of this work was performed using a correlation between the facial expression
detected and the probability that the resulting music falls in the same section. This correlation resulted
in 0.93, which proves to be a good score. Mao (2018) have used a dual axis LSTM architecture wherein
one axis provides provision for the desired time of generated music and the other axis facilitates the
output of the desired notes. By the addition of style and volume features the music production quality
was enhanced. The evaluation of this approach was done by a statistical hypothesis with the level of
significance 0.05 and the value z = 0.945 conveys that the classification precision of human composers
and the proposed approach was almost similar. Solutions to three different music generation problems
like harmonization, chord inversion and voicings and chord estimation were achieved by Kitahara (2017)
by using Bayesian Networks. Yang et al. (2017) have implemented model based on Convolutional Neural
Networks (CNN) and Generative Adversarial Network (GAN) in which information regarding the previ-
ous bar and the sequence of chord structure is incorporated and have produced similar results as others.
The drawback of this model is highlighted by the absence of consideration to velocity and musical pauses
which makes the music produced to be aligned towards artificial music. Sabathe et al. (2017) used
LSTM networks for music generation with optimized parameters like 167 units of LSTM for both the
decoding and encoding functions and 23 steps to perform sequential automatic encoding. The major
drawback in this approach was that production of music pieces longer than the trained samples could
not be generated.
Music theory and other features of music were utilised more effectively by Mańdziuk et al. (2016) in
which the authors developed a combined algorithm consisting of a genetic algorithm and local optimiza-
tion which captures all necessary technicalities of music theory to produce aesthetically and theoretically
superior music.Liang (2016) have developed a sequential LSTM network where they train the system to
produce good quality music without much training of musical theory concepts. Lyu et al. (2015) have
done an amalgamation of LSTM units to Recurrent Temporal Restricted Boltzmann Machine (RTRBM)
and have secured average results which were caused by the absence of optimization techniques. Goel et al.
(2014) used RNN with two layers of Restricted Boltzmann Machine for sequence modelling to produce
music whose results are only at par with the other researches owing to lack of optimization methods and
pretraining of musical theory.Chung et al. (2014) shows that LSTM and GRU units fare better in LSTM
networks as opposed to tanh unit in the applications of raw speech and polyphonic music generation.
Eck and Schmidhuber (2002) facilitated learning chord sequences and melody sequences to input the
learnt information to LSTM networks. Hence, a majority of the researchers have used recurrent neural
networks, more specifically LSTM networks to be able to learn and generate music pieces based on an
input work and musical technicalities.
4

3 Research Question
The research question for this project is as follows: How effectively can deep learning techniques generate
unique monophonic world music based on a music dataset and musical theory?.
This falls under the domain of Computer Music Generation. Neural networks have been successful
in producing good musical recordings as seen in the literature review in the next sub-section. To be
able to evaluate and learn the previous notes and chords played in a musical sequence is necessary in
computerised music generation. Hence, we use recurrent neural networks (RNN) have recurrent loops
for the nodes. LSTM units have been seen as the best RNN unit in the referenced papers. But for the
effective implementation of our research objective, we introduce a novel implementation of the learning of
Musical theory concerning scales in music, time signatures, chord progressions, velocity understanding of
each note according to genre and accidental chords and progressions that may be introduced unexpectedly
into a music. These information are contained in MusicXML format which are extracted into the system
along with MIDI information. In the related literature, we can observe the mention of musical knowledge
which can complement and better the system under development to generate music. The researchers
have suggested the collaboration of musical experts and computer experts. Since, I am a musician and
have the requisite music theory knowledge to train the neural networks, this project is feasible and can
be improved upon the other researches in this domain.
4 Proposed Approach
In this section the approach to be followed for the given research question is detailed and given an over-
view of the complete picture. We propose to implement recurrent neural networks (RNN) to analyse the
global contemporary music, MIDI dataset and recreate similar sequences of music. We use MIDI files as
an input to the network to train known melody lines and musical sequences and test for the probability
of occurrence of similar but unique music. The dataset used is the The Lakh MIDI Dataset v1.0 found
in the web link: http:colinraffel.comprojectslmd#get. There around 1 lakh MIDI files of the songs listed
in the Million Song Dataset. So this dataset is a subset of the Million Song Dataset containing con-
temporary global music. MIDI Files consist of only the information about the notes played, timing of
the notes, time signature, dynamics and velocity. Also, we use information related to music theory such
as chord progression, scales, accidental note and chord usages and other music rules in the MusicXML
format to be fed into the Recurrent Neural Network. These information are present as sheet music in
this format which will be extracted and trained in the system proposed.
The desired design constraints of the network are as follows:
Time Signature The Recurrent neural network should be able to identify the current playing time
with reference to the musical time signature. Time signature refers to the number of beats occurring
in one single bar of music. In common time, there are four beats in one bar which is denoted by 4/4.
Likewise, there are many time signatures like 3/4, 5/8, 6/8, 7/8, etc.
Invariance in Notes There should be independence in the music with related to the octave. Changing
octaves should not affect the basic note, chord structure and progressions.
Repetition of Notes The sustain of one note over two bars should be distinguished from playing that
same note twice.
Invariance in Time There should be freedom given to the network to generate music independent of
the time frame like an adlib, as called in musical terms
Accidental Note and Chord Changes Accidental notes or chords can be termed as out of the box
methods which do not feature in musical theory. These can be innovative and enriching to hear if used
correctly with some developed rules.
5

Figure 1: Network Design
The property of being invariant in time is achievable using RNN. However, note invariance is not
achievable in RNN because the fully connected layer has nodes to represent all the notes in MIDI format.
If we increase the pitch of every note by one half step in music, the output of the network will be entirely
different from the desirable output. This drawback can be resolved by the convolutional neural networks
(CNN). In the case of image recognition applications, the kernel of CNN is used to apply to all pixels of
the image of the input source. Now, we assume that kernel of CNN is replaced with an RNN. Hence, the
network will consist of an RNN, wherein the kernel is replaced with another RNN. This would facilitate
the cells or the pixel the luxury to possess a neural network of its own. This example is now applied to
our study and hence we replace the pixels in this example with notes which are main elements in our
research. If we implement a stack of similar RNN, wherein this RNN is provided for each and every
note, every note gets a neighbourhood of RNN. These neighbourhood RNN of every note is assigned one
octave higher and lower than the normal pitch of the note. Hence, we achieve an invariance in time as
well as notes.
Because of the memory retained concerning previous notes and sequences, we must now build a
method to produce good innovative chords for the music. Hence, we divide this approach into two
sections. Bi-Axial recurrent neural network is suitable for meeting our research objective. In Bi-Axial
RNN, the first axis will represent time and the other axis will represent note. The network design is as
shown in the figure.
The following are the details concerning the inputs and outputs of the proposed network: The inputs
to the time layer in the Bi-axial RNN are discussed here.
1. Note Value: The note value refers to the MIDI value which describes1 the register of the played
note whether it lies in the lower register or the higher register.
2. Pitch: This refers to the value of the pitch of the played note where A notes pitch value is 0 and
the value increases for every one-half note increase in pitch.
3. Scale: The scale refers to the sequence of notes following certain musical theories. There are
many scales in the music theory which can be input to the system to emulate world music without
mistakes.
4. Previous State: This input is to train the network the number of instances a particular note was
played during the previous time step.
6

5. Rhythm: This is a useful input to let the network understand the position of the current note
with respect to the time measure and time signature
Along the axis of time, LSTM which have recurrent loops forms the first hidden layer. The other axis
of LSTM, which is the note axis searches notes from the low registers ranging till the high registers.
After the running of the last layer of Long Short Term Memory Network, the final fully connected and
non-recurrent layer provides an output of two types of probabilities which are given as follows:
1. Probability of each note getting played
2. If the note probability is 1 then the probability of the articulation of that note will also form one
of the outputs from the non-recurrent network layer.
Processing of musical output: The MIDI file music generated from the output may be then fed into the
music production softwares and then edited to change the instrument. Since MIDI format quality is very
poor, high quality virtual instruments from third parties can be used to provide scoring level quality.
This option is available for the MIDI files since MIDI format represent information about note, velocity,
pitch and other musical parameters.
5 Proposed Implementation
Our music generation model proposed will be implemented in Python programming language. Particu-
larly, well be using a python library called Theano which simplifies the computational complexity and
provides flexibility to the network architecture. Now we will discuss the step by step implementation
which is as described below
Random small pieces of the MIDI files are fed into the recurrent neural networks during training.The
cross-entropy can be found out by obtaining the probabilities of all the outputs. These probabilities
are then applied a logarithmic transformation and then applied a negation. The output is then fed as
cost for the optimization of weights into the AdaDelta optimizer. Training of the time-axis layers is
done by batching each and every note together and then training the layers of the note-axis by batching
together each and every time step. The processing unit of the computer system is better utilised because
of the ability to multiply big matrices. Dropout is used in our network so as to eliminate the problem
of overfitting. The application of dropout in each layer, renders the work of elimination of 50% of the
hidden nodes. The output of each and every layer is multiplied with a mask and hence the fragile nodes
are eliminated by multiplying their output by zero. This achieves the purpose of specialization and
thereby prevents the nodes to sway towards weak dependencies. We then multiply the output of each
and every node by 0.5 with the purpose of adding a correction factor to prevent more active nodes to
eliminate large number of active nodes.
For training the model we use the instance of Amazon web service (AWS). We use cheaper options
of instances which are called as spot instances. For one hour, the price of the using the instance range
from 10 cents (US) to 15 cents (US). Our proposed model consists of two hidden layers of note-axis and
two hidden layers of time-axis. The note axis layers have 100 nodes and 50 nodes respectively for the
two layers. Similarly, the 300 nodes are present for the two hidden time axis layers. The training of all
the MIDI files in our dataset was performed by choosing 8- count segments of these MIDI clips and then
batching them together.
7

Parsing MIDI
Dataset
Parsing Dataset
of Music Theory
Dimension
Reduction
Final Features
Training Com-
bined Dataset
Testing Com-
bined Dataset
Training Fea-
ture Vectors
Testing Fea-
ture Vectors
Music Generation
Evaluation
Figure 2: Decision Model
6 Proposed Evaluation
The evaluation of the output for our music generation model will be performed by conducting an open
survey. The details of this evaluation is as described in the following section:
In the open survey, we will select a group of 50 people, among them the 80 percentage of the people
will have musical background and 20 percent people would not be having any musical background.
There will be three sets of identification tasks and in each set there will be three musical recordings to
be identified as generated by our music generation system. Out of these three musical recordings, two
will be composed by humans and one will be generated by our system. The group of people will be
knowing these rules and will have to identify from these three recordings, the music generated by our
system. The participants of this survey will also be given an option to write comments about each of the
recordings in each set. We will now describe the evaluation metrics to be used for this survey. Recording
Set Identifier will denote each of the three sets of recordings used for the survey. A metric called as
Incorrect Identification will be used to denote the percentage of incorrectly identified recordings in each
set and similarly, Correct Identification will denote the percentage of correctly identified recordings in
each set used for the survey. We estimate that the percentage of correctly identified samples will be
lower than 40 percentage ranging till 20 percentage. The estimation of the results can be tabulated as
follows
Hence, we anticipate the incorrect identification to be around 73.6 percentage and the correct iden-
tification to be 26.4 percentage. The highlight of this survey is that when the participants are asked
to pick the computer generated music from the total three recordings, the participants will definitely
pick the most unpleasing and inferior level recording. This is because of the less advances in the field of
computer generated music and the ability of humans to create world class, superior music. Hence, the
8

Recording Set Identifier Incorrect Identification Correct Identification
1 66% 34%
2 75% 25%
3 80% 20%
Total 73.60% 26.40%
incorrect identification is the accuracy of the evaluation of our system. Hence, we expect around 70 to
75 percentage of accuracy in this respect.
In the next evaluation, we survey the correctness of the genre of each recording. The same participants
are given 20 recordings of three genres selected for this evaluation. The recordings of these genres from
the training dataset and the generated recordings having similar genres are selected to be reviewed by
the participants. The metric used for reviewing the matching criteria between the training data and
the generated data is the similarity score. A mean of these similarity scores are taken for each genre to
get the similarity score of the complete recordings in each genre. The similarity score ranges from 1 to
5 where 1 denotes that the recordings sound very different to the training dataset and 5 denotes that
the recordings sound very similar to the training dataset. We expect a similarity mean score of 4.1 for
pop music genre since pop music is very commonly known among the masses and hence there wont be
much difficulty in identifying the differences in this genre. Also, the accuracy of the proposed model
will increase the similarity score. The similarity score of Jazz is expected to be less compared to other
genres because of the complexity of music of this genre and less followers of this genre. The following
table shows the mean of the similarity scores of each genre:
Genre Mean
Pop 4.1
Jazz 3.8
Blues 3.9
Hence, we calculate the accuracy of the proposed music generation model by the reviews of surveys.
The estimation or the anticipation of the results were shown in the above tabulated results. We expect
a state of the art performing model using the proposed methodology and implementation
7 Conclusion
Hence, we have proposed the plan or the blueprint of the research to be undertaken with regards to
composing or generating monophonic music using neural networks. Our plan aims to guide the research
appropriately to complete the project within the time frame of three months and to have a better
understanding during the actual implementation of the methodology. We anticipate state of the art
results compared to other researches in this field as seen in the proposed evaluation. We strive to
detail the steps of implementation and enhancing the proposed model even more after trying and testing
different approaches.
References
Boulanger-Lewandowski, N., Bengio, Y. and Vincent, P. (2012). Modeling Temporal Dependencies in
High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription, (Cd).
URL: http://arxiv.org/abs/1206.6392
Cameron, B. B. (2001). System and Method for Automatic Music Generation using a Neural Network
Architecture, 2(12).
Chung, J., Gulcehre, C., Cho, K. and Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent
Neural Networks on Sequence Modeling, pp. 1–9.
Eck, D. and Schmidhuber, J. (2002). A First Look at Music Composition using LSTM Recurrent Neural
Networks, Idsia pp. 1–11.
URL: http://people.idsia.ch/ juergen/blues/IDSIA-07-02.pdf%0Ahttp://www.idsia.ch/ juergen/blues/IDSIA-
07-02.pdf
9

Engel, J., Resnick, C., Roberts, A., Dieleman, S., Eck, D., Simonyan, K. and Norouzi, M. (2017). Neural
Audio Synthesis of Musical Notes with WaveNet Autoencoders.
Goel, K., Vohra, R. and Sahoo, J. K. (2014). Polyphonic music generation by modeling temporal
dependencies using a RNN-DBN, Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8681 LNCS: 217–224.
Kitahara, T. (2017). Music Generation Using Bayesian Networks, pp. 3–6.
URL: http://www.kthrlab.jp/
Liang, F. (2016). BachBot: Automatic composition in the style of Bach chorales - Developing, analyzing,
and evaluating a deep LSTM model for musical style, (August).
Lichtenwalter, R. and Lichtenwalter, K. (2009). Applying learning algorithms to music generation,
Proceedings of the 4th pp. 483–502.
URL: http://www.cse.nd.edu/Reports/2008/TR-2008-10.pdf
Lyu, Q., Wu, Z., Zhu, J. and Meng, H. (2015). Modelling high-dimensional sequences with LSTM-
RTRBM: Application to polyphonic music generation, IJCAI International Joint Conference on Ar-
tificial Intelligence 2015-Janua(Ijcai): 4138–4139.
Madhok, R., Goel, S. and Garg, S. (2018). SentiMozart : Music Generation based on Emotions,
2(Icaart): 501–506.
Mańdziuk, J., Wo´zniczko, A., Goss, M., Mańdziuk, J., Wo´zniczko, A., Goss, M. and System, A. N.-m.
(2016). A Neuro-memetic System for Music Composing.
Mao, H. H. (2018). DeepJ: Style-Specific Music Generation, Proceedings - 12th IEEE International
Conference on Semantic Computing, ICSC 2018 2018-Janua: 377–382.
Masako, N. and Kazuyuki, W. (1992). Interactive Music Composer Based on Neural Networks.Pdf.
Oliwa, T. and Wagner, M. (2008). Composing music with Neural Networks and probabilistic finite-
state machines, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 4974 LNCS: 503–508.
Roig, C., Tardón, L. J., Barbancho, I. and Barbancho, A. M. (2014). Automatic melody composition
based on a probabilistic model of music style and harmonic rules, Knowledge-Based Systems 71: 419–
434.
URL: http://dx.doi.org/10.1016/j.knosys.2014.08.018
Sabathe, R., Coutinho, E. and Schuller, B. (2017). Deep recurrent music writer: Memory-enhanced
variational autoencoder-based musical score composition and an objective measure, Proceedings of the
International Joint Conference on Neural Networks 2017-May: 3467–3474.
Yang, L.-C., Chou, S.-Y. and Yang, Y.-H. (2017). MidiNet: A Convolutional Generative Adversarial
Network for Symbolic-domain Music Generation.
Yuksel, A., Karci, M. and Uyar, A. (2011). Automatic music generation using evolutionary algorithms
and neural networks, pp. 354–358.
10

Nithin Xavier research_proposal

Recommended

Recommended

More Related Content

Similar to Nithin Xavier research_proposal

Similar to Nithin Xavier research_proposal (20)

Recently uploaded

Recently uploaded (20)

Nithin Xavier research_proposal