Making music with machine learning towards data science
1. Making Music with Machine Learning
Tyler Doll Follow
Jan 29 · 9 min read
Image from https://www.maxpixel.net/Circle-Structure-Music-Points-Clef-Pattern-Heart-1790837
Music is not just an art, music is an expression of the human condition.
When an artist is making a song you can often hear the emotions,
experiences, and energy they have in that moment. Music connects people
all over the world and is shared across cultures. So there is no way a
computer could possibly compete with this right? That’s the question my
group and I asked when we chose our semester project for our Machine
Learning class. Our goal was to create something that would make the
listener believe that what they were listening to was created by a human. I
think we succeeded personally, but I will let you be the judge (see the
results towards the bottom of this post).
Approach
In order to create music, we needed some way to learn the patterns and
behaviors of existing songs so that we could reproduce something that
sounded like actual music. All of us had been interested in deep learning, so
we saw this as a perfect opportunity to explore this technology. To begin we
researched existing solutions to this problem and came across a great
tutorial from Sigurður Skúli on how to generate music using Keras. After
reading their tutorial, we had a pretty good idea of what we wanted to do.
File format is important as it is what would decide how we would approach
the problem. The tutorial used midi files so we followed suit and decided to
use them as well because they were easy to parse and learn from (you can
learn more about them here). Using midi files gave us a couple advantages
because we could easily detect the pitch of a note as well as the duration.
But before we dove in and began building our network, we needed some
more information on how music is structured and the patterns to consider.
For this we went to a good friend of mine Mitch Burdick. He helped us to
determine a few things about our approach and gave us a crash course on
simple music theory.
After our conversation we realized that the time step and sequence length
would be two important factors for our network. The time step determined
when we analyzed and produced each note while the sequence length
determined how we learned patterns in a song. For our solution we chose a
time step of 0.25 seconds and 8 notes per time step. This corresponded to a
time signature of 4/4, which for us meant 8 different sequence of 4 notes.
By learning these sequences and repeating them, we could generate a
pattern that sounded like actual music and build from there. As a starting
point we used the code mentioned in Skúli’s tutorial, however in the end
our implementation differentiated from the original in several ways:
Network architecture
Restricted to single key
Use of variable length notes and rests
Use of the structure/patterns of a song
Network Architecture
For our architecture we decided to lean heavily on Bidirectional Long Short-
Term Memory (BLSTM) layers. Below is the Keras code we used:
model = Sequential()
model.add(
Bidirectional(
LSTM(512, return_sequences=True),
input_shape=(
network_input.shape[1], network_input.shape[2]),
)
)
model.add(Dropout(0.3))
model.add(Bidirectional(LSTM(512)))
model.add(Dense(n_vocab))
model.add(Activation("softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
Our thoughts behind this were by using the notes before and after a
particular spot in a song we could generate melodies that sounded similar
to a human. Often when listening to music what came before helps the
listener predict what is next. There have been many times when I’ve been
listening to a song and I can bob along to a particular beat because I can
predict what will come next. This is exactly what happens when building up
to a drop in a song. The song gets more and more intense which causes the
listener to build tension in anticipation of the drop and causes that moment
of relief and excitement when it finally hits. By taking advantage of this we
were able to produce beats that would sound natural and bring forth the
same emotions that we have become accustomed to expecting in modern
music.
For the number of nodes in our BLSTM layers we chose 512 as that was
what Skúli used. However we did experiment with this a little, but due to
time constraints we ended up sticking with the original number. Same goes
for the dropout rate of 30% (read more about dropout rates here). For the
activation function we chose softmax and for our loss function we chose
categorical cross-entropy as they work well for multi-class classification
problems such as note prediction (you can read more about both of them
here). Lastly we chose RMSprop for our optimizer as this was recommended
by Keras for RNNs.
Key Restriction
An important assumption we made was that we would only use songs from
the same key: C major/A minor. The reason for this is by keeping every song
we produced in the same key, our output would sound more song-like as the
network wouldn’t ever learn notes that would cause a song to go off key. To
do this we used a script we found here from Nick Kelly. This part was really
simple but gave us a huge improvement in our results.
Variable Length Notes and Rests
An important part of music is the dynamic and creative use of variable
length notes and rests. That one long note struck by the guitarist followed
by a peaceful pause can send a wave of emotion to the listener as we hear
the heart and soul of the player spilled out into the world. To capture this
we looked into ways of introducing long notes, short notes, and rests so that
we could create different emotions throughout the song.
In order to implement this we looked at the pitch and duration of a note and
treated this as a separate value we could input into our network. This meant
that a C# played for 0.5 seconds and a C# played for 1 second would be
treated as different values by the network. This allowed us to learn what
pitches were played longer or shorter than others and enabled us to
combine notes to produce something that sounded natural and fitting for
that part of the song.
Of course rests cannot be forgotten as they are crucial for guiding the
listener to a place of anticipation or excitement. A slow note and a pause
followed by a burst of quick firing notes can create a different emotion than
several long notes with long pauses between. We felt this was important in
order to replicate the experience the listener has when listening to a
relaxing Sunday afternoon song or a Friday night party anthem.
To achieve these goals we had to focus on our preprocessing. Again here we
started with the code from Skúli’s tutorial and adapted it to fit our needs.
for element in notes_to_parse:
if (isinstance(element, note.Note) or
isinstance(element, chord.Chord
):
duration = element.duration.quarterLength
if isinstance(element, note.Note):
name = element.pitch
elif isinstance(element, chord.Chord):
name = ".".join(str(n) for n in element.normalOrder)
notes.append(f"{name}${duration}")
rest_notes = int((element.offset - prev_offset) / TIMESTEP - 1)
for _ in range(0, rest_notes):
notes.append("NULL")
prev_offset = element.offset
To elaborate on the code above, we create notes by combining their pitch
and duration with a “$” to feed into our network. For example “A$1.0”,
“A$0.75”, “B$0.25”, etc. would all be encoded separately for use by our
network (inputs are encoded by mapping each unique note/duration to an
integer then dividing all of the integers by the number of unique
combinations thus encoding each one as a floating point number between 0
and 1). The more interesting part is calculating how many rests to insert.
We look at the offset of the current note and compare it to the offset of the
last note we looked at. We take this gap and divide it by our time step to
calculate how many rest notes we can fit (minus 1 because really this
calculates how many notes fit in the gap, but one of them is our actual next
note so we don’t want to double count it). An example would be if one note
started at 0.5s and the next didn’t start till 1.0s. With a time step of 0.25
(each note is played in 0.25s intervals), this would mean we need one rest
note to fill the gap.
Song Structure
Lastly one of the most important parts of writing a song is the structure, and
this is one of the things we found lacking in existing solutions. From what I
have seen most researchers are hoping for their network to learn this on its
own, and I don’t think that is a misguided approach. However I think this
introduces complexity to the problem and leads to further difficulty. This
could be a source of improvement upon our solution though as we take a
more manual approach to this and assume a constant pattern.
One of the key assumptions we made is that we would only produce songs
that follow the specific pattern ABCBDB where:
A is the first verse
B is the chorus
C is the second verse
and D is the bridge
Initially we tried ABABCB but this felt too formulaic. To resolve this we
decided to introduce a second verse that was different than the first but still
related. We generated the first verse from a random note and then
generated the second verse based on the first. Effectively this is generating a
single section that is twice as long and splitting it in half. The thought
process here was that if we create one verse the second should still fit the
same vibe, and by using the first as a reference we could achieve this.
def generate_notes(self, model, network_input, pitchnames, n_vocab):
""" Generate notes from the neural network based on a sequence
of notes """
int_to_note = dict(
(
number + 1,
note
) for number, note in enumerate(pitchnames)
)
int_to_note[0] = "NULL"
def get_start():
# pick a random sequence from the input as a starting point for
# the prediction
start = numpy.random.randint(0, len(network_input) - 1)
pattern = network_input[start]
prediction_output = []
return pattern, prediction_output
# generate verse 1
verse1_pattern, verse1_prediction_output = get_start()
for note_index in range(4 * SEQUENCE_LEN):
prediction_input = numpy.reshape(
verse1_pattern, (1, len(verse1_pattern), 1)
)
prediction_input = prediction_input / float(n_vocab)
prediction = model.predict(prediction_input, verbose=0)
index = numpy.argmax(prediction)
result = int_to_note[index]
verse1_prediction_output.append(result)
verse1_pattern.append(index)
verse1_pattern = verse1_pattern[1 : len(verse1_pattern)]
# generate verse 2
verse2_pattern = verse1_pattern
verse2_prediction_output = []
for note_index in range(4 * SEQUENCE_LEN):
prediction_input = numpy.reshape(
verse2_pattern, (1, len(verse2_pattern), 1)
)
prediction_input = prediction_input / float(n_vocab)
prediction = model.predict(prediction_input, verbose=0)
index = numpy.argmax(prediction)
result = int_to_note[index]
verse2_prediction_output.append(result)
verse2_pattern.append(index)
verse2_pattern = verse2_pattern[1 : len(verse2_pattern)]
# generate chorus
chorus_pattern, chorus_prediction_output = get_start()
for note_index in range(4 * SEQUENCE_LEN):
prediction_input = numpy.reshape(
chorus_pattern, (1, len(chorus_pattern), 1)
)
prediction_input = prediction_input / float(n_vocab)
prediction = model.predict(prediction_input, verbose=0)
index = numpy.argmax(prediction)
result = int_to_note[index]
chorus_prediction_output.append(result)
chorus_pattern.append(index)
chorus_pattern = chorus_pattern[1 : len(chorus_pattern)]
# generate bridge
bridge_pattern, bridge_prediction_output = get_start()
for note_index in range(4 * SEQUENCE_LEN):
prediction_input = numpy.reshape(
bridge_pattern, (1, len(bridge_pattern), 1)
)
prediction_input = prediction_input / float(n_vocab)
prediction = model.predict(prediction_input, verbose=0)
index = numpy.argmax(prediction)
result = int_to_note[index]
bridge_prediction_output.append(result)
bridge_pattern.append(index)
bridge_pattern = bridge_pattern[1 : len(bridge_pattern)]
return (
verse1_prediction_output
+ chorus_prediction_output
+ verse2_prediction_output
+ chorus_prediction_output
+ bridge_prediction_output
+ chorus_prediction_output
)
Results
We were able to achieve surprising results from this approach. We could
consistently generate unique songs that fell into the proper genre that we
trained the respective networks on. Below are some example outputs from
our various networks.
Ragtime
Christmas
Rap
Conclusion
Music generation by machines is indeed possible. Is it better or could it be
better than music generated by humans? Only time will tell. From these
results though I would say that it’s definitely possible.
Future Work
Several improvements could be made that would bring this even closer to
true music. Some possible ideas/experiments include:
Learn patterns in songs rather than manually piecing together parts
Take note duration as a separate input to the network rather than
treating each pitch/duration separately
Expand to multiple instruments
Move away from midi files and produce/learn from actual MP3s
Learn the time step, sequence length, and time signature
Introduce randomness to emulate “human error/experimentation”
Allow for multiple keys
Learn how to use intros and outros
Acknowledgments
I would like to thank my teammates Izaak Sulka and Jeff Greene for their
help on this project as well as my friend Mitch Burdick for his expertise on
music that enabled us to get these great results. And of course we would like
to thank Sigurður Skúli for their tutorial as it gave us a great starting point
and something to reference. Last but not least I would like to thank Nick
Kelly for his script to transpose songs to C major.
The code for this project can be found here:
https://github.com/tylerdoll/music-generator
Disclaimer: the music used in our project does not belong to us and was
sourced from various public websites.
Machine Learning Music Music Generation AI Towards Data Science
292 claps
See more stories from Towards Data Science.
Create a free Medium account to follow Towards Data Science. You’ll see more of
their stories on Medium and in your inbox.
Follow
Write the rst response
More From Medium
More from Towards Data Science
Want a data science job? Use the
weekend project principle to get it
Daniel Bourke in Towards…
Nov 3 · 4 min read 3.8K
More from Towards Data Science
How To Fake Being a Good
Programmer
Sten Sootla in Towards Dat…
Oct 30 · 5 min read 5.2K
More from Towards Data Science
One Word of Code to Stop Using
Pandas So Slowly
Tyler Folkman in Towards…
Nov 2 · 3 min read 1.7K
Discover Medium
Welcome to a place where words matter. On Medium,
smart voices and original ideas take center stage - with
no ads in sight. Watch
Make Medium yours
Follow all the topics you care about, and we’ll deliver the
best stories for you to your homepage and inbox. Explore
Become a member
Get unlimited access to the best stories on Medium —
and support writers while you’re at it. Just $5/month.
Upgrade
About Help Legal
DeepWaveDeepWave
Example RagtimeExample Ragtime Share
1.9KCookie policy
DeepWaveDeepWave
Example ChristmasExample Christmas Share
1.6KCookie policy
DeepWaveDeepWave
Example Rap 2Example Rap 2 Share
1.6KCookie policy
WRITTEN BY
Tyler Doll Follow
A guy who likes to think about computer things
DATA SCIENCE MACHINE LEARNING PROGRAMMING VISUALIZATION AI JOURNALISM MORE CONTRIBUTE
Sign in Get started