Automatic Music Transcription

Automatic Music Transcription
for Polyphonic Music
CS599 Deep Learning Final Project
By Keen Quartet team
Guided by Prof. Joseph Lim
& Artem Molchanov

Project Overview
● Attempt to design a system that can transcribe music
● Musical piece characteristics:
○ Has multiple musical sources (many
○ Each instrument piece is polyphonic (more than one note at a given
time)
● Motivation:
○ To make it easy for music amateurs to learn to play instrument

Approach
Challenges
○ Polyphonic music:
multiple notes / time frame → exponential combinations → difficult
learning
○ Multiple instruments,vocals → multiple models each to transcribe single
instrument
● We address these challenges by incorporating:
○ Separation of music piece into its sources
■ Current focus, on separating vocals and background instruments only.
○ Identify predominant instrument and transcribe each accordingly
○ Currently we focus on transcription of piano music only

Our Project Pipeline
Predominant
Instrument
Identification
Transcription
Input
Music file
Multiple files
after source
separation
Predominant
Instrument
Identification
Transcription
voice
piano
Source
Separation
notes
notes

Source Separation
Goal
● Separate out different musical sources
○ Sources : (Voice, various instruments etc)
● Multiple instrument => highly complex task!
○ Need Labels for each source type
○ Tune loss function
● Focus on separating two sources : vocals, instrument
● Input : A spectrogram of mixed audio signals.
● Output: 2 audio files for each separate source.
● Dataset used: MIR-1K
Difficulttoretrieve
Image source:
http://www.cs.northwestern.edu/~pardo/courses/eecs352/lectures/source%20separation.pdf

Source Separation - Our Approach
● LSTM based approach.
● 2 dense layers: capture each source.
● Masking layer:
○ Normalize outputs of dense layers.
○ Mask out other source from mixed spectra.
● Joint training :
○ Network parameters.
○ Output of masking layer.
● Discriminative training:
○ Increase difference between :
■ Predicted vocal, actual instruments
■ Predicted instruments, actual vocals
:

Source Separation: Results
Music only
Voice
Visuals to show the effectiveness of our model.
Original

Predominant Music Identification
Goal
Identify the predominant instrument in the file obtained from source
segregation
Why? Transcription is instrument specific. So very important to know the
instrument before going for transcription
Approach
● Train a CNN model on 6000 audio files to infer the pattern in music file
● 11 categories of instruments for training
Input: .wav files obtained from previous steps
Output: Label of the predominant instrument
Dataset: IRMAS

Predominant Instrument Identification
Model
Results
● Initially very bad accuracy (15%)
○ Why? Less training, images larger than usual (43 x 128)
● Improved to achieve ~60% accuracy
● Batch normalization, more epochs (150 epochs with early stopping)
Image source:

Automatic Transcription for Polyphonic Music
Goal
Obtain transcription (note representation) of music
Seems easy : one-to-one mapping between notes and notations
But is not easy: Why?
● Polyphonic music: number of notes playing at given time
● Exponential combinations at a given time
● Multiple instruments: need separate model for each instrument as loss function differs for each
Currently, focusing on Piano music
Same approach works for any instrument
● Given good dataset
● Lots of training and proper loss function

Automatic Transcription for Polyphonic Music
Appoach:
● Train a ConvNET model on polyphonic piano music
● Used MAPS dataset:
○ 45 GB audio files, around 60 hours of recording
○ Processed about 6 million timeframes
Approach 1:
● Use the whole dataset
○ Computationally intensive
○ Trained for 7 epochs before early stopping
Approach 2:
● Iterative training by using one category at a time
○ Trained for 63, 20, 7, 7, 7 epochs
We obtain the probability distribution of notes being played. Infer the notes being played by
keeping threshold
Result: ~96% accuracy

Learning outcome
● Explored a domain completely new for us.
● Beginners in Deep Learning
● Our pipeline had 3 different models, one for each step, all using deep
learning approach. This required an extensive literature survey for each of
them and implementation and training effort. Each model is trained using a
different dataset
● Attempted to build on existing concepts in each part:
● Source separation: LSTM, Discriminative Learning
● Predominant Instrument Identification: Batch normalization
● Transcription: Different approaches to train for better generalization

Summary
● Our system is divided into three components:
● First attempt to transcribe Polyphonic Music for Multiple Instrument
using Deep Learning technique
● Future directions:
○ Extend source separation for multiple instruments
○ Make transcription model more flexible
Source separation → predominant instrument identification →
Transcription

References
1. Huang, Po-Sen, et al. “Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural
Networks.” ISMIR. 2014.
2. Chandna, Pritish, et al. “Monoaural audio source separation using deep convolutional neural networks.”
International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2017.
3. MIR-1K dataset: Chao-Ling Hsu, DeLiang Wang, Jyh-Shing Roger Jang, and Ke Hu, “ A Tandem Algorithm for Singing
Pitch Extraction and Voice Separation from Music Accompaniment,” IEEE Trans. Audio, Speech, and Language
Processing, 2011
4. Han, Yoonchang, et al. “Deep convolutional neural networks for predominant instrument recognition in
polyphonic music.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25.1 (2017):
208-221.
5. IRMAS Dataset: Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques
for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012.
6. Sigtia, Siddharth, Emmanouil Benetos, and Simon Dixon. “An end-to-end neural network for polyphonic piano
music transcription.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.5 (2016):
927-939.
7. MAPS Dataset: Multi-pitch estimation of piano sounds using a new probabilistic spectral smoothness principle, V.
Emiya, R. Badeau, B. David, IEEE Transactions on Audio, Speech and Language Processing, 2010.

Automatic Music Transcription

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automatic Music Transcription

Similar to Automatic Music Transcription (20)

Recently uploaded

Recently uploaded (20)

Automatic Music Transcription