Automatic Music Transcription
for Polyphonic Music
CS599 Deep Learning Final Project
By Keen Quartet team
Guided by Prof. Joseph Lim
& Artem Molchanov
Project Overview
● Attempt to design a system that can transcribe music
● Musical piece characteristics:
○ Has multiple musical sources (many
○ Each instrument piece is polyphonic (more than one note at a given
time)
● Motivation:
○ To make it easy for music amateurs to learn to play instrument
Approach
Challenges
○ Polyphonic music:
multiple notes / time frame → exponential combinations → difficult
learning
○ Multiple instruments,vocals → multiple models each to transcribe single
instrument
● We address these challenges by incorporating:
○ Separation of music piece into its sources
■ Current focus, on separating vocals and background instruments only.
○ Identify predominant instrument and transcribe each accordingly
○ Currently we focus on transcription of piano music only
Our Project Pipeline
Predominant
Instrument
Identification
Transcription
Input
Music file
Multiple files
after source
separation
Predominant
Instrument
Identification
Transcription
voice
piano
Source
Separation
notes
notes
Source Separation
Goal
● Separate out different musical sources
○ Sources : (Voice, various instruments etc)
● Multiple instrument => highly complex task!
○ Need Labels for each source type
○ Tune loss function
● Focus on separating two sources : vocals, instrument
● Input : A spectrogram of mixed audio signals.
● Output: 2 audio files for each separate source.
● Dataset used: MIR-1K
Difficulttoretrieve
Image source:
http://www.cs.northwestern.edu/~pardo/courses/eecs352/lectures/source%20separation.pdf
Source Separation - Our Approach
● LSTM based approach.
● 2 dense layers: capture each source.
● Masking layer:
○ Normalize outputs of dense layers.
○ Mask out other source from mixed spectra.
● Joint training :
○ Network parameters.
○ Output of masking layer.
● Discriminative training:
○ Increase difference between :
■ Predicted vocal, actual instruments
■ Predicted instruments, actual vocals
:
Source Separation: Results
Music only
Voice
Visuals to show the effectiveness of our model.
Original
Predominant Music Identification
Goal
Identify the predominant instrument in the file obtained from source
segregation
Why? Transcription is instrument specific. So very important to know the
instrument before going for transcription
Approach
● Train a CNN model on 6000 audio files to infer the pattern in music file
● 11 categories of instruments for training
Input: .wav files obtained from previous steps
Output: Label of the predominant instrument
Dataset: IRMAS
Predominant Instrument Identification
Model
Results
● Initially very bad accuracy (15%)
○ Why? Less training, images larger than usual (43 x 128)
● Improved to achieve ~60% accuracy
● Batch normalization, more epochs (150 epochs with early stopping)
Image source:
Automatic Transcription for Polyphonic Music
Goal
Obtain transcription (note representation) of music
Seems easy : one-to-one mapping between notes and notations
But is not easy: Why?
● Polyphonic music: number of notes playing at given time
● Exponential combinations at a given time
● Multiple instruments: need separate model for each instrument as loss function differs for each
Currently, focusing on Piano music
Same approach works for any instrument
● Given good dataset
● Lots of training and proper loss function
Automatic Transcription for Polyphonic Music
Appoach:
● Train a ConvNET model on polyphonic piano music
● Used MAPS dataset:
○ 45 GB audio files, around 60 hours of recording
○ Processed about 6 million timeframes
Approach 1:
● Use the whole dataset
○ Computationally intensive
○ Trained for 7 epochs before early stopping
Approach 2:
● Iterative training by using one category at a time
○ Trained for 63, 20, 7, 7, 7 epochs
We obtain the probability distribution of notes being played. Infer the notes being played by
keeping threshold
Result: ~96% accuracy
Learning outcome
● Explored a domain completely new for us.
● Beginners in Deep Learning
● Our pipeline had 3 different models, one for each step, all using deep
learning approach. This required an extensive literature survey for each of
them and implementation and training effort. Each model is trained using a
different dataset
● Attempted to build on existing concepts in each part:
● Source separation: LSTM, Discriminative Learning
● Predominant Instrument Identification: Batch normalization
● Transcription: Different approaches to train for better generalization
Summary
● Our system is divided into three components:
● First attempt to transcribe Polyphonic Music for Multiple Instrument
using Deep Learning technique
● Future directions:
○ Extend source separation for multiple instruments
○ Make transcription model more flexible
Source separation → predominant instrument identification →
Transcription
References
1. Huang, Po-Sen, et al. “Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural
Networks.” ISMIR. 2014.
2. Chandna, Pritish, et al. “Monoaural audio source separation using deep convolutional neural networks.”
International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2017.
3. MIR-1K dataset: Chao-Ling Hsu, DeLiang Wang, Jyh-Shing Roger Jang, and Ke Hu, “ A Tandem Algorithm for Singing
Pitch Extraction and Voice Separation from Music Accompaniment,” IEEE Trans. Audio, Speech, and Language
Processing, 2011
4. Han, Yoonchang, et al. “Deep convolutional neural networks for predominant instrument recognition in
polyphonic music.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25.1 (2017):
208-221.
5. IRMAS Dataset: Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques
for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012.
6. Sigtia, Siddharth, Emmanouil Benetos, and Simon Dixon. “An end-to-end neural network for polyphonic piano
music transcription.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.5 (2016):
927-939.
7. MAPS Dataset: Multi-pitch estimation of piano sounds using a new probabilistic spectral smoothness principle, V.
Emiya, R. Badeau, B. David, IEEE Transactions on Audio, Speech and Language Processing, 2010.
Thank You...
Automatic Music Transcription

Automatic Music Transcription

  • 1.
    Automatic Music Transcription forPolyphonic Music CS599 Deep Learning Final Project By Keen Quartet team Guided by Prof. Joseph Lim & Artem Molchanov
  • 2.
    Project Overview ● Attemptto design a system that can transcribe music ● Musical piece characteristics: ○ Has multiple musical sources (many ○ Each instrument piece is polyphonic (more than one note at a given time) ● Motivation: ○ To make it easy for music amateurs to learn to play instrument
  • 3.
    Approach Challenges ○ Polyphonic music: multiplenotes / time frame → exponential combinations → difficult learning ○ Multiple instruments,vocals → multiple models each to transcribe single instrument ● We address these challenges by incorporating: ○ Separation of music piece into its sources ■ Current focus, on separating vocals and background instruments only. ○ Identify predominant instrument and transcribe each accordingly ○ Currently we focus on transcription of piano music only
  • 4.
    Our Project Pipeline Predominant Instrument Identification Transcription Input Musicfile Multiple files after source separation Predominant Instrument Identification Transcription voice piano Source Separation notes notes
  • 5.
    Source Separation Goal ● Separateout different musical sources ○ Sources : (Voice, various instruments etc) ● Multiple instrument => highly complex task! ○ Need Labels for each source type ○ Tune loss function ● Focus on separating two sources : vocals, instrument ● Input : A spectrogram of mixed audio signals. ● Output: 2 audio files for each separate source. ● Dataset used: MIR-1K Difficulttoretrieve Image source: http://www.cs.northwestern.edu/~pardo/courses/eecs352/lectures/source%20separation.pdf
  • 6.
    Source Separation -Our Approach ● LSTM based approach. ● 2 dense layers: capture each source. ● Masking layer: ○ Normalize outputs of dense layers. ○ Mask out other source from mixed spectra. ● Joint training : ○ Network parameters. ○ Output of masking layer. ● Discriminative training: ○ Increase difference between : ■ Predicted vocal, actual instruments ■ Predicted instruments, actual vocals :
  • 7.
    Source Separation: Results Musiconly Voice Visuals to show the effectiveness of our model. Original
  • 8.
    Predominant Music Identification Goal Identifythe predominant instrument in the file obtained from source segregation Why? Transcription is instrument specific. So very important to know the instrument before going for transcription Approach ● Train a CNN model on 6000 audio files to infer the pattern in music file ● 11 categories of instruments for training Input: .wav files obtained from previous steps Output: Label of the predominant instrument Dataset: IRMAS
  • 9.
    Predominant Instrument Identification Model Results ●Initially very bad accuracy (15%) ○ Why? Less training, images larger than usual (43 x 128) ● Improved to achieve ~60% accuracy ● Batch normalization, more epochs (150 epochs with early stopping) Image source:
  • 10.
    Automatic Transcription forPolyphonic Music Goal Obtain transcription (note representation) of music Seems easy : one-to-one mapping between notes and notations But is not easy: Why? ● Polyphonic music: number of notes playing at given time ● Exponential combinations at a given time ● Multiple instruments: need separate model for each instrument as loss function differs for each Currently, focusing on Piano music Same approach works for any instrument ● Given good dataset ● Lots of training and proper loss function
  • 11.
    Automatic Transcription forPolyphonic Music Appoach: ● Train a ConvNET model on polyphonic piano music ● Used MAPS dataset: ○ 45 GB audio files, around 60 hours of recording ○ Processed about 6 million timeframes Approach 1: ● Use the whole dataset ○ Computationally intensive ○ Trained for 7 epochs before early stopping Approach 2: ● Iterative training by using one category at a time ○ Trained for 63, 20, 7, 7, 7 epochs We obtain the probability distribution of notes being played. Infer the notes being played by keeping threshold Result: ~96% accuracy
  • 12.
    Learning outcome ● Exploreda domain completely new for us. ● Beginners in Deep Learning ● Our pipeline had 3 different models, one for each step, all using deep learning approach. This required an extensive literature survey for each of them and implementation and training effort. Each model is trained using a different dataset ● Attempted to build on existing concepts in each part: ● Source separation: LSTM, Discriminative Learning ● Predominant Instrument Identification: Batch normalization ● Transcription: Different approaches to train for better generalization
  • 13.
    Summary ● Our systemis divided into three components: ● First attempt to transcribe Polyphonic Music for Multiple Instrument using Deep Learning technique ● Future directions: ○ Extend source separation for multiple instruments ○ Make transcription model more flexible Source separation → predominant instrument identification → Transcription
  • 14.
    References 1. Huang, Po-Sen,et al. “Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks.” ISMIR. 2014. 2. Chandna, Pritish, et al. “Monoaural audio source separation using deep convolutional neural networks.” International Conference on Latent Variable Analysis and Signal Separation. Springer, Cham, 2017. 3. MIR-1K dataset: Chao-Ling Hsu, DeLiang Wang, Jyh-Shing Roger Jang, and Ke Hu, “ A Tandem Algorithm for Singing Pitch Extraction and Voice Separation from Music Accompaniment,” IEEE Trans. Audio, Speech, and Language Processing, 2011 4. Han, Yoonchang, et al. “Deep convolutional neural networks for predominant instrument recognition in polyphonic music.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25.1 (2017): 208-221. 5. IRMAS Dataset: Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012. 6. Sigtia, Siddharth, Emmanouil Benetos, and Simon Dixon. “An end-to-end neural network for polyphonic piano music transcription.” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24.5 (2016): 927-939. 7. MAPS Dataset: Multi-pitch estimation of piano sounds using a new probabilistic spectral smoothness principle, V. Emiya, R. Badeau, B. David, IEEE Transactions on Audio, Speech and Language Processing, 2010.
  • 15.