AMT overview

Automatic Music
Transcription :
Overview
01/26/2016
Cho Won Ik

• Introduction : What is transcription?
• Goal of automatic music transcription
• Early days in AMT research
• Current research areas on AMT
• Multi-pitch analysis
• Semi-automatic (informed) transcription
• Complete music notation
• Challenge
• Future works
Contents

• What is transcription?
• Notating a piece or a sound which is previously unnotated
• Usually hand-written
at past; notated digital
nowadays
• Why is it necessary?
• Information retrieval from blind source
• e.g. traditional music, impromptu, piece with score unreleased …
• Objective musical performance measurement
• Application to systematic/computational musicology
Introduction

• Example of transcription software
• Mostly pitch estimation
Introduction

• What is required in music transcription?
• Pitch, onset time, duration (frequency-temporal analysis)
• Loudness (amplitude)
• Instrumentation (waveform, after source separation)
• High-level features
• Melody tracking (often among same instrument)
• Rhythmic information : tempo and beat
• Harmonic data : key and chord
Introduction
“Can machine transcribe music just as (trained) human do?”

• Melograph [Metfessel, 1928]
• Special-purpose hardware device that makes a
graph of the pitch of the input waveform with time
Early days in AMT research

• Segmentation and analysis of continuous musical sound
by digital computer [Moorer, 1975]
• First paper to discuss automatic transcription in signal
processing view (especially filter theory)
• Optimum comb method is used to detect F0

• Blackboard system [Martin, 1996]
• Various forms of knowledge integrated for specific purpose
• human physiology, acoustics, musical practice etc.
• Blackboard workspace is arranged in a
hierarchy of five hypothesis becoming
abstract as going upward
• Tracks, Partials, Notes, Intervals, Chords

• Blackboard system (cont’d)
• Input
• Discretized version of the information
in the spectrogram representations
• Output
• Textual representation of
detected note
• Graphical display of the
note onset data

• Connectionist approach [Marolt, 2004]
• Resembles human perception of pitch
• Auditory-model based partial tracking
• Networks of adaptive oscillators inspired from hair cells of cochlea
• Note recognition based on neural network

• Current research areas
• Multi-pitch analysis
• Frame-level
• Note-level
• Timbre tracking
• Semi-automatic (informed) transcription
• Complete music notation
Research areas on AMT

• Core problem in automatic music transcription
• Most studies deal with western classical piano pieces
• Due to clarity, polyphony, plentiful DB
• Multi-pitch analysis difficult for human
• Overlapping partials
Multi-pitch analysis

• Octave ambiguity
• Ambiguity in estimation of the number of sources
• Obscurity from instrumentation

• Frame-level analysis
• Estimate pitches and polyphony in each frame
• Feature-based analysis
• Statistical model-based analysis
• Spectrogram decomposition-based analysis
• Note-level analysis
• Estimate pitch, onset & offset of notes
• Minimum duration pruning
• Hidden Markov model
• Efficient convolutional sparse coding

• Feature-based analysis
• Pitch of complex tone : fundamental frequency (F0)
• Partials/Overtones
• Harmonics
• Harmonic instrument
• String, winds, piano etc.
• Produced overtone difference
causes diversity in timbre (spectral envelope)
Frame-level analysis
f = 440 Hz n = 1 Fundamental tone 1st harmonic 1st partial
f = 880 Hz n = 2 1st overtone 2nd harmonic 2nd partial
f = 1320 Hz n = 3 2nd overtone 3rd harmonic 3rd partial

• Multiple-F0 estimation based on polyphony inference [Yeh, 2008]
• Goal : extract multiple-F0 from STFT frame of harmonic instrument
• Noise model / Source model / Source interaction model
• Noise model distinguish unnecessary components for harmonic analysis
• Non-harmonically related F0s (NHRF0s)
• Abbreviation in computation for proper F0 candidate selection
• Hypothetical partial sequence (HPS)
Extraction of a HRF0 F0c from
the HPS of a NHRF0 F0a

• Multiple-F0 estimation based on polyphony inference (cont’d)
• Source model : Quasi-periodic
• Partial frequencies and amplitude of hypothetical sources are estimated
• Source interaction model : Guiding principles for generative signal model
• Harmonicity
• Smoothness of spectral envelopes
• Synchronous amplitude evolution of partials
• Scoring function for joint evaluation is suggested
• Criteria : Harmonicity/Mean bandwidth/Spectral centroid/Synchronicity
• Smaller weighted sum stands for a better score (𝑝𝑝𝑖𝑖 decided experimentally)
𝑆𝑆 = 𝑝𝑝1 ∙ HAR + 𝑝𝑝2 ∙ MBW + 𝑝𝑝3 ∙ SPC + 𝑝𝑝4 ∙ SYNC

• Multiple-F0 estimation based on polyphony inference (cont’d)
• Polyphony is inferred based on assumption that combination of correct
number of F0s are expected to give the highest score

• Statistical model-based analysis
• Multi-pitch estimation using
new probabilistic spectral
smoothness principle [Emiya, 2010]
• Given an observed frame x and a set
𝑪𝑪 of all possible fundamental
frequency combinations, multi-pitch
detection function 𝐶𝐶̂ can be written
(Maximum a posteriori)
𝐶𝐶̂ = 𝑃𝑃 𝐶𝐶 𝑋𝑋𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝐶𝐶 ∈ 𝑪𝑪
=
𝑃𝑃 𝐶𝐶 𝑋𝑋 𝑃𝑃(𝐶𝐶)
𝑃𝑃(𝑋𝑋)
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝐶𝐶 ∈ 𝑪𝑪

• Nonnegative matrix factorization [Smaragdis, 2003]
• NMF model decomposes an input spectrogram 𝑋𝑋 with 𝐾𝐾 frequency
bins and 𝑁𝑁 frames into 𝑋𝑋 ≈ 𝑊𝑊𝑊𝑊
• For number of pitch bases R ≪ 𝑁𝑁, 𝐾𝐾; 𝑊𝑊 contains the spectral bases
for each of the 𝑅𝑅 pitch components, and 𝐻𝐻 is the pitch activity matrix
across time

• Time-pitch computation must be further processed to
detect note events with
• Discrete pitch value
• An onset time and offset time (duration)
• Minimum duration pruning [Dessein, 2010]
• Simple and fast solution
• Applied after thresholding
• Note events which have a
duration smaller than a
pre-defined value are removed from the final score
Note-level analysis

• Hidden Markov models [Ryynanen, 2005]
• Model each note with a 3-state note event HMM
• 3 states : attack, sustain, noise states of each sound
• Musicological model was used for estimating musical key and note
transition probabilities
• Observation :
• Pitch deviation
• Pitch salience
• Onset strength
• Model silence with a 1-state
silence HMM
Note-level analysis

• Efficient convolutional sparse coding [Wohlberg, 2014]
• Note tracking from audio directly
𝑠𝑠[𝑡𝑡] : monaural, polyphonic audio recording of a piano piece
𝑑𝑑 𝑚𝑚[𝑡𝑡] : dictionary element representing notes of piano
𝑥𝑥 𝑚𝑚 𝑡𝑡 : activation vectors
• Nonzero value at index 𝑡𝑡 of 𝑥𝑥 𝑚𝑚[𝑡𝑡] represent activation of note
𝑚𝑚 at sample 𝑡𝑡
Note-level analysis
𝑠𝑠 𝑡𝑡 ≅ � 𝑑𝑑 𝑚𝑚 𝑡𝑡 ∗ 𝑥𝑥 𝑚𝑚[𝑡𝑡]
𝑚𝑚

• Area closely related to source separation problem
• Also known as multi-pitch streaming
• Supervised
• Train timbre models of sound sources
• Apply timbre models during pitch estimation
• Classify estimated pitches/notes
• Supervised with timbre adaptation
• Adapt trained timbre models to sources in mixture
• Unsupervised
• Cluster pitch estimates according to timbre
• Includes problem of percussive instrument separation
• Spectrogram decomposition is still useful
Timbre tracking

• Probabilistic latent component analysis [Smaragdis, 2007]
• For N-dim random variable 𝑥𝑥 and latent variable 𝑧𝑧,
• Estimation of marginals 𝑃𝑃(𝑥𝑥 𝑗𝑗
|𝑧𝑧) is performed using EM algorithm
• In source separation, magnitude spectrogram is expressed as
that decomposition will result into two sets of marginals
Timbre tracking

• Probabilistic latent component analysis (cont’d)
• 𝑃𝑃 𝑓𝑓 𝑧𝑧 = 𝑃𝑃1(𝑓𝑓|𝑧𝑧) ∪ 𝑃𝑃2(𝑓𝑓|𝑧𝑧)
• 𝑃𝑃1(𝑓𝑓|𝑧𝑧) and 𝑃𝑃2 𝑓𝑓 𝑧𝑧 known frequency marginals
• For 𝑃𝑃 𝑓𝑓 𝑧𝑧 to explain mixture spectrogram
𝑃𝑃 𝑓𝑓, 𝑡𝑡 , we only need to estimate 𝑃𝑃(𝑡𝑡|𝑧𝑧)
• 𝑃𝑃(𝑡𝑡|𝑧𝑧) is splited into two sets which
correspond to each source
• Reconstruction of input spectrogram
that correspond to only one
Timbre tracking

• Current state-of-the-art AMT system do not reach same
level accuracy as transcriptions made by human experts
• Human could assist computational transcription process that
are crucial for an accurate transcription but difficult to model
algorithmically
• Instrument identification
• Auditory stream segregation
• Not applicable to the analysis of large music database
• Useful for more detailed and accurate transcription of music
Semi-automatic transcription

• Current AMT system can
• Detect (multiple) pitches,
onsets, offsets
• Identify instruments and
track notes in polyphony
• Identify articulation and
rhythm information
• Analyzed data need to be
translated into musical form
• Score form / MIDI form
• Fingering / string detection
• Direct mapping to software tools
Complete music notation

• MIREX (MIR Evaluation eXchange)
• Multiple F0 estimation & tracking
• Performance measure
• Precision (the portion of correct retrieved pitches for all pitches
retrieved for each frame)
• Recall (the ratio of correct pitches to all ground truth pitches for each
frame)
• Audio onset detection
• Precision / Recall / F-measure / Scoring for doubled onset
• Time precision (tolerance from +/- 50 ms to less)
• Separate scoring for different instrument types
• Singing voice separation
• SDR / SIR (Source to inferences ratio) / SAR (Source to artifacts ratio)
Challenge

• Apply ideas of AED/source separation in AMT
• Instrument identification and timbre tracking is still difficult
• AED can be used to identify onset and offset of instruments
• Source separation can be applied to decompose polyphonic
music to set of monophony
• Presence information of instruments, from AED, can be useful
Future works

• Z. Duan and E. Benetos, “Tutorial : Automatic music transcription,” 16th International So
ciety of Music Information Retrieval Conference, 2015.
• E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music
transcription: challenges and future directions,” Journal of Intelligent Information
Systems, Vol. 41, No. 3, pp. 407-434, 2013.
• J. A. Moorer, “On the segmentation and analysis of continuous musical sound by digital
computer,” PhD thesis, Stanford University, 1975.
• K. D. Martin, "A blackboard system for automatic transcription of simple polyphonic
music." Massachusetts Institute of Technology Media Laboratory Perceptual Computing
Section Technical Report, No. 385, 1996.
• M. Marolt, “A Connectionist Approach to Automatic Transcription of Polyphonic Piano
Music,” IEEE Transactions on Multimedia, Vol. 6, No. 3, Jun. 2004.
• C. Yeh, “Multiple fundamental frequency estimation of polyphonic recordings,” PhD
thesis, Universite Paris VI – Pierre et Marie Curie, 2008.
• V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano sound using a
new probabilistic spectral smoothness principle,” IEEE Transactions on Audio,
Speech, and Language Processing, Vol. 18, No. 6, pp. 1643-1654, Aug. 2010.
Reference

• P. Smaragdis and J. C. Brown, “Non-negative factorization for polyphonic music transcri
ption,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, NY,
USA, Oct. 2003.
• A. Dessein, A. Cont, and G. Lemaitre, “Real-time polyphonic music transcription with no
n-negative matrix factorization and beta-divergence,” In proceedings of 11th
International Society of Music Information Retrieval Conference, pp. 489-494, 2010.
• M. P. Ryynanen and A. Klapuri, “Polyphonic music transcription using note event model
ing,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, NY, US
A, Oct. 2003.
• A. Cogliati, Z. Duan, and B. Wohlberg, “Piano music transcription with fast convolutiona
l sparse coding,” IEEE Workshop on Machine Learning for Signal Processing, Boston, US
A, Sep. 2015.
• P. Smaragdis, R. Bhiksha, and S. Madhusudana, “Supervised and semi-supervised
separation of sounds from single- channel mixtures,” In proceedings of 7th International
Conference on Independent Component Analysis and Signal Separation, pp. 414-421,
2007.
Reference

AMT overview

More Related Content

Similar to AMT overview

More from WarNik Chow

Recently uploaded

AMT overview