This document investigates whether explicitly modeling inharmonicity improves piano transcription accuracy when using non-negative matrix factorization (NMF). It compares three models for the note spectra dictionary in NMF-based piano transcription: 1) strictly harmonic, 2) strictly following theoretical inharmonicity, and 3) relaxed inharmonicity constraints. Experimental results found the inharmonic models improved transcription accuracy compared to the harmonic model, but only when provided a good initialization. The paper aims to better understand how precisely a model needs to capture characteristics like inharmonicity for robust NMF analysis of piano recordings.
fast-Fourier-transform-presentation and Fourier transform for wave
in
signal possessing for
physics and
geophysics
spectra analysis
periodic and non periodic wave
data sampling
The Nyquist frequency
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextIJERA Editor
The paper present a studyon the noise effect whenextracting the resonance residue in a quadrature polarization
setup. The accuracy and stability of the mode quadrature residues is necessary to construct the polarization matrix,
and subsequently, derive a robust polarization states of the receiver antenna. However, with lower signal-to-noise
ratios the extraction performance will degrade; in this regards, the in-phase and quadrature components of the
baseband signal demonstrated improvedextraction performancewhen used. Here, a case of two disjoint wires is
used to verify this approach.
On Fractional Fourier Transform Moments Based On Ambiguity FunctionCSCJournals
The fractional Fourier transform can be considered as a rotated standard Fourier transform in general and its benefit in signal processing is growing to be known more. Noise removing is one application that fractional Fourier transform can do well if the signal dilation is perfectly known. In this paper, we have computed the first and second order of moments of fractional Fourier transform according to the ambiguity function exactly. In addition we have derived some relations between time and spectral moments with those obtained in fractional domain. We will prove that the first moment in fractional Fourier transform can also be considered as a rotated the time and frequency gravity in general. For more satisfaction, we choose five different types signals and obtain analytically their fractional Fourier transform and the first and second-order moments in time and frequency and fractional domains as well.
Spatial Enhancement for Immersive Stereo Audio ApplicationsAndreas Floros
In this wok, we propose a stereo recording spatial enhancement technique which retains the original panning / source location, proportionally mapped into the perceived expanded sound stage. The technique uses a time-frequency domain metric for retrieving the panning coefficients applied during the initial stereo mixing. Panning information is also used for separating the original single channel audio streams and finally for synthesizing the expanded sound field using binaural processing. The technique is mainly intended for playback applications over short-distant loudspeaker setups or headphones and, in the context of this work, it is subjectively evaluated in terms of the perceived sound field expansion and the sound-source spatial distribution accuracy.
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Maskingdhia_naruto
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
fast-Fourier-transform-presentation and Fourier transform for wave
in
signal possessing for
physics and
geophysics
spectra analysis
periodic and non periodic wave
data sampling
The Nyquist frequency
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextIJERA Editor
The paper present a studyon the noise effect whenextracting the resonance residue in a quadrature polarization
setup. The accuracy and stability of the mode quadrature residues is necessary to construct the polarization matrix,
and subsequently, derive a robust polarization states of the receiver antenna. However, with lower signal-to-noise
ratios the extraction performance will degrade; in this regards, the in-phase and quadrature components of the
baseband signal demonstrated improvedextraction performancewhen used. Here, a case of two disjoint wires is
used to verify this approach.
On Fractional Fourier Transform Moments Based On Ambiguity FunctionCSCJournals
The fractional Fourier transform can be considered as a rotated standard Fourier transform in general and its benefit in signal processing is growing to be known more. Noise removing is one application that fractional Fourier transform can do well if the signal dilation is perfectly known. In this paper, we have computed the first and second order of moments of fractional Fourier transform according to the ambiguity function exactly. In addition we have derived some relations between time and spectral moments with those obtained in fractional domain. We will prove that the first moment in fractional Fourier transform can also be considered as a rotated the time and frequency gravity in general. For more satisfaction, we choose five different types signals and obtain analytically their fractional Fourier transform and the first and second-order moments in time and frequency and fractional domains as well.
Spatial Enhancement for Immersive Stereo Audio ApplicationsAndreas Floros
In this wok, we propose a stereo recording spatial enhancement technique which retains the original panning / source location, proportionally mapped into the perceived expanded sound stage. The technique uses a time-frequency domain metric for retrieving the panning coefficients applied during the initial stereo mixing. Panning information is also used for separating the original single channel audio streams and finally for synthesizing the expanded sound field using binaural processing. The technique is mainly intended for playback applications over short-distant loudspeaker setups or headphones and, in the context of this work, it is subjectively evaluated in terms of the perceived sound field expansion and the sound-source spatial distribution accuracy.
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Maskingdhia_naruto
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
La misión de NHT Global es mejorar la manera en que usted se siente acerca de sí mismo y acerca del trabajo. Al ofrecer Productos Eficaces y de calidad de vida a nuestros clientes, y posiblemente el plan de compensación más lucrativo que hayan tenido nuestros miembros, estamos comprometidos con el bienestar de las personas en todo el mundo.
Accurate Evaluation of Interharmonics of a Six Pulse, Full Wave - Three Phase...idescitation
Interharmonics are the non-integral multiples of
the system’s fundamental frequency. The interharmonic
components can be apprehended as the intermodulation of
the fundamental and harmonic components of the system with
any other frequency components introduced by the load. These
loads include static frequency converters, cyclo-converters,
induction motors, arc furnaces and all the loads not pulsating
synchronously with the fundamental frequency of the system.
The harmonic and interharmonic components inflict common
damage to the system and apart from these damages the
interharmonics also cause light flickering, sideband torques
on motor/generator and adverse effects on transformer and
motor components. To
filter/compensate the interharmonic
components, their accurate evaluation is essential and to
achieve the same the Iterative algorithm has been proposed.
The main cause of spectral leakage errors is the truncation of
the time-domain signal. The proposed adaptive approach
calculates the immaculate window width, eliminating the
spectral leakage errors in the frequency domain and thereby
the interharmonics/harmonics can be calculated accurately.
The algorithm does not require any inputs regarding the
system frequency and interharmonic constituents of the
system. The only parameter required is the signal sequence
obtained by sampling the analog signal at equidistant sampling
interval.
Frequency based criterion for distinguishing tonal and noisy spectral componentsCSCJournals
A frequency-based criterion for distinguishing tonal and noisy spectral components is proposed. For considered spectral local maximum two instantaneous frequency estimates are determined and the difference between them is used in order to verify whether component is noisy or tonal. Since one of the estimators was invented specially for this application its properties are deeply examined. The proposed criterion is applied to the stationary and nonstationary sinusoids in order to examine its efficiency.
Comparative analysis on an exponential form of pulse with an integer and non-...IJERA Editor
The theme of this paper is to make a fundamental comparative analysis on time-bandwidth product of a short
duration pulse, whose amplitude is varied with an exponent as an integer and non-integer. The time-bandwidth
product is the most significant factor in pulse compression techniques which is used very often in radar systems
for better detection of the target and resolving the ambiguities in the both range and velocity with the help of
ambiguity function. In this paper, different exponents have been used and it is observed that the non-integer
exponents are giving slightly better quantitative parameters like time-bandwidth product, relative sidelobe level
and main lobe widths. To improve these quantitative parameters, phase variations have been incorporated with
the differentiated pulses of the original exponential signal. Finally, the modulation has also been applied on the
pulses to observe the results in real practical applications. From all these analysis, it is concluded that the
differentiated non-integer exponential pulse with bi-polar variations is giving better pulse compression
requirements (time-bandwidth product, peak sidelobe level, 3-dB beamwidth and main lobe widths) compared
to all other pulse forms. The outputs of the matched filter are also observed for each pulse shape.
Matrix Padding Method for Sparse Signal ReconstructionCSCJournals
Compressive sensing has been evolved as a very useful technique for sparse reconstruction of signals that are sampled at sub-Nyquist rates. Compressive sensing helps to reconstruct the signals from few linear projections of the sparse signal. This paper presents a technique for the sparse signal reconstruction by padding the compression matrix for solving the underdetermined system of simultaneous linear equations, followed by an iterative least mean square approximation. The performance of this method has been compared with the widely used compressive sensing recovery algorithms such as l1_ls, l1-magic, YALL1, Orthogonal Matching Pursuit, Compressive Sampling Matching Pursuit, etc.. The sounds generated by 3-blade engine, music, speech, etc. have been used to validate and compare the performance of the proposed technique with the other existing compressive sensing algorithms in ideal and noisy environments. The proposed technique is found to have outperformed the l1_ls, l1-magic, YALL1, OMP, CoSaMP, etc. as elucidated in the results.
Cyclostationary analysis of polytime coded signals for lpi radarseSAT Journals
Abstract In Radars, an electromagnetic waveform will be sent, and an echo of the same signal will be received by the receiver. From this received signal, by extracting various parameters such as round trip delay, doppler frequency it is possible to find distance, speed, altitude, etc. However, nowadays as the technology increases, intruders are intercepting transmitted signal as it reaches them, and they will be extracting the characteristics and trying to modify them. So there is a need to develop a system whose signal cannot be identified by no cooperative intercept receivers. That is why LPI radars came into existence. In this paper a brief discussion on LPI radar and its modulation (Polytime code (PT1)), detection (Cyclostationary (DFSM & FAM) techniques such as DFSM, FAM are presented and compared with respect to computational complexity.
Keywords—LPI Radar, Polytime codes, Cyclostationary DFSM, and FAM
Suppression of Chirp Interferers in GPS Using the Fractional Fourier TransformCSCJournals
In this paper we apply the Fractional Fourier Transform (FrFT) to remove chirp interferers that corrupt Global Positioning System (GPS) signals. The concept is based on the fact that in the time-frequency plane, known as the Wigner Distribution (WD), chirps are represented as lines. Using an FrFT with some rotational parameter ‘a’, we rotate to a new time axis ta that transforms the chirp to a tone, in which the energy of the tone is contained in usually just one or two samples. The best `a', and the correct time sample along the ta axis, may be found without a priori knowledge by searching for the peak in the FrFT, since compression to one or two time samples results in an energy spike. Once the peak is found, we zero out the tone, and hence the underlying chirp. Rotation back to the original time domain via an inverse FrFT produces an improved GPS signal. This method can apply to multiple chirp interferers, and we describe how to easily determine the number of interferers, K, by finding peaks in the FrFT space over the parameter `a'. We also describe how to easily notch the interferers once converted to tones by computing a threshold based on the power of the coarse acquisition (C/A) code and noise. We show that for signal-to-noise ratios (SNRs) greater than at least 10 dB, interferers can be notched regardless of the ratio of the C/A code power to the combined interferer power, denoted as carrier-to-interference ratio (CIR).
New optimization scheme for cooperative spectrum sensing taking different snr...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...cscpconf
The electric network frequency (ENF) of power lines leaves its trace in nearby media recordings. The ENF signals vary in a consistent way in a given power grid. Therefore, it is possible to develop signal processing and machine learning techniques to identify the grid of origin by extracting attributes of the embedded ENF signal in recorded audio. This paper presents a model based on a novel ENF extraction technique with training on audio and power recordings from different grids. The proposed approach is based on correcting erroneously selected peaks from the Short Time Fourier Transform (STFT) by leveraging time correlations. These peaks are mistakenly taken for the frequency component belonging to the embedded ENF signal of the power grid and are corrected by the algorithm. Results on a test set of 50 recordings from nine different locations demonstrate the effectiveness of the proposed approach with an overall accuracy of 88%.
EXPERIMENTAL RESEARCH OF HIGHLY RELIABLE METHODS OF HYDROACOUSTIC COMMUNICATI...sipij
The article presents the results of a study of transmission reliability and the quality of determining the propagation time of multifrequency signals in conditions of confined water bodies which walls and bottom are made of materials with a low absorption coefficient of acoustic waves and with a complex
reverberation pattern.
NEW METHOD OF SIGNAL DENOISING BY THE PAIRED TRANSFORMmathsjournal
A parallel restoration procedure obtained through a splitting of the signal into multiple signals by the
paired transform is described. The set of frequency-points is divided by disjoint subsets, and on each of
these subsets, the linear filtration is performed separately. The method of optimal Wiener filtration of the
noisy signal is considered. In such splitting, the optimal filter is defined as a set of sub filters applied on the
splitting-signals. Two new models of filtration are described. In the first model, the traditional filtration is
reduced to the processing separately the splitting-signals by the shifted discrete Fourier transforms
(DFTs). In the second model, the not shifted DFTs are used over the splitting-signals and sub filters are
applied. Such simplified model for splitting the filtration allows for saving 2 − 4( + 1) operations of
complex multiplication, for the signals of length = 2^, > 2. .
Similar to (2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Based Piano Transcription Model (20)
NEW METHOD OF SIGNAL DENOISING BY THE PAIRED TRANSFORM
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Based Piano Transcription Model
1. DOES INHARMONICITY IMPROVE AN NMF-BASED PIANO TRANSCRIPTION MODEL ?
Franc¸ois Rigaud 1
, Antoine Falaize 2
, Bertrand David 1
, and Laurent Daudet 3
1
Institut T´el´ecom; T´el´ecom ParisTech; CNRS LTCI; Paris, France
2
Sound Analysis/Synthesis Team; STMS IRCAM-CNRS-UPMC; Paris, France
3
Institut Langevin; Paris Diderot Univ.; ESPCI ParisTech; CNRS; Paris, France and Institut Universitaire de France
ABSTRACT
This paper investigates how precise a model should be for a robust
model-based NMF analysis of piano recordings. While inharmonic-
ity is an essential feature of piano tones from a perceptual point of
view, its explicit inclusion in sound models is not straightforward
and may even damage the quality of the analysis. Here, we assess
the quality of the analysis with a transcription task, and compare
three different models for the spectra of the dictionary : one strictly
harmonic, one following the theoretical inharmonicity law, and one
with relaxed inharmonicity constraints. Experimental results show
that both inharmonic models can indeed significantly enhance the
results, but only in the case when a good initialization is provided.
Index Terms— non-negative matrix factorization, music tran-
scription, piano, inharmonicity
1. INTRODUCTION
Methods based on non-negative signal representations (such as Non-
negative Matrix Factorization - NMF [1] - or Probabilistic Latent
Component Analysis - PLCA [2]) have been widely applied to au-
dio analysis in the last decade, giving promising results in tran-
scription [3, 4] or source separation [5, 6], to name only but a few
applications. These approaches mainly target the decomposition
of time-frequency representations of musical pieces into two non-
negative matrices: one dictionary containing the spectra/atoms of the
notes/instruments, and one activation matrix containing their tem-
poral activations. Besides the generic “blind” approaches, in order
to better fit the decomposition to specific properties of audio data
(these can be physics-based or signal-based properties), prior infor-
mation can also be used explicitly. For instance, when isolated note
recordings are available, on the same instrument and with the same
recording conditions, the spectra of the dictionary can be learned
independently [7, 4]. In that case, since the dictionary is learned
on monophonic data and fixed at the learning step, high transcrip-
tion performances can be obtained. Another approach consists in
including the information as a parametrization of the model. For
instance, harmonicity [8, 9], temporal evolution of spectral envelop
[10], sparsity of simultaneously activated notes [11], beat structure
[12], etc., have been considered in modelling the dictionary or the
time-activation matrices.
In the case of piano tones, a perceptually important feature is
the inharmonicity of their tones : due to the string stiffness [13], the
frequencies of the partials slightly deviate from a purely harmonic
relation (the frequency fn of the n-th partial is above nf0, where f0
This work is supported by the DReaM project of the French Agence Na-
tionale de la Recherche (ANR-09-CORD-006, under CONTINT program).
is the fundamental frequency). Taking into account the inharmonic-
ity of the tones in an NMF-based model has been proposed [14],
but surprisingly the transcription results were found slightly below
those obtained by a simpler harmonic model. These results seem in
contradiction with a naive intuition that inharmonicity should help
lifting typical transcription ambiguities such as with harmonically-
related notes (octave or fifth relations, for instance, where partials
fully overlap).
The goal of this paper is to have a better understanding about
these issues. For this, two types of inclusion of the inharmonicity,
with different degrees of constraints, are proposed and compared to
an harmonic model. The first inharmonic model (later called Inh)
forces the partial’s frequency to strictly follow the theoretical inhar-
monicity law. The second inharmonic model (later called InhR) re-
laxes this constraint, and enhances inharmonicity through a weighted
penalty term. We discuss in particular the influence of the initializa-
tion on the optimization process. Since it is difficult to gauge intrin-
sically the quality of NMF decompositions, these are here evaluated
on a transcription task, on a large database of piano recordings where
we have a ground-truth transcription at hand. It should be empha-
sized that the proposed algorithms do not target to be competitive
with state-of-the-art fully dedicated piano transcription algorithms,
since the only information that is taken into account is the inhar-
monicity of piano tones (for instance, no model of smooth spectral
envelop or temporal continuity of the activations is considered). On
the contrary, the simple post-processing of the data should allow one
to better highlight the differences in the core model.
2. NMF MODELLING FOR PIANO TRANSCRIPTION
NMF consists in solving the approximate low rank factorization:
V ≈ WH ⇔ Vkt ≈ Vkt =
R
r=1
WkrHrt, (1)
where V , W, and H are respectively K ×T, K ×R and R×T non-
negative matrices. In audio application, V usually corresponds to a
magnitude or power spectrogram, K being the number of frequency
bins and T the number of time-frames. W is then expected to be a
dictionary containing the spectra/atoms of R sources or notes, and
H a time activation matrix [3].
In the following, a parametric model for the spectra of the dic-
tionary W is introduced. It is based on an additive model for which
three different constraints on the partial frequencies are introduced:
these are constrained to follow a strict harmonic, a strict inharmonic
and a relaxed inharmonic relation.
2. 2.1. General additive model for the dictionary of spectra
Each spectrum of a note, indexed by r ∈ [1, R], is composed of the
sum of Nr partials, the partial rank being denoted by n ∈ [1, Nr].
Each partial is parametrized by its amplitude anr and its frequency
fnr. Thus, the set of parameters for a single atom is denoted by
θr = {anr, fnr | n ∈ [1, Nr]} and the set of parameters for the
dictionary by θ = {θr | r ∈ [1, R]}. Then, the expression of a
parametric atom indexed by r (as similarly proposed in [9]) is set to:
Wθr
kr =
Nr
n=1
anr · gτ (fk − fnr), (2)
where fk corresponds to the frequency associated to the kth bin, and
gτ to the magnitude of the Fourier Transform of the τ-length anal-
ysis window. In order to limit the computational time and to obtain
simple optimization rules, the spectral support of gτ is restricted
to its main lobe. For the experiments proposed in this paper, a
Hann window was used to compute the spectrograms. Its main lobe
magnitude spectrum (normalized to a maximal magnitude of 1) ex-
pression is given by gτ (fk) = 1
πτ
.sin(πfkτ)
fk−τ2f3
k
, for fk ∈ [−2/τ, 2/τ].
In order to estimate the parameters, the reconstruction cost-
function is chosen as the β-divergence between V and Wθ
H:
C0(θ, H) =
K
k=1
T
t=1
dβ Vkt |
R
r=1
Wθr
kr · Hrt , (3)
with
dβ (x | y) = (xβ
+ (β − 1).yβ
− β.x.yβ−1
)/(β.(β − 1)). (4)
The family of β-divergences is widely used in audio application [15]
because it encompass three common metrics: an Euclidean distance
(β = 2), the Kullback-Leibler (β → 1) and the Itakura-Saito (β →
0) divergence.
2.2. Constraints on partial frequencies
• Strictly harmonic / Ha-NMF
The strict harmonic constraint consists in fixing:
fnr = nF0r, n ∈ N∗
, (5)
directly in the parametric model (eq. (2)). Then, the set of parame-
ters for a single atom reduces to θHa
r = {anr, F0r | n ∈ [1, Nr]}.
• Strictly inharmonic / Inh-NMF
The strict inharmonic constraint consists in setting [13]:
fnr = nF0r 1 + Brn2, n ∈ N∗
, (6)
that rules the partials frequencies of stiff strings with clamped ends.
Here, θInh
r = {anr, F0r, Br | n ∈ [1, Nr]}, and Br is called the
inharmonicity coefficient. It is dependent on the piano string design
and then differs from one piano to another but also from one key to
another. From the low bass to the high treble range it can vary from
around 10−5
to 10−2
[16].
• Inharmonic relaxed / InhR-NMF
An alternative way of enforcing inharmonicity is through an ex-
tra penalty term to the reconstruction cost function C0 [17]. Thus,
the global cost function can be expressed as:
CInhR
(θ, γ, H) = C0(θ, H) + λ · C1(fnr, γ), (7)
where the set of parameters of the constraint is denoted by γ =
{F0r, Br | r ∈ [1, R]}. λ is a parameter that sets the weight be-
tween the reconstruction cost error and the inharmonicity constraint.
The constraint cost function C1 is chosen as the sum on each note of
the mean square error between the estimated partial frequencies fnr
and those given by the inharmonicity relation:
C1(fnr, γr) =
R
r=1
Nr
n=1
fnr − nF0r 1 + Brn2
2
. (8)
A potential benefit of this relaxed formulation is to allow a slight de-
viation of the partial frequencies around the theoretical inharmonic-
ity relation, that is typically observed in the low frequency range due
to the coupling between the strings and the soundboard [13, 18].
3. TRANSCRIPTION ALGORITHM
3.1. NMF optimization
3.1.1. Multiplicative update rules
As commonly proposed in NMF modelling, the optimization is
performed iteratively, using multiplicative update rules. These are
obtained in a similar way to [9]. In the following, P(θ∗
) and Q(θ∗
)
refer to positive quantities obtained by decomposing the partial
derivative of a cost function C(θ) with relation to a particular pa-
rameter θ∗
so that ∂C(θ)
∂θ∗ = P(θ∗
) − Q(θ∗
). The parameter is then
updated as θ∗
← θ∗
· Q(θ∗
)/P(θ∗
).
The updates for H and anr are identical for the 3 models (Ha /
Inh / InhR), since these parameters only appear in C0 cost function.
The rule for H is similar to standard NMF with β-divergence [15]:
H ← H ⊗
t
Wθ
V .[β−2]
⊗ V
tWθ V ·[β−1]
, (9)
where t
. corresponds to the transpose operator, and ⊗, .[]
, .
.
, to
entry-wise multiplication, exponentiation and division operators, re-
spectively. V = Wθ
H is the spectrogram model.
The rule for anr is obtained by a decomposition similar to [9]:
anr ← anr ·
k,t
(gτ (fk − fnr).Hrt) .V β−2
kt .Vkt
k,T
(gτ (fk − fnr).Hrt) .V β−1
kt
. (10)
Then, update rules of the remaining parameters are specific for each
of the three different NMF models. In the following, gτ (fk) rep-
resents the derivative of gτ (fk) with respect to fk on the spectral
support of the main lobe. For a Hann window (normalized to a max-
imal magnitude of 1) and fk ∈ [−2/τ, 2/τ]:
gτ (fk) =
1
πτ
(3τ2
f2
k − 1) sin(πτfk) + πτ(fk − τ2
f3
k ) cos(πτfk)
(fk − τ2f3
k )2
.
• Ha-NMF / Inh-NMF
F0r
Ha / Inh
← F0r ·
Q0(F0r)
P0(F0r)
, (11)
Br
Inh
← Br ·
QInh
0 (Br)
PInh
0 (Br)
, (12)
3. with
P0(F0r) =
k,t
Nr
n=1
anr
−C.fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−1
kt
+
Nr
n=1
anr
−C.fnr.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt (13)
Q0(F0r) =
k,t
Nr
n=1
anr
−C.fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt
+
Nr
n=1
anr
−C.fnr.gτ (f−fnr)
fk−fnr
.Hrt .V β−1
kt (14)
PInh
0 (Br) =
k,t
Nr
n=1
anr
−D.fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−1
kt
+
Nr
n=1
anr
−D.fnr.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt
(15)
QInh
0 (Br) =
k,t
Nr
n=1
anr
−D.fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt
+
Nr
n=1
anr
−D.fnr.gτ (f−fnr)
fk−fnr
.Hrt .V β−1
kt . (16)
For the strict harmonic model (Ha-NMF), fnr = nF0r and
C = ∂fnr/∂F0r = n. For the strict inharmonic model (Inh-NMF),
fnr = nF0r
√
1 + Brn2, C = ∂fnr/∂F0r = n
√
1 + Brn2, and
D = ∂fnr/∂Br = n3
F0r/2
√
1 + Brn2.
• InhR-NMF:
fnr
InhR
← fnr ·
QInhR
0 (fnr) + λ · QInhR
1 (fnr)
PInhR
0 (fnr) + λ · PInhR
1 (fnr)
, (17)
Br
InhR
← Br ·
QInhR
1 (Br)
PInhR
1 (Br)
, (18)
F0r
InhR
=
Nr
n=1
fnrn
√
1 + Brn2
Nr
n=1
n2(1 + Brn2)
, (19)
where
PInhR
0 (fnr) =
k,t
anr
−fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−1
kt
+ anr
−fnr.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt , (20)
QInhR
0 (fnr) =
k,t
anr
−fk.gτ (fk−fnr)
fk−fnr
.Hrt .V β−2
kt .Vkt
+ anr
−fnr.gτ (f−fnr)
fk−fnr
.Hrt .V β−1
kt , (21)
PInhR
1 (fnr) = 2fnr, QInhR
1 (fnr) = 2nF0r
√
1 + Brn2, (22)
PInhR
1 (Br) = F0r
Nr
n=1
n4
, QInhR
1 (Br) =
Nr
n=1
n3
fnr√
1+Brn2
. (23)
Note that for F0r, an exact analytic solution is obtained (eq.
(19)) when cancelling the partial derivative of the cost function C1.
3.1.2. Practical considerations
• Amplitude normalization: In order to obtain a unique decomposi-
tion, the anr are normalized to a maximal value of 1 for each atom
indexed by r. Thus, after each update ∀r ∈ [1, R] and ∀n ∈ [1, Nr]
of the anr, the following steps are applied:
Ar = max
n∈[1,Nr]
(anr), ∀r ∈ [1, R],
anr = anr/Ar, ∀r ∈ [1, R], ∀n ∈ [1, Nr], (24)
H = diag(A) · H,
where diag(A) is a diagonal matrix of dimension R × R containing
the Ar, ∀r ∈ [1, R].
• Algorithms: The steps of InhR-NMF algorithm are summarized in
table Algorithm 1. Ha-NMF and Inh-NMF algorithms are not given
in this paper but are highly similar to InhR-NMF: instead of updating
fnr, then F0r and Br in lines 12-18, F0r (Ha-NMF) or F0r and Br
(Inh-NMF) are directly updated and Wθ
recomputed.
Algorithm 1 InhR-NMF optimization
1: Input: V mag. spectrogram (normalized to a max. value of 1)
2: Initialization: ∀ r ∈ [1, R], n ∈ [1, Nr],
3: > (Br, F0r), then fnr = nF0r
√
1 + Brn2 / anr = 1
4: > Wθ
(cf. eq (2)) / H ini. with rand. positive values
5: > β / λ
6: Optimization:
7: for it = 1 to I do
8: > H update (eq. (9))
9: > anr update ∀ r ∈ [1, R], n ∈ [1, Nr] (eq. (10))
10: anr normalization and H update (eq. (24))
11: Wθ
update (eq. (2))
12: > fnr update ∀ r ∈ [1, R], n ∈ [1, Nr] (eq (17))
13: Wθ
update (eq. (2))
14: for u = 1 to 30 do
15: ∀r, n ∈ Nr
16: > F0r update (cf. eq (19))
17: > Br update (20 times) (cf. eq (18))
18: end for
19: end for
20: Output: H, anr, fnr, Br, F0r
3.2. Post-processing
In order to obtain a list that contains the detected notes, their onset
and offset time, a post-processing is applied to the activation matrix
H. Each line is processed by a low-pass differentiator filter. The
obtained matrix dH is then scaled so that its maximal element is 1.
Finally, in each line, an onset is detected if dH is above a threshold
10Ton/20
, and the corresponding offset found when dH crosses (from
negative to positive values) a second threshold −10Toff/20
< 0. If
the same note is found to be repeatedly played at less than 100 ms
of interval it is then considered as a unique note. As discussed in
the introduction, this very simple post-processing has been chosen
in order to better highlight the differences in the model itself.
4. TRANSCRIPTION EVALUATION
4.1. Protocol
The three models have been tested on the MAPS database1
. 45
pieces were randomly chosen (5 out of 30 for each of the 9 pianos),
1http://www.tsi.telecom-paristech.fr/aao/en/category/database/
4. re-sampled to 22050 Hz, and 30 seconds excerpts were taken start-
ing from t0 = 5 s. The mean polyphony level by time-frame is about
3.23. In order to estimate an appropriate value for the parameter λ
of the InhR-NMF method, a learning set composed of 9 pieces was
similarly built (1 piece for each piano, none of them in the test set).
The spectrograms were computed with a Hann window of length
τ = 90 ms, with a hope–size of τ/8 and a 213
-point FFT. The num-
ber of spectra in the dictionary was fixed to R = 64, and initialized
for notes having MIDI note number in [33, 96]. Usually the piano
keyboard contains 88 notes, from A0 (21) to C8 (108). Our choice
corresponds to a reduction of one octave in the extreme bass (where
the spectral resolution is not sufficient to perform the analysis) and
one octave in the high treble range (where the non-linear coupling
between triplets of strings at the soundboard [18] produces complex
spectra with multiple partials that cannot be fully explained by a
simple harmonic or inharmonic model). However, these notes in the
extreme parts of the keyboard are rarely played. Over the complete
MAPS dataset (352710 notes for 159 different pieces), they only ac-
count for 1.66% of the notes.
For Ha-NMF, F0r is initialized to exact equal temperament. For
Inh-NMF and InhR-NMF two initializations are tested. The first one
sets F0r to equal temperament (no “octave stretching”) and Br to
5.10−3
, ∀r ∈ [1, R]. The second one is based on an average model
of inharmonicity and tuning on the whole piano tessitura considering
invariants in piano string design and tuner choices [16].
The experiments are run for β = 1 (KL divergence), 150 itera-
tions and different values of Nr, the number of partials of the atoms
(from 5 to 30). For the post-processing, the onset detection thresh-
old Ton is varying from -80 to -1 dB in steps of 1 dB, and the offset
detection threshold is fixed to -80 dB.
4.2. Results and discussion
For each excerpt, performances are evaluated in terms of Precision
(P), Recall (R) and F-measure (F) [19] (one note is assumed to be
correctly detected if for a given pitch, the estimated onset time is
contained in +/-50 ms interval around the ground truth). Mean and
standard deviation are then computed on the whole dataset.
On the learning database, the influence of λ parameter of InhR-
NMF with Nr = 10 has been studied on a grid covering the range
[10−6
, 100] and logarithmically distributed. The optimal F-measure
was obtained for λ = 1, and it should be noted that the performance
did not depend on a fine tuning of this parameter.
The performances obtained on the test database are presented in
table 4.2 for all the methods and for Ton=-18 dB (the influence of
Ton is depicted in figure 1 for Ha/Inh2/InhR2-NMF and Nr = 20).
Standard deviations are not reported in the table but are around 10 to
14 %. For each method, increasing the number of partials Nr leads
to higher performances. It seems that adding more partials avoids
a situation where high notes explain high rank partials belonging to
lower notes. Ha-NMF results for Nr = 30 are comparable to those
obtained by the NMF under the harmonicity constraint presented in
[20] (section V.B) for similar experimental setups.
For the first initialization, Inh-NMF and InhR-NMF perform less
than Ha-NMF (this is consistent with the observation in [14]). Con-
versely, for the second initialization with the mean model of in-
harmonicity and piano tuning, these methods perform significantly
better than Ha-NMF (ANOVA p-values lower than 0.05 for Nr <
30). Furthermore, both inharmonic models give comparable mean
F-measures (p-values higher than 0.5).
This tends to demonstrate that such strategies are highly depen-
dent on the initialization. Indeed, the reconstruction cost function
Nr 5 10 20 30
Ha
P 39.3 51.9 58.9 62.1
R 41.0 49.3 56.6 60.6
F 38.7 49.0 55.9 59.7
Inh
1
P 32.6 44.6 54.9 55.6
R 34.5 46.0 56.7 57.7
F 32.4 43.8 54.0 54.8
InhR
1
P 34.0 43.0 53.4 55.2
R 36.0 44.9 55.7 57.1
F 33.7 42.6 52.7 54.4
Inh
2
P 44.3 59.6 66.4 66.9
R 45.1 55.9 60.9 62.5
F 43.0 55.8 61.5 62.6
InhR
2
P 43.3 57.5 64.1 64.6
R 45.2 54.9 60.3 61.1
F 42.5 54.1 60.2 60.7
Table 1. Mean Precison, Recall and F-measure, in %, as a function
of Nr for Ha/Inh/InhR-NMF algorithms, for Ton= -18 dB. Index 1
and 2 refer to the two different initializations of Br and F0r.
40 35 30 25 20 15 10 5 0
0
0.1
0.2
0.3
0.4
0.5
0.6
onset detection threshold (in dB)
Fmeasure
Ha
Inh 2
InhR 2
Fig. 1. F-measure as a function of the onset detection threshold
Ton ∈ [−40, −1] for Ha/Inh2/InhR2-NMF methods and Nr = 20.
C0 is non-convex with respect to fnr, F0r or Br parameters, and
present a large amount of local minima (this has been checked by
computing the cost function on a grid of these parameters). Hence,
multiplicative update rules (as well as other optimization methods
based on gradient descent) cannot ensure that these parameters will
be correctly estimated. Regarding the results, taking into account the
dispersion of the partial frequencies from a theoretical inharmonic
relation in InhR-NMF does not seem valuable, when compared to
Inh-NMF.
5. CONCLUSION
Including inharmonicity in parametric NMF models has been shown
to be relevant in a piano transcription task, provided that the in-
harmonicity and tuning parameters are sufficiently well initialized.
More precisely, an initialization with the same average value for the
inharmonicity of all notes, and equal temperament for the tuning,
turns out to provide worse estimates than the simpler purely har-
monic model. However, a note-dependent inharmonicity law, with
fixed parameters, and the corresponding “stretched” tuning curves,
provide a good initialization to our models, that lead to significant
improvement in the transcription results. Further work will investi-
gate how these models on partials frequencies can be combined with
amplitude models (smooth spectral envelopes), or frame dependen-
cies in time.
5. 6. REFERENCES
[1] D. D. Lee and H. S. Seung, “Learning the parts of objects by
non-negative matrix factorization,” Nature, vol. 401, pp. 788–
791, 1999.
[2] Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka,
“Sparse and shift-invariant feature extraction from non-
negative data,” in Proc. of the IEEE Int. Conf. on Acous-
tics, Speech and Signal Processing (ICASSP), 2008, pp. 2069–
2072.
[3] Paris Smaragdis and Judith C. Brown, “Non-negative matrix
factorization for polyphonic music transcription,” in Proc. of
the IEEE Workshop on Application of Signal Processing to Au-
dio and Acoustics (WASPAA), 2003, pp. 177–180.
[4] Arnaud Dessein, Arshia Cont, and Guillaume Lemaitre, “Real-
time polyphonic music transcription with non-negative ma-
trix factorization and beta-divergence.,” in Proc. of the 11th
Int. Soc. for Music Information Retrieval Conference (ISMIR),
2010, pp. 489–494.
[5] Jean-Louis Durrieu, Ga¨el Richard, Bertrand David, and C´edric
F´evotte, “Source/filter model for unsupervised main melody
extraction from polyphonic audio signals,” IEEE Trans. on
Audio, Speech, and Language Processing, vol. 18, no. 3, pp.
564–575, March 2010.
[6] Tuomas Virtanen, “Monaural sound source separation by
nonnegative matrix factorization with temporal continuity and
sparseness criteria,” IEEE Trans. on Audio, Speech, and Lan-
guage Processing, vol. 15, no. 3, pp. 1066–1074, March 2007.
[7] Bernhard Niedermayer, “Non-negative matrix division for the
automatic transcription of polyphonic music,” in Proc. of the
9th Int. Soc. for Music Information Retrieval conference (IS-
MIR), 2008, pp. 544–549.
[8] Nancy Bertin, Roland Badeau, and Emmanuel Vincent, “En-
forcing harmonicity and smoothness in bayesian non-negative
matrix factorization applied to polyphonic music transcrip-
tion,” IEEE Trans. on Audio, Speech and Language Process-
ing, vol. 18, no. 3, pp. 538–549, March 2010.
[9] Romain Hennequin, Roland Badeau, and Bertrand David,
“Time-dependent parametric and harmonic templates in non-
negative matrix factorization,” in Proc. of the 13th Int. Conf.
on Digital Audio Effects (DAFx-10), September 2010, pp. 246–
253.
[10] Romain Hennequin, Roland Badeau, and Bertrand David,
“Nmf with time-frequency activations to model nonstationary
audio events,” IEEE Trans. on Audio, Speech and Language
Processing, vol. 19, no. 4, pp. 744–753, May 2011.
[11] Patrik O. Hoyer, “Non-negative matrix factorization with
sparseness constraints,” Journal of Machine Learning Re-
search, vol. 5, pp. 1457–1469, 2004.
[12] Kazuki Ochiai, Hirokazu Kameoka, and Shigeki Sagayama,
“Explicit beat structure modeling for non-negative matrix
factorization-based multipitch analysis,” in Proc. of the
IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
(ICASSP), March 2012, pp. 133–136.
[13] N. H. Fletcher and T. D. Rossing, THE PHYSICS OF MUSI-
CAL INSTRUMENTS. 2nd Ed., pp. 64–69, 352–398, Springer,
1998.
[14] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Har-
monic and inharmonic nonnegative matrix factorization for
polyphonic pitch transcription,” in Proc. of the IEEE Int. Conf.
on Acoustics, Speech and Signal Processing (ICASSP), 2008,
pp. 109–112.
[15] C´edric F´evotte, Nancy Bertin, and Jean-Louis Durrieu, “Non-
negative matrix factorization with the itakura-saito divergence.
with application to music analysis,” Neural Computation, vol.
21, no. 3, pp. 793–830, March 2009.
[16] Franc¸ois Rigaud, Bertrand David, and Laurent Daudet, “A
parametric model of piano tuning,” in Proc. of the 14th Int.
Conf. on Digital Audio Effects (DAFx-11), September 2011,
pp. 393–399.
[17] Franc¸ois Rigaud, Bertrand David, and Laurent Daudet, “Piano
sound analysis using non-negative matrix factorization with in-
harmonicity constraint,” in Proc. of the 20th European Signal
Processing Conference (EUSIPCO), August 2012, pp. 393–
399.
[18] G. Weinreich, “Coupled piano strings,” J. Acoust. Soc. Am.,
vol. 62, no. 6, pp. 1474–1484, 1977.
[19] C.J. van Rijsbergen, INFORMATION RETRIEVAL, Butter-
worths, London, UK, 2nd edition, 1979.
[20] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Adap-
tive harmonic spectral decomposition for multiple pitch esti-
mation,” IEEE Trans. on Audio, Speech and Language Pro-
cessing, vol. 18, no. 3, pp. 528–537, March 2010.