Fundamentals of music processing chapter 6 발표자료

Review on ‘Fundamentals of Music Processing’
Ch.6 Tempo and Beat
Tracking
모두의 연구소
Music processing lab
최정

So far we’ve covered..
• Music representations (ch1)
: basic notations / representations, their
structure
• Fourier analysis (ch2)
: transforming signal into the Frequency
domain(spectrogram),
sampling / DFT, FFT, STFT
• Music Synchronization (ch3)
: log-frequency spectrogram, Chromagram,
synchronization between different
representation(DTW)
• Music Structure Analysis (ch4)
: Chroma-based self-similarity matrix
 path, block(+ enhancements)
Audio thumbnailing (fitness function 
optimization(DP)), Scape plot representation
• Chord Recognition (ch5)
: basic harmony theory
Template-based / HMM-based approach

Chapter 6: Tempo and Beat tracking
6.1 Onset Detection
6.2 Tempo Analysis
6.3 Beat and Pulse Tracking
6.4 Further Notes

Introduction
• beat : drives music forward / provides the temporal framework of a piece
of music
• Want to detect the phase and the period
• tempo : the rate of the pulse(given by the reciprocal of the beat period.)
• Humans are often able to tap along with the musical beat without difficulty
(sometimes even unconsciously).
• Simulating this cognitive process with an automated beat tracking system is
much harder.
• Recent beat tracking systems are mostly based on beat onsets  can’t
deal with music without beat onsets.(e.g. string music) / music with
dynamically changing tempo.

Introduction
• onset detection : estimate the positions of note onsets within the music
signal
• a novelty representation that captures certain changes in the signal’s
energy or spectrum  peaks of such a representation yield good indicators
for note onset candidates.
• Tempogram : local tempo information on different pulse levels
• 2 methods for periodicity analysis : using Fourier / using autocorrelation
analysis techniques
• Beat tracking : a mid-level representation for local pulse information (even
in the presence of significant tempo changes)  DP to get a robust beat
tracking procedure.

Onset detection
• Onset detection : determining the starting times of notes or other
musical events as they occur in a music recording
• a transient : a noise-like sound component of short duration and high
amplitude typically occurring at the beginning of a musical tone or a
more general sound event. (But as opposed to the attack and
transient, onset has no duration.)
• The onset of a note refers to the single instant (rather than a period)
that marks the beginning of the transient, or the earliest time point at
which the transient can be reliably detected.

Onset detection
Illustration of attack, transient, onset, and decay of a single note.
(a) Note played on a piano. (b) Idealized amplitude envelope.

Onset detection
• General idea
: to capture sudden changes that often mark the beginning of transient
regions.
1) Notes that have a pronounced attack phase
: just locate time positions where the signal’s amplitude envelope starts
increasing.  easy(relatively)
2) If not(non-percussive, soft onset, blurred note) Or polyphonic music
 hard

Onset detection
1) Signal converted into a suitable feature representation
2) a derivative operator is applied
3) a peak-picking algorithm is employed
(We’ve seen similar approach in novelty detection(from ch.4), but
tolerance window should be much smaller.)

Energy-Based Novelty
• Goal: Catch sudden increase of the signal’s energy
1) Transform the signal into a local energy function.
2) Look for sudden changes in this function.
_Procedure
Let x be a DT-signal.
Perform a discrete STFT
Fix a discrete window function w : Z → R
(to be shifted over the signal x to determine local sections)

• Let‘s assume..
• w is a bell-shaped function centered at time zero.
• w(m) for m ∈ [−M : M] comprises the nonzero samples of w for some M ∈ N.
• local energy of x with regard to w is defined to be the function 𝐸 𝑤
𝑥 : Z → R given by for n ∈ Z
contains the energy of the signal x
multiplied with a window(bell-shaped) shifted by n samples

• Intuitively, to measure energy changes, we take a derivative of the
local energy function
• discrete case : difference between two subsequent energy values
• Detect only energy increases : positive differences while setting the
negative differences to zero  half-wave rectification
• energy-based novelty function
(r ∈ R)

• human perception of sound intensity is logarithmic in nature.
 even musical events of rather low energy may still be perceptually
relevant.
 Use logarithmic energy enhancement.

Computation of an energy-based novelty function
Annotated note onsets
(the four beat positions are marked by thick lines)
Waveform.
Local energy function.
Discrete derivative.
Novelty function ∆Energy obtained after half-wave rectification.
Novelty function ∆Log Energy based on a logarithmic energy function.

Computation of an energy-based novelty function
4 quarter-note drum beats
correspond to the 4 highest
peaks.
(can be detected by a simple
peak-picking procedure)
4 hihat strokes between the beat
positions do not show up in ∆Energy
(relatively little energy) the weak hihat onsets become
visible

Energy Fluctuation problem
• energy fluctuation in non-steady sounds (e.g. vibrato or tremolo)
 Waveform and energy-based novelty function of the note C4 (261.6
Hz) played by different instruments
Piano. Violin. Flute.

To increase the robustness of onset detection
• Typical approach
1) decompose the signal into several subbands that contain complementary
frequency information.
2) compute a novelty function for each subband separately
3) combine the individual functions to derive the onset information.
e.g.
Subbands corresponding to musical pitches  pitch-based novelty functions
Subbands corresponding to spectral coefficients  spectral changes novelty functions
 more refined beat tracking information

Subband example.
• For example, the subbands may correspond to musical pitches, which
results in pitch-based novelty functions.

Problems with pure Energy-Based Novelty
• Polyphonic music with simultaneously occurring sound events : problematic
- A musical event of low intensity may be masked by an event of high intensity.
- Energy fluctuations (e.g., coming from vibrato) in the sustain phase of one instrument may be
stronger than energy increases in the attack phase of other instruments.
 Hard to detect all onsets when using purely energy-based methods.
- The characteristics of note onsets may strongly depend on the respective type of instrument.
- Noise-like broad-band transients may be observable in certain frequency bands even in
polyphonic mixtures.
- For example, for percussive instruments with an impulse-like onset, one can observe a sudden
increase in energy that is spread across the entire spectrum of frequencies. Such noise-like broad-
band transients may be observable in certain frequency bands even in polyphonic mixtures.
- Since the energy of harmonic sources is concentrated more in the lower part of the spectrum,
transients are often well detectable in the higher frequency region.

Spectral-Based Novelty
1) Convert the signal into a time–frequency representation
2) Capture changes in the frequency content.
let X be the discrete STFT of the DT-signal x :
(the sampling rate Fs= 1/T , the window length N of the discrete window w, the hop size
H.)
X(n,k) ∈ C (complex number)
: k-th Fourier coefficient for frequency index k ∈ [0 : K] and time frame n ∈ Z,
where K = N/2 is the frequency index corresponding to the Nyquist frequency.

• Spectral-based novelty function (spectral flux)
: compute the difference between subsequent spectral vectors using a
suitable distance measure.
Additional processing :
- change parameters of the STFT / the distance measure
- pre- and post-processing steps

First, Enhancement
• Apply a logarithmic compression to the spectral coefficients to
enhance weak spectral components. (remember in chromagram..)
logarithmic sensation of sound intensity and to balance out the
dynamic range of the signal.
Increasing γ  the low-intensity values are further enhanced
(On the downside, a large γ may also amplify non-relevant noise-like components.)

Then, get Temporal Derivative
• Compute the discrete temporal derivative of the compressed spectrum Y.
• Suitable post-processing steps for better results
In view of a subsequent peak-picking step :
Enhance the peak structure of the novelty function + while suppressing small
fluctuations.
subtracting the local average from ∆Spectralonly keeping the positive part

Getting Temporal Derivative
Get the local average :
M ∈ N : determines the size of an averaging window
Subtracting the local average from ∆Spectral and only keeping the positive part :
: a local average function (Remember also in computing chromagram..)

Waveform.
Compressed spectrogram
using γ = 100
Novelty function ∆Spectral
Novelty function ∆Spectral.
(averaged spectral data)
(thick lines : 4 beat positions).
local average function μ
(in thick/red).
significant peaks at the four weak
hihat onsets between the beats.
(compared to Energy-based model)

Effect of logarithmic compression
Magnitude
spectrogram.
Compressed
spectrogram
using γ = 1.
Compressed spectrogram
using γ = 1000.
magnitude
spectrogram
novelty function
∆Spectral

Comparing 2 models
Score representation
(in a piano reduced version).
Waveform.
Energy-based novelty function.
Spectral-based novelty function.
(thicker lines : downbeat positions).
Shostakovich’s Waltz No. 2 from the “Suite for Variety Orchestra No.
1.”

Comparing 2 models
The peaks that correspond to
downbeats are hardly visible or
even missing.
whereas the peaks that correspond to
the percussive beats are much more
pronounced.
improvements
 Spectral-based approach is better in this case.

• more approaches
First split up the spectrum into several frequency bands
(often five to eight logarithmically spaced bands are used)
Then get the derivative for each.
 weighted and summed up  single overall novelty function

Phase-Based Novelty
• Before, we only used the magnitude of the spectral coefficients, but now we use
the phase.
• let X(n,k) ∈ C be the complex-valued Fourier coefficient for frequency index k ∈ [0
: K] and time frame n ∈ Z.
• Using the polar coordinate representation,
• The phase φ(n,k) determines how the sinusoid of frequency Fcoef(k) = Fs · k/N
has to be shifted to best correlate with the windowed signal corresponding to
the nth frame.
complex coefficient phase

Phase-Based Novelty
Locally stationary signal and
its correlation to a sinusoid
(corresponding to frequency
index k for the frames n−2, n−1,
n, and n+ 1.)
The angular representation
of the phases

Phase-Based Novelty
1) If the signal x has a high correlation with this sinusoid (i.e., |X (n, k)| is large)
2) And if shows a steady behavior in a region of a number of subsequent frames
..., n−2, n−1, n, n+1, ... (i.e., x is locally stationary)
 Then the phases ..., φ(n−2,k), φ(n−1,k), φ(n,k), φ (n + 1, k), ... increase from
frame to frame in a fashion that is linear in the hop size H of the STFT.
 frame-wise phase difference in this region remains approximately constant

Phase-Based Novelty
• Constant frame-wise phase difference in a certain region :
• the first-order difference :
• the second-order difference :
So when in steady regions of x, 
But in transient regions,  the phase behaves quite unpredictably across the
entire frequency range

Phase-Based Novelty
• So, a simultaneous disturbance of the values φ′′(n,k) for k ∈ [0 : K]
 good indicator for note onsets.
• the phase-based novelty function ∆Phase :

In need of phase unwrapping
• The phase r (in radians) of a complex number c ∈ C is defined only up to integer multiples of 2π.
 the phase is often constrained to the interval [0,2π) and the number r ∈ [0,2π) is called the
principal value of the phase.
• In Fourier analysis, we are using the normalized phases φ = r /(2π).  the principal values : [0, 1)
• When considering a function or a time series of phase values (e.g., the phase values over the
frames of an STFT), the choice of principal values may introduce unwanted discontinuities.
• These artificial phase jumps are the results of phase wrapping, where a phase value just below
one is followed by a value just above zero (or vice versa).
• To avoid such discontinuities, one often applies a procedure called phase unwrapping, where the
objective is to recover a possibly continuous sequence of (unwrapped) phase values.

Phase unwrapping
• Such a procedure, however, is in general not well defined since the original time series may possess “real”
discontinuities that are hard to distinguish from “artificial” phase jumps.
• In the onset detection context, phase jumps due to wrapping may occur when computing the
differences.(1st-order/2nd-order). In these cases, one needs to use an unwrapped version of the phase. As an
alternative, we introduce a principal argument function :
• A suitable integer value is added to or subtracted from the original phase difference  yield a value in [−0.5,
0.5].
• Even though the principal argument function may cancel out large discontinuities in the phase differences,
this effect is attenuated since we consider the sum of differences over all frequency indices.

Phase unwrapping
Illustration of phase unwrapping.
Wrapped phase.
Unwrapped phase.

Complex-Domain Novelty
• If the magnitude of the Fourier coefficient X(n,k) is very small, the phase φ(n,k)
may exhibit a rather chaotic behavior due to small noise-like fluctuations that
may occur even within a steady region of the signal.
• To obtain a more robust detector, one idea is to weight the phase information
with the magnitude of the spectral coefficient. This leads to a complex-domain
variant of the novelty function, which jointly considers phase and magnitude.
• The assumption of this variant is that the phase differences as well as the
magnitude stay more or less constant in steady regions.
• Therefore, given the Fourier coefficient X(n, k), one obtains a steady-state
estimate 𝑋(n + 1, k) for the next frame by setting
steady-state estimate

• So, a steady-state estimate 𝑋(n + 1, k) for the next frame (given the Fourier
coefficient X(n, k)) :
• The magnitude between the estimate 𝑋(n + 1, k) and the actual coefficient X (n +
1, k)
 obtain a measure of novelty :
The complex-domain difference

• The complex-domain difference X′(n,k) quantifies the degree of non-stationarity
for frame n and coefficient k.
• To disciminate between energy increase(note onset) / decrease(note offset)
: decompose X′(n,k)  X+(n,k) increasing magnitude/ X−(n,k) decreasing magnitude

• A complex-domain novelty function ∆Complex for detecting note
onsets can then be defined by summing only the values X+(n,k) over
all frequency coefficients:
Cf. Similarly, for detecting general transients or note offsets, one may
compute a novelty function using X′(n,k) or X−(n,k), respectively.

Illustration of the complex-domain difference
X′(n, k) between an estimated spectral
coefficient 𝑋(n+ 1,k) and the actual coefficient
X(n+1,k).

Tempo Analysis
• Notions of tempo and beat remain rather vague or are even
nonexistent. Why?
• Weak note onsets / local tempo changes interrupted or disturbed by
syncopation.

Tempo Analysis
• Tempo can be very vague, but we set 2 assumptions.
1) Beat positions occur at note onset positions.
2) Beat positions are more or less equally spaced—at least for a certain period of
time.
: convenient and reasonable for a wide range of music including most rock and
popular songs.
 time–tempo or tempogram representations
: capture local tempo characteristics of music signals

Tempogram Representations
• Tempogram : indicates for each time instance the local relevance of a
specific tempo for a given music recording.
• Intuitively, the value T(t , τ ) indicates the extent to which the signal
contains a locally periodic pulse of a given tempo τ in a
neighborhood of time instance t. (on a discrete time–tempo grid)
time parameter t ∈ R : measured in seconds
tempo parameter τ ∈ R>0 : measured in beats per minute (BPM)

• the sampled time axis is given by [1 : N]. (we extend this axis to Z.)
• Θ⊂R > 0 : a finite set of tempi specified in BPM.
• a discrete tempogram is a function :

1) Based on the assumption that pulse positions usually go along with note
onsets, the music signal is first converted into a novelty function.
2) The locally periodic behavior of the novelty function is analyzed.
 the periodic behavior for various periods T > 0 (given in seconds) in a
neighborhood of a given time instance.
• The rate ω = 1/T (measured in Hz) and the tempo τ (measured in BPM) are
related by :

• Pulses in music are often organized in complex hierarchies that represent the
rhythm.
• One may consider the tempo on the tactus level (quarter note level) / the
measure level / the tatum (temporal atom) level (fastest repetition rate of
musically meaningful accents)
• Higher pulse levels often correspond to integer multiples or fractions (τ, 2τ, 3τ,... /
τ, τ/2, τ/3,...) of a tempo τ.
 tempo harmonics / subharmonics of τ
• Diff. between 2 tempi with half or double the value : tempo octave

Illustration of two different
tempogram representations T of a
click track with increasing tempo
(170 to 200 BPM).
Novelty function of click track.
Tempogram with harmonics.
Tempogram with subharmonics
The large values T(t, τ)
around t = 5 sec / τ = 180 BPM
 a dominant tempo of 180 BPM
around time position 5 sec

Fourier Tempogram
• Compute a spectrogram (STFT) of the novelty curve :
Convert frequency axis (given in Hertz)  tempo axis (given in BPM)
• Magnitude spectrogram : local tempo.

Recap Fourier..
• We want to represent a signal as a combination of a bunch of sine and consine
functions.
A sinusoid
Complex formulation of sinusoids

Recap Fourier..
Fourier Series in Trigonometric Form :
Fourier Series in Complex Form :
+ Decides the frequency
T : entire period

Recap Fourier..
• Tells which frequencies occur,
• But does not tell when the frequencies occur.
• Frequency information is averaged over the entire time interval.
• Time information is hidden in the phase.
We need STFT!

Fourier Tempogram
• Fix a window function w : Z → R of finite length centered at n = 0 (e.g., a sampled
Hann window)
• For a frequency parameter ω ∈ R≥0 and time parameter n ∈ Z, the complex
Fourier coefficient F(n,ω) is defined :
• Converting frequency to tempo values, we define the (discrete) Fourier
tempogram :

Fourier Tempogram
(a) Novelty function ∆ .
(b) Fourier tempogram T F.
(c–e) Correlation of ∆ and various
analyzing windowed sinusoids.

Autocorrelation Tempogram
• Autocorrelation : a mathematical tool for measuring the similarity of a signal with
a time-shifted version of itself.
• A mathematical tool for finding repeating patterns, such as the presence of a
periodic signal obscured by noise, or identifying the missing fundamental
frequency in a signal implied by its harmonic frequencies.
• Sliding inner product (discrete-time / real-valued signals)
Autocorrelation :
time-shift or lag parameter

• Apply the autocorrelation in a local fashion for analyzing a given novelty function
∆ : Z → R in the neighborhood of a given time parameter n.
(Finite length centered at n=0)
Windowing:
Short-time Autocorrelation :
When assuming that the window function w is of finite length, the autocorrelation of
the localized novelty function is zero for all but a finite number of time lag
parameters.

Novelty function ∆.
Time-lag representation A.
Correlation of ∆ and various time-
shifted windowed sections.
Visualizing the short-time
autocorrelation A leads to a time-lag
representation
The window w : 2 sec of the original audio recording.

Lag to time-tempo
• Convert the lag parameter into a tempo parameter to obtain a time–tempo
representation from the time-lag representation.
• Use standard resampling and interpolation techniques applied to the tempo
domain.

(a) Time-lag representation with linear lag axis.
(b) Representation from (a) with tempo axis.
(c) Time–tempo representation with linear tempo axis.
Conversion from lag to tempo.

As opposed to the Fourier tempogram, an
autocorrelation tempogram exhibits tempo
subharmonics, but suppresses tempo harmonics.
Fourier tempogram
autocorrelation tempogram

One global tempo value.(will be used later)
• Assuming a more or less steady tempo, it suffices to determine one global tempo
value for the entire recording.
• Averaging the tempo values obtained from a frame-wise periodicity analysis :
• Maximum of this value : an estimate for the global tempo of the recording

So far, we’ve figured out the onsets and set up the
tempogram.
We need to track down the local beat and tempo.

We need something more..
• When dealing with music that exhibits significant tempo changes, one needs to
estimate the local tempo in the neighborhood of each time instance, which is a
much harder problem than global tempo estimation.
• Having computed a tempogram, the frame-wise maximum yields a good
indicator of the locally dominating tempo.
• In the case that the tempo is relatively steady over longer periods of time, 
increase the window size to obtain more robust and smoother tempo estimates.
• Instead of simply taking the frame-wise maximum—a strategy that is prone to
local inconsistencies and outliers—global optimization techniques based on
dynamic programming may be used to obtain smooth tempo trajectories. 
beat tracking

Beat and Pulse Tracking
• Given the tempo, find the best sequence of beats.
• Complex Fourier tempogram contains magnitude and phase
information.
• The magnitude encodes how well the novelty curve resonates with a
sinusoidal kernel of a specific tempo.
• The phase optimally aligns the sinusoidal kernel with the peaks of the
novelty curve.

Predominant local pulse(PLP) function
• Construct a mid-level representation that explains the local periodic
nature of a given (possibly noisy) onset representation.
• Starting with a novelty function, we determine for each time position a
windowed sinusoid that best captures the local peak structure of the
novelty function.
• Employ an overlap-add technique by accumulating all sinusoids over time
• A local periodicity enhancement of the original novelty function :
predominant local pulse (PLP) information
• predominant pulse : the strongest pulse level that is measurable in the
underlying novelty function.

PLP function outcome
Tempogram T with frame-wise tempo maxima
(indicated by circles) shown at seven time
positions n.
Optimal windowed sinusoids κn (using a window size of 4
seconds) corresponding to the maxima(tempo)
Accumulation of all sinusoids (overlap-
add).
PLP function Γ obtained after half-wave rectification.

Deriving PLP function(1)
• Using Fourier tempogram TF, compute the tempo parameter τn ∈ Θ that
maximizes TF(n,τ) :
• Then, making use of the phase information that is given by the complex Fourier
coefficients F(n,ω) :
• The phase φn that belongs to the windowed sinusoid of tempo τn :
Time index n
Maximizing tempo among discrete tempi set

Deriving PLP function(2)
• Now, based on τn(tempo) and φn(phase), optimal windowed sinusoid (at time
index n) is κn : Z→R
• The sinusoid κn best explains the local periodic nature of the novelty function at
time position n with respect to the tempo set Θ.
• The period 60/τn corresponds to the predominant periodicity of the novelty
function, and the phase information φn takes care of accurately aligning the
maxima of κn and the peaks of the novelty function.
same window function w as for the Fourier tempogram
Increasing the window size typically yields more robust estimates at the cost of temporal flexibility.
Time index n

PLP computation
• To make the periodicity estimation more robust while keeping the
temporal flexibility, the idea is to form a single function instead of
looking at the sinusoids in a one-by-one fashion.
• Apply an overlap-add technique,
• the optimal windowed sinusoids κn are accumulated over all time
positions n ∈ Z :
PLP function
(+ half-wave rectification.)
Summed over all
time position

Again, PLP function outcome
Tempogram T with frame-wise tempo maxima
(indicated by circles) shown at seven time
positions n.
Optimal windowed sinusoids κn (using a window size of 4
seconds) corresponding to the maxima
Accumulation of all sinusoids (overlap-
add).
PLP function Γ obtained after half-wave rectification.
Sort of like a.. Back to signalizing(regenerating)
from frequency domain to time domain.

Properties of PLP function
• The maximizing tempo values as well as the corresponding optimal windowed
sinusoids are indicated for seven different time positions.
• Each of these windowed sinusoids tries to explain the locally periodic nature of
the peak structure of the novelty function.(where small deviations from the
“ideal” periodicity and weak peaks are balanced out)
 the predominant pulse positions are clearly indicated by the peaks of Γ even
though some of these pulse positions are rather weak in the original novelty
function.
• In this sense, the PLP function : a local periodicity enhancement of the original
novelty function.

Example with significant local tempo changes
(a) Score in a piano reduced version.
(b) Fourier tempogram. The rough tempo
range of the predominant pulse is
highlighted by the rectangular frames.
(c) Manual annotation of note onset positions.
(d) Novelty function ∆Spectral.
(e) PLP function Γ

Properties of PLP
where several sudden tempo changes occur
within the window size of the analysis sinusoids.

• Many note onsets are poorly captured by the
novelty function.
• + Because of large differences in dynamics,
there are some strong onsets that dominate
the novelty function as well as some weak
onsets  height of a peak is not necessarily
the only indicator of its relevance.
• Despite these challenges, the tempo is
reflected well by the Fourier tempogram 
the peak structure of the novelty function still
possesses some local periodic regularities 
the windowed sinusoids corresponding to the
predominant local tempo.

• PLP function not only reveals positions of
predominant pulses but also indicates a kind
of confidence in the estimation.
• Note that the amplitudes of the optimal
windowed sinusoids do not depend on the
amplitude of the novelty function.  PLP is
invariant to changes in dynamics

Properties of PLP function
• Since neighboring sinusoids overlap, (even though they are windowed..)
constructive and destructive interference phenomena in the overlapping regions
influence the amplitude of the resulting PLP function Γ .
• Consistent local tempo estimates result in sinusoids that produce constructive
interferences in the overlap-add process.  In such regions, the peaks of the PLP
function assume a value close to one.
• In contrast, sudden random-like changes in the local tempo estimates result in
inconsistencies in the overlap regions of subsequent sinusoids, which in turn
cause destructive interferences and lower values of Γ .
PLP function

Properties of PLP
• Although the PLP concept is designed to automatically adapt to the predominant
tempo, for some applications the local nature of these estimates might not be
desirable. (taking the frame-wise maximum to determine the predominant
tempo has both advantages and drawbacks.)
• On the one hand, it allows the PLP function to quickly adjust—even to sudden
changes in tempo.
• On the other hand, it may lead to unwanted jumps such as random switches
between tempo octaves.
 Such switches in the pulse level can be avoided by constraining the tempo
set Θ in the maximization.

Properties of PLP
Beat level adjustment by tempo
range restriction based on a piano
recording.
Tempograms and PLP functions
(KS= 4 sec) are shown for various
sets Θ specifying the tempo range
used (given in BPM).
(a) Θ = [30 : 600] (full tempo range).
(b) Θ = [60 : 200] (tempo range
around quarter-note level).
(c) Θ = [200 : 340] (tempo range
around eighth-note level).
(d) Θ = [450 : 600] (tempo range
around sixteenth-note level).
 Such switches in the pulse level can be avoided by constraining the tempo set Θ in the maximization.

Now we have overall locally dominating beats.
But doesn’t the dynamic(or relatively more
significant change in novelty) of each beat has
meaningful information also?

Beat Tracking by Dynamic Programming
• There are many types of music with a strong and steady beat, where the tempo
is more or less constant throughout the entire recording.
• 2 assumptions :
1) beat positions go along with the strongest note onsets
2) the tempo is roughly constant.
(the flexibility introduced by the PLP concept is not needed.)
 construct a score function that measures how well an arbitrary beat sequence
reflects these two assumptions.

Beat Tracking by DP
• Input :
1) a novelty function ∆ :[1:N]→R
2) a rough estimate τ ∈R>0 (global tempo)
(interval [1 : N]  sampled time axis used for the novelty feature computation.)
• Rough tempo estimate τ :
manually or
from τ and the feature rate  an estimate for the beat period δ.
: Maximum tempo from the time-averaged tempogram.

Beat Tracking by DP
• Assuming a roughly constant tempo, the difference δ of two consecutive beats
should be close to δ.
• the deviation of δ from the ideal beat period δ, we introduce a penalty function
P : N → R by setting for δ ∈ N.
 a maximal value : zero at δ = δ / larger deviation : increasingly negative values
• Furthermore, since tempo deviations are relative in nature (doubling the tempo
should be penalized to the same degree as halving the tempo), the penalty
function is defined to be symmetric on a logarithmic axis.
tempo deviations are relative
: doubling the tempo should be penalized to the same degree as halving the tempo

Beat Tracking by DP
Penalty function Pˆ
: measuring the deviation of a given beat period δ
from the ideal beat period δ.

Beat Tracking by DP
• For a given N ∈ N, let B = (b1,b2,...,bL) be a sequence of length L ∈ N0 consisting
of strictly monotonically increasing beat positions bl ∈ [1 : N] for l ∈ [1 : L].
 beat sequence (beat sequence of length L = 0 is the empty sequence)
• let BN denote the set consisting of all possible beat sequences for a given
parameter N ∈ N.
• a score value S(B) : the quality of a beat sequence B ∈ BN
(positive values of the novelty function ∆ + negative values of the penalty function Pδ
)

Beat Tracking by DP
The score value S(B) which involves
a (positive) novelty score ∆ (bl ) and a (negative) deviation penalty Pˆ (bl −
bl−1 )

Beat Tracking by DP
• E.g., if an empty beat sequence(L = 0)  S(B) = 0
• Or if L = 1  S(B) = ∆(b1).
• In order to achieve a large score, the novelty values ∆ (bl) should be large, while
the non-positive penalty values Pδ
(bl − bl −1) should be close to zero.
• The weight parameter λ ∈R>0 is introduced δ to balance out these two
conflicting conditions.
solution of the beat tracking problem.

Beat Tracking by DP
• Number of possible beat sequences is exponential in N  dynamic programming.
• Let Bn
N ⊂ BN denote the set of all(possible) beat sequences that end in n ∈ [0 : N].
In other words, for a beat sequence B = (b1,b2,...,bL) ∈ Bn
N, we have bL = n. (in the case n = 0, the
only possible beat sequence is the empty one.)
• the maximal score S(B∗) for an optimal beat sequence B∗ is obtained by looking for the largest
value of D:
accumulated score : maximal score over all beat sequences ending with n ∈ [0 : N]

Beat Tracking by DP
D(n) can be computed in an iterative fashion for n = 0,1,...,N
1) n = 0 : initializing the procedure
 D(n) = 0
2) n > 0
- Assuming that we already know the values D(m) for m ∈ [0 : n − 1], we need to compute the value
D(n).
- B∗n = (b1,b2,...,bL) with bL = n : a score-maximizing beat sequence that yields D(n) = S(B∗n).
 Though we may not know such a sequence explicitly, we know, at least, that the last beat bL = n
contributes with the novelty value ∆(n).

Beat Tracking by DP
• The first case is L = 1, where one has a single beat and D(n) = ∆ (n). The second case is L > 1,
where one has a beat bL−1 ∈ [1 : n − 1] that precedes bL = n. Using S(B), the accumulated score
D(n) is obtained by :
• In other words, the optimal score D(n) :
• the sum of the novelty value ∆(n)
• the (weighted) penalty of the beat period δ = bL− bL−1
• the optimal score D(bL−1) of a beat sequence ending at bL−1
• Even though we do not know bL−1 explicitly so far, we have already computed all values D(m) for
m ∈ [0 : n − 1]. From this and by considering the two cases (L = 1 and L > 1), we obtain the
following recursion:
: accumulated score
?

Beat Tracking by DP
• Apply a backtracking procedure
• While calculating D(n), we additionally store the information on the maximization
process by means of a number P(n) ∈ [0 : n − 1].
 In the case that the maximum is 0, we have L = 1 and there is no preceding beat.
 Therefore, we set P(n) := 0.
Otherwise, one has L > 1, and there is a preceding beat. In this case we set

Beat Tracking by DP
• Here, the maximizing index n∗ ∈ [0 : N] determines the last beat bL = n∗ of an optimal
beat sequence B∗. (Only in the case n∗ = 0, there is no last beat and B∗ = 0.)
• The remaining beats of B∗ can then be obtained by backtracking using the predecessor
information supplied by P.
• In the case P(n∗) = 0, the backtracking is terminated and L = 1.
• bL−1 = P(bL) determines the beat preceding the last beat bL = n∗. This procedure is then
iterated to determine bL−2 = P(bL−1), bL−3 = P(bL−2), and so on, until the condition P(b1) =
0 terminates the backtracking.
• (Note that the length L is not known a priori and results from the backtracking.)
• Computing  quadratic in N
n∗

Limitations
• The main limitation of the beat tracking procedure is its dependency on a single,
predefined tempo τ.
• Using a small weighting parameter λ, the procedure may yield good beat tracking
results even in the presence of local deviations from the ideal beat period δ.
• However, the presented procedure is not designed for handling music with
slowly varying tempo (such as ritardando or accelerando) or abrupt changes in
tempo.
• Despite these limitations, the simplicity and efficiency of the dynamic
programming approach to beat tracking makes it an attractive choice for many
types of music.

We are now the masters of beat tracking!
Can we use this information to improve other
feature extraction techniques?

Adaptive Windowing
• Onset and beat positions are expressive features that often segment a music
recording into semantically meaningful units.
• Such a segmentation  improve general feature extraction.
• MIR? : Transforming the given audio signal into a suitable feature representation
that captures certain musical properties while being invariant to other aspects.
(E.g., chroma features are a powerful representation for revealing harmonic
properties of a music recording.)
• Since most musical properties vary over time, the given audio signal is typically
split up into segments or frames, which are then further processed individually.
 The underlying assumption : the signal stays (approximately) stationary within
each segment with regard to the property to be captured.

Use of adaptive windowing
• In practice, (as is the case with the short-time Fourier transform,) a predefined
window of fixed size is used for the time localization, where the size is
determined empirically and optimized for the specific application in mind.
• Using fixed-size windowing, however, may lead to a violation of the homogeneity
assumption: the boundaries of the resulting windowed sections often do not
coincide with the positions where the changes of the signal occur.

• As an alternative to fixed-size windowing, one can employ a more musically
meaningful adaptive windowing strategy, where segment boundaries are induced
by previously extracted onset and beat positions.
• Since musical changes typically occur at onset positions, this often leads to an
increased homogeneity within the adaptively determined frames  a significant
improvement in the resulting feature quality.

(a) Musical score of the four chords.
(b) Segmentation using a window of fixed size.
(c) Adaptive segmentation resulting in beat-synchronized
features.
A chroma representation with 4 subsequent chords.
Note that the third frame comprises a chord
change leading to a rather “noisy” chroma
feature where the chroma bands contain
energy from two different chords.
To attenuate the problem, one may decrease
the window size at the cost of an increased
feature rate and a poorer frequency
resolution.
 an onset-based adaptive windowing leads to clean chroma features

• Adaptive windowing techniques based on beat information are of particular importance for
many music analysis and retrieval applications.
• In the previous case, a windowed section is determined by two consecutive beat positions, which
results in one feature vector per beat.
• Such beat-synchronous feature representations have the advantage of possessing a musical time
axis (given in beats) rather than a physical time axis (given in seconds).
 This makes the feature representation robust to differences in tempo (given in BPM).
• In the context of music synchronization (Chapter 3), the usage of beat-synchronous features
could make DTW-like alignment techniques obsolete.(at least in the ideal case of having perfect
beat positions.)
• (For example, knowing the beat positions already yields a beat-wise synchronization of two
different performances that follow the same musical score.)

• However, in practice, such strategies have to be treated with caution, in particular
when the beat positions are determined automatically.
• Automated beat tracking procedures work well for music with percussive onsets
and a steady tempo. However, when dealing with weak note onsets and
expressive music with local tempo changes, the automated generation of beat
positions becomes an error-prone task, not to mention the problems related to
tempo octave confusion.
• Using corrupt beat information at the feature extraction stage may have immense
consequences for the subsequent music processing tasks to be solved.
• For example, to compensate for beat tracking errors in the music synchronization
context, one may have to reintroduce error-tolerant techniques that are similar
to DTW.

Another way of contribution
• Recall that note onsets often go along with noise-like energy bursts spread over the entire
spectrum, especially for instruments such as the piano, guitar, or percussion.
Time–pitch representations for a piano
recording of a chromatic scale
Original time–pitch representation using
fixed-size windowing with a small window
size.
Adaptive windowing using λ = 1.
The transients that go along
with note onsets are clearly
visible as vertical structures at
the beginning of each note.

Shortened window
• (new pararm for shortening window) λ ∈ R, 0 < λ < 1 : determines the size of the neighborhoods.
• s,t ∈ [1 : N] : the start and end positions of a given adaptive window
 determine the start and end positions of the shortened windowing used for the
feature computation.
 With this definition, the center of the adaptive window is preserved, while its
size is reduced by a factor λ relative to its original size (t − s)

Shortened scanning
The nonharmonic transient components also enter
the analysis window and introduce noise-like
artifacts in pitch bands that are not related to the
underlying notes.
Adaptive windowing using λ = 0.5.Adaptive windowing using λ = 1.
 Exclude a neighborhood around each note onset position.

Fundamentals of music processing chapter 6 발표자료

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Fundamentals of music processing chapter 6 발표자료

Similar to Fundamentals of music processing chapter 6 발표자료 (20)

Recently uploaded

Recently uploaded (20)

Fundamentals of music processing chapter 6 발표자료