5. Music structure example
Mazurka Op.6, No.4 by Chopin
Sheet music representation
Waveform representation
Chroma representation
Manually annotated segmentation
(of the audio recording)
GOAL:
How can we derive
this structural
information for a
given audio
recording?
28. SSM Enhancement : path smoothing
• Define a (finite) set Θ consisting of tempo parameters θ ∈ Θ for
different relative tempo differences.
• Compute for each such θ a matrix SL,θ and obtain a final matrix SL,Θ
by a cell-wise maximizationover all θ ∈ Θ :
* use prior information on the expected relative tempo differences Θ
Θ = {0.66,0.81,1.00,1.22,1.50}
à Filtering along 5 different directions
29. SSM Enhancement : path smoothing
(a) Original SSM using chroma features
(resolution of 2 Hz).
(b) SSM after applying diagonal smoothing.
(c) SSM after applying tempo-invariant
smoothing.
(d) SSM after applying forward–backward
smoothing
à Takes care of fading out problem by
taking cell-wise maximum over forward-
smoothed and backward-smoothed matrices
34. SSM Enhancement : thresholding
• Scaling from the range [τ,μ] à [0,1]
( for μ := maxn,m{S(n,m)} > τ, otherwise all entries are set to zero)
• Choose τ in a relative fashion (ρ · 100%)
: keeping ρ · 100% of the cells with the highest values using a relative
threshold parameter ρ ∈ [0,1]
(Local strategy of setting τ in a column- and rowwise fashion)
35. SSM Enhancement : thresholding
(a) SSM
(b) SSM after thresholding and binarization (τ = 0.75).
(c) SSM after thresholding and scaling (ρ = 0.2).
(d) SSM after thresholding and scaling (ρ = 0.05).
40. Audio thumbnailing – fitness measure
• Fitness measure : simultaneously establish all relations between a given segment and its
repetitions.
segment
Induced
segments
paths
41. Audio thumbnailing – fitness measure
• Consider a fixed segment
• A path family over a segment is a family of paths such that the
induced segments do not overlap
Not a path family
42. Audio thumbnailing – fitness measure
• Choosing Optimal path family (for each segment)
the score σ(P) of the path family P an optimal path family of maximal score
(induced segment family)
43. Audio thumbnailing – fitness measure
• Optimizing algorithm : Dynamic programming
1) Given two sequences, say X = (x1,x2,...,xN) and Y = (y1,y2,...,yM),
compute an optimal path that globally aligns X and Y,
where the first elements as well as the last elements of the two sequences are to be aligned.
2) The step size condition as specified by the set Σ constrains the slope of the path.
Ex) Σ = {(2, 1), (1, 2), (1, 1)}
3) Each element of X is aligned to at most one element of Y.
à Find score-maximizing path family .
46. Audio thumbnailing – fitness measure
computing an optimal path family over a given segment α = [s : t] ⊆ [1 : N]
1) N × M submatrix Sα (segment α = [s : t] with M := |α|)
columns s : t of the self-similarity matrix S.
2) An accumulated score matrix D ∈ RN,M+1 by a recursive procedure.
(D : [1 : N] rows, [0 : M] columns)
3) Φ (n, m) : a set of predecessors of cell (n, m)
à all cells that may precede (n,m) in a valid path family.
4) Accumulated score matrix :
5) Constraint conditions
: values of D for the remaining index pairs (n, m) with n = 1 or m ∈ {0, 1}
for n∈[2:N]
Complexity: O(MN)
47. Audio thumbnailing – fitness measure
computing an optimal path family over a given segment α = [s : t] ⊆ [1 : N]
Submatrix Sα w/ α = [50 : 100]
Accumulated score matrix D
Optimal path family
48. Audio thumbnailing – fitness measure
• Compute an optimal path family P∗ = {P1,...,PK} for a given segment α à repetition relations of α
1) Simply use the total score σ(P∗) : not good because it not only depends on the lengths of α and the paths, but also
captures trivial self-explanations (each segment α explains itself perfectly, information that is encoded by the main diagonal
of a self-similarity matrix.)
2) subtracting the length |α| from the score σ(P∗) + normalize the score with regard to the lengths Lk := |Pk| of the paths Pk
contained in the optimal path family P∗.
normalized score σ ̄(α)
Intuitively, the value σ ̄(α) expresses the average score of the optimal path family P∗ (minus a proportion for the self-
explanation)
normalization eliminates the influence of segment lengths à how well it explains other segments.
49. Audio thumbnailing – fitness measure
• Besides repetitiveness, another issue is how much of the underlying music recording is covered
by the thumbnail and its related segments.
• To capture this property, we define a coverage measure for a given α.
• To this end, let A∗ := {π1 (P1 ), . . . , π1 (PK )} be the (induced-) segment familyinduced by the
optimal path family P∗, and let γ(A∗) be its coverage.
• We define the normalized coverage γ ̄(α) :
γ ̄(α) à the ratio between the union of the induced segments of α and the total length of the original recording
(minus a proportion for the self-explanation)
50. Audio thumbnailing – fitness measure
• a high average score and a high coverage : both important
• Shorter segments often have a higher average score, but a lower
coverage, whereas longer segments tend to have a lower average
score, but a higher coverage. à need to balance out.
à fitness φ(α) of the segment α to be the harmonic mean
51. Audio thumbnailing – fitness measure
Idealized SSM corresponding to the musical structure A1A2 ...A6
with optimal path families for various segments α corresponding to
(a) A1, (b) A1A2, and (c) A1A2A3
52. Audio thumbnailing – thumbnail selection
• Define the audio thumbnail to be the segment of maximal fitness:
• Add a lower bound θ for the minimal possible thumbnail length
à this segment has nonoverlappingrepetitions that cover a possibly
large portion of the audio recording
53. Audio thumbnailing – scape plotting
• There are (N + 1)N /2 different segments α = [s : t] ⊆ [1 : N] where s,t ∈ [1 : N]
• Instead of considering start and end points, each segment can also be uniquely described by its center :
scape plot ∆ :
54. Audio thumbnailing – scape plotting
(b) α = α∗ = [68 : 89]
(corresponding to B2)
(c) α = [41 : 67]
(corresponding to B1 )
(d) α = [131 : 150]
(corresponding to A3 )
(e) α = [21 : 89]
(corresponding to A1B1B2)
the thumbnail segments of maximal fitness
(Choose maximum point)
c(α) = 78.5
|α| = 22
55. Audio thumbnailing – scape plotting
α = α∗ = [68 : 89]
(corresponding to B2)
α = [41 : 67]
(corresponding to B1 )
Recall that the introduced fitness measure slightly favors shorter segments
à recording the B2-part is played faster than the B1-part, the fitness measure favors the B2-part
segment over the B1-part segment.
vs
70. Evaluation – part labeling
Pairwise precision, recall, and F-measure.
(a) Positive items (indicated by gray boxes) with regard to the reference annotation.
(b) Positive items (indicated by gray boxes) with regard to the estimated annotation.
(c) True positive (TP), false positive (FP), and false negative (FN) items.
71. Evaluation – boundary annotation
(a) Reference boundary annotation.
(b) Estimated boundary annotation.
(c) Evaluation of (b) with regard to (a).
(d) τ-Neighborhood of (a) using the tolerance parameter τ = 1.
(e) Evaluation of (b) with regard to (d).
(f) τ -Neighborhood of (a) using the tolerance parameter τ = 2.
(g) Evaluation of (b) with regard to (f).
72. Evaluation – thumbnail detection
Typical error sources in thumb-nailing and music
structure analysis
(a) Confusion problem for Beatles song “Martha My
Dear.”
(b) Substructure (oversegmentation) problem for
Beatles song “While My Guitar Gently Weeps.”
(c) Superordinate structure (undersegmentation)
problem for Beatles song “For No One.”