Machine Learning Pitch Tracker Evaluation

Joe Walker
Senior Experience
5/7/05

Evaluation of Pitch Trackers for Machine Learning Algorithms

1. Introduction

The goal of this independent research study started out as the creation of a new

neural networks opcode in Csound. A neural network is a machine learning algorithm in

which a program can be trained on potentially noisy data, alter the functions that it

executes on the inputs to obtain the appropriate outputs, then in theory perform well on

data similar to the data upon which it was trained. Csound is a programming language

written in C to allow synthesis of sounds. An opcode in Csound is simply a defined

function. An opcode can modify or create a signal and perform numerical calculations

among other things. A neural network opcode (or set of opcodes) could facilitate the

construction, training, testing, and use of a neural network. Other goals for the project

were a composition which made use of whatever tool I developed and a paper

documenting my work.

During the first month of the semester, the goals for this project underwent many

changes. Because of my interest in neural networks, it was suggested that I collaborate

with Matt Walsh, a student from Professor Thom's Machine Learning course interested in

doing a project relating to music. We eventually decided to delve into the topic of pitch

detection. Given the appropriate algorithms, pitch detection and tracking can be a very

complicated machine learning problem on its own. Even with robust algorithms in place,

there are still many inputs to consider that change the behavior of the tracker. Adding

additional components to the tracker to improve its performance generally requires

adding additional inputs. Manually tuning these parameters is difficult and time-

consuming. It is much more feasible to supply a learning algorithm with data and

evaluations of that data so that the parameters can be tuned automatically based on the

evaluations of the training data. My focus for the semester was finding a way to supply

these evaluations.

2. Pitch Tracker Evaluation

There are many ways in which one could evaluate a pitch tracker, but this

evaluation must apply specifically to use in a machine learning algorithm. The evaluation

must be purely quantitative. It must take as inputs the output of the pitch tracker and

some representation of the truth. In the context of a machine learning algorithm, truth is

what the pitch tracker would output if it functioned perfectly. In this case, the pitch

tracker could be used on any number of sound files. The truth is simply a transcription of

the music in those sound files.

2.1. Methods of Evaluation

When looking at how to analyze a pitch tracker, the general idea is to give a

higher rating for performance that correctly identifies the pitch more often. However,

pitch trackers tend to make errors that can be grouped into different categories. Some

common errors are harmonic errors, specifically the octave and the perfect fifth intervals.

Another common error is lag in identifying a note. A tracker must see at least one period

of the signal in order to identify the pitch. There is no avoiding this, so there will always

be some lag, but some trackers will lag more than others. Other errors occur when the

tracker outputs a pitch when there is no note (a rest), or when the tracker outputs no pitch

when there is in fact a note. The latter of these errors encompasses the lag error. These

last two errors are related to the intensity thresholds for the tracker. As an example,

training a neural network with these two error types taken into account could fine-tune

the tracker parameters for the intensity threshold of when to say when a note is on or off.

While there are many types of errors one could report, the ability to apply this

analyzer to a machine learning algorithm requires that there be a single output. The

evaluation of the tracker's performance on any given sound file must consist of a single

number (generally scaled to a range of [0, 1] or [-1, 1]). While it may be beneficial for a

human to read the results of a pitch tracker's performance in the form of a bunch of

statistics about the different errors, a machine learning algorithm requires no more or less

than a single evaluation number.

Given this principle, I was prompted to come up with a much simpler approach to

evaluation. Statistics on the different types of errors could be combined to yield a single

evaluation, but there is no clear method of combining the errors. One might be tempted to

say that one type of error is not as bad as another, but it is very difficult to quantify this

statement. The combination of these error statistics to yield a single output could be

considered a machine learning problem on its own, but without any truth to compare

testing. The bottom line is attempting to combine these statistics into a single output is

complicated and, in my opinion, unnecessary. The approach I came up with completely

disregards any error classification. The tracker is either right or wrong. My approach is to

iterate through every sample in the sound file, compare the tracker's output to the truth

for that specific point in time, and tally the number of times the tracker is correct and

incorrect. After this process has traversed the length of the sound file, a single output

number scaled to the range [0, 1] can be obtained by dividing the number of correct

samples by the total number of samples. This method eliminates many of the potential

problems with evaluating a pitch tracker. There is no longer a problem of comparing a

note to a rest. There is no problem of figuring out how to combine the different types of

errors to come up with a single number. It is unclear to begin with why one would favor

one type of error over another. If I am using a pitch tracker that is functioning incorrectly,

I don’t care if it’s a tritone off or an octave off; it’s just wrong as far as I’m concerned.

This approach uses this line of thinking and simply reports a rating of how often the

tracker is correct. This definition of correct will have to reside within a threshold. If we

assume that we’re using a 12-tone equal tempered scale, it makes sense to define this

threshold as within one quarter-tone.

2.2. My Progress

I began implementing this evaluator in MATLAB using tools created by Matt

Walsh and Professor Thom. By the end of the semester, I did not fully complete the

evaluator, but I came quite close. I was given a pitch tracker implemented in Csound to

work with (this tracker will be discussed later), and the result of my work was the ability

to read in the results of the pitch tracker on a sound file and graphically compare it to the

transposition of the same file.

Before I was able to read the tracker’s output into MATLAB, I went through

some trouble just trying to view the output file. I could open it up in a wave editor

(though the file seemed to lack a valid wave header) and see what the file looked like

while I played it back. Before I started modifying the tracker, the output was simply a

sine wave at the frequency it thought it was detecting. I changed this so that the output

was simply the frequency it thought it was detecting, no sine wave.

Figure 1. Here is a screenshot of the output of the tracker after my modification. The x-axis is time in

seconds of the input wave file (b4-b6.wav from my guitar data, in this case), and the y-axis is the detected

frequency.

Rests are inherently represented as a frequency of zero in the output. Having a

note value rather than a frequency of zero for rests doesn't make much sense, as pitch and

frequency are related logarithmically. One could infinitely descend in pitch and never

reach a frequency of zero.

At this point, I needed to find a method of reading the pitch tracker output into

MATLAB and comparing it to the truth, which was supplied in the same format by one

of Matt Walsh’s tools. I ended up opening the output file from the tracker in a wave

editor and saving it as a wave file. I could then use the built-in commands in MATLAB

to read the wave in.

Figure 2. This shows the tracker output on the top graph and the truth plotted on the bottom graph. This is

the same sound file that was analyzed in the previous figure, and it is apparent that the discrepancy in the

first half of the tracker output is indeed an octave error.

This is the extent of the work I was able to complete on the pitch tracker

evaluator. Future work on this would involve simply changing what MATLAB does with

the data from these two sources. In the above case, it is being processed only to be

displayed in a graph. The goal was to iterate through every time sample in the file and

compare to the truth. This doesn’t seem like it would be too difficult. It would be ideal to

write a script that could run this evaluation on multiple pieces of test data in sequence to

facilitate the automated training of a neural network.

3. Creating a Composition

An additional requirement of this course was to produce a composition in which I

creatively use a pitch tracker. In order to do this, I needed to obtain a pitch tracker to

work with. Professor Alves provided me with a pitch tracker implemented in Csound.

This tracker was used in the development of a pitch tracker evaluator, as documented

above. I experimented with this tracker for a while, but eventually decided to use a

different one written by Barry Vercoe that I found in The Csound Book. This

implementation used some of the same opcodes as in the one Professor Alves provided

me with, but was better documented in the book, and also included a primitive

harmonizer.

3.1. Creative Use of a Pitch Tracker

When faced with the task of creating a composition using a pitch tracker, I needed

to come up with a creative way to use a pitch tracker. The fact that I have a pitch tracker

at my disposal means that I can write a Csound program that will exhibit behavior based

on the pitch that is being played. Basically, I can have the program do something

different based on what note I’m playing. One simple idea is to detect a note and play

back the same note transposed over a specified interval. If played simultaneously with the

original signal, this would produce harmony at a fixed interval. This is what the Vercoe

tracker did with its harmony. It detected the current note and harmonized it into a major

triad with the input note as the root. This could be done more intelligently by changing

the intervals for the harmony based on what note is being played. For example, one could

remain in a specific key by detecting the note and specifying the appropriate intervals for

harmony based on that note. More advanced approaches could avoid specifying the key

and have the Csound program intelligently detect the key of the piece, or the key it is in

at any given moment. Some assumptions would have to be made to do this, such as only

looking for common key changes.

3.2. My Approach

For my composition, I ended up using a simple harmony that remains in the key

of C. I wrote programs to harmonize by thirds and by fourths. When harmonizing by

thirds, the program first runs the detector, then checks which of the twelve notes is

closest to the note being played, and finally harmonizes by a major or minor third, based

on the incoming note. This results in the original note sounding together with a note a

third above in the key of C. When harmonizing by fourths, the program does essentially

the same thing, but adds two notes: one note is a fourth below the original and the other is

another fourth below, transposed up an octave. This results in the original note sounding

together with a note a fourth below and a note a second above, both in the key of C.

Below is my Csound orchestra code for harmonizing by duplicating the input note

up a third. It is based on the pitch tracker mentioned above written by Barry Vercoe.

sr = 22050
kr = 220.5
ksmps = 100
nchnls = 1

instr 1
a1 in
a1 reson a1, 0, 3000, 1
w1 spectrum a1, .02, 6, 24, 12, 1, 3
koct, kamp specptrk w1, 1, 7.0, 9.0, 8.0, 10, 7, .7, 0, 3, 1, .1
a2 delay a1, .066

kn = frac(koct)

if (abs(kn - 0/12) <= .5/12) kgoto major
if (abs(kn - 1/12) <= .5/12) kgoto minor


major:
kharm1 = semitone(4)
kgoto main

minor:
kharm1 = semitone(3)
kgoto main

main:
kharm2 = 0

kpch = cpsoct(koct)
a3 harmon a2, kpch, .2, kharm1, kharm2, 0, 110, .1
;a3 delay a3, .2

out a2 + .8*a3
endin

I spent a good portion of time trying to tweak the parameters of the pitch tracker

to perform well on my guitar input with little success. There is also a delay in the output

due to computations in the pitch tracker that makes it very difficult to play in real time.

However, I was surprised to find that the pitch tracker and harmonizer work wonderfully

on a human voice. Because of this discovery, I decided to compose my piece with a few

guitar tracks for background and only vocals making use of the pitch tracker.

3.3. Future Ideas

In hindsight, it would have been ideal to have a working machine learning

algorithm that could come up with good parameter values for the pitch tracker to track

guitar input. I believe that the problems in tracking guitar input were rooted in the attack

of the notes. On a guitar, whether notes are picked with a plectrum or plucked with a

finger, there is a small amount of noise before the note actually starts. When the tracker

had problems, this usually sounded like the origin.

4. Conclusion

Although this project did not meet its specified goals, significant progress was

made which opened up doors for possible future work. The pitch tracker evaluator should

only take a few steps to complete. This could be extremely useful for training a pitch

tracker with a machine learning algorithm. Once this process if feasible, a pitch tracker

could be trained on the types of data with which a user intends to use it. For example, if I

intend to use a pitch tracker on my guitar, I could create some labeled training data of my

guitar playing and train the pitch tracker to become well-attuned to my guitar playing in

particular. It is also possible further in the future that this process could be automated into

a Csound opcode, which was the original goal of this project before revision.

Machine Learning Pitch Tracker Evaluation

Recommended

Recommended

More Related Content

Similar to Machine Learning Pitch Tracker Evaluation

Similar to Machine Learning Pitch Tracker Evaluation (20)

More from butest

More from butest (20)

Machine Learning Pitch Tracker Evaluation