This document summarizes the work done on evaluating pitch trackers for use in machine learning algorithms. The author developed a simple evaluation method that compares the pitch tracker's output to the true pitch at each time sample, and reports the percentage of correct samples. This single performance score is necessary for machine learning. The author implemented this evaluator in MATLAB and could display the pitch tracker output graphically compared to the true pitch. For their composition, the author used a pitch tracker to add harmony by thirds or fourths to live vocal input in the key of C. Completing the evaluator and using it to train a pitch tracker on guitar recordings is proposed for future work.
1. Joe Walker
Senior Experience
5/7/05
Evaluation of Pitch Trackers for Machine Learning Algorithms
1. Introduction
The goal of this independent research study started out as the creation of a new
neural networks opcode in Csound. A neural network is a machine learning algorithm in
which a program can be trained on potentially noisy data, alter the functions that it
executes on the inputs to obtain the appropriate outputs, then in theory perform well on
data similar to the data upon which it was trained. Csound is a programming language
written in C to allow synthesis of sounds. An opcode in Csound is simply a defined
function. An opcode can modify or create a signal and perform numerical calculations
among other things. A neural network opcode (or set of opcodes) could facilitate the
construction, training, testing, and use of a neural network. Other goals for the project
were a composition which made use of whatever tool I developed and a paper
documenting my work.
During the first month of the semester, the goals for this project underwent many
changes. Because of my interest in neural networks, it was suggested that I collaborate
with Matt Walsh, a student from Professor Thom's Machine Learning course interested in
doing a project relating to music. We eventually decided to delve into the topic of pitch
detection. Given the appropriate algorithms, pitch detection and tracking can be a very
complicated machine learning problem on its own. Even with robust algorithms in place,
there are still many inputs to consider that change the behavior of the tracker. Adding
additional components to the tracker to improve its performance generally requires
2. adding additional inputs. Manually tuning these parameters is difficult and time-
consuming. It is much more feasible to supply a learning algorithm with data and
evaluations of that data so that the parameters can be tuned automatically based on the
evaluations of the training data. My focus for the semester was finding a way to supply
these evaluations.
2. Pitch Tracker Evaluation
There are many ways in which one could evaluate a pitch tracker, but this
evaluation must apply specifically to use in a machine learning algorithm. The evaluation
must be purely quantitative. It must take as inputs the output of the pitch tracker and
some representation of the truth. In the context of a machine learning algorithm, truth is
what the pitch tracker would output if it functioned perfectly. In this case, the pitch
tracker could be used on any number of sound files. The truth is simply a transcription of
the music in those sound files.
2.1. Methods of Evaluation
When looking at how to analyze a pitch tracker, the general idea is to give a
higher rating for performance that correctly identifies the pitch more often. However,
pitch trackers tend to make errors that can be grouped into different categories. Some
common errors are harmonic errors, specifically the octave and the perfect fifth intervals.
Another common error is lag in identifying a note. A tracker must see at least one period
of the signal in order to identify the pitch. There is no avoiding this, so there will always
be some lag, but some trackers will lag more than others. Other errors occur when the
3. tracker outputs a pitch when there is no note (a rest), or when the tracker outputs no pitch
when there is in fact a note. The latter of these errors encompasses the lag error. These
last two errors are related to the intensity thresholds for the tracker. As an example,
training a neural network with these two error types taken into account could fine-tune
the tracker parameters for the intensity threshold of when to say when a note is on or off.
While there are many types of errors one could report, the ability to apply this
analyzer to a machine learning algorithm requires that there be a single output. The
evaluation of the tracker's performance on any given sound file must consist of a single
number (generally scaled to a range of [0, 1] or [-1, 1]). While it may be beneficial for a
human to read the results of a pitch tracker's performance in the form of a bunch of
statistics about the different errors, a machine learning algorithm requires no more or less
than a single evaluation number.
Given this principle, I was prompted to come up with a much simpler approach to
evaluation. Statistics on the different types of errors could be combined to yield a single
evaluation, but there is no clear method of combining the errors. One might be tempted to
say that one type of error is not as bad as another, but it is very difficult to quantify this
statement. The combination of these error statistics to yield a single output could be
considered a machine learning problem on its own, but without any truth to compare
testing. The bottom line is attempting to combine these statistics into a single output is
complicated and, in my opinion, unnecessary. The approach I came up with completely
disregards any error classification. The tracker is either right or wrong. My approach is to
iterate through every sample in the sound file, compare the tracker's output to the truth
for that specific point in time, and tally the number of times the tracker is correct and
4. incorrect. After this process has traversed the length of the sound file, a single output
number scaled to the range [0, 1] can be obtained by dividing the number of correct
samples by the total number of samples. This method eliminates many of the potential
problems with evaluating a pitch tracker. There is no longer a problem of comparing a
note to a rest. There is no problem of figuring out how to combine the different types of
errors to come up with a single number. It is unclear to begin with why one would favor
one type of error over another. If I am using a pitch tracker that is functioning incorrectly,
I don’t care if it’s a tritone off or an octave off; it’s just wrong as far as I’m concerned.
This approach uses this line of thinking and simply reports a rating of how often the
tracker is correct. This definition of correct will have to reside within a threshold. If we
assume that we’re using a 12-tone equal tempered scale, it makes sense to define this
threshold as within one quarter-tone.
2.2. My Progress
I began implementing this evaluator in MATLAB using tools created by Matt
Walsh and Professor Thom. By the end of the semester, I did not fully complete the
evaluator, but I came quite close. I was given a pitch tracker implemented in Csound to
work with (this tracker will be discussed later), and the result of my work was the ability
to read in the results of the pitch tracker on a sound file and graphically compare it to the
transposition of the same file.
Before I was able to read the tracker’s output into MATLAB, I went through
some trouble just trying to view the output file. I could open it up in a wave editor
(though the file seemed to lack a valid wave header) and see what the file looked like
5. while I played it back. Before I started modifying the tracker, the output was simply a
sine wave at the frequency it thought it was detecting. I changed this so that the output
was simply the frequency it thought it was detecting, no sine wave.
Figure 1. Here is a screenshot of the output of the tracker after my modification. The x-axis is time in
seconds of the input wave file (b4-b6.wav from my guitar data, in this case), and the y-axis is the detected
frequency.
Rests are inherently represented as a frequency of zero in the output. Having a
note value rather than a frequency of zero for rests doesn't make much sense, as pitch and
frequency are related logarithmically. One could infinitely descend in pitch and never
reach a frequency of zero.
At this point, I needed to find a method of reading the pitch tracker output into
MATLAB and comparing it to the truth, which was supplied in the same format by one
of Matt Walsh’s tools. I ended up opening the output file from the tracker in a wave
editor and saving it as a wave file. I could then use the built-in commands in MATLAB
to read the wave in.
6. Figure 2. This shows the tracker output on the top graph and the truth plotted on the bottom graph. This is
the same sound file that was analyzed in the previous figure, and it is apparent that the discrepancy in the
first half of the tracker output is indeed an octave error.
This is the extent of the work I was able to complete on the pitch tracker
evaluator. Future work on this would involve simply changing what MATLAB does with
the data from these two sources. In the above case, it is being processed only to be
displayed in a graph. The goal was to iterate through every time sample in the file and
compare to the truth. This doesn’t seem like it would be too difficult. It would be ideal to
write a script that could run this evaluation on multiple pieces of test data in sequence to
facilitate the automated training of a neural network.
3. Creating a Composition
An additional requirement of this course was to produce a composition in which I
creatively use a pitch tracker. In order to do this, I needed to obtain a pitch tracker to
work with. Professor Alves provided me with a pitch tracker implemented in Csound.
7. This tracker was used in the development of a pitch tracker evaluator, as documented
above. I experimented with this tracker for a while, but eventually decided to use a
different one written by Barry Vercoe that I found in The Csound Book. This
implementation used some of the same opcodes as in the one Professor Alves provided
me with, but was better documented in the book, and also included a primitive
harmonizer.
3.1. Creative Use of a Pitch Tracker
When faced with the task of creating a composition using a pitch tracker, I needed
to come up with a creative way to use a pitch tracker. The fact that I have a pitch tracker
at my disposal means that I can write a Csound program that will exhibit behavior based
on the pitch that is being played. Basically, I can have the program do something
different based on what note I’m playing. One simple idea is to detect a note and play
back the same note transposed over a specified interval. If played simultaneously with the
original signal, this would produce harmony at a fixed interval. This is what the Vercoe
tracker did with its harmony. It detected the current note and harmonized it into a major
triad with the input note as the root. This could be done more intelligently by changing
the intervals for the harmony based on what note is being played. For example, one could
remain in a specific key by detecting the note and specifying the appropriate intervals for
harmony based on that note. More advanced approaches could avoid specifying the key
and have the Csound program intelligently detect the key of the piece, or the key it is in
at any given moment. Some assumptions would have to be made to do this, such as only
looking for common key changes.
8. 3.2. My Approach
For my composition, I ended up using a simple harmony that remains in the key
of C. I wrote programs to harmonize by thirds and by fourths. When harmonizing by
thirds, the program first runs the detector, then checks which of the twelve notes is
closest to the note being played, and finally harmonizes by a major or minor third, based
on the incoming note. This results in the original note sounding together with a note a
third above in the key of C. When harmonizing by fourths, the program does essentially
the same thing, but adds two notes: one note is a fourth below the original and the other is
another fourth below, transposed up an octave. This results in the original note sounding
together with a note a fourth below and a note a second above, both in the key of C.
Below is my Csound orchestra code for harmonizing by duplicating the input note
up a third. It is based on the pitch tracker mentioned above written by Barry Vercoe.
sr = 22050
kr = 220.5
ksmps = 100
nchnls = 1
instr 1
a1 in
a1 reson a1, 0, 3000, 1
w1 spectrum a1, .02, 6, 24, 12, 1, 3
koct, kamp specptrk w1, 1, 7.0, 9.0, 8.0, 10, 7, .7, 0, 3, 1, .1
a2 delay a1, .066
kn = frac(koct)
if (abs(kn - 0/12) <= .5/12) kgoto major
if (abs(kn - 1/12) <= .5/12) kgoto minor
if (abs(kn - 2/12) <= .5/12) kgoto minor
if (abs(kn - 3/12) <= .5/12) kgoto minor
if (abs(kn - 4/12) <= .5/12) kgoto minor
if (abs(kn - 5/12) <= .5/12) kgoto major
if (abs(kn - 6/12) <= .5/12) kgoto major
if (abs(kn - 7/12) <= .5/12) kgoto major
9. if (abs(kn - 8/12) <= .5/12) kgoto minor
if (abs(kn - 9/12) <= .5/12) kgoto minor
if (abs(kn - 10/12) <= .5/12) kgoto minor
if (abs(kn - 11/12) <= .5/12) kgoto minor
major:
kharm1 = semitone(4)
kgoto main
minor:
kharm1 = semitone(3)
kgoto main
main:
kharm2 = 0
kpch = cpsoct(koct)
a3 harmon a2, kpch, .2, kharm1, kharm2, 0, 110, .1
;a3 delay a3, .2
out a2 + .8*a3
endin
I spent a good portion of time trying to tweak the parameters of the pitch tracker
to perform well on my guitar input with little success. There is also a delay in the output
due to computations in the pitch tracker that makes it very difficult to play in real time.
However, I was surprised to find that the pitch tracker and harmonizer work wonderfully
on a human voice. Because of this discovery, I decided to compose my piece with a few
guitar tracks for background and only vocals making use of the pitch tracker.
3.3. Future Ideas
In hindsight, it would have been ideal to have a working machine learning
algorithm that could come up with good parameter values for the pitch tracker to track
guitar input. I believe that the problems in tracking guitar input were rooted in the attack
of the notes. On a guitar, whether notes are picked with a plectrum or plucked with a
10. finger, there is a small amount of noise before the note actually starts. When the tracker
had problems, this usually sounded like the origin.
4. Conclusion
Although this project did not meet its specified goals, significant progress was
made which opened up doors for possible future work. The pitch tracker evaluator should
only take a few steps to complete. This could be extremely useful for training a pitch
tracker with a machine learning algorithm. Once this process if feasible, a pitch tracker
could be trained on the types of data with which a user intends to use it. For example, if I
intend to use a pitch tracker on my guitar, I could create some labeled training data of my
guitar playing and train the pitch tracker to become well-attuned to my guitar playing in
particular. It is also possible further in the future that this process could be automated into
a Csound opcode, which was the original goal of this project before revision.