ICT-GroupProject-Report2-NguyenDangHoa_2

University of Sciences and Technologies of Hanoi
ICT Department
GROUP PROJECT REPORT
Pitch detection algorithms
and application in musical key detection
Group members
NGUYEN Dang Hoa USTHBI4-055
NGUYEN Gia Khang USTHBI4-072
NGUYEN Thi Thu Linh USTHBI4-085
NGUYEN Duc Thang USTHBI4-139
NGUYEN Minh Tuan USTHBI4-155
Supervisor
Dr. TRAN Hoang Tung
University of Science and Technology of Hanoi
February, 2016

GROUP PROJECT REPORT Pitch detection algorithms
Contents
Table of Abbreviations 2
Abstract 3
1 Introduction 4
2 Project management status 4
3 Theoretical background and state of the art 6
3.1 Pitch detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Musical key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Scientiﬁc methods and materials 8
4.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 YIN Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 Cepstrum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2.3 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Musical key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3.1 Generating a PCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3.2 PCP comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3.3 JAVA implementation/Android application . . . . . . . . . . . . . . . 12
5 Results and discussion 14
5.1 Pitch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Key detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Conclusion 17
Acknowledgement 18
References 19
Appendix A 20
Page 1 of 21

Table of Abbreviations
Discrete Fourier transform . . . DFT
Fundamental frequency . . . f0
Graphic user interface . . . GUI
Inverse discrete Fourier transform . . . IDFT
Low-pass filter . . . LPF
Pitch class profile . . . PCP
Pitch detection algorithm . . . PDA
Pitch period . . . PP
Simplified Inverse Filter Tracking . . . SIFT
Page 2 of 21

Abstract
A research was conducted to study the performance of pitch detection algorithms and their
application in musical key detection using pitch class profiles. Sample sets to test them was
constructed from voiced/unvoiced and clean/noisy sound signals. The tested algorithms are:
YIN estimator, Cepstrum analysis and Simplified Inverse Filter Tracking. Each algorithm
was tested on the samples, and the pitch contours generated are discussed to compare their
strength and weaknesses in different environments. A pitch profile generator was developed
in JAVA using all three algorithms, and an Android application was created afterwards to
perform musical key detection of sound sequences recorded by mobile phone users. Key
detection results are then shown and discussed for each of the tested algorithms.
Page 3 of 21

1 Introduction
Fundamental frequency (f0) in signal processing is defined as the lowest frequency of a
periodic signal and also the inverse of its period. f0 usually defines the subjective pitch of
a sound and f0 estimation, also known as pitch detection, throughout many years remains
a popular topic for research. It is useful in various contexts, from digital music processing
programs to voice encoder or speech support for the hearing impaired.
Despite being essential to signal processing systems, as of now there has yet to be an ideal
pitch detection algorithm (PDA). When subjected to a clean and clearly pitched signal, most
PDAs perform well, however if the input has heavy noise or multiple pitches, the results vary
significantly. Whether a PDA is truly good or not boils down to the condition of the signals
it is applied to, as each one has its own strength in some scenarios but do very weakly with
others.
At the time of starting this study a wide variety of PDAs were available, but we decided
to focus on a selected few that were deemed most reasonable in order to compare their
strengths and weaknesses within the same environment - detecting musical key of a recorded
voice sequence. The major goal is to establish a performance evaluation of these PDAs,
while developing a mobile application to implement them effectively.
The main part of our reports includes 5 sections excluding the Introduction. Section 2
(Project management status) assesses the overall progress. Section 3 (Theoretical background
and state of the art) provides a literature review on PDAs and basic music theories. Section
4 (Scientific methods and materials) describes the tools and step-by-step methods used dur-
ing the project. Section 5 (Results and discussion) explains the results obtained from our
implementations and our comments on them, and Section 6 concludes the report.
2 Project management status
From the start of this project, our group and the supervisor held weekly meetings in ICT Lab
to discuss the project goals and overall progress. Initially, as we had no prior experience with
mobile programming, work division between the five members was 3 on literature review and
2 on the development of an Android application. We aimed for a pitch detection application
Page 4 of 21

at first, but decided to expand into a wider scope after doing additional research on music
processing and the need of a related application. Adjustments were made on the way, and
eventually the group set down to a key detection application, since it is possible to develop
with our knowledge then and we would have difficulties trying to achieve a more complex
objective.
Details on the tasks and achievements are described in Table 2.1.
Task In charge Outcome
General research on digital sound
processing
Everyone Basic comprehension of digital sig-
nals, sampling, filtering, etc.
Develop basic Android application
for sound recording
Thang,
Tuan
Runnable application
Research on pitch detection algo-
rithms
Khang,
Linh, Hoa
Proposed three suitable algorithms
In-depth research on proposed al-
gorithms and key detection
Khang,
Linh, Hoa
Pseudocodes and MATLAB tests
Implement proposed algorithms on
JAVA
Thang,
Tuan
Done
Research on musical key detection Hoa, Linh Proposed a method to detect keys
Develop Android application for
key detection with user interface
Thang,
Tuan
Runnable application with simple
GUI
Putting the report together Linh,
Khang
Done
Table 2.1. Project management and progress.
Page 5 of 21

3 Theoretical background and state of the art
In this section, we provide an overview of the types of PDA chosen for investigation, their
characteristics, basic knowledge on music theories and some most prominent research regard-
ing these matters.
3.1 Pitch detection algorithms
Accurate and reliable pitch measurement is often extremely difficult for many reasons: the
voice sequence is often not a perfect train of periodic pulses, one sequence can be composed
from a variety of PPs which are hard to separate, etc. Therefore, it is necessary to have a
grasp on current studies so we can use them into this project.
PDAs are most commonly classified in three categories: time-domain, frequency-domain
or hybrid. Time-domain methods run directly on the speech waveform, frequency-domain
methods take advantage of the impulse series that arise in the frequency spectrum, and
hybrid ones incorporates properties of both domains. From each category, we picked one
signature PDA as follows:
• YIN Estimator (Time-domain): Autocorrelation method, a prominent representative
in this category, attempts to find PP by evaluating primary peaks of the input’s au-
tocorrelation. It is good with mid to low frequencies, but makes too many errors in
various applications. YIN estimator - developed by De Cheveigné and Kawahara in
the early 2000s - is based on the basic principles of autocorrelation method but with
several modifications in order to solve the problem. It minimizes the difference between
the input and its delayed copy, thus reduce the errors. Cheveigné and Kawahara have
theorized that YIN can be implemented efficiently with low latency, and has no upper
boundary in the pitch search range.
• Cepstrum Analysis (Frequency-domain): Cepstrum - a word play on spectrum first
defined by Bogert et al in a 1963 paper [1] - is essentially the inverse discrete Fourier
transform (IDFT) of the log magnitude of the spectrum of a signal. In 1967, Schroeder
and Noll proposed an application of cepstrum analysis in pitch detection, which is
based on the fact that the Fourier transform of a signal usually has regular peaks
Page 6 of 21

representing its harmonic spectrum [3]. Taking the cepstrum of a signal eliminates
those peaks, thus remove the effects of overtones in human voice and make it much
easier to define the pitch.
• Simplified Inverse Filter Tracking (Hybrid): This was first proposed by Markel in 1972
[5] as a simple algorithm, possible to be realized in real time yet covered the positive
traits of both autocorrelation and cepstral methods. This algorithm suggests fast
runtime with a composition of elementary computations, while also offers to classify
between voiced/unvoiced regions of an input. Its core operations were based on a
simplified version of digital inverse filtering, hence the name “Simplified Inverse Filter
Tracking” (from here on referred to as SIFT).
3.2 Musical key detection
In music, the term note is used to specify frequencies within a certain range of pitch which
human ear has similar perception and can hardly distinguish. Any two notes whose ratio is
a power of two are grouped into a pitch class. Generally, we divide pitches into 12 classes:
C, C (D ), D, D (E ), E, F, F (G ), G, G (A ), A, A (B ), and B. In each pitch class, notes
can be distinguished by adding a number after the notation of its class name. For instance,
C3 has lower frequency than C4, C5 and so on.
A piece of music is an ordered set of notes. However, in order to create a good music, this
set is often limited to less than twelve pitch classes. In most cases, this number is around
seven. These specific classes in the song, which can be denoted as its scale, forms an abstract
concept called tonality. Tonality is mostly derived from human sense over a song rather than
any exact definition, which means that two pieces of same-tonic music will be perceived
relatively similar.
Figure 3.1. Example of main pitch classes within C scale.
Page 7 of 21

Most music is composed in a major or minor scale, each scale has a ”key” note (for example,
C major scale is in key C major) which means there are a total of 24 major/minor scales.
Determining the key of a song is crucial to musician, yet also extremely difficult because
there is no mathematical formula to define or even guess it after capturing the set of notes
in the song.
3.3 State of the art
Throughout the history of pitch tracking, few thorough studies to compare different types
of detection methods have been conducted. Most research focus on the properties and
applications of one method alone, due to the difficulties in selecting algorithms to evaluate,
setting a reasonable standard of comparison and compiling a comprehensive database. For
the fundamental part of our study, we decided to look at papers which introduced the
concepts of chosen PDAs as follows:
• YIN, a fundamental frequency estimator for speech and music [2]
• Cepstrum Pitch Determination [6]
• The SIFT Algorithm for Fundamental Frequency Estimation [5]
Musical key detection using pitch class profiles (PCPs) on the other hand was under extensive
research, with different dataset of various genres generating different base key profiles [7].
The general goal for such research tends to be to shape the principle of key detection in
human brain. For practical purpose, we focused on one algorithm proposed in a 2007 Master
thesis from the University of Vienna [8].
4 Scientific methods and materials
We discuss in this section our approach to PDA implementation, to key detection and to
mobile application development. The step-by-step process we propose might not be optimal,
but is simple enough to deploy using our current skills and tools.
Page 8 of 21

4.1 Tools
For this study, the following softwares and tools were used:
• IntelliJ 14.1.5 / Eclipse 4.5.1
• Android Studio 1.5
• Audacity
• Android phones
4.2 Pitch detection
Initial experiments were conducted using JAVA. We implemented the three PDAs according
to their proposed formulas on a set of pre-recorded sound samples to see the margin of
difference in their pitch estimates.
The samples used are of a female voice singing ‘ah’ at pitches from G 3 to B3.
The steps for each of the PDAs are described as follows:
4.2.1 YIN Estimator
First, a difference function is applied on the input signal xt:
dt(τ) =
W
j=1
(xj − xj+τ )2
dt(τ) is zero at zero lag and often nonzero at the period because of the imperfect periodicity,
therefore a cumulative mean normalized difference function is applied to avoid the zero lag
dip, normalize the function for the next step and reduce too-high errors.
dt(τ) =



1 τ = 0
dt(τ)
1
τ
τ
1 dt(j)
otherwise
Page 9 of 21

An absolute threshold is applied to reduce the too-low errors, then each local minimum dt
is subjected to parabolic interpolation in order to define the PP estimate.
Finally, for each index t, we search for a minimum of dθ(Tθ) for θ within [t − Tmax/2, t +
Tmax/2] where Tθ is the estimate at time θ and Tmax is 25ms. The best local estimate
obtained is the pitch of xt.
4.2.2 Cepstrum Analysis
The cepstrum of a signal is defined with the following formula:
cn = F−1
{log(|F(xn)|)}
For our purpose of pitch detection, the cepstrum of a windowed frame of signal is necessary
an is defined through the Fourier series:
cn =
N−1
n=0
log(|
N−1
n
xne−jk 2π
N
n
|)ejk 2π
N
n
The pitch can then be estimated by picking the peak of the resulting signal.
DFT log IDFT
xn Xk
ˆXk cn
Figure 4.1. Block diagram of Cepstrum analysis.
4.2.3 SIFT
First, the input signal sn with sampling frequency 10kHz is low-pass filtered with a cutoff
at fc = 0.8kHz. The filter output xn is downsampled by a 5:1 ratio to reduce the number of
operations in later steps but still retains correctness.
The signal is then analyzed frame-by-frame, with a 64-sample frame length and 32-sample
frame shift. A 4th-order linear predictive analysis is then performed to obtain a set of
coefficients, then the frame is inverse filtered using said set to produce a residual signal.
Page 10 of 21

Consequently, the autocorrelation of that signal is searched for the primary peak which is used
to determine f0. Finally, the autocorrelation function is interpolated in the neighborhood of
the calculated pitch to increase the resolution of f0.
LPF 0.8kHz
5:1
Inverse Filter Autocorrelation Interpolation
sn xn wn yn rn f0
Figure 4.2. Block diagram of the SIFT algorithm.
Full details on the formulas involved can be found in Appendix A.
4.3 Musical key detection
The basic process is in three steps: pitch detection, pitch class profile (PCP) generation and
PCP comparison.
4.3.1 Generating a PCP
A pitch class profile (PCP) is a 12-dimension vector whose each parameter represents the
intensity of a pitch class. Generating a PCP is the first step in the key detection process
since it then will be compared to the referenced profile to find the most suitable key that fit
the generated PCP.
4.3.2 PCP comparison
A generated PCP will be compared to 24 standard PCPs of 24 keys to find the closest one.
In this project, we used the linear comparison algorithm, which was proved in several related
papers to give the closest result.
The base key profile used is one derived by Krumhansl and Kessler in 1982 [4].
Page 11 of 21

C C D D E F F G G A A B
1
2
3
4
5
6
7
Intensity
Figure 4.3. Example: C minor key proﬁle of Krumhansl and Kessler.
4.3.3 JAVA implementation/Android application
Before moving on to Android, a test version on JAVA is developed. The key detection part
of the program runs basically as follows:
• After obtaining the pitch array from the buﬀers created, the output will be put through
the function intensityNote() to generate a PCP vector (an array of Note objects) of
the whole song:
1 public Note [] intensityNote(List <Note > noteList){
2 Note [] notes = Note.copy(Note.NOTES);
3 for(int i = 0; i < notes.length; i++ ) {
4 double intensity = 0;
5 for (Note item: noteList) {
6 if(notes[i]. equals(item)){
7 intensity += item.getIntensity ();
8 }
9 }
10 notes[i]. setIntensity(intensity);
11 }
12 return notes;
13 }
• These ”raw” PCP will be normalised to the range 0.0 to 1.0 in order to be compatible
in the subsequent comparison process. The “loudest” note (the note that has highest
intensity) is set the value of 1 and vice versa:
1 public void normalize(Note [] notes){
2 double max = notes [0]. getIntensity ();
3 double min = notes [0]. getIntensity ();
4 for(Note note: notes){
5 if(max < note.getIntensity ()) max = note.getIntensity ();
Page 12 of 21

6 if(min > note.getIntensity ()) min = note.getIntensity ();
7 }
8
10 double intense = 1 - ((max - note.getIntensity ())/(max -min));
11 note.setIntensity(intense);
12 }
13 }
After that, we will get the PCP, which each key has an unique range (for example...).
2 profile[i] = note.getIntensity ();
3 i++;
4 }
• Using findKey(profile) to compare with the key database, we will get the key output.
In this project, we utilised the linear comparison algorithm. The key that has the
vector with the smallest distance to the generated PCP will be assigned to be the main
key of the song:
1 public static Key findKey(double [] notes){
2 Key key = new Key();
3
4 double min_error = Double.MAX_VALUE;
5
6 for(Key k:Key.KEYS){
7 double distance = 0;
8 for(int i = 0; i < notes.length; i++){
9 distance += (notes[i] - k.getSequence ()[i]. getIntensity ())*(
notes[i] - k.getSequence ()[i]. getIntensity ());
10 }
11 if(distance < min_error){
12 min_error = distance;
13 key = k;
14 }
15 System.out.println(k.getName () + ": " + distance);
16 }
17 return key;
18 }
The idea for the Android application is to let users sing and record a sequence of notes in
a song, analyze the input and return the appropriate key. The user can choose between the
three algorithms to use for pitch detecting.
The application is tested on several mobile phones with different OS version and hardware
specifications. Some of them are:
• Asus Zenphone 5 501CG, CPU x86 Intel, OS version: 4.3/5.0
Page 13 of 21

• Xiaomi Redmi Note 2, CPU ARM Mediatek, OS version: 5.0
• Vega Sky A850, CPU ARM Snapdragon, OS version: 4.1.2/4.4.4/5.0
5 Results and discussion
We demonstrate in this section the results obtained from the experiments in Section 3, along
with our discussion and comments on the matter based on our knowledge.
5.1 Pitch detection
Results from using YIN estimator, Cepstrum Analysis and SIFT on 16 cleanly-pitched sound
samples are presented in Figure 4.1. and Table 4.2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
200
250
300
350
400
450
500
Intensity
True pitch
SIFT
YIN estimator
Cepstrum
Figure 5.1. Pitch estimates comparison between three PDAs.
True pitch curve indicates the ideal pitches from G 3 to B3.
Page 14 of 21

Case Base YIN Cepstrum SIFT Case Base YIN Cepstrum SIFT
1 207.6 209.7 206.0 210.5 9 329.6 326.5 326.0 320.0
2 220.0 218.7 218.0 216.2 10 349.2 348.8 350.0 347.8
3 233.1 234.2 234.0 235.3 11 370.0 368.5 366.0 363.6
4 246.9 244.8 244.0 242.4 12 392.0 389.2 390.0 381.0
5 261.6 259.2 262.0 258.0 13 415.3 409.8 408.0 421.0
6 277.2 276.4 278.0 275.8 14 440.0 436.4 436.0 444.4
7 293.7 291.6 292.0 296.3 15 466.2 460.7 458.0 470.6
8 311.1 306.7 306.0 307.7 16 493.9 488.1 490.0 500.0
Table 5.2. Obtained f0 (Hz) as demonstrated in Figure 4.1.
We can see that the curves created by the pitch estimation of all three algorithms do not
deviate too far from the true pitch curve. The higher the pitch the less accurate the results
(from a margin of less than 2Hz in the first few samples to 7-10Hz in the last ones). However,
we only use the estimates to perform key detection and adjacent notes are more different
than another the higher their pitches are, so up-to-10Hz is a good enough margin.
A plot of the pitch contours of each case showed that SIFT provided most consistent results
over the frames analyzed (a perfect contour even, in case 8 and 16) except for very occasional
surges or dips. This could be because of the lack of a voiced/unvoiced decision criteria, as
even in a voice speech some frames can be unvoiced and have unrealistic f0.
It is worth noting however that due to the inaccuracies in human hearing while making the
samples, the real pitches of our samples are not exactly the same as the true pitch of the
note range chosen. YIN and Cepstrum generated almost the same results with margin of
difference within 2Hz in 12/16 cases, making it probable that their results are closer to the
real pitch estimates than SIFT.
We have also tested the algorithms with noisy and unvoiced samples to find out if there are
any weaknesses. SIFT and Cepstrum returned the same results in the experiment with 16
same samples but with additive white Gaussian noise (AWGN), signal-to-noise ratio from
0.001dB - 1dB, but YIN couldn’t detect the pitch at all. SIFT even yielded the exact same
Page 15 of 21

numbers, probably because the AWGN was eliminated during prefiltering. On the other
hand, with unvoiced but toned signals, Cepstrum performed poorly because of its nature:
generating irregular, too high or too low frequencies.
5.2 Key detection
We successfully developed a working application, albeit with a very simple user interface. As
shown in Figure 4.2., the GUI composes of a pair of record/stop button for sound recording
and the options to choose which algorithm the key detection process will be based on.
Figure 5.3. Mobile app GUI while recording and after doing key detection
To test the accuracy of the application, we recorded 8 melodies of different keys, tried
different algorithms and collected the results as shown in Table 4.3.
The majority of key detection is good but with some inaccuracies. This is a reasonable
outcome, as the accuracy relies on many factors: quality of the recordings, whether the
recorded tones is truly of the expected pitches, whether the base key profiles used are suitable,
etc. The key profile in particular is an extremely important factor, because it was derived
from a dataset of a group of input and can be good or bad depending on the dataset size
and their nature. The wrongly detected keys in this experiment all belong to the harmonic
scale of the true key and have little difference in pitch class intensity, so we can conclude
Page 16 of 21

that the algorithms are working very close to the expected results.
Case True key YIN SIFT Cepstrum
1 C major C major F major C major
2 C major G major C major G major
3 E minor C minor E minor E minor
4 F major F major F major F major
5 G major D major E minor E minor
6 G minor G minor G minor G minor
7 A minor A minor E major C major
8 A major A major A major A major
Table 5.4. Key detection test results on mobile application.
All algorithms missed 3 out of 8 cases.
6 Conclusion
At the end of this project, we have succeeded in using the three PDAs for pitch detection.
Apparently, while all PDAs deliver good f0 estimates in general, each of them has its own
pros and cons. Our experiment show that when applied to clean signals, YIN estimator
and Cepstrum have closer results than SIFT, but YIN cannot detect the pitch of heavily
noised signal and Cepstrum does not work with unvoiced inputs. We did not implement
the voiced/unvoiced decision part of SIFT but our version worked reasonably well with all
tested samples - although with results slightly less accurate - and the calculations in SIFT are
simple making it easy to implement on any platform. Our pitch and key detection programs,
starting from MATLAB to JAVA and Android Studio, were developed successfully, but not
without a long period spent on optimizing the algorithms to shorten runtime and make them
more suitable for mobile phones. Overall, we have proven the ability as well as limitations
of PDAs, so even when they are not the perfect solutions to pitch and key detection we can
still take advantage of them.
Page 17 of 21

Nevertheless, on the course of this project, our group have experienced various difficulties.
One of the most significant problem is the lack of recent research papers and theoretical
resources, as related documents are usually not available to public viewers. Even if there is
any they most likely are intended for experienced readers, so we had to rely mostly on the
original research of the PDAs along with the supplements they provided. Another difficulty
is time constraint, as we had to process a large amount of new information and skills beyond
our understanding at the start of the project, that without the advice from our supervisor
we wouldn’t be able to deal with. Task division and collaborations between group members
proved problematic too at first, but we have improve our teamworking skills over the time
and were able to overcome this obstacle.
Even though it has come to an end, from the knowledge we have gained during the project,
we are aware that the application has the potential of a fully usable and marketable product.
We are looking forward to further studies on signal and music processing in the future in
order to improve it into a more complete version.
Acknowledgement
We would like to express our heartfelt appreciation to Dr. TRAN Hoang Tung for the
patience and enthusiasm that he guided us with during the course of this project. Without
your help, we would not be able to complete our work successfully.
Our gratitude also to the staff of Information and Communication Technology Department
and ICT Lab for their valuable assistance.
Page 18 of 21

References
[1] Bogert, B. P., Healy, M. J. R., and Tukey, J. W. (1963). The Quefrency Alanysis of
Time Series for Echoes: Cepstrum, Pseudo Autocovariance, Cross-Cepstrum and Saphe
Cracking, Proceedings of the Symposium on Time Series Analysis, Chapter 15, 209-243.
[2] De Cheveign´e, A., and Kawahara, H. (2002). YIN, a fundamental frequency estimator
for speech and music, Journal of the Acoustical Society of America, 111, 1917-1930.
[3] Gerhard, D. (2003) Pitch Extraction and Fundamental Frequency: History and Current
Techniques, technical report, Dept. of Computer Science, University of Regina.
[4] Krumhansl, C. L., and Kessler, E. J. (1982) Tracing the Dynamic Changes in Perceived
Tonal Organization in a Spatial Representation of Musical Keys, Psychological Review,
89-4, 334-368.
[5] Markel, J. D. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE
Trans. Audio Electroacoust., AU-20, 367-377.
[6] Noll, M. A. (1967), Cepstrum Pitch Determination, Journal of the Acoustical Society of
America, 41-2, 293-309.
[7] Temperley, D., and Marvin, E. W. (2007) Pitch-Class Distribution and the Identiﬁcation
of Key, Music Perception, 25-3, 193-212.
[8] Zenz, V. (2007) Automatic Chord Detection in Polyphonic Audio Data, diploma thesis,
Vienna University of Technology.
Page 19 of 21

Appendix A
The formulas involved in SIFT algorithm - adapted from the paper by Markel [5] is described as
follows:
1. The output xn after putting input sequence sn through the low-pass filter is obtained by:
xn = a1un + a2xn−1 + a3xn−2
un = a3sn + a2un−1
where xn = 0 and un = 0 if n < 0 and
a1 = 1 − e−α1T
a2 = e−α1T
a3 = 1 − 2e−α2T
cosβ2T + e−2a2T
a4 = 2e−α2T
cosβ2T
a5 = −e−2a2T
α1 = (0.3572)2πfc
α2 = (0.1786)2πfc
β2 = (0.8938)πfc
fc = 0.8kHz
T = 0.1ms
2. Assuming xn is at a 10kHz sampling rate, the downsampled sequence wn at 2kHz is then
created by taking every fifth sample of xn.
Note: From step 3 onwards, wn is analyzed frame-by-frame, each frame has 64 samples and
each frame shift takes 32 samples. After the analysis, we can obtain a sequence of f0 that is
consistent with the pitch contour of sn. Therefore it is assumed in step 3-5 that the input is
a 64-sample sequence and the final output is collected to a list of pitches.
3. The coefficients to the 4th inverse filter is computed by the autocorrelation equations
4
i=1
aipi−j = −pj
with the coefficients calculated by pi = N−1−j
n=0 wnwn+j with j = 0, 1, . . . , 4. The filter only
Page 20 of 21

uses 4 coefficients, so it is possible to obtain a solution for ai from the set of equations
a1p0 + a2p1 + a3p2 + a4p3 = −p1
a1p1 + a2p0 + a3p1 + a4p2 = −p2
a1p2 + a2p1 + a3p0 + a4p1 = −p3
a1p3 + a2p2 + a3p1 + a4p0 = −p4
The inverse filter output yn is then calculated as yn = wn + 4
i=1 aiwn−i where wn = 0 when
n < 0.
4. f0 can be obtained from 64 samples of the autocorrelation sequence of the inverse filter
output, defined as rn = N−1−n
j=0 yjyj+n, with n = 0, 1, . . . , 63.
The estimated pitch is defined as the distance from r0 to the first peak ˆr in ms.
5. For a more accurate estimation, the area surrounding peak ˆr is interpolated by a ratio of 4
to 1. Let’s say ˆr is obtained at position ˆN, define γa with a = 0, 1, . . . , 8 as the interpolated
sequence surrounding r ˆN with γ0 = r ˆN−1, γ4 = r ˆN , γ8 = r ˆN+1.
The rest of γa can be computed using the simplified interpolation equations:
γ±3/4 γ±1/2 γ±1/4
T
=





0.879124 0.321662 −0.150534
0.637643 0.636110 −0.212208
0.322745 0.878039 −0.158147





γ±1 γ0 γ 1
T
We re-examine γa to get the precise peak ˆγ at index â.
f0 of the frame in question is finally obtained then by P = ( ˆN + â−4
4 )/2. We can get the f0
in kHz by F0 = 1/P..
Page 21 of 21

ICT-GroupProject-Report2-NguyenDangHoa_2

Recommended

Recommended

More Related Content

Similar to ICT-GroupProject-Report2-NguyenDangHoa_2

Similar to ICT-GroupProject-Report2-NguyenDangHoa_2 (20)

ICT-GroupProject-Report2-NguyenDangHoa_2