thesis_Zhiyuan Lin

1
NEW YORK UNIVERSITY
Urban Soundscape
Acoustic Event Classification
by
Zhiyuan Lin
Submitted in partial fulfillment of the requirements for the Master of Music in
Music Technology in the Department of Music and Performing Arts Professions in
The Steinhardt School New York University
Advisor: Tae Hong Park
[DATE: 2015/06/19] June 2015

2
NEW YORK UNIVERSITY
Abstract
Steinhardt
Master of Music
by Zhiyuan Lin
Automatic urban soundscape classification is an emerging research field
that has in recent years become an area of study and exploration along
with Big Data science. This field has its roots in acoustic ecology and
soundscape studies and at the same time, has interesting practical
possibilities. For example, Soundscape Information Retrieval (SIR)
[ICMC 2014 paper on SIR] can provide city managers cyber-physical
platforms to respond and address emergency response situations, noise
feedback, as well as other areas of urban management that have
important significance. This paper aims to try different methods to
achieve the pursuit of automated real-time soundscape classification.
The main research method is artificial neural networks and deep
learning. The focus on this thesis aims to do explore machine learning
and SIR without utilizing engineered salient features like MFCCs or
spectral centroids. Rather, we aim to directly use raw spectral
information computed from soundscape recordings. We utilized the
Citygram soundscape database to train and develop the prototype model.

3
Acknowledgements
I would like to thank Professor Tae Hong Park for his guidance, and
Professor Juan Bello saved me from confusion. They
both encourage d me and pointed out the way forward for me. And
Xiang Zhang, whom guide me into the field of machine learning, is not
only is a good friend, but a good teacher. Many people helped me solve
the problem of English. These include Huilin Pan, Eric Zhang and
Samuel Mindlin. Of course, I thank my parents for their financial
support. There are countless people who encouraged me and support me,
I will not forget.

4
Contents
1 Introduction......................................................................................................................7
2 Prior work ......................................................................................................................10
2.1 CityGram.................................................................................................................10
2.2 Machine Learning ...................................................................................................11
3 Dataset creation..............................................................................................................16
3.1 Data from CityGram Project ...................................................................................16
3.2 Phase 1: Semi-manual sorting data and ground truth..............................................18
3.3 Phase 2: Data Preprocessing ...................................................................................20
3.4 Phase 3: Principal Component Analysis (PCA).....................................................23
3.5 Phase 4: Training, Verification, and Testing...........................................................24
4 Neural network training .................................................................................................25
4.1 Sparse Auto-encoder...............................................................................................25
4.2 Back Propagation ....................................................................................................28
4.3 Conjugate Gradient Optimization ...........................................................................29
5 Reliability Verification...................................................................................................31
5.1 The control group data creation ..............................................................................31
5.2 Structures an artificial neural network ....................................................................32
5.3 The control group result..........................................................................................36
6 Training and Result........................................................................................................39
6.1 Training...................................................................................................................39
6.2 Result ......................................................................................................................40
7 Conclusions and Future Work........................................................................................47
7.1 Conclusions.............................................................................................................47
7.2 Future Work ............................................................................................................48

5
List of Figures
Figure 1 CityGram artificial acoustic event mark page.....................................17
Figure 2 audio cutting........................................................................................21
Figure 3 The auto-encoder layout [Andrew Ng, Jiquan Ngiam, Chuan Yu Foo,
Yifan Mai, Caroline Suen. 2013]...............................................................26
Figure 4 optimization algorithms for sparse auto-encoder [ Le, Q. V., et al.
2011]..........................................................................................................30
Figure 5 lambda-cost curve ...............................................................................35
Figure 6 Distribution matrix results of experimental group ..............................41
Figure 7 Distribution matrix results of smaller network 1 ................................42
Figure 8 Distribution matrix results of smaller network 2 ................................43
Figure 9 Accuracy over varying number of Mel filters.....................................45

6
List of Tables
Table 1 The training category............................................................................18
Table 2 The architecture and results of 6-layer network ...................................33
Table 3 The architecture and results of 7-layer sparse auto-encoder ................33
Table 4 The testing accuracy of different sample length and segment
combinations..............................................................................................34
Table 5 The iteration of training........................................................................36
Table 6 The control group training result..........................................................36
Table 7 The control group testing result............................................................37
Table 8 The architecture of training network ....................................................39
Table 9 The results of experimental group........................................................40
Table 10 The results of two control groups.......................................................42
Table 11 The architecture of two smaller network 1&2....................................43
Table 12 The results of two smaller network 1&2 ............................................44

7
1 Introduction
Although in the field of music information retrieval (MIR) and
soundscape information retrieval (SIR) [1
Park, T. H., Lee, J. H., You, J.,
Yoo, M. J., & Turner, J. 2014], identification of sound has had a lot of
prior work, the research into urban soundscape classification still
remains scarce. Now people use voice commands to control mobile
phones, home appliances and even used to unlock the security lock. But
the common features of these sounds are precise targets and a single
generating source. Different from these, the composition of the urban
soundscape is complex and more difficult to predict. A lot of the objects
sound at the same time, and it is hard to distinguish noise from useful
information. Determined by the characteristics, urban sound is more
suited to a fuzzy algorithm. For a long time, researchers rely on a variety
of sound features, like Mel-frequency Cepstral Coefficients (MFCC)
(also, write Mel-frequency Cepstral Coefficients before abbreviating)
[2
T. Ganchev, N. Fakotakis, and G. Kokkinakis 2005]. The feature and
the associated algorithm work well in certain situations. But for urban
sound field, rules needed for classification increase exponentially, and
the generalization power of a small number of features is small [3
Jean-
Julien Aucouturier, Boris Defreville, and François Pachet. 2007]. To

8
engineer and specify rules is obviously very tedious work. There are no
specific rules within the network. The machine determines the rules, and
the practical efficiency of a well-trained artificial neural network is very
high. It makes decisions fast, and can be used in real-time systems. One
branch of machine learning, artificial neural network, can be considered
a bionics computing architecture. Prior to this, researches on artificial
neural networks have been numerous. But most of them still use
conventional audio features. However, we know that most of the
features are not invertible after signal decomposition. It means that after
compression, some audio information will be lost. Since computing
power is so advanced today, we can try to process raw data. In the case
of less compression, the accuracy and generalization of sound
classification is expected to increase. This thesis involves an experiment
on the direct use of audio for urban soundscape classification.
This thesis is a subproject of CityGram. Utilzing Citygram soundscape
dataset for this research saved a lot of time that would have been spent
on data collection. At the same time, many predecessors are important
references for this research, including the work of a former
“Citygrammer” and former Music Technology student at NYU. Many
parameters of this study largely made reference to previous experience
[4
Jacoby. 2014].

10
2 Prior Work
A soundscape is a sound or combination of sounds that forms or arises
from an immersive environment [5
Retrieved from Wikepedia]. The
sound of urban soundscape mostly generated by human activities
[6
Raimbault, M., & Dubois, D. 2005]. There is a lot of theoretical
research on urban soundscapes, but most lack the involvement of
automatic scientific tools. In this regard, Citygram is one of the few big
city data research projects. When introducing the CityGram project, we
also need to include some background about machine learning.
2.1 CityGram
CityGram is a large-scale urban sound data collection and analysis
project. In 2011, the Citygram Project in its first iteration was launched
to develop dynamic non-ocular energy maps focusing on acoustic
energy [7
Park, T. H., Lee, J. H., You, J., Yoo, M. J., & Turner, J. 2014].
Through the Remote Sensing Devices (RSD) installed or dispersed
throughout the city, it collects the city's sound and generates a regional
soundscape. Researchers will be able to have access to real-time audio

11
and audio feature information through the Citygram server. One group
within CityGram is working on automated soundscape classification
through the database of artificial markings of acoustic events in audio.
More research will be launched in the future.
2.2 Machine Learning
Now, neural networks and support vector machines are two kinds of
representative statistical learning methods of machine learning. They
can be considered derived from linear classification models (Perceptron)
Rosenblatt invented in 1958. Perceptron only performs linear
classification [8
Freund, Y.; Schapire, R. E. 1999], but in reality the
problem is usually non-linear. Neural Networks and Support Vector s are
non-linear classification models. In 1986, Rumelhart and McClelland
invented the Back Propagation algorithm. This algorithm is an important
form of supervised learning, also used in this experiment. Later, Vapnik
et al proposed SVM in 1992 [9
Bottou, L., & Vapnik, V. 1992]. A neural
network is a multi-layer (usually three) non-linear model; use a vector
machine to convert the nonlinear problem into a linear problem. For
Personal computer, the training of artificial neural networks takes very
long, while SVM no small advantage in this regard. Therefore, SVM has
been widely used in the industry for a very long time [10
Suykens, J. A.,

12
& Vandewalle, J. 1999]. However, with the advances in theory and
hardware performance, artificial neural networks once again
demonstrated its capabilities.
2.2.1 Artificial Neural Network
Artificial neural networks (ANNs) are a family of statistical learning
models, Inspired by biological neural network. The network itself does
not store data, nor manually define things, but the network functions
through training and recording of the neuronal node reaction to the input
data. Artificial neural networks typically have multiple layers of
interconnected neurons. The bottom layer is the input layer, and is
responsible for receiving external stimuli. The layers in the middle are
called the hidden layers. Similar to the biological process of neuronal
cells, the information is passed up layer by layer. This process is similar
to the human brain patterns of induction, generalization and analysis.
Finally at the highest level or the output layer, the network has the
ability to perceive and classify. For example, in the field of image
recognition, the colors and coordinates of the pixels in the picture will
be used as the input layer. Lower layers can sum up and describe the
lines and edges in the picture. Higher layers can construct simple shapes
together with lines and edges [11
Duygulu, P., Barnard, K., de Freitas, J.

13
F., & Forsyth, D. A. 2002]. Finally, the highest level can identify the
object.
In the audio field, the human ear can perceive frequency, intensity, and
duration [12
Gaskill, S. A., & Brown, A. M. 1990]. A short time Fourier
transform spectrum can reflect these three dimensions, while its data
structure is also similar to a picture. Therefore, we assume that these
multilayer neural networks can simulate the human perception of sound,
thus achieving the acoustic event classification purposes.
The neural networks must be trained first before using artificial neural
networks, and training is an iterative process. Before 2006, a typical
artificial neural network would utilize back propagation method with
adjustable parameters. Typically, the neuronal node parameters of the
network will be initialized with a random number. During the iterations,
network will input the existing data set, and the network will give
recognition results for all samples in output layer. The ground truth is
compared to the output layer of the network, and an offset parameter is
calculated and passed from the output layer back to the lower layer
[13
Mason, L., Baxter, J., Bartlett, P., & Frean, M. 1999]. In the next
iteration, the node parameters use the result of the previous iteration.
When this process is repeated, the recognition results of the network
will gradually move closer to the ground truth. After the training is

14
completed, the network will have the ability to classify new samples.
2.2.2 Deep Learning
Compared to the brain, the number of neurons on artificial neural
network is still very small [14
Jain, A. K., Mao, J., & Mohiuddin, K. M.
1996]. Moreover, the number of connections between neurons in the
human brain is more complex and diverse. Therefore, in the field of
machine learning, achieving a high level of perception with a small
number of neuronal connections is very difficult. We naturally think of
increasing the number of layers and nodes of artificial neural networks.
But when the number of layers is large, the offset value generated from
the output layer becomes very small. This makes the training to stop at
one stage. In 2006, Hinton proposed Deep Belief Networks that greatly
enhance the ability of neural networks. His approach is, for a multi-
ANN, first with Boltzmann Machine (unsupervised learning) structure
learning networks, and then through the weighted Back Propagation
(supervised learning) Learning Network [15
Hinton, G. E., 2006]. Today,
the use of artificial neural networks for image recognition has made
great breakthroughs. The original intention of this thesis is to transplant
a method of image classification to sound classification, so as to explore
the performance of deep learning classification.

16
3 Dataset Creation
Similar to the mechanism of how humans discern sounds, supervised
machine learning requires that the machine listens to adequate volumes
of audio samples and is given the category to which each audio sample
belongs. Each audio sample that the machine listens to requires
descriptions provided by a human. It would be a huge project for one
researcher to label many audio profiles. Therefore, Citygram Database,
built by multiple researchers, and multiple annotators (up to 7 per
soundscape recording at present time), holds the date and time of
acoustic events and other details. Though humans are subjective in
describing things and events, especially for sounds, the results of the
supervised machine learning project rely extensively upon the quality of
the samples and quality to the ground truth. To be specific, we
normalized the audio sample descriptions from each individual
researcher.
3.1 Data from CityGram Project
The CityGram Database houses mainly audio clips and human
descriptions of the acoustic events within. Participates repeatedly

17
listened to audio sample in 2-minute segment durations to extract
meaningful acoustic events and provide descriptions. They set the start
and end time of the acoustic events, gave verbal descriptions of the
acoustic events, and assessed the distance of the source of the sound and
other attributes based on personal judgment.
Figure 1 CityGram artificial acoustic event mark page
Urban soundscapes are noisy [16
Park, T. H., Lee, J. H., You, J., Yoo, M.
J., & Turner, J. 2014]. However, there is no good normalization on the
verbal descriptions, as different people have different language
preferences. For example, some people would use "walking" while some
others would use "footprint". Some labels differentiate the origins of the
sound, such as "men's voice" from "female speaking", while some
others would simply categorize it as "human sound."
In the training step of supervised machine learning, each category

18
demands a certain number of samples. If we discerned sounds in great
detail, we would have too many categories and too few samples in each
category. Thus, it was necessary to merge some categories.
3.2 Phase 1: Semi-manual Sorting Data and Ground Truth
There were more than 1,700 acoustic events. It would have been
inefficient to manually categorize them. We employed the original data
from the CityGram database server, and then looked them up in
Microsoft Excel with the Fuzzy Lookup plugin.
According to the description of the sample and their numbers in
database, we chose the following nine classes:
1 2 3 4 5 6 7 8 9
walking vehicle engine horn background machine quiet music speech
Table 1 The training category
Basically, “Walking” means the sound of a human walking. The
difference between “vehicle” and “engine” is that “engine” refers
specifically to the sound of the engine and the motor, and “vehicle”
usually refers to the passing sound of vehicles. “Machine” includes

19
various types of percussion. “Music” refers specifically to melodic
sound, not including percussion. The last category, “speech” includes all
kind of human voices, like laughing or yelling. “Quiet” is a special
class, which includes a large number of audio clips that are not marked.
It is not considered significant, and can be treated as silence.
There are several keywords under each category, such as "walking",
"walk", and "footstep" under the category, "walking". Fuzzy Lookup
checked each record of data against the keywords in the category, and
gave a resemblance score. Based on the resemblance scores, we had one
more manual check, and eliminated incorrect categorizations and
duplicate labels. During the manual check, we also added data without
keywords. In the end, there were around 1,200 acoustic events for the
training of supervised machine learning. Refer to audio clips, it is about
2 hours in total.
This classification reference to previous experience, but also taking into
account the limitations of the database itself [17
Justin Salamon,
Christopher Jacoby, Juan Pablo Bello., 2014]. The core idea is separate
the sound into four categories: human, nature, machinery and music.
However, 50% of the label in database belong to human. It means that
even all judges are human, it is still close to 50% recognition accuracy
rate. In order to balance gravity, we subdivided several classes to give

20
the final categories. In the process of breakdown, we also considered
some of the spectral characteristics of the sound, such as percussion in
music. Even if it has a bit of a unique rhythm, the length of percussion
samples less than 3 minutes. Such a small sample size is difficult to
extract a sound concept. So I attributed it to the category of the machine
with other beating, striking sound. In short, such categories are not
based on generic classification, but customized according to existing
resources. Algorithm itself is generic, and does not consider any special
factors of sound sample.
3.3 Phase 2: Data Preprocessing
Because our artificial neural network was run in MATLAB, we also
used MATLAB to preprocess original data. Data loaded into MATLAB,
and cut into pieces from the original audio based on the start time and
end time in the SQL databas.
3.3.1 Windowing
During extraction of acoustic events, deviations could arise with
different choice of timeline segregation. In fact, we sometimes found it
hard to distinguish a meaningful acoustic event from background noises.

21
Thus, when we were separating the acoustic events from the original
audio clips, we attenuated the effects of the two edges of audio. First,
1,000 samples were used as a sample length to compute the Root Mean
Square (RMS) of the audio clip. We used the median of the RMS of the
audio clip as a temporary median of the clip. Then, we used a half
Hanning window on both sides of the temporary median to set weights
on the clip. As a result, the important information in the audio clip was
reinforced.
Figure 2 audio cutting
3.3.2 Segregation
Although the acoustic events in the CityGram Database come in
different lengths, the audio samples provided for the training of machine

22
learning model must be of the same length, and thus we had to trim
some of the input data. In reality, common people do not need to listen
to the entire acoustic event to discern the category of the sound. For
example, under the category of “voices”, common people only need to
hear a word, instead of a whole conversation or paragraph, to tell if that
is someone talking. Therefore, a long human conversation could be
deemed as an aggregation of multiple samples in the “voices” category.
In the data preprocessing stage, we tried 3 sample lengths - 1-second, 2-
seconds, and 4-seconds. For the 1-second length, we separated an
acoustic event into multiple overlapping1-second clips; and clips shorter
than 1 second were zero padded. After the training, we examined how
samples with different lengths could affect the results. After the training,
we examined how samples with different lengths could affect the
results.
On the other hand, the ground truth of each sample was decided by the
ratio of the category among the sample. For example, if half of the
sample is human voice, the ground truth of “speech” is 0.5. Therefore,
the ground truth of each sample is an array of 9 numbers between 0 and
1, each of which corresponds to the pre-defined category.
Then, taking samples of 1-second as example, in order to show that the
spectrum within the samples changes with time, we needed to perform a

23
second segregation on the samples. Each sample was separated into 64
segments with overlap. Each of the 64 segments were then transformed
with the Fast Fourier Transform (FFT), yielding 64*n*2 rough samples,
in which n refers to the number of samples in each segment and 2 refers
real and imaginary. Each sample includes the real and complex part of
the 64 segments after FFT.
3.4 Phase 3: Principal Component Analysis (PCA)
If data with little or no preprocessing were used for machine learning,
the input number would be too large (i.e. n > 20,000). In order to run our
neural network on a personal computer, we needed to reduce the
dimensions, and thus we employed Principal Component Analysis
(PCA). PCA uses fewer dimensions to describe a higher dimensional
data, although some information might be lost [18
Barnett, T. P., and R.
Preisendorfer. 1987]. However, the PCA computation consumes a high
amount of memory and CPU. To lift the heavy burden of computing, we
computed all the data within each segment repeatedly 64 times, instead
of computing all the data within the 64 segments all together. For
reasons of computing power, we only selected features include 98% of
the information. The number of features of each segment ranging
between 35 and 90. Finally, they were combined into a single dataset.

24
3.5 Phase 4: Training, Verification, and Testing
In Machine Learning, data samples are usually divided into three
groups, among which the largest is used for training and another is for
verification. The method of using a group of data for verification is
called Cross Validation. Observing the results of Cross Verification can
help us revise the coefficients of neural network, resulting in better
training results [19Krogh, A., & Vedelsby, J. 1995]. However, to prevent
over-fitting, we need the test group for final examination of neural
network. To take 1-second samples for example, there were around
5,600 samples in the end, among which 3,600 were used for training,
1,000 for verification, and the other 1,000 for testing. In the process of
grouping, we conducted two randomized sort. The first time, all of the
original audio is randomly ordered, according to the length of time to be
roughly assigned to three groups. This ensures that the audio source of
three groups are completely separate. The second time, after completion
of the cutting, sample set is randomly ordered again for pretreatment. A
predictable disadvantage is that there will be two sample from the same
audio clip. In addition, during the cutting process, the sample itself is
overlapping, somewhat lower the quality of samples. Given the limited

25
number of artificial labels, this is a compromise choice.
4 Neural Network Training
This experiment used a sparse auto-encoder neural network. When
building an artificial neural network, we first pre-train each layer using a
sparse auto-encoder. Then parameters are adjusted with back
propagation algorithm.
4.1 Sparse Auto-encoder
The sparse auto-encoder is relatively well known in the field of Deep
learning, which is a method of unsupervised learning [20
Bengio,
Y.2009]. It is essentially a three-layer neural network, an input layer,
output layer and a hidden layer. Unlike common shallow neural
network, ground truth of its output layer is the input layer data. Its
significance lies in the hidden layer that can better reconstruct the input
layer information. As shown in figure:

26
Figure 3 The auto-encoder layout [21Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline
Suen. 2013]
This is a simple auto-encoder, its cost function formula is:
Where J is the cost, m is the amount of samples, l is the number of layer
and the lambda is the regularization parameter. When J is 0, it indicates
that the network can be 100 percent accurate in reproducing the original
data. But this situation is too extreme, and prone to over-fitting. If the
training of a network appears to be over-fitting, it can only perform very

27
well in training data set, but fail to generalize to new samples [5
Andrew
Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen. 2013].
Adding sparse coding is a way place constraints on output layers, so
only the more important nodes are activated, and most of the hidden
layer nodes are in a non-active state. This will achieve the purpose of
sparse coding. Therefore, the sparse auto-encoder cost function
expression is:
𝐽𝑠𝑝𝑎𝑟𝑠𝑒(𝑊, 𝑏) = 𝐽(𝑊, 𝑏) + 𝛽 ∑ 𝐾𝐿(𝜌||𝜌𝑗̂ )
𝑠2
𝑗=1
The new item that is Kullback-Leibler (KL) distance, which is expressed
as follows:
𝐾𝐿(𝜌||𝜌𝑗̂ ) = 𝜌 log
𝜌
𝜌𝑗̂
+ (1 − 𝜌) log
1 − 𝜌
1 − 𝜌𝑗̂
The hidden layer node average output is calculated as follows:
𝜌𝑗̂ =
1
𝑚
∑[𝑎𝑗
(2)
(𝑥(𝑖)
)]
𝑚
𝑖=1
Where the parameter ρ is generally small, such as 0.05, which means
that there is a small probability of the event occurring. The probability
of each hidden layer node being activated approaches 0.05 [5
Andrew
Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen. 2013].
The auto-encoder with the sparseness usually performs better

28
[22
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann
LeCun. 2006]. It enhances the outstanding feature, and weakened other
features.
4.2 Back Propagation
After the methods above, we can build the multi-layer structure that we
want. However, we still lack a scientific way to evaluate the number of
layers needed and the number of neuronal nodes in each layer. The
neural network is now able to reconstruct original data well, but it does
not yet known how to classify. Next, in order to add an additional output
layer onto the top layer of the neural network, we can use a standard
method of training: back propagation. In this thesis we use gradient
descent method for back propagation. This process is like using a
specific length of pace to find the lowest point down along the path [23
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. 1986]. It
transforms an issue of the lowest point calculation to an issue that can be
calculated by successive approximation. But we need to choose the
appropriate test steps, so that it will not miss the target concave point or
explore too slowly. The cost function formula is:
𝐽 = [
1
2𝑚
∑ ∑(|ℎ𝑖𝑖 − 𝑦𝑖𝑗|2
)
𝑐
𝑗=1
𝑚
𝑖=1
] +
𝜆
2𝑚
∑ ∑ ∑(𝑊𝑖𝑗
(𝑙)
)2
𝑐
𝑗=1
𝑚
𝑖=1
𝑛−1
𝑙=1

29
Where m is the number of samples, c is the number of categories
and n is the number of layers.
4.3 Conjugate Gradient Optimization
As mentioned above, the network training will be an iterative process.
To use MATLAB to iterate, while a viable approach, does not achieve
high efficiency. In the experiment, we use conjugate gradient as our
optimization algorithm. The conjugate gradient optimization is a
numerical solution usually for a large sparse system [24
Hestenes, Magnus R.; Stiefel, Eduard 1952]. It can make the cost of
network decline faster than ordinary gradient descent. For sparse auto-
encoder network, performances of several optimization algorithms
showed as below:

30
Figure 4 optimization algorithms for sparse auto-encoder [25
Le, Q. V., et al. 2011]

31
5 Reliability Verification
It is well known that the urban soundscape is an inherently noisy
environment. While in the street, to hear the sound of car horns and the
sound of the engine is a very common thing. This kind of classification
is difficult to guarantee to be completely correct, even for a human. If
we use existing experimental data directly, we will lack an effective
control group. It may be difficult to judge whether the defects of the
system come from noise and diverse spectrum or from the system itself.
Since training the neural network takes a very long time, we have to first
verify the reliability of the system. The following experiment is the
control group.
5.1 The Control Group Data Creation
The data preprocessing of the control group and the experimental group
is exactly the same, the only difference being their data. The control
group audio consists of several common instrument solos. A solo is
characterized by a small noise and no confounding acoustic event. Only
one instrument sounds at any particular time. Using instrument solos as

32
a control group can test feasibility and sound classifier performance. We
selected a total of 34 solo sounds divided into four classes. There were
15 piano songs, 11 violin songs, 7 guitar songs and 20 minutes of mute
background with the device noise as the silent group. Solo samples are
from different styles and periods, and they had silence removed. Most of
them are from classical music sources. Cutting and pretreatment resulted
in a total of 2800 samples, 1 second long per sample。After shuffling,
1800 of them were used to train, 500 were used to validation and 500 for
testing.
5.2 Structures an Artificial Neural Network
In an earlier study, we used several six-layer or seven-layer neural
networks. A smaller network allows us to have more input layer nodes.
Each test sample has 64 segments, but the results of the six-layer
network are rather unsatisfactory. The result after training is only 50.1%,
equivalent to the probability of a coin toss .The accuracy of the test
group only a little better than the bottom line (11.11%).

33
layer input 2 3 4 5 output
Number of nodes 5696 4096 4096 4096 4096 4
Training accuracy 50.1%
Testing accuracy 33.70%
Table 2 The architecture and results of 6-layer network (1 second sample length, 64 segments per
sample, lambda = 3, iteration: 50)
The results of 7-layer network are not much better than that of the 6-
layer network.
layer input 2 3 4 5 6 output
number of nodes 5696 4096 4096 4096 4096 4096 4
training accuracy 72.2%
testing accuracy 51.27%
Table 3 The architecture and results of 7-layer sparse auto-encoder (1 second sample length, 64
segments per sample, lambda = 3, iteration: 50)
After we used a 9-layer network, the result became acceptable. At the
same time we also tried different sample lengths and different numbers
of segments in each sample. Because computer capabilities and the issue
of time, we did not explore all the experimental combinations. Results

34
are shown below:
segment/length 1s 1.5s 2s 4s
16 51.16% 52.34% 51.34% 34.65%
32 51.16% 56.63% 46.46% 42.50%
64 72.57% 51.64% 52.48% 66.64%
128 N/A 70.98% 21.85% N/A
256 N/A N/A 61.48% N/A
Table 4 The testing accuracy of different sample length and segment combinations (lambda =3,
iteration: 50)
Although not very obvious, more segments and a short time obtained
better results. This experiment did not give us much reference, since
different sample lengths can lead to large changes in the experimental
samples. And the number of samples is relatively small, which means
the randomness of the results is relatively large.
We also tried different values of lambda. The relation between cost of
networks and lambda is shown below:

35
Figure 5 lambda-cost curve (1 second sample length, 64 segments per sample, iteration: 50)
The green curve is the validation set, and the blue curve is the training
group. As shown in figure, the validation group did not reach the
inflection point. This means the network is still in an under-fitting state.
The number of nodes in the artificial neural network is not enough to
summarize and describe the input data [26
M. Ranzato, C. Poultney, S.
Chopra, and Y. LeCun. 2007].
In addition, we also tried to change the different auto-encoder training
iterations. Too few iterations may cause unfitting, contrary, too many
iterations may cause overfitting [S. Amari, N. Murata, K. -R. Muller, M.
Finke, H. H. Yang 1997]. Both of them are 27not good situations for

36
back propagation. We tried 10, 20, 50 and 200, and we believe that 50 is
a suitable number of iterations.
iterations 10 20 50 100 200
training accuracy 50.20% 86.87% 87.13% 87.25% 88.10%
testing accuracy 33.15% 62.11% 63.12% 62.84% 63.11%
Table 5 The iteration of training (1 second sample length, 64 segments per sample, lambda = 3)
5.3 The Control Group Result
Samples Piano Violin Guitar Silence
Piano 601 21 27 15
Violin 21 425 34 6
Guitar 30 42 304 8
Silence 3 5 28 248
F-measure 0.917557252 0.862068966 0.810666667 0.895306859
Table 6 The control group training result (1 second sample length, 64 segments per sample, lambda =
3, iteration: 50)
Training parameters and the results are shown above. The average F-
measure precision rate of testing group reached 87.13%. This result is a

37
combination of optimum combination of several parameters. Test results
are in the following table:
Samples Piano Violin Guitar Silence
Piano 133 15 21 14
Violin 10 73 15 8
Guitar 15 6 77 6
Silence 5 11 7 84
F-
measure
0.81595092 0.695238095 0.641666667 0.75
Table 7 The control group testing result
The average F-measure precision rate of the testing
group reached 72.57%. Compared to the training group, all F-measure
precision rates are lower. Although this is a small database, and only
through the testing results of the control group, we can see that less
diversity leads to a higher classification accuracy rate. The piano sound
is a relatively fixed mechanical movement, and playing techniques of
the violin and guitar are much more varied than the piano. Experimental
results support this hypothesis. In addition, this sound classifier

38
algorithm is shown to be feasible, though the accuracy needs to be
improved. It is worth noting that, since the number of audio samples is
limited, different parameters will change the number of samples. And
because the computer's memory limit, an excessive number of samples
will make the computer run out of memory. In the next step of training
we need to weigh the selection of parameters. More segments and a
longer sample length means fewer amount of samples. Before the results
came out, the choice was very difficult.

39
6 Training and Result
Although machine learning takes a long time, we conducted a number of
experiments including cross validation, and tried a variety of
parameters. Even not all of the experiments are successful, they give us
a good reference direction.
6.1 Training
We constructed a 9-layer sparse auto-encoder network in MATLAB. The
difference is that this ground truth overlaps. A sample can belong to
multiple categories. In training we used the parameters that performed
better in the control group. The network architecture used in training is
shown below:
layer input 2 3 4
nodes 3516 4096 4096 4096
5 6 7 8 output
4096 4096 2048 2048 9
Table 8 The architecture of training network (1 second sample length, 64 segments per sample,
lambda = 3)

40
6.2 Result
The number of samples for final training is 5395, and for testing it is
1618. After 46 iterations, the training reaches a local minimum. Results
and distribution matrix results are shown below:
walking vehicle engine horn
percentage 38.32% 16.44% 10.01% 1.61%
Accuracy 59.16% 49.44% 0% 43.33%
background machine quiet music speech
percentage 25.90% 6.37% 2.97% 14.03% 35.35%
Accuracy 53.93% 48.82% 100% 52.55% 59.83%
Table 9 The results of experimental group (1 second sample length, 64 segments per sample, lambda
= 3)
The average accuracy of classification is 53.02%. Surprisingly the
“quiet” group obtained 100% accuracy rate. This could be due to the
fact that the audio of the “quiet” group have high SNR levels, only
containing noise floor of the recording device. Thus it had the least
diversity among all groups. Another obvious place that showed low
classification is the “engine” group. From the figure of distribution
matrix, we can see most of the samples were classified as “vehicle”,

41
“walking” and background.” Probable cause is that the spectral
characteristics of the engine sound is relatively low, and the temporal
characteristics are more prominent. This study focuses on the spectrum,
and downplays temporal factors, especially those that change slowly
over time or in the rhythmic temporal range. Perhaps this is the same
reason why the “music” group and the “speech” group are difficult to
distinguish through spectral characteristics.
Figure 6 Distribution matrix results of experimental group

42
There are two control groups with less samples in training. The number
of samples for control group 1 training is 2600, and control group 2
training is 1300. Results and distribution matrix results are shown
below:
accuracy walking vehicle engine horn
Group 1 56.64% 44.72% 44.78% 43.10%
Group 2 35.28% 32.03% 29.39% 22.41%
background machine quiet music speech
Group 1 51.39% 45.95% 100% 52.71% 54.09%
Group 2 34.56% 14.40% 50% 34.21% 36.55%
Table 10 The results of two control groups (1 second sample length, 64 segments per sample, lambda
= 3)
The average accuracy of control group 1 is 52.95%, and the he average
accuracy of control group 2 is 33.88%. The group with more samples
have higher accuracy.
There are result of two control groups with smaller size of network.

43
Figure 7 Distribution matrix results of smaller network 1
Figure 8 Distribution matrix results of smaller network 2
The number of nodes in each layer showed as below:
input 2 3 4 5 6 7 8 output
Network 1 3516 2500 2000 1500 1000 800 500 200 9
Network 2 3516 3000 2500 2000 1500 1000 800 500 9
Table 11 The architecture of two smaller network 1&2
(1 second sample length, 64 segments per sample, lambda = 3)
And the result are showed as below:

44
accuracy walking vehicle engine horn backgroun
d
Group 1 26.75% 9.79% 6.77% 4.17%
17.76%
Group 2 25.79% 11.2% 7.6% 0%
18.6%
machine quiet music speech average
Group 1 1.64% 4.05% 8.31% 22.17% 17.73%
Group 2 2.99% 0% 9.54% 23.82% 18.28%
Table 12 The results of two smaller network 1&2
(1 second sample length, 64 segments per sample, lambda = 3)
The results of two smaller networks are poor, only a little better than the
bottom line 11.11%. However, we cannot prove the relationship between
the network size and the results in just such a simple comparison.

45
Figure 9 Accuracy over varying number of Mel filters.
Compared to the previous results of Jacoby’s [28
Jacoby. 2014], this result
did not improve, and even got worse. However, the samples and the
methods of two experiments are different. First, his category is more
accurate. For example, one of his classes, children playing, is only an
item in our class “speech”. Secondly, the sample length in his final
results is 2 seconds. But each of our samples has 64 segments, and the
length of each segment less than 0.1 second. From a structural point of
view, each of his samples is independent. Our sample is a sequence.
Furthermore, we did not use MFCC methods. So our samples are more
primitive than his. These three reasons make our samples more diverse

46
in the frequency domain. Our advantage is that our network has better
generalization; the disadvantage is that such algorithm needs more
training samples.

47
7 Conclusions and Future Work
7.1 Conclusions
In this thesis, we tried a direct use of original audio spectrum for
machine learning in the analysis of urban soundscapes. We also tried to
use different audio pre-processing parameters, as well as different
number of samples. Its average classification accuracy reached 53.02%,
and the best individual classification accuracy reached 59.83% for a
total of 8 classes. Some results may consistent with the hypothesis: this
network has high bias problem, so an increase in the number of samples
can improve accuracy. But given the low accuracy of the results, we
cannot prove that the diversity of spectrum and number of samples are
completely relevant in urban soundscape acoustic event classification.
In this experiment, the deficiencies will be a lesson for future study. Due
to the characteristic urban sound, samples often contain more than one
type of sound. Direct use of the original sample sound without source
separation in machine learning process inevitably have duplicate factors.
Unfortunately, this experiment cannot get more reference samples.
Training with small size of sample amount exacerbates the issue of

48
overlapping samples. Meanwhile, the categories of sample also open to
question. A more detailed classification will reduce the diversity of
samples on the spectrum. But this way, samples of each category will
become less. This classification method based on various
considerations, perhaps not the best method. In this magnitude, we
cannot verify which way is better. For these reasons, it is difficult to
give a clear conclusion which factors in a greater impact on the results:
the lack of sample, or the complexity of sample.
The exercise was also a prototypical experiment with regards to
artificial neural networks on personal computers. Our experiments
showed the trends for different sizes of neural network: larger networks
can have better results. At the end, it is approaching the limit of an
ordinary PC in many ways, and it also gives some guiding opinions
about the framework of large-scale neural networks.
7.2 Future Work
Firstly, the number of samples is limited by the processing power of the
computer. More memory can increase the size of the network.
Secondly, we can improve the quality of the samples. Now, semantic
translation and voice recognition are two main subjects in the audio
machine learning study. The original sample labels of CityGram project

49
are artificial markers. The descriptions have no clear specification. If a
project chooses to classify directly from the sound of language, such
crossing of linguistics and signal processing is clearly not realistic. So
artificial acoustic event labels themselves contain too much diversity to
classify accurately. Future work should include how to choose the right
sound classes based on samples. More detailed and accurate
identification can greatly improve the accuracy of classification.
Additional, the whole experiment was run on an ordinary personal
computer using MATLAB. Although MATLAB’s linear algebra
operation are very powerful, a lot of resources are also required to build
large-scale network. Even if the system is divided into several smaller
parts, the training time of a large enough network is unacceptably long.
Therefore, this experiment shows that an ordinary personal computer is
far from sufficient in order to achieve high-precision real-time urban
soundscape classification. But with C++ and parallel computing, one
can greatly increase the number of segments and the segment length.
The time required for training could also be greatly reduced. We can
also expect higher recognition accuracy.

50
Bibliography
1
Park, T. H., Lee, J. H., You, J., Yoo, M. J., & Turner, J. (2014).
Towards Soundscape Information Retrieval (SIR). In Proceedings of the
International Computer Music Conference Proceedings (ICMC).
2
T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative
evaluation of various MFCC implementations on the speaker
verification task," in 10th International Conference on Speech and
Computer (SPECOM 2005), Vol. 1, pp. 191–194.
3
Aucouturier, J. J., Defreville, B., & Pachet, F. (2007). The bag-of-
frames approach to audio pattern recognition: A sufficient model for
urban soundscapes but not for polyphonic music. The Journal of the
Acoustical Society of America, 122(2), 881-891.
4
Christopher B. Jacoby. (April 2014) “Automatic Urban Sound
Classification
Using Feature Learning Techniques” Master of Music in Music
Technology, in the Department of Music and Performing Arts
Professions, Steinhardt School, New York University
5
Retrieved from https://en.wikipedia.org/wiki/Soundscape
6
Raimbault, M., & Dubois, D. (2005). Urban soundscapes: Experiences

51
and knowledge. Cities, 22(5), 339-350.
7
8
Freund, Y.; Schapire, R. E. (1999). "Large margin classification using
the perceptron algorithm”. Machine Learning 37 (3): 277–
296. doi:10.1023/A:1007662407062.
9
Bottou, L., & Vapnik, V. (1992). Local learning algorithms. Neural
computation, 4(6), 888-900.
10
Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector
machine classifiers. Neural processing letters, 9(3), 293-300.
11
Duygulu, P., Barnard, K., de Freitas, J. F., & Forsyth, D. A. (2002).
Object recognition as machine translation: Learning a lexicon for a fixed
image vocabulary. In Computer Vision—ECCV 2002 (pp. 97-112).
Springer Berlin Heidelberg.
12
Gaskill, S. A., & Brown, A. M. (1990). The behavior of the acoustic
distortion product, 2f1− f2, from the human ear and its relation to
auditory sensitivity.The Journal of the Acoustical Society of
America, 88(2), 821-839.
13
Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999, May). Boosting

52
algorithms as gradient descent in function space. NIPS.
14
Jain, A. K., Mao, J., & Mohiuddin, K. M. (1996). Artificial neural
networks: A tutorial. Computer, (3), 31-44.
15
Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for
deep belief nets .Neural Computation 18:1527-1554, 2006
16
17
Justin Salamon, Christopher Jacoby, Juan Pablo Bello(2014). "A
Dataset and Taxonomy for Urban Sound Research". Music and Audio
Research Laboratory, New York University, Center for Urban Science
and Progress, New York University
18
Barnett, T. P., and R. Preisendorfer. (1987). "Origins and levels of
monthly and seasonal forecast skill for United States surface air
temperatures determined by canonical correlation analysis.". Monthly
Weather Review 115.
19
Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross
validation, and active learning. Advances in neural information
processing systems, 7, 231-238.
20
Bengio, Y. (2009). "Learning Deep Architectures for AI". Foundations
and Trends in Machine Learning 2.doi:10.1561/2200000006

53
21
Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen
(March 2013). “UFLDL Tutorial”. Retrieved from http://
deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
22
Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann
LeCun., Efficient Learning of Sparse Representations with an Energy-
Based Model, in J. Platt et al. (Eds), Advances in Neural Information
Processing Systems (NIPS 2006), MIT Press, 2007
23
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8
October 1986). "Learning representations by back-propagating
errors". Nature 323 (6088): 533–536. doi:10.1038/323533a0
24
Hestenes, Magnus R.; Stiefel, Eduard (December 1952). "Methods of
Conjugate Gradients for Solving Linear Systems”. Journal of Research
of the National Bureau of Standards 49 (6).
25
Le, Q. V., et al. (2011). On optimization methods for deep learning.
Proc. of ICML.
26
M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning
of sparse representations with an energy-based model. In NIPS’06,
2007a.
27 S. Amari , N. Murata , K. -R. Muller , M. Finke , H. H. Yang,
Asymptotic statistical theory of overtraining and cross-validation, IEEE

54
Transactions on Neural Networks, v.8 n.5, p.985-996, September
1997 [doi>10.1109/72.623200]
28
Christopher B. Jacoby. (April 2014) “Automatic Urban Sound
Classification
Using Feature Learning Techniques” Master of Music in Music
Technology, in the Department of Music and Performing Arts
Professions, Steinhardt School, New York University

thesis_Zhiyuan Lin

Recommended

Recommended

More Related Content

Similar to thesis_Zhiyuan Lin

Similar to thesis_Zhiyuan Lin (20)

thesis_Zhiyuan Lin