Automatic Sea Turtle Nest Detection via Deep Learning.

Automatic Sea Turtle Nest
Detection via Deep Learning.
Ricardo S´anchez Castillo
Student ID: 4225015
School of Computer Science
University of Nottingham
Submitted September 2015, in partial fulﬁlment of
the conditions of the award of the degree M.Sc in Computer Science
I hereby declare that this dissertation is all my own work, except as indicated
in the text:
Signature:
Date: / /

Abstract
Out of the seven species of sea turtles in the world, six are classified as en-
dangered. Mexico is home to five of these species which are protected by the
government laws, however these animals are still subject to many threats, the
most crucial is the poaching and illegal commerce of their eggs. There are
programs related to the prevention of these illegal activities such as the surveil-
lance of the beaches during nesting season but monitoring the long extension of
beaches in the country is exhausting. In order to solve this problem, the current
project proposes the use of drones to automatically detect sea turtle nests along
the beach making use of machine learning and computer vision algorithms.
However, sea turtle nests detection is a complex task due to the similarity
between a nest and its surrounding area, therefore this project has explored the
subject using the deep learning approach, particularly making use of Convolu-
tional Neural Network architectures which have shown a success in challenges
related to image classification and localisation and outperforms classical ma-
chine learning methods. In order to find the best architecture for this task,
different architectures were trained and tested in similar conditions to classify
an image as either nest or not a nest, then the best architecture was used for
detecting nests using frames extracted from a video obtained using a drone. Fi-
nally, a tracking algorithm is proposed for detecting nests within a video stream
in order to obtain a complete system for a real world application in a near
future.
Results show an encouraging performance in classification and recognition
which is also non-dependent of the current task such that by performing the
corresponding training, this algorithm could also be used for detecting another
kind of objects. There is still research to be done but probably the first steps
have been taken for this important task which can be useful for the conservation
of this sea species.
Keywords— sea turtle, nest, deep learning, convolutional neural network, ar-
tificial intelligence

Acknowledgements
First, I would like to offer my special thanks to Professor Tony Pridmore,
my supervisor, for all his support and constructive suggestions during the de-
velopment of this project. Without his immense knowledge, this project would
not have been possible.
I would like to thank to Consejo Nacional de Ciencia y Tecnolog´ıa (CONA-
CyT) and Consejo Quintanarroense de Ciencia y Tecnolog´ıa (COQCyT) for the
scholarship received within the framework stated by the CONACyT-Quintana
Roo State Government; Scholarship for the Training of High Level Human
Resources in Excellent Foreign Graduated Program with CVU identification
455628.
I would also like to express my gratitude to my parents for all their support
and advices since the beginning of this dream. You have always been the guid-
ance of my life and all this is for you. My grateful thanks to Israel Sánchez
Castillo, we can always count on each other.
Finally, my special thanks are extended to Deny Clementina Arista Torres for
all her valuable support and enthusiastic encouragement to pursue this dream.

Contents
1 Introduction 5
2 Literature Review 5
2.1 Object classification . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Softmax classifier . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Slide Window . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Region proposals . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Obtaining features to track . . . . . . . . . . . . . . . . . 23
2.3.2 Lucas-Kanade algorithm . . . . . . . . . . . . . . . . . . . 24
3 Data gathering 24
4 Design and Implementation 26
4.1.1 Analysis of the AlexNet architecture . . . . . . . . . . . . 26
4.1.2 Modification of the architecture . . . . . . . . . . . . . . . 28
4.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Results 37
5.1.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.2 AlexNet first modification . . . . . . . . . . . . . . . . . . 38
5.1.3 AlexNet second modification . . . . . . . . . . . . . . . . 38
5.1.4 AlexNet third modification . . . . . . . . . . . . . . . . . 40
5.1.5 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1.6 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Object tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusions 47
7 Further Research 48
7.1 Implementation over a GPU . . . . . . . . . . . . . . . . . . . . . 48
7.2 Cross validation on VGG architecture . . . . . . . . . . . . . . . 48
7.3 Region proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Scale of regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.5 The probability model . . . . . . . . . . . . . . . . . . . . . . . . 49
References 49
2

List of Figures
1 Pipeline for a pattern recognition task . . . . . . . . . . . . . . . 6
2 Orangutan nest, image obtained from [7] . . . . . . . . . . . . . . 7
3 Image obtained containing a sea turtle nest. . . . . . . . . . . . 8
4 Propagation of gradients for the Softmax loss function . . . . . . 10
5 Representation of an Artifical Neural Network and its analogy to
the human nervous system . . . . . . . . . . . . . . . . . . . . . 11
6 Sigmoid activation function . . . . . . . . . . . . . . . . . . . . . 11
7 Tanh activation function . . . . . . . . . . . . . . . . . . . . . . 12
8 Neural Network with three inputs, two hidden layers and two
outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
9 A region of neurons connected to a single neuron in the next layer. 14
10 The region is moved to the right. The final layer will have a size
of 24x24 neurons from an input layer of 28x28 and a kernel size
of 5x5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
11 3 feature maps calculated with a stride value of 4. . . . . . . . . 16
12 An example of a Convolutional Neural Network with an output
of two classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
13 Inception module used in the GoogLeNet architecture . . . . . . 18
14 Pyramid representation, original image is halved the number of
layers in the pyramd. . . . . . . . . . . . . . . . . . . . . . . . . 25
15 User Interface used to crop images . . . . . . . . . . . . . . . . . 25
16 Image obtained from cropping . . . . . . . . . . . . . . . . . . . 25
17 Input image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
18 Filters from the second layer . . . . . . . . . . . . . . . . . . . . 27
19 Output from the second layer . . . . . . . . . . . . . . . . . . . . 27
20 Filters for the first layer . . . . . . . . . . . . . . . . . . . . . . . 28
21 Output from the first layer . . . . . . . . . . . . . . . . . . . . . 28
22 Filters from the second layer . . . . . . . . . . . . . . . . . . . . 29
23 Output from the second layer . . . . . . . . . . . . . . . . . . . . 29
24 Output from the third layer . . . . . . . . . . . . . . . . . . . . . 30
25 Output from the fourth layer . . . . . . . . . . . . . . . . . . . . 30
26 Output from the fifth layer . . . . . . . . . . . . . . . . . . . . . 31
27 Output from the sixth layer . . . . . . . . . . . . . . . . . . . . . 31
28 Output from the seven layer . . . . . . . . . . . . . . . . . . . . . 32
29 Final output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
30 Sequence diagram for the Object Recognition task . . . . . . . . 35
31 Tracking description for a video frame, yellow arrows represents
the optic flow calculated for each region. . . . . . . . . . . . . . 36
32 Loss value for training from scratch and fine-tuning for the AlexNet
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
33 Loss for the first modification of the AlexNet architecture . . . . 39
34 Loss for the second modification of the AlexNet architecture . . . 39
35 Loss plot for the third modification of the AlexNet architecture . 40
36 Loss function for the GoogLeNet architecture using fine-tuning
and training from scratch . . . . . . . . . . . . . . . . . . . . . . 41
37 Loss function for the VGG architecture using fine-tuning . . . . 42
38 Most of the images in the dataset correspond to this nest. . . . 44
39 Second most common nest in the dataset . . . . . . . . . . . . . 44
3

40 A possible nest found . . . . . . . . . . . . . . . . . . . . . . . . 46
41 The region does not coincide with the nest, consequently the
probability did not increase. . . . . . . . . . . . . . . . . . . . . . 46
42 Nest detected by processing the video reversed . . . . . . . . . . 47
43 Region remains at the same size when the nest reduced its size
due to the perspective. . . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables
1 Layers in the GoogLeNet architecture . . . . . . . . . . . . . . . 18
2 Description of the VGG architecture . . . . . . . . . . . . . . . . 19
3 Confusion matrix for the AlexNet architecture using fine-tuning . 38
4 Confusion matrix for the AlexNet architecture trained from scratch 38
5 Confusion matrix for the first modification of the AlexNet archi-
tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Confusion matrix for the second modification of the AlexNet ar-
chitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7 Confusion matrix for the third modification of the AlexNet ar-
chitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
8 Confusion matrix for the GoogLeNet architecture using fine-tuning 41
9 Confusion matrix for the GoogLeNet architecture when training
from scratch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10 Confusion matrix for the VGG architecture using the fine-tuning
method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
11 Results obtained for the object recognition task. . . . . . . . . . 43
4

1 Introduction
Every year hundreds of sea turtles arrive to the beach to deposit their eggs.
However, these animals are target of many predators including the human being
and its illegal activities which has caused to put these animals in a risk of
extinction.
Specifically in Mexico, a program was created for turtle conservation which
consists in walking every night along the beach for about 4 hours during all
the season, it is an exhausting activity for any volunteer. Even more, it is
impossible to cover the whole length of the beach causing that several nests are
not detected and remain vulnerable until the hatching season.
On the other hand, given the increase of popularity in the use of drones, there
are some emerging organizations making use of them for conservation purposes.
For instance, ”Conservation Drones” 1
is a non-profit organization with several
projects involving the use of drones in environmental tasks, one of them is the
orangutan nest spotting with conservation drones.
With this in mind, the current project aims to detect turtle nests within a
video stream taken from a drone perspective with the possibility to embed the
developed algorithm into an actual drone in a near future. However, turtle nests
are harder to detect than orangutan nests due to the sand which has the same
colour and texture as the nests, so that it is difficult to define the features that a
pattern recognition algorithm needs in order to perform the classification task.
Therefore, this project will explore the deep learning approach for the sake of
replacing those hand-crafted features into automatically detected features.
This leads to the following research questions: Is the deep learning a good
approach to deal with turtle nest detection? Which is the most suitable deep
architecture to perform this task?
The purpose of this document is to describe and analyse the approach per-
formed to detect turtle nest within a video using deep learning. Section 2 ex-
plains the main concepts related to this project, Section 4 describes the design
of the software and how was implemented and Section 5 compares the results
obtained.
2 Literature Review
In order to detect turtle nests from a video stream the following pipeline is fol-
lowed: First, it is necessary to be able to label a single image within two possible
classes: nest or not a nest. Second, small regions of a single video frame will
be labelled in order to detect all possible nests in a frame. Finally, information
about the nests related to a frame has to be shared with the subsequent frames
such that a nest can be tracked along the video stream. Therefore, the project
has been divided into the following tasks:
Object classification Given an input image, the problem consists to catego-
rize it into any of two classes: Nest or not a nest. Even more, the task
will calculate the likelihood of the image to be a nest.
Object recognition Once an input image can be classified, the next step is
1http://conservationdrones.org/
5

to identify all nests contained in a bigger image, i.e. recognise all nests
contained in a video frame.
Object tracking Finally, all detected nests will be tracked along the video
sequence in order to save processing power and improve the accuracy.
2.1 Object classification
Object classification is the procedure to assign an input vector to one of a
discrete set of finite categories (also known as classes) based on the knowledge
obtained from a previous training. These algorithms consist on two main stages:
Training and classification.
During the training process, the algorithm explores the initial data looking
for a pattern included in all images of the same class. On the other hand, on the
classification stage, the algorithm will search for the learned patterns on regions
believed to be part of one of the categories.
These algorithms require a previously labelled dataset which consist in a
set of examples belong to each class. According to [20] there are several learn-
ing scenarios concerned to machine learning, these scenarios are related to the
training data available for the learner, however the supervised learning scenario
is the most used for classification problems which it consists in provide the al-
gorithm with a set of input examples and its desired output vector, opposite to
the unsupervised learning where only the input examples are provided and the
algorithm has to infer the most efficient output.
As stated by [16] it is common to build the algorithm based on the process
shown in Figure 1. Basically, the process starts by extracting a set of features
from the image, for instance features based on edges, corners or even more
complex features as SIFT [17]. The output feature vector of this module is the
input for the pattern recognition algorithm as K-nearest neighbour [19], SVM
[31] or Neural Networks.
Feature Extraction Module Trainable Classifier Module
Input
Image
Class
Scores
Feature
Vector
Figure 1: Pipeline for a pattern recognition task
For instance, van Gemert et.al. in [30] compared three pattern recognition
algorithms searching for the best approach to detect animals using a drone.
Additionally, Chen et. al. in [6] improved the efficiency by using an active
detection approach where human supervision is incorporated during detection.
The main purpose of their project was to detect orangutan nests using drones
which produced images as the shown in Figure 2, the set of features used were a
variation of the HOG and SIFT features which are based on the gradients of the
image. On the other hand, Figure 3 shows one frame containing a nest obtained
from the video used for the purpose of this project, the colour and textures are
similar to the surrounding area so that it is a difficult task to define a set of
features to be used even though the human vision is capable to identify them
6

without any apparent effort. Therefore, in order to avoid the need to define
a set of features to be used, the deep learning approach aims to replace the
human-defined features by automatically detected features.
Figure 2: Orangutan nest, image obtained from [7]
The following subsections describe the Softmax classifier and Artificial Neu-
ral Networks, two common classifiers which require a human-defined set of fea-
tures. However, it will state the basis for describing the main approach for this
project, the deep learning which is presented at the end of the section.
2.1.1 Softmax classifier
The softmax classifier is a supervised learning algorithm similar to the Support
Vector Machine algorithm which differs in the output obtained from each. While
SVM produces a score for each class which can be contained in any range,
the Softmax classifier will produce an output vector where all values sum to 1
obtained from the Softmax function, this vector can also be interpreted as a
distribution of probability along the space of classes.
The first step is to map from an input vector x to a set of class scores K,
therefore the linear classifier is used which is defined as
f(xi; W, b) = Wxi + b (1)
Where W is the matrix of weights and b is the bias of the function. The
training process will search for the optimal values of W and b. For instance,
given an input image containing 4 pixels with values x = [9, 2, 8, 4], with K = 5
number of classes and a trained W matrix with bias b, the obtained vector of
class scores is
7

Figure 3: Image obtained containing a sea turtle nest.
W





0.01 0.08 0.01 0.09
0.007 0.02 −0.1 0.07
0.01 0.02 0.07 0.1
0.1 0.01 0.08 −0.07
0.01 0.05 −0.09 0.04






×
x



9
2
8
4




+
b





0.02
0.03
0.09
0.1
0.05






=
Scores





0.71
−0.387
1.18
1.38
−0.32






(2)
Moreover, in order to obtain an output with values between 0 and 1 from
the previous vector, the Softmax function is used which is defined in [4, p.198]
as
σ(xi) =
ef(xi;W,b)
K
j=1 ef(xj ;W,b)
(3)
For instance, the values obtained from scores in equation 2 are
Scores





0.71
−0.387
1.18
1.38
−0.32






=
ef(xi;W,b)






2.033
0.679
3.254
3.974
0.726






=
σ(xi)





0.19
0.063
0.305
0.372
0.068






(4)
Moreover, it is desired to measure a level of pleasantness with the results
obtained, for instance, assuming from the previous example that the correct
class would be K = 3 and given that the class with the highest probability is
K = 4 with a value of 0.372, it is necessary to find a function which measures
the output obtained and compares it with the correct class; this measure is
called the loss function, therefore a lower loss means that the classification
8

was correctly performed. The loss function for the Softmax classifier is defined
as
Li = −log
efyi
j efj
(5)
Where fyi
is the output of the correct class. Additionally, the full loss is
defined as the mean of the loss from each training data with a regularization
loss component as
L =
1
N i
Li + λR(W) (6)
Where λ is an hyper parameter used with the regularization function R(W)
to penalize large values in the parameters in order to reduce the overfitting
phenomenon.
Therefore, the extended loss for a training dataset in a Softmax classifier is
L = −
1
N


N
i=1
K
j=1
1{yi
= j}log
efxj
K
l=1 efxl

 + λR(W) (7)
The loss function is used during the training process to update the weights
matrices and biases in the classifier. Hence, the objective during training will
be to reduce the loss function for each input training data, this is achieved by
using the gradient of the loss function which it allows to find the direction that
would improve the set of weights.
Moreover, the loss of the Softmax function can be decomposed as a set of
different functions. For instance, given the Softmax function with two inputs
x1 and x2 and weights wi, from equation 5 the loss function is represented as
f(x) = −log
ew1x1
ew1x1 + ew2x2
(8)
Additionally, the function can be divided into a set of different functions as
p = wx →
δp
δx
= w;
δp
δw
= x (9)
q = ep
→
δq
δp
= ep
(10)
r = q1 + q2 →
δr
δq1
= 1;
δr
δq2
= 1 (11)
s =
q1
r
→
δs
δq1
=
r − q1
r2
;
δs
δr
= −
q1
r2
(12)
t = −log(s) →
δt
δs
= −
1
s ∗ ln(10)
(13)
Next, by using the chain rule defined as
dz
dx
=
dz
dy
∗
dy
dx
(14)
9

The gradients are calculated by propagating the gradient backwards, starting
by the function t until obtain the desired gradient
δt
δw
, this propagation can also
be represented in a diagram as the shown in Figure 4.
0.01
9.0
0.08
2.0
0.09
0.16
1.09
1.17
1.09
0.48
2.26
0.31
1
Figure 4: Propagation of gradients for the Softmax loss function
Once the gradient is obtained, it can be used to adjust the weights according
to the loss using
wτ+1
= wτ
− η wL (15)
Where η > 0 is a parameter known as learning rate. This method of pro-
cessing a vector forward and updating the weights by processing the gradients
backwards is known as error backpropagation or backpropagation and it gives
the basis for the training of more complex architectures as Artificial Neural
Networks where each node will be a neuron activated by a defined function.
2.1.2 Neural Networks
According to Chong Wank in [33], Artificial Neural Networks are inspired in the
human nervous system. A biological neuron receives an input from its dendrites
and produces an output using its axons, these neurons are connected with others
through the synapses. Similarly, an artificial neuron receives an input vector
which is applied to a function called the activation function and its output is
connected to the input of another neuron. A representation of an artificial
neuron is shown in Figure 5.
Moreover, an artificial neuron is a unit which receives an input vector and
produces an output, the input vector is affected by a set of weights and biases.
Therefore, the output of an artificial neuron is
o = σ(
N
j=1
Wijxj + bi) (16)
Where Wij is the matrix of weights, bi are the biases of the function and σ
is called the activation function.
Usually this activation function is a sigmoidal function [4, p.227] such as
Sigmoid Defined as
σ(x) =
1
1 + e−x
(17)
The sigmoid function is plotted in figure 6
10

Activation
function
Axon
Dendrite
Figure 5: Representation of an Artiﬁcal Neural Network and its analogy to the
human nervous system
−6 −4 −2 0 2 4 6
0
0.2
0.4
0.6
0.8
1
x
σ(x)=
1
1+e−x
Figure 6: Sigmoid activation function
11

Tanh It is a zero-centered activation function which outputs are in the values
of [-1,1]. It is plotted in figure 7
−6 −4 −2 0 2 4 6
−1
−0.5
0
0.5
1
x
σ(x)=tanh(x)
Figure 7: Tanh activation function
In order to create a complete Neural Network based on the previously de-
scribed neurons, they are interconnected such that a non-infinite loop is formed,
i.e. a set of layers is created to define the architecture where the middle lay-
ers are named as hidden layers. For instance, an architecture with two hidden
layers is shown in Figure 8. From this image, it can be seen that all neurons
from one layer are connected to all neurons of its consecutive layer, this kind
of layers are known as a Fully Connected Layers and they usually conform the
trainable classifier module of a deep learning architecture as the Convolutional
Neural Networks.
Input
Layer
Hidden
Layer 1
Hidden
Layer 2
Output
Layer
Figure 8: Neural Network with three inputs, two hidden layers and two outputs.
12

2.1.3 Deep Learning
The previously described classifiers require an input vector which is produced
from a feature extraction module, in order to produce a set of class scores based
on a previous training. However, according to Bengio in [3], the crucial task
in most of the machine learning algorithms is the selection of features to use,
such that much of the effort stays on the search for a good data representation,
for instance the search for features to describe a nest contained in Figure 2.
However this task can become more complicated depending on the data desired
to be classified as the nests obtained from figure 3.
Even more, as stated by Arel et al [1], the success of a pattern recognition
algorithm relies in the selected features whose process of extraction can be
complicated and the algorithm becomes problem-dependent.
Therefore, based on neuroscience findings into how the mammalian brain
represents and learn new information, it motivates to a new branch on machine
learning [1] called deep learning which is also known as feature learning
or learning representation [3]. The aim is to represent the data based on
automatically detected features such that human involvement is barely necessary
[2].
As stated by Bengio in [3], this field has become more important into the ma-
chine learning community with regular workshops and conferences, for instance
the 3rd International Conference on Learning Representation (ICLR
2016) 2
and the ICML 2013 Challenges in Representation Learning 3
.
All this activity has lead to a vast number of applications in several tasks as
speech recognition, signal processing, object recognition and natural language
processing [3].
To summarize, a deep architecture is composed of many levels as a neural
network does with hidden layers [2]. Each of these levels represent a degree of
abstraction which can be organized in several ways.
Moreover, from the analysis performed by Juergen Schmidhuber in [22] it
can be seen that there exist several proposed architectures. However, according
to [1] the most recognized approach is the Convolutional Neural Network.
Convolutional Neural Networks From Hubel and Wiesel in [12] it is known
that the cat’s visual cortex is composed by simple and complex cells whose
responses will vary according to the visual inputs as orientation in edges or
slits, this set of cells were the inspiration for the deep architectures designed
for two-dimensional shapes as images or videos with successful applications in
hand-writing recognition [15] or image classification challenges [14][36][25][10].
Therefore, a Convolutional Neural Network is composed by a set of layers
just as a neural network does but this layers are arranged in 3 dimensions,
Convolutional Networks are designed for images such that each dimension cor-
responds to the width, height and number of channels of the image. However,
the final layer will be reduced to a single vector which will be processed through
a series of Fully Connected Layers, producing a vector which contains the class
scores, then this vector is normalized using a Softmax function in order to obtain
the probability for each class.
2http://www.iclr.cc/doku.php
3http://deeplearning.net/icml2013-workshop-competition/
13

The components in a Convolutional Neural Network depend on the architec-
ture and design, however, the most common kind of layers are: Convolutional,
Subsampling and the Fully Connected Layer. Even though it is not manda-
tory, a convolutional layer is commonly followed by a subsampling layer, even
more, the combination of convolutional and subsampling layers detects the im-
age properties which they are trained to search for.
On the other hand, as stated by [15], Convolutional Neural Networks are
based in 3 main ideas: Local receptive fields, shared weights and spatial sub-
sampling which are now described.
Local receptive fields Each node in a layer can be seen as a neuron.
However, contrary to the Fully Connected Layers, in this layers the neurons
are not completely connected to all neurons in the next layer. Even more, the
neurons are divided in small regions connected to a single neuron in the next
layer. For instance, Figure 9 shows an input neurons where a region is connected
to a single neuron in the next layer.
Figure 9: A region of neurons connected to a single neuron in the next layer.
Then, as shown in figure 10 the local receptive fields is moved a certain value
of neurons defined as stride in order to connect to the next neuron. Even more,
the size of the region is known as the filter size.
The inputs in each neuron is called the receptive fields, as this layer receives
as input only layers from a certain region from the image, it is called the local
receptive fields. Actually this process is a convolution operation between an
image and a kernel so this layer is called a Convolutional layer.
Shared weights and biases The output of the neuron at position xj,l
given an input layer Y is given by
xj,l = σ(b +
k
r=0
k
s=0
Wr,sYj+r,l+s) (18)
Where k is the size of the kernel, W is the matrix of weights with size k, b
is the bias value for the current kernel and σ is the neural activation function.
14

Figure 10: The region is moved to the right. The final layer will have a size of
24x24 neurons from an input layer of 28x28 and a kernel size of 5x5.
From equation 18, it can be seen that during the processing of a single layer
only one matrix W and bias b are used, such that the process is looking for
the same feature along the whole image in order to detect the feature at any
position. In other words, ConvNets are robust to translations and distortions
of the object on the input image [15].
The output of a layer produced by the activation functions is known as a
feature map. In order to perform a classification, a Convolutional Neural
Network must contain more than one feature map in the same layer such that a
single Convolutional layer detects more than one feature using a set of different
kernels and biases, producing as many feature maps as the number of kernels.
For instance, given a set of input neurons as shown in figure 11, 3 feature maps
are obtained from 3 different kernels with size k = 11.
Spatial sub-sampling Any of the previously detected features can be
located at any point in the image, actually the exact position of this detected
features is not relevant. However, the relative position to other features is
relevant. As stated in [15], by reducing the spatial resolution will decrease the
effect caused by the exact position of the feature. This type of layer is called
Pooling layer, the objective is to reduce the size of the feature map in order to
minimize the susceptibility to shifts and distortions.
There are different types of pooling layers, one of them is the Max pooling
layer used by [14] which obtains the maximum value of its local receptive fields.
On the other hand, LeCun et.al. in [15] used an average pooling layer by obtain-
ing the average of its local receptive fields, multiply it by a trainable weight and
add a trainable bias. Moreover, Graham in [11] formulated a Fractional Max-
Pooling Layer and [26] replaced the Max-pooling operation by a convolutional
layer with increased stride.
The kind of pooling layer to use will depends on the architecture design,
however, the most common used is the Max Pooling layer.
15

Figure 11: 3 feature maps calculated with a stride value of 4.
Fully Connected Layer Finally, the last layer is a Fully Connected Layer
just as the used by a common Neural Network, all receptive fields are connected
to every neuron in this layer which are multiplied by a trainable matrix of
weights and the trainable bias is added. In order to obtain a final probability
for each class, the softmax function is used as described in equation 3. A final
representation of a simple Convolutional Neural Network using the described
layers is shown in figure 12.
Additionally, as stated by Yosinski et.al. [34], independently to the clas-
sification task, during the training of Convolutional Neural Networks the first
layers tend to learn basic features as Gabor filters or colour blobs. On the other
hand, the last layers contains more abstract features which depend from the
dataset and the classification task. However, the first layers can be used for
any other task, even more, the learning from a deep architecture can be used to
initialize the learning for another architecture in the first layers and increasing
the learning rate on the last layers to complete the training. This technique is
known as transfer learning, specifically when all the layers from the previ-
ously trained architecture are pre-initialized but the last Fully Connected Layer
is trained from scratch based on the new dataset is known as fine-tunning, it
is useful when the current dataset is not large enough to train the weights of
the desired architecture starting from random values.
An example of a large dataset used is the ImageNet which is used for the
Large Scale Visual Recognition Challenge (ILSVRC) whose architectures are
trained over 1.2 million images with 1000 categories [21]. Some of the architec-
tures submitted for this challenge include:
16

Input Image
28x28x3
Convolution layer
24x24x3
Filter: 3x5x5@1
Max Pooling Layer
12x12x3
Fully Connected Layer
432
Softmax Function
P(C1)
P(C2)
Figure 12: An example of a Convolutional Neural Network with an output of
two classes.
AlexNet It was submitted for the ILSVRC-2012 and developed by Alex
Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton [14]. The architecture achieved
a top-1 of 37.5% and top-5 of 17% better than the second place who achieved a
26%. The AlexNet architecture has 60 million parameters and 650,000 neurons.
It contains eight layers whose first five are convolutional layers and the rest are
Fully Connected Layers. In order to model the output of a neuron, AlexNet
proposed an activation function called Rectified Linear Units (ReLU) which it
has the form f(x) = max(0, x)
Moreover, the input image has size of 224x224x3 which is filtered by the
first convolutional layer containing 96 kernels of size 11x11x3 and it uses a
stride of 4 pixels followed by a Max pooling layer and the ReLU normalization
layer described above. On the other hand, the second convolutional layer is
composed by 256 kernels with size 5x5x48. Even more, the third, fourth and
fifth layer have 384, 384 and 256 respectively number of kernels with size 3x3.
Finally, the sixth layer which corresponds to a Fully Connected Layer has
4096 neurons as well as the seventh layer. The output of the seventh layer is
connected to the last fully connected layer which has 1000 neurons corresponding
to the number of classes in the dataset.
GoogLeNet The GoogLeNet was created by Szegedy et al[27] and was
submitted for the ILSVRC-2014. This architecture reduces the size of parame-
ters by a factor of 12 comparing to the AlexNet architecture using an Inception
Module which it is a network composed by convolutional layers with sizes 1x1,
3x3 and 5x5 and one max pooling layer, stacked upon each other as described
in Figure 13.
The architecture consists of 27 layers, but 84 independently building blocks
as described in Table 1.
VGGNet The VGGNet architecture proposed by Karen Simonyan and
Andrew Zisserman [24] explores the effects of depth in the final accuracy by
using a set of small convolutional layers. This architecture was submitted for
the ILSVRC-2014 obtaining the first place in the localization task and second
for classification.
17

Previous
Layer
Convolution
1x1
Convolution
1x1
MaxPool
3x3
Convolution
3x3
Convolution
5x5
Convolution
1x1
Convolution
1x1
Filter
Concatenation
Figure 13: Inception module used in the GoogLeNet architecture
GoogLeNet architecture
Layer type Size@Outputs
Convolution 7x7@64
Max Pool 3x3
Convolution 3x3@192
Max Pool 3x3@192
Inception @256
Inception @480
Max Pool 3x3@480
Inception @512
Inception @512
Inception @512
Inception @528
Inception @832
Max Pool 3x3@832
Inception @832
Inception @1024
Avg Pool 7x7@1024
Dropout (40%) @1024
FCL @1000
Softmax @1000
Table 1: Layers in the GoogLeNet architecture
18

VGG Architecture
16 layers 19 layers
Layer type Size@Outputs Layer type Size@Outputs
Convolution 3x3@64 Convolution 3x3@64
MaxPool @64 MaxPool @64
MaxPool @128 MaxPool @128
MaxPool @256 Convolution 3x3@256
Convolution 3x3@512 MaxPool @256
Convolution 3x3@512 MaxPool @512
FCL @4096 Convolution 3x3@512
FCL @4096 Convolution 3x3@512
FCL @1000 MaxPool @512
Softmax @1000 FCL @4096
FCL @4096
FCL @1000
Softmax @1000
Table 2: Description of the VGG architecture
Even though they tested over six different configurations, the best two con-
figurations are described in Table 2.
2.2 Object recognition
While the object classification task is centred in labelling an input region from a
set of different classes, the object recognition task aims to detect all the elements
contained in a bigger space, for instance, from an input frame, the algorithm
identifies all nests contained in the image. The pipeline for this task will be to
start generating a set of region proposals and then identify each region using
the classification task. This section describes the algorithms used for generating
the region proposals starting by the brute force and the algorithm used by [10]
which it is also used with convolutional Neural Networks for the classification
task.
19

2.2.1 Slide Window
A natural approach to search over the whole image space is iterate over each
row and column in the image. For instance, create a region with a fixed size
and set its left-top position at (0,0) in the image, then move the window over
the current row by changing the position of the column and repeat for every
row in the image, this process is described in algorithm 1.
Algorithm 1 Slide Window
1: procedure SlideWindowMethod(a, b) Input image a, Window size b
2: for r ← 1, rows − b do
3: for s ← 1, cols − b do
4: classify(a[r,s]) Using the object classification
5: end for
6: end for
7: end procedure
The idea behind this algorithm is the technique called Brute-force which it
basically consists in search for all possible combinations in a search space, this
can be more inefficient as the search space becomes larger.
Moreover, as the search becomes exhaustive, [32] used a weak classifier before
in order to discard all those obvious negative regions and use a stronger classifier
for the rest, this technique is known as cascade of classifiers where the process
starts by using the weakest classifiers.
2.2.2 Region proposals
The Selective Search method proposed by Uijlings et.al [29] generates a set of
possible object regions in order to be processed through a classifier, it was used
by [10] where it is claimed to be one of the fastest algorithms to extract region
proposals. It makes use of an initial segmentation to guide the search which it
is based on the following principles
Capture all scales Rather than having a fixed size window as the used by the
Slide Window algorithm, this algorithm includes any size regions given
that objects within an image can be presented at any size.
Diversification A set of different measures is used in order to create regions,
not only based on colour or textures which may be affected by changes
in brightness or contrast. Therefore, this approach combines a number of
strategies to create the set of regions proposals.
Fast to Compute Due to the algorithm aims to create a set of regions to be
classified, this process cannot be exhaustive in terms of time processing
and consume of resources such that this power can be used in the classi-
fication task.
The algorithm starts by segmenting the input image using the algorithm
described by Felzenszwalb and Huttenlocher in [8] which produces a set of small
regions. The Selective Search consists in joining the previously obtained regions
based on its similarity until there is only one region whose size corresponds to
the input image size.
20

Moreover, the similarity between regions is calculated based on four different
measures which will produce an output in the range of [0, 1]. These measures
are
scolour(ri, rj) It measures the similarity between two regions based on its colour
by using an histogram obtained from each channel divided in 25 bins and
normalized using the L1 norm. Then, the similarity is calculated as
scolour(ri, rj) =
n
k=1
min(ck
i , ck
j ) (19)
Where n = 75 corresponds to the histogram bins for each channel (input
images have 3 channels), the colour histogram for an image i is defined
as Ci = [c1
i , . . . , cn
i ] such that ck
i , ck
j corresponds to the element k at the
histogram calculated for image i and j respectively.
Moreover, the colour histograms for a new region can be easily calculated
based on its ancestors by
Ct =
size(ri) × Ci + size(rj) × Cj
size(ri) + size(rj)
(20)
stexture(ri, rj) This strategy measures the similarity between regions by using
as parameters its textures. Eight orientations are calculated using the
Gaussian derivatives with σ = 1, and its histogram is calculated using 10
bins and normalized using the L1 norm as the colour similarity. Then, the
similarity is calculated as
stexture(ri, rj) =
n
k=1
min(tk
i , tk
j ) (21)
Where n = 240 corresponds to the histograms bins for each channel and
its eight orientations, the texture histogram is defined as Ti = [t1
i , . . . , tn
i ]
and tk
i , tk
j corresponds to the element k at the histogram calculated for
image i and j respectively.
Finally, the histogram for a new region can be calculated by using the
same procedure as in scolour(ri, rj).
ssize(ri, rj) In order to obtain regions with any size, this measure aims to start
joining the smallest regions such that the size of the final regions obtained
from the algorithm will increase smoothly during the process, this measure
is designed to produce regions of all scales at any position.
Given a function size(r) which returns the number of pixels contained in
a region, the size similarity is obtained by
ssize(ri, rj) = 1 −
size(ri) + size(rj)
size(im)
(22)
21

sfill(ri, rj) The last strategy measures how well a region ri can fill a gap con-
tained in a region rj in order to join regions which complement each other.
This measure avoids to obtain holes during the search and to not generate
regions with unusual shapes formed by two regions which hardly touch
each other.
The fill measure makes use of the bounding box BBij around two regions
ri and rj and it is calculated as
sfill(ri, rj) = 1 −
size(BBij) − size(ri) − size(rj)
size(im)
(23)
Finally, the global similarity between two regions ri and rj is calculated by
summing all the previous measures with an activation parameter ai {0, 1} which
activates the measures. Therefore, the similarity is defined as
s(i, j) = acolourscolour(ri, rj)+atexturestexture(ri, rj)+asizessize(ri, rj)+afillsfill(ri, rj)
(24)
From the similarity obtained in equation 24, the algorithm performs an ex-
haustive search over all the initial regions obtained from the initial segmentation
as detailed in algorithm 2.
Algorithm 2 Selective Search algorithm
1: procedure SelectiveSearch(a) Input image a
2: R ← segmentation(a) Perform segmentation using [8]
3: S ← ∅ Similarity set
4: for ri, rj ← neighbouringRegionPairs do
5: S ← S ∪ s(ri, rj) Calculate similarity
6: end for
7: while S = ∅ do
8: ri, rj ← max(S) Get most similar regions
9: rt = ri ∪ rj
10: S ← Ss(ri, r∗) Remove similarities
11: S ← Ss(r∗, rj)
12: S ← S ∪ s(rt, rneighbour) Calculate similarity with neightbours
13: R ← R ∪ rt
14: end while
15: return Bounding boxes for each element in R
16: end procedure
2.3 Object tracking
Once a nest is recognised from a video frame, it is desirable to track it along the
video. Even more, all proposal regions obtained during the object recognition
task need to be tracked along the video such that, rather than obtaining a new
set of proposal regions, the algorithm can track the existing ones and perform
the classification.
22

Therefore, the problem in the object tracking task is: Given a region from
a video frame, the algorithm must be able to obtain its new location in the
consequent frames.
In order to track a region from a video frame, the algorithm has to obtain
first a set of features to track, and then the tracking can be applied over these
features into the current frame.
2.3.1 Obtaining features to track
In order to track a set of features through a video stream, it is necessary to
define what is a good set of features to track [23].
For instance, corners can be used as a feature given that they represent a
variation in the image, those variations can be measured similar as the definition
of a derivative, by obtaining the difference between an image region and a small
variation in its position ∆u which is known as auto-correlation function [28, p.
210] defined as
EAC(∆u) =
i
w(xi)[I0(xi + ∆u) − I0(xi)]2
(25)
Where w(xi) is the window at a certain position. From the Taylor series
expansion, this equation can be expressed as
EAC(∆u) = ∆uT
A∆u (26)
Where A is known as the auto-correlation matrix defined as
A = w ∗
I2
x IxIy
IxIy I2
y
(27)
In order to decide whether a region is a corner, the Harris detector calculate
a value based on eigenvalues of A defined as
R = det(A) − α(trace(A))2
(28)
where,
det(A) = λ1λ2 (29)
trace(A) = λ1 + λ2 (30)
Finally, a corner is detected when the eigenvalues are large and the score R
is greater than a defined value.
On the other hand, Shi and Tomasi in [23] proposed this score as
R = min(λ1, λ2) (31)
The score R is used to measure the quality of a certain feature, for instance,
the function implemented in the OpenCV library [5] starts by extracting all
corners with its corresponding scores, then it removes the non-maximal locals
using a 3x3 neighbourhood, removes the elements with a quality lower than the
specified in the parameter. Finally, it sorts the elements by quality measure
and removes those elements whose distance to a better element is less than a
maxDistance parameter.
23

2.3.2 Lucas-Kanade algorithm
In order to track a region from an image, it is necessary to get a set of good
features to track using the above method; these points are used as parameters
for the tracking algorithm which it will obtain the optic flow for this point.
Lucas-Kanade algorithm [18] calculates the disparity h from a set of input
points in an image. Given two curves F(x) and G(x) = F(x + h), the disparity
is obtained as
h ≈
x
δF
δx
T
[G(x) − F(x)]
x
δF
δx
T
δF
δx
−1
(32)
Where
δF
δx
is the gradient with respect to x, when generalizing for multiple
dimensions, it becomes
δ
δx
=
δ
δx1
,
δ
δx2
, . . . ,
δ
δxn
T
(33)
However, this method assumes the disparity is small, if the motion is large,
then the pyramids technique is used to perform the tracking at different reso-
lutions.
The pyramid method consists in halving the image as many times as the
number of layers defined for the pyramid, such that lowest resolution image will
be located at the head of the pyramid. Then, the top layer is processed and
the results are used to initialize the process in the next layer and this process is
repeated until there is no remaining layers to process, this method is illustrated
in Figure 14. Even more, Bouguet in [35] describes the complete implementation
of the Lucas-Kanade algorithm by using pyramids which is implemented into
the OpenCV library.
3 Data gathering
Two videos of 12 minutes each were obtained from the beach for the testing
and validation of this project. In order to obtain these videos, the Phantom 2
drone model was operated between 49 and 98 high feet within the points with
coordinates N 21 ◦
03 22.5 W 86 ◦
46 48.5 and N 21 ◦
03 55.7 W which has
3,783.28 ft long between both points approximately. The videos were recorded
using the camera model GoPro HERO 3 with a 1080p resolution.
From both videos, 48599 frames were extracted and analysed such that 8222
frames containing at least one nest were separated. Every frame was stored in
the Tagged Image File Format (TIF) with a resolution of 920x1080 pixels.
Once the frames containing nests were extracted, an application was devel-
oped to crop the region of the image containing a nest using the OpenCV library
[5]. This application iterates on the specified directory and shows every image
into the Graphic User Interface (GUI), the user selects a rectangle and presses
a key to crop that part of the image which is stored in another folder with an
automatic name. Figure 15 shows the software used and Figure 16 shows the
example image obtained. The number of cropped images was of 966 positive
useful samples of nests images and the same number for negative samples.
24

Figure 14: Pyramid representation, original image is halved the number of layers
in the pyramd.
Figure 15: User Interface used to crop images
Figure 16: Image obtained from cropping
25

4 Design and Implementation
Even though most of the libraries used during the implementation are available
in different programming languages, the core of this project was developed using
the C++ programming language such that the algorithm can be embed in an
actual drone for a real world test. Additionally, some Python scripts were used
to perform some tests and gathering data.
This section describes the design and the implementation of each task based
on the literature review.
Three different Convolutional Neural Networks architectures were tested in the
object classification task using the Caffe library developed by the Berkeley Vi-
sion and Learning Center (BVLC) [13]. Additionally, one of them was chosen to
be analysed based on an example given by the Caffe library in order to propose
three modifications which were tested as well.
4.1.1 Analysis of the AlexNet architecture
Due to the time allowed for the development of this project, the architecture
selected was the AlexNet architecture which is composed by 5 convolutional
layers, 2 Fully Connected Layers and the Softmax layer.
The input image used for this analysis is shown in Figure 17.
Figure 17: Input image
First layer contains 96 kernels of size 11x11x3 shown in Figure 20, the output
of this layer is composed by 96 feature maps of size 55x55 shown in Figure 21.
It can be seen that the filters present some aliasing possibly caused by the stride
of 4. Additionally, from the output images it can be seen that first filters seems
to be searching for some patterns in the texture of the sand while the others
could be looking for some frequencies. There is a considerable amount of black
outputs which means there are not features detected for those filters, ideally it
would be better to have more detected features as there is more information to
process.
26

0 100 200 300 400 500 600
0
100
200
300
400
500
600
Figure 18: Filters from the second layer
0 100 200 300 400
0
100
200
300
400
Figure 19: Output from the second layer
27

original image
Figure 20: Filters for the first layer
0 100 200 300 400 500
0
100
200
300
400
500
Figure 21: Output from the first layer
Moreover, all 256 filters from the second convolutional layer are shown in
Figure 22 where each column represents a channel. Additionally, the outputs
from this layer are shown in Figure 23. It is interesting to note how the first
filters contain high values maybe due to the number of iterations during the
training.
The output for the rest of convolutional layers can be seen in figures 24,
25 and 26. From the last layer output it can be seen that one of the abstract
features being detected is the hole around the nests, specifically the bottom left
hole appears in almost all images.
The rest of the architecture is composed by Fully Connected Layers, the
output values from the sixth layer are shown in Figure 27 with the histogram of
the positive values. At this point the representation of each output value from
the Fully Connected Layer is not clear, however, from the histogram it can be
seen than the outputs tends to lower values.
Similarly, the next fully connected layer is shown in Figure 28 where the
preference for the lower values is more clear.
Finally, the last fully connected layer is shown in figure 29. The final output
is the category 1, which is correct.
4.1.2 Modification of the architecture
Based on the analysis of the previous section, the following modifications to the
AlexNet architecture are proposed:
1. As suggested in [37], in order to deal with the aliasing of the first layer.
The size of the first filter was reduced from 11 to 7 and the stride was
decreased from 4 to 2.
2. The second approach reduces the size of the first layer from 11 to 7 but
it keeps the stride as is. On the other hand, the size of the third, fourth
and fifth filter was increased from 3 to 5.
28

0 100 200 300 400 500 600
0
100
200
300
400
500
600
Figure 22: Filters from the second layer
0 100 200 300 400
0
100
200
300
400
Figure 23: Output from the second layer
29

0 50 100 150 200 250
0
50
100
150
200
250
Figure 24: Output from the third layer
0 50 100 150 200 250
0
50
100
150
200
250
Figure 25: Output from the fourth layer
30

0 50 100 150 200
0
50
100
150
200
Figure 26: Output from the ﬁfth layer
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
5
10
15
20
0 5 10 15 20
0
5
10
15
20
25
30
35
40
45
Figure 27: Output from the sixth layer
31

0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
0
10
20
30
40
50
60
Figure 28: Output from the seven layer
0.0 0.2 0.4 0.6 0.8 1.0
0.40
0.45
0.50
0.55
0.60
Figure 29: Final output
32

3. The last approach keeps the size of the filter at 11 but reduces the stride
from 4 to 2. Moreover, the size of the filter for the third, fourth and fifth
layer was increased from 3 to 5.
4.1.3 Training
Using the obtained data from section 3, the trained architectures were
• AlexNet architecture
• AlexNet architecture using fine-tuning
• AlexNet with the first modification
• AlexNet with the second modification
• AlexNet with the third modification
• GoogLeNet architecture
• GoogLeNet using fine-tuning
• VGG using fine-tuning.
These architectures were trained using independent virtual machines with
2 virtual CPUs and 13 GB of RAM memory each. For data augmentation
purposes during the training, each input image was scaled to 256x256 pixels,
processed using the original image and its mirror and perform a random cropping
from the center of the image to obtain random patches with size required by
the architecture. This data augmentation technique was also used for training
in the original architectures as stated in [14], [25] and [10].
4.1.4 Testing
In order to create the confusion matrices and calculate the accuracy both in
training and validating data for all architectures, there were used 342 sample
images from each dataset.
The dataset is divided in 4 folders, 2 of them contains positive samples and
the others negative samples. Those folders whose files were obtained from the
first video are used for training data while the rest is used for the validation
data. The script starts by obtaining a vector with 342 random numbers which
corresponds to the indexes of the files in each folder.
Then the script classifies each file and accordingly updates the number of
false-positive, positives, and false-negatives or negatives obtained. The pro-
cess is repeated for the validation data folder. Additionally, the loss value was
recorded for all iterations during the training stage such that it can be plotted
to analyse the training performance.
33

4.1.5 Implementation
As stated before, the Object recognition task is performed using the Caffe li-
brary [13].When the program is started, it creates an instance of the Classifier
class; there is only one instance during the whole process for efficiency pur-
poses. Moreover, the Classifier constructor receives as parameter the model
and weights file names which has the .prototxt and .caffemodel extension. These
files allow the constructor create the structure and reserve the needed space in
memory as well as initialize each layer with its corresponding parameters.
The main function in the Classifier class for the object classification task
is the predict function which receives as parameters the input image and the
region to be classified. This function converts the image from the OpenCV class
to the format used for the Caffe library and performs the classification. The
function returns the probability value of the region to be a nest.
For the implementation of the Object recognition task, two methods were im-
plemented: the Slide Window and Selective Search method. Both methods
make use of the ISlideMethod interface which contains the methods initializeS-
lideWindow and getProposedRegion. On the other hand, the object recognition
task is executed by calling the method classifyImage contained in the Classifier
class, the method receives as parameters the input image to be processed and
returns the number of nests found, additionally, it adds to the nests vector field
in Classifier an instance of the Nest class which contains the coordinates of the
region image processed along with the probability of being a nest.
In resume, when an input image is received for recognition, the process ex-
ecute the initializeSlideWindow function which prepares the method to extract
regions in the image, then it receives each region by using the method get-
ProposedRegion and classify it using the predict function which executes the
object classification task. Finally, with the region and probability obtained, the
method addPrediction is executed which creates an instance of the Nest class
and adds it to the nests vector unless there is an intersection with an existent
region, in this case the Nest instance with the highest probability is added and
the rest are delete it.
The first method implemented was the Slide Window method which is con-
tained in the SlideWindowMethod class, basically it keeps two indexes for the
current row and column being processed which are incremented each time the
getProposedRegion is executed. However, the videos were obtained with a high
quality resolution such that each video frame has a size of 1920x1080 pixels but
254x254 is the minimum size image required for all the architectures, it would re-
quire 1,376,116 iterations to slide a region from the top-left to the bottom-right
region of the image. Even more, the AlexNet takes approximately 8 seconds to
process a single image, such that it would take approximately 4.18 months to
process a frame. This amount of time is not affordable, given that the objective
is to perform the recognition using a drone.
On the other hand, the Selective Search method was implemented in the
SelectiveMethod class. The main process is executed during the initializeS-
lideWindow method where all the proposal regions are obtained and stored in a
queue such that the getProposedRegion takes one each time its called. Moreover,
34

the method starts by segmenting the image using the C++ implementation of
Pedro Felipe Felzenszwalb [9] and released under the GNU General Public Li-
cense; the obtained segments are processed using the Selective Search algorithm
described in 2, however, for this implementation only three measures are being
used: Colour, fill and size; instead of the four stated in the original paper, the
texture measure is not being used due to the textures for the nests are similar
to the sand, in addition the difficulty to obtain the histogram of gradients can
affects the performance of the application. Additionally, the object recognition
complete process is described in the sequence diagram in Figure 30.
Classifier
SelectiveMethod
: ISlideMethod
Classify(InputImage)
initializeSlideWindow(InputImage)
getProposedRegion()
Image Region
loop
[getProposedRegion()=NULL]]
Object recognition task
[InputImage]
addPrediction()
predict(region, InputImage)
Object Classification task
Nests
Figure 30: Sequence diagram for the Object Recognition task
4.3 Object tracking
The object tracking task is implemented within the Classifier class, the process
starts tracking all the previously obtained nests from the object detection task
and updating their location in the current image, then it executes the object
classification task for each region and executes the object detection in all new
regions of the image which has not been processed before.
First, the process executes the function updateExistingNests, this function
makes use of the nests obtained during the object recognition task which are
stored in the nests list, even more, before adding an instance of Nest into the
list, the best feature to track is obtained using the function getFeatureToTrack
which is part of the OpenCV library. The updateExistingNests uses the features
to track as a parameter for the function calcOpticalFlowPyrLK which performs
the tracking and returns the new location of every point receiver using the Lucas-
Kanade algorithm and pyramids as described in [35]. Then, all the locations
from the nests list are updated based on the locations obtained from the Lucas-
Kanade algorithm, however, if the tracking was unsuccessful for a certain region,
35

its location will be updated based on the average optic flow from the rest of the
points, this can be done due to all nests in the video should move to the same
direction at a similar speed.
Additionally, all nests with a height or width lower than certain value are
removed from the lists because it could not be processed by the Convolutional
Neural Network.
Second, for all regions with its location updated, the process performs the
Object Classification task again which returns the probability value of being a
nest, this probability is updated using the average of probabilities
P(r|xi) =
P(r|xi−1)(i − 1) + ρ
i
(34)
Where ρ corresponds to the new probability obtained from the object clas-
sification task.
Third, the process will obtain a bounding box from all regions contained in
the nests list and extract a set of image regions which have not been processed
yet, then it will perform the Object Recognition over all new regions. A maxi-
mum of four new image regions can be extracted from the current frame based
on the bounding box as shown in Figure 31, the top region is processed first,
then the bottom region followed by the right and left regions.
Bounding box
Top section
Bottom section
Left section Right section
Region
Prob.: 0.9
Region
Prob.: 0.42
Region
Prob.: 0.75
Region
Prob.: 0.18
Region
Prob.: 0.21
Region
Prob.: 0.02
Region
Prob.: 0.06
Region
Prob.: 0.0001
Figure 31: Tracking description for a video frame, yellow arrows represents the
optic flow calculated for each region.
36

5 Results
Based on the literature review, design and implementation described in the
previous sections, the evaluation method used for each task is described along
with the results obtained.
In order to evaluate each of the Convolutional Neural Network architectures
trained and given that the final layer in all architectures is a Softmax layer, the
loss function is a good parameter to be compared which is used for analysing the
training performance. On the other hand, in order to evaluate the performance
of the classification, the confusion matrix is presented for all architectures.
All the following architectures were trained with 624 positive samples and
equally number of negative samples obtained from the two available videos. On
the other hand, 342 positive samples and the same number of negative samples
were used for validating the results which were obtained only from the second
video.
The evaluation method recorded each value of the final loss from each itera-
tion during the training such that the value can be plotted against the number
of iterations. Ideally this value would decrease to zero even though it will
never reach zero value. Moreover, the images used for validation were classi-
fied for each architecture, when a positive sample was described as negative, it
is counted as a positive-negative; similarly those negative samples classified as
positives are counted as negative-positive. This measures allow to create the
confusion matrices shown.
5.1.1 AlexNet
The final loss during the training process for the AlexNet architecture is shown
in Figure 32 both for training from scratch and using the fine-tuning technique.
0 200 400 600 800 1000 1200 1400 1600
Iteration
0
1
2
3
4
5
6
7
8
Loss
AlexNet Fine-tuning vs training from scratch
Fine-tuning
Training from scratch
Figure 32: Loss value for training from scratch and fine-tuning for the AlexNet
architecture
37

It can be seen that the loss decays quickly using the fine-tuning technique.
On the other hand, the fine-tuning loss stabilizes faster than training from
scratch. Even though the function loss decreases, the results on the tests are
not good enough.
On the other hand, the confusion matrix for the architecture using fine-
tuning is shown in Table 3. The architecture classifies most of the samples as
positives.
Training data Validation data
Positive Negative Positive Negative
Positive 342 0 342 0
Negative 264 78 239 103
Accuracy 61% 65%
Table 3: Confusion matrix for the AlexNet architecture using fine-tuning
Moreover, the confusion matrix for the architecture trained from scratch is
shown in Table 4.
Positive 234 108 239 103
Negative 246 96 201 141
Accuracy 48% 55%
Table 4: Confusion matrix for the AlexNet architecture trained from scratch
It is interesting to note that the fine-tuning has a better performance than
the architecture trained from scratch, even though the images used for the
original AlexNet architecture are nothing similar to the images used for this
project, even more, the number of classes was reduced from 1000 to only 2.
The final loss for both architectures decays within the first 300 iterations
and then stays oscillating, presumably the accuracy will not improve with more
iterations during the training. Additionally, from the confusion matrices can
be seen that most of the images are classified as positives, it has a 100% of
accuracy for positive values but a poor performance for the negative samples
which it makes it a non suitable architecture for this task.
5.1.2 AlexNet first modification
The loss function for this architecture is shown in Figure 34. In this case the
loss function is not stable which means that requires to change some parame-
ters, perhaps by lowering the learning rate, the training would be improved, in
consequence the performance during the testing is not satisfying.
The confusion matrix is shown in Table 5. None of the positive samples were
classified correctly which means that the parameters have to be tuned, using
cross-validation for instance. However, it is unlikely that with a tuned set of
parameters can perform better than the other trained architectures.
5.1.3 AlexNet second modification
The loss graph for the second modification is shown in Figure 34.
38

0 200 400 600 800 1000 1200 1400 1600
Iteration
0.0
0.5
1.0
1.5
2.0
Loss
AlexNet first modification Fine-tuning vs training from scratch
Figure 33: Loss for the first modification of the AlexNet architecture
Negative 0 342 4 338
Accuracy 50% 49%
Table 5: Confusion matrix for the first modification of the AlexNet architecture.
0 200 400 600 800 1000 1200 1400 1600
Iteration
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Loss
AlexNet second modification Fine-tuning vs training from scratch
Figure 34: Loss for the second modification of the AlexNet architecture
39

The confusion matrix is shown in Table 6.
Accuracy 50% 48%
Table 6: Confusion matrix for the second modification of the AlexNet architec-
ture
The loss function is considerably unstable as the first modification and the
performance in the confusion matrix is also poor. As there is only a difference
in the second layer with the first modification, it means that this layer has not
a big impact on the overall results of the architecture.
5.1.4 AlexNet third modification
Similarly, the loss for the third modification is shown in Figure 35.
0 200 400 600 800 1000 1200 1400 1600
Iteration
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Loss
AlexNet third modification Fine-tuning vs training from scratch
Figure 35: Loss plot for the third modification of the AlexNet architecture
Its confusion matrix is shown in Table 7.
Accuracy 49% 49%
Table 7: Confusion matrix for the third modification of the AlexNet architecture
For the third modification, the kernel size in the first layer remains at 11. In
general, the loss is lower for this modification and a bit more stable which means
that the first layer has an actual effect over the loss, however the performance
in the confusion matrix is worst than the first two modifications.
40

Positive 149 193 204 138
Accuracy 71% 79%
Table 8: Confusion matrix for the GoogLeNet architecture using fine-tuning
The proposed architectures did not work as expected, the main reason is
due to the amount of training data available, as the dataset is smaller, the
training would require more iterations and use of the cross-validation technique
for tuning the parameters. On the other hand, it is notable how the performance
improved using the fine-tuning technique, which supports the idea behind the
fine-tuning where the features on the first layers are useful for most of the tasks
such that is possible to take a trained architecture and complete its training
with the dataset available.
5.1.5 GoogLeNet
The final loss obtained during the training of the GoogLeNet architecture both
using fine-tuning and training from scratch is shown in Figure 36.
0 200 400 600 800 1000 1200 1400 1600
Iteration
0.0
0.5
1.0
1.5
2.0
Loss
GoogLeNet fine-tuning vs training from scratch
Fine-tuning
Train from scratch
Figure 36: Loss function for the GoogLeNet architecture using fine-tuning and
training from scratch
Moreover, the confusion matrix when the fine-tuning technique was applied
is shown in Table 8.
Additionally, the confusion matrix when the training was performed from
scratch is shown in Table 9.
Even though the loss function is unstable both for the training from scratch
and using the fine-tuning, the performance in the confusion matrix is better
than the AlexNet architecture, it achieves an accuracy of 81% on the validation
data by training from scratch. On the other hand, in this architecture the
performance of the fine-tuning technique is worst than the training from scratch,
41

Positive 272 70 254 88
Negative 64 278 39 303
Accuracy 80% 81%
Table 9: Confusion matrix for the GoogLeNet architecture when training from
scratch
the reason is possibly due to the number of layers contained in this architecture,
as the architecture is deeper, the features use to be more abstract and problem-
dependent which may not be useful for another task such as this project. It
would require more iterations during the training and a large dataset for adapt
this layers to the new task.
5.1.6 VGG
The loss function plotted for the VGG architecture is shown in Figure 37.
0 200 400 600 800 1000 1200 1400 1600
Iteration
0.0
0.2
0.4
0.6
0.8
1.0
Loss
VGG architecture using fine-tuning
Fine-tuning
Figure 37: Loss function for the VGG architecture using fine-tuning
Furthermore, the confusion matrix for the VGG trained using the fine-tuning
technique is shown in Table 10.
Positive 209 133 260 82
Accuracy 80% 88%
Table 10: Confusion matrix for the VGG architecture using the fine-tuning
method.
This architecture requires much more parameters than the rest of the ar-
chitectures, so it took more time to complete the 1500 iterations during the
42

training. However, it can be seen that the performance in the confusion matrix
is much better than the rest of the architectures.
Even more, the loss function is very unstable which can produce over-fitting,
perhaps by tuning the parameters using cross-validation could produce a more
stable loss and a better accuracy.
Overall the results for the object classification task, show that the best
architecture is the VGG, thus it is preferable a very deep architecture and
the use of the fine-tuning technique. From the modifications proposed, it can
be deducted that it is necessary a large dataset for training an architecture
from scratch, even more, the tuning of parameters is crucial as the modified
architectures were trained with the default parameters which obtained a poor
performance.
In order to measure the accuracy of the object recognition task for all archi-
tectures, 100 random frames were taken from the video test where 18 of these
images contain a nest. These samples were processed independently in each
architecture recording the total number of regions classified as positive and neg-
ative, additionally the images previously identified which contain a nest, were
analysed one by one for each architecture in order to obtain the number of real
positives (when a nest was successfully detected) and positive-negatives (when
a real nest was not detected), then the number of positives were subtracted
from the total of regions classified as positive in order to obtain the number
of negative-positives (those regions classified as nests which were actually not
nest) and the positive-negatives were subtracted from the total of negatives to
obtain the real negatives (those regions correctly classified as not nest).
In order to compare the results, the Positive Predictive Value (PPV) and
Negative Predictive Value (NPV) were calculated. The PPV is the result of the
division between the number of true positives and the sum of true positives and
false-positives, this value allows to measure the success of the nests detected.
On the other hand, NPV is defined as the quotient between the true negatives
and the sum of true negatives and false negatives, NPV allows to measure how
well the negatives values are being detected such that the architecture is not
classifying all regions as a nest. Table 11 shows the results obtained for each
architecture.
PPV NPV
AlexNet with fine-tuning 23.52% 12.98%
AlexNet 44.44% 0.002%
AlexNet First 0% 98.67%
AlexNet Second 0% 98.56%
AlexNet Third 0% 99.59%
GoogLeNet with fine-tuning 33.33% 98.89%
GoogLeNet 33.33% 78.31%
VGG with fine-tuning 44.44% 94.61%
Table 11: Results obtained for the object recognition task.
From Table 11 it can be seen that the AlexNet architecture using the fine-
43

tuning had a bad performance, both for detecting real nests and negative values.
Most of the regions were classified as positive but they do not coincide with the
actual nests. Moreover, this architecture trained from scratch has the best
PPV but the NPV is practically 0, this means that the architecture is actually
classifying all regions as positive, which increases the probability of detecting
actual nests, nevertheless this architecture is not useful for real implementations.
The modifications of the AlexNet architecture have the worst performance,
none of them detect actual nests and their NPV values are considerably high.
This means that the architectures tend to classify all regions as negatives and
a few as positive which are not correct.
On the other hand, it can be seen that the PPV value for the GoogLeNet
architecture, both using fine-tuning and trained from scratch, are the same. The
same number of nests were recognized (not necessary the same images), however
the architecture which used fine-tuning has a better NPV value, this means that
the GoogLeNet trained from scratch presents more negative-positives.
Finally, the VGG architecture which has the best performance on the classi-
fication task. It has the best PPV value and an acceptable NPV value, however
the PPV value is considerably small comparing with the performance obtained
during the classification task.
It has to be noted that the size of the dataset is small, even though the
first dataset was approximately close to 8,000 samples, it had to be reduced
due to most of the nests on the samples were not clear even for the human
vision which affected the performance. Once the dataset was re-sampled, the
number of samples was considerably reduced and the performance increased
for the classification task. From the analysis in the remaining samples in the
dataset, it can be seen that only four nests are being used for the training data
and most of the samples corresponds to only two nests shown in Figure 38 and
39. From each of the frames processed for this evaluation, it can be seen that
those nests with a clearly defined shape were correctly identified which are those
nests corresponding to the training data. Even more, those nests with a clearly
defined shape correspond to nests created two days ago at most. On the other
hand, those nests whose shape is not clear correspond to nests created more
than two days before the video was obtained, they are hardly detected and
presumably are contributing to the increase of the negative-positive errors.
Figure 38: Most of the images in the
dataset correspond to this nest.
Figure 39: Second most common nest in
the dataset
This behaviour can be compared to those classification challenges for which
these architectures were previously trained. The objects to be classified have a
clear and defined shape rather than the blurred and hardly defined nests. In
conclusion, even though this is a complex classification task, those nests with a
44

clear shape can actually be detected, but it is necessary to perform the training
with a larger dataset in order to find how sensitive are these architectures to
the shape due to it is unlikely that all the nests will share the same shape.
5.3 Object tracking
It is important to note that the core of the project is in the object classification
task and the deep learning evaluation, however, the current object tracking task
was developed as a test of many methods in which the tracking could have been
designed. This section shows the results obtained for the proposed tracking task
with its weaknesses and strengths as basis for further research.
The object tracking was evaluated taking the architecture with the best
performance and creating a new video with blue rectangles for those regions
evaluated as negative and green rectangles for those regions evaluated with a
probability bigger than 95% to be a nest. The selected architecture was the
VGG which has the best accuracy as shown in the Object classification task;
however it has the highest time processing.
During the processing of the video, the results did not reflect the accuracy
obtained from the evaluation on each of the previous tasks. The probable reason
is due to the perspective of each element in the image, when the video was
recorded, the drone was flying forward such that all elements appear from the
top of the image and moves to the bottom, in addition, the size of the elements
on top is small because they are further away from the drone. However, the first
time a probable nest appears in the video, there is a long distance between the
drone and the region, therefore the image will not be clear and it is more likely
that it is going to be classified as no nest. As the nest is moving downwards the
probability should gradually increment due to the image is more clear, however
the final average in each region will be low. Figure 40 shows a possible nest,
however the region is not clear enough.
On the other hand, the regions obtained from the object recognition task
are being tracked on the consecutive frames but they are not being resized; the
size of the regions should be scaled according to the perspective in the current
frame. Figure 41 shows the result when the drone was close to a possible nest
previously detected as a region in Figure 40, the region does not coincide with
the nest and the probability remains low.
A possible solution to this problem is that the camera has to point down-
wards at 90◦
in order to reduce the effect of the perspective. On the other hand,
the drone can also fly backwards, such that the objects appear from the bottom
to the top of the image. This way the regions are processed when the drone is
close to them. In order to test this solution, the video was reversed and pro-
cessed, the same nest is shown in Figure 42 and some frames later in Figure 43,
it can be seen that it remains the problem with the scale of the regions which
is proposed as a further research.
On the other hand, as the texture is very similar because of the sand, the
Lucas-Kanade algorithm has problems to track some regions. Sometimes it
follows paths on the sand, this behaviour makes that the regions are moving
along with the movement of the drone. This is a problem because on the next
frame, the new regions are calculated based on the bounding box of the existing
regions and this bounding box becomes very large to almost size of the image
45

Figure 40: A possible nest found
Figure 41: The region does not coincide with the nest, consequently the proba-
bility did not increase.
46

sometimes, which makes that no new regions are obtained and missing some
actual nests.
Figure 42: Nest detected by processing the video reversed
6 Conclusions
During the development of this project, it has been shown that the deep learn-
ing is a good alternative for classification in turtle nest but it requires a large
dataset with nests whose shapes are clearly defined. Moreover, the VGG ar-
chitecture using the fine-tuning technique is the best architecture for this task
which obtained the best performance from all the architectures trained, however
the processing time taken by Convolutional Neural Architectures makes a diffi-
cult task to obtain a real time classification such that it is required a tracking
system in order to do not process the same regions in each frame. Addition-
ally a selective search algorithm has been tested for extracting proposal regions
with the aim of avoiding use a brute search method over the whole frame, this
algorithm extracts regions proposals of any size at any position to be classified
afterwards by the deep architecture.
Additionally to the existent architectures trained, three modifications of one
of these architectures were proposed and trained, however the results show that
the best approach is to select an existing architecture trained with a large dataset
of images and use the fine-tuning technique for performing the training.
47

Figure 43: Region remains at the same size when the nest reduced its size due
to the perspective.
7 Further Research
7.1 Implementation over a GPU
The trained architectures require a considerable amount of time to perform
the prediction. As the project is intended to act on real time in a future real
application, it is necessary to reduce the processing time. All the algorithms
used for this project have an implementation for a GPU, therefore the time
could be reduced if those regions obtained from the object recognition task
were process in parallel due to they do not depend one of each other.
7.2 Cross validation on VGG architecture
From the results section, the VGG architecture had the best performance for this
task; however the loss function was unstable during the training. If the VGG is
analysed using cross validation in order to find a better set of parameters, the
accuracy could be improved for this task.
7.3 Region proposals
Even though the algorithm used for obtaining the region proposals is one of the
most efficient in terms of processing time, in the original paper it was tested over
images with a clearly defined shape in the image. For instance, the algorithm
aims to detect possible cars, horses or people. However, this is a more complex
task due to the similarity of the objects with the background. The measures
48

used by the Selective Search algorithm are the colour, texture, fill and size,
but half of these measures are useless for this task (colour and texture). It is
proposed to find a different set of techniques to join the initial regions from the
image segmentation aiming to obtain a more useful set of region proposals.
7.4 Scale of regions
As stated in the results for the tracking task, once the regions are obtained, their
size should be changed according to the perspective over the frames. This is not
easy task and a model should be proposed to perform this scaling. However, the
scaling of the regions is necessary due to the amount of information lost during
the tracking which it clearly affects the performance.
7.5 The probability model
During the tracking object task, the probability for a region to be a nest is
updated using the average of probabilities. However this model is very simple
comparing to the difficulty of the problem. It is proposed to use a different
probability model which it could improve the accuracy in each region.
References
[1] Itamar Arel, Derek C. Rose, and Thomas P. Karnowski.
Research frontier: Deep machine learning–a new frontier in artificial intel-
ligence research.
Comp. Intell. Mag., 5(4):13–18, November 2010.
[2] Yoshua Bengio.
Learning deep architectures for ai.
Found. Trends Mach. Learn., 2(1):1–127, January 2009.
[3] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent.
Unsupervised feature learning and deep learning: A review and new per-
spectives.
CoRR, abs/1206.5538, 2012.
[4] Christopher M. Bishop.
Pattern Recognition and Machine Learning (Information Science and
Statistics).
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
[5] G. Bradski.
The opencv library.
Dr. Dobb’s Journal of Software Tools, 2000.
[6] Yuxin Chen, Hiroaki Shioi, Cesar Fuentes Montesinos, Lian Pin Koh, Serge
Wich, and Andreas Krause.
Active detection via adaptive submodularity.
In Proceedings of The 31st International Conference on Machine Learning,
pages 55–63, 2014.
[7] ConservationDrones.
Orangutan nest spotting with conservation drone, success!, 2015.
49

http://conservationdrones.org/2012/05/29/
orangutan-nest-spotting-success-2/, [Online; accessed 7-
September-2015].
[8] Pedro F. Felzenszwalb and Daniel P. Huttenlocher.
Efficient graph-based image segmentation.
Int. J. Comput. Vision, 59(2):167–181, September 2004.
[9] Pedro Felipe Felzenszwalb.
Graph based image segmentation, 2007.
http://cs.brown.edu/~pff/segment/index.html,[Online; accesed 12-
September-2015].
[10] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic segmen-
tation.
CoRR, abs/1311.2524, 2013.
[11] Benjamin Graham.
Fractional max-pooling.
CoRR, abs/1412.6071, 2014.
[12] D. H. Hubel and T. N. Wiesel.
Receptive fields and functional architecture of monkey striate cortex.
The Journal of Physiology, 195(1):215–243, 1968.
[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell.
Caffe: Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093, 2014.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
Imagenet classification with deep convolutional neural networks.
In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc., 2012.
[15] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
[16] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
Gradient-based learning applied to document recognition.
In Proceedings of the IEEE, pages 2278–2324, 1998.
[17] DavidG. Lowe.
Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, 2004.
[18] Bruce D. Lucas and Takeo Kanade.
An iterative image registration technique with an application to stereo vi-
sion.
In Proceedings of the 7th International Joint Conference on Artificial Intel-
ligence - Volume 2, IJCAI’81, pages 674–679, San Francisco, CA, USA,
1981. Morgan Kaufmann Publishers Inc.
[19] Ryszard S. Michalski, Ivan Bratko, and Avan Bratko, editors.
Machine Learning and Data Mining; Methods and Applications.
50

John Wiley & Sons, Inc., New York, NY, USA, 1998.
[20] M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning.
Adaptive computation and machine learning series. MIT Press, 2012.
[21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael
Bernstein, Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), pages 1–42, April 2015.
[22] J¨urgen Schmidhuber.
Deep learning in neural networks: An overview.
CoRR, abs/1404.7828, 2014.
[23] Jianbo Shi and Carlo Tomasi.
Good features to track.
Technical report, Ithaca, NY, USA, 1993.
[24] K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition.
CoRR, abs/1409.1556, 2014.
[25] Karen Simonyan and Andrew Zisserman.
Very deep convolutional networks for large-scale image recognition.
CoRR, abs/1409.1556, 2014.
[26] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Mar-
tin A. Riedmiller.
Striving for simplicity: The all convolutional net.
CoRR, abs/1412.6806, 2014.
[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich.
Going deeper with convolutions.
CoRR, abs/1409.4842, 2014.
[28] Richard Szeliski.
Computer Vision: Algorithms and Applications.
Springer-Verlag New York, Inc., New York, NY, USA, 1st edition, 2010.
[29] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders.
Selective search for object recognition.
International Journal of Computer Vision, 104(2):154–171, 2013.
[30] JanC. van Gemert, CamielR. Verschoor, Pascal Mettes, Kitso Epema, Lian-
Pin Koh, and Serge Wich.
Nature conservation drones for automatic localization and counting of an-
imals.
In Lourdes Agapito, Michael M. Bronstein, and Carsten Rother, editors,
Computer Vision - ECCV 2014 Workshops, volume 8925 of Lecture
Notes in Computer Science, pages 255–270. Springer International Pub-
lishing, 2015.
[31] V. Vapnik.
The support vector method of function estimation.
51

In Neural Networks and Machine Learning.
[32] Paul Viola and Michael J. Jones.
Robust real-time face detection.
Int. J. Comput. Vision, 57(2):137–154, May 2004.
[33] Sun-Chong Wang.
Artiﬁcial neural network.
In Interdisciplinary Computing in Java Programming, volume 743 of The
Springer International Series in Engineering and Computer Science,
pages 81–100. Springer US, 2003.
[34] Jason Yosinski, Jeﬀ Clune, Yoshua Bengio, and Hod Lipson.
How transferable are features in deep neural networks?
CoRR, abs/1411.1792, 2014.
[35] Jean yves Bouguet.
Pyramidal implementation of the lucas kanade feature tracker.
Intel Corporation, Microprocessor Research Labs, 2000.
[36] Matthew D. Zeiler and Rob Fergus.
Visualizing and understanding convolutional networks.
CoRR, abs/1311.2901, 2013.
[37] Matthew D. Zeiler and Rob Fergus.
Visualizing and understanding convolutional networks.
CoRR, abs/1311.2901, 2013.
52

Automatic Sea Turtle Nest Detection via Deep Learning.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automatic Sea Turtle Nest Detection via Deep Learning.

Similar to Automatic Sea Turtle Nest Detection via Deep Learning. (20)

Recently uploaded

Recently uploaded (20)

Automatic Sea Turtle Nest Detection via Deep Learning.