Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Object tracking from a single camera
1. Image processing
and object tracking
from single camera
JOHAN SOMMERFELD
Master's Degree Project
Stockholm, Sweden 2006-12-13
2.
3. Abstract
In the last decades the computers' ability to perform huge amount of cal-
culations, and handle information ows we never thought possible ten years
ago has emerged. Despite this a computer can only extract little information
from the image in comparison to human seeing. The way the human brain
lters out useful information is not fully known and this skill has not been
merged into computer vision science.
The aim of this thesis is to implement a system in Matlab that is able
to track a specic object in a video stream from a single web camera. The
system should use both fast and advanced algorithms aiming to achieve a
better ratio between accuracy and speed than you would achieve with either
fast or advanced algorithms. The system will be tested by trying to follow a
persons hand, placed in front of a computer with the web camera mounted
on the screen.
The goal is to achieve a system with the potential to be implemented in
a real time environment. Therefore the system needs to be very fast. The
work in this thesis is an initial step and will not be implemented to run in
real time.
The hardware used is a standard computer and a regular web camera
i
4. with a 640x480 resolution at 30 frames per second (fps).
The system works overall as expected and was able to track a persons
hand with numerous congurations. It outperforms advanced algorithms in
terms of lower computational power needed, and is more stable than the
fast ones. A drawback is that the system parameters were dependent on the
object and its surroundings.
Acknowledgments
The thesis was written at the sound and image processing laboratory at the
school of electrical engineering, the Royal Institute of Technology (KTH)
through the school year of 20052006. I would like to take this opportunity
to thank my supervisor M.Sc. Anders Ekman for his patience when things
progressed a bit slow, PhD Disa Sommerfeld for proofreading and also assis-
tant professor Danica Kragi¢ for pushing me forward.
ii
7. Chapter 1
Introduction
In the last decades the computers ability to perform huge amount of calcu-
lations, and handle information ows we never thought possible ten years
ago has emerged. Despite this a computer can only extract little information
from the image in comparison to human seeing. The way the human brain
lters out useful information is not fully known and therefore this skill has
not been merged into computer vision science.
1.1 Background
Even if we have not been able to teach a computer to process visual input
in a complex sense, there is quite much a computer can do when it comes to
following movement and performing easier recognitions.
One of the key features in a computer vision system is for the computer
to extract interesting areas (foreground). Research on this has mainly two
approaches. The rst group uses advanced algorithms for pattern recognition
1
8. to extract the foreground. Often these methods take little use of temporal re-
dundancy, and are slow because of the large amount of computations needed.
The second approach is dierent, often using pixel by pixel computations and
only a few computations per pixel. In general the latter methods are fast
and may be implemented in real-time applications. The drawback of these
methods is that they are, due to the lack of complexity in the algorithms,
sensitive to noise and often need a static environment to be able to function.
1.2 Related work
There are a few simple algorithms for tracking, for example: detection of
discontinuities using Laplacian and Gaussian lters, often implemented with
a simple kernel [1]; thresholding; and motion detection with reference image.
These algorithms are simple, but sensitive to noise, and hard to generalize. A
set of more advanced algorithms involves iterations and/or transformations,
such as the Hough transform, region based segmentation and morphological
segmentation. These algorithms are generally more stable concerning noise,
although as pictures and/or frames grows larger, these algorithms get slow [1].
Other algorithms make use of pattern recognition, such as neural net-
works, maximum-likelihood and support vector machines [2]. First the image
has to be translated into something that the pattern recognition algorithms
understand. The image is processed to a so-called feature vector. The ma-
jority of pattern recognition algorithms require a set of training data to form
the decision boundary. The training is often slow, however thereafter the
algorithm is fast. The problem is that extracting the feature vector might
2
9. be a demanding task for the computer.
There is a number of interesting approaches to object tracking. In the
study by Kyrki et al [3] they use both model-based features such as a wire
frame combined with model-free features such as points of interest on a calcu-
lated surface. In the study by Doulamis et al [4] they use an implementation
of neural networks to track objects in a video stream. The neural network
is adaptive and changes over time as the object translates. In the study by
Comaniciu et al [5] a kernel-based solution is used for identifying an object.
In a study, Amer [6] uses voting based features and motion to detect objects,
which are tuned for real time processing. In the PhD thesis by Kragi¢ [7], a
multiple cue algorithm is presented, using features that are fast to compute
and relying on the assumption that not everyone fails at the same time. In
the study by Cavallaro et al [8] a hybrid algorithm is presented using infor-
mation about both objects and regions. In the study by Gastaud et al [9]
they track objects using active contours.
Kragi¢ [7] uses multiple cues for better tracking. Instead of using multiple
cues of fast algorithms, the approach in the present thesis takes the advantage
of the fast and also the advanced algorithms in order to achieve a system that
outperforms the simple algorithms, and operates faster than the advanced
ones.
3
11. Chapter 2
Problem
The aim of this thesis is to implement a system in Matlab that is able to
track a specic object in a video stream from a single web camera. The
system should use both fast and advanced algorithms aiming to achieve a
better ratio between accuracy and speed than you would achieve with either
fast or advanced algorithms. The system will be tested by trying to follow a
persons hand, placed in front of a computer with the web camera mounted
on the screen.
2.1 System
The goal is to achieve a system with the potential to be implemented in a
real time environment. Therefore the system needs to be very fast. Also a
higher accuracy than the simple methods described in section 1.2 needs to
be achieved. The system will use algorithms that need training. The work in
this thesis is an initial step and will not be implemented to run in real time.
5
12. Figure 2.1: The main blocks of the system.
At the start a user input telling the system the whereabouts of the object to
track is requested. In this thesis the system is implemented in Matlab and
therefore only a proof of concept is possible to achieve.
More specically, the system is based on four blocks, see gure 2.1.
Detection is responsible for detecting and segmenting the interesting parts
of the image. The mainly responsible algorithm is most often one of
the fast algorithms described in section 1.2.
Recognition is responsible for classifying the foreground extracted from the
image by the detection block.
Updating is responsible for updating the representation of the tracked ob-
ject, using information generated by the recognition block.
Prediction is responsible for using all information and to predict where to
start the segmentation in the Detection block, to minimize the time
consumed and to minimize the error probability.
6
13. 2.2 Hardware
The hardware used is a standard computer and a regular web camera with
a 640x480 resolution at 30 frames per second (fps). The computer is an
Apple Power Mac G5 2x2.7 GHz. The camera is an Isight from Apple, the
video stream is in Dv format. However Matlab's Unix version only takes
uncompressed video, and therefore the stream is converted to uncompressed
video with true color.
7
15. Chapter 3
Method
3.1 Adaptive lters
An adaptive lter is a lter that changes over time depending on the signal.
For a resumé of the statistical theory used, see appendix A.1. Assume that
you have two non-stationary signals with zero mean and known stochastic
functions, hence covariance and cross-covariance
ryy (n, m) = E[y(n)y(n + m)]
rxy (n, m) = E[x(n)y(n + m)].
The problem of estimating x(n) given past y(n) may be written as
N −1
x(n) =
ˆ θ(k)y(n − k) = Y T (n)θ,
k=0
9
16. where Y (n) = [y(n), ..., y(n − N + 1)] and θ = [θ(0), ...θ(N − 1)]T . The MSE
is then given by
MSE(n, θ) = E[(x(n) − x(n))2 ].
ˆ
The optimal θ may be received by the orthogonality condition which states
that Y T (n)θ is the linear MMSE of x(n) if the estimation error is orthogonal
to the observations Y (n)
E[(x(n) − Y T (n)θ)Y (n)] = 0. (3.1)
If we dene the the covariance matrices
ΣY x (n) = E[rxy (n, n), ..., rxy (n, n − N + 1)]
ΣY Y (n) = E[Y (n)Y T (n)]
ryy (0) ryy (1) . . . ryy (N − 1)
ryy (1) ryy (0) . . . ryy (N − 2)
= .
. .. .
. . .
. .
ryy (N − 1) ryy (N − 2) . . . ryy (0)
Insert this in 3.1 and we get
ΣY x (n) − ΣY Y (n)θ = 0,
from this we get θopt
θopt (n) = Σ−1 (n)ΣY x (n)
YY
10
17. here θ is dependent of time. An algorithm to update θ is also needed, a com-
mon method is to take a step in the negative gradient direction of MSE(n, θ)
ˆ ˆ µ ∂
θ(n) = θ(n − 1) + MSE(n, θ)|θ=θ(n−1)
ˆ (3.2)
2 ∂θ
here µ is an variable that controls the step size of the algorithm, a large µ
is fast but can be unstable, and a small µ is slow but generally more stable.
The gradient can be written as
∂
M SE(n, θ) = −2ΣY x + 2ΣY Y θ (3.3)
∂θ
insert 3.3 in 3.2 and we get
ˆ ˆ ˆ
θ(n) = θ(n − 1) + µ ΣY x (n) − ΣY Y θ(n − 1) (3.4)
3.1.1 The Least Mean Square algorithm
In general, the statistical information of the variables is not available. More
likely, the only thing available is y(n) and x(n). We will still use the steepest
decent algorithm, see equation 3.4, with some modications.
Since the statistical information is not available, we will not be able to
calculate the MSE. Instead we estimate the MSE by relaxing the expression
by dropping the expectation operand
M SE(n, θ) = (x(n) − Y T (n)θ)2
11
18. the gradient then is
∂
M SE(n, θ) = −2(x(n) − Y T (n)θ)Y (n) (3.5)
∂θ
if we insert 3.5 into 3.2 we get
ˆ ˆ
θ(n) = θ(n − 1) + µY (n) x(n) − Y T (n)θ(n − 1) (3.6)
The theory for this section was collected from Hjalmarsson et al [10], also
suppling more information about adaptive lters.
3.2 Motion detection
Motion detection is often built into a larger system and is tweaked to t
the other algorithms. One of the commonly used algorithms is to take a
threshold on a dierence image
1 |img(x, y, t) − img(x, y, t − 1)| T
if
d(x, y) = .
0
else
Where T is a threshold variable. Even better is to use a reference image
ref(x, y, t) = α · ref(x, y, t − 1) + (1 − α) · img(x, y, t) (3.7)
12
19. and then use this reference image to take the threshold
1 |img(x, y, t) − ref(x, y, t)| T
if
d(x, y) = (3.8)
0
else
The rate of which the reference image is updated over time is controled by
α [1]. This is a fast algorithm but sensitive to noise.
Irani et al [11] has developed a method for robust tracking of motion.
In the study they use multiple scales and translations to detect and track
motions. Though this is a robust technique it puts a heavy load on the
hardware, especially at the resolutions used in the present thesis.
3.3 Pattern recognition
When you use Pattern recognition algorithms, you can seldom supply raw
data, such as a video or audio stream into the algorithms. The algorithms
will need some sort of feature(s). These features span a domain called the
feature space. The choice of feature space is essential and in some cases
even more critical than the choice of pattern recognition algorithm. This is
because you want to keep the dimensionality as low as possible, since the
higher dimensionality the more training data is needed and the algorithms
put heavier load on the computer, but if the dimensionality is too low the
ability to separate patterns is reduced. If all statistics are known in advance,
it is possible to analytically decide an optimal decision surface. However
in reality this never happens. Instead, a training set that is supposed to
represent the distribution of the signal/pattern is used to tune the chosen al-
13
20. gorithm. There is a number of dierent algorithms with dierent approaches
in how to use the training set and the dierent a prioris.
3.3.1 Parametric algorithms
Parametric algorithms use the training set to train distributions chosen ear-
lier. When the distributions of the dierent patterns are trained the deci-
sion boundary can be calculated using for example maximum likelihood or
Bayesian parameter estimation. These algorithms generally have good con-
vergence and performance if they are tuned right. However quite a lot of
tuning is needed to adapt these algorithms for dierent problems. Another
problem is the curse of dimensionality, which appears when the feature space
increases in dimensionality [2]. To cope with this problem it is possible to
use Principal component analysis (PCA). PCA uses eigenvectors to decrease
the dimensionality of the feature space [2]. The strength of parametric algo-
rithms is that knowledge about the distributions can be taken into account
making better use of the training data available.
3.3.2 Nonparametric algorithms
In the previous section we discussed the idea behind algorithms that uses
training data to estimate pre-decided distributions. Unfortunately the knowl-
edge about the distribution of the patterns is rarely available. Nonparametric
algorithms do not assume any special distribution, instead they rely on the
training data to be accurately representative of the patterns.
One of the most known nonparametric algorithms is kn nearest neighbors.
14
21. The algorithm uses the training data to calculate the kn nearest neighbors
to the point in the feature space corresponding to the pattern that is to be
classied. The pattern that the majority of the kn neighbors belongs to is
assumed to be the pattern connected with that point. The strength of this
algorithm is the fact that with sucient training data it is able to represent
complex distributions. The drawback is that it puts a heavy load on the
computer and the complexity increases with the dimensionality and number
of training data.
3.3.3 Linear discriminant
In the previous sections two techniques with dierent approaches on how to
use the training set given have been discussed.
This third algorithm is more or less in between the two previous algo-
rithms. We do not dene a specic distribution in advance and we do not
keep all the training data as base for calculations during run time. The
training data is used directly to train the classier which is a set of linear
discriminant functions
g(x) = wt x + w0
where x is the point in the feature space that is supposed to be classied, w
is the weight vector and w0 is the bias [1]. Depending on what problem to
solve, a number of discriminant functions can be trained and used in recog-
nition problems. For instance if the classier is supposed to be a binary, one
discriminant function is sucient. If there are many patterns that are sup-
posed to be classied, the discriminant functions can be designed in multiple
15
22. ways:
• One versus all is a training technique where the discriminant func-
tion is trained to separate the pattern connected with the discriminant
function from all the other patterns.
• One versus one creates multiple binary discriminants with two patterns
versus each other.
• In a Linear Machine one discriminant for each pattern is trained. The
pattern is classied as the pattern whose discriminant produce the high-
est value.
One problem with these algorithms is that there are spaces where the classi-
er is undened. The linear machine is the one that often produces the least
amount of undened space. Undened space only occurs when two or more
discriminant functions are equal.
3.3.4 Support Vector Machines
Support Vector Machines (SVM) is basically the same as Linear discrimi-
nants, see section 3.3.3, but a few features to enhance the functionality when
faced with small training sets and ability to create more advanced hyper
planes has been added.
The reason for wanting more advanced hyper planes, is that the dimen-
sionality must be high enough to have good separation between the dierent
patterns. To be able to create advanced hyper planes the input data is
mapped into a higher dimension, which is often done by kernels. Once the
16
23. data is mapped into the higher dimension the new data is processed in the
same manner as regular linear SVM. The techniques for choosing dimen-
sions and making general kernels is a eld of research out of scope for this
thesis. [12]
The linear SVM is similar to the binary linear discriminant. The main
dierence from linear discriminant function is that during training the SVM
algorithm works towards maximizing the distance from the training data
and the hyper plane, called margin maximization. This often results in a
hyper plane that produces good results also when only small training sets
are available.
The training of the SVM is a minimization process of a cost function
N
1
||w||2 + C (1 − (yi f (xi ))+ ) (3.9)
2 i=1
where C is a tuning parameter that controls the relation between training
errors and margin maximization [13]. The ()+ function is plotted in gure
3.1. If yi f (xi ) is larger than 1, there is no penalty, but if yi f (xi ) is less than
1 there is a linear penalty scaled with the tuning parameter C.
The SVM algorithm has been widely used in pattern recognition mainly
for its good generalization [1417].
3.3.5 ψ -learning
ψ -learning is a variant of the SVM algorithm modied in order to generally
produce better results when faced with sparse non-linear separable training
sets [18]. The mathematical dierence lies within the cost function which,
17
24. Figure 3.1: The ()+ function used in the cost function, equation 3.9,
for SVM training.
for ψ -learning, looks like
N
1
||w||2 + C (1 − ψ(yi f (xi ))). (3.10)
2 i=1
This cost function is similar to the one in SVM (eq 3.9), but there is a ψ()
function instead of a ()+ function. The ψ() function is plotted in gure 3.2.
The dierence in the above cost functions is that SVM generates a linear
cost as soon as yi f (xi ) 1, meaning a training data that is close to the
decision hyper plane. In ψ -learning there is also a linear cost as soon as the
training data is close to the decision hyper plane, however this is only valid
until the data becomes misclassied. At that point the cost is doubled but
static. In practice this means that the algorithm does not care about the
magnitude of the misclassication, only the fact that there is one.
The reason why this algorithm is more complex than SVM is that the
minimization of the cost function, equation 3.10, can not be directly solved
18
25. Figure 3.2: The ψ function used in the cost function, equation 3.10, for
ψ -learning training.
with quadratic programming as is the case with SVM [18].
19
27. Chapter 4
Implementation
The methods in chapter 3 were implemented to create a system possible to
track a specic object from a video stream.
4.1 Initiation
The system needs to train the pattern recognition algorithm and it requires a
point from where it starts tracking. This is done during the initiation phase.
To initiate the pattern recognition algorithm some data used for training
the algorithm is needed. The rst frame is presented and the object that is
supposed to be tracked is chosen, see gure 4.1. When the training algorithm
is nished, the user is promted to choose a starting position, from which the
system will start tracking. The training is further discussed in section 4.3.
21
28. (a) foreground (b) background
Figure 4.1: The user manually choses which blocks that is the fore-
ground/object, everything else is background.
4.2 Detection
Since the system is given a startpoint of the object, there is only when mov-
ment occurs that the system needs to act. The detection is therefore a motion
detection algorithm. The technique is rather simple and the algorithm works
in two steps. First the stream is ltered with a high pass lter, and then
a threshold is applied to the output in order to detect motion. Since this
algorithm is very simple it is not robust, however it is very fast. To reduce
the impact of noise, we rst run a low-pass lter on each frame. This is done
with a lter kernel. If the scale on the lter is 5, then the lter kernel is a
5x5 kernel and all elements are 1/52 . The result is a smoother image, see
gure 4.2.
The lter is implemented with the help of a reference image, see gure
4.3,
ref n = α · ref n−1 + (1 − α) · imgn (4.1)
22
29. (a) Original (b) Scale = 15
(c) Scale = 30 (d) Scale = 45
Figure 4.2: Image at dierent scales.
where ref n−1 is the previous reference image. The imgn is the current image
from the stream and α is a variable for tuning how fast the reference image
should adapt to changes. When subtracting the reference from the current
image, we will achive a value that describes the amount of change in color
at every pixel
diff n = imgn − ref n . (4.2)
A threshold is applied to the diff n image, reducing the noise, and at pixels
with valules = 0, some kind of motion is assumed, see gure 4.3. [1]
23
30. (a) reference image, equation 4.1 (b) dierence image, equation 4.2
(c) motion detected
Figure 4.3: Results from the detection algorithm. Motion detected is
binary with ones where the dierence image has a value over a threshold
and zeros otherwise.
4.3 Recognition
To be able to track a specic object, motion detection is not sucient, since
the detection algorithm does not give any information regarding what is
moving. The Recognition block, see section 2.1, is responsible for recognizing
the object that is supposed to be tracked.
The recognition system in the present thesis is based on the system used
for video object segmentation in Liu et al [13]. The learning algorithm used
24
31. is ψ -learning, described in section 3.3.5. The algorithm is trained at the
initiation process and is then used throughout the whole simulation.
4.3.1 Feature Space
The ψ -learning algorithm does not work directly on the image, thus it needs
to be provided with some form of feature space. The feature space is calcu-
lated on blocks of 9x9 pixels, the image is therefore divided into such blocks.
There is an overlap of 1 pixel between the blocks, where the rst block spans
from pixels 0 − 8 and the second block from pixel 8 − 16 and so on for both x
and y coordinates. The feature space is a 24- dimensional space, 8-dimensions
for each colorspace
1. c(0, 0)
N −1
2. j=1 c(0, j)2
N −1
3. k=1 c(k, 0)2
N −1 N −1
4. j=1 j=1 c(k, j)2
5. (B(−1,−1) + B(−1,0) + B(−1,1) )/3
6. (B(−1,1) + B(0,1) + B(1,1) )/3
7. (B(1,−1) + B(1,0) + B(1,1) )/3
8. (B(−1,−1) + B(0,−1) + B(1,−1) )/3
where c(k, j) is the coecients of the Discrete Cosine Transform (DCT), the
system uses Matlab's dct2, calculated on the 9×9 blocks. In this case the
rst 3 coecients (N = 3) of the DCT is used, to deal with the fact that
the high frequency coecients tends to be small. The last 4 dimensions
25
32. B(−1,−1) B(−1,0) B(−1,1)
B(0,−1) B(0,0) B(0,1)
B(1,−1) B(1,0) B(1,1)
Figure 4.4: Neighbouring blocks of 9x9 pixels.
are the average color of the 9x9 neighboring blocks on each side, see gure
4.4. The combination of DCT and neighboring block color values gives good
classication of surface as well as grouping information which reduces the
impact of noise. [13]
4.3.2 Training
When the object is chosen as described in 4.1 the algorithm needs to be
trained by using the test data. The blocks that are not chosen is used as
background, see gure 4.1. The training is done with Matlab's fminsearch.
fminsearch needs a start point in the feature space. This start point is
calculated using minimum squared error solution with the pseudoinverse
w = (AT A)−1 AT Y
where w is the weight vector, A is a matrix where each row represents a
training point and Y is a matrix containing rows with the corresponding
class for each training point.
26
33. (a) Classication output (b) Frame
Figure 4.5: Classication of an entire frame. The green dots represents
blocks that are classied as foreground/ojbect and the red ones blocks
that are classied as background.
4.3.3 Detecting
When the training is done, each frame needs to be converted into the feature
space. The image is divided into blocks as described in 4.3.1, then each block
is evaluated binary as foreground or background, see gure 4.5. To handle
noise better, there need to be at least two blocks connected in order for them
to be accepted as part of the object.
4.4 Updating
When the detection is nished, a point of interest which is used during op-
timization of the system is calculated. The point of interest is computed
by nding the block/blocks with the lowest value in y coordinates ((0, 0) in
upper left corner), then the mean of the x coordinates in those groups is
used.
27
34. 4.5 Prediction
A LMS-lter, see section 3.1.1, is used to predict the next point of interest,
which is used in the optimization of the system.
The LMS-lter is designed to be a one step ahead lter [10]. We want
to predict the next coordinate using previous observations. Two lters were
implemented, 1 for each coordinate:
N
x(n + 1) = θx (k)x(n − k)
k=0
N
y(n + 1) = θy (k)y(n − k).
k=0
During simulations the lter mostly kept the previous 6 (N = 6) coordinates
and µ was set around 10−8 .
4.6 Optimization
To make the system run faster, a number of constraints were added to the
system in order to reduce the work load.
The Detection described in section 4.2 is based on a lter which uses
earlier images. Therefore it is not suitable to reduce the work load only by
calculating parts of the image.
The task that generated the heaviest load on the computer was the con-
version from the pixel blocks to the feature space. In an study by Yi Liu
et al [13], which uses the same feature space, calculations of the DCT is the
major contributor for this load. Therefore two constraints needs to be ful-
28
35. lled in order to perform the conversion. The rst constraint is that only a
certain number of blocks, σ , around the previous point of interest is checked.
During simulations typical values of σ is 5, 7 and 11. On an image with
resolution 640 × 480 there are 4524 blocks that the conversion needs to be
made on. Having σ = 7, and therefore only 225 blocks, reduces the number
of conversions with a factor 20. The other constraint is that the conversion of
the block is only made if there is motion detected, see section 4.2, in a certain
percent, γ, of the pixels in the block. Typical value of γ during simulations
is 60-80%. After these constraints was applied, the conversion was no longer
the bottleneck of the system.
29
37. Chapter 5
Result
The system was tested in following a persons hand. The camera was mounted
on the screen and the person sat down in front of the camera.
5.1 Simulations
The simulations were made on a sequence of 91 frames, with 5 dierent σ.
Data on tracking error (euclidean distance from ground truth) and number of
blocks calculated, see section 4.6, was collected. How the system performed
with dierent σ is presented in gure 5.1 and table 5.1 shows the average
values over all frames. The plots are separated into separate plots in appendix
B.
Other than σ there are a number of variables that have an impact on the
performance of the system. There are 4 variables that control the motion
detection: scale controls the smoothing of the image, see gure 4.2; α which
controls at what rate the reference image is updated over time; diThres
31
38. Figure 5.1: Plots of the simulations. Tracking error is the euclidean
distance from ground truth at each frame, blocks calculated is the number
of blocks calculated at each frame
32
39. σ Tracking error Blocks calculated
5 43.29 21.90
7 26.18 31.96
9 19.12 53.38
11 19.01 78.56
13 143.62 59.03
Table 5.1: The average value of the plots in gure 5.1
which is the variable that tunes at what point the dierence should be classed
as motion; γ which controls the percentage of the pixels in a block that needs
to be classied as motion for the block to be evaluated. There are 2 variables
that controls the prediction: lterLength which is the length of the LMS-
lter and µ which is the variable that controls the step size of the LMS-lter.
During the simulations the variables where set to
scale = 15
α = 0.9
diffThres = 14
γ = 80%
filterLength = 6
µ = 13 · 10−7 .
5.2 Color spaces
A number of color spaces were evaluated to see if there were any major
dierences in performance. The error rate on the training set after training,
33
40. color space foreground error background error total error
RGB 0.76% 10.57% 8.84%
normalized RGB 1.82% 12.82% 11.26%
HSV 0.15% 10.57% 9.10%
TSL 5.77% 11.04% 10.30%
YCrCb 1.82% 13.52% 11.86%
NTSC 1.67% 7.17% 6.39%
Table 5.2: Error rates of a number of colorspaces.
i.e. the amount of misclassications when trying to classify the training
set, is presented in table 5.2. The conversion from the RGB image was
done either with Matlab's built in functions, or as described in the study by
Sazonov [19]. The reason why the background has such high error rate is
that in the example in section 4.1, see gure 4.1, the face is not a part of
the object, but has similar features as the hand. The NTSC conversion, YIQ
color space is supplied by Matlab and were used most extensively during the
tests.
5.3 Tracking
During preferable conditions, such as sucient light and no or little distur-
bance in the background, the tracking worked well. The system still managed
when noise, such as back light and/or motion of other objects in the back-
ground was introduced. The lter allowed the system to work, even though
the tracking failed during small portions of time, but was able to snap on
again after a few frames. Due to limitations in the system the tracking will
fail if a block is misclassied as the object, which only occurs if motion is
detected in the block. This occurs at frame 38 and σ = 13. The reason why
34
41. Figure 5.2: 2 frames with motion blur.
the system performs well with σ=9 and σ = 11 is that it is a suciently
large area to search to be able to track well, while still small enough to miss
eventual noise in other parts of the image. Fast motion is something a stan-
dard DV camera is unable to handle, introducing motion blur, see gure 5.2,
resulting in the hand blurring out with the background and changing in color
and texture.
5.4 Speed
Since the system is implemented in Matab it is hard to reason whether it is
possible to run in realtime or not. With the system optimized as described
in section 4.6 it runs on the Apple computer at roughly 1.3 fps. This frame
rate is possible even though Matlab is not utilizing both processors and has
poor performance when it comes to loops, since it does not optimize them
as programs made in C/C++ would, also the code written is not optimal
when it comes to minimizing work load. For each frame there are roughly
20 − 200 blocks depending on the size of σ that need to be calculated, also
35
42. the detection part is pixel by pixel computations. Therfore this system could
utilize the full power of computers with multiple cores, and perhaps even
distributed systems.
36
43. Chapter 6
Discussion
The system works overall as expected. It outperforms advanced algorithms
in terms of lower computational power needed, and is more stable then the
fast ones. A drawback is that the system parameters were dependent on
the object and its surroundings. Much of the failure could probably be
compensated with more complex equipment. A more advanced camera could
be congured to use shorter shutter time, reducing the problem with tracking
failure during motion blur.
Problems due to limitations in the algorithms of the system is a more
complex problem. For example, when the tracker fails because of misclassi-
cation and motion, the problem will not be solved with better hardware.
Also if the object is big and has no texture so that it is registered as a at
surface, the motion algorithm will only detect motion on the contours, giving
a false representation of the object.
To improve the system, it might be possible to model the shape of the
object and feed that to an adaptive lter, such as the Kalman lter [10, 20].
37
44. Introducing the Kalman lter would allow more complex constraints that
are also adaptive during runtime. For example: the updating of the point
of interest could be forced to be more like the motions of a human hand;
the change of the shape could be forced to change more continuously. The
drawback of these constraint is that the system becomes less general and
harder to congure.
6.1 Future work
Though not in the scope of this thesis, the performance of the system could
probably be improved by implementing it in a low level language such as C or
C++. Then the code could be optimized further making sure no unnecessary
computations are made. Not until then will we be able to measure how
well the system performs in real time. Stereo vision might be able to make
foreground detection easier, however stereo system in real time is not trivial.
To make the system even faster it could be possible, for a simple object like
a hand, to use simpler pattern recognition algorithms.
38
45. Appendix A
Mathematical cornerstones
A.1 Statistical Theory
Many of today's algorithms and systems uses dierent forms of a priori knowl-
edge to enhance the result.
Probabilities
There are a few probabilities that are frequently used when working with
pattern recognition and other statistical frameworks. It is the regular prob-
ability
PX (x),
which is the value that describes how likely it is that variable X will be set
to x (P (x) or P (X = x) is dierent notations for the same thing).
Then there is joint probability
PX,Y (x, y),
39
46. which describes how likely it is that X is set to x and Y is set to y (P (x, y)
and P (X = x, Y = y) is dierent notations for the same thing).
Conditional probability
PX|Y (x|y)
describes how likely it is that X is set to x given that Y is set to y (P (x|y) and
P (X = x|Y = y) is dierent notations for the same thing). The denition is
PX,Y (x, y)
PX|Y (x|y) = .
PY (y)
Bayes formula
If we have the knowledge of both PX (x) and PY |X (y|x), we can, from the
denition of conditional probability get
PX,Y (x, y) = PX|Y (x|y)PY (y)
= PY |X (y|x)PX (x),
which can be rewritten to
PX|Y (x|y)PY (y)
PY |X (y|x) = .
PX (x)
This is known as Bayes formula [2, 21].
40
47. Expected value
The expected value is the mean value or function of the stochastic variable
or function
E[X] = mX
E[f (X)] = mf (X).
For a discrete stochastic variable the expected value is calculated
E[X] = xPX (x).
x∈X
Variance
The expected value gives the mean value of the stochastic variable or func-
tion. Variance gives the expected value of the squared distance between the
stochastic variable and mx
V ar[X] = σ 2 = E[(X − mx )2 ].
The variance can be expressed
V ar[X] = E[X 2 ] − (E[X])2
V ar[f (X)] = E[f 2 (X)] − (E[X])2 .
41
49. Appendix B
Simulation plots
The plots of the simulations described in section 5.1 separated into inde-
pendent plots. Tracking error is the euclidean distance from ground truth
at each frame, blocks calculated is the number of blocks calculated at each
frame
43
55. Bibliography
[1] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing,
Prentice-Hall, Inc., second edition, 2001.
[2] Peter E. Hart Richard O. Duda and David G. Stork, Pattern Classi-
cation, Wiley Sons, Inc., second edition, 2001.
[3] Ville Kyrki and Danica Kragi¢, Tracking rigid objects using integration
of model-based and model-free cues, nyn, 2005.
[4] Nikolaos D. Doulamis, Anastasios D. Doulamis, and Klimis Ntalianis,
Adaptive classication-based articulation and tracking of video objects
employing neural network retraining, .
[5] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, Kernel-based
object tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no.
5, pp. 564575, 2003.
[6] Aishy Amer, Voting-based simultaneous tracking of multiple video ob-
jects, 2003, vol. 5022, pp. 500511, SPIE.
[7] Danica Kragi¢, Visual Servoing for Manipulation: Robustness and In-
tegration Issues, Ph.D. thesis, Royal Institute of Technology, 2001.
49
56. [8] A. Cavallaro, O. Steiger, and T. Ebrahimi, Tracking video objects in
cluttered background, IEEE Transactions on Circuits and Systems for
Video Technology, vol. 15, no. 4, pp. 575584, 2005.
[9] M. Gastaud, M. Barlaud, and G. Aubert, Tracking video objects using
active contours, in MOTION '02: Proceedings of the Workshop on
Motion and Video Computing, Washington, DC, USA, 2002, p. 90, IEEE
Computer Society.
[10] Håkan Hjalmarsson and Bjorn Ottersten, Lecture notes in adaptive
signal processing, Tech. Rep., Signal, Sensors and System, Stockholm,
Sweden, 2002.
[11] Benny Rousso Michal Irani and Shmuel Peleg, Computing occluding
and transparent motions, Tech. Rep., Institute of Computer Science,
Jerusalem, Israel, 1994.
[12] Christopher J. C. Burges, A tutorial on support vector machines for
pattern recognition, Data Mining and Knowledge Discovery, vol. 2, no.
2, pp. 121167, 1998.
[13] Yi Liu and Yuan F. Zheng, Video object segmentation and tracking
using ψ -learning, IEEE Transactions on Circuits and System for Video
Technology, 2005.
[14] Constantine Kotropoulos Anastasios Tefas and Ioannis Pitas, Using
support vector machines to enhance the performance of elastic graph
matching for frontal face authentication, IEEE Trans on Pattern Anal.
Mach. Intell., 2001.
50
57. [15] Daniel J. Sebald and James A. Bucklew, Support vector machine tech-
niques for nonlinear equalization, IEEE Transactions on Signal Pro-
cessing, 2000.
[16] Robert Freund Edgar Osuna and Federico Girosi, Training support
vector machines: an application to face detection, IEEE, Computer
Vision and Pattern Recognition, 1997.
[17] Massimiliano Pontil and Alessandro Verri, Support vector machines
for 3d object recognition, IEEE Transactions on Pattern Anal. Mach.
Intell., 1998.
[18] Xuegong Zhang Xiaotong Shen, George C. Tseng and Wing Hung Wong,
On ψ -learning, Journal of the American Statistical Association, 2003.
[19] Vassili Sazonov Vladimir Vezhnevets and Alla Andreeva, A survey on
pixel-based skin color detection techniques, Tech. Rep., Graphics and
Media Laboratory, Faculty of Computational Mathematics and Cyber-
netics, Moscow, Russia, 2003.
[20] Monson H. Hayes, Statistical Digital Signal Processing and Modeling,
Wiley Sons, Inc., rst edition, 1996.
[21] Arne Leijon, Pattern recognition, Tech. Rep., Signal, Sensors and
System, Stockholm, Sweden, 2005.
51