SlideShare a Scribd company logo
1 of 57
Download to read offline
Image processing

    and object tracking

    from single camera




      JOHAN SOMMERFELD




      Master's Degree Project

Stockholm, Sweden 2006-12-13
Abstract

In the last decades the computers' ability to perform huge amount of cal-


culations, and handle information ows we never thought possible ten years


ago has emerged. Despite this a computer can only extract little information


from the image in comparison to human seeing. The way the human brain


lters out useful information is not fully known and this skill has not been


merged into computer vision science.


   The aim of this thesis is to implement a system in Matlab that is able


to track a specic object in a video stream from a single web camera. The


system should use both fast and advanced algorithms aiming to achieve a


better ratio between accuracy and speed than you would achieve with either


fast or advanced algorithms. The system will be tested by trying to follow a


persons hand, placed in front of a computer with the web camera mounted


on the screen.


   The goal is to achieve a system with the potential to be implemented in


a real time environment. Therefore the system needs to be very fast. The


work in this thesis is an initial step and will not be implemented to run in


real time.


   The hardware used is a standard computer and a regular web camera


                                       i
with a 640x480 resolution at 30 frames per second (fps).


   The system works overall as expected and was able to track a persons


hand with numerous congurations. It outperforms advanced algorithms in


terms of lower computational power needed, and is more stable than the


fast ones. A drawback is that the system parameters were dependent on the


object and its surroundings.




Acknowledgments

The thesis was written at the sound and image processing laboratory at the


school of electrical engineering, the Royal Institute of Technology (KTH)


through the school year of 20052006. I would like to take this opportunity


to thank my supervisor M.Sc. Anders Ekman for his patience when things


progressed a bit slow, PhD Disa Sommerfeld for proofreading and also assis-


tant professor Danica Kragi¢ for pushing me forward.




                                     ii
Contents

1   Introduction                                                                      1


    1.1   Background     . . . . . . . . . . . . . . . . . . . . . . . . . . . .       1


    1.2   Related work     . . . . . . . . . . . . . . . . . . . . . . . . . . .       2



2   Problem                                                                           5


    2.1   System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         5


    2.2   Hardware    . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        7



3   Method                                                                            9


    3.1   Adaptive lters    . . . . . . . . . . . . . . . . . . . . . . . . . .       9


          3.1.1   The Least Mean Square algorithm . . . . . . . . . . . .             11


    3.2   Motion detection      . . . . . . . . . . . . . . . . . . . . . . . . .     12


    3.3   Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . .         13


          3.3.1   Parametric algorithms         . . . . . . . . . . . . . . . . . .   14


          3.3.2   Nonparametric algorithms          . . . . . . . . . . . . . . . .   14


          3.3.3   Linear discriminant . . . . . . . . . . . . . . . . . . . .         15


          3.3.4   Support Vector Machines . . . . . . . . . . . . . . . . .           16


          3.3.5   ψ -learning   . . . . . . . . . . . . . . . . . . . . . . . . .     17


                                          iii
4   Implementation                                                                  21


    4.1   Initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    21


    4.2   Detection     . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   22


    4.3   Recognition     . . . . . . . . . . . . . . . . . . . . . . . . . . . .   24


          4.3.1   Feature Space . . . . . . . . . . . . . . . . . . . . . . .       25


          4.3.2   Training . . . . . . . . . . . . . . . . . . . . . . . . . .      26


          4.3.3   Detecting     . . . . . . . . . . . . . . . . . . . . . . . . .   27


    4.4   Updating      . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   27


    4.5   Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      28


    4.6   Optimization      . . . . . . . . . . . . . . . . . . . . . . . . . . .   28



5   Result                                                                          31


    5.1   Simulations     . . . . . . . . . . . . . . . . . . . . . . . . . . . .   31


    5.2   Color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .      33


    5.3   Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      34


    5.4   Speed    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    35



6   Discussion                                                                      37


    6.1   Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .       38



A Mathematical cornerstones                                                         39


    A.1   Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . .      39



B   Simulation plots                                                                43



Bibliography                                                                        51




                                          iv
Chapter 1

Introduction

In the last decades the computers ability to perform huge amount of calcu-


lations, and handle information ows we never thought possible ten years


ago has emerged. Despite this a computer can only extract little information


from the image in comparison to human seeing. The way the human brain


lters out useful information is not fully known and therefore this skill has


not been merged into computer vision science.




1.1 Background
Even if we have not been able to teach a computer to process visual input


in a complex sense, there is quite much a computer can do when it comes to


following movement and performing easier recognitions.


   One of the key features in a computer vision system is for the computer


to extract interesting areas (foreground). Research on this has mainly two


approaches. The rst group uses advanced algorithms for pattern recognition


                                     1
to extract the foreground. Often these methods take little use of temporal re-


dundancy, and are slow because of the large amount of computations needed.


The second approach is dierent, often using pixel by pixel computations and


only a few computations per pixel.    In general the latter methods are fast


and may be implemented in real-time applications. The drawback of these


methods is that they are, due to the lack of complexity in the algorithms,


sensitive to noise and often need a static environment to be able to function.




1.2 Related work
There are a few simple algorithms for tracking, for example: detection of


discontinuities using Laplacian and Gaussian lters, often implemented with


a simple kernel [1]; thresholding; and motion detection with reference image.


These algorithms are simple, but sensitive to noise, and hard to generalize. A


set of more advanced algorithms involves iterations and/or transformations,


such as the Hough transform, region based segmentation and morphological


segmentation. These algorithms are generally more stable concerning noise,


although as pictures and/or frames grows larger, these algorithms get slow [1].


   Other algorithms make use of pattern recognition, such as neural net-


works, maximum-likelihood and support vector machines [2]. First the image


has to be translated into something that the pattern recognition algorithms


understand. The image is processed to a so-called feature vector. The ma-


jority of pattern recognition algorithms require a set of training data to form


the decision boundary.   The training is often slow, however thereafter the


algorithm is fast. The problem is that extracting the feature vector might


                                      2
be a demanding task for the computer.


   There is a number of interesting approaches to object tracking. In the


study by Kyrki et al [3] they use both model-based features such as a wire


frame combined with model-free features such as points of interest on a calcu-


lated surface. In the study by Doulamis et al [4] they use an implementation


of neural networks to track objects in a video stream. The neural network


is adaptive and changes over time as the object translates. In the study by


Comaniciu et al [5] a kernel-based solution is used for identifying an object.


In a study, Amer [6] uses voting based features and motion to detect objects,


which are tuned for real time processing. In the PhD thesis by Kragi¢ [7], a


multiple cue algorithm is presented, using features that are fast to compute


and relying on the assumption that not everyone fails at the same time. In


the study by Cavallaro et al [8] a hybrid algorithm is presented using infor-


mation about both objects and regions. In the study by Gastaud et al [9]


they track objects using active contours.


   Kragi¢ [7] uses multiple cues for better tracking. Instead of using multiple


cues of fast algorithms, the approach in the present thesis takes the advantage


of the fast and also the advanced algorithms in order to achieve a system that


outperforms the simple algorithms, and operates faster than the advanced


ones.




                                      3
4
Chapter 2

Problem

The aim of this thesis is to implement a system in Matlab that is able to


track a specic object in a video stream from a single web camera.        The


system should use both fast and advanced algorithms aiming to achieve a


better ratio between accuracy and speed than you would achieve with either


fast or advanced algorithms. The system will be tested by trying to follow a


persons hand, placed in front of a computer with the web camera mounted


on the screen.




2.1 System
The goal is to achieve a system with the potential to be implemented in a


real time environment. Therefore the system needs to be very fast. Also a


higher accuracy than the simple methods described in section 1.2 needs to


be achieved. The system will use algorithms that need training. The work in


this thesis is an initial step and will not be implemented to run in real time.


                                      5
Figure 2.1: The main blocks of the system.




At the start a user input telling the system the whereabouts of the object to


track is requested. In this thesis the system is implemented in Matlab and


therefore only a proof of concept is possible to achieve.


   More specically, the system is based on four blocks, see gure 2.1.



Detection is responsible for detecting and segmenting the interesting parts


     of the image. The mainly responsible algorithm is most often one of


     the fast algorithms described in section 1.2.



Recognition is responsible for classifying the foreground extracted from the


     image by the detection block.



Updating is responsible for updating the representation of the tracked ob-


     ject, using information generated by the recognition block.



Prediction is responsible for using all information and to predict where to


     start the segmentation in the Detection block, to minimize the time


     consumed and to minimize the error probability.




                                      6
2.2 Hardware
The hardware used is a standard computer and a regular web camera with


a 640x480 resolution at 30 frames per second (fps).   The computer is an


Apple Power Mac G5 2x2.7 GHz. The camera is an Isight from Apple, the


video stream is in Dv format.   However Matlab's Unix version only takes


uncompressed video, and therefore the stream is converted to uncompressed


video with true color.




                                    7
8
Chapter 3


Method


3.1 Adaptive lters


An adaptive lter is a lter that changes over time depending on the signal.


For a resumé of the statistical theory used, see appendix A.1. Assume that


you have two non-stationary signals with zero mean and known stochastic


functions, hence covariance and cross-covariance




                       ryy (n, m) = E[y(n)y(n + m)]

                       rxy (n, m) = E[x(n)y(n + m)].


The problem of estimating    x(n)   given past   y(n)   may be written as


                             N −1
                    x(n) =
                    ˆ               θ(k)y(n − k) = Y T (n)θ,
                             k=0


                                         9
where   Y (n) = [y(n), ..., y(n − N + 1)]     and   θ = [θ(0), ...θ(N − 1)]T .   The MSE


is then given by


                              MSE(n, θ) = E[(x(n) − x(n))2 ].
                                                    ˆ

The optimal       θ   may be received by the orthogonality condition which states


that   Y T (n)θ   is the linear MMSE of     x(n) if the estimation error is orthogonal
to the observations       Y (n)


                               E[(x(n) − Y T (n)θ)Y (n)] = 0.                         (3.1)




If we dene the the covariance matrices




          ΣY x (n) = E[rxy (n, n), ..., rxy (n, n − N + 1)]

          ΣY Y (n) = E[Y (n)Y T (n)]
                                                                                
                      ryy (0)        ryy (1)  . . . ryy (N − 1)                 
                                                                                
                      ryy (1)        ryy (0)  . . . ryy (N − 2)                 
                   =                                                            .
                                                                                
                             .                 ..          .
                             .                     .       .
                             .                             .
                                                                                
                                                                                
                                                                                
                       ryy (N − 1) ryy (N − 2) . . .    ryy (0)


Insert this in 3.1 and we get




                                  ΣY x (n) − ΣY Y (n)θ = 0,


from this we get       θopt
                                  θopt (n) = Σ−1 (n)ΣY x (n)
                                              YY




                                             10
here   θ   is dependent of time. An algorithm to update      θ   is also needed, a com-


mon method is to take a step in the negative gradient direction of MSE(n, θ)




                     ˆ      ˆ          µ ∂
                     θ(n) = θ(n − 1) +      MSE(n, θ)|θ=θ(n−1)
                                                        ˆ                         (3.2)
                                       2 ∂θ

here   µ   is an variable that controls the step size of the algorithm, a large      µ
is fast but can be unstable, and a small       µ   is slow but generally more stable.


The gradient can be written as



                         ∂
                            M SE(n, θ) = −2ΣY x + 2ΣY Y θ                         (3.3)
                         ∂θ

insert 3.3 in 3.2 and we get




                   ˆ      ˆ                            ˆ
                   θ(n) = θ(n − 1) + µ ΣY x (n) − ΣY Y θ(n − 1)                   (3.4)




3.1.1        The Least Mean Square algorithm


In general, the statistical information of the variables is not available. More


likely, the only thing available is   y(n) and x(n).    We will still use the steepest


decent algorithm, see equation 3.4, with some modications.


   Since the statistical information is not available, we will not be able to


calculate the MSE. Instead we estimate the MSE by relaxing the expression


by dropping the expectation operand




                          M SE(n, θ) = (x(n) − Y T (n)θ)2

                                          11
the gradient then is



                   ∂
                      M SE(n, θ) = −2(x(n) − Y T (n)θ)Y (n)                     (3.5)
                   ∂θ

if we insert 3.5 into 3.2 we get




               ˆ      ˆ
               θ(n) = θ(n − 1) + µY (n) x(n) − Y T (n)θ(n − 1)                  (3.6)




   The theory for this section was collected from Hjalmarsson et al [10], also


suppling more information about adaptive lters.




3.2 Motion detection

Motion detection is often built into a larger system and is tweaked to t


the other algorithms.     One of the commonly used algorithms is to take a


threshold on a dierence image

                      
                       1         |img(x, y, t) − img(x, y, t − 1)|  T
                      
                             if
            d(x, y) =                                                       .
                       0
                            else



Where   T   is a threshold variable. Even better is to use a reference image




             ref(x, y, t) = α · ref(x, y, t − 1) + (1 − α) · img(x, y, t)       (3.7)


                                           12
and then use this reference image to take the threshold

                          
                           1        |img(x, y, t) − ref(x, y, t)|  T
                          
                                if
                d(x, y) =                                                   (3.8)
                           0
                               else



The rate of which the reference image is updated over time is controled by


α   [1]. This is a fast algorithm but sensitive to noise.


     Irani et al [11] has developed a method for robust tracking of motion.


In the study they use multiple scales and translations to detect and track


motions.    Though this is a robust technique it puts a heavy load on the


hardware, especially at the resolutions used in the present thesis.




3.3 Pattern recognition
When you use Pattern recognition algorithms, you can seldom supply raw


data, such as a video or audio stream into the algorithms. The algorithms


will need some sort of feature(s). These features span a domain called the


feature space.    The choice of feature space is essential and in some cases


even more critical than the choice of pattern recognition algorithm. This is


because you want to keep the dimensionality as low as possible, since the


higher dimensionality the more training data is needed and the algorithms


put heavier load on the computer, but if the dimensionality is too low the


ability to separate patterns is reduced. If all statistics are known in advance,


it is possible to analytically decide an optimal decision surface.       However


in reality this never happens.       Instead, a training set that is supposed to


represent the distribution of the signal/pattern is used to tune the chosen al-


                                           13
gorithm. There is a number of dierent algorithms with dierent approaches


in how to use the training set and the dierent a prioris.




3.3.1    Parametric algorithms


Parametric algorithms use the training set to train distributions chosen ear-


lier. When the distributions of the dierent patterns are trained the deci-


sion boundary can be calculated using for example maximum likelihood or


Bayesian parameter estimation. These algorithms generally have good con-


vergence and performance if they are tuned right.     However quite a lot of


tuning is needed to adapt these algorithms for dierent problems. Another


problem is the curse of dimensionality, which appears when the feature space


increases in dimensionality [2]. To cope with this problem it is possible to


use Principal component analysis (PCA). PCA uses eigenvectors to decrease


the dimensionality of the feature space [2]. The strength of parametric algo-


rithms is that knowledge about the distributions can be taken into account


making better use of the training data available.




3.3.2    Nonparametric algorithms


In the previous section we discussed the idea behind algorithms that uses


training data to estimate pre-decided distributions. Unfortunately the knowl-


edge about the distribution of the patterns is rarely available. Nonparametric


algorithms do not assume any special distribution, instead they rely on the


training data to be accurately representative of the patterns.


   One of the most known nonparametric algorithms is    kn nearest neighbors.

                                     14
The algorithm uses the training data to calculate the        kn   nearest neighbors


to the point in the feature space corresponding to the pattern that is to be


classied. The pattern that the majority of the       kn   neighbors belongs to is


assumed to be the pattern connected with that point. The strength of this


algorithm is the fact that with sucient training data it is able to represent


complex distributions.      The drawback is that it puts a heavy load on the


computer and the complexity increases with the dimensionality and number


of training data.




3.3.3       Linear discriminant


In the previous sections two techniques with dierent approaches on how to


use the training set given have been discussed.


   This third algorithm is more or less in between the two previous algo-


rithms. We do not dene a specic distribution in advance and we do not


keep all the training data as base for calculations during run time.           The


training data is used directly to train the classier which is a set of linear


discriminant functions


                                  g(x) = wt x + w0

where   x   is the point in the feature space that is supposed to be classied,   w
is the weight vector and     w0   is the bias [1]. Depending on what problem to


solve, a number of discriminant functions can be trained and used in recog-


nition problems. For instance if the classier is supposed to be a binary, one


discriminant function is sucient. If there are many patterns that are sup-


posed to be classied, the discriminant functions can be designed in multiple


                                         15
ways:



   •    One versus all is a training technique where the discriminant func-


        tion is trained to separate the pattern connected with the discriminant


        function from all the other patterns.



   •    One versus one creates multiple binary discriminants with two patterns


        versus each other.



   •    In a Linear Machine one discriminant for each pattern is trained. The


        pattern is classied as the pattern whose discriminant produce the high-


        est value.



One problem with these algorithms is that there are spaces where the classi-


er is undened. The linear machine is the one that often produces the least


amount of undened space. Undened space only occurs when two or more


discriminant functions are equal.




3.3.4       Support Vector Machines


Support Vector Machines (SVM) is basically the same as Linear discrimi-


nants, see section 3.3.3, but a few features to enhance the functionality when


faced with small training sets and ability to create more advanced hyper


planes has been added.


   The reason for wanting more advanced hyper planes, is that the dimen-


sionality must be high enough to have good separation between the dierent


patterns.    To be able to create advanced hyper planes the input data is


mapped into a higher dimension, which is often done by kernels. Once the


                                       16
data is mapped into the higher dimension the new data is processed in the


same manner as regular linear SVM. The techniques for choosing dimen-


sions and making general kernels is a eld of research out of scope for this


thesis. [12]


   The linear SVM is similar to the binary linear discriminant. The main


dierence from linear discriminant function is that during training the SVM


algorithm works towards maximizing the distance from the training data


and the hyper plane, called margin maximization.                       This often results in a


hyper plane that produces good results also when only small training sets


are available.


   The training of the SVM is a minimization process of a cost function


                                              N
                               1
                                 ||w||2 + C         (1 − (yi f (xi ))+ )                      (3.9)
                               2              i=1


where     C   is a tuning parameter that controls the relation between training


errors and margin maximization [13]. The                   ()+   function is plotted in gure


3.1. If   yi f (xi )   is larger than 1, there is no penalty, but if       yi f (xi )   is less than


1 there is a linear penalty scaled with the tuning parameter                   C.

   The SVM algorithm has been widely used in pattern recognition mainly


for its good generalization [1417].




3.3.5         ψ -learning

ψ -learning     is a variant of the SVM algorithm modied in order to generally


produce better results when faced with sparse non-linear separable training


sets [18]. The mathematical dierence lies within the cost function which,


                                                  17
Figure 3.1:        The   ()+    function used in the cost function, equation 3.9,
  for SVM training.



for   ψ -learning,   looks like



                                            N
                            1
                              ||w||2 + C         (1 − ψ(yi f (xi ))).                 (3.10)
                            2              i=1

This cost function is similar to the one in SVM (eq 3.9), but there is a                ψ()
function instead of a      ()+    function. The      ψ()   function is plotted in gure 3.2.


      The dierence in the above cost functions is that SVM generates a linear


cost as soon as       yi f (xi )  1,   meaning a training data that is close to the


decision hyper plane. In          ψ -learning   there is also a linear cost as soon as the


training data is close to the decision hyper plane, however this is only valid


until the data becomes misclassied. At that point the cost is doubled but


static.   In practice this means that the algorithm does not care about the


magnitude of the misclassication, only the fact that there is one.


      The reason why this algorithm is more complex than SVM is that the


minimization of the cost function, equation 3.10, can not be directly solved


                                                18
Figure 3.2: The      ψ    function used in the cost function, equation 3.10, for
 ψ -learning   training.



with quadratic programming as is the case with SVM [18].




                                          19
20
Chapter 4

Implementation

The methods in chapter 3 were implemented to create a system possible to


track a specic object from a video stream.




4.1 Initiation

The system needs to train the pattern recognition algorithm and it requires a


point from where it starts tracking. This is done during the initiation phase.


   To initiate the pattern recognition algorithm some data used for training


the algorithm is needed. The rst frame is presented and the object that is


supposed to be tracked is chosen, see gure 4.1. When the training algorithm


is nished, the user is promted to choose a starting position, from which the


system will start tracking. The training is further discussed in section 4.3.


                                      21
(a) foreground                        (b) background

  Figure 4.1:     The user manually choses which blocks that is the fore-
  ground/object, everything else is background.



4.2 Detection

Since the system is given a startpoint of the object, there is only when mov-


ment occurs that the system needs to act. The detection is therefore a motion


detection algorithm. The technique is rather simple and the algorithm works


in two steps.   First the stream is ltered with a high pass lter, and then


a threshold is applied to the output in order to detect motion.        Since this


algorithm is very simple it is not robust, however it is very fast. To reduce


the impact of noise, we rst run a low-pass lter on each frame. This is done


with a lter kernel. If the scale on the lter is 5, then the lter kernel is a


5x5 kernel and all elements are     1/52 .   The result is a smoother image, see


gure 4.2.


   The lter is implemented with the help of a reference image, see gure


4.3,




                        ref n = α · ref n−1 + (1 − α) · imgn                (4.1)


                                        22
(a) Original                           (b) Scale = 15




                   (c) Scale = 30                          (d) Scale = 45

                          Figure 4.2: Image at dierent scales.




where   ref n−1   is the previous reference image. The         imgn   is the current image


from the stream and         α   is a variable for tuning how fast the reference image


should adapt to changes. When subtracting the reference from the current


image, we will achive a value that describes the amount of change in color


at every pixel




                                   diff n = imgn − ref n    .                         (4.2)



A threshold is applied to the         diff n   image, reducing the noise, and at pixels


with valules      = 0,   some kind of motion is assumed, see gure 4.3. [1]


                                               23
(a) reference image, equation 4.1        (b) dierence image, equation 4.2




                                 (c) motion detected

  Figure 4.3: Results from the detection algorithm. Motion detected is
  binary with ones where the dierence image has a value over a threshold
  and zeros otherwise.




4.3 Recognition

To be able to track a specic object, motion detection is not sucient, since


the detection algorithm does not give any information regarding what is


moving. The Recognition block, see section 2.1, is responsible for recognizing


the object that is supposed to be tracked.



   The recognition system in the present thesis is based on the system used


for video object segmentation in Liu et al [13]. The learning algorithm used


                                          24
is   ψ -learning,   described in section 3.3.5.         The algorithm is trained at the


initiation process and is then used throughout the whole simulation.




4.3.1       Feature Space


The    ψ -learning   algorithm does not work directly on the image, thus it needs


to be provided with some form of feature space. The feature space is calcu-


lated on blocks of 9x9 pixels, the image is therefore divided into such blocks.


There is an overlap of 1 pixel between the blocks, where the rst block spans


from pixels   0 − 8 and the second block from pixel 8 − 16 and so on for both x
and   y coordinates.    The feature space is a        24- dimensional space, 8-dimensions
for each colorspace




                          1. c(0, 0)
                                    N −1
                          2.        j=1    c(0, j)2
                                    N −1
                          3.        k=1    c(k, 0)2
                                    N −1     N −1
                          4.        j=1      j=1    c(k, j)2
                          5. (B(−1,−1) + B(−1,0) + B(−1,1) )/3
                          6. (B(−1,1) + B(0,1) + B(1,1) )/3
                          7. (B(1,−1) + B(1,0) + B(1,1) )/3
                          8. (B(−1,−1) + B(0,−1) + B(1,−1) )/3


where    c(k, j)   is the coecients of the Discrete Cosine Transform (DCT), the


system uses Matlab's       dct2,   calculated on the       9×9   blocks. In this case the


rst 3 coecients (N        = 3)   of the DCT is used, to deal with the fact that


the high frequency coecients tends to be small.                  The last 4 dimensions


                                              25
B(−1,−1)     B(−1,0)   B(−1,1)
                            B(0,−1)      B(0,0)    B(0,1)
                            B(1,−1)      B(1,0)    B(1,1)


                Figure 4.4: Neighbouring blocks of 9x9 pixels.



are the average color of the 9x9 neighboring blocks on each side, see gure


4.4. The combination of DCT and neighboring block color values gives good


classication of surface as well as grouping information which reduces the


impact of noise. [13]




4.3.2       Training




When the object is chosen as described in 4.1 the algorithm needs to be


trained by using the test data.        The blocks that are not chosen is used as


background, see gure 4.1. The training is done with Matlab's        fminsearch.
fminsearch    needs a start point in the feature space.      This start point is


calculated using minimum squared error solution with the pseudoinverse




                               w = (AT A)−1 AT Y


where   w   is the weight vector,   A   is a matrix where each row represents a


training point and    Y   is a matrix containing rows with the corresponding


class for each training point.


                                          26
(a) Classication output                     (b) Frame

 Figure 4.5: Classication of an entire frame. The green dots represents
 blocks that are classied as foreground/ojbect and the red ones blocks
 that are classied as background.




4.3.3    Detecting


When the training is done, each frame needs to be converted into the feature


space. The image is divided into blocks as described in 4.3.1, then each block


is evaluated binary as foreground or background, see gure 4.5. To handle


noise better, there need to be at least two blocks connected in order for them


to be accepted as part of the object.




4.4 Updating
When the detection is nished, a point of interest which is used during op-


timization of the system is calculated.      The point of interest is computed


by nding the block/blocks with the lowest value in      y   coordinates ((0, 0) in


upper left corner), then the mean of the      x   coordinates in those groups is


used.


                                        27
4.5 Prediction
A LMS-lter, see section 3.1.1, is used to predict the next point of interest,


which is used in the optimization of the system.


   The LMS-lter is designed to be a one step ahead lter [10]. We want


to predict the next coordinate using previous observations. Two lters were


implemented, 1 for each coordinate:


                                        N
                           x(n + 1) =         θx (k)x(n − k)
                                        k=0
                                         N
                           y(n + 1) =         θy (k)y(n − k).
                                        k=0


During simulations the lter mostly kept the previous 6 (N       = 6) coordinates
and   µ   was set around   10−8 .



4.6 Optimization
To make the system run faster, a number of constraints were added to the


system in order to reduce the work load.


   The Detection described in section 4.2 is based on a lter which uses


earlier images. Therefore it is not suitable to reduce the work load only by


calculating parts of the image.


   The task that generated the heaviest load on the computer was the con-


version from the pixel blocks to the feature space.         In an study by Yi Liu


et al [13], which uses the same feature space, calculations of the DCT is the


major contributor for this load. Therefore two constraints needs to be ful-


                                        28
lled in order to perform the conversion. The rst constraint is that only a


certain number of blocks,      σ , around the previous point of interest is checked.
During simulations typical values of       σ   is 5, 7 and 11.     On an image with


resolution   640 × 480    there are 4524 blocks that the conversion needs to be


made on. Having       σ = 7,   and therefore only 225 blocks, reduces the number


of conversions with a factor 20. The other constraint is that the conversion of


the block is only made if there is motion detected, see section 4.2, in a certain


percent,   γ,   of the pixels in the block. Typical value of   γ   during simulations


is 60-80%. After these constraints was applied, the conversion was no longer


the bottleneck of the system.




                                          29
30
Chapter 5

Result

The system was tested in following a persons hand. The camera was mounted


on the screen and the person sat down in front of the camera.




5.1 Simulations
The simulations were made on a sequence of 91 frames, with 5 dierent          σ.
Data on tracking error (euclidean distance from ground truth) and number of


blocks calculated, see section 4.6, was collected. How the system performed


with dierent     σ   is presented in gure 5.1 and table 5.1 shows the average


values over all frames. The plots are separated into separate plots in appendix


B.


     Other than   σ   there are a number of variables that have an impact on the


performance of the system.       There are 4 variables that control the motion


detection: scale controls the smoothing of the image, see gure 4.2;    α   which


controls at what rate the reference image is updated over time; diThres


                                         31
Figure 5.1:   Plots of the simulations.   Tracking error is the euclidean
distance from ground truth at each frame, blocks calculated is the number
of blocks calculated at each frame




                                     32
σ    Tracking error    Blocks calculated
                        5   43.29             21.90
                        7   26.18             31.96
                        9   19.12             53.38
                      11    19.01             78.56
                      13    143.62            59.03


             Table 5.1: The average value of the plots in gure 5.1



which is the variable that tunes at what point the dierence should be classed


as motion;   γ   which controls the percentage of the pixels in a block that needs


to be classied as motion for the block to be evaluated. There are 2 variables


that controls the prediction: lterLength which is the length of the LMS-


lter and   µ which is the variable that controls the step size of the LMS-lter.
During the simulations the variables where set to




                                     scale = 15

                                       α = 0.9

                                diffThres = 14

                                       γ = 80%

                             filterLength = 6

                                       µ = 13 · 10−7 .




5.2 Color spaces
A number of color spaces were evaluated to see if there were any major


dierences in performance. The error rate on the training set after training,


                                         33
color space   foreground error   background error    total error
                 RGB     0.76%              10.57%              8.84%
       normalized RGB    1.82%              12.82%              11.26%
                 HSV     0.15%              10.57%              9.10%
                  TSL    5.77%              11.04%              10.30%
                YCrCb    1.82%              13.52%              11.86%
                NTSC     1.67%              7.17%               6.39%


              Table 5.2: Error rates of a number of colorspaces.



i.e.   the amount of misclassications when trying to classify the training


set, is presented in table 5.2.   The conversion from the RGB image was


done either with Matlab's built in functions, or as described in the study by


Sazonov [19].   The reason why the background has such high error rate is


that in the example in section 4.1, see gure 4.1, the face is not a part of


the object, but has similar features as the hand. The NTSC conversion, YIQ


color space is supplied by Matlab and were used most extensively during the


tests.




5.3 Tracking
During preferable conditions, such as sucient light and no or little distur-


bance in the background, the tracking worked well. The system still managed


when noise, such as back light and/or motion of other objects in the back-


ground was introduced. The lter allowed the system to work, even though


the tracking failed during small portions of time, but was able to snap on


again after a few frames. Due to limitations in the system the tracking will


fail if a block is misclassied as the object, which only occurs if motion is


detected in the block. This occurs at frame 38 and   σ = 13.   The reason why


                                      34
Figure 5.2: 2 frames with motion blur.




the system performs well with   σ=9    and   σ = 11    is that it is a suciently


large area to search to be able to track well, while still small enough to miss


eventual noise in other parts of the image. Fast motion is something a stan-


dard DV camera is unable to handle, introducing motion blur, see gure 5.2,


resulting in the hand blurring out with the background and changing in color


and texture.




5.4 Speed
Since the system is implemented in Matab it is hard to reason whether it is


possible to run in realtime or not. With the system optimized as described


in section 4.6 it runs on the Apple computer at roughly 1.3 fps. This frame


rate is possible even though Matlab is not utilizing both processors and has


poor performance when it comes to loops, since it does not optimize them


as programs made in C/C++ would, also the code written is not optimal


when it comes to minimizing work load. For each frame there are roughly


20 − 200   blocks depending on the size of   σ   that need to be calculated, also


                                      35
the detection part is pixel by pixel computations. Therfore this system could


utilize the full power of computers with multiple cores, and perhaps even


distributed systems.




                                     36
Chapter 6

Discussion

The system works overall as expected. It outperforms advanced algorithms


in terms of lower computational power needed, and is more stable then the


fast ones.   A drawback is that the system parameters were dependent on


the object and its surroundings.        Much of the failure could probably be


compensated with more complex equipment. A more advanced camera could


be congured to use shorter shutter time, reducing the problem with tracking


failure during motion blur.


   Problems due to limitations in the algorithms of the system is a more


complex problem. For example, when the tracker fails because of misclassi-


cation and motion, the problem will not be solved with better hardware.


Also if the object is big and has no texture so that it is registered as a at


surface, the motion algorithm will only detect motion on the contours, giving


a false representation of the object.


   To improve the system, it might be possible to model the shape of the


object and feed that to an adaptive lter, such as the Kalman lter [10, 20].


                                         37
Introducing the Kalman lter would allow more complex constraints that


are also adaptive during runtime. For example: the updating of the point


of interest could be forced to be more like the motions of a human hand;


the change of the shape could be forced to change more continuously. The


drawback of these constraint is that the system becomes less general and


harder to congure.




6.1 Future work
Though not in the scope of this thesis, the performance of the system could


probably be improved by implementing it in a low level language such as C or


C++. Then the code could be optimized further making sure no unnecessary


computations are made.     Not until then will we be able to measure how


well the system performs in real time. Stereo vision might be able to make


foreground detection easier, however stereo system in real time is not trivial.


To make the system even faster it could be possible, for a simple object like


a hand, to use simpler pattern recognition algorithms.




                                      38
Appendix A

Mathematical cornerstones

A.1 Statistical Theory
Many of today's algorithms and systems uses dierent forms of a priori knowl-


edge to enhance the result.




Probabilities



There are a few probabilities that are frequently used when working with


pattern recognition and other statistical frameworks. It is the regular prob-


ability


                                          PX (x),

which is the value that describes how likely it is that variable       X   will be set


to   x (P (x)   or   P (X = x)   is dierent notations for the same thing).


     Then there is joint probability




                                        PX,Y (x, y),

                                             39
which describes how likely it is that       X     is set to   x   and   Y   is set to   y (P (x, y)
and   P (X = x, Y = y)   is dierent notations for the same thing).



   Conditional probability


                                       PX|Y (x|y)

describes how likely it is that   X   is set to   x given that Y        is set to   y (P (x|y) and
P (X = x|Y = y)    is dierent notations for the same thing). The denition is



                                                  PX,Y (x, y)
                            PX|Y (x|y) =                      .
                                                   PY (y)




Bayes formula




If we have the knowledge of both          PX (x)      and   PY |X (y|x),     we can, from the


denition of conditional probability get




                         PX,Y (x, y) = PX|Y (x|y)PY (y)

                                        = PY |X (y|x)PX (x),


which can be rewritten to



                                           PX|Y (x|y)PY (y)
                          PY |X (y|x) =                     .
                                                PX (x)

This is known as Bayes formula [2, 21].


                                            40
Expected value




The expected value is the mean value or function of the stochastic variable


or function



                                   E[X] = mX

                               E[f (X)] = mf (X).


For a discrete stochastic variable the expected value is calculated




                               E[X] =         xPX (x).
                                        x∈X




Variance




The expected value gives the mean value of the stochastic variable or func-


tion. Variance gives the expected value of the squared distance between the


stochastic variable and   mx


                          V ar[X] = σ 2 = E[(X − mx )2 ].


The variance can be expressed




                           V ar[X] = E[X 2 ] − (E[X])2

                     V ar[f (X)] = E[f 2 (X)] − (E[X])2 .



                                        41
Covariance



Covariance is dened as



               rXY = V ar[XY ]

                   = E[(X − mX )(Y − mY )]

                   =             (x − mX )(y − mY )PX,Y (x, y).
                       x∈X y∈Y




                                       42
Appendix B

Simulation plots

The plots of the simulations described in section 5.1 separated into inde-


pendent plots.   Tracking error is the euclidean distance from ground truth


at each frame, blocks calculated is the number of blocks calculated at each


frame




                                     43
44
45
46
47
48
Bibliography

[1] Rafael C. Gonzalez and Richard E. Woods,       Digital Image Processing,
   Prentice-Hall, Inc., second edition, 2001.



[2] Peter E. Hart Richard O. Duda and David G. Stork,       Pattern Classi-
   cation,   Wiley  Sons, Inc., second edition, 2001.



[3] Ville Kyrki and Danica Kragi¢, Tracking rigid objects using integration


   of model-based and model-free cues,     nyn, 2005.

[4] Nikolaos D. Doulamis, Anastasios D. Doulamis, and Klimis Ntalianis,


   Adaptive classication-based articulation and tracking of video objects


   employing neural network retraining, .



[5] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, Kernel-based


   object tracking,    IEEE Trans. Pattern Anal. Mach. Intell.,   vol. 25, no.


   5, pp. 564575, 2003.



[6] Aishy Amer, Voting-based simultaneous tracking of multiple video ob-


   jects, 2003, vol. 5022, pp. 500511, SPIE.



[7] Danica Kragi¢,     Visual Servoing for Manipulation: Robustness and In-
   tegration Issues,   Ph.D. thesis, Royal Institute of Technology, 2001.


                                      49
[8] A. Cavallaro, O. Steiger, and T. Ebrahimi, Tracking video objects in


    cluttered background,     IEEE Transactions on Circuits and Systems for
    Video Technology, vol. 15, no. 4, pp. 575584, 2005.

 [9] M. Gastaud, M. Barlaud, and G. Aubert, Tracking video objects using


    active contours,   in   MOTION '02: Proceedings of the Workshop on
    Motion and Video Computing, Washington, DC, USA, 2002, p. 90, IEEE
    Computer Society.



[10] Håkan Hjalmarsson and Bjorn Ottersten,        Lecture notes in adaptive


    signal processing, Tech. Rep., Signal, Sensors and System, Stockholm,


    Sweden, 2002.



[11] Benny Rousso Michal Irani and Shmuel Peleg,       Computing occluding


    and transparent motions,      Tech. Rep., Institute of Computer Science,


    Jerusalem, Israel, 1994.



[12] Christopher J. C. Burges,    A tutorial on support vector machines for


    pattern recognition,    Data Mining and Knowledge Discovery, vol. 2, no.
    2, pp. 121167, 1998.



[13] Yi Liu and Yuan F. Zheng,      Video object segmentation and tracking


    using   ψ -learning, IEEE Transactions on Circuits and System for Video
    Technology, 2005.

[14] Constantine Kotropoulos Anastasios Tefas and Ioannis Pitas,      Using


    support vector machines to enhance the performance of elastic graph


    matching for frontal face authentication,   IEEE Trans on Pattern Anal.
    Mach. Intell., 2001.

                                       50
[15] Daniel J. Sebald and James A. Bucklew, Support vector machine tech-


    niques for nonlinear equalization,    IEEE Transactions on Signal Pro-
    cessing, 2000.

[16] Robert Freund Edgar Osuna and Federico Girosi,        Training support


    vector machines: an application to face detection,     IEEE, Computer
    Vision and Pattern Recognition, 1997.

[17] Massimiliano Pontil and Alessandro Verri,      Support vector machines


    for 3d object recognition,   IEEE Transactions on Pattern Anal. Mach.
    Intell., 1998.

[18] Xuegong Zhang Xiaotong Shen, George C. Tseng and Wing Hung Wong,


    On   ψ -learning, Journal of the American Statistical Association,   2003.



[19] Vassili Sazonov Vladimir Vezhnevets and Alla Andreeva, A survey on


    pixel-based skin color detection techniques, Tech. Rep., Graphics and


    Media Laboratory, Faculty of Computational Mathematics and Cyber-


    netics, Moscow, Russia, 2003.



[20] Monson H. Hayes,     Statistical Digital Signal Processing and Modeling,
    Wiley  Sons, Inc., rst edition, 1996.



[21] Arne Leijon,    Pattern recognition,   Tech. Rep., Signal, Sensors and


    System, Stockholm, Sweden, 2005.




                                      51

More Related Content

What's hot

Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisevegod
 
Triangulation methods Mihaylova
Triangulation methods MihaylovaTriangulation methods Mihaylova
Triangulation methods MihaylovaZlatka Mihaylova
 
Head_Movement_Visualization
Head_Movement_VisualizationHead_Movement_Visualization
Head_Movement_VisualizationHongfu Huang
 
Dragos Datcu_PhD_Thesis
Dragos Datcu_PhD_ThesisDragos Datcu_PhD_Thesis
Dragos Datcu_PhD_Thesisdragos80
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...stainvai
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEnrique Muñoz Corral
 
image processing to detect worms
image processing to detect wormsimage processing to detect worms
image processing to detect wormsSynergy Vision
 
Keraudren-K-2015-PhD-Thesis
Keraudren-K-2015-PhD-ThesisKeraudren-K-2015-PhD-Thesis
Keraudren-K-2015-PhD-ThesisKevin Keraudren
 
Computer security using machine learning
Computer security using machine learningComputer security using machine learning
Computer security using machine learningSandeep Sabnani
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingGabriela Agustini
 

What's hot (19)

Au anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesisAu anthea-ws-201011-ma sc-thesis
Au anthea-ws-201011-ma sc-thesis
 
Triangulation methods Mihaylova
Triangulation methods MihaylovaTriangulation methods Mihaylova
Triangulation methods Mihaylova
 
Head_Movement_Visualization
Head_Movement_VisualizationHead_Movement_Visualization
Head_Movement_Visualization
 
Dragos Datcu_PhD_Thesis
Dragos Datcu_PhD_ThesisDragos Datcu_PhD_Thesis
Dragos Datcu_PhD_Thesis
 
document
documentdocument
document
 
Inglis PhD Thesis
Inglis PhD ThesisInglis PhD Thesis
Inglis PhD Thesis
 
Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...Trade-off between recognition an reconstruction: Application of Robotics Visi...
Trade-off between recognition an reconstruction: Application of Robotics Visi...
 
Efficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image RegistrationEfficient Model-based 3D Tracking by Using Direct Image Registration
Efficient Model-based 3D Tracking by Using Direct Image Registration
 
image processing to detect worms
image processing to detect wormsimage processing to detect worms
image processing to detect worms
 
Keraudren-K-2015-PhD-Thesis
Keraudren-K-2015-PhD-ThesisKeraudren-K-2015-PhD-Thesis
Keraudren-K-2015-PhD-Thesis
 
thesis
thesisthesis
thesis
 
feilner0201
feilner0201feilner0201
feilner0201
 
Jung.Rapport
Jung.RapportJung.Rapport
Jung.Rapport
 
Report
ReportReport
Report
 
wronski_ugthesis[1]
wronski_ugthesis[1]wronski_ugthesis[1]
wronski_ugthesis[1]
 
Computer security using machine learning
Computer security using machine learningComputer security using machine learning
Computer security using machine learning
 
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable ComputingBig Data and the Web: Algorithms for Data Intensive Scalable Computing
Big Data and the Web: Algorithms for Data Intensive Scalable Computing
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
MS_Thesis
MS_ThesisMS_Thesis
MS_Thesis
 

Similar to Object tracking from a single camera

Similar to Object tracking from a single camera (20)

Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
Single person pose recognition and tracking
Single person pose recognition and trackingSingle person pose recognition and tracking
Single person pose recognition and tracking
 
main
mainmain
main
 
Thesis_Prakash
Thesis_PrakashThesis_Prakash
Thesis_Prakash
 
Location In Wsn
Location In WsnLocation In Wsn
Location In Wsn
 
Big data-and-the-web
Big data-and-the-webBig data-and-the-web
Big data-and-the-web
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Master_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuMaster_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_Liu
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 
Master Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognitionMaster Thesis - Algorithm for pattern recognition
Master Thesis - Algorithm for pattern recognition
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
Data mining of massive datasets
Data mining of massive datasetsData mining of massive datasets
Data mining of massive datasets
 
mscthesis
mscthesismscthesis
mscthesis
 
steganography using visual cryptography_report
steganography using visual cryptography_reportsteganography using visual cryptography_report
steganography using visual cryptography_report
 
Thesis
ThesisThesis
Thesis
 
MSc_Thesis
MSc_ThesisMSc_Thesis
MSc_Thesis
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
main
mainmain
main
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Object tracking from a single camera

  • 1. Image processing and object tracking from single camera JOHAN SOMMERFELD Master's Degree Project Stockholm, Sweden 2006-12-13
  • 2.
  • 3. Abstract In the last decades the computers' ability to perform huge amount of cal- culations, and handle information ows we never thought possible ten years ago has emerged. Despite this a computer can only extract little information from the image in comparison to human seeing. The way the human brain lters out useful information is not fully known and this skill has not been merged into computer vision science. The aim of this thesis is to implement a system in Matlab that is able to track a specic object in a video stream from a single web camera. The system should use both fast and advanced algorithms aiming to achieve a better ratio between accuracy and speed than you would achieve with either fast or advanced algorithms. The system will be tested by trying to follow a persons hand, placed in front of a computer with the web camera mounted on the screen. The goal is to achieve a system with the potential to be implemented in a real time environment. Therefore the system needs to be very fast. The work in this thesis is an initial step and will not be implemented to run in real time. The hardware used is a standard computer and a regular web camera i
  • 4. with a 640x480 resolution at 30 frames per second (fps). The system works overall as expected and was able to track a persons hand with numerous congurations. It outperforms advanced algorithms in terms of lower computational power needed, and is more stable than the fast ones. A drawback is that the system parameters were dependent on the object and its surroundings. Acknowledgments The thesis was written at the sound and image processing laboratory at the school of electrical engineering, the Royal Institute of Technology (KTH) through the school year of 20052006. I would like to take this opportunity to thank my supervisor M.Sc. Anders Ekman for his patience when things progressed a bit slow, PhD Disa Sommerfeld for proofreading and also assis- tant professor Danica Kragi¢ for pushing me forward. ii
  • 5. Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Problem 5 2.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Method 9 3.1 Adaptive lters . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 The Least Mean Square algorithm . . . . . . . . . . . . 11 3.2 Motion detection . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Parametric algorithms . . . . . . . . . . . . . . . . . . 14 3.3.2 Nonparametric algorithms . . . . . . . . . . . . . . . . 14 3.3.3 Linear discriminant . . . . . . . . . . . . . . . . . . . . 15 3.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . 16 3.3.5 ψ -learning . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii
  • 6. 4 Implementation 21 4.1 Initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.3 Detecting . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5 Result 31 5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6 Discussion 37 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A Mathematical cornerstones 39 A.1 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . 39 B Simulation plots 43 Bibliography 51 iv
  • 7. Chapter 1 Introduction In the last decades the computers ability to perform huge amount of calcu- lations, and handle information ows we never thought possible ten years ago has emerged. Despite this a computer can only extract little information from the image in comparison to human seeing. The way the human brain lters out useful information is not fully known and therefore this skill has not been merged into computer vision science. 1.1 Background Even if we have not been able to teach a computer to process visual input in a complex sense, there is quite much a computer can do when it comes to following movement and performing easier recognitions. One of the key features in a computer vision system is for the computer to extract interesting areas (foreground). Research on this has mainly two approaches. The rst group uses advanced algorithms for pattern recognition 1
  • 8. to extract the foreground. Often these methods take little use of temporal re- dundancy, and are slow because of the large amount of computations needed. The second approach is dierent, often using pixel by pixel computations and only a few computations per pixel. In general the latter methods are fast and may be implemented in real-time applications. The drawback of these methods is that they are, due to the lack of complexity in the algorithms, sensitive to noise and often need a static environment to be able to function. 1.2 Related work There are a few simple algorithms for tracking, for example: detection of discontinuities using Laplacian and Gaussian lters, often implemented with a simple kernel [1]; thresholding; and motion detection with reference image. These algorithms are simple, but sensitive to noise, and hard to generalize. A set of more advanced algorithms involves iterations and/or transformations, such as the Hough transform, region based segmentation and morphological segmentation. These algorithms are generally more stable concerning noise, although as pictures and/or frames grows larger, these algorithms get slow [1]. Other algorithms make use of pattern recognition, such as neural net- works, maximum-likelihood and support vector machines [2]. First the image has to be translated into something that the pattern recognition algorithms understand. The image is processed to a so-called feature vector. The ma- jority of pattern recognition algorithms require a set of training data to form the decision boundary. The training is often slow, however thereafter the algorithm is fast. The problem is that extracting the feature vector might 2
  • 9. be a demanding task for the computer. There is a number of interesting approaches to object tracking. In the study by Kyrki et al [3] they use both model-based features such as a wire frame combined with model-free features such as points of interest on a calcu- lated surface. In the study by Doulamis et al [4] they use an implementation of neural networks to track objects in a video stream. The neural network is adaptive and changes over time as the object translates. In the study by Comaniciu et al [5] a kernel-based solution is used for identifying an object. In a study, Amer [6] uses voting based features and motion to detect objects, which are tuned for real time processing. In the PhD thesis by Kragi¢ [7], a multiple cue algorithm is presented, using features that are fast to compute and relying on the assumption that not everyone fails at the same time. In the study by Cavallaro et al [8] a hybrid algorithm is presented using infor- mation about both objects and regions. In the study by Gastaud et al [9] they track objects using active contours. Kragi¢ [7] uses multiple cues for better tracking. Instead of using multiple cues of fast algorithms, the approach in the present thesis takes the advantage of the fast and also the advanced algorithms in order to achieve a system that outperforms the simple algorithms, and operates faster than the advanced ones. 3
  • 10. 4
  • 11. Chapter 2 Problem The aim of this thesis is to implement a system in Matlab that is able to track a specic object in a video stream from a single web camera. The system should use both fast and advanced algorithms aiming to achieve a better ratio between accuracy and speed than you would achieve with either fast or advanced algorithms. The system will be tested by trying to follow a persons hand, placed in front of a computer with the web camera mounted on the screen. 2.1 System The goal is to achieve a system with the potential to be implemented in a real time environment. Therefore the system needs to be very fast. Also a higher accuracy than the simple methods described in section 1.2 needs to be achieved. The system will use algorithms that need training. The work in this thesis is an initial step and will not be implemented to run in real time. 5
  • 12. Figure 2.1: The main blocks of the system. At the start a user input telling the system the whereabouts of the object to track is requested. In this thesis the system is implemented in Matlab and therefore only a proof of concept is possible to achieve. More specically, the system is based on four blocks, see gure 2.1. Detection is responsible for detecting and segmenting the interesting parts of the image. The mainly responsible algorithm is most often one of the fast algorithms described in section 1.2. Recognition is responsible for classifying the foreground extracted from the image by the detection block. Updating is responsible for updating the representation of the tracked ob- ject, using information generated by the recognition block. Prediction is responsible for using all information and to predict where to start the segmentation in the Detection block, to minimize the time consumed and to minimize the error probability. 6
  • 13. 2.2 Hardware The hardware used is a standard computer and a regular web camera with a 640x480 resolution at 30 frames per second (fps). The computer is an Apple Power Mac G5 2x2.7 GHz. The camera is an Isight from Apple, the video stream is in Dv format. However Matlab's Unix version only takes uncompressed video, and therefore the stream is converted to uncompressed video with true color. 7
  • 14. 8
  • 15. Chapter 3 Method 3.1 Adaptive lters An adaptive lter is a lter that changes over time depending on the signal. For a resumé of the statistical theory used, see appendix A.1. Assume that you have two non-stationary signals with zero mean and known stochastic functions, hence covariance and cross-covariance ryy (n, m) = E[y(n)y(n + m)] rxy (n, m) = E[x(n)y(n + m)]. The problem of estimating x(n) given past y(n) may be written as N −1 x(n) = ˆ θ(k)y(n − k) = Y T (n)θ, k=0 9
  • 16. where Y (n) = [y(n), ..., y(n − N + 1)] and θ = [θ(0), ...θ(N − 1)]T . The MSE is then given by MSE(n, θ) = E[(x(n) − x(n))2 ]. ˆ The optimal θ may be received by the orthogonality condition which states that Y T (n)θ is the linear MMSE of x(n) if the estimation error is orthogonal to the observations Y (n) E[(x(n) − Y T (n)θ)Y (n)] = 0. (3.1) If we dene the the covariance matrices ΣY x (n) = E[rxy (n, n), ..., rxy (n, n − N + 1)] ΣY Y (n) = E[Y (n)Y T (n)]    ryy (0) ryy (1) . . . ryy (N − 1)     ryy (1) ryy (0) . . . ryy (N − 2)  =  .   . .. . . . . . .       ryy (N − 1) ryy (N − 2) . . . ryy (0) Insert this in 3.1 and we get ΣY x (n) − ΣY Y (n)θ = 0, from this we get θopt θopt (n) = Σ−1 (n)ΣY x (n) YY 10
  • 17. here θ is dependent of time. An algorithm to update θ is also needed, a com- mon method is to take a step in the negative gradient direction of MSE(n, θ) ˆ ˆ µ ∂ θ(n) = θ(n − 1) + MSE(n, θ)|θ=θ(n−1) ˆ (3.2) 2 ∂θ here µ is an variable that controls the step size of the algorithm, a large µ is fast but can be unstable, and a small µ is slow but generally more stable. The gradient can be written as ∂ M SE(n, θ) = −2ΣY x + 2ΣY Y θ (3.3) ∂θ insert 3.3 in 3.2 and we get ˆ ˆ ˆ θ(n) = θ(n − 1) + µ ΣY x (n) − ΣY Y θ(n − 1) (3.4) 3.1.1 The Least Mean Square algorithm In general, the statistical information of the variables is not available. More likely, the only thing available is y(n) and x(n). We will still use the steepest decent algorithm, see equation 3.4, with some modications. Since the statistical information is not available, we will not be able to calculate the MSE. Instead we estimate the MSE by relaxing the expression by dropping the expectation operand M SE(n, θ) = (x(n) − Y T (n)θ)2 11
  • 18. the gradient then is ∂ M SE(n, θ) = −2(x(n) − Y T (n)θ)Y (n) (3.5) ∂θ if we insert 3.5 into 3.2 we get ˆ ˆ θ(n) = θ(n − 1) + µY (n) x(n) − Y T (n)θ(n − 1) (3.6) The theory for this section was collected from Hjalmarsson et al [10], also suppling more information about adaptive lters. 3.2 Motion detection Motion detection is often built into a larger system and is tweaked to t the other algorithms. One of the commonly used algorithms is to take a threshold on a dierence image   1 |img(x, y, t) − img(x, y, t − 1)| T  if d(x, y) = .  0  else Where T is a threshold variable. Even better is to use a reference image ref(x, y, t) = α · ref(x, y, t − 1) + (1 − α) · img(x, y, t) (3.7) 12
  • 19. and then use this reference image to take the threshold   1 |img(x, y, t) − ref(x, y, t)| T  if d(x, y) = (3.8)  0  else The rate of which the reference image is updated over time is controled by α [1]. This is a fast algorithm but sensitive to noise. Irani et al [11] has developed a method for robust tracking of motion. In the study they use multiple scales and translations to detect and track motions. Though this is a robust technique it puts a heavy load on the hardware, especially at the resolutions used in the present thesis. 3.3 Pattern recognition When you use Pattern recognition algorithms, you can seldom supply raw data, such as a video or audio stream into the algorithms. The algorithms will need some sort of feature(s). These features span a domain called the feature space. The choice of feature space is essential and in some cases even more critical than the choice of pattern recognition algorithm. This is because you want to keep the dimensionality as low as possible, since the higher dimensionality the more training data is needed and the algorithms put heavier load on the computer, but if the dimensionality is too low the ability to separate patterns is reduced. If all statistics are known in advance, it is possible to analytically decide an optimal decision surface. However in reality this never happens. Instead, a training set that is supposed to represent the distribution of the signal/pattern is used to tune the chosen al- 13
  • 20. gorithm. There is a number of dierent algorithms with dierent approaches in how to use the training set and the dierent a prioris. 3.3.1 Parametric algorithms Parametric algorithms use the training set to train distributions chosen ear- lier. When the distributions of the dierent patterns are trained the deci- sion boundary can be calculated using for example maximum likelihood or Bayesian parameter estimation. These algorithms generally have good con- vergence and performance if they are tuned right. However quite a lot of tuning is needed to adapt these algorithms for dierent problems. Another problem is the curse of dimensionality, which appears when the feature space increases in dimensionality [2]. To cope with this problem it is possible to use Principal component analysis (PCA). PCA uses eigenvectors to decrease the dimensionality of the feature space [2]. The strength of parametric algo- rithms is that knowledge about the distributions can be taken into account making better use of the training data available. 3.3.2 Nonparametric algorithms In the previous section we discussed the idea behind algorithms that uses training data to estimate pre-decided distributions. Unfortunately the knowl- edge about the distribution of the patterns is rarely available. Nonparametric algorithms do not assume any special distribution, instead they rely on the training data to be accurately representative of the patterns. One of the most known nonparametric algorithms is kn nearest neighbors. 14
  • 21. The algorithm uses the training data to calculate the kn nearest neighbors to the point in the feature space corresponding to the pattern that is to be classied. The pattern that the majority of the kn neighbors belongs to is assumed to be the pattern connected with that point. The strength of this algorithm is the fact that with sucient training data it is able to represent complex distributions. The drawback is that it puts a heavy load on the computer and the complexity increases with the dimensionality and number of training data. 3.3.3 Linear discriminant In the previous sections two techniques with dierent approaches on how to use the training set given have been discussed. This third algorithm is more or less in between the two previous algo- rithms. We do not dene a specic distribution in advance and we do not keep all the training data as base for calculations during run time. The training data is used directly to train the classier which is a set of linear discriminant functions g(x) = wt x + w0 where x is the point in the feature space that is supposed to be classied, w is the weight vector and w0 is the bias [1]. Depending on what problem to solve, a number of discriminant functions can be trained and used in recog- nition problems. For instance if the classier is supposed to be a binary, one discriminant function is sucient. If there are many patterns that are sup- posed to be classied, the discriminant functions can be designed in multiple 15
  • 22. ways: • One versus all is a training technique where the discriminant func- tion is trained to separate the pattern connected with the discriminant function from all the other patterns. • One versus one creates multiple binary discriminants with two patterns versus each other. • In a Linear Machine one discriminant for each pattern is trained. The pattern is classied as the pattern whose discriminant produce the high- est value. One problem with these algorithms is that there are spaces where the classi- er is undened. The linear machine is the one that often produces the least amount of undened space. Undened space only occurs when two or more discriminant functions are equal. 3.3.4 Support Vector Machines Support Vector Machines (SVM) is basically the same as Linear discrimi- nants, see section 3.3.3, but a few features to enhance the functionality when faced with small training sets and ability to create more advanced hyper planes has been added. The reason for wanting more advanced hyper planes, is that the dimen- sionality must be high enough to have good separation between the dierent patterns. To be able to create advanced hyper planes the input data is mapped into a higher dimension, which is often done by kernels. Once the 16
  • 23. data is mapped into the higher dimension the new data is processed in the same manner as regular linear SVM. The techniques for choosing dimen- sions and making general kernels is a eld of research out of scope for this thesis. [12] The linear SVM is similar to the binary linear discriminant. The main dierence from linear discriminant function is that during training the SVM algorithm works towards maximizing the distance from the training data and the hyper plane, called margin maximization. This often results in a hyper plane that produces good results also when only small training sets are available. The training of the SVM is a minimization process of a cost function N 1 ||w||2 + C (1 − (yi f (xi ))+ ) (3.9) 2 i=1 where C is a tuning parameter that controls the relation between training errors and margin maximization [13]. The ()+ function is plotted in gure 3.1. If yi f (xi ) is larger than 1, there is no penalty, but if yi f (xi ) is less than 1 there is a linear penalty scaled with the tuning parameter C. The SVM algorithm has been widely used in pattern recognition mainly for its good generalization [1417]. 3.3.5 ψ -learning ψ -learning is a variant of the SVM algorithm modied in order to generally produce better results when faced with sparse non-linear separable training sets [18]. The mathematical dierence lies within the cost function which, 17
  • 24. Figure 3.1: The ()+ function used in the cost function, equation 3.9, for SVM training. for ψ -learning, looks like N 1 ||w||2 + C (1 − ψ(yi f (xi ))). (3.10) 2 i=1 This cost function is similar to the one in SVM (eq 3.9), but there is a ψ() function instead of a ()+ function. The ψ() function is plotted in gure 3.2. The dierence in the above cost functions is that SVM generates a linear cost as soon as yi f (xi ) 1, meaning a training data that is close to the decision hyper plane. In ψ -learning there is also a linear cost as soon as the training data is close to the decision hyper plane, however this is only valid until the data becomes misclassied. At that point the cost is doubled but static. In practice this means that the algorithm does not care about the magnitude of the misclassication, only the fact that there is one. The reason why this algorithm is more complex than SVM is that the minimization of the cost function, equation 3.10, can not be directly solved 18
  • 25. Figure 3.2: The ψ function used in the cost function, equation 3.10, for ψ -learning training. with quadratic programming as is the case with SVM [18]. 19
  • 26. 20
  • 27. Chapter 4 Implementation The methods in chapter 3 were implemented to create a system possible to track a specic object from a video stream. 4.1 Initiation The system needs to train the pattern recognition algorithm and it requires a point from where it starts tracking. This is done during the initiation phase. To initiate the pattern recognition algorithm some data used for training the algorithm is needed. The rst frame is presented and the object that is supposed to be tracked is chosen, see gure 4.1. When the training algorithm is nished, the user is promted to choose a starting position, from which the system will start tracking. The training is further discussed in section 4.3. 21
  • 28. (a) foreground (b) background Figure 4.1: The user manually choses which blocks that is the fore- ground/object, everything else is background. 4.2 Detection Since the system is given a startpoint of the object, there is only when mov- ment occurs that the system needs to act. The detection is therefore a motion detection algorithm. The technique is rather simple and the algorithm works in two steps. First the stream is ltered with a high pass lter, and then a threshold is applied to the output in order to detect motion. Since this algorithm is very simple it is not robust, however it is very fast. To reduce the impact of noise, we rst run a low-pass lter on each frame. This is done with a lter kernel. If the scale on the lter is 5, then the lter kernel is a 5x5 kernel and all elements are 1/52 . The result is a smoother image, see gure 4.2. The lter is implemented with the help of a reference image, see gure 4.3, ref n = α · ref n−1 + (1 − α) · imgn (4.1) 22
  • 29. (a) Original (b) Scale = 15 (c) Scale = 30 (d) Scale = 45 Figure 4.2: Image at dierent scales. where ref n−1 is the previous reference image. The imgn is the current image from the stream and α is a variable for tuning how fast the reference image should adapt to changes. When subtracting the reference from the current image, we will achive a value that describes the amount of change in color at every pixel diff n = imgn − ref n . (4.2) A threshold is applied to the diff n image, reducing the noise, and at pixels with valules = 0, some kind of motion is assumed, see gure 4.3. [1] 23
  • 30. (a) reference image, equation 4.1 (b) dierence image, equation 4.2 (c) motion detected Figure 4.3: Results from the detection algorithm. Motion detected is binary with ones where the dierence image has a value over a threshold and zeros otherwise. 4.3 Recognition To be able to track a specic object, motion detection is not sucient, since the detection algorithm does not give any information regarding what is moving. The Recognition block, see section 2.1, is responsible for recognizing the object that is supposed to be tracked. The recognition system in the present thesis is based on the system used for video object segmentation in Liu et al [13]. The learning algorithm used 24
  • 31. is ψ -learning, described in section 3.3.5. The algorithm is trained at the initiation process and is then used throughout the whole simulation. 4.3.1 Feature Space The ψ -learning algorithm does not work directly on the image, thus it needs to be provided with some form of feature space. The feature space is calcu- lated on blocks of 9x9 pixels, the image is therefore divided into such blocks. There is an overlap of 1 pixel between the blocks, where the rst block spans from pixels 0 − 8 and the second block from pixel 8 − 16 and so on for both x and y coordinates. The feature space is a 24- dimensional space, 8-dimensions for each colorspace 1. c(0, 0) N −1 2. j=1 c(0, j)2 N −1 3. k=1 c(k, 0)2 N −1 N −1 4. j=1 j=1 c(k, j)2 5. (B(−1,−1) + B(−1,0) + B(−1,1) )/3 6. (B(−1,1) + B(0,1) + B(1,1) )/3 7. (B(1,−1) + B(1,0) + B(1,1) )/3 8. (B(−1,−1) + B(0,−1) + B(1,−1) )/3 where c(k, j) is the coecients of the Discrete Cosine Transform (DCT), the system uses Matlab's dct2, calculated on the 9×9 blocks. In this case the rst 3 coecients (N = 3) of the DCT is used, to deal with the fact that the high frequency coecients tends to be small. The last 4 dimensions 25
  • 32. B(−1,−1) B(−1,0) B(−1,1) B(0,−1) B(0,0) B(0,1) B(1,−1) B(1,0) B(1,1) Figure 4.4: Neighbouring blocks of 9x9 pixels. are the average color of the 9x9 neighboring blocks on each side, see gure 4.4. The combination of DCT and neighboring block color values gives good classication of surface as well as grouping information which reduces the impact of noise. [13] 4.3.2 Training When the object is chosen as described in 4.1 the algorithm needs to be trained by using the test data. The blocks that are not chosen is used as background, see gure 4.1. The training is done with Matlab's fminsearch. fminsearch needs a start point in the feature space. This start point is calculated using minimum squared error solution with the pseudoinverse w = (AT A)−1 AT Y where w is the weight vector, A is a matrix where each row represents a training point and Y is a matrix containing rows with the corresponding class for each training point. 26
  • 33. (a) Classication output (b) Frame Figure 4.5: Classication of an entire frame. The green dots represents blocks that are classied as foreground/ojbect and the red ones blocks that are classied as background. 4.3.3 Detecting When the training is done, each frame needs to be converted into the feature space. The image is divided into blocks as described in 4.3.1, then each block is evaluated binary as foreground or background, see gure 4.5. To handle noise better, there need to be at least two blocks connected in order for them to be accepted as part of the object. 4.4 Updating When the detection is nished, a point of interest which is used during op- timization of the system is calculated. The point of interest is computed by nding the block/blocks with the lowest value in y coordinates ((0, 0) in upper left corner), then the mean of the x coordinates in those groups is used. 27
  • 34. 4.5 Prediction A LMS-lter, see section 3.1.1, is used to predict the next point of interest, which is used in the optimization of the system. The LMS-lter is designed to be a one step ahead lter [10]. We want to predict the next coordinate using previous observations. Two lters were implemented, 1 for each coordinate: N x(n + 1) = θx (k)x(n − k) k=0 N y(n + 1) = θy (k)y(n − k). k=0 During simulations the lter mostly kept the previous 6 (N = 6) coordinates and µ was set around 10−8 . 4.6 Optimization To make the system run faster, a number of constraints were added to the system in order to reduce the work load. The Detection described in section 4.2 is based on a lter which uses earlier images. Therefore it is not suitable to reduce the work load only by calculating parts of the image. The task that generated the heaviest load on the computer was the con- version from the pixel blocks to the feature space. In an study by Yi Liu et al [13], which uses the same feature space, calculations of the DCT is the major contributor for this load. Therefore two constraints needs to be ful- 28
  • 35. lled in order to perform the conversion. The rst constraint is that only a certain number of blocks, σ , around the previous point of interest is checked. During simulations typical values of σ is 5, 7 and 11. On an image with resolution 640 × 480 there are 4524 blocks that the conversion needs to be made on. Having σ = 7, and therefore only 225 blocks, reduces the number of conversions with a factor 20. The other constraint is that the conversion of the block is only made if there is motion detected, see section 4.2, in a certain percent, γ, of the pixels in the block. Typical value of γ during simulations is 60-80%. After these constraints was applied, the conversion was no longer the bottleneck of the system. 29
  • 36. 30
  • 37. Chapter 5 Result The system was tested in following a persons hand. The camera was mounted on the screen and the person sat down in front of the camera. 5.1 Simulations The simulations were made on a sequence of 91 frames, with 5 dierent σ. Data on tracking error (euclidean distance from ground truth) and number of blocks calculated, see section 4.6, was collected. How the system performed with dierent σ is presented in gure 5.1 and table 5.1 shows the average values over all frames. The plots are separated into separate plots in appendix B. Other than σ there are a number of variables that have an impact on the performance of the system. There are 4 variables that control the motion detection: scale controls the smoothing of the image, see gure 4.2; α which controls at what rate the reference image is updated over time; diThres 31
  • 38. Figure 5.1: Plots of the simulations. Tracking error is the euclidean distance from ground truth at each frame, blocks calculated is the number of blocks calculated at each frame 32
  • 39. σ Tracking error Blocks calculated 5 43.29 21.90 7 26.18 31.96 9 19.12 53.38 11 19.01 78.56 13 143.62 59.03 Table 5.1: The average value of the plots in gure 5.1 which is the variable that tunes at what point the dierence should be classed as motion; γ which controls the percentage of the pixels in a block that needs to be classied as motion for the block to be evaluated. There are 2 variables that controls the prediction: lterLength which is the length of the LMS- lter and µ which is the variable that controls the step size of the LMS-lter. During the simulations the variables where set to scale = 15 α = 0.9 diffThres = 14 γ = 80% filterLength = 6 µ = 13 · 10−7 . 5.2 Color spaces A number of color spaces were evaluated to see if there were any major dierences in performance. The error rate on the training set after training, 33
  • 40. color space foreground error background error total error RGB 0.76% 10.57% 8.84% normalized RGB 1.82% 12.82% 11.26% HSV 0.15% 10.57% 9.10% TSL 5.77% 11.04% 10.30% YCrCb 1.82% 13.52% 11.86% NTSC 1.67% 7.17% 6.39% Table 5.2: Error rates of a number of colorspaces. i.e. the amount of misclassications when trying to classify the training set, is presented in table 5.2. The conversion from the RGB image was done either with Matlab's built in functions, or as described in the study by Sazonov [19]. The reason why the background has such high error rate is that in the example in section 4.1, see gure 4.1, the face is not a part of the object, but has similar features as the hand. The NTSC conversion, YIQ color space is supplied by Matlab and were used most extensively during the tests. 5.3 Tracking During preferable conditions, such as sucient light and no or little distur- bance in the background, the tracking worked well. The system still managed when noise, such as back light and/or motion of other objects in the back- ground was introduced. The lter allowed the system to work, even though the tracking failed during small portions of time, but was able to snap on again after a few frames. Due to limitations in the system the tracking will fail if a block is misclassied as the object, which only occurs if motion is detected in the block. This occurs at frame 38 and σ = 13. The reason why 34
  • 41. Figure 5.2: 2 frames with motion blur. the system performs well with σ=9 and σ = 11 is that it is a suciently large area to search to be able to track well, while still small enough to miss eventual noise in other parts of the image. Fast motion is something a stan- dard DV camera is unable to handle, introducing motion blur, see gure 5.2, resulting in the hand blurring out with the background and changing in color and texture. 5.4 Speed Since the system is implemented in Matab it is hard to reason whether it is possible to run in realtime or not. With the system optimized as described in section 4.6 it runs on the Apple computer at roughly 1.3 fps. This frame rate is possible even though Matlab is not utilizing both processors and has poor performance when it comes to loops, since it does not optimize them as programs made in C/C++ would, also the code written is not optimal when it comes to minimizing work load. For each frame there are roughly 20 − 200 blocks depending on the size of σ that need to be calculated, also 35
  • 42. the detection part is pixel by pixel computations. Therfore this system could utilize the full power of computers with multiple cores, and perhaps even distributed systems. 36
  • 43. Chapter 6 Discussion The system works overall as expected. It outperforms advanced algorithms in terms of lower computational power needed, and is more stable then the fast ones. A drawback is that the system parameters were dependent on the object and its surroundings. Much of the failure could probably be compensated with more complex equipment. A more advanced camera could be congured to use shorter shutter time, reducing the problem with tracking failure during motion blur. Problems due to limitations in the algorithms of the system is a more complex problem. For example, when the tracker fails because of misclassi- cation and motion, the problem will not be solved with better hardware. Also if the object is big and has no texture so that it is registered as a at surface, the motion algorithm will only detect motion on the contours, giving a false representation of the object. To improve the system, it might be possible to model the shape of the object and feed that to an adaptive lter, such as the Kalman lter [10, 20]. 37
  • 44. Introducing the Kalman lter would allow more complex constraints that are also adaptive during runtime. For example: the updating of the point of interest could be forced to be more like the motions of a human hand; the change of the shape could be forced to change more continuously. The drawback of these constraint is that the system becomes less general and harder to congure. 6.1 Future work Though not in the scope of this thesis, the performance of the system could probably be improved by implementing it in a low level language such as C or C++. Then the code could be optimized further making sure no unnecessary computations are made. Not until then will we be able to measure how well the system performs in real time. Stereo vision might be able to make foreground detection easier, however stereo system in real time is not trivial. To make the system even faster it could be possible, for a simple object like a hand, to use simpler pattern recognition algorithms. 38
  • 45. Appendix A Mathematical cornerstones A.1 Statistical Theory Many of today's algorithms and systems uses dierent forms of a priori knowl- edge to enhance the result. Probabilities There are a few probabilities that are frequently used when working with pattern recognition and other statistical frameworks. It is the regular prob- ability PX (x), which is the value that describes how likely it is that variable X will be set to x (P (x) or P (X = x) is dierent notations for the same thing). Then there is joint probability PX,Y (x, y), 39
  • 46. which describes how likely it is that X is set to x and Y is set to y (P (x, y) and P (X = x, Y = y) is dierent notations for the same thing). Conditional probability PX|Y (x|y) describes how likely it is that X is set to x given that Y is set to y (P (x|y) and P (X = x|Y = y) is dierent notations for the same thing). The denition is PX,Y (x, y) PX|Y (x|y) = . PY (y) Bayes formula If we have the knowledge of both PX (x) and PY |X (y|x), we can, from the denition of conditional probability get PX,Y (x, y) = PX|Y (x|y)PY (y) = PY |X (y|x)PX (x), which can be rewritten to PX|Y (x|y)PY (y) PY |X (y|x) = . PX (x) This is known as Bayes formula [2, 21]. 40
  • 47. Expected value The expected value is the mean value or function of the stochastic variable or function E[X] = mX E[f (X)] = mf (X). For a discrete stochastic variable the expected value is calculated E[X] = xPX (x). x∈X Variance The expected value gives the mean value of the stochastic variable or func- tion. Variance gives the expected value of the squared distance between the stochastic variable and mx V ar[X] = σ 2 = E[(X − mx )2 ]. The variance can be expressed V ar[X] = E[X 2 ] − (E[X])2 V ar[f (X)] = E[f 2 (X)] − (E[X])2 . 41
  • 48. Covariance Covariance is dened as rXY = V ar[XY ] = E[(X − mX )(Y − mY )] = (x − mX )(y − mY )PX,Y (x, y). x∈X y∈Y 42
  • 49. Appendix B Simulation plots The plots of the simulations described in section 5.1 separated into inde- pendent plots. Tracking error is the euclidean distance from ground truth at each frame, blocks calculated is the number of blocks calculated at each frame 43
  • 50. 44
  • 51. 45
  • 52. 46
  • 53. 47
  • 54. 48
  • 55. Bibliography [1] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Prentice-Hall, Inc., second edition, 2001. [2] Peter E. Hart Richard O. Duda and David G. Stork, Pattern Classi- cation, Wiley Sons, Inc., second edition, 2001. [3] Ville Kyrki and Danica Kragi¢, Tracking rigid objects using integration of model-based and model-free cues, nyn, 2005. [4] Nikolaos D. Doulamis, Anastasios D. Doulamis, and Klimis Ntalianis, Adaptive classication-based articulation and tracking of video objects employing neural network retraining, . [5] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564575, 2003. [6] Aishy Amer, Voting-based simultaneous tracking of multiple video ob- jects, 2003, vol. 5022, pp. 500511, SPIE. [7] Danica Kragi¢, Visual Servoing for Manipulation: Robustness and In- tegration Issues, Ph.D. thesis, Royal Institute of Technology, 2001. 49
  • 56. [8] A. Cavallaro, O. Steiger, and T. Ebrahimi, Tracking video objects in cluttered background, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 4, pp. 575584, 2005. [9] M. Gastaud, M. Barlaud, and G. Aubert, Tracking video objects using active contours, in MOTION '02: Proceedings of the Workshop on Motion and Video Computing, Washington, DC, USA, 2002, p. 90, IEEE Computer Society. [10] Håkan Hjalmarsson and Bjorn Ottersten, Lecture notes in adaptive signal processing, Tech. Rep., Signal, Sensors and System, Stockholm, Sweden, 2002. [11] Benny Rousso Michal Irani and Shmuel Peleg, Computing occluding and transparent motions, Tech. Rep., Institute of Computer Science, Jerusalem, Israel, 1994. [12] Christopher J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121167, 1998. [13] Yi Liu and Yuan F. Zheng, Video object segmentation and tracking using ψ -learning, IEEE Transactions on Circuits and System for Video Technology, 2005. [14] Constantine Kotropoulos Anastasios Tefas and Ioannis Pitas, Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication, IEEE Trans on Pattern Anal. Mach. Intell., 2001. 50
  • 57. [15] Daniel J. Sebald and James A. Bucklew, Support vector machine tech- niques for nonlinear equalization, IEEE Transactions on Signal Pro- cessing, 2000. [16] Robert Freund Edgar Osuna and Federico Girosi, Training support vector machines: an application to face detection, IEEE, Computer Vision and Pattern Recognition, 1997. [17] Massimiliano Pontil and Alessandro Verri, Support vector machines for 3d object recognition, IEEE Transactions on Pattern Anal. Mach. Intell., 1998. [18] Xuegong Zhang Xiaotong Shen, George C. Tseng and Wing Hung Wong, On ψ -learning, Journal of the American Statistical Association, 2003. [19] Vassili Sazonov Vladimir Vezhnevets and Alla Andreeva, A survey on pixel-based skin color detection techniques, Tech. Rep., Graphics and Media Laboratory, Faculty of Computational Mathematics and Cyber- netics, Moscow, Russia, 2003. [20] Monson H. Hayes, Statistical Digital Signal Processing and Modeling, Wiley Sons, Inc., rst edition, 1996. [21] Arne Leijon, Pattern recognition, Tech. Rep., Signal, Sensors and System, Stockholm, Sweden, 2005. 51