Sona project


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sona project

  1. 1. Image processing and object tracking from single camera JOHAN SOMMERFELD Masters Degree ProjectStockholm, Sweden 2006-12-13
  2. 2. AbstractIn the last decades the computers ability to perform huge amount of cal-culations, and handle information ows we never thought possible ten yearsago has emerged. Despite this a computer can only extract little informationfrom the image in comparison to human seeing. The way the human brainlters out useful information is not fully known and this skill has not beenmerged into computer vision science. The aim of this thesis is to implement a system in Matlab that is ableto track a specic object in a video stream from a single web camera. Thesystem should use both fast and advanced algorithms aiming to achieve abetter ratio between accuracy and speed than you would achieve with eitherfast or advanced algorithms. The system will be tested by trying to follow apersons hand, placed in front of a computer with the web camera mountedon the screen. The goal is to achieve a system with the potential to be implemented ina real time environment. Therefore the system needs to be very fast. Thework in this thesis is an initial step and will not be implemented to run inreal time. The hardware used is a standard computer and a regular web camera i
  3. 3. with a 640x480 resolution at 30 frames per second (fps). The system works overall as expected and was able to track a personshand with numerous congurations. It outperforms advanced algorithms interms of lower computational power needed, and is more stable than thefast ones. A drawback is that the system parameters were dependent on theobject and its surroundings.AcknowledgmentsThe thesis was written at the sound and image processing laboratory at theschool of electrical engineering, the Royal Institute of Technology (KTH)through the school year of 20052006. I would like to take this opportunityto thank my supervisor M.Sc. Anders Ekman for his patience when thingsprogressed a bit slow, PhD Disa Sommerfeld for proofreading and also assis-tant professor Danica Kragi¢ for pushing me forward. ii
  4. 4. Contents1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Problem 5 2.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Method 9 3.1 Adaptive lters . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 The Least Mean Square algorithm . . . . . . . . . . . . 11 3.2 Motion detection . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Parametric algorithms . . . . . . . . . . . . . . . . . . 14 3.3.2 Nonparametric algorithms . . . . . . . . . . . . . . . . 14 3.3.3 Linear discriminant . . . . . . . . . . . . . . . . . . . . 15 3.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . 16 3.3.5 ψ -learning . . . . . . . . . . . . . . . . . . . . . . . . . 17 iii
  5. 5. 4 Implementation 21 4.1 Initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.3 Detecting . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Result 31 5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 Discussion 37 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A Mathematical cornerstones 39 A.1 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . 39B Simulation plots 43Bibliography 51 iv
  6. 6. Chapter 1IntroductionIn the last decades the computers ability to perform huge amount of calcu-lations, and handle information ows we never thought possible ten yearsago has emerged. Despite this a computer can only extract little informationfrom the image in comparison to human seeing. The way the human brainlters out useful information is not fully known and therefore this skill hasnot been merged into computer vision science.1.1 BackgroundEven if we have not been able to teach a computer to process visual inputin a complex sense, there is quite much a computer can do when it comes tofollowing movement and performing easier recognitions. One of the key features in a computer vision system is for the computerto extract interesting areas (foreground). Research on this has mainly twoapproaches. The rst group uses advanced algorithms for pattern recognition 1
  7. 7. to extract the foreground. Often these methods take little use of temporal re-dundancy, and are slow because of the large amount of computations needed.The second approach is dierent, often using pixel by pixel computations andonly a few computations per pixel. In general the latter methods are fastand may be implemented in real-time applications. The drawback of thesemethods is that they are, due to the lack of complexity in the algorithms,sensitive to noise and often need a static environment to be able to function.1.2 Related workThere are a few simple algorithms for tracking, for example: detection ofdiscontinuities using Laplacian and Gaussian lters, often implemented witha simple kernel [1]; thresholding; and motion detection with reference image.These algorithms are simple, but sensitive to noise, and hard to generalize. Aset of more advanced algorithms involves iterations and/or transformations,such as the Hough transform, region based segmentation and morphologicalsegmentation. These algorithms are generally more stable concerning noise,although as pictures and/or frames grows larger, these algorithms get slow [1]. Other algorithms make use of pattern recognition, such as neural net-works, maximum-likelihood and support vector machines [2]. First the imagehas to be translated into something that the pattern recognition algorithmsunderstand. The image is processed to a so-called feature vector. The ma-jority of pattern recognition algorithms require a set of training data to formthe decision boundary. The training is often slow, however thereafter thealgorithm is fast. The problem is that extracting the feature vector might 2
  8. 8. be a demanding task for the computer. There is a number of interesting approaches to object tracking. In thestudy by Kyrki et al [3] they use both model-based features such as a wireframe combined with model-free features such as points of interest on a calcu-lated surface. In the study by Doulamis et al [4] they use an implementationof neural networks to track objects in a video stream. The neural networkis adaptive and changes over time as the object translates. In the study byComaniciu et al [5] a kernel-based solution is used for identifying an object.In a study, Amer [6] uses voting based features and motion to detect objects,which are tuned for real time processing. In the PhD thesis by Kragi¢ [7], amultiple cue algorithm is presented, using features that are fast to computeand relying on the assumption that not everyone fails at the same time. Inthe study by Cavallaro et al [8] a hybrid algorithm is presented using infor-mation about both objects and regions. In the study by Gastaud et al [9]they track objects using active contours. Kragi¢ [7] uses multiple cues for better tracking. Instead of using multiplecues of fast algorithms, the approach in the present thesis takes the advantageof the fast and also the advanced algorithms in order to achieve a system thatoutperforms the simple algorithms, and operates faster than the advancedones. 3
  9. 9. 4
  10. 10. Chapter 2ProblemThe aim of this thesis is to implement a system in Matlab that is able totrack a specic object in a video stream from a single web camera. Thesystem should use both fast and advanced algorithms aiming to achieve abetter ratio between accuracy and speed than you would achieve with eitherfast or advanced algorithms. The system will be tested by trying to follow apersons hand, placed in front of a computer with the web camera mountedon the screen.2.1 SystemThe goal is to achieve a system with the potential to be implemented in areal time environment. Therefore the system needs to be very fast. Also ahigher accuracy than the simple methods described in section 1.2 needs tobe achieved. The system will use algorithms that need training. The work inthis thesis is an initial step and will not be implemented to run in real time. 5
  11. 11. Figure 2.1: The main blocks of the system.At the start a user input telling the system the whereabouts of the object totrack is requested. In this thesis the system is implemented in Matlab andtherefore only a proof of concept is possible to achieve. More specically, the system is based on four blocks, see gure 2.1.Detection is responsible for detecting and segmenting the interesting parts of the image. The mainly responsible algorithm is most often one of the fast algorithms described in section 1.2.Recognition is responsible for classifying the foreground extracted from the image by the detection block.Updating is responsible for updating the representation of the tracked ob- ject, using information generated by the recognition block.Prediction is responsible for using all information and to predict where to start the segmentation in the Detection block, to minimize the time consumed and to minimize the error probability. 6
  12. 12. 2.2 HardwareThe hardware used is a standard computer and a regular web camera witha 640x480 resolution at 30 frames per second (fps). The computer is anApple Power Mac G5 2x2.7 GHz. The camera is an Isight from Apple, thevideo stream is in Dv format. However Matlabs Unix version only takesuncompressed video, and therefore the stream is converted to uncompressedvideo with true color. 7
  13. 13. 8
  14. 14. Chapter 3Method3.1 Adaptive ltersAn adaptive lter is a lter that changes over time depending on the signal.For a resumé of the statistical theory used, see appendix A.1. Assume thatyou have two non-stationary signals with zero mean and known stochasticfunctions, hence covariance and cross-covariance ryy (n, m) = E[y(n)y(n + m)] rxy (n, m) = E[x(n)y(n + m)].The problem of estimating x(n) given past y(n) may be written as N −1 x(n) = ˆ θ(k)y(n − k) = Y T (n)θ, k=0 9
  15. 15. where Y (n) = [y(n), ..., y(n − N + 1)] and θ = [θ(0), ...θ(N − 1)]T . The MSEis then given by MSE(n, θ) = E[(x(n) − x(n))2 ]. ˆThe optimal θ may be received by the orthogonality condition which statesthat Y T (n)θ is the linear MMSE of x(n) if the estimation error is orthogonalto the observations Y (n) E[(x(n) − Y T (n)θ)Y (n)] = 0. (3.1)If we dene the the covariance matrices ΣY x (n) = E[rxy (n, n), ..., rxy (n, n − N + 1)] ΣY Y (n) = E[Y (n)Y T (n)]    ryy (0) ryy (1) . . . ryy (N − 1)     ryy (1) ryy (0) . . . ryy (N − 2)  =  .   . .. . . . . . .       ryy (N − 1) ryy (N − 2) . . . ryy (0)Insert this in 3.1 and we get ΣY x (n) − ΣY Y (n)θ = 0,from this we get θopt θopt (n) = Σ−1 (n)ΣY x (n) YY 10
  16. 16. here θ is dependent of time. An algorithm to update θ is also needed, a com-mon method is to take a step in the negative gradient direction of MSE(n, θ) ˆ ˆ µ ∂ θ(n) = θ(n − 1) + MSE(n, θ)|θ=θ(n−1) ˆ (3.2) 2 ∂θhere µ is an variable that controls the step size of the algorithm, a large µis fast but can be unstable, and a small µ is slow but generally more stable.The gradient can be written as ∂ M SE(n, θ) = −2ΣY x + 2ΣY Y θ (3.3) ∂θinsert 3.3 in 3.2 and we get ˆ ˆ ˆ θ(n) = θ(n − 1) + µ ΣY x (n) − ΣY Y θ(n − 1) (3.4)3.1.1 The Least Mean Square algorithmIn general, the statistical information of the variables is not available. Morelikely, the only thing available is y(n) and x(n). We will still use the steepestdecent algorithm, see equation 3.4, with some modications. Since the statistical information is not available, we will not be able tocalculate the MSE. Instead we estimate the MSE by relaxing the expressionby dropping the expectation operand M SE(n, θ) = (x(n) − Y T (n)θ)2 11
  17. 17. the gradient then is ∂ M SE(n, θ) = −2(x(n) − Y T (n)θ)Y (n) (3.5) ∂θif we insert 3.5 into 3.2 we get ˆ ˆ θ(n) = θ(n − 1) + µY (n) x(n) − Y T (n)θ(n − 1) (3.6) The theory for this section was collected from Hjalmarsson et al [10], alsosuppling more information about adaptive lters.3.2 Motion detectionMotion detection is often built into a larger system and is tweaked to tthe other algorithms. One of the commonly used algorithms is to take athreshold on a dierence image   1 |img(x, y, t) − img(x, y, t − 1)| T  if d(x, y) = .  0  elseWhere T is a threshold variable. Even better is to use a reference image ref(x, y, t) = α · ref(x, y, t − 1) + (1 − α) · img(x, y, t) (3.7) 12
  18. 18. and then use this reference image to take the threshold   1 |img(x, y, t) − ref(x, y, t)| T  if d(x, y) = (3.8)  0  elseThe rate of which the reference image is updated over time is controled byα [1]. This is a fast algorithm but sensitive to noise. Irani et al [11] has developed a method for robust tracking of motion.In the study they use multiple scales and translations to detect and trackmotions. Though this is a robust technique it puts a heavy load on thehardware, especially at the resolutions used in the present thesis.3.3 Pattern recognitionWhen you use Pattern recognition algorithms, you can seldom supply rawdata, such as a video or audio stream into the algorithms. The algorithmswill need some sort of feature(s). These features span a domain called thefeature space. The choice of feature space is essential and in some caseseven more critical than the choice of pattern recognition algorithm. This isbecause you want to keep the dimensionality as low as possible, since thehigher dimensionality the more training data is needed and the algorithmsput heavier load on the computer, but if the dimensionality is too low theability to separate patterns is reduced. If all statistics are known in advance,it is possible to analytically decide an optimal decision surface. Howeverin reality this never happens. Instead, a training set that is supposed torepresent the distribution of the signal/pattern is used to tune the chosen al- 13
  19. 19. gorithm. There is a number of dierent algorithms with dierent approachesin how to use the training set and the dierent a prioris.3.3.1 Parametric algorithmsParametric algorithms use the training set to train distributions chosen ear-lier. When the distributions of the dierent patterns are trained the deci-sion boundary can be calculated using for example maximum likelihood orBayesian parameter estimation. These algorithms generally have good con-vergence and performance if they are tuned right. However quite a lot oftuning is needed to adapt these algorithms for dierent problems. Anotherproblem is the curse of dimensionality, which appears when the feature spaceincreases in dimensionality [2]. To cope with this problem it is possible touse Principal component analysis (PCA). PCA uses eigenvectors to decreasethe dimensionality of the feature space [2]. The strength of parametric algo-rithms is that knowledge about the distributions can be taken into accountmaking better use of the training data available.3.3.2 Nonparametric algorithmsIn the previous section we discussed the idea behind algorithms that usestraining data to estimate pre-decided distributions. Unfortunately the knowl-edge about the distribution of the patterns is rarely available. Nonparametricalgorithms do not assume any special distribution, instead they rely on thetraining data to be accurately representative of the patterns. One of the most known nonparametric algorithms is kn nearest neighbors. 14
  20. 20. The algorithm uses the training data to calculate the kn nearest neighborsto the point in the feature space corresponding to the pattern that is to beclassied. The pattern that the majority of the kn neighbors belongs to isassumed to be the pattern connected with that point. The strength of thisalgorithm is the fact that with sucient training data it is able to representcomplex distributions. The drawback is that it puts a heavy load on thecomputer and the complexity increases with the dimensionality and numberof training data.3.3.3 Linear discriminantIn the previous sections two techniques with dierent approaches on how touse the training set given have been discussed. This third algorithm is more or less in between the two previous algo-rithms. We do not dene a specic distribution in advance and we do notkeep all the training data as base for calculations during run time. Thetraining data is used directly to train the classier which is a set of lineardiscriminant functions g(x) = wt x + w0where x is the point in the feature space that is supposed to be classied, wis the weight vector and w0 is the bias [1]. Depending on what problem tosolve, a number of discriminant functions can be trained and used in recog-nition problems. For instance if the classier is supposed to be a binary, onediscriminant function is sucient. If there are many patterns that are sup-posed to be classied, the discriminant functions can be designed in multiple 15
  21. 21. ways: • One versus all is a training technique where the discriminant func- tion is trained to separate the pattern connected with the discriminant function from all the other patterns. • One versus one creates multiple binary discriminants with two patterns versus each other. • In a Linear Machine one discriminant for each pattern is trained. The pattern is classied as the pattern whose discriminant produce the high- est value.One problem with these algorithms is that there are spaces where the classi-er is undened. The linear machine is the one that often produces the leastamount of undened space. Undened space only occurs when two or morediscriminant functions are equal.3.3.4 Support Vector MachinesSupport Vector Machines (SVM) is basically the same as Linear discrimi-nants, see section 3.3.3, but a few features to enhance the functionality whenfaced with small training sets and ability to create more advanced hyperplanes has been added. The reason for wanting more advanced hyper planes, is that the dimen-sionality must be high enough to have good separation between the dierentpatterns. To be able to create advanced hyper planes the input data ismapped into a higher dimension, which is often done by kernels. Once the 16
  22. 22. data is mapped into the higher dimension the new data is processed in thesame manner as regular linear SVM. The techniques for choosing dimen-sions and making general kernels is a eld of research out of scope for thisthesis. [12] The linear SVM is similar to the binary linear discriminant. The maindierence from linear discriminant function is that during training the SVMalgorithm works towards maximizing the distance from the training dataand the hyper plane, called margin maximization. This often results in ahyper plane that produces good results also when only small training setsare available. The training of the SVM is a minimization process of a cost function N 1 ||w||2 + C (1 − (yi f (xi ))+ ) (3.9) 2 i=1where C is a tuning parameter that controls the relation between trainingerrors and margin maximization [13]. The ()+ function is plotted in gure3.1. If yi f (xi ) is larger than 1, there is no penalty, but if yi f (xi ) is less than1 there is a linear penalty scaled with the tuning parameter C. The SVM algorithm has been widely used in pattern recognition mainlyfor its good generalization [1417].3.3.5 ψ -learningψ -learning is a variant of the SVM algorithm modied in order to generallyproduce better results when faced with sparse non-linear separable trainingsets [18]. The mathematical dierence lies within the cost function which, 17
  23. 23. Figure 3.1: The ()+ function used in the cost function, equation 3.9, for SVM training.for ψ -learning, looks like N 1 ||w||2 + C (1 − ψ(yi f (xi ))). (3.10) 2 i=1This cost function is similar to the one in SVM (eq 3.9), but there is a ψ()function instead of a ()+ function. The ψ() function is plotted in gure 3.2. The dierence in the above cost functions is that SVM generates a linearcost as soon as yi f (xi ) 1, meaning a training data that is close to thedecision hyper plane. In ψ -learning there is also a linear cost as soon as thetraining data is close to the decision hyper plane, however this is only validuntil the data becomes misclassied. At that point the cost is doubled butstatic. In practice this means that the algorithm does not care about themagnitude of the misclassication, only the fact that there is one. The reason why this algorithm is more complex than SVM is that theminimization of the cost function, equation 3.10, can not be directly solved 18
  24. 24. Figure 3.2: The ψ function used in the cost function, equation 3.10, for ψ -learning training.with quadratic programming as is the case with SVM [18]. 19
  25. 25. 20
  26. 26. Chapter 4ImplementationThe methods in chapter 3 were implemented to create a system possible totrack a specic object from a video stream.4.1 InitiationThe system needs to train the pattern recognition algorithm and it requires apoint from where it starts tracking. This is done during the initiation phase. To initiate the pattern recognition algorithm some data used for trainingthe algorithm is needed. The rst frame is presented and the object that issupposed to be tracked is chosen, see gure 4.1. When the training algorithmis nished, the user is promted to choose a starting position, from which thesystem will start tracking. The training is further discussed in section 4.3. 21
  27. 27. (a) foreground (b) background Figure 4.1: The user manually choses which blocks that is the fore- ground/object, everything else is background.4.2 DetectionSince the system is given a startpoint of the object, there is only when mov-ment occurs that the system needs to act. The detection is therefore a motiondetection algorithm. The technique is rather simple and the algorithm worksin two steps. First the stream is ltered with a high pass lter, and thena threshold is applied to the output in order to detect motion. Since thisalgorithm is very simple it is not robust, however it is very fast. To reducethe impact of noise, we rst run a low-pass lter on each frame. This is donewith a lter kernel. If the scale on the lter is 5, then the lter kernel is a5x5 kernel and all elements are 1/52 . The result is a smoother image, seegure 4.2. The lter is implemented with the help of a reference image, see gure4.3, ref n = α · ref n−1 + (1 − α) · imgn (4.1) 22
  28. 28. (a) Original (b) Scale = 15 (c) Scale = 30 (d) Scale = 45 Figure 4.2: Image at dierent scales.where ref n−1 is the previous reference image. The imgn is the current imagefrom the stream and α is a variable for tuning how fast the reference imageshould adapt to changes. When subtracting the reference from the currentimage, we will achive a value that describes the amount of change in colorat every pixel diff n = imgn − ref n . (4.2)A threshold is applied to the diff n image, reducing the noise, and at pixelswith valules = 0, some kind of motion is assumed, see gure 4.3. [1] 23
  29. 29. (a) reference image, equation 4.1 (b) dierence image, equation 4.2 (c) motion detected Figure 4.3: Results from the detection algorithm. Motion detected is binary with ones where the dierence image has a value over a threshold and zeros otherwise.4.3 RecognitionTo be able to track a specic object, motion detection is not sucient, sincethe detection algorithm does not give any information regarding what ismoving. The Recognition block, see section 2.1, is responsible for recognizingthe object that is supposed to be tracked. The recognition system in the present thesis is based on the system usedfor video object segmentation in Liu et al [13]. The learning algorithm used 24
  30. 30. is ψ -learning, described in section 3.3.5. The algorithm is trained at theinitiation process and is then used throughout the whole simulation.4.3.1 Feature SpaceThe ψ -learning algorithm does not work directly on the image, thus it needsto be provided with some form of feature space. The feature space is calcu-lated on blocks of 9x9 pixels, the image is therefore divided into such blocks.There is an overlap of 1 pixel between the blocks, where the rst block spansfrom pixels 0 − 8 and the second block from pixel 8 − 16 and so on for both xand y coordinates. The feature space is a 24- dimensional space, 8-dimensionsfor each colorspace 1. c(0, 0) N −1 2. j=1 c(0, j)2 N −1 3. k=1 c(k, 0)2 N −1 N −1 4. j=1 j=1 c(k, j)2 5. (B(−1,−1) + B(−1,0) + B(−1,1) )/3 6. (B(−1,1) + B(0,1) + B(1,1) )/3 7. (B(1,−1) + B(1,0) + B(1,1) )/3 8. (B(−1,−1) + B(0,−1) + B(1,−1) )/3where c(k, j) is the coecients of the Discrete Cosine Transform (DCT), thesystem uses Matlabs dct2, calculated on the 9×9 blocks. In this case therst 3 coecients (N = 3) of the DCT is used, to deal with the fact thatthe high frequency coecients tends to be small. The last 4 dimensions 25
  31. 31. B(−1,−1) B(−1,0) B(−1,1) B(0,−1) B(0,0) B(0,1) B(1,−1) B(1,0) B(1,1) Figure 4.4: Neighbouring blocks of 9x9 pixels.are the average color of the 9x9 neighboring blocks on each side, see gure4.4. The combination of DCT and neighboring block color values gives goodclassication of surface as well as grouping information which reduces theimpact of noise. [13]4.3.2 TrainingWhen the object is chosen as described in 4.1 the algorithm needs to betrained by using the test data. The blocks that are not chosen is used asbackground, see gure 4.1. The training is done with Matlabs fminsearch.fminsearch needs a start point in the feature space. This start point iscalculated using minimum squared error solution with the pseudoinverse w = (AT A)−1 AT Ywhere w is the weight vector, A is a matrix where each row represents atraining point and Y is a matrix containing rows with the correspondingclass for each training point. 26
  32. 32. (a) Classication output (b) Frame Figure 4.5: Classication of an entire frame. The green dots represents blocks that are classied as foreground/ojbect and the red ones blocks that are classied as background.4.3.3 DetectingWhen the training is done, each frame needs to be converted into the featurespace. The image is divided into blocks as described in 4.3.1, then each blockis evaluated binary as foreground or background, see gure 4.5. To handlenoise better, there need to be at least two blocks connected in order for themto be accepted as part of the object.4.4 UpdatingWhen the detection is nished, a point of interest which is used during op-timization of the system is calculated. The point of interest is computedby nding the block/blocks with the lowest value in y coordinates ((0, 0) inupper left corner), then the mean of the x coordinates in those groups isused. 27
  33. 33. 4.5 PredictionA LMS-lter, see section 3.1.1, is used to predict the next point of interest,which is used in the optimization of the system. The LMS-lter is designed to be a one step ahead lter [10]. We wantto predict the next coordinate using previous observations. Two lters wereimplemented, 1 for each coordinate: N x(n + 1) = θx (k)x(n − k) k=0 N y(n + 1) = θy (k)y(n − k). k=0During simulations the lter mostly kept the previous 6 (N = 6) coordinatesand µ was set around 10−8 .4.6 OptimizationTo make the system run faster, a number of constraints were added to thesystem in order to reduce the work load. The Detection described in section 4.2 is based on a lter which usesearlier images. Therefore it is not suitable to reduce the work load only bycalculating parts of the image. The task that generated the heaviest load on the computer was the con-version from the pixel blocks to the feature space. In an study by Yi Liuet al [13], which uses the same feature space, calculations of the DCT is themajor contributor for this load. Therefore two constraints needs to be ful- 28
  34. 34. lled in order to perform the conversion. The rst constraint is that only acertain number of blocks, σ , around the previous point of interest is checked.During simulations typical values of σ is 5, 7 and 11. On an image withresolution 640 × 480 there are 4524 blocks that the conversion needs to bemade on. Having σ = 7, and therefore only 225 blocks, reduces the numberof conversions with a factor 20. The other constraint is that the conversion ofthe block is only made if there is motion detected, see section 4.2, in a certainpercent, γ, of the pixels in the block. Typical value of γ during simulationsis 60-80%. After these constraints was applied, the conversion was no longerthe bottleneck of the system. 29
  35. 35. 30
  36. 36. Chapter 5ResultThe system was tested in following a persons hand. The camera was mountedon the screen and the person sat down in front of the camera.5.1 SimulationsThe simulations were made on a sequence of 91 frames, with 5 dierent σ.Data on tracking error (euclidean distance from ground truth) and number ofblocks calculated, see section 4.6, was collected. How the system performedwith dierent σ is presented in gure 5.1 and table 5.1 shows the averagevalues over all frames. The plots are separated into separate plots in appendixB. Other than σ there are a number of variables that have an impact on theperformance of the system. There are 4 variables that control the motiondetection: scale controls the smoothing of the image, see gure 4.2; α whichcontrols at what rate the reference image is updated over time; diThres 31
  37. 37. Figure 5.1: Plots of the simulations. Tracking error is the euclideandistance from ground truth at each frame, blocks calculated is the numberof blocks calculated at each frame 32
  38. 38. σ Tracking error Blocks calculated 5 43.29 21.90 7 26.18 31.96 9 19.12 53.38 11 19.01 78.56 13 143.62 59.03 Table 5.1: The average value of the plots in gure 5.1which is the variable that tunes at what point the dierence should be classedas motion; γ which controls the percentage of the pixels in a block that needsto be classied as motion for the block to be evaluated. There are 2 variablesthat controls the prediction: lterLength which is the length of the LMS-lter and µ which is the variable that controls the step size of the LMS-lter.During the simulations the variables where set to scale = 15 α = 0.9 diffThres = 14 γ = 80% filterLength = 6 µ = 13 · 10−7 .5.2 Color spacesA number of color spaces were evaluated to see if there were any majordierences in performance. The error rate on the training set after training, 33
  39. 39. color space foreground error background error total error RGB 0.76% 10.57% 8.84% normalized RGB 1.82% 12.82% 11.26% HSV 0.15% 10.57% 9.10% TSL 5.77% 11.04% 10.30% YCrCb 1.82% 13.52% 11.86% NTSC 1.67% 7.17% 6.39% Table 5.2: Error rates of a number of colorspaces.i.e. the amount of misclassications when trying to classify the trainingset, is presented in table 5.2. The conversion from the RGB image wasdone either with Matlabs built in functions, or as described in the study bySazonov [19]. The reason why the background has such high error rate isthat in the example in section 4.1, see gure 4.1, the face is not a part ofthe object, but has similar features as the hand. The NTSC conversion, YIQcolor space is supplied by Matlab and were used most extensively during thetests.5.3 TrackingDuring preferable conditions, such as sucient light and no or little distur-bance in the background, the tracking worked well. The system still managedwhen noise, such as back light and/or motion of other objects in the back-ground was introduced. The lter allowed the system to work, even thoughthe tracking failed during small portions of time, but was able to snap onagain after a few frames. Due to limitations in the system the tracking willfail if a block is misclassied as the object, which only occurs if motion isdetected in the block. This occurs at frame 38 and σ = 13. The reason why 34
  40. 40. Figure 5.2: 2 frames with motion blur.the system performs well with σ=9 and σ = 11 is that it is a sucientlylarge area to search to be able to track well, while still small enough to misseventual noise in other parts of the image. Fast motion is something a stan-dard DV camera is unable to handle, introducing motion blur, see gure 5.2,resulting in the hand blurring out with the background and changing in colorand texture.5.4 SpeedSince the system is implemented in Matab it is hard to reason whether it ispossible to run in realtime or not. With the system optimized as describedin section 4.6 it runs on the Apple computer at roughly 1.3 fps. This framerate is possible even though Matlab is not utilizing both processors and haspoor performance when it comes to loops, since it does not optimize themas programs made in C/C++ would, also the code written is not optimalwhen it comes to minimizing work load. For each frame there are roughly20 − 200 blocks depending on the size of σ that need to be calculated, also 35
  41. 41. the detection part is pixel by pixel computations. Therfore this system couldutilize the full power of computers with multiple cores, and perhaps evendistributed systems. 36
  42. 42. Chapter 6DiscussionThe system works overall as expected. It outperforms advanced algorithmsin terms of lower computational power needed, and is more stable then thefast ones. A drawback is that the system parameters were dependent onthe object and its surroundings. Much of the failure could probably becompensated with more complex equipment. A more advanced camera couldbe congured to use shorter shutter time, reducing the problem with trackingfailure during motion blur. Problems due to limitations in the algorithms of the system is a morecomplex problem. For example, when the tracker fails because of misclassi-cation and motion, the problem will not be solved with better hardware.Also if the object is big and has no texture so that it is registered as a atsurface, the motion algorithm will only detect motion on the contours, givinga false representation of the object. To improve the system, it might be possible to model the shape of theobject and feed that to an adaptive lter, such as the Kalman lter [10, 20]. 37
  43. 43. Introducing the Kalman lter would allow more complex constraints thatare also adaptive during runtime. For example: the updating of the pointof interest could be forced to be more like the motions of a human hand;the change of the shape could be forced to change more continuously. Thedrawback of these constraint is that the system becomes less general andharder to congure.6.1 Future workThough not in the scope of this thesis, the performance of the system couldprobably be improved by implementing it in a low level language such as C orC++. Then the code could be optimized further making sure no unnecessarycomputations are made. Not until then will we be able to measure howwell the system performs in real time. Stereo vision might be able to makeforeground detection easier, however stereo system in real time is not trivial.To make the system even faster it could be possible, for a simple object likea hand, to use simpler pattern recognition algorithms. 38
  44. 44. Appendix AMathematical cornerstonesA.1 Statistical TheoryMany of todays algorithms and systems uses dierent forms of a priori knowl-edge to enhance the result.ProbabilitiesThere are a few probabilities that are frequently used when working withpattern recognition and other statistical frameworks. It is the regular prob-ability PX (x),which is the value that describes how likely it is that variable X will be setto x (P (x) or P (X = x) is dierent notations for the same thing). Then there is joint probability PX,Y (x, y), 39
  45. 45. which describes how likely it is that X is set to x and Y is set to y (P (x, y)and P (X = x, Y = y) is dierent notations for the same thing). Conditional probability PX|Y (x|y)describes how likely it is that X is set to x given that Y is set to y (P (x|y) andP (X = x|Y = y) is dierent notations for the same thing). The denition is PX,Y (x, y) PX|Y (x|y) = . PY (y)Bayes formulaIf we have the knowledge of both PX (x) and PY |X (y|x), we can, from thedenition of conditional probability get PX,Y (x, y) = PX|Y (x|y)PY (y) = PY |X (y|x)PX (x),which can be rewritten to PX|Y (x|y)PY (y) PY |X (y|x) = . PX (x)This is known as Bayes formula [2, 21]. 40
  46. 46. Expected valueThe expected value is the mean value or function of the stochastic variableor function E[X] = mX E[f (X)] = mf (X).For a discrete stochastic variable the expected value is calculated E[X] = xPX (x). x∈XVarianceThe expected value gives the mean value of the stochastic variable or func-tion. Variance gives the expected value of the squared distance between thestochastic variable and mx V ar[X] = σ 2 = E[(X − mx )2 ].The variance can be expressed V ar[X] = E[X 2 ] − (E[X])2 V ar[f (X)] = E[f 2 (X)] − (E[X])2 . 41
  47. 47. CovarianceCovariance is dened as rXY = V ar[XY ] = E[(X − mX )(Y − mY )] = (x − mX )(y − mY )PX,Y (x, y). x∈X y∈Y 42
  48. 48. Appendix BSimulation plotsThe plots of the simulations described in section 5.1 separated into inde-pendent plots. Tracking error is the euclidean distance from ground truthat each frame, blocks calculated is the number of blocks calculated at eachframe 43
  49. 49. 44
  50. 50. 45
  51. 51. 46
  52. 52. 47
  53. 53. 48
  54. 54. Bibliography[1] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing, Prentice-Hall, Inc., second edition, 2001.[2] Peter E. Hart Richard O. Duda and David G. Stork, Pattern Classi- cation, Wiley Sons, Inc., second edition, 2001.[3] Ville Kyrki and Danica Kragi¢, Tracking rigid objects using integration of model-based and model-free cues, nyn, 2005.[4] Nikolaos D. Doulamis, Anastasios D. Doulamis, and Klimis Ntalianis, Adaptive classication-based articulation and tracking of video objects employing neural network retraining, .[5] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564575, 2003.[6] Aishy Amer, Voting-based simultaneous tracking of multiple video ob- jects, 2003, vol. 5022, pp. 500511, SPIE.[7] Danica Kragi¢, Visual Servoing for Manipulation: Robustness and In- tegration Issues, Ph.D. thesis, Royal Institute of Technology, 2001. 49
  55. 55. [8] A. Cavallaro, O. Steiger, and T. Ebrahimi, Tracking video objects in cluttered background, IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 4, pp. 575584, 2005. [9] M. Gastaud, M. Barlaud, and G. Aubert, Tracking video objects using active contours, in MOTION 02: Proceedings of the Workshop on Motion and Video Computing, Washington, DC, USA, 2002, p. 90, IEEE Computer Society.[10] Håkan Hjalmarsson and Bjorn Ottersten, Lecture notes in adaptive signal processing, Tech. Rep., Signal, Sensors and System, Stockholm, Sweden, 2002.[11] Benny Rousso Michal Irani and Shmuel Peleg, Computing occluding and transparent motions, Tech. Rep., Institute of Computer Science, Jerusalem, Israel, 1994.[12] Christopher J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121167, 1998.[13] Yi Liu and Yuan F. Zheng, Video object segmentation and tracking using ψ -learning, IEEE Transactions on Circuits and System for Video Technology, 2005.[14] Constantine Kotropoulos Anastasios Tefas and Ioannis Pitas, Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication, IEEE Trans on Pattern Anal. Mach. Intell., 2001. 50
  56. 56. [15] Daniel J. Sebald and James A. Bucklew, Support vector machine tech- niques for nonlinear equalization, IEEE Transactions on Signal Pro- cessing, 2000.[16] Robert Freund Edgar Osuna and Federico Girosi, Training support vector machines: an application to face detection, IEEE, Computer Vision and Pattern Recognition, 1997.[17] Massimiliano Pontil and Alessandro Verri, Support vector machines for 3d object recognition, IEEE Transactions on Pattern Anal. Mach. Intell., 1998.[18] Xuegong Zhang Xiaotong Shen, George C. Tseng and Wing Hung Wong, On ψ -learning, Journal of the American Statistical Association, 2003.[19] Vassili Sazonov Vladimir Vezhnevets and Alla Andreeva, A survey on pixel-based skin color detection techniques, Tech. Rep., Graphics and Media Laboratory, Faculty of Computational Mathematics and Cyber- netics, Moscow, Russia, 2003.[20] Monson H. Hayes, Statistical Digital Signal Processing and Modeling, Wiley Sons, Inc., rst edition, 1996.[21] Arne Leijon, Pattern recognition, Tech. Rep., Signal, Sensors and System, Stockholm, Sweden, 2005. 51