IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 603 Mean Shift: A Robust Approach Toward Feature Space Analysis Dorin Comaniciu, Member, IEEE, and Peter Meer, Senior Member, IEEE AbstractÐA general nonparametric technique is proposed for the analysis of a complex multimodal feature space and to delineate arbitrarily shaped clusters in it. The basic computational module of the technique is an old pattern recognition procedure, the mean shift. We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and, thus, its utility in detecting the modes of the density. The relation of the mean shift procedure to the Nadaraya- Watson estimator from kernel regression and the robust M-estimators of location is also established. Algorithms for two low-level vision tasks, discontinuity preserving smoothing and image segmentation, are described as applications. In these algorithms, the only user set parameter is the resolution of the analysis and either gray level or color images are accepted as input. Extensive experimental results illustrate their excellent performance. Index TermsÐMean shift, clustering, image segmentation, image smoothing, feature space, low-level vision. æ1 INTRODUCTIONL OW-LEVEL computer vision tasks are misleadingly diffi- cult. Incorrect results can be easily obtained since theemployed techniques often rely upon the user correctly significant feature is pooled together, providing excellent tolerance to a noise level which may render local decisions unreliable. On the other hand, features with lesser supportguessing the values for the tuning parameters. To improve in the feature space may not be detected in spite of beingperformance, the execution of low-level tasks should be task salient for the task to be executed. This disadvantage,driven, i.e., supported by independent high-level informa- however, can be largely avoided by either augmenting thetion. This approach, however, requires that, first, the low- feature space with additional (spatial) parameters from thelevel stage provides a reliable enough representation of the input domain or by robust postprocessing of the inputinput and that the feature extraction process be controlled domain guided by the results of the feature space analysis.only by very few tuning parameters corresponding to Analysis of the feature space is application independent.intuitive measures in the input domain. While there are a plethora of published clustering techni- Feature space-based analysis of images is a paradigm ques, most of them are not adequate to analyze featurewhich can achieve the above-stated goals. A feature space is spaces derived from real data. Methods which rely upona mapping of the input obtained through the processing of a priori knowledge of the number of clusters presentthe data in small subsets at a time. For each subset, a (including those which use optimization of a globalparametric representation of the feature of interest is criterion to find this number), as well as methods whichobtained and the result is mapped into a point in the implicitly assume the same shape (most often elliptical) formultidimensional space of the parameter. After the entire all the clusters in the space, are not able to handle theinput is processed, significant features correspond to denser complexity of a real feature space. For a recent survey ofregions in the feature space, i.e., to clusters, and the goal of such methods, see [29, Section 8].the analysis is the delineation of these clusters. In Fig. 1, a typical example is shown. The color image in The nature of the feature space is application dependent. Fig. 1a is mapped into the three-dimensional L*u*v* colorThe subsets employed in the mapping can range from space (to be discussed in Section 4). There is a continuousindividual pixels, as in the color space representation of an transition between the clusters arising from the dominantimage, to a set of quasi-randomly chosen data points, as in colors and a decomposition of the space into elliptical tilesthe probabilistic Hough transform. Both the advantage and will introduce severe artifacts. Enforcing a Gaussianthe disadvantage of the feature space paradigm arise from mixture model over such data is doomed to fail, e.g., ,the global nature of the derived representation of the input. and even the use of a robust approach with contaminatedOn one hand, all the evidence for the presence of a Gaussian densities  cannot be satisfactory for such complex cases. Note also that the mixture models require the number of clusters as a parameter, which raises its own. D. Comaniciu is with the Imaging and Visualization Department, Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540. challenges. For example, the method described in  E-mail: firstname.lastname@example.org. proposes several different ways to determine this number.. P. Meer is with the Electrical and Computer Engineering Department, Arbitrarily structured feature spaces can be analyzed Rutgers University, 94 Brett Road, Piscataway, NJ 08854-8058. only by nonparametric methods since these methods do not E-mail: email@example.com. have embedded assumptions. Numerous nonparametricManuscript received 17 Jan. 2001; revised 16 July 2001; accepted 21 Nov. clustering methods were described in the literature and2001.Recommended for acceptance by V. Solo. they can be classified into two large classes: hierarchicalFor information on obtaining reprints of this article, please send e-mail to: clustering and density estimation. Hierarchical firstname.lastname@example.org, and reference IEEECS Log Number 113483. techniques either aggregate or divide the data based on 0162-8828/02/$17.00 ß 2002 IEEE
604 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002Fig. 1. Example of a feature space. (a) A 400 Â 276 color image. (b) Corresponding L*u*v* color space with 110; 400 data points.some proximity measure. See [28, Section 3.2] for a survey points xi , i 1; . . . ; n in the d-dimensional space Rd , theof hierarchical clustering methods. The hierarchical meth- multivariate kernel density estimator with kernel K x and aods tend to be computationally expensive and the definition symmetric positive definite d Â d bandwidth matrix H,of a meaningful stopping criterion for the fusion (or computed in the point x is given bydivision) of the data is not straightforward. The rationale behind the density estimation-based non- ^ 1 n f x KH x À xi ; 1parametric clustering approach is that the feature space can n i1be regarded as the empirical probability density function(p.d.f.) of the represented parameter. Dense regions in the wherefeature space thus correspond to local maxima of the p.d.f., KH x j H jÀ1=2 K HÀ1=2 x: 2that is, to the modes of the unknown density. Once thelocation of a mode is determined, the cluster associated The d-variate kernel K x is a bounded function withwith it is delineated based on the local structure of the compact support satisfying [62, p. 95]feature space , , . Our approach to mode detection and clustering is based on K xdx 1 lim kxkd K x 0the mean shift procedure, proposed in 1975 by Fukunaga and Rd kxk3I 3Hostetler  and largely forgotten until Chengs paper rekindled interest in it. In spite of its excellent qualities, the xK xdx 0 xxb K xdx cK I; Rd Rdmean shift procedure does not seem to be known in statisticalliterature. While the book [54, Section 6.2.2] discusses , the where cK is a constant. The multivariate kernel can beadvantages of employing a mean shift type procedure in generated from a symmetric univariate kernel K1 x in twodensity estimation were only recently rediscovered . different ways As will be proven in the sequel, a computational module dbased on the mean shift procedure is an extremely versatile K P x K1 xi K S x ak;d K1 kxk; 4tool for feature space analysis and can provide reliable i1solutions for many vision tasks. In Section 2, the mean shiftprocedure is defined and its properties are analyzed. In where K P x is obtained from the product of the univariateSection 3, the procedure is used as the computational kernels and K S x from rotating K1 x in Rd i.e., K S x is ,module for robust feature space analysis and implementa- radially symmetric. The constant aÀ1 Rd K1 kxkdx k;dtional issues are discussed. In Section 4, the feature space assures that K S x integrates to one, though this conditionanalysis technique is applied to two low-level vision tasks: can be relaxed in our context. Either type of multivariatediscontinuity preserving filtering and image segmentation. kernel obeys (3), but, for our purposes, the radiallyBoth algorithms can have as input either gray level or color symmetric kernels are often more suitable.images and the only parameter to be tuned by the user is We are interested only in a special class of radiallythe resolution of the analysis. The applicability of the mean symmetric kernels satisfyingshift procedure is not restricted to the presented examples. K x ck;d k kxk2 ; 5In Section 5, other applications are mentioned and theprocedure is put into a more general context. in which case it suffices to define the function k x called the profile of the kernel, only for x ! 0. The normalization constant ck;d , which makes K x integrate to one, is2 THE MEAN SHIFT PROCEDURE assumed strictly positive.Kernel density estimation (known as the Parzen window Using a fully parameterized H increases the complexitytechnique in pattern recognition literature [17, Section 4.3]) is of the estimation [62, p. 106] and, in practice, the bandwidththe most popular density estimation method. Given n data matrix H is chosen either as diagonal H diagh2 ; . . . ; h2 , 1 d
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 605or proportional to the identity matrix H h2 I. The clear We define the functionadvantage of the latter case is that only one bandwidthparameter h > 0 must be provided; however, as can be seen g x ÀkH x; 13from (2), then the validity of an Euclidean metric for the assuming that the derivative of the kernel profile k exists forfeature space should be confirmed first. Employing only all x P 0; I, except for a finite set of points. Now, usingone bandwidth parameter, the kernel density estimator (1) g x for profile, the kernel G x is defined asbecomes the well-known expression 1 x À xi n G x cg;d g kxk2 ; 14 ^ f x d K : 6 nh i1 h where cg;d is the corresponding normalization constant. The The quality of a kernel density estimator is measured by kernel K x was called the shadow of G x in  in a slightlythe mean of the square error between the density and its different context. Note that the Epanechnikov kernel is theestimate, integrated over the domain of definition. In practice, shadow of the uniform kernel, i.e., the d-dimensional unithowever, only an asymptotic approximation of this measure sphere, while the normal kernel and its shadow have the same(denoted as AMISE) can be computed. Under the asympto- expression.tics, the number of data points n 3 I, while the bandwidth Introducing g x into (12) yields,h 3 0 at a rate slower than nÀ1 . For both types of multivariatekernels, the AMISE measure is minimized by the Epanechni- ^ rf h;K xkov kernel [51, p. 139], [62, p. 104] having the profile 2ck;d n x À xi 2 d2 xi À xg 1Àx 0 x 1 nh h kE x 7 i1 0 x 1; 4 5Pn Q 2ck;d x À xi 2 n i1 xi g xÀxi 2 h R Swhich yields the radially symmetric kernel d2 nh g h n xÀxi 2 À x ; i1 i1 g h c d 2 1 À kxk2 kxk 1 1 À1 KE x 2 d 8 15 0 otherwise; n xÀxi 2 where i1 g h is assumed to be a positive number.where cd is the volume of the unit d-dimensional sphere. This condition is easy to satisfy for all the profiles met inNote that the Epanechnikov profile is not differentiable at practice. Both terms of the product in (15) have specialthe boundary. The profile significance. From (11), the first term is proportional to the 1 density estimate at x computed with the kernel G kN x exp À x x!0 9 2 x À xi 2 n ^h;G x cg;d f g : 16yields the multivariate normal kernel nhd i1 h 1 The second term is the mean shift KN x 2Àd=2 exp À kxk2 10 2 n 2 i1 xi g xÀxi for both types of composition (4). The normal kernel is often h mh;G x À x; 17symmetrically truncated to have a kernel with finite support. n g xÀxi 2 i1 h While these two kernels will suffice for most applicationswe are interested in, all the results presented below are valid i.e., the difference between the weighted mean, using thefor arbitrary kernels within the conditions to be stated. kernel G for weights, and x, the center of the kernelEmploying the profile notation, the density estimator (6) can (window). From (16) and (17), (15) becomesbe rewritten as 2ck;d x À xi 2 n ^ ^ rfh;K x fh;G x 2 mh;G x; 18 f^h;K x ck;d k : 11 h cg;d nhd i1 h yieldingThe first step in the analysis of a feature space with the 1 ^ rfh;K xunderlying density f x is to find the modes of this density. mh;G x h2 c : 19The modes are located among the zeros of the gradient 2 ^ fh;G xrf x 0 and the mean shift procedure is an elegant wayto locate these zeros without estimating the density. The expression (19) shows that, at location x, the mean shift vector computed with kernel G is proportional to the normal-2.1 Density Gradient Estimation ized density gradient estimate obtained with kernel K. TheThe density gradient estimator is obtained as the gradient of normalization is by the density estimate in x computed withthe density estimator by exploiting the linearity of (11) the kernel G. The mean shift vector thus always points toward the direction of maximum increase in the density. This is a n 2 ^ h;K x rfh;K x 2ck;d rf ^ H x À xi x À xi k : more general formulation of the property first remarked by nhd2 i1 h Fukunaga and Hostetler [20, p. 535], , and discussed in . The relation captured in (19) is intuitive, the local mean is 12 shifted toward the region in which the majority of the
606 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002points reside. Since the mean shift vector is aligned with the procedures to chose the adequate step sizes. This is a majorlocal gradient estimate, it can define a path leading to a advantage over the traditional gradient-based methods.stationary point of the estimated density. The modes of the For discrete data, the number of steps to convergencedensity are such stationary points. The mean shift procedure, depends on the employed kernel. When G is the uniformobtained by successive kernel, convergence is achieved in a finite number of steps since the number of locations generating distinct mean . computation of the mean shift vector mh;G x, values is finite. However, when the kernel G imposes a . translation of the kernel (window) G x by mh;G x, weighting on the data points (according to the distanceis guaranteed to converge at a nearby point where the estimate from its center), the mean shift procedure is infinitely(11) has zero gradient, as will be shown in the next section. The convergent. The practical way to stop the iterations is to setpresence of the normalization by the density estimate is a a lower bound for the magnitude of the mean shift vector.desirable feature. The regions of low-density values are of nointerest for the feature space analysis and, in such regions, the 2.3 Mean Shift-Based Mode Detectionmean shift steps are large. Similarly, near local maxima the ^c ^ Let us denote by yc and fh;K fh;K yc the convergencesteps are small and the analysis more refined. The mean shift ^ points of the sequences fyj gj1;2... and ffh;K jgj1;2... ,procedure thus is an adaptive gradient ascent method. respectively. The implications of Theorem 1 are the following. First, the magnitude of the mean shift vector converges to2.2 Sufficient Condition for Convergence zero. Indeed, from (17) and (20) the jth mean shift vector isDenote by fyj gj1;2... the sequence of successive locations ofthe kernel G, where, from (17), mh;G yj yj1 À yj 22 n and, at the limit, mh;G yc yc À yc 0. In other words, the xÀxi 2 i1 xi g h gradient of the density estimate (11) computed at yc is zero yj1 2 j 1; 2; . . . 20 n g xÀxi ^ i1 h rfh;K yc 0; 23is the weighted mean at yj computed with kernel G and y1 ^ due to (19). Hence, yc is a stationary point of fh;K . Second,is the center of the initial position of the kernel. The since ff ^h;K jg j1;2... is monotonically increasing, the meancorresponding sequence of density estimates computed shift iterations satisfy the conditions required by the Capture ^with kernel K, ffh;K jgj1;2... , is given by Theorem [4, p. 45], which states that the trajectories of such gradient methods are attracted by local maxima if they are ^ ^ fh;K j fh;K yj j 1; 2 . . . : 21 unique (within a small neighborhood) stationary points. ^ That is, once yj gets sufficiently close to a mode of fh;K , itAs stated by the following theorem, a kernel K that obeys converges to it. The set of all locations that converge to thesome mild conditions suffices for the convergence of the same mode defines the basin of attraction of that mode. ^sequences fyj gj1;2... and ffh;K jgj1;2... . The theoretical observations from above suggest aTheorem 1. If the kernel K has a convex and monotonically È É practical algorithm for mode detection: decreasing profile, the sequences yj j1;2... and ^ ^ ffh;K jgj1;2... converge and ffh;K jgj1;2... is monotoni- . Run the mean shift procedure to find the stationary ^ points of fh;K , cally increasing. . Prune these points by retaining only the local maxima. The proof is given in the Appendix. The theoremgeneralizes the result derived differently in , where K The local maxima points are defined, according to thewas the Epanechnikov kernel and G the uniform kernel. The Capture Theorem, as unique stationary points within sometheorem remains valid when each data point xi is associated small open sphere. This property can be tested bywith a nonnegative weight wi . An example of nonconver- perturbing each stationary point by a random vector ofgence when the kernel K is not convex is shown in [10, p. 16]. small norm and letting the mean shift procedure converge The convergence property of the mean shift was also again. Should the point of convergence be unchanged (up todiscussed in [7, Section iv]. (Note, however, that almost all the a tolerance), the point is a local maximum.discussion there is concerned with the ªblurringº process in 2.4 Smooth Trajectory Propertywhich the input is recursively modified after each mean shift The mean shift procedure employing a normal kernel hasstep.) The convergence of the procedure as defined in this an interesting property. Its path toward the mode follows apaper was attributed in  to the gradient ascent nature of (19). smooth trajectory, the angle between two consecutive meanHowever, as shown in [4, Section 1.2], moving in the direction shift vectors being always less than 90 degrees.of the local gradient guarantees convergence only for Using the normal kernel (10), the jth mean shift vector isinfinitesimal steps. The step size of a gradient-based algo- given byrithm is crucial for the overall performance. If the step size is n too large, the algorithm will diverge, while if the step size is too xÀxi 2 i1 xi exp hsmall, the rate of convergence may be very slow. A number of mh;N yj yj1 À yj 2 À yj : 24costly procedures have been developed for step size selection n i1 exp xÀxi h[4, p. 24]. The guaranteed convergence (as shown byTheorem 1) is due to the adaptive magnitude of the mean The following theorem holds true for all j 1; 2; . . . ,shift vector, which also eliminates the need for additional according to the proof given in the Appendix.
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 607 HTheorem 2. The cosine of the angle between two consecutive f x E ^ À x j X1 ; . . . ; Xn % h2 x ; 29 mean shift vectors is strictly positive when a normal kernel is f x2 g employed, i.e., which is similar to (19). The mean shift procedure thus b mh;N yj mh;N yj1 exploits to its advantage the inherent bias of the zero-order 0: 25 kernel regression. kmh;N yj kkmh;N yj1 k The connection to the kernel regression literature opens As a consequence of Theorem 2, the normal kernel many interesting issues, however, most of these are more ofappears to be the optimal one for the mean shift procedure. a theoretical than practical importance.The smooth trajectory of the mean shift procedure is incontrast with the standard steepest ascent method [4, p. 21] 2.6 Relation to Location M-Estimators(local gradient evaluation followed by line maximization) The M-estimators are a family of robust techniques which canwhose convergence rate on surfaces with deep narrow handle data in the presence of severe contaminations, i.e.,valleys is slow due to its zigzagging trajectory. outliers. See ,  for introductory surveys. In our context In practice, the convergence of the mean shift procedure only, the problem of location estimation has to be considered.based on the normal kernel requires large number of steps, Given the data xi ; i 1; . . . ; n; and the scale h, willas was discussed at the end of Section 2.2. Therefore, in ^ define the location estimator as ,most of our experiments, we have used the uniform kernel, 2 3for which the convergence is finite, and not the normal À xi 2 n ^ argmin J argmin ; 30kernel. Note, however, that the quality of the results almost h i1always improves when the normal kernel is employed. where, u is a symmetric, nonnegative valued function,2.5 Relation to Kernel Regression with a unique minimum at the origin and nondecreasing forImportant insight can be gained when (19) is obtained u ! 0. The estimator is obtained from the normal equationsapproaching the problem differently. Considering theunivariate case suffices for this purpose. H I À x 2 Kernel regression is a nonparametric method to estimate ^ ^ ^ i e r J 2hÀ2 À xi wd 0; 31complex trends from noisy data. See [62, chapter 5] for an h introduction to the topic,  for a more in-depth treatment.Let n measured data points be Xi ; Zi and assume that the wherevalues Xi are the outcomes of a random variable x withprobability density function f x, xi Xi ; i 1; . . . ; n, d u w u :while the relation between Zi and Xi is du Therefore, the iterations to find the location M-estimate are Zi m Xi i i 1; . . . ; n; 26 based onwhere m x is called the regression function and i is an independently distributed, zero-mean error, Ei 0. n i 2 ^ Àx i1 xi w h A natural way to estimate the regression function is by ^ n ; 32locally fitting a degree p polynomial to the data. For a window Àx 2 ^ i1 w h i centered at x, the polynomial coefficients then can be obtainedby weighted least squares, the weights being computed from a which is identical to (20) when w u g u. Taking intosymmetric function g x. The size of the window is controlledby the parameter h, gh x hÀ1 g x=h. The simplest case is account (13), the minimization (30) becomes 2 3that of fitting a constant to the data in the window, i.e., p 0. It À xi 2 n can be shown, [24, Section 3.1], [62, Section 5.2], that ^ argmax k h ; 33the estimated constant is the value of the Nadaraya- i1Watson estimator, which can also be interpreted as n i1 gh x À Xi Zi ^ ^ m x; h n ^ ; 27 argmax fh;K j x1 ; . . . ; xn : 34 i1 gh x À Xi introduced in the statistical literature 35 years ago. The That is, the location estimator is the mode of the densityasymptotic conditional bias of the estimator has the estimated with the kernel K from the available data. Note thatexpression [24, p. 109], [62, p. 125], the convexity of the k x profile, the sufficient condition for ^ the convergence of the mean shift procedure (Section 2.2) is in E m x; h À m x j X1 ; . . . ; Xn accordance with the requirements to be satisfied by the H mHH xf x 2mH xf x 28 % h2 2 g; objective function u. 2f x The relation between location M-estimators and kernel where 2 g u2 g udu. Defining m x x reduces the density estimation is not well-investigated in the statisticalNadaraya-Watson estimator to (20) (in the univariate case), literature, only  discusses it in the context of an edgewhile (28) becomes preserving smoothing technique.
608 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 20023 ROBUST ANALYSIS OF FEATURE SPACES that for a synthetic, bimodal normal distribution, the technique achieves a classification error similar to theMultimodality and arbitrarily shaped clusters are the defin- optimal Bayesian classifier. The behavior of this featureing properties of a real feature space. The quality of the mean space analysis technique is illustrated in Fig. 2. A two-shift procedure to move toward the mode (peak) of the hill on dimensional data set of 110; 400 points (Fig. 2a) is decom-which it was initiated makes it the ideal computational posed into seven clusters represented with different colorsmodule to analyze such spaces. To detect all the significant in Fig. 2b. A number of 159 mean shift procedures withmodes, the basic algorithm given in Section 2.3 should be runmultiple times (evolving in principle in parallel) with uniform kernel were employed. Their trajectories are showninitializations that cover the entire feature space. in Fig. 2c, overlapped over the density estimate computed Before the analysis is performed, two important (and with the Epanechnikov kernel. The pruning of the modesomewhat related) issues should be addressed: the metric of candidates produced seven peaks. Observe that some of thethe feature space and the shape of the kernel. The mapping trajectories are prematurely stopped by local plateaus.from the input domain into a feature space often associates 3.1 Bandwidth Selectiona non-Euclidean metric to the space. The problem of color The influence of the bandwidth parameter h was assessedrepresentation will be discussed in Section 4, but theemployed parameterization has to be carefully examined empirically in  through a simple image segmentationeven in a simple case like the Hough space of lines, e.g., task. In a more rigorous approach, however, four different, . techniques for bandwidth selection can be considered. The presence of a Mahalanobis metric can be accommo- . The first one has a statistical motivation. The optimaldated by an adequate choice of the bandwidth matrix (2). In bandwidth associated with the kernel density esti-practice, however, it is preferable to have assured that the mator (6) is defined as the bandwidth that achieves themetric of the feature space is Euclidean and, thus, the best compromise between the bias and variance of thebandwidth matrix is controlled by a single parameter, estimator, over all x P Rd , i.e., minimizes AMISE. InH h2 I. To be able to use the same kernel size for all the the multivariate case, the resulting bandwidth for-mean shift procedures in the feature space, the necessary mula [54, p. 85], [62, p. 99] is of little practical use, sincecondition is that local density variations near a significant it depends on the Laplacian of the unknown densitymode are not as large as the entire support of a significant being estimated, and its performance is not wellmode somewhere else. understood [62, p. 108]. For the univariate case, a The starting points of the mean shift procedures should reliable method for bandwidth selection is the plug-inbe chosen to have the entire feature space (except the very rule , which was proven to be superior to least-sparse regions) tessellated by the kernels (windows). squares cross-validation and biased cross-validationRegular tessellations are not required. As the windows , [55, p. 46]. Its only assumption is the smoothnessevolve toward the modes, almost all the data points are of the underlying density.visited and, thus, all the information captured in the feature . The second bandwidth selection technique is relatedspace is exploited. Note that the convergence to a given to the stability of the decomposition. The bandwidthmode may yield slightly different locations due to the is taken as the center of the largest operating rangethreshold that terminates the iterations. Similarly, on flat over which the same number of clusters are obtainedplateaus, the value of the gradient is close to zero and the for the given data [20, p. 541].mean shift procedure could stop. . For the third technique, the best bandwidth max- These artifacts are easy to eliminate through postproces- imizes an objective function that expresses the qualitysing. Mode candidates at a distance less than the kernel of the decomposition (i.e., the index of clusterbandwidth are fused, the one corresponding to the highest validity). The objective function typically comparesdensity being chosen. The global structure of the feature the inter- versus intra-cluster variability ,  orspace can be confirmed by measuring the significance of the evaluates the isolation and connectivity of thevalleys defined along a cut through the density in the delineated clusters .direction determined by two modes. . Finally, since in most of the cases the decomposition The delineation of the clusters is a natural outcome of the is task dependent, top-down information providedmode seeking process. After convergence, the basin of by the user or by an upper-level module can be usedattraction of a mode, i.e., the data points visited by all the to control the kernel bandwidth.mean shift procedures converging to that mode, automati-cally delineates a cluster of arbitrary shape. Close to the We present in , a detailed analysis of the bandwidthboundaries, where a data point could have been visited by selection problem. To solve the difficulties generated by theseveral diverging procedures, majority logic can be em- narrow peaks and the tails of the underlying density, twoployed. It is important to notice that, in computer vision, locally adaptive solutions are proposed. One is nonpara-most often we are not dealing with an abstract clustering metric, being based on a newly defined adaptive mean shiftproblem. The input domain almost always provides an procedure, which exploits the plug-in rule and the sampleindependent test for the validity of local decisions in the point density estimator. The other is semiparametric,feature space. That is, while it is less likely that one can imposing a local structure on the data to extract reliablerecover from a severe clustering error, allocation of a few scale information. We show that the local bandwidthuncertain data points can be reliably supported by input should maximize the magnitude of the normalized meandomain information. shift vector. The adaptation of the bandwidth provides The multimodal feature space analysis technique was superior results when compared to the fixed bandwidthdiscussed in detail in . It was shown experimentally, procedure. For more details, see .
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 609Fig. 2. Example of a 2D feature space analysis. (a) Two-dimensional data set of 110; 400 points representing the first two components of the L*u*v*space shown in Fig. 1b. (b) Decomposition obtained by running 159 mean shift procedures with different initializations. (c) Trajectories of the meanshift procedures drawn over the Epanechnikov density estimate computed for the same data set. The peaks retained for the final classification aremarked with red dots.3.2 Implementation Issues performance through a single parameter, the resolution ofAn efficient computation of the mean shift procedure first the analysis (i.e., bandwidth of the kernel). Since the controlrequires the resampling of the input data with a regular grid. parameter has clear physical meaning, the new algorithmsThis is a standard technique in the context of density can be easily integrated into systems performing moreestimation which leads to a binned estimator [62, Appendix complex tasks. Furthermore, both gray level and colorD]. The procedure is similar to defining a histogram where images are processed with the same algorithm, in thelinear interpolation is used to compute the weights associated former case, the feature space containing two degeneratewith the grid points. Further reduction in the computation dimensions that have no effect on the mean shift procedure.time is achieved by employing algorithms for multidimen- Before proceeding to develop the new algorithms, thesional range searching [52, p. 373] used to find the data points issue of the employed color space has to be settled. To obtainfalling in the neighborhood of a given kernel. For the efficient a meaningful segmentation, perceived color differencesEuclidean distance computation, we used the improved should correspond to Euclidean distances in the color spaceabsolute error inequality criterion, derived in . chosen to represent the features (pixels). An Euclidean metric, however, is not guaranteed for a color space [65, Sections 6.5.2, 8.4]. The spaces L*u*v* and L*a*b* were4 APPLICATIONS especially designed to best approximate perceptually uni-The feature space analysis technique introduced in the form color spaces. In both cases, LÃ , the lightness (relativeprevious section is application independent and, thus, can brightness) coordinate, is defined the same way, the twobe used to develop vision algorithms for a wide variety of spaces differ only through the chromaticity coordinates. Thetasks. Two somewhat related applications are discussed in dependence of all three coordinates on the traditionalthe sequel: discontinuity preserving smoothing and image RGB color values is nonlinear. See [46, Section 3.5] for asegmentation. The versatility of the feature space analysis readily accessible source for the conversion formulae. Theenables the design of algorithms in which the user controls metric of perceptually uniform color spaces is discussed in
610 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002the context of feature representation for image segmentation A recently proposed noniterative discontinuity preservingin . In practice, there is no clear advantage between using smoothing technique is the bilateral filtering . The relationL*u*v* or L*a*b*; in the proposed algorithms, we employed between bilateral filtering and diffusion-based techniquesL*u*v* motivated by a linear mapping property [65, p.166]. was analyzed in . The bilateral filters also work in the joint Our first image segmentation algorithm was a straightfor- spatial-range domain. The data is independently weighted inward application of the feature space analysis technique to an the two domains and the center pixel is computed as theL*u*v* representation of the color image . The modularity weighted average of the window. The fundamental differ- ence between the bilateral filtering and the mean shift-basedof the segmentation algorithm enabled its integration by other smoothing algorithm is in the use of local information.groups to a large variety of applications like image retrieval, face tracking , object-based video coding for MPEG-4 4.1.1 Mean Shift Filtering, shape detection and recognition , and texture analysis Let xi and zi ; i 1; . . . ; n, be the d-dimensional input and, to mention only a few. However, since the feature space filtered image pixels in the joint spatial-range domain. Foranalysis can be applied unchanged to moderately higher each pixel,dimensional spaces (see Section 5), we subsequently also 1. Initialize j 1 and yi;1 xi .incorporated the spatial coordinates of a pixel into its feature 2. Compute yi;j1 according to (20) until convergence,space representation. This joint domain representation is y yi;c .employed in the two algorithms described here. 3. Assign zi xs ; yr . i i;c An image is typically represented as a two-dimensionallattice of p-dimensional vectors (pixels), where p 1 in the The superscripts s and r denote the spatial and rangegray-level case, three for color images, and p 3 in the components of a vector, respectively. The assignmentmultispectral case. The space of the lattice is known as the specifies that the filtered data at the spatial location xs will ispatial domain, while the gray level, color, or spectral have the range component of the point of convergence yr . i;cinformation is represented in the range domain. For both The kernel (window) in the mean shift procedure moves indomains, Euclidean metric is assumed. When the location the direction of the maximum increase in the joint densityand range vectors are concatenated in the joint spatial-range gradient, while the bilateral filtering uses a fixed, staticdomain of dimension d p 2, their different nature has to window. In the image smoothed by mean shift filtering,be compensated by proper normalization. Thus, the multi- information beyond the individual windows is also taken intovariate kernel is defined as the product of two radially account.symmetric kernels and the Euclidean metric allows a single An important connection between filtering in the joint domain and robust M-estimation should be mentioned. Thebandwidth parameter for each domain improved performance of the generalized M-estimators (GM 2 3 2 3 xs 2 xr 2 or bounded-influence estimators) is due to the presence of a C Khs ;hr x 2 p k k ; h h 35 second weight function which offsets the influence of leverage hs hr s r points, i.e., outliers in the input domain [32, Section 8E]. A similar (at least in spirit) twofold weighting is employed in thewhere xs is the spatial part, xr is the range part of a feature bilateral and mean shift-based filterings, which is the mainvector, k x the common profile used in both two domains, reason for their excellent smoothing performance.hs and hr the employed kernel bandwidths, and C the Mean shift filtering with uniform kernel having hs ; hr corresponding normalization constant. In practice, an 8; 4 has been applied to the often used 256 Â 256 gray-levelEpanechnikov or a (truncated) normal kernel always cameraman image (Fig. 3a), the result being shown in Fig. 3b.provides satisfactory performance, so the user only has to The regions containing the grass field have been almostset the bandwidth parameter h hs ; hr , which, by completely smoothed, while details such as the tripod and thecontrolling the size of the kernel, determines the resolution buildings in the background were preserved. The processingof the mode detection. required fractions of a second on a standard PC (600 Mhz4.1 Discontinuity Preserving Smoothing Pentium III) using an optimized C++ implementation of the algorithm. On the average, 3:06 iterations were necessary untilSmoothing through replacing the pixel in the center of a the filtered value of a pixel was defined, i.e., its mean shiftwindow by the (weighted) average of the pixels in the procedure converged.window indiscriminately blurs the image, removing not To better visualize the filtering process, the 40Â20 windowonly the noise but also salient information. Discontinuity marked in Fig. 3a is represented in three dimensions in Fig. 4a.preserving smoothing techniques, on the other hand, Note that the data was reflected over the horizontal axis of theadaptively reduce the amount of smoothing near abrupt window for a more informative display. In Fig. 4b, the meanchanges in the local structure, i.e., edges. shift paths associated with every other pixel (in both There are a large variety of approaches to achieve this coordinates) from the plateau and the line are shown. Notegoal, from adaptive Wiener filtering , to implementing that convergence points (black dots) are situated in the centerisotropic  and anisotropic  local diffusion processes, of the plateau, away from the discontinuities delineating it.a topic which recently received renewed interest , , Similarly, the mean shift trajectories on the line remain on it.. The diffusion-based techniques, however, do not have As a result, the filtered data (Fig. 4c) shows clean quasi-a straightforward stopping criterion and, after a sufficiently homogeneous regions.large number of iterations, the processed image collapses The physical interpretation of the mean shift-basedinto a flat surface. The connection between anisotropic filtering is easy to see by examining Fig. 4a, which, in fact,diffusion and M-estimators is analyzed in . displays the three dimensions of the joint domain of a
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 611Fig. 3. Cameraman image. (a) Original. (b) Mean shift filtered hs ; hr 8; 4.gray-level image. Take a pixel on the line. The uniform (color) bandwidth. Only features with large spatial supportkernel defines a parallelepiped centered on this pixel and are represented in the filtered image when hs increases. On thethe computation of the mean shift vector takes into account other hand, only features with high color contrast surviveonly those pixels which have both their spatial coordinates when hr is large. Similar behavior was also reported for theand gray-level values inside the parallelepiped. Thus, if the bilateral filter [59, Fig. 3].parallelepiped is not too large, only pixels on the line areaveraged and the new location of the window is 4.2 Image Segmentationguaranteed to remain on it. Image segmentation, decomposition of a gray level or color A second filtering example is shown in Fig. 5. The image into homogeneous tiles, is arguably the most important512Â512 color image baboon was processed with mean shift low-level vision task. Homogeneity is usually defined asfilters employing normal kernels defined using various similarity in pixel values, i.e., a piecewise constant model isspatial and range resolutions, hs ; hr 8 Ä 32; 4 Ä 16. enforced over the image. From the diversity of imageWhile the texture of the fur has been removed, the details of segmentation methods proposed in the literature, we willthe eyes and the whiskers remained crisp (up to a certain mention only some whose basic processing relies on the jointresolution). One can see that the spatial bandwidth has a domain. In each case, a vector field is defined over thedistinct effect on the output when compared to the range sampling lattice of the image.Fig. 4. Visualization of mean shift-based filtering and segmentation for gray-level data. (a) Input. (b) Mean shift paths for the pixels on the plateau andon the line. The black dots are the points of convergence. (c) Filtering result hs ; hr 8; 4. (d) Segmentation result.
612 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002Fig. 5. Baboon image. Original and filtered. The attraction force field defined in  is computed at 4.2.1 Mean Shift Segmentationeach pixel as a vector sum of pairwise affinities between the Let xi and zi ; i 1; . . . ; n, be the d-dimensional input andcurrent pixel and all other pixels, with similarity measured filtered image pixels in the joint spatial-range domain andin both spatial and range domains. The region boundaries Li the label of the ith pixel in the segmented image.are then identified as loci where the force vectors diverge. Itis interesting to note that, for a given pixel, the magnitude 1. Run the mean shift filtering procedure for the imageand orientation of the force field are similar to those of the and store all the information about the d-dimensionaljoint domain mean shift vector computed at that pixel and convergence point in zi , i.e., zi yi;c .projected into the spatial domain. However, in contrast to È É 2. Delineate in the joint domain the clusters Cp p1...m, the mean shift procedure moves in the direction of this by grouping together all zi which are closer than hsvector, away from the boundaries. in the spatial domain and hr in the range domain, The edge flow in  is obtained at each location for a i.e., concatenate the basins of attraction of thegiven set of directions as the magnitude of the gradient of a corresponding convergence points.smoothed image. The boundaries are detected at imagelocations which encounter two opposite directions of flow. 3. For each i 1; . . . ; n, assign Li fp j zi P Cp g.The quantization of the edge flow direction, however, may 4. Optional: Eliminate spatial regions containing lessintroduce artifacts. Recall that the direction of the mean than M pixels.shift is dictated solely by the data. The cluster delineation step can be refined according to The mean shift procedure-based image segmentation is a a priori information and, thus, physics-based segmentationstraightforward extension of the discontinuity preserving algorithms, e.g., , , can be incorporated. Since thissmoothing algorithm. Each pixel is associated with a process is performed on region adjacency graphs, hierarch-significant mode of the joint domain density located in its ical techniques like  can provide significant speed-up.neighborhood, after nearby modes were pruned as in the The effect of the cluster delineation step is shown in Fig. 4d.generic feature space analysis technique (Section 3). Note the fusion into larger homogeneous regions of the
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 613Fig. 6. MIT image. (a) Original. (b) Segmented hs ; hr ; M 8; 7; 20. (c) Region boundaries.Fig. 7. Room image. (a) Original. (b) Region boundaries delineated with hs ; hr ; M 8; 5; 20, drawn over the input.result of filtering shown in Fig. 4c. The segmentation step A number of 225 homogeneous regions were identified indoes not add a significant overhead to the filtering process. fractions of a second, most of them delineating semantically The region representation used by the mean shift meaningful regions like walls, sky, steps, inscription on thesegmentation is similar to the blob representation employed building, etc. Compare the results with the segmentationin . However, while the blob has a parametric description obtained by one-dimensional clustering of the gray-level(multivariate Gaussians in both spatial and color domain), the values in [11, Fig. 4] or by using a Gibbs random fields-partition generated by the mean shift is characterized by a based approach [40, Fig. 7].nonparametric model. An image region is defined by all the The joint domain segmentation of the color 256 Â 256 roompixels associated with the same mode in the joint domain. image presented in Fig. 7 is also satisfactory. Compare this In , a nonparametric clustering method is described in result with the segmentation presented in [38, Figs. 3e and 5c]which, after kernel density estimation with a small band- obtained by recursive thresholding. In both these examples,width, the clusters are delineated through concatenation of one can notice that regions in which a small gradient ofthe detected modes neighborhoods. The merging process is illumination exists (like the sky in the MIT or the carpet in thebased on two intuitive measures capturing the variations in room image) were delineated as a single region. Thus, the jointthe local density. Being a hierarchical clustering technique, domain mean shift-based segmentation succeeds in over-the method is computationally expensive; it takes several coming the inherent limitations of methods based only onminutes in MATLAB to analyze a 2,000 pixel subsample of gray-level or color clustering which typically oversegmentthe feature space. The method is not recommended to be used small gradient regions.in the joint domain since the measures employed in the The segmentation with hs ; hr ; M 16; 7; 40 of themerging process become ineffective. Comparing the results 512 Â 512 color image lake is shown in Fig. 8. Compare thisfor arbitrarily shaped synthetic data [43, Fig. 6] with a result with that of the multiscale approach in [57, Fig. 11].similarly challenging example processed with the mean shift Finally, one can compare the contours of the color imagemethod [12, Fig. 1] shows that the use of a hierarchical hs ; hr ; M 16; 19; 40 hand presented in Fig. 9 with thoseapproach can be successfully avoided in the nonparametric from [66, Fig. 15], obtained through a complex globalclustering paradigm. optimization, and from [41, Fig. 4a], obtained with geodesic All the segmentation experiments were performed using active contours.uniform kernels. The improvement due to joint space The segmentation is not very sensitive to the choiceanalysis can be seen in Fig. 6 where the 256 Â 256 gray- of the resolution parameters hs and hr . Note that alllevel image MIT was processed with hs ; hr ; M 8; 7; 20. 256 Â 256 images used the same hs 8, corresponding to a
614 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002Fig. 8. Lake image. (a) Original. (b) Segmented with hs ; hr ; M 16; 7; 40.Fig. 9. Hand image. (a) Original. (b) Region boundaries delineated with hs ; hr ; M 16; 19; 40 drawn over the input.17 Â 17 spatial window, while all 512 Â 512 images used hs The code for the discontinuity preserving smoothing and16 corresponding to a 31 Â 31 window. The range image segmentation algorithms integrated into a singleparameter hr and the smallest significant feature size system with graphical interface is available at http://M control the number of regions in the segmented image. www.caip.rutgers.edu/riul/research/code.html.The more an image deviates from the assumed piecewiseconstant model, larger values have to be used for hr and M to 5 DISCUSSIONdiscard the effect of small local variations in the feature space.For example, the heavily textured background in the hand The mean shift-based feature space analysis techniqueimage is compensated by using hr 19 and M 40, values introduced in this paper is a general tool which is notwhich are much larger than those used for the room image restricted to the two applications discussed here. Since the hr 5; M 20 since the latter better obeys the model. As quality of the output is controlled only by the kernelwith any low-level vision algorithm, the quality of the bandwidth, i.e., the resolution of the analysis, the techniquesegmentation output can be assessed only in the context of should be also easily integrable into complex vision systemsthe whole vision task and, thus, the resolution parameters where the control is relinquished to a closed loop process.should be chosen according to that criterion. An important Additional insights on the bandwidth selection can beadvantage of mean shift-based segmentation is its modularity obtained by testing the stability of the mean shift directionwhich makes the control of segmentation output very simple. across the different bandwidths, as investigated in  in Other segmentation examples in which the original the case of the force field. The nonparametric toolboximage has the region boundaries superposed are shown in developed in this paper is suitable for a large variety ofFig. 10 and in which the original and labeled images are computer vision tasks where parametric models are lesscompared in Fig. 11. adequate, for example, modeling the background in visual As a potential application of the segmentation, we return to surveillance .the cameraman image. Fig. 12a shows the reconstructed image The complete solution toward autonomous image seg-after the regions corresponding to the sky and grass were mentation is to combine a bandwidth selection techniquemanually replaced with white. The mean shift segmentation (like the ones discussed in Section 3.1) with top-down task-has been applied with hs ; hr ; M 8; 4; 10. Observe the related high-level information. In this case, each mean shiftpreservation of the details which suggests that the algorithm process is associated with a kernel best suited to the localcan also be used for image editing, as shown in Fig. 12b. structure of the joint domain. Several interesting theoretical
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 615Fig. 10. Landscape images. All the region boundaries were delineated with hs ; hr ; M 8; 7; 100 and are drawn over the original image.issues have to be addressed, though, before the benefits of dimension of the space. This is mostly due to the empty spacesuch a data driven approach can be fully exploited. We are phenomenon [20, p. 70], [54, p. 93] by which most of the mass incurrently investigating these issues. a high-dimensional space is concentrated in a small region of The ability of the mean shift procedure to be attracted by the space. Thus, whenever the feature space has more thanthe modes (local maxima) of an underlying density function, (say) six dimensions, the analysis should be approachedcan be exploited in an optimization framework. Cheng  carefully. Employing projection pursuit, in which the densityalready discusses a simple example. However, by introdu- is analyzed along lower dimensional cuts, e.g., , is acing adequate objective functions, the optimization problem possibility.can acquire physical meaning in the context of a computer To conclude, the mean shift procedure is a valuablevision task. For example, in , by defining the distance computational module whose versatility can make it anbetween the distributions of the model and a candidate of the important component of any computer vision toolbox.target, nonrigid objects were tracked in an image sequenceunder severe distortions. The distance was defined at every APPENDIXpixel in the region of interest of the new frame and the mean Proof of Theorem 1. If the kernel K has a convex andshift procedure was used to find the mode of this measure monotonically decreasing profile, the sequences fyj gj1;2... andnearest to the previous location of the target. ^ ^ ffh;K jgj1;2... converge, and ffh;K jgj1;2... is monotonically The above-mentioned tracking algorithm can be re- increasing.garded as an example of computer vision techniques which ^ Since n is finite, the sequence fh;K (21) is bounded, ^ therefore, it is sufficient to show that fh;K is strictlyare based on in situ optimization. Under this paradigm, thesolution is obtained by using the input domain to define the monotonic increasing, i.e., if yj T yj1 , thenoptimization problem. The in situ optimization is a very ^ ^ fh;K j fh;K j 1;powerful method. In  and , each input data pointwas associated with a local field (voting kernel) to produce for j 1; 2 . . . . Without loss of generality, it can bea more dense structure from where the sought information assumed that yj 0 and, thus, from (16) and (21)(salient features, the hyperplane representing the funda- ^ ^ fh;K j 1 À fh;K j mental matrix) can be reliably extracted. ! ck;d n yj1 À xi 2 xi 2 A:1 The mean shift procedure is not computationally expen- k Àk : nhd i1 h hsive. Careful C++ implementation of the tracking algorithmallowed real time (30 frames/second) processing of the video The convexity of the profile k x implies thatstream. While it is not clear if the segmentation algorithm k x2 ! k x1 kH x1 x2 À x1 A:2described in this paper can be made so fast, given the quality ofthe region boundaries it provides, it can be used to support for all x1 ; x2 P 0; I, x1 T x2 , and since g x ÀkH x,edge detection without significant overhead in time. (A.2) becomes Kernel density estimation, in particular, and nonpara- k x2 À k x1 ! g x1 x1 À x2 : A:3metric techniques, in general, do not scale well with the
616 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002Fig. 11. Some other segmentation examples with hs ; hr ; M 8; 7; 20. Left: original. Right: segmented. Now, using (A.1) and (A.3), we obtain and, recalling (20), yields^ ^fh;K j 1 À fh;K j ck;d xi 2 n ^ ^ fh;K j 1 À fh;K j ! d2 kyj1 k2 g : A:5 ck;d xi 2 h n i nh i1 h ! d2 g kxi k2 À kyj1 À xi k2 nh i1 h The profile k x being monotonically decreasing for all 2 ck;d xi 2 h n i x ! 0, the sum n g xi is strictly positive. Thus, as d2 g 2yb xi À kyj1 k2 j1 i1 h nh i1 h long as yj1 T yj 0, the right term of (A.5) is strictly 4 5 ^ ^ ck;d n xi 2 2 xi 2 n positive, i.e., fh;K j 1 fh;K j. Consequently, the d2 2yb xi g À kyj1 k g sequence ff ^h;K jg nh j1 h h j1;2... is convergent. i1 i1 To prove the convergence of the sequence fyj gj1;2... , A:4 (A.5) is rewritten for an arbitrary kernel location yj T 0. After some algebra, we have
COMANICIU AND MEER: MEAN SHIFT: A ROBUST APPROACH TOWARD FEATURE SPACE ANALYSIS 617Fig. 12. Cameraman image. (a) Segmentation with hs ; hr ; M 8; 4; 10 and reconstruction after the elimination of regions representing sky andgrass. (b) Supervised texture insertion. yj À xi 2 n n yj1 À xi 2^h;K j 1 À fh;K j ! ck;d kyj1 À yj k2f ^ g : kyj1 k À 2 yb xi exp À 0: B:2 j1 nhd2 i1 h i1 h A:6 The space Rd can be decomposed into the following three domains: Now, summing the two terms of (A.6) for indices j; j 1 . . . j m À 1, it results that d b 1 2 D1 x P R yj1 x ky k 2 j1 ^ ^ f h;K j m À f h;K j 1 D2 x P Rd kyj1 k2 yb x kyj1 k2 B:3 ck;d yjmÀ1 À xi 2 n 2 j1 ! d2 kyjm À yjmÀ1 k2 g ... n o nh h i1 D3 x P Rd kyj1 k2 yb x yj À xi 2 j1 n ck;d 2 d2 kyj1 À yj k g and after some simple manipulations from (B.1), we can nh h i1 ! derive the equality ck;d 2 2 ! d2 kyjm À yjmÀ1 k . . . kyj1 À yj k M xi 2 nh kyj1 k2 À yb xi exp À j1 ck;d h ! d2 kyjm À yj k2 M; xi PD2 B:4 nh 2 xi 2 b A:7 yj1 xi À kyj1 k exp À : x PD D h i 1 3 where M represents the minimum (always strictly In addition, for x P D2 , we have kyj1 k2 À yb x ! 0, y Àx 2 j1 positive) of the sum n g j h i for all fyj gj1;2... . which implies i1 ^ Since ffh;K jgj1;2... is convergent, it is also a Cauchy kyj1 À xi k2 kyj1 k2 kxi k2 À 2yb xi ! kxi k2 À kyj1 k2 j1 sequence. This property in conjunction with (A.7) implies B:5 that fyj gj1;2... is a Cauchy sequence, hence, it is con- vergent in the Euclidean space. u t from whereProof of Theorem 2. The cosine of the angle between two 2 yj1 À xi 2 kyj1 k À yb xi exp À j1 consecutive mean shift vectors is strictly positive when a xi PD2 h normal kernel is employed. yj1 2 xi 2 exp kyj1 k2 À yb xi exp À : j1 We can assume, without loss of generality that yj 0 and h x PD h i 2 yj1 T yj2 T 0 since, otherwise, convergence has already B:6 been achieved. Therefore, the mean shift vector mh;N 0 is Now, introducing (B.4) in (B.6), we have n 2 i1 xi exp Àxi yj1 À xi 2 mh;N 0 yj1 h : B:1 kyj1 k2 À yb xi exp À j1 n xi 2 h i1 exp À h xi PD2 yj1 2 xi 2 We will show first that, when the weights are given by a exp yb xi À kyj1 k2 exp À j1 h x PD D h i 1 3 normal kernel centered at yj1 , the weighted sum of the À Á B:7 projections of yj1 À xi onto yj1 is strictly negative, i.e.,