Similarity distance measures


Published on

Similarity distance measures

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Similarity distance measures

  1. 1. Pattern Recognition 38 (2005) 1857 – 1874 Clustering of time series data—a survey T. Warren Liao∗ Industrial & Manufacturing Systems Engineering Department, Louisiana State University, 3128 CEBA, Baton Rouge, LA 70803, USA Received 16 September 2003; received in revised form 21 June 2004; accepted 7 January 2005Abstract Time series clustering has been shown effective in providing useful information in various domains. There seems to be anincreased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview,this paper surveys and summarizes previous works that investigated the clustering of time series data in various applicationdomains. The basics of time series clustering are presented, including general-purpose clustering algorithms commonly usedin time series clustering studies, the criteria for evaluating the performance of the clustering results, and the measures todetermine the similarity/dissimilarity between two time series being compared, either in the forms of raw data, extractedfeatures, or some model parameters. The past researchs are organized into three groups depending upon whether they workdirectly with the raw data either in the time or frequency domain, indirectly with features extracted from the raw data, orindirectly with models built from the raw data. The uniqueness and limitation of previous research are discussed and severalpossible topics for future research are identified. Moreover, the areas that time series clustering have been applied to are alsosummarized, including the sources of data used. It is hoped that this review will serve as the steppingstone for those interestedin advancing this area of research.᭧ 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.Keywords: Time series data; Clustering; Distance measure; Data mining1. Introduction data. Most, if not all, clustering programs developed as an independent program or as part of a large suite of data anal- The goal of clustering is to identify structure in an un- ysis or data mining software to date work only with staticlabeled data set by objectively organizing data into homo- data. Han and Kamber [1] classified clustering methodsgeneous groups where the within-group-object similarity developed for handing various static data into five major cat-is minimized and the between-group-object dissimilarity is egories: partitioning methods, hierarchical methods, density-maximized. Clustering is necessary when no labeled data are based methods, grid-based methods, and model-based meth-available regardless of whether the data are binary, categor- ods. A brief description of each category of methods follows.ical, numerical, interval, ordinal, relational, textual, spatial, Given a set of n unlabeled data tuples, a partitioningtemporal, spatio-temporal, image, multimedia, or mixtures method constructs k partitions of the data, where each par-of the above data types. Data are called static if all their fea- tition represents a cluster containing at least one object andture values do not change with time, or change negligibly. k n. The partition is crisp if each object belongs to ex-The bulk of clustering analyses has been performed on static actly one cluster, or fuzzy if one object is allowed to be in more than one cluster to a different degree. Two renowned heuristic methods for crisp partitions are the k-means ∗ Tel.: +1 225 578 5365; fax: +1 225 578 5109. algorithm [2], where each cluster is represented by the E-mail address: mean value of the objects in the cluster and the k-medoids0031-3203/$30.00 ᭧ 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2005.01.025
  2. 2. 1858 T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874algorithm [3], where each cluster is represented by the most There are two major approaches of model-based methods:centrally located object in a cluster. Two counterparts for statistical approach and neural network approach. An ex-fuzzy partitions are the fuzzy c-means algorithm [4] and the ample of statistical approach is AutoClass [17], which usesfuzzy c-medoids algorithm [5]. These heuristic algorithms Bayesian statistical analysis to estimate the number of clus-work well for finding spherical-shaped clusters and small ters. Two prominent methods of the neural network approachto medium data sets. To find clusters with non-spherical or to clustering are competitive learning, including ART [18]other complex shapes, specially designed algorithms such as and self-organizing feature maps [19].Gustafson–Kessel and adaptive fuzzy clustering algorithms Unlike static data, the time series of a feature comprise[6] or density-based methods to be introduced in the sequel values changed with time. Time series data are of interest be-are needed. Most genetic clustering methods implement the cause of its pervasiveness in various areas ranging from sci-spirit of partitioning methods, especially the k-means algo- ence, engineering, business, finance, economic, health care,rithm [7,8], the k-medoids algorithm [9], and the fuzzy c- to government. Given a set of unlabeled time series, it ismeans algorithm [10]. often desirable to determine groups of similar time series. A hierarchical clustering method works by grouping data These unlabeled time series could be monitoring data col-objects into a tree of clusters. There are generally two types lected during different periods from a particular process orof hierarchical clustering methods: agglomerative and divi- from more than one process. The process could be natural,sive. Agglomerative methods start by placing each object biological, business, or engineered. Works devoting to thein its own cluster and then merge clusters into larger and cluster analysis of time series are relatively scant comparedlarger clusters, until all objects are in a single cluster or until with those focusing on static data. However, there seems tocertain termination conditions such as the desired number be a trend of increased activity.of clusters are satisfied. Divisive methods do just the op- This paper intends to introduce the basics of time seriesposite. A pure hierarchical clustering method suffers from clustering and to provide an overview of time series cluster-its inability to perform adjustment once a merge or split ing works been done so far. In the next section, the basics ofdecision has been executed. For improving the clustering time series clustering are presented. Details of three majorquality of hierarchical methods, there is a trend to integrate components required to perform time series clustering arehierarchical clustering with other clustering techniques. given in three subsections: clustering algorithms in SectionBoth Chameleon [11] and CURE [12] perform careful anal- 2.1, data similarity/distance measurement in Section 2.2,ysis of object “linkages” at each hierarchical partitioning and performance evaluation criterion in Section 2.3. Sectionwhereas BIRCH [13] uses iterative relocation to refine the 3 categories and surveys time series clustering works thatresults obtained by hierarchical agglomeration. have been published in the open literature. Several possi- The general idea of density-based methods such as ble topics for future research are discussed in Section 4 andDBSCAN [14] is to continue growing a cluster as long finally the paper is concluded. In Appendix A, the applica-as the density (number of objects or data points) in the tion areas reported are summarized with pointers to openly“neighborhood” exceeds some threshold. Rather than pro- available time series data.ducing a clustering explicitly, OPTICS [15] computes anaugmented cluster ordering for automatic and interactivecluster analysis. The ordering contains information that is 2. Basics of time series clusteringequivalent to density-based clustering obtained from a widerange of parameter settings, thus overcoming the difficulty Just like static data clustering, time series clustering re-of selecting parameter values. quires a clustering algorithm or procedure to form clusters Grid-based methods quantize the object space into a finite given a set of unlabeled data objects and the choice of clus-number of cells that form a grid structure on which all of tering algorithm depends both on the type of data availablethe operations for clustering are performed. A typical exam- and on the particular purpose and application. As far asple of the grid-based approach is STING [16], which uses time series data are concerned, distinctions can be made asseveral levels of rectangular cells corresponding to different to whether the data are discrete-valued or real-valued, uni-levels of resolution. Statistical information regarding the at- formly or non-uniformly sampled, univariate or multivari-tributes in each cell are pre-computed and stored. A query ate, and whether data series are of equal or unequal length.process usually starts at a relatively high level of the hierar- Non-uniformly sampled data must be converted into uni-chical structure. For each cell in the current layer, the con- formed data before clustering operations can be performed.fidence interval is computed reflecting the cell’s relevance This can be achieved by a wide range of methods, from sim-to the given query. Irrelevant cells are removed from fur- ple down sampling based on the roughest sampling intervalther consideration. The query process continues to the next to a sophisticated modeling and estimation approach.lower level for the relevant cells until the bottom layer is Various algorithms have been developed to cluster differ-reached. ent types of time series data. Putting their differences aside, Model-based methods assume a model for each of the it is far to say that in spirit they all try to modify the exist-clusters and attempt to best fit the data to the assumed model. ing algorithms for clustering static data in such a way that
  3. 3. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1859 Time series Time series Feature Discretization Modeling Extraction Time series Coefficients Features Modeling or Residuals Clustering Clustering Clustering Clusters Clusters Model Clusters and maybe and maybe parameters and maybe Cluster Cluster Cluster (a) Centers (b) Centers (c) Centers Fig. 1. Three time series clustering approaches: (a) raw-data-based, (b) feature-based, (c) model-based.time series data can be handled or to convert time series might be more appropriate than another. Several commonlydata into the form of static data so that the existing algo- used distance/similarity measures are reviewed in more de-rithms for clustering static data can be directly used. The tail in Section 2.2. Most clustering algorithms/proceduresformer approach usually works directly with raw time series are iterative in nature. Such algorithms/procedures rely ondata, thus called raw-data-based approach, and the major a criterion to determine when a good clustering is obtainedmodification lies in replacing the distance/similarity mea- in order to stop the iterative process. Several commonlysure for static data with an appropriate one for time series. used evaluation criteria are reviewed in more detail inThe latter approach first converts a raw time series data ei- Section 2.3.ther into a feature vector of lower dimension or a numberof model parameters, and then applies a conventional clus- 2.1. Clustering algorithms/procedurestering algorithm to the extracted feature vectors or modelparameters, thus called feature- and model-based approach, In this subsection, we briefly describe some general-respectively. Fig. 1 outlines the three different approaches: purpose clustering algorithms/procedures that have beenraw-data-based, feature-based, and model-based. Note that employed in the previous time series clustering studies.the left branch of model-based approach trained the model Interested readers should refer to the original papers forand used the model parameters for clustering without the the details of specially tailored time series clustering algo-need for another clustering algorithm. rithms/procedures. Three of the five major categories of clustering methodsfor static data as reviewed in the Introduction, specifically 2.1.1. Relocation clusteringpartitioning methods, hierarchical methods, and model- The relocation clustering procedure has the followingbased methods, have been utilized directly or modified three steps:for time series clustering. Several commonly used algo- Step 1: Start with an initial clustering, denoted by C, havingrithms/procedures are reviewed in more details in Section the prescribed k number of clusters.2.1. Almost without exception each of the clustering al- Step 2: For each time point compute the dissimilarity matrixgorithms/procedures reviewed in Section 2.1 requires a and store all resultant matrices computed for all time pointsmeasure to compute the distance or similarity between two for the calculation of trajectory similarity.time series being compared. Depending upon whether the Step 3: Find a clustering C , such that C is better thandata are discrete-valued or real-valued and whether time C in terms of the generalized Ward criterion function. Theseries are of equal or unequal length, a particular measure clustering C is obtained from C by relocating one member
  4. 4. 1860 T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874for Cp to Cq or by swapping two members between Cp function given asand Cq , where Cp , Cq ∈ C, p, q = 1, 2, . . . , k, and p = q. c nIf no such clustering exists, then stop; else replace C by Cand repeat Step 3. Min J1 (U, V ) = uik xk − vi 2 (1) i=1 k=1This procedure works only with time series with equal length s.t. (1) uik ∈ {0, 1}∀i, k, (2) i=1,c uik = 1, ∀k. · inbecause the distance between two time series at some cross the above equation is normally the Euclidean distance mea-sections (time points where one series does not have value) sure. However, other distance measures could also be ill defined. The iterative solution procedure generally has the following steps:2.1.2. Agglomerative hierarchical clustering A hierarchical clustering method works by grouping data (1) Choose c(2 c n) and (a small number for stoppingobjects (time series here) into a tree of clusters. Two types the iterative procedure). Set the counter l = 0 and theof hierarchical clustering methods are often distinguished: initial cluster centers, V (0) , arbitrarily.agglomerative and divisive depending upon whether a (2) Distribute xk , ∀k to determine U (l) such that J1 is min-bottom-up or top-down strategy is followed. The agglom- imized. This is achieved normally by reassigning xk toerative hierarchical clustering method is more popular than a new cluster that is closest to it.the divisive method. It starts by placing each object in its (3) Revise the cluster centers V (l) .own cluster and then merges these atomic clusters into (4) Stop if the change in V is smaller than ; otherwise,larger and larger clusters, until all the objects are in a single increment l and repeat Steps 2 and 3.cluster or until certain termination conditions are satisfied.The single (complete) linkage algorithm measures the sim- Dunn [20] first extended the c-means algorithm to allowilarity between two clusters as the similarity of the closest for fuzzy partition, rather than hard partition, by using the(farthest) pair of data points belonging to different clusters, objective function given in Eq. (2) below:merges the two clusters having the minimum distance, re-peats the merging process until all the objects are eventually c nmerged to form one cluster. The Ward’s minimum variance Min J2 (U, V ) = ( ik )2 xk − vi 2 . (2)algorithm merges the two clusters that will result in the i=1 k=1smallest increase in the value of the sum-of-squares vari-ance. At each clustering step, all possible mergers of two Note that U = [ ik ] in this and the following equations de-clusters are tried. The sum-of-squares variance is computed notes the matrix of a fuzzy c-partition. The fuzzy c-partitionfor each and the one with the smallest value is selected. constraints are (1) ik ∈ [0, 1]∀i, k, (2) i=1,c ik =1, ∀k, The performance of an agglomerative hierarchical clus- and (3) 0 < k=1,n ik < n, ∀i. In other words, each xktering method often suffers from its inability to adjust once could belong to more than one cluster with each belong-a merge decision has been executed. The same is true for di- ingness taking a fractional value between 0 and 1. Bezdekvisive hierarchical clustering methods. Hierarchical cluster- [4] generalized J2 (U, V ) to an infinite number of objectiveing is not restricted to cluster time series with equal length. functions, i.e., Jm (U, V ), where 1 m ∞. The new ob-It is applicable to series of unequal length as well if an ap- jective function subject to the same fuzzy c-partition con-propriate distance measure such as dynamic time warping straints isis used to compute the distance/similarity. c n Min Jm (U, V ) = ( ik )m xk − vi 2 . (3)2.1.3. k-Means and fuzzy c-means i=1 k=1 The k-means (interchangeably called c-means in thisstudy) was first developed more than three decades ago [2]. By differentiating the objective function with respect to viThe main idea behind it is the minimization of an objective (for fixed U) and to ik (for fixed V) subject to the conditions,function, which is normally chosen to be the total distance one obtains the following two equations:between all patterns from their respective cluster centers. n m k=1 ( ik ) xkIts solution relies on an iterative scheme, which starts with vi = n m , i = 1, . . . , c. (4)arbitrarily chosen initial cluster memberships or centers. k=1 ( ik )The distribution of objects among clusters and the updatingof cluster centers are the two main steps of the c-means al- (1/ xk − vi 2 )1/(m−1) ik = c , j =1 (1/ xk − vj )gorithm. The algorithm alternates between these two steps 2 1/(m−1)until the value of the objective function cannot be reduced i = 1, . . . , c; k = 1, . . . , n. (5)anymore. Given n patterns {xk |k = 1, . . . , n}, c-means determine c To solve the fuzzy c-means model, an iterative alternativecluster centers {vi |i =1, . . . , c}, by minimizing the objective optimization procedure is required. To run the procedure the
  5. 5. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1861number of clusters, c, and the weighting coefficient, m, must 2.2.1. Euclidean distance, root mean square distance, andbe specified. The FCM algorithm has the following steps: Mikowski distance Let xi and vj each be a P-dimensional vector. The Eu-(1) Choose c(2 c n), m(1 < m < ∞), and (a small clidean distance is computed as number for stopping the iterative procedure). Set the counter l =0 and initialize the membership matrix, U (l) . P (l)(2) Calculate the cluster center, vi by using Eq. (4). dE = (xik − vj k )2 . (7)(3) Update the membership matrix U (l+1) by using Eq. (5) k=1 (l) if xk = vi Otherwise, set j k = 1 (0) if j = (=)i. The root mean square distance (or average geometric dis-(4) Compute = U (l+1) − U (l) . If > , increment l tance) is simply and go to Step 2. If , stop. This group of algorithms works better with time series drms = dE /n. (8)of equal length because the concept of cluster centers be-comes unclear when the same cluster contains time series Mikowski distance is a generalization of Euclidean distance,of unequal length. which is defined as P2.1.4. Self-organizing maps dM = q (xik − vj k )q . (9) Self-organizing maps developed by Kohonen [19] are a k=1class of neural networks with neurons arranged in a low-dimensional (often two-dimensional) structure and trained In the above equation, q is a positive integer. A normalizedby an iterative unsupervised or self-organizing procedure. version can be defined if the measured values are normalizedThe training process is initialized by assigning small ran- via division by the maximum value in the sequence.dom values to the weight vectors w of the neurons in thenetwork. Each training-iteration consists of three steps: the 2.2.2. Pearson’s correlation coefficient and relatedpresentation of a randomly chosen input vector from the in- distancesput space, the evaluation of the network, and an update of Let xi and vj each be a P-dimensional vector. Pearson’sthe weight vectors. After the presentation of a pattern, the correlation factor between xi and vj , cc, is defined asEuclidean distance between the input pattern and the weight P k=1 (xik − xik )(vj k − vj k )vector is computed for all neurons in the network. The neu-ron with the smallest distance is marked as t. Depending cc = , (10) Sxi Svjupon whether a neuron i is within a certain spatial neigh-borhood Nt (l) around t, its weight is updated according to where Xi and SXi are, respectively, the mean and scatterthe following updating rule: of xi , computed as below: wi (l) + (l)[x(l) − wi (l)] if i ∈ Nt (l),  0.5wi (l + 1) = P P wi (l) if i ∈ Nt (l). / 1 (6) xi = xik and Sxi = (xik − xi ) . (11) P k=1 k=1 Both the size of the neighborhood Nt and the step sizeof weight adaptation shrink monotonically with the itera- Two cross-correlation-based distances used by Golay et al.tion. Since the neighboring neurons are updated at each step, [21] in the fuzzy c-means algorithm arethere is a tendency that neighboring neurons in the networkrepresent neighboring locations in the feature space. In other 1 1 − ccwords, the topology of the data in the input space is pre- dcc = (12) 1 + ccserved during the mapping. Like the group of k-means andfuzzy c-means algorithms, SOM does not work well with andtime series of unequal length due to the difficulty involved 2 dcc = 2(1 − cc). (13)in defining the dimension of weight vectors.2.2. Similarity/distance measures In Eq. (12), has a similar function as m in the fuzzy c-means algorithm and take a value greater than zero. One key component in clustering is the function used tomeasure the similarity between two data being compared. 2.2.3. Short time series distanceThese data could be in various forms including raw values Considering each time series as a piecewise linear func-of equal or unequal length, vectors of feature-value pairs, tion, Möller-Levet et al. [22] proposed the STS distance astransition matrices, and so on. the sum of the squared differences of the slopes in two time
  6. 6. 1862 T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874series being compared. Mathematically, the STS distance of freedom.Consequently,between two time series xi and vj is defined as   T (xit − xj t )2 dij = T −1  2 . (17) P 2 + 2 vj (k+1) − vj k xi(k+1) −xik 2 t=1 it jtdST S = − , (14) t(k+1) −tk t(k+1) −tk k=1 The null hypothesis Ai ∼ Aj denotes it = j t for t = 1, . . . , T .where tk is the time point for data point xik and vj k . Toremove the effect of scale, z standardization of the series is 2.2.6. Kullback–Liebler distancerecommended. Let P1 and P2 be matrices of transition probabilities of two Markov chains (MCs) with s probability distributions2.2.4. Dynamic time warping distance each and p1ij and p2ij be the i − > j transition probability Dynamic time warping (DTW) is a generalization of in P1 and P2 . The asymmetric Kullback–Liebler distanceclassical algorithms for comparing discrete sequences to of two probability distributions issequences of continuous values. Given two time series, sQ = q1 , q2 , . . . , qi , . . . , qn and R = r1 , r2 , . . . , rj , . . . , rm , d(p1i , p2i ) = p1ij log(p1ij /p2ij ). (18)DTW aligns the two series so that their difference is mini- j =1mized. To this end, an n × m matrix where the (i, j ) ele-ment of the matrix contains the distance d(qi , rj ) between The symmetric version of Kullback–Liebler distance of twotwo points qi , and rj . The Euclidean distance is normally probability distributions isused. A warping path, W = w1 , w2 , . . . , wk , . . . , wK wheremax(m, n) K m + n − 1, is a set of matrix elements D(p1i , p2i ) = [d(p1i , p2i ) + d(p2i , p1i )]/2. (19)that satisfies three constraints: boundary condition, conti-nuity, and monotonicity. The boundary condition constraint The average distance between P1 and P2 is thenrequires the warping path to start and finish in diagonally D(P1 , P2 ) = i=1,s D(p1i , p2i )/s.opposite corner cells of the matrix. That is w1 = (1, 1) andwK = (m, n). The continuity constraint restricts the allow- 2.2.7. J divergence and symmetric Chernoff informationable steps to adjacent cells. The monotonicity constraint divergenceforces the points in the warping path to be monotonically Let fT ( s ) and gT ( s ) be two spectral matrix estima-spaced in time. The warping path that has the minimum dis- tors for two different stationary vector series with p dimen-tance between the two series is of interest. Mathematically, sions and T number of time points, where s = 2 s/T , s = 1, 2, . . . , T . The J divergence and symmetric Chernoff K information divergence are computed as [24] k=1 wkdDT W = min . (15) K 1 −1 −1 J (fT ; gT ) = T (tr{fT gT } 2 s Dynamic programming can be used to effectively find this −1path by evaluating the following recurrence, which defines + tr{gT fT } − 2p) (20)the cumulative distance as the sum of the distance of the cur-rent element and the minimum of the cumulative distances andof the adjacent elements: 1 −1 | fT + (1 − )gT | J B (fT ; gT ) = T log 2 |gT |dcum (i, j ) = d(qi , rj ) + min{dcum (i − 1, j − 1), sdcum (i − 1, j ), dcum (i, j − 1)}. (16) | gT + (1 − )fT | + log , (21) |fT |2.2.5. Probability-based distance function for data with where 0 < < 1 and p is the size of spectral matrices. Botherrors divergences are quasi-distance measures because they do not This function was originally developed by Kumar et al. satisfy the triangle inequality property.[23] in their study of clustering seasonality patterns. They There is a locally stationary version of J divergence fordefined the similarity/distance between two seasonalities, measuring the discrepancy between two non-stationary timeAi and Aj , as the probability of accepting/rejecting the null series. The details can be found in Refs. [25,26].hypothesis H0 : Ai ∼ Aj . Assuming Ai and Aj , each com-prised T independent samples drawn from Gaussian distribu- 2.2.8. Dissimilarity index based on the cross-correlationtions with means xit and xj t and standard deviations it and function between two time series t=1,T , (xit − xj t ) /( it + 2 2 j t , respectively, the statistic Let 2 ( ) denote the cross-correlation between two time i,j 2 ) follows the chi-square distribution with T − 1 degrees series xi and vj with lag . One dissimilarity index based jt
  7. 7. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1863on the cross-correlation function is defined as 2.3.2. Criteria based on unknown ground truth Two cases can be further distinguished: one assuming that max the number of clusters is known a priori and the other not.di,j = (1 − 2 (0)) i,j 2 ( ), i,j (22) The relocation clustering, k-means, and fuzzy c-means al- =1 gorithms all require the number of clusters to be specified. Awhere max is the maximum lag. The similarity counterpart number of validation indices has been proposed in the pastof the above index can be defined as [27]. Maulik and Bandyopadhyay [28] evaluated four clus- ter validity indices. No time series clustering studies usedsi,j = exp(−di,j ). (23) any one of the validity indices to determine the appropriate number of clusters for their application.2.2.9. Dissimilarity between two spoken words Let Pk denote the set of all clusterings that partition a set Let xi be a pattern representing a replication of one spe- of multivariate time series into a pre-specified k numberscific spoken word. Each pattern has an inherent duration of clusters. Košmelj and Batagelj [29] determined the best(e.g., xi is ni frames long) and each frame is represented by among all possible clusterings by the following criteriona vector of LPC coefficients. A symmetric distance between function:patterns xi and xj , dSW (xi , xj ) is defined as k P (C ∗ ) = min p(Cj ), (28) (xi , xj ) + (xj , xi ) Cj ∈C∈Pkdsw (xi , xj ) = (24) j =1 2and where   j i j 1 (aw(k) ) Rk (aw(k) ) T (xi , xj ) = log  , (25) p(Cj ) = ni i i i (ak ) Rk (ak ) t (Cj )pt (Cj ) (29) t=1 iwhere ak is the vector of LPC coefficients of the kth frame i andof pattern i, Rk is the matrix of autocorrelation coefficientsof the kth frame of pattern i, and denotes vector transpose. pt (Cj ) = 1 w(X)w(Y )dt (X, Y ). (30)The function w(k) is the warping function obtained from a 2w(Cj ) X,Y ∈Cjdynamic time warp match of pattern j to pattern i which min-imizes their distance over a constrained set of possible w(k). In the above equation, w(X) represents the weight of X, w(Cj ) = X∈Cj w(X) represents the weight of cluster Cj ,2.3. Clustering results evaluation criteria and dt (X, Y ) the dissimilarity between X and Y at time t. By varying k, the most appropriate number of clusters is the The performance of a time series clustering method must one with minimum P (C ∗ ).be evaluated with some criteria. Two different categories To determine the number of clusters g, Baragona [30]of evaluation criteria can be distinguished: known ground maximizes the following function:truth and unknown ground truth. The number of clusters isusually known for the former and not known for the latter. gWe will review only some general criteria below. Readers si,j , (31)should refer to the original paper for each criterion specific =1 i,j ∈C ,i=jto a particular clustering method. where si,j is a similarity index as defined in Eq. (23). In- formation criteria such as AIC [31], BIC [32], and ICL [33]2.3.1. Criteria based on known ground truth can be used if the data come from an underlying mixture of Let G and C be the set of k ground truth clusters and Gaussian distributions with equal isotropic covariance ma-those obtained by a clustering method under evaluation, re- trices. The optimal number of clusters is the one that yieldsspectively. The cluster similarity measure is defined as the highest value of the information criterion. k 1Sim(G, C) = max Sim(Gi , Cj ), (26) k 1 j k i=1 3. Major time series clustering approacheswhere 2|Gi ∩ Cj | This paper groups previously developed time series clus-Sim(Gi , Cj ) = . (27) tering methods into three major categories depending upon |Gi | + |Cj | whether they work directly with raw data, indirectly with| · | in the above equation denotes the cardinality of the features extracted from the raw data, or indirectly with mod-elements in the set. els built from the raw data. The essence of each study is
  8. 8. 1864 T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874summarized in this section. Studies using clustering algo- perimental results based on simulated data and retail datarithms, similarity/dissimilarity measures, and evaluation cri- showed that the new method outperformed both k-meansteria reviewed in Section 2.1, 2.2, and 2.3, respectively, are and Ward’s method that do not consider data errors in termsas italicized. of (arithmetic) average estimation error. They assumed that data used have been preprocessed to remove the effects of3.1. Raw-data-based approaches non-seasonal factors and normalized to enable comparison of sales of different items on the same scale. Methods that work with raw data, either in the time or For the analysis of dynamic biomedical image time se-frequency domain, are placed into this category. The two ries data, Wismüller et al. [36] showed that deterministictime series being compared are normally sampled at the annealing by the minimal free energy vector quantizationsame interval, but their length (or number of time points) (VQ) could be effective. It realizes a hierarchical unsuper-might or might not be the same. vised learning procedure to unveil the structure of the data For clustering multivariate time varying data, Košmelj and set with gradually increasing clustering resolution. In par-Batagelj [29] modified the relocation clustering procedure ticular, the method was used (i) to identify activated brainthat was originally developed for static data. For measur- regions in functional MRI studies of visual stimulation ex-ing the dissimilarity between trajectories as required by the periments, (ii) to unveil regional abnormalities of brain per-procedure, they first introduced a cross-sectional approach- fusion characterized by differences of signal magnitude andbased general model that incorporated the time dimension, dynamics in contrast-enhanced cerebral perfusion MRI, andand then developed a specific model based on the compound (iii) for the analysis of suspicious lesions in patients withinterest idea to determine the time-dependent linear weights. breast cancer in dynamic MRI mammography data.The proposed cross-sectional procedure ignores the correla- In their study of DNA microarray data, Möller-Levet et al.tions between the variables over time and works only with [22] proposed short time series (ST S) distance to measuretime series of equal length. To form a specified number of the similarity in shape formed by the relative change ofclusters, the best clustering among all the possible cluster- amplitude and the corresponding temporal information ofings is the one with the minimum generalized Ward criterion uneven sampling intervals. All series are considered sampledfunction. Also taking the cross-sectional approach, Liao et at the same time points. By incorporating the STS distanceal. [34] applied several clustering algorithms including K- into the standard fuzzy c-means algorithm, they revised themeans, fuzzy c-means, and genetic clustering to multivari- equations for computing the membership matrix and theate battle simulation time series data of unequal length with prototypes (or cluster centers), thus developed a fuzzy timethe objective to form a discrete number of battle states. The series clustering algorithm.original time series data were not evenly sampled and made To group multivariate vector series of earthquakes anduniform by using the simple linear interpolation method. mining explosions, Kakizawa et al. [24] applied hierarchical Golay et al. [21] applied the fuzzy c-means algorithm to clustering as well as k-means clustering. They measuredfunctional MRI data (univariate time series of equal length) the disparity between spectral matrices corresponding to thein order to provide the functional maps of human brain ac- p×p matrices of autocovariance functions of two zero-meantivity on the application of a stimulus. All three different vector stationary time series with two quasi-distances: the Jdistances: the Euclidean distance and two cross-correlation- divergence and symmetric Chernoff information divergence.based distances were alternatively used in the algorithm. Shumway [26] investigated the clustering of non-One of the two cross-correlation-based distances, dcc , was 1 stationary time series by applying locally stationary versionsfound to be the best. Several data preprocessing approaches of Kullback–Leibler discrimination information measureswere evaluated, and the effect of number of clusters was also that give optimal time–frequency statistics for measuringdiscussed. However, they proposed no procedure to deter- the discrepancy between two non-stationary time series. Tomine the optimal number of clusters. Instead, they recom- distinguish earthquakes from explosions, an agglomerativemended using a large number of clusters as an initial guess, hierarchical cluster analysis was performed until a final setreserving the possibility of reducing this large number to of two clusters was obtained.obtain a clear description of the clusters without redundancy Policker and Geva [37] modeled non-stationary time se-or acquisition of insignificant cluster centers. ries with a time varying mixture of stationary sources, com- van Wijk and van Selow [35] performed an agglomera- parable to the continuous hidden Markov model. The fuzzytive hierarchical clustering of daily power consumption data clustering procedure developed by Gath and Geva [38] wasbased on the root mean square distance. How the clusters applied to a series of P data points as a set of unordered ob-distributed over the week and over the year were also ex- servations to compute the membership matrix for a specifiedplored with calendar-based visualization. number of clusters. After performing the clustering, the se- Kumar et al. [23] proposed a distance function based on ries is divided into a set of P/K segments with each includingthe assumed independent Gaussian models of data errors K data points. The temporal value of each segment belong-and used a hierarchical clustering method to group season- ing to each cluster is computed as the average membershipality sequences into a desirable number of clusters. The ex- values of its data points. The optimal number of clusters is
  9. 9. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1865determined by a temporal cluster validation criterion. If the vised without averaging (UWA) clustering algorithm at thatsymmetric Kullback–Leibler distance between all the prob- time.ability function pairs is bigger than a given small threshold, Shaw and King [41] clustered time series indirectly bythen the number of clusters being tested is set as the optimal applying two hierarchical clustering algorithms, the Ward’sone; otherwise, retain the old optimal value. The resultant minimum variance algorithm and the single linkage algo-membership matrix associated with the determined number rithm, to normalized spectra (normalized by the amplitudeof clusters was given an interpretation as the weights in a of the largest peak). The spectra were constructed from thetime varying, mixture probability distribution function. original time series with the means adjusted to zero. The Liao [39] developed a two-step procedure for clustering principal component analysis (PCA) filtered spectra weremultivariate time series of equal or unequal length. The first also clustered; it was found that using 14 most signifi-step applies the k-means or fuzzy c-means clustering algo- cant eigenvectors could achieve comparable results. The Eu-rithm to time stripped data in order to convert multivariate clidean distance was used.real-valued time series into univariate discrete-valued time Goutte et al. [42] clustered fMRI time series (P slices ofseries. The converted variable is interpreted as state variable images) in groups of voxels with similar activations usingprocess. The second step employs the k-means or FCM al- two algorithms: k-means and Ward’s hierarchical clustering.gorithm again to group the converted univariate time series, The cross-correlation function between the fMRI activationexpressed as transition probability matrices, into a number and the paradigm (or stimulus) was used as the feature space,of clusters. The traditional Euclidean distance is used in instead of the raw fMRI time series. For each voxel j inthe first step, whereas various distance measures including the image, yj denotes the measured fMRI time series andthe symmetric version of Kullback–Liebler distance are em- p is the activation stimulus (assumed a square wave but notployed in the second step. limited to), common to all j. The cross-correlation function Table 1 summarizes the major components used in each is defined asraw-data-based clustering algorithm and the type of time Pseries data the algorithm is for. 1 xj (t) = yj (u)p(u − t), −T < t < T , (32) P u=13.2. Feature-based approaches where p(i) = 0 for i < 0 or i > P and T is of the order of Clustering based on raw data implies working with high- the stimulus period. In a subsequent paper Goutte et al. [43]dimensional space—especially for data collected at fast sam- further illustrated the potential of the feature-based cluster-pling rates. It is also not desirable to work directly with the ing method. First, they used only two features, namely theraw data that are highly noisy. Several feature-based clus- delay and strength of activation measured on a voxel-by-tering methods have been proposed to address these con- voxel basis to show that one could identify the regions withcerns. Though most feature extraction methods are generic significantly different delays and activations. Using the k-in nature, the extracted features are usually application de- means algorithm, they investigated the performance of threependent. That is, one set of features that work well on one information criteria including AIC [31], BIC [32], and ICLapplication might not be relevant to another. Some studies [33] for determining the optimal number of clusters. It waseven take another feature selection step to further reduce the found that ICL was most parsimonious and AIC tended tonumber of feature dimensions after feature extraction. overestimate. Then, they showed that feature-based cluster- With the objectives to develop an automatic clustering ing could be used as a meta-analysis tool in evaluating thealgorithm, which could be implemented for any user with a similarities and differences of the results obtained by severalminimal amount of knowledge about clustering procedures, individual voxel analyses. In this case, features are resultsand to provide the template sets as accurate as those created of previous analyses performed on the other clustering algorithms, Wilpon and Rabiner [40] Fu et al. [44] described the use of self-organizing mapsmodified the standard k-means clustering algorithm for the for grouping data sequences segmented from the numeri-recognition of isolated words. The modifications address cal time series using a continuous sliding window with theproblems such as how to obtain cluster centers, how to split aim to discover similar temporal patterns dispersed alongclusters to increase the number of clusters, and how to create the time series. They introduced the perceptually importantthe final cluster representations. Each pattern representing a point (PIP) identification algorithm to reduce the dimensionreplication of one specific spoken word has an inherent du- of the input data sequence D in accordance with the queryration (e.g., ni frames long), and each frame is a vector of sequence Q. The distance measure between the PIPs foundlinear predictive coding (LPC) coefficients. To measure the in D and Q was defined as the sum of the mean squared dis-distance between two spoken word patterns, a symmetric tance along the vertical scale (the magnitude) and that alongdistance measure was defined based on the Itakura dis- the horizontal scale (time dimension). To process multires-tance for measuring the distance between two frames. The olution patterns, training patterns from different resolutionsproposed modified k-means (MKM) clustering algorithm are grouped into a set of training samples to which the SOMwas shown to outperform the well-established unsuper- clustering process is applied only once. Two enhancements
  10. 10. 1866Table 1Summary of raw-data-based time series clustering algorithmsPaper Variable Length Distance measure Clustering algorithm Evaluation criterion Application T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874Golay et al. [21] Single Equal Euclidean and two cross-correlation- Fuzzy c-means Within cluster variance Functional MRI brain activity based mappingKakizawa et al. [24] Multiple Equal J divergence and symmetric Chernoff Agglomerative hierarchical N/A Earthquakes and mining ex- information divergence plosionsKošmelj and Batagelj Multiple Equal Euclidean Modified relocation cluster- Generalized Ward criterion Commercial energy con-[29] ing procedure function sumptionKumar et al. [23] Single Equal Based on the assumed independent Agglomerative hierarchical N/A Seasonality pattern in retails Gaussian models of data errorsLiao [39] Multiple Equal & Euclidean and symmetric version of k-Means and fuzzy c-Means Within cluster variance Battle simulations unequal Kullback–Liebler distanceLiao et al. [34] Single Equal & DTW k-Medoids-based genetic Several different fitness func- Battle simulations unequal clustering tionsMöller-Levet et al. Single Equal Short time series (STS) distance Modified fuzzy c-means Within cluster variance DNA microarray[22]Policker and Geva Single Equal Euclidean Fuzzy clustering by Gath and Symmetric Kullback–Leibler Sleep EEG signals[37] Geva distance between probability function pairsShumway [26] Multiple Equal Kullback–Leibler discrimination in- Agglomerative hierarchical N/A Earthquakes and mining ex- formation measures plosionsVan Wijk and van Single Equal Root mean square Agglomerative hierarchical N/A Daily power consumptionSelow [35]Wismüller et al. [36] Single Equal N/A Neural network clustering Within cluster variance Functional MRI brain activity performed by a batch EM ver- mapping sion of minimal free energy vector quantization
  11. 11. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1867were made to the SOM: filtering out those nodes (patterns) in between pairs of time series models was then processed bythe output layer that did not participate in the recall process a complete linkage clustering method to construct the den-and consolidating the discovered patterns with a relatively drogram.more general pattern by a redundancy removal step. Baragona [30] evaluated three meta-heuristic methods for An algorithm called sequence cluster refinement algo- partitioning a set of time series into clusters in such a wayrithm (SCRA) was developed by Owsley et al. [45] to group that (i) the cross-correlation maximum absolute value be-machine tool monitoring data into clusters represented as tween each pair of time series that belong to the same clusterdiscrete hidden Markov models (HMM), with each reflect- is greater than some given threshold, and (ii) the k-min clus-ing a kind of health status of the tool. The developed al- ter criterion is minimized with a specified number of clus-gorithm differs from the generalized Lloyd algorithm, a ters. The cross-correlations are computed from the residualsvector quantization algorithm, in representing the clusters of the models of the original time series. Among all methodsas HMMs’ instead of template vectors. Instead of process- evaluated, Tabu search was found to perform better than sin-ing the entire raw data, the transient events in the bulk gle linkage, pure random search, simulation annealing anddata signal are first detected by template matching. A high- genetic algorithms based on a simulation experiment on tenresolution time–frequency representation of the transient re- sets of artificial time series generated from low-order uni-gion is then formed. To reduce dimension, they modified the variate and vector ARMA models.self-organizing feature map algorithm in order to improve Motivated by questions raised in the context of musicalits generalization abilities. performance theory, Beran and Mazzola [48] defined hier- Vlachos et al. [46] presented an approach to perform in- archical smoothing models (or HISMOOTH models) to un-cremental clustering of time series at various resolutions derstand the relationship between the symbolic structure ofusing the Haar wavelet transform. First, the Haar wavelet a music score and its performance, with each representeddecomposition is computed for all time series. Then, the k- by a time series. The models are characterized by a hier-means clustering algorithm is applied, starting at the coarse archy of bandwidths and a vector of coefficients. Given nlevel and gradually progressing to finer levels. The final cen- performances and a common explanatory time series, the es-ters at the end of each resolution are reused as the initial timated bandwidth values are then used in clustering usingcenters for the next level of resolution. Since the length of the S-plus functions plclust and hclust that plots the clus-the data reconstructed from the Haar decomposition dou- tering tree structure produced by agglomerative hierarchicalbles as we progress to the next level, each coordinate of the clustering.centers at the end of level i is doubled to match the dimen- Maharaj [49] developed an agglomerative hierarchicalsionality of the points on level i + 1. The clustering error is clustering procedure that is based on the p-value of a test ofcomputed at the end of each level as the sum of number of hypothesis applied to every pair of given stationary time se-incorrectly clustered objects for each cluster divided by the ries. Assuming that each stationary time series can be fittedcardinality of the dataset. by a linear AR(k) model denoted by a vector of parameters Table 2 summarizes major components used in each = [ 1 , 2 , . . . , k ], a chi-square distributed test statisticfeature-based clustering algorithm. They all can handle was derived to test the null hypothesis that there is no dif-series with unequal length because the feature extraction ference between the generating processes of two stationaryoperation takes care of the issue. For a multivariate time time series or H0 : x = y . Two series are grouped togetherseries, features extracted can simply be put together or go if the associated p-value is greater than the pre-specifiedthrough some fusion operation to reduce the dimension significance level. The clustering result is evaluated with aand improve the quality of the clustering results, as in measure of discrepancy, which is defined as the differenceclassification studies. between the actual number of clusters and the number of exactly correct clusters generated.3.3. Model-based approaches Ramoni et al. [50] presented BCD: a Bayesian algorithm for clustering by dynamics. Given a set S of n numbers This class of approaches considers that each time series is of univariate discrete-valued time series, BCD transformsgenerated by some kind of model or by a mixture of under- each series into a Markov chain (MC) and then clusterslying probability distributions. Time series are considered similar MCs to discover the most probable set of generat-similar when the models characterizing individual series or ing processes. BCD is basically an unsupervised agglomer-the remaining residuals after fitting the model are similar. ative clustering method. Considering a partition as a hidden For clustering or choosing from a set of dynamic struc- discrete variable C, each state Ck of C represents a clus-tures (specifically the class of ARIMA invertible models), ter of time series, and hence determines a transition ma-Piccolo [47] introduced the Euclidean distance between their trix. The task of clustering is regarded as a Bayesian modelcorresponding autoregressive expansions as the metric. The selection problem with the objective to select the modelmetric satisfies the classical properties of a distance, i.e., with the maximum posterior probability. Since the samenon-negativity, symmetry, and triangularity. In addition, six data are used to compare all models and all models areproperties of the metric were discussed. The distance matrix equally likely, the comparison can be based on the marginal
  12. 12. 1868 T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874Table 2Summary of feature-based time series clustering algorithmsPaper Variable Features Feature selec- Distance mea- Clustering al- Evaluation cri- Application tion sure gorithm terionFu et al. [44] Single Perceptually No Sum of the Modified Expected Hong Kong important mean squared SOM squared error stock market points distance along the vertical and horizontal scalesGoutte et al. Single Cross- No Euclidean Agglomerative N/A and Functional[42] correlation hierarchical Within cluster MRI brain ac- function and k-means variance tivity mappingOwsley et al. Single Time- Modified Euclidean Modified k- Within cluster Tool condition[45] frequency rep- SOM means (Se- variance monitoring resentation of quence cluster the transient refinement) regionShaw and Single Normalized PCA Euclidean Agglomerative N/A Flow velocityKing [41] spectra hierarchical in a wind tun- nelVlachos et al. Single Haar wavelet No Euclidean Modified k- Within cluster Non-specific[46] transform means (called variance I-k-means)Wilpon and Single LPC coeffi- No A symmetric Modified k- Within cluster Isolated wordRabiner [40] cients measure based means variance recognition on the Itakura distancelikelihood p(S|MC), which is a measure of how likely the cepstrum provides higher discriminatory power to tell onedata are if the model MC is true. The similarity between time series from another and superior clusterings than othertwo estimated transition matrices is measured as an aver- widely used methods such as the Euclidean distance be-age of the symmetrized Kullback–Liebler distance between tween (the first 10 coefficients of) the DFT, DWT, PCA, andcorresponding rows in the matrices. The clustering result is DFT of the auto-correlation function of two time series.evaluated mainly by a measure of the loss of data informa- Xiong and Yeung [53] proposed a model-based methodtion induced by clustering, which is specific to the proposed for clustering univariate ARIMA series. They assumed thatclustering method. They also presented a Bayesian cluster- the time series are generated by k different ARMA mod-ing algorithm for multivariate time series [51]. The algo- els, with each model corresponds to one cluster of inter-rithm searches for the most probable set of clusters given the est. An expectation-maximization (EM) algorithm was useddata using a similarity-based heuristic search method. The to learn the mixing coefficients as well as the parametersmeasure of similarity is an average of the Kullback–Liebler of the component models that maximize the expectation ofdistances between comparable transition probability tables. the complete-data log-likelihood. In addition, the EM algo-The similarity measure is used as a heuristic guide for the rithm was improved so that the number of clusters could besearch process rather than a grouping criterion. Both the determined automatically. The evaluation criterion used isgrouping and stopping criteria are based on the posterior the cluster similarity measure detailed in Section 5.1.1. Theprobability of the obtained clustering. The objective is to find proposed method was compared with that of Kalpakis et al.a maximum posterior probability partition of set of MCs. using the same four datasets. Kalpakis et al. [52] studied the clustering of ARIMA time- Assuming the Gaussian mixture model for speaker veri-series, by using the Euclidean distance between the Linear fication, Tran and Wagner [54] proposed a fuzzy c-meansPredictive Coding cepstra of two time-series as their dissim- clustering-based normalization method to find a better scoreilarity measure. The cepstral coefficients for an AR(p) time to be compared with a given threshold for accepting or re-series are derived from the auto-regression coefficients. The jecting a claimed speaker. It overcomes the drawback of as-partition around medoids method [3] that is a k-medoids suming equal weight of all the likelihood values of the back-algorithm was chosen, with the clustering results evaluated ground speakers in current normalization methods. Let 0 bewith the cluster similarity measure and Silhouette width. the claimed speaker model and i , i = 1, . . . , B, be a modelBased on a test of four data sets, they showed that the LPC representing another possible speaker model and B is the
  13. 13. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1869total number of “background” speaker models. P (X| 0 ) and presented a Bayesian HMM clustering algorithm that usesP (X| i ) are the likelihood functions of the claimed speaker BIC as the model selection criterion in levels 1 and 3 andand an impostor, respectively, for a given input utterance X. exploits the monotonic characteristics of the BIC functionThe FCM membership score is defined as follows: to develop a sequential search strategy. The strategy starts  −1 with the simplest model, gradually increases the model size, B  and stops when the BIC score of the current model is lessS(X) = [log P (X| 0 )/ log P (X| i )] 1/(m−1) .   than that of the previous model. Experimental results using i=1 both artificially generated data and ecology data showed the (33) effectiveness of the clustering methodology.Biernacki et al. [33] proposed an integrated completed likeli- A framework was presented by Wang et al. [58] for toolhood (ICL) criterion for choosing a Gaussian mixture model wear monitoring in a machining process using discrete hid-and a relevant number of clusters. The ICL criterion is es- den Markov models. The feature vectors are extracted fromsentially the ordinary BIC penalized by the subtraction of the vibration signals measured during turning operations bythe estimated mean entropy. Numerical experiments with wavelet analysis. The extracted feature vectors are then con-simulated and real data showed that the ICL criterion seems verted into a symbol sequence by vector quantization, whichto overcome the practical possible tendency of Bayesian in turn is used as input for training the hidden Markov modelinformation criterion (BIC) to overestimate the number of by the expectation maximization approach.clusters. Table 3 summarizes the major components used in each Considering that a set of multivariate, real-valued time se- model-based clustering algorithm. Like feature-based meth-ries is generated according to hidden Markov models, Oates ods, model-based methods are capable of handling serieset al. [55] presented a hybrid clustering method for automat- with unequal length as well through the modeling operation.ically determining the k number of generating HMMs, and For those methods that use log-likelihood as the distancefor learning the parameters of those HMMs. A standard hi- measure, the model with the highest likelihood is concludederarchical, agglomerative clustering algorithm was first ap- to be the cluster for the data being tested.plied to obtain an initial estimate of k and to form the initialclusters using dynamic time warping to assess the similar-ity. These initial clusters serve as the input to a process that 4. Discussiontrains one HMM on each cluster and iteratively moves timeseries between clusters based on their likelihoods given the Among all the papers surveyed the studies of Ramonivarious HMMs. et al. [50,51] are the only two that assumed discrete-valued Li and Biswas [56] described a clustering methodology time series data. The work of Kumar et al. [23] is the onlyfor temporal data using the hidden Markov model repre- one that takes data error into account. Most studies addresssentation. The temporal data are assumed to have Markov evenly sampled data while Möller-Levet et al. [22] are theproperty, and may be viewed as the result of a probabilis- only ones who consider unevenly sampled data. Note thattic walk along a fixed set of (not directly observable) states. some studies such as Maharaj [49] and Baragona [30] areThe proposed continuous HMM clustering method can be restricted to stationary time series only whereas most otherssummarized in terms of four levels of nested searches. From are not. None of the papers included in this survey handlethe outer most to the inner most levels, they are the search multivariate time series data with different length for eachfor (1) the number of clusters in a partition based on the variable.partition mutual information (PMI) measure, (2) the struc- Several studies including Košmelj and Batagelj [29] andture for a given partition size according to the k-means or Kumar et al. [23] made the assumption that the T samplesdepth-first binary divisive clustering, (3) the HMM structure of a time series are independent (come from independentfor each cluster that gives the highest marginal likelihood distribution), ignoring the correlations in consecutive sam-based on the BIC and the Cheeseman–Stutz approximation, ple values in time. Modeling a time series by a (first order)and (4) the parameters for each HMM structure according Markov chain as done by Ramoni et al. [50,51] assumesto the segmental k-means procedure. For the second search that the probability of a variable at time t is dependent uponlevel, the sequence-to-model likelihood distance measure only the variable values at time t − 1 and independent ofwas chosen for object-to-cluster assignments. The HMM the variable values observed prior to time t − 1. The hid-refinement procedure for the third-level search starts with den Markov models provide a richer representation of timean initial model configuration and incrementally grows or series, especially for systems where their real states are notshrinks the model through HMM state splitting and merging observable and the observation is a probability function ofoperations. They generated an artificial data set from three the state. Note that both studies of using HMM models forrandom generative models: one with three states, one with the multidimensional case assumed that temporal featuresfour states, and one with five states, and showed that their are independent [55,57]. In the case that time series data hasmethod could reconstruct the HMM with the correct model a longer memory, higher orders of Markov chain or hiddensize and near perfect model parameter values. Li et al. [57] Markov models should be considered.
  14. 14. 1870Table 3Summary of model-based time series clustering algorithmsPaper Variable Model Model output of inter- Distance measure Clustering algorithm Evaluation Criterion Application estBaragona [30] Single and multi- ARMA Residuals Cross-correlation Tabu search, GA, and Specially designed Non-specific ple based simulated annealingBeran and Single Hierarchical Coefficients Unknown (most likely Agglomerative hierar- N/A Music performanceMazzola [48] smoothing models Euclidean) chicalBiernacki et al. Multiple Gaussian mixture Parameters Log-likelihood EM algorithm Log-likelihood Non-specific T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874[33]Kalpakis et al. Single AR LPC coefficients of Euclidean Partition around Cluster similarity metric Public data[52] AR coefficients medoids and Silhouette widthLi and Biswas Multiple Continuous HMM HMM parameters Log-likelihood Four nested levels of Partition mutual informa- Non-specific[56] search tionLi et al. [57] Multiple Continuous HMM HMM parameters Log-likelihood Four nested levels of Partition mutual informa- Ecology search tionMaharaj [49] Single ARCoefficients P-value of hypothesis Agglomerative hierar- N/A Number of dwelling units testing chical financedOates et al. Multiple Discrete HMM HMM parameters Log-likelihood Initialized by DTW Log-likelihood Robot sensor data[55] (discretized by followed by a fixed SOM) point operationPiccolo [47] Single AR(∞) Coefficients Euclidean Agglomerative hierar- N/A Industrial produc- chical tion indicesRamoni et al. Single discrete- Markov Chain Transition probabili- Symmetrized Agglomerative clus- N/A Robot sensor data(2001) valued ties Kullback–Liebler tering distanceRamoni et al. Multiple discrete- Markov Chain Transition probabili- Symmetrized Agglomerative clus- Marginal likelihood of a Robot sensor data[51] valued ties Kullback–Liebler to tering partition guide search and Pos- terior probability as a grouping criterionTran and Wag- Single Gaussian mixture Cepstral coefficients Log-likelihood Modified fuzzy c- Within cluster variance Speaker verificationner [54] meansWang et al. Single (discretized Discrete HMM HMM parameters Log-likelihood EM learning Log-likelihood Tool conditionmoni-[58] by vector quan- toring tization from wavelet coeffi- cients)Xiong and Ye- Single ARMA mixture Coefficients Log-likehood EM learning Cluster similarity metric Public dataung [53]
  15. 15. T. Warren Liao / Pattern Recognition 38 (2005) 1857 – 1874 1871 The majority of time series clustering studies are re- which most likely would create problems in attempting tostricted to univariate time series. Among all the univariate compile the clustering results obtained by each clusteringtime series clustering studies, most work only with the ob- algorithm together and a solution must be found.served data with a few exceptions. One exception assumes It should be noted that most scaled-up clustering al-that for all the observed time series there is a common ex- gorithms developed to handle larger data sets considerplanatory time series [48] while some others assume that only static data. Examples are clustering large applica-the observed time series are generated based on a known tions (CLARA) proposed by Kaufman and Rousseeuw [3],stimulus [21,36,42]. The studies that addressed multivariate clustering large applications based on randomized search(or vector) time series include Košmelj and Batagelj [29], (CLARANS) developed by Ng and Han [62], and patternKakizawa et al. [24], Ramoni et al. [51], Oates et al. [55], count-tree based clustering (PCBClu) described by Anan-Li et al. [57], etc. Some multivariate time series clustering thanarayana et al. [63]. It is definitely desirable to developstudies assume that there is no cross-correlation between scaled-up time series clustering algorithms as well in ordervariables. Those based on the probability framework sim- to handle large data sets. In this regard, most efforts haveplify the overall joint distribution by assuming conditional been devoted to data representation using so called seg-independence between variables [51]. mentation algorithms [64]. The work of Fu et al. [44] for In some cases, time series clustering could be comple- finding the PIPs is a segmentation algorithm.mented with a change-point detection algorithm in order to Finally, time series clustering is part of a recent surveyautomatically and correctly identify the start times (or the of temporal knowledge discovery by Roddick et al. [65]. Inorigins) of the time series before they are matched and com- that survey, they discussed only a few research works relatedpared. This point was brought up by Shumway [26] in their to time series clustering in less than two pages. Our surveyclustering study of earthquakes and mining explosions at intends to provide a more detailed review of this expandingregional distances. research area. All in all, clustering of time series data differs fromclustering of static feature data mainly in how to com-pute the similarity between two data objects. Dependingupon the types and characteristics of time series data, dif- 5. Concluding remarksferent clustering studies provide different ways to computethe similarity/dissimilarity between two time series being In this paper we surveyed most recent studies on the sub-compared. Once the similarity/dissimilarity of data objects ject of time series clustering. These studies are organizedis determined, many general-purpose clustering algorithms into three major categories depending upon whether theycan be used to partition the objects, as reviewed in Section work directly with the original data (either in the time or2.1. Therefore, for any given time series clustering applica- frequency domain), indirectly with features extracted fromtion, the key is to understand the unique characteristics of the raw data, or indirectly with models built from the rawthe subject data and then to design an appropriate similar- data. The basics of time series clustering, including the threeity/dissimilarity measure accordingly. key components of time series clustering studies are high- We note that to date very few time series clustering studies lighted in this survey: the clustering algorithm, the similar-made use of more recently developed clustering algorithms ity/dissimilarity measure, and the evaluation criterion. Thesuch as genetic algorithms (the work of Baragona [30] is application areas are summarized with a brief description ofthe only exception). It might be interesting to investigate the the data used. The uniqueness and limitation of past studies,performance of GA in time series clustering. There seems and some potential topics for future study are also be either no or insufficient justification why a particularapproach was taken in the previous studies. We also notethat there is a lack of studies to compare different time seriesclustering approaches. In the case where more than one Appendix A. Previous applications and data usedapproach is possible, it is desirable to know which approachis better. The studies focusing on the development of new meth- It has been shown beneficial to integrate an unsupervised ods usually do not have a particular application in mind.clustering algorithm with a supervised classification algo- For testing a new method and for comparing existing meth-rithm for static feature data [59]. There is no reason why the ods, the researchers normally either generate simulatedsame cannot true for time series data. An investigation on data or rely on public-accessible time series data depos-such integration is thus warranted. It also has been reported itories such as the UCR time series data mining archivethat an ensemble of classifiers could be more effective than [∼eamonn/TSDMA/index.html].an individual classifier [60,61]. Will an ensemble of cluster- Other studies set out to investigate issues directly relateding algorithms be as effective? It might be interesting to find to a particular application. As can be seen in the below sum-out as well. A problem that an ensemble of classifiers does mary, clustering of time series data is necessary in widelynot have is the random labeling of the clustering results, different applications.