Exploring application level semantics


Published on

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
For further details contact us:
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Exploring application level semantics

  1. 1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 95 Exploring Application-Level Semantics for Data Compression Hsiao-Ping Tsai, Member, IEEE, De-Nian Yang, and Ming-Syan Chen, Fellow, IEEE Abstract—Natural phenomena show that many creatures form large social groups and move in regular patterns. However, previous works focus on finding the movement patterns of each single object or all objects. In this paper, we first propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover their movement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, which exploits the obtained group movement patterns to reduce the amount of delivered data. The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence merge phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the entropy reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal solution. Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results show that the proposed compression algorithm leverages the group movement patterns to reduce the amount of delivered data effectively and efficiently. Index Terms—Data compression, distributed clustering, object tracking. Ç t.c om1 INTRODUCTION omR po t.c ECENT advances in location-acquisition technologies, zebra, whales, and birds, form large social groups when such as global positioning systems (GPSs) and wireless gs po migrating to find food, or for breeding or wintering. These lo s .b ogsensor networks (WSNs), have fostered many novel characteristics indicate that the trajectory data of multiple ts .blapplications like object tracking, environmental monitoring, objects may be correlated for biological applications. More- ec ts oj cand location-dependent service. These applications gener- over, some research domains, such as the study of animals’ pr ojeate a large amount of location data, and thus, lead to social behavior and wildlife migration [7], [8], are more re r lo reptransmission and storage challenges, especially in resource- concerned with the movement patterns of groups of xp lo animals, not individuals; hence, tracking each object is ee xpconstrained environments like WSNs. To reduce the data .ie eevolume, various algorithms have been proposed for data unnecessary in this case. This raises a new challenge of w e finding moving animals belonging to the same group and w .icompression and data aggregation [1], [2], [3], [4], [5], [6]. w w identifying their aggregated group movement patterns. :// wHowever, the above works do not address application-level tp //wsemantics, such as the group relationships and movement Therefore, under the assumption that objects with similar ht ttp:patterns, in the location data. movement patterns are regarded as a group, we define the h In object tracking applications, many natural phenomena moving object clustering problem as given the movementshow that objects often exhibit some degree of regularity in trajectories of objects, partitioning the objects into non-their movements. For example, the famous annual wild- overlapped groups such that the number of groups isebeest migration demonstrates that the movements of minimized. Then, group movement pattern discovery is to find the most representative movement patterns regardingcreatures are temporally and spatially correlated. Biologists each group of objects, which are further utilized toalso have found that many creatures, such as elephants, compress location data. Discovering the group movement patterns is more. H.-P. Tsai is with the Department of Electrical Engineering (EE), National difficult than finding the patterns of a single object or all Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, objects, because we need to jointly identify a group of objects Taiwan, ROC. E-mail: hptsai@nchu.edu.tw.. D.-N. Yang is with the Institute of Information Science (IIS) and the and discover their aggregated group movement patterns. Research Center of Information Technology Innovation (CITI), Academia The constrained resource of WSNs should also be consid- Sinica, No. 128, Academia Road, Sec. 2, Nankang, Taipei 11529, Taiwan, ered in approaching the moving object clustering problem. ROC. E-mail: dnyang@iis.sinica.edu.tw. However, few of existing approaches consider these issues. M.-S. Chen is with the Research Center of Information Technology Innovation (CITI), Academia Sinica, No. 128, Academia Road, Sec. 2, simultaneously. On the one hand, the temporal-and-spatial Nankang, Taipei 11529, Taiwan, and the Department of Electrical correlations in the movements of moving objects are Engineering (EE), Department of Computer Science and Information modeled as sequential patterns in data mining to discover Engineer (CSIE), and Graduate Institute of Communication Engineering (GICE), National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei the frequent movement patterns [9], [10], [11], [12]. How- 10617, Taiwan, ROC. E-mail: mschen@cc.ee.ntu.edu.tw. ever, sequential patterns 1) consider the characteristics of allManuscript received 22 Sept. 2008; revised 27 Feb. 2009; accepted 29 July objects, 2) lack information about a frequent pattern’s2009; published online 4 Feb. 2010. significance regarding individual trajectories, and 3) carryRecommended for acceptance by M. Ester. no time information between consecutive items, which makeFor information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-2008-09-0497. them unsuitable for location prediction and similarityDigital Object Identifier no. 10.1109/TKDE.2010.30. comparison. On the other hand, previous works, such as 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
  2. 2. 96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011[13], [14], [15], measure the similarity among these entire of the HIR problem based on Shannon’s theorem [17] andtrajectory sequences to group moving objects. Since objects guarantees the reduction of entropy, which is convention-may be close together in some types of terrain, such as ally viewed as an optimization bound of compressiongorges, and widely distributed in less rugged areas, their performance. As a result, our approach reduces the amountgroup relationships are distinct in some areas and vague in of delivered data and, by extension, the energy consumptionothers. Thus, approaches that perform clustering among in WSNs.entire trajectories may not be able to identify the local group Our contributions are threefold:relationships. In addition, most of the above works are . Different from previous works, we formulate acentralized algorithms [9], [10], [11], [12], [13], [14], [15], moving object clustering problem that jointly iden-which need to collect all data to a server before processing. tifies a group of objects and discovers their move-Thus, unnecessary and redundant data may be delivered, ment patterns. The application-level semantics areleading to much more power consumption because data useful for various applications, such as data storagetransmission needs more power than data processing in and transmission, task scheduling, and networkWSNs [5]. In [16], we have proposed a clustering algorithm construction.to find the group relationships for query and data aggrega- . To approach the moving object clustering problem,tion efficiency. The differences of [16] and this work are as we propose an efficient distributed mining algo-follows: First, since the clustering algorithm itself is a rithm to minimize the number of groups such thatcentralized algorithm, in this work, we further consider members in each of the discovered groups are highlysystematically combining multiple local clustering results related by their movement patterns.into a consensus to improve the clustering quality and for . We propose a novel compression algorithm touse in the update-based tracking network. Second, when a compress the location data of a group of movingdelay is tolerant in the tracking application, a new data objects with or without loss of information. Wemanagement approach is required to offer transmission formulate the HIR problem to minimize the entropy t.c om omefficiency, which also motivates this study. We thus define of location data and explore the Shannon’s theorem po t.cthe problem of compressing the location data of a group of gs po to solve the HIR problem. We also prove that the lo smoving objects as the group data compression problem. proposed compression algorithm obtains the opti- .b og Therefore, in this paper, we first introduce our distrib- ts .bl mal solution of the HIR problem efficiently. ec tsuted mining algorithm to approach the moving object The remainder of the paper is organized as follows: In oj c pr ojeclustering problem and discover group movement patterns. Section 2, we review related works. In Section 3, we provide re rThen, based on the discovered group movement patterns, lo rep an overview of the network, location, and movementwe propose a novel compression algorithm to tackle the xp lo models and formulate our problem. In Section 4, we ee xpgroup data compression problem. Our distributed mining describe the distributed mining algorithm. In Section 5, .ie eealgorithm comprises a Group Movement Pattern Mining w e we formulate the compression problems and propose our w .i w w(GMPMine) and a Cluster Ensembling (CE) algorithms. It compression algorithm. Section 6 details our experimental :// wavoids transmitting unnecessary and redundant data by tp //w results. Finally, we summarize our conclusions in Section 7. ht ttp:transmitting only the local grouping results to a base station h(the sink), instead of all of the moving objects’ location data.Specifically, the GMPMine algorithm discovers the local 2 RELATED WORKgroup movement patterns by using a novel similarity 2.1 Movement Pattern Miningmeasure, while the CE algorithm combines the local Agrawal and Srikant [18] first defined the sequentialgrouping results to remove inconsistency and improve the pattern mining problem and proposed an Apriori-likegrouping quality by using the information theory. algorithm to find the frequent sequential patterns. Han et Different from previous compression techniques that al. consider the pattern projection method in miningremove redundancy of data according to the regularity sequential patterns and proposed FreeSpan [19], which iswithin the data, we devise a novel two-phase and an FP-growth-based algorithm. Yang and Hu [9] developed2D algorithm, called 2P2D, which utilizes the discovered a new match measure for imprecise trajectory data andgroup movement patterns shared by the transmitting node proposed TrajPattern to mine sequential patterns. Manyand the receiving node to compress data. In addition to variations derived from sequential patterns are used inremove redundancy of data according to the correlations various applications, e.g., Chen et al. [20] discover pathwithin the data of each single object, the 2P2D algorithm traversal patterns in a Web environment, while Peng andfurther leverages the correlations of multiple objects and Chen [21] mine user moving patterns incrementally in atheir movement patterns to enhance the compressibility. mobile computing system. However, sequential patternsSpecifically, the 2P2D algorithm comprises a sequence and its variations like [20], [21] do not provide sufficientmerge and an entropy reduction phases. In the sequence information for location prediction or clustering. First, theymerge phase, we propose a Merge algorithm to merge and carry no time information between consecutive items, socompress the location data of a group of objects. In the they cannot provide accurate information for locationentropy reduction phase, we formulate a Hit Item Replace- prediction when time is concerned. Second, they considerment (HIR) problem to minimize the entropy of the merged the characteristics of all objects, which make the meaningfuldata and propose a Replace algorithm to obtain the optimal movement characteristics of individual objects or a group ofsolution. The Replace algorithm finds the optimal solution moving objects inconspicuous and ignored. Third, because
  3. 3. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 97a sequential pattern lacks information about its significanceregarding to each individual trajectory, they are not fullyrepresentative to individual trajectories. To discover sig-nificant patterns for location prediction, Morzy minesfrequent trajectories whose consecutive items are alsoadjacent in the original trajectory data [10], [22]. Meanwhile,Giannotti et al. [11] extract T-patterns from spatiotemporaldata sets to provide concise descriptions of frequentmovements, and Tseng and Lin [12] proposed the TMP-Mine algorithm for discovering the temporal movementpatterns. However, the above Apriori-like or FP-growth-based algorithms still focus on discovering frequent Fig. 1. (a) The hierarchical- and cluster-based network structure and thepatterns of all objects and may suffer from computing data flow of an update-based tracking network. (b) A flat view of a two- layer network structure with 16 clusters.efficiency or memory problems, which make them unsui-table for use in resource-constrained environments. memory constraint of a sensor device. Summarization of the2.2 Clustering original data by regression or linear modeling has beenRecently, clustering based on objects’ movement behavior proposed for trajectory data compression [3], [6]. However,has attracted more attention. Wang et al. [14] transform the the above works do not address application-level semanticslocation sequences into a transaction-like data on users and in data, such as the correlations of a group of movingbased on which to obtain a valid group, but the proposed objects, which we exploit to enhance the compressibility.AGP and VG growth are still Apriori-like or FP-growth-based algorithms that suffer from high computing cost and 3 PRELIMINARIES t.c ommemory demand. Nanni and Pedreschi [15] proposed a om 3.1 Network and Location Models po t.cdensity-based clustering algorithm, which makes use of anoptimal time interval and the average euclidean distance gs po Many researchers believe that a hierarchical architecture lo s provides better coverage and scalability, and also extends .b ogbetween each point of two trajectories, to approach the ts .bltrajectory clustering problem. However, the above works the network lifetime of WSNs [25], [26]. In a hierarchical ec ts WSN, such as that proposed in [27], the energy, computing, oj cdiscover global group relationships based on the proportion pr ojeof the time a group of users stay close together to the whole and storage capacity of sensor nodes are heterogeneous. A re r lo reptime duration or the average euclidean distance of the entire high-end sophisticated sensor node, such as Intel Stargate xp lotrajectories. Thus, they may not be able to reveal the local [28], is assigned as a cluster head (CH) to perform high ee xp complexity tasks; while a resource-constrained sensor node, .ie eegroup relationships, which are required for many applica- such as Mica2 mote [29], performs the sensing and low w etions. In addition, though computing the average euclidean w .i w wdistance of two geometric trajectories is simple and useful, complexity tasks. In this work, we adopt a hierarchical :// w tp //wthe geometric coordinates are expensive and not always network structure with K layers, as shown in Fig. 1a, where ht ttp:available. Approaches, such as EDR, LCSS, and DTW, are sensor nodes are clustered in each level and collaboratively hwidely used to compute the similarity of symbolic trajectory gather or relay remote information to a base station called asequences [13], but the above dynamic programming sink. A sensor cluster is a mesh network of n  n sensorapproaches suffer from scalability problem [23]. To provide nodes handled by a CH and communicate with each otherscalability, approximation or summarization techniques are by using multihop routing [30]. We assume that each nodeused to represent original data. Guralnik and Karypis [23] in a sensor cluster has a locally unique ID and denote theproject each sequence into a vector space of sequential sensor IDs by an alphabet Æ. Fig. 1b shows an example of apatterns and use a vector-based K-means algorithm to cluster two-layer tracking network, where each sensor clusterobjects. However, the importance of a sequential pattern contains 16 nodes identified by Æ ¼ fa, b; . . . ; pg.regarding individual sequences can be very different, which In this work, an object is defined as a target, such as anis not considered in this work. To cluster sequences, Yang animal or a bird, that is recognizable and trackable by theand Wang proposed CLUSEQ [24], which iteratively identi- tracking network. To represent the location of an object,fies a sequence to a learned model, yet the generated clusters geometric models and symbolic models are widely usedmay overlap which differentiates their problem from ours. [31]. A geometric location denotes precise two-dimension or three-dimension coordinates; while a symbolic location2.3 Data Compression represents an area, such as the sensing area of a sensor nodeData compression can reduce the storage and energy or a cluster of sensor nodes, defined by the application. Sinceconsumption for resource-constrained applications. In [1], the accurate geometric location is not easy to obtain anddistributed source (Slepian-Wolf) coding uses joint entropy techniques like the Received Signal Strength (RSS) [32] canto encode two nodes’ data individually without sharing any simply estimate an object’s location based on the ID of thedata between them; however, it requires prior knowledge of sensor node with the strongest signal, we employ a symboliccross correlations of sources. Other works, such as [2], [4], model and describe the location of an object by using the IDcombine data compression with routing by exploiting cross of a nearby sensor node.correlations between sensor nodes to reduce the data size. Object tracking is defined as a task of detecting a movingIn [5], a tailed LZW has been proposed to address the object’s location and reporting the location data to the sink
  4. 4. 98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011Fig. 2. An example of an object’s moving trajectory, the obtained locationsequence, and the generated PST T . Fig. 3. The predict_next algorithm.periodically at a time interval. Hence, an observation on anobject is defined by the obtained location data. We assume algorithm1 learns from a location sequence and generates athat sensor nodes wake up periodically to detect objects. PST whose height is limited by a specified parameter Lmax .When a sensor node wakes up on its duty cycle and detects Each node of the tree represents a significant movementan object of interest, it transmits the location data of the pattern s whose occurrence probability is above a specifiedobject to its CH. Here, the location data include a times minimal threshold Pmin . It also carries the conditionaltamp, the ID of an object, and its location. We also assume empirical probabilities P ðjsÞ for each in Æ that we usethat the targeted applications are delay-tolerant. Thus, in location prediction. Fig. 2 shows an example of a locationinstead of forwarding the data upward immediately, the t.c om sequence and the corresponding PST. nodef is one of the omCH compresses the data accumulated for a batch period and po t.c children of the root node and represents a significantsends it to the CH of the upper layer. The process is repeated gs po movement pattern with }f} whose occurrence probability lo suntil the sink receives the location data. Consequently, the .b og is above 0.01. The conditional empirical probabilities are ts .bltrajectory of a moving object is thus modeled as a series of P ð0 b0 j}f}Þ ¼ 0:33, P ð0 e0 j}f}Þ ¼ 0:67, and P ðj}f}Þ ¼ 0 for the ec tsobservations and expressed by a location sequence, i.e., a oj c other in Æ. pr ojesequence of sensor IDs visited by the object. We denote a PST is frequently used in predicting the occurrence re r lo replocation sequence by S ¼ s0 s1 . . . sLÀ1 , where each item si is probability of a given sequence, which provides us xp loa symbol in Æ and L is the sequence length. An example of important information in similarity comparison. The occur- ee xpan object’s trajectory and the obtained location sequence is .ie ee rence probability of a sequence s regarding to a PST T ,shown in the left of Fig. 2. The tracking network tracks w e denoted by P T ðsÞ, is the prediction of the occurrence w .i w wmoving objects for a period and generates a location :// w probability of s based on T . For example, the occurrence tp //wsequence data set, based on which we discover the group probability P T ð}nokjfb}Þ is computed as follows: ht ttp:relationships of the moving objects. h P T ð}nokjfb}Þ ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}no}Þ3.2 Variable Length Markov Model (VMM) and Probabilistic Suffix Tree (PST)  P T ð0 j0 j}nok}Þ Â P T ð0 f 0 j}nokj}ÞIf the movements of an object are regular, the object’s next  P T ð0 b0 j}nokjf}Þlocation can be predicted based on its preceding locations. ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}o}ÞWe model the regularity by using the Variable LengthMarkov Model (VMM). Under the VMM, an object’s move-  P T ð0 j0 j}k}Þ Â P T ð0f 0 j}j}Þ Â P T ð0 b0 j}okjf}Þment is expressed by a conditional probability distribution ¼ 0:05  1  1  1  1  0:3 ¼ 0:0165:over Æ. Let s denote a pattern which is a subsequence of alocation sequence S and denote a symbol in Æ. The PST is also useful and efficient in predicting the nextconditional probability P ðjsÞ is the occurrence probability item of a sequence. For a given sequence s and a PST T , ourthat will follow s in S. Since the length of s is floating, the predict_next algorithm as shown in Fig. 3 outputs the mostVMM provides flexibility to adapt to the variable length of probable next item, denoted by predict nextðT ; sÞ. Wemovement patterns. Note that when a pattern s occurs more demonstrate its efficiency by an example as follow: Givenfrequently, it carries more information about the movements s ¼ }nokjf} and T shown in Fig. 2, the predict_nextof the object and is thus more desirable for the purpose of algorithm traverses the tree to the deepest node nodeokjfprediction. To find the informative patterns, we first define a along the path including noderoot , nodef , nodejf , nodekjf , andpattern as a significant movement pattern if its occurrence nodeokjf . Finally, symbol 0 e0 , which has the highest condi-probability is above a minimal threshold. tional empirical probability in nodeokjf , is returned, i.e., To learn the significant movement patterns, we adapt predict nextðT ; }nokjf}Þ ¼ 0 e0 . The algorithm’s computa-Probabilistic Suffix Tree (PST) [33] for it has the lowest tional overhead is limited by the height of a PST so that itstorage requirement among many VMM implementations is suitable for sensor nodes.[34]. PST’s low complexity, i.e., OðnÞ in both time and space 1. The PST building algorithm is given in Appendix A, which can[35], also makes it more attractive especially for streaming or be found on the Computer Society Digital Library at http://doi.resource-constrained environments [36]. The PST building ieeecomputersociety.org/10.1109/TKDE.2010.30.
  5. 5. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 993.3 Problem Description the similarity of two data streams only on the changedWe formulate the problem of this paper as exploring the mature nodes of emission trees [36], instead of all nodes.group movement patterns to compress the location To combine multiple local grouping results into asequences of a group of moving objects for transmission consensus, the CE algorithm utilizes the Jaccard similarityefficiency. Consider a set of moving objects O ¼ coefficient to measure the similarity between a pair offo1 ; o2 ; . . . ; on g and their associated location sequence data objects, and normalized mutual information (NMI) toset S ¼ fS1 ; S2 ; . . . ; Sn g. derive the final ensembling result. It trades off the grouping quality against the computation cost by adjusting a partitionDefinition 1. Two objects are similar to each other if their parameter. In contrast to approaches that perform cluster- movement patterns are similar. Given the similarity measure ing among the entire trajectories, the distributed algorithm function simp 2 and a minimal threshold simmin , oi and oj are discovers the group relationships in a distributed manner similar if their similarity score simp ðoi ; oj Þ is above simmin , i.e., on sensor nodes. As a result, we can discover group simp ðoi ; oj Þ ! simmin . The set of objects that are similar to oi is movement patterns to compress the location data in the denoted by soðoi Þ ¼ foj j8oj 2 O; simp ðoi ; oj Þ ! simmin g. areas where objects have explicit group relationships.Definition 2. A set of objects is recognized as a group if they are Besides, the distributed design provides flexibility to take highly similar to one another. Let g denote a set of objects. g is a partial local grouping results into ensembling when the group if every object in g is similar to at least a threshold of group relationships of moving objects in a specified objects in g, i.e., 8oi 2 g, jsoðogijÞgj ! , where is with default j subregion are interested. Also, it is especially suitable for value 1 .3 2 heterogeneous tracking configurations, which helps reduce the tracking cost, e.g., instead of waking up all sensors at We formally define the moving object clustering problem the same frequency, a fine-grained tracking interval isas follows: Given a set of moving objects O together with specified for partial terrain in the migration season totheir associated location sequence data set S and a minimal reduce the energy consumption. Rather than deploying the t.c omsimilarity threshold simmin , the moving object clustering om sensors in the same density, they are only highly concen- po t.cproblem is to partition O into nonoverlapped groups, trated in areas of interest to reduce deployment costs.denoted by G ¼ fg1 ; g2 ; . . . ; gi g, such that the number of gs po lo s .b oggroups is minimized, i.e., jGj is minimal. Thereafter, with 4.1 The Group Movement Pattern Mining ts .blthe discovered group information and the obtained group (GMPMine) Algorithm ec ts oj cmovement patterns, the group data compression problem is To provide better discrimination accuracy, we propose a pr ojeto compress the location sequences of a group of moving new similarity measure simp to compare the similarity of re r lo repobjects for transmission efficiency. Specifically, we formu- two objects. For each of their significant movement patterns, xp lolate the group data compression problem as a merge the new similarity measure considers not merely two ee xp .ie eeproblem and an HIR problem. The merge problem is to probability distributions but also two weight factors, i.e., w ecombine multiple location sequences to reduce the overall the significance of the pattern regarding to each PST. The w .i w wsequence length, while the HIR problem targets to minimize similarity score simp of oi and oj based on their respective :// w tp //wthe entropy of a sequence such that the amount of data is PSTs, Ti and Tj , is defined as follows: ht ttp:reduced with or without loss of information. P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P h Ti Tj 2 s2S e 2Æ ðP ðsÞ À P ðsÞÞ simp ðoi ; oj Þ ¼ À log pffiffiffi ; ð1Þ4 MINING OF GROUP MOVEMENT PATTERNS 2Lmax þ 2To tackle the moving object clustering problem, we propose e where S denotes the union of significant patterns (nodea distributed mining algorithm, which comprises the strings) on the two trees. The similarity score simp includesGMPMine and CE algorithms. First, the GMPMine algo- the distance associated with a pattern s, defined asrithm uses a PST to generate an object’s significant move- qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xment patterns and computes the similarity of two objects by dðsÞ ¼ ðP Ti ðsÞ À P Tj ðsÞÞ2 2Æusing simp to derive the local grouping results. The merits qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xof simp include its accuracy and efficiency: First, simp ¼ 2Æ ðP Ti ðsÞ Â P Ti ðjsÞ À P Tj ðsÞ Â P Tj ðjsÞÞ2 ;considers the significances of each movement patternregarding to individual objects so that it achieves better where dðsÞ is the euclidean distance associated with aaccuracy in similarity comparison. For a PST can be used to pattern s over Ti and Tj .predict a pattern’s occurrence probability, which is viewed For a pattern s 2 T , P T ðsÞ is a significant value becauseas the significance of the pattern regarding the PST, simp the occurrence probability of s is higher than the minimalthus includes movement patterns’ predicted occurrence support Pmin . If oi and oj share the pattern s, we have s 2 Tiprobabilities to provide fine-grained similarity comparison. and s 2 Tj , respectively, such that P Ti ðsÞ and P Tj ðsÞ are non-Second, simp can offer seamless and efficient comparison negligible and meaningful in the similarity comparison.for the applications with evolving and evolutionary Because the conditional empirical probabilities are also partssimilarity relationships. This is because simp can compare of a pattern, we consider the conditional empirical prob- abilities P T ðjsÞ when calculating the distance between two 2. simp is to be defined in Section 4.1. e PSTs. Therefore, we sum dðsÞ for all s 2 S as the distance 3. In this work, we set the threshold to 1 , as suggested in [37], and leave 2the similarity threshold as the major control parameter because training the between two PSTs. Note that the distance betweenffiffiffitwo PSTs psimilarity threshold of two objects is easier than that of a group of objects. is normalized by its maximal value, i.e., 2Lmax þ 2. We take
  6. 6. 100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011the negative log of the distance between two PSTs as thesimilarity score such that a larger value of the similarityscore implies a stronger similar relationship, and vice versa.With the definition of similarity score, two objects are similarto each other if their score is above a specified similaritythreshold. The GMPMine algorithm includes four steps. First, we Fig. 4. Design of the two-phase and 2D compression algorithm.extract the movement patterns from the location sequencesby learning a PST for each object. Second, our algorithm ½0; 1Š such that a finer-grained configuration of D achieves aconstructs an undirected, unweighted similarity graph better grouping quality but in a higher computation cost.where similar objects share an edge between each other. Therefore, for a set of thresholds D, we rewrite our objective PWe model the density of group relationship by the function as G0 ¼ argmaxG;2D K NMIðGi ; G Þ. i¼1connectivity of a subgraph, which is also defined as the The algorithm includes three steps. First, we utilizeminimal cut of the subgraph. When the ratio of the Jaccard Similarity Coefficient [40] as the measure of theconnectivity to the size of the subgraph is higher than a similarity for each pair of objects. Second, for each 2 D,threshold, the objects corresponding to the subgraph are we construct a graph where two objects share an edge ifidentified as a group. Since the optimization of the graph their Jaccard Similarity Coefficient is above . Our algo-partition problem is intractable in general, we bisect the rithm partitions the objects to generate a partitioning resultsimilarity graph in the following way. We leverage the HCS G . Third, we select the ensembling result G0 . Because ofcluster algorithm [37] to partition the graph and derive the space limitations, we only demonstrate the CE mininglocation group information. Finally, we select a group PST algorithm with an example in Appendix C, which can beTg for each group in order to conserve the memory space by found on the Computer Society Digital Library at http:// P doi.ieeecomputersociety.org/10.1109/TKDE.2010.30. In theusing the formula expressed as T g ¼ argmaxT 2T s2S P T ðsÞ, t.c om next section, we propose our compression algorithm that omwhere S denotes sequences of a group of objects and T leverages the obtained group movement patterns. po t.cdenotes their PSTs. Due to the page limit, we give anillustrative example of the GMPMine algorithm in Appen- gs po lo s .b ogdix B, which can be found on the Computer Society Digital 5 DESIGN OF A COMPRESSION ALGORITHM WITH ts .blLibrary at http://doi.ieeecomputersociety.org/10.1109/ GROUP MOVEMENT PATTERNS ec ts oj c pr ojeTKDE.2010.30. A WSN is composed of a large number of miniature sensor re r lo rep4.2 The Cluster Ensembling (CE) Algorithm nodes that are deployed in a remote area for various xp lo applications, such as environmental monitoring or wildlifeIn the previous section, each CH collects location data and ee xp tracking. These sensor nodes are usually battery-powered .ie eegenerates local group results with the proposed GMPMine and recharging a large number of them is difficult. There- w ealgorithm. Since the objects may pass through only partial w .i fore, energy conservation is paramount among all design w wsensor clusters and have different movement patterns in :// w issues in WSNs [41], [42]. Because the target objects are tp //wdifferent clusters, the local grouping results may be moving, conserving energy in WSNs for tracking moving ht ttp:inconsistent. For example, if objects in a sensor cluster walk objects is more difficult than in WSNs that monitor immobile hclose together across a canyon, it is reasonable to consider phenomena, such as humidity or vibrations. Hence, pre-them a group. In contrast, objects scattered in grassland vious works, such as [43], [44], [45], [46], [47], [48], especiallymay not be identified as a group. Furthermore, in the case consider movement characteristics of moving objects in theirwhere a group of objects moves across the margin of a designs to track objects efficiently. On the other hand, sincesensor cluster, it is difficult to find their group relationships. transmission of data is one of the most energy expensiveTherefore, we propose the CE algorithm to combine tasks in WSNs, data compression is utilized to reduce themultiple local grouping results. The algorithm solves the amount of delivered data [1], [2], [3], [4], [5], [6], [49], [50],inconsistency problem and improves the grouping quality. [51], [52]. Nevertheless, few of the above works have The ensembling problem involves finding the partition of addressed the application-level semantics, i.e., the tempor-all moving objects O that contains the most information about al-and-spatial correlations of a group of moving objects.the local grouping results. We utilize NMI [38], [39] to Therefore, to reduce the amount of delivered data, weevaluate the grouping quality. Let C denote the ensemble propose the 2P2D algorithm which leverages the groupof the local grouping results, represented as C ¼ fG0 ; movement patterns derived in Section 4 to compress the location sequences of moving objects elaborately. As shownG1 ; . . . ; GK g, where K denotes the ensemble size. Our goal in Fig. 4, the algorithm includes the sequence merge phaseis to discover the ensembling result G0 that contains the most and the entropy reduction phase to compress location Pinformation about C, i.e., G0 ¼ argmax e K NMIðGi ; GÞ, i¼1 sequences vertically and horizontally. In the sequence G2G ewhere G denotes all possible ensembling results. merge phase, we propose the Merge algorithm to compress e However, enumerating every G 2 G in order to find the the location sequences of a group of moving objects. Since 0optimal ensembling result G is impractical, especially in the objects with similar movement patterns are identified as aresource-constrained environments. To overcome this diffi- group, their location sequences are similar. The Mergeculty, the CE algorithm trades off the grouping quality algorithm avoids redundant sending of their locations, andagainst the computation cost by adjusting the partition thus, reduces the overall sequence length. It combines theparameter D, i.e., a set of thresholds with values in the range sequences of a group of moving objects by 1) trimming
  7. 7. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 101multiple identical symbols at the same time interval into asingle symbol or 2) choosing a qualified symbol to representthem when a tolerance of loss of accuracy is specified by theapplication. Therefore, the algorithm trims and prunesmore items when the group size is larger and the grouprelationships are more distinct. Besides, in the case that onlythe location center of a group of objects is of interest, ourapproach can find the aggregated value in the phase,instead of transmitting all location sequences back to the Fig. 5. An example of the Merge algorithm. (a) Three sequences with high similarity. (b) The merged sequence S 00 .sink for postprocessing. In the entropy reduction phase, we propose the Replace S-column into a single symbol. Specifically, given a groupalgorithm that utilizes the group movement patterns as the of n sequences, the items of an S-column are replaced by aprediction model to further compress the merged sequence. single symbol, whereas the items of a D-column areThe Replace algorithm guarantees the reduction of a wrapped up between two 0 =0 symbols. Finally, our algo-sequence’s entropy, and consequently, improves compres- rithm generates a merged sequence containing the samesibility without loss of information. Specifically, we define a information of the original sequences. In decompressingnew problem of minimizing the entropy of a sequence as from the merged sequence, while symbol 0 =0 is encountered,the HIR problem. To reduce the entropy of a location the items after it are output until the next 0 =0 symbol.sequence, we explore Shannon’s theorem [17] to derive Otherwise, for each item, we repeat it n times to generatethree replacement rules, based on which the Replace the original sequences. Fig. 5b shows the merged sequencealgorithm reduces the entropy efficiently. Also, we prove S 00 whose length is decreased from 60 items to 48 items suchthat the Replace algorithm obtains the optimal solution of that 12 items are conserved. The example pointed out thatthe HIR problem as Theorem 1. In addition, since the objects t.c om our approach can reduce the amount of data without loss of ommay enter and leave a sensor cluster multiple times during po t.ca batch period and a group of objects may enter and leave a information. Moreover, when there are more S-columns, gs po our approach can bring more benefit.cluster at slightly different times, we discuss the segmenta- lo s .b og When a little loss in accuracy is tolerant, representingtion and alignment problems in Section 5.3. Table 1 ts .bl items in a D-column by an qualified symbol to generatesummaries the notations. ec ts more S-columns can improve the compressibility. We oj c pr oje5.1 Sequence Merge Phase regulate the accuracy by an error bound, defined as the re r lo repIn the application of tracking wild animals, multiple maximal hop count between the real and reported locations xp lomoving objects may have group relationships and share of an object. For example, replacing items 2 and 6 of S2 by 0 g0 ee xp and 0 p0 , respectively, creates two more S-columns, and thus, .ie eesimilar trajectories. In this case, transmitting their location w edata separately leads to redundancy. Therefore, in this results in a shorter merged sequence with 42 items. To w .i w wsection, we concentrate on the problem of compressing select a representative symbol for a D-column, we includes :// w tp //wmultiple similar sequences of a group of moving objects. a selection criterion to minimize the average deviation ht ttp: Consider an illustrative example in Fig. 5a, where three between the real locations and reported locations for a hlocation sequences S0 , S1 , and S2 represent the trajectories group of objects at each time interval as follows:of a group of three moving objects. Items with the sameindex belong to a column, and a column containing Selection criterion. The maximal distance between the reportedidentical symbols is called an S-column; otherwise, the location and the real location is below a specified error
  8. 8. boundcolumn is called a D-column. Since sending the items in an eb, i.e., when the ith column is a D-column,
  9. 9. Sj ½iŠ À
  10. 10. ebS-column individually causes redundancy, our basic idea must hold for 0 j n, where n is the number of sequences.of compressing multiple sequences is to trim the items in an TABLE 1 Description of the Notations
  11. 11. 102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 Fig. 7. An example of the Merge algorithm: the merged sequence with eb ¼ 1. jfs js ¼ ;0 jLgj as pi ¼ j j L i , and L ¼ jS j. According to Shannon’s source coding theorem, the optimal code length for symbol i is log2 pi , which is also called the information content of i . Information entropy (or Shannon’s entropy) is thereby defined as the overall sum of the information content over Æ, X eðSÞ ¼ eðp0 ; p1 ; . . . ; pjÆjÀ1 Þ ¼ Àpi  log2 pi : ð2Þ 0 ijÆj Shannon’s entropy represents the optimal average code length in data compression, where the length of a symbol’sFig. 6. The Merge algorithm. codeword is proportional to its information content [17]. A property of Shannon’s entropy is that the entropy is Therefore, to compress the location sequences for a the maximum, while all probabilities are of the same value.group of moving objects, we propose the Merge algorithm Thus, in the case without considering the regularity in theshown in Fig. 6. The input of the algorithm contains a group movements of objects, the occurrence probabilities ofof sequences fSi j0 i ng and an error bound eb, while symbols are assumed to be uniform such that the entropythe output is a merged sequence that represents the group t.c om of a location sequence as well as the compression ratio is omof sequences. Specifically, the Merge algorithm processes po t.cthe sequences in a columnwise way. For each column, the fixed and dependent on the size of a sensor cluster, e.g., for gs po a location sequence with length D collected in a sensoralgorithm first checks whether it is an S-column. For an S- lo s .b og 1column, it retains the value of the items as Lines 4-5. cluster of 16 nodes, the entropy of the sequence is eð16 , ts .bl 1 1Otherwise, while an error bound eb 0 is specified, a 16 ; . . . ; 16Þ ¼ 4. Consequently, 4D bits are needed to repre- ec ts oj crepresentative symbol is selected according to the selection sent the location sequence. Nevertheless, since the move- pr ojecriterion as Line 7. If a qualified symbol exists to represent ments of a moving object are of some regularity, the re r lo repthe column, the algorithm outputs it as Lines 15-18. occurrence probabilities of symbols are probably skewed xp loOtherwise, the items in the column are retained and and the entropy is lower. For example, if two probabilities ee xp .ie ee 1 3wrapped by a pair of “/” as Lines 9-13. The process repeats become 32 and 32 , the entropy is reduced to 3.97. Thus, w euntil all columns are examined. Afterward, the merged instead of encoding the sequence using 4 bits per symbol, w .i w wsequence S 00 is generated. :// w only 3:97D bits are required for the same information. This tp //w Following the example shown in Fig. 5a, with a specified example points out that when a moving object exhibits ht ttp:error bound eb as 1, the Merge algorithm generates the some degree of regularity in their movements, the skewness hsolution, as shown in Fig. 7; the merged sequence contains of these probabilities lowers the entropy. Seeing that dataonly 20 items, i.e., 40 items are curtailed. with lower entropy require fewer bits to represent the same5.2 Entropy Reduction Phase information, reducing the entropy thereby benefits for data compression and, by extension, storage and transmission.In the entropy reduction phase, we propose the Replace Motivated by the above observation, we design thealgorithm to minimize the entropy of the merged sequence Replace algorithm to reduce the entropy of a locationobtained in the sequence merge phase. Since data with sequence. Our algorithm imposes the hit symbols on thelower entropy require fewer bits for storage and transmis- location sequence to increase the skewness. Specifically, thesion [17], we replace some items to reduce the entropy algorithm uses the group movement patterns built in bothwithout loss of information. The object movement patterns the transmitter (CH) and the receiver (sink) as thediscovered by our distributed mining algorithm enable usto find the replaceable items and facilitate the selection of prediction model to decide whether an item of a sequenceitems in our compression algorithm. In this section, we first is predictable. A CH replaces the predictable items eachintroduce and define the HIR problem, and then, explore with a hit symbol to reduce the location sequence’s entropythe properties of Shannon’s entropy to solve the HIR when compressing it. After receiving the compressedproblem. We extend the concentration property for entropy sequence, the sink node decompresses it and substitutesreduction and discuss the benefits of replacing multiple every hit symbol with the original symbol by the identicalsymbols simultaneously. We derive three replacement rules prediction model, and no information loss occurs.for the HIR problem and prove that the entropy of the Here, an item si of a sequence S is predictable item ifobtained solution is minimized. predict nextðT ; S½0::i À 1ŠÞ is the same value as si . A symbol is a predictable symbol once an item of the symbol is5.2.1 Three Derived Replacement Rules predictable. For ease of explanation, we use a taglst4 toLet P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g denote the probability distribu- 4. A taglst associated with a sequence S is a sequence of 0s and 1stion of the alphabet Æ corresponding to a location sequence obtained as follows: For those predictable items in S, the correspondingS, where pi is the occurrence probability of i in Æ, defined items in taglst are set as 1. Otherwise, their values in taglst are 0.
  12. 12. TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 103Fig. 8. An example of the simple method. The first row represents the index, and the second row is the location sequence. The third row is the taglstwhich represents the prediction results. The last row is the result of the simple approach. Note that items 1, 2, 7, 9, 11, 12, 15, and 22 are predictableitems such that the set of predictable symbols is s ¼ fk, a, j, f, og. In the example, items 2, 9, and 10 are items of 0 o0 such that the number of items of ^symbol 0 o0 is nð0 o0 Þ ¼ 3, whereas items 2 and 9 are the predictable items of 0 o0 , and the number of predictable items of symbol 0 o0 is nhit ð0 o0 Þ ¼ 2.Fig. 9. Problems of the simple approach.express whether each item of S is predictable in the fifth, called the strong concentration property, is derivedfollowing sections. Consider the illustrative example shown and proved in the paper.5in Fig. 8, the probability distribution of Æ corresponding to t.c om Property 1 (Expansibility). Adding a probability with a value omS is P ¼ f0:04; 0:16; 0:04; 0; 0; 0:08; 0:04; 0; 0:04; 0:12; 0:24; 0; po t.c of zero does not change the entropy, i.e., eðp0 , p1 ; . . . ; pjÆjÀ1 ,0; 0:12; 0:12; 0g, and items 1, 2, 7, 9, 11, 12, 15, and 22 are gs po pjÆj Þ is identical to eðp0 , p1 ; . . . ; pjÆjÀ1 Þ when pjÆj is zero. lo spredictable items. To reduce the entropy of the sequence, a .b og ts .blsimple approach is to replace each of the predictable items According to Property 1, we add a new symbol 0 :0 to Æ ec tswith the hit symbol and obtain an intermediate sequence oj c without affecting the entropy and denote its probability as pr ojeS 0 ¼00n::kfbb:n:o::ij:bbnkkk:gc00 . Compared with the original p16 such that P is fp0 , p1 ; . . . ; p15 , p16 ¼ 0g. re rsequence S with entropy 3.053, the entropy of S 0 is reduced lo repto 2.854. Encoding S and S 0 by the Huffman coding Property 2 (Symmetry). Any permutation of the probability xp lo ee xptechnique, the lengths of the output bit streams are 77 and values does not change to the entropy. For example, .ie ee73 bits, respectively, i.e., 5 bits are conserved by the simple eð0:1; 0:4; 0:5Þ is identical to eð0:4; 0:5; 0:1Þ. w e w .i w wapproach. Property 3 (Accumulation). Moving all the value from one :// w However, the above simple approach does not always tp //w probability to another such that the former can be thought ht ttp:minimize the entropy. Consider the example shown in Fig. 9a, of as being eliminated decreases the entropy, i.e.,an intermediate sequence with items 1 and 19 unreplaced has h eðp0 ; p1 ; . . . ; 0; . . . ; pi þ pj ; . . . ; pjÆjÀ1 Þ is equal or less thanlower entropy than that generated by the simple approach. eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.For the example shown in Fig. 9b, the simple approach evenincreases the entropy from 2.883 to 2.963. With Properties 2 and 3, if all the items of symbol are We define the above problem as the HIR problem and predictable, i.e., nðÞ ¼ nhit ðÞ, replacing all the items of formulate it as follows: by 0 :0 will not affect the entropy. If there are multipleDefinition 3 (HIR problem). Given a sequence S ¼ fsi jsi 2 symbols having nðÞ ¼ nhit ðÞ, replacing all the items of Æ; 0 i Lg and a taglst, an intermediate sequence is a these symbols can reduce the entropy. Thus, we derive the generation of S, denoted by S 0 ¼ fs0i j0 i Lg, where s0i is first replacement rule—the accumulation rule: Replace all equal to si if taglst½iŠ ¼ 0. Otherwise, s0i is equal to si or 0 :0 . items of symbol in s, where nðÞ ¼ nhit ðÞ. ^ The HIR problem is to find the intermediate sequence S 0 such Property 4 (Concentration). For two probabilities, moving a that the entropy of S 0 is minimal for all possible intermediate value from the lower probability to the higher probability sequences. decreases the entropy, i.e., if pi pj , for 0 Áp pi , eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than A brute-force method to the HIR problem is to enumer-ate all possible intermediate sequences to find the optimal eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.solution. However, this brute-force approach is not scalable, Property 5 (Strong concentration). For two probabilities,especially when the number of the predictable items is moving a value that is larger than the difference of the twolarge. Therefore, to solve the HIR problem, we explore probabilities from the higher probability to the lower oneproperties of Shannon’s entropy to derive three replace- decreases the entropy, i.e., if pi pj , for pi À pj Áp pi ,ment rules that our Replace algorithm leverages to obtainthe optimal solution. Here, we list five most relevant 5. Due to the page limit, the proofs of the strong concentration rule and Lemmas 1-4 are given in Appendices C and D, respectively, which can beproperties to explain the replacement rules. The first four found on the Computer Society Digital Library at http://doi.properties can be obtained from [17], [53], [54], while the ieeecomputersociety.org/10.1109/TKDE.2010.30.
  13. 13. 104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than as Lemma 4. We thereby derive the third replacement eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. rule—the multiple symbol rule: Replace all of the predictable items of every symbol in s0 if gainð^0 Þ 0. ^ s According to Properties 4 and 5, we conclude that if the Lemma 3. If exist 0 an pin , n ¼ 1; . . . ; k, such thatdifference of two probabilities increases, the entropy P fi1 i2 ...ik j ða1 ; a2 ; . . . ; ak Þ is minimal, we have pj þ k ai ! i¼1decreases. When the conditions conform to Properties 4 pin À an , n ¼ 1; . . . ; k.and 5, we further prove that the entropy is monotonicallyreduced as Áp increases such that replacing all predictable Lemma 4. If exist xmin1 , xmin2 ; . . . , and xmink such that pj þ Pkitems of a qualified symbol reduces the entropy mostly as i¼1 xmini pin À xminn for n ¼ 1; 2; . . . ; k,fi1 i2 ...ik j ðx1 ; x2 ;Lemmas 1 and 2. . . . ; xk Þ is monotonically decreasing for xminn xn pin ,Definition 4 (Concentration function). For a probability n ¼ 1; . . . ; k. distribution P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g, we define the concentra- tion function of k probabilities pi1 , pi2 ; . . . ; pik , and pj as 5.2.2 The Replace Algorithm Based on the observations described in the previous section, fi1 i2 ...ik j ðx1 ; x2 ; . . . ; xk Þ ¼ e p0 ; . . . ; pi1 À x1 ; . . . ; pi2 we propose the Replace algorithm that leverages the three replacement rules to obtain the optimal solution for the HIR À x2 ; . . . ; pik À xk ; . . . ; pj problem. Our algorithm examines the predictable symbols Xk on their statistics, which include the number of items and þ xi ; . . . ; pj P jÀ1 : the number of predictable items of each predictable symbol. i¼1 The algorithm first replaces the qualified symbols according to the accumulation rule. Afterward, since the concentration rule and the multiple symbol rule are related to nð0 :0 Þ, whichLemma 1. If pi pj , for any x such that 0 x pi , the is increased after every replacement, the algorithm itera- t.c om concentration function of pi and pj is monotonic decreasing. tively replaces the qualified symbols according to the two om po t.cLemma 2. If pi pj , for any x such that pi À pj x pi , the rules until all qualified symbols are replaced. The algorithm concentration function fij ðxÞ of pi and pj is monotonic gs po thereby replaces qualified symbols and reduces the entropy lo s .b og decreasing. toward the optimum gradually. Compared with the brute- ts .bl force method that enumerates all possible intermediate ec ts oj c Accordingly, we derive the second replacement rule—the sequences for the optimum in exponential complexity, the pr oje Replace algorithm that leverages the derived rules to obtain re rconcentration rule: Replace all predictable items of symbol lo rep in s, where nðÞ nð0 :0 Þ or nhit ðÞ nðÞ À nð0 :0 Þ. ^ the optimal solution in OðLÞ time6 is more scalable and xp lo efficient. We prove that the Replace algorithm guarantees to ee xp As an extension of the above properties, we also explore .ie eethe entropy variation, while predictable items of multiple reduce the entropy monotonically and obtains the optimal w esymbols are replaced simultaneously. Let s0 denote a subset solution of the HIR problem as Theorem 1. Next, we detail w .i ^ w w :// wof s, where j^0 j ! 2. We prove that if replacing predictable ^ s the replace algorithm and demonstrate the algorithm by an tp //wsymbols in s0 can minimize the entropy corresponding to ^ illustrative example. ht ttp:symbols in s0 , the number of hit symbols after the ^ Theorem 1.7 The Replace algorithm obtains the optimal solution hreplacement must be larger than the number of the of the HIR problem.predictable symbols after the replacement as Lemma 3. Toinvestigate whether the converse statement exists, we Fig. 10 shows the Replace algorithm. The input includesconduct an experiment in a brute-force way. However, the a location sequence S and a predictor Tg , while the output,experimental results show that even under the condition P denoted by S 0 , is a sequence in which qualified items arethat nð0 :0 Þ þ 80 2^0 nhit ð0 Þ ! nðÞ À nhit ðÞ for every in s0 , s ^ replaced by 0 :0 . Initially, Lines 3-9 of the algorithm find thereplacing predictable items of the symbols in s0 does not ^ set of predictable symbols together their statistics. Then, itguarantee the reduction of the entropy. Therefore, we exams the statistics of the predictable symbols according tocompare the difference of the entropy before and after the three replacement rules as follows: First, according toreplacing symbols in s0 as ^ the accumulation rule, it replaces qualified symbols in one X p scan of the predictable symbols as Lines 10-14. Next, the ^ gainðS 0 Þ ¼ À p  log2 À Áp algorithm iteratively exams for the concentration and the 82^s 0 p À Áp multiple symbol rules by two loops. The first loop from  log2 ðp À Áp Þ Line 16 to Line 22 is for the concentration, whereas the p0 :0 second loop from Line 25 to Line 36 is for the multiple À p0 :0  log2 0 0 P symbol rule. In our design, since finding a combination of p : þ 82^0 ÁpÀ s ! predictable symbols to make gainð^0 Þ 0 hold is more s X X þ Áp  log2 p þ 0 :0 ÁpÀ : costly, the algorithm is prone to replace symbols with the 82^0 s 82^0 s 6. The complexity analysis is given in Appendix E, which can be foundIn addition, we also prove that once replacing partial on the Computer Society Digital Library at http://doi.ieeecomputersociety.predictable items of symbols in s0 reduces entropy, repla- ^ org/10.1109/TKDE.2010.30. 7. The proof of Theorem 1 is given in Appendix F, which can be found oncing all predictable items of these symbols reduces the the Computer Society Digital Library at http://doi.ieeecomputersociety.entropy mostly since the entropy decreases monotonically org/10.1109/TKDE.2010.30.