Exploring application level semantics
Upcoming SlideShare
Loading in...5
×
 

Exploring application level semantics

on

  • 2,051 views

Dear Students...

Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.

Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/

Statistics

Views

Total Views
2,051
Views on SlideShare
2,051
Embed Views
0

Actions

Likes
0
Downloads
27
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Exploring application level semantics Exploring application level semantics Document Transcript

  • IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 95 Exploring Application-Level Semantics for Data Compression Hsiao-Ping Tsai, Member, IEEE, De-Nian Yang, and Ming-Syan Chen, Fellow, IEEE Abstract—Natural phenomena show that many creatures form large social groups and move in regular patterns. However, previous works focus on finding the movement patterns of each single object or all objects. In this paper, we first propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover their movement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, which exploits the obtained group movement patterns to reduce the amount of delivered data. The compression algorithm includes a sequence merge and an entropy reduction phases. In the sequence merge phase, we propose a Merge algorithm to merge and compress the location data of a group of moving objects. In the entropy reduction phase, we formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal solution. Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results show that the proposed compression algorithm leverages the group movement patterns to reduce the amount of delivered data effectively and efficiently. Index Terms—Data compression, distributed clustering, object tracking. Ç t.c om1 INTRODUCTION omR po t.c ECENT advances in location-acquisition technologies, zebra, whales, and birds, form large social groups when such as global positioning systems (GPSs) and wireless gs po migrating to find food, or for breeding or wintering. These lo s .b ogsensor networks (WSNs), have fostered many novel characteristics indicate that the trajectory data of multiple ts .blapplications like object tracking, environmental monitoring, objects may be correlated for biological applications. More- ec ts oj cand location-dependent service. These applications gener- over, some research domains, such as the study of animals’ pr ojeate a large amount of location data, and thus, lead to social behavior and wildlife migration [7], [8], are more re r lo reptransmission and storage challenges, especially in resource- concerned with the movement patterns of groups of xp lo animals, not individuals; hence, tracking each object is ee xpconstrained environments like WSNs. To reduce the data .ie eevolume, various algorithms have been proposed for data unnecessary in this case. This raises a new challenge of w e finding moving animals belonging to the same group and w .icompression and data aggregation [1], [2], [3], [4], [5], [6]. w w identifying their aggregated group movement patterns. :// wHowever, the above works do not address application-level tp //wsemantics, such as the group relationships and movement Therefore, under the assumption that objects with similar ht ttp:patterns, in the location data. movement patterns are regarded as a group, we define the h In object tracking applications, many natural phenomena moving object clustering problem as given the movementshow that objects often exhibit some degree of regularity in trajectories of objects, partitioning the objects into non-their movements. For example, the famous annual wild- overlapped groups such that the number of groups isebeest migration demonstrates that the movements of minimized. Then, group movement pattern discovery is to find the most representative movement patterns regardingcreatures are temporally and spatially correlated. Biologists each group of objects, which are further utilized toalso have found that many creatures, such as elephants, compress location data. Discovering the group movement patterns is more. H.-P. Tsai is with the Department of Electrical Engineering (EE), National difficult than finding the patterns of a single object or all Chung Hsing University, No. 250, Kuo Kuang Road, Taichung 402, objects, because we need to jointly identify a group of objects Taiwan, ROC. E-mail: hptsai@nchu.edu.tw.. D.-N. Yang is with the Institute of Information Science (IIS) and the and discover their aggregated group movement patterns. Research Center of Information Technology Innovation (CITI), Academia The constrained resource of WSNs should also be consid- Sinica, No. 128, Academia Road, Sec. 2, Nankang, Taipei 11529, Taiwan, ered in approaching the moving object clustering problem. ROC. E-mail: dnyang@iis.sinica.edu.tw. However, few of existing approaches consider these issues. M.-S. Chen is with the Research Center of Information Technology Innovation (CITI), Academia Sinica, No. 128, Academia Road, Sec. 2, simultaneously. On the one hand, the temporal-and-spatial Nankang, Taipei 11529, Taiwan, and the Department of Electrical correlations in the movements of moving objects are Engineering (EE), Department of Computer Science and Information modeled as sequential patterns in data mining to discover Engineer (CSIE), and Graduate Institute of Communication Engineering (GICE), National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei the frequent movement patterns [9], [10], [11], [12]. How- 10617, Taiwan, ROC. E-mail: mschen@cc.ee.ntu.edu.tw. ever, sequential patterns 1) consider the characteristics of allManuscript received 22 Sept. 2008; revised 27 Feb. 2009; accepted 29 July objects, 2) lack information about a frequent pattern’s2009; published online 4 Feb. 2010. significance regarding individual trajectories, and 3) carryRecommended for acceptance by M. Ester. no time information between consecutive items, which makeFor information on obtaining reprints of this article, please send e-mail to:tkde@computer.org, and reference IEEECS Log Number TKDE-2008-09-0497. them unsuitable for location prediction and similarityDigital Object Identifier no. 10.1109/TKDE.2010.30. comparison. On the other hand, previous works, such as 1041-4347/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
  • 96 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011[13], [14], [15], measure the similarity among these entire of the HIR problem based on Shannon’s theorem [17] andtrajectory sequences to group moving objects. Since objects guarantees the reduction of entropy, which is convention-may be close together in some types of terrain, such as ally viewed as an optimization bound of compressiongorges, and widely distributed in less rugged areas, their performance. As a result, our approach reduces the amountgroup relationships are distinct in some areas and vague in of delivered data and, by extension, the energy consumptionothers. Thus, approaches that perform clustering among in WSNs.entire trajectories may not be able to identify the local group Our contributions are threefold:relationships. In addition, most of the above works are . Different from previous works, we formulate acentralized algorithms [9], [10], [11], [12], [13], [14], [15], moving object clustering problem that jointly iden-which need to collect all data to a server before processing. tifies a group of objects and discovers their move-Thus, unnecessary and redundant data may be delivered, ment patterns. The application-level semantics areleading to much more power consumption because data useful for various applications, such as data storagetransmission needs more power than data processing in and transmission, task scheduling, and networkWSNs [5]. In [16], we have proposed a clustering algorithm construction.to find the group relationships for query and data aggrega- . To approach the moving object clustering problem,tion efficiency. The differences of [16] and this work are as we propose an efficient distributed mining algo-follows: First, since the clustering algorithm itself is a rithm to minimize the number of groups such thatcentralized algorithm, in this work, we further consider members in each of the discovered groups are highlysystematically combining multiple local clustering results related by their movement patterns.into a consensus to improve the clustering quality and for . We propose a novel compression algorithm touse in the update-based tracking network. Second, when a compress the location data of a group of movingdelay is tolerant in the tracking application, a new data objects with or without loss of information. Wemanagement approach is required to offer transmission formulate the HIR problem to minimize the entropy t.c om omefficiency, which also motivates this study. We thus define of location data and explore the Shannon’s theorem po t.cthe problem of compressing the location data of a group of gs po to solve the HIR problem. We also prove that the lo smoving objects as the group data compression problem. proposed compression algorithm obtains the opti- .b og Therefore, in this paper, we first introduce our distrib- ts .bl mal solution of the HIR problem efficiently. ec tsuted mining algorithm to approach the moving object The remainder of the paper is organized as follows: In oj c pr ojeclustering problem and discover group movement patterns. Section 2, we review related works. In Section 3, we provide re rThen, based on the discovered group movement patterns, lo rep an overview of the network, location, and movementwe propose a novel compression algorithm to tackle the xp lo models and formulate our problem. In Section 4, we ee xpgroup data compression problem. Our distributed mining describe the distributed mining algorithm. In Section 5, .ie eealgorithm comprises a Group Movement Pattern Mining w e we formulate the compression problems and propose our w .i w w(GMPMine) and a Cluster Ensembling (CE) algorithms. It compression algorithm. Section 6 details our experimental :// wavoids transmitting unnecessary and redundant data by tp //w results. Finally, we summarize our conclusions in Section 7. ht ttp:transmitting only the local grouping results to a base station h(the sink), instead of all of the moving objects’ location data.Specifically, the GMPMine algorithm discovers the local 2 RELATED WORKgroup movement patterns by using a novel similarity 2.1 Movement Pattern Miningmeasure, while the CE algorithm combines the local Agrawal and Srikant [18] first defined the sequentialgrouping results to remove inconsistency and improve the pattern mining problem and proposed an Apriori-likegrouping quality by using the information theory. algorithm to find the frequent sequential patterns. Han et Different from previous compression techniques that al. consider the pattern projection method in miningremove redundancy of data according to the regularity sequential patterns and proposed FreeSpan [19], which iswithin the data, we devise a novel two-phase and an FP-growth-based algorithm. Yang and Hu [9] developed2D algorithm, called 2P2D, which utilizes the discovered a new match measure for imprecise trajectory data andgroup movement patterns shared by the transmitting node proposed TrajPattern to mine sequential patterns. Manyand the receiving node to compress data. In addition to variations derived from sequential patterns are used inremove redundancy of data according to the correlations various applications, e.g., Chen et al. [20] discover pathwithin the data of each single object, the 2P2D algorithm traversal patterns in a Web environment, while Peng andfurther leverages the correlations of multiple objects and Chen [21] mine user moving patterns incrementally in atheir movement patterns to enhance the compressibility. mobile computing system. However, sequential patternsSpecifically, the 2P2D algorithm comprises a sequence and its variations like [20], [21] do not provide sufficientmerge and an entropy reduction phases. In the sequence information for location prediction or clustering. First, theymerge phase, we propose a Merge algorithm to merge and carry no time information between consecutive items, socompress the location data of a group of objects. In the they cannot provide accurate information for locationentropy reduction phase, we formulate a Hit Item Replace- prediction when time is concerned. Second, they considerment (HIR) problem to minimize the entropy of the merged the characteristics of all objects, which make the meaningfuldata and propose a Replace algorithm to obtain the optimal movement characteristics of individual objects or a group ofsolution. The Replace algorithm finds the optimal solution moving objects inconspicuous and ignored. Third, because
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 97a sequential pattern lacks information about its significanceregarding to each individual trajectory, they are not fullyrepresentative to individual trajectories. To discover sig-nificant patterns for location prediction, Morzy minesfrequent trajectories whose consecutive items are alsoadjacent in the original trajectory data [10], [22]. Meanwhile,Giannotti et al. [11] extract T-patterns from spatiotemporaldata sets to provide concise descriptions of frequentmovements, and Tseng and Lin [12] proposed the TMP-Mine algorithm for discovering the temporal movementpatterns. However, the above Apriori-like or FP-growth-based algorithms still focus on discovering frequent Fig. 1. (a) The hierarchical- and cluster-based network structure and thepatterns of all objects and may suffer from computing data flow of an update-based tracking network. (b) A flat view of a two- layer network structure with 16 clusters.efficiency or memory problems, which make them unsui-table for use in resource-constrained environments. memory constraint of a sensor device. Summarization of the2.2 Clustering original data by regression or linear modeling has beenRecently, clustering based on objects’ movement behavior proposed for trajectory data compression [3], [6]. However,has attracted more attention. Wang et al. [14] transform the the above works do not address application-level semanticslocation sequences into a transaction-like data on users and in data, such as the correlations of a group of movingbased on which to obtain a valid group, but the proposed objects, which we exploit to enhance the compressibility.AGP and VG growth are still Apriori-like or FP-growth-based algorithms that suffer from high computing cost and 3 PRELIMINARIES t.c ommemory demand. Nanni and Pedreschi [15] proposed a om 3.1 Network and Location Models po t.cdensity-based clustering algorithm, which makes use of anoptimal time interval and the average euclidean distance gs po Many researchers believe that a hierarchical architecture lo s provides better coverage and scalability, and also extends .b ogbetween each point of two trajectories, to approach the ts .bltrajectory clustering problem. However, the above works the network lifetime of WSNs [25], [26]. In a hierarchical ec ts WSN, such as that proposed in [27], the energy, computing, oj cdiscover global group relationships based on the proportion pr ojeof the time a group of users stay close together to the whole and storage capacity of sensor nodes are heterogeneous. A re r lo reptime duration or the average euclidean distance of the entire high-end sophisticated sensor node, such as Intel Stargate xp lotrajectories. Thus, they may not be able to reveal the local [28], is assigned as a cluster head (CH) to perform high ee xp complexity tasks; while a resource-constrained sensor node, .ie eegroup relationships, which are required for many applica- such as Mica2 mote [29], performs the sensing and low w etions. In addition, though computing the average euclidean w .i w wdistance of two geometric trajectories is simple and useful, complexity tasks. In this work, we adopt a hierarchical :// w tp //wthe geometric coordinates are expensive and not always network structure with K layers, as shown in Fig. 1a, where ht ttp:available. Approaches, such as EDR, LCSS, and DTW, are sensor nodes are clustered in each level and collaboratively hwidely used to compute the similarity of symbolic trajectory gather or relay remote information to a base station called asequences [13], but the above dynamic programming sink. A sensor cluster is a mesh network of n  n sensorapproaches suffer from scalability problem [23]. To provide nodes handled by a CH and communicate with each otherscalability, approximation or summarization techniques are by using multihop routing [30]. We assume that each nodeused to represent original data. Guralnik and Karypis [23] in a sensor cluster has a locally unique ID and denote theproject each sequence into a vector space of sequential sensor IDs by an alphabet Æ. Fig. 1b shows an example of apatterns and use a vector-based K-means algorithm to cluster two-layer tracking network, where each sensor clusterobjects. However, the importance of a sequential pattern contains 16 nodes identified by Æ ¼ fa, b; . . . ; pg.regarding individual sequences can be very different, which In this work, an object is defined as a target, such as anis not considered in this work. To cluster sequences, Yang animal or a bird, that is recognizable and trackable by theand Wang proposed CLUSEQ [24], which iteratively identi- tracking network. To represent the location of an object,fies a sequence to a learned model, yet the generated clusters geometric models and symbolic models are widely usedmay overlap which differentiates their problem from ours. [31]. A geometric location denotes precise two-dimension or three-dimension coordinates; while a symbolic location2.3 Data Compression represents an area, such as the sensing area of a sensor nodeData compression can reduce the storage and energy or a cluster of sensor nodes, defined by the application. Sinceconsumption for resource-constrained applications. In [1], the accurate geometric location is not easy to obtain anddistributed source (Slepian-Wolf) coding uses joint entropy techniques like the Received Signal Strength (RSS) [32] canto encode two nodes’ data individually without sharing any simply estimate an object’s location based on the ID of thedata between them; however, it requires prior knowledge of sensor node with the strongest signal, we employ a symboliccross correlations of sources. Other works, such as [2], [4], model and describe the location of an object by using the IDcombine data compression with routing by exploiting cross of a nearby sensor node.correlations between sensor nodes to reduce the data size. Object tracking is defined as a task of detecting a movingIn [5], a tailed LZW has been proposed to address the object’s location and reporting the location data to the sink View slide
  • 98 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011Fig. 2. An example of an object’s moving trajectory, the obtained locationsequence, and the generated PST T . Fig. 3. The predict_next algorithm.periodically at a time interval. Hence, an observation on anobject is defined by the obtained location data. We assume algorithm1 learns from a location sequence and generates athat sensor nodes wake up periodically to detect objects. PST whose height is limited by a specified parameter Lmax .When a sensor node wakes up on its duty cycle and detects Each node of the tree represents a significant movementan object of interest, it transmits the location data of the pattern s whose occurrence probability is above a specifiedobject to its CH. Here, the location data include a times minimal threshold Pmin . It also carries the conditionaltamp, the ID of an object, and its location. We also assume empirical probabilities P ðjsÞ for each  in Æ that we usethat the targeted applications are delay-tolerant. Thus, in location prediction. Fig. 2 shows an example of a locationinstead of forwarding the data upward immediately, the t.c om sequence and the corresponding PST. nodef is one of the omCH compresses the data accumulated for a batch period and po t.c children of the root node and represents a significantsends it to the CH of the upper layer. The process is repeated gs po movement pattern with }f} whose occurrence probability lo suntil the sink receives the location data. Consequently, the .b og is above 0.01. The conditional empirical probabilities are ts .bltrajectory of a moving object is thus modeled as a series of P ð0 b0 j}f}Þ ¼ 0:33, P ð0 e0 j}f}Þ ¼ 0:67, and P ðj}f}Þ ¼ 0 for the ec tsobservations and expressed by a location sequence, i.e., a oj c other  in Æ. pr ojesequence of sensor IDs visited by the object. We denote a PST is frequently used in predicting the occurrence re r lo replocation sequence by S ¼ s0 s1 . . . sLÀ1 , where each item si is probability of a given sequence, which provides us xp loa symbol in Æ and L is the sequence length. An example of important information in similarity comparison. The occur- ee xpan object’s trajectory and the obtained location sequence is .ie ee rence probability of a sequence s regarding to a PST T ,shown in the left of Fig. 2. The tracking network tracks w e denoted by P T ðsÞ, is the prediction of the occurrence w .i w wmoving objects for a period and generates a location :// w probability of s based on T . For example, the occurrence tp //wsequence data set, based on which we discover the group probability P T ð}nokjfb}Þ is computed as follows: ht ttp:relationships of the moving objects. h P T ð}nokjfb}Þ ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}no}Þ3.2 Variable Length Markov Model (VMM) and Probabilistic Suffix Tree (PST)  P T ð0 j0 j}nok}Þ Â P T ð0 f 0 j}nokj}ÞIf the movements of an object are regular, the object’s next  P T ð0 b0 j}nokjf}Þlocation can be predicted based on its preceding locations. ¼ P T ð}n}Þ Â P T ð0 o0 j}n}Þ Â P T ð0 k0 j}o}ÞWe model the regularity by using the Variable LengthMarkov Model (VMM). Under the VMM, an object’s move-  P T ð0 j0 j}k}Þ Â P T ð0f 0 j}j}Þ Â P T ð0 b0 j}okjf}Þment is expressed by a conditional probability distribution ¼ 0:05  1  1  1  1  0:3 ¼ 0:0165:over Æ. Let s denote a pattern which is a subsequence of alocation sequence S and  denote a symbol in Æ. The PST is also useful and efficient in predicting the nextconditional probability P ðjsÞ is the occurrence probability item of a sequence. For a given sequence s and a PST T , ourthat  will follow s in S. Since the length of s is floating, the predict_next algorithm as shown in Fig. 3 outputs the mostVMM provides flexibility to adapt to the variable length of probable next item, denoted by predict nextðT ; sÞ. Wemovement patterns. Note that when a pattern s occurs more demonstrate its efficiency by an example as follow: Givenfrequently, it carries more information about the movements s ¼ }nokjf} and T shown in Fig. 2, the predict_nextof the object and is thus more desirable for the purpose of algorithm traverses the tree to the deepest node nodeokjfprediction. To find the informative patterns, we first define a along the path including noderoot , nodef , nodejf , nodekjf , andpattern as a significant movement pattern if its occurrence nodeokjf . Finally, symbol 0 e0 , which has the highest condi-probability is above a minimal threshold. tional empirical probability in nodeokjf , is returned, i.e., To learn the significant movement patterns, we adapt predict nextðT ; }nokjf}Þ ¼ 0 e0 . The algorithm’s computa-Probabilistic Suffix Tree (PST) [33] for it has the lowest tional overhead is limited by the height of a PST so that itstorage requirement among many VMM implementations is suitable for sensor nodes.[34]. PST’s low complexity, i.e., OðnÞ in both time and space 1. The PST building algorithm is given in Appendix A, which can[35], also makes it more attractive especially for streaming or be found on the Computer Society Digital Library at http://doi.resource-constrained environments [36]. The PST building ieeecomputersociety.org/10.1109/TKDE.2010.30. View slide
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 993.3 Problem Description the similarity of two data streams only on the changedWe formulate the problem of this paper as exploring the mature nodes of emission trees [36], instead of all nodes.group movement patterns to compress the location To combine multiple local grouping results into asequences of a group of moving objects for transmission consensus, the CE algorithm utilizes the Jaccard similarityefficiency. Consider a set of moving objects O ¼ coefficient to measure the similarity between a pair offo1 ; o2 ; . . . ; on g and their associated location sequence data objects, and normalized mutual information (NMI) toset S ¼ fS1 ; S2 ; . . . ; Sn g. derive the final ensembling result. It trades off the grouping quality against the computation cost by adjusting a partitionDefinition 1. Two objects are similar to each other if their parameter. In contrast to approaches that perform cluster- movement patterns are similar. Given the similarity measure ing among the entire trajectories, the distributed algorithm function simp 2 and a minimal threshold simmin , oi and oj are discovers the group relationships in a distributed manner similar if their similarity score simp ðoi ; oj Þ is above simmin , i.e., on sensor nodes. As a result, we can discover group simp ðoi ; oj Þ ! simmin . The set of objects that are similar to oi is movement patterns to compress the location data in the denoted by soðoi Þ ¼ foj j8oj 2 O; simp ðoi ; oj Þ ! simmin g. areas where objects have explicit group relationships.Definition 2. A set of objects is recognized as a group if they are Besides, the distributed design provides flexibility to take highly similar to one another. Let g denote a set of objects. g is a partial local grouping results into ensembling when the group if every object in g is similar to at least a threshold of group relationships of moving objects in a specified objects in g, i.e., 8oi 2 g, jsoðogijÞgj ! , where is with default j subregion are interested. Also, it is especially suitable for value 1 .3 2 heterogeneous tracking configurations, which helps reduce the tracking cost, e.g., instead of waking up all sensors at We formally define the moving object clustering problem the same frequency, a fine-grained tracking interval isas follows: Given a set of moving objects O together with specified for partial terrain in the migration season totheir associated location sequence data set S and a minimal reduce the energy consumption. Rather than deploying the t.c omsimilarity threshold simmin , the moving object clustering om sensors in the same density, they are only highly concen- po t.cproblem is to partition O into nonoverlapped groups, trated in areas of interest to reduce deployment costs.denoted by G ¼ fg1 ; g2 ; . . . ; gi g, such that the number of gs po lo s .b oggroups is minimized, i.e., jGj is minimal. Thereafter, with 4.1 The Group Movement Pattern Mining ts .blthe discovered group information and the obtained group (GMPMine) Algorithm ec ts oj cmovement patterns, the group data compression problem is To provide better discrimination accuracy, we propose a pr ojeto compress the location sequences of a group of moving new similarity measure simp to compare the similarity of re r lo repobjects for transmission efficiency. Specifically, we formu- two objects. For each of their significant movement patterns, xp lolate the group data compression problem as a merge the new similarity measure considers not merely two ee xp .ie eeproblem and an HIR problem. The merge problem is to probability distributions but also two weight factors, i.e., w ecombine multiple location sequences to reduce the overall the significance of the pattern regarding to each PST. The w .i w wsequence length, while the HIR problem targets to minimize similarity score simp of oi and oj based on their respective :// w tp //wthe entropy of a sequence such that the amount of data is PSTs, Ti and Tj , is defined as follows: ht ttp:reduced with or without loss of information. P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P h Ti Tj 2 s2S e 2Æ ðP ðsÞ À P ðsÞÞ simp ðoi ; oj Þ ¼ À log pffiffiffi ; ð1Þ4 MINING OF GROUP MOVEMENT PATTERNS 2Lmax þ 2To tackle the moving object clustering problem, we propose e where S denotes the union of significant patterns (nodea distributed mining algorithm, which comprises the strings) on the two trees. The similarity score simp includesGMPMine and CE algorithms. First, the GMPMine algo- the distance associated with a pattern s, defined asrithm uses a PST to generate an object’s significant move- qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xment patterns and computes the similarity of two objects by dðsÞ ¼ ðP Ti ðsÞ À P Tj ðsÞÞ2 2Æusing simp to derive the local grouping results. The merits qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Xof simp include its accuracy and efficiency: First, simp ¼ 2Æ ðP Ti ðsÞ Â P Ti ðjsÞ À P Tj ðsÞ Â P Tj ðjsÞÞ2 ;considers the significances of each movement patternregarding to individual objects so that it achieves better where dðsÞ is the euclidean distance associated with aaccuracy in similarity comparison. For a PST can be used to pattern s over Ti and Tj .predict a pattern’s occurrence probability, which is viewed For a pattern s 2 T , P T ðsÞ is a significant value becauseas the significance of the pattern regarding the PST, simp the occurrence probability of s is higher than the minimalthus includes movement patterns’ predicted occurrence support Pmin . If oi and oj share the pattern s, we have s 2 Tiprobabilities to provide fine-grained similarity comparison. and s 2 Tj , respectively, such that P Ti ðsÞ and P Tj ðsÞ are non-Second, simp can offer seamless and efficient comparison negligible and meaningful in the similarity comparison.for the applications with evolving and evolutionary Because the conditional empirical probabilities are also partssimilarity relationships. This is because simp can compare of a pattern, we consider the conditional empirical prob- abilities P T ðjsÞ when calculating the distance between two 2. simp is to be defined in Section 4.1. e PSTs. Therefore, we sum dðsÞ for all s 2 S as the distance 3. In this work, we set the threshold to 1 , as suggested in [37], and leave 2the similarity threshold as the major control parameter because training the between two PSTs. Note that the distance betweenffiffiffitwo PSTs psimilarity threshold of two objects is easier than that of a group of objects. is normalized by its maximal value, i.e., 2Lmax þ 2. We take
  • 100 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011the negative log of the distance between two PSTs as thesimilarity score such that a larger value of the similarityscore implies a stronger similar relationship, and vice versa.With the definition of similarity score, two objects are similarto each other if their score is above a specified similaritythreshold. The GMPMine algorithm includes four steps. First, we Fig. 4. Design of the two-phase and 2D compression algorithm.extract the movement patterns from the location sequencesby learning a PST for each object. Second, our algorithm ½0; 1Š such that a finer-grained configuration of D achieves aconstructs an undirected, unweighted similarity graph better grouping quality but in a higher computation cost.where similar objects share an edge between each other. Therefore, for a set of thresholds D, we rewrite our objective PWe model the density of group relationship by the function as G0 ¼ argmaxG;2D K NMIðGi ; G Þ. i¼1connectivity of a subgraph, which is also defined as the The algorithm includes three steps. First, we utilizeminimal cut of the subgraph. When the ratio of the Jaccard Similarity Coefficient [40] as the measure of theconnectivity to the size of the subgraph is higher than a similarity for each pair of objects. Second, for each  2 D,threshold, the objects corresponding to the subgraph are we construct a graph where two objects share an edge ifidentified as a group. Since the optimization of the graph their Jaccard Similarity Coefficient is above . Our algo-partition problem is intractable in general, we bisect the rithm partitions the objects to generate a partitioning resultsimilarity graph in the following way. We leverage the HCS G . Third, we select the ensembling result G0 . Because ofcluster algorithm [37] to partition the graph and derive the space limitations, we only demonstrate the CE mininglocation group information. Finally, we select a group PST algorithm with an example in Appendix C, which can beTg for each group in order to conserve the memory space by found on the Computer Society Digital Library at http:// P doi.ieeecomputersociety.org/10.1109/TKDE.2010.30. In theusing the formula expressed as T g ¼ argmaxT 2T s2S P T ðsÞ, t.c om next section, we propose our compression algorithm that omwhere S denotes sequences of a group of objects and T leverages the obtained group movement patterns. po t.cdenotes their PSTs. Due to the page limit, we give anillustrative example of the GMPMine algorithm in Appen- gs po lo s .b ogdix B, which can be found on the Computer Society Digital 5 DESIGN OF A COMPRESSION ALGORITHM WITH ts .blLibrary at http://doi.ieeecomputersociety.org/10.1109/ GROUP MOVEMENT PATTERNS ec ts oj c pr ojeTKDE.2010.30. A WSN is composed of a large number of miniature sensor re r lo rep4.2 The Cluster Ensembling (CE) Algorithm nodes that are deployed in a remote area for various xp lo applications, such as environmental monitoring or wildlifeIn the previous section, each CH collects location data and ee xp tracking. These sensor nodes are usually battery-powered .ie eegenerates local group results with the proposed GMPMine and recharging a large number of them is difficult. There- w ealgorithm. Since the objects may pass through only partial w .i fore, energy conservation is paramount among all design w wsensor clusters and have different movement patterns in :// w issues in WSNs [41], [42]. Because the target objects are tp //wdifferent clusters, the local grouping results may be moving, conserving energy in WSNs for tracking moving ht ttp:inconsistent. For example, if objects in a sensor cluster walk objects is more difficult than in WSNs that monitor immobile hclose together across a canyon, it is reasonable to consider phenomena, such as humidity or vibrations. Hence, pre-them a group. In contrast, objects scattered in grassland vious works, such as [43], [44], [45], [46], [47], [48], especiallymay not be identified as a group. Furthermore, in the case consider movement characteristics of moving objects in theirwhere a group of objects moves across the margin of a designs to track objects efficiently. On the other hand, sincesensor cluster, it is difficult to find their group relationships. transmission of data is one of the most energy expensiveTherefore, we propose the CE algorithm to combine tasks in WSNs, data compression is utilized to reduce themultiple local grouping results. The algorithm solves the amount of delivered data [1], [2], [3], [4], [5], [6], [49], [50],inconsistency problem and improves the grouping quality. [51], [52]. Nevertheless, few of the above works have The ensembling problem involves finding the partition of addressed the application-level semantics, i.e., the tempor-all moving objects O that contains the most information about al-and-spatial correlations of a group of moving objects.the local grouping results. We utilize NMI [38], [39] to Therefore, to reduce the amount of delivered data, weevaluate the grouping quality. Let C denote the ensemble propose the 2P2D algorithm which leverages the groupof the local grouping results, represented as C ¼ fG0 ; movement patterns derived in Section 4 to compress the location sequences of moving objects elaborately. As shownG1 ; . . . ; GK g, where K denotes the ensemble size. Our goal in Fig. 4, the algorithm includes the sequence merge phaseis to discover the ensembling result G0 that contains the most and the entropy reduction phase to compress location Pinformation about C, i.e., G0 ¼ argmax e K NMIðGi ; GÞ, i¼1 sequences vertically and horizontally. In the sequence G2G ewhere G denotes all possible ensembling results. merge phase, we propose the Merge algorithm to compress e However, enumerating every G 2 G in order to find the the location sequences of a group of moving objects. Since 0optimal ensembling result G is impractical, especially in the objects with similar movement patterns are identified as aresource-constrained environments. To overcome this diffi- group, their location sequences are similar. The Mergeculty, the CE algorithm trades off the grouping quality algorithm avoids redundant sending of their locations, andagainst the computation cost by adjusting the partition thus, reduces the overall sequence length. It combines theparameter D, i.e., a set of thresholds with values in the range sequences of a group of moving objects by 1) trimming
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 101multiple identical symbols at the same time interval into asingle symbol or 2) choosing a qualified symbol to representthem when a tolerance of loss of accuracy is specified by theapplication. Therefore, the algorithm trims and prunesmore items when the group size is larger and the grouprelationships are more distinct. Besides, in the case that onlythe location center of a group of objects is of interest, ourapproach can find the aggregated value in the phase,instead of transmitting all location sequences back to the Fig. 5. An example of the Merge algorithm. (a) Three sequences with high similarity. (b) The merged sequence S 00 .sink for postprocessing. In the entropy reduction phase, we propose the Replace S-column into a single symbol. Specifically, given a groupalgorithm that utilizes the group movement patterns as the of n sequences, the items of an S-column are replaced by aprediction model to further compress the merged sequence. single symbol, whereas the items of a D-column areThe Replace algorithm guarantees the reduction of a wrapped up between two 0 =0 symbols. Finally, our algo-sequence’s entropy, and consequently, improves compres- rithm generates a merged sequence containing the samesibility without loss of information. Specifically, we define a information of the original sequences. In decompressingnew problem of minimizing the entropy of a sequence as from the merged sequence, while symbol 0 =0 is encountered,the HIR problem. To reduce the entropy of a location the items after it are output until the next 0 =0 symbol.sequence, we explore Shannon’s theorem [17] to derive Otherwise, for each item, we repeat it n times to generatethree replacement rules, based on which the Replace the original sequences. Fig. 5b shows the merged sequencealgorithm reduces the entropy efficiently. Also, we prove S 00 whose length is decreased from 60 items to 48 items suchthat the Replace algorithm obtains the optimal solution of that 12 items are conserved. The example pointed out thatthe HIR problem as Theorem 1. In addition, since the objects t.c om our approach can reduce the amount of data without loss of ommay enter and leave a sensor cluster multiple times during po t.ca batch period and a group of objects may enter and leave a information. Moreover, when there are more S-columns, gs po our approach can bring more benefit.cluster at slightly different times, we discuss the segmenta- lo s .b og When a little loss in accuracy is tolerant, representingtion and alignment problems in Section 5.3. Table 1 ts .bl items in a D-column by an qualified symbol to generatesummaries the notations. ec ts more S-columns can improve the compressibility. We oj c pr oje5.1 Sequence Merge Phase regulate the accuracy by an error bound, defined as the re r lo repIn the application of tracking wild animals, multiple maximal hop count between the real and reported locations xp lomoving objects may have group relationships and share of an object. For example, replacing items 2 and 6 of S2 by 0 g0 ee xp and 0 p0 , respectively, creates two more S-columns, and thus, .ie eesimilar trajectories. In this case, transmitting their location w edata separately leads to redundancy. Therefore, in this results in a shorter merged sequence with 42 items. To w .i w wsection, we concentrate on the problem of compressing select a representative symbol for a D-column, we includes :// w tp //wmultiple similar sequences of a group of moving objects. a selection criterion to minimize the average deviation ht ttp: Consider an illustrative example in Fig. 5a, where three between the real locations and reported locations for a hlocation sequences S0 , S1 , and S2 represent the trajectories group of objects at each time interval as follows:of a group of three moving objects. Items with the sameindex belong to a column, and a column containing Selection criterion. The maximal distance between the reportedidentical symbols is called an S-column; otherwise, the location and the real location is below a specified error
  • boundcolumn is called a D-column. Since sending the items in an eb, i.e., when the ith column is a D-column,
  • Sj ½iŠ À 
  • ebS-column individually causes redundancy, our basic idea must hold for 0 j < n, where n is the number of sequences.of compressing multiple sequences is to trim the items in an TABLE 1 Description of the Notations
  • 102 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 Fig. 7. An example of the Merge algorithm: the merged sequence with eb ¼ 1. jfs js ¼ ;0 j<Lgj as pi ¼ j j L i , and L ¼ jS j. According to Shannon’s source coding theorem, the optimal code length for symbol i is log2 pi , which is also called the information content of i . Information entropy (or Shannon’s entropy) is thereby defined as the overall sum of the information content over Æ, X eðSÞ ¼ eðp0 ; p1 ; . . . ; pjÆjÀ1 Þ ¼ Àpi  log2 pi : ð2Þ 0 i<jÆj Shannon’s entropy represents the optimal average code length in data compression, where the length of a symbol’sFig. 6. The Merge algorithm. codeword is proportional to its information content [17]. A property of Shannon’s entropy is that the entropy is Therefore, to compress the location sequences for a the maximum, while all probabilities are of the same value.group of moving objects, we propose the Merge algorithm Thus, in the case without considering the regularity in theshown in Fig. 6. The input of the algorithm contains a group movements of objects, the occurrence probabilities ofof sequences fSi j0 i < ng and an error bound eb, while symbols are assumed to be uniform such that the entropythe output is a merged sequence that represents the group t.c om of a location sequence as well as the compression ratio is omof sequences. Specifically, the Merge algorithm processes po t.cthe sequences in a columnwise way. For each column, the fixed and dependent on the size of a sensor cluster, e.g., for gs po a location sequence with length D collected in a sensoralgorithm first checks whether it is an S-column. For an S- lo s .b og 1column, it retains the value of the items as Lines 4-5. cluster of 16 nodes, the entropy of the sequence is eð16 , ts .bl 1 1Otherwise, while an error bound eb > 0 is specified, a 16 ; . . . ; 16Þ ¼ 4. Consequently, 4D bits are needed to repre- ec ts oj crepresentative symbol is selected according to the selection sent the location sequence. Nevertheless, since the move- pr ojecriterion as Line 7. If a qualified symbol exists to represent ments of a moving object are of some regularity, the re r lo repthe column, the algorithm outputs it as Lines 15-18. occurrence probabilities of symbols are probably skewed xp loOtherwise, the items in the column are retained and and the entropy is lower. For example, if two probabilities ee xp .ie ee 1 3wrapped by a pair of “/” as Lines 9-13. The process repeats become 32 and 32 , the entropy is reduced to 3.97. Thus, w euntil all columns are examined. Afterward, the merged instead of encoding the sequence using 4 bits per symbol, w .i w wsequence S 00 is generated. :// w only 3:97D bits are required for the same information. This tp //w Following the example shown in Fig. 5a, with a specified example points out that when a moving object exhibits ht ttp:error bound eb as 1, the Merge algorithm generates the some degree of regularity in their movements, the skewness hsolution, as shown in Fig. 7; the merged sequence contains of these probabilities lowers the entropy. Seeing that dataonly 20 items, i.e., 40 items are curtailed. with lower entropy require fewer bits to represent the same5.2 Entropy Reduction Phase information, reducing the entropy thereby benefits for data compression and, by extension, storage and transmission.In the entropy reduction phase, we propose the Replace Motivated by the above observation, we design thealgorithm to minimize the entropy of the merged sequence Replace algorithm to reduce the entropy of a locationobtained in the sequence merge phase. Since data with sequence. Our algorithm imposes the hit symbols on thelower entropy require fewer bits for storage and transmis- location sequence to increase the skewness. Specifically, thesion [17], we replace some items to reduce the entropy algorithm uses the group movement patterns built in bothwithout loss of information. The object movement patterns the transmitter (CH) and the receiver (sink) as thediscovered by our distributed mining algorithm enable usto find the replaceable items and facilitate the selection of prediction model to decide whether an item of a sequenceitems in our compression algorithm. In this section, we first is predictable. A CH replaces the predictable items eachintroduce and define the HIR problem, and then, explore with a hit symbol to reduce the location sequence’s entropythe properties of Shannon’s entropy to solve the HIR when compressing it. After receiving the compressedproblem. We extend the concentration property for entropy sequence, the sink node decompresses it and substitutesreduction and discuss the benefits of replacing multiple every hit symbol with the original symbol by the identicalsymbols simultaneously. We derive three replacement rules prediction model, and no information loss occurs.for the HIR problem and prove that the entropy of the Here, an item si of a sequence S is predictable item ifobtained solution is minimized. predict nextðT ; S½0::i À 1ŠÞ is the same value as si . A symbol is a predictable symbol once an item of the symbol is5.2.1 Three Derived Replacement Rules predictable. For ease of explanation, we use a taglst4 toLet P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g denote the probability distribu- 4. A taglst associated with a sequence S is a sequence of 0s and 1stion of the alphabet Æ corresponding to a location sequence obtained as follows: For those predictable items in S, the correspondingS, where pi is the occurrence probability of i in Æ, defined items in taglst are set as 1. Otherwise, their values in taglst are 0.
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 103Fig. 8. An example of the simple method. The first row represents the index, and the second row is the location sequence. The third row is the taglstwhich represents the prediction results. The last row is the result of the simple approach. Note that items 1, 2, 7, 9, 11, 12, 15, and 22 are predictableitems such that the set of predictable symbols is s ¼ fk, a, j, f, og. In the example, items 2, 9, and 10 are items of 0 o0 such that the number of items of ^symbol 0 o0 is nð0 o0 Þ ¼ 3, whereas items 2 and 9 are the predictable items of 0 o0 , and the number of predictable items of symbol 0 o0 is nhit ð0 o0 Þ ¼ 2.Fig. 9. Problems of the simple approach.express whether each item of S is predictable in the fifth, called the strong concentration property, is derivedfollowing sections. Consider the illustrative example shown and proved in the paper.5in Fig. 8, the probability distribution of Æ corresponding to t.c om Property 1 (Expansibility). Adding a probability with a value omS is P ¼ f0:04; 0:16; 0:04; 0; 0; 0:08; 0:04; 0; 0:04; 0:12; 0:24; 0; po t.c of zero does not change the entropy, i.e., eðp0 , p1 ; . . . ; pjÆjÀ1 ,0; 0:12; 0:12; 0g, and items 1, 2, 7, 9, 11, 12, 15, and 22 are gs po pjÆj Þ is identical to eðp0 , p1 ; . . . ; pjÆjÀ1 Þ when pjÆj is zero. lo spredictable items. To reduce the entropy of the sequence, a .b og ts .blsimple approach is to replace each of the predictable items According to Property 1, we add a new symbol 0 :0 to Æ ec tswith the hit symbol and obtain an intermediate sequence oj c without affecting the entropy and denote its probability as pr ojeS 0 ¼00n::kfbb:n:o::ij:bbnkkk:gc00 . Compared with the original p16 such that P is fp0 , p1 ; . . . ; p15 , p16 ¼ 0g. re rsequence S with entropy 3.053, the entropy of S 0 is reduced lo repto 2.854. Encoding S and S 0 by the Huffman coding Property 2 (Symmetry). Any permutation of the probability xp lo ee xptechnique, the lengths of the output bit streams are 77 and values does not change to the entropy. For example, .ie ee73 bits, respectively, i.e., 5 bits are conserved by the simple eð0:1; 0:4; 0:5Þ is identical to eð0:4; 0:5; 0:1Þ. w e w .i w wapproach. Property 3 (Accumulation). Moving all the value from one :// w However, the above simple approach does not always tp //w probability to another such that the former can be thought ht ttp:minimize the entropy. Consider the example shown in Fig. 9a, of as being eliminated decreases the entropy, i.e.,an intermediate sequence with items 1 and 19 unreplaced has h eðp0 ; p1 ; . . . ; 0; . . . ; pi þ pj ; . . . ; pjÆjÀ1 Þ is equal or less thanlower entropy than that generated by the simple approach. eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.For the example shown in Fig. 9b, the simple approach evenincreases the entropy from 2.883 to 2.963. With Properties 2 and 3, if all the items of symbol  are We define the above problem as the HIR problem and predictable, i.e., nðÞ ¼ nhit ðÞ, replacing all the items of formulate it as follows: by 0 :0 will not affect the entropy. If there are multipleDefinition 3 (HIR problem). Given a sequence S ¼ fsi jsi 2 symbols having nðÞ ¼ nhit ðÞ, replacing all the items of Æ; 0 i < Lg and a taglst, an intermediate sequence is a these symbols can reduce the entropy. Thus, we derive the generation of S, denoted by S 0 ¼ fs0i j0 i < Lg, where s0i is first replacement rule—the accumulation rule: Replace all equal to si if taglst½iŠ ¼ 0. Otherwise, s0i is equal to si or 0 :0 . items of symbol  in s, where nðÞ ¼ nhit ðÞ. ^ The HIR problem is to find the intermediate sequence S 0 such Property 4 (Concentration). For two probabilities, moving a that the entropy of S 0 is minimal for all possible intermediate value from the lower probability to the higher probability sequences. decreases the entropy, i.e., if pi pj , for 0 < Áp pi , eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than A brute-force method to the HIR problem is to enumer-ate all possible intermediate sequences to find the optimal eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ.solution. However, this brute-force approach is not scalable, Property 5 (Strong concentration). For two probabilities,especially when the number of the predictable items is moving a value that is larger than the difference of the twolarge. Therefore, to solve the HIR problem, we explore probabilities from the higher probability to the lower oneproperties of Shannon’s entropy to derive three replace- decreases the entropy, i.e., if pi > pj , for pi À pj < Áp pi ,ment rules that our Replace algorithm leverages to obtainthe optimal solution. Here, we list five most relevant 5. Due to the page limit, the proofs of the strong concentration rule and Lemmas 1-4 are given in Appendices C and D, respectively, which can beproperties to explain the replacement rules. The first four found on the Computer Society Digital Library at http://doi.properties can be obtained from [17], [53], [54], while the ieeecomputersociety.org/10.1109/TKDE.2010.30.
  • 104 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 eðp0 ,p1 ; . . . ; pi À Áp; . . . ; pj þ Áp; . . . ; pjÆjÀ1 Þ is less than as Lemma 4. We thereby derive the third replacement eðp0 ,p1 ; . . . ; pi ; . . . ; pj ; . . . ; pjÆjÀ1 Þ. rule—the multiple symbol rule: Replace all of the predictable items of every symbol in s0 if gainð^0 Þ > 0. ^ s According to Properties 4 and 5, we conclude that if the Lemma 3. If exist 0 an pin , n ¼ 1; . . . ; k, such thatdifference of two probabilities increases, the entropy P fi1 i2 ...ik j ða1 ; a2 ; . . . ; ak Þ is minimal, we have pj þ k ai ! i¼1decreases. When the conditions conform to Properties 4 pin À an , n ¼ 1; . . . ; k.and 5, we further prove that the entropy is monotonicallyreduced as Áp increases such that replacing all predictable Lemma 4. If exist xmin1 , xmin2 ; . . . , and xmink such that pj þ Pkitems of a qualified symbol reduces the entropy mostly as i¼1 xmini > pin À xminn for n ¼ 1; 2; . . . ; k,fi1 i2 ...ik j ðx1 ; x2 ;Lemmas 1 and 2. . . . ; xk Þ is monotonically decreasing for xminn xn pin ,Definition 4 (Concentration function). For a probability n ¼ 1; . . . ; k. distribution P ¼ fp0 , p1 ; . . . ; pjÆjÀ1 g, we define the concentra- tion function of k probabilities pi1 , pi2 ; . . . ; pik , and pj as 5.2.2 The Replace Algorithm  Based on the observations described in the previous section, fi1 i2 ...ik j ðx1 ; x2 ; . . . ; xk Þ ¼ e p0 ; . . . ; pi1 À x1 ; . . . ; pi2 we propose the Replace algorithm that leverages the three replacement rules to obtain the optimal solution for the HIR À x2 ; . . . ; pik À xk ; . . . ; pj problem. Our algorithm examines the predictable symbols Xk  on their statistics, which include the number of items and þ xi ; . . . ; pj P jÀ1 : the number of predictable items of each predictable symbol. i¼1 The algorithm first replaces the qualified symbols according to the accumulation rule. Afterward, since the concentration rule and the multiple symbol rule are related to nð0 :0 Þ, whichLemma 1. If pi pj , for any x such that 0 x pi , the is increased after every replacement, the algorithm itera- t.c om concentration function of pi and pj is monotonic decreasing. tively replaces the qualified symbols according to the two om po t.cLemma 2. If pi > pj , for any x such that pi À pj x pi , the rules until all qualified symbols are replaced. The algorithm concentration function fij ðxÞ of pi and pj is monotonic gs po thereby replaces qualified symbols and reduces the entropy lo s .b og decreasing. toward the optimum gradually. Compared with the brute- ts .bl force method that enumerates all possible intermediate ec ts oj c Accordingly, we derive the second replacement rule—the sequences for the optimum in exponential complexity, the pr oje Replace algorithm that leverages the derived rules to obtain re rconcentration rule: Replace all predictable items of symbol lo rep in s, where nðÞ nð0 :0 Þ or nhit ðÞ > nðÞ À nð0 :0 Þ. ^ the optimal solution in OðLÞ time6 is more scalable and xp lo efficient. We prove that the Replace algorithm guarantees to ee xp As an extension of the above properties, we also explore .ie eethe entropy variation, while predictable items of multiple reduce the entropy monotonically and obtains the optimal w esymbols are replaced simultaneously. Let s0 denote a subset solution of the HIR problem as Theorem 1. Next, we detail w .i ^ w w :// wof s, where j^0 j ! 2. We prove that if replacing predictable ^ s the replace algorithm and demonstrate the algorithm by an tp //wsymbols in s0 can minimize the entropy corresponding to ^ illustrative example. ht ttp:symbols in s0 , the number of hit symbols after the ^ Theorem 1.7 The Replace algorithm obtains the optimal solution hreplacement must be larger than the number of the of the HIR problem.predictable symbols after the replacement as Lemma 3. Toinvestigate whether the converse statement exists, we Fig. 10 shows the Replace algorithm. The input includesconduct an experiment in a brute-force way. However, the a location sequence S and a predictor Tg , while the output,experimental results show that even under the condition P denoted by S 0 , is a sequence in which qualified items arethat nð0 :0 Þ þ 80 2^0 nhit ð0 Þ ! nðÞ À nhit ðÞ for every  in s0 , s ^ replaced by 0 :0 . Initially, Lines 3-9 of the algorithm find thereplacing predictable items of the symbols in s0 does not ^ set of predictable symbols together their statistics. Then, itguarantee the reduction of the entropy. Therefore, we exams the statistics of the predictable symbols according tocompare the difference of the entropy before and after the three replacement rules as follows: First, according toreplacing symbols in s0 as ^ the accumulation rule, it replaces qualified symbols in one X p scan of the predictable symbols as Lines 10-14. Next, the ^ gainðS 0 Þ ¼ À p  log2 À Áp algorithm iteratively exams for the concentration and the 82^s 0 p À Áp multiple symbol rules by two loops. The first loop from  log2 ðp À Áp Þ Line 16 to Line 22 is for the concentration, whereas the   p0 :0 second loop from Line 25 to Line 36 is for the multiple À p0 :0  log2 0 0 P symbol rule. In our design, since finding a combination of p : þ 82^0 ÁpÀ s ! predictable symbols to make gainð^0 Þ > 0 hold is more s X X þ Áp  log2 p þ 0 :0 ÁpÀ : costly, the algorithm is prone to replace symbols with the 82^0 s 82^0 s 6. The complexity analysis is given in Appendix E, which can be foundIn addition, we also prove that once replacing partial on the Computer Society Digital Library at http://doi.ieeecomputersociety.predictable items of symbols in s0 reduces entropy, repla- ^ org/10.1109/TKDE.2010.30. 7. The proof of Theorem 1 is given in Appendix F, which can be found oncing all predictable items of these symbols reduces the the Computer Society Digital Library at http://doi.ieeecomputersociety.entropy mostly since the entropy decreases monotonically org/10.1109/TKDE.2010.30.
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 105 combinations of m þ 1 symbols. The process repeats until s0 contains all of the symbols in s. ^ ^ In the following, we explain our Replace algorithm with an illustrative example. Given a sequence S ¼ 00 nokkfbbanookjijfbbnkkkjgc00 with entropy 3.053, as shown in Fig. 11, our algorithm generates the statistic table and taglst shown in Fig. 11a. In this example, s ¼ fk, j, o, a, fg, ^ and the numbers of items of the symbols are 6, 3, 3, 1, and 2, whereas the numbers of predictable items of the symbols are 2, 2, 2, 1, and 1, respectively. First, according to the accumulation rule, the predictable items of 0 a0 are replaced due to nð0 a0 Þ being equal to nhit ð0 a0 Þ (Lines 10-14). After that, the statistic table is updated, as shown in Fig. 11b. Second, according to the multiple symbol rule, we replace the predictable items of 0 j0 and 0 o0 simultaneously such that the entropy of S 0 is reduced to 2.969. Next, because nð0 f 0 Þ is less than nð0 :0 Þ, the predictable items of 0 f 0 are replaced according to the concentration rule (Lines 17-23), then the entropy of S 0 is reduced to 2.893. Finally, since nð0 k0 Þ is equal to nð0 :0 Þ and nhit ð0 k0 Þ is greater than nð0 k0 Þ minus ð0 :0 Þ, the predictable items of symbol 0 k0 are replaced according to the concentration rule. Finally, no other candidate is available, and our algorithm outputs S 0 with entropy t.c om om 2.854. In this example, all predictable items are replaced po t.c to minimize the entropy. gs po lo s In addition, for the example shown in Fig. 5, the Replace .b og algorithm reduces S0 , S1 , and S2 ’s entropies from 3.171, ts .bl ec ts 2.933, and 2.871 to 2.458, 2.828, and 2.664, respectively, and oj cFig. 10. The Replace algorithm. pr oje 0 0 0 encoding S0 , S1 , and S2 reduces the sum of output re r bitstreams from 181 to 161 bits. On the other hand, when lo repconcentration rule. Specifically, after a scan of predictable the specified error bound eb is 0 and 1, by fully utilizing the xp lo ee xpsymbols for the second rule as Lines 17-23, the algorithm group movement patterns, the 2P2D algorithm reduces the .ie eesearch for a combination of symbols in s to make the ^ total data size to 153 and 47 bits, respectively; hence, 15.5 w e w .icondition of the multiple symbol rule hold as Lines 26-36; it w w and 74 percent of the data volume are saved, respectively. :// wstarts with a combination of two symbols, i.e., m ¼ 2. Once tp //wa combination s0 makes gainð^0 Þ > 0 hold, the enumeration ^ s 5.3 Segmentation, Alignment, and Packaging ht ttp:procedure stops as Line 31 and the algorithm goes back to In an online update approach, sensor nodes are assigned a hthe first loop. Otherwise, after an exhaustive search for any tracking task to update the sink with the location of movingcombination of m symbols, it goes on examining the objects at every tracking interval. In contrast to the onlineFig. 11. An example of the Replace algorithm.
  • 106 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 where r denotes the compression ratio of our compression algorithm and H denotes the data size of the packet header. As for the online update approach, when a sensor node detects an object of interest, it sends an update packet upward to the sink. The payload of a packet includes time stamp, location, and object ID such that the packet size is H þ a þ b þ c. Some approaches, such as [55], employ techniques like location prediction to reduce the number of transmitted update packets. For D tracking intervals, the amount of data for tracking n objects is D  ðH þ a þ b þ cÞ Â ð1 À pÞ Â n, where p is the predic- tion hit rate. Therefore, the group size, the number of segments, andFig. 12. An example of constructing an update packet. (a) Three the compress ratio are important factors that influence thesequences aligned in time domain. (G-segments: E1 , E2 , and E3 , S- performance of the batch-based approach. In the nextsegments: A, B , C, and D.) (b) Combining G-segments by the Mergealgorithm. (c) Replacing the predictable items by the Replace algorithm. section, we conduct experiments to evaluate the perfor-(d) Compressing and packing to generate the payload of the update mance of our design.packet.approach, the CHs in our batch-based approach accumulate 6 EXPERIMENT AND ANALYSISa large volume of location data for a batch period before We implement an event-driven simulator in C++ with SIMcompressing and transmitting it to the sink; and the location [56] to evaluate the performance of our design. To the best ofupdate process repeats from batch to batch. In real-world our knowledge, no research work has been dedicated to t.c om omtracking scenarios, slight irregularities of the movements of discovering application-level semantic for location data po t.ca group of moving objects may exist in the microcosmic compression. We compare our batch-based approach with gs po lo sview. Specifically, a group of objects may enter a sensor an online approach for the overall system performance .b ogcluster at slightly different times and stay in a sensor cluster ts .bl evaluation and study the impact of the group size (n), as ec tsfor slightly different periods, which lead to the alignment well as the group dispersion radius (GDR ), the batch period oj c pr ojeproblem among the location sequences. Moreover, since the (D), and the error bound of accuracy (eb). We also compare re r lo reptrajectories of moving objects may span multiple sensor our Replace algorithm with Huffman encoding technique to xp loclusters, and the objects may enter and leave a cluster show its effectiveness. Since there is no related work that ee xpmultiple times during a batch period, a location sequence finds real location data of group moving objects, we .ie eemay comprise multiple segments, each of which is a w e generate the location data, i.e., the coordinates (x; y), with w .i w wtrajectory that is continuous in time domain. To deal with the Reference Point Group Mobility Model [57] for a group :// w tp //wthe alignment and segmentation problems, we partition of objects moving in a two-layer tracking network with ht ttp:location sequences into segments, and then, compress and 256 nodes. A location-dependent mobility model [58] is used hpackage them into one update packet. to simulate the roaming behavior of a group leader; the other Consider a group of three sequences shown in Fig. 12a, the member objects are followers that are uniformly distributedsegments E1 , E2 , and E3 are aligned and named G-segments, within a specified group dispersion radius (GDR) of thewhereas segments A, B, C, and D are named S-segments. leader, where the GDR is the maximal hop count betweenFigs. 12b, 12c, and 12d show an illustrative example to followers and the leader. We utilize the GDR to control theconstruct the frame for the three sequences. First, the Merge dispersion degree of the objects. Smaller GDR impliesalgorithm combines E1 , E2 , and E3 to generate an inter- stronger group relationships, i.e., objects are closer together. 00 00mediate sequence SE . Next, SE together with A, B, C, and D The speed of each object is 1 node per time unit, and theis viewed as a sequence and processed by the Replace tracking interval is 0.5 time unit. In addition, the startingalgorithm to generate an intermediate sequence S 0 , which 0 0 0 0 0 point and the furthest point reached by the leader object arecomprises SA , SB , SC , SD , and SE . Finally, intermediate 0 randomly selected, and the movement range of a group ofsequence S is compressed and packed. objects is the euclidean distance between the two points. For a batch period of D tracking intervals, the location Note that we take the group leader as a virtual object todata of a group of n objects are aggregated in one packet control the roaming behavior of a group of moving objectssuch that ðn  D À 1Þ packet headers are eliminated. The and exclude it in calculating the data traffic.payload may comprise multiple G-segments or S-segments, In the following experiments, the default values of n, D,each of which includes a beginning time stamp (a bits), asequence of consequent locations (b bits for each), an object d, and eb are 5, 1,000, 6, and 0. The data sizes of object (oror group ID ( c bits), and a field representing the length of a group) ID, location ID, time stamp, and packet header are 1,segment (l bits). Therefore, the payload size is calculated 1, 1, and 4 bytes, respectively. We set the PST parameters P ðLmax ; Pmin ; ; min ; rÞ ¼ ð5; 0:01; 0; 0:0001; 1:2Þ empirically inas i ða þ Di  b þ c þ lÞ, where Di is the length ofith segment. By exploiting the correlations in the location learning the movement patterns. Moreover, we use thedata, we can further compress the location data and reduce amount of data in kilobyte (KB) and compression ratio (r) as Pthe amount of data to H þ i ða þ c þ lÞ þ n  D  b  1 , r the evaluation metric, where the compression ratio is
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 107Fig. 13. Performance comparison of the batch-based and online update approaches and the impact of the group size. (a) Comparison of the batch-based and online approaches. (b) Impact of group size.defined as the ratio between the uncompressed data size Fig. 14a shows the impact of the batch period (D). Theand the compressed data size, i.e., r ¼ uncompresseddata size . compressed data size amount of data decreases as the batch period increases. First, we compare the amount of data of our batch-based Since more packets are aggregated and more data areapproach (batch) with that of an online update approach compressed for a longer batch period, our batch-based(online). In addition, some approaches, such as [55], employ approach reduces both the data volume of packet headerstechniques like location prediction to reduce the number of and the location data.transmitted update packets. We use the discovered move- Since the accuracy of sensor networks is inherentlyment patterns as the prediction model for prediction in the limited, allowing approximation of sensors’ readings oronline update approach (onlineþp). Fig. 13a shows that our tolerating a loss of accuracy is a compromise between databatch-based approach outperforms the online approach with accuracy and energy conservation. We study the impact ofand without prediction. The amount of data of our batch- t.c om accuracy on the amount of data. Fig. 14b shows that by ombased approach is relatively low and stable as the GDR po t.cincreases. Compared with the online approach, the compres- tolerating a loss of accuracy with eb varying from 1 to 3, the gs po amount of data decreases. As GDR ¼ 1, the compressionsion ratios of our batch approach and the online approach lo s .b ogwith prediction are about 15.0 and 2.5 as GDR ¼ 1. ratio r of eb ¼ 3 is about 21.2; while the compression ratio r ts .bl Next, our compression algorithm utilizes the group of eb ¼ 0 is about 15.0. ec ts oj crelationships to reduce the data size. Fig. 13b shows the We study the effectiveness of the Replace algorithm by pr ojeimpact of the group size. The amount of data per object comparing the compression ratios of the Huffman encoding re r lo repdecreases as the group size increases. Compared with with and without our Replace algorithm. As GDR varies xp locarrying the location data for a single object by an from 0.1 to 1, Fig. 15a shows the compression ratios of the ee xp .ie eeindividual packet, our batch-based approach aggregates Huffman encoding with and without our Replace algo- w eand compresses packets of multiple objects such that the rithm; while Fig. 15b shows the prediction hit rate. w .i w wamount of data decreases as the group size increases. :// w Compared with Huffman, our Replace algorithm achieves tp //wMoreover, our algorithm achieves high compression ratio in higher compression ratio, e.g., the compression ratio of our ht ttp:two ways. First, while more sequences that are similar or approach is about 4, while that of Huffman is about 2.65 as hsequences that are more similar are compressed simulta- GDR ¼ 0:1. From Figs. 15a and 15b, we show that theneously, the Merge algorithm achieves higher compression compression ratio that the Replace algorithm achievesratio. Second, with the regularity in the movements of agroup of objects, the Replace algorithm minimizes the reduces as the prediction hit rate. As the prediction hit rateentropy which also leads to higher compression ratio. Note is about 0.6, the compression ratio of our design is about 2.7that we use the GDR to control the group dispersion range that is higher than 2.3 of Huffman.of the input workload. The leader object’s movement pathtogether with the GDR sets up a spacious area where the 7 CONCLUSIONSmember objects are randomly distributed. Therefore, alarger GDR implies that the location sequences have higher In this work, we exploit the characteristics of group move-entropy, which degrades both the prediction hit rate and ments to discover the information about groups of movingthe compression ratio. Therefore, larger group size and objects in tracking applications. We propose a distributedsmaller GDR result in higher compression ratio. mining algorithm, which consists of a local GMPMineFig. 14. Impacts of the batch period and accuracy. (a) Impact of batch period. (b) Impact of accuracy.
  • 108 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1, JANUARY 2011 [10] M. Morzy, “Mining Frequent Trajectories of Moving Objects for Location Prediction,” Proc. Fifth Int’l Conf. Machine Learning and Data Mining in Pattern Recognition, pp. 667-680, July 2007. [11] F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi, “Trajectory Pattern Mining,” Proc. ACM SIGKDD, pp. 330-339, 2007. [12] V.S. Tseng and K.W. Lin, “Energy Efficient Strategies for Object Tracking in Sensor Networks: A Data Mining Approach,” J. Systems and Software, vol. 80, no. 10, pp. 1678-1698, 2007. ¨ [13] L. Chen, M. Tamer Ozsu, and V. Oria, “Robust and Fast Similarity Search for Moving Object Trajectories,” Proc. ACM SIGMOD, pp. 491-502, 2005.Fig. 15. Effectiveness of the Replace algorithm. (a) Compression ratio [14] Y. Wang, E.-P. Lim, and S.-Y. Hwang, “Efficient Mining of Groupversus GDR. (b) Prediction hit rate versus GDR. Patterns from User Movement Data,” Data Knowledge Eng., vol. 57, no. 3, pp. 240-282, 2006. [15] M. Nanni and D. Pedreschi, “Time-Focused Clustering ofalgorithm and a CE algorithm, to discover group movement Trajectories of Moving Objects,” J. Intelligent Information Systems,patterns. With the discovered information, we devise the vol. 27, no. 3, pp. 267-289, 2006.2P2D algorithm, which comprises a sequence merge phase [16] H.-P. Tsai, D.-N. Yang, W.-C. Peng, and M.-S. Chen, “Exploring Group Moving Pattern for an Energy-Constrained Object Trackingand an entropy reduction phase. In the sequence merge Sensor Network,” Proc. 11th Pacific-Asia Conf. Knowledge Discoveryphase, we propose the Merge algorithm to merge the location and Data Mining, May 2007.sequences of a group of moving objects with the goal of [17] C.E. Shannon, “A Mathematical Theory of Communication,” J. Bell System Technical, vol. 27, pp. 379-423, 623-656, 1948.reducing the overall sequence length. In the entropy [18] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc.reduction phase, we formulate the HIR problem and propose 11th Int’l Conf. Data Eng., pp. 3-14, 1995.a Replace algorithm to tackle the HIR problem. In addition, [19] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,we devise and prove three replacement rules, with which the “Freespan: Frequent Pattern-Projected Sequential Pattern Mining,” Proc. ACM SIGKDD, pp. 355-359, 2000.Replace algorithm obtains the optimal solution of HIR [20] M.-S. Chen, J.S. Park, and P.S. Yu, “Efficient Data Mining for Path t.c omefficiently. Our experimental results show that the proposed Traversal Patterns,” Knowledge and Data Eng., vol. 10, no. 2, om pp. 209-221, 1998. po t.ccompression algorithm effectively reduces the amount of [21] W.-C. Peng and M.-S. Chen, “Developing Data Allocationdelivered data and enhances compressibility and, by exten- gs po Schemes by Incremental Mining of User Moving Patterns in a lo s .b ogsion, reduces the energy consumption expense for data Mobile Computing System,” IEEE Trans. Knowledge and Data Eng., ts .bltransmission in WSNs. vol. 15, no. 1, pp. 70-85, Jan./Feb. 2003. ec ts [22] M. Morzy, “Prediction of Moving Object Location Based on oj c pr oje Frequent Trajectories,” Proc. 21st Int’l Symp. Computer and Information Sciences, pp. 583-592, Nov. 2006. re rACKNOWLEDGMENTS lo rep [23] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering xp loThe work was supported in part by the National Science Sequential Data,” Proc. First IEEE Int’l Conf. Data Mining, pp. 179- ee xp 186, 2001.Council of Taiwan, R.O.C., under Contracts NSC99-2218-E- .ie ee [24] J. Yang and W. Wang, “CLUSEQ: Efficient and Effective Sequence w e005-002 and NSC99-2219-E-001-001. Clustering,” Proc. 19th Int’l Conf. Data Eng., pp. 101-112, Mar. 2003. w .i w w [25] J. Tang, B. Hao, and A. Sen, “Relay Node Placement in Large Scale :// w tp //w Wireless Sensor Networks,” J. Computer Comm., special issue onREFERENCES sensor networks, vol. 29, no. 4, pp. 490-501, 2006. ht ttp: [26] M. Younis and K. Akkaya, “Strategies and Techniques for Node h[1] S.S. Pradhan, J. Kusuma, and K. Ramchandran, “Distributed Compression in a Dense Microsensor Network,” IEEE Signal Placement in Wireless Sensor Networks: A Survey,” Ad Hoc Processing Magazine, vol. 19, no. 2, pp. 51-60, Mar. 2002. Networks, vol. 6, no. 4, pp. 621-655, 2008.[2] A. Scaglione and S.D. Servetto, “On the Interdependence of [27] S. Pandey, S. Dong, P. Agrawal, and K. Sivalingam, “A Hybrid Routing and Data Compression in Multi-Hop Sensor Networks,” Approach to Optimize Node Placements in Hierarchical Hetero- Proc. Eighth Ann. Int’l Conf. Mobile Computing and Networking, geneous Networks,” Proc. IEEE Conf. Wireless Comm. and Network- pp. 140-147, 2002. ing Conf., pp. 3918-3923, Mar. 2007.[3] N. Meratnia and R.A. de By, “A New Perspective on Trajectory [28] “Stargate: A Platform x Project,” http://platformx.sourceforge. Compression Techniques,” Proc. ISPRS Commission II and IV, WG net, 2010. II/5, II/6, IV/1 and IV/2 Joint Workshop Spatial, Temporal and Multi- [29] “Mica2 Sensor Board,” http://www.xbow.com, 2010. Dimensional Data Modelling and Analysis, Oct. 2003. [30] J.N. Al-Karaki and A.E. Kamal, “Routing Techniques in Wireless[4] S. Baek, G. de Veciana, and X. Su, “Minimizing Energy Sensor Networks: A Survey,” IEEE Wireless Comm., vol. 11, no. 6, Consumption in Large-Scale Sensor Networks through Distrib- pp. 6-28, Dec. 2004. uted Data Compression and Hierarchical Aggregation,” IEEE J. [31] J. Hightower and G. Borriello, “Location Systems for Ubiquitous Selected Areas in Comm., vol. 22, no. 6, pp. 1130-1140, Aug. 2004. Computing,” Computer, vol. 34, no. 8, pp. 57-66, Aug. 2001.[5] C.M. Sadler and M. Martonosi, “Data Compression Algorithms [32] D. Li, K.D. Wong, Y.H. Hu, and A.M. Sayeed, “Detection, for Energy-Constrained Devices in Delay Tolerant Networks,” Classification, and Tracking of Targets,” IEEE Signal Processing Proc. ACM Conf. Embedded Networked Sensor Systems, Nov. 2006. Magazine, vol. 19, no. 2, pp. 17-30, Mar. 2002.[6] Y. Xu and W.-C. Lee, “Compressing Moving Object Trajectory in [33] D. Ron, Y. Singer, and N. Tishby, “Learning Probabilistic Wireless Sensor Networks,” Int’l J. Distributed Sensor Networks, Automata with Variable Memory Length,” Proc. Seventh Ann. vol. 3, no. 2, pp. 151-174, Apr. 2007. Conf. Computational Learning Theory, July 1994.[7] G. Shannon, B. Page, K. Duffy, and R. Slotow, “African Elephant [34] D. Katsaros and Y. Manolopoulos, “A Suffix Tree Based Prediction Home Range and Habitat Selection in Pongola Game Reserve, Scheme for Pervasive Computing Environments,” Proc. Panhellenic South Africa,” African Zoology, vol. 41, no. 1, pp. 37-44, Apr. 2006. Conf. Informatics, pp. 267-277, Nov. 2005.[8] C. Roux and R.T.F. Bernard, “Home Range Size, Spatial Distribu- [35] A. Apostolico and G. Bejerano, “Optimal Amnesic Probabilistic tion and Habitat Use of Elephants in Two Enclosed Game Automata or How to Learn and Classify Proteins in Linear Time Reserves in the Eastern Cape Province, South Africa,” African J. and Space,” Proc. Fourth Ann. Int’l Conf. Computational Molecular Ecology, vol. 47, no. 2, pp. 146-153, June 2009. Biology, pp. 25-32, 2000.[9] J. Yang and M. Hu, “Trajpattern: Mining Sequential Patterns from [36] J. Yang and W. Wang, “Agile: A General Approach to Detect Imprecise Trajectories of Mobile Objects,” Proc. 10th Int’l Conf. Transitions in Evolving Data Streams,” Proc. Fourth IEEE Int’l Extending Database Technology, pp. 664-681, Mar. 2006. Conf. Data Mining, pp. 559-V562, 2004.
  • TSAI ET AL.: EXPLORING APPLICATION-LEVEL SEMANTICS FOR DATA COMPRESSION 109[37] E. Hartuv and R. Shamir, “A Clustering Algorithm Based on Hsiao-Ping Tsai received the BS and MS Graph Connectivity,” Information Processing Letters, vol. 76, nos. 4- degrees in computer science and information 6, pp. 175-181, 2000. engineering from National Chiao Tung Univer-[38] A. Strehl and J. Ghosh, “Cluster Ensembles—A Knowledge Reuse sity, Hsinchu, Taiwan, in 1996 and 1998, Framework for Combining Partitionings,” Proc. Conf. Artificial respectively, and the PhD degree in electrical Intelligence, pp. 93-98, July 2002. engineering from National Taiwan University,[39] A.L.N. Fred and A.K. Jain, “Combining Multiple Clusterings Taipei, Taiwan, in January 2009. She is now an Using Evidence Accumulation,” IEEE Trans. Pattern Analysis and assistant professor in the Department of Elec- Machine Intelligence, vol. 27, no. 6, pp. 835-850, June 2005. trical Engineering (EE), National Chung Hsing[40] G. Saporta and G. Youness, “Comparing Two Partitions: Some University. Her research interests include data Proposals and Experiments,” Proc. Computational Statistics, Aug. mining, sensor data management, object tracking, mobile computing, 2002. and wireless data broadcasting. She is a member of the IEEE.[41] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless Sensor Networks: A Survey,” Computer Networks, De-Nian Yang received the BS and PhD vol. 38, no. 4, pp. 393-422, 2002. degrees from the Department of Electrical[42] D. Culler, D. Estrin, and M. Srivastava, “Overview of Sensor Engineering, National Taiwan University, Taipei, Networks,” Computer, special issue on sensor networks, vol. 37, Taiwan, in 1999 and 2004, respectively. From no. 8, pp. 41-49, Aug. 2004. 1999 to 2004, he worked at the Internet[43] H.T. Kung and D. Vlah, “Efficient Location Tracking Using Sensor Research Lab with advisor Prof. Wanjiun Liao. Networks,” Proc. Conf. IEEE Wireless Comm. and Networking, vol. 3, Afterward, he joined the Network Database Lab pp. 1954-1961, Mar. 2003. with advisor Prof. Ming-Syan Chen as a post-[44] Y. Xu, J. Winter, and W.-C. Lee, “Dual Prediction-Based Reporting doctoral fellow for the military services. He is for Object Tracking Sensor Networks,” Proc. First Ann. Int’l Conf. now an assistant research fellow in the Institute Mobile and Ubiquitous Systems: Networking and Services, pp. 154-163, of Information Science (IIS) and the Research Center of Information Aug. 2004. Technology Innovation (CITI), Academia Sinica.[45] W. Zhang and G. Cao, “DCTC: Dynamic Convoy Tree-Based Collaboration for Target Tracking in Sensor Networks,” IEEE Ming-Syan Chen received the BS degree in Trans. Wireless Comm., vol. 3, no. 5, pp. 1689-1701, Sept. 2004. electrical engineering from the National Taiwan[46] J. Yick, B. Mukherjee, and D. Ghosal, “Analysis of a Prediction- University, Taipei, Taiwan, and the MS and PhD Based Mobility Adaptive Tracking Algorithm,” Proc. Second Int’l t.c om degrees in computer, information and control om Conf. Broadband Networks, pp. 753-760, Oct. 2005. engineering from The University of Michigan, po t.c[47] C.-Y. Lin, W.-C. Peng, and Y.-C. Tseng, “Efficient In-Network Ann Arbor, in 1985 and 1988, respectively. He is Moving Object Tracking in Wireless Sensor Networks,” IEEE gs po now a distinguished research fellow and the lo s Trans. Mobile Computing, vol. 5, no. 8, pp. 1044-1056, Aug. 2006. .b og director of Research Center of Information[48] S. Santini and K. Romer, “An Adaptive Strategy for Quality-Based ts .bl Technology Innovation (CITI) in the Academia Data Reduction in Wireless Sensor Networks,” Proc. Third Int’l ec ts Sinica, Taiwan, and is also a distinguished oj c Conf. Networked Sensing Systems, pp. 29-36, June 2006. professor jointly appointed by the EE Department, CSIE Department, pr oje[49] G. Mathur, P. Desnoyers, D. Ganesan, and P. Shenoy, “Ultra-Low and Graduate Institute of Communication Engineering (GICE) at re r lo rep Power Data Storage for Sensor Networks,” Proc. Fifth Int’l Conf. National Taiwan University. He was a research staff member at IBM Information Processing in Sensor Networks, pp. 374-381, Apr. 2006. Thomas J. Watson Research Center, Yorktown Heights, New York, from xp lo ee xp[50] P. Dutta, D. Culler, and S. Shenker, “Procrastination Might Lead 1988 to 1996, the director of GICE from 2003 to 2006, and also the .ie ee to a Longer and More Useful Life,” Proc. Sixth Workshop Hot Topics president/CEO in the Institute for Information Industry (III), which is one in Networks, Nov. 2007. w e of the largest organizations for information technology in Taiwan, from w .i[51] Y. Diao, D. Ganesan, G. Mathur, and P.J. Shenoy, “Rethinking w w 2007 to 2008. His research interests include databases, data mining, :// w Data Management for Storage-Centric Sensor Networks,” Proc. mobile computing systems, and multimedia networking. He has tp //w Third Biennial Conf. Innovative Data Systems Research, pp. 22-31, published more than 300 papers in his research areas. In addition to ht ttp: Nov. 2007. serving as program chairs/vice chairs and keynote/tutorial speakers in h[52] F. Osterlind and A. Dunkels, “Approaching the Maximum many international conferences, he was an associate editor of the IEEE 802.15.4 Multi-Hop Throughput,” Proc. Fifth Workshop Embedded Transactions on Knowledge and Data Engineering and also the Journal Networked Sensors, 2008. of Information Systems Education, is currently the editor in chief of the[53] S. Watanabe, “Pattern Recognition as a Quest for Minimum International Journal of Electrical Engineering (IJEE), and is on the Entropy,” Pattern Recognition, vol. 13, no. 5, pp. 381-387, 1981. editorial board of the Very Large Data Base (VLDB) Journal and the[54] L. Yuan and H.K. Kesavan, “Minimum Entropy and Information Knowledge and Information Systems (KAIS) Journal. He holds, or has Measurement,” IEEE Trans. System, Man, and Cybernetics, vol. 28, applied for, 18 US patents and seven ROC patents in his research no. 3, pp. 488-491, Aug. 1998. areas. He is a recipient of the National Science Council (NSC)[55] G. Wang, H. Wang, J. Cao, and M. Guo, “Energy-Efficient Dual Distinguished Research Award, Pan Wen Yuan Distinguished Research Prediction-Based Data Gathering for Environmental Monitoring Award, Teco Award, Honorary Medal of Information, and K.-T. Li Applications,” Proc. IEEE Wireless Comm. and Networking Conf., Research Breakthrough Award for his research work, and also the Mar. 2007. Outstanding Innovation Award from IBM Corporate for his contribution to[56] D. Bolier, “SIM : A C++ Library for Discrete Event Simulation,” a major database product. He also received numerous awards for his http://www.cs.vu.nl/eliens/sim, Oct. 1995. research, teaching, inventions, and patent applications. He is a fellow of[57] X. Hong, M. Gerla, G. Pei, and C. Chiang, “A Group Mobility the ACM and the IEEE. Model for Ad Hoc Wireless Networks,” Proc. Ninth ACM/IEEE Int’l Symp. Modeling, Analysis and Simulation of Wireless and Mobile Systems, pp. 53-60, Aug. 1999.[58] B. Gloss, M. Scharf, and D. Neubauer, “Location-Dependent . For more information on this or any other computing topic, Parameterization of a Random Direction Mobility Model,” Proc. please visit our Digital Library at www.computer.org/publications/dlib. IEEE 63rd Conf. Vehicular Technology, vol. 3, pp. 1068-1072, 2006.[59] G. Bejerano and G. Yona, “Variations on Probabilistic Suffix Trees: Statistical Modeling and the Prediction of Protein Families,” Bioinformatics, vol. 17, no. 1, pp. 23-43, 2001. ´[60] C. Largeron-Leteno, “Prediction Suffix Trees for Supervised Classification of Sequences,” Pattern Recognition Letters, vol. 24, no. 16, pp. 3153-3164, 2003.