Creating evolving user behavior profiles automatically.bak


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Creating evolving user behavior profiles automatically.bak

  1. 1. 854 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 Creating Evolving User Behavior Profiles Automatically Jose Antonio Iglesias, Member, IEEE Computer Society, Plamen Angelov, Senior Member, IEEE Computer Society, Agapito Ledezma, Member, IEEE Computer Society, and Araceli Sanchis, Member, IEEE Computer Society Abstract—Knowledge about computer users is very beneficial for assisting them, predicting their future actions or detecting masqueraders. In this paper, a new approach for creating and recognizing automatically the behavior profile of a computer user is presented. In this case, a computer user behavior is represented as the sequence of the commands she/he types during her/his work. This sequence is transformed into a distribution of relevant subsequences of commands in order to find out a profile that defines its behavior. Also, because a user profile is not necessarily fixed but rather it evolves/changes, we propose an evolving method to keep up to date the created profiles using an Evolving Systems approach. In this paper, we combine the evolving classifier with a trie-based user profiling to obtain a powerful self-learning online scheme. We also develop further the recursive formula of the potential of a data point to become a cluster center using cosine distance, which is provided in the Appendix. The novel approach proposed in this paper can be applicable to any problem of dynamic/evolving user behavior modeling where it can be represented as a sequence of actions or events. It has been evaluated on several real data streams. Index Terms—Evolving fuzzy systems, fuzzy-rule-based (FRB) classifiers, user modeling. Ç1 INTRODUCTIONR ECOGNIZING the behavior of others in real time is a In this paper, we propose an adaptive approach for significant aspect of different human tasks in many creating behavior profiles and recognizing computer users. http://ieeexploreprojects.blogspot.comdifferent environments. When this process is carried out by We call this approach Evolving Agent behavior Classificationsoftware agents or robots, it is known as user modeling. The based on Distributions of relevant events (EVABCD) and it isrecognition of users can be very beneficial for assisting them based on representing the observed behavior of an agentor predicting their future actions. Most existing techniques (computer user) as an adaptive distribution of her/hisfor user recognition assume the availability of handcrafteduser profiles, which encode the a-priori known behavioral relevant atomic behaviors (events). Once the model has beenrepertoire of the observed user. However, the construction created, EVABCD presents an evolving method for updatingof effective user profiles is a difficult problem for different and evolving the user profiles and classifying an observedreasons: human behavior is often erratic, and sometimes user. The approach we present is generalizable to all kinds ofhumans behave differently because of a change in their user behaviors represented by a sequence of events.goals. This last problem makes necessary that the user The UNIX operating system environment is used in thisprofiles we create evolve. research for explaining and evaluating EVABCD. A user There exists several definitions for user profile [1]. It can behavior is represented in this case by a sequence of UNIXbe defined as the description of the user interests, commands typed by a computer user in a command-linecharacteristics, behaviors, and preferences. User profiling interface. Previous research studies in this environment [3],is the practice of gathering, organizing, and interpreting the [4] focus on detecting masquerades (individuals whouser profile information. In recent years, significant workhas been carried out for profiling users, but most of the user impersonate other users on computer networks and sys-profiles do not change according to the environment and tems) from sequences of UNIX commands. However,new goals of the user. An example of how to create these EVABCD creates evolving user profiles and classifies newstatic profiles is proposed in a previous work [2]. users into one of the previously created profiles. Thus, the goal of EVABCD in the UNIX environment can be divided into two phases:. J.A. Iglesias, A. Ledezma, and A. Sanchis are with the CAOS Group, Carlos III University of Madrid, Spain. 1. Creating and updating user profiles from the E-mail: {jiglesia, ledezma, masm} commands the users typed in a UNIX shell.. P. Angelov is with the Department of Communication Systems, InfoLab21, Lancaster University, United Kingdom. 2. Classifying a new sequence of commands into the E-mail: predefined profiles.Manuscript received 29 May 2009; revised 18 Nov. 2009; accepted 2 Oct. Because we use an evolving classifier, it is constantly2010; published online 21 Dec. 2011. learning and adapting the existing classifier structure toRecommended for acceptance by D. Tao.For information on obtaining reprints of this article, please send e-mail to: accommodate the newly observed emerging, and reference IEEECS Log Number TKDE-2009-05-0472. Once a user is classified, relevant actions can be done;Digital Object Identifier no. 10.1109/TKDE.2011.17. however, this task is not addressed in this paper. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society
  2. 2. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 855 The creation of a user profile from a sequence of UNIX behavior from example streams. Popular approaches tocommands should consider the consecutive order of the such learning include statistical analysis and frequency-commands typed by the user and the influence of his/her based methods. Lane and Brodley [10] present an approachpast experiences. This aspect motivates the idea of auto- based on the basis of instance-based learning (IBL)mated sequence learning for computer user behavior techniques, and several techniques for reducing dataclassification; if we do not know the features that influence storage requirements of the user profile.the behavior of a user, we can consider a sequence of past Although the proposed approach can be applied to anyactions to incorporate some of the historical context of the behavior represented by a sequence of events, we focus inuser. However, it is difficult, or in general, impossible, to this research in a command-line interface a classifier that will have a full description of all Related to this environment, Schonlau et al. [3] investigate apossible behaviors of the user, because these behaviors number of statistical approaches for detecting masquer-evolve with time, they are not static and new patterns may aders. Coull et al. [11] propose an effective algorithm thatemerge as well as an old habit may be forgotten or stopped to uses pairwise sequence alignment to characterize similaritybe used. The descriptions of a particular behavior itself may between sequences of commands. Recently, Angelov andalso evolve, so we assume that each behavior is described by Zhou propose in [12] to use evolving fuzzy classifiers forone or more fuzzy rules. A conventional system do not this detection task.capture the new patterns (behaviors) that could appear in the In [13], Panda and Patra compared the performance ofdata stream once the classifier is built. In addition, the different classification algorithms—Naive Bayesian (NB),information produced by a user is often quite large. There- C4.5 and Iterative Dichotomizer 3 (ID3)—for networkfore, we need to cope with large amounts of data and process intrusion detection. According to the authors, ID3 andthis information in real time, because storing the completedata set and analyzing it in an offline (batch) mode, would be C4.5 are robust in detecting new intrusions, but NBimpractical. In order to take into account these aspects, performs better to overall classification accuracy. Cufogluwe use an evolving fuzzy-rule-based system that allows for et al. [14] evaluated the classification accuracy of NB, IB1,the user behaviors to be dynamic, to evolve. SimpleCART, NBTree, ID3, J48, and Sequential Minimal This paper is organized as follows: Section 2 provides an Optimization (SMO) algorithms with large user profile data.overview of the background and related work. The overall According to the simulation results, NBTree classifierstructure of our approach, EVABCD, is explained in Section 3. performs the best classification on user-related information.Section 4 describes the construction of the user behavior It should be emphasized that all of the above approachesprofile. The evolving UNIX user classifier is detailed in ignore the fact that user behaviors can change and evolve. http://ieeexploreprojects.blogspot.comSection 5. Section 6 describes the experimental setting and the However, this aspect needs to be taken into account in theexperimental results obtained. Finally, Section 7 contains proposed approach. In addition, owing to the characteristicsfuture work and concluding remarks. of the proposed environment, we need to extract some sort of knowledge from a continuous stream of data. Thus, it is necessary that the approach deals with the problem of2 BACKGROUND AND RELATED WORK classification of streaming data. Incremental algorithmsDifferent techniques have been used to find out relevant build and refine the model at different points in time, ininformation related to the human behavior in many different contrast to the traditional algorithms which perform theareas. The literature in this field is vast; Macedo et al. [5] model in a batch manner. It is more efficient to revise anpropose a system (WebMemex) that provides recommended existing hypothesis than it is to generate hypothesis each timeinformation based on the captured history of navigation a new instance is observed. Therefore, one of the solution tofrom a list of known users. Pepyne et al. [6] describe a method the proposed scenario is the incremental classifiers.using queuing theory and logistic regression modeling An incremental learning algorithm can be defined as onemethods for profiling computer users based on simple that meets the following criteria [15]:temporal aspects of their behavior. In this case, the goal isto create profiles for very specialized groups of users, who 1. It should be able to learn additional informationwould be expected to use their computers in a very similar from new data.way. Gody and Amandi [7] present a technique to generate 2. It should not require access to the original data, usedreadable user profiles that accurately capture interests by to train the existing classifier.observing their behavior on the web. 3. It should preserve previously acquired knowledge. There is a lot of work focusing on user profiling in a 4. It should be able to accommodate new classes thatspecific environment, but it is not clear that they can be may be introduced with new data.transferred to other environments. However, the approach Several incremental classifiers have been implementedwe propose in this paper can be used in any domain in using different frameworkswhich a user behavior can be represented as a sequence ofactions or events. Because sequences play a crucial role in . Decision trees. The problem of processing streaminghuman skill learning and reasoning [8], the problem of user data in online has motivated the development ofprofile classification is examined as a problem of sequence many algorithms which were designed to learnclassification. According to this aspect, Horman and decision trees incrementally [16], [17]. Some exam-Kaminka [9] present a learner with unlabeled sequential ples of the algorithms which construct incrementaldata that discover meaningful patterns of sequential decision trees are: ID4 [18], ID5 [19], and ID5R [20].
  3. 3. 856 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 . Artificial neural networks (ANN). Adaptive Reso- [30]. By drift, they refer to a modification of the concept over nance Theory (ART) networks [21] are unsupervised time, and shift usually refers to a sudden and abrupt ANNs proposed by Carpenter that dynamically changes in the streaming data. To capture these changes, it determine the number of clusters based on a is necessary not only tuning parameters of the classifiers, vigilance parameter [22]. In addition, Kasabov but also a change in its structure. A simple incremental proposed another incremental learning neural net- algorithm does not evolve the structure of the classifier. The work architecture, called Evolving Fuzzy Neural interpretation of the results is also an important character- Network (EFuNN) [23]. This architecture does not istic in a classifier, and several incremental classifiers (such require access to previously seen data and can as ANN or SVM) are not good in terms of interpretation of accommodate new classes. A new approach to the results. Finally, the computational efficiency of the incremental learning using evolving neural net- proposed classifier is very important, and some incremental works is proposed by Seipone and Bullinaria [24]. algorithms (such SVM) have to solve quadratic optimiza- This approach uses an evolutionary algorithm to tion problems many times. evolve some MLP parameters. This process aims at Taking all these aspects into account, we propose in this evolving the parameters to produce networks with paper an evolving fuzzy-rule-base system which satisfies all better incremental abilities. of the criteria of the incremental classifiers. However, the . Prototype-based supervised algorithms. Learning approach has also important advantages which make it Vector Quantization (LVQ) is one of the well-known very useful in real environments nearest prototype learning algorithms [25]. LVQ can 1. It can cope with huge amounts and data. be considered to be a supervised clustering algo- 2. Its evolving structure can capture sudden and rithm, in which each weight vector can be inter- abrupt changes in the stream of data. preted as a cluster center. Using this algorithm, the 3. Its structure meaning is very clear, as we propose a number of reference vectors has to be set by the user. rule-based classifier. For this reason, Poirier and Ferrieux proposed a 4. It is noniterative and single pass; therefore, it is method to generate new prototypes dynamically. computationally very efficient and fast. This incremental LVQ is known as Dynamic Vector 5. Its classifier structure is simple and interpretable. Quantization (DVQ) [26]. However, this method lacks the generalizing capability, resulting in the Thus, an approach for creating and recognizing auto- generation of many prototype neurons for applica- matically the behavior profile of a computer user with the tions with noisy data. time evolving in a very effective way. To the best of our http://ieeexploreprojects.blogspot.comfirst publication where user behavior . Bayesian. Bayesian classifier is an effective metho- knowledge, this is the dology for solving classification problems when all is considered, treated, and modeled as a dynamic and features are considered simultaneously. However, evolving phenomenon. This is the most important con- when the features are added one by one in Bayesian tribution of this paper. classifier in batch mode in forward selection method, huge computation is involved. For this reason, 3 THE PROPOSED APPROACH Agrawal and Bala [27] proposed an incremental Bayesian classifier for multivariate normal distribu- This section introduces the proposed approach for auto- tion data sets. Several incremental versions of matic clustering, classifier design, and classification of the Bayesian classifiers are proposed in [28] behavior profiles of users. The novel evolving user behavior . Support Vector Machine (SVM). A Support Vector classifier is based on Evolving Fuzzy Systems and it takes into Machine performs classification by constructing an account the fact that the behavior of any user is not fixed, N-dimensional hyperplane that optimally separates but is rather changing. Although the proposed approach the data into two categories. Training an SVM can be applied to any behavior represented by a sequence of “incrementally” on new data by discarding all events, we detail it using a command-line interface (UNIX previous data except their support vectors, gives commands) environment. only approximate results. Cauwenberghs et al. In order to classify an observed behavior, our approach, as consider incremental learning as an exact online many other agent modeling methods [31], creates a library method to construct the solution recursively, one which contains the different expected behaviors. However, point at a time. In addition, Xiao et al. [29] propose in our case, this library is not a prefixed one, but is evolving, an incremental algorithm which utilizes the proper- learning from the observations of the users real behaviors ties of SV set, and accumulates the distribution and, moreover, it starts to be filled in “from scratch” by knowledge of the sample space through the adjus- assigning temporarily to the library the first observed user as table parameters. a prototype. The library, called Evolving-Profile-Library However, as this research focus in a command-line (EPLib), is continuously changing, evolving influenced byinterface environment, it is necessary an approach able to the changing user behaviors observed in the environment.process streaming data in real time and also cope with huge Thus, the proposed approach includes at each step theamounts of data. Several incremental classifiers do not following two main actions:considered this last aspect. In addition, the structure of theincremental classifiers is assumed to be fixed, and they can 1. Creating and evolving the classifier. This actionnot address the problem of so-called concept drift and shift involves in itself two subactions:
  4. 4. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 857Fig. 1. Steps of creating an example trie. a. Creating the user behavior profiles. This sub- the size of the subsequences created. In the remainder of action analyzes the sequences of commands the paper, we will use the term subsequence length to denote typed by different UNIX users online (data the value of this length. This value determines how many stream), and creates the corresponding profiles. commands are considered as dependent. This process is detailed in Section 4. In the proposed sample sequence (fls ! date ! b. Evolving the classifier. This subaction includes ls ! date ! catg), let 3 be the subsequence length, then online learning and update of the classifier, we obtain including the potential of each behavior to be a fLS ! DAT E ! LSg, fdate ! ls ! dateg, fls ! date ! prototype, stored in the EPLib. The whole catg: process is explained in Section 5. 2. User classification. The user profiles created in the 4.2 Storage of the Subsequences in a trie previous action are associated with one of the The subsequences of commands are stored in a trie data prototypes from the EPLib, and they are classified structure. When a new model needs to be constructed, we into one of the classes formed by the prototypes. create an empty trie, and insert each subsequence of events This action is also detailed in Section 5. into it, such that all possible subsequences are accessible and explicitly represented. Every trie node represents an4 CONSTRUCTION OF THE USER BEHAVIOR PROFILE event appearing at the end of a subsequence, and the nodes children represent the events that have appeared followingIn order to construct a user behavior profile in online mode this event. Also, each node keeps track of the number offrom a data stream, we need to extract an ordered sequence times a command has been recorded into it. When a newof recognized events; in this case, UNIX commands. These subsequence is inserted into a trie, the existing nodes arecommands are inherently sequential, and this sequentiality modified and/or new nodes are created. As the dependen-is considered in the modeling process. According to this cies of the commands are relevant in the user profile, theaspect and based on the work done in [2], in order to get the subsequence suffixes (subsequences that extend to the endmost representative set of subsequences from a sequence,we propose the use of a trie data structure [32]. This of the given sequence) are also inserted. Considering the previous example, the first subsequencestructure was also used in [33] to classify differentsequences and in [34], [35] to classify the behavior patterns (fls ! date ! lsg) is added as the first branch of the emptyof a RoboCup soccer simulation team. trie (Fig. 1a). Each node is labeled with the number 1 which The construction of a user profile from a single sequence indicates that the command has been inserted in the nodeof commands is done by a three step process: once (in Fig. 1, this number is enclosed in square brackets). Then, the suffixes of the subsequence (fdate ! lsg and {ls}) 1. Segmentation of the sequence of commands. are also inserted (Fig. 1b). Finally, after inserting the three 2. Storage of the subsequences in a trie. subsequences and its corresponding suffixes, the completed 3. Creation of the user profile. trie is obtained (Fig. 1c). These steps are detailed in the following three sections. 4.3 Creation of the User ProfileFor the sake of simplicity, let us consider the followingsequence of commands as an example: fls ! date ! Once the trie is created, the subsequences that characterizels ! date ! catg. the user profile and its relevance are calculated by traversing the trie. For this purpose, frequency-based4.1 Segmentation of the Sequence of Commands methods are used. In particular, in EVABCD, to evaluateFirst, the sequence is segmented into subsequences of equal the relevance of a subsequence, its relative frequency orlength from the first to the last element. Thus, the sequence support [36] is calculated. In this case, the support of aA ¼ A1 A2 . . . An (where n is the number of commands of the subsequence is defined as the ratio of the number of timessequence) will be segmented in the subsequences described the subsequence has been inserted into the trie and the totalby Ai . . . Aiþlength 8i; i ¼ ½1; n À length þ 1Š, where length is number of subsequences of equal size inserted.
  5. 5. 858 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012Fig. 2. Distribution of subsequences of commands—Example. In this step, the trie can be transformed into a set of according to the different subsequences that are repre-subsequences labeled by its support value. In EVABCD, this sented in it.set of subsequences is represented as a distribution of Fig. 3 explains graphically this novel idea. In this example,relevant subsequences. Thus, we assume that user profiles the distribution of the first user consists of five subsequencesare n-dimensional matrices, where each dimension of the of commands (ls, ls-date, date, cat, and date-cat); therefore, wematrix will represent a particular subsequence of commands. need a five-dimensional data space to represent this distribu- In the previous example, the trie consists of nine nodes; tion because each different subsequence is represented by onetherefore, the corresponding profile consists of nine dimension. If we consider the second user, we can see thatdifferent subsequences which are labeled with its support. three of the five previous subsequences have not been typedFig. 2 shows the distribution of these subsequences. by this user (ls-date, date, and date-cat), so these values are not Once a user behavior profile has been created, it is available. Also, the values of the two new subsequences (cpclassified and used to update the Evolving-Profile-Library, as and ls-cp) need to be represented in the same data space; thus,explained in the next section. it is necessary to increase the dimensionality of the data space from five to seven. To sum up, the dimensions of the data5 EVOLVING UNIX USER CLASSIFIER space represent the different subsequences typed by the usersA classifier is a mapping from the feature space to the class and they will increase according to the different new http://ieeexploreprojects.blogspot.comlabel space. In the proposed classifier, the feature space is subsequences obtained.defined by distributions of subsequences of events. On theother hand, the class label space is represented by the most 5.2 Structure of the EVABCDrepresentative distributions. Thus, a distribution in the class Once the corresponding distribution has been created fromlabel space represents a specific behavior which is one of the stream, it is processed by the classifier. The structure ofthe prototypes of the EPLib. The prototypes are not fixed this classifier includesand evolve taking into account the new informationcollected online from the data stream—this is what makes 1. Classify the new sample in a class represented by athe classifier Evolving. The number of these prototypes is prototype. Section 5.6.not prefixed but it depends on the homogeneity of the 2. Calculate the potential of the new data sample to beobserved behaviors. The following sections describes how a a prototype. Section 5.3.user behavior is represented by the proposed classifier, and 3. Update all the prototypes considering the new datahow this classifier is created in an evolving manner. sample. It is done because the density of the data space surrounding certain data sample changes with the5.1 User Behavior RepresentationEVABCD receives observations in real time from theenvironment to analyze. In our case, these observationsare UNIX commands and they are converted into thecorresponding distribution of subsequences online. In orderto classify a UNIX user behavior, these distributions mustbe represented in a data space. For this reason, eachdistribution is considered as a data vector that defines apoint that can be represented in the data space. The data space in which we can represent these pointsshould consist of n dimensions, where n is the number ofthe different subsequences that could be observed. It meansthat we should know all the different subsequences of theenvironment a priori. However, this value is unknown andthe creation of this data space from the beginning is notefficient. For this reason, in EVABCD, the dimension of the Fig. 3. Distributions of subsequences of events in an evolving systemdata space also evolves, it is incrementally growing approach—Example.
  6. 6. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 859 insertion of each new data sample. Insert the new represented as zk . Each sample is represented by a set of data sample as a new prototype if needed. Section 5.4. values; the value of the ith attribute (element) of the zk 4. Remove any prototype if needed. Section 5.5. sample is represented as zi . k Therefore, as we can see, the classifier does not need to As it is detailed in the Appendix, this derivation isbe configured according to the environment where it is used proposed as a novel contribution and the result is as follows:because it can start “from scratch.” Also, the relevant 1information of the obtained samples is necessary to update Pk ðzk Þ ¼  1 Â à  1 ÃÃà 2 þ hðkÀ1Þ À2BK þ h Dkthe library, but, as we will explain in the next section, thereis no need to store all the samples in it. k ¼ 2; 3 . . . ; P1 ðz1 Þ ¼ 1 where: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi5.3 Calculating the Potential of a Data Sample u À j Á2 X j j j n u zAs in [12], a prototype is a data sample (a behavior Bk ¼ zk bk ; bk ¼ bðkÀ1Þ þ tPn kÀ Á2 j l j¼1 l¼1 zkrepresented by a distribution of subsequences of com- vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u À j Á2 ð3Þmands) that represents several samples which represent a u zcertain class. bj ¼ tPn 1À Á2 ; j ¼ ½1; n þ 1Š 1 l l¼1 z1 The classifier is initialized with the first data sample, X j X p jp jp n n zj zpwhich is stored in EPLib. Then, each data sample is Dk ¼ zk zk dk ; dk ¼ djp þ Pn k k l 2 ðkÀ1Þclassified to one of the prototypes (classes) defined in the j¼1 p¼1 l¼1 ðzk Þclassifier. Finally, based on the potential of the new data zj z1sample to become a prototype [37], it could form a new dj1 ¼ Pn 1 À1 Á2 ; j ¼ ½1; n þ 1Š; 1 l l¼1 z1prototype or replace an existing one. The potential (P) of the kth data sample ðxk Þ is calculated where dij represents this accumulated value for the kth data kby (1) which represents a function of the accumulated sample considering the multiplication of the ith and jthdistance between a sample and all the other k À 1 samples attributes of the data sample. Thus, to get recursively thein the data space [12]. The result of this function represents value of the potential of a sample using (1), it is necessary tothe density of the data that surrounds a certain data sample calculate k  k different accumulated values which store the result of the multiplication of an attribute value of the data 1 P ðxk Þ ¼ PkÀ1 ; ð1Þ sample by all the other attribute values of the data sample. distance2 ðxk ;xi Þ 1 þ i¼1 kÀ1 However, since the number of needed accumulated simplify this expression in order to values is very large, wewhere distance represents the distance between two samples calculate it faster and with less the data space. In [38], the potential is calculated using the euclidean 5.3.2 Simplifying the Potential Expressiondistance and in [12] it is calculated using the cosine In our particular application, the data are represented by adistance. Cosine distance has the advantage that it tolerates set of support values and are thus positive. To simplify thedifferent samples to have different number of attributes; in recursive calculation of the expression (1), we can usethis case, an attribute is the support value of a subsequence simply the distance instead of square of the distance. Forof commands. Also, cosine distance tolerates that the value this reason, we use in this case (4) instead of (1)of several subsequences in a sample can be null (null isdifferent than zero). Therefore, EVABCD uses the cosine 1 Pk ðzk Þ ¼ PkÀ1 : ð4Þdistance (cosDist) to measure the similarity between two 1 þ i¼1 kÀ1 cosDistðxk ;xi Þsamples, as it is described below Pn Using (4), we develop a recursive expression similar to j¼1 xkj xpj the recursive expressions developed in [37], [38] using cosDistðxk ; xp Þ ¼ 1 À qPffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; Pn ð2Þ n 2 2 euclidean distance and in [12] using cosine distance. This j¼1 xkj j¼1 xpj formula is as follows:where xk and xp represent the two samples to measure its 1distance and n represents the number of different attributes Pk ðzk Þ ¼ 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; k ¼ 2; 3; . . . ; P1 ðz1 Þ ¼ 1in both samples. 2 À kÀ1 P 1 À Á2 Bk n j j¼1 zk5.3.1 Calculating the Potential Recursively where: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u À j Á2Note that the expression in (1) requires all the accumulated X j j j n u z ð5Þ Bk ¼ zk bk ; bk ¼ bðkÀ1Þ þ tPn kÀ Á2 jdata sample available to be calculated, which contradicts to l j¼1 l¼1 zkthe requirement for real time and online application needed vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u À j Á2in the proposed approach. For this reason, we need to u z bj ¼ tPn 1À Á2 ; j ¼ ½1; n þ 1Š; 1develop a recursive expression of the potential, in which it l l¼1 z1is not needed to store the history of all the data. In (3), the potential is calculated in the input-output Using this expression, it is only necessary to calculatejoint data space, where z ¼ ½x; LabelŠ; therefore, the ðn þ 1Þ values where n is the number of differentkth data sample (xk ) with its corresponding label is obtained subsequences; this value is represented by b,
  7. 7. 860 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012where bj ; j ¼ ½1; nŠ represents the accumulated value for k 5.6 Classification Methodthe kth data sample. In order to classify a new data sample, we compare it with all the prototypes stored in EPLib. This comparison is done5.4 Creating New Prototypes using cosine distance and the smallest distance determinesThe proposed evolving user behavior classifier, EVABCD, the closest similarity. This comparison is shown ascan start “from scratch” (without prototypes in the library) ina similar manner as eClass evolving fuzzy rule-based Classðxz Þ ¼ ClassðP rotà Þ; ð11Þclassifier proposed in [38], used in [39] for robotics and P rotà ¼ MINi¼1 rot ðcosDistðxP rototypei ; xz ÞÞ: NumPfurther developed in [12]. The potential of each new data The time consumed for classifying a new samplesample is calculated recursively and the potential of the depends on the number of prototypes and its number ofother prototypes is updated. Then, the potential of the new attributes. However, we can consider, in general terms, thatsample (zk ) is compared with the potential of the existing both the time consumed and the computational complexityprototypes. A new prototype is created if its value is higher are reduced and acceptable for real-time applications (inthan any other existing prototype, as shown in (6) order of milliseconds per data sample). 9i; i ¼ ½1; NumP rototypesŠ : P ðzk Þ > P ðP roti Þ: 5.7 Supervised and Unsupervised Learning ð6Þ Thus, if the new data sample is not relevant, the overall The proposed classifier has been explained taking intostructure of the classifier is not changed. Otherwise, if theaccount that the observed data samples do not have labels;new data sample has high descriptive power and general- thus, using unsupervised learning. In this case, the classesization potential, the classifier evolves by adding a new are created based on the existing prototypes and, thus, anyprototype which represents a part of the observed data prototype represents a different class (label). Such asamples. technique is used, for example, in eClass0 [12], [38] which is a clustering-based classification.5.5 Removing Existing Prototypes However, the observed data samples can have a labelAfter adding a new prototype, we check whether any of the assigned to them a priori. In this case, using EVABCD, aalready existing prototypes are described well by the newly specific class is represented by several prototypes, becauseadded prototype [12]. By well, we mean that the value of the the previous process is done for each different class. Thus,membership function that describes the closeness to the each class has a number of prototypes that depends on howprototype is a Gaussian bell function chosen due to its heterogeneous are the samples of the same class. In this case, a new data sample is classified in the class whichgeneralization capabilities belongs the closest prototype. This technique is used, for 9i; i ¼ ½1; NumP rototypesŠ : i ðzk Þ eÀ1 : ð7Þ example, in eClass1 [12], [38]. For this reason, we calculate the membership functionbetween a data sample and a prototype which is defined as 6 EXPERIMENTAL SETUP AND RESULTS cosDistðzk ;P roti Þ In order to evaluate EVABCD in the UNIX environment, we À1½ Š i ðzk Þ ¼ e 2 i ; i ¼ ½1; NumP rototypesŠ; ð8Þ use a data set with the UNIX commands typed by 168 real users and labeled in four different groups. Therefore, inwhere cosDist(zk ; P roti Þ represents the cosine distance these experiments, we use supervised learning. The ex-between a data sample (zk ) and the ith prototype (P roti ); plained process is applied for each of the four groupi represents the spread of the membership function, which (classes) and one or more prototypes will be created foralso symbolizes the radius of the zone of influence of the each group. EVABCD is applied in these experimentsprototype. This spread is determined based on the scatter considering the data set as pseudo-online streams. How-[40] of the data. The equation to get the spread of the kth ever, only using data sets in an offline mode, the proposeddata sample is defined as classifier can be compared with other incremental and vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nonincremental classifiers. u k u1 X i ðkÞ ¼ t cosDistðP roti ; zk Þ; i ð0Þ ¼ 1; ð9Þ 6.1 Data Set k j¼1 In these experiments, we use the command-line datawhere k is the number of data samples inserted in the data collected by Greenberg [41] using UNIX csh commandspace; cosDist(P roti ; zk ) is the cosine distance between the interpreter. This data are identified in four target groups which represent a total of 168 male and female users with anew data sample (zk ) and the ith prototype. However, to calculate the scatter without storing all the wide cross section of computer experience and needs. Salient features of each group are described belowreceived samples, this value can be updated recursively (asshown in [38]) by . Novice programmers. The users of this group had little rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi or no previous exposure to programming, operating 1 systems, or UNIX-like command-based interfaces. i ðkÞ ¼ ½i ðk À 1ފ2 þ ½cosDist2 ðP roti ; zk Þ À ½i ðk À 1ފ2 Š: k These users spent most of their time learning how to ð10Þ program and use the basic system facilities.
  8. 8. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 861 TABLE 1 uses a different distribution of training samples. Sample Group Sizes and Command Lines Recorded Combining weak classifiers take advantage of the so- called instability of the weak classifier. This instabil- ity causes the classifier to construct sufficiently different decision surfaces for minor modifications in their training data sets. In these experiments, we use a decision stump as a weak learner. A decision stump is advantageous over other weak learners because it can be calculated very quickly. In addition, there is a one-to-one correspondence between a weak learner and a feature when a . Experienced programmers. These group members decision stump is used [48]. were senior Computer Science undergraduates, . The Support Vector Machine classifier relies on the statistical learning theory. In this case, the algo- expected to have a fair knowledge of the UNIX rithm Sequential Minimal Optimization is used for environment. These users used the system for training support vector machines by breaking a coding, word processing, employing more advanced large quadratic programming (QP) optimization UNIX facilities to fulfill course requirements, and problem into a series of smallest possible QP social and exploratory purposes. problems [49]. In addition, in order to determine . Computer scientist. This group, graduates and re- the best kernel type for this task, we have searchers from the Department of Computer Science, performed several experiments using polynomial had varying experience with UNIX, although all were kernel function with various degrees and radial experts with computers. Tasks performed were less basis function. In this case, the polynomial kernel predictable and more varied than other groups, function results performed best in all experiments. research investigations, social communication, word Therefore, we use polynomial kernel function of processing, maintaining databases, and so on. degree 1 for proposed experiments. . Nonprogrammers. Word processing and document . The Learning Vector Quantization classifier is a preparation was the dominant activity of the supervised version of vector quantization, which members of this group, made up of office staff and defines class boundaries based on prototypes, a members of the Faculty of Environmental Design. nearest-neighbor rule and a winner-takes-it-all para- Knowledge of UNIX was the minimum necessary to digm. The performance of LVQ depends on the get the job done. algorithm implementation. In these experiments, it is The sample sizes (the number of people observed) and used the enhanced version of LVQ1, the OLVQ1the total number of command lines are indicated in Table 1. implementation [50]. In addition, we compare EVABCD with two incremental6.2 Comparison with Other Classifiers classifiers which share a common theoretical basis withIn order to evaluate the performance of EVABCD, it is batch classifiers. However, they can be incrementallycompared with several different types of classifiers. This updated as new training instances arrive; they do not haveexperiment focuses on using the well-established batch to process all the data in one batch. Therefore, most of theclassifiers described as follows: incremental learning algorithms can process data files that . The C5.0 algorithm [42] is a commercial version of are too large to fit in memory. The incremental classifiers the C4.5 [43] decision-tree-based classifier. used in these experiments are detailed as follows: . The Naive Bayes classifier (NB) is a simple . Incremental naive bayes. This classifier uses a probabilistic classifier based on applying Bayes theorem with strong (naive) independence assump- default precision of 0.1 for numeric attributes when tions [44]. In this case, the numeric estimator it is created with zero training instances. precision values are chosen based on analysis of . Incremental kNN. The kNN classifier adopts an the training data. For this reason, the classifier is no- instance-based approach whose conceptual simpli- incremental [45]. city makes its adaptation to an incremental . The K Nearest Neighbor classifier (kNN) is an classifier straightforward. There is an algorithm instance-based learning technique for classifying called incremental kNN as the number of nearest objects based on closest training examples in the neighbors k needs not be known in advance and it feature space [46]. The parameter k affects the is calculated incrementally. However, in this case, performance of the kNN classifier significantly. In the difference between Incremental kNN and these experiments, the value with which better nonincremental kNN is the way all the data are results are obtained is k ¼ 1; thus, a sample is processed. Unlike other incremental algorithms, assigned to the class of its nearest neighbor. kNN stores entire data set internally. In this case, However, using this value, the corresponding the parameter k with which better results are classifier may not be robust enough to the noise in obtained is k ¼ 1. the data sample. In these experiments, the classifiers were trained using a . The AdaBoost (adaptive boosting) classifier [47] feature vector for each of the 168 users. This vector consists combines multiple weak learners in order to form of the support value of all the different subsequences of an efficient and strong classifier. Each weak classifier commands obtained for all the users; thus, there are
  9. 9. 862 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 TABLE 2 TABLE 3 Total Number of Different Subsequences Obtained EvABCD: Number of Prototypes Created per Group Using 10-Fold Cross Validation In the phase of behavior model creation, the length of the subsequences in which the original sequence is segmentedsubsequences which do not have a value because thecorresponding user has not typed those commands. In this (used for creating the trie) is an important parameter: usingcase, in order to be able to use this data for training the long subsequences, the time consumed for creating the trieclassifiers, we consider this value as 0 (although its real and the number of relevant subsequences of the corre-value is null). sponding distribution increase drastically. In the experi- ments presented in this paper, the subsequence length6.3 Experimental Design varies from 2 to 6.In order to measure the performance of the proposed Before showing the results, we should consider that theclassifier using the above data set, the well-established number of subsequences obtained using different streamstechnique of 10-fold cross validation is chosen. Thus, all of data is often very large. To get an idea of how thisthe users (training set) are divided into 10 disjoint subsets number increases, Table 2 shows the number of differentwith equal size. Each of the 10 subsets is left out in turn subsequences obtained using different number of com-for evaluation. It should be emphasized that EVABCD mands for training (100, 500, and 1.000 commands per user)does not need to work in this mode; this is done mainly and subsequence lengths (3 and 5).in order to have comparable results with the established Using EVABCD, the number of prototypes per class isoffline techniques. not fixed, it varies automatically depending on the hetero- The number of UNIX commands typed by a user and geneity of the data. To get an idea about it, Table 3 tabulatesused for creating his/her profile is very relevant in the the number of prototypes created per group in each of the http://ieeexploreprojects.blogspot.comclassification process. When EVABCD is carried out in the 10 runs using 1.000 commands per user as training data setfield, the behavior of a user is classified (and the evolving and a subsequence length of 3.behavior library updated) after she/he types a limitednumber of commands. In order to show the relevance of 6.4 Resultsthis value using the data set already described, we consider Results are listed in Table 4. Each major row corresponds todifferent number of UNIX commands for creating the a training-set size (100, 500, and 1.000 commands) and eachclassifier: 100, 500, and 1.000 commands per user. such row is further subdivided into experiments with TABLE 4 Classification Rate (Percent) of Different Classifiers in the UNIX Users Environments Using Different Size of Training Data Set and Different Subsequence Lengths
  10. 10. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 863different subsequence lengths for segmenting the initial constantly. For this purpose, we design a new experi-sequence (from two to six commands). The columns show mental scenario in which the number of users of a class isthe average classification success using the proposed incrementing in the training data in order to detect howapproach (EvABCD) and the other (incremental and the different classifiers recognize the users of a new class.nonincremental) classification algorithms. In this case, EVABCD is compared with three nonincre- According to these data, SVM and Nonincremental mental classifiers and 10-fold cross validation is used. ThisNaive Bayes perform slightly better than the Incremental experiment consists of four parts:NB and C5.0 classifiers in terms of accuracy. The percen-tages of users correctly classified by EVABCD are higher to 1. One of the four classes is chosen to detect howthe results obtained by LVQ and lower than the percentages EVABCD can adapt to it. The chosen class is namedobtained by NB, C5.0, and SVM. Incremental and Non- as new class.incremental kNN classifiers perform much worse than the 2. Initially, the training data set is composed by aothers. Note that the changes in the subsequence length do sample of the new class and all the samples of thenot modify the classification results obtained by AdaBoost. other three classes. The test data set is composed ofThis is due to the fact that the classifier creates the same several samples of the four classes.classification rule (weak hypotheses) although the subse- 3. EVABCD is applied and the classification rate of thequence length varies. samples of the new class is calculated. In general, the difference between EVABCD and the 4. The number of samples of the new class in thealgorithms NB and SVM is considerable for small sub- training data set is increasing one by one and step 3sequence lengths (two or three commands), but this is repeated. Thus, the classification rate of the newdifference decreases when this length is longer (five or six class samples is calculated in each increase.commands). These results show that using an appropriate Fig. 4 shows the four graphic results of this experimentsubsequence length, the proposed classifier can compete considering in each graph one of the four classes as thewell with offline approaches. new class: Nevertheless, the proposed environment needs a classi-fier able to process streaming data in online and in real . x-axis represents the number of users of the new classtime. Only the incremental classifiers satisfy this require- that contains the training data set.ment, but unlike EVABCD, they assume a fixed structure. In . y-axis represents the percentage of users of the newspite of this, taking into account the incremental classifiers, class correctly can be noticed that the difference in the results is not very In the different graphs, we can see how quicklysignificant; besides, EVABCD can cope with huge amounts EVABCD evolves and adapts to the new class. If we http://ieeexploreprojects.blogspot.comof data because its structure is open and the rule-base consider the class Novice Programmers, it is remarkable thatevolves in terms of creating new prototypes as they occur after analyzing three users of this class, the proposedautomatically. In addition, the learning in EVABCD is classifier is able to create a new prototype in which almostperformed in single pass and a significantly smaller 90 percent of the test users are correctly classified.memory is used. Spending too much time for training is However, the other classifiers need a larger number ofclearly not adequate for this purpose. samples for recognizing the users of this new class. Similar In short, EVABCD needs an appropriate subsequence performance has been observed for the other three classes.length to get a classification rate similar to the obtained by Specially, C5.0 needs several samples for creating aother classifiers which use different techniques. However, suitable decision tree. As we can see in the graph whichEVABCD does not need to store the entire data stream in represents the class Computer scientist as the new class, thethe memory and disregards any sample after being used. percentages of users correctly in the 1-NN classifier isEVABCD is one pass (each sample is proceeded once at the always 0 because all the users of this class are classified intime of its arrival), while other offline algorithms require a the Novice programmers class. The increase in the classifica-batch set of training data in the memory and make many tion rate is not perfectly smooth because the new dataiterations. Thus, EVABCD is computationally more simple bring useful information but also noise.and efficient as it is recursive and one pass. Unlike other Taking into account these results, we would like toincremental classifiers, EVABCD does not assume a remark that the proposed approach is able to adapt to aprefixed structure and it changes according to the samples new user behavior extremely quick. In a changing andobtained. In addition, as EVABCD uses a recursive real-time environment, as the proposed in this paper, thisexpression for calculating the potential of a sample, it isalso computationally very efficient. In fact, since the property is essential.number of attributes is very large in the proposedenvironment and it changes frequently, EVABCD is the 7 CONCLUSIONSmost suitable alternative. Finally, the EVABCD structure issimple and interpretable. In this paper, we propose a generic approach, EVABCD, to model and classify user behaviors from a sequence of6.5 Scenario: A New Class Appears events. The underlying assumption in this approach is thatIn addition to the advantage that EVABCD is an online the data collected from the corresponding environment canclassifier, we want to prove the ability of our approach to be transformed into a sequence of events. This sequence isadapt to new data; in this case, new users or a different segmented and stored in a trie and the relevant sub-behavior of a user. This aspect is especially relevant in sequences are evaluated by using a frequency-basedthe proposed environment since a user profile changes method. Then, a distribution of relevant subsequences is
  11. 11. 864 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 5, MAY 2012 http://ieeexploreprojects.blogspot.comFig. 4. Evolution of the classification rate during online learning with a subset of UNIX users data set.created. However, as a user behavior is not fixed but rather to other type of users such as users of e-services, digitalit changes and evolves, the proposed classifier is able to communications, etc.keep up to date the created profiles using an EvolvingSystems approach. EVABCD is one pass, noniterative, APPENDIX Arecursive, and it has the potential to be used in aninteractive mode; therefore, it is computationally very CALCULATION OF THE USING COSINE DISTANCEefficient and fast. In addition, its structure is simple and POTENTIAL RECURSIVELYinterpretable. In this appendix the expression of the potential is trans- The proposed evolving classifier is evaluated in an formed into a recursive expression in which it is calculatedenvironment in which each user behavior is represented asa sequence of UNIX commands. Although EVABCD has using only the current data sample (zk ). For this novelbeen developed to be used online, the experiments have derivation we combine the expression of the potential for abeen performed using a batch data set in order to compare sample data (1) represented by a vector of elements and thethe performance to established (incremental and nonincre- distance cosine expression (2).mental) classifiers. The test results with a data set of168 real UNIX users demonstrates that, using an appro- 1 Pk ðzk Þ ¼ 2 2 32 3 ð12Þpriate subsequence length, EVABCD can perform almost Pn j j PkÀ1 zk zias well as other well-established offline classifiers in terms 1 þ 4kÀ1 1 4 i¼1 1 À qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi5 5 j¼1 Pn À j Á2 Pn À j Á2of correct classification on validation data. However, zk zi j¼1 j¼1taking into account that EVABCD is able to adaptextremely quickly to new data, and that this classifier where zk denotes the kth sample inserted in the datacan cope with huge amounts of data in a real environment space. Each sample is represented by a set of values; thewhich changes rapidly, the proposed approach is the most value of the ith attribute (element) of the zk sample issuitable alternative. represented as zi . k Although, it is not addressed in this paper, EVABCD In order to explain the derivation of the expression stepcan also be used to monitor, analyze, and detectabnormalities based on a time-varying behavior of same by step; firstly, we consider the denominator of the (12)users and to detect masqueraders. It can also be applied which is named as den:P ðzk Þ.
  12. 12. IGLESIAS ET AL.: CREATING EVOLVING USER BEHAVIOR PROFILES AUTOMATICALLY 865 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiden:Pk ðzk Þ u À j Á2 X j j j n u z 2 2 Pn j j 32 3 Bk ¼ zk bk ; bk ¼ bðkÀ1Þ þ tPn kÀ Á2 j l 6 1 X6 kÀ1 j¼1 zk zi 7 7 j¼1 l¼1 zk ð19Þ ¼1þ4 41 À qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi5 5 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi À Á2 k À 1 i¼1 Pn j 2 Pn j 2 zj j¼1 ðzk Þ j¼1 ðzi Þ bj ¼ Pn À l Á2 : 1 1 ! # l¼1 z1 1 X kÀ1 fi 2den:Pk ðzk Þ ¼ 1 þ 1À Secondly, considering Dk with the renaming done in k À 1 i¼1 h gi (13), we get:where: 0 12 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn j j #2 uX uX XB kÀ1 zk zi C X kÀ1 1 X j j n X n u n À j Á2 u n À j Á2 Dk ¼ j¼1 @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA ¼ Pn À l Á2 zk zi fi ¼ zj zj ; h¼t zk and gi ¼ t zi : ð13Þ Pn À j Á2 k i i¼1 j¼1 zi i¼1 l¼1 zi j¼1 j¼1 j¼1 j¼1 X n X n X kÀ1 1 We can observe that the variables fi and gi depend on the Dk ¼ zj k zp k zij zip Pn À Á2 : k k l j¼1 p¼1 i¼1 l¼1 zisum of all the data samples (all these data samples are ð20Þrepresented by i); but the variable h represents the sum ofthe attribute values of the sample. Therefore, den:Pk ðzk Þ can If we define djp as each attribute of the sample Dk , we kbe simplified further into: get: ! ! ## X kÀ1 1 1 X kÀ1 À2fi fi 2 djp ¼ zij zip Pn À Á2 ð21Þ k k k den:Pk ðzk Þ ¼ 1 þ þ 1À þ ; l ðk À 1Þ i¼1 hgi hgi i¼1 l¼1 zi !## 1 X À2fi kÀ1 fi2 Therefore: den:Pk ðzk Þ ¼ 1 þ ðk À 1Þ þ þ 2 2 : ðk À 1Þ i¼1 hgi h gi X n X n zj zp Dk ¼ zj k zp djp ; djp ¼ djp þ Pn k Àk Á2 ; k k k ðkÀ1Þ ð14Þ j¼1 p¼1 l¼1 zk l ð22Þ zj z1 And finally, into: ¼ Pn 1 À1 Á2 ; j ¼ ½1; n þ 1Š:d1j 1 l l¼1 z1 # ### X fi kÀ1 1 kÀ1 1 X fi 2 Finally:den:Pk ðzk Þ ¼ 2 þ À2 þ : hðk À 1Þ i¼1 gi h i¼1 gi 1 Pk ðzk Þ ¼ 1 1 ð15Þ 2 þ ½hðkÀ1Þ ½½À2BK Š þ ½h Dk ŠŠŠ ð23Þ k ¼ 2; 3 . . . ; P1 ðz1 Þ ¼ 1; In order to obtain an expression for the potential from(15), we rename it as follows: where Bk is obtained as in (19), and Dk is described in (22). Note that to get recursively the value of Bk , it is !!! necessary to calculate n accumulated values (in this case, 1 1 den:Pk ðzk Þ ¼ 2 þ ½À2BK Š þ Dk n is the number of the different subsequences obtained). hðk À 1Þ h X fi kÀ1 Xfi 2 kÀ1 ð16Þ However, to get recursively the value of Dk we need to where: Bk ¼ and Dk ¼ : calculate n  n different accumulated values which store g i¼1 i i¼1 gi the result of multiply a value by all the other different values (these values are represented as dij ). k If we analyze each variable (Bk and Dk ) separately(considering the renaming done in (13)): Firstly, we consider Bk ACKNOWLEDGMENTS vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi This work is partially supported by the Spanish Govern- Pn j j u À j Á2 ment under project TRA2007-67374-C02-02. X kÀ1 j¼1 zk zi X n Xu kÀ1 zi Bk ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ j zk t ð17Þ Pn À j Á2 Pn À l Á2 : i¼1 j¼1 z i j¼1 i¼1 l¼1 zi REFERENCES [1] D. Godoy and A. Amandi, “User Profiling in Personal Information If we define each attribute of the sample Bk by: Agents: A Survey,” Knowledge Eng. Rev., vol. 20, no. 4, pp. 329-361, vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2005. u À j Á2 [2] J.A. Iglesias, A. Ledezma, and A. Sanchis, “Creating User Profiles Xu kÀ1 zi j bk ¼ t ð18Þ from a Command-Line Interface: A Statistical Approach,” Proc. Pn À l Á2 : Int’l Conf. User Modeling, Adaptation, and Personalization (UMAP), i¼1 l¼1 zi pp. 90-101, 2009. [3] M. Schonlau, W. Dumouchel, W.H. Ju, A.F. Karr, and Theus, Thus, the value of Bk can be calculated as a recursive “Computer Intrusion: Detecting Masquerades,” Statistical Science,expression: vol. 16, pp. 58-74, 2001.