Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

296 views

Published on

Slides for our paper: Ting Chen, Lu-An Tang, Yizhou Sun, Zhengzhang Chen, Kai Zhang, "Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events," Proc. 25th Int. Joint Conf. on Artifical Intelligence (IJCAI'16), New York City, USA, Jul. 2016.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events

  1. 1. Entity Embedding-based Anomaly Detection for Heterogeneous Categorical Events Ting Chen♠1 , Lu-An Tang♣ , Yizhou Sun♠ , Zhengzhang Chen♣ , Kai Zhang♣ ♠ Northeastern University, ♣ NEC Labs America {tingchen, yzsun}@ccs.neu.edu, {ltang, zchen, kzhang}@nec-labs.com July 14, 2016 1 Part of the work is done during first author’s internship at NEC Labs America. 1 / 25
  2. 2. Introduction Anomaly detection is important in many applications. For example, detect anomalous activities in enterprise networks. 2 / 25
  3. 3. Problem statement Heterogeneous Categorical Event. A heterogeneous categorical event e = (a1, · · · , am) is a record contains m different categorical attributes, and the i-th attribute value ai denotes an entity from the type Ai . Table 1: Examples of event in process to process interaction network. index day hour uid src. proc. dst. proc. src. folder dst. folder 0 3 16 8 init python /sbin/ /usr/bin/ 1 5 3 4 init mingetty /sbin/ /sbin/ Problem: abnormal event detection. Given a set of training events D = {e1, · · · , en}, by assuming that most events in D are normal, the problem is to learn a model M, so that when a new event en+1 comes, the model M can accurately predict whether the event is abnormal or not. 3 / 25
  4. 4. Traditional anomaly detection by density estimation Estimate probability distribution over data space using kernel density estimation, and detect anomalies/outliers with lower probability/density. 4 / 25
  5. 5. Challenges There are several challenges associated with heterogeneous categorical events: Large event space: m different entity types, facing O(exp(m)) event space. Lack of intrinsic distance measure among entities and events: similarities between two users/machines? And two events with different entities? No label data 5 / 25
  6. 6. Motivations for our model To overcome the lack of distance measures: using entity embedding. To alleviate the large event space issue: At the model level, only consider pairwise interactions. A maintenance program is usually triggered at midnight, but suddenly it is trigged during the day. A user usually connect to servers with low privilege, but suddenly it tries to access some server with high privilege. At the learning level, propose using noise-contrastive estimation (NCE) with “context-dependent” noise distribution. 6 / 25
  7. 7. APE model We model the probability of a single event e = (a1, · · · , am) in event space Ω using the following parametric form: Pθ(e) = exp Sθ(e) e ∈Ω exp Sθ(e ) (1) Where Sθ(e) = i,j:1≤i<j≤m wij(vai · vaj ) (2) Where wij is the weight for pairwise interaction between entity types Ai and Aj, and it is non-negative constrained, i.e. ∀i, j, wij ≥ 0. vai is the embedding vector for entity ai. 7 / 25
  8. 8. APE model …… Event Embedding Lookup Table Entity Embeddings Pairwise Interactions Probability !"# !"$ %& %' ()* + Figure 1: The framework of proposed method. 8 / 25
  9. 9. APE model The maximum likelihood learning objective: arg max θ e∈D log Pθ(e) (3) Where we maximize likelihood of each observed events. Recall Pθ(e) = exp Sθ(e) e ∈Ω exp Sθ(e ) (4) Where event space Ω can be prohibitively large. So we resort to Noise-contrastive learning (NCE). 9 / 25
  10. 10. Noise-contrastive Estimation With NCE, we make mainly two modifications to original learning objective: Treat normalization term in Pθ(e) as a parameter c: Pθ(e) = exp i,j:1≤i<j≤m wij(vai · vaj ) + c (5) Introduce a noise distribution Pn(e), and construct a binary classification problem, discriminating samples from data distribution and some artificial known noise distribution. J(θ) =Ee∼Pd log Pθ(e) Pθ(e) + kPn(e) + kEe∼Pn log kPn(e) Pθ(e) + kPn(e) (6) 10 / 25
  11. 11. Stochastic gradient descent In practice, we can use SGD for fast and online training. For each observed event e, samples k artificial events e from Pn(e ), and update parameters according to stochastic gradients based on: log σ log Pθ(e) − log kPn(e) + e log σ − log Pθ(e ) + log kPn(e ) (7) Here σ(x) = 1/(1 + exp(−x)) is the sigmoid function. The complexity of our algorithm is O(Nkm2d), where N is the number of total observed events it is trained on, m is the number of entity type, and d is the embedding dimension. 11 / 25
  12. 12. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) 12 / 25
  13. 13. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ 12 / 25
  14. 14. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 12 / 25
  15. 15. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 5 12 / 25
  16. 16. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 5 3 12 / 25
  17. 17. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 5 3 4 12 / 25
  18. 18. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 5 3 4 bash mingetty / /sbin/ 12 / 25
  19. 19. Context-independent noise distribution “Context-independent” noise distribution: drawing an artificial event, independent of given event e, according to Pfactorized n (e) = pA1 (a1) · · · pAi (ai) Table 2: Generation of example artificial event in process to process interaction network “context-independent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 5 3 4 bash mingetty / /sbin/ + Simple and tractable. - Generated samples might be too easy for the classifier. 12 / 25
  20. 20. Context-dependent noise distribution “Context-dependent” noise distribution: first sample an observed event e, then randomly sample an entity type Ai, and finally sample a new entity ai ∼ pAi (ai) to replace ai and form a new negative sample e . 13 / 25
  21. 21. Context-dependent noise distribution “Context-dependent” noise distribution: first sample an observed event e, then randomly sample an entity type Ai, and finally sample a new entity ai ∼ pAi (ai) to replace ai and form a new negative sample e . Table 3: Generation of example artificial event in process to process interaction network according to “context-dependent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ 13 / 25
  22. 22. Context-dependent noise distribution “Context-dependent” noise distribution: first sample an observed event e, then randomly sample an entity type Ai, and finally sample a new entity ai ∼ pAi (ai) to replace ai and form a new negative sample e . Table 3: Generation of example artificial event in process to process interaction network according to “context-dependent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 13 / 25
  23. 23. Context-dependent noise distribution “Context-dependent” noise distribution: first sample an observed event e, then randomly sample an entity type Ai, and finally sample a new entity ai ∼ pAi (ai) to replace ai and form a new negative sample e . Table 3: Generation of example artificial event in process to process interaction network according to “context-dependent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 3 16 8 init mingetty /sbin/ /usr/bin/ 13 / 25
  24. 24. Context-dependent noise distribution “Context-dependent” noise distribution: first sample an observed event e, then randomly sample an entity type Ai, and finally sample a new entity ai ∼ pAi (ai) to replace ai and form a new negative sample e . Table 3: Generation of example artificial event in process to process interaction network according to “context-dependent” noise distribution. day hour uid src. proc. dst. proc. src. folder dst. folder observed event e 3 16 8 init python /sbin/ /usr/bin/ generated event e 3 16 8 init mingetty /sbin/ /usr/bin/ + Generated samples is harder for the classifier. - Pn(e ) ≈ Pd (e)PAi (ai)/m is intractable. 13 / 25
  25. 25. Context-dependent noise distribution We further approximate the “context-dependent” distribution term Pn(e ) ≈ Pd (e)PAi (ai)/m as follows. Pd (e) is small for most events, we simply set it to some constant l, so log kPn(e ) ≈ log PAi (ai) + z, where z = log kl/m is a constant term (simply set to 0). To compute Pn(e) for an observed event e, we will use the expectation over all entity types as follows: log kPn(e) ≈ i 1 m log PAi (ai) + z. And again the constant z will be ignored. 14 / 25
  26. 26. Experimental settings We utilize two data sets in enterprise network. P2P. Process to process event data set. P2I. Process to Internet Socket event data set. We split the two-week data into two of one-weeks. The events in the first week are used as training set, and new events that only appeared in the second week are used as test sets. Generate artificial anomalies: for each observed event in test, we generate a corresponding anomaly by replacing 1-3 entities of different types in the event according to random sampling. 15 / 25
  27. 27. Data statistics Table 4: Entity types in data sets. Data sets Types of entity and their arities P2P day (7), hour (24), uid (361), src proc (778), dst proc (1752), src folder (255), dst folder (415) P2I day (7), hour (24), src ip (59), dst ip (184), dst port (283), proc (91), proc folder (70), uid (162), connection type (3) Table 5: Statistics of the collected two-week events. Data # week 1 # week 2 # week 2 new P2P 95,434 107,619 53,478 (49.69%) P2I 1,316,357 1,330,376 498,029 (37.44%) 16 / 25
  28. 28. Methods for comparison We compare following models in experiments: Condition. This method is proposed in [Das and Schneider2007]. CompreX. This method is proposed in [Akoglu et al.2012]. APE. This is the proposed method. Noted that we use the negative of its likelihood output as the abnormal score. APE (no weight). This method is the same as APE, except that instead of learning wij , we simply set ∀i, j, wij = 1, i.e. it is APE without automatic weights learning on pairwise interactions. All types of interactions are weighted equally. 17 / 25
  29. 29. Performance comparison for abnormal event detection Table 6: Values left to slash are AUC of ROC, and ones on the right are average precision. The last two rows (∗ marked) are averaged over 3 smaller (1%) test samples due to long runtime of CompreX. P2P Models c=1 c=2 c=3 Condition 0.6296 / 0.6777 0.6795 / 0.7321 0.7137 / 0.7672 APE (no weight) 0.8797 / 0.8404 0.9377 / 0.9072 0.9688 / 0.9449 APE 0.8995 / 0.8845 0.9540 / 0.9378 0.9779 / 0.9639 CompreX∗ 0.8230 / 0.7683 0.8208 / 0.7566 0.8390 / 0.7978 APE∗ 0.9003 / 0.8892 0.9589 / 0.9394 0.9732 / 0.9616 P2I Models c=1 c=2 c=3 Condition 0.7733 / 0.7127 0.8300 / 0.7688 0.8699 / 0.8165 APE (no weight) 0.8912 / 0.8784 0.9412 / 0.9398 0.9665 / 0.9671 APE 0.9267 / 0.9383 0.9669 / 0.9717 0.9838 / 0.9861 CompreX∗ 0.7749 / 0.8391 0.7834 / 0.8525 0.7832 / 0.8497 APE∗ 0.9291 / 0.9411 0.9656 / 0.9729 0.9829 / 0.9854 18 / 25
  30. 30. Performance comparison for abnormal event detection 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR ROC 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision PRC (a) P2P abnormal event detection. 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR ROC 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision PRC Condition CompareX APE (no weight) APE (b) P2I abnormal event detection. Figure 2: Receiver operating characteristic curves and precision recall curves for abnormal event detections. 19 / 25
  31. 31. Parameter study Table 7: Average precision under different choice of noise distributions. P2P P2I context-independent 0.8463 0.7534 context-dependent, log kPn(e) = 0 0.8176 0.7868 context-dependent, log kPn(e) = appx 0.8845 0.9383 1 2 3 4 5 Number of negative samples per entity type 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Average precision Data P2P P2I Figure 3: Performance versus number of negative samples drawn per entity type. 20 / 25
  32. 32. Case study Here are some examples of detected anomalies (events with low probabilities): Table 8: Detected abnormal events examples. Data Abnormal event Ab. entity P2P ..., src proc: bash, src folder: /home/, ... src proc P2P ..., uid: 9 (some main user), hour: 1, ... uid P2I ..., proc: ssh, dst port: 80, ... dst port 21 / 25
  33. 33. Pairwise weights visualization day hour uid src proc dst proc src folder     dst folder     dst folder src folder dst proc src proc uid hour day 0 0 0 0 0 0 0 0 0 0 0 0 0 0.82 0 0 0 0 0 0.69 6.1 0 0 0 0 5.3 3.8 0.54 0 0 0 3.5 4.4 3 2.3 0 0 2.6 3 4 2 1.5 0 0 1.8 0.5 0.54 0.99 0.58 0.0 1.5 3.0 4.5 6.0 (a) P2P events day hour src ip dst ip dport sproc proc folder uid conn. type conn. type uid proc folder sproc dport dst ip src ip hour day 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.5 0 0 0 0 0 0 0 1 1.8 0 0 0 0 0 0 3.1 1.6 2.5 0 0 0 0 0 2.1 1.2 2.4 1.6 0 0 0 0 2.9 1.7 1.8 1.4 0.94 0 0 0 1 0 0.05 0.95 6 1.3 0 0 0.73 0.8 0.78 0.79 0.7 0.94 1 0 1.4 0.42 0.91 0.78 0.76 0.85 0.32 3.2 0 1 2 3 4 5 (b) P2I events Figure 4: Learned pairwise weights. 22 / 25
  34. 34. Embedding Visualization Figure 5: 2d visualization of user embeddings. Each color indicates a user type in the system. The left-most and right-most points are Linux and Windows root users, respectively. 23 / 25
  35. 35. Embedding Visualization 10 11 12 131416 18 08 7 15 17 19 20 21 22 23 1 2 3 45 6 9 working hours non­working hours Figure 6: 2d visualization of hour embeddings. 24 / 25
  36. 36. Q & A Thank you! 25 / 25

×