• acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
• Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
Real World Classification
1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer