NLP Techniques for Log
Analysis
Jacob Perkins, CTO @ Insight Engines
● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview
Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me
1. Tokenization
2. Feature Extraction
3. Classification
4. Clustering
5. Anomaly Detection
Topics
• Split text into tokens
• Many options beyond whitespace
• Works on any arbitrary text
• NLTK has many tokenizers
Tokenization
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
• Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching
• Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction
• Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing
● Bigram: (acmepayroll, syslog)
● Trigram: (HANDLING, TELNET, CALL)
● Skipgram: (syslog, HANDLING, CALL)
Ngram Features
• acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes
Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
• Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification
1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
Features:
● Description
● Rules / thresholds
● Log record features
Labels = priority level (high, medium, low)
Alert Classification
● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering
1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection
● Tokenization
● Feature extraction
● Classification
● Clustering
● Anomaly detection
Summary
• NLTK
• Scikit-Learn
• Gensim
• Logpai
• Text-processing.com
• Streamhacker.com
References
● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines
Thank you!

NLP techniques for log analysis

  • 1.
    NLP Techniques forLog Analysis Jacob Perkins, CTO @ Insight Engines
  • 2.
    ● Speculative ideaswith specific techniques ● Python is great for NLP, ML, simple text processing Overview
  • 3.
    Author of TextProcessing with NLTK Cookbook Contributor to Bad Data Handbook Blog @ StreamHacker.com Helped create Seahorse / Gnome Keyring (GPG UI) CTO @ InsightEngines.com About me
  • 4.
    1. Tokenization 2. FeatureExtraction 3. Classification 4. Clustering 5. Anomaly Detection Topics
  • 5.
    • Split textinto tokens • Many options beyond whitespace • Works on any arbitrary text • NLTK has many tokenizers Tokenization
  • 6.
    Sep 19 19:18:40acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 7.
    Sep 19 19:18:40acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 8.
    Sep 19 19:18:40acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 9.
    • Edit distance(a.k.a Levenshtein distance) • Fuzzywuzzy • Can use to identify similar strings • Ex: Google vs Go0gle = edit distance 1 Fuzzy Matching
  • 10.
    • Transform textinto discrete values • Use for data analysis, machine learning • Art, not science Feature Extraction
  • 11.
    • Date parsingwith dateutil • Regex patterns • Grammars with pyparsing • Automatic log parsing with Logpai logparser Parsing
  • 12.
    ● Bigram: (acmepayroll,syslog) ● Trigram: (HANDLING, TELNET, CALL) ● Skipgram: (syslog, HANDLING, CALL) Ngram Features
  • 13.
    • acmepayroll ->aa • User -> Aa • ABCDE -> AA • 10101 -> nn • pid=9644 -> aa=nn Token Shapes
  • 14.
    Log -> TokenShapes & Date Parsing date aa syslog: date nn wksh: AA AA AA (User: aa, Branch: AA, Client: nn) pid=nn Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644
  • 15.
    • Count tokensacross all records & types (ie ssh) • How uniform are tokens within a record type? • Mostly uniform ~= clean data • In a given record, does it have rare tokens? • Rare = anomaly? Identifying Rare Tokens
  • 16.
    1. Log record-> feature extraction 2. Features -> Classifier 3. Classifier returns class probabilities Classification • Must train on good labeled data • Binary classification is most accurate • Scikit-learn has many options
  • 18.
    ● Spam vsHam ● Sentiment & Opinion analysis: positive vs negative ● Fraud Real World Classification
  • 19.
    1. Train onrecord type (ssh vs everything else) 2. What has type ssh but doesn’t classify? 3. What is not ssh but does classify? Log Classification Anomalies
  • 20.
    Features: ● Description ● Rules/ thresholds ● Log record features Labels = priority level (high, medium, low) Alert Classification
  • 21.
    ● No trainingneeded (unsupervised) ● Group by feature similarity / distance ● Must operate on large batch of records ● Scikit-learn has many options ● Gensim for topic modeling Clustering
  • 23.
    1. Cluster afew different record types 2. Does each type correspond to a single cluster? 3. Which records don’t cluster well? (far from centroid) Data Clustering Anomalies
  • 24.
    ● A.k.a. Novelty/ Outlier detection ● A.k.a. One-class classification ● Learn from good data set ● Identify new records that don’t fit ● Scikit-learn has a few options ● Automated anomaly detection with Logpai loglizer Anomaly Detection
  • 26.
    ● Tokenization ● Featureextraction ● Classification ● Clustering ● Anomaly detection Summary
  • 27.
    • NLTK • Scikit-Learn •Gensim • Logpai • Text-processing.com • Streamhacker.com References
  • 28.
    ● Investigator: plainenglish log search -> multiple visualizations & recommendations to do next ● Analyzer: data health analysis ● InsightEngines.com About Insight Engines
  • 29.

Editor's Notes

  • #7 Punctuation in weird places
  • #8 NLP example: can’t
  • #9 Trained on WSJ news articles
  • #12 Grammars ~= multi-line regex
  • #13 Bigram & Trigram features can add a lot to classification & clustering accuracy
  • #16 Use token shapes to normalize? Technique based on TF/IDF & search indexing to identify high information words
  • #19 Sentiment used a lot for marketing analytics
  • #20 One vs all classification.
  • #21 Triage, identify false positives or negatives
  • #22 Topic modeling is different type of clustering