Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NLP techniques for log analysis

265 views

Published on

Presented at Silicon Valley Cyber Security Meetup on 9/25/2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

NLP techniques for log analysis

  1. 1. NLP Techniques for Log Analysis Jacob Perkins, CTO @ Insight Engines
  2. 2. ● Speculative ideas with specific techniques ● Python is great for NLP, ML, simple text processing Overview
  3. 3. Author of Text Processing with NLTK Cookbook Contributor to Bad Data Handbook Blog @ StreamHacker.com Helped create Seahorse / Gnome Keyring (GPG UI) CTO @ InsightEngines.com About me
  4. 4. 1. Tokenization 2. Feature Extraction 3. Classification 4. Clustering 5. Anomaly Detection Topics
  5. 5. • Split text into tokens • Many options beyond whitespace • Works on any arbitrary text • NLTK has many tokenizers Tokenization
  6. 6. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  7. 7. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  8. 8. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  9. 9. • Edit distance (a.k.a Levenshtein distance) • Fuzzywuzzy • Can use to identify similar strings • Ex: Google vs Go0gle = edit distance 1 Fuzzy Matching
  10. 10. • Transform text into discrete values • Use for data analysis, machine learning • Art, not science Feature Extraction
  11. 11. • Date parsing with dateutil • Regex patterns • Grammars with pyparsing • Automatic log parsing with Logpai logparser Parsing
  12. 12. ● Bigram: (acmepayroll, syslog) ● Trigram: (HANDLING, TELNET, CALL) ● Skipgram: (syslog, HANDLING, CALL) Ngram Features
  13. 13. • acmepayroll -> aa • User -> Aa • ABCDE -> AA • 10101 -> nn • pid=9644 -> aa=nn Token Shapes
  14. 14. Log -> Token Shapes & Date Parsing date aa syslog: date nn wksh: AA AA AA (User: aa, Branch: AA, Client: nn) pid=nn Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644
  15. 15. • Count tokens across all records & types (ie ssh) • How uniform are tokens within a record type? • Mostly uniform ~= clean data • In a given record, does it have rare tokens? • Rare = anomaly? Identifying Rare Tokens
  16. 16. 1. Log record -> feature extraction 2. Features -> Classifier 3. Classifier returns class probabilities Classification • Must train on good labeled data • Binary classification is most accurate • Scikit-learn has many options
  17. 17. ● Spam vs Ham ● Sentiment & Opinion analysis: positive vs negative ● Fraud Real World Classification
  18. 18. 1. Train on record type (ssh vs everything else) 2. What has type ssh but doesn’t classify? 3. What is not ssh but does classify? Log Classification Anomalies
  19. 19. Features: ● Description ● Rules / thresholds ● Log record features Labels = priority level (high, medium, low) Alert Classification
  20. 20. ● No training needed (unsupervised) ● Group by feature similarity / distance ● Must operate on large batch of records ● Scikit-learn has many options ● Gensim for topic modeling Clustering
  21. 21. 1. Cluster a few different record types 2. Does each type correspond to a single cluster? 3. Which records don’t cluster well? (far from centroid) Data Clustering Anomalies
  22. 22. ● A.k.a. Novelty / Outlier detection ● A.k.a. One-class classification ● Learn from good data set ● Identify new records that don’t fit ● Scikit-learn has a few options ● Automated anomaly detection with Logpai loglizer Anomaly Detection
  23. 23. ● Tokenization ● Feature extraction ● Classification ● Clustering ● Anomaly detection Summary
  24. 24. • NLTK • Scikit-Learn • Gensim • Logpai • Text-processing.com • Streamhacker.com References
  25. 25. ● Investigator: plain english log search -> multiple visualizations & recommendations to do next ● Analyzer: data health analysis ● InsightEngines.com About Insight Engines
  26. 26. Thank you!

×