2. ● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview
3. Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me
9. • Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching
10. • Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction
11. • Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing
13. • acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes
14. Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
15. • Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
16. 1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
18. ● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification
19. 1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
21. ● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering
23. 1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
24. ● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection
28. ● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines