SlideShare a Scribd company logo
NLP Techniques for Log
Analysis
Jacob Perkins, CTO @ Insight Engines
● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview
Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me
1. Tokenization
2. Feature Extraction
3. Classification
4. Clustering
5. Anomaly Detection
Topics
• Split text into tokens
• Many options beyond whitespace
• Works on any arbitrary text
• NLTK has many tokenizers
Tokenization
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/
• Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching
• Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction
• Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing
● Bigram: (acmepayroll, syslog)
● Trigram: (HANDLING, TELNET, CALL)
● Skipgram: (syslog, HANDLING, CALL)
Ngram Features
• acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes
Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
• Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens
1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options
● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification
1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies
Features:
● Description
● Rules / thresholds
● Log record features
Labels = priority level (high, medium, low)
Alert Classification
● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering
1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies
● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection
● Tokenization
● Feature extraction
● Classification
● Clustering
● Anomaly detection
Summary
• NLTK
• Scikit-Learn
• Gensim
• Logpai
• Text-processing.com
• Streamhacker.com
References
● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines
Thank you!

More Related Content

What's hot

What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
Henrik Skogström
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
Jorge Cardoso
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Neo4j
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
Yury Gorbachev
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
Hortonworks
 
Notes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew NgNotes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew Ng
Tess Ferrandez
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Databricks
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
DataWorks Summit/Hadoop Summit
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
Sasha Rosenbaum
 
Well architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data ScienceWell architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data Science
Leela Krishna Kandrakota
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Neo4j
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
ananth
 
Building Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons LearnedBuilding Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons Learned
parallellabs
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
9xdot
 
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Edureka!
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
SindhuVasireddy1
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 

What's hot (20)

What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Notes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew NgNotes from Coursera Deep Learning courses by Andrew Ng
Notes from Coursera Deep Learning courses by Andrew Ng
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
Well architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data ScienceWell architected ML platforms for Enterprise Data Science
Well architected ML platforms for Enterprise Data Science
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Building Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons LearnedBuilding Software Systems at Google and Lessons Learned
Building Software Systems at Google and Lessons Learned
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 

Similar to NLP techniques for log analysis

How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
Milo Yip
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
Vinay Kumar Chella
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
Alexandre Moneger
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
Tomas Doran
 
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
ICSM 2011
 
The Newest in Session Types
The Newest in Session TypesThe Newest in Session Types
The Newest in Session Types
Roland Kuhn
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
Florent Pillet
 
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
EC-Council
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30
tylerturk
 
Cryptography in PHP: use cases
Cryptography in PHP: use casesCryptography in PHP: use cases
Cryptography in PHP: use cases
Enrico Zimuel
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
packetloop
 
Paranoia 2018: A Process is No One
Paranoia 2018: A Process is No OneParanoia 2018: A Process is No One
Paranoia 2018: A Process is No One
Jared Atkinson
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
Aniruddha Chakrabarti
 
NBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then iceNBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then ice
Alexandre Moneger
 
Vulnerability, exploit to metasploit
Vulnerability, exploit to metasploitVulnerability, exploit to metasploit
Vulnerability, exploit to metasploit
Tiago Henriques
 
03 blockchain transactions
03 blockchain transactions03 blockchain transactions
03 blockchain transactions
BastianBlankenburg
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
Joff Thyer
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
Ivo Andreev
 

Similar to NLP techniques for log analysis (20)

How to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the WorldHow to Write the Fastest JSON Parser/Writer in the World
How to Write the Fastest JSON Parser/Writer in the World
 
Taming Text
Taming TextTaming Text
Taming Text
 
rspamd-slides
rspamd-slidesrspamd-slides
rspamd-slides
 
Query and audit logging in cassandra
Query and audit logging in cassandraQuery and audit logging in cassandra
Query and audit logging in cassandra
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...Industry - Program analysis and verification - Type-preserving Heap Profiler ...
Industry - Program analysis and verification - Type-preserving Heap Profiler ...
 
The Newest in Session Types
The Newest in Session TypesThe Newest in Session Types
The Newest in Session Types
 
NSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - EnglishNSLogger - Cocoaheads Paris Presentation - English
NSLogger - Cocoaheads Paris Presentation - English
 
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30
 
Cryptography in PHP: use cases
Cryptography in PHP: use casesCryptography in PHP: use cases
Cryptography in PHP: use cases
 
Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
Paranoia 2018: A Process is No One
Paranoia 2018: A Process is No OneParanoia 2018: A Process is No One
Paranoia 2018: A Process is No One
 
NLP using JavaScript Natural Library
NLP using JavaScript Natural LibraryNLP using JavaScript Natural Library
NLP using JavaScript Natural Library
 
NBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then iceNBTC#2 - Why instrumentation is cooler then ice
NBTC#2 - Why instrumentation is cooler then ice
 
Vulnerability, exploit to metasploit
Vulnerability, exploit to metasploitVulnerability, exploit to metasploit
Vulnerability, exploit to metasploit
 
03 blockchain transactions
03 blockchain transactions03 blockchain transactions
03 blockchain transactions
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
 

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 

NLP techniques for log analysis

  • 1. NLP Techniques for Log Analysis Jacob Perkins, CTO @ Insight Engines
  • 2. ● Speculative ideas with specific techniques ● Python is great for NLP, ML, simple text processing Overview
  • 3. Author of Text Processing with NLTK Cookbook Contributor to Bad Data Handbook Blog @ StreamHacker.com Helped create Seahorse / Gnome Keyring (GPG UI) CTO @ InsightEngines.com About me
  • 4. 1. Tokenization 2. Feature Extraction 3. Classification 4. Clustering 5. Anomaly Detection Topics
  • 5. • Split text into tokens • Many options beyond whitespace • Works on any arbitrary text • NLTK has many tokenizers Tokenization
  • 6. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 7. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 8. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/
  • 9. • Edit distance (a.k.a Levenshtein distance) • Fuzzywuzzy • Can use to identify similar strings • Ex: Google vs Go0gle = edit distance 1 Fuzzy Matching
  • 10. • Transform text into discrete values • Use for data analysis, machine learning • Art, not science Feature Extraction
  • 11. • Date parsing with dateutil • Regex patterns • Grammars with pyparsing • Automatic log parsing with Logpai logparser Parsing
  • 12. ● Bigram: (acmepayroll, syslog) ● Trigram: (HANDLING, TELNET, CALL) ● Skipgram: (syslog, HANDLING, CALL) Ngram Features
  • 13. • acmepayroll -> aa • User -> Aa • ABCDE -> AA • 10101 -> nn • pid=9644 -> aa=nn Token Shapes
  • 14. Log -> Token Shapes & Date Parsing date aa syslog: date nn wksh: AA AA AA (User: aa, Branch: AA, Client: nn) pid=nn Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644
  • 15. • Count tokens across all records & types (ie ssh) • How uniform are tokens within a record type? • Mostly uniform ~= clean data • In a given record, does it have rare tokens? • Rare = anomaly? Identifying Rare Tokens
  • 16. 1. Log record -> feature extraction 2. Features -> Classifier 3. Classifier returns class probabilities Classification • Must train on good labeled data • Binary classification is most accurate • Scikit-learn has many options
  • 17.
  • 18. ● Spam vs Ham ● Sentiment & Opinion analysis: positive vs negative ● Fraud Real World Classification
  • 19. 1. Train on record type (ssh vs everything else) 2. What has type ssh but doesn’t classify? 3. What is not ssh but does classify? Log Classification Anomalies
  • 20. Features: ● Description ● Rules / thresholds ● Log record features Labels = priority level (high, medium, low) Alert Classification
  • 21. ● No training needed (unsupervised) ● Group by feature similarity / distance ● Must operate on large batch of records ● Scikit-learn has many options ● Gensim for topic modeling Clustering
  • 22.
  • 23. 1. Cluster a few different record types 2. Does each type correspond to a single cluster? 3. Which records don’t cluster well? (far from centroid) Data Clustering Anomalies
  • 24. ● A.k.a. Novelty / Outlier detection ● A.k.a. One-class classification ● Learn from good data set ● Identify new records that don’t fit ● Scikit-learn has a few options ● Automated anomaly detection with Logpai loglizer Anomaly Detection
  • 25.
  • 26. ● Tokenization ● Feature extraction ● Classification ● Clustering ● Anomaly detection Summary
  • 27. • NLTK • Scikit-Learn • Gensim • Logpai • Text-processing.com • Streamhacker.com References
  • 28. ● Investigator: plain english log search -> multiple visualizations & recommendations to do next ● Analyzer: data health analysis ● InsightEngines.com About Insight Engines

Editor's Notes

  1. Punctuation in weird places
  2. NLP example: can’t
  3. Trained on WSJ news articles
  4. Grammars ~= multi-line regex
  5. Bigram & Trigram features can add a lot to classification & clustering accuracy
  6. Use token shapes to normalize? Technique based on TF/IDF & search indexing to identify high information words
  7. Sentiment used a lot for marketing analytics
  8. One vs all classification.
  9. Triage, identify false positives or negatives
  10. Topic modeling is different type of clustering