NLP techniques for log analysis

•Download as PPTX, PDF•

7 likes•5,743 views

Jacob Perkins

Presented at Silicon Valley Cyber Security Meetup on 9/25/2018

● Speculative ideas with specific techniques
● Python is great for NLP, ML, simple text processing
Overview

Author of Text Processing with NLTK Cookbook
Contributor to Bad Data Handbook
Blog @ StreamHacker.com
Helped create Seahorse / Gnome Keyring (GPG UI)
CTO @ InsightEngines.com
About me

1. Tokenization
2. Feature Extraction
3. Classification
4. Clustering
5. Anomaly Detection
Topics

• Split text into tokens
• Many options beyond whitespace
• Works on any arbitrary text
• NLTK has many tokenizers
Tokenization

Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644
https://text-processing.com/demo/tokenize/

• Edit distance (a.k.a Levenshtein distance)
• Fuzzywuzzy
• Can use to identify similar strings
• Ex: Google vs Go0gle = edit distance 1
Fuzzy Matching

• Transform text into discrete values
• Use for data analysis, machine learning
• Art, not science
Feature Extraction

• Date parsing with dateutil
• Regex patterns
• Grammars with pyparsing
• Automatic log parsing with Logpai logparser
Parsing

● Bigram: (acmepayroll, syslog)
● Trigram: (HANDLING, TELNET, CALL)
● Skipgram: (syslog, HANDLING, CALL)
Ngram Features

• acmepayroll -> aa
• User -> Aa
• ABCDE -> AA
• 10101 -> nn
• pid=9644 -> aa=nn
Token Shapes

Log -> Token Shapes & Date Parsing
date aa syslog: date nn wksh: AA AA AA (User: aa,
Branch: AA, Client: nn) pid=nn
Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51
39480627 wksh: HANDLING TELNET CALL (User: root,
Branch: ABCDE, Client: 10101) pid=9644

• Count tokens across all records & types (ie ssh)
• How uniform are tokens within a record type?
• Mostly uniform ~= clean data
• In a given record, does it have rare tokens?
• Rare = anomaly?
Identifying Rare Tokens

1. Log record -> feature extraction
2. Features -> Classifier
3. Classifier returns class probabilities
Classification
• Must train on good labeled data
• Binary classification is most accurate
• Scikit-learn has many options

● Spam vs Ham
● Sentiment & Opinion analysis: positive vs negative
● Fraud
Real World Classification

1. Train on record type (ssh vs everything else)
2. What has type ssh but doesn’t classify?
3. What is not ssh but does classify?
Log Classification Anomalies

Features:
● Description
● Rules / thresholds
● Log record features
Labels = priority level (high, medium, low)
Alert Classification

● No training needed (unsupervised)
● Group by feature similarity / distance
● Must operate on large batch of records
● Scikit-learn has many options
● Gensim for topic modeling
Clustering

1. Cluster a few different record types
2. Does each type correspond to a single cluster?
3. Which records don’t cluster well? (far from centroid)
Data Clustering Anomalies

● A.k.a. Novelty / Outlier detection
● A.k.a. One-class classification
● Learn from good data set
● Identify new records that don’t fit
● Scikit-learn has a few options
● Automated anomaly detection with Logpai loglizer
Anomaly Detection

● Tokenization
● Feature extraction
● Classification
● Clustering
● Anomaly detection
Summary

• NLTK
• Scikit-Learn
• Gensim
• Logpai
• Text-processing.com
• Streamhacker.com
References

● Investigator: plain english log search -> multiple
visualizations & recommendations to do next
● Analyzer: data health analysis
● InsightEngines.com
About Insight Engines

Mihai is the Principal Architect for Platform Engineering and Technology Solutions at IBM, responsible for Cloud Native and AI Solutions. He is a Red Hat Certified Architect, CKA/CKS, a leader in the IBM Open Innovation community, and advocate for open source development. Mihai is driving the development of Retrieval Augmentation Generation platforms, and solutions for Generative AI at IBM that leverage WatsonX, Vector databases, LangChain, HuggingFace and open source AI models. Mihai will share lessons learned building Retrieval Augmented Generation, or “Chat with Documents” platforms and APIs that scale, and deploy on Kubernetes. His talk will cover use cases for Generative AI, limitations of Large Language Models, use of RAG, Vector Databases and Fine Tuning to overcome model limitations and build solutions that connect to your data and provide content grounding, limit hallucinations and form the basis of explainable AI. In terms of technology, he will cover LLAMA2, HuggingFace TGIS, SentenceTransformers embedding models using Python, LangChain, and Weaviate and ChromaDB vector databases. He’ll also share tips on writing code using LLM, including building an agent for Ansible and containers. Scaling factors for Large Language Model Architectures: • Vector Database: consider sharding and High Availability • Fine Tuning: collecting data to be used for fine tuning • Governance and Model Benchmarking: how are you testing your model performance over time, with different prompts, one-shot, and various parameters • Chain of Reasoning and Agents • Caching embeddings and responses • Personalization and Conversational Memory Database • Streaming Responses and optimizing performance. A fine tuned 13B model may perform better than a poor 70B one! • Calling 3rd party functions or APIs for reasoning or other type of data (ex: LLMs are terrible at reasoning and prediction, consider calling other models) • Fallback techniques: fallback to a different model, or default answers • API scaling techniques, rate limiting, etc. • Async, streaming and parallelization, multiprocessing, GPU acceleration (including embeddings), generating your API using OpenAPI, etc.

Reproducible AI using MLflow and PyTorch

Databricks

DAT304_Amazon Aurora Performance Optimization with MySQL

Kamal Gupta

An introduction to Elasticsearch's advanced relevance ranking toolbox

Elasticsearch

The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

The field of AIOps, also known as Artificial Intelligence for IT Operations, uses advanced technologies to dramatically improve the monitoring, operation, and troubleshooting of distributed systems. Its main premise is that operations can be automated using monitoring data to reduce the workload of operators (e.g., SREs or production engineers). Our current research explores how AIOps – and many related fields such as deep learning, machine learning, distributed traces, graph analysis, time-series analysis, sequence analysis, advanced statistics, NLP and log analysis – can be explored to effectively detect, localize, predict, and remediate failures in large-scale cloud infrastructures (>50 regions and AZs) by analyzing service management data (e.g., distributed traces, logs, events, alerts, metrics). In particular, this talk will describe how a particular monitoring data structure, called distributed traces, can be analyzed using deep learning to identify anomalies in its spans. This capability empowers operators to quickly identify which components of a distributed system are faulty.

Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...

Neo4j

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Databricks

Airbnb has a wide variety of ML problems ranging from models on traditional structured data to models built on unstructured data such as user reviews, messages and listing images. The ability to build, iterate on, and maintain healthy machine learning models is critical to Airbnb’s success. Many ML Platforms cover data collection, feature engineering, training, deploying, productionalization, and monitoring but few, if any, do all of the above seamlessly. Bighead aims to tie together various open source and in-house projects to remove incidental complexity from ML workflows. Bighead is built on Python and Spark and can be used in modular pieces as each ML problem presents unique challenges. Through standardization of the path to production, training environments and the methods for collecting and transforming data on Spark, each model is reproducible and iterable. This talk covers the architecture, the problems that each individual component and the overall system aims to solve, and a vision for the future of machine learning infrastructure. It’s widely adapted in Airbnb and we have variety of models running in production. We have seen the overall model development time go down from many months to days on Bighead. We plan to open source Bighead to allow the wider community to benefit from our work.

OpenVINO introduction

Yury Gorbachev

MLOps Using MLflow

Databricks

MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects, Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps. MLflow is designed to work with any machine learning library and require minimal changes to integrate into an existing codebase. In this session, we will cover the common pain points of machine learning developers such as tracking experiments, reproducibility, deployment tool and model versioning. Ready to get your hands dirty by doing quick ML project using mlflow and release to production to understand the ML-Ops lifecycle.

Parquet Hadoop Summit 2013

Julien Le Dem

Premier Inside-Out: Apache Druid

Hortonworks

Notes from Coursera Deep Learning courses by Andrew Ng

Tess Ferrandez

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Databricks

The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.

Reliable and Scalable Data Ingestion at Airbnb

DataWorks Summit/Hadoop Summit

MLOps by Sasha Rosenbaum

Sasha Rosenbaum

Well architected ML platforms for Enterprise Data Science

Leela Krishna Kandrakota

Graph Data Science at Scale

Neo4j

Word representation: SVD, LSA, Word2Vec

ananth

Building Software Systems at Google and Lessons Learned

parallellabs

Scikit Learn intro

9xdot

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...

Edureka!

** AI & Deep Learning Training: https://www.edureka.co/ai-deep-learning-with-tensorflow ** ) This Edureka Tutorial on "Keras Tutorial" (Deep Learning Blog Series: https://goo.gl/4zxMfU) provides you a quick and insightful tutorial on the working of Keras along with an interesting use-case! We will be checking out the following topics: Agenda: What is Keras? Who makes Keras? Who uses Keras? What Makes Keras special? Working principle of Keras Keras Models Understanding Execution Implementing a Neural Network Use-Case with Keras Coding in Colaboratory Session in a minute Check out our Deep Learning blog series: https://bit.ly/2xVIMe1 Check out our complete Youtube playlist here: https://bit.ly/2OhZEpz Follow us to never miss an update in the future. Instagram: https://www.instagram.com/edureka_learning/ Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...

SindhuVasireddy1

Using LLVM to accelerate processing of data in Apache Arrow

DataWorks Summit

Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM. Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model. Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture. Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods. Speaker Siddharth Teotia, Dremio, Software Engineer

How to Write the Fastest JSON Parser/Writer in the World

Milo Yip

Taming Text

Grant Ingersoll

What's hot

What is MLOps

Henrik Skogström

Distributed Trace & Log Analysis using ML

Jorge Cardoso

Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...

Neo4j

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Databricks

OpenVINO introduction

Yury Gorbachev

MLOps Using MLflow

Databricks

Parquet Hadoop Summit 2013

Julien Le Dem

Premier Inside-Out: Apache Druid

Hortonworks

Notes from Coursera Deep Learning courses by Andrew Ng

Tess Ferrandez

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Databricks

Reliable and Scalable Data Ingestion at Airbnb

DataWorks Summit/Hadoop Summit

MLOps by Sasha Rosenbaum

Sasha Rosenbaum

Well architected ML platforms for Enterprise Data Science

Leela Krishna Kandrakota

Graph Data Science at Scale

Neo4j

Word representation: SVD, LSA, Word2Vec

ananth

Building Software Systems at Google and Lessons Learned

parallellabs

Scikit Learn intro

9xdot

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...

Edureka!

Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...

SindhuVasireddy1

Using LLVM to accelerate processing of data in Apache Arrow

DataWorks Summit

What's hot (20)

What is MLOps

Distributed Trace & Log Analysis using ML

Knowledge Graphs & Graph Data Science, More Context, Better Predictions - Neo...

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

OpenVINO introduction

MLOps Using MLflow

Parquet Hadoop Summit 2013

Premier Inside-Out: Apache Druid

Notes from Coursera Deep Learning courses by Andrew Ng

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Reliable and Scalable Data Ingestion at Airbnb

MLOps by Sasha Rosenbaum

Well architected ML platforms for Enterprise Data Science

Graph Data Science at Scale

Word representation: SVD, LSA, Word2Vec

Building Software Systems at Google and Lessons Learned

Scikit Learn intro

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...

Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...

Using LLVM to accelerate processing of data in Apache Arrow

Similar to NLP techniques for log analysis

How to Write the Fastest JSON Parser/Writer in the World

Milo Yip

Taming Text

Grant Ingersoll

rspamd-slidesVsevolod Stakhov

Query and audit logging in cassandra

Vinay Kumar Chella

This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply

BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...

Alexandre Moneger

This presentation shows that code coverage guided fuzzing is possible in the context of network daemon fuzzing. Some fuzzers are blackbox while others are protocol aware. Even ones which are made protocol aware, fuzzer writers typically model the protocol specification and implement packet awareness logic in the fuzzer. Unfortunately, just because the fuzzer is protocol aware, it does not guarantee that sufficient code paths have been reached. The presentation deals with specific scenarios where the target protocol is completely unknown (proprietary) and no source code or protocol specs are accessible. The tool developed builds a feedback loop between the client and the server components using the concept of "gate functions". A gate function triggers monitoring. The pintool component tracks the binary code coverage for all the functions untill it reaches an exit gate. By instrumenting such gated functions, the tool is able to measure code coverage during packet processing.

Messaging, interoperability and log aggregation - a new framework

Tomas Doran

In this talk, I will talk about why log files are horrible, logging log lines, and more structured performance metrics from large scale production applications as well as building reliable, scaleable and flexible large scale software systems in multiple languages. Why (almost) all log formats are horrible will be explained, and why JSON is a good solution for logging will be discussed, along with a number of message queuing, middleware and network transport technologies, including STOMP, AMQP and ZeroMQ. The Message::Passing framework will be introduced, along with the logstash.net project which the perl code is interoperable with. These are pluggable frameworks in ruby/java/jruby and perl with pre-written sets of inputs, filters and outputs for many many different systems, message formats and transports. They were initially designed to be aggregators and filters of data for logging. However they are flexible enough to be used as part of your messaging middleware, or even as a replacement for centralised message queuing systems. You can have your cake and eat it too - an architecture which is flexible, extensible, scaleable and distributed. Build discrete, loosely coupled components which just pass messages to each other easily. Integrate and interoperate with your existing code and code bases easily, consume from or publish to any existing message queue, logging or performance metrics system you have installed. Simple examples using common input and output classes will be demonstrated using the framework, as will easily adding your own custom filters. A number of common messaging middleware patterns will be shown to be trivial to implement. Some higher level use-cases will also be explored, demonstrating log indexing in ElasticSearch and how to build a responsive platform API using webhooks. Interoperability is also an important goal for messaging middleware. The logstash.net project will be highlighted and we'll discuss crossing the single language barrier, allowing us to have full integration between java, ruby and perl components, and to easily write bindings into libraries we want to reuse in any of those languages.

Industry - Program analysis and verification - Type-preserving Heap Profiler ...

ICSM 2011

The Newest in Session Types

Roland Kuhn

The new Actor representation in Akka Typed allows formulations that lend themselves to monadic interpretation or introspection. This leads us to explore possibilities for expressing and verifying dynamic properties like the adherence to a communication protocol between multiple agents as well as the safety properties of that protocol on a global level. Academic research in this area is far from complete, but there are interesting initial results that we explore in this session: precisely how much purity and reasoning can we bring to the distributed world?

NSLogger - Cocoaheads Paris Presentation - English

Florent Pillet

Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...

EC-Council

Over the past year, Tripwire Security Researchers Tyler Reguly and Andrew Swoboda have invested numerous hours into understanding the Microsoft Remote Desktop Protocol, specifically the pre-authentication portions of RDP. The Microsoft Open Protocol Specifications were heavily utilized for this projected and, while both researchers had used the specifications before, neither had fully realized their usefulness to security researchers. This session will be a discussion of The Microsoft Open Protocol Specification with RDP as the example. The culmination of the session will be the release of a new RDP Fuzzer and a discussion around the vulnerabilities it has already discovered. Attendees can expect to walk away with a strong understanding of the Microsoft Open Protocol Specifications and how they can leverage them to build protocol implementations and fuzzers, as well as investigate inherent flaws and discover new vulnerabilities. Attendees will have a better understanding of the pre-authentication RDP connection sequence and exactly what data is exchanged and what an attacker can deduce from this communication. Finally, attendees will gain insight into new RDP vulnerabilities.

Ansible Best Practices - July 30

tylerturk

This is a powerpoint presentation that I put together discussing best practices with Ansible, although it more specifically targets ansible playbooks. The topics include content organization, tips for writing playbooks, discussion around idempotency and it's importance, the power of jinja2 within ansible, and finishes with some lessons learned. This presentation was delivered on July 30th at WP Engine's office for the Austin Ansible MeetUp.

Cryptography in PHP: use cases

Enrico Zimuel

Finding Needles in Haystacks (The Size of Countries)

packetloop

Paranoia 2018: A Process is No One

Jared Atkinson

A Process is No One - Jared Atkinson and Robby Winchester Does your organization want to start Threat Hunting, but you're not sure how to begin? Most people start with collecting ALL THE DATA, but data means nothing if you're not able to analyze it properly. This talk begins with the often overlooked first step of hunt hypothesis generation which can help guide targeted collection and analysis of forensic artifacts. We will demonstrate how to use the MITRE ATTACK Framework and our five-phase Hypothesis Generation Process to develop actionable hunt processes, narrowing the scope of your Hunt operation and avoiding "analysis paralysis." We will then walk through a detailed case study of detecting access token impersonation/manipulation from concept to technical execution by way of the Hypothesis Generation Process. Along the way, we will detail some of the most common access token manipulations in use and detail the defensive detection implications for each of these cases. This comprehensive case study will better arm both attackers and defenders with how to better utilize their toolset to detect or avoid detection of token theft and manipulation.

NLP using JavaScript Natural Library

Aniruddha Chakrabarti

NBTC#2 - Why instrumentation is cooler then ice

Alexandre Moneger

Vulnerability, exploit to metasploit

Tiago Henriques

03 blockchain transactions

BastianBlankenburg

BSIDES-PR Keynote Hunting for Bad Guys

Joff Thyer

Attackers don’t just search for technology vulnerabilities, they take the easiest path and find the human vulnerabilities. Drive by web attacks, targeted spear phishing, and more are commonplace today with the goal of delivering custom malware. In a world where delivering custom advanced malware that handily evades signature and blacklisting approaches, and does not depend on application software vulnerabilities, how do we understand when are environments are compromised? What are the telltale signs that compromise activity has started, and how can we move to arrest a compromise in progress before the attacker laterally moves and reinforces their position? The penetration testing community knows these signs and artifacts of advanced malware presence, and it is up to us to help educate defenders on what to look for.

OpenAI GPT in Depth - Questions and Misconceptions

Ivo Andreev

OpenAI GPT in depth – misconceptions and questions you would like answered Have you ever wondered why GPT models work? Do you ask questions like: How does GPT work? Why does the same problem receive different answers for different users? Is there a way to improve explainability? Can GPT model provide its sources? Why does Bing chat work differently? What are my ways to have better performance and improve completions? How can I work with data in my enterprise? What practical business cases could a generative AI model fit solving? If you are tired of sessions just scratching the surface of OpenAI GPT, this one will go deeper and answer questions like why, why not and how.

Similar to NLP techniques for log analysis (20)

How to Write the Fastest JSON Parser/Writer in the World

Taming Text

rspamd-slides

Query and audit logging in cassandra

BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...

Messaging, interoperability and log aggregation - a new framework

Industry - Program analysis and verification - Type-preserving Heap Profiler ...

The Newest in Session Types

NSLogger - Cocoaheads Paris Presentation - English

Hacker Halted 2014 - RDP Fuzzing And Why the Microsoft Open Protocol Specific...

Ansible Best Practices - July 30

Cryptography in PHP: use cases

Finding Needles in Haystacks (The Size of Countries)

Paranoia 2018: A Process is No One

NLP using JavaScript Natural Library

NBTC#2 - Why instrumentation is cooler then ice

Vulnerability, exploit to metasploit

03 blockchain transactions

BSIDES-PR Keynote Hunting for Bad Guys

OpenAI GPT in Depth - Questions and Misconceptions

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

Quantum Computing: Current Landscape and the Future Role of APIs

Vlad Stirbu

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

DevOps and Testing slides at DASA Connect

Kari Kakkonen

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Peter Spielvogel

Building better applications for business users with SAP Fiori. • What is SAP Fiori and why it matters to you • How a better user experience drives measurable business benefits • How to get started with SAP Fiori today • How SAP Fiori elements accelerates application development • How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities • How SAP Fiori paves the way for using AI in SAP apps

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Quantum Computing: Current Landscape and the Future Role of APIs

Climate Impact of Software Testing at Nordic Testing Days

Video Streaming: Then, Now, and in the Future

Removing Uninteresting Bytes in Software Fuzzing

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Pushing the limits of ePRTC: 100ns holdover for 100 days

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

A tale of scale & speed: How the US Navy is enabling software delivery from l...

Accelerate your Kubernetes clusters with Varnish Caching

Elizabeth Buie - Older adults: Are we really designing for our future selves?

DevOps and Testing slides at DASA Connect

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

PHP Frameworks: I want to break free (IPC Berlin 2024)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

By Design, not by Accident - Agile Venture Bolzano 2024

NLP techniques for log analysis

1. NLP Techniques for Log Analysis Jacob Perkins, CTO @ Insight Engines

2. ● Speculative ideas with specific techniques ● Python is great for NLP, ML, simple text processing Overview

3. Author of Text Processing with NLTK Cookbook Contributor to Bad Data Handbook Blog @ StreamHacker.com Helped create Seahorse / Gnome Keyring (GPG UI) CTO @ InsightEngines.com About me

4. 1. Tokenization 2. Feature Extraction 3. Classification 4. Clustering 5. Anomaly Detection Topics

5. • Split text into tokens • Many options beyond whitespace • Works on any arbitrary text • NLTK has many tokenizers Tokenization

6. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/

7. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/

8. Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644 https://text-processing.com/demo/tokenize/

9. • Edit distance (a.k.a Levenshtein distance) • Fuzzywuzzy • Can use to identify similar strings • Ex: Google vs Go0gle = edit distance 1 Fuzzy Matching

10. • Transform text into discrete values • Use for data analysis, machine learning • Art, not science Feature Extraction

11. • Date parsing with dateutil • Regex patterns • Grammars with pyparsing • Automatic log parsing with Logpai logparser Parsing

12. ● Bigram: (acmepayroll, syslog) ● Trigram: (HANDLING, TELNET, CALL) ● Skipgram: (syslog, HANDLING, CALL) Ngram Features

13. • acmepayroll -> aa • User -> Aa • ABCDE -> AA • 10101 -> nn • pid=9644 -> aa=nn Token Shapes

14. Log -> Token Shapes & Date Parsing date aa syslog: date nn wksh: AA AA AA (User: aa, Branch: AA, Client: nn) pid=nn Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644

15. • Count tokens across all records & types (ie ssh) • How uniform are tokens within a record type? • Mostly uniform ~= clean data • In a given record, does it have rare tokens? • Rare = anomaly? Identifying Rare Tokens

16. 1. Log record -> feature extraction 2. Features -> Classifier 3. Classifier returns class probabilities Classification • Must train on good labeled data • Binary classification is most accurate • Scikit-learn has many options

17.

18. ● Spam vs Ham ● Sentiment & Opinion analysis: positive vs negative ● Fraud Real World Classification

19. 1. Train on record type (ssh vs everything else) 2. What has type ssh but doesn’t classify? 3. What is not ssh but does classify? Log Classification Anomalies

20. Features: ● Description ● Rules / thresholds ● Log record features Labels = priority level (high, medium, low) Alert Classification

21. ● No training needed (unsupervised) ● Group by feature similarity / distance ● Must operate on large batch of records ● Scikit-learn has many options ● Gensim for topic modeling Clustering

22.

23. 1. Cluster a few different record types 2. Does each type correspond to a single cluster? 3. Which records don’t cluster well? (far from centroid) Data Clustering Anomalies

24. ● A.k.a. Novelty / Outlier detection ● A.k.a. One-class classification ● Learn from good data set ● Identify new records that don’t fit ● Scikit-learn has a few options ● Automated anomaly detection with Logpai loglizer Anomaly Detection

25.

26. ● Tokenization ● Feature extraction ● Classification ● Clustering ● Anomaly detection Summary

27. • NLTK • Scikit-Learn • Gensim • Logpai • Text-processing.com • Streamhacker.com References

28. ● Investigator: plain english log search -> multiple visualizations & recommendations to do next ● Analyzer: data health analysis ● InsightEngines.com About Insight Engines

29. Thank you!

Editor's Notes

Punctuation in weird places
NLP example: can’t
Trained on WSJ news articles
Grammars ~= multi-line regex
Bigram & Trigram features can add a lot to classification & clustering accuracy
Use token shapes to normalize? Technique based on TF/IDF & search indexing to identify high information words
Sentiment used a lot for marketing analytics
One vs all classification.
Triage, identify false positives or negatives
Topic modeling is different type of clustering

NLP techniques for log analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP techniques for log analysis

Similar to NLP techniques for log analysis (20)

Recently uploaded

Recently uploaded (20)

NLP techniques for log analysis

Editor's Notes