Machine Learning in Splunk: Predictions and Anomaly Detection

Copyright © 2015 Splunk Inc.
Machine Learning and
Analytics in Splunk
Pierre Brunel

3
Introduction
• Splunker since 2014
• Sr Sales Engineer, Analytics SME
• Previously worked in operations for large
SaaS company
– 5 years in escalation support before Splunk
– 2 years using Splunk
• Grad Degree in Applied Mathematics
3
I liked the product so much I joined the company!

4
Disclaimer
During the course of this presentation, we may make forward looking statements regarding future events
or the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results
could differ materially. For important factors that may cause actual results to differ from those contained
in our forward-looking statements, please review our filings with the SEC. The forward-looking
statements made in the this presentation are being made as of the time and date of its live presentation.
If reviewed after its live presentation, this presentation may not contain current or accurate information.
We do not assume any obligation to update any forward looking statements we may make.
In addition, any information about our roadmap outlines our general product direction and is subject to
change at any time without notice. It is for informational purposes only and shall not, be incorporated
into any contract or other commitment. Splunk undertakes no obligation either to develop the features
or functionality described or to include any such feature or functionality in a future release.

5
Agenda
• Machine learning and statistics
• ML Toolkit and Showcase app
• Demo!
• How to acquire and use the app

6
ML 101: What is it?
• TL;DR - Process for generalizing from examples
Image source: http://phdp.github.io/posts/2013-07-05-dtl.html
“all models are wrong, but some are useful.”
- George E.P. Box

7
ML 101: Supervised vs Unsupervised
• Supervised Learning: generalizing from labeled data

8
ML 101: Supervised vs Unsupervised
• Unsupervised Learning: generalizing from unlabeled data
Clustering

9
Capacity Planning
1. Log resource utilization (e.g., disk capacity)
2. Build a predictive model based on past values
3. Refine until predictions are accurate
4. Forecast resource saturation or demand
5. Act
Challenge: Unexpected downtime due to insufficient capacity can cost time & money
Solution: Build predictive model to forecast these scenarios and act pre-emptively

10
Insider Threat
1. Log cloud storage data transfer
2. Build a predictive model
4. Detect large prediction errors
5. Investigate
Challenge: Data theft is a common and costly problem to many organizations
Solution: Build predictive model to identify and alert on anomalous data transfer patterns

11
Predict Customer Churn
1. Build a model that predicts customer churn
3. Predict when customers will churn
4. Inspect the model to see what factors drive churn
5. Act
Challenge: Many factors can contribute to customer leaving for competitor. Customer churn = less revenue
Solution: Build model to identify customers that are likely to move to competitor. Take action

12
The Process
1. Clean & transform
2. Fit a model
3. Refine the model
4. Apply to make predictions
5. Detect anomalies
6. Alert
7. Act

ML Toolkit and
Showcase App

14
ML Toolkit and Showcase App
An app that adds extensible machine
learning commands to SPL. The
showcases embody best-practices of
particular analytics.
Preview Release!

15
ML SPL
• Generic grammar
– Follows the lead of popular ML libraries
– Doesn’t clutter SPL
• fit, apply, summary

16
[training data] | fit LinearRegression into my_model costly_KPI from metric1 metric2 metric3
ML SPL
• Fit a (persistent) model using training data
• Apply a model to new data to make predictions
• Inspect a summary of the model
fit apply
summary
[test data] | apply my_model as pred_kpi_value
| summary my_model

18
Behind the Curtain
• Uses only public interfaces and libraries
• Distribution of the python data science ecosystem
– scitkit-learn, pandas, numpy, scipy, and much more
– On Splunkbase: Python for Scientific Computing
• “Just an app”
• Source code is packaged in the app

Demo

21
Operationalization how-to
(aka Preview Release Caveats)
• Fit model on up to 50k training events
– Can apply model to unlimited events
• Install on standalone 6.3 search head
• 8 currently supported algorithms (and counting)
– Linear Regression, Logistic Regression, PCA, SVM, KMeans, DBSCAN, Birch,
Spectral Clustering
• Community-supported app
– Feedback always welcome!
• Plus all the other caveats you’d expect of a preview release

22
GA Sneak Peak
All dashboards have examples w/ core Splunk / ITOA datasets
Support for Search Head Clustering
Distribute the workload to indexers
– fit & apply – remove limitation of 50K events for fit

23
Gimme! Gimme!
• ML Toolkit and Showcase App
– Preview Release is Free on Splunkbase
• Dependencies
– Splunk 6.3
– Python for Scientific Computing
http://tiny.cc/splunkmlapp

• September 26-29, 2016
• The Disney Swan and Dolphin, Orlando
• 5000+ IT & Business Professionals
• 3 days of technical content
• 165+ sessions
• 3 days of Splunk University
• Sept 24-26, 2016
• Get Splunk Certified for FREE!
• Get CPE credits for CISSP, CAP, SSCP
• Save thousands on Splunk education!
• 80+ Customer Speakers
• 35+ Apps in Splunk Apps Showcase
• 75+ Technology Partners
• 1:1 networking: Ask The Experts and
• Security Experts, Birds of a Feather and Chalk Talks
• NEW hands-on labs!
• Expanded show floor, Dashboards Control Room &
Clinic, and MORE!
.conf2016: The 7th Annual
Splunk Worldwide Users’ Conference

25
We Want to Hear your Feedback!
After the Breakout Sessions conclude
Text Splunk to 20691
And be entered for a chance to win a $100 AMEX gift card!

27
Predict Numeric Fields (Use-Cases)
2
Predict Service Desk Request/Call volume for password resets
Predict cost of assigning an employee to an opportunity
Predict potential cost of a system outage

28
Predict Categorical Fields (Use-Cases)
2
– Predict likely data-center hard-drive failure
– Predict whether an inbound email otherwise not flagged by information security controls contains
malware and should be reviewed/remediated for potential undetected malware (perhaps for manual
dynamic evaluation in sandbox)
– Predict profitability of offering a specific customer a targeted promotion by using A/B testing data to look
at customer value over time in response to having received the promotion.
– Predict potential employee attrition by looking at badge data and login data. Look for variables that lead to
employees leaving. i.e. badge time consistently/increasingly later than previous X weeks’ average.

Machine Learning in Splunk: Predictions and Anomaly Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning in Splunk: Predictions and Anomaly Detection

Similar to Machine Learning in Splunk: Predictions and Anomaly Detection (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

Machine Learning in Splunk: Predictions and Anomaly Detection

Editor's Notes