2. 3
Introduction
• Splunker since 2014
• Sr Sales Engineer, Analytics SME
• Previously worked in operations for large
SaaS company
– 5 years in escalation support before Splunk
– 2 years using Splunk
• Grad Degree in Applied Mathematics
3
I liked the product so much I joined the company!
3. 4
Disclaimer
During the course of this presentation, we may make forward looking statements regarding future events
or the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results
could differ materially. For important factors that may cause actual results to differ from those contained
in our forward-looking statements, please review our filings with the SEC. The forward-looking
statements made in the this presentation are being made as of the time and date of its live presentation.
If reviewed after its live presentation, this presentation may not contain current or accurate information.
We do not assume any obligation to update any forward looking statements we may make.
In addition, any information about our roadmap outlines our general product direction and is subject to
change at any time without notice. It is for informational purposes only and shall not, be incorporated
into any contract or other commitment. Splunk undertakes no obligation either to develop the features
or functionality described or to include any such feature or functionality in a future release.
4. 5
Agenda
• Machine learning and statistics
• ML Toolkit and Showcase app
• Demo!
• How to acquire and use the app
5. 6
ML 101: What is it?
• TL;DR - Process for generalizing from examples
Image source: http://phdp.github.io/posts/2013-07-05-dtl.html
“all models are wrong, but some are useful.”
- George E.P. Box
6. 7
ML 101: Supervised vs Unsupervised
• Supervised Learning: generalizing from labeled data
7. 8
ML 101: Supervised vs Unsupervised
• Unsupervised Learning: generalizing from unlabeled data
Clustering
8. 9
Capacity Planning
1. Log resource utilization (e.g., disk capacity)
2. Build a predictive model based on past values
3. Refine until predictions are accurate
4. Forecast resource saturation or demand
5. Act
Challenge: Unexpected downtime due to insufficient capacity can cost time & money
Solution: Build predictive model to forecast these scenarios and act pre-emptively
9. 10
Insider Threat
1. Log cloud storage data transfer
2. Build a predictive model
3. Refine until predictions are accurate
4. Detect large prediction errors
5. Investigate
Challenge: Data theft is a common and costly problem to many organizations
Solution: Build predictive model to identify and alert on anomalous data transfer patterns
10. 11
Predict Customer Churn
1. Build a model that predicts customer churn
2. Refine until predictions are accurate
3. Predict when customers will churn
4. Inspect the model to see what factors drive churn
5. Act
Challenge: Many factors can contribute to customer leaving for competitor. Customer churn = less revenue
Solution: Build model to identify customers that are likely to move to competitor. Take action
11. 12
The Process
1. Clean & transform
2. Fit a model
3. Refine the model
4. Apply to make predictions
5. Detect anomalies
6. Alert
7. Act
13. 14
ML Toolkit and Showcase App
An app that adds extensible machine
learning commands to SPL. The
showcases embody best-practices of
particular analytics.
Preview Release!
14. 15
ML SPL
• Generic grammar
– Follows the lead of popular ML libraries
– Doesn’t clutter SPL
• fit, apply, summary
15. 16
[training data] | fit LinearRegression into my_model costly_KPI from metric1 metric2 metric3
ML SPL
• Fit a (persistent) model using training data
• Apply a model to new data to make predictions
• Inspect a summary of the model
fit apply
summary
[test data] | apply my_model as pred_kpi_value
| summary my_model
16. 18
Behind the Curtain
• Uses only public interfaces and libraries
• Distribution of the python data science ecosystem
– scitkit-learn, pandas, numpy, scipy, and much more
– On Splunkbase: Python for Scientific Computing
• “Just an app”
• Source code is packaged in the app
18. 21
Operationalization how-to
(aka Preview Release Caveats)
• Fit model on up to 50k training events
– Can apply model to unlimited events
• Install on standalone 6.3 search head
• 8 currently supported algorithms (and counting)
– Linear Regression, Logistic Regression, PCA, SVM, KMeans, DBSCAN, Birch,
Spectral Clustering
• Community-supported app
– Feedback always welcome!
• Plus all the other caveats you’d expect of a preview release
19. 22
GA Sneak Peak
All dashboards have examples w/ core Splunk / ITOA datasets
Support for Search Head Clustering
Distribute the workload to indexers
– fit & apply – remove limitation of 50K events for fit
20. 23
Gimme! Gimme!
• ML Toolkit and Showcase App
– Preview Release is Free on Splunkbase
• Dependencies
– Splunk 6.3
– Python for Scientific Computing
http://tiny.cc/splunkmlapp
22. 25
We Want to Hear your Feedback!
After the Breakout Sessions conclude
Text Splunk to 20691
And be entered for a chance to win a $100 AMEX gift card!
24. 27
Predict Numeric Fields (Use-Cases)
2
Predict Service Desk Request/Call volume for password resets
Predict cost of assigning an employee to an opportunity
Predict potential cost of a system outage
25. 28
Predict Categorical Fields (Use-Cases)
2
– Predict likely data-center hard-drive failure
– Predict whether an inbound email otherwise not flagged by information security controls contains
malware and should be reviewed/remediated for potential undetected malware (perhaps for manual
dynamic evaluation in sandbox)
– Predict profitability of offering a specific customer a targeted promotion by using A/B testing data to look
at customer value over time in response to having received the promotion.
– Predict potential employee attrition by looking at badge data and login data. Look for variables that lead to
employees leaving. i.e. badge time consistently/increasingly later than previous X weeks’ average.
Editor's Notes
TODO SVM can’t be inspected
We’re headed to the East Coast!
2 inspired Keynotes – General Session and Security Keynote + Super Sessions with Splunk Leadership in Cloud, IT Ops, Security and Business Analytics!
165+ Breakout sessions addressing all areas and levels of Operational Intelligence – IT, Business Analytics, Mobile, Cloud, IoT, Security…and MORE!
30+ hours of invaluable networking time with industry thought leaders, technologists, and other Splunk Ninjas and Champions waiting to share their business wins with you!
Join the 50%+ of Fortune 100 companies who attended .conf2015 to get hands on with Splunk. You’ll be surrounded by thousands of other like-minded individuals who are ready to share exciting and cutting edge use cases and best practices. You can also deep dive on all things Splunk products together with your favorite Splunkers.
Head back to your company with both practical and inspired new uses for Splunk, ready to unlock the unimaginable power of your data! Arrive in Orlando a Splunk user, leave Orlando a Splunk Ninja!
REGISTRATION OPENS IN MARCH 2016 – STAY TUNED FOR NEWS ON OUR BEST REGISTRATION RATES – COMING SOON!
Predict service desk request volume for password resets (allows for staffing/scheduling to be adjusted on the leading edge of an event) by looking at the past x hours of authentication data (optionally enriching with service desk utilization data via lookup for users failing authentication). Provides estimated call volume in next x hours.
Inspired by an actual customer example where new password expiration and complexity policy roll-out unexpectedly overwhelmed the service desk leading to extensive user downtime across the enterprise. This could help fine-tune staffing levels as well as predict upcoming call/request surges.
Data sources
LDAP (i.e. Active Directory)
success count (i.e. estimate volume of active users overall)
fail count
fail (all reasons)
fail count due to expired passwords
fail count due to expired account
fail count due to disabled account
…
Application Logs
auth failures
(optional, fine tuning) Service Desk platform logs
lookup total number of service desk calls from users with auth failures for password resets
Predict cost of assigning an employee to an opportunity using past expense report data – predict actual budget amount based on the employee/destination incorporating their travel profile/behavior.
Predict potential cost of a system outage using transaction volume, recovery point objective, recovery time objective, past disaster recovery exercise data.
Predict likely data center hard drive failures – (theoretical use case) in a data center hosting many thousands of hard drives, having a predictive model which can mark out disk prone to fail can prevent data loss. Using such information, one could proactively make data copies of vulnerable hard drives.
Hard drive metrics
Hard drive model
In-use timespan
SMART (Self-Monitoring, Analysis and Reporting Technology) disk usage data.
Predict whether an inbound email otherwise not flagged by information security controls contains malware and should be reviewed/remediated for potential undetected malware (perhaps for manual dynamic evaluation in sandbox)
Inspired by an actual customer example where new Upatre/Dyre malware campaigns were being delivered to senior leadership. Static and dynamic automated sandboxing didn’t detect (i.e. see https://threatpost.com/dyre-banking-trojan-jumps-out-of-sandbox/112533/) - Root cause analysis led to identifying key attributes that describe these otherwise undetectable threats: emails with attachments from unknown/low-volume domains, sent to multiple senior leaders. (lots of other variables can be added for additional related use cases).
Data sources
Email security platform logs for the inbound message
Length of message
Country for sender’s IP
# of unique recipients
# of attachments
# of recipients which are a distribution list
lookup # of recipients on watchlists (i.e. Finance leadership)
lookup # of recipients who are admin assistants (i.e. likely to open and process the types of mail that contain this threat)
Lookup number of emails from domain and specific email address in past X days
Breakdown into # of emails flagged and # of emails not flagged for spam/malware
The overview is a map of the types of tasks you can perform with the showcases. Highlights the division of numeric/categorical and prediction/outliers.
Lists the algos for those who care.
Clickable examples that will fill out the showcase for an end-to-end experience.
Build a model that will predict the value of a numeric field (MEDV) given the values of other fields (CHAS, CRIM, etc.). You can use this to fill missing values, e.g.
Note that you can fill these in with your own Search, fields, and other parameters.
You can also save a model to use later (and the two “…in Search” buttons will bring you there).
The showcase focuses on how well the model fits the training data. Two different views of residuals (error) and some related metrics.
Model summary includes the coefficients that constitute the model; you’d see the large values for Charles River adjacency and number of rooms.
Another application of the previous showcase is to then use those residuals to find anomalies.
Prediction error that is an outlier can be considered to come from an anomalous underlying value in the data.
Chaining showcase tasks!
Unordered data, so no sliding window. These predictions were way off.
Show drilldown to specific events that lead to this outlier in prediction error.
For ordered data, use a sliding window so you don’t cheat and look into the future.
Choosing stddev w/ parameter 3 on this dataset will yield unstable outlier bounds.
Logistic regression. Telecom churn data.
Note these fields are straightforward to compute in Splunk SPL given call logs.
Show Apply Model in Search for how one might use a model like this on customers not used in training.
Adjust training/test split and see how metrics change. Bigger is not always better (e.g., overfitting).
Note that, for this dataset, if we predict a customer will churn, we are correct 70+% of the time. This is a pretty simple dataset, however, and in more realistic scenarios churn is about complex sequences of actions and experiences; this is a toy example with data that happens to be real.
Looks at combinations of the values of fields. Works on categorical data or numeric (maybe don’t mention to avoid confusion).
Note probable cause listings. Sort by those fields in the table below and note that these are not the biggest, but still outliers given the values of the other fields.
Choice of methods that will model different aspects of the time series.
We predict two years into the future, capturing both trend and seasonal aspects of the time series.
Holdback lets us test our predictions on data we already have; note the excellent correspondence.
Various clustering algos, matching what is on http://scikit-learn.org/stable/modules/clustering.html
White noise is ambiguous; no single correct answer.
DBSCAN outlier value in blue shows another way to use these showcase analytics to detect anomalies.
Click on Spectral Clustering to show the underlying ML SPL.