• Save
Splunk live! Customer Presentation – Prelert

Splunk live! Customer Presentation – Prelert



From Splunklive! San Francisco

From Splunklive! San Francisco



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • [no audio here]
  • Probability of data comes in all shapes and sizes – rarely does it fit a nice bell curve
  • index="invite" | timechart span=1h count as mycount | predict mycount | rename upper95(prediction(mycount)) as ceiling | rename lower95(prediction(mycount)) as floor | eval alarm1=if(mycount > ceiling, "10000", "0") | eval alarm2=if(mycount < floor, "-10000", "0") | table _time,alarm1,alarm2,mycount,ceiling,floor
  • Prelert has users analyzing 100,000+ simultaneous unique metrics, not just 20! <br />

Splunk live! Customer Presentation – Prelert Splunk live! Customer Presentation – Prelert Presentation Transcript

  • Anomaly Detection using Machine Learning Predictive Analytics the anomaly detection company
  • Terminology • Machine-learning  Autonomous self-learning without the assistance of humans (unsupervised learning) • Predictive Analytics  Probabilistic prediction of behavior based upon observed past behavior • Anomaly Detection  what’s “different” or weird” versus what’s “good” or “bad”
  • Q: What’s Interesting Here? 3
  • A: Only What’s Behaving Abnormally 4
  • Anomaly Detection - an Analogy • How could I accurately predict how much Postal-mail you are likely to get delivered to your home tomorrow? • And, how would I know if the amount you received was “abnormal”?
  • A practical methodology would involve… • First, determine what’s normal before I can declare what’s abnormal • Watch your mail delivery volume for a while…  1 day?  1 week?  1 month? • Notice, that you intuitively feel like you’ll gain accuracy in your predictions with more data that you see. • Ideally, use those observations to create a…
  • Probability Distribution Function pieces of mail per day %likelihood(probability)
  • Probability Distribution Function pieces of mail per day %likelihood(probability) Best for my house
  • Probability Distribution Function pieces of mail per day %likelihood(probability) College Student?
  • Probability Distribution Function pieces of mail per day %likelihood(probability) My Mom
  • Finding “what’s unexpected”… Your job is often looking for unexpected change in your environment, either proactively through monitoring or reactively through diagnostics/troubleshooting
  • Using the PDF to Find What is Unexpected pieces of mail per day %likelihood(probability) zero pieces of mail? fifteen pieces of mail?
  • Relate back to IT and Security data • # Pieces of mail = # events of a certain type  Number of failed logins  Number of errors of different types  Number of events with certain status codes  Etc. • Or, performance metrics  Response time  Utilization % => Every kind of data will need its own unique “model” (probability distribution function)
  • Do You Know How to Accurately Model? • Which one(s) models your data best? • You will want to get it right 14 source: “Doing Data Science” O’Neil & Schutt avg +/- 2 stdev assumes Gaussian (Normal) Distribution!
  • Gaussian (“Normal”) Distribution 15
  • Non-Gaussian Data status=503 status=404 CPU load Memory Utilization Revenue Transactions
  • Standard Deviations – Not so Good 33,000+ performance metrics analyzed using +/- 2.5σ 0 1000 2000 3000 4000 5000 6000 7000 28 Feb 00:00 28 Feb 12:00 01 Mar 00:00 01 Mar 12:00 02 Mar 00:00 02 Mar 12:00 03 Mar 00:00 03 Mar 12:00 • Never less than 900 alerts per hour • Real outage (circled) overshadowed by ~6000 extraneous alerts Total # Alerts
  • Don’t worry, we have you covered • Prelert uses sophisticated machine-learning techniques to best-fit the right statistical model for your data. • Better models = better outlier detection = less false alarms 20
  • 21 DEMO
  • Kinds of Anomalies Detected 22 Deviations in event count vs. time Deviations in values vs. time Rare occurrences of things Population/Peer outliers
  • #1) Deviations in Event Counts/Rates • Use Case: Online Commerce Site  Cyclical online ordering volume (credit cards, etc.)  Service outage on May 10th orders not being processed, dip in afternoon volume 23
  • Hard to automatically detect because… • Tricky to catch with thresholds because overall count didn’t dip below low watermark • Output of Splunk “predict”: 24
  • Prelert finds the anomaly perfectly 25 • No extraneous false alarms • Despite the inherent challenges of the periodic nature of the data
  • #2) Deviations in Performance Metrics • Use Case: Online travel portal • Makes web services calls to airlines for fare quotes • Each airline responds to fare request with its own typical response time (20 airlines): 26
  • Hard to automatically detect because… • Tricky to construct unique thresholds for each airline individually • Cannot do “avg +/- 2σ” because it is too noisy for this kind of data • Splunk’s “predict” doesn’t support explosion out via by clause (“by airline”) 27
  • Prelert finds the anomaly perfectly 28 • Only 1 of the many airlines is having an issue
  • #3) Rare Items as Anomalies • Use Case: Security team @ services company • Wanted to profile typical processes on each host using netstat • Goal was to identify rare processes that “start up and communicate” for each host, individually 29
  • Hard to automatically detect because… • Each host has it’s own separate “set” of typical processes that are potentially unique • i.e. FTP may run routinely run on server A, but never runs on server B • Maintaining a running list of “typical processes” across hundreds of servers not practical • Splunk “rare” command is not truly a rarity measurement, just “least occurring” 30
  • Prelert finds the anomaly perfectly 31 • Finds FTP process running for 3 hours on system that doesn’t normally run FTP
  • #4) Population / Peer Outliers • Use Case: Proxy log data  Need to determine which users/systems are sending out requests/data much differently than the others 32
  • Hard to automatically detect because… • Peer analysis is impossible without Prelert 33
  • Prelert finds the anomaly perfectly 34 • One particular host sending many requests (20,000/hr) to an IIS webserver • This is an attempt to hack the webserver
  • Anomaly Detective App • Free to download and try – 100% native Splunk app • Easy to use – “push button anomaly detection” • More powerful anomaly detection than Splunk on its own • Scalable for big data sets 35 http://goo.gl/KJY9B
  • Bonus – Anomaly Cross-Correlation • Use Case: Retail company with flaky POS application (gift card redemption)  App occasionally disconnects from DB  Team suspects either a DB or a network problem, but hard to find cause • Prelert configured to run anomaly detection across 3 data types simultaneously  App logs (unstructured) – count by dynamic message type  SQL Server performance metrics  Network performance metrics 36
  • Result: Instant Answers 37 Symptom: Sudden influx of DB errors in log Symptom: Drop in SQL Server client connections Cause: Network spike and TCP discards