• Like
  • Save

Splunk live! Customer Presentation – Prelert

  • 206 views
Uploaded on

From Splunklive! San Francisco

From Splunklive! San Francisco

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
206
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • [no audio here]
  • Probability of data comes in all shapes and sizes – rarely does it fit a nice bell curve
  • index="invite" | timechart span=1h count as mycount | predict mycount | rename upper95(prediction(mycount)) as ceiling | rename lower95(prediction(mycount)) as floor | eval alarm1=if(mycount > ceiling, "10000", "0") | eval alarm2=if(mycount < floor, "-10000", "0") | table _time,alarm1,alarm2,mycount,ceiling,floor
  • Prelert has users analyzing 100,000+ simultaneous unique metrics, not just 20!

Transcript

  • 1. Anomaly Detection using Machine Learning Predictive Analytics the anomaly detection company
  • 2. Terminology • Machine-learning  Autonomous self-learning without the assistance of humans (unsupervised learning) • Predictive Analytics  Probabilistic prediction of behavior based upon observed past behavior • Anomaly Detection  what’s “different” or weird” versus what’s “good” or “bad”
  • 3. Q: What’s Interesting Here? 3
  • 4. A: Only What’s Behaving Abnormally 4
  • 5. Anomaly Detection - an Analogy • How could I accurately predict how much Postal-mail you are likely to get delivered to your home tomorrow? • And, how would I know if the amount you received was “abnormal”?
  • 6. A practical methodology would involve… • First, determine what’s normal before I can declare what’s abnormal • Watch your mail delivery volume for a while…  1 day?  1 week?  1 month? • Notice, that you intuitively feel like you’ll gain accuracy in your predictions with more data that you see. • Ideally, use those observations to create a…
  • 7. Probability Distribution Function pieces of mail per day %likelihood(probability)
  • 8. Probability Distribution Function pieces of mail per day %likelihood(probability) Best for my house
  • 9. Probability Distribution Function pieces of mail per day %likelihood(probability) College Student?
  • 10. Probability Distribution Function pieces of mail per day %likelihood(probability) My Mom
  • 11. Finding “what’s unexpected”… Your job is often looking for unexpected change in your environment, either proactively through monitoring or reactively through diagnostics/troubleshooting
  • 12. Using the PDF to Find What is Unexpected pieces of mail per day %likelihood(probability) zero pieces of mail? fifteen pieces of mail?
  • 13. Relate back to IT and Security data • # Pieces of mail = # events of a certain type  Number of failed logins  Number of errors of different types  Number of events with certain status codes  Etc. • Or, performance metrics  Response time  Utilization % => Every kind of data will need its own unique “model” (probability distribution function)
  • 14. Do You Know How to Accurately Model? • Which one(s) models your data best? • You will want to get it right 14 source: “Doing Data Science” O’Neil & Schutt avg +/- 2 stdev assumes Gaussian (Normal) Distribution!
  • 15. Gaussian (“Normal”) Distribution 15
  • 16. Non-Gaussian Data status=503 status=404 CPU load Memory Utilization Revenue Transactions
  • 17. Standard Deviations – Not so Good 33,000+ performance metrics analyzed using +/- 2.5σ 0 1000 2000 3000 4000 5000 6000 7000 28 Feb 00:00 28 Feb 12:00 01 Mar 00:00 01 Mar 12:00 02 Mar 00:00 02 Mar 12:00 03 Mar 00:00 03 Mar 12:00 • Never less than 900 alerts per hour • Real outage (circled) overshadowed by ~6000 extraneous alerts Total # Alerts
  • 18. Don’t worry, we have you covered • Prelert uses sophisticated machine-learning techniques to best-fit the right statistical model for your data. • Better models = better outlier detection = less false alarms 20
  • 19. 21 DEMO
  • 20. Kinds of Anomalies Detected 22 Deviations in event count vs. time Deviations in values vs. time Rare occurrences of things Population/Peer outliers
  • 21. #1) Deviations in Event Counts/Rates • Use Case: Online Commerce Site  Cyclical online ordering volume (credit cards, etc.)  Service outage on May 10th orders not being processed, dip in afternoon volume 23
  • 22. Hard to automatically detect because… • Tricky to catch with thresholds because overall count didn’t dip below low watermark • Output of Splunk “predict”: 24
  • 23. Prelert finds the anomaly perfectly 25 • No extraneous false alarms • Despite the inherent challenges of the periodic nature of the data
  • 24. #2) Deviations in Performance Metrics • Use Case: Online travel portal • Makes web services calls to airlines for fare quotes • Each airline responds to fare request with its own typical response time (20 airlines): 26
  • 25. Hard to automatically detect because… • Tricky to construct unique thresholds for each airline individually • Cannot do “avg +/- 2σ” because it is too noisy for this kind of data • Splunk’s “predict” doesn’t support explosion out via by clause (“by airline”) 27
  • 26. Prelert finds the anomaly perfectly 28 • Only 1 of the many airlines is having an issue
  • 27. #3) Rare Items as Anomalies • Use Case: Security team @ services company • Wanted to profile typical processes on each host using netstat • Goal was to identify rare processes that “start up and communicate” for each host, individually 29
  • 28. Hard to automatically detect because… • Each host has it’s own separate “set” of typical processes that are potentially unique • i.e. FTP may run routinely run on server A, but never runs on server B • Maintaining a running list of “typical processes” across hundreds of servers not practical • Splunk “rare” command is not truly a rarity measurement, just “least occurring” 30
  • 29. Prelert finds the anomaly perfectly 31 • Finds FTP process running for 3 hours on system that doesn’t normally run FTP
  • 30. #4) Population / Peer Outliers • Use Case: Proxy log data  Need to determine which users/systems are sending out requests/data much differently than the others 32
  • 31. Hard to automatically detect because… • Peer analysis is impossible without Prelert 33
  • 32. Prelert finds the anomaly perfectly 34 • One particular host sending many requests (20,000/hr) to an IIS webserver • This is an attempt to hack the webserver
  • 33. Anomaly Detective App • Free to download and try – 100% native Splunk app • Easy to use – “push button anomaly detection” • More powerful anomaly detection than Splunk on its own • Scalable for big data sets 35 http://goo.gl/KJY9B
  • 34. Bonus – Anomaly Cross-Correlation • Use Case: Retail company with flaky POS application (gift card redemption)  App occasionally disconnects from DB  Team suspects either a DB or a network problem, but hard to find cause • Prelert configured to run anomaly detection across 3 data types simultaneously  App logs (unstructured) – count by dynamic message type  SQL Server performance metrics  Network performance metrics 36
  • 35. Result: Instant Answers 37 Symptom: Sudden influx of DB errors in log Symptom: Drop in SQL Server client connections Cause: Network spike and TCP discards