Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business


Published on

IT operations is only continuing to grow in complexity. There are too many alerts for human operators to process and little to no visibility into which alerts are business impacting. And it’s only getting worse with the addition of new devices and the growing list of services being employed by organisations — slowing detection and resolution times. Why is this a problem? Legacy IT solutions have left organisations in a complex and manual state. With too many siloed tools, productivity for IT remains low, and they’re often plagued in their inability to find the exact root cause.

Published in: Technology
  • How to Grip Her Attention - Unlock Her Legs ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • FREE TRAINING: "How to Earn a 6-Figure Side-Income Online" ... 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business

  1. 1. © 2018 SPLUNK INC. Still haven’t got on top of IT outages? Accept failure, learn from failure and get rid of failure to protect your business Dr. Siyka Andreeva | IT Operations Analytics Specialist April 2019
  2. 2. © 2018 SPLUNK INC. Forward Looking Statements During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward-looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release. Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
  3. 3. © 2018 SPLUNK INC. Agenda Why You Need to Stop Being Reactive Data and Machine Learning: How to Get to a Predictive IT Case Study with CMC Markets
  4. 4. © 2018 SPLUNK INC. High Availability is everywhere ! How many 9’s do you have? 100% 100% 100% 99,999%
  5. 5. © 2018 SPLUNK INC. Because we live in a (theorical) SLA world But surrounded by storms, human errors and trolls SQLApp Service 99,95% 99,95% • App service is down • SQL is down • Both are down Serial Compound Availability The overall “service” availability is lower: 99,90% SQLApp Service 99,95% 99,95% SQLApp Service 99,95% 99,95% Serial and parallel Availability 99,99% A B Traffic Mger • App service (A) is down • SQL (A) is down • Both App/SQL (A) are down • App service (B) is down • SQL (B) is down • Both App/SQL (B) are down • Traffic Mger is down • Combination of above 99,98%You still have a SPOF Overall SLA is:
  6. 6. © 2018 SPLUNK INC. And yet there are more outages than ever 25% 2017 31% 2018 suffered an outage or period of “server service degradation” over the past 12 months, Source: Uptime Institute 2018 (8th annual Data Center Survey) 48% If on-prem DC 80%Could have been prevented Leading causes: Human errors, power outages, network, configuration issues
  7. 7. © 2018 SPLUNK INC. More outages than ever + higher cost / incident Customer Satisfaction Brand Reputation Line of Revenue *According to “Damage Control: The Impact of Critical IT Incidents” $105,302 the mean business cost of an IT incident
  8. 8. © 2018 SPLUNK INC. Predict and Prevent Operational Issues with AI $ Impact Proactive (add logs and metrics) Effective $ Impact Existing Events Cost of Impact Reactively Alerted MTTR Automated Resolution MTTR MTTR Splunk ML Alert
  9. 9. © 2018 SPLUNK INC. Predict and Prevent Operational Issues with AI $ Impact Predictive Proactive (add logs and metrics) Effective $ Impact Existing Events NEGATIVE MTTR!! Predict 30 Minutes in Advance Time Return to Business Cost of Impact Reactively Alerted MTTR Automated Resolution MTTR MTTR Splunk ML Alert
  10. 10. © 2018 SPLUNK INC. Online ServicesNetworks Security Call Detail Records Web Services Telecoms Web Clickstreams Tracing Online Shopping Cart Smartphones and Devices Custom Applications Energy Meters Storage Public Cloud Private Cloud Containers On-Premises Servers GPS Location RFID Packaged ApplicationsDatabases MessagingFirewall Logs Wired DB Mobile IoT APIMetrics Data lake APM Traces + Machine Learning Multiple Data Sources The right teams are automatically alerted of the incidents to take actions Teams are notified of the potential issues BEFORE they turn red Automate runbooks for known issues Alerts correlated across the stack and prioritized and presented by Service Impact
  11. 11. © 2018 SPLUNK INC. How to find a needle in multiple haystacks? (choose your tool) Network? Database? Middleware? Hardware? Wrong command? Connection? Apache? VM? Mainframe? Load balancer?Wrong code released? Collect ALL data • Collect from all silos • Data in original raw format • Add open sources apps to ingest data on the fly • Schema on the fly • Dynamic thresholding • Realtime correlation Clustering & aggregation • Real time event clustering/correlation • Reduce alert noise • Behavioural analytics • Deduplication Add context • Measure / report on indicators that matters • Add service / business context • Add actionable information to detection Salessso Claims Anomaly detection • Catch issues that thresholds cannot • Reduce event clutter • Deviation from past behaviour • Deviation from peers • Unusual change in features Assisted deep dive investigation • Root cause analysis • Powerful & easy to use search & investigate language ? Predictive Analytics • Predict service health • Predict events • Trend forecasting • Detect influencing entities • Early warning of failure 70% to 90% Reduction in investigation time 15% to 45% Reduction in high priority incidents 67% to 82% Reduction in business impact
  12. 12. © 2018 SPLUNK INC. How We’re Getting There Richard Bailey CMC Markets
  13. 13. © 2018 SPLUNK INC. Introduction • Not a blueprint • Organic / agile • Our challenges • Multiple use cases • Process • What we collect • DIY anomaly detection • Predicting the predictable • Essential housekeeping
  14. 14. © 2018 SPLUNK INC. What Does CMC Markets Do? • Online Retail Financial Trading • Spreadbets & CFDs • Leveraged Products • Short-term Positions • Automated Trading • Worldwide Product Base
  15. 15. © 2018 SPLUNK INC. Specific Monitoring Situation (That may not apply to all Splunk Customers) • Short, sharp, unpredictable load • Sub-second performance targets • External SLAs • Regulatory environment • In-house development  Highly granular stats (e.g per sec,cpu)  Care about short pauses  Financial penalties  Fast, fair, transparent, evidenced  Can change logging
  16. 16. © 2018 SPLUNK INC. @ Base Splunk 1TB/day - On-Prem - 2-Site Clustered - All-Flash Storage Enterprise Security Log Management Application Performance Monitoring Monitoring (everything) IT Ops Security (incl. SIEM) Business Ops Perf Testing Surveillance Capacity Mgmt SLA Reporting Alert Generation
  17. 17. © 2018 SPLUNK INC. (e.g. Splunk’s MC) Full Picture Multi-use Peace-of-mind Support Specific Alert Reduce MTTR Self-explanatory Support runbook Encapsulates expertise + Alert Tuning Rare (prefer Alerts) Maximize Info Not self-explanatory Human correlation Operational Their only route to data Dashboards We have distinct types of dashboard General Alert Response Live Business
  18. 18. © 2018 SPLUNK INC. Process Culture of Closed-Loop and Continuous Improvement Service Monitoring Restore Service InvestigatePost- Incident Review Machine Learning Incidents Alerts Anomalies Predictions Insights Noise Reduction Improvements Lessons learnt Solutions • Could we have prevented this? • Could we have seen this coming? • Could we have got to root cause faster? • Did we have all the data/insights we needed? • Can we eliminate any noise? • Did we need to write SPL? • Runbooks • Dashboards • Aim: No SPL
  19. 19. © 2018 SPLUNK INC. Monitoring Services Service Internals Application Logs In-Memory Counters JMX GC Logs Monitoring API Load EUM -> CDN -> TM -> Logs & Upstream Services Performance Logs -> TM -> CDN ->EUM & Downstream Services State Infrastructure Storage NW Messaging DBs Resource Utilisation CPU IO Mem NW Correlation
  20. 20. © 2018 SPLUNK INC. Anomaly Detection The Goals • Detect effects of changes • Early Warning …but still value in post-incident info • Must handle incidents - today’s slowdown must not become tomorrow’s normal - yet responsive to intended service changes - but not ignore long-term gradual degradation • Control: adjust sensitivity, reduce false alarms • Handle hot/cold nodes and rolling restarts • Relatable (black box vs plain sight) • Traceable (back to real figures) • Actionable (deal with both incidents and false alarms)
  21. 21. © 2018 SPLUNK INC. Anomaly Detection Our typical pattern Events SI Per-minute KPI summary SI Daily Baseline (KPI) Time Operation Instance avg value …in bulk Operation [time of day] all instances Median KPI Key percentiles -> range Typically:3w Express difference between current KPI value and baseline median, as a multiple of a percentile range …for this operation …for this time of day Trigger on the value of the multiple (e.g. 2x)
  22. 22. © 2018 SPLUNK INC. Anomaly Detection Visualise the time-based baseline
  23. 23. © 2018 SPLUNK INC. Levers Building in the control we need Events Per-minute KPI summary Daily Baseline (KPI) Time Operation Instance avg value Operation [time of day] all instances Median KPI Key percentiles -> range Typically:3w Express difference between current KPI value and baseline median, as a multiple of the typical range …for this operation …for this time of day Trigger on the value of the multiple (e.g. 2x) Don’t let todays anomaly be tomorrow’s baseline The range is not the threshold Use range to eliminate outliers and the multiple to control the threshold • No Data Cleansing • Backtest • Dashboard Support • 2 week trial As short as possible, to get a decent spread of data
  24. 24. © 2018 SPLUNK INC. Predicting EOD License Will we bust the Splunk license today? • Usage varies over day • Simple extrapolation would not work • We run license close so need accuracy • Has to handle earlier incidents (high usage) • Uses a typical day as baseline, which • Must be recent • Must recalibrate after incident
  25. 25. © 2018 SPLUNK INC. EOD License: Baseline Derive baseline, using percentiles again to remove outliers • Looks at last 9 days • Use cumulative volume (streamstats) • Find EOD volume (eventstats) • 2 will be weekends • Up to 3 more days could be trading holidays • Allow up to 2 days to have incidents • Don’t want to blend days • Use 3rd biggest day (exactperc72(EOD)) • Could have >1 day with same EOD • Could be smarter but good enough
  26. 26. © 2018 SPLUNK INC. Predicting EOD License Predicting EOD based on sensible assumptions ?Same time of day Rolling 60m Rest of day Rolling 60m Used TodayBaseline day
  27. 27. © 2018 SPLUNK INC. Dashboard Support Getting to root cause Avoiding SPL  Recent History  Breakdown by sourcetype  Breakdown by index  Historical context  Comparison Day  Biggest increases by sourcetype  Biggest increases by index  Comparison by time – sourcetype  Comparison by time - index  Latest prediction  Trendline
  28. 28. © 2018 SPLUNK INC. Trajectory-based Disk Alert Why wait until a static alert fires?
  29. 29. © 2018 SPLUNK INC. Live Dashboard Fed by Splunk (but not built in Splunk) • Combines static and dynamic • Services are grouped • Static view shows RAG • Dynamic list shows details • Middle column is single-site • + News and Changes
  30. 30. © 2018 SPLUNK INC. Behind the Scenes Housekeeping tasks we do to keep this on the road • Build summary indexes for speed & retention • Handle late-arriving data • Detect increases/decrease in index volume • Detect when events stop (not trivial) • Check assumptions made in searches • Manage lookups • Handle alert exclusions • Handle clock changes • Dashboard/report curation • Manage schedule report load • Treat as code. Test it. KISS. Manage changes • It’s not rocket science!
  31. 31. © 2018 SPLUNK INC.© 2018 SPLUNK INC. Thank you