Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business

© 2018 SPLUNK INC.
Still haven’t got on top of IT outages?
Accept failure, learn from failure and get rid of failure to protect your business
Dr. Siyka Andreeva | IT Operations Analytics Specialist
April 2019

© 2018 SPLUNK INC.
Forward Looking Statements
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward-looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other
brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.

© 2018 SPLUNK INC.
Agenda
Why You Need to Stop Being Reactive
Data and Machine Learning: How to Get to a Predictive IT
Case Study with CMC Markets

© 2018 SPLUNK INC.
High Availability is everywhere !
How many 9’s
do you have?
100%
100%
100%
99,999%

© 2018 SPLUNK INC.
Because we live in a (theorical) SLA world
But surrounded by storms, human errors and trolls
SQLApp Service
99,95% 99,95%
• App service is down
• SQL is down
• Both are down
Serial Compound
Availability
The overall “service”
availability is lower: 99,90%
SQLApp Service
99,95% 99,95%
SQLApp Service
99,95% 99,95%
Serial and parallel
Availability
99,99%
A
B
Traffic Mger
• App service (A) is down
• SQL (A) is down
• Both App/SQL (A) are down
• App service (B) is down
• SQL (B) is down
• Both App/SQL (B) are down
• Traffic Mger is down
• Combination of above
99,98%You still have a SPOF
Overall SLA is:

© 2018 SPLUNK INC.
And yet there are more outages than ever
25%
2017
31%
2018
suffered an outage or
period of “server
service degradation”
over the past 12
months,
Source: Uptime Institute 2018 (8th annual Data Center Survey)
48%
If on-prem
DC
80%Could have been prevented
Leading causes:
Human errors, power outages,
network, configuration issues

© 2018 SPLUNK INC.
More outages than ever + higher cost / incident
Customer
Satisfaction
Brand
Reputation
Line of
Revenue
*According to “Damage Control: The Impact of Critical IT Incidents”
$105,302
the mean business
cost of an IT incident

© 2018 SPLUNK INC.
Predict and Prevent Operational Issues with AI
$ Impact
Proactive
(add logs and metrics)
Effective
$ Impact
Existing
Events
Cost of
Impact
Reactively Alerted
MTTR
Automated Resolution
MTTR
MTTR
Splunk ML Alert

© 2018 SPLUNK INC.
Predict and Prevent Operational Issues with AI
$ Impact
Predictive
Proactive
(add logs and metrics)
Effective
$ Impact
Existing
Events
NEGATIVE
MTTR!!
Predict 30 Minutes
in Advance
Time
Return to
Business
Cost of
Impact
Reactively Alerted
MTTR
Automated Resolution
MTTR
MTTR
Splunk ML Alert

© 2018 SPLUNK INC.
Online
ServicesNetworks
Security
Call Detail
Records
Web
Services
Telecoms
Web
Clickstreams
Tracing
Online
Shopping Cart
Smartphones
and Devices
Custom
Applications
Energy Meters
Storage
Public
Cloud Private
Cloud
Containers
On-Premises
Servers
GPS
Location
RFID
Packaged
ApplicationsDatabases MessagingFirewall
Logs Wired DB Mobile IoT APIMetrics
Data lake
APM
Traces
+ Machine Learning
Multiple Data Sources
The right teams are
automatically alerted of
the incidents to take
actions
Teams are notified of
the potential issues
BEFORE they turn red
Automate runbooks for
known issues
Alerts correlated across the stack and prioritized and presented by Service Impact

© 2018 SPLUNK INC.
How to find a needle in multiple haystacks?
(choose your tool)
Network?
Database?
Middleware?
Hardware?
Wrong
command?
Connection?
Apache?
VM?
Mainframe?
Load
balancer?Wrong code
released?
Collect ALL data
• Collect from all silos
• Data in original raw format
• Add open sources apps to
ingest data on the fly
• Schema on the fly
• Dynamic thresholding
• Realtime correlation
Clustering & aggregation
• Real time event
clustering/correlation
• Reduce alert noise
• Behavioural analytics
• Deduplication
Add context
• Measure / report on
indicators that matters
• Add service / business
context
• Add actionable
information to detection
Salessso
Claims
Anomaly detection
• Catch issues that thresholds
cannot
• Reduce event clutter
• Deviation from past
behaviour
• Deviation from peers
• Unusual change in features
Assisted deep dive
investigation
• Root cause analysis
• Powerful & easy to use
search & investigate
language
?
Predictive
Analytics
• Predict service health
• Predict events
• Trend forecasting
• Detect influencing
entities
• Early warning of
failure
70% to 90%
Reduction in investigation time
15% to 45%
Reduction in high priority incidents
67% to 82%
Reduction in business
impact

© 2018 SPLUNK INC.
How We’re Getting There
Richard Bailey
CMC Markets

© 2018 SPLUNK INC.
Introduction
• Not a blueprint
• Organic / agile
• Our challenges
• Multiple use cases
• Process
• What we collect
• DIY anomaly detection
• Predicting the predictable
• Essential housekeeping

© 2018 SPLUNK INC.
What Does CMC
Markets Do?
• Online Retail Financial Trading
• Spreadbets & CFDs
• Leveraged Products
• Short-term Positions
• Automated Trading
• Worldwide Product Base

© 2018 SPLUNK INC.
Specific Monitoring Situation
(That may not apply to all Splunk Customers)
• Short, sharp, unpredictable load
• Sub-second performance targets
• External SLAs
• Regulatory environment
• In-house development
 Highly granular stats (e.g per sec,cpu)
 Care about short pauses
 Financial penalties
 Fast, fair, transparent, evidenced
 Can change logging

© 2018 SPLUNK INC.
@
Base Splunk
1TB/day - On-Prem - 2-Site Clustered - All-Flash Storage
Enterprise Security
Log
Management
Application
Performance
Monitoring
Monitoring
(everything)
IT Ops
Security
(incl. SIEM)
Business Ops Perf Testing Surveillance Capacity Mgmt SLA Reporting
Alert Generation

© 2018 SPLUNK INC.
(e.g. Splunk’s MC)
Full Picture
Multi-use
Peace-of-mind
Support Specific Alert
Reduce MTTR
Self-explanatory
Support runbook
Encapsulates expertise
+ Alert Tuning
Rare (prefer Alerts)
Maximize Info
Not self-explanatory
Human correlation
Operational
Their only route to data
Dashboards
We have distinct types of dashboard
General
Alert
Response
Live Business

© 2018 SPLUNK INC.
Process
Culture of Closed-Loop and Continuous Improvement
Service
Monitoring
Restore
Service
InvestigatePost-
Incident
Review
Machine Learning
Incidents
Alerts
Anomalies
Predictions
Insights
Noise
Reduction
Improvements
Lessons
learnt
Solutions
• Could we have prevented this?
• Could we have seen this coming?
• Could we have got to root cause faster?
• Did we have all the data/insights we needed?
• Can we eliminate any noise?
• Did we need to write SPL?
• Runbooks
• Dashboards
• Aim: No SPL

© 2018 SPLUNK INC.
Monitoring Services
Service Internals
Application Logs
In-Memory Counters
JMX
GC Logs
Monitoring API
Load
EUM -> CDN -> TM -> Logs
& Upstream Services
Performance
Logs -> TM -> CDN ->EUM
& Downstream Services
State
Infrastructure
Storage NW Messaging DBs
Resource
Utilisation
CPU IO
Mem NW
Correlation

© 2018 SPLUNK INC.
Anomaly Detection
The Goals
• Detect effects of changes
• Early Warning
…but still value in post-incident info
• Must handle incidents
- today’s slowdown must not become tomorrow’s normal
- yet responsive to intended service changes
- but not ignore long-term gradual degradation
• Control: adjust sensitivity, reduce false alarms
• Handle hot/cold nodes and rolling restarts
• Relatable (black box vs plain sight)
• Traceable (back to real figures)
• Actionable (deal with both incidents and false alarms)

© 2018 SPLUNK INC.
Anomaly Detection
Our typical pattern
Events
SI
Per-minute
KPI summary
SI
Daily
Baseline
(KPI)
Time
Operation
Instance
avg value
…in bulk
Operation
[time of day]
all instances
Median KPI
Key percentiles
-> range
Typically:3w
Express difference between current
KPI value and baseline median, as
a multiple of a percentile range
…for this operation
…for this time of day
Trigger on the value of the multiple
(e.g. 2x)

© 2018 SPLUNK INC.
Levers
Building in the control we need
Events
Per-minute
KPI summary
Daily
Baseline
(KPI)
Time
Operation
Instance
avg value
Operation
[time of day]
all instances
Median KPI
Key percentiles
-> range
Typically:3w
Express difference between current
KPI value and baseline median, as
a multiple of the typical range
…for this operation
…for this time of day
Trigger on the value of the multiple
(e.g. 2x)
Don’t let todays anomaly be tomorrow’s baseline
The range is not the threshold
Use range to eliminate outliers and the multiple to control the threshold
• No Data Cleansing
• Backtest
• Dashboard Support
• 2 week trial
As short as possible, to get a decent spread of data

© 2018 SPLUNK INC.
Predicting EOD License
Will we bust the Splunk license today?
• Usage varies over day
• Simple extrapolation would not work
• We run license close so need accuracy
• Has to handle earlier incidents (high usage)
• Uses a typical day as baseline, which
• Must be recent
• Must recalibrate after incident

© 2018 SPLUNK INC.
EOD License: Baseline
Derive baseline, using percentiles again to remove outliers
• Looks at last 9 days
• Use cumulative volume (streamstats)
• Find EOD volume (eventstats)
• 2 will be weekends
• Up to 3 more days could be trading holidays
• Allow up to 2 days to have incidents
• Don’t want to blend days
• Use 3rd biggest day (exactperc72(EOD))
• Could have >1 day with same EOD
• Could be smarter but good enough

© 2018 SPLUNK INC.
Predicting EOD License
Predicting EOD based on sensible assumptions
?Same time
of day
Rolling 60m
Rest of day
Rolling 60m
Used
TodayBaseline day

© 2018 SPLUNK INC.
Dashboard Support
Getting to root cause
Avoiding SPL
 Recent History
 Breakdown by sourcetype
 Breakdown by index
 Historical context
 Comparison Day
 Biggest increases by sourcetype
 Biggest increases by index
 Comparison by time – sourcetype
 Comparison by time - index
 Latest prediction
 Trendline

© 2018 SPLUNK INC.
Live Dashboard
Fed by Splunk (but not built in Splunk)
• Combines static and dynamic
• Services are grouped
• Static view shows RAG
• Dynamic list shows details
• Middle column is single-site
• + News and Changes

© 2018 SPLUNK INC.
Behind the Scenes
Housekeeping tasks we do to keep this on the road
• Build summary indexes for speed & retention
• Handle late-arriving data
• Detect increases/decrease in index volume
• Detect when events stop (not trivial)
• Check assumptions made in searches
• Manage lookups
• Handle alert exclusions
• Handle clock changes
• Dashboard/report curation
• Manage schedule report load
• Treat as code. Test it. KISS. Manage changes
• It’s not rocket science!

Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business

Similar to Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business (20)

More from Splunk

More from Splunk (20)

Recently uploaded

Recently uploaded (20)

Still Suffering from IT Outages? Accept Failure, Learn from Failure and Get Rid of Failure to Protect your Business