That Conference 2017: Refactoring your Monitoring

Jamie Riedesel
DevOps Engineer
@sysadm1138
Route-Planning your Monitoring Stack Climb
@sysadm1138ThatConference 2017

Today’s Climb
Overview
Your monitoring stack
Deciding what to monitor
The monitoring project-plan
Extra: Humane on-call rotations

Your Monitoring Stack
LEARNING THE TERRITORY

This is your stack. Really
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
PolicyEngine

Scheduled-tasks &
Powershell
Scheduler runs scripts on a schedule.
Scripts emit email or update a database.
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine

“Full-Stack”
SolarWinds.
Xenoss.
Nagios.
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine

Open-Source Medley
Nagios + Graphite + Grafana
Logstash + InfluxDB + Kibana + Bosun
Greylog + New Relic + Hash.io + DataDog
Nexosis + Go + OpenTSDB + Grafana
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine

Polling Engine
The whatever that fetches data.
● SNMP agents
● WMI endpoints
● Nagios agent
● Solarwinds agent
● Powershell scripts
● Bash scripts
● Polling Engines in Nagios & SolarWinds
Polling Engine

Aggregation Engine
Turns raw data into useful data.
● Summarizes over time (think RRDTool)
● Does stats (min/max/%-tile) on incoming
stream.
● Summarizes over system/rack/datacenter
No one (except possibly Google) keeps full
granularity monitoring logs forever and ever in
a trivially queryable way. Too expensive, and
you don’t usually care about 2 years ago.
Aggregation Engine
Alerting
Engine

Alerting Engine
Bothering humans in realtime!
● May do analytics.
● May be threshold-based, or trigger on
very sophisticated conditions.
● Scripts that send email every time.
● Scripts that drop notices in group-chat.
● Night-operator calling the Systems
Engineers
● PagerDuty.
Alerting
Engine

Reporting Engine
Bothering humans on a lag!
● Long-term trends
● Capacity analysis
● Growth tracking
● Full-bore big-data analytics
● SLA pass/fail reporting
● Track user behaviors across features
● BA building reports for executives
Reporting
Engine

API
Programatic interfaces into your monitoring
system.
● Build feedback systems
● Manage policy-engine details
● Could be your CM system
Good monitoring systems have APIs. It makes
them easier to integrate with. And integration is
usage.
API

User Interface
How humans interface with it.
A monitoring system with a bad user-
interface is a bad monitoring system.
- Jamie Riedesel, lots of times
I’ve seen things.
User Interface
API

User Interface
To access a previous job’s monitoring system:
1. Open a browser.
2. Log in using 2-factor to our SSL-VPN.
3. Connect to RDP using same password as VPN.
4. Open another browser.
5. Hit Monitoring site.
6. Using non SSO-ed password, log in.
7. See what’s going on.
User Interface
API

Policy Engine
This defines the behavior of each stage of the
stack.
Configured as part of the User Interface and
API.
PolicyEngine
User Interface
API
Humans

Policy Engine +
Polling engine
● How often are things polled?
○ Every 10s, 1m, 2m, 5m, 1d?
● Does polling get paused for
maintenance-windows?
● What data gets reported to the
Aggregation Engine?
Polling Engine
PolicyEngine
User Interface
API
Humans

Policy Engine +
Aggregation Engine
● How long do you keep data at all?
● How long do you keep full granularity
data?
● How long do you keep summarized data?
● Where do you keep full granularity data?
● Where do you keep summarized data?
● How do you summarize data?
○ Time? System? Location?
● Do maintenance windows affect any of
Polling Engine
Aggregation Engine
PolicyEngine
User Interface
API
Humans

Policy Engine +
Alerting Engine
● Which alarms merit bothering humans?
● Which alarms merit automatic fixing?
● Which alarms can be ignored?
● How do maintenance-windows impact
alarms?
● What escalation policies are in place?
Polling Engine
Aggregation Engine
Alerting
Engine
API
PolicyEngine
User Interface
API
Humans

Policy-Engine +
Reporting Engine
● Do reports get automatically generated?
● What reports are viewable on-demand?
● What reports are defined?
● Are ad-hoc reports possible?
● Who gets automatically generated
reports?
● What trends are we looking for?
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine

That’s cleared up!
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
PolicyEngine

Deciding What To Monitor
PLANNING THE APPROACH

Different Kinds of
Monitoring
Granularity and goals differ
from type to type. Be aware of
these as you build your system.
Performance Monitoring
Operational Monitoring
Capacity Monitoring
SLA Monitoring

Granularity: Very high ( 10s, 1s, or even sub-second)
Duration: As-needed
Response: Realtime
Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb
Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much
higher than operational monitoring. Typically, results are reviewed in near realtime and not
persisted long.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017

Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc)
Duration: Continuous.
Response: Rapid.
Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp
What most people think of when you say monitoring (but they’re wrong). This type of monitoring
catches the health of your infrastructure and is not directly related to the services it provides.
Think disk replacements, switch failures, and tornados.

This one is easy
OPERATIONAL MONITORING
1
The SLA for this is: our infrastructure can support the
delivery of our products and services.
● Switch failures.
● Disk failures.
● Blade-chassis failures.
● UPS failures.
● PSU / PDU failures.
● Compliance failures.

Capacity Monitoring
Granularity: Low (1h, 1d, 1w, 1m)
Duration: Continual or occasional
Response: Slow
Tools: Grafana, Kibana, Graphite, Nagios, Excel
Monitoring the capacity of your system to do work. Lead times can be quite long for some
replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in
cloud contexts.
Capacity Monitoring

How much do I need,
And when do I need it?
CAPACITY MONITORING
2
Every product or service uses consumables. This is
where you track them:
● Disk-space
● Cloud budget
● Overtime allowance
● P1 incident usage
● SmartHands budget

Service Level Agreement Monitoring
Granularity: Medium to Low
Duration: Continual
Response: Rapid and Slow
Tools: Everything
Monitoring to detect whether or not you’re meeting your SLA for a given service or services.
Where most monitoring really exists.
SLA Monitoring

This one is complicated
SERVICE LEVEL AGREEMENT MONITORING
3
How your product or service is supposed to perform. Not
just executives care about SLAs.
SLA: Service Level Agreement
SLO: Service Level Objectives
SLI: Service Level Indicators
We’ll get into these.

What if we don’t have SLAs? That’s like…
commitment. We avoid that around here!

What if we don’t have SLAs? That’s like…
commitment. We avoid that around here!
Yes, you have an SLA
No, really. You do.

The service is up when our users need it
to be.
And if it isn’t, they’re allowed to slag us
on Twitter.
DEFACTO SERVICE LEVEL AGREEMENT

The service is up when our users need it
to be.
And if it isn’t, they’re allowed to slag us
on Twitter.
In short, 100% uptime or your reputation will be hauled through the meat-grinder.
DEFACTO SERVICE LEVEL AGREEMENT

We promise X availability, on penalty of Y
things, outside of Q maintenance
periods. Planned outages will have no
less than Z days notice...
Less likely to end up as a meme on Twitter. This can be 100% an internal-only document!
DEFINED SERVICE LEVEL AGREEMENT

Service Level Agreement (SLA): An agreement, written in
Human; or sometimes Lawyer. Sets goalposts, defines penalties
(if any), defines terms.
Service Level Objective (SLO): A set of objectives, written in
Engineer. Technical definition of the goalposts in the SLA.
Service Level Indicator (SLI): Something that tells you whether
or not you’re meeting your SLO.
DEFINITIONS

SLA: The service is up 99.99% of the time, not including
scheduled maintenance.
SLOs - SERVICE LEVEL OBJECTIVES

SLA: The service is up 99.99% of the time, not including
scheduled maintenance.
● The settings page renders in under 10 seconds.
● The site returns HTTP-200 from Europe within 2 seconds.
● Branch-office ADC01 can reach the service.
● 98%-tile end to end request time is not more than 3
seconds.
● The SSL certificate is valid and chains to our CA.
● The text, “Welcome to Example Co,” is on the main page.
SLOs - SERVICE LEVEL OBJECTIVES

SLA: The site is up 99.99% of the time, not including scheduled
maintenance.
SLO:
● Site is reachable.
● The site is showing the right content.
● Scheduled maintenance is tracked.
SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com

SLA: Printing is available in Computer Labs 99.99% of the time,
outside of scheduled closures and maintenance.
SLO:
● Every Computer Lab has at least one working printer with
paper.
● Printers service only the central print queues.
● The swipe-card terminal in Computer Labs must work for
the printers to be considered ‘working’.
● Printers do not work if they can’t talk to the payment
processor.
SLOs - SERVICE LEVEL OBJECTIVES: University Print Services

SLO: The settings page renders in under 10 seconds.
SLI:
● Logins work.
● Page render-time from same data-center.
● Page render-time from Europe.
● Database disk-queue length.
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

SLO: 98%-tile end to end request time is not more than 3
seconds.
SLI:
● Time-to-process for all requests.
● Request processing is functional at least 30 seconds ago.
● 10 minute 98th percentile request-time average.
● 10 minute 50th percentile request-time average.
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!

Alarm: Informing humans of failing SLI/SLOs in realtime.
Report: Eventually informing humans of failing SLI/SLOs
Which humans do you bother for each SLI/SLO? Only you can
figure that out!
DEFINITIONS

Specific: Must tell me something specific is
wrong.
Alarms that require a human to log in to figure out what is
actually wrong, if anything is, are bad alarms.
FYI alarms lead to high cognitive load and decrease worker
satisfaction.
GOOD ALARMS

Actionable: Must be something I can directly fix
Getting alarmed for things you can’t fix is a great road to
burnout. These are especially great at 3:19 AM.
The failure mode is teaching people that some alarms can be
ignored safely. Eventually, they’ll ignore the wrong one. This is
bad.
GOOD ALARMS

Format Agnostic: Don’t be a dick about format
If a team wants full HTML with links to runbooks and wiki-pages,
let ‘em.
If a team wants the entire alert to fit into their iPhone lock-
screen, let ‘em.
Better, allow both!
GOOD ALARMS

Specific.
Actionable.
In the format you want.
GOOD ALARMS

The Monitoring Project-Plan
MAKING THE ASCENT

Get Approval For The Project:
● If it’s just you, that’s easy! Do it.
● A good monitoring product is used by many people
○ Get buy-in from not just IT, but sales, support, etc.
● Pitch to the business-case, not process improvement for your department.
○ We will reduce customer churn by enabling our CSMs.
○ We will improve our reaction time to reputation-impacting events.
○ This will increase buy-in from other departments, enabling our IT
goals
0
PROJECT PLAN

Figure out high-level needs (SLA)
● If you have a written one? Great! Work backwards from that.
● If you have an unwritten one, ask people to see what they think it is.
○ Play 20-questions with higher level execs on impacts of down-time
and service degredations.
○ Point out the de facto SLA, see how they react.
○ Point out we don’t need to publish the SLA to our customers, but can
have one internally.
● If you have microservices, each service will need its own SLA.
1
PROJECT PLAN

Figure out concrete definitions (SLO)
● Now that you have an SLA, or many SLAs, do the analysis to determine
what ‘up’ and ‘responsive’ mean in a concrete way.
● Ask other people to get involved. Involvement keep the project rolling.
● This is an opportunity for education with business leaders.
2
PROJECT PLAN

Figure out specific monitorables (SLI)
● Take your SLO list and figure out how to monitor for each.
● You may need to monitor new things.
● You may be able to stop monitoring/alarming some other things.
● Magic happens: your first opportunity to turn off existing alarms!
3
PROJECT PLAN

Figure out how to monitor those things
● Some of this may already exist. If so, cool.
● Some may need to be monitored in a different way.
● Some may need to be monitored for the first time.
● This defines how the Polling Engine works.
● Build new engines if you need to.
● Poll direct measurements where you can, try not to use proxy
measurements.
4
PROJECT PLAN
Polling Engine

Decide on your aggregation techniques
● Perhaps you don’t need to keep data as long as you thought.
● Perhaps you need to keep high granularity data longer than you thought.
● Perhaps you need to start tracking things like percentiles and standard-
deviations.
● This defines how the Aggregation Engine works.
5
PROJECT PLAN
Aggregation Engine

Alert Definition (OperationalSLA monitoring)
● Figure out who needs to know what and how fast they need to know it.
● One person shop? Easy!
● Ops team of 80? There will be meetings.
○ Work with each group individually.
○ Be flexible with requirements in each.
○ Don’t force communications-format standards without good cause.
○ Ensure the alarms are specific and actionable.
6
PROJECT PLAN
Alerting
Engine

Report Definition (CapacitySLA monitoring)
● Figure out how to write the pass/fail report for your SLAs.
● Determine what kind of response-times are needed to address SLA risks.
● Determine what kind of response-times are needed for capacity risks.
● Determine who gets what.
7
PROJECT PLAN
Reporting
Engine

Periodic Review
● Run the system for a while.
● Come back 3 months, 6 months later and ask questions.
○ How are the alarms working for you?
○ What changes do you think need to be made?
○ What new things have shown up?
● Especially important for departments that haven’t been attached to a
monitoring system before.
8
PROJECT PLAN
Humans

Step 0: Get approval
Step 1: Figure out high level needs (Service Level Agreement)
Step 2: Turn that into concrete definitions (Service Level Objectives)
Step 3: Figure out specific monitorables (Service Level Indicators)
Step 4: Decide how to monitor it (Polling Engine)
Step 5: Determine aggregation requirements (Aggregation Engine)
Step 6: Define Alerts (Operational and SLA monitoring)
Step 7: Define Reports (Capacity and SLA monitoring)
Step 8: Periodic Review

Post-Incident Review Questions
1. Did the monitoring system see the problem?
2. Did we react to the monitoring system, or humans?
3. Is it worth our time to catch this problem in the monitoring system?
4. What changes do we need to make, including to alerts, to deal with this in
the future?
9
PROJECT MAINTENANCE

Questions?
STACK CLIMBING

That Conference 2017: Refactoring your Monitoring

That Conference 2017: Refactoring your Monitoring

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to That Conference 2017: Refactoring your Monitoring

Similar to That Conference 2017: Refactoring your Monitoring (20)

Recently uploaded

Recently uploaded (20)

That Conference 2017: Refactoring your Monitoring