SlideShare a Scribd company logo
Jamie Riedesel
DevOps Engineer
@sysadm1138
Route-Planning your Monitoring Stack Climb
@sysadm1138ThatConference 2017
Today’s Climb
Overview
Your monitoring stack
Deciding what to monitor
The monitoring project-plan
Extra: Humane on-call rotations
@sysadm1138ThatConference 2017
Your Monitoring Stack
LEARNING THE TERRITORY
@sysadm1138ThatConference 2017
This is your stack. Really
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
Scheduled-tasks &
Powershell
Scheduler runs scripts on a schedule.
Scripts emit email or update a database.
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
“Full-Stack”
SolarWinds.
Xenoss.
Nagios.
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
Open-Source Medley
Nagios + Graphite + Grafana
Logstash + InfluxDB + Kibana + Bosun
Greylog + New Relic + Hash.io + DataDog
Nexosis + Go + OpenTSDB + Grafana
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
Polling Engine
The whatever that fetches data.
● SNMP agents
● WMI endpoints
● Nagios agent
● Solarwinds agent
● Powershell scripts
● Bash scripts
● Polling Engines in Nagios & SolarWinds
Polling Engine
@sysadm1138ThatConference 2017
Aggregation Engine
Turns raw data into useful data.
● Summarizes over time (think RRDTool)
● Does stats (min/max/%-tile) on incoming
stream.
● Summarizes over system/rack/datacenter
No one (except possibly Google) keeps full
granularity monitoring logs forever and ever in
a trivially queryable way. Too expensive, and
you don’t usually care about 2 years ago.
Aggregation Engine
Alerting
Engine
@sysadm1138ThatConference 2017
Alerting Engine
Bothering humans in realtime!
● May do analytics.
● May be threshold-based, or trigger on
very sophisticated conditions.
● Scripts that send email every time.
● Scripts that drop notices in group-chat.
● Night-operator calling the Systems
Engineers
● PagerDuty.
Alerting
Engine
@sysadm1138ThatConference 2017
Reporting Engine
Bothering humans on a lag!
● Long-term trends
● Capacity analysis
● Growth tracking
● Full-bore big-data analytics
● SLA pass/fail reporting
● Track user behaviors across features
● BA building reports for executives
Reporting
Engine
@sysadm1138ThatConference 2017
API
Programatic interfaces into your monitoring
system.
● Build feedback systems
● Manage policy-engine details
● Could be your CM system
Good monitoring systems have APIs. It makes
them easier to integrate with. And integration is
usage.
API
@sysadm1138ThatConference 2017
User Interface
How humans interface with it.
A monitoring system with a bad user-
interface is a bad monitoring system.
- Jamie Riedesel, lots of times
I’ve seen things.
User Interface
API
@sysadm1138ThatConference 2017
User Interface
To access a previous job’s monitoring system:
1. Open a browser.
2. Log in using 2-factor to our SSL-VPN.
3. Connect to RDP using same password as VPN.
4. Open another browser.
5. Hit Monitoring site.
6. Using non SSO-ed password, log in.
7. See what’s going on.
User Interface
API
@sysadm1138ThatConference 2017
Policy Engine
This defines the behavior of each stage of the
stack.
Configured as part of the User Interface and
API.
PolicyEngine
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine +
Polling engine
● How often are things polled?
○ Every 10s, 1m, 2m, 5m, 1d?
● Does polling get paused for
maintenance-windows?
● What data gets reported to the
Aggregation Engine?
Polling Engine
PolicyEngine
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine +
Aggregation Engine
● How long do you keep data at all?
● How long do you keep full granularity
data?
● How long do you keep summarized data?
● Where do you keep full granularity data?
● Where do you keep summarized data?
● How do you summarize data?
○ Time? System? Location?
● Do maintenance windows affect any of
Polling Engine
Aggregation Engine
PolicyEngine
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy Engine +
Alerting Engine
● Which alarms merit bothering humans?
● Which alarms merit automatic fixing?
● Which alarms can be ignored?
● How do maintenance-windows impact
alarms?
● What escalation policies are in place?
Polling Engine
Aggregation Engine
Alerting
Engine
API
PolicyEngine
User Interface
API
Humans
@sysadm1138ThatConference 2017
Policy-Engine +
Reporting Engine
● Do reports get automatically generated?
● What reports are viewable on-demand?
● What reports are defined?
● Are ad-hoc reports possible?
● Who gets automatically generated
reports?
● What trends are we looking for?
Polling Engine
Reporting
Engine
User Interface
Aggregation Engine
Alerting
Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
That’s cleared up!
Polling Engine
Reporting Engine
User Interface
Aggregation Engine
Alerting Engine
API
Humans
PolicyEngine
@sysadm1138ThatConference 2017
Deciding What To Monitor
PLANNING THE APPROACH
@sysadm1138ThatConference 2017
Different Kinds of
Monitoring
Granularity and goals differ
from type to type. Be aware of
these as you build your system.
Performance Monitoring
Operational Monitoring
Capacity Monitoring
SLA Monitoring
@sysadm1138ThatConference 2017
Performance Monitoring
Granularity: Very high ( 10s, 1s, or even sub-second)
Duration: As-needed
Response: Realtime
Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb
Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much
higher than operational monitoring. Typically, results are reviewed in near realtime and not
persisted long.
Performance Monitoring
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Operational Monitoring
Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc)
Duration: Continuous.
Response: Rapid.
Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp
What most people think of when you say monitoring (but they’re wrong). This type of monitoring
catches the health of your infrastructure and is not directly related to the services it provides.
Think disk replacements, switch failures, and tornados.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Operational Monitoring
This one is easy
OPERATIONAL MONITORING
1
The SLA for this is: our infrastructure can support the
delivery of our products and services.
● Switch failures.
● Disk failures.
● Blade-chassis failures.
● UPS failures.
● PSU / PDU failures.
● Compliance failures.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Capacity Monitoring
Granularity: Low (1h, 1d, 1w, 1m)
Duration: Continual or occasional
Response: Slow
Tools: Grafana, Kibana, Graphite, Nagios, Excel
Monitoring the capacity of your system to do work. Lead times can be quite long for some
replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in
cloud contexts.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Capacity Monitoring
How much do I need,
And when do I need it?
CAPACITY MONITORING
2
Every product or service uses consumables. This is
where you track them:
● Disk-space
● Cloud budget
● Overtime allowance
● P1 incident usage
● SmartHands budget
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
Service Level Agreement Monitoring
Granularity: Medium to Low
Duration: Continual
Response: Rapid and Slow
Tools: Everything
Monitoring to detect whether or not you’re meeting your SLA for a given service or services.
Where most monitoring really exists.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLA Monitoring
This one is complicated
SERVICE LEVEL AGREEMENT MONITORING
3
How your product or service is supposed to perform. Not
just executives care about SLAs.
SLA: Service Level Agreement
SLO: Service Level Objectives
SLI: Service Level Indicators
We’ll get into these.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
What if we don’t have SLAs? That’s like…
commitment. We avoid that around here!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
What if we don’t have SLAs? That’s like…
commitment. We avoid that around here!
Yes, you have an SLA
No, really. You do.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
The service is up when our users need it
to be.
And if it isn’t, they’re allowed to slag us
on Twitter.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFACTO SERVICE LEVEL AGREEMENT
The service is up when our users need it
to be.
And if it isn’t, they’re allowed to slag us
on Twitter.
In short, 100% uptime or your reputation will be hauled through the meat-grinder.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFACTO SERVICE LEVEL AGREEMENT
We promise X availability, on penalty of Y
things, outside of Q maintenance
periods. Planned outages will have no
less than Z days notice...
Less likely to end up as a meme on Twitter. This can be 100% an internal-only document!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINED SERVICE LEVEL AGREEMENT
Service Level Agreement (SLA): An agreement, written in
Human; or sometimes Lawyer. Sets goalposts, defines penalties
(if any), defines terms.
Service Level Objective (SLO): A set of objectives, written in
Engineer. Technical definition of the goalposts in the SLA.
Service Level Indicator (SLI): Something that tells you whether
or not you’re meeting your SLO.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
SLA: The service is up 99.99% of the time, not including
scheduled maintenance.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES
SLA: The service is up 99.99% of the time, not including
scheduled maintenance.
● The settings page renders in under 10 seconds.
● The site returns HTTP-200 from Europe within 2 seconds.
● Branch-office ADC01 can reach the service.
● 98%-tile end to end request time is not more than 3
seconds.
● The SSL certificate is valid and chains to our CA.
● The text, “Welcome to Example Co,” is on the main page.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES
SLA: The site is up 99.99% of the time, not including scheduled
maintenance.
SLO:
● Site is reachable.
● The site is showing the right content.
● Scheduled maintenance is tracked.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com
SLA: Printing is available in Computer Labs 99.99% of the time,
outside of scheduled closures and maintenance.
SLO:
● Every Computer Lab has at least one working printer with
paper.
● Printers service only the central print queues.
● The swipe-card terminal in Computer Labs must work for
the printers to be considered ‘working’.
● Printers do not work if they can’t talk to the payment
processor.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLOs - SERVICE LEVEL OBJECTIVES: University Print Services
SLO: The settings page renders in under 10 seconds.
SLI:
● Logins work.
● Page render-time from same data-center.
● Page render-time from Europe.
● Database disk-queue length.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
SLO: 98%-tile end to end request time is not more than 3
seconds.
SLI:
● Time-to-process for all requests.
● Request processing is functional at least 30 seconds ago.
● 10 minute 98th percentile request-time average.
● 10 minute 50th percentile request-time average.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
Service Level Agreement (SLA): An agreement, written in
Human; or sometimes Lawyer. Sets goalposts, defines penalties
(if any), defines terms.
Service Level Objective (SLO): A set of objectives, written in
Engineer. Technical definition of the goalposts in the SLA.
Service Level Indicator (SLI): Something that tells you whether
or not you’re meeting your SLO.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
Alarm: Informing humans of failing SLI/SLOs in realtime.
Report: Eventually informing humans of failing SLI/SLOs
Which humans do you bother for each SLI/SLO? Only you can
figure that out!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
DEFINITIONS
Specific: Must tell me something specific is
wrong.
Alarms that require a human to log in to figure out what is
actually wrong, if anything is, are bad alarms.
FYI alarms lead to high cognitive load and decrease worker
satisfaction.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Actionable: Must be something I can directly fix
Getting alarmed for things you can’t fix is a great road to
burnout. These are especially great at 3:19 AM.
The failure mode is teaching people that some alarms can be
ignored safely. Eventually, they’ll ignore the wrong one. This is
bad.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Format Agnostic: Don’t be a dick about format
If a team wants full HTML with links to runbooks and wiki-pages,
let ‘em.
If a team wants the entire alert to fit into their iPhone lock-
screen, let ‘em.
Better, allow both!
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
Specific.
Actionable.
In the format you want.
@sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
GOOD ALARMS
The Monitoring Project-Plan
MAKING THE ASCENT
@sysadm1138ThatConference 2017
Get Approval For The Project:
● If it’s just you, that’s easy! Do it.
● A good monitoring product is used by many people
○ Get buy-in from not just IT, but sales, support, etc.
● Pitch to the business-case, not process improvement for your department.
○ We will reduce customer churn by enabling our CSMs.
○ We will improve our reaction time to reputation-impacting events.
○ This will increase buy-in from other departments, enabling our IT
goals
0
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out high-level needs (SLA)
● If you have a written one? Great! Work backwards from that.
● If you have an unwritten one, ask people to see what they think it is.
○ Play 20-questions with higher level execs on impacts of down-time
and service degredations.
○ Point out the de facto SLA, see how they react.
○ Point out we don’t need to publish the SLA to our customers, but can
have one internally.
● If you have microservices, each service will need its own SLA.
1
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out concrete definitions (SLO)
● Now that you have an SLA, or many SLAs, do the analysis to determine
what ‘up’ and ‘responsive’ mean in a concrete way.
● Ask other people to get involved. Involvement keep the project rolling.
● This is an opportunity for education with business leaders.
2
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out specific monitorables (SLI)
● Take your SLO list and figure out how to monitor for each.
● You may need to monitor new things.
● You may be able to stop monitoring/alarming some other things.
● Magic happens: your first opportunity to turn off existing alarms!
3
PROJECT PLAN
@sysadm1138ThatConference 2017
Figure out how to monitor those things
● Some of this may already exist. If so, cool.
● Some may need to be monitored in a different way.
● Some may need to be monitored for the first time.
● This defines how the Polling Engine works.
● Build new engines if you need to.
● Poll direct measurements where you can, try not to use proxy
measurements.
4
PROJECT PLAN
@sysadm1138ThatConference 2017
Polling Engine
Decide on your aggregation techniques
● Some of this may already exist. If so, cool.
● Perhaps you don’t need to keep data as long as you thought.
● Perhaps you need to keep high granularity data longer than you thought.
● Perhaps you need to start tracking things like percentiles and standard-
deviations.
● This defines how the Aggregation Engine works.
5
PROJECT PLAN
@sysadm1138ThatConference 2017
Aggregation Engine
Alert Definition (OperationalSLA monitoring)
● Some of this may already exist. If so, cool.
● Figure out who needs to know what and how fast they need to know it.
● One person shop? Easy!
● Ops team of 80? There will be meetings.
○ Work with each group individually.
○ Be flexible with requirements in each.
○ Don’t force communications-format standards without good cause.
○ Ensure the alarms are specific and actionable.
6
PROJECT PLAN
@sysadm1138ThatConference 2017
Alerting
Engine
Report Definition (CapacitySLA monitoring)
● Some of this may already exist. If so, cool.
● Figure out how to write the pass/fail report for your SLAs.
● Determine what kind of response-times are needed to address SLA risks.
● Determine what kind of response-times are needed for capacity risks.
● Determine who gets what.
7
PROJECT PLAN
@sysadm1138ThatConference 2017
Reporting
Engine
Periodic Review
● Run the system for a while.
● Come back 3 months, 6 months later and ask questions.
○ How are the alarms working for you?
○ What changes do you think need to be made?
○ What new things have shown up?
● Especially important for departments that haven’t been attached to a
monitoring system before.
8
PROJECT PLAN
@sysadm1138ThatConference 2017
Humans
Step 0: Get approval
Step 1: Figure out high level needs (Service Level Agreement)
Step 2: Turn that into concrete definitions (Service Level Objectives)
Step 3: Figure out specific monitorables (Service Level Indicators)
Step 4: Decide how to monitor it (Polling Engine)
Step 5: Determine aggregation requirements (Aggregation Engine)
Step 6: Define Alerts (Operational and SLA monitoring)
Step 7: Define Reports (Capacity and SLA monitoring)
Step 8: Periodic Review
@sysadm1138ThatConference 2017
Post-Incident Review Questions
1. Did the monitoring system see the problem?
2. Did we react to the monitoring system, or humans?
3. Is it worth our time to catch this problem in the monitoring system?
4. What changes do we need to make, including to alerts, to deal with this in
the future?
9
PROJECT MAINTENANCE
@sysadm1138ThatConference 2017
Questions?
STACK CLIMBING
@sysadm1138ThatConference 2017
That Conference 2017: Refactoring your Monitoring

More Related Content

What's hot

SAP: How SAP fully automates the provisioning and operations of its dynatrace...
SAP: How SAP fully automates the provisioning and operations of its dynatrace...SAP: How SAP fully automates the provisioning and operations of its dynatrace...
SAP: How SAP fully automates the provisioning and operations of its dynatrace...
Dynatrace
 
Experian: Dynatrace real time feedback changed the development culture at exp...
Experian: Dynatrace real time feedback changed the development culture at exp...Experian: Dynatrace real time feedback changed the development culture at exp...
Experian: Dynatrace real time feedback changed the development culture at exp...
Dynatrace
 
Virgin Money: Virgin Money's quest for digital performance perfection
Virgin Money: Virgin Money's quest for digital performance perfectionVirgin Money: Virgin Money's quest for digital performance perfection
Virgin Money: Virgin Money's quest for digital performance perfection
Dynatrace
 
Best Practices for Continuous Delivery in Financial Services
Best Practices for Continuous Delivery in Financial ServicesBest Practices for Continuous Delivery in Financial Services
Best Practices for Continuous Delivery in Financial Services
Dynatrace
 
New Farming Methods in the Epistemological Wasteland of Application Security
New Farming Methods in the Epistemological Wasteland of Application SecurityNew Farming Methods in the Epistemological Wasteland of Application Security
New Farming Methods in the Epistemological Wasteland of Application Security
James Wickett
 
Reliability at scale
Reliability at scaleReliability at scale
Reliability at scale
praveen shukla
 
DevOps best practices in microservices | Walkingtree Technologies
DevOps best practices in microservices | Walkingtree TechnologiesDevOps best practices in microservices | Walkingtree Technologies
DevOps best practices in microservices | Walkingtree Technologies
Walking Tree Technologies
 
Zurich: Monitoring a sales force-based insurance application using dynatrace ...
Zurich: Monitoring a sales force-based insurance application using dynatrace ...Zurich: Monitoring a sales force-based insurance application using dynatrace ...
Zurich: Monitoring a sales force-based insurance application using dynatrace ...
Dynatrace
 
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
CA Technologies
 

What's hot (9)

SAP: How SAP fully automates the provisioning and operations of its dynatrace...
SAP: How SAP fully automates the provisioning and operations of its dynatrace...SAP: How SAP fully automates the provisioning and operations of its dynatrace...
SAP: How SAP fully automates the provisioning and operations of its dynatrace...
 
Experian: Dynatrace real time feedback changed the development culture at exp...
Experian: Dynatrace real time feedback changed the development culture at exp...Experian: Dynatrace real time feedback changed the development culture at exp...
Experian: Dynatrace real time feedback changed the development culture at exp...
 
Virgin Money: Virgin Money's quest for digital performance perfection
Virgin Money: Virgin Money's quest for digital performance perfectionVirgin Money: Virgin Money's quest for digital performance perfection
Virgin Money: Virgin Money's quest for digital performance perfection
 
Best Practices for Continuous Delivery in Financial Services
Best Practices for Continuous Delivery in Financial ServicesBest Practices for Continuous Delivery in Financial Services
Best Practices for Continuous Delivery in Financial Services
 
New Farming Methods in the Epistemological Wasteland of Application Security
New Farming Methods in the Epistemological Wasteland of Application SecurityNew Farming Methods in the Epistemological Wasteland of Application Security
New Farming Methods in the Epistemological Wasteland of Application Security
 
Reliability at scale
Reliability at scaleReliability at scale
Reliability at scale
 
DevOps best practices in microservices | Walkingtree Technologies
DevOps best practices in microservices | Walkingtree TechnologiesDevOps best practices in microservices | Walkingtree Technologies
DevOps best practices in microservices | Walkingtree Technologies
 
Zurich: Monitoring a sales force-based insurance application using dynatrace ...
Zurich: Monitoring a sales force-based insurance application using dynatrace ...Zurich: Monitoring a sales force-based insurance application using dynatrace ...
Zurich: Monitoring a sales force-based insurance application using dynatrace ...
 
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
Case Study: Citrix Adopts DevOps Principles to Gain Efficiency and Speed Soft...
 

Similar to That Conference 2017: Refactoring your Monitoring

Lunch and Learn and Sneakers
Lunch and Learn and SneakersLunch and Learn and Sneakers
Lunch and Learn and Sneakers
Bill Zajac
 
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
apidays
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
novsela
 
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace
 
Performance platform about event 17-6-14
Performance platform about event 17-6-14Performance platform about event 17-6-14
Performance platform about event 17-6-14
imattharrington
 
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
Amazon Web Services
 
Taking IT Analytics to the Next Level
Taking IT Analytics to the Next LevelTaking IT Analytics to the Next Level
Taking IT Analytics to the Next Level
CA Technologies
 
Anypoint new features_coimbatore_mule_meetup
Anypoint new features_coimbatore_mule_meetupAnypoint new features_coimbatore_mule_meetup
Anypoint new features_coimbatore_mule_meetup
MergeStack
 
CWIN17 telford api management, practical implementation experience - david ru...
CWIN17 telford api management, practical implementation experience - david ru...CWIN17 telford api management, practical implementation experience - david ru...
CWIN17 telford api management, practical implementation experience - david ru...
Capgemini
 
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunenMeetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Digipolis Antwerpen
 
DEM04 Fearless: From Monolith to Serverless with Dynatrace
DEM04 Fearless: From Monolith to Serverless with DynatraceDEM04 Fearless: From Monolith to Serverless with Dynatrace
DEM04 Fearless: From Monolith to Serverless with Dynatrace
Amazon Web Services
 
DEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
DEM09 [Repeat] Fearless: From Monolith to Serverless with DynatraceDEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
DEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
Amazon Web Services
 
MuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
MuleSoft Surat Meetup#39 - Pragmatic API Led ConnectivityMuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
MuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
Jitendra Bafna
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Amazon Web Services
 
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
MysoreMuleSoftMeetup
 
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
Amazon Web Services
 
Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...
Manisha Sule
 
IBM API Management BPM Systems Engage
IBM API Management BPM Systems EngageIBM API Management BPM Systems Engage
IBM API Management BPM Systems Engage
Sebastian Osterc
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
GetInData
 

Similar to That Conference 2017: Refactoring your Monitoring (20)

Lunch and Learn and Sneakers
Lunch and Learn and SneakersLunch and Learn and Sneakers
Lunch and Learn and Sneakers
 
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
apidays LIVE New York 2021 - Microservice Authorization with Open Policy Agen...
 
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data AnalyticsMotadata - Unified Product Suite for IT Operations and Big Data Analytics
Motadata - Unified Product Suite for IT Operations and Big Data Analytics
 
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
 
Performance platform about event 17-6-14
Performance platform about event 17-6-14Performance platform about event 17-6-14
Performance platform about event 17-6-14
 
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
NEW LAUNCH! Introducing AWS IoT Analytics - IOT214 - re:Invent 2017
 
Taking IT Analytics to the Next Level
Taking IT Analytics to the Next LevelTaking IT Analytics to the Next Level
Taking IT Analytics to the Next Level
 
Anypoint new features_coimbatore_mule_meetup
Anypoint new features_coimbatore_mule_meetupAnypoint new features_coimbatore_mule_meetup
Anypoint new features_coimbatore_mule_meetup
 
CWIN17 telford api management, practical implementation experience - david ru...
CWIN17 telford api management, practical implementation experience - david ru...CWIN17 telford api management, practical implementation experience - david ru...
CWIN17 telford api management, practical implementation experience - david ru...
 
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunenMeetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
 
DEM04 Fearless: From Monolith to Serverless with Dynatrace
DEM04 Fearless: From Monolith to Serverless with DynatraceDEM04 Fearless: From Monolith to Serverless with Dynatrace
DEM04 Fearless: From Monolith to Serverless with Dynatrace
 
DEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
DEM09 [Repeat] Fearless: From Monolith to Serverless with DynatraceDEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
DEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace
 
MuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
MuleSoft Surat Meetup#39 - Pragmatic API Led ConnectivityMuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
MuleSoft Surat Meetup#39 - Pragmatic API Led Connectivity
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
Anypoint DataGraph - Consume & Re-use your APIs faster | MuleSoft Mysore Meet...
 
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
NFL and Forwood Safety Deploy Business Analytics at Scale with Amazon QuickSi...
 
Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...Microservics, serverless and real time; Building blocks of the modern data pi...
Microservics, serverless and real time; Building blocks of the modern data pi...
 
IBM API Management BPM Systems Engage
IBM API Management BPM Systems EngageIBM API Management BPM Systems Engage
IBM API Management BPM Systems Engage
 
Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...Predicting Startup Market Trends based on the news and social media - Albert ...
Predicting Startup Market Trends based on the news and social media - Albert ...
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

That Conference 2017: Refactoring your Monitoring

  • 1.
  • 2. Jamie Riedesel DevOps Engineer @sysadm1138 Route-Planning your Monitoring Stack Climb @sysadm1138ThatConference 2017
  • 3. Today’s Climb Overview Your monitoring stack Deciding what to monitor The monitoring project-plan Extra: Humane on-call rotations @sysadm1138ThatConference 2017
  • 4. Your Monitoring Stack LEARNING THE TERRITORY @sysadm1138ThatConference 2017
  • 5. This is your stack. Really Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 6. Scheduled-tasks & Powershell Scheduler runs scripts on a schedule. Scripts emit email or update a database. Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 7. “Full-Stack” SolarWinds. Xenoss. Nagios. Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 8. Open-Source Medley Nagios + Graphite + Grafana Logstash + InfluxDB + Kibana + Bosun Greylog + New Relic + Hash.io + DataDog Nexosis + Go + OpenTSDB + Grafana Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 9. Polling Engine The whatever that fetches data. ● SNMP agents ● WMI endpoints ● Nagios agent ● Solarwinds agent ● Powershell scripts ● Bash scripts ● Polling Engines in Nagios & SolarWinds Polling Engine @sysadm1138ThatConference 2017
  • 10. Aggregation Engine Turns raw data into useful data. ● Summarizes over time (think RRDTool) ● Does stats (min/max/%-tile) on incoming stream. ● Summarizes over system/rack/datacenter No one (except possibly Google) keeps full granularity monitoring logs forever and ever in a trivially queryable way. Too expensive, and you don’t usually care about 2 years ago. Aggregation Engine Alerting Engine @sysadm1138ThatConference 2017
  • 11. Alerting Engine Bothering humans in realtime! ● May do analytics. ● May be threshold-based, or trigger on very sophisticated conditions. ● Scripts that send email every time. ● Scripts that drop notices in group-chat. ● Night-operator calling the Systems Engineers ● PagerDuty. Alerting Engine @sysadm1138ThatConference 2017
  • 12. Reporting Engine Bothering humans on a lag! ● Long-term trends ● Capacity analysis ● Growth tracking ● Full-bore big-data analytics ● SLA pass/fail reporting ● Track user behaviors across features ● BA building reports for executives Reporting Engine @sysadm1138ThatConference 2017
  • 13. API Programatic interfaces into your monitoring system. ● Build feedback systems ● Manage policy-engine details ● Could be your CM system Good monitoring systems have APIs. It makes them easier to integrate with. And integration is usage. API @sysadm1138ThatConference 2017
  • 14. User Interface How humans interface with it. A monitoring system with a bad user- interface is a bad monitoring system. - Jamie Riedesel, lots of times I’ve seen things. User Interface API @sysadm1138ThatConference 2017
  • 15. User Interface To access a previous job’s monitoring system: 1. Open a browser. 2. Log in using 2-factor to our SSL-VPN. 3. Connect to RDP using same password as VPN. 4. Open another browser. 5. Hit Monitoring site. 6. Using non SSO-ed password, log in. 7. See what’s going on. User Interface API @sysadm1138ThatConference 2017
  • 16. Policy Engine This defines the behavior of each stage of the stack. Configured as part of the User Interface and API. PolicyEngine User Interface API Humans @sysadm1138ThatConference 2017
  • 17. Policy Engine + Polling engine ● How often are things polled? ○ Every 10s, 1m, 2m, 5m, 1d? ● Does polling get paused for maintenance-windows? ● What data gets reported to the Aggregation Engine? Polling Engine PolicyEngine User Interface API Humans @sysadm1138ThatConference 2017
  • 18. Policy Engine + Aggregation Engine ● How long do you keep data at all? ● How long do you keep full granularity data? ● How long do you keep summarized data? ● Where do you keep full granularity data? ● Where do you keep summarized data? ● How do you summarize data? ○ Time? System? Location? ● Do maintenance windows affect any of Polling Engine Aggregation Engine PolicyEngine User Interface API Humans @sysadm1138ThatConference 2017
  • 19. Policy Engine + Alerting Engine ● Which alarms merit bothering humans? ● Which alarms merit automatic fixing? ● Which alarms can be ignored? ● How do maintenance-windows impact alarms? ● What escalation policies are in place? Polling Engine Aggregation Engine Alerting Engine API PolicyEngine User Interface API Humans @sysadm1138ThatConference 2017
  • 20. Policy-Engine + Reporting Engine ● Do reports get automatically generated? ● What reports are viewable on-demand? ● What reports are defined? ● Are ad-hoc reports possible? ● Who gets automatically generated reports? ● What trends are we looking for? Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 21. That’s cleared up! Polling Engine Reporting Engine User Interface Aggregation Engine Alerting Engine API Humans PolicyEngine @sysadm1138ThatConference 2017
  • 22. Deciding What To Monitor PLANNING THE APPROACH @sysadm1138ThatConference 2017
  • 23. Different Kinds of Monitoring Granularity and goals differ from type to type. Be aware of these as you build your system. Performance Monitoring Operational Monitoring Capacity Monitoring SLA Monitoring @sysadm1138ThatConference 2017
  • 24. Performance Monitoring Granularity: Very high ( 10s, 1s, or even sub-second) Duration: As-needed Response: Realtime Tools: Procmon, wireshark, strace, perf, Performance Monitor, gdb Typically done as part of debugging, troubleshooting, and profiling activities. Granularity is much higher than operational monitoring. Typically, results are reviewed in near realtime and not persisted long. Performance Monitoring @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 25. Operational Monitoring Granularity: Medium (1m, 2m, 5m, 10m, 1h, etc) Duration: Continuous. Response: Rapid. Tools: Dell OpenManage, HP Operations Manager, Cisco OpManager, NetApp What most people think of when you say monitoring (but they’re wrong). This type of monitoring catches the health of your infrastructure and is not directly related to the services it provides. Think disk replacements, switch failures, and tornados. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 Operational Monitoring
  • 26. This one is easy OPERATIONAL MONITORING 1 The SLA for this is: our infrastructure can support the delivery of our products and services. ● Switch failures. ● Disk failures. ● Blade-chassis failures. ● UPS failures. ● PSU / PDU failures. ● Compliance failures. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 27. Capacity Monitoring Granularity: Low (1h, 1d, 1w, 1m) Duration: Continual or occasional Response: Slow Tools: Grafana, Kibana, Graphite, Nagios, Excel Monitoring the capacity of your system to do work. Lead times can be quite long for some replacements (SAN arrays), and capacity can be budgetary more than hardware. Especially in cloud contexts. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 Capacity Monitoring
  • 28. How much do I need, And when do I need it? CAPACITY MONITORING 2 Every product or service uses consumables. This is where you track them: ● Disk-space ● Cloud budget ● Overtime allowance ● P1 incident usage ● SmartHands budget @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 29. Service Level Agreement Monitoring Granularity: Medium to Low Duration: Continual Response: Rapid and Slow Tools: Everything Monitoring to detect whether or not you’re meeting your SLA for a given service or services. Where most monitoring really exists. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLA Monitoring
  • 30. This one is complicated SERVICE LEVEL AGREEMENT MONITORING 3 How your product or service is supposed to perform. Not just executives care about SLAs. SLA: Service Level Agreement SLO: Service Level Objectives SLI: Service Level Indicators We’ll get into these. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 31. What if we don’t have SLAs? That’s like… commitment. We avoid that around here! @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 32. What if we don’t have SLAs? That’s like… commitment. We avoid that around here! Yes, you have an SLA No, really. You do. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017
  • 33. The service is up when our users need it to be. And if it isn’t, they’re allowed to slag us on Twitter. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFACTO SERVICE LEVEL AGREEMENT
  • 34. The service is up when our users need it to be. And if it isn’t, they’re allowed to slag us on Twitter. In short, 100% uptime or your reputation will be hauled through the meat-grinder. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFACTO SERVICE LEVEL AGREEMENT
  • 35. We promise X availability, on penalty of Y things, outside of Q maintenance periods. Planned outages will have no less than Z days notice... Less likely to end up as a meme on Twitter. This can be 100% an internal-only document! @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFINED SERVICE LEVEL AGREEMENT
  • 36. Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms. Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA. Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFINITIONS
  • 37. SLA: The service is up 99.99% of the time, not including scheduled maintenance. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLOs - SERVICE LEVEL OBJECTIVES
  • 38. SLA: The service is up 99.99% of the time, not including scheduled maintenance. ● The settings page renders in under 10 seconds. ● The site returns HTTP-200 from Europe within 2 seconds. ● Branch-office ADC01 can reach the service. ● 98%-tile end to end request time is not more than 3 seconds. ● The SSL certificate is valid and chains to our CA. ● The text, “Welcome to Example Co,” is on the main page. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLOs - SERVICE LEVEL OBJECTIVES
  • 39. SLA: The site is up 99.99% of the time, not including scheduled maintenance. SLO: ● Site is reachable. ● The site is showing the right content. ● Scheduled maintenance is tracked. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLOs - SERVICE LEVEL OBJECTIVES: HasDCDoneSomethingStupidToday.com
  • 40. SLA: Printing is available in Computer Labs 99.99% of the time, outside of scheduled closures and maintenance. SLO: ● Every Computer Lab has at least one working printer with paper. ● Printers service only the central print queues. ● The swipe-card terminal in Computer Labs must work for the printers to be considered ‘working’. ● Printers do not work if they can’t talk to the payment processor. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLOs - SERVICE LEVEL OBJECTIVES: University Print Services
  • 41. SLO: The settings page renders in under 10 seconds. SLI: ● Logins work. ● Page render-time from same data-center. ● Page render-time from Europe. ● Database disk-queue length. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
  • 42. SLO: 98%-tile end to end request time is not more than 3 seconds. SLI: ● Time-to-process for all requests. ● Request processing is functional at least 30 seconds ago. ● 10 minute 98th percentile request-time average. ● 10 minute 50th percentile request-time average. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 SLIs - SERVICE LEVEL INDICATORS: Specific monitorables!
  • 43. Service Level Agreement (SLA): An agreement, written in Human; or sometimes Lawyer. Sets goalposts, defines penalties (if any), defines terms. Service Level Objective (SLO): A set of objectives, written in Engineer. Technical definition of the goalposts in the SLA. Service Level Indicator (SLI): Something that tells you whether or not you’re meeting your SLO. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFINITIONS
  • 44. Alarm: Informing humans of failing SLI/SLOs in realtime. Report: Eventually informing humans of failing SLI/SLOs Which humans do you bother for each SLI/SLO? Only you can figure that out! @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 DEFINITIONS
  • 45. Specific: Must tell me something specific is wrong. Alarms that require a human to log in to figure out what is actually wrong, if anything is, are bad alarms. FYI alarms lead to high cognitive load and decrease worker satisfaction. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 GOOD ALARMS
  • 46. Actionable: Must be something I can directly fix Getting alarmed for things you can’t fix is a great road to burnout. These are especially great at 3:19 AM. The failure mode is teaching people that some alarms can be ignored safely. Eventually, they’ll ignore the wrong one. This is bad. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 GOOD ALARMS
  • 47. Format Agnostic: Don’t be a dick about format If a team wants full HTML with links to runbooks and wiki-pages, let ‘em. If a team wants the entire alert to fit into their iPhone lock- screen, let ‘em. Better, allow both! @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 GOOD ALARMS
  • 48. Specific. Actionable. In the format you want. @sysadm1138ThatConference 2017 @sysadm1138ThatConference 2017 GOOD ALARMS
  • 49. The Monitoring Project-Plan MAKING THE ASCENT @sysadm1138ThatConference 2017
  • 50. Get Approval For The Project: ● If it’s just you, that’s easy! Do it. ● A good monitoring product is used by many people ○ Get buy-in from not just IT, but sales, support, etc. ● Pitch to the business-case, not process improvement for your department. ○ We will reduce customer churn by enabling our CSMs. ○ We will improve our reaction time to reputation-impacting events. ○ This will increase buy-in from other departments, enabling our IT goals 0 PROJECT PLAN @sysadm1138ThatConference 2017
  • 51. Figure out high-level needs (SLA) ● If you have a written one? Great! Work backwards from that. ● If you have an unwritten one, ask people to see what they think it is. ○ Play 20-questions with higher level execs on impacts of down-time and service degredations. ○ Point out the de facto SLA, see how they react. ○ Point out we don’t need to publish the SLA to our customers, but can have one internally. ● If you have microservices, each service will need its own SLA. 1 PROJECT PLAN @sysadm1138ThatConference 2017
  • 52. Figure out concrete definitions (SLO) ● Now that you have an SLA, or many SLAs, do the analysis to determine what ‘up’ and ‘responsive’ mean in a concrete way. ● Ask other people to get involved. Involvement keep the project rolling. ● This is an opportunity for education with business leaders. 2 PROJECT PLAN @sysadm1138ThatConference 2017
  • 53. Figure out specific monitorables (SLI) ● Take your SLO list and figure out how to monitor for each. ● You may need to monitor new things. ● You may be able to stop monitoring/alarming some other things. ● Magic happens: your first opportunity to turn off existing alarms! 3 PROJECT PLAN @sysadm1138ThatConference 2017
  • 54. Figure out how to monitor those things ● Some of this may already exist. If so, cool. ● Some may need to be monitored in a different way. ● Some may need to be monitored for the first time. ● This defines how the Polling Engine works. ● Build new engines if you need to. ● Poll direct measurements where you can, try not to use proxy measurements. 4 PROJECT PLAN @sysadm1138ThatConference 2017 Polling Engine
  • 55. Decide on your aggregation techniques ● Some of this may already exist. If so, cool. ● Perhaps you don’t need to keep data as long as you thought. ● Perhaps you need to keep high granularity data longer than you thought. ● Perhaps you need to start tracking things like percentiles and standard- deviations. ● This defines how the Aggregation Engine works. 5 PROJECT PLAN @sysadm1138ThatConference 2017 Aggregation Engine
  • 56. Alert Definition (OperationalSLA monitoring) ● Some of this may already exist. If so, cool. ● Figure out who needs to know what and how fast they need to know it. ● One person shop? Easy! ● Ops team of 80? There will be meetings. ○ Work with each group individually. ○ Be flexible with requirements in each. ○ Don’t force communications-format standards without good cause. ○ Ensure the alarms are specific and actionable. 6 PROJECT PLAN @sysadm1138ThatConference 2017 Alerting Engine
  • 57. Report Definition (CapacitySLA monitoring) ● Some of this may already exist. If so, cool. ● Figure out how to write the pass/fail report for your SLAs. ● Determine what kind of response-times are needed to address SLA risks. ● Determine what kind of response-times are needed for capacity risks. ● Determine who gets what. 7 PROJECT PLAN @sysadm1138ThatConference 2017 Reporting Engine
  • 58. Periodic Review ● Run the system for a while. ● Come back 3 months, 6 months later and ask questions. ○ How are the alarms working for you? ○ What changes do you think need to be made? ○ What new things have shown up? ● Especially important for departments that haven’t been attached to a monitoring system before. 8 PROJECT PLAN @sysadm1138ThatConference 2017 Humans
  • 59. Step 0: Get approval Step 1: Figure out high level needs (Service Level Agreement) Step 2: Turn that into concrete definitions (Service Level Objectives) Step 3: Figure out specific monitorables (Service Level Indicators) Step 4: Decide how to monitor it (Polling Engine) Step 5: Determine aggregation requirements (Aggregation Engine) Step 6: Define Alerts (Operational and SLA monitoring) Step 7: Define Reports (Capacity and SLA monitoring) Step 8: Periodic Review @sysadm1138ThatConference 2017
  • 60.
  • 61. Post-Incident Review Questions 1. Did the monitoring system see the problem? 2. Did we react to the monitoring system, or humans? 3. Is it worth our time to catch this problem in the monitoring system? 4. What changes do we need to make, including to alerts, to deal with this in the future? 9 PROJECT MAINTENANCE @sysadm1138ThatConference 2017