Just enough web ops for web developers

Just Enough WebOps
for Developers
Alexis Lê-Quôc @alq
http://www.datadoghq.com

Datadog is Monitoring that does not
suck... as a Service

Datadog is Monitoring that does not
suck... as a Service
“Metrics
made
social”

Developer-friendly
Monitoring

930,000 350,000
Dev Ops
2010 US figures from BLS

The New
Development
Equation

The New Development Equation
Code + + AWS =

Code + + AWS =
3 months

Code + + AWS =
3 months 5 minutes

Code + + AWS =
3 months 5 minutes
Web
Operations?

Common vocabulary
between Dev & WebOps?

“Come and get it”
“We want root!”

But first an
important
digression

Service = Code + Infrastructure

reliable, fast, cheap
Provide access

reliable, fast, cheap
Provide access
24x7
without going crazy

Delivery historically
not the focus

Agile Cycle WDeebliOveprsy
Cycle

Dev Release Measure & Log
Monitor
Change
WebOps Cycle
Investigate Alert
Fix || Escalate

Dev Release
Measure & Log
Monitor
Change
Investigate Alert
Fix || Escalate

Purpose
Collect quantitative metrics
Process
Instrument servers
Instrument code
Instrument SaaS deps
Automate collection
Measure
Risks
Imprecise metric
definition
Manual collection
“What does it mean?”
Tools
System (ganglia, collectd, munin, nagios, etc.)
Code (metrics, statsd)
SaaS (Datadog et al.)

Purpose
Collect meaningful, timestamped events
Log
Process
All the time
In one place
Access for everyone
Discipline
Risks
TiB of garbage
Non-uniform
timestamps
Non-uniform formats
Tools
log4j et al.
syslog et al.
logstash, splunk
+ Logging-as-a-Service

Change
Investigate Alert
Fix || Escalate
Monitor

Purpose
Watch actionable events & metrics
Process
Health of the app?
Which metrics for health?
Compute metrics
Metric domain
Access for everyone
Pretty graphs
Monitor
Risks
Non-actionable metrics
Tools
graphite, cubism et al.
+ services

Monitor
Change
Investigate
Alert
Fix || Escalate

Purpose
Bring human in the loop
when automated fix does not work
Alert
Process
Alert on vital monitors
Add new alerts with new
monitors
Compute metrics from alerts
Ruthlessly edit
Risks
Too many alerts
Become desensitized
Ignore alerts
App crashes for real
Pendulum swings back
Tools
nagios
+ services

Monitor
Change
Investigate Alert
Fix || Escalate

Purpose
Fix issue or find someone who can
Process
(fix) capture actions as soon as
possible (while or shortly after)
(fix) runbooks
(fix) automate fixes
(escalation) on-call rotation
(escalation) agree on rules
Fix || Escalate
Risks
Burn out
Tools
PagerDuty
Bug tracker

Monitor
Alert
Change
Investigate
Fix || Escalate

Purpose
Collect evidence
Reconstruct what happened
Process
Start where/when problem 1st detected
Work your way from there
Capture relevant graphs/logs
Investigate
Risks
Missing the starting point
Lagging events/metrics
Low-level events/metrics
Blame game
Tools
Post-mortems

Monitor
Investigate Alert
Fix || Escalate
Change

Change
Purpose
Fewer alerts
Better service
Process
Change infrastructure, code
Infrastructure == code
Add/Edit monitors & alerts
Risks
ad-hoc changes
Tools
...

Monitor
Change
Questions?
Comments?
@alq
Investigate Alert
Fix || Escalate

Just enough web ops for web developers

More Related Content

What's hot

Viewers also liked

Similar to Just enough web ops for web developers

More from Datadog

Recently uploaded

Just enough web ops for web developers