Just Enough WebOps 
for Developers 
Alexis Lê-Quôc @alq 
http://www.datadoghq.com
@alq
@alq 
Co-founder 
DATADOG
Datadog is Monitoring that does not 
suck... as a Service
Datadog is Monitoring that does not 
suck... as a Service 
“Metrics 
made 
social”
People-friendly 
Monitoring
Developer-friendly 
Monitoring
930,000 350,000 
Dev Ops 
2010 US figures from BLS
The New 
Development 
Equation
The New Development Equation 
Code + + AWS =
The New Development Equation 
Code + + AWS = 
3 months
The New Development Equation 
Code + + AWS = 
3 months 5 minutes
Web 
Operations?
The New Development Equation 
Code + + AWS = 
3 months 5 minutes
The New Development Equation 
Code + + AWS = 
3 months 5 minutes 
Web 
Operations?
Cargo cult 
Operations
Common vocabulary 
between Dev & WebOps?
Users 
SysAdmin
“Come and get it” 
“We want root!”
Dev 
WebOps
WebOps 
and this is what I do
But first an 
important 
digression
Product Service
Service = Code + Infrastructure
Service = Product + Access
Provide access
Provide access
reliable, fast, cheap 
Provide access
reliable, fast, cheap 
Provide access
reliable, fast, cheap 
Provide access 
24x7 
without going crazy
24x7 && !crazy
Development 
Models
Delivery historically 
not the focus
Agile Cycle Delivery
Agile Cycle Delivery
Agile Cycle WDeebliOveprsy 
Cycle
WebOps 
and this is what I do
Dev Release Measure & Log 
Monitor 
Change 
WebOps Cycle 
Investigate Alert 
Fix || Escalate
(Release)
Dev Release 
Measure & Log 
Monitor 
Change 
Investigate Alert 
Fix || Escalate
Purpose 
Collect quantitative metrics 
Process 
Instrument servers 
Instrument code 
Instrument SaaS deps 
Automate collection 
Measure 
Risks 
Imprecise metric 
definition 
Manual collection 
“What does it mean?” 
Tools 
System (ganglia, collectd, munin, nagios, etc.) 
Code (metrics, statsd) 
SaaS (Datadog et al.)
Dev Release 
Measure & Log 
Monitor 
Change 
Investigate Alert 
Fix || Escalate
Purpose 
Collect meaningful, timestamped events 
Log 
Process 
All the time 
In one place 
Access for everyone 
Discipline 
Risks 
TiB of garbage 
Non-uniform 
timestamps 
Non-uniform formats 
Tools 
log4j et al. 
syslog et al. 
logstash, splunk 
+ Logging-as-a-Service
Dev Release Measure & Log 
Change 
Investigate Alert 
Fix || Escalate 
Monitor
Purpose 
Watch actionable events & metrics 
Process 
Health of the app? 
Which metrics for health? 
Compute metrics 
Metric domain 
Access for everyone 
Pretty graphs 
Monitor 
Risks 
Non-actionable metrics 
Tools 
graphite, cubism et al. 
+ services
Dev Release Measure & Log 
Monitor 
Change 
Investigate 
Alert 
Fix || Escalate
Purpose 
Bring human in the loop 
when automated fix does not work 
Alert 
Process 
Alert on vital monitors 
Add new alerts with new 
monitors 
Compute metrics from alerts 
Ruthlessly edit 
Risks 
Too many alerts 
Become desensitized 
Ignore alerts 
App crashes for real 
Pendulum swings back 
Tools 
nagios 
+ services
Dev Release Measure & Log 
Monitor 
Change 
Investigate Alert 
Fix || Escalate
Purpose 
Fix issue or find someone who can 
Process 
(fix) capture actions as soon as 
possible (while or shortly after) 
(fix) runbooks 
(fix) automate fixes 
(escalation) on-call rotation 
(escalation) agree on rules 
Fix || Escalate 
Risks 
Burn out 
Tools 
PagerDuty 
Bug tracker
Dev Release Measure & Log 
Monitor 
Alert 
Change 
Investigate 
Fix || Escalate
Purpose 
Collect evidence 
Reconstruct what happened 
Process 
Start where/when problem 1st detected 
Work your way from there 
Capture relevant graphs/logs 
Investigate 
Risks 
Missing the starting point 
Lagging events/metrics 
Low-level events/metrics 
Blame game 
Tools 
Post-mortems
Dev Release Measure & Log 
Monitor 
Investigate Alert 
Fix || Escalate 
Change
Change 
Purpose 
Fewer alerts 
Better service 
Process 
Change infrastructure, code 
Infrastructure == code 
Add/Edit monitors & alerts 
Risks 
ad-hoc changes 
Tools 
...
WebOps 
and this is what I do
Dev Release Measure & Log 
Monitor 
Change 
Questions? 
Comments? 
@alq 
Investigate Alert 
Fix || Escalate

Just enough web ops for web developers