Datadog is monitoring that does not suck. It's metrics friendly, people friendly and developer friendly monitoring.
Learn more at https://www.datadoghq.com/
48. Purpose
Collect meaningful, timestamped events
Log
Process
All the time
In one place
Access for everyone
Discipline
Risks
TiB of garbage
Non-uniform
timestamps
Non-uniform formats
Tools
log4j et al.
syslog et al.
logstash, splunk
+ Logging-as-a-Service
50. Purpose
Watch actionable events & metrics
Process
Health of the app?
Which metrics for health?
Compute metrics
Metric domain
Access for everyone
Pretty graphs
Monitor
Risks
Non-actionable metrics
Tools
graphite, cubism et al.
+ services
52. Purpose
Bring human in the loop
when automated fix does not work
Alert
Process
Alert on vital monitors
Add new alerts with new
monitors
Compute metrics from alerts
Ruthlessly edit
Risks
Too many alerts
Become desensitized
Ignore alerts
App crashes for real
Pendulum swings back
Tools
nagios
+ services
54. Purpose
Fix issue or find someone who can
Process
(fix) capture actions as soon as
possible (while or shortly after)
(fix) runbooks
(fix) automate fixes
(escalation) on-call rotation
(escalation) agree on rules
Fix || Escalate
Risks
Burn out
Tools
PagerDuty
Bug tracker
56. Purpose
Collect evidence
Reconstruct what happened
Process
Start where/when problem 1st detected
Work your way from there
Capture relevant graphs/logs
Investigate
Risks
Missing the starting point
Lagging events/metrics
Low-level events/metrics
Blame game
Tools
Post-mortems