Bad monitoring/alerting/logging made me very frustrated in some of my previous positions. It seems that almost nobody gets this exactly right. This will be a 5 minute talk with the most annoying issues I've come across and how to fix them.
10. 2nd Deadly sin of monitoring
Single team does monitoring, everyone
else is second tier
11. Solution: direct alerts to relevant parties
1) only person who can fix the problem gets
alerted, others get emails
2) system needs to be smart enough to make the
choice, and fixed when it makes a mistake in
waking up the wrong person
15. Solution: Monitoring needs to be a part of the design
the empty error - classic example - null pointer exceptions in java
make your developers accountable for empty errors
16.
17.
18. solutions:
self correcting metrics. if an alert goes off for a
metric, and we decide it wasn’t a real error - a
dialog for changing the threshold should pop up.