Creating great software just doesn’t cut it anymore. Nowadays you’re responsible for running it. When things go wrong, you’re the one pouring through all the data to find the fault. In this talk, we’ll take a step back, look at all these silos of data and gain an understanding of why it all came to be. We’ll advise on how to use these systems to understand what’s going on in production.
This presentation was given during the the Spaces Summit, an internal IT conference by and for the engineers of bol.com.
21. Golden Signals
Source: Google Site Reliability Engineering
Latency
- Time it takes to service a request. Differentiating between failed and
successful requests
Traffic
- How much demand is placed on your service
Errors
- Rate of Failures
Saturation
- How “full” is your service
22. Creating Alerts - Best Practices
Great Alerts Are:
- Simple
- Urgent
- Actionable
- Require human intervention
24. Life After the Alert - Some advice
Meta:
- Mitigate first
- Strict time period before escalation
Specifics:
- Dashboard with SLI’s
- Link to other dashboards for drill down
- Logs for the audit trail
- Use $tool for more
26. Future of Monitoring at bol.com
- Separation of Alerting and Monitoring
- Improve Self Service Monitoring
- Focus on Metric Based Alerting
- Distributed Tracing