2. State of the Infrastructure
• Modern infrastructure – multiple interacting applications/services running inside
service-orchestration system (e.g. docker+kubernetes, but not limited)
• This stack matches the paradigm shift happening in software development as an
engineering discipline. The main target is to reach business agility (we should
deliver the functionality as soon as possible). Site reliability isn’t the main
purpose.
• This obviously brings a technical debt.
• It is expected that this technical debt will be solved later, but that rarely happens.
3. It is expected that this technical debt would be
solved later, but that rarely happens.
And the infrastructure encourages it!
4. Kubernetes is so awesome that one of our JVM containers
has been periodically running out of memory for more than
a year, and we just recently learned about it.
https://danlebrero.com/2018/11/20/how-to-do-java-jvm-heapdump-in-kubernetes/
e.g.:
• Sales report service was updated to provide updates for the regional sales director
• Each time the app was loading the whole database instead of a single month
• SREs were seeing occasional (once a month) service restarts, and no attention was given
5.
6. What can we do?
Top (10?) problems in microservice-
oriented architecture monitoring
7. No service restart monitoring
• Problem: rare server restarts are unknown and aren’t investigated
• Leads to late reaction especially to OOM issues (sometimes months)
• Monitor/alert on change in number of restarts.
• Technical view: number of restarts in Kubernetes/docker swarm/etc.
• Prometheus/AlertManager, DataDog, etc.
8. No app-level error monitoring
• Problem: The app might be running but might throw critical errors to
logs
• Might mean big issues developers/business is unaware of
• Monitor service logs for known error log patterns (ask developers for
templates)
• Technical view: Grep over Elastic, X-Pack
• Bonus for developers: Sentry (developer user friendly)
• Prometheus/AlertManager, Sentry, DataDog, etc.
9. No service health checks (or minimal health
checks)
• Problem: service health check works as “print ‘I’m OK’”
• Leads to OK health check while service is completely down
• Work with developers/management to add proper health checks
(might mean for them to schedule more time for development plans,
usually not included: people expect it to be covered by monitoring
tools)
10. No API response-time checks (and no tracing)
• Problem: service might be working, but too slowly, which means the
whole app is slow
• In SOA/microservices/macroservices we need to check inter-service
communication as well
• Work with developers to add tracing to your app
• There’s a way to include Jaeger metrics to Prometheus:
https://www.youtube.com/watch?v=fjYAU3jayVo — Distributed
Tracing with Jaeger & Prometheus on Kubernetes
• https://github.com/opentracing-contrib
11. Service is still an application and still
consumes CPU and RAM
• Common problem: the application/cluster might be covered well by
monitoring service-wise, but still a single service might consume a lot
of CPU and that’s going to be left unnoticed
• Just don’t forget to monitor it
12. If someone adds new services, that should be
monitored as well
• Problem: sometimes cross-functional teams might have the ability to
add new services and this might be unknown to the SRE team
• Leads to: unmonitored service down unnoticed, a lot of investigation
• Monitor cluster configuration metrics (deployments/number of
namespaces etc.). Alert on difference.
• Not a PRIO 1 investigation, but when it happens, work with the team
to introduce the right workflow
13. CI/CD time should be monitored
• Problem: deploy time (docker build, tests) increases and new
developers are used to it.
• Leads to developer frustration, time spent just waiting
• Gradual changes to Dockerfile might gradually increase build time. 3
mins -> 5 mins -> 10 mins -> 20 mins. People get used to it
• Monitor time spent on integration and delivery, investigate reasons.
CI/CD support is now part of SRE live as well.
14. APM and profiling capabilities
• Problem: the app is slow, no one knows why
• Leads to really long investigation
• Development teams still don’t use APM/profiling in many cases; work
with developers to add APM when possible
15. Security monitoring
• Security is a part of SRE life as well
• Might be a good idea to monitor the number of WAF-related events
to investigate jumps/block attackers, escalate to security team.
• Monitor npm/yarn audit, docker image audits for critical CVEs, alert
when present
16. Summing up
• Alerts on service restarts
• Alerts on app-level errors
• Advanced health checks
• API response time alerts (including external APIs)
• Microservice architecture change notifications
• CI/CD build/delivery time monitoring and notifications
• APM/profiling added
• WAF event monitoring
17. Bonus track: Page response time is not
server response time anymore
• Problem: if your “server” responds in 200ms, but the page is
rendered in a browser in 60 seconds, it’s still 60 seconds for users.
• So monitor page rendering time as well
• Pingdom, Site24x7, a lot of headless browsers available