Michael Kehoe walks you through building a small monitoring utility for cgroup containers to illustrate best practices in container monitoring. You'll explore various cgroup constraints and learn how to specifically monitor for each of them to ensure that your application is behaving as expected. Along the way, Michael shares tricks and tips about monitoring containerized applications.
5. Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Worked on:
• Networks
• Micro-services
• Traffic Engineering
• Databases
6. Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it
8. Containers
Limiting the
resources that can
be used by a
process/ set of
processes
cgroups
Isolating filesystem
resources
Namespaces
Implicit sharing or
shadowing
Copy on Write
Locking down
container privileges
Linux Security
Modules
9. Cgroup
• Abbreviation for ‘Control Groups’
• Provides
• Resource Limiting
• Prioritization
• Accounting
• Control
So I’m apart of a team at LinkedIn called Production-SRE
The key tenants of production-sre at LinkedIn is:
Assist in restoring stability during site-critical issues
Developing applications to reduce MTTD and MTTR
Provide direction and guidelines for site-troubleshooting
Build tools for efficient site-issue troubleshooting, issue detection and correlation
As this presentation goes on, you’ll notice how an Event Correlation system fits into these
Resource limiting – groups can be set to not exceed a configured memory limit, which also includes the file system cache[8][9]
Prioritization – some groups may get a larger share of CPU utilization[10] or disk I/O throughput[11]
Accounting – measures a group's resource usage, which may be used, for example, for billing purposes[12]
Control – freezing groups of processes, their checkpointing and restarting[12]