Successfully reported this slideshow.
Your SlideShare is downloading. ×

Velocity NY 2018: Monitoring Containers Correctly

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
eBPF Workshop
eBPF Workshop
Loading in …3
×

Check these out next

1 of 27 Ad

Velocity NY 2018: Monitoring Containers Correctly

Download to read offline

Michael Kehoe walks you through building a small monitoring utility for cgroup containers to illustrate best practices in container monitoring. You'll explore various cgroup constraints and learn how to specifically monitor for each of them to ensure that your application is behaving as expected. Along the way, Michael shares tricks and tips about monitoring containerized applications.

Michael Kehoe walks you through building a small monitoring utility for cgroup containers to illustrate best practices in container monitoring. You'll explore various cgroup constraints and learn how to specifically monitor for each of them to ensure that your application is behaving as expected. Along the way, Michael shares tricks and tips about monitoring containerized applications.

Advertisement
Advertisement

More Related Content

More from Michael Kehoe (18)

Recently uploaded (20)

Advertisement

Velocity NY 2018: Monitoring Containers Correctly

  1. 1. Monitoring Containers Correctly Michael Kehoe Staff Site Reliability Engineer https://github.com/michael-kehoe/container-monitoring-workshop
  2. 2. Getting Started • Setup your workshop platform: • https://app.strigo.io/event/QXDpmTiR AufQ4LBis • Token: F7C7 • Background slides: https://bit.ly/2NcEBQN • Code repo: https://github.com/michael- kehoe/container-monitoring-workshop • Please let me know ASAP if you’re
  3. 3. Today’s agenda 1 Introductions 2 Container Primitives 3 What we’ll monitor 4 Cgroup interface file formats 5 Exercises
  4. 4. Today’s agenda Exercises 100 CPU Basics 101 CPU Enhanced 102 CPU Advanced 200 Memory Basics 201 Memory Enhanced 300 IO Basics 400 PID
  5. 5. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Worked on: • Networks • Micro-services • Traffic Engineering • Databases
  6. 6. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  7. 7. Container Primitives
  8. 8. Containers Limiting the resources that can be used by a process/ set of processes cgroups Isolating filesystem resources Namespaces Implicit sharing or shadowing Copy on Write Locking down container privileges Linux Security Modules
  9. 9. Cgroup • Abbreviation for ‘Control Groups’ • Provides • Resource Limiting • Prioritization • Accounting • Control
  10. 10. What we’ll monitor
  11. 11. • 100: Basic cgroup CPU utilization • 101: Enhanced cgroup CPU utilization (with percentiles • 102: cgroup throttles What we’ll monitor CPU
  12. 12. • 200: Memory Basics • Cgroup utilization • 201: Enhanced Memory Metrics What we’ll monitor MEMORY
  13. 13. • 300: Disk IO Monitoring What we’ll monitor DISK/ NETWORK
  14. 14. • 400: PID Utilization What we’ll monitor PID
  15. 15. Cgroup interface file formats
  16. 16. Cgroup interface file formats https://www.kernel.org/doc/Documentation/cgroup-v2.txt
  17. 17. Exercises
  18. 18. 100: CPU Monitoring
  19. 19. 101: Enhanced CPU Monitoring
  20. 20. Enhanced CPU Monitoring
  21. 21. 102: CPU Advanced Monitoring
  22. 22. Advanced CPU Monitoring
  23. 23. 200: Memory Basics
  24. 24. 201: Memory Enhanced
  25. 25. 300: Disk IO Basics
  26. 26. 400: PID Monitoring

Editor's Notes

  • So I’m apart of a team at LinkedIn called Production-SRE
    The key tenants of production-sre at LinkedIn is:
    Assist in restoring stability during site-critical issues
    Developing applications to reduce MTTD and MTTR
    Provide direction and guidelines for site-troubleshooting
    Build tools for efficient site-issue troubleshooting, issue detection and correlation

    As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  • Cgroups
    Kernel >= 2.6.24
    Namespaces
    Kernel >= 2.4.19
    Copy-on-Write
    Linux Security Modules
  • Resource limiting – groups can be set to not exceed a configured memory limit, which also includes the file system cache[8][9]
    Prioritization – some groups may get a larger share of CPU utilization[10] or disk I/O throughput[11]
    Accounting – measures a group's resource usage, which may be used, for example, for billing purposes[12]
    Control – freezing groups of processes, their checkpointing and restarting[12]
  • Nlsv
    Ssv
    Fk
    nk

×