Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen

1,437 views

Published on

In this InfluxDays NYC 2019 talk by Gunnar Aasen (Manager of Partner Engineering at InfluxData), you will get an overview of the AWS Container Monitoring Stack as well as how you can use InfluxDB on AWS for container monitoring. This session will include a demo of the solution.

Published in: Internet
  • Login to see the comments

  • Be the first to like this

Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen

  1. 1. Gunnar Aasen / Partner Engineering Container Monitoring Best Practices Using AWS and InfluxData
  2. 2. Agenda • What is container monitoring • Options for running containers on AWS • Best practices for container monitoring • Run TICK on AWS container services • Demo • Questions
  3. 3. Partner Engineering Manager, InfluxData InfluxDB expert Based in San Francisco, Gunnar is a former InfluxData support engineer. He has intimate knowledge of InfluxDB and the rest of the TICK stack. As a partner engineer, he’s focused on integrating InfluxDB into the larger open source and cloud ecosystems to help InfluxData’s partners and customers succeed.
  4. 4. Containers & Monitoring
  5. 5. What is a container?
  6. 6. Process in cgroup? Docker? Kubernetes? Another buzzword like ”cloud”
  7. 7. It’s containers all the way down 🐢 🐢 🐢
  8. 8. Deployment Options on AWS
  9. 9. AWS ECS/Fargate • Elastic Container Service (ECS) – Docker-based container deployment – Essentially AWS’ version of Kubernetes • Terminology a bit different: Tasks vs services – Exposes the EC2 hosts used underneath – Can use Docker compose • Fargate – The same as ECS, with no EC2 instances exposed – Pay only for container CPU/memory used
  10. 10. AWS EKS • EKS is AWS’ managed Kubernetes offering – Equivalent to Google’s GKS • Uses EC2 instances underneath – These are exposed to the user • AWS manages the Kubernetes API • Some integration with IAM and load balancers
  11. 11. TICK on AWS
  12. 12. Options for deploying TICK on AWS • CloudFormation module for EC2 • Link: https://github.com/influxdata/amazon-cloud-formation-influxdb-enterprise • ECS/Fargate via Docker Compose • Link: https://github.com/influxdata/sandbox • EKS – Via Helm (On the AWS Marketplace) • Link: https://aws.amazon.com/marketplace/pp/B07KGM885K – Via InfluxDB operator • Link: https://docs.influxdata.com/platform/integrations/kubernetes/
  13. 13. Kubernetes resources • Summary – Link: https://docs.influxdata.com/platform/integrations/kubernetes/ • kube-influxdb project – Enable monitoring of Kubernetes with TICK easy on different platforms • Link: https://github.com/influxdata/kube-influxdb – Similar to kube-prometheus – Includes common container and Kubernetes inputs to enable – Includes graphs and dashboards for those metrics – Will include alerts as well
  14. 14. Recommendations for monitoring on AWS
  15. 15. What’s different • Proliferation of containers – Running in AWS… • Enables microservices – Increases the amount of inter-container (inter-process) communication • Minimal environments – Lack of familiar debugging tools and techniques
  16. 16. Observability is the new paradigm • A holistic understanding of reality in a system – Monitoring • Current state of the system – Logging • Actions taken by services in the system – Tracing • Interactions between different services – Graphs/alerting • Translating machine information into human information
  17. 17. Levels of container monitoring • Host/node level monitoring – EC2 node failures • Container monitoring – Lack of resources • Application monitoring – Service does not respond • Cluster monitoring – Is Kubernetes overextended?
  18. 18. Telegraf in Kubernetes • Three options – DaemonSet: monitoring per node (one telegraf per EC2) • Collect host/node metrics – Deployment: single service for a cluster (Prometheus scraping) • Collect application and cluster metrics – SideCar: tight coupling with the application • Collect container metrics • DaemonSet or SideCar? Start with DaemonSet • Understand the metrics you’re generating before deploying
  19. 19. Telegraf input plugins for instrumenting nodes • cpu: standard CPU metrics • system: general stats on system load • processes: uptime, and number of users logged in • procstat: fine grained process stats like RSS memory • diskio: metrics about disk traffic and timing • Disk: metrics about disk usage. • Mem: system memory metrics. • netstat: network related metrics • http_response: setup local ping • filestat: Files to gather stats about (meta node only)
  20. 20. Telegraf input plugins for instrumenting containers • logs: requires syslog • swap: system swap metrics. • internal: Telegraf related stats • docker: if deployed in containers • kubernetes: kubelet stats like per-node pod metrics • kube_inventory: Kubernetes state metrics • prometheus: Prometheus-style /metrics endpoints • syslog: structured logging
  21. 21. Monitoring recommendations • Remember to set up black box testing – Kubernetes may look fine internally but egress may be failing – Always start here for alerting • Node health is still important in Kubernetes – OOM killer, no disk space are still problems – Pay attention to local system disk space • Believe your user’s reports – Most small problems are never reported – Microservices/container scheduling can create many small outages
  22. 22. System recommendations • Decouple the monitoring system from the target infrastructure – SaaS, VMs work well for decoupling • Test the monitoring system – All large environments should have staging metrics • Monitoring should be deployed with your application – Infrastructure as code like CloudFormation or Terraform templates • Always consider how cascading failures will affect monitoring – Monitoring systems tend to go down during other service issues
  23. 23. AWS recommendations • Keeping an accessible record of Cloudwatch stats – Keep in mind Cloudwatch API limits • Always consider AWS limits ahead of time – Available instance classes – Hard to monitor without access to the AWS support API • Kubernetes – Stay up to date for the best experience – Pay attention to IAM roles – Use CloudFormation
  24. 24. Future Plans • Next couple months – Migrating to official Helm charts repo • Deprecating TICK charts and kube-influxdb repos • One well-known place for all charts • This summer – Operator extended for InfluxDB Enterprise – Additional operator functionality for other TICK components – Publish more tools for tracing
  25. 25. 💻 Demo time! 💻
  26. 26. 🙋♀️ Questions? 🙋♂️
  27. 27. 🎉 Thank You! 🎉

×