1. Monitoring in 2017
Challenges in monitoring containers, and dynamic
infrastructure.
TIAD
Oct 6, 2017
Charly Fontaine
Software Engineer - Containers team
Datadog
3. • SaaS based infrastructure and app monitoring
• Open Source Agent
• Time series data (metrics and events)
• Processing nearly a trillion data points per day
• Intelligent Alerting
• We’re hiring! (www.datadoghq.com/careers/)
Datadog Overview
4. Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches,
Queues and more...
Monitor Everything
5. $ cat ~/.plan
1. Intro: The Importance of Monitoring
2. The Challenge: Monitoring Dynamic Infrastructure
3. Finding the Signal: How do we know what to monitor?
4. Implementation: Applying it to Containerized Workloads
5. Demo: Monitoring of a containerized web app deployment
6.
7.
8.
9. Collecting data is cheap;
not having it when you
need it can be expensive
22. Open Questions
• Where is my container running?
• What is the capacity of my cluster?
• What’s the total throughput of my app?
• What’s its response time per tag? (app, version, region)
• What’s the distribution of 5xx error per container?
23. More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
25. Examples: NGINX - Metrics
Resource Metrics:
• Disk I/O
• Memory
• CPU
• Queue Length
Work Metrics:
• Requests Per Second
• Request Time
• Error Rates (4xx or 5xx)
• Success (2xx)
26. Examples: NGINX - Events
• Configuration Change
• Code Deployment
• Service Started / Stopped
33. Query Based Monitoring
“What’s the average throughput of
application:nginx per version ?”
“Alert me when one of my pod from replication
controller:foo is not behaving like the others?”
“Show me rate of HTTP 500 responses from nginx”
“… across all data centers”
“… running my app version 2….”
36. Container Events
• Starting / Stopping Containers
• Scaling Events for Underlying Instances
• Deploying a new container build
37. Pseudo-files
• Provide visibility into container metrics via the file system.
• Generally under:
/cgroup/<resource>/docker/$CONTAINER_ID/
or
/sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/
38. Pseudo-files: CPU Metrics
$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat
> user 2451 # time spent running processes since boot
> system 966 # time spent executing system calls since boot
$ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat
> nr_periods 565 # Number of enforcement intervals that have elapsed
> nr_throttled 559 # Number of times the group has been throttled
> throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)
Pseudo-files: CPU Throttling
39. Docker API
• Detailed streaming metrics as JSON HTTP socket
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/
28d7a95f468e/stats
41. Service Discovery
Docker API Kubernetes
Monitoring Agent
Container
A O A O
Containers List &
Metadata
Additional Metadata
(Tags, etc)
Config Backends
Integration Configurations
Host Level
Metrics
42.
43. Custom Metrics
• Instrument custom applications
• You know your key transactions best.
• Use async protocols like Etys’ STATSD or
DogstatsD
45. Resources
Monitoring 101: Alerting
https://www.datadoghq.com/blog/monitoring-101-alerting/
Monitoring 101: Collecting the Right Data
https://www.datadoghq.com/blog/monitoring-101-collecting-data/
Monitoring 101: Investigating performance issues
https://www.datadoghq.com/blog/monitoring-101-investigation/
The Power of Tagged Metrics
https://www.datadoghq.com/blog/the-docker-monitoring-problem/
How to Collect Docker Metrics
https://www.datadoghq.com/blog/how-to-collect-docker-metrics/
8 surprising facts about Docker Adoption
https://www.datadoghq.com/docker-adoption/