SlideShare a Scribd company logo
1 of 59
Download to read offline
Container monitoring
challenges
Xavier Vello - Datadog
1 / 37
  Former Debian Maintainer
(sporadic) KDE contributor
Industrial Engineer for 5 years
Software Engineer at Datadog
Container Monitoring Team
whois xvello.net
xavier.vello@datadoghq.com
http://lkdin.com/in/xaviervello
Container monitoring challenges – Xavier Vello – Datadog 2 / 37
whois datadoghq.com
Container monitoring challenges – Xavier Vello – Datadog 3 / 37
whois datadoghq.com
Container monitoring challenges – Xavier Vello – Datadog 3 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Container monitoring challenges – Xavier Vello – Datadog 4 / 37
Our collection agent
A 7 year-old codebase, 24 458 Python SLOCs
87 integrations (32 274 Python SLOCs) in a different repo
Container monitoring challenges – Xavier Vello – Datadog 5 / 37
The Agent6 project
A rewrite
With containers support from the start
With Autodiscovery at the core
Container monitoring challenges – Xavier Vello – Datadog 6 / 37
The Agent6 project
A rewrite
With containers support from the start
With Autodiscovery at the core
A rewrite in Go
Reduce memory and CPU usage
Better reliability and maintainability
True concurrency
Container monitoring challenges – Xavier Vello – Datadog 6 / 37
Agenda
Share our challenges and discoveries:
Namespaces & cgroups
DogStatsD traf c: using Unix Domain
Sockets
Secure usage of the Docker Socket?
Container monitoring challenges – Xavier Vello – Datadog 7 / 37
Namespaces
& cgroups
"The dead forwarder" case of 2017
Network metrics
8 / 37
"The dead forwarder" case
supervisord supervisord -n -c /etc/dd-agent/supervisor.conf
│
├─python agent.py foreground --use-local-forwarder
│ └─{python}
├─python dogstatsd.py --use-local-forwarder
│ └─{python}
└─python ddagent.py
Container monitoring challenges – Xavier Vello – Datadog 9 / 37
"The dead forwarder" case
Support case: "metrics not coming through although the
collection works OK"
Container monitoring challenges – Xavier Vello – Datadog 10 / 37
"The dead forwarder" case
Support case: "metrics not coming through although the
collection works OK"
Collector logs: errors connecting to localhost:17123
Container monitoring challenges – Xavier Vello – Datadog 10 / 37
"The dead forwarder" case
Support case: "metrics not coming through although the
collection works OK"
Collector logs: errors connecting to localhost:17123
Supervisord logs: unexpected exit of the forwarder,
retries 5 times then gives up
Container monitoring challenges – Xavier Vello – Datadog 10 / 37
"The dead forwarder" case
Support case: "metrics not coming through although the
collection works OK"
Collector logs: errors connecting to localhost:17123
Supervisord logs: unexpected exit of the forwarder,
retries 5 times then gives up
Forwarder logs: no error
Container monitoring challenges – Xavier Vello – Datadog 10 / 37
docker-dd-agent process tree
supervisord supervisord -n -c /etc/dd-agent/supervisor.conf
│
├─python agent.py foreground --use-local-forwarder
│ └─{python}
├─python dogstatsd.py --use-local-forwarder
│ └─{python}
└─python ddagent.py
Container monitoring challenges – Xavier Vello – Datadog 11 / 37
docker-dd-agent process tree
supervisord supervisord -n -c /etc/dd-agent/supervisor.conf
│
├─python agent.py foreground --use-local-forwarder
│ └─{python}
├─python dogstatsd.py --use-local-forwarder
│ └─{python}
└─
Container monitoring challenges – Xavier Vello – Datadog 11 / 37
mem cgroup
Limit tells how much RAM / RAM+swap a cgroup can use
If limit exceeded, ...
Container monitoring challenges – Xavier Vello – Datadog 12 / 37
mem cgroup
Limit tells how much RAM / RAM+swap a cgroup can use
If limit exceeded, oomkiller descends upon your cgroup
Can kill the PID 1 (which leads to a container OOM
exit)
Or not... which can leave the container stuck in a
non-working state
Must pre-emptively monitor docker.mem.in_use and
docker.mem.sw_in_use to see if that could happen
Container monitoring challenges – Xavier Vello – Datadog 12 / 37
Let's test it
services:
memleak-pid1:
image: alpine:3.6
command: "ash -c 'for i in `seq 1 10000000`; do true; done'"
mem_limit: 10000000
restart: on-failure
memleak-forked:
image: alpine:3.6
command: "ash -c "ash -c 'for i in `seq 1 10000000`; do true; done'
& sleep 20""
mem_limit: 10000000
restart: on-failure
Container monitoring challenges – Xavier Vello – Datadog 13 / 37
Result
ubuntu@ci-xaviervello:~$ docker-compose -f composes/memlimit.compose up
memleak-pid1_1 | Killed
composes_memleak-pid1_1 exited with code 137
memleak-pid1_1 | Killed
memleak-forked_1 | Killed
...
composes_memleak-forked_1 exited with code 0 ✅
composes_memleak-pid1_1 exited with code 137
Container monitoring challenges – Xavier Vello – Datadog 14 / 37
Result
ubuntu@ci-xaviervello:~$ docker-compose -f composes/memlimit.compose up
memleak-pid1_1 | Killed
composes_memleak-pid1_1 exited with code 137
memleak-pid1_1 | Killed
memleak-forked_1 | Killed
...
composes_memleak-forked_1 exited with code 0 ✅
composes_memleak-pid1_1 exited with code 137
What should I do?
Ideally, run only one program per container
Or use a robust supervisor, check exit codes
Have a thorough healthcheck
Container monitoring challenges – Xavier Vello – Datadog 14 / 37
net namespace
Every container has their own network namespace
Their own virtual eth0 that is bridged by the host
Allows isolation and routing
Allows us to collect per-container metrics
$ docker exec agent5 cat /host/proc/30828/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
lo: 6961740 50619 0 0 0 0 0 0
eth0: 19558170 37932 0 0 0 0 0 0
Container monitoring challenges – Xavier Vello – Datadog 15 / 37
Getting the host's net stats
$ cat /proc/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
enp0s8: 61509004 104768 0 0 0 0 0 0
enp0s3: 523131054 862084 0 0 0 0 0 0
lo: 2952 46 0 0 0 0 0 0
...
$ docker exec agent5 cat /proc/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
lo: 6656783 48403 0 0 0 0 0 0
eth0: 18699348 36269 0 0 0 0 0 0
Container monitoring challenges – Xavier Vello – Datadog 16 / 37
Getting the host's net stats
$ docker exec agent5 cat /host/proc/net/dev
Container monitoring challenges – Xavier Vello – Datadog 17 / 37
Getting the host's net stats
$ docker exec agent5 cat /host/proc/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
lo: 6697512 48702 0 0 0 0 0 0
eth0: 18812503 36489 0 0 0 0 0 0
Container monitoring challenges – Xavier Vello – Datadog 17 / 37
Getting the host's net stats
$ docker exec agent5 cat /host/proc/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
lo: 6697512 48702 0 0 0 0 0 0
eth0: 18812503 36489 0 0 0 0 0 0
We need to run with net=host
Container monitoring challenges – Xavier Vello – Datadog 17 / 37
Getting the host's net stats
$ docker exec agent5 cat /host/proc/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
lo: 6697512 48702 0 0 0 0 0 0
eth0: 18812503 36489 0 0 0 0 0 0
We need to run with net=host
Investigating monitoring pid 1's net stats
$ docker exec agent5 cat /host/proc/1/net/dev
Inter-|
face | bytes packets errs drop fifo frame compressed multicast
enp0s8: 61509280 104771 0 0 0 0 0 0
enp0s3: 523996193 864400 0 0 0 0 0 0
lo: 2952 46 0 0 0 0 0 0
...
Container monitoring challenges – Xavier Vello – Datadog 17 / 37
DogStatsD traf c:
using
Unix Domain Sockets
18 / 37
What is DogStatsd?
Collection, tagging, aggregation of your custom metrics
Container monitoring challenges – Xavier Vello – Datadog 19 / 37
Issues with UDP
Poor discoverability of the endpoint
No error handling makes automatic detection
impossible
Network overhead (lots of small UDP packets)
Detecting source container per IP is error-prone
Container monitoring challenges – Xavier Vello – Datadog 20 / 37
So, we're just trying to
talk to localhost, right?
Container monitoring challenges – Xavier Vello – Datadog 21 / 37
Datagram Unix Sockets
Host-local traf c
Same semantics as UDP
Smaller CPU overhead than iptables
Endpoint is a socket le in a docker volume
Container monitoring challenges – Xavier Vello – Datadog 22 / 37
Container tagging is easy!
unix.SetsockoptInt(fd, unix.SOL_SOCKET, unix.SO_PASSCRED, 1)
Unix Credentials in anciallary data, provided by the kernel
type Ucred struct {
Pid int32
Uid uint32
Gid uint32
}
Need to run in pid=host mode
Container monitoring challenges – Xavier Vello – Datadog 23 / 37
Container tagging is easy!
You can slice and compare your custom metrics by:
Availability zone
Host
Chef role
as before, but also now by:
ECS task
Kubernetes deployment
Docker labels
Container monitoring challenges – Xavier Vello – Datadog 24 / 37
Secure usage of
the Docker Socket?
25 / 37
Secure Docker Socket?
Monitoring is intrusive, but we strive to make it
secure and seamless
Container monitoring challenges – Xavier Vello – Datadog 26 / 37
Secure Docker Socket?
Monitoring is intrusive, but we strive to make it
secure and seamless
We need to list and inspect containers, images, volumes
Docker does not provide a monitoring interface
It provides monitoring endpoints in its management
interface
Container monitoring challenges – Xavier Vello – Datadog 26 / 37
Plan A: use AuthN and AuthZ
Container monitoring challenges – Xavier Vello – Datadog 27 / 37
Plan A: use AuthN and AuthZ
Pros:
Already upstream since 1.11
Solid authentication mechanism
Similar to the Kubernetes AuthN/AuthZ system
Container monitoring challenges – Xavier Vello – Datadog 28 / 37
Plan A: use AuthN and AuthZ
Pros:
Already upstream since 1.11
Solid authentication mechanism
Similar to the Kubernetes AuthN/AuthZ system
Cons:
Need to setup a PKI
Requires a third-party AuthZ plugin
/var/run/docker.sock unusable (it's HTTP-only for now)
Container monitoring challenges – Xavier Vello – Datadog 28 / 37
Plan A: use AuthN and AuthZ
Container monitoring challenges – Xavier Vello – Datadog 29 / 37
Plan B1: HTTPS on UDS
Pros:
Minimal changes in Moby
Container monitoring challenges – Xavier Vello – Datadog 30 / 37
Plan B1: HTTPS on UDS
Pros:
Minimal changes in Moby
Cons:
Still need to setup a PKI
Still breaks most orchestrators
Container monitoring challenges – Xavier Vello – Datadog 30 / 37
Plan B2: use UID for AuthZ
Container monitoring challenges – Xavier Vello – Datadog 31 / 37
Plan B2: use UID for AuthZ
Pros:
Minimal changes
No PKI
Transparent to orchestrators (just whitelist unix:root)
Container monitoring challenges – Xavier Vello – Datadog 32 / 37
Plan B2: use UID for AuthZ
Pros:
Minimal changes
No PKI
Transparent to orchestrators (just whitelist unix:root)
Cons:
Linux-speci c logic
Different code path for AuthN
Container monitoring challenges – Xavier Vello – Datadog 32 / 37
Plan B2: use UID for AuthZ
Great idea!
Let's discuss it on:
https://github.com/moby/moby/issues/35711
RFC: enable AuthZ on the unix socket
Container monitoring challenges – Xavier Vello – Datadog 33 / 37
Plan B2: use UID for AuthZ
Great idea!
Let's discuss it on:
https://github.com/moby/moby/issues/35711
RFC: enable AuthZ on the unix socket
Wait!
ECS ships 1.12
GKE ships 1.13
Container monitoring challenges – Xavier Vello – Datadog 33 / 37
Plan C: use a ltering proxy
Container monitoring challenges – Xavier Vello – Datadog 34 / 37
Plan C: use a ltering proxy
services:
dockerproxy:
image: "datadog/docker-filter"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- safe-socket:/safe-socket:rw
network_mode: "none"
agent5:
image: "datadog/docker-dd-agent:latest"
volumes:
- safe-socket:/var/run:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
Container monitoring challenges – Xavier Vello – Datadog 35 / 37
Thanks!
Questions?
 
xavier.vello@datadoghq.com
36 / 37
We are hiring!
Paris, NYC, Remote
http://jobs.datadoghq.com
37 / 37

More Related Content

What's hot

Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPThomas Graf
 
Developping drivers on small machines
Developping drivers on small machinesDevelopping drivers on small machines
Developping drivers on small machinesAnne Nicolas
 
Kernel Recipes 2015 - The Dronecode Project – A step in open source drones
Kernel Recipes 2015 - The Dronecode Project – A step in open source dronesKernel Recipes 2015 - The Dronecode Project – A step in open source drones
Kernel Recipes 2015 - The Dronecode Project – A step in open source dronesAnne Nicolas
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerJeff Anderson
 
Accelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAccelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAlexander Shalimov
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsAnne Nicolas
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabMichelle Holley
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Kernel TLV
 
Andrea Righi - Spying on the Linux kernel for fun and profit
Andrea Righi - Spying on the Linux kernel for fun and profitAndrea Righi - Spying on the Linux kernel for fun and profit
Andrea Righi - Spying on the Linux kernel for fun and profitlinuxlab_conf
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkAnne Nicolas
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux KernelKernel TLV
 
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...linuxlab_conf
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
DPDK Summit 2015 - HP - Al Sanders
DPDK Summit 2015 - HP - Al SandersDPDK Summit 2015 - HP - Al Sanders
DPDK Summit 2015 - HP - Al SandersJim St. Leger
 
Distributed Lock Manager
Distributed Lock ManagerDistributed Lock Manager
Distributed Lock ManagerHao Chen
 
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The UglyMin-Yih Hsu
 

What's hot (20)

Cilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDPCilium - Fast IPv6 Container Networking with BPF and XDP
Cilium - Fast IPv6 Container Networking with BPF and XDP
 
Developping drivers on small machines
Developping drivers on small machinesDevelopping drivers on small machines
Developping drivers on small machines
 
Kernel Recipes 2015 - The Dronecode Project – A step in open source drones
Kernel Recipes 2015 - The Dronecode Project – A step in open source dronesKernel Recipes 2015 - The Dronecode Project – A step in open source drones
Kernel Recipes 2015 - The Dronecode Project – A step in open source drones
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support Engineer
 
Accelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDKAccelerating Neutron with Intel DPDK
Accelerating Neutron with Intel DPDK
 
Kubernetes 1001
Kubernetes 1001Kubernetes 1001
Kubernetes 1001
 
Kernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutionsKernel Recipes 2013 - Nftables, what motivations and what solutions
Kernel Recipes 2013 - Nftables, what motivations and what solutions
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
DPDK in Containers Hands-on Lab
DPDK in Containers Hands-on LabDPDK in Containers Hands-on Lab
DPDK in Containers Hands-on Lab
 
Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545Bypassing ASLR Exploiting CVE 2015-7545
Bypassing ASLR Exploiting CVE 2015-7545
 
Andrea Righi - Spying on the Linux kernel for fun and profit
Andrea Righi - Spying on the Linux kernel for fun and profitAndrea Righi - Spying on the Linux kernel for fun and profit
Andrea Righi - Spying on the Linux kernel for fun and profit
 
Kernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver frameworkKernel Recipes 2015 - So you want to write a Linux driver framework
Kernel Recipes 2015 - So you want to write a Linux driver framework
 
VLANs in the Linux Kernel
VLANs in the Linux KernelVLANs in the Linux Kernel
VLANs in the Linux Kernel
 
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...
Valerio Di Giampietro - Introduction To IoT Reverse Engineering with an examp...
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
DPDK Summit 2015 - HP - Al Sanders
DPDK Summit 2015 - HP - Al SandersDPDK Summit 2015 - HP - Al Sanders
DPDK Summit 2015 - HP - Al Sanders
 
Distributed Lock Manager
Distributed Lock ManagerDistributed Lock Manager
Distributed Lock Manager
 
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
 

Similar to LinuxLabs 2017 talk: Container monitoring challenges

FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018Xavier Mertens
 
FPC for the Masses (SANSFire Edition)
FPC for the Masses (SANSFire Edition)FPC for the Masses (SANSFire Edition)
FPC for the Masses (SANSFire Edition)Xavier Mertens
 
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environmentCorpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environmentRaghavendra Prabhu
 
Perspectives on Docker
Perspectives on DockerPerspectives on Docker
Perspectives on DockerRightScale
 
EC2 Storage for Docker 150526b
EC2 Storage for Docker   150526bEC2 Storage for Docker   150526b
EC2 Storage for Docker 150526bClinton Kitson
 
DCEU 18: Tips and Tricks of the Docker Captains
DCEU 18: Tips and Tricks of the Docker CaptainsDCEU 18: Tips and Tricks of the Docker Captains
DCEU 18: Tips and Tricks of the Docker CaptainsDocker, Inc.
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned RightScale
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionBen Hall
 
Fosdem_Using_SELinux_with_container_runtimes.pdf
Fosdem_Using_SELinux_with_container_runtimes.pdfFosdem_Using_SELinux_with_container_runtimes.pdf
Fosdem_Using_SELinux_with_container_runtimes.pdfnicerussianpainter
 
The Container Security Checklist
The Container Security Checklist The Container Security Checklist
The Container Security Checklist LibbySchulze
 
Présentation de Docker
Présentation de DockerPrésentation de Docker
Présentation de DockerProto204
 
Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!All Things Open
 
Kernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologiesKernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologiesAnne Nicolas
 
GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)Igalia
 
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...Odinot Stanislas
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Red Hat Developers
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawnGábor Nyers
 

Similar to LinuxLabs 2017 talk: Container monitoring challenges (20)

FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018FPC for the Masses - CoRIIN 2018
FPC for the Masses - CoRIIN 2018
 
FPC for the Masses (SANSFire Edition)
FPC for the Masses (SANSFire Edition)FPC for the Masses (SANSFire Edition)
FPC for the Masses (SANSFire Edition)
 
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environmentCorpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
 
Perspectives on Docker
Perspectives on DockerPerspectives on Docker
Perspectives on Docker
 
EC2 Storage for Docker 150526b
EC2 Storage for Docker   150526bEC2 Storage for Docker   150526b
EC2 Storage for Docker 150526b
 
DCEU 18: Tips and Tricks of the Docker Captains
DCEU 18: Tips and Tricks of the Docker CaptainsDCEU 18: Tips and Tricks of the Docker Captains
DCEU 18: Tips and Tricks of the Docker Captains
 
Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned  Real-World Docker: 10 Things We've Learned
Real-World Docker: 10 Things We've Learned
 
Real World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and ProductionReal World Experience of Running Docker in Development and Production
Real World Experience of Running Docker in Development and Production
 
Fosdem_Using_SELinux_with_container_runtimes.pdf
Fosdem_Using_SELinux_with_container_runtimes.pdfFosdem_Using_SELinux_with_container_runtimes.pdf
Fosdem_Using_SELinux_with_container_runtimes.pdf
 
Observability
ObservabilityObservability
Observability
 
Cloud RPI4 tomcat ARM64
Cloud RPI4 tomcat ARM64Cloud RPI4 tomcat ARM64
Cloud RPI4 tomcat ARM64
 
The Container Security Checklist
The Container Security Checklist The Container Security Checklist
The Container Security Checklist
 
Présentation de Docker
Présentation de DockerPrésentation de Docker
Présentation de Docker
 
Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!Intro to Kernel Debugging - Just make the crashing stop!
Intro to Kernel Debugging - Just make the crashing stop!
 
Kernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologiesKernel Recipes 2015: Kernel packet capture technologies
Kernel Recipes 2015: Kernel packet capture technologies
 
GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)GStreamer and SysLink (GStreamer Conference 2011)
GStreamer and SysLink (GStreamer Conference 2011)
 
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...
Bare-metal, Docker Containers, and Virtualization: The Growing Choices for Cl...
 
Restfs internals
Restfs internalsRestfs internals
Restfs internals
 
Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!Why you’re going to fail running java on docker!
Why you’re going to fail running java on docker!
 
Containers with systemd-nspawn
Containers with systemd-nspawnContainers with systemd-nspawn
Containers with systemd-nspawn
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

LinuxLabs 2017 talk: Container monitoring challenges

  • 2.   Former Debian Maintainer (sporadic) KDE contributor Industrial Engineer for 5 years Software Engineer at Datadog Container Monitoring Team whois xvello.net xavier.vello@datadoghq.com http://lkdin.com/in/xaviervello Container monitoring challenges – Xavier Vello – Datadog 2 / 37
  • 3. whois datadoghq.com Container monitoring challenges – Xavier Vello – Datadog 3 / 37
  • 4. whois datadoghq.com Container monitoring challenges – Xavier Vello – Datadog 3 / 37
  • 5. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 6. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 7. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 8. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 9. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 10. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 11. Container monitoring challenges – Xavier Vello – Datadog 4 / 37
  • 12. Our collection agent A 7 year-old codebase, 24 458 Python SLOCs 87 integrations (32 274 Python SLOCs) in a different repo Container monitoring challenges – Xavier Vello – Datadog 5 / 37
  • 13. The Agent6 project A rewrite With containers support from the start With Autodiscovery at the core Container monitoring challenges – Xavier Vello – Datadog 6 / 37
  • 14. The Agent6 project A rewrite With containers support from the start With Autodiscovery at the core A rewrite in Go Reduce memory and CPU usage Better reliability and maintainability True concurrency Container monitoring challenges – Xavier Vello – Datadog 6 / 37
  • 15. Agenda Share our challenges and discoveries: Namespaces & cgroups DogStatsD traf c: using Unix Domain Sockets Secure usage of the Docker Socket? Container monitoring challenges – Xavier Vello – Datadog 7 / 37
  • 16. Namespaces & cgroups "The dead forwarder" case of 2017 Network metrics 8 / 37
  • 17. "The dead forwarder" case supervisord supervisord -n -c /etc/dd-agent/supervisor.conf │ ├─python agent.py foreground --use-local-forwarder │ └─{python} ├─python dogstatsd.py --use-local-forwarder │ └─{python} └─python ddagent.py Container monitoring challenges – Xavier Vello – Datadog 9 / 37
  • 18. "The dead forwarder" case Support case: "metrics not coming through although the collection works OK" Container monitoring challenges – Xavier Vello – Datadog 10 / 37
  • 19. "The dead forwarder" case Support case: "metrics not coming through although the collection works OK" Collector logs: errors connecting to localhost:17123 Container monitoring challenges – Xavier Vello – Datadog 10 / 37
  • 20. "The dead forwarder" case Support case: "metrics not coming through although the collection works OK" Collector logs: errors connecting to localhost:17123 Supervisord logs: unexpected exit of the forwarder, retries 5 times then gives up Container monitoring challenges – Xavier Vello – Datadog 10 / 37
  • 21. "The dead forwarder" case Support case: "metrics not coming through although the collection works OK" Collector logs: errors connecting to localhost:17123 Supervisord logs: unexpected exit of the forwarder, retries 5 times then gives up Forwarder logs: no error Container monitoring challenges – Xavier Vello – Datadog 10 / 37
  • 22. docker-dd-agent process tree supervisord supervisord -n -c /etc/dd-agent/supervisor.conf │ ├─python agent.py foreground --use-local-forwarder │ └─{python} ├─python dogstatsd.py --use-local-forwarder │ └─{python} └─python ddagent.py Container monitoring challenges – Xavier Vello – Datadog 11 / 37
  • 23. docker-dd-agent process tree supervisord supervisord -n -c /etc/dd-agent/supervisor.conf │ ├─python agent.py foreground --use-local-forwarder │ └─{python} ├─python dogstatsd.py --use-local-forwarder │ └─{python} └─ Container monitoring challenges – Xavier Vello – Datadog 11 / 37
  • 24. mem cgroup Limit tells how much RAM / RAM+swap a cgroup can use If limit exceeded, ... Container monitoring challenges – Xavier Vello – Datadog 12 / 37
  • 25. mem cgroup Limit tells how much RAM / RAM+swap a cgroup can use If limit exceeded, oomkiller descends upon your cgroup Can kill the PID 1 (which leads to a container OOM exit) Or not... which can leave the container stuck in a non-working state Must pre-emptively monitor docker.mem.in_use and docker.mem.sw_in_use to see if that could happen Container monitoring challenges – Xavier Vello – Datadog 12 / 37
  • 26. Let's test it services: memleak-pid1: image: alpine:3.6 command: "ash -c 'for i in `seq 1 10000000`; do true; done'" mem_limit: 10000000 restart: on-failure memleak-forked: image: alpine:3.6 command: "ash -c "ash -c 'for i in `seq 1 10000000`; do true; done' & sleep 20"" mem_limit: 10000000 restart: on-failure Container monitoring challenges – Xavier Vello – Datadog 13 / 37
  • 27. Result ubuntu@ci-xaviervello:~$ docker-compose -f composes/memlimit.compose up memleak-pid1_1 | Killed composes_memleak-pid1_1 exited with code 137 memleak-pid1_1 | Killed memleak-forked_1 | Killed ... composes_memleak-forked_1 exited with code 0 ✅ composes_memleak-pid1_1 exited with code 137 Container monitoring challenges – Xavier Vello – Datadog 14 / 37
  • 28. Result ubuntu@ci-xaviervello:~$ docker-compose -f composes/memlimit.compose up memleak-pid1_1 | Killed composes_memleak-pid1_1 exited with code 137 memleak-pid1_1 | Killed memleak-forked_1 | Killed ... composes_memleak-forked_1 exited with code 0 ✅ composes_memleak-pid1_1 exited with code 137 What should I do? Ideally, run only one program per container Or use a robust supervisor, check exit codes Have a thorough healthcheck Container monitoring challenges – Xavier Vello – Datadog 14 / 37
  • 29. net namespace Every container has their own network namespace Their own virtual eth0 that is bridged by the host Allows isolation and routing Allows us to collect per-container metrics $ docker exec agent5 cat /host/proc/30828/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast lo: 6961740 50619 0 0 0 0 0 0 eth0: 19558170 37932 0 0 0 0 0 0 Container monitoring challenges – Xavier Vello – Datadog 15 / 37
  • 30. Getting the host's net stats $ cat /proc/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast enp0s8: 61509004 104768 0 0 0 0 0 0 enp0s3: 523131054 862084 0 0 0 0 0 0 lo: 2952 46 0 0 0 0 0 0 ... $ docker exec agent5 cat /proc/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast lo: 6656783 48403 0 0 0 0 0 0 eth0: 18699348 36269 0 0 0 0 0 0 Container monitoring challenges – Xavier Vello – Datadog 16 / 37
  • 31. Getting the host's net stats $ docker exec agent5 cat /host/proc/net/dev Container monitoring challenges – Xavier Vello – Datadog 17 / 37
  • 32. Getting the host's net stats $ docker exec agent5 cat /host/proc/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast lo: 6697512 48702 0 0 0 0 0 0 eth0: 18812503 36489 0 0 0 0 0 0 Container monitoring challenges – Xavier Vello – Datadog 17 / 37
  • 33. Getting the host's net stats $ docker exec agent5 cat /host/proc/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast lo: 6697512 48702 0 0 0 0 0 0 eth0: 18812503 36489 0 0 0 0 0 0 We need to run with net=host Container monitoring challenges – Xavier Vello – Datadog 17 / 37
  • 34. Getting the host's net stats $ docker exec agent5 cat /host/proc/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast lo: 6697512 48702 0 0 0 0 0 0 eth0: 18812503 36489 0 0 0 0 0 0 We need to run with net=host Investigating monitoring pid 1's net stats $ docker exec agent5 cat /host/proc/1/net/dev Inter-| face | bytes packets errs drop fifo frame compressed multicast enp0s8: 61509280 104771 0 0 0 0 0 0 enp0s3: 523996193 864400 0 0 0 0 0 0 lo: 2952 46 0 0 0 0 0 0 ... Container monitoring challenges – Xavier Vello – Datadog 17 / 37
  • 35. DogStatsD traf c: using Unix Domain Sockets 18 / 37
  • 36. What is DogStatsd? Collection, tagging, aggregation of your custom metrics Container monitoring challenges – Xavier Vello – Datadog 19 / 37
  • 37. Issues with UDP Poor discoverability of the endpoint No error handling makes automatic detection impossible Network overhead (lots of small UDP packets) Detecting source container per IP is error-prone Container monitoring challenges – Xavier Vello – Datadog 20 / 37
  • 38. So, we're just trying to talk to localhost, right? Container monitoring challenges – Xavier Vello – Datadog 21 / 37
  • 39. Datagram Unix Sockets Host-local traf c Same semantics as UDP Smaller CPU overhead than iptables Endpoint is a socket le in a docker volume Container monitoring challenges – Xavier Vello – Datadog 22 / 37
  • 40. Container tagging is easy! unix.SetsockoptInt(fd, unix.SOL_SOCKET, unix.SO_PASSCRED, 1) Unix Credentials in anciallary data, provided by the kernel type Ucred struct { Pid int32 Uid uint32 Gid uint32 } Need to run in pid=host mode Container monitoring challenges – Xavier Vello – Datadog 23 / 37
  • 41. Container tagging is easy! You can slice and compare your custom metrics by: Availability zone Host Chef role as before, but also now by: ECS task Kubernetes deployment Docker labels Container monitoring challenges – Xavier Vello – Datadog 24 / 37
  • 42. Secure usage of the Docker Socket? 25 / 37
  • 43. Secure Docker Socket? Monitoring is intrusive, but we strive to make it secure and seamless Container monitoring challenges – Xavier Vello – Datadog 26 / 37
  • 44. Secure Docker Socket? Monitoring is intrusive, but we strive to make it secure and seamless We need to list and inspect containers, images, volumes Docker does not provide a monitoring interface It provides monitoring endpoints in its management interface Container monitoring challenges – Xavier Vello – Datadog 26 / 37
  • 45. Plan A: use AuthN and AuthZ Container monitoring challenges – Xavier Vello – Datadog 27 / 37
  • 46. Plan A: use AuthN and AuthZ Pros: Already upstream since 1.11 Solid authentication mechanism Similar to the Kubernetes AuthN/AuthZ system Container monitoring challenges – Xavier Vello – Datadog 28 / 37
  • 47. Plan A: use AuthN and AuthZ Pros: Already upstream since 1.11 Solid authentication mechanism Similar to the Kubernetes AuthN/AuthZ system Cons: Need to setup a PKI Requires a third-party AuthZ plugin /var/run/docker.sock unusable (it's HTTP-only for now) Container monitoring challenges – Xavier Vello – Datadog 28 / 37
  • 48. Plan A: use AuthN and AuthZ Container monitoring challenges – Xavier Vello – Datadog 29 / 37
  • 49. Plan B1: HTTPS on UDS Pros: Minimal changes in Moby Container monitoring challenges – Xavier Vello – Datadog 30 / 37
  • 50. Plan B1: HTTPS on UDS Pros: Minimal changes in Moby Cons: Still need to setup a PKI Still breaks most orchestrators Container monitoring challenges – Xavier Vello – Datadog 30 / 37
  • 51. Plan B2: use UID for AuthZ Container monitoring challenges – Xavier Vello – Datadog 31 / 37
  • 52. Plan B2: use UID for AuthZ Pros: Minimal changes No PKI Transparent to orchestrators (just whitelist unix:root) Container monitoring challenges – Xavier Vello – Datadog 32 / 37
  • 53. Plan B2: use UID for AuthZ Pros: Minimal changes No PKI Transparent to orchestrators (just whitelist unix:root) Cons: Linux-speci c logic Different code path for AuthN Container monitoring challenges – Xavier Vello – Datadog 32 / 37
  • 54. Plan B2: use UID for AuthZ Great idea! Let's discuss it on: https://github.com/moby/moby/issues/35711 RFC: enable AuthZ on the unix socket Container monitoring challenges – Xavier Vello – Datadog 33 / 37
  • 55. Plan B2: use UID for AuthZ Great idea! Let's discuss it on: https://github.com/moby/moby/issues/35711 RFC: enable AuthZ on the unix socket Wait! ECS ships 1.12 GKE ships 1.13 Container monitoring challenges – Xavier Vello – Datadog 33 / 37
  • 56. Plan C: use a ltering proxy Container monitoring challenges – Xavier Vello – Datadog 34 / 37
  • 57. Plan C: use a ltering proxy services: dockerproxy: image: "datadog/docker-filter" volumes: - /var/run/docker.sock:/var/run/docker.sock:ro - safe-socket:/safe-socket:rw network_mode: "none" agent5: image: "datadog/docker-dd-agent:latest" volumes: - safe-socket:/var/run:ro - /proc/:/host/proc/:ro - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro Container monitoring challenges – Xavier Vello – Datadog 35 / 37
  • 59. We are hiring! Paris, NYC, Remote http://jobs.datadoghq.com 37 / 37