SlideShare a Scribd company logo
1 of 15
Download to read offline
Usernetes Gen2
Kubernetes in Rootless Docker, with Multiple Nodes
Akihiro Suda (NTT)
akihiro.suda.cz@hco.ntt.co.jp
Container Plumbing Days (Apr 15, 2024)
https://github.com/rootless-containers/usernetes
Usernetes Gen2
Kubernetes in Rootless Docker, with Multiple Nodes
Akihiro Suda (NTT)
akihiro.suda.cz@hco.ntt.co.jp
Container Plumbing Days (Apr 15, 2024)
https://github.com/rootless-containers/usernetes
Podman, and nerdctl
• Puts container runtimes (as well as containers) in a user namespace
– UserNS: Linux kernel’s feature that maps a non-root user to a fake root
(the root privilege is limited inside the namespace)
• Can mitigate potential vulnerabilities of the runtimes
– No access to read/write other users’ files
– No access to modify the kernel (e.g., to inject invisible malware)
– No access to modify the firmware
– No ARP spoofing
– No DNS spoofing
• Also useful for shared hosts (High-performance Computing, etc.)
– Works with GPU too
3
[Introduction] Rootless Containers
e.g., runc breakout
CVE-2024-21626
(2024-01-31)
• Linux kernel’s feature to remap UIDs and GIDs
• UID=1000 gains fake root privileges (UID=0) that are enough to create
containers
• The privileges are limited inside the namespace
• No privilege for setting up vEth pairs with “real” IP addresses;
user mode TCP/IP (e.g., slirp4netns) is used instead
• Also notorious as the culprit of the several kernel CVEs,
but at least it is more secure than just running everything as the root
– Ubuntu 24.04 disables UserNS by default with the allowlist (AppArmor profiles)
4
[Introduction] User namespaces
# /etc/subuid
1000:100000:65536
0 1 65536
0 1000 100000 165535
5
[Introduction] Network namespaces
(vEth)
eth0: 172.17.0.2
(Bridge)
docker0: 172.17.0.1
(TAP)
tap0: 10.0.2.100
(vEth) (vEth)
Network namespaces
(vEth)
eth0: 172.17.0.3
(Physical Ethernet)
eth0: 192.168.0.42
(slirp4netns)
virtual IP:10.0.2.2
Network namespace + User namespace
Ethernet packets
Unprivileged socket
syscalls
• Usernetes: Rootless Kubernetes, since 2018
https://github.com/rootless-containers/usernetes
• As old as Rootless Docker (pre-release at that time) and Rootless
Podman
• The changes to upstream was merged in Kubernetes v1.22 (2021)
– Feature gate: KubeletInUserNamespace (Alpha)
– The feature gate is also used by kind, minikube, k3s, etc.
• The first generation (“Gen1”, 2018-2023) of Usernetes didn’t gain
much popularity due to its complexity (”The Hard Way”)
6
Rootless Kubernetes
7
KubeletInUserNamespace feature gate
• The gate is slightly misnomer; as it requires CRI, OCI, CNI, and
kube-proxy to be in the same UserNS too
• Quite “boring” gate to allow trivial permission errors
https://github.com/search?q=repo%3Akubernetes%2Fkubernetes%20KubeletInUserNamespace&type=code
– dmesg
– sysctl -w vm.overcommit_memory
– etc.
• The UserNS has to be created by an external runtime
– Usernetes Gen1: RootlessKit
– Usernetes Gen2: Rootless Docker/Podman/nerdctl
– LXD/Incus can be used too
Gen 1 (2018-2023) Gen 2 (2023-)
Host dependency RootlessKit Rootless Docker,
Rootless Podman, or
Rootless nerdctl
(contaiNERD CTL)
Supports kubeadm No Yes
Supports multi-node Yes, but practically No,
due to complexity
Yes
Supports hostPath
volumes
Yes Yes, for most paths,
but needs an extra config
8
Usernetes Gen 1 vs Gen 2
”The Hard Way”
Similar to `kind` and minikube,
but supports real multi-node
9
File layout
• Makefile
– Defines targets like make up to wrap docker compose up, etc.
• Dockerfile
– FROM kindest/node (kind’s node image) with a few additional ADD and RUN
• docker-compose.yaml
– Just defines a single node container
– Currently, node ports, etc. have to be statically defined here
• kubeadm-config.yaml
– Configures feature gates, CIDRs, TLS SANs, etc.
Everything is just a plain text file,
for ease of customization
10
Usage
# Bootstrap the first node
make up
make kubeadm-init
make install-flannel
# Enable kubectl
make kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig
kubectl get pods -A
# Multi-node
make join-command
scp join-command another-host:~/usernetes
ssh another-host make -C ~/usernetes up kubeadm-join
make sync-external-ip
11
Multi-node Network
• VXLAN is known to work
– Just kubectl -f kube-flannel.yaml
• “External IP” is used, as the containerized kubelet’s IP is not
accessible from other nodes
– kubelet is launched with --cloud-provider=external
– node.status.addresses is dynamically patched with kubectl patch node
– node is also annotated with
flannel.alpha.coreos.com/public-ip-overwrite
– UDP checksums are recomputed with
ethtool -K flannel.1 tx-checksum-ip-generic off
12
Experimental: network acceleration
• Bypass4netns allows bypassing slirp4netns to eliminate the overhead
caused by the usermode TCP/IP
https://github.com/rootless-containers/bypass4netns
• Captures socket syscalls inside the NetNS, reconstructs the FDs outside the
NetNS, and replaces the FDs inside the NetNS, using seccomp_unotify(2)
• As fast as the host network (e.g., 1.28 Gbps vs 49.9 Gbps)
• Bypass4netns supports both connect(2) and bind(2),
but Usernetes only supports accelerating connect(2) currently
– bind(2) is already fast anyway
• Available for nerdctl
13
Experimental: network acceleration
• Pod-to-Pod communications across multiple nodes are not
accelerated yet
– VXLAN packets are generated by the kernel itself and cannot be intercepted
via seccomp_unotify(2)
– NodePorts can be still accelerated, as it does not incur VXLAN packets
Node
Pod Pod
Node
NodePort Pod
Internet
VXLAN
Fast
Slow
• iperf3 (TCP) benchmark across multiple nodes
14
Experimental: network acceleration
slirp4netns bypass4netns
Pod → Pod (same node) 37.6 Gbps 37.6 Gbps
Pod → Pod (different node) 1.40 Gbps 1.41 Gbps
Pod → NodePort (same node) 1.28 Gbps 49.9 Gbps
Pod → NodePort (different node) 1.47 Gbps 9.53 Gbps
Host → NodePort (same node) 50.2 Gbps 49.4 Gbps
Host → NodePort (different node) 9.53 Gbps 9.52 Gbps
IaaS: Amazon EC2 (m7i.2xlarge)
Versions: Ubuntu 22.04, nerdctl v2.0.0-beta.4, bypass4netns v0.4.1, Usernetes Gen2-v20240410.0 (Kubernetes v1.29)
https://github.com/rootless-containers/usernetes/discussions/329
15
Future works
• Integrate bypass4netns into Docker and Podman too
• Support accelerating Pod-to-Pod communications across different nodes,
perhaps with a sidecar proxy that would forward packets to NodePorts
• Support dynamic port forwarding
– Ports are currently statically defined in docker-compose.yaml
– If docker container update could support modifying port forwards, Usernetes coud
just watch Kubernetes service events and update the Docker ports accordingly
https://github.com/docker/cli/issues/5013
• Help other Kubernetes distributions to support rootless
– k3s has been supporting rootless since 2019, but still lacks support for multi-node setup
– Are Podman folks interested in running OKD inside Rootless Podman?

More Related Content

Similar to 20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Docker, with Multiple Nodes.pdf

4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
Juraj Hantak
 

Similar to 20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Docker, with Multiple Nodes.pdf (20)

Open stackaustinmeetupsept21
Open stackaustinmeetupsept21Open stackaustinmeetupsept21
Open stackaustinmeetupsept21
 
Workshop : 45 minutes pour comprendre Docker avec Jérôme Petazzoni
Workshop : 45 minutes pour comprendre Docker avec Jérôme PetazzoniWorkshop : 45 minutes pour comprendre Docker avec Jérôme Petazzoni
Workshop : 45 minutes pour comprendre Docker avec Jérôme Petazzoni
 
Introduction to Docker, December 2014 "Tour de France" Edition
Introduction to Docker, December 2014 "Tour de France" EditionIntroduction to Docker, December 2014 "Tour de France" Edition
Introduction to Docker, December 2014 "Tour de France" Edition
 
From dev to prod: Kubernetes on AWS (short ver.)
From dev to prod: Kubernetes on AWS (short ver.)From dev to prod: Kubernetes on AWS (short ver.)
From dev to prod: Kubernetes on AWS (short ver.)
 
Rootless Containers
Rootless ContainersRootless Containers
Rootless Containers
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Docker Ecosystem on Azure
Docker Ecosystem on AzureDocker Ecosystem on Azure
Docker Ecosystem on Azure
 
Docker Security Overview
Docker Security OverviewDocker Security Overview
Docker Security Overview
 
Docker Multi Host Networking, Rachit Arora, IBM
Docker Multi Host Networking, Rachit Arora, IBMDocker Multi Host Networking, Rachit Arora, IBM
Docker Multi Host Networking, Rachit Arora, IBM
 
Containerize! Between Docker and Jube.
Containerize! Between Docker and Jube.Containerize! Between Docker and Jube.
Containerize! Between Docker and Jube.
 
Scaling Docker with Kubernetes
Scaling Docker with KubernetesScaling Docker with Kubernetes
Scaling Docker with Kubernetes
 
Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2Network stack personality in Android phone - netdev 2.2
Network stack personality in Android phone - netdev 2.2
 
Podman rootless containers
Podman rootless containersPodman rootless containers
Podman rootless containers
 
Docker 1.11 Presentation
Docker 1.11 PresentationDocker 1.11 Presentation
Docker 1.11 Presentation
 
Introduction to Docker at the Azure Meet-up in New York
Introduction to Docker at the Azure Meet-up in New YorkIntroduction to Docker at the Azure Meet-up in New York
Introduction to Docker at the Azure Meet-up in New York
 
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet[Draft] Fast Prototyping with DPDK and eBPF in Containernet
[Draft] Fast Prototyping with DPDK and eBPF in Containernet
 
4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
4. CNCF kubernetes Comparison of-existing-cni-plugins-for-kubernetes
 
Comparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetesComparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetes
 
DockerCon EU 2018 Workshop: Container Networking for Swarm and Kubernetes in ...
DockerCon EU 2018 Workshop: Container Networking for Swarm and Kubernetes in ...DockerCon EU 2018 Workshop: Container Networking for Swarm and Kubernetes in ...
DockerCon EU 2018 Workshop: Container Networking for Swarm and Kubernetes in ...
 
The State of Linux Containers
The State of Linux ContainersThe State of Linux Containers
The State of Linux Containers
 

More from Akihiro Suda

More from Akihiro Suda (20)

20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_20240321 [KubeCon EU Pavilion] Lima.pdf_
20240321 [KubeCon EU Pavilion] Lima.pdf_
 
20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf20240320 [KubeCon EU Pavilion] containerd.pdf
20240320 [KubeCon EU Pavilion] containerd.pdf
 
[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion[KubeConNA2023] Lima pavilion
[KubeConNA2023] Lima pavilion
 
[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion[KubeConNA2023] containerd pavilion
[KubeConNA2023] containerd pavilion
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
 
[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2[CNCF TAG-Runtime] Usernetes Gen2
[CNCF TAG-Runtime] Usernetes Gen2
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
 
The internals and the latest trends of container runtimes
The internals and the latest trends of container runtimesThe internals and the latest trends of container runtimes
The internals and the latest trends of container runtimes
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion
 
[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion[KubeConEU2023] containerd pavilion
[KubeConEU2023] containerd pavilion
 
[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?[Container Plumbing Days 2023] Why was nerdctl made?
[Container Plumbing Days 2023] Why was nerdctl made?
 
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
[FOSDEM2023] Bit-for-bit reproducible builds with Dockerfile
 
[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima[CNCF TAG-Runtime 2022-10-06] Lima
[CNCF TAG-Runtime 2022-10-06] Lima
 
[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
[Paris Container Day 2021] nerdctl: yet another Docker & Docker Compose imple...
 
[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10[Docker Tokyo #35] Docker 20.10
[Docker Tokyo #35] Docker 20.10
 
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
 
DockerとPodmanの比較
DockerとPodmanの比較DockerとPodmanの比較
DockerとPodmanの比較
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 

20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Docker, with Multiple Nodes.pdf

  • 1. Usernetes Gen2 Kubernetes in Rootless Docker, with Multiple Nodes Akihiro Suda (NTT) akihiro.suda.cz@hco.ntt.co.jp Container Plumbing Days (Apr 15, 2024) https://github.com/rootless-containers/usernetes
  • 2. Usernetes Gen2 Kubernetes in Rootless Docker, with Multiple Nodes Akihiro Suda (NTT) akihiro.suda.cz@hco.ntt.co.jp Container Plumbing Days (Apr 15, 2024) https://github.com/rootless-containers/usernetes Podman, and nerdctl
  • 3. • Puts container runtimes (as well as containers) in a user namespace – UserNS: Linux kernel’s feature that maps a non-root user to a fake root (the root privilege is limited inside the namespace) • Can mitigate potential vulnerabilities of the runtimes – No access to read/write other users’ files – No access to modify the kernel (e.g., to inject invisible malware) – No access to modify the firmware – No ARP spoofing – No DNS spoofing • Also useful for shared hosts (High-performance Computing, etc.) – Works with GPU too 3 [Introduction] Rootless Containers e.g., runc breakout CVE-2024-21626 (2024-01-31)
  • 4. • Linux kernel’s feature to remap UIDs and GIDs • UID=1000 gains fake root privileges (UID=0) that are enough to create containers • The privileges are limited inside the namespace • No privilege for setting up vEth pairs with “real” IP addresses; user mode TCP/IP (e.g., slirp4netns) is used instead • Also notorious as the culprit of the several kernel CVEs, but at least it is more secure than just running everything as the root – Ubuntu 24.04 disables UserNS by default with the allowlist (AppArmor profiles) 4 [Introduction] User namespaces # /etc/subuid 1000:100000:65536 0 1 65536 0 1000 100000 165535
  • 5. 5 [Introduction] Network namespaces (vEth) eth0: 172.17.0.2 (Bridge) docker0: 172.17.0.1 (TAP) tap0: 10.0.2.100 (vEth) (vEth) Network namespaces (vEth) eth0: 172.17.0.3 (Physical Ethernet) eth0: 192.168.0.42 (slirp4netns) virtual IP:10.0.2.2 Network namespace + User namespace Ethernet packets Unprivileged socket syscalls
  • 6. • Usernetes: Rootless Kubernetes, since 2018 https://github.com/rootless-containers/usernetes • As old as Rootless Docker (pre-release at that time) and Rootless Podman • The changes to upstream was merged in Kubernetes v1.22 (2021) – Feature gate: KubeletInUserNamespace (Alpha) – The feature gate is also used by kind, minikube, k3s, etc. • The first generation (“Gen1”, 2018-2023) of Usernetes didn’t gain much popularity due to its complexity (”The Hard Way”) 6 Rootless Kubernetes
  • 7. 7 KubeletInUserNamespace feature gate • The gate is slightly misnomer; as it requires CRI, OCI, CNI, and kube-proxy to be in the same UserNS too • Quite “boring” gate to allow trivial permission errors https://github.com/search?q=repo%3Akubernetes%2Fkubernetes%20KubeletInUserNamespace&type=code – dmesg – sysctl -w vm.overcommit_memory – etc. • The UserNS has to be created by an external runtime – Usernetes Gen1: RootlessKit – Usernetes Gen2: Rootless Docker/Podman/nerdctl – LXD/Incus can be used too
  • 8. Gen 1 (2018-2023) Gen 2 (2023-) Host dependency RootlessKit Rootless Docker, Rootless Podman, or Rootless nerdctl (contaiNERD CTL) Supports kubeadm No Yes Supports multi-node Yes, but practically No, due to complexity Yes Supports hostPath volumes Yes Yes, for most paths, but needs an extra config 8 Usernetes Gen 1 vs Gen 2 ”The Hard Way” Similar to `kind` and minikube, but supports real multi-node
  • 9. 9 File layout • Makefile – Defines targets like make up to wrap docker compose up, etc. • Dockerfile – FROM kindest/node (kind’s node image) with a few additional ADD and RUN • docker-compose.yaml – Just defines a single node container – Currently, node ports, etc. have to be statically defined here • kubeadm-config.yaml – Configures feature gates, CIDRs, TLS SANs, etc. Everything is just a plain text file, for ease of customization
  • 10. 10 Usage # Bootstrap the first node make up make kubeadm-init make install-flannel # Enable kubectl make kubeconfig export KUBECONFIG=$(pwd)/kubeconfig kubectl get pods -A # Multi-node make join-command scp join-command another-host:~/usernetes ssh another-host make -C ~/usernetes up kubeadm-join make sync-external-ip
  • 11. 11 Multi-node Network • VXLAN is known to work – Just kubectl -f kube-flannel.yaml • “External IP” is used, as the containerized kubelet’s IP is not accessible from other nodes – kubelet is launched with --cloud-provider=external – node.status.addresses is dynamically patched with kubectl patch node – node is also annotated with flannel.alpha.coreos.com/public-ip-overwrite – UDP checksums are recomputed with ethtool -K flannel.1 tx-checksum-ip-generic off
  • 12. 12 Experimental: network acceleration • Bypass4netns allows bypassing slirp4netns to eliminate the overhead caused by the usermode TCP/IP https://github.com/rootless-containers/bypass4netns • Captures socket syscalls inside the NetNS, reconstructs the FDs outside the NetNS, and replaces the FDs inside the NetNS, using seccomp_unotify(2) • As fast as the host network (e.g., 1.28 Gbps vs 49.9 Gbps) • Bypass4netns supports both connect(2) and bind(2), but Usernetes only supports accelerating connect(2) currently – bind(2) is already fast anyway • Available for nerdctl
  • 13. 13 Experimental: network acceleration • Pod-to-Pod communications across multiple nodes are not accelerated yet – VXLAN packets are generated by the kernel itself and cannot be intercepted via seccomp_unotify(2) – NodePorts can be still accelerated, as it does not incur VXLAN packets Node Pod Pod Node NodePort Pod Internet VXLAN Fast Slow
  • 14. • iperf3 (TCP) benchmark across multiple nodes 14 Experimental: network acceleration slirp4netns bypass4netns Pod → Pod (same node) 37.6 Gbps 37.6 Gbps Pod → Pod (different node) 1.40 Gbps 1.41 Gbps Pod → NodePort (same node) 1.28 Gbps 49.9 Gbps Pod → NodePort (different node) 1.47 Gbps 9.53 Gbps Host → NodePort (same node) 50.2 Gbps 49.4 Gbps Host → NodePort (different node) 9.53 Gbps 9.52 Gbps IaaS: Amazon EC2 (m7i.2xlarge) Versions: Ubuntu 22.04, nerdctl v2.0.0-beta.4, bypass4netns v0.4.1, Usernetes Gen2-v20240410.0 (Kubernetes v1.29) https://github.com/rootless-containers/usernetes/discussions/329
  • 15. 15 Future works • Integrate bypass4netns into Docker and Podman too • Support accelerating Pod-to-Pod communications across different nodes, perhaps with a sidecar proxy that would forward packets to NodePorts • Support dynamic port forwarding – Ports are currently statically defined in docker-compose.yaml – If docker container update could support modifying port forwards, Usernetes coud just watch Kubernetes service events and update the Docker ports accordingly https://github.com/docker/cli/issues/5013 • Help other Kubernetes distributions to support rootless – k3s has been supporting rootless since 2019, but still lacks support for multi-node setup – Are Podman folks interested in running OKD inside Rootless Podman?