Cilium:
Fast IPv6 Container Networking with
BPF and XDP
LinuxCon 2016, Toronto
Thomas Graf (@tgraf__)
Kernel, Cilium & Open vSwitch Team
Noiro Networks (Cisco)
The Cilium Experiment
Scale
– Addressing: IPv6?
– Policy: Linear lists don’t scale. Alternative?
Extensibility
– Can we be as extensible as userspace networking
in the kernel?
Simplicity
– What is an appropriate abstraction away from
traditional networking?
Performance
– Do we sacrifice performance in the process?
Scaling Addressing
Solution:
– IPv6 addresses with host scope allocator
Pros:
– Everything is globally addressable
– No NAT
– Path to ILA for mobility of tasks
Cons:
– Legacy IPv4 only endpoints/applications
→ Optional IPv4 addressing (+ NAT)
→ NAT46: Provide IPv6 only applications to IPv4
only clients
IPv6 Status in Kubernetes/Docker
● Kubernetes (CNI): Almost there
– Pods are IPv6-only capable as of k8s 1.3.6
(PR23317, PR26438, PR26439, PR26441)
– Kubeproxy (services) not done yet
● Docker (libnetwork): Working on it
– PR826 - “Make IPv6 Great Again”
Not merged yet
Scaling Policy
LB Frontend Backend
Scaling Policy
LB BEFE
LB FE
FE BE
LB
LB Frontend Backend
Policy:
NetworkPolicy Kubernetes policy spec
as discussed and standardized in the
Networking SIG
https://github.com/kubernetes/kubernetes/blo
b/master/docs/proposals/network-policy.md
Scaling Policy
LB QA BE QAFE QA
LB Prod BE ProdFE Prod
LB FE
FE BE
LB
LB Frontend Backend
QA
Prod
Policy:
Scaling Policy
LB QA BE QAFE QA
LB Prod BE ProdFE Prod
LB FE
FE
QA
Prod
BE
LB QA
Prod
requires
requires
LB Frontend Backend
QA
Prod
Policy:
Cilium extension
Not yet part of
Kubernetes spec
QA
Scaling Policy Enforcement
LB FE
FE
QA
Prod
BE
LB QA
Prod
requires
requires
LB QA
FE QA
LB Prod10
11
12
13
Policy enforcement cost becomes a single hashtable
lookup regardless of number of containers or policy
complexity.
BE QA
FE Prod 14
BE Prod 15
Distributed Label ID Table:Policy:
QA
This ID is carried in packet as
metadata to provide security
context at destination host
Extensibility
Kernel
Userspace
Source
Code
Byte
Code
LLVM/clang
Sockets
netdevice
Network
StackTC
Ingress
TC
Egress
netdevice
Verifier
+ JIT
add eax,edx
shl eax,2
add eax,edx
shl eax,2
BPF – Berkley Packet Filter
Kernel
Userspace
BPF
Program
Userspace
Process
BPF Maps & Perf Ring Buffer
BPF Map
Hashtable
BPF Map
Array
Userspace
Process
BPF
Program
Per Ring
Buffer
Data DataTail Call
BPF Features
(As of Aug 2016)
● Efficient data sharing via maps
– Per-CPU/global arrays & hashtables
● Rewrite packet content
● Extend/trim packet size
● Redirect to other net_device
● Attachment of tunnel metadata
● Cgroups integration
● Access to high performance perf ring buffer
● …
Kernel
Userspace
XDP – Express Data Path
Source
Code
Byte
Code
LLVM/clang
Sockets
Netdevice
Network
Stack
Verifier
+ JIT
add eax,edx
shl eax,2
Driver
Access to
DMA buffer
Kernel
Cilium Layer
Orchestration
systems
eth0
BPF
Program
Cilium
Daemon
Cilium
Monitor
Cilium
CLI
BPF Program
Conntrack Policy
Bytecode injection
Events
BPF Program
Conntrack Policy
Code
Generation
Plugins
Policy
Repository
Cilium Architecture
Why is this awesome?
On the fly BPF program generation means:
● Extensibility of userspace networking in the kernel
● MAC, IP, port number, … all become constants
→ compiler can optimize heavily!
● BPF programs can be recompiled and replaced without
interrupting the container and its connections
– Features can be compiled in/out at runtime with
container granularity
● Access to fast BPF maps and perf ring buffer to interact
with userspace.
– Drop monitor in n*Mpps context
– Use notifications for policy learning, IDS, logging, ...
Available Building Blocks
● L3 forwarding (IPv6 & IPV4)
● Host connectivity
● Encapsulation
(VXLAN/Geneve/GRE)
● ICMPv6 generation
● NDisc & ARP responder
● Access Control
Currently working on:
● Fragmentation handling
● Mobility
● Port Mapping (TCP/UDP)
● Connection tracking
● L3/L4 Load Balancer
● Statistics
● Events (perf ring buffer)
● Debugging framework
● NAT46
● End to end encryption
Networking should be invisible,
it is not.
Simplicity
Simplicity
● L3 only (Calico gets this right)
– No L2 scaling issues, no broadcast domains, no L2
vulnerabilities
● No “Networks”
– No need for containers to join multiple networks
to access multiple isolation domains. No need for
multiple addresses.
● Policy definition independent of addressing
– As specified in Kubernetes Networking SIG
– All policies based on container labels
Performance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
0
100
200
300
400
500
600
Container to container on local node
# Cores
Gbit
netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e
1 TCP flow per core, 10’000 policies
Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
Performance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Container to container over 10GiB NICs
64
128
256
512
1024
64000
# Cores
MBit
netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e
1 TCP flow per core, 10’000 policies
Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
<Insert Cool Demo Here>
Q&A
Image Sources:
● Cover (Toronto)
Rick Harris (https://www.flickr.com/photos/rickharris/)
● The Invisible Man
Dr. Azzacov (https://www.flickr.com/photos/drazzacov/)
Start hacking with BPF for containers:
http://github.com/cilium/cilium
Contact:
Slack: cilium.slack.com
Twitter: @tgraf__ Mail: tgraf@tgraf.ch
Team:
● André Martins
● Daniel Borkmann
● Madhu Challa
● Thomas Graf

Cilium - Fast IPv6 Container Networking with BPF and XDP

  • 1.
    Cilium: Fast IPv6 ContainerNetworking with BPF and XDP LinuxCon 2016, Toronto Thomas Graf (@tgraf__) Kernel, Cilium & Open vSwitch Team Noiro Networks (Cisco)
  • 2.
    The Cilium Experiment Scale –Addressing: IPv6? – Policy: Linear lists don’t scale. Alternative? Extensibility – Can we be as extensible as userspace networking in the kernel? Simplicity – What is an appropriate abstraction away from traditional networking? Performance – Do we sacrifice performance in the process?
  • 3.
    Scaling Addressing Solution: – IPv6addresses with host scope allocator Pros: – Everything is globally addressable – No NAT – Path to ILA for mobility of tasks Cons: – Legacy IPv4 only endpoints/applications → Optional IPv4 addressing (+ NAT) → NAT46: Provide IPv6 only applications to IPv4 only clients
  • 4.
    IPv6 Status inKubernetes/Docker ● Kubernetes (CNI): Almost there – Pods are IPv6-only capable as of k8s 1.3.6 (PR23317, PR26438, PR26439, PR26441) – Kubeproxy (services) not done yet ● Docker (libnetwork): Working on it – PR826 - “Make IPv6 Great Again” Not merged yet
  • 5.
  • 6.
    Scaling Policy LB BEFE LBFE FE BE LB LB Frontend Backend Policy: NetworkPolicy Kubernetes policy spec as discussed and standardized in the Networking SIG https://github.com/kubernetes/kubernetes/blo b/master/docs/proposals/network-policy.md
  • 7.
    Scaling Policy LB QABE QAFE QA LB Prod BE ProdFE Prod LB FE FE BE LB LB Frontend Backend QA Prod Policy:
  • 8.
    Scaling Policy LB QABE QAFE QA LB Prod BE ProdFE Prod LB FE FE QA Prod BE LB QA Prod requires requires LB Frontend Backend QA Prod Policy: Cilium extension Not yet part of Kubernetes spec QA
  • 9.
    Scaling Policy Enforcement LBFE FE QA Prod BE LB QA Prod requires requires LB QA FE QA LB Prod10 11 12 13 Policy enforcement cost becomes a single hashtable lookup regardless of number of containers or policy complexity. BE QA FE Prod 14 BE Prod 15 Distributed Label ID Table:Policy: QA This ID is carried in packet as metadata to provide security context at destination host
  • 10.
  • 11.
  • 12.
    Kernel Userspace BPF Program Userspace Process BPF Maps &Perf Ring Buffer BPF Map Hashtable BPF Map Array Userspace Process BPF Program Per Ring Buffer Data DataTail Call
  • 13.
    BPF Features (As ofAug 2016) ● Efficient data sharing via maps – Per-CPU/global arrays & hashtables ● Rewrite packet content ● Extend/trim packet size ● Redirect to other net_device ● Attachment of tunnel metadata ● Cgroups integration ● Access to high performance perf ring buffer ● …
  • 14.
    Kernel Userspace XDP – ExpressData Path Source Code Byte Code LLVM/clang Sockets Netdevice Network Stack Verifier + JIT add eax,edx shl eax,2 Driver Access to DMA buffer
  • 15.
    Kernel Cilium Layer Orchestration systems eth0 BPF Program Cilium Daemon Cilium Monitor Cilium CLI BPF Program ConntrackPolicy Bytecode injection Events BPF Program Conntrack Policy Code Generation Plugins Policy Repository Cilium Architecture
  • 16.
    Why is thisawesome? On the fly BPF program generation means: ● Extensibility of userspace networking in the kernel ● MAC, IP, port number, … all become constants → compiler can optimize heavily! ● BPF programs can be recompiled and replaced without interrupting the container and its connections – Features can be compiled in/out at runtime with container granularity ● Access to fast BPF maps and perf ring buffer to interact with userspace. – Drop monitor in n*Mpps context – Use notifications for policy learning, IDS, logging, ...
  • 17.
    Available Building Blocks ●L3 forwarding (IPv6 & IPV4) ● Host connectivity ● Encapsulation (VXLAN/Geneve/GRE) ● ICMPv6 generation ● NDisc & ARP responder ● Access Control Currently working on: ● Fragmentation handling ● Mobility ● Port Mapping (TCP/UDP) ● Connection tracking ● L3/L4 Load Balancer ● Statistics ● Events (perf ring buffer) ● Debugging framework ● NAT46 ● End to end encryption
  • 18.
    Networking should beinvisible, it is not. Simplicity
  • 19.
    Simplicity ● L3 only(Calico gets this right) – No L2 scaling issues, no broadcast domains, no L2 vulnerabilities ● No “Networks” – No need for containers to join multiple networks to access multiple isolation domains. No need for multiple addresses. ● Policy definition independent of addressing – As specified in Kubernetes Networking SIG – All policies based on container labels
  • 20.
    Performance 1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 100 200 300 400 500 600 Container to container on local node # Cores Gbit netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e 1 TCP flow per core, 10’000 policies Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
  • 21.
    Performance 1 2 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Container to container over 10GiB NICs 64 128 256 512 1024 64000 # Cores MBit netperf -t TCP_SENDFILE -H beef::aa0:18:ee5e 1 TCP flow per core, 10’000 policies Intel Xeon 3.5Ghz Sandy Bridge, 24 cores
  • 22.
  • 23.
    Q&A Image Sources: ● Cover(Toronto) Rick Harris (https://www.flickr.com/photos/rickharris/) ● The Invisible Man Dr. Azzacov (https://www.flickr.com/photos/drazzacov/) Start hacking with BPF for containers: http://github.com/cilium/cilium Contact: Slack: cilium.slack.com Twitter: @tgraf__ Mail: tgraf@tgraf.ch Team: ● André Martins ● Daniel Borkmann ● Madhu Challa ● Thomas Graf