3. Compute Isolation Status
• bhyve(4) is a stable, performant hypervisor
• ZFS zvol passthrough to guests works well
TIP: use multiple zvols per guest and stripe IO across
devices using lvcreate(8)
• Good CPU isolation - CPU pinned for low-density
workloads
• Perfect memory isolation (guest memory is wired)
5. Network Isolation Status
• Network isolation is not core to bhyve(4) today
• Use of VNET(9) for manipulating FIBS for tap(4)
interfaces is possible, but limited and not performant
7. Incomplete Solution
• bhyve(4) guests run customer workloads
• Cloud providers need a single FIB for the underlay
network
• Guests run in isolated overlay networks
• How do you map guests to their respective overlay
network?
8. Multi-Host Network Isolation
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
???
9. Possible Solution
• Plug a tap(4) device into a guest
• Plug tap(4) device into a bridge(4)
• Plug the physical or cloned interface into the underlay
bridge(4)
11. Possible Solution++
• Plug a tap(4) device into a guest
• Plug tap(4) device into a bridge(4)
• Plug a vxlan(4) interface into a per-subnet bridge(4)
• Plug the vxlan(4) into an underlay bridge(4) instance
• Plug the physical or cloned interface into the underlay
bridge(4)
12. Fuster Cluck Isolation
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
bridge1
tap51 tap52tap50
bridge0
vxlan1vxlan0
bridge2
14. Problems with
tap(4)/bridge(4)/vxlan(4)/VNET(9)
• tap(4) is slow
• bridge(4) is slow
• vxlan(4) sends received packets through ip_input()
twice (i.e. "sub-optimal")
• VNET(9) virtualizes underlay networks, not overlay networks
• How do you ARP between VMs in the overlay network?
• How do you perform vxlan(4) encap?
15. uvxbridge(8) POC
uvxbridge(8) POC provides a high PPS interface and
guest VXLAN encapsulation using netmap(9)/
ptnetmap(4):
https://github.com/joyent/uvxbridge
17. Reasonable Performance
• 21Gbps guest-to-guest on the same host
• 15Gbps guest-to-guest across the wire
• Supports AES PSK encryption for cross-DC traffic
• Incomplete, but a successful POC
18. New Approach
• Build a network isolation subsystem aligned with the
capabilities of network hardware
• Reduce the number of times a packet is copied
• Reduce context switches
• Build a bottom-up abstraction rooted in the capabilities of
hardware, not a hardware implementation defined in terms
of an administrative policy
21. FreeBSD/VPC Multi-Host
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
???
22. VXLAN to the Rescue
• Encapsulates all IP packets as UDP
• Adds a preamble to IP packet
• Tags packets and with a VXLAN ID, known as a VNI
• VXLAN is similar to VLAN tagging, but embeds tagging in
the IP header, not in the L2 frame
27. ioctl(2)
• Initially used ioctl(2)
• Unable to completely secure via capsicum(4) - able to
specify flags, but not scope the target device
• ioctl(2) is the kernel
equivalent of an HTTP PUT,
the interface for everything
without a design
"Hi ifconfig(8), I'm
looking at you."
28. One Device Per Object?
• New device in /dev for every new interface:
/dev/vpcsw0
/dev/vpcsw1
/dev/vpcp0
• Query via reads to individual devices in/dev
• Control devices via writes
• Security via devd(8)?
• Using VFS ACLs to secure network primitives?
29. One Device Per Object?
• New device in /dev for every new interface:
/dev/vpcsw0
/dev/vpcsw1
/dev/vpcp0
• Query via reads to individual devices in/dev
• Control devices via writes
• Security via devd(8)?
• Using VFS ACLs to secure network primitives?
Da Vinci mashed up with ...
30. One Device Per Object?
• New device in /dev for every new interface:
/dev/vpcsw0
/dev/vpcsw1
/dev/vpcp0
• Query via reads to individual devices in/dev
• Control devices via writes
• Security via devd(8)?
• Using VFS ACLs to secure network primitives?
Da Vinci mashed up with Jackson Pollock.
31. One Device Per Object?
• New device in /dev for every new interface:
/dev/vpcsw0
/dev/vpcsw1
/dev/vpcp0
• Query via reads to individual devices in/dev
• Control devices via writes
• Security via devd(8)?
• Using VFS ACLs to secure network primitives?
Da Vinci mashed up with Jackson Pollock.
Kernel API design hate crime.
32. New System Calls
• vpc_open(2) - Creates a new VPC descriptor
• vpc_ctl(2) - Manipulates VPC descriptors
• Capsicum-like, intended for privilege separation
• Intended for idempotent tooling
• Makes aggressive use of UUIDs as operator handles to
be compatible with Triton
33. vpc_open(2)
int
vpc_open(const vpc_id_t *vpc_id,
vpc_type_t obj_type,
vpc_flags_t flags);
• Creates a new "VPC descriptor"
• Similar to open(2)
• Manipulate descriptor via vcp_ctl(2)
• Responds to close(2)
• Priv-sep native security semantics
• Version aware
• System Call 580
34. vpc_id_t
int
vpc_open(const vpc_id_t *vpc_id,
vpc_handle_type_t obj_type,
vpc_flags_t flags);
16 bytes, UUID-like:
type ID struct {
TimeLow uint32
TimeMid uint16
TimeHi uint16
ClockSeqHi uint8
ObjType ObjType
Node [6]byte // Default MAC address vpc(4)
}
39. vpc_flags_t
int
vpc_open(const vpc_id_t *vpc_id,
vpc_type_t obj_type,
vpc_flags_t flags);
#define VPC_F_CREATE (1ULL << 0)
#define VPC_F_OPEN (1ULL << 1)
#define VPC_F_READ (1ULL << 2)
#define VPC_F_WRITE (1ULL << 3)
NOTE: going to revisit flags to enable a bit field for
capabilities per VPC Object Type
40. vpc_ctl(2)
int
vpc_ctl(int vpcd, vpc_op_t op,
size_t keylen, const void *key,
size_t *vallen, void *buf);
• Only interface for manipulating a VPC object
• Available operations different per VPC Object
• Inspired by ioctl(2), capsicum(4), and HTTP
• System Call 581
44. What's it all mean?
• Network interfaces are first-class objects
• Network interfaces are pluggable
• Configurations can be arbitrarily complex
• All vpc(4) interfaces are iflib(9) and can be
tcpdump(8)'ed
• Performance is nice
46. Tooling
• Able to be cross-compiled
• Developer-friendly
• Stable ABI
• Idempotent Operations
• No a priori knowledge of the target
OS required (i.e. no headers required
to cross-compile tooling)
47. Tooling
• Able to be cross-compiled
• Developer-friendly
• Stable ABI
• Idempotent Operations
• No a priori knowledge of the target
OS required (i.e. no headers required
to cross-compile tooling)
49. ELI5: vpc(4) edition
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
VNI 123
VNI 987
VNI 123 VNI 987
VXLAN
Packets
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
50. ELI5: vpc(4) Assumptions
• Guest is running Ubuntu or CentOS Linux
• Multi-Queue TX/RX in Host
• Multiple network queues available in the guest
• Underlay hosts can pass traffic
• Overlay hosts are in the same subnet
51. ELI5: vpc(4) edition
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vmnic1
ethlink0
vmnic0 vmnic2
vpcsw1
vpcsw0
vpcp0
vpcp1
vpcp3 vpcp4
vpcp5
62. ELI5: vpc(4) edition
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
VNI 123
VLAN 456
VNI 987
VNI 123
VTAG 456 VNI 987
VXLAN
Packets
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
10.65.5.161
10.65.5.162
63. ELI5: vpclink(4) Setup
• How do you do ARP or IPv6 Neighbor Discovery?
• Broadcast?
• Multicast?
64. ELI5: vpc(4) edition
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
VNI 123
VLAN 456
VNI 987
VNI 123
VTAG 456 VNI 987
VXLAN
Packets
em0
Guest 1
Customer A
Guest 3
Customer B
Guest 2
Customer B
vpclink0
vmnic0
vpcsw1vpcsw0
vmnic1 vmnic2
10.65.5.161
10.65.5.162
???
65. VTEP: VXLAN Tunnel End Point
• vpcsw(4) traps broadcast packets
• If the packet is:
• a broadcast packet and
• either an IPv4 ARP or IPv6 ND packet
Packet is added to a knote (part of kqueue(2)) and passed to all
kqueue(2) subscribers filtering for EVFILT_VPC filters
• Packet payload is stored in user-allocated buffer pointed to by ext[0]
• Userland utility parses packet and performs lookup
• Overlay and underlay forwarding information written back to the kernel
via vpc_ctl(2)
• vpcsw(4) caches forwarding information for the src/dst MAC tuple
66. Ongoing Work
• Firewalling - integrated at vpcp(4)
• Routing
• NAT
• Userland Control Plane (including setup and teardown of
bhyve(4) guests via something not a shell script)
67. vpc(8)
% doas vpc vm create
% doas vpc vm run
• bhyve(4) VM creation and launch tool
• Unifies network isolation with compute isolation
68. Desktop Software + vpc(4)
• vpc(4) + NAT == Desktop Software Hotness
• Ability to sandbox VMs with no prior knowledge of the
host's networking
• Uses:
% vagrant up
% packer build
No more dependency on pf(4) or dnsmasq(8)