Brought to you by
OSv Unikernel
Waldek Kozaczuk
OSv Committer
Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud
What is OSv ?
An open-source versatile modular unikernel designed to run single unmodified
Linux application securely as microVM on top of a hypervisor, when compared to
traditional operating systems which were designed for a vast range of physical
machines. Or simply:
■ OS designed to run single application without isolation between the
application and kernel
■ HIP - Highly Isolated Process without ability to make system calls to the host
OS
■ Supports both x64_64 and aarch64 platforms
Components of OSv
Why Stateless and Serverless Workloads?
Can take advantage of OSv strengths:
■ Fast to boot and restart
■ Low memory utilization
■ Optimized networking stack
Do not need performant and feature-full filesystem, just enough to read
code and configuration
■ What about logs?
What and Why to Optimize
■ Short boot time
■ Low memory utilization
● Current minimum is 15M but can be optimized to 10M
■ Small kernel size
● Directly leads to higher density of guests on the host
■ Optimized networking stack
● Improves throughput in terms of requests per second
● Improves latency
Optimize Boot Time
OSv, with Read-Only FS and networking off, can boot as fast as ~5 ms on Firecracker and
even faster around ~3 ms on QEMU with the microvm machine. However, in general the
boot time will depend on many factors like hypervisor including settings of individual
para-virtual devices, filesystem (ZFS, ROFS, RAMFS or Virtio-FS) and some boot
parameters
For example, the boot time of ZFS image on Firecracker is ~40 ms and regular QEMU
~200 ms these days. Also, newer versions of QEMU (>=4.0) are typically faster to boot.
Booting on QEMU in PVH/HVM mode (aka direct kernel boot) should always be faster as
OSv is directly invoked in 64-bit long mode.
For more details see https://github.com/cloudius-systems/osv#boot-time
Optimize Kernel ELF Size: Why?
■ Smaller kernel ELF leads to less memory utilization
■ Fewer symbols, ideally only those needed by a specific app, improves security
Current kernel size is around 6.7 MB and includes subsets of following libraries:
The experiments described in following slides help reduce kernel size to 2.6 MB
libdl.so.2, ld-linux-x86-64.so.2
libresolv.so.2, libcrypt.so.1, libaio.so.1
libc.so.6, libm.so.6
libpthread.so.0
librt.so.1, libxenstore.so.3.0
libstdc++.so.6
Optimize Kernel ELF Size: Hide STD C++
diff --git a/Makefile b/Makefile
+ --version-script=./version_script_with_public_ABI_symbols_only 
--whole-archive 
- $(libstdc++.a) $(libgcc_eh.a) 
+ $(libgcc_eh.a) 
$(boost-libs) 
- --no-whole-archive $(libgcc.a), 
+ --no-whole-archive $(libstdc++.a) $(libgcc.a), 
LINK kernel.elf)
Hiding standard C++ library helps reduce kernel to 5.0 MB.
Optimize Kernel ELF Size: Collect Garbage
Enabling garbage collection reduces kernel size furher to 4.3 MB.
diff --git a/Makefile b/Makefile
EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base)
-DOSV_KERNEL_VM_BASE=$(kernel_vm_base) 
- -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift)
+ -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -ffunction-sections -fdata-sections
- --no-whole-archive $(libstdc++.a) $(libgcc.a), 
+ --no-whole-archive $(libstdc++.a) $(libgcc.a) --gc-sections, 
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
.start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
- }
+ KEEP(*(.start32_address)) }
Optimize Kernel ELF Size: Disable ZFS
diff --git a/Makefile b/Makefile
+ifdef zfs-enabled
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zmod_subr.o
solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zutil.o
solaris += $(zfs)
+endif
+ifdef zfs-enabled
drivers += drivers/zfs.o
+endif
We do not need ZFS for stateless and server-less workloads.
Disabling ZFS reduces kernel size to 3.6 MB.
Optimize Kernel ELF Size: Select Platform/Drivers
diff --git a/Makefile b/Makefile
+ifdef xen-enabled
bsd += bsd/sys/xen/xenstore/xenstore.o
bsd += bsd/sys/xen/xenbus/xenbus.o
+endif
+ifdef virtio-enabled
drivers += drivers/virtio-vring.o
drivers += drivers/virtio-blk.o
+endif
For example, disabling all drivers and other platform code but what is needed to
run on firecracker or QEMU microVM, reduces kernel size to 3.1 MB.
Optimize Kernel ELF Size: App Specific Symbols
{
global:
__cxa_finalize;
__libc_start_main;
puts;
local:
*;
};
Eliminate all symbols and related code but what is needed to run specific app.
This further reduces kernel size to 2.6 MB which is enough to run native and
Java “hello world” app.
Optimize Memory Usage
Apart from shrinking kernel ELF to minimize memory used, following
optimizations can be implemented:
■ Lazy stack for application threads (WIP patch available, issue #144)
● Needs to pre-fault before calling kernel-code that cannot be preempted.
■ Refine L1/L2 memory pools logic to dynamically shrink/expand the low
watermark depending on physical memory size
● Currently we pre-allocate 512K for each vCPU, regardless if app needs or not.
Optimize Number of Runs on c5n.metal
■ Disk-only boot on Firecracker
■ Almost 1,900 boots per second (total of 629,625 runs)
● 25 boots per second on single host CPU
■ Boot time percentiles:
● P50 = 8.98 ms
● P75 = 12.07ms
● P90 = 17.15 ms
● P99 = 31.49 ms
■ Cost of hypervisor affecting boots/sec
Optimize Density on c5n.metal: Boot Time
Optimize Density on c5n.metal: Boots/second
Optimize Density on c5n.metal: CPU utilization
Optimize HTTP Requests/Sec
Each test described in following slides involves separate test “worker” machine connected with a
test “client” machine over 1GBit network.
■ Setup:
● Test guest VM:
■ Linux guest - Fedora 33 with firewall turned off
■ OSv guest - 0.56
■ QEMU 5.0 with vhost networking bridged to expose guest interface within local
ethernet, same setup for OSv and Linux guest
● Test worker machine - 8-way MacBook Pro i7 2.3GHz with Ubuntu 20.10 on it
● Linux test “client” machine - 8-way MacBook Pro i7 2.7GHz with Fedora 33 on it
■ Each test executed against guest VM with 1 and 2 and 4 vCPUs if makes sense
■ As a baseline each test app is executed and measured on host with taskset to limit cpu count
■ The load is generated by wrk with enough load to observe host CPUs pinned to the OSv or Linux VM
spike close to 100% cpu utilization
Linux Guest vs OSv: Nginx 1.20.1
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload - 774 bytes
Host pinned to single host CPU
■ 51,076.43 requests/sec
■ 49.78 MB/sec
■ P99 latency: 2.28ms
Linux guest with 1 vCPU
■ 25,736.96 requests/sec
■ 24.80 MB/sec
■ P99 latency: 14.04ms
OSv with 1 vCPU (same as Linux)
■ 38,333.70 requests/sec (~1.49 of Linux guest)
■ 36.78 MB/sec
■ P99 latency: 1.75 ms (~0.12 of Linux guest)
Linux Guest vs OSv: Node.JS 14.17
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, payload of 33 bytes
Host pinned to single host CPU
■ 23,260.02 requests/sec
■ 3.95 MB/sec
■ P99 latency: 8.05 ms
Linux guest with 1 vCPU
■ 12,351.55 requests/sec
■ 2.37 MB/sec
■ P99 latency: 14.78 ms
OSv with 1 vCPU
■ 17,996.67 requests/sec (~1.46 of Linux guest)
■ 3.45 MB/sec
■ P99 latency: 7.38 ms (~0.5 of Linux guest)
Linux Guest vs OSv: Golang 1.15.13
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload 42 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 48,033.00 7.28 94,346.28 14.31 106,905.85 16.21
P99 latency in ms 4.16 3.28 2.26
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 24,124.64 3.84 49,856.71 7.94 93,544.90 14.90
P99 latency in ms 8.62 9.21 8.11
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 40,793.25
(1.69)
6.03 74,247.88
(1.49)
10.98 82,426.27
(0.88)
12.18
P99 latency in ms 5.35 (0.62) 15.94 (1.73) 10.90 (1.34)
Linux Guest vs OSv: Rust with Tokio and Hyper
Each test - best out of 3 runs, wrk with 8 threads and 200 connections running for 5sec, response payload of 30 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 71,011.59 9.96 153,286.61 21.49 144,677.22 20.28
P99 latency in ms 3.29 1.85 2.78
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 28,061.11 3.93 68,742.13 9.64 132,515.62 18.58
P99 latency in ms 10.03 8.18 5.06
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 57,886.77
(2.06)
8.12 47,312.48
(0.69)
6.63 47,073.25
(0.36)
6.60
P99 latency in ms 7.48 (0.75) 8.36 (1.02) 22.22 (4.39)
Linux Guest vs OSv: Akka HTTP 2.6 on Java8
Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload of 42 bytes
Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs
Requests/sec, trans in MB/sec 19,122.27 3.15 53,301.46 8.79 95,439.53 15.75
P99 latency in ms 50.64 33.26 16.35
Linux Guest 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 10,959.20 1.81 27,018.56 4.46 51,493.63 8.50
P99 latency in ms 691.89 96.63 40.04
OSv 1 vCPU 2 vCPUs 4 vCPUs
Requests/sec, trans in MB/sec 38,666.84
(3.52)
6.38 64,532.12
(2.39)
10.65 81,930.80
(1.59)
13.52
P99 latency in ms 91.94 (0.13) 30.26 (0.31) 54.80 (1.37)
Things to optimize
■ Implement SO_REUSEPORT to improve Rust apps throughput
■ Finish “lazy application stack” support to minimize memory used
■ Lock contention in futex implementation to improve Golang apps
■ Optimize atomic operations on single vCPU
■ Make L1/L2 memory pool sizes self-configurable depending on physical
memory available
■ Other open issues
● https://github.com/cloudius-systems/osv/labels/performance
● https://github.com/cloudius-systems/osv/labels/optimization
Brought to you by
Waldek Kozaczuk
https://github.com/cloudius-systems/osv
https://groups.google.com/g/osv-dev
@OSv_unikernel

OSv Unikernel — Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud

  • 1.
    Brought to youby OSv Unikernel Waldek Kozaczuk OSv Committer Optimizing Guest OS to Run Stateless and Serverless Apps in the Cloud
  • 2.
    What is OSv? An open-source versatile modular unikernel designed to run single unmodified Linux application securely as microVM on top of a hypervisor, when compared to traditional operating systems which were designed for a vast range of physical machines. Or simply: ■ OS designed to run single application without isolation between the application and kernel ■ HIP - Highly Isolated Process without ability to make system calls to the host OS ■ Supports both x64_64 and aarch64 platforms
  • 3.
  • 4.
    Why Stateless andServerless Workloads? Can take advantage of OSv strengths: ■ Fast to boot and restart ■ Low memory utilization ■ Optimized networking stack Do not need performant and feature-full filesystem, just enough to read code and configuration ■ What about logs?
  • 5.
    What and Whyto Optimize ■ Short boot time ■ Low memory utilization ● Current minimum is 15M but can be optimized to 10M ■ Small kernel size ● Directly leads to higher density of guests on the host ■ Optimized networking stack ● Improves throughput in terms of requests per second ● Improves latency
  • 6.
    Optimize Boot Time OSv,with Read-Only FS and networking off, can boot as fast as ~5 ms on Firecracker and even faster around ~3 ms on QEMU with the microvm machine. However, in general the boot time will depend on many factors like hypervisor including settings of individual para-virtual devices, filesystem (ZFS, ROFS, RAMFS or Virtio-FS) and some boot parameters For example, the boot time of ZFS image on Firecracker is ~40 ms and regular QEMU ~200 ms these days. Also, newer versions of QEMU (>=4.0) are typically faster to boot. Booting on QEMU in PVH/HVM mode (aka direct kernel boot) should always be faster as OSv is directly invoked in 64-bit long mode. For more details see https://github.com/cloudius-systems/osv#boot-time
  • 7.
    Optimize Kernel ELFSize: Why? ■ Smaller kernel ELF leads to less memory utilization ■ Fewer symbols, ideally only those needed by a specific app, improves security Current kernel size is around 6.7 MB and includes subsets of following libraries: The experiments described in following slides help reduce kernel size to 2.6 MB libdl.so.2, ld-linux-x86-64.so.2 libresolv.so.2, libcrypt.so.1, libaio.so.1 libc.so.6, libm.so.6 libpthread.so.0 librt.so.1, libxenstore.so.3.0 libstdc++.so.6
  • 8.
    Optimize Kernel ELFSize: Hide STD C++ diff --git a/Makefile b/Makefile + --version-script=./version_script_with_public_ABI_symbols_only --whole-archive - $(libstdc++.a) $(libgcc_eh.a) + $(libgcc_eh.a) $(boost-libs) - --no-whole-archive $(libgcc.a), + --no-whole-archive $(libstdc++.a) $(libgcc.a), LINK kernel.elf) Hiding standard C++ library helps reduce kernel to 5.0 MB.
  • 9.
    Optimize Kernel ELFSize: Collect Garbage Enabling garbage collection reduces kernel size furher to 4.3 MB. diff --git a/Makefile b/Makefile EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_KERNEL_VM_BASE=$(kernel_vm_base) - -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) + -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -ffunction-sections -fdata-sections - --no-whole-archive $(libstdc++.a) $(libgcc.a), + --no-whole-archive $(libstdc++.a) $(libgcc.a) --gc-sections, diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld .start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) { *(.start32_address) - } + KEEP(*(.start32_address)) }
  • 10.
    Optimize Kernel ELFSize: Disable ZFS diff --git a/Makefile b/Makefile +ifdef zfs-enabled solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zmod_subr.o solaris += bsd/sys/cddl/contrib/opensolaris/uts/common/zmod/zutil.o solaris += $(zfs) +endif +ifdef zfs-enabled drivers += drivers/zfs.o +endif We do not need ZFS for stateless and server-less workloads. Disabling ZFS reduces kernel size to 3.6 MB.
  • 11.
    Optimize Kernel ELFSize: Select Platform/Drivers diff --git a/Makefile b/Makefile +ifdef xen-enabled bsd += bsd/sys/xen/xenstore/xenstore.o bsd += bsd/sys/xen/xenbus/xenbus.o +endif +ifdef virtio-enabled drivers += drivers/virtio-vring.o drivers += drivers/virtio-blk.o +endif For example, disabling all drivers and other platform code but what is needed to run on firecracker or QEMU microVM, reduces kernel size to 3.1 MB.
  • 12.
    Optimize Kernel ELFSize: App Specific Symbols { global: __cxa_finalize; __libc_start_main; puts; local: *; }; Eliminate all symbols and related code but what is needed to run specific app. This further reduces kernel size to 2.6 MB which is enough to run native and Java “hello world” app.
  • 13.
    Optimize Memory Usage Apartfrom shrinking kernel ELF to minimize memory used, following optimizations can be implemented: ■ Lazy stack for application threads (WIP patch available, issue #144) ● Needs to pre-fault before calling kernel-code that cannot be preempted. ■ Refine L1/L2 memory pools logic to dynamically shrink/expand the low watermark depending on physical memory size ● Currently we pre-allocate 512K for each vCPU, regardless if app needs or not.
  • 14.
    Optimize Number ofRuns on c5n.metal ■ Disk-only boot on Firecracker ■ Almost 1,900 boots per second (total of 629,625 runs) ● 25 boots per second on single host CPU ■ Boot time percentiles: ● P50 = 8.98 ms ● P75 = 12.07ms ● P90 = 17.15 ms ● P99 = 31.49 ms ■ Cost of hypervisor affecting boots/sec
  • 15.
    Optimize Density onc5n.metal: Boot Time
  • 16.
    Optimize Density onc5n.metal: Boots/second
  • 17.
    Optimize Density onc5n.metal: CPU utilization
  • 18.
    Optimize HTTP Requests/Sec Eachtest described in following slides involves separate test “worker” machine connected with a test “client” machine over 1GBit network. ■ Setup: ● Test guest VM: ■ Linux guest - Fedora 33 with firewall turned off ■ OSv guest - 0.56 ■ QEMU 5.0 with vhost networking bridged to expose guest interface within local ethernet, same setup for OSv and Linux guest ● Test worker machine - 8-way MacBook Pro i7 2.3GHz with Ubuntu 20.10 on it ● Linux test “client” machine - 8-way MacBook Pro i7 2.7GHz with Fedora 33 on it ■ Each test executed against guest VM with 1 and 2 and 4 vCPUs if makes sense ■ As a baseline each test app is executed and measured on host with taskset to limit cpu count ■ The load is generated by wrk with enough load to observe host CPUs pinned to the OSv or Linux VM spike close to 100% cpu utilization
  • 19.
    Linux Guest vsOSv: Nginx 1.20.1 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload - 774 bytes Host pinned to single host CPU ■ 51,076.43 requests/sec ■ 49.78 MB/sec ■ P99 latency: 2.28ms Linux guest with 1 vCPU ■ 25,736.96 requests/sec ■ 24.80 MB/sec ■ P99 latency: 14.04ms OSv with 1 vCPU (same as Linux) ■ 38,333.70 requests/sec (~1.49 of Linux guest) ■ 36.78 MB/sec ■ P99 latency: 1.75 ms (~0.12 of Linux guest)
  • 20.
    Linux Guest vsOSv: Node.JS 14.17 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, payload of 33 bytes Host pinned to single host CPU ■ 23,260.02 requests/sec ■ 3.95 MB/sec ■ P99 latency: 8.05 ms Linux guest with 1 vCPU ■ 12,351.55 requests/sec ■ 2.37 MB/sec ■ P99 latency: 14.78 ms OSv with 1 vCPU ■ 17,996.67 requests/sec (~1.46 of Linux guest) ■ 3.45 MB/sec ■ P99 latency: 7.38 ms (~0.5 of Linux guest)
  • 21.
    Linux Guest vsOSv: Golang 1.15.13 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload 42 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 48,033.00 7.28 94,346.28 14.31 106,905.85 16.21 P99 latency in ms 4.16 3.28 2.26 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 24,124.64 3.84 49,856.71 7.94 93,544.90 14.90 P99 latency in ms 8.62 9.21 8.11 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 40,793.25 (1.69) 6.03 74,247.88 (1.49) 10.98 82,426.27 (0.88) 12.18 P99 latency in ms 5.35 (0.62) 15.94 (1.73) 10.90 (1.34)
  • 22.
    Linux Guest vsOSv: Rust with Tokio and Hyper Each test - best out of 3 runs, wrk with 8 threads and 200 connections running for 5sec, response payload of 30 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 71,011.59 9.96 153,286.61 21.49 144,677.22 20.28 P99 latency in ms 3.29 1.85 2.78 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 28,061.11 3.93 68,742.13 9.64 132,515.62 18.58 P99 latency in ms 10.03 8.18 5.06 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 57,886.77 (2.06) 8.12 47,312.48 (0.69) 6.63 47,073.25 (0.36) 6.60 P99 latency in ms 7.48 (0.75) 8.36 (1.02) 22.22 (4.39)
  • 23.
    Linux Guest vsOSv: Akka HTTP 2.6 on Java8 Each test - best out of 3 runs, wrk with 8 threads and 100 connections running for 5sec, response payload of 42 bytes Host with CPUs pinned 1 CPU 2 CPUs 4 CPUs Requests/sec, trans in MB/sec 19,122.27 3.15 53,301.46 8.79 95,439.53 15.75 P99 latency in ms 50.64 33.26 16.35 Linux Guest 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 10,959.20 1.81 27,018.56 4.46 51,493.63 8.50 P99 latency in ms 691.89 96.63 40.04 OSv 1 vCPU 2 vCPUs 4 vCPUs Requests/sec, trans in MB/sec 38,666.84 (3.52) 6.38 64,532.12 (2.39) 10.65 81,930.80 (1.59) 13.52 P99 latency in ms 91.94 (0.13) 30.26 (0.31) 54.80 (1.37)
  • 24.
    Things to optimize ■Implement SO_REUSEPORT to improve Rust apps throughput ■ Finish “lazy application stack” support to minimize memory used ■ Lock contention in futex implementation to improve Golang apps ■ Optimize atomic operations on single vCPU ■ Make L1/L2 memory pool sizes self-configurable depending on physical memory available ■ Other open issues ● https://github.com/cloudius-systems/osv/labels/performance ● https://github.com/cloudius-systems/osv/labels/optimization
  • 25.
    Brought to youby Waldek Kozaczuk https://github.com/cloudius-systems/osv https://groups.google.com/g/osv-dev @OSv_unikernel