Container-relevant Kernel developments
Tycho Andersen
tycho@docker.com
GH: tych0
IMA
● Integrity Management Architecture (“IMA”, “I’ma”)
● In-kernel protection against unauthorized userspace
file modification
IMA
open(“/foo/bar”, O_RDWR)
sha256sum(“/foo/bar”) == getxattr(“/foo/bar”, “security.ima”)
verify(“/foo/bar”) == getxattr(“/foo/bar”, “security.evm”)
open(“/foo/bar”, O_RDWR) = -EPERM
IMA
$ tee /sys/kernel/security/policy <<EOF
PROC_SUPER_MAGIC=0x9fa0
dont_measure fsmagic=0x9fa0
dont_appraise fsmagic=0x9fa0
EXT4_MAGIC=0xEF53
appraise fsmagic=$EXT4_MAGIC fowner=$user
appraise func=MODULE_CHECK
EOF
ima_appraise={off,enforce,fix,log}
IMA
IMA namespacing
● global policy
● which namespace to pin?
● what about unshare()?
● ima: namespacing IMA audit messages
https://lkml.org/lkml/2017/7/20/905
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
Audit
Audit
type=USER_LOGIN msg=audit(1506873468.459:1814706): pid=27995 uid=0 auid=4294967295
ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=?
addr=113.195.145.13 terminal=sshd res=failed'
type=USER_AUTH msg=audit(1506873489.492:1814707): pid=28128 uid=0 auid=4294967295
ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd"
hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'
type=USER_LOGIN msg=audit(1506873489.492:1814708): pid=28128 uid=0 auid=4294967295
ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=?
addr=113.195.145.13 terminal=sshd res=failed'
type=USER_AUTH msg=audit(1506873491.708:1814709): pid=28128 uid=0 auid=4294967295
ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd"
hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'
type=USER_LOGIN msg=audit(1506873491.708:1814710): pid=28128 uid=0 auid=4294967295
ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=?
addr=113.195.145.13 terminal=sshd res=failed'
type=USER_AUTH msg=audit(1506873493.864:1814711): pid=28128 uid=0 auid=4294967295
ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd"
hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'
Audit namespacing
● which namespace to pin?
● what about unshare()?
● RFC: Audit Kernel Container IDs
https://lkml.org/lkml/2017/9/13/383
● RFC(v2): Audit Kernel Container IDs
https://lkml.org/lkml/2017/10/12/354
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
struct container *
int cfd = container_create(const char *name, unsigned int flags);
container_mount(int cfd,
const char *source,
const char *target, /* NULL -> root */
const char *filesystemtype,
unsigned long mountflags,
const void *data);
container_chroot(int cfd, const char *path);
mkdirat(int cfd, const char *path, mode_t mode);
mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
struct container *
container_bind_mount_across(int cfd,
const char *source,
const char *target);
int fd = openat(int cfd, const char *path,
unsigned int flags, mode_t mode);
int fd = container_socket(int cfd, int domain, int type,
int protocol);
fork_into_container(int cfd);
container_wait(int container_fd, int *_wstatus, unsigned int wait,
struct rusage *rusage);
container_kill(int container_fd, int initonly, int signal);
container_add_key(const char *type, const char *description,
const void *payload, size_t plen,
int container_fd);
struct container *
● Device restriction
● “supervising” the container
● Make containers kernel objects
https://lkml.org/lkml/2017/5/22/645
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
LSM
● Linux Security Module
○ SELinux
○ AppArmor
○ Smack
○ Landlock
○ tomoyo
○ yama
○ loadpin
○ SARA
LSM namespacing (stacking, chaining)
● 2004: https://lwn.net/Articles/110432/ Stackable security modules
● 2010: https://lwn.net/Articles/393008/ LSM Stacking (again)
● 2011: https://lwn.net/Articles/426921/ Supporting multiple LSMs
● 2012: https://lwn.net/Articles/518345/ Another LSM stacking approach
● 2013: https://lwn.net/Articles/548314/ LSM: Multiple concurrent LSMs
● 2014: https://lwn.net/Articles/548314/ LSM: Generalize existing module
stacking
● 2015: https://lwn.net/Articles/635771/ Progress in security module stacking
● 2016-2017: https://lwn.net/Articles/719731/ Stacking for major security
modules
LSM namespacing (stacking, chaining)
Host: AppArmor
Guest: SELinux
Nested: Smack
LSM namespacing (stacking, chaining)
Host: AppArmor
Guest: AppArmor
Nested:
AppArmor
LSM namespacing (stacking, chaining)
● SELinux in development:
https://marc.info/?l=selinux&m=150696042210126&w=2
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
unshare(CLONE_NEWTIME)
gettimeofday(); settimeofday();
clock_getres();
clock_gettime(); clock_settime();
time();
unshare(CLONE_NEWTIME)?
gettimeofday(); settimeofday();
clock_getres();
clock_gettime(); clock_settime();
time();
virtual Dynamic Shared Object (vDSO)
● optimization to make frequent syscalls faster
● injected into a task’s address space by the kernel
unshare(CLONE_NEWTIME)?
Task 1 Task 2 Task n
...Task 3
kernel: tick_handle_periodic() -> update_vsyscall()
seccomp logging
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
seccomp can’t dereference pointers
ptr = “/tmp/foo”;
open(ptr, O_RDWR);
__secure_computing(...) = 0
ptr = “/etc/passwd”;
sys_open()
do_sys_open()
do_filp_open()
path_openat()
vfs_open()
do_dentry_open()
Landlock
● eBPF based Linux Security Module http://landlock.io
__secure_computing()
sys_open()
do_sys_open()
do_filp_open()
path_openat()
vfs_open()
do_dentry_open()
security_file_open()
Landlock
int security_file_open(struct file *file,
struct cred *cred);
struct file {
...
struct path f_path;
struct inode *f_inode;
};
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
Wireguard
● WireGuard is an extremely simple yet fast and modern
VPN https://www.wireguard.com/
● Allows for transparent encryption between endpoints
Wireguard
● IPSec: 400k lines
● OpenVPN: 100k lines + SSL
● Wireguard: 4k lines
Wireguard
● Noise protocol: https://noiseprotocol.org
● Curve25519, Blake2s, ChaCha20, Poly1305,
SipHash2-4
● No cypher agility
Kernel Self Protection Project (KSPP)
● Currently ~12 organizations and ~10 individuals
working on
about ~20 technologies
● KSPP focuses on the kernel protecting the kernel from
attack
● More at: https://outflux.net/slides/2017/lss/kspp.pdf
IMA
Audit
struct container *
LSM
Time Namespace
seccomp logging
Landlock
Wireguard
KSPP
XPFO
eXclusive Page Frame Ownership (XPFO)
● Introduced in “Rethinking Kernel Isolation” by
Kemerlis, Polychronakis, and Keromytis
● Protects against ret2dir attacks
● 29 files changed, 1013 insertions(+), 57 deletions(-)
● Implementation supports x86 and arm64
mm basics
0x00007fbcd334f000
(user)
0x1214b9000
(physical)
0xffff8801214b9000
(kernel)
Classic attack
struct file_operations {
int (*flush) (...)
};
/* kernel text */
int do_flush(...)
{
...
}
/* userspace memory */
int bad_flush(...)
{
commit_creds(prepare_kernel_cred(0));
}
Classic attack
● PaX UDEREF
● SMEP+SMAP on x86
● PXN on ARM
Updated attack
struct file_operations {
int (*flush) (...)
};
/* kernel text */
int do_flush(...)
{
...
}
/* userspace memory
0x00007fbcd334f000 */
int bad_flush(...)
{
commit_creds(prepare_kernel_cred(0));
}
/* userspace alias in kernel
0xffff8801214b9000 */
Enter XPFO!
● Keep track of who owns page
● Map/unmap accordingly
● Flush TLB as necessary
Get involved
● https://lists.linux-foundation.org/mailman/listinfo/containers
● http://www.openwall.com/lists/kernel-hardening/
● https://sourceforge.net/p/linux-ima/mailman/linux-ima-devel/
THANK YOU :)THANK YOU
Image credits
● Marty Bee for Brain Dump: http://www.martybee.com/
● https://en.wikipedia.org/wiki/White_Rabbit#/media/File:Down_the_Rabbit_Hole.png
● https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Container_ship_Hanjin_Taipei
.jpg/1024px-Container_ship_Hanjin_Taipei.jpg
● https://en.wikipedia.org/wiki/Hansel_and_Gretel#/media/File:1903_Ludwig_Richter.jpg
● https://upload.wikimedia.org/wikipedia/commons/8/87/WinonaSavingsBankVault.JPG
● http://www.gizmodo.in/photo/20861051.cms
● https://upload.wikimedia.org/wikipedia/commons/b/be/TPM.svg
● Kyle Spiers (Security Intern at Docker) for Gordon photo
On allocation
allocate
0x00007fbcd334f000
TLB flush
CPU core
CPU core CPU core
CPU core
On map/unmap
map
0x00007fbcd334f000
TLB flush
CPU core
CPU core CPU core
CPU core
x86
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
...
on_each_cpu(do_kernel_range_flush, &info, 1);
}
x86
/*
* Can deadlock when called with interrupts disabled. ...
*/
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress);
On map/unmap
map
0x00007fbcd334f000
TLB flush
CPU core
CPU core CPU core
CPU core
Benchmark
● kernbench running from n/2 - n cores in steps of 2
● test inter-core interference from excess flushing
2x Xeon E5-2650 v4, 24 cores/48 threads
2.2 GHz,
30 MB
SmartCache
Xeon E3-1240, 4 cores/8 threads
3.3 GHz,
8 MB
SmartCache
Amlogic Coretex A53 4 cores (odroid-C2)
1.5 GHz,
32k L1 (I/D),
512k L2
XPFO links
● Original paper:
https://cs.brown.edu/~vpk/papers/ret2dir.sec14.pdf
● v6 posting: https://lkml.org/lkml/2017/9/7/445

Container-relevant Upstream Kernel Developments

  • 1.
    Container-relevant Kernel developments TychoAndersen tycho@docker.com GH: tych0
  • 6.
    IMA ● Integrity ManagementArchitecture (“IMA”, “I’ma”) ● In-kernel protection against unauthorized userspace file modification
  • 7.
    IMA open(“/foo/bar”, O_RDWR) sha256sum(“/foo/bar”) ==getxattr(“/foo/bar”, “security.ima”) verify(“/foo/bar”) == getxattr(“/foo/bar”, “security.evm”) open(“/foo/bar”, O_RDWR) = -EPERM
  • 8.
    IMA $ tee /sys/kernel/security/policy<<EOF PROC_SUPER_MAGIC=0x9fa0 dont_measure fsmagic=0x9fa0 dont_appraise fsmagic=0x9fa0 EXT4_MAGIC=0xEF53 appraise fsmagic=$EXT4_MAGIC fowner=$user appraise func=MODULE_CHECK EOF ima_appraise={off,enforce,fix,log}
  • 9.
  • 10.
    IMA namespacing ● globalpolicy ● which namespace to pin? ● what about unshare()? ● ima: namespacing IMA audit messages https://lkml.org/lkml/2017/7/20/905
  • 11.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 12.
  • 13.
    Audit type=USER_LOGIN msg=audit(1506873468.459:1814706): pid=27995uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed' type=USER_AUTH msg=audit(1506873489.492:1814707): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed' type=USER_LOGIN msg=audit(1506873489.492:1814708): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed' type=USER_AUTH msg=audit(1506873491.708:1814709): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed' type=USER_LOGIN msg=audit(1506873491.708:1814710): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed' type=USER_AUTH msg=audit(1506873493.864:1814711): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'
  • 14.
    Audit namespacing ● whichnamespace to pin? ● what about unshare()? ● RFC: Audit Kernel Container IDs https://lkml.org/lkml/2017/9/13/383 ● RFC(v2): Audit Kernel Container IDs https://lkml.org/lkml/2017/10/12/354
  • 15.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 16.
    struct container * intcfd = container_create(const char *name, unsigned int flags); container_mount(int cfd, const char *source, const char *target, /* NULL -> root */ const char *filesystemtype, unsigned long mountflags, const void *data); container_chroot(int cfd, const char *path); mkdirat(int cfd, const char *path, mode_t mode); mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
  • 17.
    struct container * container_bind_mount_across(intcfd, const char *source, const char *target); int fd = openat(int cfd, const char *path, unsigned int flags, mode_t mode); int fd = container_socket(int cfd, int domain, int type, int protocol); fork_into_container(int cfd); container_wait(int container_fd, int *_wstatus, unsigned int wait, struct rusage *rusage); container_kill(int container_fd, int initonly, int signal); container_add_key(const char *type, const char *description, const void *payload, size_t plen, int container_fd);
  • 18.
    struct container * ●Device restriction ● “supervising” the container ● Make containers kernel objects https://lkml.org/lkml/2017/5/22/645
  • 19.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 20.
    LSM ● Linux SecurityModule ○ SELinux ○ AppArmor ○ Smack ○ Landlock ○ tomoyo ○ yama ○ loadpin ○ SARA
  • 21.
    LSM namespacing (stacking,chaining) ● 2004: https://lwn.net/Articles/110432/ Stackable security modules ● 2010: https://lwn.net/Articles/393008/ LSM Stacking (again) ● 2011: https://lwn.net/Articles/426921/ Supporting multiple LSMs ● 2012: https://lwn.net/Articles/518345/ Another LSM stacking approach ● 2013: https://lwn.net/Articles/548314/ LSM: Multiple concurrent LSMs ● 2014: https://lwn.net/Articles/548314/ LSM: Generalize existing module stacking ● 2015: https://lwn.net/Articles/635771/ Progress in security module stacking ● 2016-2017: https://lwn.net/Articles/719731/ Stacking for major security modules
  • 22.
    LSM namespacing (stacking,chaining) Host: AppArmor Guest: SELinux Nested: Smack
  • 23.
    LSM namespacing (stacking,chaining) Host: AppArmor Guest: AppArmor Nested: AppArmor
  • 24.
    LSM namespacing (stacking,chaining) ● SELinux in development: https://marc.info/?l=selinux&m=150696042210126&w=2
  • 25.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 26.
  • 27.
  • 28.
    virtual Dynamic SharedObject (vDSO) ● optimization to make frequent syscalls faster ● injected into a task’s address space by the kernel
  • 29.
    unshare(CLONE_NEWTIME)? Task 1 Task2 Task n ...Task 3 kernel: tick_handle_periodic() -> update_vsyscall()
  • 31.
  • 32.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 33.
    seccomp can’t dereferencepointers ptr = “/tmp/foo”; open(ptr, O_RDWR); __secure_computing(...) = 0 ptr = “/etc/passwd”; sys_open() do_sys_open() do_filp_open() path_openat() vfs_open() do_dentry_open()
  • 34.
    Landlock ● eBPF basedLinux Security Module http://landlock.io __secure_computing() sys_open() do_sys_open() do_filp_open() path_openat() vfs_open() do_dentry_open() security_file_open()
  • 35.
    Landlock int security_file_open(struct file*file, struct cred *cred); struct file { ... struct path f_path; struct inode *f_inode; };
  • 36.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 37.
    Wireguard ● WireGuard isan extremely simple yet fast and modern VPN https://www.wireguard.com/ ● Allows for transparent encryption between endpoints
  • 38.
    Wireguard ● IPSec: 400klines ● OpenVPN: 100k lines + SSL ● Wireguard: 4k lines
  • 39.
    Wireguard ● Noise protocol:https://noiseprotocol.org ● Curve25519, Blake2s, ChaCha20, Poly1305, SipHash2-4 ● No cypher agility
  • 41.
    Kernel Self ProtectionProject (KSPP) ● Currently ~12 organizations and ~10 individuals working on about ~20 technologies ● KSPP focuses on the kernel protecting the kernel from attack ● More at: https://outflux.net/slides/2017/lss/kspp.pdf
  • 42.
    IMA Audit struct container * LSM TimeNamespace seccomp logging Landlock Wireguard KSPP XPFO
  • 43.
    eXclusive Page FrameOwnership (XPFO) ● Introduced in “Rethinking Kernel Isolation” by Kemerlis, Polychronakis, and Keromytis ● Protects against ret2dir attacks ● 29 files changed, 1013 insertions(+), 57 deletions(-) ● Implementation supports x86 and arm64
  • 44.
  • 45.
    Classic attack struct file_operations{ int (*flush) (...) }; /* kernel text */ int do_flush(...) { ... } /* userspace memory */ int bad_flush(...) { commit_creds(prepare_kernel_cred(0)); }
  • 46.
    Classic attack ● PaXUDEREF ● SMEP+SMAP on x86 ● PXN on ARM
  • 47.
    Updated attack struct file_operations{ int (*flush) (...) }; /* kernel text */ int do_flush(...) { ... } /* userspace memory 0x00007fbcd334f000 */ int bad_flush(...) { commit_creds(prepare_kernel_cred(0)); } /* userspace alias in kernel 0xffff8801214b9000 */
  • 48.
    Enter XPFO! ● Keeptrack of who owns page ● Map/unmap accordingly ● Flush TLB as necessary
  • 49.
    Get involved ● https://lists.linux-foundation.org/mailman/listinfo/containers ●http://www.openwall.com/lists/kernel-hardening/ ● https://sourceforge.net/p/linux-ima/mailman/linux-ima-devel/
  • 50.
  • 51.
    Image credits ● MartyBee for Brain Dump: http://www.martybee.com/ ● https://en.wikipedia.org/wiki/White_Rabbit#/media/File:Down_the_Rabbit_Hole.png ● https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Container_ship_Hanjin_Taipei .jpg/1024px-Container_ship_Hanjin_Taipei.jpg ● https://en.wikipedia.org/wiki/Hansel_and_Gretel#/media/File:1903_Ludwig_Richter.jpg ● https://upload.wikimedia.org/wikipedia/commons/8/87/WinonaSavingsBankVault.JPG ● http://www.gizmodo.in/photo/20861051.cms ● https://upload.wikimedia.org/wikipedia/commons/b/be/TPM.svg ● Kyle Spiers (Security Intern at Docker) for Gordon photo
  • 52.
  • 53.
  • 54.
    x86 void flush_tlb_kernel_range(unsigned longstart, unsigned long end) { ... on_each_cpu(do_kernel_range_flush, &info, 1); }
  • 55.
    x86 /* * Can deadlockwhen called with interrupts disabled. ... */ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress);
  • 56.
  • 58.
    Benchmark ● kernbench runningfrom n/2 - n cores in steps of 2 ● test inter-core interference from excess flushing
  • 59.
    2x Xeon E5-2650v4, 24 cores/48 threads 2.2 GHz, 30 MB SmartCache
  • 60.
    Xeon E3-1240, 4cores/8 threads 3.3 GHz, 8 MB SmartCache
  • 61.
    Amlogic Coretex A534 cores (odroid-C2) 1.5 GHz, 32k L1 (I/D), 512k L2
  • 62.
    XPFO links ● Originalpaper: https://cs.brown.edu/~vpk/papers/ret2dir.sec14.pdf ● v6 posting: https://lkml.org/lkml/2017/9/7/445