5. Container Security
Containers use several mechanisms for security :
Control Groups (cgroups)
Namespaces.
Capabilities.
Seccomp.
Linux security mechanisms.
The Docker daemon.
6. Control Groups (cgroups)
By default, a container has no resource constraints and can use as much of a
given resource as the host’s kernel scheduler will allow…
https://docs.docker.com/engine/admin/resource_constraints/
7. Control Groups (cgroups)
Denial Of Service (cpu, memory, disk)
Fork bomb :(){:|:&};:
Human Redable :
bomb() {
bomb | bomb &
}; bomb
import os
while 1:
os.fork()
perl -e "fork while fork" &
8. Control Groups (cgroups)
Limit a container's resources
Docker provides ways to control how much memory, CPU, or block IO a container
can use, setting runtime configuration flags of the docker run command.
docker run -it -m 500M --kernel-memory 50M --cpu-shares 512 --blkio-
weight 400 --name ubuntu1 ubuntu bash
9. Control Groups (cgroups)
Option Description
-m or --me-
mory=
The maximum amount of memory the container can use. If you set this option,
the minimum allowed value is 4m (4 megabyte).
--memory-
swap*
The amount of memory this container is allowed to swap to disk. See --memory-
swap details.
--memory-
swappiness
By default, the host kernel can swap out a percentage of anonymous pages used
by a container. You can set --memory-swappiness to a value between 0 and
100, to tune this percentage. See --memory-swappiness details.
--memory-
reservation
Allows you to specify a soft limit smaller than --memory which is activated
when Docker detects contention or low memory on the host machine. If you use
--memory-reservation, it must be set lower than --memory in order for it to
take precedence. Because it is a soft limit, it does not guarantee that the container
will not exceed the limit.
--kernel-
memory
The maximum amount of kernel memory the container can use. The minimum
allowed value is 4m. Because kernel memory cannot be swapped out, a container
which is starved of kernel memory may block host machine resources, which can
have side effects on the host machine and on other containers. See --kernel-
memory details.
--cpus=<va-
lue>
Specify how much of the available CPU resources a container can use. For ins-
tance, if the host machine has two CPUs and you set --cpus="1.5", the container
will be guaranteed to be able to access at most one and a half of the CPUs. This
is the equivalent of setting --cpu-period="100000" and --cpu-quota="150000".
Available in Docker 1.13 and higher.
--cpu-pe-
riod=<va-
lue>
Specify the CPU CFS scheduler period, which is used alongside --cpu-quota. De-
faults to 1 second, expressed in micro-seconds. Most users do not change this
from the default. If you use Docker 1.13 or higher, use --cpus instead.
…
10. Control Groups (cgroups)
Prevent fork bombs:
A new cgroup (PIDs subsystem ) to limit the number of processes that can be forked
inside a cgroup.
Kernel 4.3+ & Docker 1.11+ (--pids-limit)
11. Namespaces :
By default containers run with full root privileges
root in container == root outside container
Never run applications as root inside the container.
12. User Namespaces
Docker introduced support for user
namespace in version 1.10
run as user :
--user UID:GID
Need root inside container :
--userns-remap [uid[:gid]]
Docker daemon needs to be started with : --userns-
remap=username/uid:groupname/gid”. Using “default” will create “dockremap” user
(--userns-remap=defaults)
13. Docker internals
Architecture & Layouts
Capabilites
Capabilities divide system access into logical groups that may be individually granted to,
or removed from, different processes.
Capabilities allow system administrators to fine-tune what a process is allowed to do
The capabilities are divided into four sets :
Effective
Permitted
Inheritable
Ambient (since Linux 4.3)
The use of capabilities is not limited to processes. They are also placed on the executable
files
14. Default Capabilities
Capability Key Capability Description
SETPCAP Modify process capabilities.
MKNOD Create special files using mknod(2).
AUDIT_WRITE Write records to kernel auditing log.
CHOWN Make arbitrary changes to file UIDs and GIDs (see chown(2)).
NET_RAW Use RAW and PACKET sockets.
DAC_OVERRIDE Bypass file read, write, and execute permission checks.
FOWNER Bypass permission checks on operations that normally require the
file system UID of the process to match the UID of the file.
FSETID Don’t clear set-user-ID and set-group-ID permission bits when a
file is modified.
KILL Bypass permission checks for sending signals.
SETGID Make arbitrary manipulations of process GIDs and
supplementary GID list.
SETUID Make arbitrary manipulations of process UIDs.
NET_BIND_SERVICE Bind a socket to internet domain privileged ports (port numbers
less than 1024).
SYS_CHROOT Use chroot(2), change root directory.
SETFCAP Set file capabilities.
--cap-add: Add Linux capabilities
--cap-drop: Drop Linux capabilities
15. Capabilities that can be added
Capability Key Capability Description
SYS_MODULE Load and unload kernel modules.
SYS_RAWIO Perform I/O port operations (iopl(2) and ioperm(2)).
SYS_PACCT Use acct(2), switch process accounting on or off.
SYS_ADMIN Perform a range of system administration operations.
SYS_NICE Raise process nice value (nice(2), setpriority(2)) and change the nice value for arbitrary processes.
SYS_RESOURCE Override resource Limits.
SYS_TIME Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock.
SYS_TTY_CONFIG Use vhangup(2); employ various privileged ioctl(2) operations on virtual terminals.
AUDIT_CONTROL Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules.
MAC_OVERRIDE Allow MAC configuration or state changes. Implemented for the Smack LSM.
MAC_ADMIN Override Mandatory Access Control (MAC). Implemented for the Smack Linux Security Module (LSM).
NET_ADMIN Perform various network-related operations.
SYSLOG Perform privileged syslog(2) operations.
DAC_READ_SEARCH Bypass file read permission checks and directory read and execute permission checks.
LINUX_IMMUTABLE Set the FS_APPEND_FL and FS_IMMUTABLE_FL i-node flags.
NET_BROADCAST Make socket broadcasts, and listen to multicasts.
IPC_LOCK Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2)).
IPC_OWNER Bypass permission checks for operations on System V IPC objects.
SYS_PTRACE Trace arbitrary processes using ptrace(2).
SYS_BOOT Use reboot(2) and kexec_load(2), reboot and load a new kernel for later execution.
LEASE Establish leases on arbitrary files (see fcntl(2)).
WAKE_ALARM Trigger something that will wake up the system.
BLOCK_SUSPEND Employ features that can block system suspend.
16. Secure computing mode
Seccomp is used to restrict the set of system calls applications can make
seccomp is a sandboxing facility in the Linux kernel that acts like a firewall for
system calls (syscalls).
Seccomp is an existing open source project originally created for Google Chrome.
It uses Berkeley Packet Filter (BPF) rules to filter syscalls.
17. Example of blocked syscall
Syscall Description
acct Accounting syscall which could let containers disable their own
resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_key Prevent containers from using the kernel keyring, which is not
namespaced.
adjtimex Similar to clock_settime and settimeofday, time/date is not
namespaced. Also gated by CAP_SYS_TIME.
bpf Deny loading potentially persistent bpf programs into kernel, already
gated by CAP_SYS_ADMIN.
clock_adjtime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clock_settime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clone Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for
CLONE_* flags, except CLONE_USERNS.
create_module Deny manipulation and functions on kernel modules. Obsolete. Also
gated by CAP_SYS_MODULE.
delete_module Deny manipulation and functions on kernel modules. Also gated by
CAP_SYS_MODULE.
…
18. larguas@ubuntu:~$ strace -c -f -S name ps 2>&1 1>/dev/null | tail -n +3 | head -n -2 | awk '{print $(NF)}'
access
arch_prctl
brk
close
execve
fstat
Futex
Write
…
Seccomp and the no-new-privileges option Seccomp policies have to be applied before
executing your container and be less specific unless you use:
--security-opt no-new-privileges
19. To be continued…
Linux Security Modules
AppArmor
7
7.1
SELinux7.2
The Docker daemon8
Docker Security Best Practices9