Control Groups
What do we have?
● cpuset - whole cores and cpu mapping
● cpuacct - cpu cycle accounting
● cpu - less then core granularity
● memory - limits and accounting
● blkio - limits and accounting
● net_cls - network classification
● net_prio - network priority
● Freezer + checkpoint/restore - migration
General structure
● tasks
– attach a task(thread) and show list of threads
● cgroup.procs
– show list of processes
● cgroup.event_control
– an interface for event_fd()
# mount -t cgroup none /cgroups
# mount -t cgroup -o cpuset cpuset /cg/cpuset
cpuset
● Physical CPU & Memory limits
– cpuset.cpus - a list of allowed CPUs
– cpuset.mems - a list of allowed memory slots
– cpuset.cpu_exclusive - 0/1 are the CPUs exclusive to this
group(no other group can use them)
– cpuset.mem_exclusive or cpuset.mem_hardwall - 0/1 are
the memory slots exclusive to this group(no other group can
use them)
– cpuset.sched_load_balance - should the kernel balance the
tasks between the CPUs in the current cpuset
– cpuset.sched_relax_domain_level
Documentation/cgroups/cpusets.txt
cpuset
● Physical CPU & Memory limits
– cpuset.sched_relax_domain_level
-1 : no request. use system default or follow request of others.
0 : no search.
1 : search siblings (hyperthreads in a core).
2 : search cores in a package.
3 : search cpus in a node [= system wide on non-NUMA system]
on NUMA systems only
4 : search nodes in a chunk of node
5 : search system wide
Documentation/cgroups/cpusets.txt
CPU accounting
● cpu usage combined for all cpus (in nanoseconds)
● cpu usage per-cpu (in nanoseconds)
● per cpu and user/system(in USER_HZ)
● Documentation/cgroups/cpuacct.txt
CPU
● CPU scheduler limits CONFIG_CGROUP_SCHED
– cpu.shares: the amount of cpu shares available to the group
– cpu.cfs_quota_us: the total available run-time within a period (in
microseconds) (-1 no limit)
– cpu.cfs_period_us: the length of a period (in microseconds) (default
100ms)
– cpu.stat: exports throttling statistics
nr_periods: Number of enforcement intervals that have elapsed.
nr_throttled: Number of times the group has been throttled/limited.
throttled_time: The total time duration (in nanoseconds) for which
entities of the group have been throttled.
● Documentation/scheduler/sched-bwc.txt
CPU examples
1. Limit a group to 1 CPU worth of runtime. If period is 250ms and quota is also
250ms, the group will get 1 CPU worth of runtime every 250ms.
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine. With 500ms
period and 1000ms quota, the group can get 2 CPUs worth of runtime every 500ms.
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */
The larger period here allows for increased burst capacity.
3. Limit a group to 20% of 1 CPU. With 50ms period, 10ms quota will be equivalent to
20% of 1 CPU.
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
By using a small period here we are ensuring a consistent latency response at the
expense of burst capacity.
memory
Only Memory
●
memory.usage_in_bytes - show current res_counter usage for memory
●
memory.limit_in_bytes - set/show limit of memory usage
● memory.failcnt - show the number of memory usage hits limits
●
memory.max_usage_in_bytes - show max memory usage recorded
Memory + Swap
●
memory.memsw.usage_in_bytes - show current res_counter usage
● memory.memsw.limit_in_bytes - set/show limit
●
memory.memsw.failcnt - show the number of hits limits
●
memory.memsw.max_usage_in_bytes - show max memory+Swap usage recorded
●
memory.soft_limit_in_bytes - set/show soft limit of memory usage
●
memory.stat - show various statistics
● memory.use_hierarchy - set/show hierarchical account enabled
●
memory.force_empty - trigger forced move charge to parent
● memory.pressure_level - set memory pressure notifications
● memory.swappiness - set/show swappiness parameter of vmscan
memory
● memory.move_charge_at_immigrate - set/show controls of moving charges
●
memory.oom_control - set/show oom controls.
●
memory.numa_stat - show the number of memory usage per numa node
Kernel Memory limits
● memory.kmem.limit_in_bytes - set/show hard limit for kernel memory
●
memory.kmem.usage_in_bytes - show current kernel memory allocation
●
memory.kmem.failcnt - show the number of kernel memory usage hits limits
● memory.kmem.max_usage_in_bytes - show max kernel memory usage recorded
●
memory.kmem.tcp.limit_in_bytes - set/show hard limit for tcp buf memory
●
memory.kmem.tcp.usage_in_bytes - show current tcp buf memory allocation
● memory.kmem.tcp.failcnt - show the number of tcp buf memory usage hits limits
●
memory.kmem.tcp.max_usage_in_bytes - show max tcp buf memory usage recorded
blkio statistics
● blkio.io_wait_time
● blkio.io_merged
● blkio.io_queued
● blkio.avg_queue_size
● blkio.group_wait_time
● blkio.throttle.io_serviced
● blkio.throttle.io_service_bytes
● blkio.sectors
● blkio.io_service_bytes
● blkio.io_serviced
● blkio.io_service_time
● blkio.*_recursive
● blkio.reset_stats
– write an int to it
blkio limiting
● blkio.weight - allowed range 10 - 1000
● blkio.weight_device - weight per device
● blkio.leaf_weight[_device] - when competing with
child cgroups
● blkio.time - disk time allocated in miliseconds
● blkio.throttle.read_bps_device
● blkio.throttle.write_bps_device
● blkio.throttle.read_iops_device
Network
● Adding network class to each cgroup so you can
later limit it with tc
– Documentation/cgroups/net_cls.txt
● Prioritizing network traffic on interface
– Documentation/cgroups/net_prio.txt
Freezer + CRIU
● freezer.state
– ТHAWED
– FREEZING
– FROZEN
● freezer.self_freezing
– 0 (thawed)/ 1 (frozen)
● freezer.parent_freezing
– 0 if partent is frozen
● CRIU - Checkpoint and Restore
In Userspace

LSA2 - 02 Control Groups

  • 1.
  • 2.
    What do wehave? ● cpuset - whole cores and cpu mapping ● cpuacct - cpu cycle accounting ● cpu - less then core granularity ● memory - limits and accounting ● blkio - limits and accounting ● net_cls - network classification ● net_prio - network priority ● Freezer + checkpoint/restore - migration
  • 3.
    General structure ● tasks –attach a task(thread) and show list of threads ● cgroup.procs – show list of processes ● cgroup.event_control – an interface for event_fd() # mount -t cgroup none /cgroups # mount -t cgroup -o cpuset cpuset /cg/cpuset
  • 4.
    cpuset ● Physical CPU& Memory limits – cpuset.cpus - a list of allowed CPUs – cpuset.mems - a list of allowed memory slots – cpuset.cpu_exclusive - 0/1 are the CPUs exclusive to this group(no other group can use them) – cpuset.mem_exclusive or cpuset.mem_hardwall - 0/1 are the memory slots exclusive to this group(no other group can use them) – cpuset.sched_load_balance - should the kernel balance the tasks between the CPUs in the current cpuset – cpuset.sched_relax_domain_level Documentation/cgroups/cpusets.txt
  • 5.
    cpuset ● Physical CPU& Memory limits – cpuset.sched_relax_domain_level -1 : no request. use system default or follow request of others. 0 : no search. 1 : search siblings (hyperthreads in a core). 2 : search cores in a package. 3 : search cpus in a node [= system wide on non-NUMA system] on NUMA systems only 4 : search nodes in a chunk of node 5 : search system wide Documentation/cgroups/cpusets.txt
  • 6.
    CPU accounting ● cpuusage combined for all cpus (in nanoseconds) ● cpu usage per-cpu (in nanoseconds) ● per cpu and user/system(in USER_HZ) ● Documentation/cgroups/cpuacct.txt
  • 7.
    CPU ● CPU schedulerlimits CONFIG_CGROUP_SCHED – cpu.shares: the amount of cpu shares available to the group – cpu.cfs_quota_us: the total available run-time within a period (in microseconds) (-1 no limit) – cpu.cfs_period_us: the length of a period (in microseconds) (default 100ms) – cpu.stat: exports throttling statistics nr_periods: Number of enforcement intervals that have elapsed. nr_throttled: Number of times the group has been throttled/limited. throttled_time: The total time duration (in nanoseconds) for which entities of the group have been throttled. ● Documentation/scheduler/sched-bwc.txt
  • 8.
    CPU examples 1. Limita group to 1 CPU worth of runtime. If period is 250ms and quota is also 250ms, the group will get 1 CPU worth of runtime every 250ms. # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */ # echo 250000 > cpu.cfs_period_us /* period = 250ms */ 2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine. With 500ms period and 1000ms quota, the group can get 2 CPUs worth of runtime every 500ms. # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */ # echo 500000 > cpu.cfs_period_us /* period = 500ms */ The larger period here allows for increased burst capacity. 3. Limit a group to 20% of 1 CPU. With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU. # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */ # echo 50000 > cpu.cfs_period_us /* period = 50ms */ By using a small period here we are ensuring a consistent latency response at the expense of burst capacity.
  • 9.
    memory Only Memory ● memory.usage_in_bytes -show current res_counter usage for memory ● memory.limit_in_bytes - set/show limit of memory usage ● memory.failcnt - show the number of memory usage hits limits ● memory.max_usage_in_bytes - show max memory usage recorded Memory + Swap ● memory.memsw.usage_in_bytes - show current res_counter usage ● memory.memsw.limit_in_bytes - set/show limit ● memory.memsw.failcnt - show the number of hits limits ● memory.memsw.max_usage_in_bytes - show max memory+Swap usage recorded ● memory.soft_limit_in_bytes - set/show soft limit of memory usage ● memory.stat - show various statistics ● memory.use_hierarchy - set/show hierarchical account enabled ● memory.force_empty - trigger forced move charge to parent ● memory.pressure_level - set memory pressure notifications ● memory.swappiness - set/show swappiness parameter of vmscan
  • 10.
    memory ● memory.move_charge_at_immigrate -set/show controls of moving charges ● memory.oom_control - set/show oom controls. ● memory.numa_stat - show the number of memory usage per numa node Kernel Memory limits ● memory.kmem.limit_in_bytes - set/show hard limit for kernel memory ● memory.kmem.usage_in_bytes - show current kernel memory allocation ● memory.kmem.failcnt - show the number of kernel memory usage hits limits ● memory.kmem.max_usage_in_bytes - show max kernel memory usage recorded ● memory.kmem.tcp.limit_in_bytes - set/show hard limit for tcp buf memory ● memory.kmem.tcp.usage_in_bytes - show current tcp buf memory allocation ● memory.kmem.tcp.failcnt - show the number of tcp buf memory usage hits limits ● memory.kmem.tcp.max_usage_in_bytes - show max tcp buf memory usage recorded
  • 11.
    blkio statistics ● blkio.io_wait_time ●blkio.io_merged ● blkio.io_queued ● blkio.avg_queue_size ● blkio.group_wait_time ● blkio.throttle.io_serviced ● blkio.throttle.io_service_bytes ● blkio.sectors ● blkio.io_service_bytes ● blkio.io_serviced ● blkio.io_service_time ● blkio.*_recursive ● blkio.reset_stats – write an int to it
  • 12.
    blkio limiting ● blkio.weight- allowed range 10 - 1000 ● blkio.weight_device - weight per device ● blkio.leaf_weight[_device] - when competing with child cgroups ● blkio.time - disk time allocated in miliseconds ● blkio.throttle.read_bps_device ● blkio.throttle.write_bps_device ● blkio.throttle.read_iops_device
  • 13.
    Network ● Adding networkclass to each cgroup so you can later limit it with tc – Documentation/cgroups/net_cls.txt ● Prioritizing network traffic on interface – Documentation/cgroups/net_prio.txt
  • 14.
    Freezer + CRIU ●freezer.state – ТHAWED – FREEZING – FROZEN ● freezer.self_freezing – 0 (thawed)/ 1 (frozen) ● freezer.parent_freezing – 0 if partent is frozen ● CRIU - Checkpoint and Restore In Userspace