Containers @ Google


Published on

Slides from our presentation at the SF Bay Area Large Scale Production Engineering meetup on Lightweight Containers.

Published in: Technology

Containers @ Google

  1. 1. Let Me Contain That For You Containers @ Google Victor Marmol ( Rohit Jnagal ( SF Bay Area Large-Scale Production Engineering: Lightweight Containers Meetup February 20, 2014 Google Confidential and Proprietary
  2. 2. Containers in the Wild User 1 User 2 User 3 User 4 Linux Kernel ● ● ● ● Used to provide VM-like instances High density (lower costs) and high performance Fast to start Migration is hard, but possible Google Confidential and Proprietary
  3. 3. The Need for Isolation: A Shared Google Machine I/O:CPU:Mem Sensitive Task Front End Task Back End Task Alloc BACKGROUND System Daemons Batch workload TASKS Soaker workload Google Confidential and Proprietary
  4. 4. Containers @ Google SS1 SS2 Sub 2 Task 1 Task 2 Sub 1 Sub 4 Sub 1 SS3 Sub 3 SS4 Sub 3 Sub 2 Alloc 1 Task 1 Task 2 Linux Kernel ● ● ● ● Container-aware tasks use asymmetric subcontainers Provide different guarantees of quality of service Overcommit resources to achieve high utilization Early users, few namespaces, and near-zero overhead Google Confidential and Proprietary
  5. 5. Asymmetric Isolation Isolating only certain resources (e.g., CPU but not memory). CPU Memory Net Container 1 Container 2 Container 3 Google Confidential and Proprietary
  6. 6. Containers @ Google Today ● Historically ○ ○ ○ ● ● ● ● ● 2004: No isolation 2006: Cgroups Now: Namespaces Primarily Linux cgroups + user-space policies and monitoring We skipped VMs due to high overhead Used everywhere: SaaS, PaaS, IaaS; Android, Chrome OS Heterogeneous workloads: Latency, bandwidth, and priority High task churn Google Confidential and Proprietary
  7. 7. Goals ● Isolation ○ Tasks do not impact each other ○ The behavior of a Task is the same regardless of what else is on the machine ● Predictability ○ Tasks behave the same each time they run ○ Unless they are specifically configured to use "slack" ● Quality of Service ○ Different tasks get different quality of resources ● Overcommitment ○ Oversell machine resources within QoS guarantees Google Confidential and Proprietary
  8. 8. lmctfy: Let Me Contain That For You Open source containers stack based on Google’s. Provides the Container abstraction to higher levels by abstracting away the kernel interfaces. Motivation ● Existing code, systems, and design around containers ● Problems with LXC ○ ○ No abstraction (direct knob exposure) No easy way to access programmatically Google Confidential and Proprietary
  9. 9. lmctfy: Let Me Contain That For You Objectives ● Abstract away enforcement: separate policy from enforcement ● Scalability and parallel access ● Intent-based container specifications ● Asymmetric isolation ● Subcontainer support ● Provides tiers of quality of service System Layers ● CL1 ○ ○ ○ Container abstraction and enforcement Thin and light layer Current lmctfy ● CL2 ○ ○ ○ Sets policy (QoS, overcommitment) Higher level logic, monitoring, and control loops Stateful entity Google Confidential and Proprietary
  10. 10. lmctfy: Fine-tuned resource isolation Current cgroup API is complicated with lots of knobs (each a cgroup file): Common: 5+ files cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release release_agent CPU: 8+ files cpuacct.stat cpuacct.usage cpuacct.usage_percpu cpu.cfs_period_us cpu.cfs_quota_us cpu. rt_period_us cpu.rt_runtime_us cpu.shares cpu.stat Memory: 12+ files memory.failcnt memory.force_empty memory.limit_in_bytes memory.max_usage_in_bytes memory. move_charge_at_immigrate memory.numa_stat memory.oom_control memory.pressure_level memory.soft_limit_in_bytes memory.stat memory.swappiness memory.usage_in_bytes memory. use_hierarchy Cpuset: 12+ files cpuset.cpu_exclusive cpuset.cpus cpuset.mem_exclusive cpuset.mem_hardwall cpuset. memory_migrate cpuset.memory_pressure cpuset.memory_pressure_enabled cpuset. memory_spread_page cpuset.memory_spread_slab cpuset.mems cpuset.sched_load_balance cpuset.sched_relax_domain_level +DiskIO +Net +... Google Confidential and Proprietary
  11. 11. Released 0.4.0 (This Week!) Initial version of lowest layer ● Written entirely in C++ ● Delivered as a CLI and a C++ library (C and Go bindings soon) ● Isolation for CPU, memory, and perf event ● Full support for subcontainers ● “Stateless” and lightweight ● Initial support for namespaces, more to come in the next week. Can be augmented with custom kernel patches ● CPU latency and accounting ● OOM priority Supported configurations ● Target configuration is well supported ● Designed to be flexible, but we test on a limited set of them ● More target configurations being added ● Contributions to add more are welcome Google Confidential and Proprietary
  12. 12. Container Specifications message ContainerSpec { optional int64 owner = 1; optional optional optional optional optional ... CpuSpec cpu = 2; MemorySpec memory = 3; DiskIoSpec diskio = 4; NetworkSpec network = 5; VirtualHost virtualhost = 6; } message CpuSpec { optional ShedulingLatency scheduling_latency = 1; optional uint64 limit = 2; optional uint64 max_limit = 3; ... } Create: “cpu:<limit:1000 max_limit:2000> memory:<limit:4096000 reservation:1024000>” Google Confidential and Proprietary
  13. 13. Cgroup Specifications Create: “cpu:<limit:1000 max_limit:2000 scheduling_latency:PRIORITY> memory:<limit:4096000 reservation:1024000>” equivalent lxc cgroup config: lxc.cgroup.cpu.shares = 2048 lxc.cgroup.cpu.cfs_period_us = 50000 lxc.cgroup.cpu.cfs_quota_us = 10000 = 25 .. cpu performance knobs .. lxc.cgroup.memory.limit_in_bytes = 4096000 lxc.cgroup.memory.soft_limit_in_bytes = 1024000 .. memory performance knobs .. Google Confidential and Proprietary
  14. 14. C++ API ::containers::lmctfy::ContainerApi ● Create ● Get ● Destroy ● Detect ● InitMachine ::containers::lmctfy::Container ● Update ● Run ● Notifications ● List (threads, PIDs, and subcontainers) ● Stats ● Pause/Resume ● KillAll CLI is a thin wrapper around the C++ API Google Confidential and Proprietary
  15. 15. Container Names Path-like hierarchy of container names: Absolute: /parent/self Relative: self when in /parent Container Name Refers To / The root top-level container /sys The sys top-level container /sys/sub The sub subcontainer of the sys top-level container . or ./ The current container (current relative to the calling process) .. The parent container (parent relative to the calling process) ./foo_container or foo_container The foo_container subcontainer of the current container /foo_container The foo_container top-level container Google Confidential and Proprietary
  16. 16. Roadmap Towards Version 1.0 ● Improve VirtualHost support ● Root file systems ● Checkpoint restore ● Support and target most major distros ● Fully compatible with Docker’s use of containers Higher Layer ● Admission control and feasibility checks ● Monitoring, notifications, and statistics ● Tiers of quality of service guarantees Contributions Welcome! Google Confidential and Proprietary
  17. 17. Questions? Repository: Mailing list: Victor Marmol: Rohit Jnagal: Google Confidential and Proprietary