© 2017 Mesosphere, Inc. All Rights Reserved. 1
Stockholm k8s CNCF meetup - May 8th, 2019
Dealing with Kubesprawl - Tetris style !
Taco Scargo
@tacoscargo
© 2017 Mesosphere, Inc. All Rights Reserved.
❏ Senior Solution Engineer - mostly active in Benelux and
Nordics @ Mesosphere
❏ Building stuff with open source software for ~20 years
Taco Scargo, Senior Solution Engineer
@tacoscargo
© 2019 Mesosphere, Inc. All Rights Reserved. 3
The trouble with Tribbles ...
Verb ( used with object )
1. To stretch out as in sprawling
2. To spread out or distribute in a straggling manner
Verb ( used with object )
1. To stretch out as in sprawling
2. To spread out or distribute in a straggling manner
* Source: https://github.com/cncf/k8s-conformance
** Typically each project team creates 2 or more clusters: https://blog.newrelic.com/engineering/kubernetes-usage-data/
***Source: https://kubernetes.io/blog/2017/01/kubernetes-ux-survey-infographic/
**** Source: Kubernetes application survey 2018
84%
clusters with fewer
than 25 nodes****
~15
clusters
per enterprise**
73%
clusters not
managed by
central IT ***
+70
ways to deploy
K8s*
© 2019 Mesosphere, Inc. All Rights Reserved. 6
One cluster to rule them all ...
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
Cluster Orchestrator
Abstract away all resources into a single pool
© 2019 Mesosphere, Inc. All Rights Reserved. 7
One cluster to rule them all ...
● Shared kernel model perceived as soft multi-tenancy
● Disconnect of change pace between dev and infra
● Scaling issues for monolithic schedulers
Perception is reality ….
© 2019 Mesosphere, Inc. All Rights Reserved. 8
One cluster to rule them all ...
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
Team 1: Dev Team 1: Prod
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
Team 2: Dev Team 2: Prod
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
HW
or
VM
OS OS OS
HW
or
VM
HW
or
VM
Team 3: Dev Team 3: Prod
● Tooling proliferation
● Operational overhead
● Biggest problem is COST ...
© 2017 Mesosphere, Inc. All Rights Reserved. 9
Static partitioning
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
© 2017 Mesosphere, Inc. All Rights Reserved. 10
Apache Mesos:
The datacenter
kernel
http://mesos.apache.org/
© 2017 Mesosphere, Inc. All Rights Reserved.
• A cluster resource negotiator
• Scalable to 10,000s of nodes
• Fault-tolerant, battle-tested
11
Building block of the modern internet
http://mesos.apache.org/documentation/latest/powered-by-mesos/
© 2015 Mesosphere, Inc. All Rights Reserved.
MESOS KERNEL APPLIES LESSONS FROM EARLY INNOVATORS
Tupperware/BistroBorg/Omega Apache Mesos
ProprietaryProprietary Open Source (Apache License)
~2007~2001 2010+
Production-proven Web Scale Cluster Managers
● Built at UC Berkeley AMPLab by Ben Hindman (Mesosphere Co-founder)
● Built in collaboration with Google to overcome some Borg Challenges
● Production proven at scale +80K hosts @ Twitter
© 2017 Mesosphere, Inc. All Rights Reserved. 13
Solving the Fail Whale
© 2017 Mesosphere, Inc. All Rights Reserved.
Two level scheduling
Mesos Master and Agents
● Abstract resources into single pool
● Offers and tracks resources
● Guarantees isolation
● Handles workload restart on failure
Mesos Framework
● Consumes resources
● Deploys tasks
● Provides application specific logic for deployment, recovery, upgrade
Zookeeper Quorum
Master
Master
(Standby)
Master
(Standby)
Frameworks
Agent
Spark
Agent
Kubernetes
Agent
Spark
Kubernetes
Agent
Kafka
Kubernetes
Executor
Task
© 2019 Mesosphere, Inc. All Rights Reserved. 15
Everything is a container ...
Mesos is the golden standard for production grade, large scale clusters
Solomon Hykes
Co-founder Docker
© 2017 Mesosphere, Inc. All Rights Reserved. 16
MULTIPLEXING
Typical Deployment
siloed, over-provisioned servers,
low utilization
Apache Mesos
automated schedulers, workload multiplexing onto the
same machines
Workload 3
Workload 4
Workload 5
Workload 1
Workload 2
© 2019 Mesosphere, Inc. All Rights Reserved. 17
DC/OS - Distributed Cloud Operating System
● Service Discovery
● Load Balancing
● Security
● Ease of installation
● Comprehensive tooling for operations
● Built in frameworks for long running and
scheduled jobs
● Catalog of pre-configured apps (including
Kubernetes, Apache Spark, Apache Kafka…),
browse at http://universe.dcos.io/
● And much more https://dcos.io/
© 2017 Mesosphere, Inc. All Rights Reserved. 18
DC/OS
Architecture
Overview
Security &
Governance
Container Orchestration Monitoring & Operations User Interface & Command Line
HDFS Jenkins Marathon Cassandra Flink
Spark Docker Kafka MongoDB +30 more...
DC/OS
Services & Containers
ANY INFRASTRUCTURE
© 2017 Mesosphere, Inc. All Rights Reserved.
Bin packing ..
© 2019 Mesosphere, Inc. All Rights Reserved. 20
Schedulers enable full lifecycle management
● Upgrades
● Downgrades
● Scale Up
● Scale Down
● Failure Scenarios
● And more ...
© 2019 Mesosphere, Inc. All Rights Reserved. 21
Scheduler plans
Macbook-Pro:kube tscargo$ dcos kubernetes plan list
[
"deploy",
"recovery",
"replace",
"update",
"uninstall"
]
© 2019 Mesosphere, Inc. All Rights Reserved. 22
Scheduler plans
Macbook-Pro:kube tscargo$ dcos kubernetes plan status deploy
deploy (serial strategy) (IN_PROGRESS)
├─ etcd (serial strategy) (COMPLETE)
│ ├─ etcd-0:[peer] (COMPLETE)
│ ├─ etcd-1:[peer] (COMPLETE)
│ └─ etcd-2:[peer] (COMPLETE)
├─ apiserver (parallel strategy) (COMPLETE)
│ ├─ kube-apiserver-0:[instance] (COMPLETE)
│ ├─ kube-apiserver-1:[instance] (COMPLETE)
│ └─ kube-apiserver-2:[instance] (COMPLETE)
├─ kubernetes-api-proxy (parallel strategy) (COMPLETE)
│ └─ kubernetes-api-proxy-0:[install] (COMPLETE)
├─ controller-manager (parallel strategy) (COMPLETE)
…………………..
© 2019 Mesosphere, Inc. All Rights Reserved. 23
No Forks !
© 2019 Mesosphere, Inc. All Rights Reserved. 24
Allocate resources for Pods
Mesos Agent
kube-node
● Kubelet
● Kube-Proxy
● Container Runtime
Mesos
Container
© 2019 Mesosphere, Inc. All Rights Reserved. 25
Allocate resources for Pods
In Kubernetes :
--kube-reserved How much resource is reserved for kubelet, container
runtime etc.
--system-reserved How much resource is reserved for OS function
In DC/OS Kubernetes framework :
Reserved How much resource is reserved for kubelet,
container runtime etc.
Allocatable How much resource should be allocated for pods
© 2019 Mesosphere, Inc. All Rights Reserved. 26
Allocate resources for Pods
Kubelet reads total resource from /proc/cpuinfo and /proc/meminfo on startup
● Allocatable = total resource - ( system-reserved + kube-reserved )
On DC/OS this would be the total available on the agent, NOT the Mesos allocation
When we start the kubelet we configure :
● System-reserved = ( Total available - agent.reserved ) - agent.allocatable
© 2019 Mesosphere, Inc. All Rights Reserved. 27
Allocate resources for Pods
Agent Node
20GB
Kube-Node
3GB
1GB - Reserved
2GB - Allocatable
System-reserved = ( 20 - 1 ) - 2 = 17GB
Kube-reserved = 1GB
Allocatable = 2GB
© 2019 Mesosphere, Inc. All Rights Reserved. 28
Allocate resources for Pods
8GB available
Mesos Agent
10GB
2GB
Mesos resource offer
© 2019 Mesosphere, Inc. All Rights Reserved. 29
Allocate resources for Pods
6GB available
Mesos Agent
10GB
2GB
Mesos resource offer
2GB
© 2019 Mesosphere, Inc. All Rights Reserved. 30
Allocate resources for Pods
4GB available
Mesos Agent
10GB
2GB
Mesos resource offer
2GB 2GB
© 2019 Mesosphere, Inc. All Rights Reserved. 31
Allocate resources for Pods
2GB available
Mesos Agent
10GB
2GB
Mesos resource offer
2GB 2GB 2GB
© 2019 Mesosphere, Inc. All Rights Reserved. 32
Allocate resources for Pods
Mesos Agent
10GB
2GB 2GB 2GB 2GB 2GB
No Resource Offer
© 2019 Mesosphere, Inc. All Rights Reserved. 33
Containers
Control Groups ( CGroups )
● Linux kernel feature
○ Limits and Accounting
● Hierarchical - inherits limits from parent
● Kernel will kill processes to maintain limits
● Accessed via the cgroup virtual filesystem
○ /sys/fs/cgroup
© 2019 Mesosphere, Inc. All Rights Reserved. 34
Cadvisor
Google code to understand resource usage and
performance characteristics of running containers
Kubelet relies on Cadvisor to get this data
Cadvisor looks at /sys/fs/cgroup hierarchy to get resource usage eg.
/sys/fs/cgroup/memory/memory.usage_in_bytes
© 2019 Mesosphere, Inc. All Rights Reserved. 35
Kubelet Eviction
10GB
Pod
3GB
Pod
3GB
Pod
3GB
Docker
500MB
Kubelet
1GB
● Kubelet read 10GB available from /proc/meminfo
● Cadvisor reports 9.5GB in use
● Kubelet reports memory pressure
● Kubelet evicts pods based on configuration
© 2019 Mesosphere, Inc. All Rights Reserved. 36
Kubelet Eviction
10GB
Pod
3GB
Pod
3GB
Docker
500MB
Kubelet
1GB
● Kubelet read 10GB available from /proc/meminfo
● Cadvisor reports 9.5GB in use
● Kubelet reports memory pressure
● Kubelet evicts pods based on configuration
● Memory pressure recedes
© 2019 Mesosphere, Inc. All Rights Reserved. 37
Mesos and Cgroups
For resources Mesos manages, each task is assigned a cgroup under the root Mesos cgroup :
/sys/fs/cgroup/memory/mesos/UUID
UUID is the UUID of the kube-node container
When we run Kubernetes on Mesos, pods are assigned to a cgroup in :
/sys/fs/cgroup/memory/mesos/UUID/kubepods/
Cadvisor reads /sys/fs/cgroup/memory/memory.usage_in_bytes, gets the host usage ...
© 2019 Mesosphere, Inc. All Rights Reserved. 38
Per Container Cgroups
New code added to Mesos, for all isolators in use ...
Bind mount inside container :
/sys/fs/cgroup/memory/mesos/UUID -> /sys/fs/cgroup/memory
Cadvisor reads /sys/fs/cgroup/memory/memory.usage_in_bytes, gets the container usage ...
© 2019 Mesosphere, Inc. All Rights Reserved. 39
Under pressure ...
TOTAL RAM 20GB
Mesos allocation 10GB
● Kubelet reads total RAM from /proc/meminfo on startup
○ This is the host RAM in our case
● Kubelet gets RAM usage from cadvisor
● Kubelet thinks it’s running in 20GB of RAM
● Need to avoid OOM of Kubelet or Docker ( likely Kubelet )
Kubelet Docker Pods
© 2019 Mesosphere, Inc. All Rights Reserved. 40
Under pressure ...
● No disk cgroup subsystem
● Kubelet gets disk usage from Cadvisor
● Cadvisor calls statfs syscall
○ This returns disk usage on the host
© 2019 Mesosphere, Inc. All Rights Reserved. 41
Kubelet-resource-watchdog ...
● Sidecar container
● Monitors pods, container runtime and kubelet resource usage
● Signals to the control plane that node is under pressure
● Impersonates the kubelet to evict pods
© 2019 Mesosphere, Inc. All Rights Reserved. 42
Networking ..
● Isolation
● No forks
● Able to use off-the-shelf CNI plugin
● Able to use standard functionality
○ Service discovery, Policy, Load balancing, Ingress
● Reasonably Performant
● Dynamic configuration
© 2019 Mesosphere, Inc. All Rights Reserved. 43
Connect Pods to host network - using Overlay
© 2019 Mesosphere, Inc. All Rights Reserved. 44
Connect Pods directly to host network - Using Routing
© 2019 Mesosphere, Inc. All Rights Reserved. 45
Connect Pods to Kube-node - Using Routing
© 2019 Mesosphere, Inc. All Rights Reserved. 46
Use two interfaces
© 2019 Mesosphere, Inc. All Rights Reserved. 47
Overlay on Overlay
© 2019 Mesosphere, Inc. All Rights Reserved. 48
Overlay on Overlay
© 2019 Mesosphere, Inc. All Rights Reserved. 49
MKE
© 2019 Mesosphere, Inc. All Rights Reserved. 50
Try it out !
© 2019 Mesosphere, Inc. All Rights Reserved. 51
Questions?
Taco Scargo
Twitter: @tacoscargo
Email: tscargo@mesosphere.com
https://dcos.io
@dcos & @mesosphere
users@dcos.io
/dcos
/dcos/examples
/dcos/demos
chat.dcos.io

Dealing with kubesprawl tetris style !

  • 1.
    © 2017 Mesosphere,Inc. All Rights Reserved. 1 Stockholm k8s CNCF meetup - May 8th, 2019 Dealing with Kubesprawl - Tetris style ! Taco Scargo @tacoscargo
  • 2.
    © 2017 Mesosphere,Inc. All Rights Reserved. ❏ Senior Solution Engineer - mostly active in Benelux and Nordics @ Mesosphere ❏ Building stuff with open source software for ~20 years Taco Scargo, Senior Solution Engineer @tacoscargo
  • 3.
    © 2019 Mesosphere,Inc. All Rights Reserved. 3 The trouble with Tribbles ...
  • 4.
    Verb ( usedwith object ) 1. To stretch out as in sprawling 2. To spread out or distribute in a straggling manner
  • 5.
    Verb ( usedwith object ) 1. To stretch out as in sprawling 2. To spread out or distribute in a straggling manner * Source: https://github.com/cncf/k8s-conformance ** Typically each project team creates 2 or more clusters: https://blog.newrelic.com/engineering/kubernetes-usage-data/ ***Source: https://kubernetes.io/blog/2017/01/kubernetes-ux-survey-infographic/ **** Source: Kubernetes application survey 2018 84% clusters with fewer than 25 nodes**** ~15 clusters per enterprise** 73% clusters not managed by central IT *** +70 ways to deploy K8s*
  • 6.
    © 2019 Mesosphere,Inc. All Rights Reserved. 6 One cluster to rule them all ... HW or VM OS OS OS HW or VM HW or VM HW or VM OS OS OS HW or VM HW or VM HW or VM OS OS OS HW or VM HW or VM Cluster Orchestrator Abstract away all resources into a single pool
  • 7.
    © 2019 Mesosphere,Inc. All Rights Reserved. 7 One cluster to rule them all ... ● Shared kernel model perceived as soft multi-tenancy ● Disconnect of change pace between dev and infra ● Scaling issues for monolithic schedulers Perception is reality ….
  • 8.
    © 2019 Mesosphere,Inc. All Rights Reserved. 8 One cluster to rule them all ... HW or VM OS OS OS HW or VM HW or VM HW or VM OS OS OS HW or VM HW or VM Team 1: Dev Team 1: Prod HW or VM OS OS OS HW or VM HW or VM HW or VM OS OS OS HW or VM HW or VM Team 2: Dev Team 2: Prod HW or VM OS OS OS HW or VM HW or VM HW or VM OS OS OS HW or VM HW or VM Team 3: Dev Team 3: Prod ● Tooling proliferation ● Operational overhead ● Biggest problem is COST ...
  • 9.
    © 2017 Mesosphere,Inc. All Rights Reserved. 9 Static partitioning Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
  • 10.
    © 2017 Mesosphere,Inc. All Rights Reserved. 10 Apache Mesos: The datacenter kernel http://mesos.apache.org/
  • 11.
    © 2017 Mesosphere,Inc. All Rights Reserved. • A cluster resource negotiator • Scalable to 10,000s of nodes • Fault-tolerant, battle-tested 11 Building block of the modern internet http://mesos.apache.org/documentation/latest/powered-by-mesos/
  • 12.
    © 2015 Mesosphere,Inc. All Rights Reserved. MESOS KERNEL APPLIES LESSONS FROM EARLY INNOVATORS Tupperware/BistroBorg/Omega Apache Mesos ProprietaryProprietary Open Source (Apache License) ~2007~2001 2010+ Production-proven Web Scale Cluster Managers ● Built at UC Berkeley AMPLab by Ben Hindman (Mesosphere Co-founder) ● Built in collaboration with Google to overcome some Borg Challenges ● Production proven at scale +80K hosts @ Twitter
  • 13.
    © 2017 Mesosphere,Inc. All Rights Reserved. 13 Solving the Fail Whale
  • 14.
    © 2017 Mesosphere,Inc. All Rights Reserved. Two level scheduling Mesos Master and Agents ● Abstract resources into single pool ● Offers and tracks resources ● Guarantees isolation ● Handles workload restart on failure Mesos Framework ● Consumes resources ● Deploys tasks ● Provides application specific logic for deployment, recovery, upgrade Zookeeper Quorum Master Master (Standby) Master (Standby) Frameworks Agent Spark Agent Kubernetes Agent Spark Kubernetes Agent Kafka Kubernetes Executor Task
  • 15.
    © 2019 Mesosphere,Inc. All Rights Reserved. 15 Everything is a container ... Mesos is the golden standard for production grade, large scale clusters Solomon Hykes Co-founder Docker
  • 16.
    © 2017 Mesosphere,Inc. All Rights Reserved. 16 MULTIPLEXING Typical Deployment siloed, over-provisioned servers, low utilization Apache Mesos automated schedulers, workload multiplexing onto the same machines Workload 3 Workload 4 Workload 5 Workload 1 Workload 2
  • 17.
    © 2019 Mesosphere,Inc. All Rights Reserved. 17 DC/OS - Distributed Cloud Operating System ● Service Discovery ● Load Balancing ● Security ● Ease of installation ● Comprehensive tooling for operations ● Built in frameworks for long running and scheduled jobs ● Catalog of pre-configured apps (including Kubernetes, Apache Spark, Apache Kafka…), browse at http://universe.dcos.io/ ● And much more https://dcos.io/
  • 18.
    © 2017 Mesosphere,Inc. All Rights Reserved. 18 DC/OS Architecture Overview Security & Governance Container Orchestration Monitoring & Operations User Interface & Command Line HDFS Jenkins Marathon Cassandra Flink Spark Docker Kafka MongoDB +30 more... DC/OS Services & Containers ANY INFRASTRUCTURE
  • 19.
    © 2017 Mesosphere,Inc. All Rights Reserved. Bin packing ..
  • 20.
    © 2019 Mesosphere,Inc. All Rights Reserved. 20 Schedulers enable full lifecycle management ● Upgrades ● Downgrades ● Scale Up ● Scale Down ● Failure Scenarios ● And more ...
  • 21.
    © 2019 Mesosphere,Inc. All Rights Reserved. 21 Scheduler plans Macbook-Pro:kube tscargo$ dcos kubernetes plan list [ "deploy", "recovery", "replace", "update", "uninstall" ]
  • 22.
    © 2019 Mesosphere,Inc. All Rights Reserved. 22 Scheduler plans Macbook-Pro:kube tscargo$ dcos kubernetes plan status deploy deploy (serial strategy) (IN_PROGRESS) ├─ etcd (serial strategy) (COMPLETE) │ ├─ etcd-0:[peer] (COMPLETE) │ ├─ etcd-1:[peer] (COMPLETE) │ └─ etcd-2:[peer] (COMPLETE) ├─ apiserver (parallel strategy) (COMPLETE) │ ├─ kube-apiserver-0:[instance] (COMPLETE) │ ├─ kube-apiserver-1:[instance] (COMPLETE) │ └─ kube-apiserver-2:[instance] (COMPLETE) ├─ kubernetes-api-proxy (parallel strategy) (COMPLETE) │ └─ kubernetes-api-proxy-0:[install] (COMPLETE) ├─ controller-manager (parallel strategy) (COMPLETE) …………………..
  • 23.
    © 2019 Mesosphere,Inc. All Rights Reserved. 23 No Forks !
  • 24.
    © 2019 Mesosphere,Inc. All Rights Reserved. 24 Allocate resources for Pods Mesos Agent kube-node ● Kubelet ● Kube-Proxy ● Container Runtime Mesos Container
  • 25.
    © 2019 Mesosphere,Inc. All Rights Reserved. 25 Allocate resources for Pods In Kubernetes : --kube-reserved How much resource is reserved for kubelet, container runtime etc. --system-reserved How much resource is reserved for OS function In DC/OS Kubernetes framework : Reserved How much resource is reserved for kubelet, container runtime etc. Allocatable How much resource should be allocated for pods
  • 26.
    © 2019 Mesosphere,Inc. All Rights Reserved. 26 Allocate resources for Pods Kubelet reads total resource from /proc/cpuinfo and /proc/meminfo on startup ● Allocatable = total resource - ( system-reserved + kube-reserved ) On DC/OS this would be the total available on the agent, NOT the Mesos allocation When we start the kubelet we configure : ● System-reserved = ( Total available - agent.reserved ) - agent.allocatable
  • 27.
    © 2019 Mesosphere,Inc. All Rights Reserved. 27 Allocate resources for Pods Agent Node 20GB Kube-Node 3GB 1GB - Reserved 2GB - Allocatable System-reserved = ( 20 - 1 ) - 2 = 17GB Kube-reserved = 1GB Allocatable = 2GB
  • 28.
    © 2019 Mesosphere,Inc. All Rights Reserved. 28 Allocate resources for Pods 8GB available Mesos Agent 10GB 2GB Mesos resource offer
  • 29.
    © 2019 Mesosphere,Inc. All Rights Reserved. 29 Allocate resources for Pods 6GB available Mesos Agent 10GB 2GB Mesos resource offer 2GB
  • 30.
    © 2019 Mesosphere,Inc. All Rights Reserved. 30 Allocate resources for Pods 4GB available Mesos Agent 10GB 2GB Mesos resource offer 2GB 2GB
  • 31.
    © 2019 Mesosphere,Inc. All Rights Reserved. 31 Allocate resources for Pods 2GB available Mesos Agent 10GB 2GB Mesos resource offer 2GB 2GB 2GB
  • 32.
    © 2019 Mesosphere,Inc. All Rights Reserved. 32 Allocate resources for Pods Mesos Agent 10GB 2GB 2GB 2GB 2GB 2GB No Resource Offer
  • 33.
    © 2019 Mesosphere,Inc. All Rights Reserved. 33 Containers Control Groups ( CGroups ) ● Linux kernel feature ○ Limits and Accounting ● Hierarchical - inherits limits from parent ● Kernel will kill processes to maintain limits ● Accessed via the cgroup virtual filesystem ○ /sys/fs/cgroup
  • 34.
    © 2019 Mesosphere,Inc. All Rights Reserved. 34 Cadvisor Google code to understand resource usage and performance characteristics of running containers Kubelet relies on Cadvisor to get this data Cadvisor looks at /sys/fs/cgroup hierarchy to get resource usage eg. /sys/fs/cgroup/memory/memory.usage_in_bytes
  • 35.
    © 2019 Mesosphere,Inc. All Rights Reserved. 35 Kubelet Eviction 10GB Pod 3GB Pod 3GB Pod 3GB Docker 500MB Kubelet 1GB ● Kubelet read 10GB available from /proc/meminfo ● Cadvisor reports 9.5GB in use ● Kubelet reports memory pressure ● Kubelet evicts pods based on configuration
  • 36.
    © 2019 Mesosphere,Inc. All Rights Reserved. 36 Kubelet Eviction 10GB Pod 3GB Pod 3GB Docker 500MB Kubelet 1GB ● Kubelet read 10GB available from /proc/meminfo ● Cadvisor reports 9.5GB in use ● Kubelet reports memory pressure ● Kubelet evicts pods based on configuration ● Memory pressure recedes
  • 37.
    © 2019 Mesosphere,Inc. All Rights Reserved. 37 Mesos and Cgroups For resources Mesos manages, each task is assigned a cgroup under the root Mesos cgroup : /sys/fs/cgroup/memory/mesos/UUID UUID is the UUID of the kube-node container When we run Kubernetes on Mesos, pods are assigned to a cgroup in : /sys/fs/cgroup/memory/mesos/UUID/kubepods/ Cadvisor reads /sys/fs/cgroup/memory/memory.usage_in_bytes, gets the host usage ...
  • 38.
    © 2019 Mesosphere,Inc. All Rights Reserved. 38 Per Container Cgroups New code added to Mesos, for all isolators in use ... Bind mount inside container : /sys/fs/cgroup/memory/mesos/UUID -> /sys/fs/cgroup/memory Cadvisor reads /sys/fs/cgroup/memory/memory.usage_in_bytes, gets the container usage ...
  • 39.
    © 2019 Mesosphere,Inc. All Rights Reserved. 39 Under pressure ... TOTAL RAM 20GB Mesos allocation 10GB ● Kubelet reads total RAM from /proc/meminfo on startup ○ This is the host RAM in our case ● Kubelet gets RAM usage from cadvisor ● Kubelet thinks it’s running in 20GB of RAM ● Need to avoid OOM of Kubelet or Docker ( likely Kubelet ) Kubelet Docker Pods
  • 40.
    © 2019 Mesosphere,Inc. All Rights Reserved. 40 Under pressure ... ● No disk cgroup subsystem ● Kubelet gets disk usage from Cadvisor ● Cadvisor calls statfs syscall ○ This returns disk usage on the host
  • 41.
    © 2019 Mesosphere,Inc. All Rights Reserved. 41 Kubelet-resource-watchdog ... ● Sidecar container ● Monitors pods, container runtime and kubelet resource usage ● Signals to the control plane that node is under pressure ● Impersonates the kubelet to evict pods
  • 42.
    © 2019 Mesosphere,Inc. All Rights Reserved. 42 Networking .. ● Isolation ● No forks ● Able to use off-the-shelf CNI plugin ● Able to use standard functionality ○ Service discovery, Policy, Load balancing, Ingress ● Reasonably Performant ● Dynamic configuration
  • 43.
    © 2019 Mesosphere,Inc. All Rights Reserved. 43 Connect Pods to host network - using Overlay
  • 44.
    © 2019 Mesosphere,Inc. All Rights Reserved. 44 Connect Pods directly to host network - Using Routing
  • 45.
    © 2019 Mesosphere,Inc. All Rights Reserved. 45 Connect Pods to Kube-node - Using Routing
  • 46.
    © 2019 Mesosphere,Inc. All Rights Reserved. 46 Use two interfaces
  • 47.
    © 2019 Mesosphere,Inc. All Rights Reserved. 47 Overlay on Overlay
  • 48.
    © 2019 Mesosphere,Inc. All Rights Reserved. 48 Overlay on Overlay
  • 49.
    © 2019 Mesosphere,Inc. All Rights Reserved. 49 MKE
  • 50.
    © 2019 Mesosphere,Inc. All Rights Reserved. 50 Try it out !
  • 51.
    © 2019 Mesosphere,Inc. All Rights Reserved. 51 Questions? Taco Scargo Twitter: @tacoscargo Email: tscargo@mesosphere.com https://dcos.io @dcos & @mesosphere users@dcos.io /dcos /dcos/examples /dcos/demos chat.dcos.io