Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++

Henning Jacobs
Henning JacobsSenior Principal at Zalando SE
Optimizing Kubernetes
Resource Requests/Limits
for Cost-Efficiency and Latency
Henning Jacobs (Zalando SE)
@try_except_
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++
4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of visits via
mobile devices
> 24
million
active customers
> 300.000
product choices
~ 2.000
brands
17
countries
5
SCALE
100Clusters
373Accounts
6
KUBERNETES: IT'S ALL ABOUT RESOURCES
Node Node
Pods demand capacity
Nodes offer capacity
Scheduler
7
COMPUTE RESOURCE TYPES
● CPU
● Memory
● Local ephemeral storage (1.12+)
● Extended Resources
○ GPU
○ TPU?
Node
8
KUBERNETES RESOURCES
CPU
○ Base: 1 AWS vCPU (or GCP Core or ..)
○ Example: 100m (0.1 vCPU, "100 Millicores")
Memory
○ Base: 1 Byte
○ Example: 500Mi (500 MiB memory)
9
REQUESTS / LIMITS
Requests
○ Affect Scheduling Decision
○ Priority (CPU, OOM adjust)
Limits
○ Limit maximum container usage
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 1
memory: 300Mi
10
CPU
Memory
REQUESTS: POD SCHEDULING
CPU
Memory
CPU
Memory
Node 1
Node 2
Pod 1
Requests
11
POD SCHEDULING
CPU
Memory
CPU
Memory
Node 1
Node 2
Pod 2
12
POD SCHEDULING
CPU
Memory
CPU
Memory
Node 1
Node 2
Pod 3
13
POD SCHEDULING
CPU
Memory
CPU
Memory
Node 1
Node 2
Pod 4
14
POD SCHEDULING: TRY TO FIT
CPU
Memory
CPU
Memory
Node 1
Node 2
15
POD SCHEDULING: NO CAPACITY
CPU
Memory
CPU
Memory
Node 1
Node 2
Pod 4
"PENDING"
16
REQUESTS: CPU SHARES
docker run --requests=cpu=10m/5m ..sha512()..
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod5d5..0d/cpu.shares
10 // relative share of CPU time
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod6e0..0d/cpu.shares
5 // relative share of CPU time
cat /sys/fs/cgroup/cpuacct/kubepods/burstable/pod5d5..0d/cpuacct.usage
/sys/fs/cgroup/cpuacct/kubepods/burstable/pod6e0..0d/cpuacct.usage
13432815283 // total CPU time in nanoseconds
7528759332 // total CPU time in nanoseconds
17
LIMITS: COMPRESSIBLE RESOURCES
Can be taken away quickly,
"only" cause slowness
CPU Throttling
200m CPU limit ⇒
container can use
0.2s of CPU time per second
18
CPU THROTTLING
docker run --cpus CPUS -it python
python -m timeit -s 'import hashlib' -n 10000 -v
'hashlib.sha512().update(b"foo")'
CPUS=1.0 3.8 - 4ms
CPUS=0.5 3.8 - 52ms
CPUS=0.2 6.8 - 88ms
CPUS=0.1 5.7 - 190ms
19
LIMITS: NON-COMPRESSIBLE RESOURCES
Hold state,
are slower to take away.
Killing (OOMKill)
20
MEMORY LIMITS: OUT OF MEMORY
kubectl get pod
NAME READY STATUS RESTARTS AGE
kube-ops-view-7bc-tcwkt 0/1 CrashLoopBackOff 3 2m
kubectl describe pod kube-ops-view-7bc-tcwkt
...
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
21
QOS
Guaranteed: all containers have limits == requests
Burstable: some containers have limits > requests
BestEffort: no requests/limits set
kubectl describe pod …
…
Limits:
memory: 100Mi
Requests:
cpu: 100m
memory: 100Mi
QoS Class: Burstable
22
OVERCOMMIT
Limits > Requests ⇒ Burstable QoS ⇒ Overcommit
Overcommit for CPU: fine, running into completely fair scheduling
Overcommit for memory: fine, as long as demand < node capacity
https://code.fb.com/production-engineering/oomd/
Might run into unpredictable OOM situations
when demand reaches node's memory
capacity (Kernel OOM Killer)
23
LIMITS: CGROUPS
docker run --cpus 1 -m 200m --rm -it busybox
cat /sys/fs/cgroup/cpu/docker/8ab25..1c/cpu.{shares,cfs_*}
1024 // cpu.shares (default value)
100000 // cpu.cfs_period_us (100ms period length)
100000 // cpu.cfs_quota_us (total CPU time in µs consumable per period)
cat /sys/fs/cgroup/memory/docker/8ab25..1c/memory.limit_in_bytes
209715200
24
LIMITS: PROBLEMS
1. CPU CFS Quota: Latency
2. Memory: accounting, OOM behavior
25
PROBLEMS: LATENCY
https://github.com/zalando-incubator/kubernetes-on-aws/pull/923
26
PROBLEMS: HARDCODED PERIOD
27
PROBLEMS: HARDCODED PERIOD
https://github.com/kubernetes/kubernetes/issues/51135
28
WHAT'S A GOOD VALUE FOR CFS PERIOD?
Latencyp95→
29
NOW IN KUBERNETES 1.12
https://github.com/kubernetes/kubernetes/pull/63437
30
OVERLY AGGRESSIVE CFS
31
OVERLY AGGRESSIVE CFS: EXPERIMENT #1
CPU Period: 100ms
CPU Quota: None
Burn 5ms and sleep 100ms
⇒ Quota disabled
⇒ No Throttling expected!
https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
32
EXPERIMENT #1: NO QUOTA, NO THROTTLING
2018/11/03 13:04:02 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 6ms
2018/11/03 13:04:03 [1] burn took 5ms, real time so far: 510ms, cpu time so far: 11ms
2018/11/03 13:04:03 [2] burn took 5ms, real time so far: 1015ms, cpu time so far: 17ms
2018/11/03 13:04:04 [3] burn took 5ms, real time so far: 1520ms, cpu time so far: 23ms
2018/11/03 13:04:04 [4] burn took 5ms, real time so far: 2025ms, cpu time so far: 29ms
2018/11/03 13:04:05 [5] burn took 5ms, real time so far: 2530ms, cpu time so far: 35ms
2018/11/03 13:04:05 [6] burn took 5ms, real time so far: 3036ms, cpu time so far: 40ms
2018/11/03 13:04:06 [7] burn took 5ms, real time so far: 3541ms, cpu time so far: 46ms
2018/11/03 13:04:06 [8] burn took 5ms, real time so far: 4046ms, cpu time so far: 52ms
2018/11/03 13:04:07 [9] burn took 5ms, real time so far: 4551ms, cpu time so far: 58ms
33
OVERLY AGGRESSIVE CFS: EXPERIMENT #2
CPU Period: 100ms
CPU Quota: 20ms
Burn 5ms and sleep 500ms
⇒ No 100ms intervals where possibly 20ms is burned
⇒ No Throttling expected!
34
EXPERIMENT #2: OVERLY AGGRESSIVE CFS
2018/11/03 13:05:05 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 5ms
2018/11/03 13:05:06 [1] burn took 99ms, real time so far: 690ms, cpu time so far: 9ms
2018/11/03 13:05:06 [2] burn took 99ms, real time so far: 1290ms, cpu time so far: 14ms
2018/11/03 13:05:07 [3] burn took 99ms, real time so far: 1890ms, cpu time so far: 18ms
2018/11/03 13:05:07 [4] burn took 5ms, real time so far: 2395ms, cpu time so far: 24ms
2018/11/03 13:05:08 [5] burn took 94ms, real time so far: 2990ms, cpu time so far: 27ms
2018/11/03 13:05:09 [6] burn took 99ms, real time so far: 3590ms, cpu time so far: 32ms
2018/11/03 13:05:09 [7] burn took 5ms, real time so far: 4095ms, cpu time so far: 37ms
2018/11/03 13:05:10 [8] burn took 5ms, real time so far: 4600ms, cpu time so far: 43ms
2018/11/03 13:05:10 [9] burn took 5ms, real time so far: 5105ms, cpu time so far: 49ms
35
OVERLY AGGRESSIVE CFS: EXPERIMENT #3
CPU Period: 10ms
CPU Quota: 2ms
Burn 5ms and sleep 100ms
⇒ Same 20% CPU (200m) limit, but smaller period
⇒ Throttling expected!
36
SMALLER CPU PERIOD ⇒ BETTER LATENCY
2018/11/03 16:31:07 [0] burn took 18ms, real time so far: 18ms, cpu time so far: 6ms
2018/11/03 16:31:07 [1] burn took 9ms, real time so far: 128ms, cpu time so far: 8ms
2018/11/03 16:31:07 [2] burn took 9ms, real time so far: 238ms, cpu time so far: 13ms
2018/11/03 16:31:07 [3] burn took 5ms, real time so far: 343ms, cpu time so far: 18ms
2018/11/03 16:31:07 [4] burn took 30ms, real time so far: 488ms, cpu time so far: 24ms
2018/11/03 16:31:07 [5] burn took 19ms, real time so far: 608ms, cpu time so far: 29ms
2018/11/03 16:31:07 [6] burn took 9ms, real time so far: 718ms, cpu time so far: 34ms
2018/11/03 16:31:08 [7] burn took 5ms, real time so far: 824ms, cpu time so far: 40ms
2018/11/03 16:31:08 [8] burn took 5ms, real time so far: 943ms, cpu time so far: 45ms
2018/11/03 16:31:08 [9] burn took 9ms, real time so far: 1068ms, cpu time so far: 48ms
37
LIMITS: VISIBILITY
docker run --cpus 1 -m 200m --rm -it busybox top
Mem: 7369128K used, 726072K free, 128164K shrd, 303924K buff, 1208132K cached
CPU0: 14.8% usr 8.4% sys 0.2% nic 67.6% idle 8.2% io 0.0% irq 0.6% sirq
CPU1: 8.8% usr 10.3% sys 0.0% nic 75.9% idle 4.4% io 0.0% irq 0.4% sirq
CPU2: 7.3% usr 8.7% sys 0.0% nic 63.2% idle 20.1% io 0.0% irq 0.6% sirq
CPU3: 9.3% usr 9.9% sys 0.0% nic 65.7% idle 14.5% io 0.0% irq 0.4% sirq
38
• Container-aware memory configuration
• JVM MaxHeap
• Container-aware processor configuration
• Thread pools
• GOMAXPROCS
• node.js cluster module
LIMITS: VISIBILITY
39
ZALANDO: DEVELOPERS USING KUBERNETES
40
KUBERNETES RESOURCES
41
ZALANDO: DECISION
• Forbid Memory Overcommit
• Implement mutating admission webhook
• Set requests = limits
• Disable CPU CFS Quota in all clusters
42
ZALANDO: DISABLING CPU LIMITS
43
INGRESS LATENCY IMPROVEMENT
Example: Getting started
with Zalenium & UI Tests
Example: Step by step guide to the first UI test with Zalenium running in the
Continuous Delivery Platform. I was always afraid of UI tests because it looked too
difficult to get started, Zalenium solved this problem for me.
45
CLUSTER AUTOSCALER
Simulates the Kubernetes scheduler internally to find out..
• ..if any of the pods wouldn’t fit on existing nodes
⇒ upscale is needed
• ..if it’s possible to fit some of the pods on existing nodes
⇒ downscale is needed
⇒ Cluster size is determined by resource requests (+ constraints)
github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
46
ALLOCATABLE
Reserve resources for
system components,
Kubelet, and container runtime:
--system-reserved=
cpu=100m,memory=164Mi
--kube-reserved=
cpu=100m,memory=282Mi
47
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
SLACK
CPU
Memory
Node
"Slack"
48
STRANDED RESOURCES
Stranded
CPU
Memory
CPU
Memory
Node 1
Node 2
Some available capacity
can become unusable /
stranded.
⇒ Reschedule, bin packing
49
HOW TO FIND GOOD VALUES FOR REQUESTS?
resources:
limits:
memory: "2Gi"
requests:
cpu: "1.5"
memory: "2Gi"
50
TWO TOOLS
● Kubernetes Resource Report
○ Dashboard showing Kubernetes Resource Slack
● Kubernetes Application Dashboard
○ Grafana Dashboard showing historical CPU and Memory
usage for Kubernetes Applications
51
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
52
KUBERNETES RESOURCE REPORT
"Slack"
53
KUBERNETES APPLICATION DASHBOARD
54
VERTICAL POD AUTOSCALER (VPA)
"Some 2/3 of the Borg users use autopilot."
- Tim Hockin
VPA: Set resource requests
automatically based on usage.
55
VPA FOR PROMETHEUS
apiVersion: poc.autoscaling.k8s.io/v1alpha1
kind: VerticalPodAutoscaler
metadata:
name: prometheus-vpa
namespace: kube-system
spec:
selector:
matchLabels:
application: prometheus
updatePolicy:
updateMode: Auto
CPU/memory
56
AUTOSCALING BUFFER
• Cluster Autoscaler only triggers on Pending Pods
• Node provisioning is slow
⇒ Reserve extra capacity via low priority Pods
"Autoscaling Buffer Pods"
57
AUTOSCALING BUFFER
kubectl describe pod autoscaling-buffer-eu-central-1a-78-rzjq5 -n kube-system
...
Namespace: kube-system
Priority: -1000000
PriorityClassName: autoscaling-buffer
Containers:
pause:
Image: teapot/pause-amd64:3.1
Requests:
cpu: 1600m
memory: 6871947673
Evict if higher
priority (default)
Pod needs
capacity
58
HORIZONTAL POD AUTOSCALER
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 100
target: ~100% of
CPU requests
...
59
HORIZONTAL POD AUTOSCALING (CUSTOM METRICS)
Queue Length
Prometheus Query
Ingress Req/s
ZMON Check
github.com/zalando-incubator/kube-metrics-adapter
60
DOWNSCALING DURING OFF-HOURS
github.com/hjacobs/kube-downscaler
Weekend
61
DOWNSCALING DURING OFF-HOURS
DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET"
annotations:
downscaler/exclude: "true"
github.com/hjacobs/kube-downscaler
62
WHAT WORKED FOR US
● Disable CPU CFS Quota in all clusters
● Admission Controller:
○ Prevent memory overcommit
○ Set default requests
● Kubernetes Resource Report
● Downscaling during off-hours
63
WRAP UP
● Understanding Requests/Limits
● Deciding on quota period (or disable), overcommit
● Autoscaling (cluster-autoscaler, HPA, VPA, downscaler)
● Buffer capacity
● Optimizing slack
64
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
https://github.com/hjacobs/kube-ops-view
66
OTHER TALKS
• Everything You Ever Wanted to Know About Resource Scheduling
• Inside Kubernetes Resource Management (QoS) - KubeCon 2018
• Setting Resource Requests and Limits in Kubernetes (Best Practices)
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k
1 of 67

Recommended

Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering... by
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Henning Jacobs
4.3K views79 slides
Performance Tuning EC2 Instances by
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
171.6K views81 slides
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD) by
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)충섭 김
12K views85 slides
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc... by
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
2.4K views83 slides
Kubernetes Requests and Limits by
Kubernetes Requests and LimitsKubernetes Requests and Limits
Kubernetes Requests and LimitsAhmed AbouZaid
891 views25 slides
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안 by
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
15.2K views54 slides

More Related Content

What's hot

Introduction to Kafka Cruise Control by
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
25.7K views47 slides
Netflix: From Clouds to Roots by
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
67.8K views97 slides
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ... by
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...AWSKRUG - AWS한국사용자모임
2.7K views49 slides
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ... by
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
573 views34 slides
Community Openstack 구축 사례 by
Community Openstack 구축 사례Community Openstack 구축 사례
Community Openstack 구축 사례Open Source Consulting
846 views20 slides
Apache kafka performance(latency)_benchmark_v0.3 by
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
1.8K views13 slides

What's hot(20)

Introduction to Kafka Cruise Control by Jiangjie Qin
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin25.7K views
Netflix: From Clouds to Roots by Brendan Gregg
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
Brendan Gregg67.8K views
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ... by AWSKRUG - AWS한국사용자모임
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...
Firecracker, 서버리스 컴퓨팅을 위한 오픈소스 microVM 기술 :: 류한진 - AWS ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ... by Flink Forward
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward573 views
Apache kafka performance(latency)_benchmark_v0.3 by SANG WON PARK
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK1.8K views
Hands-On Introduction to Kubernetes at LISA17 by Ryan Jarvinen
Hands-On Introduction to Kubernetes at LISA17Hands-On Introduction to Kubernetes at LISA17
Hands-On Introduction to Kubernetes at LISA17
Ryan Jarvinen1.1K views
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase by HBaseCon
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon14.6K views
Linux Systems Performance 2016 by Brendan Gregg
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg504.5K views
Introduction to Kubernetes RBAC by Kublr
Introduction to Kubernetes RBACIntroduction to Kubernetes RBAC
Introduction to Kubernetes RBAC
Kublr712 views
ksqlDB로 실시간 데이터 변환 및 스트림 처리 by confluent
ksqlDB로 실시간 데이터 변환 및 스트림 처리ksqlDB로 실시간 데이터 변환 및 스트림 처리
ksqlDB로 실시간 데이터 변환 및 스트림 처리
confluent964 views
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon by Jérôme Petazzoni
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Jérôme Petazzoni96.9K views
Introduction and Deep Dive Into Containerd by Kohei Tokunaga
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
Kohei Tokunaga452 views
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB by YugabyteDB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
YugabyteDB2.6K views
Optimizing {Java} Application Performance on Kubernetes by Dinakar Guniguntala
Optimizing {Java} Application Performance on KubernetesOptimizing {Java} Application Performance on Kubernetes
Optimizing {Java} Application Performance on Kubernetes
Docker and Kubernetes 101 workshop by Sathish VJ
Docker and Kubernetes 101 workshopDocker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshop
Sathish VJ385 views
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI by Altinity Ltd
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Altinity Ltd317 views
Kvm performance optimization for ubuntu by Sim Janghoon
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
Sim Janghoon40.7K views

Similar to Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++

Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency by
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and LatencyOptimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and LatencyHenning Jacobs
84 views83 slides
Mastering java in containers - MadridJUG by
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUGJorge Morales
1.6K views55 slides
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext by
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextLucidworks
4.5K views41 slides
JITServerTalk.pdf by
JITServerTalk.pdfJITServerTalk.pdf
JITServerTalk.pdfRichHagarty
66 views45 slides
MesosCon 2018 by
MesosCon 2018MesosCon 2018
MesosCon 2018Pablo Delgado
171 views65 slides
JITServerTalk-OSS-2023.pdf by
JITServerTalk-OSS-2023.pdfJITServerTalk-OSS-2023.pdf
JITServerTalk-OSS-2023.pdfRichHagarty
11 views42 slides

Similar to Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++(20)

Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency by Henning Jacobs
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and LatencyOptimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency
Henning Jacobs84 views
Mastering java in containers - MadridJUG by Jorge Morales
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUG
Jorge Morales1.6K views
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext by Lucidworks
Tuning Solr for Logs: Presented by Radu Gheorghe, SematextTuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext
Lucidworks4.5K views
JITServerTalk-OSS-2023.pdf by RichHagarty
JITServerTalk-OSS-2023.pdfJITServerTalk-OSS-2023.pdf
JITServerTalk-OSS-2023.pdf
RichHagarty11 views
Build an High-Performance and High-Durable Block Storage Service Based on Ceph by Rongze Zhu
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Rongze Zhu8.2K views
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final by Vigyan Jain
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain395 views
Speedrunning the Open Street Map osm2pgsql Loader by GregSmith458515
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
GregSmith4585151.2K views
KOCOON – KAKAO Automatic K8S Monitoring by issac lim
KOCOON – KAKAO Automatic K8S MonitoringKOCOON – KAKAO Automatic K8S Monitoring
KOCOON – KAKAO Automatic K8S Monitoring
issac lim1.6K views
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ... by Amazon Web Services
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Container Performance Analysis Brendan Gregg, Netflix by Docker, Inc.
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
Docker, Inc.12.6K views
JVM Garbage Collection Tuning by ihji
JVM Garbage Collection TuningJVM Garbage Collection Tuning
JVM Garbage Collection Tuning
ihji2.7K views
Container Performance Analysis by Brendan Gregg
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
Brendan Gregg448.6K views
Devoxx France 2018 : Mes Applications en Production sur Kubernetes by Michaël Morello
Devoxx France 2018 : Mes Applications en Production sur KubernetesDevoxx France 2018 : Mes Applications en Production sur Kubernetes
Devoxx France 2018 : Mes Applications en Production sur Kubernetes
Michaël Morello291 views
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ... by Amazon Web Services
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
Microservices with Micronaut by QAware GmbH
Microservices with MicronautMicroservices with Micronaut
Microservices with Micronaut
QAware GmbH2.8K views

More from Henning Jacobs

How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent by
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
2.7K views77 slides
Open Source at Zalando - OSB Open Source Day 2019 by
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Henning Jacobs
649 views34 slides
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin by
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinHenning Jacobs
1.1K views91 slides
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise... by
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Henning Jacobs
1.8K views86 slides
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &... by
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Henning Jacobs
2.4K views85 slides
Kubernetes + Python = ❤ - Cloud Native Prague by
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueHenning Jacobs
3.7K views75 slides

More from Henning Jacobs(20)

How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent by Henning Jacobs
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
Henning Jacobs2.7K views
Open Source at Zalando - OSB Open Source Day 2019 by Henning Jacobs
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019
Henning Jacobs649 views
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin by Henning Jacobs
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Henning Jacobs1.1K views
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise... by Henning Jacobs
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Henning Jacobs1.8K views
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &... by Henning Jacobs
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Henning Jacobs2.4K views
Kubernetes + Python = ❤ - Cloud Native Prague by Henning Jacobs
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native Prague
Henning Jacobs3.7K views
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ... by Henning Jacobs
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Henning Jacobs3.5K views
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo... by Henning Jacobs
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Henning Jacobs1.4K views
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat... by Henning Jacobs
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Henning Jacobs535 views
Kubernetes Failure Stories - KubeCon Europe Barcelona by Henning Jacobs
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
Henning Jacobs728 views
Developer Experience at Zalando - CNCF End User SIG-DX by Henning Jacobs
Developer Experience at Zalando - CNCF End User SIG-DXDeveloper Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DX
Henning Jacobs1.2K views
Let's talk about Failures with Kubernetes - Hamburg Meetup by Henning Jacobs
Let's talk about Failures with Kubernetes - Hamburg MeetupLet's talk about Failures with Kubernetes - Hamburg Meetup
Let's talk about Failures with Kubernetes - Hamburg Meetup
Henning Jacobs5.7K views
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019 by Henning Jacobs
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Henning Jacobs2.1K views
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO... by Henning Jacobs
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevO...
Henning Jacobs18.5K views
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont... by Henning Jacobs
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Henning Jacobs3.9K views
API First with Connexion - PyConWeb 2018 by Henning Jacobs
API First with Connexion - PyConWeb 2018API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018
Henning Jacobs2.9K views
Developer Journey at Zalando - Idea to Production with Containers in the Clou... by Henning Jacobs
Developer Journey at Zalando - Idea to Production with Containers in the Clou...Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Henning Jacobs1.3K views
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW by Henning Jacobs
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Henning Jacobs3.2K views
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C... by Henning Jacobs
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Henning Jacobs1.8K views
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup by Henning Jacobs
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
Henning Jacobs1.8K views

Recently uploaded

Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze by
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeNUS-ISS
19 views47 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
17 views1 slide
Perth MeetUp November 2023 by
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023 Michael Price
12 views44 slides
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
15 views28 slides
Empathic Computing: Delivering the Potential of the Metaverse by
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the MetaverseMark Billinghurst
449 views80 slides
DALI Basics Course 2023 by
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023Ivory Egg
14 views12 slides

Recently uploaded(20)

Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze by NUS-ISS
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
NUS-ISS19 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price12 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS15 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
DALI Basics Course 2023 by Ivory Egg
DALI Basics Course  2023DALI Basics Course  2023
DALI Basics Course 2023
Ivory Egg14 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk86 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman25 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
Combining Orchestration and Choreography for a Clean Architecture by ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
ThomasHeinrichs168 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software91 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views

Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Highload++

  • 1. Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency Henning Jacobs (Zalando SE) @try_except_
  • 4. 4 ZALANDO AT A GLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 24 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  • 6. 6 KUBERNETES: IT'S ALL ABOUT RESOURCES Node Node Pods demand capacity Nodes offer capacity Scheduler
  • 7. 7 COMPUTE RESOURCE TYPES ● CPU ● Memory ● Local ephemeral storage (1.12+) ● Extended Resources ○ GPU ○ TPU? Node
  • 8. 8 KUBERNETES RESOURCES CPU ○ Base: 1 AWS vCPU (or GCP Core or ..) ○ Example: 100m (0.1 vCPU, "100 Millicores") Memory ○ Base: 1 Byte ○ Example: 500Mi (500 MiB memory)
  • 9. 9 REQUESTS / LIMITS Requests ○ Affect Scheduling Decision ○ Priority (CPU, OOM adjust) Limits ○ Limit maximum container usage resources: requests: cpu: 100m memory: 300Mi limits: cpu: 1 memory: 300Mi
  • 14. 14 POD SCHEDULING: TRY TO FIT CPU Memory CPU Memory Node 1 Node 2
  • 15. 15 POD SCHEDULING: NO CAPACITY CPU Memory CPU Memory Node 1 Node 2 Pod 4 "PENDING"
  • 16. 16 REQUESTS: CPU SHARES docker run --requests=cpu=10m/5m ..sha512().. cat /sys/fs/cgroup/cpu/kubepods/burstable/pod5d5..0d/cpu.shares 10 // relative share of CPU time cat /sys/fs/cgroup/cpu/kubepods/burstable/pod6e0..0d/cpu.shares 5 // relative share of CPU time cat /sys/fs/cgroup/cpuacct/kubepods/burstable/pod5d5..0d/cpuacct.usage /sys/fs/cgroup/cpuacct/kubepods/burstable/pod6e0..0d/cpuacct.usage 13432815283 // total CPU time in nanoseconds 7528759332 // total CPU time in nanoseconds
  • 17. 17 LIMITS: COMPRESSIBLE RESOURCES Can be taken away quickly, "only" cause slowness CPU Throttling 200m CPU limit ⇒ container can use 0.2s of CPU time per second
  • 18. 18 CPU THROTTLING docker run --cpus CPUS -it python python -m timeit -s 'import hashlib' -n 10000 -v 'hashlib.sha512().update(b"foo")' CPUS=1.0 3.8 - 4ms CPUS=0.5 3.8 - 52ms CPUS=0.2 6.8 - 88ms CPUS=0.1 5.7 - 190ms
  • 19. 19 LIMITS: NON-COMPRESSIBLE RESOURCES Hold state, are slower to take away. Killing (OOMKill)
  • 20. 20 MEMORY LIMITS: OUT OF MEMORY kubectl get pod NAME READY STATUS RESTARTS AGE kube-ops-view-7bc-tcwkt 0/1 CrashLoopBackOff 3 2m kubectl describe pod kube-ops-view-7bc-tcwkt ... Last State: Terminated Reason: OOMKilled Exit Code: 137
  • 21. 21 QOS Guaranteed: all containers have limits == requests Burstable: some containers have limits > requests BestEffort: no requests/limits set kubectl describe pod … … Limits: memory: 100Mi Requests: cpu: 100m memory: 100Mi QoS Class: Burstable
  • 22. 22 OVERCOMMIT Limits > Requests ⇒ Burstable QoS ⇒ Overcommit Overcommit for CPU: fine, running into completely fair scheduling Overcommit for memory: fine, as long as demand < node capacity https://code.fb.com/production-engineering/oomd/ Might run into unpredictable OOM situations when demand reaches node's memory capacity (Kernel OOM Killer)
  • 23. 23 LIMITS: CGROUPS docker run --cpus 1 -m 200m --rm -it busybox cat /sys/fs/cgroup/cpu/docker/8ab25..1c/cpu.{shares,cfs_*} 1024 // cpu.shares (default value) 100000 // cpu.cfs_period_us (100ms period length) 100000 // cpu.cfs_quota_us (total CPU time in µs consumable per period) cat /sys/fs/cgroup/memory/docker/8ab25..1c/memory.limit_in_bytes 209715200
  • 24. 24 LIMITS: PROBLEMS 1. CPU CFS Quota: Latency 2. Memory: accounting, OOM behavior
  • 28. 28 WHAT'S A GOOD VALUE FOR CFS PERIOD? Latencyp95→
  • 29. 29 NOW IN KUBERNETES 1.12 https://github.com/kubernetes/kubernetes/pull/63437
  • 31. 31 OVERLY AGGRESSIVE CFS: EXPERIMENT #1 CPU Period: 100ms CPU Quota: None Burn 5ms and sleep 100ms ⇒ Quota disabled ⇒ No Throttling expected! https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
  • 32. 32 EXPERIMENT #1: NO QUOTA, NO THROTTLING 2018/11/03 13:04:02 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 6ms 2018/11/03 13:04:03 [1] burn took 5ms, real time so far: 510ms, cpu time so far: 11ms 2018/11/03 13:04:03 [2] burn took 5ms, real time so far: 1015ms, cpu time so far: 17ms 2018/11/03 13:04:04 [3] burn took 5ms, real time so far: 1520ms, cpu time so far: 23ms 2018/11/03 13:04:04 [4] burn took 5ms, real time so far: 2025ms, cpu time so far: 29ms 2018/11/03 13:04:05 [5] burn took 5ms, real time so far: 2530ms, cpu time so far: 35ms 2018/11/03 13:04:05 [6] burn took 5ms, real time so far: 3036ms, cpu time so far: 40ms 2018/11/03 13:04:06 [7] burn took 5ms, real time so far: 3541ms, cpu time so far: 46ms 2018/11/03 13:04:06 [8] burn took 5ms, real time so far: 4046ms, cpu time so far: 52ms 2018/11/03 13:04:07 [9] burn took 5ms, real time so far: 4551ms, cpu time so far: 58ms
  • 33. 33 OVERLY AGGRESSIVE CFS: EXPERIMENT #2 CPU Period: 100ms CPU Quota: 20ms Burn 5ms and sleep 500ms ⇒ No 100ms intervals where possibly 20ms is burned ⇒ No Throttling expected!
  • 34. 34 EXPERIMENT #2: OVERLY AGGRESSIVE CFS 2018/11/03 13:05:05 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 5ms 2018/11/03 13:05:06 [1] burn took 99ms, real time so far: 690ms, cpu time so far: 9ms 2018/11/03 13:05:06 [2] burn took 99ms, real time so far: 1290ms, cpu time so far: 14ms 2018/11/03 13:05:07 [3] burn took 99ms, real time so far: 1890ms, cpu time so far: 18ms 2018/11/03 13:05:07 [4] burn took 5ms, real time so far: 2395ms, cpu time so far: 24ms 2018/11/03 13:05:08 [5] burn took 94ms, real time so far: 2990ms, cpu time so far: 27ms 2018/11/03 13:05:09 [6] burn took 99ms, real time so far: 3590ms, cpu time so far: 32ms 2018/11/03 13:05:09 [7] burn took 5ms, real time so far: 4095ms, cpu time so far: 37ms 2018/11/03 13:05:10 [8] burn took 5ms, real time so far: 4600ms, cpu time so far: 43ms 2018/11/03 13:05:10 [9] burn took 5ms, real time so far: 5105ms, cpu time so far: 49ms
  • 35. 35 OVERLY AGGRESSIVE CFS: EXPERIMENT #3 CPU Period: 10ms CPU Quota: 2ms Burn 5ms and sleep 100ms ⇒ Same 20% CPU (200m) limit, but smaller period ⇒ Throttling expected!
  • 36. 36 SMALLER CPU PERIOD ⇒ BETTER LATENCY 2018/11/03 16:31:07 [0] burn took 18ms, real time so far: 18ms, cpu time so far: 6ms 2018/11/03 16:31:07 [1] burn took 9ms, real time so far: 128ms, cpu time so far: 8ms 2018/11/03 16:31:07 [2] burn took 9ms, real time so far: 238ms, cpu time so far: 13ms 2018/11/03 16:31:07 [3] burn took 5ms, real time so far: 343ms, cpu time so far: 18ms 2018/11/03 16:31:07 [4] burn took 30ms, real time so far: 488ms, cpu time so far: 24ms 2018/11/03 16:31:07 [5] burn took 19ms, real time so far: 608ms, cpu time so far: 29ms 2018/11/03 16:31:07 [6] burn took 9ms, real time so far: 718ms, cpu time so far: 34ms 2018/11/03 16:31:08 [7] burn took 5ms, real time so far: 824ms, cpu time so far: 40ms 2018/11/03 16:31:08 [8] burn took 5ms, real time so far: 943ms, cpu time so far: 45ms 2018/11/03 16:31:08 [9] burn took 9ms, real time so far: 1068ms, cpu time so far: 48ms
  • 37. 37 LIMITS: VISIBILITY docker run --cpus 1 -m 200m --rm -it busybox top Mem: 7369128K used, 726072K free, 128164K shrd, 303924K buff, 1208132K cached CPU0: 14.8% usr 8.4% sys 0.2% nic 67.6% idle 8.2% io 0.0% irq 0.6% sirq CPU1: 8.8% usr 10.3% sys 0.0% nic 75.9% idle 4.4% io 0.0% irq 0.4% sirq CPU2: 7.3% usr 8.7% sys 0.0% nic 63.2% idle 20.1% io 0.0% irq 0.6% sirq CPU3: 9.3% usr 9.9% sys 0.0% nic 65.7% idle 14.5% io 0.0% irq 0.4% sirq
  • 38. 38 • Container-aware memory configuration • JVM MaxHeap • Container-aware processor configuration • Thread pools • GOMAXPROCS • node.js cluster module LIMITS: VISIBILITY
  • 41. 41 ZALANDO: DECISION • Forbid Memory Overcommit • Implement mutating admission webhook • Set requests = limits • Disable CPU CFS Quota in all clusters
  • 44. Example: Getting started with Zalenium & UI Tests Example: Step by step guide to the first UI test with Zalenium running in the Continuous Delivery Platform. I was always afraid of UI tests because it looked too difficult to get started, Zalenium solved this problem for me.
  • 45. 45 CLUSTER AUTOSCALER Simulates the Kubernetes scheduler internally to find out.. • ..if any of the pods wouldn’t fit on existing nodes ⇒ upscale is needed • ..if it’s possible to fit some of the pods on existing nodes ⇒ downscale is needed ⇒ Cluster size is determined by resource requests (+ constraints) github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
  • 46. 46 ALLOCATABLE Reserve resources for system components, Kubelet, and container runtime: --system-reserved= cpu=100m,memory=164Mi --kube-reserved= cpu=100m,memory=282Mi
  • 47. 47 CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack SLACK CPU Memory Node "Slack"
  • 48. 48 STRANDED RESOURCES Stranded CPU Memory CPU Memory Node 1 Node 2 Some available capacity can become unusable / stranded. ⇒ Reschedule, bin packing
  • 49. 49 HOW TO FIND GOOD VALUES FOR REQUESTS? resources: limits: memory: "2Gi" requests: cpu: "1.5" memory: "2Gi"
  • 50. 50 TWO TOOLS ● Kubernetes Resource Report ○ Dashboard showing Kubernetes Resource Slack ● Kubernetes Application Dashboard ○ Grafana Dashboard showing historical CPU and Memory usage for Kubernetes Applications
  • 54. 54 VERTICAL POD AUTOSCALER (VPA) "Some 2/3 of the Borg users use autopilot." - Tim Hockin VPA: Set resource requests automatically based on usage.
  • 55. 55 VPA FOR PROMETHEUS apiVersion: poc.autoscaling.k8s.io/v1alpha1 kind: VerticalPodAutoscaler metadata: name: prometheus-vpa namespace: kube-system spec: selector: matchLabels: application: prometheus updatePolicy: updateMode: Auto CPU/memory
  • 56. 56 AUTOSCALING BUFFER • Cluster Autoscaler only triggers on Pending Pods • Node provisioning is slow ⇒ Reserve extra capacity via low priority Pods "Autoscaling Buffer Pods"
  • 57. 57 AUTOSCALING BUFFER kubectl describe pod autoscaling-buffer-eu-central-1a-78-rzjq5 -n kube-system ... Namespace: kube-system Priority: -1000000 PriorityClassName: autoscaling-buffer Containers: pause: Image: teapot/pause-amd64:3.1 Requests: cpu: 1600m memory: 6871947673 Evict if higher priority (default) Pod needs capacity
  • 58. 58 HORIZONTAL POD AUTOSCALER apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: myapp spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 3 maxReplicas: 5 metrics: - type: Resource resource: name: cpu targetAverageUtilization: 100 target: ~100% of CPU requests ...
  • 59. 59 HORIZONTAL POD AUTOSCALING (CUSTOM METRICS) Queue Length Prometheus Query Ingress Req/s ZMON Check github.com/zalando-incubator/kube-metrics-adapter
  • 61. 61 DOWNSCALING DURING OFF-HOURS DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET" annotations: downscaler/exclude: "true" github.com/hjacobs/kube-downscaler
  • 62. 62 WHAT WORKED FOR US ● Disable CPU CFS Quota in all clusters ● Admission Controller: ○ Prevent memory overcommit ○ Set default requests ● Kubernetes Resource Report ● Downscaling during off-hours
  • 63. 63 WRAP UP ● Understanding Requests/Limits ● Deciding on quota period (or disable), overcommit ● Autoscaling (cluster-autoscaler, HPA, VPA, downscaler) ● Buffer capacity ● Optimizing slack
  • 64. 64 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  • 66. 66 OTHER TALKS • Everything You Ever Wanted to Know About Resource Scheduling • Inside Kubernetes Resource Management (QoS) - KubeCon 2018 • Setting Resource Requests and Limits in Kubernetes (Best Practices)
  • 67. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k