Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018

Henning Jacobs
Henning JacobsSenior Principal at Zalando SE
Running Kubernetes in Production:
A Million Ways to Crash Your Cluster
HENNING JACOBS
@try_except_
2018-12-05
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018
4
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15.000
employees in
Europe
> 70%
of visits via
mobile devices
> 24
million
active customers
> 300.000
product choices
~ 2.000
brands
17
countries
Black
Friday
2018
> 4,200
orders per minute
6
SCALE
100Clusters
373Accounts
7
DEVELOPERS USING KUBERNETES
8
46+ cluster
components
INCIDENTS ARE FINE
10
INCIDENT #1: CUSTOMER IMPACT
11
INCIDENT #1: IAM RETURNING 404
12
INCIDENT #1: NUMBER OF PODS
13
LIFE OF A REQUEST (INGRESS)
Node Node
MyApp MyApp MyApp
EC2 network
K8s network
TLS
HTTP
Skipper Skipper
ALB
14
ROUTES FROM API SERVER
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
15
API SERVER DOWN
Node Node
MyApp MyApp MyApp
Skipper
ALBAPI Server
Skipper
OOMKill
16
INCIDENT #1: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...
17
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
containers:
18
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Ingress to stay “healthy” during API server problems
• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
19
INCIDENT #2: CLUSTER DOWN
20
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
21
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
22
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
https://www.outcome-eng.com/human-error-never-root-cause/
24
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
25
INCIDENT #3: API LATENCY SPIKES
26
INCIDENT #3: CONNECTION ISSUES
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
Master Node
API Server
etcd
etcd-member
27
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
28
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
29
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
30
INCIDENT #4: IMPACT
Ingress
5XXs
31
INCIDENT #4: CLUSTER DOWN?
32
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
34
CLUSTER UPGRADE
FLOW
35
CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
36
CLUSTER CHANNELS
github.com/zalando-incubator/kubernetes-on-aws
Channel Description Clusters
dev Development and playground clusters. 3
alpha Main infrastructure cluster (important to us). 1
beta
Product clusters for the rest of the
organization (prod/test). 90+
37
E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
38
RUNNING E2E TESTS (BEFORE)
Control plane
nodenode
branch: dev
Create Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
39
RUNNING E2E TESTS (NOW)
Control plane
nodenode
Control plane
nodenode
branch: alpha (base) branch: dev (head)
Create Cluster Update Cluster Run e2e tests Delete Cluster
Testing dev to alpha upgrade
Control plane Control plane
40
INCIDENT #4: LESSONS LEARNED
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
41
INCIDENT #5: IMPACT
[4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied
one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 ..
[5:01 PM] Alice: Now it does not start the build step at all
[5:02 PM] John: +1
[5:02 PM] John: Failed to create builder pod: …
[5:02 PM] Pedro: +1
[5:04 PM] Damien: +1
[5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which
results in many problems…
...
42
INCIDENT #5: IMPACT
43
INCIDENT #5: A VERY INNOCENT PULL REQUEST
44
INCIDENT #5: WHAT HAPPENED
• Deployment caused rebuild with latest stable Go version
• Library for signature verification was incompatible with Go 1.10,
causing all verification checks to fail during runtime.
• Lack of unit/smoke tests and alerting for one component
• "Near miss": outage could have had large impact
45
INCIDENT #6: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
46
INCIDENT #6: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
47
INCIDENT #6: CPU THROTTLING
48
INCIDENT #6: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
49
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
SLACK
CPU
Memory
Node
"Slack"
50
DISABLING CPU THROTTLING
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
kubelet … --cpu-cfs-quota=false
51
A MILLION WAYS TO CRASH YOUR CLUSTER?
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
52
MORE TOPICS
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
53
RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
github.com/zalando-incubator/kubernetes-on-aws
54
TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
55
DOCKER.. (ON GKE)
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
WELCOME TO
CLOUD NATIVE!
57
58
OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
59
KUBERNETES RESOURCE REPORT
github.com/hjacobs/kube-resource-report
https://github.com/hjacobs/kube-ops-view
61
OTHER TALKS
• Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017
• Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018
• Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency
and Latency - HighLoad++ 2018
We need more failure talks!
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k
1 of 62

Recommended

Let's talk about Failures with Kubernetes - Hamburg Meetup by
Let's talk about Failures with Kubernetes - Hamburg MeetupLet's talk about Failures with Kubernetes - Hamburg Meetup
Let's talk about Failures with Kubernetes - Hamburg MeetupHenning Jacobs
5.7K views52 slides
Fallacies of distributed computing with Kubernetes on AWS by
Fallacies of distributed computing with Kubernetes on AWSFallacies of distributed computing with Kubernetes on AWS
Fallacies of distributed computing with Kubernetes on AWSRaffaele Di Fazio
3.3K views40 slides
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering... by
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Henning Jacobs
4.3K views79 slides
CI-CD Jenkins, GitHub Actions, Tekton by
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton Araf Karsh Hamid
1.2K views63 slides
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc... by
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
24.1K views67 slides
How we can do Multi-Tenancy on Kubernetes by
How we can do Multi-Tenancy on KubernetesHow we can do Multi-Tenancy on Kubernetes
How we can do Multi-Tenancy on KubernetesOpsta
336 views40 slides

More Related Content

What's hot

Monitoring using Prometheus and Grafana by
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
3.5K views25 slides
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ... by
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Henning Jacobs
3.5K views93 slides
NATS Streaming - an alternative to Apache Kafka? by
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?Anton Zadorozhniy
4.6K views13 slides
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent by
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
2.7K views77 slides
[KubeCon EU 2022] Running containerd and k3s on macOS by
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOSAkihiro Suda
1.6K views28 slides
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc... by
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
3.5K views21 slides

What's hot(20)

Monitoring using Prometheus and Grafana by Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K views
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ... by Henning Jacobs
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU ...
Henning Jacobs3.5K views
NATS Streaming - an alternative to Apache Kafka? by Anton Zadorozhniy
NATS Streaming - an alternative to Apache Kafka?NATS Streaming - an alternative to Apache Kafka?
NATS Streaming - an alternative to Apache Kafka?
Anton Zadorozhniy4.6K views
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent by Henning Jacobs
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
Henning Jacobs2.7K views
[KubeCon EU 2022] Running containerd and k3s on macOS by Akihiro Suda
[KubeCon EU 2022] Running containerd and k3s on macOS[KubeCon EU 2022] Running containerd and k3s on macOS
[KubeCon EU 2022] Running containerd and k3s on macOS
Akihiro Suda1.6K views
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc... by Altinity Ltd
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Altinity Ltd3.5K views
Replacing iptables with eBPF in Kubernetes with Cilium by Michal Rostecki
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with Cilium
Michal Rostecki469 views
Spring Boot+Kafka: the New Enterprise Platform by VMware Tanzu
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu1.4K views
Deep dive into Kubernetes Networking by Sreenivas Makam
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
Sreenivas Makam9.3K views
Stephan Ewen - Experiences running Flink at Very Large Scale by Ververica
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica 3.5K views
Intro to Telegraf by InfluxData
Intro to TelegrafIntro to Telegraf
Intro to Telegraf
InfluxData701 views
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and... by Altinity Ltd
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd12.5K views
Free GitOps Workshop + Intro to Kubernetes & GitOps by Weaveworks
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks178 views
Apache Spark Streaming in K8s with ArgoCD & Spark Operator by Databricks
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks448 views
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018 by Seunghyun Lee
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee3.1K views
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc... by Henning Jacobs
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs2.4K views
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf by Altinity Ltd
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Altinity Ltd3.2K views

Similar to Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018

Kubernetes Failure Stories - KubeCon Europe Barcelona by
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe BarcelonaHenning Jacobs
728 views89 slides
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont... by
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Henning Jacobs
3.9K views49 slides
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin by
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinHenning Jacobs
1.1K views91 slides
Scaling Docker Containers using Kubernetes and Azure Container Service by
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container ServiceBen Hall
804 views120 slides
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事 by
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事smalltown
402 views57 slides
Cloud-native .NET Microservices mit Kubernetes by
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit KubernetesQAware GmbH
1.1K views45 slides

Similar to Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018(20)

Kubernetes Failure Stories - KubeCon Europe Barcelona by Henning Jacobs
Kubernetes Failure Stories - KubeCon Europe BarcelonaKubernetes Failure Stories - KubeCon Europe Barcelona
Kubernetes Failure Stories - KubeCon Europe Barcelona
Henning Jacobs728 views
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont... by Henning Jacobs
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Henning Jacobs3.9K views
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin by Henning Jacobs
Why I love Kubernetes Failure Stories and you should too - GOTO BerlinWhy I love Kubernetes Failure Stories and you should too - GOTO Berlin
Why I love Kubernetes Failure Stories and you should too - GOTO Berlin
Henning Jacobs1.1K views
Scaling Docker Containers using Kubernetes and Azure Container Service by Ben Hall
Scaling Docker Containers using Kubernetes and Azure Container ServiceScaling Docker Containers using Kubernetes and Azure Container Service
Scaling Docker Containers using Kubernetes and Azure Container Service
Ben Hall804 views
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事 by smalltown
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
smalltown 402 views
Cloud-native .NET Microservices mit Kubernetes by QAware GmbH
Cloud-native .NET Microservices mit KubernetesCloud-native .NET Microservices mit Kubernetes
Cloud-native .NET Microservices mit Kubernetes
QAware GmbH1.1K views
Kubernetes the Very Hard Way. Velocity Berlin 2019 by Laurent Bernaille
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019
Laurent Bernaille1.5K views
'DOCKER' & CLOUD: ENABLERS For DEVOPS by ACA IT-Solutions
'DOCKER' & CLOUD:  ENABLERS For DEVOPS'DOCKER' & CLOUD:  ENABLERS For DEVOPS
'DOCKER' & CLOUD: ENABLERS For DEVOPS
ACA IT-Solutions708 views
Docker and Cloud - Enables for DevOps - by ACA-IT by Stijn Wijndaele
Docker and Cloud - Enables for DevOps - by ACA-ITDocker and Cloud - Enables for DevOps - by ACA-IT
Docker and Cloud - Enables for DevOps - by ACA-IT
Stijn Wijndaele574 views
Container orchestration and microservices world by Karol Chrapek
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
Karol Chrapek191 views
SDLC Using Docker for Fun and Profit by dantheelder
SDLC Using Docker for Fun and ProfitSDLC Using Docker for Fun and Profit
SDLC Using Docker for Fun and Profit
dantheelder1.6K views
Production sec ops with kubernetes in docker by Docker, Inc.
Production sec ops with kubernetes in dockerProduction sec ops with kubernetes in docker
Production sec ops with kubernetes in docker
Docker, Inc.418 views
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A... by Henning Jacobs
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - A...
Henning Jacobs1.9K views
Making kubernetes simple for developers by Suraj Deshmukh
Making kubernetes simple for developersMaking kubernetes simple for developers
Making kubernetes simple for developers
Suraj Deshmukh528 views
DCEU 18: Docker Container Networking by Docker, Inc.
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container Networking
Docker, Inc.821 views
Kubernetes - Sailing a Sea of Containers by Kel Cecil
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
Kel Cecil1.4K views
DCEU 18: Building Your Development Pipeline by Docker, Inc.
DCEU 18: Building Your Development PipelineDCEU 18: Building Your Development Pipeline
DCEU 18: Building Your Development Pipeline
Docker, Inc.1.6K views

More from Henning Jacobs

Open Source at Zalando - OSB Open Source Day 2019 by
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019Henning Jacobs
649 views34 slides
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise... by
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Henning Jacobs
1.8K views86 slides
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &... by
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Henning Jacobs
2.4K views85 slides
Kubernetes + Python = ❤ - Cloud Native Prague by
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native PragueHenning Jacobs
3.7K views75 slides
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo... by
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Henning Jacobs
1.4K views114 slides
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat... by
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Henning Jacobs
535 views99 slides

More from Henning Jacobs(20)

Open Source at Zalando - OSB Open Source Day 2019 by Henning Jacobs
Open Source at Zalando - OSB Open Source Day 2019Open Source at Zalando - OSB Open Source Day 2019
Open Source at Zalando - OSB Open Source Day 2019
Henning Jacobs649 views
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise... by Henning Jacobs
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...
Henning Jacobs1.8K views
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &... by Henning Jacobs
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Why Kubernetes? Cloud Native and Developer Experience at Zalando - OWL Tech &...
Henning Jacobs2.4K views
Kubernetes + Python = ❤ - Cloud Native Prague by Henning Jacobs
Kubernetes + Python = ❤ - Cloud Native PragueKubernetes + Python = ❤ - Cloud Native Prague
Kubernetes + Python = ❤ - Cloud Native Prague
Henning Jacobs3.7K views
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo... by Henning Jacobs
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
Henning Jacobs1.4K views
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat... by Henning Jacobs
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Henning Jacobs535 views
Developer Experience at Zalando - CNCF End User SIG-DX by Henning Jacobs
Developer Experience at Zalando - CNCF End User SIG-DXDeveloper Experience at Zalando - CNCF End User SIG-DX
Developer Experience at Zalando - CNCF End User SIG-DX
Henning Jacobs1.2K views
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019 by Henning Jacobs
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Henning Jacobs2.1K views
API First with Connexion - PyConWeb 2018 by Henning Jacobs
API First with Connexion - PyConWeb 2018API First with Connexion - PyConWeb 2018
API First with Connexion - PyConWeb 2018
Henning Jacobs2.9K views
Developer Journey at Zalando - Idea to Production with Containers in the Clou... by Henning Jacobs
Developer Journey at Zalando - Idea to Production with Containers in the Clou...Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Developer Journey at Zalando - Idea to Production with Containers in the Clou...
Henning Jacobs1.3K views
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW by Henning Jacobs
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRWKubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW
Henning Jacobs3.2K views
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C... by Henning Jacobs
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - C...
Henning Jacobs1.8K views
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup by Henning Jacobs
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes MeetupFrom AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
From AWS/STUPS to Kubernetes on AWS @Zalando - Berlin Kubernetes Meetup
Henning Jacobs1.8K views
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09 by Henning Jacobs
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Kubernetes on AWS @Zalando - Berlin AWS User Group 2017-05-09
Henning Jacobs17.5K views
Kubernetes at Zalando - CNCF End User Committee Presentation by Henning Jacobs
Kubernetes at Zalando - CNCF End User Committee PresentationKubernetes at Zalando - CNCF End User Committee Presentation
Kubernetes at Zalando - CNCF End User Committee Presentation
Henning Jacobs1.2K views
Kubernetes on AWS at Europe's Leading Online Fashion Platform by Henning Jacobs
Kubernetes on AWS at Europe's Leading Online Fashion PlatformKubernetes on AWS at Europe's Leading Online Fashion Platform
Kubernetes on AWS at Europe's Leading Online Fashion Platform
Henning Jacobs4.2K views
Plan B: Service to Service Authentication with OAuth by Henning Jacobs
Plan B: Service to Service Authentication with OAuthPlan B: Service to Service Authentication with OAuth
Plan B: Service to Service Authentication with OAuth
Henning Jacobs3.4K views
Docker Berlin Meetup Nov 2015: Zalando Intro by Henning Jacobs
Docker Berlin Meetup Nov 2015: Zalando IntroDocker Berlin Meetup Nov 2015: Zalando Intro
Docker Berlin Meetup Nov 2015: Zalando Intro
Henning Jacobs1.1K views
STUPS @ AWS Enterprise Web Day Oktober 2015 by Henning Jacobs
STUPS @ AWS Enterprise Web Day Oktober 2015STUPS @ AWS Enterprise Web Day Oktober 2015
STUPS @ AWS Enterprise Web Day Oktober 2015
Henning Jacobs826 views
Python at Zalando Technology @ Python Users Berlin Meetup September 2015 by Henning Jacobs
Python at Zalando Technology @ Python Users Berlin Meetup September 2015Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Python at Zalando Technology @ Python Users Berlin Meetup September 2015
Henning Jacobs1.2K views

Recently uploaded

Case Study Copenhagen Energy and Business Central.pdf by
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdfAitana
16 views3 slides
Democratising digital commerce in India-Report by
Democratising digital commerce in India-ReportDemocratising digital commerce in India-Report
Democratising digital commerce in India-ReportKapil Khandelwal (KK)
18 views161 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
59 views46 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
132 views17 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
300 views20 slides
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdfDr. Jimmy Schwarzkopf
20 views29 slides

Recently uploaded(20)

Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi132 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson92 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views

Running Kubernetes in Production: A Million Ways to Crash Your Cluster - DevOpsCon Munich 2018

  • 1. Running Kubernetes in Production: A Million Ways to Crash Your Cluster HENNING JACOBS @try_except_ 2018-12-05
  • 4. 4 ZALANDO AT A GLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15.000 employees in Europe > 70% of visits via mobile devices > 24 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  • 11. 11 INCIDENT #1: IAM RETURNING 404
  • 13. 13 LIFE OF A REQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  • 14. 14 ROUTES FROM API SERVER Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper
  • 15. 15 API SERVER DOWN Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper OOMKill
  • 16. 16 INCIDENT #1: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  • 17. 17 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never containers:
  • 18. 18 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Ingress to stay “healthy” during API server problems • Fix Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  • 20. 20 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  • 21. 21 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  • 22. 22 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  • 24. 24 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  • 25. 25 INCIDENT #3: API LATENCY SPIKES
  • 26. 26 INCIDENT #3: CONNECTION ISSUES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ... Master Node API Server etcd etcd-member
  • 27. 27 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  • 28. 28 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  • 29. 29 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  • 35. 35 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  • 36. 36 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure cluster (important to us). 1 beta Product clusters for the rest of the organization (prod/test). 90+
  • 37. 37 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws
  • 38. 38 RUNNING E2E TESTS (BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 39. 39 RUNNING E2E TESTS (NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  • 40. 40 INCIDENT #4: LESSONS LEARNED • Automated e2e tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with previous configuration • Apply new configuration • Run end-to-end & conformance tests
  • 41. 41 INCIDENT #5: IMPACT [4:59 PM] Marc: There is a error during build - forbidden: image policy webhook backend denied one or more images: X-Trusted header "false" for image pierone../ci/cdp-builder:234 .. [5:01 PM] Alice: Now it does not start the build step at all [5:02 PM] John: +1 [5:02 PM] John: Failed to create builder pod: … [5:02 PM] Pedro: +1 [5:04 PM] Damien: +1 [5:19 PM] Anton: We're currently having issues pulling images from our Docker registry which results in many problems… ...
  • 43. 43 INCIDENT #5: A VERY INNOCENT PULL REQUEST
  • 44. 44 INCIDENT #5: WHAT HAPPENED • Deployment caused rebuild with latest stable Go version • Library for signature verification was incompatible with Go 1.10, causing all verification checks to fail during runtime. • Lack of unit/smoke tests and alerting for one component • "Near miss": outage could have had large impact
  • 45. 45 INCIDENT #6: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail
  • 46. 46 INCIDENT #6: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
  • 47. 47 INCIDENT #6: CPU THROTTLING
  • 48. 48 INCIDENT #6: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough
  • 49. 49 CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack SLACK CPU Memory Node "Slack"
  • 50. 50 DISABLING CPU THROTTLING [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements kubelet … --cpu-cfs-quota=false
  • 51. 51 A MILLION WAYS TO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  • 52. 52 MORE TOPICS • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  • 53. 53 RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  • 54. 54 TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws
  • 57. 57
  • 58. 58 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  • 61. 61 OTHER TALKS • Nordstrom: 101 Ways to Crash Your Cluster - KubeCon 2017 • Monzo: Anatomy of a Production Kubernetes Outage - KubeCon 2018 • Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - HighLoad++ 2018 We need more failure talks!
  • 62. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k