Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kubernetes Failure Stories - KubeCon Europe Barcelona

338 views

Published on

Talk given on 2019-05-21 at KubeCon Barcelona: https://kccnceu19.sched.com/event/MPcM/kubernetes-failure-stories-and-how-to-crash-your-clusters-henning-jacobs-zalando-se

Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge. In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando developer base. We will walk you through our horror stories of operating 100+ clusters and share the insights we gained from incidents, failures, user reports and general observations. Our failure stories will be sourced from recent and past incidents, so the talk will be up-to-date with our latest experiences.

Most of our learnings apply to other Kubernetes infrastructures (EKS, GKE, ..) as well. This talk strives to reduce the audience's unknown unknowns about running Kubernetes in production.

Published in: Technology
  • Be the first to comment

Kubernetes Failure Stories - KubeCon Europe Barcelona

  1. 1. HENNING JACOBS @try_except_ Kubernetes Failure Stories
  2. 2. 4 ZALANDO AT A GLANCE ~ 5.4billion EUR revenue 2018 > 250 million visits per month > 15.000 employees in Europe > 79% of visits via mobile devices > 26 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  3. 3. 5 SCALE 118Clusters 380Accounts
  4. 4. 6 DEVELOPERS USING KUBERNETES
  5. 5. 7 47+ cluster components
  6. 6. INCIDENTS ARE FINE
  7. 7. INCIDENT #1
  8. 8. 10 INCIDENT #1: CUSTOMER IMPACT
  9. 9. 11 INCIDENT #1: CUSTOMER IMPACT
  10. 10. 12 INCIDENT #1: INGRESS ERRORS
  11. 11. 13 INCIDENT #1: AWS ALB 502 github.com/zalando/riptide
  12. 12. 14 INCIDENT #1: AWS ALB 502 github.com/zalando/riptide 502 Bad Gateway Server: awselb/2.0 ...
  13. 13. 15 INCIDENT #1: ALB HEALTHY HOST COUNT 3 healthy hosts zero healthy hosts 2xx requests
  14. 14. 16 LIFE OF A REQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  15. 15. 17 INCIDENT #1: SKIPPER MEMORY USAGE Memory Limit Memory Usage
  16. 16. 18 INCIDENT #1: SKIPPER OOM Node Node MyApp MyApp MyApp TLS HTTP Skipper Skipper ALB OOMKill
  17. 17. 19 INCIDENT #1: CONTRIBUTING FACTORS • Shared Ingress (per cluster) • High latency of unrelated app (Solr) caused high number of in-flight requests • Skipper creates goroutine per HTTP request. Goroutine costs 2kB memory + http.Request • Memory limit was fixed at 500Mi (4x regular usage) Fix for the memory issue in Skipper: https://opensource.zalando.com/skipper/operation/operation/#scheduler
  18. 18. INCIDENT #2
  19. 19. 21 INCIDENT #2: CUSTOMER IMPACT
  20. 20. 22 INCIDENT #1: IAM RETURNING 404
  21. 21. 23 INCIDENT #1: NUMBER OF PODS
  22. 22. 24 LIFE OF A REQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  23. 23. 25 ROUTES FROM API SERVER Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper
  24. 24. 26 API SERVER DOWN Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper OOMKill
  25. 25. 27 INCIDENT #2: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  26. 26. 28 INCIDENT #2: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never containers:
  27. 27. 29 INCIDENT #2: LESSONS LEARNED • Fix Ingress to stay “healthy” during API server problems • Fix Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500" NOTE: we dropped quotas recently github.com/zalando-incubator/kubernetes- on-aws/pull/2059
  28. 28. INCIDENT #3
  29. 29. 31 INCIDENT #3: INGRESS ERRORS
  30. 30. 32 INCIDENT #3: COREDNS OOMKILL coredns invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=994 Memory cgroup out of memory: Kill process 6428 (coredns) score 2050 or sacrifice child oom_reaper: reaped process 6428 (coredns), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB restarts
  31. 31. 33 STOP THE BLEEDING: INCREASE MEMORY LIMIT 4Gi 2Gi 200Mi
  32. 32. 34 SPIKE IN HTTP REQUESTS
  33. 33. 35 SPIKE IN DNS QUERIES
  34. 34. 36 INCREASE IN MEMORY USAGE
  35. 35. 37 INCIDENT #3: CONTRIBUTING FACTORS • HTTP retries • No DNS caching • Kubernetes ndots:5 problem • Short maximum lifetime of HTTP connections • Fixed memory limit for CoreDNS • Monitoring affected by DNS outage github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
  36. 36. INCIDENT #4
  37. 37. 39 INCIDENT #4: CLUSTER DOWN
  38. 38. 40 INCIDENT #4: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  39. 39. 41 INCIDENT #4: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  40. 40. 42 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  41. 41. https://www.outcome-eng.com/human-error-never-root-cause/
  42. 42. 44 INCIDENT #4: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  43. 43. INCIDENT #5
  44. 44. 46 INCIDENT #5: API LATENCY SPIKES
  45. 45. 47 INCIDENT #5: CONNECTION ISSUES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ... Master Node API Server etcd etcd-member
  46. 46. 48 INCIDENT #5: STOP THE BLEEDING #!/bin/bash while true; do echo "sleep for 60 seconds" sleep 60 timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  47. 47. 49 INCIDENT #5: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  48. 48. 50 INCIDENT #5: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  49. 49. INCIDENT #6
  50. 50. 52 INCIDENT #6: IMPACT Ingress 5XXs
  51. 51. 53 INCIDENT #6: CLUSTER DOWN?
  52. 52. 54 INCIDENT #6: THE TRIGGER
  53. 53. https://www.outcome-eng.com/human-error-never-root-cause/
  54. 54. 56 CLUSTER UPGRADE FLOW
  55. 55. 57 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  56. 56. 58 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure clusters (important to us). 2 beta Product clusters for the rest of the organization (non-prod). 57+ stable Product clusters for the rest of the organization (prod). 57+
  57. 57. 59 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws
  58. 58. 60 RUNNING E2E TESTS (BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  59. 59. 61 RUNNING E2E TESTS (NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  60. 60. 62 INCIDENT #6: LESSONS LEARNED • Automated e2e tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with previous configuration • Apply new configuration • Run end-to-end & conformance tests github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
  61. 61. INCIDENT #7
  62. 62. 64 ⇒ all containers on this node down #7: KERNEL OOM KILLER
  63. 63. 65 INCIDENT #7: KUBELET MEMORY
  64. 64. 66 UPSTREAM ISSUE REPORTED https://github.com/kubernetes/kubernetes/issues/73587
  65. 65. 67 INCIDENT #7: THE PATCH https://github.com/kubernetes/kubernetes/issues/73587
  66. 66. INCIDENT #8
  67. 67. 69 INCIDENT #8: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail
  68. 68. 70 INCIDENT #8: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
  69. 69. 71 INCIDENT #8: CPU THROTTLING
  70. 70. 72 INCIDENT #8: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough
  71. 71. 73 CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack SLACK CPU Memory Node "Slack"
  72. 72. 74 DISABLING CPU THROTTLING [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements kubelet … --cpu-cfs-quota=false
  73. 73. 75 A MILLION WAYS TO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  74. 74. 76 MORE TOPICS • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  75. 75. 77 RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  76. 76. 78 TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws
  77. 77. 79 MANAGED KUBERNETES?
  78. 78. 80 WILL MANAGED K8S SAVE US? GKE: monthly uptime percentage at 99.95% for regional clusters
  79. 79. 81 WILL MANAGED K8S SAVE US? NO(not really) e.g. AWS EKS uptime SLA is only for API server
  80. 80. 82 PRODUCTION PROOFING AWS EKS List of things you might want to look at for EKS in production https://medium.com/glia-tech/productionproofing-e ks-ed52951ffd6c
  81. 81. 83 AWS EKS IN PRODUCTION https://kubedex.com/90-days-of-aws-eks-in-production/
  82. 82. 84 DOCKER.. (ON GKE) https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0 39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
  83. 83. WELCOME TO CLOUD NATIVE!
  84. 84. 86
  85. 85. 87 KUBERNETES FAILURE STORIES A compiled list of links to public failure stories related to Kubernetes. k8s.af We need more failure talks! Istio? Anyone?
  86. 86. 88 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler
  87. 87. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×