Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU 2019

1,123 views

Published on

Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge. In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando developer base. We will walk you through our horror stories of operating 100+ clusters and share the insights we gained from incidents, failures, user reports and general observations. Our failure stories will be sourced from recent and past incidents, so the talk will be up-to-date with our latest experiences.

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Kubernetes Failure Stories, or: How to Crash Your Cluster - ContainerDays EU 2019

  1. 1. Kubernetes Failure Stories CONTAINER DAYS 2019-06-26 HENNING JACOBS @try_except_
  2. 2. 2 ZALANDO AT A GLANCE ~ 5.4billion EUR revenue 2018 > 250 million visits per month > 15.000 employees in Europe > 79% of visits via mobile devices > 26 million active customers > 300.000 product choices ~ 2.000 brands 17 countries
  3. 3. 3 SCALE 130Clusters 396Accounts
  4. 4. 4 DEVELOPERS USING KUBERNETES
  5. 5. 5 47+ cluster components
  6. 6. INCIDENTS ARE FINE
  7. 7. INCIDENT #0
  8. 8. 8 INCIDENT #0: IMPACT
  9. 9. 9 INCIDENT #0: CONTRIBUTING FACTORS ● Pods couldn’t get AWS IAM credentials, timing out and failing ● kube2iam could not get the Pod’s IP address ● kubelet was delayed in updating Pod statuses for multiple minutes ● Default kubelet configuration has a low rate limit for calls to the API server (--kube-api-qps) ● Due to rescaling, only one node was available for builder Pods ● Rapid creation and deletion of Pods caused kubelet to fall behind
  10. 10. 10 INCIDENT #0: FIX github.com/zalando-incubator/kubernetes-on-aws/pull/2247
  11. 11. INCIDENT #1
  12. 12. 12 INCIDENT #1: CUSTOMER IMPACT
  13. 13. 13 INCIDENT #1: CUSTOMER IMPACT
  14. 14. 14 INCIDENT #1: INGRESS ERRORS
  15. 15. 15 INCIDENT #1: AWS ALB 502 github.com/zalando/riptide
  16. 16. 16 INCIDENT #1: AWS ALB 502 github.com/zalando/riptide 502 Bad Gateway Server: awselb/2.0 ...
  17. 17. 17 INCIDENT #1: ALB HEALTHY HOST COUNT 3 healthy hosts zero healthy hosts 2xx requests
  18. 18. 18 LIFE OF A REQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  19. 19. 19 INCIDENT #1: SKIPPER MEMORY USAGE Memory Limit Memory Usage
  20. 20. 20 INCIDENT #1: SKIPPER OOM Node Node MyApp MyApp MyApp TLS HTTP Skipper Skipper ALB OOMKill
  21. 21. 21 INCIDENT #1: CONTRIBUTING FACTORS • Shared Ingress (per cluster) • High latency of unrelated app (Solr) caused high number of in-flight requests • Skipper creates goroutine per HTTP request. Goroutine costs 2kB memory + http.Request • Memory limit was fixed at 500Mi (4x regular usage) Fix for the memory issue in Skipper: https://opensource.zalando.com/skipper/operation/operation/#scheduler
  22. 22. INCIDENT #2
  23. 23. 23 INCIDENT #2: CUSTOMER IMPACT
  24. 24. 24 INCIDENT #1: IAM RETURNING 404
  25. 25. 25 INCIDENT #1: NUMBER OF PODS
  26. 26. 26 LIFE OF A REQUEST (INGRESS) Node Node MyApp MyApp MyApp EC2 network K8s network TLS HTTP Skipper Skipper ALB
  27. 27. 27 ROUTES FROM API SERVER Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper
  28. 28. 28 API SERVER DOWN Node Node MyApp MyApp MyApp Skipper ALBAPI Server Skipper OOMKill
  29. 29. 29 INCIDENT #2: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  30. 30. 30 INCIDENT #2: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never containers:
  31. 31. 31 INCIDENT #2: LESSONS LEARNED • Fix Ingress to stay “healthy” during API server problems • Fix Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500" NOTE: we dropped quotas recently github.com/zalando-incubator/kubernetes- on-aws/pull/2059
  32. 32. INCIDENT #3
  33. 33. 33 INCIDENT #3: INGRESS ERRORS
  34. 34. 34 INCIDENT #3: COREDNS OOMKILL coredns invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=994 Memory cgroup out of memory: Kill process 6428 (coredns) score 2050 or sacrifice child oom_reaper: reaped process 6428 (coredns), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB restarts
  35. 35. 35 STOP THE BLEEDING: INCREASE MEMORY LIMIT 4Gi 2Gi 200Mi
  36. 36. 36 SPIKE IN HTTP REQUESTS
  37. 37. 37 SPIKE IN DNS QUERIES
  38. 38. 38 INCREASE IN MEMORY USAGE
  39. 39. 39 INCIDENT #3: CONTRIBUTING FACTORS • HTTP retries • No DNS caching • Kubernetes ndots:5 problem • Short maximum lifetime of HTTP connections • Fixed memory limit for CoreDNS • Monitoring affected by DNS outage github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
  40. 40. INCIDENT #4
  41. 41. 41 INCIDENT #4: CLUSTER DOWN
  42. 42. 42 INCIDENT #4: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  43. 43. 43 INCIDENT #4: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  44. 44. 44 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
  45. 45. https://www.outcome-eng.com/human-error-never-root-cause/
  46. 46. 46 INCIDENT #4: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  47. 47. INCIDENT #5
  48. 48. 48 INCIDENT #5: API LATENCY SPIKES
  49. 49. 49 INCIDENT #5: CONNECTION ISSUES ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ... Master Node API Server etcd etcd-member
  50. 50. 50 INCIDENT #5: STOP THE BLEEDING #!/bin/bash while true; do echo "sleep for 60 seconds" sleep 60 timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  51. 51. 51 INCIDENT #5: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  52. 52. 52 INCIDENT #5: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  53. 53. INCIDENT #6
  54. 54. 54 INCIDENT #6: IMPACT Ingress 5XXs
  55. 55. 55 INCIDENT #6: CLUSTER DOWN?
  56. 56. 56 INCIDENT #6: THE TRIGGER
  57. 57. https://www.outcome-eng.com/human-error-never-root-cause/
  58. 58. 58 CLUSTER UPGRADE FLOW
  59. 59. 59 CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager
  60. 60. 60 CLUSTER CHANNELS github.com/zalando-incubator/kubernetes-on-aws Channel Description Clusters dev Development and playground clusters. 3 alpha Main infrastructure clusters (important to us). 2 beta Product clusters for the rest of the organization (non-prod). 57+ stable Product clusters for the rest of the organization (prod). 57+
  61. 61. 61 E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws
  62. 62. 62 RUNNING E2E TESTS (BEFORE) Control plane nodenode branch: dev Create Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  63. 63. 63 RUNNING E2E TESTS (NOW) Control plane nodenode Control plane nodenode branch: alpha (base) branch: dev (head) Create Cluster Update Cluster Run e2e tests Delete Cluster Testing dev to alpha upgrade Control plane Control plane
  64. 64. 64 INCIDENT #6: LESSONS LEARNED • Automated e2e tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with previous configuration • Apply new configuration • Run end-to-end & conformance tests github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
  65. 65. INCIDENT #7
  66. 66. 66 ⇒ all containers on this node down #7: KERNEL OOM KILLER
  67. 67. 67 INCIDENT #7: KUBELET MEMORY
  68. 68. 68 UPSTREAM ISSUE REPORTED https://github.com/kubernetes/kubernetes/issues/73587
  69. 69. 69 INCIDENT #7: THE PATCH https://github.com/kubernetes/kubernetes/issues/73587
  70. 70. INCIDENT #8
  71. 71. 71 INCIDENT #8: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail
  72. 72. 72 INCIDENT #8: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 ] | Current queue size: 58381, current number of active workers: 20
  73. 73. 73 INCIDENT #8: CPU THROTTLING
  74. 74. 74 INCIDENT #8: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough
  75. 75. 75 CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack SLACK CPU Memory Node "Slack"
  76. 76. 76 DISABLING CPU THROTTLING [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements kubelet … --cpu-cfs-quota=false
  77. 77. 77 DISABLING CPU THROTTLING
  78. 78. 78 A MILLION WAYS TO CRASH YOUR CLUSTER? • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
  79. 79. 79 MORE TOPICS • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
  80. 80. 80 RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • github.com/zalando-incubator/kubernetes-on-aws
  81. 81. 81 TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws
  82. 82. 82 MANAGED KUBERNETES?
  83. 83. 83 WILL MANAGED K8S SAVE US? GKE: monthly uptime percentage at 99.95% for regional clusters
  84. 84. 84 WILL MANAGED K8S SAVE US? NO(not really) e.g. AWS EKS uptime SLA is only for API server
  85. 85. 85 PRODUCTION PROOFING AWS EKS List of things you might want to look at for EKS in production https://medium.com/glia-tech/productionproofing-e ks-ed52951ffd6c
  86. 86. 86 AWS EKS IN PRODUCTION https://kubedex.com/90-days-of-aws-eks-in-production/
  87. 87. 87 DOCKER.. (ON GKE) https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0 39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
  88. 88. WELCOME TO CLOUD NATIVE!
  89. 89. 89
  90. 90. 90 KUBERNETES FAILURE STORIES A compiled list of links to public failure stories related to Kubernetes. k8s.af We need more failure talks! Istio? Anyone?
  91. 91. 91 DISCLAIMER
  92. 92. 92 OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler Kubernetes Operator Pythonic Framework (Kopf) github.com/zalando-incubator/kopf
  93. 93. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×