Advertisement

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

Senior Principal at Zalando SE
Feb. 22, 2018
Advertisement

More Related Content

Slideshows for you(20)

Similar to Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW(20)

Advertisement
Advertisement

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

  1. DEVOPS NRW 2018-02-21 HENNING JACOBS @try_except_ Kubernetes on AWS @ZalandoTech
  2. 2 ZALANDO IN NUMBERS > 4.5billion EUR 2017 > 200 million visits per month > 14,000 employees in Europe > 70% of visits via mobile devices > 22 million active customers > 250,000 product choices ~ 2,000 brands 15 countries
  3. 3 OUR FOOTPRINT AROUND EUROPE as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  4. 4 OUR FOOTPRINT AROUND EUROPE TECH as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  5. INCIDENTS ARE FINE
  6. ON CALL
  7. 7 INCIDENT #1: CUSTOMER IMPACT
  8. 8 INCIDENT #1: IAM RETURNING 404
  9. 9 INCIDENT #1: NUMBER OF PODS
  10. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  11. 11 INCIDENT #1: UNASSUMING MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  12. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  13. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  14. 14 STABILITY: LIMIT RANGE $ kubectl describe limitrange Name: limits Namespace: default Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio ---- -------- --- --- ----------- ------------- ----------------------- Container memory - 64Gi 100Mi 1Gi - Container cpu - 16 100m 3 - http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources ⇒ Mitigate errors on OSI layer 8 ;-)
  15. 15 INCIDENT #2: CLUSTER DOWN
  16. 16 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  17. 17 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  18. https://www.outcome-eng.com/human-error-never-root-cause/
  19. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  20. 20 INCIDENT #3: LATENCY SPIKES
  21. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  22. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  23. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  24. 24 INCIDENT #4: IMPACT
  25. 25 INCIDENT #4: CLUSTER DOWN?
  26. 26 INCIDENT #4: THE TRIGGER
  27. https://www.outcome-eng.com/human-error-never-root-cause/
  28. 28 CLUSTER UPDATES
  29. 29 CLUSTER LIFECYCLE MANAGER
  30. 30 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  31. 31 TRAFFIC SWITCHING Default deployment: rolling update
  32. 32 TRAFFIC SWITCHING
  33. 33 MANUAL TRAFFIC SWITCHING $ zkubectl traffic <ingress-name> SERVICE WEIGHT <service-backend-1> 30% <service-backend-2> 70% $ zkubectl traffic <ingress-name> <service-backend-2> 100 SERVICE WEIGHT <service-backend-1> 0% <service-backend-2> 100%
  34. 34 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  35. 35 DOCUMENTATION • Restructure following https://www.divio.com/en/blog/documentation/ • Concepts • How Tos • Tutorials • Reference • Global Search • Weekly Health Check: Support → Documentation
  36. 36 ONBOARDING • Many new concepts to grasp vs. 200+ teams • Kubernetes Training (2h) • Documentation • Recorded Friday Demos • Support Channels (chat, mail)
  37. 37 CLUSTER SCOPE
  38. 38 DOES ANYTHING EVEN WORK? • Kubernetes API as the primary interface • Ingress Controller + External DNS • kube2iam • Cluster lifecycle management • Zalando IAM/OAuth integration via CRD • PostgreSQL Operator
  39. 39 DEPLOYMENT
  40. 40 DEPLOYMENT CONFIGURATION . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── credentials.yaml # K8s TPR │ ├── ingress.yaml # K8s Ingress │ └── service.yaml # K8s Service └── delivery.yaml # pipeline config
  41. 41 INGRESS.YAML apiVersion: extensions/v1beta1 kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  42. 42 CONTINUOUS DELIVERY PLATFORM
  43. 43 CDP: APPLY
  44. 44 CDP: OPTIONAL APPROVAL
  45. 45 POSTGRES OPERATOR Application to manage PostgreSQL clusters Observes “postgres” manifests (CRDs) Spawns and modifies new clusters Syncs and provisions roles Handles volume resize, incl. Resize2fs Also responsible for updating Docker images
  46. https://github.com/hjacobs/kube-ops-view
  47. 47 LINKS Running Kubernetes in Production on AWS http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html Kube AWS Ingress Controller https://github.com/zalando-incubator/kube-ingress-aws-controller External DNS https://github.com/kubernetes-incubator/external-dns PostgreSQL Operator https://github.com/zalando-incubator/postgres-operator Zalando Cluster Configuration https://github.com/zalando-incubator/kubernetes-on-aws List of Organizations using Kubernetes on AWS https://github.com/hjacobs/kubernetes-on-aws-users
  48. https://goo.gl/t2zNc8
  49. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k
Advertisement