Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
DEVOPS NRW
2018-02-21
HENNING JACOBS
@try_except_
Kubernetes on AWS
@ZalandoTech
2
ZALANDO IN NUMBERS
> 4.5billion EUR
2017
> 200
million
visits
per
month
> 14,000
employees in
Europe
> 70%
of visits via...
3
OUR FOOTPRINT AROUND EUROPE
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CE...
4
OUR FOOTPRINT AROUND EUROPE
TECH
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLME...
INCIDENTS ARE FINE
ON CALL
7
INCIDENT #1: CUSTOMER IMPACT
8
INCIDENT #1: IAM RETURNING 404
9
INCIDENT #1: NUMBER OF PODS
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DE...
11
INCIDENT #1: UNASSUMING MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application:...
12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foo...
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to...
14
STABILITY: LIMIT RANGE
$ kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req ...
15
INCIDENT #2: CLUSTER DOWN
16
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
17
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> ...
https://www.outcome-eng.com/human-error-never-root-cause/
19
INCIDENT #2: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
20
INCIDENT #3: LATENCY SPIKES
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEP...
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems durin...
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernete...
24
INCIDENT #4: IMPACT
25
INCIDENT #4: CLUSTER DOWN?
26
INCIDENT #4: THE TRIGGER
https://www.outcome-eng.com/human-error-never-root-cause/
28
CLUSTER
UPDATES
29
CLUSTER LIFECYCLE MANAGER
30
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration aut...
31
TRAFFIC SWITCHING
Default deployment: rolling update
32
TRAFFIC SWITCHING
33
MANUAL TRAFFIC SWITCHING
$ zkubectl traffic <ingress-name>
SERVICE WEIGHT
<service-backend-1> 30%
<service-backend-2> 7...
34
DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexit...
35
DOCUMENTATION
• Restructure following
https://www.divio.com/en/blog/documentation/
• Concepts
• How Tos
• Tutorials
• R...
36
ONBOARDING
• Many new concepts to grasp vs. 200+ teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Dem...
37
CLUSTER SCOPE
38
DOES ANYTHING EVEN WORK?
• Kubernetes API as the primary interface
• Ingress Controller + External DNS
• kube2iam
• Clu...
39
DEPLOYMENT
40
DEPLOYMENT CONFIGURATION
.
├── deploy/apply
│ ├── deployment.yaml # K8s Deployment
│ ├── credentials.yaml # K8s TPR
│ ├...
41
INGRESS.YAML
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your applicatio...
42
CONTINUOUS DELIVERY PLATFORM
43
CDP: APPLY
44
CDP: OPTIONAL APPROVAL
45
POSTGRES OPERATOR
Application to manage PostgreSQL clusters
Observes “postgres” manifests (CRDs)
Spawns and modifies ne...
https://github.com/hjacobs/kube-ops-view
47
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-...
https://goo.gl/t2zNc8
QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k
Upcoming SlideShare
Loading in …5
×

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

2,378 views

Published on

Presentation held at the Düsseldorf DevOps NRW meetup on 2018-02-21 : https://www.meetup.com/devops-duesseldorf/events/246645236/

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

  1. 1. DEVOPS NRW 2018-02-21 HENNING JACOBS @try_except_ Kubernetes on AWS @ZalandoTech
  2. 2. 2 ZALANDO IN NUMBERS > 4.5billion EUR 2017 > 200 million visits per month > 14,000 employees in Europe > 70% of visits via mobile devices > 22 million active customers > 250,000 product choices ~ 2,000 brands 15 countries
  3. 3. 3 OUR FOOTPRINT AROUND EUROPE as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  4. 4. 4 OUR FOOTPRINT AROUND EUROPE TECH as of November 2017 1 8 10 11 12 13 BERLIN HEADQUARTERS AND OUTLET BRIESELANG FULFILLMENT CENTER ERFURT FULFILLMENT CENTER AND TECH OFFICE MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE LAHR FULFILLMENT CENTER DORTMUND TECH HUB FRANKFURT OUTLET DUBLIN TECH HUB HELSINKI TECH HUB MILAN (STRADELLA) FULFILLMENT CENTER KÖLN OUTLET PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER SZCZECIN (GRYFINO) FULFILLMENT CENTER HAMBURG ADTECH LAB STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017) 10 9 7 6 5 3 2 1 11 12 13 4 14 15 15 14 9 8 7 6 5 4 3 2 1
  5. 5. INCIDENTS ARE FINE
  6. 6. ON CALL
  7. 7. 7 INCIDENT #1: CUSTOMER IMPACT
  8. 8. 8 INCIDENT #1: IAM RETURNING 404
  9. 9. 9 INCIDENT #1: NUMBER OF PODS
  10. 10. 10 LIFE OF A REQUEST (INGRESS) DNS my-app.example.org ALB aws-1234-lb.eu-central-1.elb.amazonaws.com SERVICE 10.3.0.216 DEPLOYMENT POD 10.2.0.1 POD 10.2.1.1 POD 10.2.2.1 POD 10.2.3.1 SKIPPER 172.31.1.1:9999 SKIPPER 172.31.2.1:9999 SKIPPER 172.31.3.1:9999 SKIPPER 172.31.4.1:9999 ALIAS Record
  11. 11. 11 INCIDENT #1: UNASSUMING MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: metadata: labels: application: "foobar" spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: ...
  12. 12. 12 INCIDENT #1: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" labels: application: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: metadata: labels: application: "foobar" spec: restartPolicy: Never containers: ...
  13. 13. 13 INCIDENT #1: LESSONS LEARNED • ALB routes traffic to ALL hosts if all hosts report “unhealthy” • Fix Skipper Ingress to stay “healthy” during API server problems • Fix Skipper Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
  14. 14. 14 STABILITY: LIMIT RANGE $ kubectl describe limitrange Name: limits Namespace: default Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio ---- -------- --- --- ----------- ------------- ----------------------- Container memory - 64Gi 100Mi 1Gi - Container cpu - 16 100m 3 - http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources ⇒ Mitigate errors on OSI layer 8 ;-)
  15. 15. 15 INCIDENT #2: CLUSTER DOWN
  16. 16. 16 INCIDENT #2: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
  17. 17. 17 INCIDENT #2: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end]
  18. 18. https://www.outcome-eng.com/human-error-never-root-cause/
  19. 19. 19 INCIDENT #2: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots
  20. 20. 20 INCIDENT #3: LATENCY SPIKES
  21. 21. 21 INCIDENT #3: STOP THE BLEEDING #!/bin/bash SLEEPTIME=60 while true; do echo "sleep for $SLEEPTIME seconds" sleep $SLEEPTIME timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done
  22. 22. 22 INCIDENT #3: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...]
  23. 23. 23 INCIDENT #3: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable
  24. 24. 24 INCIDENT #4: IMPACT
  25. 25. 25 INCIDENT #4: CLUSTER DOWN?
  26. 26. 26 INCIDENT #4: THE TRIGGER
  27. 27. https://www.outcome-eng.com/human-error-never-root-cause/
  28. 28. 28 CLUSTER UPDATES
  29. 29. 29 CLUSTER LIFECYCLE MANAGER
  30. 30. 30 INCIDENT #4: LESSONS LEARNED • Automated end-to-end tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with the previous configuration • Apply new configuration • Run end-to-end & conformance tests
  31. 31. 31 TRAFFIC SWITCHING Default deployment: rolling update
  32. 32. 32 TRAFFIC SWITCHING
  33. 33. 33 MANUAL TRAFFIC SWITCHING $ zkubectl traffic <ingress-name> SERVICE WEIGHT <service-backend-1> 30% <service-backend-2> 70% $ zkubectl traffic <ingress-name> <service-backend-2> 100 SERVICE WEIGHT <service-backend-1> 0% <service-backend-2> 100%
  34. 34. 34 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  35. 35. 35 DOCUMENTATION • Restructure following https://www.divio.com/en/blog/documentation/ • Concepts • How Tos • Tutorials • Reference • Global Search • Weekly Health Check: Support → Documentation
  36. 36. 36 ONBOARDING • Many new concepts to grasp vs. 200+ teams • Kubernetes Training (2h) • Documentation • Recorded Friday Demos • Support Channels (chat, mail)
  37. 37. 37 CLUSTER SCOPE
  38. 38. 38 DOES ANYTHING EVEN WORK? • Kubernetes API as the primary interface • Ingress Controller + External DNS • kube2iam • Cluster lifecycle management • Zalando IAM/OAuth integration via CRD • PostgreSQL Operator
  39. 39. 39 DEPLOYMENT
  40. 40. 40 DEPLOYMENT CONFIGURATION . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── credentials.yaml # K8s TPR │ ├── ingress.yaml # K8s Ingress │ └── service.yaml # K8s Service └── delivery.yaml # pipeline config
  41. 41. 41 INGRESS.YAML apiVersion: extensions/v1beta1 kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  42. 42. 42 CONTINUOUS DELIVERY PLATFORM
  43. 43. 43 CDP: APPLY
  44. 44. 44 CDP: OPTIONAL APPROVAL
  45. 45. 45 POSTGRES OPERATOR Application to manage PostgreSQL clusters Observes “postgres” manifests (CRDs) Spawns and modifies new clusters Syncs and provisions roles Handles volume resize, incl. Resize2fs Also responsible for updating Docker images
  46. 46. https://github.com/hjacobs/kube-ops-view
  47. 47. 47 LINKS Running Kubernetes in Production on AWS http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html Kube AWS Ingress Controller https://github.com/zalando-incubator/kube-ingress-aws-controller External DNS https://github.com/kubernetes-incubator/external-dns PostgreSQL Operator https://github.com/zalando-incubator/postgres-operator Zalando Cluster Configuration https://github.com/zalando-incubator/kubernetes-on-aws List of Organizations using Kubernetes on AWS https://github.com/hjacobs/kubernetes-on-aws-users
  48. 48. https://goo.gl/t2zNc8
  49. 49. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×