Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - AWS Tech Community Days Cologne

1,213 views

Published on

Bootstrapping a Kubernetes cluster is easy, rolling it out to nearly 200 engineering teams and operating it at scale is a challenge.
In this talk, we are presenting our approach to Kubernetes provisioning on AWS, operations and developer experience for our growing Zalando Technology department. We will highlight in the context of Kubernetes: AWS service integrations, our IAM/OAuth infrastructure, cluster autoscaling, continuous delivery and general developer experience. The talk will cover our most important learnings and we will openly share failure stories.

Presented on 2017-09-28 at AWS Tech Community Days in Cologne.

Published in: Technology
  • Be the first to comment

Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - AWS Tech Community Days Cologne

  1. 1. AWS TECH COMMUNITY DAYS 2017-09-28 HENNING JACOBS @try_except_ Kubernetes on AWS @ZalandoTech
  2. 2. 2 ZALANDO 15 markets 6 fulfillment centers 21 million active customers 3.6 billion € net sales 2016 200 million visits per month 13,000 employees in Europe
  3. 3. 3 ZALANDO TECHNOLOGY HOME-BREWED, CUTTING-EDGE & SCALABLE technology solutions 1,800 employees from tech locations + HQs in Berlin6 77 nations help our brand to WIN ONLINE
  4. 4. 4 ZALANDO TECH’S INFRASTRUCTURE
  5. 5. 5 FOUR ERAS AT ZALANDO TECH ZOMCATPHP STUPS KUBERNETES 2010 2015 2016 Data center WAR AWS Docker Cloud Formation Low level (AWS API) AWS Docker Kubernetes manifest High abstraction level Data center PHP files
  6. 6. 6 LARGE SCALE?
  7. 7. 8 KUBERNETES: ARCHITECTURE
  8. 8. 9 KUBERNETES ON AWS: CONTEXT 200 engineering teams 30 prod. clusters AWS/STUPS Dockerized apps No manual operations Reliability Autoscaling Seamless migration
  9. 9. 10 ISOLATED AWS ACCOUNTS Internet *.abc.example.org *.xyz.example.org Product ABC Product XYZ EC2 LBLB
  10. 10. 11 KUBERNETES ON AWS
  11. 11. 12 DEPLOYMENT
  12. 12. 13 DEPLOYMENT CONFIGURATION . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── credentials.yaml # K8s TPR │ ├── ingress.yaml # K8s Ingress │ └── service.yaml # K8s Service └── delivery.yaml # pipeline config
  13. 13. 14 INGRESS.YAML apiVersion: extensions/v1beta1 kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  14. 14. 15 CONTINUOUS DELIVERY PLATFORM
  15. 15. 16 CDP: APPLY
  16. 16. 17 CDP: OPTIONAL APPROVAL
  17. 17. 18 AWS INTEGRATION
  18. 18. 19 CLOUD FORMATION VIA CI/CD . ├── deploy/apply │ ├── deployment.yaml # K8s Deployment │ ├── cf-iam-role.yaml # AWS IAM Role │ ├── cf-rds.yaml # AWS RDS Database │ ├── kube-ingress.yaml # K8s Ingress │ ├── kube-secret.yaml # K8s Secret │ └── kube-service.yaml # K8s Service └── delivery.yaml # CI/CD config
  19. 19. 20 ASSIGNING AWS IAM ROLE TO POD kind: Deployment spec: template: metadata: annotations: # annotation for kube2iam iam.amazonaws.com/role: "app-myapp-role" spec: containers: - name: ... ... https://github.com/jtblin/kube2iam ⇒ AWS SDKs just work as expected
  20. 20. 21 OAUTH / IAM INTEGRATION
  21. 21. 22 SERVICE TO SERVICE AUTHNZ Kubernetes Cluster https://resource-server.example.org/protected HTTP/1.1 401 Unauthorized { "message": "Authorization required" }
  22. 22. 23 CREDENTIAL PROVIDER
  23. 23. 24 USING THE OAUTH CREDENTIALS #!/bin/bash secret=$(cat /creds/mytok-token-secret) curl -H "Authorization: Bearer $secret" https://resource-server.example.org/protected
  24. 24. 25 CHALLENGES
  25. 25. 26 1. Getting Started 2. Stability 3. Onboarding 4. User Experience CHALLENGES
  26. 26. 27 CHALLENGE 1: GETTING STARTED
  27. 27. 28 GETTING STARTED https://github.com/hjacobs/kubernetes-on-aws-users
  28. 28. 29 GETTING STARTED https://github.com/hjacobs/kubernetes-on-aws-users
  29. 29. 30 CLUSTER PROVISIONING
  30. 30. 31 CLUSTER PROVISIONING • Two Cloud Formation stacks • Master & worker ASGs + etcd • Nodes w/ Container Linux • K8s manifests applied separately • kube-system Deployments • DaemonSets
  31. 31. 32 GETTING STARTED Goal: use Kubernetes API as primary interface for AWS • Mate, External DNS • Kubernetes Ingress Controller for AWS • kube2iam ⇒ we wrote new components to achieve our goal
  32. 32. 33 INGRESS CONTROLLER https://github.com/zalando-incubator/kube-ingress-aws-controller / https://github.com/kubernetes-incubator/external-dns
  33. 33. 34 GETTING STARTED Other questions we asked ourselves.. • Single AZ vs. Multi AZ? • Federation? • Overlay network? • Authnz?
  34. 34. 35 GETTING STARTED Other questions we asked ourselves.. • Single AZ vs. Multi AZ? ⇒ Multi AZ • Federation? ⇒ No, not ready yet • Overlay network? ⇒ Flannel, “rock solid” • Authnz? ⇒ OAuth, webhook
  35. 35. 36 CHALLENGE 2: STABILITY
  36. 36. 37 CLUSTER UPDATES
  37. 37. 38 STABILITY: AWS RATE LIMITS • Ran into the same trap twice (Mate & Ingress Ctrl) • Kubernetes core causes many calls (e.g. EBS) • Monitoring (ZMON) needs to poll AWS ⇒ One of our biggest pain points with AWS (and all workarounds are hard and/or ugly)
  38. 38. 39 STABILITY: LIMIT RANGE kubectl describe limitrange Name: limits Namespace: default Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio ---- -------- --- --- ----------- ------------- ----------------------- Container memory - 64Gi 100Mi 1Gi - Container cpu - 16 100m 3 - http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources ⇒ Mitigate errors on OSI layer 8 ;-)
  39. 39. Recommended: The 5 Whys https://en.wikipedia.org/wiki/5_Whys
  40. 40. ON CALL
  41. 41. 43 CHALLENGE 3: ONBOARDING
  42. 42. 44 ONBOARDING • Many new concepts to grasp vs. 200 teams • Kubernetes Training (2h) • Documentation • Recorded Friday Demos • Support Channels (chat, mail)
  43. 43. 45 CHALLENGE 4: USER EXPERIENCE
  44. 44. 46 USER EXPERIENCE • Continuous Delivery Platform (delivery.yaml) • Juggling with K8s and CF YAMLs • Inconsistent state, troubleshooting
  45. 45. 47 KUBERNETES VS. AWS ECS
  46. 46. 48 AWS API Tasks, Services Static AWS API Blox Operating worker nodes Vendor community/support AWS only WHY NOT ECS? Declarative API (fast & no rate limits) High level abstractions (Ingress, CronJob) Extensible API (e.g. TPR) Batteries included (DaemonSet, StatefulSet) Operating etcd, master & worker nodes Huge community Run anywhere ⟺ ⟺ ⟺ ⟺ ⟺ ⟺ ⟺ disclaimer: incomplete and opinionated ;-)
  47. 47. https://github.com/hjacobs/kube-ops-view
  48. 48. 50 LINKS Running Kubernetes in Production on AWS http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html Kube AWS Ingress Controller https://github.com/zalando-incubator/kube-ingress-aws-controller External DNS https://github.com/kubernetes-incubator/external-dns PostgreSQL Operator https://github.com/zalando-incubator/postgres-operator Zalando Cluster Configuration https://github.com/zalando-incubator/kubernetes-on-aws List of Organizations using Kubernetes on AWS https://github.com/hjacobs/kubernetes-on-aws-users
  49. 49. QUESTIONS? HENNING JACOBS DEDICATED OWNER DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

×