Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise Cloud Native Summit

800 views

Published on

Kubernetes hat sich als defacto Standard für Cloud Native Plattformen etabliert. Doch warum? Welche Vorteile und Fallstricke gibt es in der Praxis? Henning Jacobs zeigt am Beispiel von Zalando wie Kubernetes als Infrastruktur für 1200+ Entwickler dient, welche Aspekte Kubernetes trotz seiner Komplexität einzigartig machen, und was dies für die Developer Experience bedeutet.

Published in: Technology
  • Be the first to comment

Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise Cloud Native Summit

  1. 1. WHY KUBERNETES? ENTERPRISE CLOUD NATIVE SUMMIT 2019-10-08 HENNING JACOBS @try_except_
  2. 2. 2 ROLLING OUT KUBERNETES? "We are rolling out Kubernetes to production next month and I'm interested to hear from people who made that step already."
  3. 3. 3 DON'T USE IT !!!!!
  4. 4. 4 DON'T USE IT !!!!!
  5. 5. 5
  6. 6. 6 KUBERNETES FAILURE STORIES
  7. 7. 7 ~ 5.4billion EUR revenue 2018 > 300 million visits per month ~ 14,000 employees in Europe > 80% of visits via mobile devices > 28 million active customers > 400,000 product choices > 2,000 brands 17 countries as of June 2019 ZALANDO AT A GLANCE
  8. 8. 8 A BRIEF HISTORY OF ZALANDO TECH
  9. 9. 9 2010 "Sysop-Test" "QA-Test"
  10. 10. 10 2013: SELF SERVICE
  11. 11. 11 2015: RADICAL AGILITY AWS STUPS DOCKER DEPLOY SSH ACCESS AUDIT REPORTS FULL AWS ACCESS Teams have admin access & full responsibility
  12. 12. 12 2015: ISOLATED AWS ACCOUNTS Internet *.abc.example.org *.xyz.example.org Team ABC Team XYZ EC2EC2 ELBELB EC2
  13. 13. 13 2019: SCALE 140Clusters 396Accounts
  14. 14. 14 2019: DEVELOPERS USING KUBERNETES
  15. 15. 15 Platform > 1100 developers > 200 development teams
  16. 16. 16 YOU BUILD IT, YOU RUN IT The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. - A Conversation with Werner Vogels, ACM Queue, 2006
  17. 17. 17 ON-CALL: YOU OWN IT, YOU RUN IT When things are broken, we want people with the best context trying to fix things. - Blake Scrivener, Netflix SRE Manager
  18. 18. 18 DEVELOPER JOURNEY Consistent story that models all aspects of SW dev
  19. 19. 19 Developer Journey
  20. 20. 20 Developer Journey Correctness Compliance GDPR Security Cost Efficiency 24x7 On Call Governance Resilience Capacity ...
  21. 21. 21 DEVELOPER PRODUCTIVITY Code Build Test Deploy OperateSetup Cloud Native Application Runtime
  22. 22. 23 PLAN & SETUP
  23. 23. 24 Plan Stories Rules of Play Tech Radar
  24. 24. 26 Setup Application Bootstrapping
  25. 25. 29 BUILD & TEST
  26. 26. 30 CDPGit code push CONTINUOUS DELIVERY PLATFORM: BUILD
  27. 27. 32 DEPLOY
  28. 28. 33 Deploy Kubernetes
  29. 29. 34 DEPLOYMENT CONFIGURATION ├── deploy/apply │ ├── deployment.yaml │ ├── credentials.yaml # Zalando IAM │ ├── ingress.yaml │ └── service.yaml └── delivery.yaml # Zalando CI/CD
  30. 30. 35 INGRESS.YAML kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80
  31. 31. 36 TEMPLATING: MUSTACHE kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "{{{APPLICATION}}}.example.org" http: paths: - backend: serviceName: "{{{APPLICATION}}}" servicePort: 80
  32. 32. 37 CONTINUOUS DELIVERY PLATFORM
  33. 33. 38 CDP: DEPLOY "glorified kubectl apply"
  34. 34. 39 CDP: OPTIONAL APPROVAL
  35. 35. 40 STACKSET: TRAFFIC SWITCHING github.com/zalando-incubator/stackset-controller
  36. 36. 41 TRAFFIC SWITCHING STEPS IN CDP github.com/zalando-incubator/stackset-controller
  37. 37. 42 Deploy You build it, you run it!
  38. 38. 43 EMERGENCY ACCESS SERVICE Emergency access by referencing Incident zkubectl cluster-access request --emergency -i INC REASON Privileged production access via 4-eyes zkubectl cluster-access request REASON zkubectl cluster-access approve USERNAME
  39. 39. 44 KUBERNETES WEB VIEW kubectl get pods,stacks,deploys,..
  40. 40. 45 SEARCHING ACROSS 140+ CLUSTERS codeberg.org/hjacobs/kube-web-view
  41. 41. codeberg.org/hjacobs/kube-web-view
  42. 42. 47 INTEGRATIONS
  43. 43. 48 CLOUD FORMATION VIA CI/CD ├── deploy/apply │ ├── deployment.yaml # Kubernetes │ ├── cf-iam-role.yaml # AWS IAM Role │ ├── cf-rds.yaml # AWS RDS Database │ ├── kube-ingress.yaml │ ├── kube-secret.yaml │ └── kube-service.yaml └── delivery.yaml # CI/CD config "Infrastructure as Code"
  44. 44. 49 POSTGRES OPERATOR Application to manage PostgreSQL clusters on Kubernetes >500 clusters running on Kubernetes github.com/zalando/postgres-operator
  45. 45. Elasticsearch in Kubernetes Elasticsearch 2.500 vCPUs 1 TB RAM github.com/zalando-incubator/es-operator/
  46. 46. 51 SUMMARY • Application Bootstrapping • Git as source of truth and UI • 4-eyes principle for master/production • Extensible Kubernetes API as primary interface • OAuth/IAM credentials • PostgreSQL, Elasticsearch • CloudFormation for proprietary AWS services
  47. 47. 52 MONITORING & COST EFFICIENCY
  48. 48. 53 OPENTRACING
  49. 49. 54 KUBERNETES RESOURCE REPORT github.com/hjacobs/kube-resource-report
  50. 50. 55 RESOURCE REPORT: TEAMS Sorting teams by Slack Costs github.com/hjacobs/kube-resource-report
  51. 51. 56 KUBERNETES APPLICATION DASHBOARD
  52. 52. https://github.com/hjacobs/kube-ops-view
  53. 53. 58 VERTICAL POD AUTOSCALER limit/requests adapted by VPA
  54. 54. 59 DOWNSCALING DURING OFF-HOURS github.com/hjacobs/kube-downscaler Weekend
  55. 55. 60 KUBERNETES JANITOR ● TTL and expiry date annotations, e.g. ○ set time-to-live for your test deployment ● Custom rules, e.g. ○ delete everything without "app" label after 7 days github.com/hjacobs/kube-janitor
  56. 56. 61 EC2 SPOT NODES 72% savings
  57. 57. 62 STABILITY ↔ EFFICIENCY Slack Autoscaling Buffer Disable Overcommit Cluster Overhead Resource Report HPA VPA Downscaler Janitor EC2 Spot
  58. 58. 63 DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate srcco.de/posts/accelerate-software-delivery-performance.html
  59. 59. 64 CONTAINERS From "Accelerate: The Science of Lean Software and DevOps"
  60. 60. 65 DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate ≙ Commit to Prod ≙ Deploys/week/dev ≙ MTRS from incidents ≙ n/a
  61. 61. “.. means establishing empathy with internal consumers (read: developers) and collaborating with them on the design. Platform product managers establish roadmaps and ensure the platform delivers value to the business and enhances the developer experience.” - ThoughtWorks Technology Radar
  62. 62. 68 DEVELOPER SATISFACTION
  63. 63. 69 DOCUMENTATION "Documentation is hard to find" "Documentation is not comprehensive enough" "Remove unnecessary complexity and obstacles." "Get the documentation up to date and prepare use cases" "More and more clear documentation" "More detailed docs, example repos with more complicated deployments."
  64. 64. 71 TESTIMONIALS “So, thank you, Team Automata, for listening to our community, taking our upvotes in consideration when developing new solutions and building every day 'the first CI that doesn't suck'.” - a user, October 2018
  65. 65. 72 WHY KUBERNETES?
  66. 66. 73 WHY KUBERNETES? • provides enough abstractions (StatefulSet, CronJob, ..) • provides consistency (API spec/status) • is extensible (annotations, CRDs, API aggreg.) • certain compatibility guarantee (versioning) • widely adopted (all cloud providers) • works across environments and implementations srcco.de/posts/why-kubernetes.html
  67. 67. 74 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  68. 68. 75 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  69. 69. 76 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  70. 70. 77 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  71. 71. 78 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  72. 72. 79 WHY KUBERNETES? • Efficiency • Common Operational Model • Developer Experience • Cloud Provider Independent • Compliance and Security • Talent (for Zalando)
  73. 73. 80 KUBERNETES FAILURE STORIES • Learning about production pitfalls! • Availability bias? https://k8s.af
  74. 74. 81 FACTFULNESS Things can be both better and bad! How would failure stories for your non-K8s infra look like? https://k8s.af
  75. 75. 82 COMPLEXITY FOR GOOGLE-SCALE INFRA? • Managed DO cluster: 4 minutes • K3s single node: 2 minutes demo.j-serv.de
  76. 76. 83
  77. 77. 84 MAYBE THAT'S GOOD?
  78. 78. 85 OPEN SOURCE & MORE Kubernetes Web View codeberg.org/hjacobs/kube-web-view Skipper HTTP Router & Ingress controller github.com/zalando/skipper Kubernetes Janitor github.com/hjacobs/kube-janitor Postgres Operator github.com/zalando-incubator/postgres-operator More Zalando Tech Talks github.com/zalando/public-presentations
  79. 79. QUESTIONS? HENNING JACOBS SENIOR PRINCIPAL henning@zalando.de @try_except_ Illustrations by @01k

×