Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fallacies of distributed computing with Kubernetes on AWS

1,485 views

Published on


This is the short story of a bug in one of our Go services that cut off some of the traffic targeting one of our production Kubernetes cluster running on AWS. But more than that, this is about how we did conceptually similar mistakes before and why thinking about failures and the famous "fallacies of distributed computing" is key to develop infrastructural components.

With this talk, we will give you a walk through some of those problems, illustrate some interesting details of Kubernetes, AWS and hopefully help you to not make the same mistakes again.

Published in: Technology
  • DOWNLOAD FULL. BOOKS INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Good presentation
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Fallacies of distributed computing with Kubernetes on AWS

  1. 1. FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS Raffaele Di Fazio 05.10.2017
  2. 2. WHOAMI
  3. 3. 3 ZALANDO 15 markets 6 fulfillment centers 21 million active customers 3.6 billion € net sales 2016 200 million visits per month 13,000 employees in Europe
  4. 4. 4 ZALANDO TECHNOLOGY HOME-BREWED, CUTTING-EDGE & SCALABLE technology solutions >1,800 employees from tech locations + HQs in Berlin6 77 nations
  5. 5. HISTORIA MAGISTRA VITAE Photo by Jace Grandinetti on Unsplash
  6. 6. 6 Node KUBERNETES ARCHITECTURE kubectl Master Node API Server Scheduler Controller Manager Skipper Kube2IAM Kubelet etcd Logging agent ….
  7. 7. WHAT HAPPENS WHEN THE API SERVER IS NOT RUNNING? Photo by Thomas Kvistholt on Unsplash
  8. 8. 8 “A FEW MORE” 404s THAN USUAL...
  9. 9. 9 Thanks to Ashley McNamara for the picture
  10. 10. WHY? Photo by Ricardo Gomez Angel on Unsplash
  11. 11. 11 “A FEW” MORE PODS…
  12. 12. 12 KILLING KUBERNETES’ API SERVER Too much memory API server OOMKilledLots of pods
  13. 13. 13 SYSTEM VIEW - TRAFFIC FLOW https://github.com/zalando/skipper ALB Node Skipper Node Skipper MyApp MyApp MyApp Service Service K8S network EC2 network TLS HTTP
  14. 14. 14 WHAT REALLY HAPPENED
  15. 15. 15 WHAT REALLY HAPPENED • All routes removed: • No routes to the applications deployed inside the cluster • Healthcheck “unhealthy” because of no connection to API server • => All nodes were unhealthy in the ELBv2
  16. 16. WHAT HAPPENS WHEN ALL THE TARGETS IN AN ELBv2 ARE UNHEALTHY? Photo by Sandro Katalina on Unsplash
  17. 17. 17 WHAT ABOUT THE ELBv2?
  18. 18. 18 WHAT ABOUT THE ELBv2? If no Availability Zone contains a healthy target, the load balancer nodes route requests to all targets.
  19. 19. KUBERNETES API SERVER AVAILABILITY AND CONTROL LOOPS Photo by chuttersnap on Unsplash
  20. 20. 20 MISTAKES WERE MADE
  21. 21. 21
  22. 22. 22 WE ARE NOT ALONE • … a test that simulated the failure of a single apiserver node disrupted the cluster in a way that negatively impacted the availability of running workloads • ... helped us identify that the disruption was likely related to an interaction between the various clients that connect to the Kubernetes apiserver (like calico-agent, kubelet, kube-proxy, and kube-controller-manager) and our internal load balancer’s behavior during an apiserver node failure. • Source: Kubernetes at GitHub
  23. 23. 23 HOW WE FIXED IT
  24. 24. 24 HOW WE FIXED IT • Do not change the healthcheck in case of API server failures • Do not drop the routes in case of API server failures • => Delete when you are really sure you want to delete!
  25. 25. FALLACIES OF DISTRIBUTED COMPUTING Photo by chuttersnap on Unsplash
  26. 26. 26 8 FALLACIES OF DISTRIBUTED COMPUTING • The network is reliable. • Latency is zero. • Bandwidth is infinite. • The network is secure. • Topology doesn't change. • There is one administrator. • Transport cost is zero. • The network is homogeneous.
  27. 27. 27
  28. 28. 28 THE FALLACIES OF CLOUD COMPUTING • The API call you will make will succeed. • The next API call you will make will succeed. • Deleting resources is the same as adding new. • Your cloud provider will have no outages. • The dependencies between your services are clear.
  29. 29. MAKING YOUR SYSTEM RESILIENT Photo by Aaron Barnaby on Unsplash
  30. 30. 30 WHEN MAKING API CALLS • Every API call can fail • Retry (with backoff) • Circuit breakers • Fallbacks • Don’t scale down / delete resources fast! • Deal with rate limiting • Deal with “weird” values due to a broken cloud provider feature
  31. 31. 31 TEST ALL THE THINGS • Continuous integration tests • Continuous deployment of cluster updates • Load tests • Chaos tests
  32. 32. 32 CONTINUOUS INTEGRATION TESTS • Test the interactions between components • For every configuration change we run extensive e2e tests
  33. 33. 33 CONTINUOUS INTEGRATION TESTS
  34. 34. 34 CONTINUOUS DEPLOYMENT OF CLUSTER UPDATES
  35. 35. 35 LOAD TESTING • Lots of request to the API server • Lots of pods running • Write/reads to the data storage (etcd) • => what matters: observe the impact on running applications
  36. 36. 36 CHAOS TESTING • Random shutdown of Kubernetes components • https://github.com/linki/chaoskube • http://chaostoolkit.org/ • https://github.com/asobti/kube-monkey • http://principlesofchaos.org • Random shutdown of nodes (EC2 Instances) • https://github.com/Netflix/chaosmonkey
  37. 37. 37 MORE ON CHAOS TESTING • Netflix’s principles of Chaos Engineering • http://principlesofchaos.org • Chaos Engineering free ebook ->
  38. 38. 38 THAT’S NOT ALL • You think Kubernetes the hard way is hard • The hard part was never only the setup • Sometimes you will have to break things to learn • …setup a healthy post mortem culture and learn from mistakes!
  39. 39. THAT’S IT Photo by Dhruva Reddy on Unsplash
  40. 40. QUESTIONS? Raffaele Di Fazio @x0rg

×