05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS

FALLACIES OF
DISTRIBUTED
COMPUTING WITH
KUBERNETES ON
AWS
Raffaele Di Fazio
05.10.2017

3
ZALANDO
15 markets
6 fulfillment centers
21 million active customers
3.6 billion € net sales 2016
200 million visits per month
13,000 employees in Europe

4
ZALANDO TECHNOLOGY
HOME-BREWED,
CUTTING-EDGE
& SCALABLE
technology solutions
>1,800
employees from
tech locations
+ HQs in Berlin6
77
nations

HISTORIA MAGISTRA
VITAE
Photo by Jace Grandinetti on Unsplash

6
Node
KUBERNETES ARCHITECTURE
kubectl
Master
Node
API Server
Scheduler
Controller
Manager
Skipper Kube2IAM
Kubelet
etcd
Logging agent ….

WHAT HAPPENS WHEN
THE API SERVER IS
NOT RUNNING?
Photo by Thomas Kvistholt on Unsplash

8
“A FEW MORE” 404s THAN USUAL...

9
Thanks to Ashley McNamara for the picture

WHY?
Photo by Ricardo Gomez Angel on Unsplash

12
KILLING KUBERNETES’ API SERVER
Too much memory
API server
OOMKilledLots of pods

13
SYSTEM VIEW - TRAFFIC FLOW
https://github.com/zalando/skipper
ALB
Node Skipper Node Skipper
MyApp MyApp MyApp
Service Service
K8S network
EC2 network
TLS
HTTP

15
WHAT REALLY HAPPENED
• All routes removed:
• No routes to the applications deployed inside the cluster
• Healthcheck “unhealthy” because of no connection to API server
• => All nodes were unhealthy in the ELBv2

WHAT HAPPENS WHEN
ALL THE TARGETS IN
AN ELBv2 ARE
UNHEALTHY?
Photo by Sandro Katalina on Unsplash

18
WHAT ABOUT THE ELBv2?
If no Availability Zone contains a healthy target, the load
balancer nodes route requests to all targets.

KUBERNETES API
SERVER AVAILABILITY
AND CONTROL LOOPS
Photo by chuttersnap on Unsplash

22
WE ARE NOT ALONE
• … a test that simulated the failure of a single apiserver node
disrupted the cluster in a way that negatively impacted the
availability of running workloads
• ... helped us identify that the disruption was likely related to an
interaction between the various clients that connect to the
Kubernetes apiserver (like calico-agent, kubelet, kube-proxy, and
kube-controller-manager) and our internal load balancer’s
behavior during an apiserver node failure.
• Source: Kubernetes at GitHub

24
HOW WE FIXED IT
• Do not change the healthcheck in case of API server failures
• Do not drop the routes in case of API server failures
• => Delete when you are really sure you want to delete!

FALLACIES OF
DISTRIBUTED
COMPUTING
Photo by chuttersnap on Unsplash

26
8 FALLACIES OF DISTRIBUTED COMPUTING
• The network is reliable.
• Latency is zero.
• Bandwidth is infinite.
• The network is secure.
• Topology doesn't change.
• There is one administrator.
• Transport cost is zero.
• The network is homogeneous.

28
THE FALLACIES OF CLOUD COMPUTING
• The API call you will make will succeed.
• The next API call you will make will succeed.
• Deleting resources is the same as adding new.
• Your cloud provider will have no outages.
• The dependencies between your services are clear.

MAKING YOUR SYSTEM
RESILIENT
Photo by Aaron Barnaby on Unsplash

30
WHEN MAKING API CALLS
• Every API call can fail
• Retry (with backoff)
• Circuit breakers
• Fallbacks
• Don’t scale down / delete resources fast!
• Deal with rate limiting
• Deal with “weird” values due to a broken cloud provider feature

31
TEST ALL THE THINGS
• Continuous integration tests
• Continuous deployment of cluster updates
• Load tests
• Chaos tests

32
CONTINUOUS INTEGRATION TESTS
• Test the interactions between components
• For every configuration change we run extensive e2e tests

33
CONTINUOUS INTEGRATION TESTS

34
CONTINUOUS
DEPLOYMENT
OF CLUSTER
UPDATES

35
LOAD TESTING
• Lots of request to the API server
• Lots of pods running
• Write/reads to the data storage (etcd)
• => what matters: observe the impact on running applications

36
CHAOS TESTING
• Random shutdown of Kubernetes components
• https://github.com/linki/chaoskube
• http://chaostoolkit.org/
• https://github.com/asobti/kube-monkey
• http://principlesofchaos.org
• Random shutdown of nodes (EC2 Instances)
• https://github.com/Netflix/chaosmonkey

37
MORE ON CHAOS TESTING
• Netflix’s principles of Chaos Engineering
• http://principlesofchaos.org
• Chaos Engineering free ebook ->

38
THAT’S NOT ALL
• You think Kubernetes the hard way is hard
• The hard part was never only the setup
• Sometimes you will have to break things to learn
• …setup a healthy post mortem culture and learn from
mistakes!

THAT’S IT
Photo by Dhruva Reddy on Unsplash

QUESTIONS?
Raffaele Di Fazio
@x0rg

05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to 05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS

Similar to 05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS (20)

More from Zalando adtech lab

More from Zalando adtech lab (9)

Recently uploaded

Recently uploaded (20)

05.10.2017 AWS User Group Meetup - FALLACIES OF DISTRIBUTED COMPUTING WITH KUBERNETES ON AWS