Staying out of_trouble_with_k8s_on_aws

Staying out of trouble with
K8S on AWS
Adam Hamsik
DevOps/Cloud Engineer

www.pixelfederation.com
1. Know your Enemy, Deep knowledge of
a. AWS
b. Kubernetes
i. Choose your CNI wisely
ii. Be aware of scheduler
c. Applications
2. Trust Your tools
a. Monitoring
b. ELK
c. Deployment tools
Staying out of trouble with K8S on AWS
TL;DR Summary

1. Standard AWS HA procedures
2. Cluster Autoscaler
3. EBS volumes
a. EBS Volumes don’t work cross AZ
b. Kubernetes sometimes can’t find a place for a pod if all instances in a given
AZ are full
4. Choose the right Instance type for your application
AWS Gotchas

Kubernetes AWS architecture

K8s on AWS
Cluster Autoscaler
1. CA doesn’t understand AZ when auto scaling your cluster
a. Sometimes POD needs to run only in particular ZONE but CA will start new
node in another.
2. Use PodDistributionBudget to make sure that you have required number of pods running
3. Use podAntiAffinity to spread your replicas in multiple AZs, Nodes
4. CA vs AWS ASG rebalance policy can get cluster into a failure loop

Real Life example
Cluster Autoscaler
1. Create application deployment with multiple replicas and EBS volumes as
update strategy use RollingUpdate
2. Change version and run upgrade
3. During upgrade CA will have to scale your cluster up based on MaxSurge
RollingUpdate parameter
4. There is 1 in 3 probability that new node will not be in a same AZ as original
one.
5. Upgrade can’t move forward and it’s blocked

Kubernetes CA Multi AZ setup

K8s Node troubles
1. K8s scheduler wants to utilize your node as much as possible
a. It will schedule more pods on it than it’s physical resources can manage
2. Use kubelet limits to make sure pods are evicted from a node when it’s
utilized too much
3. Node problem detector is a daemon running as daemonset on each node
and checking if node is in correct state
a. Infrastructure daemon issues: ntp service down
b. Hardware issues: Bad cpu, memory or disk
c. Kernel issues: Kernel deadlock, corrupted file system
d. Container runtime issues: Unresponsive runtime daemon

Real Life example
K8s Node troubles
1. Creating multiple Deployments on our cluster with containers not using
resource limits
2. Because without limits kubernetes scheduler has no idea about resources
every pod will need. It will run all pods on one node.
3. As resource usage of pods grows NODE will run out of HW resources
4. Kernel OOM killer will kill different systems services and NODE will become
unresponsive

Node Size VS POD Size
K8s Node troubles
Not everything has to run in Kubernetes. Some things are better managed
in VMs.
If your application POD are almost as big as servers where you run them
it’s better to use VMs.
You have to plan your InstanceGroups Accordingly no need to have beefy
servers for small pods

K8s POD troubles
It essential to understand your workload and how does your application
behave in traffic.
1. POD resource limits and requests
a. Some applications need more ram/cpu during startup and later can work
with less plan accordingly.
b. Provide necessary info to K8s scheduler. Without this information scheduler
will work on best effort basis.
2. If your application goes over limit it will be killed by kernel and POD will be
restarted.
3. Set limits/requests relatively close together to make sure POD is not prime
suspect to free resources.

K8s POD troubles examples
1. Deployed application needs more ram during startup (logstash, ES)
2. During start application will exhaust it’s resource limits
3. Kernel OOM Killer will kill Logstash because it ran out of memory inside
it’s cgroup
4. Kubelet will restart application POD

K8s POD QoS
When Kubernetes creates a Pod it assigns one of these QoS classes
1. Guaranteed
a. Every Container in the Pod must have a memory/cpu limit and a memory/cpu
request, and they must be the same.
2. Burstable
a. The Pod does not meet the criteria for QoS class Guaranteed
3. BestEffort
a. For a Pod to be given a QoS class of BestEffort, the Containers in the Pod must not
have any memory or CPU limits or requests.

K8s Application troubleshooting
1. If your application is down start from as close as possible to a POD and
build from there.
a. Is your application healthy ? Do you have many restart on application POD ?
2. Can you access your application on a POD does it work ?
a. kubectl port-forward pod/pod-name local_port:remote_port
3. Can you access your application with a service ?
a. Kubectl port-forward
4. If everything above works and your ingress still doesn’t work check ingress
manifest.

When it something goes wrong
Kubernetes is a distributed application with many moving parts. Be
aware that any troubleshooting si a complicated process
1. Have your monitoring ready
a. Prometheus + Grafana works great
b. Prometheus can dynamically detect new services/pods and based on their
annotations scrape them for metrics.
2. Gather kubernetes events and logs
a. EFK
i. Gather Kubernetes logs from nodes/masters and push them to own elasticsearch
cluster
b. Gather Kubernetes events and store them in elasticsearch cluster
i. https://github.com/haad/event-exporter

Grafana + Prometheus

Questions ?

Thanks !
ahamsik@pixelfederation.com

Staying out of_trouble_with_k8s_on_aws

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Staying out of_trouble_with_k8s_on_aws

Similar to Staying out of_trouble_with_k8s_on_aws (20)

Recently uploaded

Recently uploaded (16)

Staying out of_trouble_with_k8s_on_aws