Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

DEVOPS NRW
2018-02-21
HENNING JACOBS
@try_except_
Kubernetes on AWS
@ZalandoTech

2
ZALANDO IN NUMBERS
> 4.5billion EUR
2017
> 200
million
visits
per
month
> 14,000
employees in
Europe
> 70%
of visits via
mobile devices
> 22
million
active customers
> 250,000
product choices
~ 2,000
brands
15
countries

3
OUR FOOTPRINT AROUND EUROPE
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1

4
OUR FOOTPRINT AROUND EUROPE
TECH
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1

7
INCIDENT #1: CUSTOMER IMPACT

8
INCIDENT #1: IAM RETURNING 404

10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record

11
INCIDENT #1: UNASSUMING MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
application: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
metadata:
labels:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
...

12
INCIDENT #1: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
labels:
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
metadata:
labels:
spec:
restartPolicy: Never
containers:
...

13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"

14
STABILITY: LIMIT RANGE
$ kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio
---- -------- --- --- ----------- ------------- -----------------------
Container memory - 64Gi 100Mi 1Gi -
Container cpu - 16 100m 3 -
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources
⇒ Mitigate errors on OSI layer 8 ;-)

16
INCIDENT #2: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix

17
INCIDENT #2: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]

https://www.outcome-eng.com/human-error-never-root-cause/

19
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots

20
INCIDENT #3: LATENCY SPIKES

21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done

22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]

23
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable

30
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests

31
TRAFFIC SWITCHING
Default deployment: rolling update

33
MANUAL TRAFFIC SWITCHING
$ zkubectl traffic <ingress-name>
SERVICE WEIGHT
<service-backend-1> 30%
$ zkubectl traffic <ingress-name> <service-backend-2> 100
SERVICE WEIGHT

34
DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexity and obstacles."
"Get the documentation up to date and prepare
use cases"
"More and more clear documentation"
"More detailed docs, example repos with more
complicated deployments."

35
DOCUMENTATION
• Restructure following
https://www.divio.com/en/blog/documentation/
• Concepts
• How Tos
• Tutorials
• Reference
• Global Search
• Weekly Health Check: Support → Documentation

36
ONBOARDING
• Many new concepts to grasp vs. 200+ teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Demos
• Support Channels (chat, mail)

38
DOES ANYTHING EVEN WORK?
• Kubernetes API as the primary interface
• Ingress Controller + External DNS
• kube2iam
• Cluster lifecycle management
• Zalando IAM/OAuth integration via CRD
• PostgreSQL Operator

40
DEPLOYMENT CONFIGURATION
.
├── deploy/apply
│ ├── deployment.yaml # K8s Deployment
│ ├── credentials.yaml # K8s TPR
│ ├── ingress.yaml # K8s Ingress
│ └── service.yaml # K8s Service
└── delivery.yaml # pipeline config

41
INGRESS.YAML
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80

42
CONTINUOUS DELIVERY PLATFORM

45
POSTGRES OPERATOR
Application to manage PostgreSQL clusters
Observes “postgres” manifests (CRDs)
Spawns and modifies new clusters
Syncs and provisions roles
Handles volume resize, incl. Resize2fs
Also responsible for updating Docker images

https://github.com/hjacobs/kube-ops-view

47
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html
Kube AWS Ingress Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
https://github.com/kubernetes-incubator/external-dns
PostgreSQL Operator
https://github.com/zalando-incubator/postgres-operator
Zalando Cluster Configuration
https://github.com/zalando-incubator/kubernetes-on-aws
List of Organizations using Kubernetes on AWS
https://github.com/hjacobs/kubernetes-on-aws-users

QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

More Related Content

What's hot

Similar to Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW

More from Henning Jacobs

Recently uploaded

Kubernetes on AWS at Zalando: Failures & Learnings - DevOps NRW