2
ZALANDO IN NUMBERS
> 4.5billion EUR
2017
> 200
million
visits
per
month
> 14,000
employees in
Europe
> 70%
of visits via
mobile devices
> 22
million
active customers
> 250,000
product choices
~ 2,000
brands
15
countries
3
OUR FOOTPRINT AROUND EUROPE
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
4
OUR FOOTPRINT AROUND EUROPE
TECH
as of November 2017
1
8
10
11
12
13
BERLIN HEADQUARTERS AND OUTLET
BRIESELANG FULFILLMENT CENTER
ERFURT FULFILLMENT CENTER AND TECH OFFICE
MÖNCHENGLADBACH FULFILLMENT CENTER AND TECH OFFICE
LAHR FULFILLMENT CENTER
DORTMUND TECH HUB
FRANKFURT OUTLET
DUBLIN TECH HUB
HELSINKI TECH HUB
MILAN (STRADELLA) FULFILLMENT CENTER
KÖLN OUTLET
PARIS (MOISSY-CRAMAYEL) FULFILLMENT CENTER
SZCZECIN (GRYFINO) FULFILLMENT CENTER
HAMBURG ADTECH LAB
STOCKHOLM (BRUNNA) FULFILLMENT CENTER (start winter 2017)
10
9
7
6
5
3
2
1
11
12
13
4
14
15
15
14
9
8
7
6
5
4
3
2
1
10
LIFE OF A REQUEST (INGRESS)
DNS
my-app.example.org
ALB
aws-1234-lb.eu-central-1.elb.amazonaws.com
SERVICE
10.3.0.216
DEPLOYMENT
POD
10.2.0.1
POD
10.2.1.1
POD
10.2.2.1
POD
10.2.3.1
SKIPPER
172.31.1.1:9999
SKIPPER
172.31.2.1:9999
SKIPPER
172.31.3.1:9999
SKIPPER
172.31.4.1:9999
ALIAS Record
13
INCIDENT #1: LESSONS LEARNED
• ALB routes traffic to ALL hosts if all hosts report “unhealthy”
• Fix Skipper Ingress to stay “healthy” during API server problems
• Fix Skipper Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
14
STABILITY: LIMIT RANGE
$ kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio
---- -------- --- --- ----------- ------------- -----------------------
Container memory - 64Gi 100Mi 1Gi -
Container cpu - 16 100m 3 -
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources
⇒ Mitigate errors on OSI layer 8 ;-)
21
INCIDENT #3: STOP THE BLEEDING
#!/bin/bash
SLEEPTIME=60
while true; do
echo "sleep for $SLEEPTIME seconds"
sleep $SLEEPTIME
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
22
INCIDENT #3: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
23
INCIDENT #3: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
30
INCIDENT #4: LESSONS LEARNED
• Automated end-to-end tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with the previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
34
DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexity and obstacles."
"Get the documentation up to date and prepare
use cases"
"More and more clear documentation"
"More detailed docs, example repos with more
complicated deployments."
36
ONBOARDING
• Many new concepts to grasp vs. 200+ teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Demos
• Support Channels (chat, mail)
38
DOES ANYTHING EVEN WORK?
• Kubernetes API as the primary interface
• Ingress Controller + External DNS
• kube2iam
• Cluster lifecycle management
• Zalando IAM/OAuth integration via CRD
• PostgreSQL Operator
47
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html
Kube AWS Ingress Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
https://github.com/kubernetes-incubator/external-dns
PostgreSQL Operator
https://github.com/zalando-incubator/postgres-operator
Zalando Cluster Configuration
https://github.com/zalando-incubator/kubernetes-on-aws
List of Organizations using Kubernetes on AWS
https://github.com/hjacobs/kubernetes-on-aws-users