CONTAINER DAYS HAMBURG
2017-06-20
HENNING JACOBS
@try_except_
Kubernetes on AWS
@ZalandoTech
2
ZALANDO
15 markets
6 fulfillment centers
20 million active customers
3.6 billion € net sales 2016
165 million visits per month
12,000 employees in Europe
3
ZALANDO TECHNOLOGY
HOME-BREWED,
CUTTING-EDGE
& SCALABLE
technology solutions
>1,700
employees from
tech locations
+ HQs in Berlin6
77
nations
help our brand to
WIN ONLINE
4
ZALANDO TECH’S
INFRASTRUCTURE
5
FOUR ERAS AT ZALANDO TECH
ZOMCATPHP STUPS KUBERNETES
2010 2015 2016
Data center
WAR
AWS
Docker
Cloud Formation
Low level (AWS API)
AWS
Docker
Kubernetes manifest
High abstraction level
Data center
PHP files
6
LARGE SCALE?
8
KUBERNETES:
ARCHITECTURE
9
KUBERNETES ON AWS: CONTEXT
200 engineering teams
30 prod. clusters
AWS/STUPS
Dockerized apps
No manual operations
Reliability
Autoscaling
Seamless migration
10
ISOLATED AWS ACCOUNTS
Internet
*.abc.example.org *.xyz.example.org
Product ABC Product XYZ
EC2
LBLB
11
KUBERNETES ON AWS
12
DEPLOYMENT
13
DEPLOYMENT CONFIGURATION
.
├── apply
│ ├── credentials.yaml # K8s TPR
│ ├── ingress.yaml # K8s Ingress
│ ├── redis-deployment.yaml # K8s Deployment
│ ├── redis-service.yaml # K8s Service
│ └── service.yaml # K8s Service
├── deployment.yaml # K8s Deployment
└── pipeline.yaml # proprietary config
14
INGRESS.YAML
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
15
JENKINS DEPLOY PIPELINE
16
AWS INTEGRATION
17
CLOUD FORMATION VIA CI/CD
.
├── apply
│ ├── cf-iam-role.yaml # AWS IAM Role
│ ├── cf-rds.yaml # AWS RDS Database
│ ├── kube-ingress.yaml # K8s Ingress
│ ├── kube-secret.yaml # K8s Secret
│ └── kube-service.yaml # K8s Service
├── deployment.yaml # K8s Deployment
└── pipeline.yaml # CI/CD config
18
ASSIGNING AWS IAM ROLE TO POD
kind: Deployment
spec:
template:
metadata:
annotations:
# annotation for kube2iam
iam.amazonaws.com/role: "app-myapp-role"
spec:
containers:
- name: ...
...
https://github.com/jtblin/kube2iam
⇒ AWS SDKs just work as expected
19
CLUSTER
AUTOSCALING
20
CLUSTER AUTOSCALING
Control # of worker nodes in ASG:
• Satisfy all resource requests
• One spare node per AZ
• No manual config “tweaking”
• Scale down, but not too fast
⇒ we want to be “elastic”
https://github.com/hjacobs/kube-aws-autoscaler
21
OAUTH / IAM
INTEGRATION
22
SERVICE TO SERVICE AUTHNZ
Kubernetes Cluster
https://resource-server.example.org/protected
HTTP/1.1 401 Unauthorized
{
"message": "Authorization required"
}
23
CREDENTIAL PROVIDER
24
USING THE OAUTH CREDENTIALS
#!/bin/bash
secret=$(cat /creds/mytok-token-secret)
curl -H "Authorization: Bearer $secret" 
https://resource-server.example.org/protected
25
CHALLENGES
26
1. Getting Started
2. Stability
3. Onboarding
4. User Experience
5. Operations
CHALLENGES
27
CHALLENGE 1:
GETTING STARTED
28
GETTING STARTED
https://github.com/hjacobs/kubernetes-on-aws-users
29
GETTING STARTED
https://github.com/hjacobs/kubernetes-on-aws-users
30
CLUSTER PROVISIONING
31
CLUSTER PROVISIONING
• Two Cloud Formation stacks
• Master & worker ASGs + etcd
• Nodes w/ Container Linux
• K8s manifests applied separately
• kube-system Deployments
• DaemonSets
32
GETTING STARTED
Goal: use Kubernetes API as primary interface for AWS
• Mate, External DNS
• Kubernetes Ingress Controller for AWS
• kube2iam
⇒ we wrote new components
to achieve our goal
33
INGRESS CONTROLLER
https://github.com/zalando-incubator/kube-ingress-aws-controller / https://github.com/kubernetes-incubator/external-dns
34
GETTING STARTED
Other questions we asked ourselves..
• Single AZ vs. Multi AZ?
• Federation?
• Overlay network?
• Authnz?
35
GETTING STARTED
Other questions we asked ourselves..
• Single AZ vs. Multi AZ? ⇒ Multi AZ
• Federation? ⇒ No, not ready yet
• Overlay network? ⇒ Flannel, “rock solid”
• Authnz? ⇒ OAuth, webhook
36
CHALLENGE 2:
STABILITY
37
STABILITY
• Cluster Updates
• Docker
• AWS Rate Limits
38
CLUSTER
UPDATES
40
STABILITY: AWS RATE LIMITS
• Ran into the same trap twice (Mate & Ingress Ctrl)
• Kubernetes core causes many calls (e.g. EBS)
• Monitoring (ZMON) needs to poll AWS
⇒ One of our biggest pain points with AWS
(and all workarounds are hard and/or ugly)
41
STABILITY: LIMIT RANGE
kubectl describe limitrange
Name: limits
Namespace: default
Type Resource Min Max Default Req Default Limit Max Limit/Request Ratio
---- -------- --- --- ----------- ------------- -----------------------
Container memory - 64Gi 100Mi 1Gi -
Container cpu - 16 100m 3 -
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html#resources
⇒ Mitigate errors on OSI layer 8 ;-)
Recommended: The 5 Whys
https://en.wikipedia.org/wiki/5_Whys
44
CHALLENGE 3:
ONBOARDING
45
ONBOARDING
• Many new concepts to grasp vs. 200 teams
• Kubernetes Training (2h)
• Documentation
• Recorded Friday Demos
• Support Channels (chat, mail)
46
CHALLENGE 4:
USER EXPERIENCE
47
USER EXPERIENCE
• Jenkins deployment only covers “happy case”
• Juggling with YAMLs
• Weighted traffic switching missing
48
UX: WEIGHTED TRAFFIC SWITCHING
• STUPS uses weighted Route53 DNS records
• Allows canary, blue/green, slow ramp up
• Approach: add weights to Ingress backends
https://github.com/zalando/skipper/issues/324
49
UX: WEIGHTED TRAFFIC SWITCHING
https://github.com/zalando/skipper/issues/324
50
CHALLENGE 5:
OPERATIONS
51
OPERATIONS
• Team Autonomy?
• Platform as a Service
• Convergence
• Emergency Operator Access
⇒ Hard challenges..
https://github.com/hjacobs/kube-ops-view
53
LINKS
Running Kubernetes in Production on AWS
http://kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/kubernetes-in-production.html
Kube AWS Ingress Controller
https://github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
https://github.com/kubernetes-incubator/external-dns
PostgreSQL Operator
https://github.com/zalando-incubator/postgres-operator
Zalando Cluster Configuration
https://github.com/zalando-incubator/kubernetes-on-aws
List of Organizations using Kubernetes on AWS
https://github.com/hjacobs/kubernetes-on-aws-users
QUESTIONS?
HENNING JACOBS
TECH INFRASTRUCTURE
CLOUD ENGINEER
henning@zalando.de
@try_except_
Illustrations by @01k

Large Scale Kubernetes on AWS at Europe's Leading Online Fashion Platform - Container Days Hamburg