Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kubernetes from scratch at veepee sysadmins days 2019

74 views

Published on

How we build Kubernetes from Scratch at Veepee to deliver faster and with upper quality.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Kubernetes from scratch at veepee sysadmins days 2019

  1. 1. Kubernetes from scratch @Veepee
  2. 2. SUMMARY 1 Study Kubernetes components Tools & exploitation Network, security, runtime, proxy, ...3 Control plane deployment 4Node architecture observability, isolation, discovery 2
  3. 3. Study Kubernetes components
  4. 4. Components ● Control plane ○ Storage (etcd) ○ API ○ Scheduler ○ Controller-manager ● Nodes ○ Container runtime ○ Node agent (kubelet) ○ Service proxy ○ Network agent
  5. 5. ● Key-value store ● Raft based distributed storage ● Client to Server & Server to Server TLS support Project page : https://etcd.io/ Incubating at Components : storage
  6. 6. Components : API server ● Store data in etcd ● Stateless REST API ● HTTP/2 + TLS ● gRPC support: ○ WATCH events over HTTP ○ Reactive event based triggers on Kubernetes components
  7. 7. Components : Scheduler ● Connected to API server only ● Watch for pod objects ● Select node to run on based on criterias: ○ Hardware (CPU available, CPU architecture, memory available, disk space) ○ (Anti-)Affinity patterns ○ Policy constraints (labels) ● 1 master per quorum (token in etcd)
  8. 8. Components : Controller manager ● Core controller: ○ Node status responses ○ Replication: ensure pod number on replication controllers ○ Endpoints: maintains Endpoints object for Services ○ Namespace: create default Service Account & Tokens ● 1 master per quorum (token in etcd)
  9. 9. Node components ● Container runtime: Run containers (Docker, containerd.io…) ● Node agent : connects to API server to handle containers & volumes ● Service proxy : load balances service IPs to pod endpoints ● Network agent : Connects nodes together (flannel, calico, kube-router…)
  10. 10. Control plane Deployment
  11. 11. ● 3 Kubernetes clusters per datacenter: ○ Benchmark ○ Staging ○ Production ● No cross DC cluster: No DC split brain situation to manage Datacenter deployment
  12. 12. ● 3 etcd per datacenter ○ TLSv1.2 enabled ○ Authentication through TLSv1.2 enabled ○ Hardware : 4 CPU 32GB RAM ○ OS : Debian 10.1 ○ Version 3.4 enabled : ■ reduced latency ■ high write performance improvements ■ read not affected by commits ■ Will be the default version to K8S 1.17 ■ See : https://kubernetes.io/blog/2019/08/30/announcing-etcd-3-4/ Etcd deployment
  13. 13. ● API version: 1.15.x (old clusters) and 1.16.x (new clusters) ● 2 API server load balanced by haproxy (TCP mode) ○ Horizontally scalable ○ Vertically scalable ○ Current setup : 4 CPU 32GB RAM ○ OS : Debian 10.1 ● Load balance etcd themselves ○ We discovered a bug in k8s < 1.16.3 when using TLS, ensure you have at least this version ○ Issue: https://github.com/kubernetes/kubernetes/issues/83028 API server deployment
  14. 14. API server deployment
  15. 15. ● Enabled/Enforced features (Admission controllers): ○ LimitRanger: Resource limitation validator ○ NodeRestriction: limit kubelet permissions on node/pod objects ○ PodSecurityPolicy: security policies to run pods ○ PodNodeSelector: limit node selection for pods ● See full list of admission controllers here: ○ https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers ● Enabled extra feature: Secret encryption on etcd in AES256 API server deployment
  16. 16. ● 3 nodes per DC ○ Each has scheduler ○ Each has controller manager ○ Hardware: 2 CPU 8GB RAM ○ OS: Debian 10.1 Controller-Manager & scheduler deployment
  17. 17. ● Enabled features on controller-manager: all defaults plus ○ BootstrapSigner: authenticate kubelets on cluster join ○ TokenCleaner: clean expired tokens ● Supplementary features on scheduler: ○ NodeRestrictions: restrict pods on some nodes Controller-Manager & scheduler deployment
  18. 18. Control plane global overview
  19. 19. Node architecture Network, security, runtime, proxy, ...
  20. 20. Node architecture: container runtime ● Valid choice: Docker (https://www.docker.com/) ○ The default one ○ Known by “everyone” in the container world ○ Owned by a company ○ Simple to use
  21. 21. Node architecture: container runtime ● Valid choices: Containerd (https://containerd.io/) ○ Younger than Docker ○ Extracted from Docker ○ CNCF enabled project ○ Some limitations: ■ No docker API v1! ■ K8S integration poorly documented
  22. 22. Node architecture: container runtime ● Veepee choice: Containerd ○ Supported by CNCF and community ○ Used by Docker as underlying container runtime ○ We use artifactory, Docker API v2 is fully supported ○ Less footprint, less code, lower latency for kubelet
  23. 23. Node architecture: system configuration ● Pod DNS configuration ○ clusterDomain: root DNS name for the pods/services ○ clusterDNS: DNS servers configured on pods ■ except if hostNetwork: true and pod DNS policy is default ● Protect system from pods: Ensure node system daemons can run ■ 128Mio memory reserved ■ 0.2 CPU reserved ■ Disk soft & hard limits ● Soft: don’t allow new pods to run if limit reached ● Hard: evict pods if limit reached
  24. 24. Node architecture: service proxy ● Exposes K8S service IP on nodes to access pods ● Multiple ways ○ IPTables ○ IPVS ○ External Load Balancer (example AWS ELB in layer 4 or layer 7) ● Multiple possibilities ○ Kube-proxy (iptables, ipvs) ○ Kube-router (ipvs) ○ Calico ○ ...
  25. 25. Node architecture: service proxy ● Veepee solution choice: kube-proxy ○ Stay close to Kubernetes distribution: don’t add more complexity ○ No default need for layer 7 load balancing (service type: LoadBalancer), can be added as extra proxy in the future ○ Next challenge: IPTables vs IPVS
  26. 26. Node architecture: kube-proxy mode ● Kube-proxy: iptables mode ○ Default recommended mode (faster) ○ Works quite well… but: ■ Doesn’t integrate with Debian 10 and upper (thanks for Debian iptables-nftables tool) => restore legacy iptables mode ■ Has locking problems when multiple programs need it ● https://github.com/weaveworks/weave/issues/3351 ● https://github.com/kubernetes/kubernetes/issues/82587 ● https://github.com/kubernetes/kubernetes/issues/46103 ■ We need kube-proxy and Kubernetes Network Policies ■ We should take care of conntrack :(
  27. 27. Node architecture: kube-proxy mode ● Kube-proxy: ipvs mode ○ Works well technically (no locking issue/hacks!) ○ ipvsadm is a very better friend than iptables -t nat ○ ipvs also chosen by some other tools like kube-router ○ calico performance comparison convinced us (https://www.projectcalico.org/comparing-kube-proxy-modes-iptables-or-ipvs/)
  28. 28. Node architecture: kube-proxy mode ● Veepee final choice: kube-proxy + IPVS
  29. 29. Node architecture: network layer ● Interconnects nodes ○ Ensure pod to pod and pod to service communication ○ Can be fully private (our choice) or shared with regular network ● Various ways to achieve it ○ Static routing ○ Dynamic routing (generally BGP) ○ VXLan VPN ○ IPIP VPN ● Multiple ways to allocate node CIDRs ○ Statically (enjoy) ○ Dynamically
  30. 30. Node architecture: network layer Warning, reading this slide can make your network engineers crazy ● Allocate two CIDRs for your cluster ○ 1 for nodes and pods ○ 1 for service IPs ● Don’t be conservative, give a thousands of IPs to K8S, each node requires a /24 ○ CIDR /14 for nodes (up to 1024 nodes) ○ CIDR /16 for services (service IP randomness party)
  31. 31. Node architecture: network layer ● Needs: ○ Each solution must learn the CIDR of current node through API ○ Network mesh setup should be automagic ● Select the right solution ○ Flannel (default recommended one): VXLan, host-gw ○ Kube-router: IPIP or BGP ○ Calico: IPIP ○ WeaveNet: VXLan
  32. 32. Node architecture: network layer First test: flannel in VXLan ● Works quite well ● Very easy setup kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml ● Yes it’s like curl blah | bash ● No we didn’t installed it like this :)
  33. 33. Node architecture: network layer First test: flannel in VXLan (https://github.com/coreos/flannel) ● Before a big sale, we load tested an app and… very bad network performance on nodes ○ Iperf shows that the outside network was good, around 9.8Gbps over 10Gbps ○ Node to pod perf was at maximum too ○ Node to node using regular net is around 9.7Gbps ○ Node to node using VXLan is around 3.2Gbps and kernel load is very high ○ Investigation on the recommended way to run VXLan: offload VXLan to network cards. ○ It’s not possible in our case we are using Libvirt/KVM VMs, discard VXLan
  34. 34. Node architecture: network layer Second test: kube-router in BGP mode (https://www.kube-router.io/) ● Drops the need of offloading to network card ● Easy setup too kubectl apply -f https://raw.githubusercontent.com/cloudnativelabs/kube-router/master/daemonset/kube-router-all-service-daemonset.yaml ● Don’t forget to read the yaml and ensure you publish on right cluster :) ● As suspected, using BGP restore the full capacity of the bandwidth ● Other interesting features: ○ Service proxy (IPVS) ○ Network Policy support ○ Network LB using BGP
  35. 35. ● Our choice: ○ BGP choice is very nice ○ We can extend the BGP to fabric if needed in the future ○ We need network policy isolation for some sensible apps ○ One binary for both network mesh and policies: less maintenance Node architecture: network layer
  36. 36. Tools & exploitation DNS, metrology, logging, ...
  37. 37. Kubernetes is not magic: tooling With previous setup we have: ● API ● Container scheduling ● Network communication We have some limits: ● No access from outside ● No DNS resolution ● No metrology/alerting ● Volatile logging on nodes
  38. 38. Tooling: DNS resolution Two methods: ● External, using host resolv.conf: no DNS for inside cluster communication, we can use DNS for external resources only ● Internal: inside cluster DNS records, enables service discovery ○ We need it, go ahead
  39. 39. Tooling: DNS resolution Two main solutions: ● Kube-dns: legacy one, should not be used for new cluster ○ dnsmasq C layer, single thread ○ 3 containers for a single daemon ? ● Coredns: modern one ○ Golang multithreaded implementation (goroutine) ○ 1 container only ● Some benchmarks (from coredns team, be careful) ○ https://coredns.io/2018/11/27/cluster-dns-coredns-vs-kube-dns/
  40. 40. Tooling: DNS resolution ● CoreDNS is the more reasonable choice. ● Our deployment ○ Deployed as Kubernetes deployment ○ Runs on master nodes (3 pods) ○ Configured as default DNS service on all Kubelet
  41. 41. Tooling: Access from outside Ingress: access from outside of the cluster Various choices on the market: ● Nginx (the default one) ● Traefik ● Envoy ● Kong ● Ambassador ● Haproxy ● And more...
  42. 42. Tooling: Access from outside We studied five: ● ambassador: promising but very young (https://www.getambassador.io/) ● nginx: the OSS model on Nginx is unclear since F5 bought Nginx Inc. (http://nginx.org/) ● haproxy: mature product but ingress is very young and HTTP/2 and gRPC too (http://www.haproxy.org/) ● kong: built on the top of Nginx it's not for general purposes but can be a very nice API gateway (https://konghq.com/kong/) ● Traefik: good licensing, mature and updated regularly (https://traefik.io/)
  43. 43. Tooling: Access from outside Because of risks on some products, we benched traefik: ● Kubernetes API ready ● HTTP/2 ready ● TLS/1.3 ready (Veepee minimum: TLS/1.2) ● Scalable & reactive configuration deployments ● TLS certificate reconfiguration in less than 10sec ● TCP/UDP raw balancing (traefik v2)
  44. 44. Tooling: Access from outside Traefik bench: ● Very good performance in lab: ○ Tested using k6 and ab tools ○ Test backend was a raw golang HTTP service ○ HTTP: Up to 10krps with 2 pods on VM with 1CPU and 2GB RAM ○ HTTPS: Up to 6.3krps with 2 pods on VM with 1CPU and 2GB RAM ○ Scaling pods doesn’t increase performance, anyway it’s sufficient
  45. 45. Tooling: Access from outside Traefik bench: ● Load Testing with a real product: ○ More than 1krps ○ not so recent dotnet.core app ○ Dotnet.core app doesn’t take care about containers and suffers from some contention ○ Anyway the rate is sufficient for the sale: go ahead to prod ○ On a big event sale we sold ~32k concert tickets in 1h40 without problems
  46. 46. Tooling: Access from outside Traefik bench: ● Before production sale: ○ We increase nodes from 2 to 3 ○ We increase application size from 2 to 10 instances ● Production sale day (starting at 7am): ○ No incident ○ We sold 32k concert places in 1h40
  47. 47. Tooling: metrology/alerting Need: ● collect metrics on pods to do nice graphs Solution: ● A solution to rule them all
  48. 48. Tooling: metrology/alerting Implementation: ● Pods exposes a /metrics endpoint through their HTTP listener ● Prometheus will scrape it ● Writing prometheus scrapping configuration by hand is painful ● Hopefully comes: https://github.com/coreos/kube-prometheus + =
  49. 49. Tooling: metrology/alerting ● Kube-prometheus implementation: ○ HA prometheus instances ○ HA alertmanager instances ○ Grafana for local metrics view (not reusable for something else) ○ Gather node metrics ○ ServiceMonitor Kubernetes API extension object
  50. 50. Tooling: metrology/alerting Pod discovery
  51. 51. Tooling: metrology/alerting Veepee ecosystem integration
  52. 52. Tooling: metrology/alerting Pod resource overview
  53. 53. Tooling: metrology/alerting Kube-prometheus graphes (+ some custom)
  54. 54. Tooling: logging How to retrieve logs properly ? ● Logging is volatile on containers ● On docker hosts: just mount a volume from host and write on it ● On K8S: i don’t know where my container runs, i don’t know the host, the host doesn’t want me to write on it, help me doctor!
  55. 55. Tooling: logging ● You can prevent open heart surgery in production by knowing the rules
  56. 56. Tooling: logging ● Never write logs on disk ○ if you need it, use a sidecar to read it and don’t forget rotation! ● Write on stdout/stderr in a parsable way ○ Json comes to the rescue: known by every devel language, easy to serialize & implement ● Choose a software to gather container logs and push them: ○ filebeat ○ fluentd ○ fluentbit ○ logstash
  57. 57. Tooling: logging ● Our choice: fluentd ○ CNCF sponsored (https://www.cncf.io/announcement/2019/04/11/cncf-announces-fluentd-graduati on/) ○ Some needed features on fluentd are not in fluentbit ○ Already used by many SRE at Veepee ● Our deployment model: K8S Daemonset ○ Rolling upgrade flexibility ○ Ensure logs are gathered on each running node ○ Ensure configuration is same everywhere
  58. 58. Tooling: logging Fluentd object deployment
  59. 59. Tooling: logging Fluentd log ingestion pipeline
  60. 60. Tooling: client/product isolation Need: ● Ensure a client or product will not steal CPU/Memory/Disk resources of another Two work axis: ● Node level isolation ● Pod level isolation
  61. 61. Tooling: client/product isolation Work axis: node level ● Ensure a client (tribe) or a product own the underlying node ● Billing per customer ● Resources per customer, then SRE team Solution: ● Use enforced NodeSelector on namespaces scheduler.alpha.kubernetes.io/node-selector: k8s.veepee.tech/tribe=foundation,k8s.veepee.tech=platform ○ Pod can be at only be scheduled on a node with at minimum those labels
  62. 62. Tooling: client/product isolation Work axis: pod level ● Ensure pods are not stealing other pod resources ● Ensure scheduling do the right node choice according to available resources ● Forbid pod allocation if no resource available (no overcommit) Solution: ● LimitRanges
  63. 63. Tooling: client/product isolation Applied LimitRanges
  64. 64. <ADD YOUR TITLE HERE/> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation
  65. 65. <ADD YOUR TITLE HERE/> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation
  66. 66. Questions ?
  67. 67. THANK YOU

×