1. Kubernetes Day 2 @ ZSE Energia, a.s.
Miro Toma
November 10th, 2021
2. About
Me
• IT nerd since the dawn of time
• 25 years professional experience
• Held various positions covering most functions in IT
stack
• Passionate about tech & new trends
• Stirring the IT pot in the utilities sector since 2014
ZSE Energia, a.s.
• Major energy supplier in Slovakia
• Part of larger ZSE group
• Commercial company (not state managed!)
• Small internal IT unit
• Heavy reliance on vendors (not a dev shop)
3. The (somewhat) accelerated journey
Day 0 & 1 - Now or never
• K8s incepted as a target platform for an
ongoing high-profile project
• Severely limited infrastructure support
capacities (human) at the time [couldn’t
deploy on ‘classic’ VMs]
• Anticipated uptime requirements
Day 2 start – Apr 2019
• ingress
• logs (Fluentd, Elasticsearch, Kibana)
• 1 app namespace
• no native monitoring* (!DON’T!)
* trivial heartbeat monitoring with Zabbix
Later that (2nd) day..
• elasticsearch->opendistro->opensearch
• fluentd->fluent-bit
• vendor namespaces (SaaS model with ‘our’
infrastructure)
• calico (cluster reinstall)
• cert-manager
• prometheus/alert manager/grafana
• real backups (!)
• zookeeper
• kafka
Day 0 to Day 2 in <6 months
4. Backups
• “CI/CD pipeline will take care of the cluster rebuild“
• Until it won’t:
• persistent volumes
• manual tweaks (don’t !)
• ..
• Solutions exist to take whole-cluster backups, including volumes
• Use-case – migrate cluster between cloud subscriptions
• migration supported by cloud vendor for majority of resources
• but not Kubernetes (!)
• 4 hours vs. multi-month project
5. Don’t Question Your Vendor’s Infrastructure Sizing
• Obscene asks for CPU and memory
• Questioning never lead to a significant difference
0.1 (10% of a single CPU) ~1.2GB RAM
Example project ask – two-machine cluster with 4CPU, 16GB RAM each. Real life:
Deploy and set real quotas afterwards
• real world is a fraction of the original ask (no
exceptions yet)
• should thing go south, you can tune on the fly
6. Budget for Disruptions, Promote ‘Aversion’
• Define disruption budgets (religiously)
• beta since 1.5; prod from 1.21
• your app won’t potentially disappear on a node drain
• Strive to distribute pods across multiple nodes
• use podAntiAffinity as a rule
• consider using descheduler
• Sample scenario (real life):
1. all ingress pods eventually ended up running on a single node
2. drain the specific node hosting all ingress pods
3. no ingress (i.e. ‘cluster is down’) for a non-insignificant moment
7. Let Them Die Peacefully
• 30 secs default timeout to terminate may not be good for all
• Long running consumer queries
• Lengthy cleanup processes (e.g. to keep PVs consistent)
• Hooks delaying the TERM signal eats into the total budget
• Use rather generous terminationGracePeriodSeconds
• should the container terminate earlier, the control plane will notice
• Not everyone plays nice with TERM
• Use preStop hooks
8. Dying Containers Won’t Accept New Work
• Updating deployments, stateful-sets, kubectl delete pod xxx & co
• ‘Terminating’ a pod:
• containers receive TERM signal -> stop accepting new requests
• network (CNI) – in parallel - starts converging endpoints/services
• until converged, the terminating pods will deny new requests
• preStop hooks to delay TERM, thus giving time for network to converge
• don’t want, but also can’t really set a dependency on isolating a pod before shutting it
down (split-brain situations)
• 8 secs worked fine so far (exceptions)
9. Cluster upgrades
• Started @1.15, now on 1.20
• Upgrading a managed cluster is a breeze – until it isn’t
• fairly complex process - on a managed cluster you don’t get all the knobs and buttons to comfortably identify/fix
an issue
• two incidents yet:
• medium 1.16 -> 1.17 (upgrade stopped in the middle; documented fix/workaround)
• huge 1.19 -> 1.20 (internal cluster network went south, node pool ‘failed’)
• Both issues traced to node drain timeouts
• provider’s upgrade scripts define (weakly documented) node drain timeout for upgrades
• longer termination periods, multiplied by disruption budgets prolong node drains
• Current approach:
• upgrade control plane first (separately)
• create new node-pool(s) at the ugraded version
• manually drain old nodes
• delete old pools
10. Some Major Roads NOT Taken
Helm
• Initial eval with v2 (might take different twist now)
• Many charts ‘opinionated’
• Some charts drag-in dependencies we didn’t want
Operator frenzy (i.e. operator for everything)
• Many operators undergoing major revisions (would be hard to keep up)
• Many offerings for the same use-case, frequently neither matching all of our requirements
• Single manifest modification/deletion may evaporate your service in an instant
Pre-packaged pipelines (e.g. Banzai)
• Very early in development at the time
Note: These decisions were taken based on situation around 2018/2019. Some will be revisited in due course
11. Some More Takeaways
• Don’t rush Day 2
• Dedicate resources for day 0 & 1
• Day-to-day ops are surprisingly modest
• Adoption by ‘traditional’ IT departments may be a journey on its own…
• Local market uptake for K8s (still) lagging
• pushing & training vendors for adoption of k8s
• some vendors still ‘resist’, but some became proponents
• Stay cloud-agnostic
• minimize utilization of cloud-specific services