2. there is always a reason behind...
Challenges of countless projects:
- application resiliency issues
- low resource utilization and cost in-efficiency
- operational inefficiency in using non-unified technology stack for managing different layers and
deploying applications
- low deployment velocity and elasticity
- security and compliance issues (host & app level access and audit)
- operational overhead in managing infrastructure
3. Fundamental Principles
● Cattle, no Pet
● Immutable Infrastructure
● Codified Infrastructure
● Golden Image
● OOB Resiliency
● OOB Telemetry
is there anything we can do in a dramatically
different way?
4. … for a single ultimate goal
let application developers focus on application development and business capabilities
... while somebody else (or something else) takes care about infrastructure maintenance, networking,
capacity planning, resiliency, telemetry, security and compliance, etc…
5. responding to the demand – K8S based
container management PaaS
- best-of-breed container scheduler – K8S
- KOPS and AWS based OSS K8S distribution
- OSS-based addons ecosystem (fluentd, weave scope, heapster, etc.)
- rolling cluster update to answer most of operational challenges
- unified addons, resources, applications and services deployment
(helm)
- 100% OSS, no proprietary closed products
- full CNCF K8S conformance (read as no lock, can migrate to other
distributions) Source: https://redislabs.com/redis-enterprise-
documentation/administering/kubernetes/upgrading-redis-enterprise-
cluster-kubernetes-deployment-operator/
6. fundamental principles > platform capabilities
- Cattle Host -> no pet hosts, any node can be killed any time if misbehaving. Workloads will by
rescheduled on alternative nodes
- Immutable Infrastructure → rolling cluster update. Through mechanics of rolling update any
compliance, security hardening or configuration management issue is addressed. Lift & shift container
- Golden Image → backed into the cluster definition
- Codified infrastructure → clusters, addons, resources, applications - all declaratively defined
- Build-in resiliency and telemetry - out of the box open source addons that require none to low effort
on product team side
Container management PaaS is essentially an integrated family of cloud-native capabilities that lets you
increase speed and reliability, improve security and focus on delivery
8. journey
- Oct 2016 – realized that there is a need for container scheduler. Chosen Rancher for the cloud and scheduler
agnostic approach
- June 2017 – realized that Rancher does not deliver in accordance with expectations (health + readiness
checks, granular control over workloads and rolling service updates)
- May – Sep 2017 – OpenShift evaluation
- Nov 2017 – taken decision about vanilla K8S, started POC
- Feb 2018 – started K8S productionalization
- Sep 2018 – finalizing productionalization
9. today
- Unified cluster operations, 24/7 monitoring with PD, office hours support in place
- 4 clusters in place (2 prod)
- Overall capacity of 50+ nodes
- 10+ products / services hosted
- Unified stack of addons for performance monitoring, DNS management, ingress controller, centralized logging
- 3 engineers + 1 architect in the team - 24/7 support included - to prove validity of fundamental principles
(and economy of scale!)
- Product teams are excited!
10. mistakes made
- not implementing cluster-level DR strategy early enough (etcd backup) – we killed cluster twice, both times
due to unexpected behavior of tooling (KOPS – split brain, HELM – resource termination during deletion of
failed deployment) and overconfidence
- toolset overconfidence: took traefik as ingress controller and ended up with 4 ingress controllers for a single
environment, HTTP + TCP, internal + external
- too broad scope: monitoring and security addons, rich networking capabilities, clusters maintenance, teams
support
- no time for OSS contribution – not sustainable approach
- rolling update is still hard – regular failures and need for manual interventions, maintenance windows
agreements, etc.
- Stitch-free Cloud Native and Cloud (AWS) Integration is still a challenge:
- environment segregation (VPC or account based)
- provisioning of related services (RDS, RedShift, Lambdas) as part of unified deployment stack