Container Management Platform
[K8S]
PaaS Journey
By Uladzimir Palkhouski
https://www.linkedin.com/in/uladzimirpalkhouski/
there is always a reason behind...
Challenges of countless projects:
- application resiliency issues
- low resource utilization and cost in-efficiency
- operational inefficiency in using non-unified technology stack for managing different layers and
deploying applications
- low deployment velocity and elasticity
- security and compliance issues (host & app level access and audit)
- operational overhead in managing infrastructure
Fundamental Principles
● Cattle, no Pet
● Immutable Infrastructure
● Codified Infrastructure
● Golden Image
● OOB Resiliency
● OOB Telemetry
is there anything we can do in a dramatically
different way?
… for a single ultimate goal
let application developers focus on application development and business capabilities
... while somebody else (or something else) takes care about infrastructure maintenance, networking,
capacity planning, resiliency, telemetry, security and compliance, etc…
responding to the demand – K8S based
container management PaaS
- best-of-breed container scheduler – K8S
- KOPS and AWS based OSS K8S distribution
- OSS-based addons ecosystem (fluentd, weave scope, heapster, etc.)
- rolling cluster update to answer most of operational challenges
- unified addons, resources, applications and services deployment
(helm)
- 100% OSS, no proprietary closed products
- full CNCF K8S conformance (read as no lock, can migrate to other
distributions) Source: https://redislabs.com/redis-enterprise-
documentation/administering/kubernetes/upgrading-redis-enterprise-
cluster-kubernetes-deployment-operator/
fundamental principles > platform capabilities
- Cattle Host -> no pet hosts, any node can be killed any time if misbehaving. Workloads will by
rescheduled on alternative nodes
- Immutable Infrastructure → rolling cluster update. Through mechanics of rolling update any
compliance, security hardening or configuration management issue is addressed. Lift & shift container
- Golden Image → backed into the cluster definition
- Codified infrastructure → clusters, addons, resources, applications - all declaratively defined
- Build-in resiliency and telemetry - out of the box open source addons that require none to low effort
on product team side
Container management PaaS is essentially an integrated family of cloud-native capabilities that lets you
increase speed and reliability, improve security and focus on delivery
capabilities overview
journey
- Oct 2016 – realized that there is a need for container scheduler. Chosen Rancher for the cloud and scheduler
agnostic approach
- June 2017 – realized that Rancher does not deliver in accordance with expectations (health + readiness
checks, granular control over workloads and rolling service updates)
- May – Sep 2017 – OpenShift evaluation
- Nov 2017 – taken decision about vanilla K8S, started POC
- Feb 2018 – started K8S productionalization
- Sep 2018 – finalizing productionalization
today
- Unified cluster operations, 24/7 monitoring with PD, office hours support in place
- 4 clusters in place (2 prod)
- Overall capacity of 50+ nodes
- 10+ products / services hosted
- Unified stack of addons for performance monitoring, DNS management, ingress controller, centralized logging
- 3 engineers + 1 architect in the team - 24/7 support included - to prove validity of fundamental principles
(and economy of scale!)
- Product teams are excited!
mistakes made
- not implementing cluster-level DR strategy early enough (etcd backup) – we killed cluster twice, both times
due to unexpected behavior of tooling (KOPS – split brain, HELM – resource termination during deletion of
failed deployment) and overconfidence
- toolset overconfidence: took traefik as ingress controller and ended up with 4 ingress controllers for a single
environment, HTTP + TCP, internal + external
- too broad scope: monitoring and security addons, rich networking capabilities, clusters maintenance, teams
support
- no time for OSS contribution – not sustainable approach
- rolling update is still hard – regular failures and need for manual interventions, maintenance windows
agreements, etc.
- Stitch-free Cloud Native and Cloud (AWS) Integration is still a challenge:
- environment segregation (VPC or account based)
- provisioning of related services (RDS, RedShift, Lambdas) as part of unified deployment stack
positioning against competitive offerings
what would you like to see next time?
- Automatic cluster provisioning and rolling update mechanics DEMO (Terraform, KOPS, Jenkins) ?
- Reach application deployment capabilities DEMO (HELM, Kubectl, Jenkins) ?
- Routing & Networking techniques DEMO (Ingress Controllers, DNS Management) ?
- Telemetry capabilities DEMO (DataDog, Weave.Scope, Prometheus) ?

Kubernetes - Container Management PaaS Journey

  • 1.
    Container Management Platform [K8S] PaaSJourney By Uladzimir Palkhouski https://www.linkedin.com/in/uladzimirpalkhouski/
  • 2.
    there is alwaysa reason behind... Challenges of countless projects: - application resiliency issues - low resource utilization and cost in-efficiency - operational inefficiency in using non-unified technology stack for managing different layers and deploying applications - low deployment velocity and elasticity - security and compliance issues (host & app level access and audit) - operational overhead in managing infrastructure
  • 3.
    Fundamental Principles ● Cattle,no Pet ● Immutable Infrastructure ● Codified Infrastructure ● Golden Image ● OOB Resiliency ● OOB Telemetry is there anything we can do in a dramatically different way?
  • 4.
    … for asingle ultimate goal let application developers focus on application development and business capabilities ... while somebody else (or something else) takes care about infrastructure maintenance, networking, capacity planning, resiliency, telemetry, security and compliance, etc…
  • 5.
    responding to thedemand – K8S based container management PaaS - best-of-breed container scheduler – K8S - KOPS and AWS based OSS K8S distribution - OSS-based addons ecosystem (fluentd, weave scope, heapster, etc.) - rolling cluster update to answer most of operational challenges - unified addons, resources, applications and services deployment (helm) - 100% OSS, no proprietary closed products - full CNCF K8S conformance (read as no lock, can migrate to other distributions) Source: https://redislabs.com/redis-enterprise- documentation/administering/kubernetes/upgrading-redis-enterprise- cluster-kubernetes-deployment-operator/
  • 6.
    fundamental principles >platform capabilities - Cattle Host -> no pet hosts, any node can be killed any time if misbehaving. Workloads will by rescheduled on alternative nodes - Immutable Infrastructure → rolling cluster update. Through mechanics of rolling update any compliance, security hardening or configuration management issue is addressed. Lift & shift container - Golden Image → backed into the cluster definition - Codified infrastructure → clusters, addons, resources, applications - all declaratively defined - Build-in resiliency and telemetry - out of the box open source addons that require none to low effort on product team side Container management PaaS is essentially an integrated family of cloud-native capabilities that lets you increase speed and reliability, improve security and focus on delivery
  • 7.
  • 8.
    journey - Oct 2016– realized that there is a need for container scheduler. Chosen Rancher for the cloud and scheduler agnostic approach - June 2017 – realized that Rancher does not deliver in accordance with expectations (health + readiness checks, granular control over workloads and rolling service updates) - May – Sep 2017 – OpenShift evaluation - Nov 2017 – taken decision about vanilla K8S, started POC - Feb 2018 – started K8S productionalization - Sep 2018 – finalizing productionalization
  • 9.
    today - Unified clusteroperations, 24/7 monitoring with PD, office hours support in place - 4 clusters in place (2 prod) - Overall capacity of 50+ nodes - 10+ products / services hosted - Unified stack of addons for performance monitoring, DNS management, ingress controller, centralized logging - 3 engineers + 1 architect in the team - 24/7 support included - to prove validity of fundamental principles (and economy of scale!) - Product teams are excited!
  • 10.
    mistakes made - notimplementing cluster-level DR strategy early enough (etcd backup) – we killed cluster twice, both times due to unexpected behavior of tooling (KOPS – split brain, HELM – resource termination during deletion of failed deployment) and overconfidence - toolset overconfidence: took traefik as ingress controller and ended up with 4 ingress controllers for a single environment, HTTP + TCP, internal + external - too broad scope: monitoring and security addons, rich networking capabilities, clusters maintenance, teams support - no time for OSS contribution – not sustainable approach - rolling update is still hard – regular failures and need for manual interventions, maintenance windows agreements, etc. - Stitch-free Cloud Native and Cloud (AWS) Integration is still a challenge: - environment segregation (VPC or account based) - provisioning of related services (RDS, RedShift, Lambdas) as part of unified deployment stack
  • 11.
  • 12.
    what would youlike to see next time? - Automatic cluster provisioning and rolling update mechanics DEMO (Terraform, KOPS, Jenkins) ? - Reach application deployment capabilities DEMO (HELM, Kubectl, Jenkins) ? - Routing & Networking techniques DEMO (Ingress Controllers, DNS Management) ? - Telemetry capabilities DEMO (DataDog, Weave.Scope, Prometheus) ?