SlideShare a Scribd company logo
Kubernetes Day 2 @ ZSE Energia, a.s.
Miro Toma
November 10th, 2021
About
Me
• IT nerd since the dawn of time
• 25 years professional experience
• Held various positions covering most functions in IT
stack
• Passionate about tech & new trends
• Stirring the IT pot in the utilities sector since 2014
ZSE Energia, a.s.
• Major energy supplier in Slovakia
• Part of larger ZSE group
• Commercial company (not state managed!)
• Small internal IT unit
• Heavy reliance on vendors (not a dev shop)
The (somewhat) accelerated journey
Day 0 & 1 - Now or never
• K8s incepted as a target platform for an
ongoing high-profile project
• Severely limited infrastructure support
capacities (human) at the time [couldn’t
deploy on ‘classic’ VMs]
• Anticipated uptime requirements
Day 2 start – Apr 2019
• ingress
• logs (Fluentd, Elasticsearch, Kibana)
• 1 app namespace
• no native monitoring* (!DON’T!)
* trivial heartbeat monitoring with Zabbix
Later that (2nd) day..
• elasticsearch->opendistro->opensearch
• fluentd->fluent-bit
• vendor namespaces (SaaS model with ‘our’
infrastructure)
• calico (cluster reinstall)
• cert-manager
• prometheus/alert manager/grafana
• real backups (!)
• zookeeper
• kafka
Day 0 to Day 2 in <6 months
Backups
• “CI/CD pipeline will take care of the cluster rebuild“
• Until it won’t:
• persistent volumes
• manual tweaks (don’t !)
• ..
• Solutions exist to take whole-cluster backups, including volumes
• Use-case – migrate cluster between cloud subscriptions
• migration supported by cloud vendor for majority of resources
• but not Kubernetes (!)
• 4 hours vs. multi-month project
Don’t Question Your Vendor’s Infrastructure Sizing
• Obscene asks for CPU and memory
• Questioning never lead to a significant difference
0.1 (10% of a single CPU) ~1.2GB RAM
Example project ask – two-machine cluster with 4CPU, 16GB RAM each. Real life:
Deploy and set real quotas afterwards
• real world is a fraction of the original ask (no
exceptions yet)
• should thing go south, you can tune on the fly
Budget for Disruptions, Promote ‘Aversion’
• Define disruption budgets (religiously)
• beta since 1.5; prod from 1.21
• your app won’t potentially disappear on a node drain
• Strive to distribute pods across multiple nodes
• use podAntiAffinity as a rule
• consider using descheduler
• Sample scenario (real life):
1. all ingress pods eventually ended up running on a single node
2. drain the specific node hosting all ingress pods
3. no ingress (i.e. ‘cluster is down’) for a non-insignificant moment
Let Them Die Peacefully
• 30 secs default timeout to terminate may not be good for all
• Long running consumer queries
• Lengthy cleanup processes (e.g. to keep PVs consistent)
• Hooks delaying the TERM signal eats into the total budget
• Use rather generous terminationGracePeriodSeconds
• should the container terminate earlier, the control plane will notice
• Not everyone plays nice with TERM
• Use preStop hooks
Dying Containers Won’t Accept New Work
• Updating deployments, stateful-sets, kubectl delete pod xxx & co
• ‘Terminating’ a pod:
• containers receive TERM signal -> stop accepting new requests
• network (CNI) – in parallel - starts converging endpoints/services
• until converged, the terminating pods will deny new requests
• preStop hooks to delay TERM, thus giving time for network to converge
• don’t want, but also can’t really set a dependency on isolating a pod before shutting it
down (split-brain situations)
• 8 secs worked fine so far (exceptions)
Cluster upgrades
• Started @1.15, now on 1.20
• Upgrading a managed cluster is a breeze – until it isn’t
• fairly complex process - on a managed cluster you don’t get all the knobs and buttons to comfortably identify/fix
an issue
• two incidents yet:
• medium 1.16 -> 1.17 (upgrade stopped in the middle; documented fix/workaround)
• huge 1.19 -> 1.20 (internal cluster network went south, node pool ‘failed’)
• Both issues traced to node drain timeouts
• provider’s upgrade scripts define (weakly documented) node drain timeout for upgrades
• longer termination periods, multiplied by disruption budgets prolong node drains
• Current approach:
• upgrade control plane first (separately)
• create new node-pool(s) at the ugraded version
• manually drain old nodes
• delete old pools
Some Major Roads NOT Taken
Helm
• Initial eval with v2 (might take different twist now)
• Many charts ‘opinionated’
• Some charts drag-in dependencies we didn’t want
Operator frenzy (i.e. operator for everything)
• Many operators undergoing major revisions (would be hard to keep up)
• Many offerings for the same use-case, frequently neither matching all of our requirements
• Single manifest modification/deletion may evaporate your service in an instant
Pre-packaged pipelines (e.g. Banzai)
• Very early in development at the time
Note: These decisions were taken based on situation around 2018/2019. Some will be revisited in due course
Some More Takeaways
• Don’t rush Day 2
• Dedicate resources for day 0 & 1
• Day-to-day ops are surprisingly modest
• Adoption by ‘traditional’ IT departments may be a journey on its own…
• Local market uptake for K8s (still) lagging
• pushing & training vendors for adoption of k8s
• some vendors still ‘resist’, but some became proponents
• Stay cloud-agnostic
• minimize utilization of cloud-specific services
Thanks

More Related Content

Similar to Kubernetes day 2 @ zse energia

Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!
Stuart Charlton
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
Jon Haddad
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
DataStax Academy
 
Happy users and good sleep. How?
Happy users and good sleep. How?Happy users and good sleep. How?
Happy users and good sleep. How?
Stanislav German-Evtushenko
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
Kieran Kunhya
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
DataStax Academy
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
David Stockton
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
Kieran Kunhya
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in production
bcantrill
 
Ethereum Classic Shanghai: Products and Services
Ethereum Classic Shanghai: Products and ServicesEthereum Classic Shanghai: Products and Services
Ethereum Classic Shanghai: Products and Services
Avtar Sehra
 
DevOps Days Ohio
DevOps Days OhioDevOps Days Ohio
DevOps Days Ohio
Kelly Looney
 
The Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentThe Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentTimothy Fitz
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
InfluxData
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
DevOps.com
 
FreeBSD: The Next 10 Years (MeetBSD 2014)
FreeBSD: The Next 10 Years (MeetBSD 2014)FreeBSD: The Next 10 Years (MeetBSD 2014)
FreeBSD: The Next 10 Years (MeetBSD 2014)
iXsystems
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
corehard_by
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff
 

Similar to Kubernetes day 2 @ zse energia (20)

Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!Platform Clouds, Containers, Immutable Infrastructure Oh My!
Platform Clouds, Containers, Immutable Infrastructure Oh My!
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Happy users and good sleep. How?
Happy users and good sleep. How?Happy users and good sleep. How?
Happy users and good sleep. How?
 
Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...Moving to software-based production workflows and containerisation of media a...
Moving to software-based production workflows and containerisation of media a...
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018Kubernetes Manchester - 6th December 2018
Kubernetes Manchester - 6th December 2018
 
The challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT HardwareThe challenges of generating 2110 streams on Standard IT Hardware
The challenges of generating 2110 streams on Standard IT Hardware
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
Instrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in productionInstrumenting the real-time web: Node.js in production
Instrumenting the real-time web: Node.js in production
 
Ethereum Classic Shanghai: Products and Services
Ethereum Classic Shanghai: Products and ServicesEthereum Classic Shanghai: Products and Services
Ethereum Classic Shanghai: Products and Services
 
DevOps Days Ohio
DevOps Days OhioDevOps Days Ohio
DevOps Days Ohio
 
The Hard Problems of Continuous Deployment
The Hard Problems of Continuous DeploymentThe Hard Problems of Continuous Deployment
The Hard Problems of Continuous Deployment
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
How Texas Instruments Uses InfluxDB to Uphold Product Standards and to Improv...
 
FreeBSD: The Next 10 Years (MeetBSD 2014)
FreeBSD: The Next 10 Years (MeetBSD 2014)FreeBSD: The Next 10 Years (MeetBSD 2014)
FreeBSD: The Next 10 Years (MeetBSD 2014)
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 

More from Juraj Hantak

Kubernetes day 2_jozef_halgas_pf
Kubernetes day 2_jozef_halgas_pfKubernetes day 2_jozef_halgas_pf
Kubernetes day 2_jozef_halgas_pf
Juraj Hantak
 
Dev ops culture_final
Dev ops culture_finalDev ops culture_final
Dev ops culture_final
Juraj Hantak
 
Promise of DevOps
Promise of DevOpsPromise of DevOps
Promise of DevOps
Juraj Hantak
 
23 meetup rancher
23 meetup rancher23 meetup rancher
23 meetup rancher
Juraj Hantak
 
Integracia security do ci cd pipelines
Integracia security do ci cd pipelinesIntegracia security do ci cd pipelines
Integracia security do ci cd pipelines
Juraj Hantak
 
CNCF opa
CNCF opaCNCF opa
CNCF opa
Juraj Hantak
 
Secrets management vault cncf meetup
Secrets management vault cncf meetupSecrets management vault cncf meetup
Secrets management vault cncf meetup
Juraj Hantak
 
Introductiontohelmcharts2021
Introductiontohelmcharts2021Introductiontohelmcharts2021
Introductiontohelmcharts2021
Juraj Hantak
 
Intro to creating kubernetes operators
Intro to creating kubernetes operators Intro to creating kubernetes operators
Intro to creating kubernetes operators
Juraj Hantak
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes
Juraj Hantak
 
16. Cncf meetup-docker
16. Cncf meetup-docker16. Cncf meetup-docker
16. Cncf meetup-docker
Juraj Hantak
 
16. meetup sietovy model v kubernetes
16. meetup sietovy model v kubernetes16. meetup sietovy model v kubernetes
16. meetup sietovy model v kubernetes
Juraj Hantak
 
16.meetup uvod
16.meetup uvod16.meetup uvod
16.meetup uvod
Juraj Hantak
 
14. meetup
14. meetup14. meetup
14. meetup
Juraj Hantak
 
Terraform a gitlab ci
Terraform a gitlab ciTerraform a gitlab ci
Terraform a gitlab ci
Juraj Hantak
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
Juraj Hantak
 
Kubernetes monitoring using prometheus stack
Kubernetes monitoring using prometheus stackKubernetes monitoring using prometheus stack
Kubernetes monitoring using prometheus stack
Juraj Hantak
 
12.cncfsk meetup observability and analysis
12.cncfsk meetup observability and analysis12.cncfsk meetup observability and analysis
12.cncfsk meetup observability and analysis
Juraj Hantak
 
Grafana 7.0
Grafana 7.0Grafana 7.0
Grafana 7.0
Juraj Hantak
 
Nginx app protect-for-meetup-v1.0-202006_lk
Nginx app protect-for-meetup-v1.0-202006_lkNginx app protect-for-meetup-v1.0-202006_lk
Nginx app protect-for-meetup-v1.0-202006_lk
Juraj Hantak
 

More from Juraj Hantak (20)

Kubernetes day 2_jozef_halgas_pf
Kubernetes day 2_jozef_halgas_pfKubernetes day 2_jozef_halgas_pf
Kubernetes day 2_jozef_halgas_pf
 
Dev ops culture_final
Dev ops culture_finalDev ops culture_final
Dev ops culture_final
 
Promise of DevOps
Promise of DevOpsPromise of DevOps
Promise of DevOps
 
23 meetup rancher
23 meetup rancher23 meetup rancher
23 meetup rancher
 
Integracia security do ci cd pipelines
Integracia security do ci cd pipelinesIntegracia security do ci cd pipelines
Integracia security do ci cd pipelines
 
CNCF opa
CNCF opaCNCF opa
CNCF opa
 
Secrets management vault cncf meetup
Secrets management vault cncf meetupSecrets management vault cncf meetup
Secrets management vault cncf meetup
 
Introductiontohelmcharts2021
Introductiontohelmcharts2021Introductiontohelmcharts2021
Introductiontohelmcharts2021
 
Intro to creating kubernetes operators
Intro to creating kubernetes operators Intro to creating kubernetes operators
Intro to creating kubernetes operators
 
19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes19. stretnutie komunity kubernetes
19. stretnutie komunity kubernetes
 
16. Cncf meetup-docker
16. Cncf meetup-docker16. Cncf meetup-docker
16. Cncf meetup-docker
 
16. meetup sietovy model v kubernetes
16. meetup sietovy model v kubernetes16. meetup sietovy model v kubernetes
16. meetup sietovy model v kubernetes
 
16.meetup uvod
16.meetup uvod16.meetup uvod
16.meetup uvod
 
14. meetup
14. meetup14. meetup
14. meetup
 
Terraform a gitlab ci
Terraform a gitlab ciTerraform a gitlab ci
Terraform a gitlab ci
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Kubernetes monitoring using prometheus stack
Kubernetes monitoring using prometheus stackKubernetes monitoring using prometheus stack
Kubernetes monitoring using prometheus stack
 
12.cncfsk meetup observability and analysis
12.cncfsk meetup observability and analysis12.cncfsk meetup observability and analysis
12.cncfsk meetup observability and analysis
 
Grafana 7.0
Grafana 7.0Grafana 7.0
Grafana 7.0
 
Nginx app protect-for-meetup-v1.0-202006_lk
Nginx app protect-for-meetup-v1.0-202006_lkNginx app protect-for-meetup-v1.0-202006_lk
Nginx app protect-for-meetup-v1.0-202006_lk
 

Recently uploaded

假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
Laura Szabó
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
JeyaPerumal1
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
nhiyenphan2005
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
Danica Gill
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
SEO Article Boost
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 

Recently uploaded (20)

假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
Gen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needsGen Z and the marketplaces - let's translate their needs
Gen Z and the marketplaces - let's translate their needs
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
2.Cellular Networks_The final stage of connectivity is achieved by segmenting...
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
Bài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docxBài tập unit 1 English in the world.docx
Bài tập unit 1 English in the world.docx
 
7 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 20247 Best Cloud Hosting Services to Try Out in 2024
7 Best Cloud Hosting Services to Try Out in 2024
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
Understanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdfUnderstanding User Behavior with Google Analytics.pdf
Understanding User Behavior with Google Analytics.pdf
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 

Kubernetes day 2 @ zse energia

  • 1. Kubernetes Day 2 @ ZSE Energia, a.s. Miro Toma November 10th, 2021
  • 2. About Me • IT nerd since the dawn of time • 25 years professional experience • Held various positions covering most functions in IT stack • Passionate about tech & new trends • Stirring the IT pot in the utilities sector since 2014 ZSE Energia, a.s. • Major energy supplier in Slovakia • Part of larger ZSE group • Commercial company (not state managed!) • Small internal IT unit • Heavy reliance on vendors (not a dev shop)
  • 3. The (somewhat) accelerated journey Day 0 & 1 - Now or never • K8s incepted as a target platform for an ongoing high-profile project • Severely limited infrastructure support capacities (human) at the time [couldn’t deploy on ‘classic’ VMs] • Anticipated uptime requirements Day 2 start – Apr 2019 • ingress • logs (Fluentd, Elasticsearch, Kibana) • 1 app namespace • no native monitoring* (!DON’T!) * trivial heartbeat monitoring with Zabbix Later that (2nd) day.. • elasticsearch->opendistro->opensearch • fluentd->fluent-bit • vendor namespaces (SaaS model with ‘our’ infrastructure) • calico (cluster reinstall) • cert-manager • prometheus/alert manager/grafana • real backups (!) • zookeeper • kafka Day 0 to Day 2 in <6 months
  • 4. Backups • “CI/CD pipeline will take care of the cluster rebuild“ • Until it won’t: • persistent volumes • manual tweaks (don’t !) • .. • Solutions exist to take whole-cluster backups, including volumes • Use-case – migrate cluster between cloud subscriptions • migration supported by cloud vendor for majority of resources • but not Kubernetes (!) • 4 hours vs. multi-month project
  • 5. Don’t Question Your Vendor’s Infrastructure Sizing • Obscene asks for CPU and memory • Questioning never lead to a significant difference 0.1 (10% of a single CPU) ~1.2GB RAM Example project ask – two-machine cluster with 4CPU, 16GB RAM each. Real life: Deploy and set real quotas afterwards • real world is a fraction of the original ask (no exceptions yet) • should thing go south, you can tune on the fly
  • 6. Budget for Disruptions, Promote ‘Aversion’ • Define disruption budgets (religiously) • beta since 1.5; prod from 1.21 • your app won’t potentially disappear on a node drain • Strive to distribute pods across multiple nodes • use podAntiAffinity as a rule • consider using descheduler • Sample scenario (real life): 1. all ingress pods eventually ended up running on a single node 2. drain the specific node hosting all ingress pods 3. no ingress (i.e. ‘cluster is down’) for a non-insignificant moment
  • 7. Let Them Die Peacefully • 30 secs default timeout to terminate may not be good for all • Long running consumer queries • Lengthy cleanup processes (e.g. to keep PVs consistent) • Hooks delaying the TERM signal eats into the total budget • Use rather generous terminationGracePeriodSeconds • should the container terminate earlier, the control plane will notice • Not everyone plays nice with TERM • Use preStop hooks
  • 8. Dying Containers Won’t Accept New Work • Updating deployments, stateful-sets, kubectl delete pod xxx & co • ‘Terminating’ a pod: • containers receive TERM signal -> stop accepting new requests • network (CNI) – in parallel - starts converging endpoints/services • until converged, the terminating pods will deny new requests • preStop hooks to delay TERM, thus giving time for network to converge • don’t want, but also can’t really set a dependency on isolating a pod before shutting it down (split-brain situations) • 8 secs worked fine so far (exceptions)
  • 9. Cluster upgrades • Started @1.15, now on 1.20 • Upgrading a managed cluster is a breeze – until it isn’t • fairly complex process - on a managed cluster you don’t get all the knobs and buttons to comfortably identify/fix an issue • two incidents yet: • medium 1.16 -> 1.17 (upgrade stopped in the middle; documented fix/workaround) • huge 1.19 -> 1.20 (internal cluster network went south, node pool ‘failed’) • Both issues traced to node drain timeouts • provider’s upgrade scripts define (weakly documented) node drain timeout for upgrades • longer termination periods, multiplied by disruption budgets prolong node drains • Current approach: • upgrade control plane first (separately) • create new node-pool(s) at the ugraded version • manually drain old nodes • delete old pools
  • 10. Some Major Roads NOT Taken Helm • Initial eval with v2 (might take different twist now) • Many charts ‘opinionated’ • Some charts drag-in dependencies we didn’t want Operator frenzy (i.e. operator for everything) • Many operators undergoing major revisions (would be hard to keep up) • Many offerings for the same use-case, frequently neither matching all of our requirements • Single manifest modification/deletion may evaporate your service in an instant Pre-packaged pipelines (e.g. Banzai) • Very early in development at the time Note: These decisions were taken based on situation around 2018/2019. Some will be revisited in due course
  • 11. Some More Takeaways • Don’t rush Day 2 • Dedicate resources for day 0 & 1 • Day-to-day ops are surprisingly modest • Adoption by ‘traditional’ IT departments may be a journey on its own… • Local market uptake for K8s (still) lagging • pushing & training vendors for adoption of k8s • some vendors still ‘resist’, but some became proponents • Stay cloud-agnostic • minimize utilization of cloud-specific services