SlideShare a Scribd company logo
1 of 15
Running containers at scale.
From Nomad to Kubernetes
and lesson learned
Viet Huynh 16 Jan 2018
Agenda
1. Chotot k8s system overview
2. Kube-system
3. Platform observability
4. Deployment
5. Application
6. Tips & tricks
7. Team alignment
0. Nomad
- Back to history
- Why Nomad?
- Nomad issues
- Why Kubernetes?
1. Chotot k8s system overview
- 3 etcd, 2 masters + 1 HA, 13 workers
(physical)
- 100+ services, jobs
- 100% backend microservices
- 80% web
- cronJob, internal dashboard, 3rd party
tooling, ephemeral caching
2. Kube-system
- Kube-api
- Kube-dns
- Networking
- Etcd
- Resource quotas
3. Platform observability
- Node problem detector
- Efficient logging
- Efficient metrics collecting
- Performance test
3.1 Node problem detector
Why need it?
- Had a couple of kernel panic, no reason,
mostly assumed server’s under stress.
- Need to know before it happens.
3.2 Efficient logging
- Tailable
- Local log
- Obstructive middleware
3.3 Efficient metrics collecting
- K8s metrics
- Application metrics
- Customization
- Long-term storage
3.4 Performance test
- Early and frequently
- Cloud networking perf test is a beast
- PerfKitBenchmarker
- In-house “benchmarks” tool kit
4. Deployment
- Blue/green deployment strategy
- Helm common template
- Containers security practice (Todo)
5. Application
- Java workload on containers.
- Go
+ Ram is abundant due to optimized Go service.
+ Memory fragmentation could be a problem.
- NodeJS
+ Web server needs whole lots CPU on initiative
and minimal on run.
+ Skewed cpu resources limit to avoid crash loop.
6. Kubectl tips and tricks
# get all pods sort by name
kubectl get po -o jsonpath='{range.items[*]}{.metadata.name}{"n"}{end}'
# Get all pods, which as restart_count > 0
kubectl get po -o
jsonpath='{range.items[?(@.status.containerStatuses[0].restartCount>0)]}{.status.containerStatuses[0].name}{"
n"}{end}'
# Get all non-running pods
kubectl get po -o jsonpath='{range.items[?(@.status.phase != "Running")]}{.metadata.name}{"n"}{end}'
# Get most used cpu pods
kubectl top pods | tail -n +2 | sort -nr -k 2 | awk '{print $1}' | head -n 1
# Get all nodes, and their IP
kubectl get no -o
jsonpath='{range.items[*]}{.metadata.name}{"t"}{.status.addresses[?(@.type=="InternalIP")].address}{"n"}{end}'
7. Technical alignment
- Devops culture, train your software engineer
k8s mindset and tooling so you could focus
on improving the platform.
15
THANKS!
Any questions?
Looking for teammates to build next generation cloud
native platform, like Hyperconverged infrastructure,
Chaos Engineering.
Chat to us at skype: tuan-viet.huynh or email to
viethuynh@chotot.vn

More Related Content

Similar to Chotot k8s experiences.pptx

Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Antonios Giannopoulos
 
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...addame
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsDoiT International
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
KubeCon 2017: Kubernetes from Dev to Prod
KubeCon 2017: Kubernetes from Dev to ProdKubeCon 2017: Kubernetes from Dev to Prod
KubeCon 2017: Kubernetes from Dev to ProdSubhas Dandapani
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS Erik Osterman
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellEmily Ikuta
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Using ansible to core os & kubernetes clusters
Using ansible to core os & kubernetes clustersUsing ansible to core os & kubernetes clusters
Using ansible to core os & kubernetes clustersmagicmarkup
 
Building big data pipelines with Kafka and Kubernetes
Building big data pipelines with Kafka and KubernetesBuilding big data pipelines with Kafka and Kubernetes
Building big data pipelines with Kafka and KubernetesVenu Ryali
 
Pro2516 10 things about oracle and k8s.pptx-final
Pro2516   10 things about oracle and k8s.pptx-finalPro2516   10 things about oracle and k8s.pptx-final
Pro2516 10 things about oracle and k8s.pptx-finalMichel Schildmeijer
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScyllaDB
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)Chinmay Kulkarni
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...Nati Shalom
 

Similar to Chotot k8s experiences.pptx (20)

Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
 
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
Montreal On Rails 5 : Rails deployment using : Nginx, Mongrel, Mongrel_cluste...
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
KubeCon 2017: Kubernetes from Dev to Prod
KubeCon 2017: Kubernetes from Dev to ProdKubeCon 2017: Kubernetes from Dev to Prod
KubeCon 2017: Kubernetes from Dev to Prod
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
 
MySQL 5.7 in a Nutshell
MySQL 5.7 in a NutshellMySQL 5.7 in a Nutshell
MySQL 5.7 in a Nutshell
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Using ansible to core os & kubernetes clusters
Using ansible to core os & kubernetes clustersUsing ansible to core os & kubernetes clusters
Using ansible to core os & kubernetes clusters
 
Building big data pipelines with Kafka and Kubernetes
Building big data pipelines with Kafka and KubernetesBuilding big data pipelines with Kafka and Kubernetes
Building big data pipelines with Kafka and Kubernetes
 
Pro2516 10 things about oracle and k8s.pptx-final
Pro2516   10 things about oracle and k8s.pptx-finalPro2516   10 things about oracle and k8s.pptx-final
Pro2516 10 things about oracle and k8s.pptx-final
 
Scylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla OperatorScylla on Kubernetes: Introducing the Scylla Operator
Scylla on Kubernetes: Introducing the Scylla Operator
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)In-Memory Data Grids - Ampool (1)
In-Memory Data Grids - Ampool (1)
 
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...Orchestration tool roundup   kubernetes vs. docker vs. heat vs. terra form vs...
Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Chotot k8s experiences.pptx

  • 1. Running containers at scale. From Nomad to Kubernetes and lesson learned Viet Huynh 16 Jan 2018
  • 2. Agenda 1. Chotot k8s system overview 2. Kube-system 3. Platform observability 4. Deployment 5. Application 6. Tips & tricks 7. Team alignment
  • 3. 0. Nomad - Back to history - Why Nomad? - Nomad issues - Why Kubernetes?
  • 4. 1. Chotot k8s system overview - 3 etcd, 2 masters + 1 HA, 13 workers (physical) - 100+ services, jobs - 100% backend microservices - 80% web - cronJob, internal dashboard, 3rd party tooling, ephemeral caching
  • 5. 2. Kube-system - Kube-api - Kube-dns - Networking - Etcd - Resource quotas
  • 6. 3. Platform observability - Node problem detector - Efficient logging - Efficient metrics collecting - Performance test
  • 7. 3.1 Node problem detector Why need it? - Had a couple of kernel panic, no reason, mostly assumed server’s under stress. - Need to know before it happens.
  • 8. 3.2 Efficient logging - Tailable - Local log - Obstructive middleware
  • 9. 3.3 Efficient metrics collecting - K8s metrics - Application metrics - Customization - Long-term storage
  • 10. 3.4 Performance test - Early and frequently - Cloud networking perf test is a beast - PerfKitBenchmarker - In-house “benchmarks” tool kit
  • 11. 4. Deployment - Blue/green deployment strategy - Helm common template - Containers security practice (Todo)
  • 12. 5. Application - Java workload on containers. - Go + Ram is abundant due to optimized Go service. + Memory fragmentation could be a problem. - NodeJS + Web server needs whole lots CPU on initiative and minimal on run. + Skewed cpu resources limit to avoid crash loop.
  • 13. 6. Kubectl tips and tricks # get all pods sort by name kubectl get po -o jsonpath='{range.items[*]}{.metadata.name}{"n"}{end}' # Get all pods, which as restart_count > 0 kubectl get po -o jsonpath='{range.items[?(@.status.containerStatuses[0].restartCount>0)]}{.status.containerStatuses[0].name}{" n"}{end}' # Get all non-running pods kubectl get po -o jsonpath='{range.items[?(@.status.phase != "Running")]}{.metadata.name}{"n"}{end}' # Get most used cpu pods kubectl top pods | tail -n +2 | sort -nr -k 2 | awk '{print $1}' | head -n 1 # Get all nodes, and their IP kubectl get no -o jsonpath='{range.items[*]}{.metadata.name}{"t"}{.status.addresses[?(@.type=="InternalIP")].address}{"n"}{end}'
  • 14. 7. Technical alignment - Devops culture, train your software engineer k8s mindset and tooling so you could focus on improving the platform.
  • 15. 15 THANKS! Any questions? Looking for teammates to build next generation cloud native platform, like Hyperconverged infrastructure, Chaos Engineering. Chat to us at skype: tuan-viet.huynh or email to viethuynh@chotot.vn

Editor's Notes

  1. Open discussion Event này chủ yếu khuyến khích anh em đã từng vọc qua k8s tham gia -> nên mình sẽ ko giới thiệu docker là gì, k8s là gì, cũng như các components của nó Talk about hand on experiences, maybe how it works as our understanding Chợ Tốt cũng chật vật tham gia vào câu chuyện dockerization nên đã trải nghiệm qua 1 số gỉai pháp để đến với k8s như bây giờ Hiện tại có khá nhiều nhóm tham gia improve k8s ở Chợ Tốt, từ các team khác nhau, devops, infra, platform, software engineer
  2. Overview về hệ thống Chợ Tốt đang chạy kube ở mức độ nào Các kube components quan trọng: + thường xuyên làm việc với nó + dễ sinh ra vấn đề nhất Platform observability: + cần chủ động nhận biết những gì xảy ra bên trong hệ thống, + không để user hoặc business team nói thì mới biết. + Not monitor + Standard monitor: dead/alive. If dead -> alert + Ví dụ: Number of request per second vào 1 API bị giảm 20% -> alert api owner Deployment: + Cách deploy application Chợ Tốt đã & sẽ làm Application: + Kinh nghiệm từ việc chạy app trên k8s/container với các ngôn ngữ lập trình khác nhau Tips: + Tips sử dụng kubectl để operate Team aligment: + Devops culture + Software engineer need to know to use k8s + how to deploy, monitor and debug dựa trên các tools devops đưa ra + Devops team can focus to improve platform
  3. Why K8s? + Trước đó cân nhắc giữa DC/OS và k8s -> Try cả 2 + Có cộng đồng anh em của Chợ Tốt ở Châu Âu đang sử dụng k8s -> nhận thấy cơ hội được support + Pick one and bet + Có exp với Nomad nên hiểu hơn về mindset của docker orchestration -> move services to k8s không quá khó + Khó là setup k8s đáp ứng được việc chạy tải production, và làm việc với nhiều components của nó.
  4. lý do dùng Cronjob: + có những việc chỉ cần chạy mỗi giờ hoặc mỗi ngày 1 lần + mỗi lần chạy cần big resources + nếu persistent running on host thì sẽ lãng phí resources ephemeral caching: varnish, redis cho cả web và backend api
  5. # KUBE-API Tải về resource không nhiều Need: availability If kube-api dead -> can not deploy apps, rộng hơn là ko thể thay đổi những gì đang chạy Trong thời gian ngắn 1 tiếng không vấn đề gì (tạm thời) Không chấp nhận down kube-api ở system đã applied CD, deploy liên tục. Solution: Load balancing with HAproxy Upgrade: + Api-server is backward compatible, we accidentally apt-upgrade a node to newer version, and whole cluster is still running without error. + For upgrading cluster: next topic, beware of different api versioning # KUBE-DNS + from: Bowei https://github.com/kubernetes/dns + issues: some pod could not resolve DNS to clusterIP -> failed -> chưa tìm ra root cause + to: Tim Hockin gcr.io/google_containers/kubedns-amd64:1.9 + result: work well, increase limit, cache dns -> single point of failure -> phải test kỹ (how -> open discuss) # NETWORKING Using Calico production, based on KET installation. Những lần cài thử thì dùng Flannel theo document, nên cũng chưa thấy rõ sự khác biệt on production calico: service node mesh -> rule/policy for routing Issues: + Pod A on Node-02 could connect to all pod on Node-01 + Pod B on Node-03 could connect to all pod on Node-01 + Pod B on Node-04 could connect to all pod on Node-01 + Where issue come from? Node-04 or Node-01 + check /etc/calico/confd/config/bird.cfg # ETCD Cực kỳ quan trọng, nếu etcd down thì cả cluster đi đứt Etcd suggest là chạy cluster 5 nodes -> được fail 2 nodes. ChoTot đang chạy 3 nodes -> stable k8s CT prod hiện tại hầu như ko gặp issues gì với etcd + SSL connection + peer key for etcd <-> etcd + client key for other k8s components -> etcd Etcd database nên được backup cho trường hợp disaster cần recovery etcd: prometheus connect via ssl to etcd to get metrics -> write a proxy between prometheus and etcd for ssl prometheus -> etcd-proxy non-ssl -> etcd SSL # RESOURCE QUOTAS Set default limitrange + Request: cpu 50m, mem 128Mi + Limit: cpu 100m, mem 128Mi Mục đích: tránh trường hợp deploy app lên mà không set resources limit trong helm charts, ảnh hưởng tới pods cùng host issue: 2 workers unbalanced CPU clock || VM & physical -> Nomad is good at this case
  6. Platform observability: Không chỉ dừng lại ở monitor Thông thường monitor ở mức: dead/alive, high load -> alert Không thể để user hoặc biz team báo thì mình mới biết Cần: + request per min giảm đột ngột 20% so với cùng ngày hôm trước -> alert + total request hôm nay giảm > 10% so với hôm trước -> alert + vừa deploy app version mới lên thì có 1 metrics nào đó giảm đột ngột -> rollback release version
  7. - Need: Tailable, local log (không ưu tiên raw log tập trung) , obstructive middleware, stdout ( standard logging ) - Standard solution: ELK ( recommended ) , why not use: enterprise support ( users, alert), current set up with Graylog working well - Chotot solution: containers -> log file (rotated) <- fluent-bit read (container_id --> mux by service (central config integrate with kube-api (metadata) )) --> ship graylog
  8. - metrics của kube -> prometheus cadvisor - metrics của services, gồm: + total reqs; + (request/min, latency, status code) per path and per pod; + CPU, mem, network (packets) per pod + Go routine per pod -> mỗi services expose metrics ra -> prometheus đi scrape - vấn đề: + prometheus không fit cho việc storage long term metrics + metrics của k8s quá nhiều, có những thứ mình chưa cần + Cụ thể: chẳng hạn với một series metrics -> 1 label có thể được tags nhiều value -> thông tin pod_name thay đổi liên tục mỗi khi deploy/restart svc -> chủ động group lại theo y/c: node_name, service_name (pod_name without xxx) - giải quyết: + dựng 2 con prometheus + con thứ 1 scrape metrics -> aggregate lại metrics mình cần lấy -> group theo services, nodes, pods -> dùng federate chuyển sang một con prometheus thứ 2 -> con prometheus thứ 2 storage metrics dài hạn, ví dụ 1 năm. -> trên con scraping set data retention ngắn
  9. Template đã có, sửa 1 lần apply tất cả -> chỉ cần sửa file values.yaml Hạn chế ko chạy container bằng root user, based on alpine -> muốn thay bằng linux from scratching Network policy (todo)
  10. Java: beware of default heap JVM setting ( jvm only aware of host resources, not container resource ). Long start up ( giống node ) Python: not performing well on containers , still dont know why Nodejs: preferably not running pm2 -> open discuss Turn off swap on kube-worker: swap unpredictation, docker sum mem + mem swap -> slow when using swap