SlideShare a Scribd company logo
Calico のデプロイを
ミスって本番クラスタを
壊しそうになった話
2021/03/12
Cloud Native Days Online
WHO I AM?
- Name: Kawabe Katsuya
- Team: CyberAgent Group Infrastructure Unit
- Position: Software, Infra Engineer, 2020 New Graduate
- Hobby: Music, Comic
ABOUT US
@KKawabe108
1. What Happen
2. How To Resolve
TABLE
CONTENTS
問題発覚編: calico のデプロイをミスったことに
よって発生した事象について
解決編: プロダクト側への説明、監視の見直し、
再発防止への取り組み
図参照: https://docs.projectcalico.org/reference/architecture/overview
CNI: calico
アラートはいつも突然に
AKE での監視
Victoria Metrics で複数のクラスタを監視しています
What Happened
Ingress とノードの
BGP ピアがダウンしたというアラートが大量発生
Kubectl get po -A をすると Master ノードに乗っている
Pod がほとんど Evicted されていた 😇
Master ノードに ssh すると、どうやらディスク領域が
圧迫され、Eviction の閾値に到達していた
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
What Happened: Ingress の実装
Node
calico-node
Node
exporter
Node
calico-node
Node
calico-node
nginx ctrl nginx ctrl nginx ctrl
Node
calico-node
Node
calico-node
calico-node
exporter exporter
Big IP VS
BGP Routing
BGP Link is
Down
なぜ、Diskが圧迫されたのか
What Happened : ipamhandles リソースの爆発
calico-ipam が使用する Pod と IPを紐づけるリソース
通常、calico が Pod の作成と削除に合わせ
制御するリソースのはずだったが・・・
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
What Happened : ipamhandles リソースの爆発
😇
kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる
ちなみに、Pod の数はおよそ30個ぐらい
APIサーバが操作できない =
クラスタが操作不能 =
ヤバイ
What Happened : ipamhandles リソースの爆発
etcd backup
2GB
/var/backup/etcd
etcd backup
2GB
etcd backup
2GB
systemd
🧨
🧨 🧨
ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、
20GB しかないディスクの圧迫へと繋がった
What Happened : まとめ
Step 1 Step 2 Step 3 Step 4
calico の ipamhandles が
爆発する
etcd のバックアップデータが
増加する (2GB)
Master のディスクが圧迫されて
calico-node とその他が Evicted
される
calico-node がダウンしたことに
より、BGPピアが切断され、
アラート発砲
原因と対応
How To Resolve
calico-kube-controllers というコンポーネント
をデプロイしていなかった
calico-node の ClusterRole の権限が
間違っていた
図参照: https://docs.projectcalico.org/reference/architecture/overview
How To Resolve: 反省点
calico-node に delete の権限を渡していないせいで
GCが発生していなかった
元々、3.8 のマニフェストをベースに弄っていたので
発生したミス
新しいマニフェストを公式から落としてそれをベースにすれば
今回のようなミスは発生しなかった
How To Resolve : プロダクトへの対応
今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは
Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった
すぐに事情を説明して、マニフェストの修正を行なった
発生するリスクは抑えたということを確認して、対応終了
How To Resolve: 監視基盤の対応
監視基盤で全クラスタで etcd の db size
と、オブジェクト数を監視するようにした
ディスクサイズのアラートが
Eviction Policy と同等だったので、
それより低く設定し直す
How To Resolve: calico のアップデートについて
極力アップデートしなくていいならしない
マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい)
マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
CNI のアップデートには細心の
注意を払いましょう!
Thank you for listening !

More Related Content

What's hot

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Kohei Tokunaga
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
Kohei Tokunaga
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
Kohei Tokunaga
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of Images
Kohei Tokunaga
 
Learning kubernetes
Learning kubernetesLearning kubernetes
Learning kubernetes
Eueung Mulyana
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Kohei Tokunaga
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
inside-BigData.com
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
inwin stack
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeAcademy
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developers
Robert Barr
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeAcademy
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
Matt Ray
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
Docker, Inc.
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
Red Hat Developers
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
Akihiro Suda
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
Eueung Mulyana
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
Burkhard Stubert
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
Gleb Smirnoff
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetes
inwin stack
 

What's hot (20)

Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動するStargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
Stargz Snapshotter: イメージのpullを省略しcontainerdでコンテナを高速に起動する
 
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz SnapshotterThe overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
The overview of lazypull with containerd Remote Snapshotter & Stargz Snapshotter
 
Startup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image DistributionStartup Containers in Lightning Speed with Lazy Image Distribution
Startup Containers in Lightning Speed with Lazy Image Distribution
 
Starting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of ImagesStarting up Containers Super Fast With Lazy Pulling of Images
Starting up Containers Super Fast With Lazy Pulling of Images
 
Learning kubernetes
Learning kubernetesLearning kubernetes
Learning kubernetes
 
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
Build and Run Containers With Lazy Pulling - Adoption status of containerd St...
 
Shifter singularity - june 7, 2018 - bw symposium
Shifter  singularity - june 7, 2018 - bw symposiumShifter  singularity - june 7, 2018 - bw symposium
Shifter singularity - june 7, 2018 - bw symposium
 
Distributed tensorflow on kubernetes
Distributed tensorflow on kubernetesDistributed tensorflow on kubernetes
Distributed tensorflow on kubernetes
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project CalicoKubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
KubeCon EU 2016: Secure, Cloud-Native Networking with Project Calico
 
Kubernetes for Java developers
Kubernetes for Java developersKubernetes for Java developers
Kubernetes for Java developers
 
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
KubeCon EU 2016: What is OpenStack's role in a Kubernetes world?
 
SCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with ChefSCALE 2011 Deploying OpenStack with Chef
SCALE 2011 Deploying OpenStack with Chef
 
Kubernetes extensibility
Kubernetes extensibilityKubernetes extensibility
Kubernetes extensibility
 
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech TalkArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
ArgoCD and Tekton: Match made in Kubernetes heaven | DevNation Tech Talk
 
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
[KubeConEU] Building images efficiently and securely on Kubernetes with BuildKit
 
Rtl sdr software defined radio
Rtl sdr   software defined radioRtl sdr   software defined radio
Rtl sdr software defined radio
 
Using Qt under LGPLv3
Using Qt under LGPLv3Using Qt under LGPLv3
Using Qt under LGPLv3
 
What's new in FreeBSD 10
What's new in FreeBSD 10What's new in FreeBSD 10
What's new in FreeBSD 10
 
Cantainer CI/ CD with Kubernetes
Cantainer CI/ CD with KubernetesCantainer CI/ CD with Kubernetes
Cantainer CI/ CD with Kubernetes
 

Similar to 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?
OWASP
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020
Mauricio (Salaboy) Salatino
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019
rouanw
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum Belgium
David Nuescheler
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Managementguest88136a
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetes
rhirschfeld
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch Provision
RackN
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
RackN
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
KubeAcademy
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
Tobias Schmidt
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Peter Hlavaty
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
Hacks in Taiwan (HITCON)
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
Giulio Vian
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
Alexander Tarlinder
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
KP Kaiser
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Matt McNeeney
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in Kubernetes
Alexei Ledenev
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Ajeet Singh Raina
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
KP Kaiser
 

Similar to 【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話 (20)

[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?[Wroclaw #7] Why So Serial?
[Wroclaw #7] Why So Serial?
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020Orchestrating Cloud Events - Knative Meetup 2020
Orchestrating Cloud Events - Knative Meetup 2020
 
Qa in production singular 2019
Qa in production   singular 2019Qa in production   singular 2019
Qa in production singular 2019
 
Future of WCM - CM Forum Belgium
Future of WCM - CM Forum BelgiumFuture of WCM - CM Forum Belgium
Future of WCM - CM Forum Belgium
 
Cms forum, future of Web Content Management
Cms forum, future of Web Content ManagementCms forum, future of Web Content Management
Cms forum, future of Web Content Management
 
Kubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch KubernetesKubecon 2017 Zero Touch Kubernetes
Kubecon 2017 Zero Touch Kubernetes
 
KubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch ProvisionKubeCon 2017 Zero Touch Provision
KubeCon 2017 Zero Touch Provision
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesWindows Kernel Exploitation : This Time Font hunt you down in 4 bytes
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytes
 
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
【HITCON FreeTalk 2022 - 我把在網頁框架發現的密碼學漏洞變成 CTF 題了】
 
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
There and Back Again (My DevOps journey) - DevOps Days Copenhagen 2018
 
Testability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testableTestability for developers – Fighting a mess by making it testable
Testability for developers – Fighting a mess by making it testable
 
Recreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web ScrapingRecreating "The Clock" with Machine Learning and Web Scraping
Recreating "The Clock" with Machine Learning and Web Scraping
 
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDCBasics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
Basics of Kubernetes on BOSH: Run Production-grade Kubernetes on the SDDC
 
Debugging Go in Kubernetes
Debugging Go in KubernetesDebugging Go in Kubernetes
Debugging Go in Kubernetes
 
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & CloudSimplifying Real Time Data Analytics with Docker, IoT & Cloud
Simplifying Real Time Data Analytics with Docker, IoT & Cloud
 
Build and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in KubernetesBuild and Monitor Machine Learning Services in Kubernetes
Build and Monitor Machine Learning Services in Kubernetes
 

Recently uploaded

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 

Recently uploaded (20)

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 

【CNDO2021】Calicoのデプロイをミスって本番クラスタを壊しそうになった話

  • 2. WHO I AM? - Name: Kawabe Katsuya - Team: CyberAgent Group Infrastructure Unit - Position: Software, Infra Engineer, 2020 New Graduate - Hobby: Music, Comic ABOUT US @KKawabe108
  • 3. 1. What Happen 2. How To Resolve TABLE CONTENTS 問題発覚編: calico のデプロイをミスったことに よって発生した事象について 解決編: プロダクト側への説明、監視の見直し、 再発防止への取り組み
  • 4.
  • 7. AKE での監視 Victoria Metrics で複数のクラスタを監視しています
  • 8. What Happened Ingress とノードの BGP ピアがダウンしたというアラートが大量発生 Kubectl get po -A をすると Master ノードに乗っている Pod がほとんど Evicted されていた 😇 Master ノードに ssh すると、どうやらディスク領域が 圧迫され、Eviction の閾値に到達していた
  • 9. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing
  • 10. What Happened: Ingress の実装 Node calico-node Node exporter Node calico-node Node calico-node nginx ctrl nginx ctrl nginx ctrl Node calico-node Node calico-node calico-node exporter exporter Big IP VS BGP Routing BGP Link is Down
  • 12. What Happened : ipamhandles リソースの爆発 calico-ipam が使用する Pod と IPを紐づけるリソース 通常、calico が Pod の作成と削除に合わせ 制御するリソースのはずだったが・・・
  • 13. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい
  • 14. What Happened : ipamhandles リソースの爆発 😇 kubectl get をすると API サーバがメモリを食い潰して死ぬので etcdctl でチェックしてる ちなみに、Pod の数はおよそ30個ぐらい APIサーバが操作できない = クラスタが操作不能 = ヤバイ
  • 15. What Happened : ipamhandles リソースの爆発 etcd backup 2GB /var/backup/etcd etcd backup 2GB etcd backup 2GB systemd 🧨 🧨 🧨 ipamhandles の爆発によって、etcd のバックアップデータが肥大化し、 20GB しかないディスクの圧迫へと繋がった
  • 16. What Happened : まとめ Step 1 Step 2 Step 3 Step 4 calico の ipamhandles が 爆発する etcd のバックアップデータが 増加する (2GB) Master のディスクが圧迫されて calico-node とその他が Evicted される calico-node がダウンしたことに より、BGPピアが切断され、 アラート発砲
  • 18. How To Resolve calico-kube-controllers というコンポーネント をデプロイしていなかった calico-node の ClusterRole の権限が 間違っていた 図参照: https://docs.projectcalico.org/reference/architecture/overview
  • 19. How To Resolve: 反省点 calico-node に delete の権限を渡していないせいで GCが発生していなかった 元々、3.8 のマニフェストをベースに弄っていたので 発生したミス 新しいマニフェストを公式から落としてそれをベースにすれば 今回のようなミスは発生しなかった
  • 20. How To Resolve : プロダクトへの対応 今回、アラートが上がったのは監視用に立てているクラスタでプロダクトが利用しているクラスタでは Pod がそこまで頻繁に作成削除されていなかったので、肥大化はディスクに影響が出るほどではなかった すぐに事情を説明して、マニフェストの修正を行なった 発生するリスクは抑えたということを確認して、対応終了
  • 21. How To Resolve: 監視基盤の対応 監視基盤で全クラスタで etcd の db size と、オブジェクト数を監視するようにした ディスクサイズのアラートが Eviction Policy と同等だったので、 それより低く設定し直す
  • 22. How To Resolve: calico のアップデートについて 極力アップデートしなくていいならしない マニフェストは基本公式のものをそのまま使うので問題はないはず (Pod CIDR や IPIP の有効化フラグぐらい) マニフェストが大きいので修正のレビューは三重チェックで丁度いい (それぐらいCNIはクリティカル)
  • 24. Thank you for listening !