SlideShare a Scribd company logo
The great fire of 2021
Kris Buytaert
@krisbuytaert
@krisbuytaert
March 10 , 2021
●
03:59 Incoming Phone call
●
“Our Datacenter is On fire”
●
5 minutes of waking up..
going downstairs trying to
realize what was just said to
me, partial panic , partial .. it
can’t be THAT bad
@krisbuytaert
Fire
@krisbuytaert
Kris Buytaert
●
I used to be a Dev,
●
Then Became an Op
●
CTO and Open Source Consultant @inuits.eu
●
Everything is a freaking DNS Problem
●
Evangelizing devops
●
Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ….
@krisbuytaert
Impact :
●
3.6 million websites
across 464.000
distinct domains
●
https://news.netcraft.com/archives/2021/03/10/ovh-fire.html
@krisbuytaert
Immediate Assesment ?
●
What Customer Facing Production services have we lost ?
●
What Internal Production services have we lost ?
– Do we need any of those to recover ?
●
What are the priorities ?
– Which services
●
24/7 money generating platforms
●
Office hours tooling
– What parts of these services
●
Basic Service, no redundancy
@krisbuytaert
Our Immediate visible Impact
●
Multiple customers failed over automatically to
the 2ndary DC
●
1 VIP address failed to move
●
1 platform was spread over SBG2 & SBG1
●
1 DR site destroyed
@krisbuytaert
Failing VIP
●
1 VIP address failed to move (API call failed)
●
Manual action to use a different public IP for the loadbalancer
●
Updated DNS records
●
Except for the 1 domain the customer controlled.
●
This cost us about 60 minutes before we could reach someone with
the right credentials to update the dns
●
Some customer devices had the VIP address hardcoded :(
– Manual interventions needed :(
@krisbuytaert
Customer Impact
●
1 platform was spread over SBG2 &
SBG1
●
Rebootstrapped the platform on a
different ISP, restored the backup
●
At 0900 am platform was ready for use,
●
We were improving performance of the
platform
●
Customer wouldn’t have known if we
didn’t tell them
@krisbuytaert
Total Loss:
●
We lost 13 physical
servers
●
135 vm’s
@krisbuytaert
Root Cause
●
Klaba does not say the UPS is the definite cause. "We don’t have all the answers today,"
he said. The OVHcloud staff responded to alarms at 11.42 pm on Tuesday, but the
affected part of the data center had already filled with smoke: "Two minutes after, they
took the decision to leave, because it was too dangerous."
●
The firefighter's thermal cameras found UPS7 and UPS8 on fire, but further data will be
extracted from on-site cameras: "We have 300 cameras in Strasbourg," said Klaba. "We
expect to have all the answers about how it started. We will give you all the information."
●
https://www.datacenterdynamics.com/en/news/ovh-fire-octave-klaba-says-ups-systems-
were-ablaze/
@krisbuytaert
Our Impact
NextCloud
●
Gitlab
●
Lots of development
environments
●
Lots of Test nodes
●
Partial K8s clusters
@krisbuytaert
We survived
●
18 hours of work
●
2 engineers at night
●
4-5 during daytime
●
Recovery
●
Communication
●
Restoring Backups
@krisbuytaert
At 0910
●
Another customer seemed to have issues
accessing some files
●
We had overlooked 1 gluster cluster that
seemed healthy but wasn’t
●
Switched to a single node mount point
(don’t try this at home)
@krisbuytaert
No data was lost
@krisbuytaert
Phase II
●
After 0900 we started restoring our internal tools from
backup
●
Verified all the secondary nodes would be making backups
●
Verified we had sufficient diskspace left
●
“Shoemaker always wears the worst shoes”
●
Challenge : Hardware !
@krisbuytaert
Hardware Landrush
●
Everybody needed new
hardware
●
We had spare hardware
ready for a new platform in
the wrong DC of a different
supplier . Temporary OK
●
We spun up new hardware
at a different supplier
@krisbuytaert
Wrong decisions
●
On day 2 we wasted at least 6 hours trying
spinning up a new supplier
●
Figuring out their network and redundancy
strategies was not the right time
●
We could scale on the existing suppliers
@krisbuytaert
Phase III
●
Restoring the rest of the services
●
What boxen would we be getting back ?
●
What boxen were foobar ?
●
(in the end we only got 3 physical servers back which were in SBG3),
the rest was lost
●
Priority on making sure all pipeline promotions of actively
developed platforms were running again
●
Hardware availability
@krisbuytaert
The day after
●
New DR platform where needed
●
Inventory of what besides production was impacted (lots of
development boxen, partial clusters)
●
Prioritization of what needs to be 100% back first
●
What hardware do we have / need additionally
@krisbuytaert
Your NEW DR Plan
●
You have just lost your DR site
●
You need to plan again
– Do we take the risk for now ?
– Do we respin a new DR ?
●
IAC + Datasync makes DR Trivial
@krisbuytaert
What saved us ?
●
Real Infrastructure as Code
●
Architecture
●
Backups
●
Fast escalation
@krisbuytaert
Real Infrastructure as Code
●
Desired State => Puppetize all the things
●
100% control , no 3rd parties involved
– Provision vm
– Deploy applications
– Deploy database schemas
●
No handovers , 1 person can deploy this, no other teams involved
●
Exported resources for Loadbalancers (haproxy), Monitoring (icinga), Databases (mysql)etc
– Heavy automation
●
All deployed versions are available in (yum) repositories
●
Everything is a Pipeline Puppet, Hiera, DNS, ... (jenkins)
@krisbuytaert
CloudNative vs Cloudnaive
●
We don’t own hardware
●
Baremetal on demand at $ < AWS
– Hetzner / ovh / ...
●
Spin up 120 seconds, decomission
when unneeded.
●
vm definitions are in code
●
Ansible to bootstrap, Puppet for
Desired state
@krisbuytaert
Multi Cloud
●
OVH
●
Hetzner
●
...
●
Workload is spread
●
Customer DR is in other
Supplier
@krisbuytaert
Multi Datacenter
●
OVH
– SBG, GRA, RBX, ..
●
Hetzner
– FSN X, HEL
@krisbuytaert
CloudNative ?
●
Critical Customer services are build redundantly,
(corosync, haproxy, mysql, elastic)
●
Losing a bare metal should doesnot have an impact
●
We can rebootstrap our nodes (kickstart+ puppet +
ansible)
●
All UGC data we know about is backed up (rsnapshot)
@krisbuytaert
Cloud Agnostic
●
A vm is a vm is a vm
●
Vendor Specific
Features
– e.g VIP addresses are
abstracted in code
@krisbuytaert
Dedicated stacks per Customer/Project
●
Single Purpose vm’s
●
Single Purpose Database Clusters
●
Single Purpose Storage Clusters
●
Single Purpose Loadbalancers
=> Reduced impact, Reduced complexity
@krisbuytaert
Clusters are Multi DC
●
Depending on scale , preference on HA vs
actual Loadbalancing
●
Pinning of resources within 1 DC
●
Lessons Learned: we need Multi Campus
@krisbuytaert
HA Applications
●
No local file usage
– If local files -> {gluster,drbd,...}
●
MySQL replication
●
Elastic multiple nodes (if we can’t generate the
data)
●
Multiple instances of rabbitmq
@krisbuytaert
HA Storage
●
VM’s on Local Disk
– 100% puppetized => no differences in config/ software
deployment
– “Disposable”
●
Storage distributed
– Gluster, DRBD, (Ceph)
●
(No Raid)
@krisbuytaert
Backups
●
Are both on site
●
And off Site
@krisbuytaert
Documentation
●
Mkdocs based
●
Git repo
●
Locally available
@krisbuytaert
Lessons Learned
●
Make sure you can always provision more resources on multiple
providers
●
A Datacenter is not always Datacenter, even with different buildings
you need Campus redundancy
●
Fix the shoemaker problem
●
Everything IS a fscking dns problem
– U WANT control over the DNS of the production sites you run
●
Or an MTU problem
@krisbuytaert
Thnx
Thnx to everyone who helped
out during the outage !
@krisbuytaert
March 10 , 2021
@krisbuytaert
Contact
Inuits
Inuits
Essensteenweg 31
Essensteenweg 31
Brasschaat
Brasschaat
Belgium
Belgium
891.514.231
891.514.231
+32 475 961221
+32 475 961221
Kris Buytaert Kris.Buytaert@inuits.eu
Kris Buytaert Kris.Buytaert@inuits.eu
Further Reading
Further Reading
@krisbuytaert
@krisbuytaert
http://www.krisbuytaert.be/blog/
http://www.krisbuytaert.be/blog/
https://inuits.eu/
https://inuits.eu/

More Related Content

What's hot

Building a Distributed Build System at Google Scale
Building a Distributed Build System at Google ScaleBuilding a Distributed Build System at Google Scale
Building a Distributed Build System at Google Scale
Aysylu Greenberg
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Vietnam Open Infrastructure User Group
 
Docs or it didn’t happen
Docs or it didn’t happenDocs or it didn’t happen
Docs or it didn’t happen
All Things Open
 
Security: The Value of SBOMs
Security: The Value of SBOMsSecurity: The Value of SBOMs
Security: The Value of SBOMs
Weaveworks
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
Weaveworks
 
GitOps with Gitkube
GitOps with GitkubeGitOps with Gitkube
GitOps with Gitkube
Tirumarai Selvan
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
sparkfabrik
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps Workshop
Weaveworks
 
Get started with gitops and flux
Get started with gitops and fluxGet started with gitops and flux
Get started with gitops and flux
LibbySchulze1
 
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
CodiLime
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & WeaveworksSecure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Weaveworks
 
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Red Hat Developers
 
The journey to GitOps
The journey to GitOpsThe journey to GitOps
The journey to GitOps
Nicola Baldi
 
GitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott RigbyGitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott Rigby
Weaveworks
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Sunnyvale
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
Puppet
 
Docker Deployments
Docker DeploymentsDocker Deployments
Docker Deployments
Docker, Inc.
 
Open design at large scale
Open design at large scaleOpen design at large scale
Open design at large scale
shykes
 
Gitlab meets Kubernetes
Gitlab meets KubernetesGitlab meets Kubernetes
Gitlab meets Kubernetes
inovex GmbH
 

What's hot (20)

Building a Distributed Build System at Google Scale
Building a Distributed Build System at Google ScaleBuilding a Distributed Build System at Google Scale
Building a Distributed Build System at Google Scale
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
 
Docs or it didn’t happen
Docs or it didn’t happenDocs or it didn’t happen
Docs or it didn’t happen
 
Security: The Value of SBOMs
Security: The Value of SBOMsSecurity: The Value of SBOMs
Security: The Value of SBOMs
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
 
GitOps with Gitkube
GitOps with GitkubeGitOps with Gitkube
GitOps with Gitkube
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
 
Intro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps WorkshopIntro to Kubernetes & GitOps Workshop
Intro to Kubernetes & GitOps Workshop
 
Get started with gitops and flux
Get started with gitops and fluxGet started with gitops and flux
Get started with gitops and flux
 
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
CodiLime Tech Talk - Dawid Trzebiatowski i Wojciech Urbański: Opening the Flo...
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & WeaveworksSecure GitOps pipelines for Kubernetes with Snyk & Weaveworks
Secure GitOps pipelines for Kubernetes with Snyk & Weaveworks
 
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
 
The journey to GitOps
The journey to GitOpsThe journey to GitOps
The journey to GitOps
 
GitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott RigbyGitOps for Helm Users by Scott Rigby
GitOps for Helm Users by Scott Rigby
 
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCDKubernetes GitOps featuring GitHub, Kustomize and ArgoCD
Kubernetes GitOps featuring GitHub, Kustomize and ArgoCD
 
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, PuppetPuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
PuppetConf 2017: Cloud, Containers, Puppet and You- Carl Caum, Puppet
 
Docker Deployments
Docker DeploymentsDocker Deployments
Docker Deployments
 
Open design at large scale
Open design at large scaleOpen design at large scale
Open design at large scale
 
Gitlab meets Kubernetes
Gitlab meets KubernetesGitlab meets Kubernetes
Gitlab meets Kubernetes
 

Similar to stackconf 2021 | Help, My Datacenter is on Fire

Continuous Infrastructure First
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure First
Kris Buytaert
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
Rohit Jnagal
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
How Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about itHow Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about it
Ian Kallen
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
camunda services GmbH
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
MayaData Inc
 
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTTHE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
Opher Dubrovsky
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
MartinStrycek
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Anthony Scata
 
GitOps , done Right
GitOps , done RightGitOps , done Right
GitOps , done Right
Kris Buytaert
 
Cassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache CassandraCassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache Cassandra
DataStax Academy
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
What makes me to migrate entire VPC JAWS PANKRATION 2021
What makes me to migrate entire VPC JAWS PANKRATION 2021What makes me to migrate entire VPC JAWS PANKRATION 2021
What makes me to migrate entire VPC JAWS PANKRATION 2021
Naomi Yamasaki
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
GDG Cloud Bengaluru
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout Session
Weston Jossey
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 

Similar to stackconf 2021 | Help, My Datacenter is on Fire (20)

Continuous Infrastructure First
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure First
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Task migration using CRIU
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
How Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about itHow Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about it
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTTHE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Webinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
GitOps , done Right
GitOps , done RightGitOps , done Right
GitOps , done Right
 
Cassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache CassandraCassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache Cassandra
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
What makes me to migrate entire VPC JAWS PANKRATION 2021
What makes me to migrate entire VPC JAWS PANKRATION 2021What makes me to migrate entire VPC JAWS PANKRATION 2021
What makes me to migrate entire VPC JAWS PANKRATION 2021
 
Google Cloud - Stand Out Features
Google Cloud - Stand Out FeaturesGoogle Cloud - Stand Out Features
Google Cloud - Stand Out Features
 
Tapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout Session
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 

Recently uploaded

美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
widenerjobeyrl638
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
VictoriaMetrics
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
michniczscribd
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
Michał Kurzeja
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
OnePlan Solutions
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
Anand Bagmar
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
mohitd6
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
kalichargn70th171
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
KrishnaveniMohan1
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
servicesNitor
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
campbellclarkson
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
WebConnect Pvt Ltd
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
ICS
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
vaishalijagtap12
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
dhavalvaghelanectarb
 

Recently uploaded (20)

美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
美洲杯赔率投注网【​网址​🎉3977·EE​🎉】
 
What’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 UpdateWhat’s New in VictoriaLogs - Q2 2024 Update
What’s New in VictoriaLogs - Q2 2024 Update
 
Beginner's Guide to Observability@Devoxx PL 2024
Beginner's  Guide to Observability@Devoxx PL 2024Beginner's  Guide to Observability@Devoxx PL 2024
Beginner's Guide to Observability@Devoxx PL 2024
 
Refactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contextsRefactoring legacy systems using events commands and bubble contexts
Refactoring legacy systems using events commands and bubble contexts
 
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical OperationsEnsuring Efficiency and Speed with Practical Solutions for Clinical Operations
Ensuring Efficiency and Speed with Practical Solutions for Clinical Operations
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
Upturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in NashikUpturn India Technologies - Web development company in Nashik
Upturn India Technologies - Web development company in Nashik
 
Streamlining End-to-End Testing Automation
Streamlining End-to-End Testing AutomationStreamlining End-to-End Testing Automation
Streamlining End-to-End Testing Automation
 
The Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdfThe Role of DevOps in Digital Transformation.pdf
The Role of DevOps in Digital Transformation.pdf
 
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdfThe Comprehensive Guide to Validating Audio-Visual Performances.pdf
The Comprehensive Guide to Validating Audio-Visual Performances.pdf
 
Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.Penify - Let AI do the Documentation, you write the Code.
Penify - Let AI do the Documentation, you write the Code.
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Hands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion StepsHands-on with Apache Druid: Installation & Data Ingestion Steps
Hands-on with Apache Druid: Installation & Data Ingestion Steps
 
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
🏎️Tech Transformation: DevOps Insights from the Experts 👩‍💻
 
Optimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptxOptimizing Your E-commerce with WooCommerce.pptx
Optimizing Your E-commerce with WooCommerce.pptx
 
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA ComplianceSecure-by-Design Using Hardware and Software Protection for FDA Compliance
Secure-by-Design Using Hardware and Software Protection for FDA Compliance
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert42 Ways to Generate Real Estate Leads - Sellxpert
42 Ways to Generate Real Estate Leads - Sellxpert
 
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024Flutter vs. React Native: A Detailed Comparison for App Development in 2024
Flutter vs. React Native: A Detailed Comparison for App Development in 2024
 

stackconf 2021 | Help, My Datacenter is on Fire

  • 1. The great fire of 2021 Kris Buytaert @krisbuytaert
  • 2. @krisbuytaert March 10 , 2021 ● 03:59 Incoming Phone call ● “Our Datacenter is On fire” ● 5 minutes of waking up.. going downstairs trying to realize what was just said to me, partial panic , partial .. it can’t be THAT bad
  • 4. @krisbuytaert Kris Buytaert ● I used to be a Dev, ● Then Became an Op ● CTO and Open Source Consultant @inuits.eu ● Everything is a freaking DNS Problem ● Evangelizing devops ● Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ….
  • 5. @krisbuytaert Impact : ● 3.6 million websites across 464.000 distinct domains ● https://news.netcraft.com/archives/2021/03/10/ovh-fire.html
  • 6. @krisbuytaert Immediate Assesment ? ● What Customer Facing Production services have we lost ? ● What Internal Production services have we lost ? – Do we need any of those to recover ? ● What are the priorities ? – Which services ● 24/7 money generating platforms ● Office hours tooling – What parts of these services ● Basic Service, no redundancy
  • 7. @krisbuytaert Our Immediate visible Impact ● Multiple customers failed over automatically to the 2ndary DC ● 1 VIP address failed to move ● 1 platform was spread over SBG2 & SBG1 ● 1 DR site destroyed
  • 8. @krisbuytaert Failing VIP ● 1 VIP address failed to move (API call failed) ● Manual action to use a different public IP for the loadbalancer ● Updated DNS records ● Except for the 1 domain the customer controlled. ● This cost us about 60 minutes before we could reach someone with the right credentials to update the dns ● Some customer devices had the VIP address hardcoded :( – Manual interventions needed :(
  • 9. @krisbuytaert Customer Impact ● 1 platform was spread over SBG2 & SBG1 ● Rebootstrapped the platform on a different ISP, restored the backup ● At 0900 am platform was ready for use, ● We were improving performance of the platform ● Customer wouldn’t have known if we didn’t tell them
  • 10. @krisbuytaert Total Loss: ● We lost 13 physical servers ● 135 vm’s
  • 11. @krisbuytaert Root Cause ● Klaba does not say the UPS is the definite cause. "We don’t have all the answers today," he said. The OVHcloud staff responded to alarms at 11.42 pm on Tuesday, but the affected part of the data center had already filled with smoke: "Two minutes after, they took the decision to leave, because it was too dangerous." ● The firefighter's thermal cameras found UPS7 and UPS8 on fire, but further data will be extracted from on-site cameras: "We have 300 cameras in Strasbourg," said Klaba. "We expect to have all the answers about how it started. We will give you all the information." ● https://www.datacenterdynamics.com/en/news/ovh-fire-octave-klaba-says-ups-systems- were-ablaze/
  • 12. @krisbuytaert Our Impact NextCloud ● Gitlab ● Lots of development environments ● Lots of Test nodes ● Partial K8s clusters
  • 13. @krisbuytaert We survived ● 18 hours of work ● 2 engineers at night ● 4-5 during daytime ● Recovery ● Communication ● Restoring Backups
  • 14. @krisbuytaert At 0910 ● Another customer seemed to have issues accessing some files ● We had overlooked 1 gluster cluster that seemed healthy but wasn’t ● Switched to a single node mount point (don’t try this at home)
  • 16. @krisbuytaert Phase II ● After 0900 we started restoring our internal tools from backup ● Verified all the secondary nodes would be making backups ● Verified we had sufficient diskspace left ● “Shoemaker always wears the worst shoes” ● Challenge : Hardware !
  • 17. @krisbuytaert Hardware Landrush ● Everybody needed new hardware ● We had spare hardware ready for a new platform in the wrong DC of a different supplier . Temporary OK ● We spun up new hardware at a different supplier
  • 18. @krisbuytaert Wrong decisions ● On day 2 we wasted at least 6 hours trying spinning up a new supplier ● Figuring out their network and redundancy strategies was not the right time ● We could scale on the existing suppliers
  • 19. @krisbuytaert Phase III ● Restoring the rest of the services ● What boxen would we be getting back ? ● What boxen were foobar ? ● (in the end we only got 3 physical servers back which were in SBG3), the rest was lost ● Priority on making sure all pipeline promotions of actively developed platforms were running again ● Hardware availability
  • 20. @krisbuytaert The day after ● New DR platform where needed ● Inventory of what besides production was impacted (lots of development boxen, partial clusters) ● Prioritization of what needs to be 100% back first ● What hardware do we have / need additionally
  • 21. @krisbuytaert Your NEW DR Plan ● You have just lost your DR site ● You need to plan again – Do we take the risk for now ? – Do we respin a new DR ? ● IAC + Datasync makes DR Trivial
  • 22. @krisbuytaert What saved us ? ● Real Infrastructure as Code ● Architecture ● Backups ● Fast escalation
  • 23. @krisbuytaert Real Infrastructure as Code ● Desired State => Puppetize all the things ● 100% control , no 3rd parties involved – Provision vm – Deploy applications – Deploy database schemas ● No handovers , 1 person can deploy this, no other teams involved ● Exported resources for Loadbalancers (haproxy), Monitoring (icinga), Databases (mysql)etc – Heavy automation ● All deployed versions are available in (yum) repositories ● Everything is a Pipeline Puppet, Hiera, DNS, ... (jenkins)
  • 24. @krisbuytaert CloudNative vs Cloudnaive ● We don’t own hardware ● Baremetal on demand at $ < AWS – Hetzner / ovh / ... ● Spin up 120 seconds, decomission when unneeded. ● vm definitions are in code ● Ansible to bootstrap, Puppet for Desired state
  • 25. @krisbuytaert Multi Cloud ● OVH ● Hetzner ● ... ● Workload is spread ● Customer DR is in other Supplier
  • 26. @krisbuytaert Multi Datacenter ● OVH – SBG, GRA, RBX, .. ● Hetzner – FSN X, HEL
  • 27. @krisbuytaert CloudNative ? ● Critical Customer services are build redundantly, (corosync, haproxy, mysql, elastic) ● Losing a bare metal should doesnot have an impact ● We can rebootstrap our nodes (kickstart+ puppet + ansible) ● All UGC data we know about is backed up (rsnapshot)
  • 28. @krisbuytaert Cloud Agnostic ● A vm is a vm is a vm ● Vendor Specific Features – e.g VIP addresses are abstracted in code
  • 29. @krisbuytaert Dedicated stacks per Customer/Project ● Single Purpose vm’s ● Single Purpose Database Clusters ● Single Purpose Storage Clusters ● Single Purpose Loadbalancers => Reduced impact, Reduced complexity
  • 30. @krisbuytaert Clusters are Multi DC ● Depending on scale , preference on HA vs actual Loadbalancing ● Pinning of resources within 1 DC ● Lessons Learned: we need Multi Campus
  • 31. @krisbuytaert HA Applications ● No local file usage – If local files -> {gluster,drbd,...} ● MySQL replication ● Elastic multiple nodes (if we can’t generate the data) ● Multiple instances of rabbitmq
  • 32. @krisbuytaert HA Storage ● VM’s on Local Disk – 100% puppetized => no differences in config/ software deployment – “Disposable” ● Storage distributed – Gluster, DRBD, (Ceph) ● (No Raid)
  • 33. @krisbuytaert Backups ● Are both on site ● And off Site
  • 35. @krisbuytaert Lessons Learned ● Make sure you can always provision more resources on multiple providers ● A Datacenter is not always Datacenter, even with different buildings you need Campus redundancy ● Fix the shoemaker problem ● Everything IS a fscking dns problem – U WANT control over the DNS of the production sites you run ● Or an MTU problem
  • 36. @krisbuytaert Thnx Thnx to everyone who helped out during the outage !
  • 38. @krisbuytaert Contact Inuits Inuits Essensteenweg 31 Essensteenweg 31 Brasschaat Brasschaat Belgium Belgium 891.514.231 891.514.231 +32 475 961221 +32 475 961221 Kris Buytaert Kris.Buytaert@inuits.eu Kris Buytaert Kris.Buytaert@inuits.eu Further Reading Further Reading @krisbuytaert @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.krisbuytaert.be/blog/ https://inuits.eu/ https://inuits.eu/