Help , My Datacenter is on fire

Kris Buytaert
Kris BuytaertDevops, Linux and Open Source Expert at Inuits
The great fire of 2021
Kris Buytaert
@krisbuytaert
@krisbuytaert
March 10 , 2021
●
03:59 Incoming Phone call
●
“Our Datacenter is On fire”
●
5 minutes of waking up..
going downstairs trying to
realize what was just said to
me, partial panic , partial .. it
can’t be THAT bad
@krisbuytaert
Fire
@krisbuytaert
Kris Buytaert
●
I used to be a Dev,
●
Then Became an Op
●
CTO and Open Source Consultant @inuits.eu
●
Everything is a freaking DNS Problem
●
Evangelizing devops
●
Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ….
@krisbuytaert
Impact :
●
3.6 million websites
across 464.000
distinct domains
●
https://news.netcraft.com/archives/2021/03/10/ovh-fire.html
@krisbuytaert
Immediate Assesment ?
●
What Customer Facing Production services have we lost ?
●
What Internal Production services have we lost ?
– Do we need any of those to recover ?
●
What are the priorities ?
– Which services
●
24/7 money generating platforms
●
Office hours tooling
– What parts of these services
●
Basic Service, no redundancy
@krisbuytaert
Our Immediate visible Impact
●
Multiple customers failed over automatically to
the 2ndary DC
●
1 VIP address failed to move
●
1 platform was spread over SBG2 & SBG1
●
1 DR site destroyed
@krisbuytaert
Failing VIP
●
1 VIP address failed to move (API call failed)
●
Manual action to use a different public IP for the loadbalancer
●
Updated DNS records
●
Except for the 1 domain the customer controlled.
●
This cost us about 60 minutes before we could reach someone with
the right credentials to update the dns
●
Some customer devices had the VIP address hardcoded :(
– Manual interventions needed :(
@krisbuytaert
Customer Impact
●
1 platform was spread over SBG2 &
SBG1
●
Rebootstrapped the platform on a
different ISP, restored the backup
●
At 0900 am platform was ready for use,
●
We were improving performance of the
platform
●
Customer wouldn’t have known if we
didn’t tell them
@krisbuytaert
Total Loss:
●
We lost 13 physical
servers
●
135 vm’s
@krisbuytaert
Root Cause
●
Klaba does not say the UPS is the definite cause. "We don’t have all the answers today,"
he said. The OVHcloud staff responded to alarms at 11.42 pm on Tuesday, but the
affected part of the data center had already filled with smoke: "Two minutes after, they
took the decision to leave, because it was too dangerous."
●
The firefighter's thermal cameras found UPS7 and UPS8 on fire, but further data will be
extracted from on-site cameras: "We have 300 cameras in Strasbourg," said Klaba. "We
expect to have all the answers about how it started. We will give you all the information."
●
https://www.datacenterdynamics.com/en/news/ovh-fire-octave-klaba-says-ups-systems-
were-ablaze/
@krisbuytaert
Our Impact
NextCloud
●
Gitlab
●
Lots of development
environments
●
Lots of Test nodes
●
Partial K8s clusters
@krisbuytaert
We survived
●
18 hours of work
●
2 engineers at night
●
4-5 during daytime
●
Recovery
●
Communication
●
Restoring Backups
@krisbuytaert
At 0910
●
Another customer seemed to have issues
accessing some files
●
We had overlooked 1 gluster cluster that
seemed healthy but wasn’t
●
Switched to a single node mount point
(don’t try this at home)
@krisbuytaert
No data was lost
@krisbuytaert
Phase II
●
After 0900 we started restoring our internal tools from
backup
●
Verified all the secondary nodes would be making backups
●
Verified we had sufficient diskspace left
●
“Shoemaker always wears the worst shoes”
●
Challenge : Hardware !
@krisbuytaert
Hardware Landrush
●
Everybody needed new
hardware
●
We had spare hardware
ready for a new platform in
the wrong DC of a different
supplier . Temporary OK
●
We spun up new hardware
at a different supplier
@krisbuytaert
Wrong decisions
●
On day 2 we wasted at least 6 hours trying
spinning up a new supplier
●
Figuring out their network and redundancy
strategies was not the right time
●
We could scale on the existing suppliers
@krisbuytaert
Phase III
●
Restoring the rest of the services
●
What boxen would we be getting back ?
●
What boxen were foobar ?
●
(in the end we only got 3 physical servers back which were in SBG3),
the rest was lost
●
Priority on making sure all pipeline promotions of actively
developed platforms were running again
●
Hardware availability
@krisbuytaert
The day after
●
New DR platform where needed
●
Inventory of what besides production was impacted (lots of
development boxen, partial clusters)
●
Prioritization of what needs to be 100% back first
●
What hardware do we have / need additionally
@krisbuytaert
Your NEW DR Plan
●
You have just lost your DR site
●
You need to plan again
– Do we take the risk for now ?
– Do we respin a new DR ?
●
IAC + Datasync makes DR Trivial
@krisbuytaert
What saved us ?
●
Real Infrastructure as Code
●
Architecture
●
Backups
●
Fast escalation
@krisbuytaert
Real Infrastructure as Code
●
Desired State => Puppetize all the things
●
100% control , no 3rd parties involved
– Provision vm
– Deploy applications
– Deploy database schemas
●
No handovers , 1 person can deploy this, no other teams involved
●
Exported resources for Loadbalancers (haproxy), Monitoring (icinga), Databases (mysql)etc
– Heavy automation
●
All deployed versions are available in (yum) repositories
●
Everything is a Pipeline Puppet, Hiera, DNS, ... (jenkins)
@krisbuytaert
CloudNative vs Cloudnaive
●
We don’t own hardware
●
Baremetal on demand at $ < AWS
– Hetzner / ovh / ...
●
Spin up 120 seconds, decomission
when unneeded.
●
vm definitions are in code
●
Ansible to bootstrap, Puppet for
Desired state
@krisbuytaert
Multi Cloud
●
OVH
●
Hetzner
●
...
●
Workload is spread
●
Customer DR is in other
Supplier
@krisbuytaert
Multi Datacenter
●
OVH
– SBG, GRA, RBX, ..
●
Hetzner
– FSN X, HEL
@krisbuytaert
CloudNative ?
●
Critical Customer services are build redundantly,
(corosync, haproxy, mysql, elastic)
●
Losing a bare metal should doesnot have an impact
●
We can rebootstrap our nodes (kickstart+ puppet +
ansible)
●
All UGC data we know about is backed up (rsnapshot)
@krisbuytaert
Cloud Agnostic
●
A vm is a vm is a vm
●
Vendor Specific
Features
– e.g VIP addresses are
abstracted in code
@krisbuytaert
Dedicated stacks per Customer/Project
●
Single Purpose vm’s
●
Single Purpose Database Clusters
●
Single Purpose Storage Clusters
●
Single Purpose Loadbalancers
=> Reduced impact, Reduced complexity
@krisbuytaert
Clusters are Multi DC
●
Depending on scale , preference on HA vs
actual Loadbalancing
●
Pinning of resources within 1 DC
●
Lessons Learned: we need Multi Campus
@krisbuytaert
HA Applications
●
No local file usage
– If local files -> {gluster,drbd,...}
●
MySQL replication
●
Elastic multiple nodes (if we can’t generate the
data)
●
Multiple instances of rabbitmq
@krisbuytaert
HA Storage
●
VM’s on Local Disk
– 100% puppetized => no differences in config/ software
deployment
– “Disposable”
●
Storage distributed
– Gluster, DRBD, (Ceph)
●
(No Raid)
@krisbuytaert
Backups
●
Are both on site
●
And off Site
@krisbuytaert
Documentation
●
Mkdocs based
●
Git repo
●
Locally available
@krisbuytaert
Lessons Learned
●
Make sure you can always provision more resources on multiple
providers
●
A Datacenter is not always Datacenter, even with different buildings
you need Campus redundancy
●
Fix the shoemaker problem
●
Everything IS a fscking dns problem
– U WANT control over the DNS of the production sites you run
●
Or an MTU problem
@krisbuytaert
Thnx
Thnx to everyone who helped
out during the outage !
@krisbuytaert
March 10 , 2021
@krisbuytaert
Contact
Inuits
Inuits
Essensteenweg 31
Essensteenweg 31
Brasschaat
Brasschaat
Belgium
Belgium
891.514.231
891.514.231
+32 475 961221
+32 475 961221
Kris Buytaert Kris.Buytaert@inuits.eu
Kris Buytaert Kris.Buytaert@inuits.eu
Further Reading
Further Reading
@krisbuytaert
@krisbuytaert
http://www.krisbuytaert.be/blog/
http://www.krisbuytaert.be/blog/
https://inuits.eu/
https://inuits.eu/
1 of 38

Recommended

GitOps , done Right by
GitOps , done RightGitOps , done Right
GitOps , done RightKris Buytaert
209 views38 slides
Migrating to Puppet 5 by
Migrating to Puppet 5Migrating to Puppet 5
Migrating to Puppet 5Kris Buytaert
1K views34 slides
Open Source Monitoring in 2019 by
Open Source Monitoring in 2019 Open Source Monitoring in 2019
Open Source Monitoring in 2019 Kris Buytaert
1.5K views56 slides
Pipeline all the Dashboards as Code by
Pipeline all the Dashboards as CodePipeline all the Dashboards as Code
Pipeline all the Dashboards as CodeKris Buytaert
644 views20 slides
DevOps Days Kyiv 2019 -- continuous Infrafirstructure First //Kris buytaert by
DevOps Days Kyiv 2019 -- continuous Infrafirstructure First //Kris buytaertDevOps Days Kyiv 2019 -- continuous Infrafirstructure First //Kris buytaert
DevOps Days Kyiv 2019 -- continuous Infrafirstructure First //Kris buytaertMykola Marzhan
24 views20 slides
10 years of #devopsdays, but what have we really learned ? by
10 years of #devopsdays, but what have we really learned ? 10 years of #devopsdays, but what have we really learned ?
10 years of #devopsdays, but what have we really learned ? Kris Buytaert
594 views42 slides

More Related Content

What's hot

Repositories as Code by
Repositories as CodeRepositories as Code
Repositories as CodeKris Buytaert
642 views34 slides
Deploying your SaaS stack OnPrem by
Deploying your SaaS stack OnPremDeploying your SaaS stack OnPrem
Deploying your SaaS stack OnPremKris Buytaert
681 views38 slides
Continuous Infrastructure First by
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure FirstKris Buytaert
521 views42 slides
Continuous Delivery of (y)our infrastructure. by
Continuous Delivery of (y)our infrastructure.Continuous Delivery of (y)our infrastructure.
Continuous Delivery of (y)our infrastructure.Kris Buytaert
3K views45 slides
Building and Deploying MediaSalsa, an Open Source DAM as Saas platform by
Building and Deploying MediaSalsa, an Open Source DAM as Saas platformBuilding and Deploying MediaSalsa, an Open Source DAM as Saas platform
Building and Deploying MediaSalsa, an Open Source DAM as Saas platformKris Buytaert
3.1K views32 slides
ADDO 2019: Looking back at over 10 years of Devops by
ADDO 2019:    Looking back at over 10 years of DevopsADDO 2019:    Looking back at over 10 years of Devops
ADDO 2019: Looking back at over 10 years of DevopsKris Buytaert
578 views45 slides

What's hot(20)

Deploying your SaaS stack OnPrem by Kris Buytaert
Deploying your SaaS stack OnPremDeploying your SaaS stack OnPrem
Deploying your SaaS stack OnPrem
Kris Buytaert681 views
Continuous Infrastructure First by Kris Buytaert
Continuous Infrastructure FirstContinuous Infrastructure First
Continuous Infrastructure First
Kris Buytaert521 views
Continuous Delivery of (y)our infrastructure. by Kris Buytaert
Continuous Delivery of (y)our infrastructure.Continuous Delivery of (y)our infrastructure.
Continuous Delivery of (y)our infrastructure.
Kris Buytaert3K views
Building and Deploying MediaSalsa, an Open Source DAM as Saas platform by Kris Buytaert
Building and Deploying MediaSalsa, an Open Source DAM as Saas platformBuilding and Deploying MediaSalsa, an Open Source DAM as Saas platform
Building and Deploying MediaSalsa, an Open Source DAM as Saas platform
Kris Buytaert3.1K views
ADDO 2019: Looking back at over 10 years of Devops by Kris Buytaert
ADDO 2019:    Looking back at over 10 years of DevopsADDO 2019:    Looking back at over 10 years of Devops
ADDO 2019: Looking back at over 10 years of Devops
Kris Buytaert578 views
Is there a Future for devops ? by Kris Buytaert
Is there a Future for devops   ? Is there a Future for devops   ?
Is there a Future for devops ?
Kris Buytaert478 views
Pipeline as code for your infrastructure as Code by Kris Buytaert
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
Kris Buytaert1.9K views
Dev secops opsec, devsec, devops ? by Kris Buytaert
Dev secops opsec, devsec, devops ?Dev secops opsec, devsec, devops ?
Dev secops opsec, devsec, devops ?
Kris Buytaert17.4K views
From MonitoringSucks to Monitoring Love , 2016 Edition by Kris Buytaert
From MonitoringSucks to Monitoring Love , 2016 EditionFrom MonitoringSucks to Monitoring Love , 2016 Edition
From MonitoringSucks to Monitoring Love , 2016 Edition
Kris Buytaert29.4K views
Moby is killing your devops efforts by Kris Buytaert
Moby is killing your devops effortsMoby is killing your devops efforts
Moby is killing your devops efforts
Kris Buytaert2.7K views
Is there a future for devops ? by Kris Buytaert
Is there a future for devops ?Is there a future for devops ?
Is there a future for devops ?
Kris Buytaert3.5K views
Velocity 2011: Production Begins in Development by dev2ops
Velocity 2011: Production Begins in DevelopmentVelocity 2011: Production Begins in Development
Velocity 2011: Production Begins in Development
dev2ops2.1K views
Automating MySQL operations with Puppet by Kris Buytaert
Automating MySQL operations with PuppetAutomating MySQL operations with Puppet
Automating MySQL operations with Puppet
Kris Buytaert1.8K views
Devops is a Security Requirement by Kris Buytaert
Devops is a Security RequirementDevops is a Security Requirement
Devops is a Security Requirement
Kris Buytaert699 views
Devops is dead, Long Live Devops by Kris Buytaert
Devops is dead, Long Live DevopsDevops is dead, Long Live Devops
Devops is dead, Long Live Devops
Kris Buytaert1.7K views

Similar to Help , My Datacenter is on fire

Building real time Data Pipeline using Spark Streaming by
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
1.4K views33 slides
Task migration using CRIU by
Task migration using CRIUTask migration using CRIU
Task migration using CRIURohit Jnagal
1.7K views35 slides
Scaling Monitoring At Databricks From Prometheus to M3 by
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
294 views36 slides
How Klout migrated from CDH3 to CDH4 …and survived to tell about it by
How Klout migrated from CDH3 to CDH4 …and survived to tell about itHow Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about itIan Kallen
583 views13 slides
Server fleet management using Camunda by Akhil Ahuja by
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahujacamunda services GmbH
349 views36 slides
Cloud arch patterns by
Cloud arch patternsCloud arch patterns
Cloud arch patternsCorey Huinker
3.7K views32 slides

Similar to Help , My Datacenter is on fire(20)

Building real time Data Pipeline using Spark Streaming by datamantra
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra1.4K views
Task migration using CRIU by Rohit Jnagal
Task migration using CRIUTask migration using CRIU
Task migration using CRIU
Rohit Jnagal1.7K views
Scaling Monitoring At Databricks From Prometheus to M3 by LibbySchulze
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze294 views
How Klout migrated from CDH3 to CDH4 …and survived to tell about it by Ian Kallen
How Klout migrated from CDH3 to CDH4 …and survived to tell about itHow Klout migrated from CDH3 to CDH4 …and survived to tell about it
How Klout migrated from CDH3 to CDH4 …and survived to tell about it
Ian Kallen583 views
Webinar: Building a multi-cloud Kubernetes storage on GitLab by MayaData Inc
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
MayaData Inc98 views
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST by Opher Dubrovsky
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTTHE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
Opher Dubrovsky100 views
Apache Cassandra at Target - Cassandra Summit 2014 by Dan Cundiff
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff4.4K views
Webinar: Choosing the Right Shard Key for High Performance and Scale by MongoDB
Webinar: Choosing the Right Shard Key for High Performance and ScaleWebinar: Choosing the Right Shard Key for High Performance and Scale
Webinar: Choosing the Right Shard Key for High Performance and Scale
MongoDB1.3K views
Piano Media - approach to data gathering and processing by MartinStrycek
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
MartinStrycek1.1K views
Cassandra Day London 2015: The Resilience of Apache Cassandra by DataStax Academy
Cassandra Day London 2015: The Resilience of Apache CassandraCassandra Day London 2015: The Resilience of Apache Cassandra
Cassandra Day London 2015: The Resilience of Apache Cassandra
DataStax Academy1.5K views
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes by SeungYong Oh
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh903 views
What makes me to migrate entire VPC JAWS PANKRATION 2021 by Naomi Yamasaki
What makes me to migrate entire VPC JAWS PANKRATION 2021What makes me to migrate entire VPC JAWS PANKRATION 2021
What makes me to migrate entire VPC JAWS PANKRATION 2021
Naomi Yamasaki627 views
Tapjoy OpenStack Summit Paris Breakout Session by Weston Jossey
Tapjoy OpenStack Summit Paris Breakout SessionTapjoy OpenStack Summit Paris Breakout Session
Tapjoy OpenStack Summit Paris Breakout Session
Weston Jossey982 views
NetflixOSS Meetup season 3 episode 1 by Ruslan Meshenberg
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg21.4K views
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber... by StormForge .io
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
StormForge .io200 views
Processing TeraBytes of data every day and sleeping at night by Luciano Mammino
Processing TeraBytes of data every day and sleeping at nightProcessing TeraBytes of data every day and sleeping at night
Processing TeraBytes of data every day and sleeping at night
Luciano Mammino144 views
Big data @ Hootsuite analtyics by Claudiu Coman
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
Claudiu Coman811 views

More from Kris Buytaert

Years of (not) learning , from devops to devoops by
Years of (not) learning , from devops to devoopsYears of (not) learning , from devops to devoops
Years of (not) learning , from devops to devoopsKris Buytaert
65 views44 slides
Observability will not fix your Broken Monitoring ,Ignite by
Observability will not fix your Broken Monitoring ,IgniteObservability will not fix your Broken Monitoring ,Ignite
Observability will not fix your Broken Monitoring ,IgniteKris Buytaert
167 views20 slides
Infrastructure as Code Patterns by
Infrastructure as Code PatternsInfrastructure as Code Patterns
Infrastructure as Code PatternsKris Buytaert
117 views53 slides
From devoops to devops 13 years of (not) learning by
From devoops to devops 13 years of (not) learningFrom devoops to devops 13 years of (not) learning
From devoops to devops 13 years of (not) learningKris Buytaert
185 views40 slides
10 Years of #devopsdays weirdness by
10 Years of #devopsdays weirdness10 Years of #devopsdays weirdness
10 Years of #devopsdays weirdnessKris Buytaert
400 views20 slides
Continuous Infrastructure First Ignite Edition by
Continuous Infrastructure First  Ignite EditionContinuous Infrastructure First  Ignite Edition
Continuous Infrastructure First Ignite EditionKris Buytaert
476 views20 slides

More from Kris Buytaert(10)

Years of (not) learning , from devops to devoops by Kris Buytaert
Years of (not) learning , from devops to devoopsYears of (not) learning , from devops to devoops
Years of (not) learning , from devops to devoops
Kris Buytaert65 views
Observability will not fix your Broken Monitoring ,Ignite by Kris Buytaert
Observability will not fix your Broken Monitoring ,IgniteObservability will not fix your Broken Monitoring ,Ignite
Observability will not fix your Broken Monitoring ,Ignite
Kris Buytaert167 views
Infrastructure as Code Patterns by Kris Buytaert
Infrastructure as Code PatternsInfrastructure as Code Patterns
Infrastructure as Code Patterns
Kris Buytaert117 views
From devoops to devops 13 years of (not) learning by Kris Buytaert
From devoops to devops 13 years of (not) learningFrom devoops to devops 13 years of (not) learning
From devoops to devops 13 years of (not) learning
Kris Buytaert185 views
10 Years of #devopsdays weirdness by Kris Buytaert
10 Years of #devopsdays weirdness10 Years of #devopsdays weirdness
10 Years of #devopsdays weirdness
Kris Buytaert400 views
Continuous Infrastructure First Ignite Edition by Kris Buytaert
Continuous Infrastructure First  Ignite EditionContinuous Infrastructure First  Ignite Edition
Continuous Infrastructure First Ignite Edition
Kris Buytaert476 views
Looking back at 5 years of #cfgmgmtcamp by Kris Buytaert
Looking back at 5 years of #cfgmgmtcampLooking back at 5 years of #cfgmgmtcamp
Looking back at 5 years of #cfgmgmtcamp
Kris Buytaert625 views
The Return of the Dull Stack Engineer by Kris Buytaert
The Return of the Dull Stack EngineerThe Return of the Dull Stack Engineer
The Return of the Dull Stack Engineer
Kris Buytaert2.4K views
Looking back at 7.5 years of Devopsdays , DOd PDX by Kris Buytaert
Looking back at 7.5 years of Devopsdays , DOd PDXLooking back at 7.5 years of Devopsdays , DOd PDX
Looking back at 7.5 years of Devopsdays , DOd PDX
Kris Buytaert463 views
Devopsdays Amsterdam 2017 Keynote, looking back at 5 years of AMS by Kris Buytaert
Devopsdays Amsterdam 2017 Keynote, looking back at 5 years of AMSDevopsdays Amsterdam 2017 Keynote, looking back at 5 years of AMS
Devopsdays Amsterdam 2017 Keynote, looking back at 5 years of AMS
Kris Buytaert772 views

Recently uploaded

PRODUCT PRESENTATION.pptx by
PRODUCT PRESENTATION.pptxPRODUCT PRESENTATION.pptx
PRODUCT PRESENTATION.pptxangelicacueva6
18 views1 slide
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
29 views26 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdfMelek BEN MAHMOUD.pdf
Melek BEN MAHMOUD.pdfMelekBenMahmoud
17 views1 slide
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...James Anderson
126 views32 slides
MVP and prioritization.pdf by
MVP and prioritization.pdfMVP and prioritization.pdf
MVP and prioritization.pdfrahuldharwal141
37 views8 slides
Mini-Track: Challenges to Network Automation Adoption by
Mini-Track: Challenges to Network Automation AdoptionMini-Track: Challenges to Network Automation Adoption
Mini-Track: Challenges to Network Automation AdoptionNetwork Automation Forum
17 views27 slides

Recently uploaded(20)

Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson126 views
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec15 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn26 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana17 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院

Help , My Datacenter is on fire

  • 1. The great fire of 2021 Kris Buytaert @krisbuytaert
  • 2. @krisbuytaert March 10 , 2021 ● 03:59 Incoming Phone call ● “Our Datacenter is On fire” ● 5 minutes of waking up.. going downstairs trying to realize what was just said to me, partial panic , partial .. it can’t be THAT bad
  • 4. @krisbuytaert Kris Buytaert ● I used to be a Dev, ● Then Became an Op ● CTO and Open Source Consultant @inuits.eu ● Everything is a freaking DNS Problem ● Evangelizing devops ● Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ….
  • 5. @krisbuytaert Impact : ● 3.6 million websites across 464.000 distinct domains ● https://news.netcraft.com/archives/2021/03/10/ovh-fire.html
  • 6. @krisbuytaert Immediate Assesment ? ● What Customer Facing Production services have we lost ? ● What Internal Production services have we lost ? – Do we need any of those to recover ? ● What are the priorities ? – Which services ● 24/7 money generating platforms ● Office hours tooling – What parts of these services ● Basic Service, no redundancy
  • 7. @krisbuytaert Our Immediate visible Impact ● Multiple customers failed over automatically to the 2ndary DC ● 1 VIP address failed to move ● 1 platform was spread over SBG2 & SBG1 ● 1 DR site destroyed
  • 8. @krisbuytaert Failing VIP ● 1 VIP address failed to move (API call failed) ● Manual action to use a different public IP for the loadbalancer ● Updated DNS records ● Except for the 1 domain the customer controlled. ● This cost us about 60 minutes before we could reach someone with the right credentials to update the dns ● Some customer devices had the VIP address hardcoded :( – Manual interventions needed :(
  • 9. @krisbuytaert Customer Impact ● 1 platform was spread over SBG2 & SBG1 ● Rebootstrapped the platform on a different ISP, restored the backup ● At 0900 am platform was ready for use, ● We were improving performance of the platform ● Customer wouldn’t have known if we didn’t tell them
  • 10. @krisbuytaert Total Loss: ● We lost 13 physical servers ● 135 vm’s
  • 11. @krisbuytaert Root Cause ● Klaba does not say the UPS is the definite cause. "We don’t have all the answers today," he said. The OVHcloud staff responded to alarms at 11.42 pm on Tuesday, but the affected part of the data center had already filled with smoke: "Two minutes after, they took the decision to leave, because it was too dangerous." ● The firefighter's thermal cameras found UPS7 and UPS8 on fire, but further data will be extracted from on-site cameras: "We have 300 cameras in Strasbourg," said Klaba. "We expect to have all the answers about how it started. We will give you all the information." ● https://www.datacenterdynamics.com/en/news/ovh-fire-octave-klaba-says-ups-systems- were-ablaze/
  • 12. @krisbuytaert Our Impact NextCloud ● Gitlab ● Lots of development environments ● Lots of Test nodes ● Partial K8s clusters
  • 13. @krisbuytaert We survived ● 18 hours of work ● 2 engineers at night ● 4-5 during daytime ● Recovery ● Communication ● Restoring Backups
  • 14. @krisbuytaert At 0910 ● Another customer seemed to have issues accessing some files ● We had overlooked 1 gluster cluster that seemed healthy but wasn’t ● Switched to a single node mount point (don’t try this at home)
  • 16. @krisbuytaert Phase II ● After 0900 we started restoring our internal tools from backup ● Verified all the secondary nodes would be making backups ● Verified we had sufficient diskspace left ● “Shoemaker always wears the worst shoes” ● Challenge : Hardware !
  • 17. @krisbuytaert Hardware Landrush ● Everybody needed new hardware ● We had spare hardware ready for a new platform in the wrong DC of a different supplier . Temporary OK ● We spun up new hardware at a different supplier
  • 18. @krisbuytaert Wrong decisions ● On day 2 we wasted at least 6 hours trying spinning up a new supplier ● Figuring out their network and redundancy strategies was not the right time ● We could scale on the existing suppliers
  • 19. @krisbuytaert Phase III ● Restoring the rest of the services ● What boxen would we be getting back ? ● What boxen were foobar ? ● (in the end we only got 3 physical servers back which were in SBG3), the rest was lost ● Priority on making sure all pipeline promotions of actively developed platforms were running again ● Hardware availability
  • 20. @krisbuytaert The day after ● New DR platform where needed ● Inventory of what besides production was impacted (lots of development boxen, partial clusters) ● Prioritization of what needs to be 100% back first ● What hardware do we have / need additionally
  • 21. @krisbuytaert Your NEW DR Plan ● You have just lost your DR site ● You need to plan again – Do we take the risk for now ? – Do we respin a new DR ? ● IAC + Datasync makes DR Trivial
  • 22. @krisbuytaert What saved us ? ● Real Infrastructure as Code ● Architecture ● Backups ● Fast escalation
  • 23. @krisbuytaert Real Infrastructure as Code ● Desired State => Puppetize all the things ● 100% control , no 3rd parties involved – Provision vm – Deploy applications – Deploy database schemas ● No handovers , 1 person can deploy this, no other teams involved ● Exported resources for Loadbalancers (haproxy), Monitoring (icinga), Databases (mysql)etc – Heavy automation ● All deployed versions are available in (yum) repositories ● Everything is a Pipeline Puppet, Hiera, DNS, ... (jenkins)
  • 24. @krisbuytaert CloudNative vs Cloudnaive ● We don’t own hardware ● Baremetal on demand at $ < AWS – Hetzner / ovh / ... ● Spin up 120 seconds, decomission when unneeded. ● vm definitions are in code ● Ansible to bootstrap, Puppet for Desired state
  • 25. @krisbuytaert Multi Cloud ● OVH ● Hetzner ● ... ● Workload is spread ● Customer DR is in other Supplier
  • 26. @krisbuytaert Multi Datacenter ● OVH – SBG, GRA, RBX, .. ● Hetzner – FSN X, HEL
  • 27. @krisbuytaert CloudNative ? ● Critical Customer services are build redundantly, (corosync, haproxy, mysql, elastic) ● Losing a bare metal should doesnot have an impact ● We can rebootstrap our nodes (kickstart+ puppet + ansible) ● All UGC data we know about is backed up (rsnapshot)
  • 28. @krisbuytaert Cloud Agnostic ● A vm is a vm is a vm ● Vendor Specific Features – e.g VIP addresses are abstracted in code
  • 29. @krisbuytaert Dedicated stacks per Customer/Project ● Single Purpose vm’s ● Single Purpose Database Clusters ● Single Purpose Storage Clusters ● Single Purpose Loadbalancers => Reduced impact, Reduced complexity
  • 30. @krisbuytaert Clusters are Multi DC ● Depending on scale , preference on HA vs actual Loadbalancing ● Pinning of resources within 1 DC ● Lessons Learned: we need Multi Campus
  • 31. @krisbuytaert HA Applications ● No local file usage – If local files -> {gluster,drbd,...} ● MySQL replication ● Elastic multiple nodes (if we can’t generate the data) ● Multiple instances of rabbitmq
  • 32. @krisbuytaert HA Storage ● VM’s on Local Disk – 100% puppetized => no differences in config/ software deployment – “Disposable” ● Storage distributed – Gluster, DRBD, (Ceph) ● (No Raid)
  • 33. @krisbuytaert Backups ● Are both on site ● And off Site
  • 35. @krisbuytaert Lessons Learned ● Make sure you can always provision more resources on multiple providers ● A Datacenter is not always Datacenter, even with different buildings you need Campus redundancy ● Fix the shoemaker problem ● Everything IS a fscking dns problem – U WANT control over the DNS of the production sites you run ● Or an MTU problem
  • 36. @krisbuytaert Thnx Thnx to everyone who helped out during the outage !
  • 38. @krisbuytaert Contact Inuits Inuits Essensteenweg 31 Essensteenweg 31 Brasschaat Brasschaat Belgium Belgium 891.514.231 891.514.231 +32 475 961221 +32 475 961221 Kris Buytaert Kris.Buytaert@inuits.eu Kris Buytaert Kris.Buytaert@inuits.eu Further Reading Further Reading @krisbuytaert @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.krisbuytaert.be/blog/ https://inuits.eu/ https://inuits.eu/