SlideShare a Scribd company logo
1 of 29
Download to read offline
The hourly network outage
Andras Temesvary| 2023-04-13
- DC automation engineer
- For some time labelled as “network engineer”
- Writing “code” (more so duct taping things together) to solve
problems
- Python enthusiast
Who am I?
Terminologies
TOR - Top of Rack
OOB - Out of band
Server role - workload type
Role Owner - team
responsible for a given role
TOR
OOB
DATA
The problems of managing
network devices at scale (at least, some of)
- We have thousands of network devices
- Multi vendor environment (2-3)
- Network device lifespan can go to up to 10+ years (long tail)
- Version differences between different install batches
- Some tools only work with recent software features
- Need to maintain a sufficient level of security / compliance
- We want to use new network features, and constantly run into
weird bugs
CONCLUSION #1:
WE NEED TO CONTROL NETWORK
SOFTWARE LIFECYCLE
The problems of upgrading network devices
- ISSU / SSU is more of a marketing term
- We have to reboot the actual devices
- No redundancy at TOR layer (unless redundant TOR)
- Vendors are releasing new software several times a year
- Do you really need to upgrade? Likely.
CONCLUSION #2:
UPGRADING TORs WILL BE IMPACTING
CONCLUSION #3:
WE HAVE TO UPGRADE REGULARLY &
CONTINUOUSLY
SOLUTION:
Automate the upgrade process!
The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)
The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)
- Lots of toil to schedule maintenances
- Server owners just ignored the emails 🤷
“it's easier to ask forgiveness than to get
permission”
Grace Hopper
The assertive approach
- Tell, don’t ask. “Maintenance will happen at @timestamp”
- Automate the end-to-end process - no humans should be involved
- Build in sufficient emergency breaks
- Communicate all details to your customers
- Allow customers to interact with you via APIs
The components
HTTP API /
Database
Scheduler
Maintenance
Execution
Upgrade
Schedule
Builder
��
The execution workflow
1. Pre-flight checks 2. Start 3. Isolate device
4. Wait
5. Upload+reboot
6. Waiting
9. Finish
8. Trigger discovery
7. Post-flight checks
Release V1 - 2018Q1
- Starting small: 2 upgrades
per day
- Pre-built static list of
maintenances (runs out)
- Lots of safeguards!
- Only PROD TORs
Release V2 - 2019Q1
- 8 upgrades a day (hourly)
- Fully automated scheduler
(does not run out)
- Only next 7 days are fixed,
the rest is fluid
- Single DC on a given day
Release V3 - 2020Q2
- Support for OOB
environment beyond PROD
- Different environments can
run in parallel
- 15 PROD, 31 OOB upgrades
a day
- Single availability zone on a
given week
Release V4 - 2021Q1
- Re-factored schedule
generator
- Many new environments
(PCI, CORP, etc)
- Support for non-TOR
switches
- Theoretical maximum 76
upgrades per day in total
Release V5 - 2023Q1
- Adding more
environments
- Improved pre- and post-
flight checks during
execution
- At this point we’re just
pushing the needle to
reach 100% coverage
What contributed to the success
- SRE culture / philosophy reached the company in 2016
- The google global chubby planned outage story: “The network is too
reliable”
- Outage budget well communicated by leadership - core part of
company culture
- Building failure resistant systems became a core objective in Tech
The future of maintenances
- We’re actively working on applying the same automation framework
for any change (not just upgrade)
- Outsourcing execution logic for the teams (it’s their business what
and how they run)
- Centralising all network changes into a single system
Questions?
Thank you!

More Related Content

What's hot

Linux Ethernet device driver
Linux Ethernet device driverLinux Ethernet device driver
Linux Ethernet device driver艾鍗科技
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKMarian Marinov
 
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesIPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesMartin Děcký
 
DevNetCreate - ACI and Kubernetes Integration
DevNetCreate - ACI and Kubernetes IntegrationDevNetCreate - ACI and Kubernetes Integration
DevNetCreate - ACI and Kubernetes IntegrationHank Preston
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on LabMichelle Holley
 
WebKit and GStreamer
WebKit and GStreamerWebKit and GStreamer
WebKit and GStreamercalvaris
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitchSim Janghoon
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitAldrin Piri
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch IntroductionHungWei Chiu
 
How to design a file system
How to design a file systemHow to design a file system
How to design a file systemNikhil Anurag VN
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
Bd master guide extract encapsulated bios
Bd master guide extract encapsulated biosBd master guide extract encapsulated bios
Bd master guide extract encapsulated biosIman Teguh Pribadi
 
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...idsecconf
 
Progressive Lightmapper: An Introduction to Lightmapping in Unity
Progressive Lightmapper: An Introduction to Lightmapping in UnityProgressive Lightmapper: An Introduction to Lightmapping in Unity
Progressive Lightmapper: An Introduction to Lightmapping in UnityUnity Technologies
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadKevin Traynor
 
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hwvideos
 

What's hot (20)

Linux Ethernet device driver
Linux Ethernet device driverLinux Ethernet device driver
Linux Ethernet device driver
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesIPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, Capabilities
 
DevNetCreate - ACI and Kubernetes Integration
DevNetCreate - ACI and Kubernetes IntegrationDevNetCreate - ACI and Kubernetes Integration
DevNetCreate - ACI and Kubernetes Integration
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
WebKit and GStreamer
WebKit and GStreamerWebKit and GStreamer
WebKit and GStreamer
 
Virtualized network with openvswitch
Virtualized network with openvswitchVirtualized network with openvswitch
Virtualized network with openvswitch
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop SummitApache NiFi Crash Course - San Jose Hadoop Summit
Apache NiFi Crash Course - San Jose Hadoop Summit
 
Open vSwitch Introduction
Open vSwitch IntroductionOpen vSwitch Introduction
Open vSwitch Introduction
 
How to design a file system
How to design a file systemHow to design a file system
How to design a file system
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
Bd master guide extract encapsulated bios
Bd master guide extract encapsulated biosBd master guide extract encapsulated bios
Bd master guide extract encapsulated bios
 
DPDK KNI interface
DPDK KNI interfaceDPDK KNI interface
DPDK KNI interface
 
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
Vm escape: case study virtualbox bug hunting and exploitation - Muhammad Alif...
 
Progressive Lightmapper: An Introduction to Lightmapping in Unity
Progressive Lightmapper: An Introduction to Lightmapping in UnityProgressive Lightmapper: An Introduction to Lightmapping in Unity
Progressive Lightmapper: An Introduction to Lightmapping in Unity
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 
Ovs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offloadOvs dpdk hwoffload way to full offload
Ovs dpdk hwoffload way to full offload
 
SR-IOV Introduce
SR-IOV IntroduceSR-IOV Introduce
SR-IOV Introduce
 
1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
 

Similar to The hourly network outage - Booking.com.pdf

NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-finalshintaro mizuno
 
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...WebCamp
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
 
Spring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptxSpring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptxPrabhakaran Ravichandran
 
Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveDevOpsProdigy
 
Argonne Win7 Closeout
Argonne Win7 CloseoutArgonne Win7 Closeout
Argonne Win7 CloseoutChad Karkos
 
Unit 1 final
Unit 1 finalUnit 1 final
Unit 1 finalsietkcse
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsRakuten Group, Inc.
 
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)Victor Pascual Ávila
 
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...Emerson Exchange
 
booting-booster-final-20160420-0700
booting-booster-final-20160420-0700booting-booster-final-20160420-0700
booting-booster-final-20160420-0700Samsung Electronics
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture corehard_by
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Extending Piwik At R7.com
Extending Piwik At R7.comExtending Piwik At R7.com
Extending Piwik At R7.comLeo Lorieri
 
A glance at a scrum team in real software company
A glance at a scrum team in real software companyA glance at a scrum team in real software company
A glance at a scrum team in real software companyPin-Ying Tu
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014Hojoong Kim
 

Similar to The hourly network outage - Booking.com.pdf (20)

NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-final
 
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...
WebCamp 2016: DevOps. Николай Дойков: Опыт создания клауда для потокового вид...
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
Spring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptxSpring_Boot_Microservices-5_Day_Session.pptx
Spring_Boot_Microservices-5_Day_Session.pptx
 
Explore Android Internals
Explore Android InternalsExplore Android Internals
Explore Android Internals
 
Monitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspectiveMonitoring microservice applications: An SRE’s perspective
Monitoring microservice applications: An SRE’s perspective
 
Microservices
MicroservicesMicroservices
Microservices
 
Argonne Win7 Closeout
Argonne Win7 CloseoutArgonne Win7 Closeout
Argonne Win7 Closeout
 
Unit 1 final
Unit 1 finalUnit 1 final
Unit 1 final
 
Evolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deploymentsEvolution of unix environments and the road to faster deployments
Evolution of unix environments and the road to faster deployments
 
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)
WebRTC and VoIP: bridging the gap (Kamailio world conference 2013)
 
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...
One Day Version 10.3 Upgrade - How a Large Biotech Plant's DeltaV Systems Wer...
 
booting-booster-final-20160420-0700
booting-booster-final-20160420-0700booting-booster-final-20160420-0700
booting-booster-final-20160420-0700
 
IBM Notes in the Cloud
IBM Notes in the CloudIBM Notes in the Cloud
IBM Notes in the Cloud
 
Choosing the right parallel compute architecture
Choosing the right parallel compute architecture Choosing the right parallel compute architecture
Choosing the right parallel compute architecture
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Extending Piwik At R7.com
Extending Piwik At R7.comExtending Piwik At R7.com
Extending Piwik At R7.com
 
A glance at a scrum team in real software company
A glance at a scrum team in real software companyA glance at a scrum team in real software company
A glance at a scrum team in real software company
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
 
Gatehouse software genanvendelse
Gatehouse software genanvendelseGatehouse software genanvendelse
Gatehouse software genanvendelse
 

Recently uploaded

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistandanishmna97
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 

Recently uploaded (20)

Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 

The hourly network outage - Booking.com.pdf

  • 1. The hourly network outage Andras Temesvary| 2023-04-13
  • 2. - DC automation engineer - For some time labelled as “network engineer” - Writing “code” (more so duct taping things together) to solve problems - Python enthusiast Who am I?
  • 3. Terminologies TOR - Top of Rack OOB - Out of band Server role - workload type Role Owner - team responsible for a given role
  • 5. The problems of managing network devices at scale (at least, some of) - We have thousands of network devices - Multi vendor environment (2-3) - Network device lifespan can go to up to 10+ years (long tail) - Version differences between different install batches - Some tools only work with recent software features - Need to maintain a sufficient level of security / compliance - We want to use new network features, and constantly run into weird bugs
  • 6. CONCLUSION #1: WE NEED TO CONTROL NETWORK SOFTWARE LIFECYCLE
  • 7. The problems of upgrading network devices - ISSU / SSU is more of a marketing term - We have to reboot the actual devices - No redundancy at TOR layer (unless redundant TOR) - Vendors are releasing new software several times a year - Do you really need to upgrade? Likely.
  • 8. CONCLUSION #2: UPGRADING TORs WILL BE IMPACTING CONCLUSION #3: WE HAVE TO UPGRADE REGULARLY & CONTINUOUSLY
  • 10. The (overly) naïve approach for automation - Built a UI to request consent for all server owners in a rack - Maintenances had to be scheduled manually by network engineers - If / when consent was given, maintenance had to run manually (we actually had automated the upgrade process, but still a manual execution required)
  • 11. The (overly) naïve approach for automation - Built a UI to request consent for all server owners in a rack - Maintenances had to be scheduled manually by network engineers - If / when consent was given, maintenance had to run manually (we actually had automated the upgrade process, but still a manual execution required) - Lots of toil to schedule maintenances - Server owners just ignored the emails 🤷
  • 12. “it's easier to ask forgiveness than to get permission” Grace Hopper
  • 13. The assertive approach - Tell, don’t ask. “Maintenance will happen at @timestamp” - Automate the end-to-end process - no humans should be involved - Build in sufficient emergency breaks - Communicate all details to your customers - Allow customers to interact with you via APIs
  • 14. The components HTTP API / Database Scheduler Maintenance Execution Upgrade Schedule Builder ��
  • 15. The execution workflow 1. Pre-flight checks 2. Start 3. Isolate device 4. Wait 5. Upload+reboot 6. Waiting 9. Finish 8. Trigger discovery 7. Post-flight checks
  • 16. Release V1 - 2018Q1 - Starting small: 2 upgrades per day - Pre-built static list of maintenances (runs out) - Lots of safeguards! - Only PROD TORs
  • 17. Release V2 - 2019Q1 - 8 upgrades a day (hourly) - Fully automated scheduler (does not run out) - Only next 7 days are fixed, the rest is fluid - Single DC on a given day
  • 18. Release V3 - 2020Q2 - Support for OOB environment beyond PROD - Different environments can run in parallel - 15 PROD, 31 OOB upgrades a day - Single availability zone on a given week
  • 19. Release V4 - 2021Q1 - Re-factored schedule generator - Many new environments (PCI, CORP, etc) - Support for non-TOR switches - Theoretical maximum 76 upgrades per day in total
  • 20. Release V5 - 2023Q1 - Adding more environments - Improved pre- and post- flight checks during execution - At this point we’re just pushing the needle to reach 100% coverage
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. What contributed to the success - SRE culture / philosophy reached the company in 2016 - The google global chubby planned outage story: “The network is too reliable” - Outage budget well communicated by leadership - core part of company culture - Building failure resistant systems became a core objective in Tech
  • 28. The future of maintenances - We’re actively working on applying the same automation framework for any change (not just upgrade) - Outsourcing execution logic for the teams (it’s their business what and how they run) - Centralising all network changes into a single system