The hourly network outage - Booking.com.pdf

The hourly network outage
Andras Temesvary| 2023-04-13

- DC automation engineer
- For some time labelled as “network engineer”
- Writing “code” (more so duct taping things together) to solve
problems
- Python enthusiast
Who am I?

Terminologies
TOR - Top of Rack
OOB - Out of band
Server role - workload type
Role Owner - team
responsible for a given role

The problems of managing
network devices at scale (at least, some of)
- We have thousands of network devices
- Multi vendor environment (2-3)
- Network device lifespan can go to up to 10+ years (long tail)
- Version differences between different install batches
- Some tools only work with recent software features
- Need to maintain a sufﬁcient level of security / compliance
- We want to use new network features, and constantly run into
weird bugs

CONCLUSION #1:
WE NEED TO CONTROL NETWORK
SOFTWARE LIFECYCLE

The problems of upgrading network devices
- ISSU / SSU is more of a marketing term
- We have to reboot the actual devices
- No redundancy at TOR layer (unless redundant TOR)
- Vendors are releasing new software several times a year
- Do you really need to upgrade? Likely.

CONCLUSION #2:
UPGRADING TORs WILL BE IMPACTING
CONCLUSION #3:
WE HAVE TO UPGRADE REGULARLY &
CONTINUOUSLY

SOLUTION:
Automate the upgrade process!

The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)

The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)
- Lots of toil to schedule maintenances
- Server owners just ignored the emails 🤷

“it's easier to ask forgiveness than to get
permission”
Grace Hopper

The assertive approach
- Tell, don’t ask. “Maintenance will happen at @timestamp”
- Automate the end-to-end process - no humans should be involved
- Build in sufﬁcient emergency breaks
- Communicate all details to your customers
- Allow customers to interact with you via APIs

The components
HTTP API /
Database
Scheduler
Maintenance
Execution
Upgrade
Schedule
Builder
��

The execution workflow
1. Pre-flight checks 2. Start 3. Isolate device
4. Wait
5. Upload+reboot
6. Waiting
9. Finish
8. Trigger discovery
7. Post-flight checks

Release V1 - 2018Q1
- Starting small: 2 upgrades
per day
- Pre-built static list of
maintenances (runs out)
- Lots of safeguards!
- Only PROD TORs

Release V2 - 2019Q1
- 8 upgrades a day (hourly)
- Fully automated scheduler
(does not run out)
- Only next 7 days are ﬁxed,
the rest is ﬂuid
- Single DC on a given day

Release V3 - 2020Q2
- Support for OOB
environment beyond PROD
- Different environments can
run in parallel
- 15 PROD, 31 OOB upgrades
a day
- Single availability zone on a
given week

Release V4 - 2021Q1
- Re-factored schedule
generator
- Many new environments
(PCI, CORP, etc)
- Support for non-TOR
switches
- Theoretical maximum 76
upgrades per day in total

Release V5 - 2023Q1
- Adding more
environments
- Improved pre- and post-
ﬂight checks during
execution
- At this point we’re just
pushing the needle to
reach 100% coverage

What contributed to the success
- SRE culture / philosophy reached the company in 2016
- The google global chubby planned outage story: “The network is too
reliable”
- Outage budget well communicated by leadership - core part of
company culture
- Building failure resistant systems became a core objective in Tech

The future of maintenances
- We’re actively working on applying the same automation framework
for any change (not just upgrade)
- Outsourcing execution logic for the teams (it’s their business what
and how they run)
- Centralising all network changes into a single system

The hourly network outage - Booking.com.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The hourly network outage - Booking.com.pdf

Similar to The hourly network outage - Booking.com.pdf (20)

Recently uploaded

Recently uploaded (20)

The hourly network outage - Booking.com.pdf