The presentation from András Temesváry from Booking.com on "The Hourly Network Outage"- as presented on the 13th of April, 2023 at the Site Reliability Engineering NL MeetUp.
2. - DC automation engineer
- For some time labelled as “network engineer”
- Writing “code” (more so duct taping things together) to solve
problems
- Python enthusiast
Who am I?
3. Terminologies
TOR - Top of Rack
OOB - Out of band
Server role - workload type
Role Owner - team
responsible for a given role
5. The problems of managing
network devices at scale (at least, some of)
- We have thousands of network devices
- Multi vendor environment (2-3)
- Network device lifespan can go to up to 10+ years (long tail)
- Version differences between different install batches
- Some tools only work with recent software features
- Need to maintain a sufficient level of security / compliance
- We want to use new network features, and constantly run into
weird bugs
7. The problems of upgrading network devices
- ISSU / SSU is more of a marketing term
- We have to reboot the actual devices
- No redundancy at TOR layer (unless redundant TOR)
- Vendors are releasing new software several times a year
- Do you really need to upgrade? Likely.
10. The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)
11. The (overly) naïve approach for automation
- Built a UI to request consent for all server owners in a rack
- Maintenances had to be scheduled manually by network engineers
- If / when consent was given, maintenance had to run manually (we
actually had automated the upgrade process, but still a manual
execution required)
- Lots of toil to schedule maintenances
- Server owners just ignored the emails 🤷
12. “it's easier to ask forgiveness than to get
permission”
Grace Hopper
13. The assertive approach
- Tell, don’t ask. “Maintenance will happen at @timestamp”
- Automate the end-to-end process - no humans should be involved
- Build in sufficient emergency breaks
- Communicate all details to your customers
- Allow customers to interact with you via APIs
14. The components
HTTP API /
Database
Scheduler
Maintenance
Execution
Upgrade
Schedule
Builder
��
16. Release V1 - 2018Q1
- Starting small: 2 upgrades
per day
- Pre-built static list of
maintenances (runs out)
- Lots of safeguards!
- Only PROD TORs
17. Release V2 - 2019Q1
- 8 upgrades a day (hourly)
- Fully automated scheduler
(does not run out)
- Only next 7 days are fixed,
the rest is fluid
- Single DC on a given day
18. Release V3 - 2020Q2
- Support for OOB
environment beyond PROD
- Different environments can
run in parallel
- 15 PROD, 31 OOB upgrades
a day
- Single availability zone on a
given week
19. Release V4 - 2021Q1
- Re-factored schedule
generator
- Many new environments
(PCI, CORP, etc)
- Support for non-TOR
switches
- Theoretical maximum 76
upgrades per day in total
20. Release V5 - 2023Q1
- Adding more
environments
- Improved pre- and post-
flight checks during
execution
- At this point we’re just
pushing the needle to
reach 100% coverage
21.
22.
23.
24.
25.
26.
27. What contributed to the success
- SRE culture / philosophy reached the company in 2016
- The google global chubby planned outage story: “The network is too
reliable”
- Outage budget well communicated by leadership - core part of
company culture
- Building failure resistant systems became a core objective in Tech
28. The future of maintenances
- We’re actively working on applying the same automation framework
for any change (not just upgrade)
- Outsourcing execution logic for the teams (it’s their business what
and how they run)
- Centralising all network changes into a single system