Real-World Resiliency
in the Face of a Datacenter Disaster
HOSTED BY
Stanislav Komanec
VP of Engineering, Kiwi.com
IN PARTNERSHIP WITH
Presenter
Stanislav Komanec
VP of Engineering Platform
My past:
➔ Backend developer
➔ Technical team lead
➔ Head of Platform
➔ VP of Engineering
4. Kiwi.com’s
Preparedness
1. Kiwi.com
Intro/Overview
7. Final remarks
6. Choosing/
Implementing Resilient
Systems
3. Kiwi.com Impact
5. Best Practices for
Resiliency
2. OVHcloud Fire
8. Q&A
Agenda
About Kiwi.com
➔ Virtual global supercarrier
➔ Seamless travel experience
➔ Connecting “A” to “B”
➔ Virtual interlining
Kiwi.com History
2012
Skypicker founded
2014
Acquisition of
whichairline.com
2016
Rebranded to Kiwi.com
2019
General Atlantics on
board
500+
People in R&D
31
Average age
66+
Nationalities
140+
Dogs
Kiwi.com: The Team
👥
🐶
🚩
💼
100K+
Seats weekly
50K+
Bookings weekly
5M+
Searches weekly
900+
Partners
Kiwi.com: Business Numbers
✈
🤝
🔎
💺
➔ Our technology unlocks our key features
➔ Best inventory in the world
➔ Great search
➔ Features like multi-city, nomad, good deals...
Kiwi.com Key Features
➔ Cloud native
◆ Infrastructure as a code
➔ Micro-services oriented architecture
➔ 600+ microservices, aligned in specific domains
Kiwi.com under the Hood – Architecture
➔ Main database – Scylla
➔ 400K+ /s reads, 200K+ /s writes; we are rewriting whole DB once in
10 days
➔ Infrastructure
◆ OVH main bare-metal provider
◆ Megaport
◆ GCP as the main cloud provider – web services
Kiwi.com under the Hood – Infrastructure
Geographically Distributed Datacenters
>500km
>250km
>550km
Main database locations
OVHcloud Fire
Events of 10 Mar 2021
➔ Strasbourg, France
➔ Wednesday, 10 March 2021
➔ Fire breaks out 00:47 CET
OVHcloud Fire
OVHcloud’s Strasbourg SBG2 Datacenter engulfed
in flames.
(Image: SDIS du Bas Rhin)
OVHcloud’s Strasbourg SBG2 Datacenter
the next morning. (Image via Twitter)
➔ Strasbourg datacenter impact
■ SBG2 totally consumed
■ SBG1 4 of 12 rooms gutted
■ SBG3 & SBG4 proactively taken offline
➔ Internet impact (as per Netcraft)
■ 3.6 million websites
■ 464,000 domains
■ 1 in 50 sites in all of .fr TLD
Damage Assessment
“Websites that went offline during the fire included online
banks, webmail services, news sites, online shops selling PPE
to protect against coronavirus, and several countries’
government websites.”
— Netcraft
Kiwi.com Impact
Response to the Fire
OVHCloud Fire
>500km
65km
>550km
Monitoring the Problem
Latencies briefly rise
until unavailable servers
are taken out of cluster
10 of 30 servers
are suddenly
unavailable
Requests per
second per server;
note how some
drop towards zero
then blip out of
existence
Timeline of Fire
00:47 CET Fire breaks out in OVHcloud Strasbourg SBG2
01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster
01:15 CET All 10 Strasbourg nodes offline; traffic diverted to 2x other Kiwi.com datacenters (20
servers remaining)
02:23 CET Production operational, we manually need to tweak some services around the main
database.
08:54 CET Tweaks deployed, we are fully operational
➔ Degraded performance on some services
◆ Trying to rebalance load
➔ Moving some affected service to different place
We were up & running
Kiwi.com Impact
Kiwi.com: Our primary Database...
Kiwi.com Impact (in theory)
What if...
What if... Kiwi.com Customer’s Impact
➔ Customer perspective
■ They could not use the service
■ They could not changes bookings
■ We could not process changes in itineraries
● Customers might be at the airports waiting for flights
What if. Kiwi.com Technical Impact
➔ Micro-services – domino effect
➔ Other teams
■ Issues will be cascading: in order to mitigate it, we would need to stop
the services in specific order
➔ Inconsistencies
■ We might end up with lot of inconsistencies even for current customers
➔ Customer support overloaded
What if…. How to Handle the Situation
➔ Stop services, in right order
➔ Spin off new cluster
➔ Let it sync
➔ Run data refreshers
➔ Slowly start web services for customer
What if…. Estimation
➔ Revenue loss
➔ Reputation loss
■ Customers would buy elsewhere
➔ Inconsistencies
■ A lot of manual work
Kiwi.com Preparedness
Incident response
➔ Choice of technology
■ High availability architecture
■ Data replication for resiliency
➔ Choice of cloud vendor
■ Geographic distribution of datacenters
■ Capability to manage SLAs
➔ Having procedures in place
➔ Right environment
Long Before the Fire Broke Out
➔ Requirements
■ High resiliency – to provide best value to customer
■ Low latency – to enable products like Nomad, Multicity search...
➔ History
■ PostgreSQL databases - consistency issues
■ Cassandra - performance issues
What experience did we have?
➔ Peer-to-peer leaderless architecture
■ No single point-of-failure
➔ User-controllable replication factor (RF)
■ RF=1; We have 3 data centres
➔ Per-operation tunable consistency levels
■ One, Quorum, All, etc.
➔ Automatic multi-datacenter replication
■ Keeps different sites in sync
➔ Rack-aware and datacenter-aware
■ Ensures replicas are physically and
geographically distributed
Scylla’s High Availability Architecture
We Needed a Plan
Beginnings of a Plan
Goals:
3 datacenters
3 cities
geographically
separated
➔ You need to unlock technology advantages via the great team
➔ The best way is to setup culture & procedures
■ Creates the right environment
Team & Process Plan
➔ Proper monitoring in place
➔ Proper alerting
➔ Incident management system
➔ Postmortems
Incident Management
➔ Learning from each incident
➔ Making our systems more robust
➔ Building the culture
■ Present your mistakes
■ Wheels of misfortune
Blameless Culture
Best Practices for Resiliency
➔ Critical path
■ How to find it?
■ How to measure it?
➔ Implementation phase
➔ Proactive vs Reactive approach
Where to Start – Identification
➔ Engineers perspective
■ Love automations
■ Don’t like a lot of manual steps
Where to Start – Proactive vs Reactive
➔ Business perspective
■ Cost efficiency
■ Risk factors
Where to Start – Proactive vs Reactive
➔ Invest where it matters
■ Time to time it’s about overscaling the whole datacenter, not just an
instance or two
■ Critical path
Where to Start – Overscale?
Choosing/Implementing
Resilient Systems
➔ Overscale
■ Along the defined critical path
➔ Fallbacks solutions (e.g. in networking)
➔ Measure
➔ Run the tests
➔ Example: Chaos monkey approach
Proactive Solutions
➔ Get the great plan
➔ Well tested, keep it up to date
➔ For example: runbooks
Reactive Solution
The Culture
Get the Great Team
➔ It’s important to build the right
environment
➔ Thank you to all members of the
team
Get the Greatest Team
➔ Find the partners, who consider your problems their own
■ OVH
■ GCP
■ Scylla
● Initial setup, great support over the years
■ Megaport
■ Cloudflare…
Get the Great Partners
Final remarks
A Good Year (2006)
Uncle Henry:
“It's inevitable to lose now and
again. The trick is not to make a
habit of it.”
Takeaways
➔ Outages are inevitable. It's just up to us to be prepared
➔ Plan for the worst, hope for the best
➔ Get right balance between proactivity and reactivity
➔ Get the great team & cultivate blameless culture
■ Drives the innovation most effectively
Lessons Learned
Q&A
HOSTED BY
IN PARTNERSHIP WITH

Real-World Resiliency: Surviving Datacenter Disaster

  • 1.
    Real-World Resiliency in theFace of a Datacenter Disaster HOSTED BY Stanislav Komanec VP of Engineering, Kiwi.com IN PARTNERSHIP WITH
  • 2.
    Presenter Stanislav Komanec VP ofEngineering Platform My past: ➔ Backend developer ➔ Technical team lead ➔ Head of Platform ➔ VP of Engineering
  • 3.
    4. Kiwi.com’s Preparedness 1. Kiwi.com Intro/Overview 7.Final remarks 6. Choosing/ Implementing Resilient Systems 3. Kiwi.com Impact 5. Best Practices for Resiliency 2. OVHcloud Fire 8. Q&A Agenda
  • 4.
    About Kiwi.com ➔ Virtualglobal supercarrier ➔ Seamless travel experience ➔ Connecting “A” to “B” ➔ Virtual interlining
  • 5.
    Kiwi.com History 2012 Skypicker founded 2014 Acquisitionof whichairline.com 2016 Rebranded to Kiwi.com 2019 General Atlantics on board
  • 6.
    500+ People in R&D 31 Averageage 66+ Nationalities 140+ Dogs Kiwi.com: The Team 👥 🐶 🚩 💼
  • 7.
    100K+ Seats weekly 50K+ Bookings weekly 5M+ Searchesweekly 900+ Partners Kiwi.com: Business Numbers ✈ 🤝 🔎 💺
  • 8.
    ➔ Our technologyunlocks our key features ➔ Best inventory in the world ➔ Great search ➔ Features like multi-city, nomad, good deals... Kiwi.com Key Features
  • 9.
    ➔ Cloud native ◆Infrastructure as a code ➔ Micro-services oriented architecture ➔ 600+ microservices, aligned in specific domains Kiwi.com under the Hood – Architecture
  • 10.
    ➔ Main database– Scylla ➔ 400K+ /s reads, 200K+ /s writes; we are rewriting whole DB once in 10 days ➔ Infrastructure ◆ OVH main bare-metal provider ◆ Megaport ◆ GCP as the main cloud provider – web services Kiwi.com under the Hood – Infrastructure
  • 11.
  • 12.
  • 13.
  • 14.
    ➔ Strasbourg, France ➔Wednesday, 10 March 2021 ➔ Fire breaks out 00:47 CET OVHcloud Fire OVHcloud’s Strasbourg SBG2 Datacenter engulfed in flames. (Image: SDIS du Bas Rhin)
  • 15.
    OVHcloud’s Strasbourg SBG2Datacenter the next morning. (Image via Twitter) ➔ Strasbourg datacenter impact ■ SBG2 totally consumed ■ SBG1 4 of 12 rooms gutted ■ SBG3 & SBG4 proactively taken offline ➔ Internet impact (as per Netcraft) ■ 3.6 million websites ■ 464,000 domains ■ 1 in 50 sites in all of .fr TLD Damage Assessment
  • 16.
    “Websites that wentoffline during the fire included online banks, webmail services, news sites, online shops selling PPE to protect against coronavirus, and several countries’ government websites.” — Netcraft
  • 17.
  • 18.
  • 20.
    Monitoring the Problem Latenciesbriefly rise until unavailable servers are taken out of cluster 10 of 30 servers are suddenly unavailable Requests per second per server; note how some drop towards zero then blip out of existence
  • 21.
    Timeline of Fire 00:47CET Fire breaks out in OVHcloud Strasbourg SBG2 01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster 01:15 CET All 10 Strasbourg nodes offline; traffic diverted to 2x other Kiwi.com datacenters (20 servers remaining) 02:23 CET Production operational, we manually need to tweak some services around the main database. 08:54 CET Tweaks deployed, we are fully operational
  • 22.
    ➔ Degraded performanceon some services ◆ Trying to rebalance load ➔ Moving some affected service to different place We were up & running Kiwi.com Impact
  • 23.
  • 24.
    Kiwi.com Impact (intheory) What if...
  • 25.
    What if... Kiwi.comCustomer’s Impact ➔ Customer perspective ■ They could not use the service ■ They could not changes bookings ■ We could not process changes in itineraries ● Customers might be at the airports waiting for flights
  • 26.
    What if. Kiwi.comTechnical Impact ➔ Micro-services – domino effect ➔ Other teams ■ Issues will be cascading: in order to mitigate it, we would need to stop the services in specific order ➔ Inconsistencies ■ We might end up with lot of inconsistencies even for current customers ➔ Customer support overloaded
  • 27.
    What if…. Howto Handle the Situation ➔ Stop services, in right order ➔ Spin off new cluster ➔ Let it sync ➔ Run data refreshers ➔ Slowly start web services for customer
  • 28.
    What if…. Estimation ➔Revenue loss ➔ Reputation loss ■ Customers would buy elsewhere ➔ Inconsistencies ■ A lot of manual work
  • 29.
  • 30.
    ➔ Choice oftechnology ■ High availability architecture ■ Data replication for resiliency ➔ Choice of cloud vendor ■ Geographic distribution of datacenters ■ Capability to manage SLAs ➔ Having procedures in place ➔ Right environment Long Before the Fire Broke Out
  • 31.
    ➔ Requirements ■ Highresiliency – to provide best value to customer ■ Low latency – to enable products like Nomad, Multicity search... ➔ History ■ PostgreSQL databases - consistency issues ■ Cassandra - performance issues What experience did we have?
  • 32.
    ➔ Peer-to-peer leaderlessarchitecture ■ No single point-of-failure ➔ User-controllable replication factor (RF) ■ RF=1; We have 3 data centres ➔ Per-operation tunable consistency levels ■ One, Quorum, All, etc. ➔ Automatic multi-datacenter replication ■ Keeps different sites in sync ➔ Rack-aware and datacenter-aware ■ Ensures replicas are physically and geographically distributed Scylla’s High Availability Architecture
  • 33.
  • 34.
    Beginnings of aPlan Goals: 3 datacenters 3 cities geographically separated
  • 35.
    ➔ You needto unlock technology advantages via the great team ➔ The best way is to setup culture & procedures ■ Creates the right environment Team & Process Plan
  • 36.
    ➔ Proper monitoringin place ➔ Proper alerting ➔ Incident management system ➔ Postmortems Incident Management
  • 37.
    ➔ Learning fromeach incident ➔ Making our systems more robust ➔ Building the culture ■ Present your mistakes ■ Wheels of misfortune Blameless Culture
  • 38.
  • 39.
    ➔ Critical path ■How to find it? ■ How to measure it? ➔ Implementation phase ➔ Proactive vs Reactive approach Where to Start – Identification
  • 40.
    ➔ Engineers perspective ■Love automations ■ Don’t like a lot of manual steps Where to Start – Proactive vs Reactive
  • 41.
    ➔ Business perspective ■Cost efficiency ■ Risk factors Where to Start – Proactive vs Reactive
  • 42.
    ➔ Invest whereit matters ■ Time to time it’s about overscaling the whole datacenter, not just an instance or two ■ Critical path Where to Start – Overscale?
  • 43.
  • 44.
    ➔ Overscale ■ Alongthe defined critical path ➔ Fallbacks solutions (e.g. in networking) ➔ Measure ➔ Run the tests ➔ Example: Chaos monkey approach Proactive Solutions
  • 45.
    ➔ Get thegreat plan ➔ Well tested, keep it up to date ➔ For example: runbooks Reactive Solution
  • 46.
  • 47.
  • 48.
    ➔ It’s importantto build the right environment ➔ Thank you to all members of the team Get the Greatest Team
  • 49.
    ➔ Find thepartners, who consider your problems their own ■ OVH ■ GCP ■ Scylla ● Initial setup, great support over the years ■ Megaport ■ Cloudflare… Get the Great Partners
  • 50.
  • 51.
    A Good Year(2006) Uncle Henry: “It's inevitable to lose now and again. The trick is not to make a habit of it.” Takeaways
  • 52.
    ➔ Outages areinevitable. It's just up to us to be prepared ➔ Plan for the worst, hope for the best ➔ Get right balance between proactivity and reactivity ➔ Get the great team & cultivate blameless culture ■ Drives the innovation most effectively Lessons Learned
  • 53.