Real-World Resiliency: Surviving Datacenter Disaster

Real-World Resiliency
in the Face of a Datacenter Disaster
HOSTED BY
Stanislav Komanec
VP of Engineering, Kiwi.com
IN PARTNERSHIP WITH

Presenter
Stanislav Komanec
VP of Engineering Platform
My past:
➔ Backend developer
➔ Technical team lead
➔ Head of Platform
➔ VP of Engineering

4. Kiwi.com’s
Preparedness
1. Kiwi.com
Intro/Overview
7. Final remarks
6. Choosing/
Implementing Resilient
Systems
3. Kiwi.com Impact
5. Best Practices for
Resiliency
2. OVHcloud Fire
8. Q&A
Agenda

About Kiwi.com
➔ Virtual global supercarrier
➔ Seamless travel experience
➔ Connecting “A” to “B”
➔ Virtual interlining

Kiwi.com History
2012
Skypicker founded
2014
Acquisition of
whichairline.com
2016
Rebranded to Kiwi.com
2019
General Atlantics on
board

500+
People in R&D
31
Average age
66+
Nationalities
140+
Dogs
Kiwi.com: The Team
👥
🐶
🚩
💼

100K+
Seats weekly
50K+
Bookings weekly
5M+
Searches weekly
900+
Partners
Kiwi.com: Business Numbers
✈
🤝
🔎
💺

➔ Our technology unlocks our key features
➔ Best inventory in the world
➔ Great search
➔ Features like multi-city, nomad, good deals...
Kiwi.com Key Features

➔ Cloud native
◆ Infrastructure as a code
➔ Micro-services oriented architecture
➔ 600+ microservices, aligned in speciﬁc domains
Kiwi.com under the Hood – Architecture

➔ Main database – Scylla
➔ 400K+ /s reads, 200K+ /s writes; we are rewriting whole DB once in
10 days
➔ Infrastructure
◆ OVH main bare-metal provider
◆ Megaport
◆ GCP as the main cloud provider – web services
Kiwi.com under the Hood – Infrastructure

Geographically Distributed Datacenters
>500km
>250km
>550km

OVHcloud Fire
Events of 10 Mar 2021

➔ Strasbourg, France
➔ Wednesday, 10 March 2021
➔ Fire breaks out 00:47 CET
OVHcloud Fire
OVHcloud’s Strasbourg SBG2 Datacenter engulfed
in ﬂames.
(Image: SDIS du Bas Rhin)

OVHcloud’s Strasbourg SBG2 Datacenter
the next morning. (Image via Twitter)
➔ Strasbourg datacenter impact
■ SBG2 totally consumed
■ SBG1 4 of 12 rooms gutted
■ SBG3 & SBG4 proactively taken oﬄine
➔ Internet impact (as per Netcraft)
■ 3.6 million websites
■ 464,000 domains
■ 1 in 50 sites in all of .fr TLD
Damage Assessment

“Websites that went oﬄine during the ﬁre included online
banks, webmail services, news sites, online shops selling PPE
to protect against coronavirus, and several countries’
government websites.”
— Netcraft

Kiwi.com Impact
Response to the Fire

OVHCloud Fire
>500km
65km
>550km

Monitoring the Problem
Latencies brieﬂy rise
until unavailable servers
are taken out of cluster
10 of 30 servers
are suddenly
unavailable
Requests per
second per server;
note how some
drop towards zero
then blip out of
existence

Timeline of Fire
00:47 CET Fire breaks out in OVHcloud Strasbourg SBG2
01:12 CET Kiwi.com nodes in Strasbourg start falling off the cluster
01:15 CET All 10 Strasbourg nodes oﬄine; traﬃc diverted to 2x other Kiwi.com datacenters (20
servers remaining)
02:23 CET Production operational, we manually need to tweak some services around the main
database.
08:54 CET Tweaks deployed, we are fully operational

➔ Degraded performance on some services
◆ Trying to rebalance load
➔ Moving some affected service to different place
We were up & running
Kiwi.com Impact

Kiwi.com: Our primary Database...

Kiwi.com Impact (in theory)
What if...

What if... Kiwi.com Customer’s Impact
➔ Customer perspective
■ They could not use the service
■ They could not changes bookings
■ We could not process changes in itineraries
● Customers might be at the airports waiting for flights

What if. Kiwi.com Technical Impact
➔ Micro-services – domino effect
➔ Other teams
■ Issues will be cascading: in order to mitigate it, we would need to stop
the services in specific order
➔ Inconsistencies
■ We might end up with lot of inconsistencies even for current customers
➔ Customer support overloaded

What if…. How to Handle the Situation
➔ Stop services, in right order
➔ Spin off new cluster
➔ Let it sync
➔ Run data refreshers
➔ Slowly start web services for customer

What if…. Estimation
➔ Revenue loss
➔ Reputation loss
■ Customers would buy elsewhere
➔ Inconsistencies
■ A lot of manual work

Kiwi.com Preparedness
Incident response

➔ Choice of technology
■ High availability architecture
■ Data replication for resiliency
➔ Choice of cloud vendor
■ Geographic distribution of datacenters
■ Capability to manage SLAs
➔ Having procedures in place
➔ Right environment
Long Before the Fire Broke Out

➔ Requirements
■ High resiliency – to provide best value to customer
■ Low latency – to enable products like Nomad, Multicity search...
➔ History
■ PostgreSQL databases - consistency issues
■ Cassandra - performance issues
What experience did we have?

➔ Peer-to-peer leaderless architecture
■ No single point-of-failure
➔ User-controllable replication factor (RF)
■ RF=1; We have 3 data centres
➔ Per-operation tunable consistency levels
■ One, Quorum, All, etc.
➔ Automatic multi-datacenter replication
■ Keeps different sites in sync
➔ Rack-aware and datacenter-aware
■ Ensures replicas are physically and
geographically distributed
Scylla’s High Availability Architecture

Beginnings of a Plan
Goals:
3 datacenters
3 cities
geographically
separated

➔ You need to unlock technology advantages via the great team
➔ The best way is to setup culture & procedures
■ Creates the right environment
Team & Process Plan

➔ Proper monitoring in place
➔ Proper alerting
➔ Incident management system
➔ Postmortems
Incident Management

➔ Learning from each incident
➔ Making our systems more robust
➔ Building the culture
■ Present your mistakes
■ Wheels of misfortune
Blameless Culture

➔ Critical path
■ How to ﬁnd it?
■ How to measure it?
➔ Implementation phase
➔ Proactive vs Reactive approach
Where to Start – Identiﬁcation

➔ Engineers perspective
■ Love automations
■ Don’t like a lot of manual steps
Where to Start – Proactive vs Reactive

➔ Business perspective
■ Cost eﬃciency
■ Risk factors
Where to Start – Proactive vs Reactive

➔ Invest where it matters
■ Time to time it’s about overscaling the whole datacenter, not just an
instance or two
■ Critical path
Where to Start – Overscale?

Choosing/Implementing
Resilient Systems

➔ Overscale
■ Along the deﬁned critical path
➔ Fallbacks solutions (e.g. in networking)
➔ Measure
➔ Run the tests
➔ Example: Chaos monkey approach
Proactive Solutions

➔ Get the great plan
➔ Well tested, keep it up to date
➔ For example: runbooks
Reactive Solution

➔ It’s important to build the right
environment
➔ Thank you to all members of the
team
Get the Greatest Team

➔ Find the partners, who consider your problems their own
■ OVH
■ GCP
■ Scylla
● Initial setup, great support over the years
■ Megaport
■ Cloudﬂare…
Get the Great Partners

A Good Year (2006)
Uncle Henry:
“It's inevitable to lose now and
again. The trick is not to make a
habit of it.”
Takeaways

➔ Outages are inevitable. It's just up to us to be prepared
➔ Plan for the worst, hope for the best
➔ Get right balance between proactivity and reactivity
➔ Get the great team & cultivate blameless culture
■ Drives the innovation most effectively
Lessons Learned

Q&A
HOSTED BY
IN PARTNERSHIP WITH

Real-World Resiliency: Surviving Datacenter Disaster

More Related Content

What's hot

Similar to Real-World Resiliency: Surviving Datacenter Disaster

More from ScyllaDB

Recently uploaded

Real-World Resiliency: Surviving Datacenter Disaster