Zero Downtime Migrations
at Scale
Aysylu Greenberg ∽ April 28, 2018 ∽ Medellín
Photo by Iván Erre Jota / CC BY-SA
Aysylu Greenberg
@aysylu22
Zero Downtime Migrations
at Scale
Zero Downtime Migrations
at Scale
Data
Migration
Architecture
Migration
Data
Migration
Architecture
Migration
Zero Downtime Migrations
at Scale
Zero Downtime Migrations
at Scale
We will be performing scheduled site
maintenance on Saturday from 3 am to 5 am!
Zero Downtime Migrations
at Scale
New data storage
and architecture
Zero-Downtime
Migration for
Backups
Zero Downtime Migrations
at Scale
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
● No sharing, no search, no folders
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
● No sharing, no search, no folders
● Mainly write traffic, read traffic is rare and high
priority
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
● No sharing, no search, no folders
● Mainly write traffic, read traffic is rare and high
priority
● Spanner: global strong consistency, SQL
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
● No sharing, no search, no folders
● Mainly write traffic, read traffic is rare and high
priority
● Spanner: global strong consistency, SQL
● Backups can get large
Considerations Before Migration
● Scale: O(100M) of users with 1B+ backups,
representing 1T+ objects
● No sharing, no search, no folders
● Mainly write traffic, read traffic is rare and high
priority
● Spanner: global strong consistency, SQL
● Backups can get large
So how do you
migrate all this data?
Photo by Richard Evea / CC BY-SA 2.0
So how do you
migrate all this data?
copy/paste...
Photo by Richard Evea / CC BY-SA 2.0
Zero-Downtime
Migration for
Backups
Data migration
&
Architecture migration
Data migration
● Dual writes
Architecture migration
So how do you migrate all this data?
Data migration
● Dual writes
Architecture migration
So how do you migrate all this data?
Are we
storing data
correctly?
Data migration
● Dual writes
Architecture migration
● Dual writes
So how do you migrate all this data?
Data migration
● Dual writes
Architecture migration
● Dual writes
So how do you migrate all this data?
No effect on
latency and
error rates!
Data migration
● Dual writes
● Backfill data
Architecture migration
● Dual writes
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
Architecture migration
● Dual writes
So how do you migrate all this data?
Do we
understand all
client behavior
and adapt the
data correctly?
Data migration
● Dual writes
● Backfill data
Architecture migration
● Dual writes
● Prove the stack
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
Architecture migration
● Dual writes
● Prove the stack
So how do you migrate all this data?
Is the response
from the new
system same as
from the old?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
Architecture migration
● Dual writes
● Prove the stack
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
Architecture migration
● Dual writes
● Prove the stack
So how do you migrate all this data?
New storage
mechanism or
schema?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
So how do you migrate all this data?
How to get it to
production
readiness to
serve full load?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
● Migrate slowly
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
● Migrate slowly
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
So how do you migrate all this data?
Validate, validate,
validate
Resource constraints?
Scale migration
Data migration
● Dual writes
● Backfill data
● Learn the new storage
● Migrate slowly
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
● Roll out slowly
So how do you migrate all this data?
Data migration
● Dual writes
● Backfill data
● Learn the new storage
● Migrate slowly
Architecture migration
● Dual writes
● Prove the stack
● Harden the system
● Roll out slowly
So how do you migrate all this data?
Scale carefully
& proactively
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
● Prepare to write code for intermediate state
>>> Quality of code corresponds to the expected lifetime
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
● Prepare to write code for intermediate state
>>> Quality of code corresponds to the expected lifetime
● Migrate backends first
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
● Prepare to write code for intermediate state
>>> Quality of code corresponds to the expected lifetime
● Migrate backends first
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
● Prepare to write code for intermediate state
>>> Quality of code corresponds to the expected lifetime
● Migrate backends first
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
● Prepare to write code for intermediate state
>>> Quality of code corresponds to the expected lifetime
● Migrate backends first
● Invest into visibility into system & migration state
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
● Validate scalability while affecting fewest users
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
● Validate scalability while affecting fewest users
● Decouple launch of services
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
● Validate scalability while affecting fewest users
● Decouple launch of services
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
● Validate scalability while affecting fewest users
● Decouple launch of services
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
VALIDATE & PRACTICE ROLLOUT
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
Zero-Downtime Migrations at Scale
FOCUS ON INTERMEDIATE STATE
ROLL OUT INCREMENTALLY
VALIDATE & PRACTICE ROLLOUT
Gratitude
Steve Clark
Thomas Escobar
Matt Welsh
Ranjodh Mathial
Tatiana Marquez
Zero Downtime Migrations
at Scale
Aysylu Greenberg ∽ April 28, 2018 ∽ Medellín
Photo by Iván Erre Jota / CC BY-SA

Zero Downtime Migrations at Scale