Data Integrity - Patryk Hes

Data Integrity
Patryk Hes
Site Reliability Engineer - Software Engineer
Adapted from slides by Kristina Bennett and Raymond Blum.

Hope is not a strategy
● Reliance on data continues to grow in every dimension
○ Greater volumes of information
○ Growing numbers of platforms and uses
● The importance of preserving that information grows as well
● The way we preserve data has to evolve as fast as the data
○ Challenges of scale, both digital and physical
○ Generations of platforms and storage hardware

● Durability - The data exists
● Integrity - The data is exactly what was written
● Availability - The data can be read on demand
Three aspects of data preservation

Integrity before availability
Availability of 99.99%?
That's a total of less than 53 minutes of downtime
in a year.
Generally, users recover, and life goes on.
Integrity of 99.99%?
For 2 GB, that's 200+ KB corrupted or missing
Consequences are often long lasting, and possibly
unrecoverable.
● A document has lost several pages
● An executable is useless
● A database is corrupt
● A video is garbled

My data is replicated. Job done, right?
That's wonderful!
Redundant instances are indispensable, for so many reasons.
But, no, they're not backups, and they're not enough for data preservation.
● Deletions and corruptions can replicate as fast as legitimate writes.
● Same software stack. Same storage devices. Same vulnerabilities.
● Are all the instances at the same site?

So, we need a backup plan, then?
No.
We need a restore plan.
OK, yes, backups are a part of that.
But, don't let that distract you.

"Nobody cares about backups. What people want are restores!"*
● Are you sure you backed up the right data?
● How will you ﬁnd that backup when you need it?
● How can you recover the data back into the service?
● How long will the recovery take?
● How much extra storage and processing will you need for the recovery?
● Are you sure the recovery process won't break or corrupt any other part of the
service?
None of this was covered in the backup plan!
*Raymond Blum

You need these answers now!
Or, a couple days after the push for that new feature, when you
realize it's been accidentally deleting chunks of critical data,
and you need that data back NOW. You could learn the
answers then.
By the way, it's a holiday weekend, and the only person who
understood the backups quit two months ago.
Your choice.

Consider all of your recovery options
Restores are the goal. Backups are just taxes we pay to fund that goal.
If there are alternate sources of funding, use them!
● Reconstruction from canonical sources
● Stateless services
● Magic...?

Leverage your service structure
● Redundancy and sharding are useful for backups and recovery, too!
○ Take backups from different instances
■ Spread the resource load
■ Protect against problems/outages on a single instance
○ Sharded backups are more manageable
● Know how to perform a selective recovery
○ You might only need to recover a single entity
○ You might need to prioritize some entities over others

Practice, practice, practice
Prediction:
Your restore plan will not work the ﬁrst time you try it.
This is not a surprise.
But, the way it breaks might be.

Automate! Automate! Automate!
If you can automate it, you should.
● No need to relearn the process each time.
● Anybody can run the automation.
○ Not just that one person who understood the process.
The one who quit two months ago.
● No trying to decipher incomplete and out-of-date notes
● No scrambling for answers when the crisis hits

Test early, test often
Now that you've automated, you can run it any time. So, run it all the time!
Your service is growing and evolving. There will be
● New features
● Schema changes
● Platform migrations
● Increasing scale
● Resharding
● The next big thing
Test often. Detect early. Adapt quickly.

Summary
● Small data loss can have big consequences.
● Availability is important, but it's only as good as the data that's available
○ I.e., your data needs durability and integrity
● Plan.
● Practice.
● Automate, test, and adapt.
● Avoid horror, panic, and tragedy.

Further reading
An entire chapter on this topic!

Data Integrity - Patryk Hes

More Related Content

What's hot

Similar to Data Integrity - Patryk Hes

Recently uploaded

Data Integrity - Patryk Hes