Data Integrity
Patryk Hes
Site Reliability Engineer - Software Engineer
Adapted from slides by Kristina Bennett and Raymond Blum.
Hope is not a strategy
● Reliance on data continues to grow in every dimension
○ Greater volumes of information
○ Growing numbers of platforms and uses
● The importance of preserving that information grows as well
● The way we preserve data has to evolve as fast as the data
○ Challenges of scale, both digital and physical
○ Generations of platforms and storage hardware
● Durability - The data exists
● Integrity - The data is exactly what was written
● Availability - The data can be read on demand
Three aspects of data preservation
Integrity before availability
Availability of 99.99%?
That's a total of less than 53 minutes of downtime
in a year.
Generally, users recover, and life goes on.
Integrity of 99.99%?
For 2 GB, that's 200+ KB corrupted or missing
Consequences are often long lasting, and possibly
unrecoverable.
● A document has lost several pages
● An executable is useless
● A database is corrupt
● A video is garbled
My data is replicated. Job done, right?
That's wonderful!
Redundant instances are indispensable, for so many reasons.
But, no, they're not backups, and they're not enough for data preservation.
● Deletions and corruptions can replicate as fast as legitimate writes.
● Same software stack. Same storage devices. Same vulnerabilities.
● Are all the instances at the same site?
So, we need a backup plan, then?
No.
We need a restore plan.
OK, yes, backups are a part of that.
But, don't let that distract you.
"Nobody cares about backups. What people want are restores!"*
● Are you sure you backed up the right data?
● How will you find that backup when you need it?
● How can you recover the data back into the service?
● How long will the recovery take?
● How much extra storage and processing will you need for the recovery?
● Are you sure the recovery process won't break or corrupt any other part of the
service?
None of this was covered in the backup plan!
*Raymond Blum
You need these answers now!
Or, a couple days after the push for that new feature, when you
realize it's been accidentally deleting chunks of critical data,
and you need that data back NOW. You could learn the
answers then.
By the way, it's a holiday weekend, and the only person who
understood the backups quit two months ago.
Your choice.
So, what can I do?
Consider all of your recovery options
Restores are the goal. Backups are just taxes we pay to fund that goal.
If there are alternate sources of funding, use them!
● Reconstruction from canonical sources
● Stateless services
● Magic...?
Leverage your service structure
● Redundancy and sharding are useful for backups and recovery, too!
○ Take backups from different instances
■ Spread the resource load
■ Protect against problems/outages on a single instance
○ Sharded backups are more manageable
● Know how to perform a selective recovery
○ You might only need to recover a single entity
○ You might need to prioritize some entities over others
Practice, practice, practice
Prediction:
Your restore plan will not work the first time you try it.
This is not a surprise.
But, the way it breaks might be.
Automate! Automate! Automate!
If you can automate it, you should.
● No need to relearn the process each time.
● Anybody can run the automation.
○ Not just that one person who understood the process.
The one who quit two months ago.
● No trying to decipher incomplete and out-of-date notes
● No scrambling for answers when the crisis hits
Test early, test often
Now that you've automated, you can run it any time. So, run it all the time!
Your service is growing and evolving. There will be
● New features
● Schema changes
● Platform migrations
● Increasing scale
● Resharding
● The next big thing
Test often. Detect early. Adapt quickly.
Summary
● Small data loss can have big consequences.
● Availability is important, but it's only as good as the data that's available
○ I.e., your data needs durability and integrity
● Plan.
● Practice.
● Automate, test, and adapt.
● Avoid horror, panic, and tragedy.
Further reading
An entire chapter on this topic!

Data Integrity - Patryk Hes

  • 1.
    Data Integrity Patryk Hes SiteReliability Engineer - Software Engineer Adapted from slides by Kristina Bennett and Raymond Blum.
  • 2.
    Hope is nota strategy ● Reliance on data continues to grow in every dimension ○ Greater volumes of information ○ Growing numbers of platforms and uses ● The importance of preserving that information grows as well ● The way we preserve data has to evolve as fast as the data ○ Challenges of scale, both digital and physical ○ Generations of platforms and storage hardware
  • 3.
    ● Durability -The data exists ● Integrity - The data is exactly what was written ● Availability - The data can be read on demand Three aspects of data preservation
  • 4.
    Integrity before availability Availabilityof 99.99%? That's a total of less than 53 minutes of downtime in a year. Generally, users recover, and life goes on. Integrity of 99.99%? For 2 GB, that's 200+ KB corrupted or missing Consequences are often long lasting, and possibly unrecoverable. ● A document has lost several pages ● An executable is useless ● A database is corrupt ● A video is garbled
  • 5.
    My data isreplicated. Job done, right? That's wonderful! Redundant instances are indispensable, for so many reasons. But, no, they're not backups, and they're not enough for data preservation. ● Deletions and corruptions can replicate as fast as legitimate writes. ● Same software stack. Same storage devices. Same vulnerabilities. ● Are all the instances at the same site?
  • 6.
    So, we needa backup plan, then? No. We need a restore plan. OK, yes, backups are a part of that. But, don't let that distract you.
  • 7.
    "Nobody cares aboutbackups. What people want are restores!"* ● Are you sure you backed up the right data? ● How will you find that backup when you need it? ● How can you recover the data back into the service? ● How long will the recovery take? ● How much extra storage and processing will you need for the recovery? ● Are you sure the recovery process won't break or corrupt any other part of the service? None of this was covered in the backup plan! *Raymond Blum
  • 8.
    You need theseanswers now! Or, a couple days after the push for that new feature, when you realize it's been accidentally deleting chunks of critical data, and you need that data back NOW. You could learn the answers then. By the way, it's a holiday weekend, and the only person who understood the backups quit two months ago. Your choice.
  • 9.
  • 10.
    Consider all ofyour recovery options Restores are the goal. Backups are just taxes we pay to fund that goal. If there are alternate sources of funding, use them! ● Reconstruction from canonical sources ● Stateless services ● Magic...?
  • 11.
    Leverage your servicestructure ● Redundancy and sharding are useful for backups and recovery, too! ○ Take backups from different instances ■ Spread the resource load ■ Protect against problems/outages on a single instance ○ Sharded backups are more manageable ● Know how to perform a selective recovery ○ You might only need to recover a single entity ○ You might need to prioritize some entities over others
  • 12.
    Practice, practice, practice Prediction: Yourrestore plan will not work the first time you try it. This is not a surprise. But, the way it breaks might be.
  • 13.
    Automate! Automate! Automate! Ifyou can automate it, you should. ● No need to relearn the process each time. ● Anybody can run the automation. ○ Not just that one person who understood the process. The one who quit two months ago. ● No trying to decipher incomplete and out-of-date notes ● No scrambling for answers when the crisis hits
  • 14.
    Test early, testoften Now that you've automated, you can run it any time. So, run it all the time! Your service is growing and evolving. There will be ● New features ● Schema changes ● Platform migrations ● Increasing scale ● Resharding ● The next big thing Test often. Detect early. Adapt quickly.
  • 15.
    Summary ● Small dataloss can have big consequences. ● Availability is important, but it's only as good as the data that's available ○ I.e., your data needs durability and integrity ● Plan. ● Practice. ● Automate, test, and adapt. ● Avoid horror, panic, and tragedy.
  • 16.
    Further reading An entirechapter on this topic!