Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Orphans, Corruption, Careful Write, and Logging


Published on

"Orphans, Corruption, Careful Write, and Logging", by Ann W. Harrison and Jim Starkey. A frequent question on the Firebird Support list is "Gfix thinks my database is corrupt. How can I fix it?" The best answer may be to fix gfix itself so it reports benign errors differently from actual corruption. The most common errors reported by gfix can simply be ignored. From the beginning of InterBase to now, gfix is the least interesting of the utilities, so it's had little attention. However it requires a great deal of knowledge of database internals, so it's not easily replaced by a separately developed tool. So, from 1984 until now, gfix has reported major corruption and benign errors as if they were all the same.
What's a benign error? Usually, it's an artifact from the sudden death of a Firebird server that leaves obsolete record versions or even whole empty pages orphaned - neither released nor in use. Orphans represent lost space, but have no other damaging effect on the database. Orphaned record versions and pages occur because Firebird carefully orders its page writes to avoid the need for before or after logs. Database management systems that rely on logging for durability and recovery write each piece of data twice - once to a log and later to the database. Careful write requires discipline in programming. In 1984, careful write had performance advantages that more than made up for the need for careful programming.
This talk and paper explore the cause of "orphan" pages and record versions, Firebird's careful write I/O architecture, and the trade-offs between careful write and logging today.

Published in: Software
  • Be the first to comment

Orphans, Corruption, Careful Write, and Logging

  1. 1. Orphans, Corruption, Careful Write, and Logging, or Gfix says my database is CORRUPT or Database Integrity - then, now, future Ann W. Harrison James A. Starkey
  2. 2. A Word of Thanks to our Sponsors
  3. 3. And to Vlad Khorsun Core 4562 Some errors reported by database validation (such as orphan pages and a few others) are not critical for database, i.e. don’t affect query results andor logical consistency of user data. Such defects should not be counted as errors to not scare users. Fixed 28 Sept 2014
  4. 4. Questions?
  5. 5. MVCC – Quick Review Read consistency, undo, and update concurrency provided in one durable mechanism. Data is never overwritten. Update or Delete creates new record version linked to old. Transaction reads the version committed when it started (or at the instant for Read Committed) Each record chain has at most one uncommitted version. Rollback removes uncommitted version. .
  6. 6. What does Gfix do? Reads entire database verifying internal consistency: Of interest now: Allocated pages are in use Unused pages are not allocated Primary record links to Fragments Back versions Before 28 September 2014, any problem was an error
  7. 7. What are Orphans? And what do they have to do with this? (not what you think)
  8. 8. Database Integrity Disasters occur (more often circa 1985) Database System, O/S, Network, Power, Disk Classic Solutions Write Ahead Log Shadow Pages After image Log Firebird Solution Careful write, multi-version records Write once
  9. 9. Disk Failure InterBase V1 Journal After image Abandoned by Borland Shadow Complete copy on separate disk Better done in RAID
  10. 10. Careful Write Order writes to disk (fsync) Database is always consistent on disk Rule: write the object pointed to then the pointer Record examples: record before index, back version before main, fragment before main record Page examples: mark as allocated before using, release before marking free Requires disciplined development
  11. 11. Record Before Index Indexes are always considered “noisy” Start at the first value below desired value Stop at next value above Index will be written before commit completes After crash: New uncommitted records not in index Uncommitted deleted records stay in index Gfix reports index corruption
  12. 12. Back Version Before Record When the back version is on a different page Write the back version first Write the record pointing to the back version next After crash: Old record still exists New back version wastes space Gfix reports orphan back versions
  13. 13. Fragment Before Record Record bigger than page size Write the last page of the record Write the next to last, point to the last Write other pages in reverse order, pointing to prior Write the first bit, pointing to next page After crash: Record fragments are unusable space Gfix reports orphan record fragments
  14. 14. Page Allocation Allocation: Mark page as allocated on PIP Format page Enter page in table, index, or internal structure After crash: Page is unusable Gfix reports orphan page
  15. 15. Page Release Release Remove page from table or index Mark page as unallocated After crash: Page is unusable Gfix reports orphan page
  16. 16. Precedence If index page A points to a record on page B, page B must be written before page A. If the record on page B has a back version on page C, page C must be written before page B. Firebird maintains a complete graph of precedence. If a cache conflict requires writing page A, C and B must be written first. If the graph develops a cycle, all pages must be written.
  17. 17. Downsides of Careful Write Writes are random. Precedence may cause multiple writes. Cycles cause multiple writes.
  18. 18. Design is Balance Performance Recoverability
  19. 19. Disaster Recovery From DBMS crash From OS crash From CPU crash From Network failure From Disk Crash
  20. 20. Antediluvian Technology Long Term Journaling Before and after page images are journalled Required a Tape Drive (now extinct) Recovery Roll forward from dump Rollback from the current disk image Performance bounded by tape speed
  21. 21. JRD’s Across History 1980 1990 2000 2010 2014 Rdb/ELN InterBase Firebird Netfrastructure Falcon Development NuoDB Closed Source Open Source AmorphousDB
  22. 22. Interbase 1.0, 1985 (Actually gds/Galaxy 1.0) MVCC + Careful Write Disk shadowing (raid not invented yet) GLTJ: Long term journal server Dumped database to journal when enabled Journalled page changes (or full page) GLTJ could be shared among databases Rarely, if ever, used Performance constrained by disk speed
  23. 23. Falcon MVCC in memory Disk used as back-fill for memory Serial log for recovery Single log per database Page changes posted to log Log written with non-buffered writes Pages written when convenient Performance constrained by CPU
  24. 24. NuoDB DB layered on distributed objects called Atoms Atoms replicate peer to peer MVCC at Atom level Transaction nodes pump SQL transactions Storage managers persist serialized Atoms Storage managers use serial log for replication messages
  25. 25. NuoDB Transactions DBA has control over commit policy: Commit when transaction node sends commit messages Commit when <n> storage managers acknowledge commit messages Commit when <n> storage managers have written commit messages to serial log
  26. 26. Performance Implications Disk Based MVCC Many disk writes per transaction Batch commit is possible Performance is dozens of transactions per second with forced write Higher transaction rate with buffered writes, but at risk of data loss SSDs are a big win
  27. 27. Performance Implications Serial Log With fine granularity threading and 8 cores, benchmarked at 22,000 TPS Serial log management is critical Requires substantial non-interlocked data structures
  28. 28. Performance Implications NuoDB Bench marked at 3,000,000 TPS running on 40 commodity processors Read only TPS is theoretically infinite
  29. 29. Questions?