More Related Content

Viewers also liked(19)


Orphans, Corruption, Careful Write, and Logging

  1. Orphans, Corruption, Careful Write, and Logging, or Gfix says my database is CORRUPT or Database Integrity - then, now, future Ann W. Harrison James A. Starkey
  2. A Word of Thanks to our Sponsors
  3. And to Vlad Khorsun Core 4562 Some errors reported by database validation (such as orphan pages and a few others) are not critical for database, i.e. don’t affect query results andor logical consistency of user data. Such defects should not be counted as errors to not scare users. Fixed 28 Sept 2014
  4. Questions?
  5. MVCC – Quick Review Read consistency, undo, and update concurrency provided in one durable mechanism. Data is never overwritten. Update or Delete creates new record version linked to old. Transaction reads the version committed when it started (or at the instant for Read Committed) Each record chain has at most one uncommitted version. Rollback removes uncommitted version. .
  6. What does Gfix do? Reads entire database verifying internal consistency: Of interest now: Allocated pages are in use Unused pages are not allocated Primary record links to Fragments Back versions Before 28 September 2014, any problem was an error
  7. What are Orphans? And what do they have to do with this? (not what you think)
  8. Database Integrity Disasters occur (more often circa 1985) Database System, O/S, Network, Power, Disk Classic Solutions Write Ahead Log Shadow Pages After image Log Firebird Solution Careful write, multi-version records Write once
  9. Disk Failure InterBase V1 Journal After image Abandoned by Borland Shadow Complete copy on separate disk Better done in RAID
  10. Careful Write Order writes to disk (fsync) Database is always consistent on disk Rule: write the object pointed to then the pointer Record examples: record before index, back version before main, fragment before main record Page examples: mark as allocated before using, release before marking free Requires disciplined development
  11. Record Before Index Indexes are always considered “noisy” Start at the first value below desired value Stop at next value above Index will be written before commit completes After crash: New uncommitted records not in index Uncommitted deleted records stay in index Gfix reports index corruption
  12. Back Version Before Record When the back version is on a different page Write the back version first Write the record pointing to the back version next After crash: Old record still exists New back version wastes space Gfix reports orphan back versions
  13. Fragment Before Record Record bigger than page size Write the last page of the record Write the next to last, point to the last Write other pages in reverse order, pointing to prior Write the first bit, pointing to next page After crash: Record fragments are unusable space Gfix reports orphan record fragments
  14. Page Allocation Allocation: Mark page as allocated on PIP Format page Enter page in table, index, or internal structure After crash: Page is unusable Gfix reports orphan page
  15. Page Release Release Remove page from table or index Mark page as unallocated After crash: Page is unusable Gfix reports orphan page
  16. Precedence If index page A points to a record on page B, page B must be written before page A. If the record on page B has a back version on page C, page C must be written before page B. Firebird maintains a complete graph of precedence. If a cache conflict requires writing page A, C and B must be written first. If the graph develops a cycle, all pages must be written.
  17. Downsides of Careful Write Writes are random. Precedence may cause multiple writes. Cycles cause multiple writes.
  18. Design is Balance Performance Recoverability
  19. Disaster Recovery From DBMS crash From OS crash From CPU crash From Network failure From Disk Crash
  20. Antediluvian Technology Long Term Journaling Before and after page images are journalled Required a Tape Drive (now extinct) Recovery Roll forward from dump Rollback from the current disk image Performance bounded by tape speed
  21. JRD’s Across History 1980 1990 2000 2010 2014 Rdb/ELN InterBase Firebird Netfrastructure Falcon Development NuoDB Closed Source Open Source AmorphousDB
  22. Interbase 1.0, 1985 (Actually gds/Galaxy 1.0) MVCC + Careful Write Disk shadowing (raid not invented yet) GLTJ: Long term journal server Dumped database to journal when enabled Journalled page changes (or full page) GLTJ could be shared among databases Rarely, if ever, used Performance constrained by disk speed
  23. Falcon MVCC in memory Disk used as back-fill for memory Serial log for recovery Single log per database Page changes posted to log Log written with non-buffered writes Pages written when convenient Performance constrained by CPU
  24. NuoDB DB layered on distributed objects called Atoms Atoms replicate peer to peer MVCC at Atom level Transaction nodes pump SQL transactions Storage managers persist serialized Atoms Storage managers use serial log for replication messages
  25. NuoDB Transactions DBA has control over commit policy: Commit when transaction node sends commit messages Commit when <n> storage managers acknowledge commit messages Commit when <n> storage managers have written commit messages to serial log
  26. Performance Implications Disk Based MVCC Many disk writes per transaction Batch commit is possible Performance is dozens of transactions per second with forced write Higher transaction rate with buffered writes, but at risk of data loss SSDs are a big win
  27. Performance Implications Serial Log With fine granularity threading and 8 cores, benchmarked at 22,000 TPS Serial log management is critical Requires substantial non-interlocked data structures
  28. Performance Implications NuoDB Bench marked at 3,000,000 TPS running on 40 commodity processors Read only TPS is theoretically infinite
  29. Questions?

Editor's Notes

  1. Since gfix will be more clear in V3, there’s not much to talk about…
  2. No, not that easy. First a review of the basic concurrency control mechanism of Firebird. Concurrency covers several areas: atomicity – a transaction succeeds or fails as a unit; isolation – a transaction is not affected by concurrent transactions; avoiding inconsistent reads and dirty writes; all done without locks.
  3. If anyone has missed gfix, it was originally known as Alice = because it handled everything that didn’t fit elsewhere, thus All Else. So it’s a grab bag of functions that run outside the database server. Gfix reads the database directly as if it were a file. The specific function of gfix that I’m talking about is its ability to validate a database’s internal consistency. Because it’s outside the server, it doesn’t know anything about database constraints or data formats. It won’t recoginize invalid data, non-functional referential constraints, duplicates in unique indexes, etc. However, it will recognize problems like index entries that don’t point to records, pages that are either doubly allocated or not allocated at all, records with back version pointers that don’t have back versions.
  4. So, gfix is looking for all those problems, how could they possibly occur in a robust database management system? From time to time, database systems stop cold. Local legend says that InterBase was used in Abrams tanks because the tanks suffered frequent short power failures. Unlike other database systems, InterBase restarted immediately. The cost of an immediate restart was that some