Oracle Berkeley DB - Transactional Data Storage (TDS) Tutorial


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Designed to follow the Use-Case Based Tutorial for Data Store
  • Ideally want three use cases: simple one, one that uses ALLDB, and one that uses cds_groupbegin WW-like site: Per-user database? Food database (secondary points to food) Recipe database (secondary for ingredients to recipes; categories to recipes) Add food: look up food (read) write per-user Add recipe: look up recipe, iterate over incredients, write into per-user New recipe:
  • Review the ACID properties
  • Modifications can be grouped, of course, but reads can be transactional as well to be repeatable. Filesystem operations are things like database remove, rename, create (most relational engines don’t support this, it’s a very useful feature).
  • Talk about deadlocks versus timeouts
  • We’ll introduce a few more things when we get to performance tuning
  • Recovery must be single threaded. (Still true in BDB 4.4?)
  • For BDB 4.4, probably add DB_REGISTER to the list of flags.
  • Use DB_ENV->txn_begin, txn_commit and txn_abort or specify DB_AUTO_COMMIT in the DB->put call.
  • This gives you fully serializable read: a complete consistent snapshot of the database. DB_READ_UNCOMMITTED or DB_READ_COMMITTED should be used for the cursor, if possible, in a real application. (Pre-BDB 4.4, “DB_READ_UNCOMMITTED” was “DB_DIRTY_READ” and “DB_READ_COMMITTED” was “DB_DEGREE_2”. Older names still work ok.)
  • If a parent transaction aborts, all child transactions of that parent also abort, regardless of whether the child transaction was currently aborted, committed, or unresolved. If a child transaction aborts, the parent’s locks are unchanged. This slide show doesn’t mention preparing transactions as part of a federated database.
  • The CDS product should never deadlock All DBMS systems can deadlock – they usually just handle it in the server and retry automatically Deadlocks cannot occur in read-only applications (next slide)
  • Condition 1: can’t happen in read-only situation Condition 2: avoid this with DB_NO_WAIT Condition 3: get all the locks up front (don’t know how to do this) Condition 4: called a waits-for graph – find cycles and break them. This is called deadlock detection and is the normal mode of operation in DB
  • Most applications choose Asynchronous detection. A few use synchronous detection Lock/transaction timeouts are not common.
  • DB_RMW may increase contention (because write locks are held longer than they would otherwise be) but it may decrease the number of deadlocks that occur. See Transaction Tuning, Transaction Throughput and Access Method Tuning portion of the BDB Reference Guide. Faster transactions hold locks for a shorter time.
  • DB_ENV->txn_checkpoint() allows application to specify minimum number of kbytes in the log or minimum number of minutes since last checkpoint. Checkpointing is I/O intensive as is recovery. Recovery might take longer than real time, because it may have to rewrite the same pages over and over. This is uncommon but possible. “ Checkpoints don’t block operations.” During the write of a page, the page is locked against modifications, so if page contention is a problem, checkpointing could make it worse.
  • Force: regardless Kbytes: since last checkpoint Time: since last checkpoint Trickle can be run periodically to make sure that at least N% of the cache is clean.
  • “ Physical” logging means the physical bytes that changed are written, rather than logging of API calls, of SQL statements, etc. DB_ENV->set_lg_dir must be called before DB_ENV->open The log filename consists 10 digits, with a maximum of 2,000,000,000 log files. Consider an application performing 6000 transactions per second for 24 hours a day, logged into 10MB log files, in which each transaction is logging approximately 500 bytes of data. The following calculation: (10 * 2^20 * 2000000000) / (6000 * 500 * 365 * 60 * 60 * 24) indicates that the system will run out of log filenames in roughly 221 years. Because transaction logs are the basis for recovery, a transaction protected database will not allow non-transaction protected operations. If it did, then the transaction log would not have a complete record of all of the database operations and would not be recoverable.
  • Log information is stored in the in-memory log buffer until the storage space fills up or transaction commit forces the information to be flushed to stable storage (in the case of on-disk logs). In the presence of long-running transactions or transactions producing large amounts of data, larger buffer sizes can increase throughput. The log region is used to store filenames and so may need to be increased in size if a large number of files will be opened and registered with the specified Berkeley DB environment's log manager. DB_ENV->set_lg_regionmax must be called before DB_ENV->open
  • See Hot Failover and Recovery section of Reference Guide db_hotbackup utility as of BDB 4.4
  • In step #2 you have to stop all transactional operations (including file-level operations such as database renames) In step #4 it may be easier to simply copy all of the files in DB_ENV->set_data_dir or all of the files stored in the data directories. In step #5 you only have to copy a single log file (the last one) because you did a checkpoint in step #3, which means that only the last log file is interesting; all changes recorded in the previous log files have been flushed to the backing database files.
  • See the “Berkeley DB recoverability” section of the Reference Guide for information on copying databases atomically To reduce the number of log files that need to be backed up use db_archive –l or DB_ENV->log_archive with DB_ARCH_DATA to identify the log files that are not in use. Either do not archive them or physically move them to an alternative location before copying the log files. BDB 4.4 has a utility program for this.
  • Single-threading of recovery is automatic in BDB 4.4.
  • Common mistake is application writers don’t run recovery all the time. Instead, they assume that if they don’t get a DB_RUNRECOVERY error on startup, things are OK. Big mistake. You should always run recovery on startup. You may or may not get a DB_RUNRECOVERY error on opening when recovery is needed. DB would have to validate the entire set of databases on startup to check if recovery is needed and that’s too slow. DB_RUNRECOVERY might fail if, for example, memory was corrupted. But beware of having a new program start up while others are running – the new program should NOT run recovery.
  • Berkeley DB 4.x has significant recovery performance improvements
  • To recover in an alternative directory databases must be referenced using relative file names. Log files copied: the idea here is that you can run recovery repeatedly if you have too many log files and not enough disk space -- copy in log files 1-M, run recovery, copy in log files M+1-N, run recovery, and so on.
  • It may be simpler to copy all of the database files in step 1. Should be integrated with log archival process. See Hot Failover section of the Reference Guide. Three directories: active, backup, and failover. An archive directory is optional.
  • No promises that the OS writes file system pages to disk atomically, but it’s kind of dump if they don’t.
  • Putting log files on a different disk than database files helps performance and reliability. The former by reducing disk seeks and the latter by making a single disk failure less likely to lose an unrecoverable amount of data. Using separate partitions on the same disk is mostly pointless and probably hurts performance.
  • See “Architecting Transactional Data Store applications” in the Reference Guide. DB_REGISTER flag in Berkeley DB 4.4 eases the handling of crashes in threads of control. failchk() in Berkeley DB 4.4 provides an alternate, higher-performance means of handling thread failure
  • If failchk() returns 0, application may proceed normally w.r.t. the database. If failchk() returns DB_RUNRECOVERY, crash occurred while a thread was executing inside the BDB API. We can’t recover so standard application recovery is required is required: shutdown all threads, run recovery, restart. Transactions that are underway will be aborted.
  • DB_REGISTER flag in Berkeley DB 4.4 makes tracking threads of control easier. DB_REGISTER was implemented specifically to support loosely managed multi-process applications, where individual processes start, do some work and then shut down. In this type of application, having each process start up and open the environment with DB_REGISTER helps to identify and resolve issues left by crashed processes. Create a monitor process that opens and closes the environment every few seconds—the period determines the maximum latency for detecting stale transactions and locks. Without such a process, latency to detect thread/process death and recover from it is longer; might be fine for a given application. DB_REGISTER avoids permission problems on some operating systems w.r.t. processes accessing thread information of other, unrelated processes.
  • If Monitor doesn’t get alerted when threads or processes die, use a timer. If it does get alerted (e.g. a signal is sent on death of a child process), the system will respond faster—but not all operating systems can do that. failchk() is more efficient than DB_REGISTER at getting the system going again after a single thread or process crashes because most times, control is outside the BDB API when the crash occurs so recovery isn’t needed; DB_REGISTER can’t tell that. DB_REGISTER works on processes, not threads. If a thread crashes but the owning process doesn’t, DB_REGISTER doesn’t know.
  • Read operations are given shared locks Write operations are given exclusive locks
  • Instructor: there will be a lot of discussion on the second bullet point Configure deadlock detection with DB_LOCK_EXPIRE if you want timeouts to be the only way a deadlock is broken.
  • Oracle Berkeley DB - Transactional Data Storage (TDS) Tutorial

    1. 1. Oracle Berkeley DB Transactional Data Store A Use-Case Based Tutorial
    2. 2. Part III: Transactional Data Store <ul><li>Overview </li></ul><ul><ul><li>What is TDS? </li></ul></ul><ul><ul><li>When is TDS appropriate? </li></ul></ul><ul><li>Case Studies </li></ul><ul><ul><li>Web services </li></ul></ul>
    3. 3. What is TDS? <ul><li>Recoverable data management: recovery from application or system failure. </li></ul><ul><li>Ability to group operations into transactions: all operations in transaction either succeed or fail. </li></ul><ul><li>Concurrency control: multiple readers and writers operate on the database simultaneously. </li></ul>
    4. 4. Transactions Provide ACID <ul><li>A tomic: multiple operations appear as a single operation </li></ul><ul><ul><li>all or none of the operations applied </li></ul></ul><ul><li>C onsistent: no access sees a partially completed transaction </li></ul><ul><li>I solated: system behaves as if there is no concurrency </li></ul><ul><li>D urable: modifications persist after failure </li></ul><ul><li>Applications can relax ACID for performance </li></ul><ul><li>(more on this later) </li></ul>
    5. 5. Transactions <ul><li>All operations can be transactionally protected </li></ul><ul><ul><li>Database create, remove, rename </li></ul></ul><ul><ul><li>Key/Data pair get, put, update </li></ul></ul><ul><li>Why should you care? </li></ul><ul><ul><li>Never lose data after system or application failure (durability) </li></ul></ul><ul><ul><li>Group multiple updates into a single operation (atomicity) </li></ul></ul><ul><ul><li>Roll back changes (abort transaction if something odd happens) </li></ul></ul><ul><ul><li>Hide changes from other users until completed (consistency) </li></ul></ul><ul><ul><li>Changes made in other transactions while your transaction is underway aren’t visible unless you want to see them (isolation) </li></ul></ul>
    6. 6. Transaction Terminology <ul><li>Thread of control </li></ul><ul><ul><li>Process or true thread </li></ul></ul><ul><li>Free-threaded </li></ul><ul><ul><li>Object protected for simultaneous access by multiple threads </li></ul></ul><ul><li>Deadlock </li></ul><ul><ul><li>Two or more threads of control request mutually exclusive locks </li></ul></ul><ul><ul><li>Blocked threads cannot proceed until a thread releases its locks </li></ul></ul><ul><ul><li>They are stuck forever. </li></ul></ul><ul><li>Transaction </li></ul><ul><ul><li>One or more operations grouped into a single unit of work </li></ul></ul>
    7. 7. Transaction Terminology <ul><li>Transaction abort </li></ul><ul><ul><li>The unit of work is backed out/rolled back/undone </li></ul></ul><ul><li>Transaction commit </li></ul><ul><ul><li>The unit of work is permanently done </li></ul></ul><ul><li>System or application failure </li></ul><ul><ul><li>Unexpected application exit, for whatever reason </li></ul></ul><ul><li>Recovery </li></ul><ul><ul><li>Making databases consistent after failure so they can be used again </li></ul></ul><ul><ul><li>Must fix data and metadata </li></ul></ul>
    8. 8. Transactional APIs (1) <ul><li>New flags to DB_ENV->open </li></ul><ul><ul><li>DB_INIT_TXN, DB_INIT_LOCK, DB_INIT_LOG, DB_RECOVER, DB_RECOVER_FATAL </li></ul></ul><ul><li>New configuration options </li></ul><ul><ul><li>DB_ENV->set_tx_max : max number of concurrent transactions </li></ul></ul><ul><ul><li>DB_ENV->set_lk_detect : set deadlock resolution policy </li></ul></ul><ul><ul><li>DB_ENV->set_tx_timestamp : recover to a timestamp </li></ul></ul><ul><ul><li>DB_ENV->set_timeout : transaction timeout </li></ul></ul>
    9. 9. Transactional APIs (2) <ul><li>Transaction calls </li></ul><ul><ul><li>DB_ENV->txn_begin </li></ul></ul><ul><ul><li>DB_TXN->commit </li></ul></ul><ul><ul><li>DB_TXN->abort </li></ul></ul><ul><ul><li>DB_TXN->prepare </li></ul></ul><ul><li>New error returns </li></ul><ul><ul><li>DB_RUNRECOVERY </li></ul></ul><ul><ul><li>DB_DEADLOCK </li></ul></ul><ul><li>Utility functions </li></ul><ul><ul><li>DB_ENV->txn_checkpoint </li></ul></ul><ul><ul><li>DB_ENV->lk_detect </li></ul></ul><ul><ul><li>DB_ENV->log_archive </li></ul></ul>
    10. 10. What goes in a transaction? <ul><li>All updates must be transaction-protected. </li></ul><ul><ul><li>DB->put, DBC->c_put </li></ul></ul><ul><ul><li>DBC->c_del, DBC->c_del </li></ul></ul><ul><li>All databases that will be accessed inside transactions must be opened/created in a transaction. </li></ul><ul><li>File system operations </li></ul><ul><ul><li>DB_ENV->dbremove </li></ul></ul><ul><ul><li>DB_ENV->dbrename </li></ul></ul>
    11. 11. What About Reads? <ul><li>Read operations can go in transactions </li></ul><ul><ul><li>May not always have to </li></ul></ul><ul><li>A read that is a part of read-modify-write should be transaction protected. </li></ul><ul><li>Reads that must be consistent should be transaction-protected. </li></ul><ul><li>Other reads might use weaker semantics </li></ul><ul><ul><li>DB_READ_COMMITTED: never see uncommitted data </li></ul></ul><ul><ul><li>DB_READ_UNCOMMITTED: may see uncommitted data </li></ul></ul>
    12. 12. Anatomy of an Application <ul><li>Create/Open/Recover environment </li></ul><ul><li>Open database handles </li></ul><ul><li>Spawn utility threads </li></ul><ul><li>Spawn worker threads </li></ul><ul><ul><li>Begin transaction </li></ul></ul><ul><ul><li>Do database operations (insert/delete/update/etc.) </li></ul></ul><ul><ul><li>Commit or abort transaction </li></ul></ul><ul><ul><li>Do it again if appropriate (main loop) </li></ul></ul><ul><li>Close database handles </li></ul><ul><li>Close environment </li></ul>
    13. 13. Create/Open/Recover Environment if ((ret = db_env_create(&DB_ENV, 0)) != 0) … error_handling… /* Configure the environment. */ flags = DB_CREATE |DB_INIT_LOG | DB_INIT_TXN | DB_INIT_LOCK | DB_INIT_MPOOL | DB_RECOVER; if ((ret = DB_ENV->open(DB_ENV, HOME, flags, 0)) != 0) … error_handling…
    14. 14. Create/Open Database <ul><li>Must specify DB_ENV to database open </li></ul><ul><li>That DB_ENV must have been opened with DB_INIT_TXN </li></ul><ul><li>Database open must be transactional </li></ul><ul><ul><li>Specify a DB_TXN in the open </li></ul></ul><ul><ul><li>Specify DB_AUTO_COMMIT in the open </li></ul></ul><ul><ul><li>Open environment with DB_AUTO_COMMIT </li></ul></ul>
    15. 15. Transaction Operations <ul><li>Begin transaction with DB_ENV->txn_begin() </li></ul><ul><li>DB_TXN->commit() commits the transaction </li></ul><ul><ul><li>Releases all locks </li></ul></ul><ul><ul><li>All log records are written to disk (by default) </li></ul></ul><ul><li>DB_TXN->abort() aborts the transaction </li></ul><ul><ul><li>Releases all locks </li></ul></ul><ul><ul><li>Modifications are rolled back </li></ul></ul><ul><li>Must close all cursors before commit or abort </li></ul>
    16. 16. Transaction Differences <ul><li>Without transactions: </li></ul><ul><ul><li>ret = dbp->put(dbp, NULL, &key, &data, 0); </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul><ul><li>With transactions </li></ul><ul><ul><li>ret = DB_ENV->txn_begin(DB_ENV, NULL, &txn, 0); </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul><ul><ul><li>ret = dbp->put(dbp, txn, &key, &data, 0); </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul><ul><ul><li>ret = txn->commit(txn, 0); </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul><ul><li>Note: if the DB handle was opened transactionally, then we will automatically wrap every modification operation in a transaction (DB_AUTO_COMMIT behavior). </li></ul>
    17. 17. Transactional Cursors <ul><li>Specify transaction handle on cursor creation </li></ul><ul><li>Cursor operations are performed inside that transaction </li></ul><ul><li>Cursors must be closed before commit or abort </li></ul><ul><ul><li>DBC *dbc = NULL; </li></ul></ul><ul><ul><li>DB_TXN *txn = NULL; </li></ul></ul><ul><ul><li>ret = dbenv->txn_begin(dbenv, NULL, &txn, 0); </li></ul></ul><ul><ul><li>if (ret != 0) </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul><ul><ul><li>ret = dbp->cursor(dbp, txn, &dbc, 0); </li></ul></ul><ul><ul><li>if (ret != 0) </li></ul></ul><ul><ul><li>… error handling … </li></ul></ul>
    18. 18. Transactional Iteration <ul><li>ret = dbp->cursor(dbp, txn, &dbc, 0); </li></ul><ul><li>… error handling … </li></ul><ul><li>while ((ret = dbc->c_get(dbc, &key, &data, DB_NEXT)) == 0) </li></ul><ul><li>… process record: write new data, delete old data … </li></ul><ul><li>if (ret == DB_NOTFOUND) </li></ul><ul><li>ret = 0; </li></ul><ul><li>if ((temp_ret = dbc->close(dbc)) != 0 && ret == 0) </li></ul><ul><li>ret = temp_ret; </li></ul><ul><li>if (ret == 0) </li></ul><ul><li>ret = txn->commit(txn, 0); </li></ul><ul><li>else </li></ul><ul><li>(void) txn->abort(txn); </li></ul><ul><li>if (ret != 0) </li></ul><ul><li>… error handling … </li></ul>
    19. 19. Nested Transactions <ul><li>Split large transactions into smaller ones </li></ul><ul><ul><li>Children can be individually aborted. </li></ul></ul><ul><li>Create nested transaction </li></ul><ul><ul><li>Pass parent DB_TXN handle to DB_ENV->txn_begin() </li></ul></ul><ul><li>Parent transaction with active children can only </li></ul><ul><ul><li>Create more children </li></ul></ul><ul><ul><li>Commit or abort </li></ul></ul><ul><li>If parent commits (aborts), all children commit (abort) </li></ul><ul><li>Child transaction’s locks </li></ul><ul><ul><li>Won’t conflict with the parent’s locks </li></ul></ul><ul><ul><li>Will conflict with other children’s (siblings) locks </li></ul></ul>
    20. 20. Deadlocks <ul><li>What are deadlocks? </li></ul><ul><li>Deadlock resolution </li></ul><ul><li>Deadlock detection </li></ul><ul><li>Dealing with deadlocks </li></ul><ul><li>Deadlock avoidance </li></ul>
    21. 21. Defining Deadlocks <ul><li>Consider two threads </li></ul><ul><ul><li>Thread 1: Write A, Write B </li></ul></ul><ul><ul><li>Thread 2: Write B, Write A </li></ul></ul><ul><li>Let’s say that the sequence of operations is </li></ul><ul><ul><li>T1: Write lock A </li></ul></ul><ul><ul><li>T2: Write lock B </li></ul></ul><ul><ul><li>T1: Request lock on B (blocks) </li></ul></ul><ul><ul><li>T2: Request lock on A (blocks </li></ul></ul><ul><li>Neither thread can make forward progress </li></ul>
    22. 22. Conditions for Deadlock <ul><li>Exclusive access </li></ul><ul><li>Block when access is unavailable </li></ul><ul><li>May request additional resources while holding resources </li></ul><ul><li>The graph of “who is waiting for whom” has a cycle in it. </li></ul>
    23. 23. Deadlock Resolution <ul><li>Two techniques: </li></ul><ul><ul><li>Assume a sufficiently long wait is a deadlock and self-abort </li></ul></ul><ul><ul><li>Detect and selectively abort </li></ul></ul><ul><li>Berkeley DB supports both </li></ul><ul><li>Timeouts </li></ul><ul><ul><li>Use DB_ENV->set_timeout to specify a maximum length of time that a lock or transaction should block. </li></ul></ul><ul><ul><li>Return DB_LOCK_NOTGRANTED if the timeout expires before a lock is granted </li></ul></ul>
    24. 24. Deadlock Detection <ul><li>Two step process: </li></ul><ul><ul><li>Traverse waits-for graph looking for cycles </li></ul></ul><ul><ul><li>If cycle, select victim to abort </li></ul></ul><ul><li>Looking for cycles </li></ul><ul><ul><li>Synchronously: checked on each blocking lock </li></ul></ul><ul><ul><ul><li>DB_ENV->set_lk_detect() </li></ul></ul></ul><ul><ul><ul><li>+ Immediate detection and notification </li></ul></ul></ul><ul><ul><ul><li>- Higher CPU cost </li></ul></ul></ul><ul><ul><li>Asynchronously: run detector thread </li></ul></ul><ul><ul><ul><li>DB_ENV->lock_detect() or db_deadlock utility program </li></ul></ul></ul><ul><ul><ul><li>- Detected only when thread of control runs </li></ul></ul></ul><ul><ul><ul><li>+ Lower CPU cost </li></ul></ul></ul>
    25. 25. Picking who to abort <ul><li>Ideally want to limit wasted effort. </li></ul><ul><li>Deadlock resolution policy is configurable </li></ul><ul><ul><li>Default is DB_LOCK_RANDOM </li></ul></ul><ul><li>Other options </li></ul><ul><ul><li>DB_LOCK_MINLOCKS </li></ul></ul><ul><ul><li>DB_LOCK_MINWRITE (# of write locks, not write ops) </li></ul></ul><ul><ul><li>DB_LOCK_YOUNGEST </li></ul></ul><ul><ul><li>DB_LOCK_OLDEST </li></ul></ul><ul><li>The victim </li></ul><ul><ul><li>Gets a DB_LOCK_DEADLOCK error return </li></ul></ul><ul><ul><li>Must immediately (close all cursors and) abort </li></ul></ul>
    26. 26. Dealing with Deadlocks <ul><li>Transactional applications must code for deadlocks. </li></ul><ul><li>Typical response to deadlock is to retry </li></ul><ul><li>Repeated retries may indicate a serious problem </li></ul>
    27. 27. Deadlock Example <ul><li>for (fail = 0; fail < MAXIMUM_RETRY; fail++) { </li></ul><ul><li>ret = DB_ENV->txn_begin(DB_ENV, NULL, &txn, 0); </li></ul><ul><li>… error_handling … </li></ul><ul><li>ret = db->put(db, txn, &key, &data, 0); </li></ul><ul><li>if (ret == 0) { </li></ul><ul><li>… commit the transaction … </li></ul><ul><li>return 0; /* Success! */ </li></ul><ul><li>} else { /* DB_LOCK_DEADLOCK or something else */ </li></ul><ul><li>… abort the transaction … </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>DB_ENV->err(DB_ENV, ret, “Maximum retry limit exceeded”); </li></ul><ul><li>return (ret); /* Retry limit reached; give up. */ </li></ul>
    28. 28. Deadlock Avoidance <ul><li>Good practices </li></ul><ul><ul><li>Keep transactions short </li></ul></ul><ul><ul><li>Read/write databases in the same order in all transactions </li></ul></ul><ul><ul><li>Limit the number of concurrent writers </li></ul></ul><ul><ul><li>Use DB_RMW for read-modify-write operations </li></ul></ul><ul><ul><li>DB_READ_UNCOMMITTED for readers </li></ul></ul><ul><ul><li>DB_READ_COMMITTED for cursors </li></ul></ul><ul><ul><li>DB_REVSPLITOFF for cyclical Btrees (grow/shrink/grow/…) </li></ul></ul><ul><li>Debugging </li></ul><ul><ul><li>db_stat –Co </li></ul></ul><ul><ul><li>db_stat -Cl </li></ul></ul><ul><ul><li>“ Deadlock Debugging” section of the Reference Guide </li></ul></ul>
    29. 29. Subsystems and Utilities <ul><li>Checkpoint </li></ul><ul><li>Logging </li></ul><ul><li>Backups </li></ul><ul><li>Recovery </li></ul>
    30. 30. Checkpoints <ul><li>Recall that Berkeley DB maintains a cache of database pages. </li></ul><ul><li>Checkpoints write dirty pages from the cache into the databases </li></ul><ul><ul><li>Transaction commit only writes the log </li></ul></ul><ul><li>Checkpoints: </li></ul><ul><ul><li>Permit log file reclamation </li></ul></ul><ul><ul><li>Reduce recovery time </li></ul></ul><ul><ul><li>Block other operations minimally </li></ul></ul><ul><li>Checkpoint is I/O intensive </li></ul><ul><li>Typically, checkpoint in a separate thread </li></ul><ul><ul><li>Command line db_checkpoint </li></ul></ul><ul><ul><li>Use DB_ENV->txn_checkpoint() </li></ul></ul>
    31. 31. Controlling Checkpoints <ul><li>Log file size dictates unit of log reclamation </li></ul><ul><ul><li>Use DB_ENV->set_lg_max </li></ul></ul><ul><li>Checkpoint method and utility controlled by </li></ul><ul><ul><li>Force checkpoint </li></ul></ul><ul><ul><li>More than kbytes of log have been written </li></ul></ul><ul><ul><li>More than min minutes have passed </li></ul></ul><ul><li>Reduce checkpoint impact by keeping mpool clean </li></ul><ul><ul><li>Use DB_ENV->memp_trickle </li></ul></ul>
    32. 32. Logging <ul><li>Database modifications are described in log records . </li></ul><ul><li>Log records describe physical changes to pages. </li></ul><ul><li>Log records are indexed by log sequence numbers ( LSN s) </li></ul><ul><li>The log is comprised of one or more log files . </li></ul><ul><li>Logs are named log. NNNNNNNNNN </li></ul><ul><li>Logs can be maintained on-disk or in-memory </li></ul><ul><ul><li>On-disk logs stored in DB_HOME directory </li></ul></ul><ul><ul><li>Or in location configured by DB_ENV->set_lg_dir() </li></ul></ul><ul><li>Log records flushed by DB_TXN->commit() </li></ul><ul><li>For best performance and recoverability properties, </li></ul><ul><li>place logs and databases on separate disks </li></ul>
    33. 33. Logging Configuration <ul><li>DB_ENV->set_lg_max() </li></ul><ul><ul><li>Set the size of the individual log files </li></ul></ul><ul><ul><li>On disk logs: default is 10MB </li></ul></ul><ul><ul><li>In-memory logs: default is 256KB </li></ul></ul><ul><li>DB_ENV->set_lg_bsize() </li></ul><ul><ul><li>Set the size of the in-memory log buffer </li></ul></ul><ul><ul><li>On-disk logs: default is 32KB </li></ul></ul><ul><ul><li>In-memory logs: default is 1MB </li></ul></ul><ul><li>DB_ENV->set_lg_regionmax() </li></ul><ul><ul><li>Set the size of the logging subsystem’s shared region (default is 60KB) </li></ul></ul><ul><ul><li>Region holds filenames </li></ul></ul>
    34. 34. Backups <ul><li>Full backup: Copy database and log files </li></ul><ul><ul><li>Standard </li></ul></ul><ul><ul><ul><li>Pause database operations and perform the backup </li></ul></ul></ul><ul><ul><ul><li>Creates a snapshot at a known point in time </li></ul></ul></ul><ul><ul><li>Hot </li></ul></ul><ul><ul><ul><li>Backup while database(s) are still active </li></ul></ul></ul><ul><ul><ul><li>Creates a snapshot at fuzzy point in time </li></ul></ul></ul><ul><li>Incremental backup </li></ul><ul><ul><li>Copy log files to be replayed/recovered against a full backup </li></ul></ul><ul><ul><li>Hot failover </li></ul></ul>
    35. 35. Standard Backup <ul><li>Commit or abort all on-going transactions </li></ul><ul><li>Pause all database writes </li></ul><ul><li>Force a checkpoint </li></ul><ul><li>Copy all database files to backup location </li></ul><ul><ul><li>Find active database files with DB_ENV->log_archive() </li></ul></ul><ul><ul><li>or db_archive program with DB_ARCH_DATA flag </li></ul></ul><ul><li>Copy the last log file to backup location </li></ul><ul><ul><li>DB_ENV->log_archive() or db_archive with DB_ARCH_LOG identifies all of the log files </li></ul></ul>
    36. 36. Hot Backup <ul><li>Do not stop database operations </li></ul><ul><li>Copy all database files to backup location </li></ul><ul><ul><li>Use DB_ENV->log_archive() with DB_ARCH_DATA or db_archive –s </li></ul></ul><ul><ul><li>Database files may be modified during the backup, so the copy utility must read each database page atomically. </li></ul></ul><ul><li>Copy all log files to backup location </li></ul><ul><li>The order of operations must be preserved. </li></ul><ul><li>Copy databases first and then log files. </li></ul>
    37. 37. Log File Removal <ul><li>Remove unused log files to regain disk space </li></ul><ul><ul><li>db_archive –l or </li></ul></ul><ul><ul><li>DB_ENV->log_archive() to identify unused log files </li></ul></ul><ul><li>To allow catastrophic recovery </li></ul><ul><ul><li>Move to backup media; do not simply delete them </li></ul></ul><ul><li>Never remove all of the transaction logs </li></ul><ul><li>Do not remove active transaction logs </li></ul>
    38. 38. Recovery <ul><li>Write-ahead logging </li></ul><ul><ul><li>Log records always written to disk before database updates </li></ul></ul><ul><li>Recovery </li></ul><ul><ul><li>Database changes validated against the log </li></ul></ul><ul><ul><ul><li>Redo committed operations not in the database </li></ul></ul></ul><ul><ul><ul><li>Undo aborted operations that are in the database </li></ul></ul></ul><ul><ul><li>Removes and re-initializes the environment files </li></ul></ul><ul><ul><li>Must be single-threaded </li></ul></ul><ul><ul><li>Other threads of control must wait for recovery </li></ul></ul>
    39. 39. When to Run Recovery <ul><li>DB_RUNRECOVERY error </li></ul><ul><ul><li>Subsequent API calls return DB_RUNRECOVERY </li></ul></ul><ul><ul><li>Restart application and run recovery </li></ul></ul><ul><li>Always perform recovery at application startup </li></ul><ul><ul><li>Prevent spreading corruption </li></ul></ul>
    40. 40. Kinds of Recovery <ul><li>Normal recovery assumes no loss of media </li></ul><ul><ul><li>Reviews log records since the last checkpoint </li></ul></ul><ul><ul><li>More frequent checkpoints mean faster recovery </li></ul></ul><ul><ul><li>DB_RECOVER flag when opening the environment or </li></ul></ul><ul><ul><li>db_recover program </li></ul></ul><ul><li>Catastrophic recovery does the same, but: </li></ul><ul><ul><li>Reviews all available log files </li></ul></ul><ul><ul><li>Catastrophic recovery can take awhile </li></ul></ul><ul><ul><li>DB_RECOVER_FATAL flag or db_recover –c </li></ul></ul><ul><li>Recovery to a timestamp is available </li></ul>
    41. 41. Recovery Procedures <ul><li>Simplest case </li></ul><ul><ul><li>Re-create the databases, no need for recovery </li></ul></ul><ul><li>Normal recovery </li></ul><ul><ul><li>Up-to-date database and log files are available </li></ul></ul><ul><ul><li>Not used when creating hot backups </li></ul></ul><ul><li>Catastrophic recovery </li></ul><ul><ul><li>Database or log files destroyed or corrupted </li></ul></ul><ul><ul><li>Normal recovery fails for any reason </li></ul></ul><ul><ul><li>Used when creating hot backups </li></ul></ul>
    42. 42. Catastrophic Recovery <ul><li>Often performed in a directory other than where the database files previously lived </li></ul><ul><li>Copy the most recent snapshot of the database and log files to the recovery directory </li></ul><ul><li>Copy any newer log files into the recovery directory </li></ul><ul><li>Log files must be recovered in sequential order </li></ul><ul><li>Run db_recover -c </li></ul><ul><ul><li>Or call DB_ENV->open() with DB_RECOVER_FATAL </li></ul></ul>
    43. 43. Maintaining a Hot Standby <ul><li>db_archive -s to identify all the active database files </li></ul><ul><li>Copy identified database files to the backup/failover directory </li></ul><ul><li>Archive all existing log files from the backup directory </li></ul><ul><li>Use db_archive (no option) in the active environment to identify the inactive logs and move them to the backup directory </li></ul><ul><li>Use db_archive -l in the active environment to identify all active log files and copy these to the hot failover directory </li></ul><ul><li>Run db_recover -c against the hot failover directory to catastrophically recover the environment </li></ul><ul><li>Steps 2-5 can be repeated as often as desired </li></ul><ul><li>If step 1 is repeated, steps 2-5 must follow to ensure consistency </li></ul>
    44. 44. Programming Practices <ul><li>Disk guarantees </li></ul><ul><li>Application structures </li></ul><ul><li>Locking </li></ul><ul><li>Special errors </li></ul>
    45. 45. Disk I/O Integrity Guarantees <ul><li>Disk writes are performed a database page at a time </li></ul><ul><li>Berkeley DB assumes pages are atomically written </li></ul><ul><li>If the OS requires multiple writes per page </li></ul><ul><ul><li>A partial page can be written, possibly corrupting the database </li></ul></ul><ul><li>To guard against this: </li></ul><ul><ul><li>Set database page size equal to file system page size OR </li></ul></ul><ul><ul><li>Configure the environment to perform checksums* </li></ul></ul><ul><ul><li>Use DB->set_flags() with the DB_CHKSUM flag </li></ul></ul><ul><ul><li>If corruption is detected run catastrophic recovery </li></ul></ul><ul><ul><li>* Checksums can be computationally expensive </li></ul></ul>
    46. 46. Disk I/O Integrity Guarantees <ul><li>Berkeley DB relies on the integrity of the underlying software and hardware interfaces </li></ul><ul><ul><li>Berkeley DB assumes POSIX compliance, especially with regard to disk I/O </li></ul></ul><ul><ul><li>Avoid hardware that performs partial writes or “lies” about writes (acknowledges them when the data hits the disk cache and does not guarantee that the cache can be written under all circumstances). </li></ul></ul><ul><ul><li>Best practice: store database files and log files on physically separate disks </li></ul></ul>
    47. 47. Transactional Applications <ul><li>Must handle recovery gracefully </li></ul><ul><li>Always run recovery, normally on startup, before other threads/processes join the environment </li></ul><ul><li>Use DB_REGISTER to indicate if recovery is necessary. </li></ul><ul><ul><li>All DB_ENV handles must be opened with DB_REGISTER for this to work. </li></ul></ul><ul><ul><li>If DB_REGISTER and DB_RECOVER are both set, recovery will only be run if necessary. </li></ul></ul><ul><li>Use DB_ENV->failchk() after any thread/process failure to determine if it is safe to continue. </li></ul>
    48. 48. Single-process Applications <ul><li>A single process, multi-threaded application </li></ul><ul><ul><li>First thread opens the database environment </li></ul></ul><ul><ul><ul><li>Run recovery at this time </li></ul></ul></ul><ul><ul><li>First thread opens databases </li></ul></ul><ul><ul><li>First thread spawns subsequent threads </li></ul></ul><ul><ul><li>Subsequent threads share DB_ENV and DB handles </li></ul></ul><ul><ul><li>Last thread to exit closes the DB_ENV and DB handles </li></ul></ul><ul><li>If any thread exits abnormally </li></ul><ul><ul><li>Some thread calls DB_ENV->failchk() </li></ul></ul><ul><ul><li>Return value dictates action: continue or run recovery </li></ul></ul>
    49. 49. Multi-process Applications #1 <ul><li>Order of process start up must be controlled </li></ul><ul><ul><li>Recovery must happen before anything else </li></ul></ul><ul><ul><li>Recovery must be single-threaded </li></ul></ul><ul><li>Processes themselves may be threaded </li></ul><ul><li>Processes maintain their own DB_ENV and DB handles </li></ul><ul><li>If thread of control exits abnormally </li></ul><ul><ul><li>All threads should exit database environment </li></ul></ul><ul><ul><li>Recovery must be run </li></ul></ul><ul><li>Automate with DB_REGISTER flag in DB_ENV->open() </li></ul>
    50. 50. Multi-process Applications #2 <ul><li>Often easiest to have a single monitoring process </li></ul><ul><li>Monitor is responsible for running recovery </li></ul><ul><li>Monitor starts other processes after recovery </li></ul><ul><li>Monitor is on a timer or waits for other processes to die </li></ul><ul><ul><li>At which time it uses failchk() mechanism </li></ul></ul><ul><ul><li>If failchk() returns DB_RUNRECOVERY: </li></ul></ul><ul><ul><ul><li>Monitor runs recovery </li></ul></ul></ul><ul><ul><ul><li>Other processes die when the environment is recreated </li></ul></ul></ul><ul><ul><ul><li>Monitor kills stubborn or slow processes </li></ul></ul></ul><ul><ul><ul><li>Monitor re-starts the system </li></ul></ul></ul>
    51. 51. Locking in Transactions <ul><li>Initialize locking subsystem </li></ul><ul><ul><li>Use the flag DB_INIT_LOCK to DB_ENV->open </li></ul></ul><ul><li>Pages or records are read- or write-locked </li></ul><ul><li>Conflicting lock requests wait until object available </li></ul><ul><li>Locking within transactions </li></ul><ul><ul><li>Locks are released at the end of the transaction </li></ul></ul><ul><ul><li>Earlier if specified ( DB_READ_COMMITTED , DB_READ_UNCOMMITTED ). </li></ul></ul><ul><li>Locking without transactions </li></ul><ul><ul><li>Locks are released at the end of the operation </li></ul></ul>
    52. 52. Additional Locking Options <ul><li>Use lock timeouts instead of or in addition to regular deadlock detection </li></ul><ul><ul><li>DB_LOCK_NOTGRANTED </li></ul></ul><ul><ul><ul><li>(optional) is returned when a lock times out </li></ul></ul></ul><ul><ul><li>DB_LOCK_DEADLOCK </li></ul></ul><ul><ul><ul><li>returned when a lock times out </li></ul></ul></ul><ul><ul><ul><li>or the transaction is selected for deadlock resolution </li></ul></ul></ul><ul><li>Accuracy of timeout depends on how often deadlock detection is performed </li></ul>
    53. 53. Special Errors <ul><li>Catastrophic error: should never happen </li></ul><ul><li>Subsequent calls to the API will return the same error </li></ul><ul><li>DB->set_paniccall() or DB_ENV->set_paniccall() </li></ul><ul><ul><li>Callback function when catastrophic error occurs </li></ul></ul><ul><li>Exit and restart application </li></ul><ul><ul><li>Run recovery if you are using an environment </li></ul></ul>
    54. 54. End of Part II