CS 542 Database Management Systems Failure Recovery, Concurrency Control J Singh April 4, 2011
Today’s meeting The D in ACID: Durability The ACI in ACID Consistency is specified by users is how they define transactions The Database is responsible for Atomicity and Isolation
Types of Failures Potential sources of failures: Power loss, resulting in loss of main-memory state, Media failures, resulting in loss of disk state and Software errors, resulting in both Recovery is based on the concept of transactions.
Transactions and Concurrency Users submit transactions, and think of each transaction as executing by itself. Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions. Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins.A transaction can end in two different ways: commit: successful end, all actions completed, abort: unsuccessful end, only some actions executed. Issues: effect of interleaving transactions on the database System failures (today’s lecture) Concurrent transactions (partly today, remainder next week)
Transactions, Logging and Recovery We studied Query Processing in the last two lectures Now, Log Manager and Recovery Manager Second part today, Transaction Manager
Reminder: Buffer Management DB Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated by replacement policy Data must be in RAM for DBMS to operate on it!
Primitive Buffer Operations Requests from Transactions Read (x,t): Input(x) if necessary Assign value of x in block to local variable t (in buffer) Write (x,t): Input(x) if necessary Assign value of local variable t (in buffer) to x Requests to Disk Input (x): Transfer block containing x from disk to memory (buffer) Output (x): Transfer block containing x from buffer to disk
Failure Recovery Approaches All of the approaches rely on logging – storing a log of changes to the database so it is possible to restore its state. They differ in What information is logged, The timing of when to force that information to stable storage, What the procedure for recovery will be The approaches are named after the recovery procedure Undo Logging The log contains enough information to detect if the transaction was committed and to roll back the state if it was not. When recovering after a failure, walk back through the log and undo the effect of all txns that do not have a COMMIT entry in the log Other approaches described later
Undo Logging When executing transactions Write the log before writing transaction data and force it to disk Make sure to preserve chronological order The log contains enough information to detect if the transaction was committed and to roll back the state if it was not. When restarting, Walk back through the log and undo the effect of all uncommitted txns in the log. Challenge: How far back do we need to look? Answer: Until the last checkpoint Define and implement checkpoints momentarily
An Example Transaction Initially A = 8 B = 8 Transaction T1 A 2 A B 2 B Transaction T1: Read (A,t); t t 2 Write (A,t); Read (B,t); t t 2 Write (B,t); Output (A); Output (B); State at Failure Point: Memory: A = 16 B = 16 Disk: A = 16 B = 8 Undo Log Entries <T1, start> <T1, A, 8> <T1, B, 8> <T1, Commit> Would have been written if the transaction had completed. Do we have the info to restore? failure!
Execution with Undo Logging Forces all log records to disk Logging Rule:
If a transaction commits, the commit record must be written to disk after all data records have been written to disk
Recovery with Undo Logging Consider all uncommitted transactions, starting with the most recent one and going backward. Undo all actions of these transactions. Why going backward, not forward? Example: T1, T2 and T3 all write A T1 executed before T2 before T3 T1 committed, T2 and T3 incomplete time/log T1 write A T2 write A T3 write A T1 commit systemfailure
More on Undo Logging Failure During Recovery Recovery algorithm is idempotent Just do it again! How much of the log file needs to be processed? In principle, we need to examine the entire log. Checkpointing limits the part of the log that needs to be considered during recovery up to a certain point (checkpoint).
Quiescent Checkpointing Simple approach to introduce the concept Pause the database stop accepting new transactions, wait until all current transactions commit or abort and have written the corresponding log records, flush the log to disk, write a <CKPT> log record and flush the log, resume accepting new transactions. Once we encounter a checkpoint record, we know that there are no incomplete transactions. Do not need to go backward beyond checkpoint. Can afford to throw away any part of the log prior to the checkpoint Pausing the database may not be warranted for business reasons
Non-quiescentCheckpointing Main idea: Start- and End-Checkpoints to bracket unfinished txns Write a <START CKPT (T1, T2, … Tk)> record into the log T1, T2, … Tk are the unfinished txns Wait till T1, T2, … Tk commit or abort, but allow other txns to begin Write a <END CKPT> record into the log Recovery method: scan the log backwards until a <CKPT> record is found If <END…>, scan backwards to the previous <START…> No need to look any further If <START…>, then crash must have occurred during checkpointing. The START record tells us unfinished txns and Scan back to the beginning of the oldest one of these.
Issues with Undo Logging Bottlenecks on I/O All log records must be forced back to disk before any data written back All data records must be forced to disk before the COMMIT record is written back An alternative: Redo Logging Instead of scanning backward from the end Undoing all transactions that were not completed Scans the log forward Reapplies all transactions that were not completed
Logging with Redo Logs Creation of the Redo log For every action, generate redo log record. <T, X, v> has different meaning: v is the new value, not old Flush log at commit. All log records for transaction that modified X (including commit) must be on disk before X is modified on disk Write END log record after DB modifications have been written to disk. Recovery algorithm. Redo the modifications by committed transactions not yet flushed to disk. S = set of txns with <Ti commit> and no <Ti end> in log For each <Ti X, v> in log, in forward order (from earliest to latest) do: if Ti in S then Write(X, v) Output(X) Write <Ti END>
Comments on Redo Logging Checkpoint algorithms similar to those for Undo Logging Quiescent as well as Non-quiescent algorithms Issues with Redo Logging Writing data back to disk is not allowed until transaction logs have been written out Results in a large requirement for memory for buffer pool A flaw in the checkpointing algorithms (textbook, p869) Both undo and redo logs may put contradictory requirements on how buffers are handled during a checkpoint, unless the database elements are complete blocks or sets of blocks. For instance, if a buffer contains one database element A that was changed by a committed transaction and another database element B that was changed in the same buffer by a transaction that has not yet had its COMMIT record written to disk, then we are required to copy the buffer to disk because of A but also forbidden to do so, because rule R1 applies to B.
Undo/Redo Logging (p1) Undo logging requires to write modifications to disk immediately after commit, leading to an unnecessarily large number of IOs. Redo logging requires to keep all modified blocks in the buffer until the transaction commits and the log records have been flushed, increasing the buffer size requirement. Undo/redo logging combines undo and redo logging. It provides more flexibility in flushing modified blocks at the expense of maintaining more information in the log.
Undo/Redo Logging (p2) Main idea: The log can be used to reconstruct the data Update records <T, X, new, old> record new and old value of X. The only undo/redo logging rule is: Log record must be flushed before corresponding modified block Also known as write ahead logging. Block of X can be flushed before or after T commits, i.e. before or after the COMMIT log record. Flush the log at commit.
Undo/Redo Logging (p3) Because of the flexibility of flushing X before or after the COMMIT record, we can have uncommitted transactions with modifications on disk and committed transactions with modifications not yet on disk. The undo/redo recovery policy is as follows: Redo committed transactions. Undo uncommitted transactions.
Undo/Redo Logging Recovery More details on the recovery procedure: Backward pass From end of log back to latest valid checkpoint, construct set S of committed transactions. Undo actions of transactions not in S. Forward pass From latest checkpoint forward to end of log, Or from the beginning of time, if there are no checkpoints redo actions of transactions in S. Alternatively, can also perform the redos before the undos.
Undo/Redo Checkpointing Write "start checkpoint" listing all active transactions to log Flush log to disk Write to disk all dirty buffers (contain a changed DB element), whether or not transaction has committed Implies nothing should be written (not even to memory buffers) until we are sure the transaction will not abort Implies some log records may need to be written to disk (WAL) Write "end checkpoint" to log Flush log to disk start ckpt active T's: T1,T2,... end ckpt ... ... ...
Protecting Against Media Failure Logging protects from local loss of main memory and disk content, but not against global loss of secondary storage content (media failure). To protect against media failures, employ archiving: maintaining a copy of the database on a separate, secure storage device. Log also needs to be archived in the same manner. Two levels of archiving: full dump vs. incremental dump.
Protecting Against Media Failure Typically, database cannot be shut down for the period of time needed to make a backup copy (dump). Need to perform nonquiescent archiving, i.e., create a dump while the DBMS continues to process transactions. Goal is to make copy of database at time when the dump began, but transactions may change database content during the dumping. Logging continues during the dumping, and discrepancies can be corrected from the log.
Protecting Against Media Failure We assume undo/redo (or redo) logging. The archiving procedure is as follows: Write a log record <START DUMP>. Perform a checkpoint for the log. Perform a (full / incremental) dump on the secure storage device. Make sure that enough of the log has been copied to the secure storage device so that at least the log up to the check point will survive media failure. Write a log record <END DUMP>.
Protecting Against Media Failure After a media failure, we can restore the DB from the archived DB and archived log as follows: Copy latest full dump (archive) back to DB. Starting with the earliest ones, make the modifications recorded in the incremental dump(s) in increasing order of time. Further modify DB using the archived log. Use the recovery method corresponding to the chosen type of logging.
Summary Logging is an effective way to prepare for system failure Transactions provide a useful building block on which to base log entries Three type of logs Undo Logs Redo Logs Undo/Redo logs Only Undo/Redo logs are used in practice. Why? Periodic checkpoints are necessary for keeping recovery times under control. Why? Database Dumps (archives) protect against media failure Great for making a “point in time” copy of the database.
On the NoSQL Front… Google Datastore Recently (1/2011) added a “High Replication” option. Replicates the datastore synchronously across multiple data centers Does not use an append-only log Has performance and size impact CouchDB Append-only log that’s actually a b-tree No provision for deleting part of the log Provision for ‘compacting the log’ MongoDB Recently (12/2010) added a --journal option Has performance impact, no measurements available Common thread, tradeoff between performance and durability!
CS 542 Database Management Systems Concurrency Control J Singh April 4, 2011
Concurrency Control Goal: Preserving Data Integrity Challenge: enforce ACID rules (while maintaining maximum traffic through the system) Committed transactions leave the system in a consistent state Rolled-back transactions behave as if they never happened! Historical Note Based on The Transaction Concept: Virtues and Limitations by Jim Gray, Tandem Computers, 1981 ACM Turing Award, 1998
Transactions Concurrent execution of user programs is essential for good DBMS performance. Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently. A user’s program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned about what data is read/written from/to the database. A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes. Referred to as a Schedule Implemented by a Transaction Scheduler
Scheduler Scheduler takes read/write requests from transactions Either executes them in buffers or delays them Scheduler must avoid Isolation Anomalies
Isolation Anomalies (p1) READ UNCOMMITTED Dirty Read – data of an uncommitted transaction visible to others Sometimes called WR Conflict UNREPEATABLE READ Non-repeatable Read – some previously read data changes due to another transaction committing Sometimes called RW Conflict T1: R(A), W(A), R(B), W(B), C T2: R(A), W(A), R(B), W(B), C T1: R(A), W(A), C T2: R(A), W(A), C
Isolation Anomalies (p2) Overwriting Uncommitted Data Sometimes called WW Conflicts We need a set of rules to prohibit such isolation anomalies The rules place constraints on the actions of concurrent transactions T1: W(A), W(B), C T2: W(A), W(B), C
Serial Schedules Definition: A schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions. A schedule is serial if its transactions are not interleaved Serial schedules observe ACI properties Schedule D is the set of 3 transactions T1, T2, T3. T1 Reads and writes to object X Then T2 Reads and writes to object Y ThenT3 Reads and writes to object Z. D is an example of a serial schedule, because the 3 txns are not interleaved. Shorthand: R1 (X), W1(X),R2 (Y), W2(Y), R3 (Z), W3(Z)
Serializable Schedules Aserializable schedule is one that is equivalent to a serial schedule. The Transaction Manager should defer some transactions if the current schedule is not serializable The order of transactions in E is not the same as in D, But E gives the same result. Shorthand: E = R1 (X); R2 (Y); R3 (Z); W1 (X); W2 (Y); W3 (Z);
Serializability Is G serializable? Equivalent to the serial schedule <T1,T2> But not <T2,T1> G is conflict-serializable Conflict equivalence: The schedules S1 and S2are conflict-equivalent if the following conditions are satisfied: Both schedules S1and S2involve the same set of transactions (including ordering of actions within each transaction). The order of each pair of conflicting actions in S1 and S2are the same Conflict-serializability: A schedule is conflict-serializablewhen the schedule is conflict-equivalent to one or more serial schedules.
Serializability of Schedule G T1: R(A) W(B) T2: R(A) W(A) T1 T2 Precedence graph: a node for each transaction an arc from Ti to Tj if an action in Tiprecedes and conflicts with an action in Tj. T1 T2? R1 (A) W1 (B) R2(A) W2(A) ? No conflicts T2T1? R2(A) W2 (A) R1 (A) W1 (B) ? Two actions conflict if The actions belong to different transactions. At least one of the actions is a write operation. The actions access the same object (read or write). Theorem: A schedule is conflict serializable if and only if its precedence graph is acyclic Conflicts
Enforcing Serializable Schedules Prevent cycles in the Precedence Graph, P(S), from occurring Locking primitives: Lock (exclusive): li(A) Unlock: ui(A) Make transactions consistent Ti: pi (A) becomes Ti: li(A) pi (A) ui(A) pi (A) is either a READ or a WRITE Allow only one transaction to hold a lock on A at any time Two-phase locking for transactions Ti: li(A) … pi (A) … ui(A) no unlocks no locks
Locking Protocols for Serializable Schedules Strict Two-phase Locking (Strict 2PL) Protocol: Each transaction must obtain a S (shared) lock on object before reading, and an X (exclusive) lock on object before writing. All locks held by a transaction are released when the transaction completes Strict 2PL allows only serializable schedules Additionally, it simplifies transaction aborts (Non-strict) 2PL Variant: Release locks anytime, but cannot acquire locks after releasing any lock. If a txn holds an X lock on an object, no other txn can get a lock (S or X) on that object. (Non-strict) 2PL also allows only serializable schedules, but involves more complex abort processing Why is “acquiring after releasing” disallowed? To avoid cascading aborts More in a minute
Executing Locking Protocols Begin with a Serialized Schedule We know it won’t deadlock How do we know this? Beyond this simple 2PL protocol, it is all a matter of improving performance and allowing more concurrency…. Shared locks Increment locks Multiple granularity Other types of concurrency control mechanisms
Lock Management Lock and unlock requests are handled by the lock manager Lock Table Entry Number of transactions currently holding a lock Type of lock held (shared or exclusive) Pointer to queue of lock requests Locking and unlocking operations Atomic Support upgrade: transaction that holds a shared lock can be upgraded to hold an exclusive lock Any level of granularity can be locked Database, table, block, tuple Why is this necessary?
Multiple-Granularity Locks If a transaction needs to scan all records in a table, we don’t really want to have a lock on all tuples individually – significant locking overhead! Put a single lock on the table Database Tables Pages Tuples A lock on a node implicitly locks all descendents. contains
Aborting a Transaction If a transaction Ti is aborted, all its actions have to be undone. If Tjreads an object last written by Ti, Tj must be aborted as well! Most systems avoid such cascading aborts by releasing a transaction’s locks only at commit time. If Tiwrites an object, Tjcan read this only after Ticommits. In order to undo the actions of an aborted transaction, the DBMS maintains a log in which every write is recorded. The same mechanism is used to recover from system crashes; all active txns at the time of the crash are aborted when the system recovers
Performance Considerations (Again!) 2PL Protocol allows transactions to proceed with maximum parallelism Locking algorithm only delays actions that would cause conflicts But the locks are still a bottleneck Need to ensure lowest-possible level of locking granularity Classic memory-performance trade-off Conflict-serialization is too conservative But other methods of serialization are too complex A use case that occurs quite often, should be optimized Besides scanning through the table, if we need to modify a few tuples, what kind of lock to put on the table? Have to be X (if we only have S or X). But, blocks all other read requests! Concurrency control is pessimistic and acquires/releases locks Optimistic Concurrency Control
Next Week Intention Locks Optimistic Concurrency Control Distributed Commit Please Read ahead of time The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008