CS 542 -- Failure Recovery, Concurrency Control


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CS 542 -- Failure Recovery, Concurrency Control

  1. 1. CS 542 Database Management Systems<br />Failure Recovery, Concurrency Control<br />J Singh <br />April 4, 2011<br />
  2. 2. Today’s meeting<br />The D in ACID: Durability<br />The ACI in ACID<br />Consistency is specified by users is how they define transactions<br />The Database is responsible for Atomicity and Isolation<br />
  3. 3. Types of Failures<br />Potential sources of failures:<br />Power loss, resulting in loss of main-memory state,<br />Media failures, resulting in loss of disk state and<br />Software errors, resulting in both<br />Recovery is based on the concept of transactions.<br />
  4. 4. Transactions and Concurrency<br />Users submit transactions, and think of each transaction as executing by itself.<br />Concurrency is achieved by the DBMS, which interleaves actions (reads/writes of DB objects) of various transactions.<br />Each transaction must leave the database in a consistent state if the DB is consistent when the transaction begins.A transaction can end in two different ways:<br />commit: successful end, all actions completed,<br />abort: unsuccessful end, only some actions executed.<br />Issues: effect of interleaving transactions on the database<br />System failures (today’s lecture)<br />Concurrent transactions (partly today, remainder next week)<br />
  5. 5. Transactions, Logging and Recovery<br />We studied Query Processing in the last two lectures<br />Now, Log Manager and Recovery Manager<br />Second part today, Transaction Manager<br />
  6. 6. Reminder: Buffer Management<br />DB<br />Page Requests from Higher Levels<br />BUFFER POOL<br />disk page<br />free frame<br />MAIN MEMORY<br />DISK<br />choice of frame dictated<br />by replacement policy<br />Data must be in RAM for DBMS to operate on it!<br />
  7. 7. Primitive Buffer Operations<br />Requests from Transactions<br />Read (x,t): <br />Input(x) if necessary<br />Assign value of x in block to local variable t (in buffer)<br />Write (x,t): <br />Input(x) if necessary<br />Assign value of local variable t (in buffer) to x<br />Requests to Disk<br />Input (x):<br />Transfer block containing x from disk to memory (buffer)<br />Output (x):<br />Transfer block containing x from buffer to disk<br />
  8. 8. Failure Recovery Approaches<br />All of the approaches rely on logging – storing a log of changes to the database so it is possible to restore its state. They differ in<br />What information is logged,<br />The timing of when to force that information to stable storage,<br />What the procedure for recovery will be<br />The approaches are named after the recovery procedure<br />Undo Logging<br />The log contains enough information to detect if the transaction was committed and to roll back the state if it was not.<br />When recovering after a failure, walk back through the log and undo the effect of all txns that do not have a COMMIT entry in the log<br />Other approaches described later<br />
  9. 9. Undo Logging<br />When executing transactions<br />Write the log before writing transaction data and force it to disk<br />Make sure to preserve chronological order<br />The log contains enough information to detect if the transaction was committed and to roll back the state if it was not.<br />When restarting, <br />Walk back through the log and undo the effect of all uncommitted txns in the log.<br />Challenge: How far back do we need to look?<br />Answer: Until the last checkpoint<br />Define and implement checkpoints momentarily<br />
  10. 10. An Example Transaction<br />Initially<br />A = 8<br />B = 8<br />Transaction T1<br />A  2  A<br />B  2  B<br />Transaction T1:<br />Read (A,t); t  t  2<br />Write (A,t);<br />Read (B,t); t  t  2<br />Write (B,t);<br />Output (A);<br />Output (B);<br />State at Failure Point:<br />Memory:<br />A = 16<br />B = 16<br />Disk:<br />A = 16<br />B = 8<br />Undo Log Entries<br /><T1, start><br /><T1, A, 8><br /><T1, B, 8><br /><T1, Commit><br />Would have been written if the transaction had completed.<br />Do we have the info to restore?<br />failure!<br />
  11. 11. Execution with Undo Logging<br />Forces all log records to disk<br />Logging Rule:<br /><ul><li>If a transaction commits, the commit record must be written to disk after all data records have been written to disk</li></li></ul><li>Recovery with Undo Logging<br />Consider all uncommitted transactions, starting with the most recent one and going backward. <br />Undo all actions of these transactions.<br />Why going backward, not forward?<br />Example: T1, T2 and T3 all write A<br />T1 executed before T2 before T3<br />T1 committed, T2 and T3 incomplete<br />time/log<br />T1 write A<br />T2 write A<br />T3 write A<br />T1 commit<br />systemfailure<br />
  12. 12. More on Undo Logging<br />Failure During Recovery<br />Recovery algorithm is idempotent<br />Just do it again!<br />How much of the log file needs to be processed?<br />In principle, we need to examine the entire log.<br />Checkpointing limits the part of the log that needs to be considered during recovery up to a certain point (checkpoint).<br />
  13. 13. Quiescent Checkpointing<br />Simple approach to introduce the concept<br />Pause the database<br />stop accepting new transactions,<br />wait until all current transactions commit or abort and have written the corresponding log records,<br />flush the log to disk,<br />write a <CKPT> log record and flush the log,<br />resume accepting new transactions.<br />Once we encounter a checkpoint record, we know that there are no incomplete transactions.<br />Do not need to go backward beyond checkpoint. <br />Can afford to throw away any part of the log prior to the checkpoint<br />Pausing the database may not be warranted for business reasons<br />
  14. 14. Non-quiescentCheckpointing<br />Main idea: Start- and End-Checkpoints to bracket unfinished txns<br />Write a <START CKPT (T1, T2, … Tk)> record into the log<br />T1, T2, … Tk are the unfinished txns<br />Wait till T1, T2, … Tk commit or abort, but allow other txns to begin<br />Write a <END CKPT> record into the log<br />Recovery method: scan the log backwards until a <CKPT> record is found<br />If <END…>, scan backwards to the previous <START…><br />No need to look any further<br />If <START…>, then crash must have occurred during checkpointing. <br />The START record tells us unfinished txns and <br />Scan back to the beginning of the oldest one of these.<br />
  15. 15. Issues with Undo Logging<br />Bottlenecks on I/O<br />All log records must be forced back to disk before any data written back<br />All data records must be forced to disk before the COMMIT record is written back<br />An alternative: Redo Logging<br />Instead of scanning backward from the end<br />Undoing all transactions that were not completed<br />Scans the log forward<br />Reapplies all transactions that were not completed<br />
  16. 16. Logging with Redo Logs<br />Creation of the Redo log<br />For every action, generate redo log record.<br /><T, X, v> has different meaning: v is the new value, not old<br />Flush log at commit.<br />All log records for transaction that modified X (including commit) must be on disk before X is modified on disk<br />Write END log record after DB modifications have been written to disk.<br />Recovery algorithm. <br />Redo the modifications by committed transactions not yet flushed to disk.<br />S = set of txns with <Ti commit> and no <Ti end> in log<br />For each <Ti X, v> in log, in forward order (from earliest to latest) do:<br />if Ti in S then<br />Write(X, v) <br />Output(X)<br />Write <Ti END><br />
  17. 17. Logging with Redo Logs<br />
  18. 18. Comments on Redo Logging<br />Checkpoint algorithms similar to those for Undo Logging<br />Quiescent as well as Non-quiescent algorithms<br />Issues with Redo Logging<br />Writing data back to disk is not allowed until transaction logs have been written out<br />Results in a large requirement for memory for buffer pool<br />A flaw in the checkpointing algorithms (textbook, p869)<br />Both undo and redo logs may put contradictory requirements on how buffers are handled during a checkpoint, unless the database elements are complete blocks or sets of blocks. <br />For instance, if a buffer contains one database element A that was changed by a committed transaction and another database element B that was changed in the same buffer by a transaction that has not yet had its COMMIT record written to disk, then we are required to copy the buffer to disk because of A but also forbidden to do so, because rule R1 applies to B.<br />
  19. 19. Undo/Redo Logging (p1)<br />Undo logging requires to write modifications to disk immediately after commit, leading to an unnecessarily large number of IOs.<br />Redo logging requires to keep all modified blocks in the buffer until the transaction commits and the log records have been flushed, increasing the buffer size requirement.<br />Undo/redo logging combines undo and redo logging. <br />It provides more flexibility in flushing modified blocks at the expense of maintaining more information in the log.<br />
  20. 20. Undo/Redo Logging (p2)<br />Main idea: The log can be used to reconstruct the data<br />Update records <T, X, new, old> record new and old value of X.<br />The only undo/redo logging rule is: <br />Log record must be flushed before corresponding modified block<br />Also known as write ahead logging.<br />Block of X can be flushed before or after T commits, i.e. before or after the COMMIT log record.<br />Flush the log at commit.<br />
  21. 21. Undo/Redo Logging (p3)<br />Because of the flexibility of flushing X before or after the COMMIT record, we can have uncommitted transactions with modifications on disk and committed transactions with modifications not yet on disk.<br />The undo/redo recovery policy is as follows:<br />Redo committed transactions.<br />Undo uncommitted transactions.<br />
  22. 22. Undo/Redo Logging Recovery<br />More details on the recovery procedure:<br />Backward pass <br />From end of log back to latest valid checkpoint, construct set S of committed transactions.<br />Undo actions of transactions not in S.<br />Forward pass<br />From latest checkpoint forward to end of log,<br />Or from the beginning of time, if there are no checkpoints<br />redo actions of transactions in S.<br />Alternatively, can also perform the redos before the undos. <br />
  23. 23. Undo/Redo Checkpointing<br />Write "start checkpoint" listing all active transactions to log<br />Flush log to disk<br />Write to disk all dirty buffers (contain a changed DB element), whether or not transaction has committed<br />Implies nothing should be written (not even to memory buffers) until we are sure the transaction will not abort<br />Implies some log records may need to be written to disk (WAL)<br />Write "end checkpoint" to log<br />Flush log to disk<br />start ckpt<br />active T's:<br />T1,T2,...<br />end<br />ckpt<br />...<br />...<br />...<br />
  24. 24. Protecting Against Media Failure<br />Logging protects from local loss of main memory and disk content, but not against global loss of secondary storage content (media failure).<br />To protect against media failures, employ archiving: maintaining a copy of the database on a separate, secure storage device.<br />Log also needs to be archived in the same manner.<br />Two levels of archiving:<br />full dump vs. incremental dump.<br />
  25. 25. Protecting Against Media Failure<br />Typically, database cannot be shut down for the period of time needed to make a backup copy (dump).<br />Need to perform nonquiescent archiving, i.e., create a dump while the DBMS continues to process transactions.<br />Goal is to make copy of database at time when the dump began, but transactions may change database content during the dumping.<br />Logging continues during the dumping, and discrepancies can be corrected from the log.<br />
  26. 26. Protecting Against Media Failure<br />We assume undo/redo (or redo) logging.<br />The archiving procedure is as follows: <br />Write a log record <START DUMP>. <br />Perform a checkpoint for the log. <br />Perform a (full / incremental) dump on the secure storage device. <br />Make sure that enough of the log has been copied to the secure storage device so that at least the log up to the check point will survive media failure.<br />Write a log record <END DUMP>.<br />
  27. 27. Protecting Against Media Failure<br />After a media failure, we can restore the DB from the archived DB and archived log as follows: <br />Copy latest full dump (archive) back to DB. <br />Starting with the earliest ones, make the modifications recorded in the incremental dump(s) in increasing order of time. <br />Further modify DB using the archived log. <br />Use the recovery method corresponding to the chosen type of logging.<br />
  28. 28. Summary<br />Logging is an effective way to prepare for system failure<br />Transactions provide a useful building block on which to base log entries<br />Three type of logs<br />Undo Logs<br />Redo Logs<br />Undo/Redo logs<br />Only Undo/Redo logs are used in practice. Why?<br />Periodic checkpoints are necessary for keeping recovery times under control. Why?<br />Database Dumps (archives) protect against media failure<br />Great for making a “point in time” copy of the database.<br />
  29. 29. On the NoSQL Front…<br />Google Datastore<br />Recently (1/2011) added a “High Replication” option.<br />Replicates the datastore synchronously across multiple data centers<br />Does not use an append-only log<br />Has performance and size impact<br />CouchDB<br />Append-only log that’s actually a b-tree<br />No provision for deleting part of the log<br />Provision for ‘compacting the log’<br />MongoDB<br />Recently (12/2010) added a --journal option<br />Has performance impact, no measurements available<br />Common thread, tradeoff between performance and durability!<br />
  30. 30. CS 542 Database Management Systems<br />Concurrency Control<br />J Singh <br />April 4, 2011<br />
  31. 31. Concurrency Control<br />Goal: Preserving Data Integrity<br />Challenge: enforce ACID rules (while maintaining maximum traffic through the system)<br />Committed transactions leave the system in a consistent state<br />Rolled-back transactions behave as if they never happened!<br />Historical Note<br />Based on The Transaction Concept: Virtues and Limitations by Jim Gray, Tandem Computers, 1981<br />ACM Turing Award, 1998<br />
  32. 32. Transactions<br />Concurrent execution of user programs is essential for good DBMS performance.<br />Because disk accesses are frequent, and relatively slow, it is important to keep the cpu humming by working on several user programs concurrently.<br />A user’s program may carry out many operations on the data retrieved from the database, but the DBMS is only concerned about what data is read/written from/to the database.<br />A transaction is the DBMS’s abstract view of a user program: a sequence of reads and writes.<br />Referred to as a Schedule<br />Implemented by a Transaction Scheduler<br />
  33. 33. Scheduler<br />Scheduler takes read/write requests from transactions<br />Either executes them in buffers or delays them<br />Scheduler must avoid Isolation Anomalies<br />
  34. 34. Isolation Anomalies (p1)<br />READ UNCOMMITTED<br />Dirty Read – data of an uncommitted transaction visible to others<br />Sometimes called WR Conflict<br />UNREPEATABLE READ<br />Non-repeatable Read – some previously read data changes due to another transaction committing<br />Sometimes called RW Conflict<br />T1: R(A), W(A), R(B), W(B), C<br />T2: R(A), W(A), R(B), W(B), C<br />T1: R(A), W(A), C<br />T2: R(A), W(A), C<br />
  35. 35. Isolation Anomalies (p2)<br />Overwriting Uncommitted Data<br />Sometimes called WW Conflicts<br />We need a set of rules to prohibit such isolation anomalies<br />The rules place constraints on the actions of concurrent transactions<br />T1: W(A), W(B), C<br />T2: W(A), W(B), C<br />
  36. 36. Serial Schedules<br />Definition: A schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions.<br />A schedule is serial if its transactions are not interleaved<br />Serial schedules observe ACI properties<br />Schedule D is the set of 3 transactions T1, T2, T3. <br />T1 Reads and writes to object X<br />Then T2 Reads and writes to object Y<br />ThenT3 Reads and writes to object Z. <br />D is an example of a serial schedule, because the 3 txns are not interleaved.<br />Shorthand:<br />R1 (X), W1(X),R2 (Y), W2(Y), R3 (Z), W3(Z)<br />
  37. 37. Serializable Schedules<br />Aserializable schedule is one that is equivalent to a serial schedule.<br />The Transaction Manager should defer some transactions if the current schedule is not serializable<br />The order of transactions in E is not the same as in D, <br />But E gives the same result.<br />Shorthand: <br />E = R1 (X); R2 (Y); R3 (Z); W1 (X); <br /> W2 (Y); W3 (Z);<br />
  38. 38. Serializability<br />Is G serializable?<br />Equivalent to the serial schedule <T1,T2><br />But not <T2,T1><br />G is conflict-serializable<br />Conflict equivalence: The schedules S1 and S2are conflict-equivalent if the following conditions are satisfied:<br />Both schedules S1and S2involve the same set of transactions (including ordering of actions within each transaction).<br />The order of each pair of conflicting actions in S1 and S2are the same<br />Conflict-serializability: A schedule is conflict-serializablewhen the schedule is conflict-equivalent to one or more serial schedules.<br />
  39. 39. Serializability of Schedule G<br />T1: R(A) W(B)<br />T2: R(A) W(A)<br />T1<br />T2<br />Precedence graph:<br />a node for each transaction<br />an arc from Ti to Tj if an action in Tiprecedes and conflicts with an action in Tj.<br />T1 T2? R1 (A) W1 (B) R2(A) W2(A) ? No conflicts<br />T2T1? R2(A) W2 (A) R1 (A) W1 (B) ? <br />Two actions conflict if<br />The actions belong to different transactions. <br />At least one of the actions is a write operation. <br />The actions access the same object (read or write). <br />Theorem: A schedule is conflict serializable if and only if its precedence graph is acyclic<br />Conflicts<br />
  40. 40. Enforcing Serializable Schedules<br />Prevent cycles in the Precedence Graph, P(S), from occurring<br />Locking primitives:<br />Lock (exclusive): li(A)<br />Unlock: ui(A)<br />Make transactions consistent<br />Ti: pi (A) becomes Ti: li(A) pi (A) ui(A)<br />pi (A) is either a READ or a WRITE<br />Allow only one transaction to hold a lock on A at any time<br />Two-phase locking for transactions<br />Ti: li(A) … pi (A) … ui(A)<br />no unlocks no locks<br />
  41. 41. Legal Schedules?<br />S1= l1 (A) l1(B) r1 (A) w1 (B)l2(B)u1 (A) u1 (B)<br />r2 (B) w2 (B) u2 (B)l3 (B) r3 (B) u3(B)<br />S2= l1 (A) r1 (A) w1 (B) u1 (A) u1 (B)<br />l2 (B) r2 (B) w2 (B) l3 (B) r3 (B) u3 (B)<br />S3= l1 (A) r1 (A) u1 (A) l1 (B) w1 (B) u1 (B)<br />l2 (B) r2 (B) w2 (B) u2 (B)l3 (B) r3 (B) u3 (B)<br />
  42. 42. Locking Protocols for Serializable Schedules<br />Strict Two-phase Locking (Strict 2PL) Protocol:<br />Each transaction must obtain a S (shared) lock on object before reading, and an X (exclusive) lock on object before writing.<br />All locks held by a transaction are released when the transaction completes<br />Strict 2PL allows only serializable schedules<br />Additionally, it simplifies transaction aborts<br />(Non-strict) 2PL Variant: Release locks anytime, but cannot acquire locks after releasing any lock.<br />If a txn holds an X lock on an object, no other txn can get a lock (S or X) on that object.<br />(Non-strict) 2PL also allows only serializable schedules, but involves more complex abort processing<br />Why is “acquiring after releasing” disallowed? To avoid cascading aborts<br />More in a minute<br />
  43. 43. Executing Locking Protocols<br />Begin with a Serialized Schedule<br />We know it won’t deadlock<br />How do we know this?<br />Beyond this simple 2PL protocol, it is all a matter of improving performance and allowing more concurrency….<br />Shared locks<br />Increment locks<br />Multiple granularity<br />Other types of concurrency control mechanisms<br />
  44. 44. Lock Management<br />Lock and unlock requests are handled by the lock manager<br />Lock Table Entry<br />Number of transactions currently holding a lock<br />Type of lock held (shared or exclusive)<br />Pointer to queue of lock requests<br />Locking and unlocking operations<br />Atomic<br />Support upgrade: transaction that holds a shared lock can be upgraded to hold an exclusive lock<br />Any level of granularity can be locked<br />Database, table, block, tuple<br />Why is this necessary?<br />
  45. 45. Multiple-Granularity Locks<br />If a transaction needs to scan all records in a table, we don’t really want to have a lock on all tuples individually – significant locking overhead! <br />Put a single lock on the table<br />Database<br />Tables<br />Pages<br />Tuples<br />A lock on a node<br />implicitly locks<br />all descendents.<br />contains<br />
  46. 46. Aborting a Transaction<br />If a transaction Ti is aborted, all its actions have to be undone. <br />If Tjreads an object last written by Ti, Tj must be aborted as well!<br />Most systems avoid such cascading aborts by releasing a transaction’s locks only at commit time.<br />If Tiwrites an object, Tjcan read this only after Ticommits.<br />In order to undo the actions of an aborted transaction, the DBMS maintains a log in which every write is recorded. <br />The same mechanism is used to recover from system crashes; all active txns at the time of the crash are aborted when the system recovers<br />
  47. 47. Performance Considerations (Again!)<br />2PL Protocol allows transactions to proceed with maximum parallelism<br />Locking algorithm only delays actions that would cause conflicts<br />But the locks are still a bottleneck<br />Need to ensure lowest-possible level of locking granularity<br />Classic memory-performance trade-off<br />Conflict-serialization is too conservative<br />But other methods of serialization are too complex<br />A use case that occurs quite often, should be optimized<br />Besides scanning through the table, if we need to modify a few tuples, what kind of lock to put on the table?<br />Have to be X (if we only have S or X).<br />But, blocks all other read requests!<br />Concurrency control is pessimistic and acquires/releases locks<br />Optimistic Concurrency Control<br />
  48. 48. Next Week<br />Intention Locks<br />Optimistic Concurrency Control<br />Distributed Commit<br />Please Read ahead of time<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />