Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
CS 542 -- Concurrency Control, Distributed Commit
1. CS 542 Database Management Systems Concurrency Control Commit in Distributed Systems J Singh April 11, 2011
2. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
3. Scheduler Architecture for CC Scheduler has two parts Accepts read/write requests from transactions Assures serialization Keeps track of active and pending transactions Controls commit, abort, delay Today’s lecture discusses Part 2 functionality
4. The Lock Table A relation that associates database elements with locking information about that element Implemented as a hash table Size is proportional to the number of lock elements, not to the size of the entire database DB element A Lock information for A
5. Scheduler Priority Logic When a transaction releases a lock that other transactions are waiting for, what policy to use? First-Come-First-Served: Grant the lock to the longest waiting request. No starvation (waiting forever for lock) Priority to Shared Locks: Grant all S locks waiting, then one X lock. Grant X lock if no others waiting Priority to Upgrading: If there is a U lock waiting to upgrade to an X lock, grant that first. Each has its advantages and disadvantages Configurable for a database instance
6. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
7. Motivation for intention locks Besides scanning through the table, if we need to modify a few tuples. What kind of lock to put on the table? Have to be X (if we only have S or X). But, blocks all other read requests!
8. Intention Locks Allow intention locks IS, IX. Before S locking an item, must IS lock the root. Before X locking an item, must IX lock the root. Should make sure: If Ti S locks a node, no Tj can X lock an ancestor. Achieved if S conflicts with IX If TjX locks a node, no Tican S or X lock an ancestor. Achieved if X conflicts with IS and IX.
9. Allowed Lock Sharings Lock Requester IX S SIX X IS Ö Ö Ö Ö Ö IS IX Ö Ö Lock Holder S Ö Ö SIX Ö X
10. Multiple Granularity Lock Protocol Each txn starts from the root of the hierarchy. To get a lock on any node, must hold an intentional lock on its parent node! E.g. to get S lock on a node, must hold IS or IX on parent. E.g. to get X lock on a node, must hold IX or SIX on parent. Full table of rules: Must release locks in bottom-up order.
11. Example 1 T1(IS) T1(S) T1 needs a shared lock on t2 T2 needs a shared lock on R1 , T2(S) R1 t1 t4 t2 t3
12.
13. Examples 3, 4, 5 T1 scans R, and updates a few tuples: T1 gets an SIX lock on R, and occasionally upgrades to X on the tuples. T2 uses an index to read only part of R: T2 gets an IS lock on R, and repeatedly gets an S lock on tuples of R. T3 reads all of R: T3 gets an S lock on R. OR, T3 could behave like T2; can use lock escalationas it goes. Lock Requester IX S SIX X IS Ö Ö Ö Ö Ö IS IX Ö Ö Lock Holder S Ö Ö SIX Ö X
19. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
20. Did Insert/Delete expose a flaw in 2PL? The flaw was with the assumption that by locking all tuples, T1 had locked the set! We needed to lock the set Would we bottleneck on the relation if the workload were insert- and delete-heavy? There is another way to solve the problem: Lock at the index (if one exists) Since B+ trees are not 100% full, we can maintain multiple locks in different sections of the tree. Index Put a lock here. r=1
21. Index Locking (p1) Higher levels of the tree only direct searches for leaf pages. For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.) We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.
22. Index Locking (p2) Search: Start at root and go down; repeatedly, S lock child then unlock parent. Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: If child is safe, release all locks on ancestors. Safe node: Node such that changes will not propagate up beyond this node. Inserts: Node is not full. Deletes: Node is not half-empty.
23. Example ROOT Where to lock? 1) Delete 38* 2) Insert 45* 3) Insert 25* A 20 B 35 C F 38 44 23 H D E G I 20* 22* 23* 24* 35* 36* 38* 41* 44*
24. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
25. Optimistic CC Locking is a conservative approach in which conflicts are prevented. Disadvantages: Lock management overhead. Deadlock detection/resolution. Not discussed in CS-542 lectures, expecting that you are familiar with it If conflicts are rare, we may be able to gain performance by not locking, and instead checking for conflicts before txns commit. Two approaches Kung-Robinson Model Divides every transaction into three phases: read, validate, write Makes commit/abort decision based on what’s being read and written Timestamp Ordering Algorithms Clever use of timestamps to determine which operations are conflict-free and which must be aborted
26. Kung-Robinson Model Key idea: Let transactions work in isolation Validate reads and writes when ready to commit Make Validation Atomic Validated ≡ Committed Transactions have three phases: READ: txns read from the database, make changes to private copies of objects. VALIDATE: Check if schedule so far is serializable. WRITE: Make local copies of changes public. old ROOT modified objects new
27. Validation Test conditions that are sufficient to ensure that no conflict occurred. Each txn is assigned a numeric id. Just use a timestamp. Transaction ids assigned at end of READ phase, just before validation begins. ReadSet(Ti): Set of objects read by txn Ti. WriteSet(Ti): Set of objects modified by Ti. Validation is atomic Done in a critical section
28. Validation Tests Test FIN(Ti) < START(Tj) FIN(Ti) < VAL(Tj) AND WriteSet(Ti ) ∩ReadSet(Tj ) is empty. VAL(Ti) < VAL(Tj) AND WriteSet(Ti ) ∩ReadSet(Tj ) is empty AND WriteSet(Ti ) ∩WriteSet(Tj ) is empty. Ti Tj Ti Ti R V W R V W R V W Tj R V W Tj R V W R V W Situation
29. Overheads in Kung-Robinson CC Must record read/write activity in ReadSet and WriteSet per txn. Must create and destroy these sets as needed. Must check for conflicts during validation, and must make validated writes “global”. Critical section can reduce concurrency. Scheme for making writes global can reduce clustering of objects. Optimistic CC restarts transactions that fail validation. Work done so far is wasted; requires clean-up.
30.
31. Timestamp Ordering CC Main idea: Put a timestamp on the last read and write action on every object Use this timestamp to detect if a transaction attempts an illegal operation Abort the offending transaction if it does Algorithm: Give each object a read-timestamp (RTS) and a write-timestamp (WTS), Give each txn a timestamp (TS) when it begins Action ai of txn Ti must occur before action aj of txn Tj if If action ai of txn Ti conflicts with action aj of txn Tj, and TS(Ti) < TS(Tj), then ai must occur before aj. Otherwise, restart the violating txn.
32. Rules for Timestamps-Based scheduling Algorithm setup RT(X) The read time of X, the highest timestamp of transaction that has read X. WT(X) The write time of X, the highest timestamp of transaction that has write X. C(X) The commit bit for X, which is true if and only if the most recent transaction to write X has already committed. Scheduler receives a request from T to operate on X The request is realizable under some conditions and not under others
33. Physically Unrealizable Read too late A transaction U that started after transaction T but wrote a value for X before T reads X In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back. U writes X T reads X T start U start
34. Physically Unrealizable Write too late A transaction U that started after T, but read X before T got a chance to write X. In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back. U reads X T writes X T start U start
35. Dirty Read After T reads the value of X written by U, U could abort In other words, if TS(T) = RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later value in X. If C(X) is true, then the previous writer of X is committed, all is good. If C(X) is false, we must delay T. U writes X T reads X U start T start U aborts
36. Write after Write T tries to write X after a later transaction (U) has written it OK to ignore the write by T because it will get overwritten anyway Except if U aborts And the new value of T is lost forever Solve this problem by introducing the concept of a “tentative write” U writes X T writes X U abort U start T start T commit
37. Rules for Timestamps-based Scheduling Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X)=true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed. Scheduler receives a request to abort T or decides to rollback T, Any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.
38.
39. Multiversion Timestamps Multiversion schemes keep old versions of data item to increase concurrency. Each successful write results in the creation of a new version of the data item written. Use timestamps to label versions. When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.
40. Timestamps vs Locking Generally, timestamping performs better than locking in situations where: Most transactions are read-only. It is rare that concurrent transaction will try to read and write the same element. This is generally the case for Web Applications In high-conflict situation, locking performs better than timestamps
41. Practical Use 2-Phase Locks (or variants) Used by most relational databases Multi-level granularity Support for table, page and tuple-level locks Used by most relational databases Multi-version concurrency control Oracle 8 forward: Divide transactions into read-only and read-write Read-only transactions use multi-version concurrency and never wait Read-write transactions use 2PL Postgres, others as well, offer some level of MVCC
42. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
43. Distributed Commit Motivation FruitCo has Its main Sales office in Oregon Farms and Warehouse are in Washington Finance is in Utah All three sites have local data centers with their own systems When an order is placed, the Sales system must send the billing information to Utah and shipping information to Washington. When an order is placed, all three databases must be updated, or none should be.
45. Two-Phase Commit (2PC) Phase 1 : The TM gets the RMs ready to write the results into the database Phase 2 : Everybody writes the results into the database TM :The process at the site where the transaction originates and which controls the execution RM :The process at the other sites that participate in executing the transaction Global Commit Rule: The TM aborts a transaction if and only if at least one RM votes to abort it. The TM commits a transaction if and only if all of the RMs vote to commit it.
46. Centralized 2PC P P P P C C C P P P P ready? yes/no commit/abort? commited/aborted Phase 1 Phase 2
48. When TM Fails… Timeout in INITIAL Who cares Timeout in WAIT Cannot unilaterally commit Can unilaterally abort Timeout in ABORT or COMMIT Stay blocked and wait for the acks TM INITIAL Commit command Prepare WAIT Vote-abort Vote-commit Global-commit Global-abort ABORT COMMIT
49. When an RM Fails… INITIAL Timeout in INITIAL TM must have failed in INITIAL state Unilaterally abort Timeout in READY Stay blocked RMs Prepare Vote-commit Prepare Vote-abort READY Global-abort Global-commit Ack Ack ABORT COMMIT
50. When TM Recovers… Failure in INITIAL Start the commit process upon recovery Failure in WAIT Restart the commit process upon recovery Failure in ABORT or COMMIT Nothing special if all the acks have been received Otherwise the termination protocol is involved INITIAL TM Commit command Prepare WAIT Vote-commit Vote-abort Global-commit Global-abort ABORT COMMIT
51. When an RM Recovers… Failure in INITIAL Unilaterally abort upon recovery Failure in READY The TM has been informed about the local decision Treat as timeout in READY state and invoke the termination protocol Failure in ABORT or COMMIT Nothing special needs to be done INITIAL RMs Prepare Vote-commit Prepare Vote-abort READY Global-abort Global-commit Ack Ack COMMIT ABORT
52. 2PC Protocol Actions RM TM INITIAL INITIAL PREPARE write begin_commit in log write abort in log No Ready to Commit? VOTE-ABORT Yes VOTE-COMMIT write ready in log WAIT Yes GLOBAL-ABORT write abort in log READY Any No? No VOTE-COMMIT write commit in log Abort Type of msg ACK Commit write abort in log ABORT COMMIT ACK write commit in log write end_of_transaction in log ABORT COMMIT
53. Two-phase commit commentary Two-phase commit protocol limitation: it is a blocking protocol. The failure of the TM can cause the protocol to block until the TM is repaired. If the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted. RMs will block resource processes while waiting for a message from the TM. A TM will also block resources while waiting for replies from RMs. A TM can also block indefinitely if no acknowledgement is received from the RM. “Federated” two-phase commit protocols, aka three-phase protocols, have been proposed but are still unproven. Paxos Consensus Algorithm. Consensus on Transaction Commit, Jim Gray and Leslie Lamport, Microsoft Research, 2005, MSR-TR-2003-96
54. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
55. Fault-Tolerant Two Phase Commit Prepared client TM RM RequestCommit Prepare Prepared Prepare TM RM RequestCommit Prepare Prepared If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
56. Fault-Tolerant Two Phase Commit client TM RM abort Prepared Prepare commit commit TM RM TM Prepared commit Prepare RequestCommit Prepare Prepared Inconsistent! Now What? Prepare Prepared commit commit abort If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit) The complexity is a mess. But… What if….?
57. Fault Tolerant 2PC Several workarounds proposed in database community: Often called "3-phase" or "non-blocking" commit. None with complete algorithm and correctness proof.
58. Propose X consensus box client W Chosen Propose W client W Chosen client W Chosen Consensus collects proposed values Picks one proposed value remembers it forever
59. Consensus for Commit – The Obvious Approach consensus box RM client TM Propose Prepared Prepared Chosen Request Commit Prepared Prepare Commit Commit Prepare Commit TM RM Prepared Chosen Prepared RequestCommit Prepare Prepared Propose Prepared Prepared Chosen Commit Commit Get consensus on TM’s decision. TM just learns consensus value. TM is “stateless”
60. Consensus for Commit – The Paxos Commit Approach RM client TM Request Commit consensus box Propose RM1 Prepared Prepare RM1 Prepared Chosen Commit Commit Prepare consensus box Commit RM TM Propose RM2 Prepared RM2 Prepared Chosen RequestCommit Prepare Propose RM1 Prepared Propose RM2 Prepared RM1 Prepared Chosen RM2 Prepared Chosen Commit Commit Get consensus on each RM’s choice. TM just combines consensus values. TM is “stateless”
62. RM Consensus box Propose RM Prepared acceptor TM acceptor TM acceptor Consensus in Action Propose RM Prepared Vote RM Prepared Propose RM Prepared RM Prepared Chosen Vote RM Prepared Vote RM Prepared The normal (failure-free) case Two message delays Can optimize
63. RM Consensus box acceptor TM acceptor TM TM acceptor Consensus in Action TM can always learn what was chosen, or get Aborted chosen if nothing chosen yet; if majority of acceptors working .
64. The Complete Algorithm Subtle. More weird cases than most people imagine. Proved correct.
65. PaxosCommit in a Nutshell Acceptors 0…2F Client TM RM1…N request commit prepare prepared all prepared commit N RMs 2F+1 acceptors (~2F+1 TMs) If F+1 acceptors see all RMs prepared, then transaction committed. 2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays.
66.
67. Today’s Meeting Concurrency Control Intention Locks Index Locking Optimistic CC Validation Timestamp Ordering Multi-version CC Commit in Distributed Databases Two Phase Commit Paxos Algorithm Concluding thoughts References (aside from textbook): Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research. Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004 OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008 The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007
68.
69. OLTP Through the Looking Glass (p2) Concurrency Control Look for applications where it can be turned off Some sort of optimistic concurrency control Multi-core Support Latching (inter-thread communication) remains a significant bottleneck Cache-conscious B-Trees Replication Management Loss of transactional consistency if log shipping Recovery is not instantaneous Maintaining transactional consistency Weak Consistency Starbucks doesn’t need two phase commit How to achieve eventual consistency without transactional consistency Areas for Research that may yield dividends
70.
71. What’s so fun about databases? From our January 13 Lecture… Traditional database courses talked about Employee records Bank records Now we talk about Web search Data mining The collective intelligence of tweets Scientific and medical databases From a personal viewpoint, I have enjoyed learning this material with you Thank you.
72. About CS 542 CS 542 will Build on database concepts you already know Provide you tools for separating hype from reality Help you develop skills in evaluating the tradeoffs involved in using and/or creating a database CS 542 may Train you to read technical journals and apply them CS 542 will not Cover the intricacies of SQL programming Spend much effort in Dynamic SQL Stored Procedures Interfaces with application programming languages Connectors, e.g., JDBC, ODBC From our January 13 Lecture…
73. Thanks Contact Information: President, Early Stage IT – a cloud-based consulting firm Email: J [dot] Singh [at] EarlyStageIT [dot] com Phone: 978-760-2055 Co-chair of Software and Services SIG at TiE-Boston Founder, SQLnix.org, a local resource for NoSQL databases My WPI email will be good through the summer.