CS 542 Database Management Systems<br />Concurrency Control<br />Commit in Distributed Systems<br />J Singh <br />April 11...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
Motivation for intention locks<br />Besides scanning through the table, if we need to modify a few tuples. What kind of lo...
Intention Locks<br />Allow intention locks IS, IX.<br />Before S locking an item, must IS lock the root.<br />Before X loc...
Allowed Lock Sharings<br />Lock Requester<br />IX<br />S<br />SIX<br />X<br />IS	<br />Ö<br />Ö<br />Ö<br />Ö<br />Ö<br />...
Multiple Granularity Lock Protocol<br />Each txn starts from the root of the hierarchy.<br />To get a lock on any node, mu...
Example 1<br />T1(IS)<br />T1(S)<br />T1 needs a shared lock on t2<br />T2 needs a shared lock on R1<br />, T2(S)<br />R1<...
Example 2<br />T1(IS)<br />, T2(IX)<br />T2(IX)<br />T1(S)<br /><ul><li>T1 needs a shared lock on t2</li></ul>T2 needs an ...
Examples 3, 4, 5<br />T1 scans R, and updates a few tuples:<br />T1 gets an SIX lock on R, and occasionally upgrades to X ...
Insert and Delete<br />Transactions<br />T1:<br />SELECT MAX(Price) WHERE Rating = 1;<br />SELECT MAX(Price) WHERE Rating ...
From T1: 80, 65
Actual: 96, 65
T1 then T2: 80, 75
T2 then T1: 96, 65</li></li></ul><li>Insert and Delete Rules<br />When T1 inserts t1 into R,<br />Give X lock on t1 to T1<...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
Did Insert/Delete expose a flaw in 2PL?<br />The flaw was with the assumption that by locking all tuples, T1 had locked th...
Index Locking (p1)<br />Higher levels of the tree only direct searches for leaf pages.<br />For inserts, a node on a path ...
Index Locking (p2)<br />Search:  Start at root and go down; repeatedly, S lock child then unlock parent.<br />Insert/Delet...
Example<br />ROOT<br />Where to lock?<br />1)  Delete 38*<br />2)  Insert 45*<br />3)  Insert 25*<br />A<br />20<br />B<br...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
Optimistic CC<br />Locking is a conservative approach in which conflicts are prevented. Disadvantages:<br />Lock managemen...
Kung-Robinson Model<br />Key idea:<br />Let transactions work in isolation<br />Validate reads and writes when ready to co...
Validation<br />Test conditions that are sufficient to ensure that no conflict occurred.<br />Each txn is assigned a numer...
Validation Tests<br />Test<br />FIN(Ti) < START(Tj)<br />FIN(Ti) < VAL(Tj) AND<br />WriteSet(Ti ) ∩ReadSet(Tj ) is empty.<...
Overheads in Kung-Robinson CC<br />Must record read/write activity in ReadSet and WriteSet per txn.<br />Must create and d...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br /><ul><li>Validatio...
Timestamp Ordering CC<br />Main idea:<br />Put a timestamp on the last read and write action on every object<br />Use this...
Rules for Timestamps-Based scheduling<br />Algorithm setup<br />RT(X)<br />The read time of X, the highest timestamp of tr...
Physically Unrealizable<br />Read too late<br />A transaction U that started after transaction T but wrote a value for X b...
Physically Unrealizable<br />Write too late<br />A transaction U that started after T, but read X before T got a chance to...
Dirty Read<br />After T reads the value of X written by U, U could abort<br />In other words, if TS(T) = RT(X) but TS(T) <...
Write after Write<br />T tries to write X after a later transaction (U) has written it<br />OK to ignore the write by T be...
Rules for Timestamps-based Scheduling<br />Scheduler receives a request to commit T. <br />It must find all the database e...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br /><ul><li>Validatio...
Multiversion Timestamps<br />Multiversion schemes keep old versions of data item to increase concurrency.<br />Each succes...
Timestamps vs Locking<br />Generally, timestamping performs better than locking in situations where:<br />Most transaction...
Practical Use<br />2-Phase Locks (or variants)<br />Used by most relational databases<br />Multi-level granularity<br />Su...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
Distributed Commit Motivation<br />FruitCo has<br />Its main Sales office in Oregon<br />Farms and Warehouse are in Washin...
Two Phase Commit<br />The Basic Idea<br />
Two-Phase Commit (2PC)<br />Phase 1 : The TM gets the RMs ready to write the results into the database<br />Phase 2 : Ever...
Centralized 2PC<br />P<br />P<br />P<br />P<br />C<br />C<br />C<br />P<br />P<br />P<br />P<br />ready?<br />yes/no<br />...
State Transitions in 2PC<br />INITIAL<br />INITIAL<br />READY<br />     Prepare   <br />Commit command<br />Vote-commit<br...
When TM Fails…<br />Timeout in INITIAL<br />Who cares<br />Timeout in WAIT<br />Cannot unilaterally commit<br />Can unilat...
When an RM Fails…<br />INITIAL<br />Timeout in INITIAL<br />TM must have failed in INITIAL state<br />Unilaterally abort<b...
When TM Recovers…<br />Failure in INITIAL<br />Start the commit process upon recovery<br />Failure in WAIT<br />Restart th...
When an RM Recovers…<br />Failure in INITIAL<br />Unilaterally abort upon recovery<br />Failure in READY<br />The TM has b...
2PC Protocol Actions<br />RM                   <br />TM                <br />INITIAL<br />INITIAL<br />PREPARE<br />write<...
Two-phase commit commentary<br />Two-phase commit protocol limitation: it is a blocking protocol. <br />The failure of the...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
Fault-Tolerant Two Phase Commit<br />Prepared<br />client<br />TM<br />RM<br />RequestCommit<br />Prepare<br />Prepared<br...
Fault-Tolerant Two Phase Commit<br />client<br />TM<br />RM<br />abort<br />Prepared<br />Prepare<br />commit<br />commit<...
Fault Tolerant 2PC <br />Several workarounds proposed in database community:<br />Often called "3-phase" or "non-blocking"...
Propose X<br />consensus<br />box<br />client<br />W Chosen<br />Propose W<br />client<br />W Chosen<br />client<br />W Ch...
Consensus for Commit – The Obvious Approach<br />consensus<br />box<br />RM<br />client<br />TM<br />Propose Prepared<br /...
Consensus for Commit – The Paxos Commit Approach<br />RM<br />client<br />TM<br />Request Commit<br />consensus<br />box<b...
The Obvious Approach<br />Paxos Commit<br />One fewer message delay<br />Prepare<br />Prepare<br />Prepared<br />Propose R...
RM<br />Consensus box<br />Propose RM Prepared<br />acceptor<br />TM<br />acceptor<br />TM<br />acceptor<br />Consensus in...
RM<br />Consensus box<br />acceptor<br />TM<br />acceptor<br />TM<br />TM<br />acceptor<br />Consensus in Action<br />TM c...
The Complete Algorithm<br />Subtle.<br />More weird cases than most people imagine.<br />Proved correct.<br />
PaxosCommit in a Nutshell<br />Acceptors<br />0…2F<br />Client<br />   TM<br />RM1…N<br />request<br />commit<br />prepare...
Paxos Commit Evaluation<br />Two-Phase Commit<br />3N+1 messages<br />N+1 stable writes<br />4 message delays<br />2 stabl...
Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />T...
OLTP Through the Looking Glass (p1)<br />Workload<br />TPC-C Benchmark<br />Quote:<br />Overall, we identify overheads and...
OLTP Through the Looking Glass (p2)<br />Concurrency Control<br />Look for applications where it can be turned off<br />So...
End of an Era?<br />The Relational Model is not necessarily the answer<br />It was excellent for data processing<br />Not ...
What’s so fun about databases?<br />From our January 13 Lecture…<br />Traditional database courses talked about<br />Emplo...
About CS 542<br />CS 542 will<br />Build on database concepts you already know<br />Provide you tools for separating hype ...
Thanks<br />Contact Information:<br />President, Early Stage IT – a cloud-based consulting firm<br />Email: J [dot] Singh ...
CS 542 -- Concurrency Control, Distributed Commit
Upcoming SlideShare
Loading in...5
×

CS 542 -- Concurrency Control, Distributed Commit

2,232

Published on

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,232
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
96
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

CS 542 -- Concurrency Control, Distributed Commit

  1. 1. CS 542 Database Management Systems<br />Concurrency Control<br />Commit in Distributed Systems<br />J Singh <br />April 11, 2011<br />
  2. 2. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  3. 3. Motivation for intention locks<br />Besides scanning through the table, if we need to modify a few tuples. What kind of lock to put on the table?<br />Have to be X (if we only have S or X).<br />But, blocks all other read requests!<br />
  4. 4. Intention Locks<br />Allow intention locks IS, IX.<br />Before S locking an item, must IS lock the root.<br />Before X locking an item, must IX lock the root.<br />Should make sure:<br />If Ti S locks a node, no Tj can X lock an ancestor.<br />Achieved if S conflicts with IX<br />If TjX locks a node, no Tican S or X lock an ancestor.<br />Achieved if X conflicts with IS and IX.<br />
  5. 5. Allowed Lock Sharings<br />Lock Requester<br />IX<br />S<br />SIX<br />X<br />IS <br />Ö<br />Ö<br />Ö<br />Ö<br />Ö<br />IS<br />IX<br />Ö<br />Ö<br />Lock Holder<br />S<br />Ö<br />Ö<br />SIX<br />Ö<br />X<br />
  6. 6. Multiple Granularity Lock Protocol<br />Each txn starts from the root of the hierarchy.<br />To get a lock on any node, must hold an intentional lock on its parent node!<br />E.g. to get S lock on a node, must hold IS or IX on parent.<br />E.g. to get X lock on a node, must hold IX or SIX on parent.<br />Full table of rules:<br />Must release locks in bottom-up order.<br />
  7. 7. Example 1<br />T1(IS)<br />T1(S)<br />T1 needs a shared lock on t2<br />T2 needs a shared lock on R1<br />, T2(S)<br />R1<br />t1<br />t4<br />t2<br />t3<br />
  8. 8. Example 2<br />T1(IS)<br />, T2(IX)<br />T2(IX)<br />T1(S)<br /><ul><li>T1 needs a shared lock on t2</li></ul>T2 needs an exclusive lock on t4<br />No conflict<br />R1<br />t1<br />t4<br />t2<br />t3<br />
  9. 9. Examples 3, 4, 5<br />T1 scans R, and updates a few tuples:<br />T1 gets an SIX lock on R, and occasionally upgrades to X on the tuples.<br />T2 uses an index to read only part of R:<br />T2 gets an IS lock on R, and repeatedly gets an S lock on tuples of R.<br />T3 reads all of R:<br />T3 gets an S lock on R. <br />OR, T3 could behave like T2; can use lock escalationas it goes.<br />Lock Requester<br />IX<br />S<br />SIX<br />X<br />IS <br />Ö<br />Ö<br />Ö<br />Ö<br />Ö<br />IS<br />IX<br />Ö<br />Ö<br />Lock Holder<br />S<br />Ö<br />Ö<br />SIX<br />Ö<br />X<br />
  10. 10. Insert and Delete<br />Transactions<br />T1:<br />SELECT MAX(Price) WHERE Rating = 1;<br />SELECT MAX(Price) WHERE Rating = 2;<br />T2:<br />INSERT <Apple, Arkansas Black, 1, 96>;<br />DELETE WHERE Rating = 2 <br />AND Price = (SELECT MAX(Price) WHERE Rating = 2);<br />Execution<br />T1 locks all records w/Rating=1 and gets 80.<br />T2 inserts <Arkansas Black, 96><br />T2 deletes <Fuji, 75><br />T1 locks all records w/Rating=2 and gets 65.<br /><ul><li>Result:
  11. 11. From T1: 80, 65
  12. 12. Actual: 96, 65
  13. 13. T1 then T2: 80, 75
  14. 14. T2 then T1: 96, 65</li></li></ul><li>Insert and Delete Rules<br />When T1 inserts t1 into R,<br />Give X lock on t1 to T1<br />When T2 deletes t2 from R,<br />It must obtain an X lock on t2<br />This will fix the Fuji delete problem (how so?)<br />But there is still a problem: Phantom Reads. <br />Seen with Arkansas Black in the example<br />Solution: use multiple granularity tree<br />Before inserting Q, obtain an X lock for parent(Q) <br />
  15. 15. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  16. 16. Did Insert/Delete expose a flaw in 2PL?<br />The flaw was with the assumption that by locking all tuples, T1 had locked the set!<br />We needed to lock the set<br />Would we bottleneck on the relation if the workload were insert- and delete-heavy?<br />There is another way to solve the problem:<br />Lock at the index (if one exists)<br />Since B+ trees are not 100% full, we can maintain multiple locks in different sections of the tree.<br />Index<br />Put a lock here.<br />r=1<br />
  17. 17. Index Locking (p1)<br />Higher levels of the tree only direct searches for leaf pages.<br />For inserts, a node on a path from root to modified leaf must be locked (in X mode, of course), only if a split can propagate up to it from the modified leaf. (Similar point holds w.r.t. deletes.)<br />We can exploit these observations to design efficient locking protocols that guarantee serializability even though they violate 2PL.<br />
  18. 18. Index Locking (p2)<br />Search: Start at root and go down; repeatedly, S lock child then unlock parent.<br />Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe:<br />If child is safe, release all locks on ancestors.<br />Safe node: Node such that changes will not propagate up beyond this node.<br />Inserts: Node is not full.<br />Deletes: Node is not half-empty.<br />
  19. 19. Example<br />ROOT<br />Where to lock?<br />1) Delete 38*<br />2) Insert 45*<br />3) Insert 25*<br />A<br />20<br />B<br />35<br />C<br />F<br />38<br />44<br />23<br />H<br />D<br />E<br />G<br />I<br />20*<br />22*<br />23*<br />24*<br />35*<br />36*<br />38*<br />41*<br />44*<br />
  20. 20. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  21. 21. Optimistic CC<br />Locking is a conservative approach in which conflicts are prevented. Disadvantages:<br />Lock management overhead.<br />Deadlock detection/resolution.<br />Not discussed in CS-542 lectures, expecting that you are familiar with it<br />If conflicts are rare, we may be able to gain performance by not locking, and instead checking for conflicts before txns commit.<br />Two approaches<br />Kung-Robinson Model<br />Divides every transaction into three phases: read, validate, write<br />Makes commit/abort decision based on what’s being read and written<br />Timestamp Ordering Algorithms<br />Clever use of timestamps to determine which operations are conflict-free and which must be aborted<br />
  22. 22. Kung-Robinson Model<br />Key idea:<br />Let transactions work in isolation<br />Validate reads and writes when ready to commit<br />Make Validation Atomic<br />Validated ≡ Committed<br />Transactions have three phases:<br />READ: <br />txns read from the database, <br />make changes to private copies of objects.<br />VALIDATE: <br />Check if schedule so far is serializable.<br />WRITE: <br />Make local copies of changes public.<br />old<br />ROOT<br />modified<br />objects<br />new<br />
  23. 23. Validation<br />Test conditions that are sufficient to ensure that no conflict occurred.<br />Each txn is assigned a numeric id.<br />Just use a timestamp.<br />Transaction ids assigned at end of READ phase, just before validation begins. <br />ReadSet(Ti): Set of objects read by txn Ti.<br />WriteSet(Ti): Set of objects modified by Ti.<br />Validation is atomic<br />Done in a critical section<br />
  24. 24. Validation Tests<br />Test<br />FIN(Ti) < START(Tj)<br />FIN(Ti) < VAL(Tj) AND<br />WriteSet(Ti ) ∩ReadSet(Tj ) is empty.<br />VAL(Ti) < VAL(Tj) AND<br />WriteSet(Ti ) ∩ReadSet(Tj ) is empty AND<br />WriteSet(Ti ) ∩WriteSet(Tj ) is empty.<br />Ti<br />Tj<br />Ti<br />Ti<br />R<br />V<br />W<br />R<br />V<br />W<br />R<br />V<br />W<br />Tj<br />R<br />V<br />W<br />Tj<br />R<br />V<br />W<br />R<br />V<br />W<br />Situation<br />
  25. 25. Overheads in Kung-Robinson CC<br />Must record read/write activity in ReadSet and WriteSet per txn.<br />Must create and destroy these sets as needed.<br />Must check for conflicts during validation, and must make validated writes “global”.<br />Critical section can reduce concurrency.<br />Scheme for making writes global can reduce clustering of objects.<br />Optimistic CC restarts transactions that fail validation.<br />Work done so far is wasted; requires clean-up.<br />
  26. 26. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br /><ul><li>Validation</li></ul>Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  27. 27. Timestamp Ordering CC<br />Main idea:<br />Put a timestamp on the last read and write action on every object<br />Use this timestamp to detect if a transaction attempts an illegal operation<br />Abort the offending transaction if it does<br />Algorithm: <br />Give each object a read-timestamp (RTS) and a write-timestamp (WTS), <br />Give each txn a timestamp (TS) when it begins<br />Action ai of txn Ti must occur before action aj of txn Tj if<br />If action ai of txn Ti conflicts with action aj of txn Tj, and <br />TS(Ti) < TS(Tj), then ai must occur before aj. <br />Otherwise, restart the violating txn.<br />
  28. 28. Rules for Timestamps-Based scheduling<br />Algorithm setup<br />RT(X)<br />The read time of X, the highest timestamp of transaction that has read X.<br />WT(X)<br />The write time of X, the highest timestamp of transaction that has write X.<br />C(X)<br />The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.<br />Scheduler receives a request from T to operate on X<br />The request is realizable under some conditions and not under others<br />
  29. 29. Physically Unrealizable<br />Read too late<br />A transaction U that started after transaction T but wrote a value for X before T reads X<br />In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.<br />U writes X<br />T reads X<br />T start<br />U start<br />
  30. 30. Physically Unrealizable<br />Write too late<br />A transaction U that started after T, but read X before T got a chance to write X.<br />In other words, if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.<br />U reads X<br />T writes X<br />T start<br />U start<br />
  31. 31. Dirty Read<br />After T reads the value of X written by U, U could abort<br />In other words, if TS(T) = RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later value in X. <br />If C(X) is true, then the previous writer of X is committed, all is good.<br />If C(X) is false, we must delay T.<br />U writes X<br />T reads X<br />U start<br />T start<br />U aborts<br />
  32. 32. Write after Write<br />T tries to write X after a later transaction (U) has written it<br />OK to ignore the write by T because it will get overwritten anyway<br />Except if U aborts <br />And the new value of T is lost forever<br />Solve this problem by introducing the concept of a “tentative write”<br />U writes X<br />T writes X<br />U abort<br />U start<br />T start<br />T commit<br />
  33. 33. Rules for Timestamps-based Scheduling<br />Scheduler receives a request to commit T. <br />It must find all the database elements X written by T and set C(X)=true. <br />If any transactions are waiting for X to be committed, these transactions are allowed to proceed.<br />Scheduler receives a request to abort T or decides to rollback T, <br />Any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write. <br />
  34. 34. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br /><ul><li>Validation</li></ul>Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  35. 35. Multiversion Timestamps<br />Multiversion schemes keep old versions of data item to increase concurrency.<br />Each successful write results in the creation of a new version of the data item written.<br />Use timestamps to label versions.<br />When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version. <br />
  36. 36. Timestamps vs Locking<br />Generally, timestamping performs better than locking in situations where:<br />Most transactions are read-only.<br />It is rare that concurrent transaction will try to read and write the same element.<br />This is generally the case for Web Applications<br />In high-conflict situation, locking performs better than timestamps<br />
  37. 37. Practical Use<br />2-Phase Locks (or variants)<br />Used by most relational databases<br />Multi-level granularity<br />Support for table, page and tuple-level locks<br />Used by most relational databases<br />Multi-version concurrency control<br />Oracle 8 forward: Divide transactions into read-only and read-write<br />Read-only transactions use multi-version concurrency and never wait<br />Read-write transactions use 2PL<br />Postgres, others as well, offer some level of MVCC<br />
  38. 38. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  39. 39. Distributed Commit Motivation<br />FruitCo has<br />Its main Sales office in Oregon<br />Farms and Warehouse are in Washington<br />Finance is in Utah<br />All three sites have local data centers with their own systems<br />When an order is placed, the Sales system must send the billing information to Utah and shipping information to Washington.<br />When an order is placed, all three databases must be updated, or none should be.<br />
  40. 40. Two Phase Commit<br />The Basic Idea<br />
  41. 41. Two-Phase Commit (2PC)<br />Phase 1 : The TM gets the RMs ready to write the results into the database<br />Phase 2 : Everybody writes the results into the database<br />TM :The process at the site where the transaction originates and which controls the execution<br />RM :The process at the other sites that participate in executing the transaction<br />Global Commit Rule:<br />The TM aborts a transaction if and only if at least one RM votes to abort it.<br />The TM commits a transaction if and only if all of the RMs vote to commit it.<br />
  42. 42. Centralized 2PC<br />P<br />P<br />P<br />P<br />C<br />C<br />C<br />P<br />P<br />P<br />P<br />ready?<br />yes/no<br />commit/abort?<br />commited/aborted<br />Phase 1<br />Phase 2<br />
  43. 43. State Transitions in 2PC<br />INITIAL<br />INITIAL<br />READY<br /> Prepare <br />Commit command<br />Vote-commit<br />Prepare<br /> Prepare <br />Vote-abort<br />WAIT<br />Global-abort<br />Global-commit<br />Vote-commit (all)<br /> Vote-abort <br />Ack<br />Ack<br />Global-commit<br />Global-abort<br />ABORT<br />COMMIT<br />COMMIT<br />ABORT<br />TM<br />RMs<br />
  44. 44. When TM Fails…<br />Timeout in INITIAL<br />Who cares<br />Timeout in WAIT<br />Cannot unilaterally commit<br />Can unilaterally abort<br />Timeout in ABORT or COMMIT<br />Stay blocked and wait for the acks<br />TM<br />INITIAL<br />Commit command<br />Prepare<br />WAIT<br /> Vote-abort <br /> Vote-commit <br />Global-commit<br />Global-abort<br />ABORT<br />COMMIT<br />
  45. 45. When an RM Fails…<br />INITIAL<br />Timeout in INITIAL<br />TM must have failed in INITIAL state<br />Unilaterally abort<br />Timeout in READY<br />Stay blocked<br />RMs<br /> Prepare <br />Vote-commit<br /> Prepare <br />Vote-abort<br />READY<br />Global-abort<br />Global-commit<br />Ack<br />Ack<br />ABORT<br />COMMIT<br />
  46. 46. When TM Recovers…<br />Failure in INITIAL<br />Start the commit process upon recovery<br />Failure in WAIT<br />Restart the commit process upon recovery<br />Failure in ABORT or COMMIT<br />Nothing special if all the acks have been received<br />Otherwise the termination protocol is involved<br />INITIAL<br />TM<br />Commit command<br />Prepare<br />WAIT<br /> Vote-commit <br /> Vote-abort <br />Global-commit<br />Global-abort<br />ABORT<br />COMMIT<br />
  47. 47. When an RM Recovers…<br />Failure in INITIAL<br />Unilaterally abort upon recovery<br />Failure in READY<br />The TM has been informed about the local decision<br />Treat as timeout in READY state and invoke the termination protocol<br />Failure in ABORT or COMMIT<br />Nothing special needs to be done<br />INITIAL<br />RMs<br /> Prepare <br />Vote-commit<br /> Prepare <br />Vote-abort<br />READY<br />Global-abort<br />Global-commit<br />Ack<br />Ack<br />COMMIT<br />ABORT<br />
  48. 48. 2PC Protocol Actions<br />RM <br />TM <br />INITIAL<br />INITIAL<br />PREPARE<br />write<br />begin_commit<br />in log<br />write abort<br />in log<br />No<br />Ready to<br />Commit?<br />VOTE-ABORT<br />Yes<br />VOTE-COMMIT<br />write ready<br />in log<br />WAIT<br />Yes<br />GLOBAL-ABORT<br />write abort<br />in log<br />READY<br />Any No?<br />No<br />VOTE-COMMIT<br />write commit<br />in log<br />Abort<br />Type of<br />msg<br />ACK<br />Commit<br />write abort<br />in log<br />ABORT<br />COMMIT<br />ACK<br />write commit<br />in log<br />write<br />end_of_transaction<br />in log<br />ABORT<br />COMMIT<br />
  49. 49. Two-phase commit commentary<br />Two-phase commit protocol limitation: it is a blocking protocol. <br />The failure of the TM can cause the protocol to block until the TM is repaired. <br />If the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted.<br />RMs will block resource processes while waiting for a message from the TM. <br />A TM will also block resources while waiting for replies from RMs. A TM can also block indefinitely if no acknowledgement is received from the RM. <br />“Federated” two-phase commit protocols, aka three-phase protocols, have been proposed but are still unproven.<br />Paxos Consensus Algorithm. <br />Consensus on Transaction Commit, Jim Gray and Leslie Lamport, Microsoft Research, 2005, MSR-TR-2003-96<br />
  50. 50. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  51. 51. Fault-Tolerant Two Phase Commit<br />Prepared<br />client<br />TM<br />RM<br />RequestCommit<br />Prepare<br />Prepared<br />Prepare<br />TM<br />RM<br />RequestCommit<br />Prepare<br />Prepared<br />If the 2PC Transaction Manager (TM) Fails, transaction blocks.<br />Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)<br />
  52. 52. Fault-Tolerant Two Phase Commit<br />client<br />TM<br />RM<br />abort<br />Prepared<br />Prepare<br />commit<br />commit<br />TM<br />RM<br />TM<br />Prepared<br />commit<br />Prepare<br />RequestCommit<br />Prepare<br />Prepared<br />Inconsistent! <br />Now What?<br />Prepare<br />Prepared<br />commit<br />commit<br />abort<br />If the 2PC Transaction Manager (TM) Fails, transaction blocks.<br />Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)<br />The complexity is a mess.<br />But… What if….?<br />
  53. 53. Fault Tolerant 2PC <br />Several workarounds proposed in database community:<br />Often called "3-phase" or "non-blocking" commit.<br />None with complete algorithm and correctness proof.<br />
  54. 54. Propose X<br />consensus<br />box<br />client<br />W Chosen<br />Propose W<br />client<br />W Chosen<br />client<br />W Chosen<br />Consensus<br />collects proposed values<br />Picks one proposed value<br />remembers it forever<br />
  55. 55. Consensus for Commit – The Obvious Approach<br />consensus<br />box<br />RM<br />client<br />TM<br />Propose Prepared<br />Prepared Chosen<br />Request Commit<br />Prepared<br />Prepare<br />Commit<br />Commit<br />Prepare<br />Commit<br />TM<br />RM<br />Prepared Chosen<br />Prepared<br />RequestCommit<br />Prepare<br />Prepared<br />Propose Prepared<br />Prepared Chosen<br />Commit<br />Commit<br />Get consensus on TM’s decision.<br />TM just learns consensus value.<br />TM is “stateless”<br />
  56. 56. Consensus for Commit – The Paxos Commit Approach<br />RM<br />client<br />TM<br />Request Commit<br />consensus<br />box<br />Propose RM1 Prepared<br />Prepare<br />RM1 Prepared Chosen<br />Commit<br />Commit<br />Prepare<br />consensus<br />box<br />Commit<br />RM<br />TM<br />Propose RM2 Prepared<br />RM2 Prepared Chosen<br />RequestCommit<br />Prepare<br />Propose RM1 Prepared<br />Propose RM2 Prepared<br />RM1 Prepared Chosen<br />RM2 Prepared Chosen<br />Commit<br />Commit<br />Get consensus on each RM’s choice.<br />TM just combines consensus values.<br />TM is “stateless”<br />
  57. 57. The Obvious Approach<br />Paxos Commit<br />One fewer message delay<br />Prepare<br />Prepare<br />Prepared<br />Propose RM1 Prepared<br />Propose RM2 Prepared<br />Propose Prepared<br />RM1 Prepared Chosen<br />Prepared Chosen<br />RM2 Prepared Chosen<br />Commit<br />Commit<br />
  58. 58. RM<br />Consensus box<br />Propose RM Prepared<br />acceptor<br />TM<br />acceptor<br />TM<br />acceptor<br />Consensus in Action<br />Propose RM Prepared<br />Vote RM Prepared<br />Propose RM Prepared<br />RM Prepared<br />Chosen<br />Vote RM Prepared<br />Vote RM Prepared<br />The normal (failure-free) case<br />Two message delays<br />Can optimize<br />
  59. 59. RM<br />Consensus box<br />acceptor<br />TM<br />acceptor<br />TM<br />TM<br />acceptor<br />Consensus in Action<br />TM can always learn what was chosen,<br />or get Aborted chosen if nothing chosen yet;<br />if majority of acceptors working . <br />
  60. 60. The Complete Algorithm<br />Subtle.<br />More weird cases than most people imagine.<br />Proved correct.<br />
  61. 61. PaxosCommit in a Nutshell<br />Acceptors<br />0…2F<br />Client<br /> TM<br />RM1…N<br />request<br />commit<br />prepare<br />prepared<br />all prepared<br />commit<br />N RMs<br />2F+1 acceptors (~2F+1 TMs)<br />If F+1 acceptors see all RMs prepared, then transaction committed.<br />2F(N+1) + 3N + 1 messages5 message delays 2 stable write delays. <br />
  62. 62. Paxos Commit Evaluation<br />Two-Phase Commit<br />3N+1 messages<br />N+1 stable writes<br />4 message delays<br />2 stable-write delays<br />Availability is compromised<br />Paxos Commit<br />3N+ 2F(N+1) +1 messages<br />N+2F+1 stable writes<br />5 message delays<br />2 stable-write delays<br />Tolerates F Faults<br />Paxos≣ 2PC for F = 0<br /><ul><li>Paxos Algorithm is the basis of Google’s Global Distributed Lock Manager</li></ul>Chubby has F=2 (5 Acceptors)<br />
  63. 63. Today’s Meeting<br />Concurrency Control<br />Intention Locks<br />Index Locking<br />Optimistic CC<br />Validation<br />Timestamp Ordering<br />Multi-version CC<br />Commit in Distributed Databases<br />Two Phase Commit<br />Paxos Algorithm<br />Concluding thoughts<br />References (aside from textbook): <br />Concurrency Control and Recovery in Database Systems, Philip A. Bernstein, VassosHadzilacos, Nathan Goodman, Microsoft Research.<br />Concurrency Control: Methods, Performance, and Analysis, Alexander Thomasian, ACM Computing Surveys, March, 1998 <br />Paxos Commit, Gray & Lamport, Microsoft Research TechFest, 2004<br />OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008<br />The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007<br />
  64. 64. OLTP Through the Looking Glass (p1)<br />Workload<br />TPC-C Benchmark<br />Quote:<br />Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. … <br />Substantial time is spent in logging, latching, locking, Btree, and buffer management.<br /><ul><li>OLTP Through the Looking Glass, and What We Found There, Harizopoulos et al, Proc ACM SIGMOD, 2008</li></ul>Took out components of a DBMS and measured its performance impact<br />
  65. 65. OLTP Through the Looking Glass (p2)<br />Concurrency Control<br />Look for applications where it can be turned off<br />Some sort of optimistic concurrency control<br />Multi-core Support<br />Latching (inter-thread communication) remains a significant bottleneck<br />Cache-conscious B-Trees<br />Replication Management<br />Loss of transactional consistency if log shipping<br />Recovery is not instantaneous<br />Maintaining transactional consistency<br />Weak Consistency<br />Starbucks doesn’t need two phase commit<br />How to achieve eventual consistency without transactional consistency<br />Areas for Research that may yield dividends<br />
  66. 66. End of an Era?<br />The Relational Model is not necessarily the answer<br />It was excellent for data processing<br />Not a natural fit for<br />Data Warehouses<br />Web-oriented search<br />Real-time analytics, and<br />Semi-structured data<br />i.e., Semantic Web<br />SQL is not the answer<br />Coupling between modern programming languages and SQL are “ugly beyond belief”<br />Programming languages have evolved while SQL has remained static<br />Pascal<br />C/C++<br />Java<br />The little languages: Python, Perl, PHP, Ruby<br /><ul><li>The end of an Architectural Era, Stonebraker et al, Proc. VLDB, 2007</li></ul>A critique of the “one size fits all” assumption in DBMS<br />
  67. 67. What’s so fun about databases?<br />From our January 13 Lecture…<br />Traditional database courses talked about<br />Employee records<br />Bank records<br />Now we talk about<br />Web search<br />Data mining<br />The collective intelligence of tweets<br />Scientific and medical databases<br />From a personal viewpoint,<br />I have enjoyed learning this material with you<br />Thank you.<br />
  68. 68. About CS 542<br />CS 542 will<br />Build on database concepts you already know<br />Provide you tools for separating hype from reality<br />Help you develop skills in evaluating the tradeoffs involved in using and/or creating a database<br />CS 542 may<br />Train you to read technical journals and apply them<br />CS 542 will not<br />Cover the intricacies of SQL programming<br />Spend much effort in<br />Dynamic SQL<br />Stored Procedures<br />Interfaces with application programming languages<br />Connectors, e.g., JDBC, ODBC<br />From our January 13 Lecture…<br />
  69. 69. Thanks<br />Contact Information:<br />President, Early Stage IT – a cloud-based consulting firm<br />Email: J [dot] Singh [at] EarlyStageIT [dot] com<br />Phone: 978-760-2055<br />Co-chair of Software and Services SIG at TiE-Boston<br />Founder, SQLnix.org, a local resource for NoSQL databases<br />My WPI email will be good through the summer.<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×