From Object Replication to
  Database Replication

       Fernando Pedone
      University of Lugano
          Switzerland
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •...
Motivation

     •From object replication...
         ‣ Replication in the distributed systems
           community
      ...
Motivation

     •Object replication is important
         ‣ Fault tolerance
                              File server
   ...
Motivation

     • Database replication is important
         ‣ Fault tolerance
                                          ...
Motivation

     •Object vs. database replication (recap)
         ‣ Different goals
              -   Fault tolerance vs....
Motivation

     •Group communication
         ‣ Messages addressed to a group of nodes
         ‣ Initially used for obje...
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •...
Replication model

     •Object model
         ‣ Clients: c1, c2,...
         ‣ Servers: s1, ..., sn
         ‣ Operations...
Replication model

     •Object consistency
         ‣ Consistency criteria
              -   Defines the behavior of objec...
Replication model

         ‣ Consistency criteria
              -   Simple case: single client, single server, no failure...
Replication model

         ‣ Consistency criteria
              -   Multiple clients, multiple servers

          write(x...
Replication model

         ‣ Consistency criteria
              -
              Ideally, defines system behavior regardles...
Replication model

         ‣ Consistency criteria
              -   Two consistency criteria for objects
                ...
Replication model

     •Linearizability
         ‣ A concurrent execution is linearizable if
           there is a sequen...
Replication model

     •Linearizability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
          ...
Replication model

     •Linearizability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
          ...
Replication model

     •Linearizability
         write(x,1)             ack(x)              read(x)          return(1)

 ...
Replication model

     •Sequential consistency
         ‣ A concurrent execution is sequentially
           consistent if...
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(1)

      c1
   ...
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(0)

      c1
   ...
Replication model

     •Sequential consistency
         write(x,1)   ack(x)             read(x)   return(0)

      c1
   ...
Replication model

     •Database model
         ‣ Clients: c1, c2,...
         ‣ Servers: s1, ..., sn
         ‣ Operatio...
Replication model

     •Transaction properties
         ‣ Atomicity: Either all transaction operations
           are exe...
Replication model

     •Serializability (1-copy SR)
         ‣ A concurrent execution is serializable if it is
          ...
Replication model

     •Serializability
         write(x,1)   ack(x)             read(x)   return(1)

      c1
          ...
Replication model

     •Serializability
         write(x,1)   ack(x) read(x)   return(1)

      c1
                      ...
Replication model

     •Serializability
         write(x,1)   ack(x) read(x)   return(1) read(x)   return(0)

      c1

 ...
Replication model

     •Serializability
         write(x,1)        ack(x)          read(y)    return(0)

      c1
       ...
Replication model

     •Object and database consistency criteria
         ‣ Assume single-operation transactions

       ...
Replication model

     •Serializable, not session-serializable

              write(x,1)   ack(x)   read(x)   return(0)

...
Replication model

     •Session-serializable, not strongly
        serializable
              write(x,1)   ack(x)        ...
Replication model

     •Strongly serializable

              write(x,1)   ack(x)                    read(x)   return(1)

...
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •...
From objects to databases

     •Fundamentals
         ‣ Model considerations
         ‣ Generic functional model
     •Ob...
From objects to databases

     •Model considerations
         ‣ Client and server processes
         ‣ Communication by m...
Replication model

     •Generic functional model
              Phase 1:   Phase 2:       Phase 3:    Phase 4:       Phase...
From objects to databases

     •Passive replication
         ‣ aka Primary-backup replication
         ‣ Fail-stop failur...
From objects to databases

         ‣ Passive replication algorithm

                   Phase 1:   Phase 2:       Phase 3:...
From objects to databases

     •Passive replication (Linearizability)
         ‣ Perfect failure detection ensures that a...
From objects to databases

     •Active replication
         ‣ aka State-machine replication
         ‣ Crash failure mode...
From objects to databases

         ‣ Atomic broadcast
              -   Group communication abstraction
              -  ...
From objects to databases

         ‣ Atomic broadcast      Application

                m             broadcast(m) delive...
From objects to databases

         ‣ Active replication algorithm

                Phase 1:   Phase 2:       Phase 3:    ...
From objects to databases

         ‣ Why it works
           write(x,1)    ack(x)               read(x)      return(1)

 ...
From objects to databases

     •Active and passive replication
         ‣ Target fault tolerance only
         ‣ Not good...
From objects to databases

     •Multi-primary passive replication
         ‣ Targets both fault tolerance and performance...
From objects to databases

     •Multi-primary passive replication
         ‣ Read requests
              -   Broadcast to...
From objects to databases

     •Multi-primary passive replication
         ‣ Write requests
              -   Executed by...
From objects to databases

     •Multi-primary passive replication
                write(x,1)      ack(x)

      c1
      ...
From objects to databases

     •Multi-primary passive replication
               write(x,1)                              ...
From objects to databases

     •Multi-primary passive replication
                inc(x,1)                               ...
From objects to databases

     •Multi-primary passive replication
         ‣ Write requests
              -   Executed by...
From objects to databases

     •Multi-primary passive replication
              -   Compatible states                    ...
From objects to databases
              -    Compatible states

              x=1    write(x,8)   x=8
 N            y=2
  ...
From objects to databases
              -    Compatible states

              x=1    write(x,8)     x=8
 N            y=2
...
From objects to databases

     •Multi-primary passive replication
                inc(x,1)                               ...
From objects to databases

     •Multi-primary passive replication
         ‣ Optimistic approach
         ‣ May or not re...
From objects to databases

     •Multi-primary passive replication
         ‣ Provides both fault tolerance and
          ...
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •...
Deferred update replication

     •General idea (I)
              Phase 1:   Phase 2:       Phase 3:    Phase 4:       Pha...
Deferred update replication

     •General idea (II)

              Phase 1:   Phase 3:    Phase 5:         Phase 1:   Pha...
Deferred update replication

     •The protocol in detail
         ‣ Transactions (read-only and update) are
           su...
Deferred update replication

     •Read-only transactions
              read(X) return(1)   read(Y) return(2)   commit   o...
Deferred update replication

     •Update transactions
                                                                   ...
Deferred update replication

     •Termination based on Atomic Broadcast
         ‣ Similar to multi-primary passive repli...
Deferred update replication

         ‣ Transaction states
              -   Executing(t,s): transitory state
            ...
Deferred update replication

         ‣ Transaction states
                                                    Certify
   ...
Deferred update replication

         ‣ Transaction states

                commit(t1)      ok
Client 1

Server 1

Server ...
therefore, all servers reach the same outcome, committed or aborted, without

  Deferred update replication
     further c...
Deferred update replication

     •Atomic broadcast-based certification
         ‣ Example
              -   Transaction t:...
Deferred update replication

     •Reordering-based certification
         ‣ Let t and t’ be two concurrent transactions
  ...
Deferred update replication

         ‣ Serialization order of transactions
              -   Without reordering         -...
Deferred update replication

     •Reordering-based certification
                                          serialization
 ...
Deferred update replication

     •Reordering-based certification
                                          serialization
 ...
mitted transactions that have not been seen by transactions in execution since
       their relative order may change. The...
Deferred update replication

     •Generalized reordering
                   Reorder List               Reorder List

    ...
Deferred update replication

     •Termination based on Generic Broadcast
         ‣ Generic broadcast
              -   C...
Deferred update replication

         ‣ Generic broadcast
              -   Motivation
                  • Ordering messag...
Deferred update replication

         ‣ Conflict relation between transactions
              -   Assume transactions t1 and...
We show now that the ordering imposed by atomic broadcast is not always needed.
         Deferred update replication
Consi...
Deferred update replication

         ‣ Read-only transactions
              -   Local execution without certification is n...
Deferred update replication

         ‣ Read-only transactions
              -   Optimistic solution
                  • B...
Outline

     •Motivation
     •Replication model
     •From objects to databases
     •Deferred update replication
     •...
Final remarks

     •Object and database replication
         ‣ Different goals
              -   Fault tolerance   fault ...
Final remarks

     •Consistency models
         ‣ Relationship between L/SC and SR
     •Replication algorithms
         ...
Final remarks

     •Replication and group communication
         ‣ A happy union
         ‣ Group communication -- atomic...
References

     •Chapter 1:
        Consistency Models for
        Replicated Data
        A. Fekete and K. Ramamritham
 ...
Upcoming SlideShare
Loading in …5
×

20100522 from object_to_database_replication_pedone_lecture01-02

605 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
605
On SlideShare
0
From Embeds
0
Number of Embeds
133
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20100522 from object_to_database_replication_pedone_lecture01-02

  1. 1. From Object Replication to Database Replication Fernando Pedone University of Lugano Switzerland
  2. 2. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 2
  3. 3. Motivation •From object replication... ‣ Replication in the distributed systems community ‣ Processes, non-transactional objects •...to database replication ‣ Replication in the database community ‣ Transactions, databases © F. Pedone 3
  4. 4. Motivation •Object replication is important ‣ Fault tolerance File server Replica #1 Client 1 File server File server Replica #2 Client 2 File server Replica #3 © F. Pedone 4
  5. 5. Motivation • Database replication is important ‣ Fault tolerance Client 3 ‣ Performance Database server Replica #1 Client 4 Client 1 Database server Replica #2 Client 2 Client 5 Database server Replica #3 © F. Pedone Client 6 5
  6. 6. Motivation •Object vs. database replication (recap) ‣ Different goals - Fault tolerance vs. Performance & Fault tolerance ‣ Different models - Objects, non-transactional vs. transactions ‣ Different algorithms - to a certain extent... © F. Pedone 6
  7. 7. Motivation •Group communication ‣ Messages addressed to a group of nodes ‣ Initially used for object replication ‣ Later used for database replication node 1 node 2 node 3 point-to-point group © F. Pedone communication communication 7
  8. 8. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 8
  9. 9. Replication model •Object model ‣ Clients: c1, c2,... ‣ Servers: s1, ..., sn ‣ Operations: read and write - read(x) returns the value of object x - write(x,v) updates the value of x with v; returns acknowledgement © F. Pedone 9
  10. 10. Replication model •Object consistency ‣ Consistency criteria - Defines the behavior of object operations - Simple for single-client-single-server case - But more complex in the general case • Multiple clients • Multiple servers • Failures © F. Pedone 10
  11. 11. Replication model ‣ Consistency criteria - Simple case: single client, single server, no failures write(x,1) ack(x) read(x) return(1) client server x=0 x=1 © F. Pedone 11
  12. 12. Replication model ‣ Consistency criteria - Multiple clients, multiple servers write(x,1) ack(x) c1 read(x) return(0) read(x) return(1) c2 s1 x=0 x=1 s2 © F. Pedone x=0 x=1 12
  13. 13. Replication model ‣ Consistency criteria - Ideally, defines system behavior regardless of implementation details and the operation semantics - Is the ack(x) write(x,1) following execution intuitive to clients? c1 read(x) return(0) read(x) return(1) c2 s1 s2 © F. Pedone 13
  14. 14. Replication model ‣ Consistency criteria - Two consistency criteria for objects (among several others) • Linearizability • Sequential consistency © F. Pedone 14
  15. 15. Replication model •Linearizability ‣ A concurrent execution is linearizable if there is a sequential way to reorder the client operations such that: (1) it respects the semantics of the objects, as determined in their sequential specs (2) it respects the order of non-overlapping operations among all clients © F. Pedone 15
  16. 16. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Problem: Does not respect © F. Pedone the semantics of the object 16
  17. 17. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or non-overlapping operations 17
  18. 18. Replication model •Linearizability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 overlapping read(x) return(0) read(x) return(1) sequential execution write(x,1) ack(x) Linearizable execution! © F. Pedone 18
  19. 19. Replication model •Sequential consistency ‣ A concurrent execution is sequentially consistent if there is a sequential way to reorder the client operations such that: (1) it respects the semantics of the objects, as determined in their sequential specs (2) it respects the order of operations at the client that issued the operations © F. Pedone 19
  20. 20. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Sequentially consistent! © F. Pedone 20
  21. 21. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(0) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or non-overlapping operations 21
  22. 22. Replication model •Sequential consistency write(x,1) ack(x) read(x) return(0) c1 read(x) return(0) c2 sequential execution Problem: Does not respect the order © F. Pedone or operations at c1 22
  23. 23. Replication model •Database model ‣ Clients: c1, c2,... ‣ Servers: s1, ..., sn ‣ Operations: read, write, commit and abort ‣ Transaction: group of read and/or write operations followed by commit or abort - ACID properties (see next) © F. Pedone 23
  24. 24. Replication model •Transaction properties ‣ Atomicity: Either all transaction operations are executed or none are executed ‣ Consistency: A transaction is a correct transformation of the state ‣ Isolation: Serializability (next slide) ‣ Durability: Once a transaction commits, its changes to the state survive failures © F. Pedone 24
  25. 25. Replication model •Serializability (1-copy SR) ‣ A concurrent execution is serializable if it is equivalent to a serial execution with the same transactions ‣ Two executions are (view) equivalent if: - Their transactions preserve the same “reads-from” relationships (t reads x from t’ in both executions) - They have the same final writes © F. Pedone 25
  26. 26. Replication model •Serializability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Serializable! © F. Pedone 26
  27. 27. Replication model •Serializability write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) c2 sequential execution Serializable! © F. Pedone 27
  28. 28. Replication model •Serializability write(x,1) ack(x) read(x) return(1) read(x) return(0) c1 c2 sequential execution Serializable!!! © F. Pedone 28
  29. 29. Replication model •Serializability write(x,1) ack(x) read(y) return(0) c1 write(y,1) ack(y) read(x) return(0) c2 sequential execution Non serializable! © F. Pedone 29
  30. 30. Replication model •Object and database consistency criteria ‣ Assume single-operation transactions Objects Database Strong Linearizability Serializability equivalent Sequential Session if transactions have only one Consistency Serializability operation ? Serializability equivalent if transactions have only one operation © F. Pedone 30
  31. 31. Replication model •Serializable, not session-serializable write(x,1) ack(x) read(x) return(0) c1 Transaction 1 Transaction 2 c2 sequential execution © F. Pedone 31
  32. 32. Replication model •Session-serializable, not strongly serializable write(x,1) ack(x) read(x) return(1) c1 Transaction 1 read(x) return(0) Transaction 2 c2 Transaction 3 sequential execution © F. Pedone 32
  33. 33. Replication model •Strongly serializable write(x,1) ack(x) read(x) return(1) c1 Transaction 1 read(x) return(1) Transaction 2 c2 Transaction 3 sequential execution © F. Pedone 33
  34. 34. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 34
  35. 35. From objects to databases •Fundamentals ‣ Model considerations ‣ Generic functional model •Object replication ‣ Passive replication (primary-backup) ‣ Active replication (state-machine replication) ‣ Multi-primary passive replication © F. Pedone 35
  36. 36. From objects to databases •Model considerations ‣ Client and server processes ‣ Communication by message passing ‣ Crash failures only - No Byzantine failures © F. Pedone 36
  37. 37. Replication model •Generic functional model Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Client Server 1 Server 2 Server 3 © F. Pedone 37
  38. 38. From objects to databases •Passive replication ‣ aka Primary-backup replication ‣ Fail-stop failure model - a process follows its spec until it crashes - a crash is detected by every correct process - no process is suspected of having crashed until after it actually crashes ‣ Algorithm ensures linearizability © F. Pedone 38
  39. 39. From objects to databases ‣ Passive replication algorithm Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client ServerOnly Execution Agreement Client Request Coordination Coordination Response primary executes Client requests Primary Server 1 Backups Backup 1 apply new Server 2 state Backup 2 Server 3 © F. Pedone 39
  40. 40. From objects to databases •Passive replication (Linearizability) ‣ Perfect failure detection ensures that at most one primary exists at all times ‣ A failed primary is detected by the backups ‣ Eventually one backup replaces failed primary © F. Pedone 40
  41. 41. From objects to databases •Active replication ‣ aka State-machine replication ‣ Crash failure model - a process follows its spec until it crashes - a crash is detected by every correct process - correct processes may be erroneously suspected ‣ Algorithm ensures linearizability © F. Pedone 41
  42. 42. From objects to databases ‣ Atomic broadcast - Group communication abstraction - Primitives: broadcast(m) and deliver(m) - Properties • Agreement: Either all servers deliver m or no server delivers m • Total order: Any two servers deliver messages m and m’ in the same order © F. Pedone 42
  43. 43. From objects to databases ‣ Atomic broadcast Application m broadcast(m) deliver(m) Group Client 1 Communication m’ send(m) receive(m) Client 2 Network Server 1 Architectural Server 2 View Server 3 © F. Pedone 43
  44. 44. From objects to databases ‣ Active replication algorithm Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Deterministic execution Client Server 1 All servers execute all Server 2 requests Server 3 © F. Pedone 44
  45. 45. From objects to databases ‣ Why it works write(x,1) ack(x) read(x) return(1) c1 read(x) return(0) return(1) broadcast c 2 operation “write(x,1)” s1 x=0 x=1 s2 x=0 x=1 s3 © F. Pedone x=0 x=1 45
  46. 46. From objects to databases •Active and passive replication ‣ Target fault tolerance only ‣ Not good for performance - Active replication: All servers execute all requests - Passive replication: Only one server executes requests © F. Pedone 46
  47. 47. From objects to databases •Multi-primary passive replication ‣ Targets both fault tolerance and performance ‣ Does not require perfect failure detection ‣ Same resilience as active replication, but... ‣ Better performance - Distinguishes read from write operations - Only one server executes each request © F. Pedone 47
  48. 48. From objects to databases •Multi-primary passive replication ‣ Read requests - Broadcast to all servers - Executed by only one server - Response is sent to the client © F. Pedone 48
  49. 49. From objects to databases •Multi-primary passive replication ‣ Write requests - Executed by one server - State changes (“diff”) are broadcast to all servers - If the changes are “compatible” with the previous installed states, then a new state is installed, and the result of the request is returned to the client - Otherwise the request is re-executed © F. Pedone 49
  50. 50. From objects to databases •Multi-primary passive replication write(x,1) ack(x) c1 read(x) return(1) c2 broadcast state change “x=1” s1 x=0 x=1 s2 x=0 x=1 s3 © F. Pedone 50 x=0 x=1
  51. 51. From objects to databases •Multi-primary passive replication write(x,1) ack(x) c1 write(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=1 s2 x=0 x=2 x=1 s3 © F. Pedone 51 x=0 x=2 x=1
  52. 52. From objects to databases •Multi-primary passive replication inc(x,1) ack(x) c1 inc(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=1 s2 x=0 x=2 x=1 s3 © F. Pedone 52 x=0 x=2 x=1
  53. 53. From objects to databases •Multi-primary passive replication ‣ Write requests - Executed by one server - State changes (“diff”) are broadcast to all servers - If the changes are “compatible” with the previous installed states, then a new state is installed, and the result of the request is returned to the client - Otherwise the request is re-executed © F. Pedone 53
  54. 54. From objects to databases •Multi-primary passive replication - Compatible states Si α Si’ • Let Si be the state at server N before it executes operation α, and Si’ the state after α • Let h=S0°...°Sj be the sequence of installed states at a server N’ when it tries to install the new state Si’ • Si’ is compatible with h if - (a) Sj = Si or - (b) α does not read any variables modified in Si+1, Si+2, ..., Sj © F. Pedone 54
  55. 55. From objects to databases - Compatible states x=1 write(x,8) x=8 N y=2 z=3 y=2 z=3 Si Si’ x=1 x=8 (a) N’ y=2 y=2 z=3 z=3 Si Si+1 © F. Pedone 55
  56. 56. From objects to databases - Compatible states x=1 write(x,8) x=8 N y=2 z=3 y=2 z=3 Si Si’ x=1 write(y,3) x=1 write(z,5) x=1 x=8 (b) N’ y=2 y=3 y=3 y=3 z=3 z=3 z=5 z=5 Si Si+1 Sj Sj+1 © F. Pedone 56
  57. 57. From objects to databases •Multi-primary passive replication inc(x,1) ack(x) c1 inc(x,2) ack(x) state c2 change “x=2” state s1 change x=0 “x=1” x=2 x=3 s2 x=0 x=2 x=3 s3 © F. Pedone 57 x=0 x=2 x=3
  58. 58. From objects to databases •Multi-primary passive replication ‣ Optimistic approach ‣ May or not rely on deterministic execution - It depends on how re-executions are dealt with ‣ Installing a new state is cheaper than executing the request (that creates the state) © F. Pedone 58
  59. 59. From objects to databases •Multi-primary passive replication ‣ Provides both fault tolerance and performance ‣ Suitable for database replication! © F. Pedone 59
  60. 60. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 60
  61. 61. Deferred update replication •General idea (I) Phase 1: Phase 2: Phase 3: Phase 4: Phase 5: Client Server Execution Agreement Client Request Coordination Coordination Response Client Server 1 Server 2 Server 3 © F. Pedone 61
  62. 62. Deferred update replication •General idea (II) Phase 1: Phase 3: Phase 5: Phase 1: Phase 3: Phase 4: Phase 5: Client Execution Client Client Execution Agreement Client Request Response Request Coordination Response Certification ... & Commit/ Client Abort Server 1 Server 2 Communication Server 3 Protocol © F. Pedone 62
  63. 63. Deferred update replication •The protocol in detail ‣ Transactions (read-only and update) are submitted and executed by one database server allowed by serializability ‣ At commit time: - Read-only transactions are committed immediately - Update transactions are propagated to the other replicas for certification, and possibly commit © F. Pedone 63
  64. 64. Deferred update replication •Read-only transactions read(X) return(1) read(Y) return(2) commit ok Client 1 Server 1 Server 2 Server 3 Client 2 read(Y) return(2) read(X) return(1) commit ok © F. Pedone 64
  65. 65. Deferred update replication •Update transactions ok / abort read(X) return(1) write(Y,3) ack(Y) commit Client 1 Server 1 Server 2 Server 3 If commit, Client 2 ... writes are read(Y) return(2) write(X,3) ack(X) applied © F. Pedone 65
  66. 66. Deferred update replication •Termination based on Atomic Broadcast ‣ Similar to multi-primary passive replication ‣ Deterministic certification test ‣ Lower abort rate than atomic commit based technique © F. Pedone 66
  67. 67. Deferred update replication ‣ Transaction states - Executing(t,s): transitory state - Committing(t,s): transitory state - Committed(t,s): final state - Aborted(t,s): final state Committed(t,s) Executing(t,s) Committing(t,s) Aborted(t,s) © F. Pedone 77
  68. 68. Deferred update replication ‣ Transaction states Certify transaction write(y,3) ack(y) commit(t) ok Client 1 Atomic Broadcast Commit(t,s1) Server 1 Executing(t,s1) Committing(t,s1) Server 2 Server 3 Transaction termination Client 2 © F. Pedone 68
  69. 69. Deferred update replication ‣ Transaction states commit(t1) ok Client 1 Server 1 Server 2 Server 3 Client 2 ok commit(t2) © F. Pedone 69
  70. 70. therefore, all servers reach the same outcome, committed or aborted, without Deferred update replication further communication [1, 8]. When a server s delivers t’s readset, writeset, and updates, Committing(t, s) holds. Server s certifies transactions in the order they are delivered. To certify t, server s checks whether each transaction it knows about in the committed state either precedes t or does not have a writeset that intersects t’s readset. •If Atomic conditions holds, s locally commits t. We define the condition one of these broadcast-based certification for the transition from the committing state to the committed state more formally as follows: ∀t ∀s : Committing(t, s) ; Committed(t, s) ≡   ∀t s.t. Committed(t , s) :   (1.3) t → t ∨ W S(t ) ∩ RS(t) = ∅ If transaction t passes to the committed state at s, its updates should be t’→t, t’ precedes t: applied to the database following the order imposed by the t’ does not atomic broadcast changes made by t’ update any item primitive, that is, if t is delivered before t and both pass the reads that t certification could be seen by t test, then t’s updates should be applied to the database before those of t . (i.e,. read by t) Intuitively, condition 1.3 checks whether transactions can be serialized following their delivery order. Let t and t be two transactions such that t © F. Pedone 70 is delivered and committed before t. There are two cases in which t will pass certification: (i) t committed before t started its execution, in which case any
  71. 71. Deferred update replication •Atomic broadcast-based certification ‣ Example - Transaction t: read(x); write(y,-) - Transaction t’: read(y); write(x,-) - t and t’ are concurrent (i.e., neither t→t’ nor t’→t) - All servers validate t and t’ in the same order - Assume servers validate t and then t’ - What transaction(s) commit/abort? © F. Pedone 71
  72. 72. Deferred update replication •Reordering-based certification ‣ Let t and t’ be two concurrent transactions - t executes read(x); write(y,-) and - t’ executes read(y); write(z,-) ‣ (a) t is delivered and certified before t’ - t passes certification, but - t’ does not: RS(t’) ∩ WS(t) = { y } ≠ ∅ ‣ (b) t’ is delivered and certified before t - t’ passes certification and - t passes certification(!): RS(t) ∩ WS(t’) = ∅ © F. Pedone 72
  73. 73. Deferred update replication ‣ Serialization order of transactions - Without reordering - With reordering • Order given by abcast • Either t1 followed by t2 • Ex.: t1 followed by t2 or t2 followed by t1 t1 t2 Server 1 Server 2 Server 3 © F. Pedone 73
  74. 74. Deferred update replication •Reordering-based certification serialization order Reorder List (committed transactions) new delivered transaction t1 t2 t3 t4 t Condition for committing t (simplified): RS(t) ∩ (WS(t1) ∪ WS(t2) ∪ WS(t3) ∪ WS(t4)) = ∅ © F. Pedone 74
  75. 75. Deferred update replication •Reordering-based certification serialization order Reorder List (committed transactions) new delivered transaction t1 t2 t3 t4 t Condition for committing t (simplified): RS(t) ∩ (WS(t1) ∪ WS(t2)) = ∅ and WS(t) ∩ (RS(t3) ∪ RS(t4)) = ∅ © F. Pedone 75
  76. 76. mitted transactions that have not been seen by transactions in execution since their relative order may change. The number of transactions in the Reorder List is limited by a predetermined threshold, the R . Whenever Deferred update replication the Reorder Factor is reached, one transaction in the Reorder List is removed and its updates are applied to the database. Let RLs = t0 ; t1 ; . . . ; tcount−1 be the Reorder List at server s containing count transactions and pos(x) be a function that returns the position of transaction x in RLs . We represent the condition for the state transition of •Reordering-based certification transaction t from the committing state to the committed state more formally as follows: ∀t∀s : Committing(t, s) ; Committed(t, s) ≡   ∃i, 0 ≤ i < count, s.t. ∀t ∈ RLs :      pos(t ) < i ⇒ t → t ∨ W S(t ) ∩ RS(t) = ∅ ∧        ∧  (1.4)   (t → t ∨ W S(t ) ∩ RS(t) = ∅)     pos(t ) ≥ i ⇒  ∧  W S(t) ∩ RS(t ) = ∅ If t passes the certification test, it is included in RLs at position i, which means that transactions in positions in [i .. count−1] are rightshifted. If more than one position satisfies the condition in equation 1.4, then to ensure that © F. Pedone servers apply transactions to the database in the same order, a deter- 76 all ministic procedure has to be used to choose the insertion position (e.g., the
  77. 77. Deferred update replication •Generalized reordering Reorder List Reorder List t1 t2 t3 t4 t t1 t t2 t3 t4 Reorder List Reorder List t1 t2 t3 t t4 t t1 t2 t3 t4 Reorder List Reorder List t1 t2 t t3 t4 t4 Abort t? t t2 t1 t3 © F. Pedone 77
  78. 78. Deferred update replication •Termination based on Generic Broadcast ‣ Generic broadcast - Conflict relation ∼ between messages - Properties • Agreement: Either all servers deliver m or no server delivers m • Total order: If messages m and m’ conflict, then any two servers deliver them in the same order © F. Pedone 78
  79. 79. Deferred update replication ‣ Generic broadcast - Motivation • Ordering messages is more expensive than not ordering messages • Messages should only be ordered when needed (as defined by the application) - Atomic broadcast is a special case of generic broadcast where all messages must be ordered © F. Pedone 79
  80. 80. Deferred update replication ‣ Conflict relation between transactions - Assume transactions t1 and t2 don’t conflict t1 t2 Server 1 t1 t2 Server 2 t1 Server 3 t2 © F. Pedone 80
  81. 81. We show now that the ordering imposed by atomic broadcast is not always needed. Deferred update replication Consider the termination of two transactions t and t with non-intersecting readsets and writesets. In such a case, both t and t will pass the certification test and be committed, regardless of the order in which they are delivered. This shows that atomic broadcast is sometimes too strong, and can be favorably replaced by generic broadcast (see Chapter 3). Generic broadcast is similar to atomic broadcast, with the ‣ Conflict relation between transactions exception that applications can define order constraints on the delivery of messages: - m:t means message m where transaction t two messages are ordered only if they conflict,relays the conflict relation is defined by the application semantics. - if t conflicts with t’ (i.e., m:t ∼ m’:t’), then they are Based on the example above, one could definethe same order delivered and certified in the following conflict relation ∼ among messages, where m : t means that message m relays transaction t:   RS(t) ∩W S(t ) = 0 /  ∨     W S(t) ∩ RS(t ) = 0  m :t ∼ m :t ≡  /  (11.5)  ∨  W S(t) ∩W S(t ) = 0 / Notice that the conflict relation ∼ should account for write-write conflicts to make sure thatF.transactions that update common data items are ordered, preventing the © Pedone 81 case in which two servers end up in different states after applying such transactions
  82. 82. Deferred update replication ‣ Read-only transactions - Local execution without certification is not permitted - Different states of the database could be observed x=1 write(x,1) t1 t3 y=0 t2 x=0 x=1 Server 1 y=0 y=2 t1 t2 x=0 x=1 Server 2 y=0 y=2 x=0 write(y,2) t1 x=1 Server 3 y=0 y=2 t2 t4 x=0 y=2 read(x);read(y) © F. Pedone 82
  83. 83. Deferred update replication ‣ Read-only transactions - Optimistic solution • Broadcast and certify read-only transactions - Pessimistic solution • Pre-declare items to be read (readset) • Broadcast transaction before execution • Executed by one server only • Never aborted © F. Pedone 83
  84. 84. Outline •Motivation •Replication model •From objects to databases •Deferred update replication •Final remarks © F. Pedone 84
  85. 85. Final remarks •Object and database replication ‣ Different goals - Fault tolerance fault tolerance & performance ‣ Different consistency models - Linearizability & sequential consistency serializability ‣ Different algorithms - Primary-backup & active replication deferred update replication © F. Pedone 85
  86. 86. Final remarks •Consistency models ‣ Relationship between L/SC and SR •Replication algorithms ‣ Unifying framework for object and database replication ‣ Multi-primary passive replication © F. Pedone 86
  87. 87. Final remarks •Replication and group communication ‣ A happy union ‣ Group communication -- atomic broadcast -- leads to modular and efficient protocols (e.g., fewer aborts than atomic commit) ‣ Replication has motivated more powerful and efficient group communication protocols (e.g., generic and optimistic primitives) © F. Pedone 87
  88. 88. References •Chapter 1: Consistency Models for Replicated Data A. Fekete and K. Ramamritham •Chapter 2: Replication Techniques for Availability R. van Renesse and R. Guerraoui •Chapter 11: From Object Replication to Database Replication F. Pedone and A. Schiper © F. Pedone 88

×