SlideShare a Scribd company logo
2/29/2012 1
10. Replication
CSEP 545 Transaction Processing
Philip A. Bernstein
Sameh Elnikety
Copyright ©2012 Philip A. Bernstein
2/29/2012 2
Outline
1. Introduction
2. Primary-Copy Replication
3. Multi-Master Replication
4. Other Approaches
5. Products
2/29/2012 3
1. Introduction
• Replication - using multiple copies of a server or
resource for better availability and performance.
– Replica and Copy are synonyms
• If you’re not careful, replication can lead to
– worse performance - updates must be applied to all
replicas and synchronized
– worse availability - some algorithms require multiple
replicas to be operational for any of them to be used
2/29/2012 4
Read-only Database
Database
Server
Database
Server 1
Database
Server 2
Database
Server 3
• T1 = { r[x] }
2/29/2012 5
Update-only Database
Database
Server
Database
Server 1
Database
Server 2
Database
Server 3
• T1 = { w[x=1] }
• T2 = { w[x=2] }
2/29/2012 6
Update-only Database
Database
Server
Database
Server 1
Database
Server 2
Database
Server 3
• T1 = { w[x=1] }
• T2 = { w[y=1] }
2/29/2012 7
Replicated Database
Database
Server 1
Database
Server 2
Database
Server 3
• Objective
– Availability
– Performance
• Transparency
– 1 copy serializability
• Challenge
– Propagating and synchronizing updates
2/29/2012 8
Replicated Server
• Can replicate servers on a common resource
– Data sharing - DB servers communicate with shared disk
Resource
Server Replica 1 Server Replica 2
Client
• Helps availability for process (not resource) failure
• Requires a replica cache coherence mechanism, so
this helps performance only if
– little conflict between transactions at different servers or
– loose coherence guarantees (e.g. read committed)
2/29/2012 9
Replicated Resource
• To get more improvement in availability,
replicate the resources (too)
• Also increases potential throughput
• This is what’s usually meant by replication
• It’s the scenario we’ll focus on
Resource replica
Server Replica 1 Server Replica 2
ClientClient
Resource replica
2/29/2012 10
Synchronous Replication
• Replicas function just like a non-replicated resource
– Txn writes data item x. System writes all replicas of x.
– Synchronous – replicas are written within the update txn
– Asynchronous – One replica is updated immediately.
Other replicas are updated later
• Problems with synchronous replication
– Too expensive for most applications, due to heavy
distributed transaction load (2-phase commit)
– Can’t control when updates are applied to replicas
Write(x1)
Write(x2)
Write(x3)
x1
x2
x3
Start
Write(x)
Commit
2/29/2012 11
Synchronous Replication - Issues
r1[xA]
r2[yD] w2[xB]
w1[yC]yD fails
xA fails
Not equivalent to a
one-copy execution,
even if xA and yD
never recover!
• DBMS products support it only in special situations
• If you just use transactions, availability suffers.
• For high-availability, the algorithms are complex and
expensive, because they require heavy-duty
synchronization of failures.
• … of failures? How do you synchronize failures?
• Assume replicas xA, xB of x and yC, yD of y
2/29/2012 12
Atomicity & Isolation Goal
• One-copy serializability (abbr. 1SR)
– An execution of transactions on the replicated database has
the same effect as a serial execution on a one-copy database.
• Readset (resp. writeset) - the set of data items (not
copies) that a transaction reads (resp. writes).
• 1SR Intuition: the execution is SR and in an equivalent
serial execution, for each txn T and each data item x in
readset(T), T reads from the most recent txn that wrote
into any copy of x.
• To check for 1SR, first check for SR (using SG), then
see if there’s equivalent serial history with the above
property
2/29/2012 13
Atomicity & Isolation (cont’d)
• Previous example was not 1SR. It is equivalent to
– r1[xA] w1[yC] r2[yD] w2[xB] and
– r2[yD] w2[xB] r1[xA] w1[yC]
– but in both cases, the second transaction does not read its
input from the previous transaction that wrote that input.
• These are 1SR
– r1[xA] w1[yD] r2[yD] w2[xB]
– r1[xA] w1[yC] w1[yD] r2[yD] w2[xA] w2[xB]
• The previous history is the one you would expect
– Each transaction reads one copy of its readset and
writes into all copies of its writeset
• But it may not always be feasible, because some copies
may be unavailable.
2/29/2012 14
Asynchronous Replication
• Asynchronous replication
– Each transaction updates one replica.
– Updates are propagated later to other replicas.
• Primary copy: Each data item has a primary copy
– All transactions update the primary copy
– Other copies are for queries and failure handling
• Multi-master: Transactions update different copies
– Useful for disconnected operation, partitioned network
• Both approaches ensure that
– Updates propagate to all replicas
– If new updates stop, replicas converge to the same state
• Primary copy ensures serializability, and often 1SR
– Multi-master does not.
2/29/2012 15
2. Primary-Copy Replication
• Designate one replica as the primary copy (publisher)
• Transactions may update only the primary copy
• Updates to the primary are sent later to secondary replicas
(subscribers) in the order they were applied to the primary
T1: Start
… Write(x1) ...
Commit
x1T2
Tn
... Primary
Copy
x2
xm
...
Secondaries
2/29/2012 16
Update Propagation
• Collect updates at the primary using triggers or
by post-processing the log
– Triggers: on every update at the primary, a trigger fires
to store the update in the update propagation table.
– Log post-processing: “sniff” the log to generate update
propagations
• Log post-processing (log sniffing)
– Saves triggered update overhead during on-line txn.
– But R/W log synchronization has a (small) cost
• Optionally identify updated fields to compress log
• Most DB systems support this today.
2/29/2012 17
Update Processing 1/2
• At the replica, for each tx T in the propagation stream,
execute a refresh tx that applies T’s updates to replica.
• Process the stream serially
– Otherwise, conflicting transactions may run in a
different order at the replica than at the primary.
– Suppose log contains w1[x] c1 w2[x] c2.
Obviously, T1 must run before T2 at the replica.
– So the execution of update transactions is serial.
• Optimizations
– Batching: {w(x)} {w(y)} -> {w(x), w(y)}
– “Concurrent” execution
2/29/2012 18
Update Processing 2/2
• To get a 1SR execution at the replica
– Refresh transactions and read-only queries use an
atomic and isolated mechanism (e.g., 2PL)
• Why this works
– The execution is serializable
– Each state in the serial execution is one that
occurred at the primary copy
– Each query reads one of those states
• Client view
– Session consistency
2/29/2012 19
Request Propagation
• Or propagate requests (e.g. txn-bracketed stored proc calls)
• Requirements
– Must ensure same order at primary and replicas
– Determinism
• This is often a txn middleware (not DB) feature.
• An alternative to propagating updates is to propagate
procedure calls (e.g., a DB stored procedure call).
SP1: Write(x)
Write(y)x, y
DB-A w[x]
w[y]
SP1: Write(x)
Write(y) x, y
DB-B
w[x]
w[y]Replicate
Call(SP1)
2/29/2012 20
Failure & Recovery Handling 1/3
• Secondary failure - nothing to do till it recovers
– At recovery, apply the updates it missed while down
– Needs to determine which updates it missed,
just like non-replicated log-based recovery
– If down for too long, may be faster to get a whole copy
• Primary failure
– Normally, secondaries wait till the primary recovers
– Can get higher availability by electing a new primary
– A secondary that detects primary’s failure starts a new
election by broadcasting its unique replica identifier
– Other secondaries reply with their replica identifier
– The largest replica identifier wins
2/29/2012 21
Failure & Recovery Handling 2/3
• Primary failure (cont’d)
– All replicas must now check that they have the
same updates from the failed primary
– During the election, each replica reports the id of the
last log record it received from the primary
– The most up-to-date replica sends its latest updates to
(at least) the new primary.
2/29/2012 22
Failure & Recovery Handling 3/3
• Primary failure (cont’d)
– Lost updates
– Could still lose an update that committed at the primary
and wasn’t forwarded before the primary failed …
but solving it requires synchronous replication
(2-phase commit to propagate updates to replicas)
– One primary and one backup
• There is always a window for lost updates.
2/29/2012 23
Communications Failures
• Secondaries can’t distinguish a primary failure from a
communication failure that partitions the network.
• If the secondaries elect a new primary and the old primary
is still running, there will be a reconciliation problem
when they’re reunited. This is multi-master.
• To avoid this, one partition must know it’s the only one
that can operate. It can’t communicate with other
partitions to figure this out.
• Could make a static decision.
E.g., the partition that has the primary wins.
• Dynamic solutions are based on Majority Consensus
2/29/2012 24
Majority Consensus
• Whenever a set of communicating replicas detects a
replica failure or recovery, they test if they have a
majority (more than half) of the replicas.
• If so, they can elect a primary
• Only one set of replicas can have a majority.
• Doesn’t work with an even number of copies.
– Useless with 2 copies
• Quorum consensus
– Give a weight to each replica
– The replica set that has a majority of the weight wins
– E.g. 2 replicas, one has weight 1, the other weight 2
2/29/2012 25
3. Multi-Master Replication
• Some systems must operate when partitioned.
– Requires many updatable copies, not just one primary
– Conflicting updates on different copies are detected late
• Classic example - salesperson’s disconnected laptop
Customer table (rarely updated) Orders table (insert mostly)
Customer log table (append only)
– So conflicting updates from different salespeople are rare
• Use primary-copy algorithm, with multiple masters
– Each master exchanges updates (“gossips”) with other replicas
when it reconnects to the network
– Conflicting updates require reconciliation (i.e. merging)
• In Lotus Notes, Access, SQL Server, Oracle, …
2/29/2012 26
Example of Conflicting Updates
• Replicas end up in different states
Replica 1
Initially x=0
T1: X=1
Primary
Initially x=0
Send (X=1)
Replica 2
Initially x=0
T2: X=2
Send (X=1)
X=1
X=1
X=2
Send (X=2)
X=2
Send (X=2)
• Assume all updates propagate via the primary
time
2/29/2012 27
Thomas’ Write Rule
• To ensure replicas end up in the same state
– Tag each data item with a timestamp
– A transaction updates the value and timestamp of data
items (timestamps monotonically increase)
– An update to a replica is applied only if the update’s
timestamp is greater than the data item’s timestamp
– You only need timestamps of data items that were
recently updated (where an older update could still be
floating around the system)
• All multi-master products use some variation of this
• Robert Thomas, ACM TODS, June ’79
2/29/2012 28
Thomas Write Rule  Serializability
• Replicas end in the same state, but neither T1 nor T2 reads
the other’s output, so the execution isn’t serializable.
• This requires reconciliation
Replica 1
T1: read x=0 (TS=0)
T1: X=1, TS=1
Primary
Initially x=0,TS=0
Send (X=1, TS=1)
Replica 2
T1: read x=0 (TS=0)
T2: X=2, TS=2
Send (X=1, TS=1)
X=1, TS=1
X=1,TS=1X=2, TS=2
Send (X=2, TS=2)
X=2, TS=2
Send (X=2, TS=2)
2/29/2012 29
Multi-Master Performance
• The longer a replica is disconnected and
performing updates, the more likely it will
need reconciliation
• The amount of propagation activity increases
with more replicas
– If each replica is performing updates,
the effect is quadratic in the number of
replicas
2/29/2012 30
Making Multi-Master Work
• Transactions
– T1: x++ {x=1} at replica 1
– T2: x++ {x=1} at replica 2
– T3: x++ {y=1} at replica 3
– Replica 2 and 3 already exchanged updates
• On replica 1
– Current state { x=1, y=0 }
– Receive update from replica 2 {x=1, y=1}
– Receive update from replica 3 {x=1, y=1}
2/29/2012 31
Making Multi-Master Work
• Time in a distributed system
– Emulate global clock
– Use local clock
– Logical clock
– Vector clock
• Dependency tracking metadata
– Per data item
– Per replica
– This could be bigger than the data
2/29/2012 32
Microsoft Access and SQL Server
• Each row R of a table has 4 additional columns
– Globally unique id (GUID)
– Generation number, to determine which updates from
other replicas have been applied
– Version num = the number of updates to R
– Array of [replica, version num] pairs, identifying the
largest version num it got for R from every other replica
• Uses Thomas’ write rule, based on version nums
– Access uses replica id to break ties.
– SQL Server 7 uses subscriber priority or custom conflict
resolution.
2/29/2012 33
4. Other Approaches (1/2)
• Non-transactional replication using timestamped
updates and variations of Thomas’ write rule
– Directory services are managed this way
• Quorum consensus per-transaction
– Read and write a quorum of copies
– Each data item has a version number and timestamp
– Each read chooses a replica with largest version
number
– Each write increments version number one greater
than any one it has seen
– No special work needed for a failure or recovery
2/29/2012 34
Other Approaches 2/2
• Read-one replica, write-all-available replicas
– Requires careful management of failures and
recoveries
• E.g., Virtual partition algorithm
– Each node knows the nodes it can communicate
with, called its view
– Txn T can execute if its home node has a view
including a quorum of T’s readset and writeset
– If a node fails or recovers, run a view formation
protocol (much like an election protocol)
– For each data item with a read quorum, read the
latest version and update the others with smaller
version #.
2/29/2012 35
Summary
• State-of-the-art products have rich functionality.
– It’s a complicated world for app designers
– Lots of options to choose from
• Most failover stories are weak
– Fine for data warehousing
– For 247 TP, need better integration with
cluster node failover

More Related Content

What's hot

Database recovery techniques
Database recovery techniquesDatabase recovery techniques
Database recovery techniques
pusp220
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
Kathirvel Ayyaswamy
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling
SHATHAN
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loadingmyrajendra
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replicationAbDul ThaYyal
 
Transaction & Concurrency Control
Transaction & Concurrency ControlTransaction & Concurrency Control
Transaction & Concurrency Control
Ravimuthurajan
 
Logging Last Resource Optimization for Distributed Transactions in Oracle…
Logging Last Resource Optimization for Distributed Transactions in  Oracle…Logging Last Resource Optimization for Distributed Transactions in  Oracle…
Logging Last Resource Optimization for Distributed Transactions in Oracle…Gera Shegalov
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed System
Ehsan Hessami
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
Kavya Barnadhya Hazarika
 
Operating system ch#7
Operating system ch#7Operating system ch#7
Operating system ch#7
Noor Noon
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
Ibrahim Amer
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
ijsrd.com
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
Andrii Vozniuk
 
Q2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP SchedulingQ2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP Scheduling
Linaro
 
3 process management
3 process management3 process management
3 process management
Dr. Loganathan R
 
Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling Algorithms
AJAL A J
 
Recovery system
Recovery systemRecovery system
Recovery system
GowriLatha1
 
Chapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency controlChapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency controlAbDul ThaYyal
 
Chapter 3 - Processes
Chapter 3 - ProcessesChapter 3 - Processes
Chapter 3 - Processes
Wayne Jones Jnr
 

What's hot (20)

Database recovery techniques
Database recovery techniquesDatabase recovery techniques
Database recovery techniques
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loading
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Transaction & Concurrency Control
Transaction & Concurrency ControlTransaction & Concurrency Control
Transaction & Concurrency Control
 
Logging Last Resource Optimization for Distributed Transactions in Oracle…
Logging Last Resource Optimization for Distributed Transactions in  Oracle…Logging Last Resource Optimization for Distributed Transactions in  Oracle…
Logging Last Resource Optimization for Distributed Transactions in Oracle…
 
Data Replication in Distributed System
Data Replication in  Distributed SystemData Replication in  Distributed System
Data Replication in Distributed System
 
Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Operating system ch#7
Operating system ch#7Operating system ch#7
Operating system ch#7
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Process Migration in Heterogeneous Systems
Process Migration in Heterogeneous SystemsProcess Migration in Heterogeneous Systems
Process Migration in Heterogeneous Systems
 
Scheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii VozniukScheduling in distributed systems - Andrii Vozniuk
Scheduling in distributed systems - Andrii Vozniuk
 
Q2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP SchedulingQ2.12: Research Update on big.LITTLE MP Scheduling
Q2.12: Research Update on big.LITTLE MP Scheduling
 
3 process management
3 process management3 process management
3 process management
 
Real-Time Scheduling Algorithms
Real-Time Scheduling AlgorithmsReal-Time Scheduling Algorithms
Real-Time Scheduling Algorithms
 
Recovery system
Recovery systemRecovery system
Recovery system
 
Distributed Operating System_2
Distributed Operating System_2Distributed Operating System_2
Distributed Operating System_2
 
Chapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency controlChapter 12 transactions and concurrency control
Chapter 12 transactions and concurrency control
 
Chapter 3 - Processes
Chapter 3 - ProcessesChapter 3 - Processes
Chapter 3 - Processes
 

Viewers also liked

14 turing wics
14 turing wics14 turing wics
14 turing wics
ashish61_scs
 
8 application servers_v2
8 application servers_v28 application servers_v2
8 application servers_v2
ashish61_scs
 
09 workflow
09 workflow09 workflow
09 workflow
ashish61_scs
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
ashish61_scs
 
11 tm
11 tm11 tm
06 07 lock
06 07 lock06 07 lock
06 07 lock
ashish61_scs
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
ashish61_scs
 
7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
ashish61_scs
 
6 two phasecommit
6 two phasecommit6 two phasecommit
6 two phasecommit
ashish61_scs
 

Viewers also liked (9)

14 turing wics
14 turing wics14 turing wics
14 turing wics
 
8 application servers_v2
8 application servers_v28 application servers_v2
8 application servers_v2
 
09 workflow
09 workflow09 workflow
09 workflow
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
 
11 tm
11 tm11 tm
11 tm
 
06 07 lock
06 07 lock06 07 lock
06 07 lock
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
 
7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
6 two phasecommit
6 two phasecommit6 two phasecommit
6 two phasecommit
 

Similar to 10 replication

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
Len Bass
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
John Brinnand
 
Ch5 process synchronization
Ch5   process synchronizationCh5   process synchronization
Ch5 process synchronization
Welly Dian Astika
 
MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability Webinar
MariaDB plc
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Replication in the Wild
Replication in the WildReplication in the Wild
Replication in the Wild
Ensar Basri Kahveci
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
FannyBellows
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
EDB
 
Ch-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptxCh-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptx
Kabindra Koirala
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
MariaDB plc
 
Best Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDBBest Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDB
MariaDB plc
 
What is the merge window?
What is the merge window?What is the merge window?
What is the merge window?
Macpaul Lin
 
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.pptBIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
Kadri20
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
skadyan1
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Toronto-Oracle-Users-Group
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
nadirpervez2
 
try
trytry

Similar to 10 replication (20)

Architecting for the cloud elasticity security
Architecting for the cloud elasticity securityArchitecting for the cloud elasticity security
Architecting for the cloud elasticity security
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Ch5 process synchronization
Ch5   process synchronizationCh5   process synchronization
Ch5 process synchronization
 
MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability Webinar
 
Solution8 v2
Solution8 v2Solution8 v2
Solution8 v2
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Replication in the Wild
Replication in the WildReplication in the Wild
Replication in the Wild
 
cs1311lecture25wdl.ppt
cs1311lecture25wdl.pptcs1311lecture25wdl.ppt
cs1311lecture25wdl.ppt
 
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr UnternehmenDie 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Ch-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptxCh-7-Part-2-Distributed-System.pptx
Ch-7-Part-2-Distributed-System.pptx
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Best Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDBBest Practice for Achieving High Availability in MariaDB
Best Practice for Achieving High Availability in MariaDB
 
What is the merge window?
What is the merge window?What is the merge window?
What is the merge window?
 
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.pptBIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
BIL406-Chapter-9-Synchronization and Communication in MIMD Systems.ppt
 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
 
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?Extreme Availability using Oracle 12c Features: Your very last system shutdown?
Extreme Availability using Oracle 12c Features: Your very last system shutdown?
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 
try
trytry
try
 

More from ashish61_scs

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
ashish61_scs
 
Transactions
TransactionsTransactions
Transactions
ashish61_scs
 
22 levine
22 levine22 levine
22 levine
ashish61_scs
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
ashish61_scs
 
20 access paths
20 access paths20 access paths
20 access paths
ashish61_scs
 
19 structured files
19 structured files19 structured files
19 structured files
ashish61_scs
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
ashish61_scs
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
ashish61_scs
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
ashish61_scs
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
ashish61_scs
 
10b rm
10b rm10b rm
10b rm
ashish61_scs
 
10a log
10a log10a log
10a log
ashish61_scs
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
ashish61_scs
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
ashish61_scs
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
ashish61_scs
 
03 fault model
03 fault model03 fault model
03 fault model
ashish61_scs
 
01 whirlwind tour
01 whirlwind tour01 whirlwind tour
01 whirlwind tour
ashish61_scs
 

More from ashish61_scs (20)

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
Transactions
TransactionsTransactions
Transactions
 
22 levine
22 levine22 levine
22 levine
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
20 access paths
20 access paths20 access paths
20 access paths
 
19 structured files
19 structured files19 structured files
19 structured files
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
 
10b rm
10b rm10b rm
10b rm
 
10a log
10a log10a log
10a log
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
 
03 fault model
03 fault model03 fault model
03 fault model
 
01 whirlwind tour
01 whirlwind tour01 whirlwind tour
01 whirlwind tour
 
Solution5.2012
Solution5.2012Solution5.2012
Solution5.2012
 
Solution6.2012
Solution6.2012Solution6.2012
Solution6.2012
 
Solution7.2012
Solution7.2012Solution7.2012
Solution7.2012
 

Recently uploaded

2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 

Recently uploaded (20)

2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 

10 replication

  • 1. 2/29/2012 1 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety Copyright ©2012 Philip A. Bernstein
  • 2. 2/29/2012 2 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches 5. Products
  • 3. 2/29/2012 3 1. Introduction • Replication - using multiple copies of a server or resource for better availability and performance. – Replica and Copy are synonyms • If you’re not careful, replication can lead to – worse performance - updates must be applied to all replicas and synchronized – worse availability - some algorithms require multiple replicas to be operational for any of them to be used
  • 4. 2/29/2012 4 Read-only Database Database Server Database Server 1 Database Server 2 Database Server 3 • T1 = { r[x] }
  • 5. 2/29/2012 5 Update-only Database Database Server Database Server 1 Database Server 2 Database Server 3 • T1 = { w[x=1] } • T2 = { w[x=2] }
  • 6. 2/29/2012 6 Update-only Database Database Server Database Server 1 Database Server 2 Database Server 3 • T1 = { w[x=1] } • T2 = { w[y=1] }
  • 7. 2/29/2012 7 Replicated Database Database Server 1 Database Server 2 Database Server 3 • Objective – Availability – Performance • Transparency – 1 copy serializability • Challenge – Propagating and synchronizing updates
  • 8. 2/29/2012 8 Replicated Server • Can replicate servers on a common resource – Data sharing - DB servers communicate with shared disk Resource Server Replica 1 Server Replica 2 Client • Helps availability for process (not resource) failure • Requires a replica cache coherence mechanism, so this helps performance only if – little conflict between transactions at different servers or – loose coherence guarantees (e.g. read committed)
  • 9. 2/29/2012 9 Replicated Resource • To get more improvement in availability, replicate the resources (too) • Also increases potential throughput • This is what’s usually meant by replication • It’s the scenario we’ll focus on Resource replica Server Replica 1 Server Replica 2 ClientClient Resource replica
  • 10. 2/29/2012 10 Synchronous Replication • Replicas function just like a non-replicated resource – Txn writes data item x. System writes all replicas of x. – Synchronous – replicas are written within the update txn – Asynchronous – One replica is updated immediately. Other replicas are updated later • Problems with synchronous replication – Too expensive for most applications, due to heavy distributed transaction load (2-phase commit) – Can’t control when updates are applied to replicas Write(x1) Write(x2) Write(x3) x1 x2 x3 Start Write(x) Commit
  • 11. 2/29/2012 11 Synchronous Replication - Issues r1[xA] r2[yD] w2[xB] w1[yC]yD fails xA fails Not equivalent to a one-copy execution, even if xA and yD never recover! • DBMS products support it only in special situations • If you just use transactions, availability suffers. • For high-availability, the algorithms are complex and expensive, because they require heavy-duty synchronization of failures. • … of failures? How do you synchronize failures? • Assume replicas xA, xB of x and yC, yD of y
  • 12. 2/29/2012 12 Atomicity & Isolation Goal • One-copy serializability (abbr. 1SR) – An execution of transactions on the replicated database has the same effect as a serial execution on a one-copy database. • Readset (resp. writeset) - the set of data items (not copies) that a transaction reads (resp. writes). • 1SR Intuition: the execution is SR and in an equivalent serial execution, for each txn T and each data item x in readset(T), T reads from the most recent txn that wrote into any copy of x. • To check for 1SR, first check for SR (using SG), then see if there’s equivalent serial history with the above property
  • 13. 2/29/2012 13 Atomicity & Isolation (cont’d) • Previous example was not 1SR. It is equivalent to – r1[xA] w1[yC] r2[yD] w2[xB] and – r2[yD] w2[xB] r1[xA] w1[yC] – but in both cases, the second transaction does not read its input from the previous transaction that wrote that input. • These are 1SR – r1[xA] w1[yD] r2[yD] w2[xB] – r1[xA] w1[yC] w1[yD] r2[yD] w2[xA] w2[xB] • The previous history is the one you would expect – Each transaction reads one copy of its readset and writes into all copies of its writeset • But it may not always be feasible, because some copies may be unavailable.
  • 14. 2/29/2012 14 Asynchronous Replication • Asynchronous replication – Each transaction updates one replica. – Updates are propagated later to other replicas. • Primary copy: Each data item has a primary copy – All transactions update the primary copy – Other copies are for queries and failure handling • Multi-master: Transactions update different copies – Useful for disconnected operation, partitioned network • Both approaches ensure that – Updates propagate to all replicas – If new updates stop, replicas converge to the same state • Primary copy ensures serializability, and often 1SR – Multi-master does not.
  • 15. 2/29/2012 15 2. Primary-Copy Replication • Designate one replica as the primary copy (publisher) • Transactions may update only the primary copy • Updates to the primary are sent later to secondary replicas (subscribers) in the order they were applied to the primary T1: Start … Write(x1) ... Commit x1T2 Tn ... Primary Copy x2 xm ... Secondaries
  • 16. 2/29/2012 16 Update Propagation • Collect updates at the primary using triggers or by post-processing the log – Triggers: on every update at the primary, a trigger fires to store the update in the update propagation table. – Log post-processing: “sniff” the log to generate update propagations • Log post-processing (log sniffing) – Saves triggered update overhead during on-line txn. – But R/W log synchronization has a (small) cost • Optionally identify updated fields to compress log • Most DB systems support this today.
  • 17. 2/29/2012 17 Update Processing 1/2 • At the replica, for each tx T in the propagation stream, execute a refresh tx that applies T’s updates to replica. • Process the stream serially – Otherwise, conflicting transactions may run in a different order at the replica than at the primary. – Suppose log contains w1[x] c1 w2[x] c2. Obviously, T1 must run before T2 at the replica. – So the execution of update transactions is serial. • Optimizations – Batching: {w(x)} {w(y)} -> {w(x), w(y)} – “Concurrent” execution
  • 18. 2/29/2012 18 Update Processing 2/2 • To get a 1SR execution at the replica – Refresh transactions and read-only queries use an atomic and isolated mechanism (e.g., 2PL) • Why this works – The execution is serializable – Each state in the serial execution is one that occurred at the primary copy – Each query reads one of those states • Client view – Session consistency
  • 19. 2/29/2012 19 Request Propagation • Or propagate requests (e.g. txn-bracketed stored proc calls) • Requirements – Must ensure same order at primary and replicas – Determinism • This is often a txn middleware (not DB) feature. • An alternative to propagating updates is to propagate procedure calls (e.g., a DB stored procedure call). SP1: Write(x) Write(y)x, y DB-A w[x] w[y] SP1: Write(x) Write(y) x, y DB-B w[x] w[y]Replicate Call(SP1)
  • 20. 2/29/2012 20 Failure & Recovery Handling 1/3 • Secondary failure - nothing to do till it recovers – At recovery, apply the updates it missed while down – Needs to determine which updates it missed, just like non-replicated log-based recovery – If down for too long, may be faster to get a whole copy • Primary failure – Normally, secondaries wait till the primary recovers – Can get higher availability by electing a new primary – A secondary that detects primary’s failure starts a new election by broadcasting its unique replica identifier – Other secondaries reply with their replica identifier – The largest replica identifier wins
  • 21. 2/29/2012 21 Failure & Recovery Handling 2/3 • Primary failure (cont’d) – All replicas must now check that they have the same updates from the failed primary – During the election, each replica reports the id of the last log record it received from the primary – The most up-to-date replica sends its latest updates to (at least) the new primary.
  • 22. 2/29/2012 22 Failure & Recovery Handling 3/3 • Primary failure (cont’d) – Lost updates – Could still lose an update that committed at the primary and wasn’t forwarded before the primary failed … but solving it requires synchronous replication (2-phase commit to propagate updates to replicas) – One primary and one backup • There is always a window for lost updates.
  • 23. 2/29/2012 23 Communications Failures • Secondaries can’t distinguish a primary failure from a communication failure that partitions the network. • If the secondaries elect a new primary and the old primary is still running, there will be a reconciliation problem when they’re reunited. This is multi-master. • To avoid this, one partition must know it’s the only one that can operate. It can’t communicate with other partitions to figure this out. • Could make a static decision. E.g., the partition that has the primary wins. • Dynamic solutions are based on Majority Consensus
  • 24. 2/29/2012 24 Majority Consensus • Whenever a set of communicating replicas detects a replica failure or recovery, they test if they have a majority (more than half) of the replicas. • If so, they can elect a primary • Only one set of replicas can have a majority. • Doesn’t work with an even number of copies. – Useless with 2 copies • Quorum consensus – Give a weight to each replica – The replica set that has a majority of the weight wins – E.g. 2 replicas, one has weight 1, the other weight 2
  • 25. 2/29/2012 25 3. Multi-Master Replication • Some systems must operate when partitioned. – Requires many updatable copies, not just one primary – Conflicting updates on different copies are detected late • Classic example - salesperson’s disconnected laptop Customer table (rarely updated) Orders table (insert mostly) Customer log table (append only) – So conflicting updates from different salespeople are rare • Use primary-copy algorithm, with multiple masters – Each master exchanges updates (“gossips”) with other replicas when it reconnects to the network – Conflicting updates require reconciliation (i.e. merging) • In Lotus Notes, Access, SQL Server, Oracle, …
  • 26. 2/29/2012 26 Example of Conflicting Updates • Replicas end up in different states Replica 1 Initially x=0 T1: X=1 Primary Initially x=0 Send (X=1) Replica 2 Initially x=0 T2: X=2 Send (X=1) X=1 X=1 X=2 Send (X=2) X=2 Send (X=2) • Assume all updates propagate via the primary time
  • 27. 2/29/2012 27 Thomas’ Write Rule • To ensure replicas end up in the same state – Tag each data item with a timestamp – A transaction updates the value and timestamp of data items (timestamps monotonically increase) – An update to a replica is applied only if the update’s timestamp is greater than the data item’s timestamp – You only need timestamps of data items that were recently updated (where an older update could still be floating around the system) • All multi-master products use some variation of this • Robert Thomas, ACM TODS, June ’79
  • 28. 2/29/2012 28 Thomas Write Rule  Serializability • Replicas end in the same state, but neither T1 nor T2 reads the other’s output, so the execution isn’t serializable. • This requires reconciliation Replica 1 T1: read x=0 (TS=0) T1: X=1, TS=1 Primary Initially x=0,TS=0 Send (X=1, TS=1) Replica 2 T1: read x=0 (TS=0) T2: X=2, TS=2 Send (X=1, TS=1) X=1, TS=1 X=1,TS=1X=2, TS=2 Send (X=2, TS=2) X=2, TS=2 Send (X=2, TS=2)
  • 29. 2/29/2012 29 Multi-Master Performance • The longer a replica is disconnected and performing updates, the more likely it will need reconciliation • The amount of propagation activity increases with more replicas – If each replica is performing updates, the effect is quadratic in the number of replicas
  • 30. 2/29/2012 30 Making Multi-Master Work • Transactions – T1: x++ {x=1} at replica 1 – T2: x++ {x=1} at replica 2 – T3: x++ {y=1} at replica 3 – Replica 2 and 3 already exchanged updates • On replica 1 – Current state { x=1, y=0 } – Receive update from replica 2 {x=1, y=1} – Receive update from replica 3 {x=1, y=1}
  • 31. 2/29/2012 31 Making Multi-Master Work • Time in a distributed system – Emulate global clock – Use local clock – Logical clock – Vector clock • Dependency tracking metadata – Per data item – Per replica – This could be bigger than the data
  • 32. 2/29/2012 32 Microsoft Access and SQL Server • Each row R of a table has 4 additional columns – Globally unique id (GUID) – Generation number, to determine which updates from other replicas have been applied – Version num = the number of updates to R – Array of [replica, version num] pairs, identifying the largest version num it got for R from every other replica • Uses Thomas’ write rule, based on version nums – Access uses replica id to break ties. – SQL Server 7 uses subscriber priority or custom conflict resolution.
  • 33. 2/29/2012 33 4. Other Approaches (1/2) • Non-transactional replication using timestamped updates and variations of Thomas’ write rule – Directory services are managed this way • Quorum consensus per-transaction – Read and write a quorum of copies – Each data item has a version number and timestamp – Each read chooses a replica with largest version number – Each write increments version number one greater than any one it has seen – No special work needed for a failure or recovery
  • 34. 2/29/2012 34 Other Approaches 2/2 • Read-one replica, write-all-available replicas – Requires careful management of failures and recoveries • E.g., Virtual partition algorithm – Each node knows the nodes it can communicate with, called its view – Txn T can execute if its home node has a view including a quorum of T’s readset and writeset – If a node fails or recovers, run a view formation protocol (much like an election protocol) – For each data item with a read quorum, read the latest version and update the others with smaller version #.
  • 35. 2/29/2012 35 Summary • State-of-the-art products have rich functionality. – It’s a complicated world for app designers – Lots of options to choose from • Most failover stories are weak – Fine for data warehousing – For 247 TP, need better integration with cluster node failover