This document provides an overview of advanced transaction management techniques. It discusses mixing heterogeneous transaction managers, high availability commit and transfer of commit protocols, optimizing commit processes, and disaster protection through data and application replication. Specific topics covered include system pairing, logical logging, session takeover, and using replication for fault tolerance and high availability.
3.
Mixing Transaction Managers
Four standards: LU 6.2 ~ APPC ~ CPIC ~ CICS: de
facto TP standard
X/Open + OSI/TP : The de jure TP standard.
OTS: The CORBA standard
TIP: De facto interoperability standard
Almost everyone interoperates with LU6.2
LU6.2 has evolved to have presumed abort, not reuse
aborted trids, .. other fixes
LU6.2 is "open" two phase commit, documented
interface, reconnection / resolve is documented.
Internally, everyone uses private protocols with many
tricks.
4.
Mixing "OLD" Transaction Managers
Many old TP monitors are not open:
Do not expose 2PC (prepare() and commit())
=> insist on being root commit coordinator.
All will become X/Open-compliant eventually and thus
be open TP monitors.
If stuck with an "closed" TM:
Can still get atomicity if:
1. Only one closed TM involved.
2. TM is direct not queued
5.
Mixing with a Closed Transaction Manager
All "open" TMs and RMs prepared, closed TM does "RUMP"
deferred_update(int id, complex_type list_of_updates) /* rump logic */
{Begin_Work(); /* start a new transaction */
select count(*) from done where id = :id; /* test if work was done */
if not found then /* if not done */
do list_of_updates; /* then do the list of updates.*/
insert into done values (:id); /* flag transaction done */
Commit_Work(); /* commit update and flag */
acknowledge; /* reply success to caller */
} /* in both cases. */
Status_Transaction(TRID trid)
{ select count(*) into :ans from done where trid = :trid; return ans:}
Transaction Gateway
to Closed Transaction Mgr
If Not duplicate
Do transaction
Insert trid in done table
Commit
Acknowledge
Do Transaction
While not acknowledge
Send trid + data
Wait
Done Table
6.
Mixing Open Transaction Managers
Gateway translates between external and internal TRID.
Gateway translates between external and internal protocols
Participates in transaction resolution (is a TM in both worlds)
Local Protocol
Transaction Gateway
OSI Protocol Stack
"Foreign"
Transaction
Managers
"Our"
Transaction
Manager
his trid our trid
Trid Map Table
7.
Mixing Open Transaction Managers
Multiple entry problem:
TRID enters system twice at two different paths.
"works" but looks like two separate transactions.
commit dependency is external to system.
Fancy option problem:
External/internal TM has an option the other does not.
Fakes (or turn off) optimizations/options not supported
by one side or the other
9.
Non-Blocking Commit
The problem: what if the coordinator fails.
Solutions: 1. wait
2. appoint a new coordinator
Appointment can be thought of as a process pair (n-plex)
Works great in a cluster (no communications failures).
P r im a r y B a c k u p P a r t ic ip a n t s
P r e p a r e ( + lis t o f p a r tic ip a n ts a n d s e s s io n s )
a c k
P r e p a r e
P r e p a r e d
C o m m it
a c k
C o m m it
C o m m itte d
W r ite C o m m it L o g R e c o r d
L o g
C o m p le te
a c k
W r ite " C o m p le te " L o g R e c o r d
P r o c e s s P a ir
10.
Non-Blocking Commit in a WAN:
3ϕ or Heuristic or Operator Command
Wide area net can partition
Process pairs cannot reliably decide to take over.
Solution(s):
1. Three phase protocol
Broadcast participant list and decision as part of phase
1.5; let (majority) of participants decide if coordinator
fails.
2. Heuristic decisions
Default to commit/abort.
Announce Heuristic Mismatch at reconnect if wrong
guess
3. Human decision
Announce Operator Mismatch at reconnect if wrong
guess.
11.
Transfer of Commit
What if a participant
is more secure than the coordinator?
is more reliable than the coordinator?
Is faster than the coordinator?
Transfer commit authority to him?
Gas Pump
LA Bank
VisaSF Bank
Gas Pump
LA Bank
VisaSF Bank
12.
Transfer of Commit
Is also an optimization:
saves messages if done as part of commit.
called nested commit protocol
or last resource manager optimization
2 messages vs 5 messages (plus one lazy msg)
Begin
Dequeue
Prepare
doit
Enqueue
Commit_Work()Phase 2 Commit
Begin
Dequeue
doit
Enqueue
Phase 2 Commit
Commit
Prepare
No Transfer of Commit Transfer of Commit
complete
complete
Commit_Work()
work request
work request
+ You are Root!
13.
Transfer of Commit: More Complex Case
More complex if the root has more than one branch:
Need to set up new sessions among "trusted" nodes
root sends new root name to all participants at phase 1
Lybia
US
Deutschland
15.
Optimizing Commit
Can optimize:
Delay: milliseconds/commit
Message cost: number, size, urgency of messages
IO cost: number, size, or urgency of IO
CPU cost: cycles used
Throughput: maximum commit rate.
16.
Commit: the General Case
Prepare(): 1 rpc or message pair per RM
and one per non-root TM
1 forced IO per RM (prepare record)
1 forced IO per TM(commit record)
Commit(): The same.
Summary of 2PC cost:
IO: 2(RM+TM)
RPCs: 2(RM+(TM-1))
Messages: 4(RM+(TM-1)) (equivalent to RPCs)
Delay: 2IO ~ 50ms ~ 10Kins.
4 msg ~ 20ms ~ 50Kins
50ms*(RM+TM) + 20ms*(RM+TM-1)
These are the error-free counts (i.e. the minimum values)
17.
Commit: Simple Optimizations
Presumed abort saves a TM IO (implicit in protocol above)
Do phase 1, phase2 in parallel (saves delay)
Common log (saves RM log forces)
IO: 2(TM)
Messages: 4(RM+TM-1) (equivalent to RPCs)
Delay: 2*IO*TM + 4*M*(RM+TM-1)
~50ms*TM+40ms*(RM+TM-1)
Use Local RPC (10x faster)
~50ms*TM + RM+40ms*(TM-1)
Use WADS for low IO latency(3ms vs 25ms)
~ 6ms*TM + RM + 40ms*(TM-1)
Simple case of 1 TM 2 RM:
~ 8ms delay for a commit.
18.
Group Commit Optimization
Amortizes IO and messages across several transactions
Adds delay
If N transactions in a group:
IO, Message cost per transaction is ~ 1/N
Small extra delay if one slow step in original path.
As system heats up (commit rate rises) to 25tps
start to install group commit with a 30ms threshold
(at 100tps: 3.3 trans/group).
19.
Simple Commit Optimizations
Read-only: just get phase1 call to release locks.
Note: may violate ACID, should release read locks
at phase 2 if any locks acquired during phase 1.
Saves messages (Phase 2) and IO (no RM IO).
True read-only transaction must prepare at phase 1
unlock at phase 2.
Unjoin: RM does no work at commit/abort.
Lazy: user-requested group commit. Piggybacks on others.
no extra IO or messages.
20.
Transaction Commit Trees
one node deep bush general
case
share log transfer Parallel Parallel
LRPC commit transfer transfer
.
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
21.
Transfer of COMMIT: Linear COMMIT
Parent and other sub-trees prepare
then transfer commit authority to remaining child.
Last in chain becomes commit coordinator.
More delay, fewer messages
For N=2, Same delay, 3 vs 4 messages.
Always use it.
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
TM
RM
23.
Disaster Recovery at a Remote Site
Replicate Data
Applications
Network connection at 2 (or more sites)
Symmetric design:
Either site can process transactions
Asymmetric design:
One site is master of each data item.
Allows: Caching
Batching of updates at backup
So far, asymmetric design is most popular.
To get symmetry, have each node master 1/2 of the db/net.
24.
Sample Physical LOG RECORD
Basic idea of asymmetric design:
send log from primary to backup
backup applies log to its copy
backup is in constant media recovery
backup processes/sessions/data ready to take over
Client
Primary Backuplog
Session
System Pair
Clients
Primary Backuplog
Symmetric:
Two System
Pairs
System Pairs
Basic Idea
PrimaryBackup log
Primary
Hub:
Central Site Backs
up
Several Primaries
client Client
Primary
Backup
log &
archive
dumps
Vault:
Backup stores Log
and
Archive Dumps
client
Backup
Primary Primary
client
25.
Sample Physical LOG RECORD
Need some way to decide failure.
Easy in a cluster
Hard in a WAN (partition possible)
Solutions: Extra wires
Wires on demand (dialup)
Human (operator)
Quorum device.
Kind of log?
Logical log is best
loose coupling (allows backup to be a different TM/RM
failure independence (different from physiological log)
26.
Takeover Logic
/* initialization */
Tell primary I'm here
Setup all RMs and application processes
Open all initial sessions to clients.
/* the main backup loop */
While (not primary) {redo log} /* the main backup loop */
/* Takeover */
redo rest of log
resend most recent message on each session
abort any incomplete transactions
/* Become Primary */
tell application processes to start accepting requests.
27.
Session Takeover
Just like process pairs
Session sequence numbers eliminate duplicates
So, get at-least-once delivery: resend msg at takeover
Primary Backup
Network Switches Clients
OSI, SNA,TCP/IP, X..25,etc
Primary Backup
Front Ends Switch Clients
OSI, SNA,TCP/IP, X..25,etc
28.
Catch-up After Failure
Failed node at restart executes normal restart
Then enters backup logic.
If both fail, outside observer must say who is best
backup has to match its log to new primary.
Design issue: are nodes bit-for-bit identical?
If so, backup must “trim” log to match primary.
29.
How Safe?
1-SAFE: no extra delay, risks lost transactions
2-SAFE: extra delay (if backup up),
single fault tolerant, high availability
VERY-SAFE: extra delay, no lost transactions
low availability
client
commitcommit
ok
client
commitcommit
client
commit
commit
ok
client
out of
service
client
commit
commit
ok
client
commitcommit
primary backup primary backup
Both Up Primary Up, Backup Down
1-Safe
2-Safe
Very Safe
30.
System Pairs vs Replicated Data
System pairs replicate the application
DB
application processes
sessions
Data replicators only replicate data.
Other aspects left as an exercise for the
application designer.
31.
System Pair Benefits
Tolerates faults
Hardware
Environment
Operations
Heisenbugs
Can replace software/hardware online
Can move backup to new building or...
Allows design diversity: backup can be completely different
S te p 1 : B o th s y s te m s a r e r u n n in g v e r s io n V 1 . S te p 2 : B a c k u p is c o ld - lo a d e d a s v e r s io n V 2 .
S te p 3 : S W I T C H to B a c k u p . S te p 4 : B a c k u p is c o ld - lo a d e d a s v e r s io n V 2
P r i m a r y
V 1
B a c k u p
V 1
P r i m a r y
V 1
B a c k u p
V 2
V 1
B a c k u p
V 2
P r i m a r y
V 2
B a c k u p
V 2
P r i m a r y