MySQL Replication Consistency with Reduced Durability
1. Consistency between Engine and Binlog
under Reduced Durability
Yoshinori Matsunobu
Production Engineer, Facebook
Jan/2020
2. What we want to do
▪ When slave or master instances fail and recover, we want to make
them rejoin the replication chain (replica set), instead of dropping and
rebuilding
▪ Imaging a 10 minute network outage in one Availability Zone, and
want to recover MySQL instances in the AZ
3. Agenda
▪ When binlog and storage engine consistency gets broken
▪ What can go wrong on restarting replica
▪ What can go wrong on restarting master
▪ Challenges to support multiple transactional storage engines
4. Consistency between binlog and engine
▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs
(InnoDB/MyRocks/NDB)
▪ Internally handles XA
▪ Commit ordering:
▪ Binlog Prepare (doing nothing)
▪ Engine Prepare (in parallel)
▪ Binlog Commit (ordered)
▪ Engine Commit (ordered, if binlog_order_commits==ON)
▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent
▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and
innodb-flush-log-at-trx-commit!=1)
▪ Some binlog events that were persisted in engine may be lost
▪ Engine may lose some transactions there were persisted in binlog
▪ This talk is about how to address consistency issues under reduced durability
5. 5.6 Single Threaded Slave, Binlog < Engine
▪ Unplanned OS reboot on slave may end up inconsistent state
between Binlog GTID sets and Engine state
▪ A big question is the slave can continue replication by START
SLAVE, without entirely replacing it
▪ Transactional Storage Engines (both InnoDB and MyRocks) store
last committed GTID, and it is visible from
mysql.slave_relay_log_info table. This table is updated for each
commit to the engine
▪ With Single Threaded Slave, you don’t have to think about out of
order execution
▪ Run with relay_log_recovery=1
▪ Slave discards relay logs, restart replication from engine max GTID
position from master
▪ Skips execution in engine if GTID < slave_relay_log_info
▪ Skips writing binlog events if binlog GTID overlaps
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 99
6. 5.6 Single Threaded Slave, Binlog > Engine
▪ Replication will continue from GTID 95 or
less
▪ Executing Engine GTID 96-98 but not saving
binlog events
▪ Continuing normal replication flows after 99
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
Master
GTID: 1-100
7. Multi Threaded Slave
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
▪ mysql.slave_relay_log_info stores only max
executed GTID in the instance
▪ Under parallel database execution, MySQL has no
idea if GTID 94 is in engine or not
▪ Execution order might be 91 -> 92 -> 95
▪ In upstream 5.6, you can’t guarantee consistency
8. 5.7 gtid_executed table
Replica
Binlog GTID: 1-98
gtid_executed table: 1-93, 95-98
Master
GTID: 1-100
▪ 5.7 gtid_executed table stores GTID sets in InnoDB
(crash safe)
▪ However, executed GTIDs are not updated for each
commit
▪ It is updated on binlog rotate
▪ If it updates for each commit, you can figure out
GTID 94 is there or not. (you can’t right now)
9. FB Extension: Slave Idempotent Recovery
- Starting replication from old enough binlog GTID
- Re-executing binlog events to engine, then ignoring
all duplicate key error / row not found error during
catchup
- Eventual Consistency
- Must use RBR, and tables must have primary keys
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine GTID state: empty
10. What can go wrong when restarting master
▪ Master may go down unexpectedly by various reasons
▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of
memory
▪ Kernel panic
▪ power outages then restarted after a while
▪ Nowadays dead master promotion kicks in (Orchestrator, MHA)
▪ A question is failed master can restart replication from the new master
▪ Dead master may be back before dead master promotion
▪ If the master lost some transactions that are already replicated, replicas may
not be able to continue replication
11. Master Promotion happening, Binlog < Engine
▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on
orig master <= Binlog/Engine on new master)
▪ You need to start replication from the last GTID in the engine
▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99
▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine
▪ However, this information is not visible from MySQL commands (only printed in err log)
▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position
▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified.
Instance 1
(Master)
Binlog: 1-98
Engine: 1-99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-98
Engine: 1-
99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
12. Master Promotion happening, Binlog >
Engine
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-
100
Engine: 1-
98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
“100” should be discarded before
replicating from new master (instance 3)
InnoDB: Last binlog file position 79143, file name binlog.000005
InnoDB: Last MySQL Gtid UUID:98
▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync)
▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica
▪ We need some ways not to apply GTID 100 during recovery
13. FB Extention: Server Side Binlog
Truncation▪ At instance startup, truncating binlog events that don’t exist in storage
engine
▪ End of binlog position is the same or smaller than engine’s last committed GTID
▪ Retaining original binlog file as a backup
▪ All of the prepared state transactions in storage engines will be rolled back
14. Master Promotion not happening
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Recovered)
Binlog: 1-98
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves
▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
15. Common Replica errors
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from
binary log: 'Slave has more GTIDs than the master has, using the
master's SERVER_UUID. This may indicate that the end of the binary log
was truncated or that the last binary log file was lost, e.g., after a
power or disk failure when sync_binlog != 1. The master may or may not
have rolled back transactions that were already replica’
▪ Set read_only=1 by default
▪ Find the most advanced slave, catch up from there, then start serving write requests
16. Dual Engine Consistency
▪ Binlog GTID Sets
▪ InnoDB
▪ MyRocks
▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent
▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197
▪ It is unclear if 191-196 are committed
▪ Roll back all prepared transactions (server side binlog truncation)
▪ Idempotent recovery
▪ Recover from binlogs on semi-sync replica
17. Dual Engine consistency without binlog
▪ 8.0 DDL is transactional
▪ Table metadata info is stored in InnoDB
▪ It is common to run DDL outside of replication
▪ FB OSC changes schema without binlog
▪ MyRocks table changes without binlog may end up inconsistency
▪ There is no binlog to fix inconsistency
▪ DDL validation is our current workaround
18. Summary
▪ MySQL needs to be aware of executed engine GTID sets
▪ With low update costs
▪ We don’t have in upstream MySQL yet. It’s a nice feature
▪ We worked around by Slave Idempotent Recovery
▪ Binlog Truncation during recovery, so that an old master can rejoin
as a replica