Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Consistency between Engine and Binlog under Reduced Durability

109 views

Published on

Challenges to keep consistency between binary logs and storage engines in MySQL

Published in: Software
  • Be the first to comment

Consistency between Engine and Binlog under Reduced Durability

  1. 1. Consistency between Engine and Binlog under Reduced Durability Yoshinori Matsunobu Production Engineer, Facebook Jan/2020
  2. 2. What we want to do ▪ When slave or master instances fail and recover, we want to make them rejoin the replication chain (replica set), instead of dropping and rebuilding ▪ Imaging a 10 minute network outage in one Availability Zone, and want to recover MySQL instances in the AZ
  3. 3. Agenda ▪ When binlog and storage engine consistency gets broken ▪ What can go wrong on restarting replica ▪ What can go wrong on restarting master ▪ Challenges to support multiple transactional storage engines
  4. 4. Consistency between binlog and engine ▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs (InnoDB/MyRocks/NDB) ▪ Internally handles XA ▪ Commit ordering: ▪ Binlog Prepare (doing nothing) ▪ Engine Prepare (in parallel) ▪ Binlog Commit (ordered) ▪ Engine Commit (ordered, if binlog_order_commits==ON) ▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent ▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and innodb-flush-log-at-trx-commit!=1) ▪ Some binlog events that were persisted in engine may be lost ▪ Engine may lose some transactions there were persisted in binlog ▪ This talk is about how to address consistency issues under reduced durability
  5. 5. 5.6 Single Threaded Slave, Binlog < Engine ▪ Unplanned OS reboot on slave may end up inconsistent state between Binlog GTID sets and Engine state ▪ A big question is the slave can continue replication by START SLAVE, without entirely replacing it ▪ Transactional Storage Engines (both InnoDB and MyRocks) store last committed GTID, and it is visible from mysql.slave_relay_log_info table. This table is updated for each commit to the engine ▪ With Single Threaded Slave, you don’t have to think about out of order execution ▪ Run with relay_log_recovery=1 ▪ Slave discards relay logs, restart replication from engine max GTID position from master ▪ Skips execution in engine if GTID < slave_relay_log_info ▪ Skips writing binlog events if binlog GTID overlaps Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 99
  6. 6. 5.6 Single Threaded Slave, Binlog > Engine ▪ Replication will continue from GTID 95 or less ▪ Executing Engine GTID 96-98 but not saving binlog events ▪ Continuing normal replication flows after 99 Replica Binlog GTID: 1-98 Engine Max GTID: 95 Master GTID: 1-100
  7. 7. Multi Threaded Slave Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 95 ▪ mysql.slave_relay_log_info stores only max executed GTID in the instance ▪ Under parallel database execution, MySQL has no idea if GTID 94 is in engine or not ▪ Execution order might be 91 -> 92 -> 95 ▪ In upstream 5.6, you can’t guarantee consistency
  8. 8. 5.7 gtid_executed table Replica Binlog GTID: 1-98 gtid_executed table: 1-93, 95-98 Master GTID: 1-100 ▪ 5.7 gtid_executed table stores GTID sets in InnoDB (crash safe) ▪ However, executed GTIDs are not updated for each commit ▪ It is updated on binlog rotate ▪ If it updates for each commit, you can figure out GTID 94 is there or not. (you can’t right now)
  9. 9. FB Extension: Slave Idempotent Recovery - Starting replication from old enough binlog GTID - Re-executing binlog events to engine, then ignoring all duplicate key error / row not found error during catchup - Eventual Consistency - Must use RBR, and tables must have primary keys Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine GTID state: empty
  10. 10. What can go wrong when restarting master ▪ Master may go down unexpectedly by various reasons ▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of memory ▪ Kernel panic ▪ power outages then restarted after a while ▪ Nowadays dead master promotion kicks in (Orchestrator, MHA) ▪ A question is failed master can restart replication from the new master ▪ Dead master may be back before dead master promotion ▪ If the master lost some transactions that are already replicated, replicas may not be able to continue replication
  11. 11. Master Promotion happening, Binlog < Engine ▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on orig master <= Binlog/Engine on new master) ▪ You need to start replication from the last GTID in the engine ▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99 ▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine ▪ However, this information is not visible from MySQL commands (only printed in err log) ▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position ▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified. Instance 1 (Master) Binlog: 1-98 Engine: 1-99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1-98 Engine: 1- 99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98
  12. 12. Master Promotion happening, Binlog > Engine Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1- 100 Engine: 1- 98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 “100” should be discarded before replicating from new master (instance 3) InnoDB: Last binlog file position 79143, file name binlog.000005 InnoDB: Last MySQL Gtid UUID:98 ▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync) ▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica ▪ We need some ways not to apply GTID 100 during recovery
  13. 13. FB Extention: Server Side Binlog Truncation▪ At instance startup, truncating binlog events that don’t exist in storage engine ▪ End of binlog position is the same or smaller than engine’s last committed GTID ▪ Retaining original binlog file as a backup ▪ All of the prepared state transactions in storage engines will be rolled back
  14. 14. Master Promotion not happening Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Recovered) Binlog: 1-98 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 ▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves ▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
  15. 15. Common Replica errors Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replica’ ▪ Set read_only=1 by default ▪ Find the most advanced slave, catch up from there, then start serving write requests
  16. 16. Dual Engine Consistency ▪ Binlog GTID Sets ▪ InnoDB ▪ MyRocks ▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent ▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197 ▪ It is unclear if 191-196 are committed ▪ Roll back all prepared transactions (server side binlog truncation) ▪ Idempotent recovery ▪ Recover from binlogs on semi-sync replica
  17. 17. Dual Engine consistency without binlog ▪ 8.0 DDL is transactional ▪ Table metadata info is stored in InnoDB ▪ It is common to run DDL outside of replication ▪ FB OSC changes schema without binlog ▪ MyRocks table changes without binlog may end up inconsistency ▪ There is no binlog to fix inconsistency ▪ DDL validation is our current workaround
  18. 18. Summary ▪ MySQL needs to be aware of executed engine GTID sets ▪ With low update costs ▪ We don’t have in upstream MySQL yet. It’s a nice feature ▪ We worked around by Slave Idempotent Recovery ▪ Binlog Truncation during recovery, so that an old master can rejoin as a replica

×