SlideShare a Scribd company logo
1 of 18
Consistency between Engine and Binlog
under Reduced Durability
Yoshinori Matsunobu
Production Engineer, Facebook
Jan/2020
What we want to do
▪ When slave or master instances fail and recover, we want to make
them rejoin the replication chain (replica set), instead of dropping and
rebuilding
▪ Imaging a 10 minute network outage in one Availability Zone, and
want to recover MySQL instances in the AZ
Agenda
▪ When binlog and storage engine consistency gets broken
▪ What can go wrong on restarting replica
▪ What can go wrong on restarting master
▪ Challenges to support multiple transactional storage engines
Consistency between binlog and engine
▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs
(InnoDB/MyRocks/NDB)
▪ Internally handles XA
▪ Commit ordering:
▪ Binlog Prepare (doing nothing)
▪ Engine Prepare (in parallel)
▪ Binlog Commit (ordered)
▪ Engine Commit (ordered, if binlog_order_commits==ON)
▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent
▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and
innodb-flush-log-at-trx-commit!=1)
▪ Some binlog events that were persisted in engine may be lost
▪ Engine may lose some transactions there were persisted in binlog
▪ This talk is about how to address consistency issues under reduced durability
5.6 Single Threaded Slave, Binlog < Engine
▪ Unplanned OS reboot on slave may end up inconsistent state
between Binlog GTID sets and Engine state
▪ A big question is the slave can continue replication by START
SLAVE, without entirely replacing it
▪ Transactional Storage Engines (both InnoDB and MyRocks) store
last committed GTID, and it is visible from
mysql.slave_relay_log_info table. This table is updated for each
commit to the engine
▪ With Single Threaded Slave, you don’t have to think about out of
order execution
▪ Run with relay_log_recovery=1
▪ Slave discards relay logs, restart replication from engine max GTID
position from master
▪ Skips execution in engine if GTID < slave_relay_log_info
▪ Skips writing binlog events if binlog GTID overlaps
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 99
5.6 Single Threaded Slave, Binlog > Engine
▪ Replication will continue from GTID 95 or
less
▪ Executing Engine GTID 96-98 but not saving
binlog events
▪ Continuing normal replication flows after 99
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
Master
GTID: 1-100
Multi Threaded Slave
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine Max GTID: 95
▪ mysql.slave_relay_log_info stores only max
executed GTID in the instance
▪ Under parallel database execution, MySQL has no
idea if GTID 94 is in engine or not
▪ Execution order might be 91 -> 92 -> 95
▪ In upstream 5.6, you can’t guarantee consistency
5.7 gtid_executed table
Replica
Binlog GTID: 1-98
gtid_executed table: 1-93, 95-98
Master
GTID: 1-100
▪ 5.7 gtid_executed table stores GTID sets in InnoDB
(crash safe)
▪ However, executed GTIDs are not updated for each
commit
▪ It is updated on binlog rotate
▪ If it updates for each commit, you can figure out
GTID 94 is there or not. (you can’t right now)
FB Extension: Slave Idempotent Recovery
- Starting replication from old enough binlog GTID
- Re-executing binlog events to engine, then ignoring
all duplicate key error / row not found error during
catchup
- Eventual Consistency
- Must use RBR, and tables must have primary keys
Master
GTID: 1-100
Replica
Binlog GTID: 1-98
Engine GTID state: empty
What can go wrong when restarting master
▪ Master may go down unexpectedly by various reasons
▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of
memory
▪ Kernel panic
▪ power outages then restarted after a while
▪ Nowadays dead master promotion kicks in (Orchestrator, MHA)
▪ A question is failed master can restart replication from the new master
▪ Dead master may be back before dead master promotion
▪ If the master lost some transactions that are already replicated, replicas may
not be able to continue replication
Master Promotion happening, Binlog < Engine
▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on
orig master <= Binlog/Engine on new master)
▪ You need to start replication from the last GTID in the engine
▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99
▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine
▪ However, this information is not visible from MySQL commands (only printed in err log)
▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position
▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified.
Instance 1
(Master)
Binlog: 1-98
Engine: 1-99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-98
Engine: 1-
99
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Master Promotion happening, Binlog >
Engine
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Dead)
Binlog: 1-
100
Engine: 1-
98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Master)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
“100” should be discarded before
replicating from new master (instance 3)
InnoDB: Last binlog file position 79143, file name binlog.000005
InnoDB: Last MySQL Gtid UUID:98
▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync)
▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica
▪ We need some ways not to apply GTID 100 during recovery
FB Extention: Server Side Binlog
Truncation▪ At instance startup, truncating binlog events that don’t exist in storage
engine
▪ End of binlog position is the same or smaller than engine’s last committed GTID
▪ Retaining original binlog file as a backup
▪ All of the prepared state transactions in storage engines will be rolled back
Master Promotion not happening
Instance 1
(Master)
Binlog: 1-100
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
Instance 1
(Recovered)
Binlog: 1-98
Engine: 1-98
Instance 2
(Replica)
Binlog: 1-98
Instance 3
(Replica)
Binlog: 1-99
Instance 4
(Replica)
Binlog: 1-98
▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves
▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
Common Replica errors
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from
binary log: 'Slave has more GTIDs than the master has, using the
master's SERVER_UUID. This may indicate that the end of the binary log
was truncated or that the last binary log file was lost, e.g., after a
power or disk failure when sync_binlog != 1. The master may or may not
have rolled back transactions that were already replica’
▪ Set read_only=1 by default
▪ Find the most advanced slave, catch up from there, then start serving write requests
Dual Engine Consistency
▪ Binlog GTID Sets
▪ InnoDB
▪ MyRocks
▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent
▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197
▪ It is unclear if 191-196 are committed
▪ Roll back all prepared transactions (server side binlog truncation)
▪ Idempotent recovery
▪ Recover from binlogs on semi-sync replica
Dual Engine consistency without binlog
▪ 8.0 DDL is transactional
▪ Table metadata info is stored in InnoDB
▪ It is common to run DDL outside of replication
▪ FB OSC changes schema without binlog
▪ MyRocks table changes without binlog may end up inconsistency
▪ There is no binlog to fix inconsistency
▪ DDL validation is our current workaround
Summary
▪ MySQL needs to be aware of executed engine GTID sets
▪ With low update costs
▪ We don’t have in upstream MySQL yet. It’s a nice feature
▪ We worked around by Slave Idempotent Recovery
▪ Binlog Truncation during recovery, so that an old master can rejoin
as a replica

More Related Content

What's hot

Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
Kenny Gryp
 

What's hot (20)

Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Best practices for MySQL High Availability
Best practices for MySQL High AvailabilityBest practices for MySQL High Availability
Best practices for MySQL High Availability
 
MariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly AvailableMariaDB Galera Cluster - Simple, Transparent, Highly Available
MariaDB Galera Cluster - Simple, Transparent, Highly Available
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
 
Reducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQLReducing Risk When Upgrading MySQL
Reducing Risk When Upgrading MySQL
 
Galera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction SlidesGalera cluster for MySQL - Introduction Slides
Galera cluster for MySQL - Introduction Slides
 
Introduction to ClustrixDB
Introduction to ClustrixDBIntroduction to ClustrixDB
Introduction to ClustrixDB
 
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
9 DevOps Tips for Going in Production with Galera Cluster for MySQL - Slides
 
Mysql replication @ gnugroup
Mysql replication @ gnugroupMysql replication @ gnugroup
Mysql replication @ gnugroup
 
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
Webinar slides: Introducing Galera 3.0 - Now supporting MySQL 5.6
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
 
Highly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackupHighly efficient backups with percona xtrabackup
Highly efficient backups with percona xtrabackup
 
Introduction to XtraDB Cluster
Introduction to XtraDB ClusterIntroduction to XtraDB Cluster
Introduction to XtraDB Cluster
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7
 
Oss4b - pxc introduction
Oss4b   - pxc introductionOss4b   - pxc introduction
Oss4b - pxc introduction
 
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group ReplicationPercona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
Percona XtraDB Cluster vs Galera Cluster vs MySQL Group Replication
 
Master master vs master-slave database
Master master vs master-slave databaseMaster master vs master-slave database
Master master vs master-slave database
 

Similar to Consistency between Engine and Binlog under Reduced Durability

Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Severalnines
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
Severalnines
 

Similar to Consistency between Engine and Binlog under Reduced Durability (20)

MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Pseudo GTID and Easy MySQL Replication Topology Management
Pseudo GTID and Easy MySQL Replication Topology ManagementPseudo GTID and Easy MySQL Replication Topology Management
Pseudo GTID and Easy MySQL Replication Topology Management
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
Troubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer PerspectiveTroubleshooting MySQL from a MySQL Developer Perspective
Troubleshooting MySQL from a MySQL Developer Perspective
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDBWebinar slides: Migrating to Galera Cluster for MySQL and MariaDB
Webinar slides: Migrating to Galera Cluster for MySQL and MariaDB
 
MySQL 5.6 GTID in a nutshell
MySQL 5.6 GTID in a nutshellMySQL 5.6 GTID in a nutshell
MySQL 5.6 GTID in a nutshell
 
Percona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replicationPercona Live 2012PPT: introduction-to-mysql-replication
Percona Live 2012PPT: introduction-to-mysql-replication
 
Running gtid replication in production
Running gtid replication in productionRunning gtid replication in production
Running gtid replication in production
 
Webinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera ClusterWebinar Slides: Migrating to Galera Cluster
Webinar Slides: Migrating to Galera Cluster
 
Managing and Visualizing your Replication Topologies with Orchestrator
Managing and Visualizing your Replication Topologies with OrchestratorManaging and Visualizing your Replication Topologies with Orchestrator
Managing and Visualizing your Replication Topologies with Orchestrator
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication Evolution of MySQL Parallel Replication
Evolution of MySQL Parallel Replication
 
Upgrade to MySQL 5.6 without downtime
Upgrade to MySQL 5.6 without downtimeUpgrade to MySQL 5.6 without downtime
Upgrade to MySQL 5.6 without downtime
 
MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting MySQL GTID Concepts, Implementation and troubleshooting
MySQL GTID Concepts, Implementation and troubleshooting
 
An issue of all slaves stop replication
An issue of all slaves stop replicationAn issue of all slaves stop replication
An issue of all slaves stop replication
 

More from Yoshinori Matsunobu

データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
Yoshinori Matsunobu
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
Yoshinori Matsunobu
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
Yoshinori Matsunobu
 

More from Yoshinori Matsunobu (12)

RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
データベース技術の羅針盤
データベース技術の羅針盤データベース技術の羅針盤
データベース技術の羅針盤
 
MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話MHA for MySQLとDeNAのオープンソースの話
MHA for MySQLとDeNAのオープンソースの話
 
Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)Introducing MySQL MHA (JP/LT)
Introducing MySQL MHA (JP/LT)
 
Linux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQLLinux and H/W optimizations for MySQL
Linux and H/W optimizations for MySQL
 
ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計ソーシャルゲームのためのデータベース設計
ソーシャルゲームのためのデータベース設計
 
More mastering the art of indexing
More mastering the art of indexingMore mastering the art of indexing
More mastering the art of indexing
 
SSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQLSSD Deployment Strategies for MySQL
SSD Deployment Strategies for MySQL
 
Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)Linux performance tuning & stabilization tips (mysqlconf2010)
Linux performance tuning & stabilization tips (mysqlconf2010)
 
Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)Linux/DB Tuning (DevSumi2010, Japanese)
Linux/DB Tuning (DevSumi2010, Japanese)
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
 
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with GraphGraphSummit Milan - Neo4j: The Art of the Possible with Graph
GraphSummit Milan - Neo4j: The Art of the Possible with Graph
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Novo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMsNovo Nordisk: When Knowledge Graphs meet LLMs
Novo Nordisk: When Knowledge Graphs meet LLMs
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 

Consistency between Engine and Binlog under Reduced Durability

  • 1. Consistency between Engine and Binlog under Reduced Durability Yoshinori Matsunobu Production Engineer, Facebook Jan/2020
  • 2. What we want to do ▪ When slave or master instances fail and recover, we want to make them rejoin the replication chain (replica set), instead of dropping and rebuilding ▪ Imaging a 10 minute network outage in one Availability Zone, and want to recover MySQL instances in the AZ
  • 3. Agenda ▪ When binlog and storage engine consistency gets broken ▪ What can go wrong on restarting replica ▪ What can go wrong on restarting master ▪ Challenges to support multiple transactional storage engines
  • 4. Consistency between binlog and engine ▪ MySQL separates Replication logs (Binary Logs) and Transactional Storage Engine logs (InnoDB/MyRocks/NDB) ▪ Internally handles XA ▪ Commit ordering: ▪ Binlog Prepare (doing nothing) ▪ Engine Prepare (in parallel) ▪ Binlog Commit (ordered) ▪ Engine Commit (ordered, if binlog_order_commits==ON) ▪ If MySQL instance or host dies in between, Engine and Binlog might become inconsistent ▪ Possibility of inconsistency will be bigger when operating with reduced durability (sync-binlog !=1 and innodb-flush-log-at-trx-commit!=1) ▪ Some binlog events that were persisted in engine may be lost ▪ Engine may lose some transactions there were persisted in binlog ▪ This talk is about how to address consistency issues under reduced durability
  • 5. 5.6 Single Threaded Slave, Binlog < Engine ▪ Unplanned OS reboot on slave may end up inconsistent state between Binlog GTID sets and Engine state ▪ A big question is the slave can continue replication by START SLAVE, without entirely replacing it ▪ Transactional Storage Engines (both InnoDB and MyRocks) store last committed GTID, and it is visible from mysql.slave_relay_log_info table. This table is updated for each commit to the engine ▪ With Single Threaded Slave, you don’t have to think about out of order execution ▪ Run with relay_log_recovery=1 ▪ Slave discards relay logs, restart replication from engine max GTID position from master ▪ Skips execution in engine if GTID < slave_relay_log_info ▪ Skips writing binlog events if binlog GTID overlaps Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 99
  • 6. 5.6 Single Threaded Slave, Binlog > Engine ▪ Replication will continue from GTID 95 or less ▪ Executing Engine GTID 96-98 but not saving binlog events ▪ Continuing normal replication flows after 99 Replica Binlog GTID: 1-98 Engine Max GTID: 95 Master GTID: 1-100
  • 7. Multi Threaded Slave Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine Max GTID: 95 ▪ mysql.slave_relay_log_info stores only max executed GTID in the instance ▪ Under parallel database execution, MySQL has no idea if GTID 94 is in engine or not ▪ Execution order might be 91 -> 92 -> 95 ▪ In upstream 5.6, you can’t guarantee consistency
  • 8. 5.7 gtid_executed table Replica Binlog GTID: 1-98 gtid_executed table: 1-93, 95-98 Master GTID: 1-100 ▪ 5.7 gtid_executed table stores GTID sets in InnoDB (crash safe) ▪ However, executed GTIDs are not updated for each commit ▪ It is updated on binlog rotate ▪ If it updates for each commit, you can figure out GTID 94 is there or not. (you can’t right now)
  • 9. FB Extension: Slave Idempotent Recovery - Starting replication from old enough binlog GTID - Re-executing binlog events to engine, then ignoring all duplicate key error / row not found error during catchup - Eventual Consistency - Must use RBR, and tables must have primary keys Master GTID: 1-100 Replica Binlog GTID: 1-98 Engine GTID state: empty
  • 10. What can go wrong when restarting master ▪ Master may go down unexpectedly by various reasons ▪ Hitting segfaults (SIG=11), assertion (SIG=6), forcing kill (SIG=9), out of memory ▪ Kernel panic ▪ power outages then restarted after a while ▪ Nowadays dead master promotion kicks in (Orchestrator, MHA) ▪ A question is failed master can restart replication from the new master ▪ Dead master may be back before dead master promotion ▪ If the master lost some transactions that are already replicated, replicas may not be able to continue replication
  • 11. Master Promotion happening, Binlog < Engine ▪ “Loss-Less Semi-Synchronous Replication” guarantees semisync tailer gets binlog events before master engine commit (so Engine on orig master <= Binlog/Engine on new master) ▪ You need to start replication from the last GTID in the engine ▪ In this case, GTID Executed Sets in master is 1-98, but replication should start after 99 ▪ Master’s engine execution order is serialized (with binlog-order-commit=1) so its’ guaranteed 1~99 are in engine ▪ However, this information is not visible from MySQL commands (only printed in err log) ▪ Feature Request to Oracle: InnoDB should add information_schema to print current committed last GTID, binlog file and position ▪ With Slave Idempotent Recovery, fetching last committed GTID can be skipped so automation can be more simplified. Instance 1 (Master) Binlog: 1-98 Engine: 1-99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1-98 Engine: 1- 99 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98
  • 12. Master Promotion happening, Binlog > Engine Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Dead) Binlog: 1- 100 Engine: 1- 98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Master) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 “100” should be discarded before replicating from new master (instance 3) InnoDB: Last binlog file position 79143, file name binlog.000005 InnoDB: Last MySQL Gtid UUID:98 ▪ Binlog GTID 100 is on instance 1 only, and is not acked to client (with loss less semisync) ▪ If the original master (instance 1) applies Binlog 100, it can’t join as a replica ▪ We need some ways not to apply GTID 100 during recovery
  • 13. FB Extention: Server Side Binlog Truncation▪ At instance startup, truncating binlog events that don’t exist in storage engine ▪ End of binlog position is the same or smaller than engine’s last committed GTID ▪ Retaining original binlog file as a backup ▪ All of the prepared state transactions in storage engines will be rolled back
  • 14. Master Promotion not happening Instance 1 (Master) Binlog: 1-100 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 Instance 1 (Recovered) Binlog: 1-98 Engine: 1-98 Instance 2 (Replica) Binlog: 1-98 Instance 3 (Replica) Binlog: 1-99 Instance 4 (Replica) Binlog: 1-98 ▪ Unplanned reboot on master may end up losing transactions that were already replicated to slaves ▪ Instance1 should not serve write requests until catching up Binlog GTID 99 from instance 3
  • 15. Common Replica errors Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Slave has more GTIDs than the master has, using the master's SERVER_UUID. This may indicate that the end of the binary log was truncated or that the last binary log file was lost, e.g., after a power or disk failure when sync_binlog != 1. The master may or may not have rolled back transactions that were already replica’ ▪ Set read_only=1 by default ▪ Find the most advanced slave, catch up from there, then start serving write requests
  • 16. Dual Engine Consistency ▪ Binlog GTID Sets ▪ InnoDB ▪ MyRocks ▪ Binlog, InnoDB and MyRocks (or NDB) need to be consistent ▪ Binlog: GTID 1-200, InnoDB: GTID 190, MyRocks: GTID 197 ▪ It is unclear if 191-196 are committed ▪ Roll back all prepared transactions (server side binlog truncation) ▪ Idempotent recovery ▪ Recover from binlogs on semi-sync replica
  • 17. Dual Engine consistency without binlog ▪ 8.0 DDL is transactional ▪ Table metadata info is stored in InnoDB ▪ It is common to run DDL outside of replication ▪ FB OSC changes schema without binlog ▪ MyRocks table changes without binlog may end up inconsistency ▪ There is no binlog to fix inconsistency ▪ DDL validation is our current workaround
  • 18. Summary ▪ MySQL needs to be aware of executed engine GTID sets ▪ With low update costs ▪ We don’t have in upstream MySQL yet. It’s a nice feature ▪ We worked around by Slave Idempotent Recovery ▪ Binlog Truncation during recovery, so that an old master can rejoin as a replica