Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Running GTID
replication in
production
Balazs Pocze, DBA@Gawker
Peter Boros, Principal Architect@Percona
Who we are
• Gawker Media: World’s largest independent media company
• We got about ~67M monthly uniques in US, ~107M uniq...
GTID replication in a nutshell
Traditional replication
● The MySQL server writes data which will be replicated to binary log
● For replica the following ...
● If a replica has to be moved to a different server you have to
ensure that all of the servers are in the same state.
● Y...
GTID in a nutshell
● Behind the scenes it is the same as traditional replication
• filename, position
● GTID replication
•...
GTID in a nutshell
● UUID identifies the server
● Seqno identified the Nth transaction from that server
• This is harder t...
Gawker’s environment
Gawker’s environment
9
● 2 datacenter operation
● One of the DC’s are ‘more equal’ than the other - there’s the ‘active
ma...
Gawker’s environment
http://bit.ly/kinjadb
10
Gawker’s environment
11
● We don’t fix broken MySQL instances, when there is a problem we drop
and recreate them as fresh ...
Pre-GTID master maintenance
12
● Put site to read-only mode (time!)
● Put the database to read-only mode (SET GLOBAL READ_...
● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1)
● Failover applications to write to the new master*
*At this mo...
Caveats
Common failures with GTID: Errant transactions
● Executed GTID set of the about to be promoted node is not the subset
of t...
Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e...
Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e...
Common failures with GTID: you still have tell
where to start replication after rebuild
● CHANGE MASTER TO … MASTER_AUTO_P...
Common failures with GTID: Server UUID changes
after rebuild
● Xtrabackup doesn’t back up auto.cnf
• This is good for rebu...
Common failures with GTID: There are transactions
where you don’t expect them
● Crashed table repairs will appear as new t...
The not so obvious failure
● We had the opposite of errant transactions: missing transactions from
the slave, we called it...
The not so obvious failure: takeaways
● The issue was present with non-GTID replication as well
• But was never spotted
● ...
GTID holes by default: MTS
● The aforementioned GTID holes can be intermittently present any time
if multi-threaded slave ...
Practical things
Skipping replication event
● Injecting empty transactions
● In SQL
• set gtid_next=’<gtid>’; begin; commit; set gtid_next=...
Consistency checks: pt-table-checksum
● Checksumming multiple levels of replication with pt-table-checksum
● ROW based rep...
Thanks!
Upcoming SlideShare
Loading in …5
×

Running gtid replication in production

726 views

Published on

This presentation was held in Percona Live! Amsterdam 2015

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Running gtid replication in production

  1. 1. Running GTID replication in production Balazs Pocze, DBA@Gawker Peter Boros, Principal Architect@Percona
  2. 2. Who we are • Gawker Media: World’s largest independent media company • We got about ~67M monthly uniques in US, ~107M uniques worldwide. • We run on Kinja platform (http://kinja.com) • our blogs are: gawker.com, jalopnik.com, jezebel.com, lifehacker.com, gizmodo.com, deadspin.com, io9.com, kotaku.com 2
  3. 3. GTID replication in a nutshell
  4. 4. Traditional replication ● The MySQL server writes data which will be replicated to binary log ● For replica the following information is the minimum to know • Binary log file name • Position in binary log 4
  5. 5. ● If a replica has to be moved to a different server you have to ensure that all of the servers are in the same state. ● You have to stop all the writes (set global read_only & set global super_read_only) and move the replica to the different database master (show master status …) as repositioning the slave. After the server is moved, the writes can be resumed. Traditional replication caveats 5
  6. 6. GTID in a nutshell ● Behind the scenes it is the same as traditional replication • filename, position ● GTID replication • uuid:seqno 6
  7. 7. GTID in a nutshell ● UUID identifies the server ● Seqno identified the Nth transaction from that server • This is harder to find than just seeking to a byte offset • The binlog containing the GTID needs to be scanned ● Nodes can be repositioned easily (anywhere in replication hierarchy) 7
  8. 8. Gawker’s environment
  9. 9. Gawker’s environment 9 ● 2 datacenter operation ● One of the DC’s are ‘more equal’ than the other - there’s the ‘active master’ ○ All the writes happen there, and it replicates to the other master, as well as secondaries ● The replicas has to be moveable between the masters and the secondaries ...
  10. 10. Gawker’s environment http://bit.ly/kinjadb 10
  11. 11. Gawker’s environment 11 ● We don’t fix broken MySQL instances, when there is a problem we drop and recreate them as fresh clones ● The backup, and slave creation uses the same codebase ● All the operations are defined as ansible playbooks, they are called from python wrappers, and they could be managed from a jenkins instance ○ Backup
  12. 12. Pre-GTID master maintenance 12 ● Put site to read-only mode (time!) ● Put the database to read-only mode (SET GLOBAL READ_ONLY=1) ● On secondary master (SHOW MASTER STATUS - 2x) ● On replicas: (CHANGE MASTER TO … ) ● Failover applications to write to the new master
  13. 13. ● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1) ● Failover applications to write to the new master* *At this moment we still have to disable writes, but that is about for 30 seconds GTID master maintenance 13
  14. 14. Caveats
  15. 15. Common failures with GTID: Errant transactions ● Executed GTID set of the about to be promoted node is not the subset of the current master’s executed GTID set • Use GTID_SUBSET() in pre-flight checks • mysql> select gtid_subset('b0bb2e56-6121-11e5-9e7c-d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as slave_is_subset; +-----------------+ | slave_is_subset | +-----------------+ | 0 | +-----------------+ 15
  16. 16. Common failures with GTID: Errant transactions ● select gtid_subtract('b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c- 0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1- 7') as errant_transactions; +------------------------------------------+ | errant_transactions | +------------------------------------------+ | 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 | +------------------------------------------+ 16
  17. 17. Common failures with GTID: Errant transactions ● select gtid_subtract('b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c- 0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c- d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1- 7') as errant_transactions; +------------------------------------------+ | errant_transactions | +------------------------------------------+ | 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 | +------------------------------------------+ 17
  18. 18. Common failures with GTID: you still have tell where to start replication after rebuild ● CHANGE MASTER TO … MASTER_AUTO_POSITION=1 is not enough. ● First you have to set global gtid_purged=... (xtrabackup gives this information) (because xtrabackup is our friend!) 18 [root@slave01.xyz /var/lib/mysql]# cat xtrabackup_slave_info SET GLOBAL gtid_purged='075d81d6-8d7f-11e3-9d88-b4b52f517ce4:1- 1075752934, cd56d1ad-1e82-11e5-947b-b4b52f51dbf8:1-1030088732, e907792a-8417-11e3-a037-b4b52f51dbf8:1-25858698180'; CHANGE MASTER TO MASTER_AUTO_POSITION=1
  19. 19. Common failures with GTID: Server UUID changes after rebuild ● Xtrabackup doesn’t back up auto.cnf • This is good for rebuilding slaves • Can come as a surprise if the master is restored ● A write on a freshly rebuilt node introduces a new UUID • The workaround is to restore the UUID in auto.cnf as well 19
  20. 20. Common failures with GTID: There are transactions where you don’t expect them ● Crashed table repairs will appear as new transactions • STOP SLAVE • RESET MASTER • SET GLOBAL gtid_purged=<...> • START SLAVE 20
  21. 21. The not so obvious failure ● We had the opposite of errant transactions: missing transactions from the slave, we called it GTID holes • http://bit.ly/gtidholefill ● The corresponding event in the master was empty ● Slaves with such holes couldn’t be repositioned to other masters: they wanted a transaction which is no longer in the binlogs 21
  22. 22. The not so obvious failure: takeaways ● The issue was present with non-GTID replication as well • But was never spotted ● Most transactions were on frequently updated data, they meant occasional inconsistency ● Good example that it’s harder to lose transactions with GTID replication ● Fix: in mf_iocache2.c: 22
  23. 23. GTID holes by default: MTS ● The aforementioned GTID holes can be intermittently present any time if multi-threaded slave is used ● Having holes in a GTID sequence can be a valid state with MTS and STOP SLAVE ● We reverted to non-multithreaded slave during debugging to slave that the holes left is a side effect of MTS, but it wasn’t ● (@MySQL 5.6 MTS doesn’t really solves anything at all if the schemas not equally written - which is true in our case) 23
  24. 24. Practical things
  25. 25. Skipping replication event ● Injecting empty transactions ● In SQL • set gtid_next=’<gtid>’; begin; commit; set gtid_next=’automatic’; • http://bit.ly/gtidholefill ● Some tools exist for this (pt-slave-restart) 25
  26. 26. Consistency checks: pt-table-checksum ● Checksumming multiple levels of replication with pt-table-checksum ● ROW based replication limitation ● Workaround is checking on individual levels ● That introduces extra writes on second level slaves • Luckily, GTID is not as error prone regarding consistency issues 26
  27. 27. Thanks!

×