Running gtid replication in production

Running GTID
replication in
production
Balazs Pocze, DBA@Gawker
Peter Boros, Principal Architect@Percona

Who we are
• Gawker Media: World’s largest independent media company
• We got about ~67M monthly uniques in US, ~107M uniques worldwide.
• We run on Kinja platform (http://kinja.com)
• our blogs are: gawker.com, jalopnik.com, jezebel.com, lifehacker.com,
gizmodo.com, deadspin.com, io9.com, kotaku.com
2

GTID replication in a nutshell

Traditional replication
● The MySQL server writes data which will be replicated to binary log
● For replica the following information is the minimum to know
• Binary log file name
• Position in binary log
4

● If a replica has to be moved to a different server you have to
ensure that all of the servers are in the same state.
● You have to stop all the writes (set global read_only & set global
super_read_only) and move the replica to the different database
master (show master status …) as repositioning the slave. After
the server is moved, the writes can be resumed.
Traditional replication caveats
5

GTID in a nutshell
● Behind the scenes it is the same as traditional replication
• filename, position
● GTID replication
• uuid:seqno 6

GTID in a nutshell
● UUID identifies the server
● Seqno identified the Nth transaction from that server
• This is harder to find than just seeking to a byte offset
• The binlog containing the GTID needs to be scanned
● Nodes can be repositioned easily (anywhere in replication hierarchy)
7

Gawker’s environment
9
● 2 datacenter operation
● One of the DC’s are ‘more equal’ than the other - there’s the ‘active
master’
○ All the writes happen there, and it replicates to the other master, as
well as secondaries
● The replicas has to be moveable between the masters and the
secondaries ...

http://bit.ly/kinjadb
10

11
● We don’t fix broken MySQL instances, when there is a problem we drop
and recreate them as fresh clones
● The backup, and slave creation uses the same codebase
● All the operations are defined as ansible playbooks, they are called from
python wrappers, and they could be managed from a jenkins instance
○ Backup

Pre-GTID master maintenance
12
● Put site to read-only mode (time!)
● Put the database to read-only mode (SET GLOBAL READ_ONLY=1)
● On secondary master (SHOW MASTER STATUS - 2x)
● On replicas: (CHANGE MASTER TO … )
● Failover applications to write to the new master

● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1)
● Failover applications to write to the new master*
*At this moment we still have to disable writes, but that is about for 30 seconds
GTID master maintenance
13

Common failures with GTID: Errant transactions
● Executed GTID set of the about to be promoted node is not the subset
of the current master’s executed GTID set
• Use GTID_SUBSET() in pre-flight checks
• mysql> select gtid_subset('b0bb2e56-6121-11e5-9e7c-d73eafb37531:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as
slave_is_subset;
+-----------------+
| slave_is_subset |
+-----------------+
| 0 |
+-----------------+
15

● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
16

● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
17

Common failures with GTID: you still have tell
where to start replication after rebuild
● CHANGE MASTER TO … MASTER_AUTO_POSITION=1 is not enough.
● First you have to set global gtid_purged=... (xtrabackup gives this
information) (because xtrabackup is our friend!)
18
[root@slave01.xyz /var/lib/mysql]# cat xtrabackup_slave_info
SET GLOBAL gtid_purged='075d81d6-8d7f-11e3-9d88-b4b52f517ce4:1-
1075752934, cd56d1ad-1e82-11e5-947b-b4b52f51dbf8:1-1030088732,
e907792a-8417-11e3-a037-b4b52f51dbf8:1-25858698180';
CHANGE MASTER TO MASTER_AUTO_POSITION=1

Common failures with GTID: Server UUID changes
after rebuild
● Xtrabackup doesn’t back up auto.cnf
• This is good for rebuilding slaves
• Can come as a surprise if the master is restored
● A write on a freshly rebuilt node introduces a new UUID
• The workaround is to restore the UUID in auto.cnf as well 19

Common failures with GTID: There are transactions
where you don’t expect them
● Crashed table repairs will appear as new transactions
• STOP SLAVE
• RESET MASTER
• SET GLOBAL gtid_purged=<...>
• START SLAVE 20

The not so obvious failure
● We had the opposite of errant transactions: missing transactions from
the slave, we called it GTID holes
• http://bit.ly/gtidholefill
● The corresponding event in the master was empty
● Slaves with such holes couldn’t be repositioned to other masters: they
wanted a transaction which is no longer in the binlogs
21

The not so obvious failure: takeaways
● The issue was present with non-GTID replication as well
• But was never spotted
● Most transactions were on frequently updated data, they meant
occasional inconsistency
● Good example that it’s harder to lose transactions with GTID replication
● Fix: in mf_iocache2.c:
22

GTID holes by default: MTS
● The aforementioned GTID holes can be intermittently present any time
if multi-threaded slave is used
● Having holes in a GTID sequence can be a valid state with MTS and STOP
SLAVE
● We reverted to non-multithreaded slave during debugging to slave that
the holes left is a side effect of MTS, but it wasn’t
● (@MySQL 5.6 MTS doesn’t really solves anything at all if the schemas
not equally written - which is true in our case)
23

Skipping replication event
● Injecting empty transactions
● In SQL
• set gtid_next=’<gtid>’; begin; commit; set gtid_next=’automatic’;
• http://bit.ly/gtidholefill
● Some tools exist for this (pt-slave-restart) 25

Consistency checks: pt-table-checksum
● Checksumming multiple levels of replication with pt-table-checksum
● ROW based replication limitation
● Workaround is checking on individual levels
● That introduces extra writes on second level slaves
• Luckily, GTID is not as error prone regarding consistency issues 26

Running gtid replication in production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running gtid replication in production

Similar to Running gtid replication in production (20)

Recently uploaded

Recently uploaded (20)

Running gtid replication in production