2. Who we are
• Gawker Media: World’s largest independent media company
• We got about ~67M monthly uniques in US, ~107M uniques worldwide.
• We run on Kinja platform (http://kinja.com)
• our blogs are: gawker.com, jalopnik.com, jezebel.com, lifehacker.com,
gizmodo.com, deadspin.com, io9.com, kotaku.com
2
4. Traditional replication
● The MySQL server writes data which will be replicated to binary log
● For replica the following information is the minimum to know
• Binary log file name
• Position in binary log
4
5. ● If a replica has to be moved to a different server you have to
ensure that all of the servers are in the same state.
● You have to stop all the writes (set global read_only & set global
super_read_only) and move the replica to the different database
master (show master status …) as repositioning the slave. After
the server is moved, the writes can be resumed.
Traditional replication caveats
5
6. GTID in a nutshell
● Behind the scenes it is the same as traditional replication
• filename, position
● GTID replication
• uuid:seqno 6
7. GTID in a nutshell
● UUID identifies the server
● Seqno identified the Nth transaction from that server
• This is harder to find than just seeking to a byte offset
• The binlog containing the GTID needs to be scanned
● Nodes can be repositioned easily (anywhere in replication hierarchy)
7
9. Gawker’s environment
9
● 2 datacenter operation
● One of the DC’s are ‘more equal’ than the other - there’s the ‘active
master’
○ All the writes happen there, and it replicates to the other master, as
well as secondaries
● The replicas has to be moveable between the masters and the
secondaries ...
11. Gawker’s environment
11
● We don’t fix broken MySQL instances, when there is a problem we drop
and recreate them as fresh clones
● The backup, and slave creation uses the same codebase
● All the operations are defined as ansible playbooks, they are called from
python wrappers, and they could be managed from a jenkins instance
○ Backup
12. Pre-GTID master maintenance
12
● Put site to read-only mode (time!)
● Put the database to read-only mode (SET GLOBAL READ_ONLY=1)
● On secondary master (SHOW MASTER STATUS - 2x)
● On replicas: (CHANGE MASTER TO … )
● Failover applications to write to the new master
13. ● On replicas: (CHANGE MASTER TO … MASTER_AUTO_POSITION=1)
● Failover applications to write to the new master*
*At this moment we still have to disable writes, but that is about for 30 seconds
GTID master maintenance
13
15. Common failures with GTID: Errant transactions
● Executed GTID set of the about to be promoted node is not the subset
of the current master’s executed GTID set
• Use GTID_SUBSET() in pre-flight checks
• mysql> select gtid_subset('b0bb2e56-6121-11e5-9e7c-d73eafb37531:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30, 8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as
slave_is_subset;
+-----------------+
| slave_is_subset |
+-----------------+
| 0 |
+-----------------+
15
16. Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
16
17. Common failures with GTID: Errant transactions
● select gtid_subtract('b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-29, 8e3648e4-bc14-11e3-8d4c-
0800272864ba:1-9','b0bb2e56-6121-11e5-9e7c-
d73eafb37531:1-30,8e3648e4-bc14-11e3-8d4c-0800272864ba:1-
7') as errant_transactions;
+------------------------------------------+
| errant_transactions |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+
17
18. Common failures with GTID: you still have tell
where to start replication after rebuild
● CHANGE MASTER TO … MASTER_AUTO_POSITION=1 is not enough.
● First you have to set global gtid_purged=... (xtrabackup gives this
information) (because xtrabackup is our friend!)
18
[root@slave01.xyz /var/lib/mysql]# cat xtrabackup_slave_info
SET GLOBAL gtid_purged='075d81d6-8d7f-11e3-9d88-b4b52f517ce4:1-
1075752934, cd56d1ad-1e82-11e5-947b-b4b52f51dbf8:1-1030088732,
e907792a-8417-11e3-a037-b4b52f51dbf8:1-25858698180';
CHANGE MASTER TO MASTER_AUTO_POSITION=1
19. Common failures with GTID: Server UUID changes
after rebuild
● Xtrabackup doesn’t back up auto.cnf
• This is good for rebuilding slaves
• Can come as a surprise if the master is restored
● A write on a freshly rebuilt node introduces a new UUID
• The workaround is to restore the UUID in auto.cnf as well 19
20. Common failures with GTID: There are transactions
where you don’t expect them
● Crashed table repairs will appear as new transactions
• STOP SLAVE
• RESET MASTER
• SET GLOBAL gtid_purged=<...>
• START SLAVE 20
21. The not so obvious failure
● We had the opposite of errant transactions: missing transactions from
the slave, we called it GTID holes
• http://bit.ly/gtidholefill
● The corresponding event in the master was empty
● Slaves with such holes couldn’t be repositioned to other masters: they
wanted a transaction which is no longer in the binlogs
21
22. The not so obvious failure: takeaways
● The issue was present with non-GTID replication as well
• But was never spotted
● Most transactions were on frequently updated data, they meant
occasional inconsistency
● Good example that it’s harder to lose transactions with GTID replication
● Fix: in mf_iocache2.c:
22
23. GTID holes by default: MTS
● The aforementioned GTID holes can be intermittently present any time
if multi-threaded slave is used
● Having holes in a GTID sequence can be a valid state with MTS and STOP
SLAVE
● We reverted to non-multithreaded slave during debugging to slave that
the holes left is a side effect of MTS, but it wasn’t
● (@MySQL 5.6 MTS doesn’t really solves anything at all if the schemas
not equally written - which is true in our case)
23
25. Skipping replication event
● Injecting empty transactions
● In SQL
• set gtid_next=’<gtid>’; begin; commit; set gtid_next=’automatic’;
• http://bit.ly/gtidholefill
● Some tools exist for this (pt-slave-restart) 25
26. Consistency checks: pt-table-checksum
● Checksumming multiple levels of replication with pt-table-checksum
● ROW based replication limitation
● Workaround is checking on individual levels
● That introduces extra writes on second level slaves
• Luckily, GTID is not as error prone regarding consistency issues 26