pg_chameleon a MySQL to PostgreSQL replica

pg chameleon
MySQL to PostgreSQL replica
Federico Campoli
Transferwise
30 May 2017
Federico Campoli (Transferwise) pg chameleon 30 May 2017 1 / 49

Few words about the speaker
Born in 1972
Passionate about IT since 1982

Few words about the speaker
Born in 1972
Passionate about IT since 1982
mostly because of the TRON movie
Joined the Oracle DBA secret society in 2004
In love with PostgreSQL since 2006
Currently runs the Brighton PostgreSQL User group
Works at Transferwise as Data Engineer

Table of contents
1 Some history
2 MySQL Replica in a nutshell
3 A chameleon in the middle
4 Replica in action
5 Lessons learned
6 Wrap up

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

The beginnings
Years 2006/2012
neo my2pg.py
I wrote the script because of a struggling phpbb on MySQL
The database migration was successful
However phpbb didn’t work very well with PostgreSQL

The beginnings
Years 2006/2012
neo my2pg.py
The script is in python 2.6
It’s a monolith script
And it’s slow, very slow

The beginnings
Years 2006/2012
neo my2pg.py
The script is in python 2.6
It’s a monolith script
And it’s slow, very slow
It’s a good checklist for things to avoid when coding
https://github.com/the4thdoctor/neo my2pg

I’m not scared of using the ORMs
Years 2013/2015
First attempt of pg chameleon
Developed in Python 2.7
Used SQLAlchemy for extracting the MySQL’s metadata
Proof of concept only

Years 2013/2015
It was built during the years of the roller coaster
Therefore it was a just a way to discharge frustration
Abandoned after a while

Years 2013/2015
It was built during the years of the roller coaster
Therefore it was a just a way to discharge frustration
Abandoned after a while
SQLAlchemy’s limitations were frustrating as well
And there was already pgloader doing the same job

pg chameleon reborn
Year 2016
I revamped the project because I needed to replicate the data data from MySQL
to PostgreSQL.
And the library python-mysql-replication looked very promising for reading the
mysql replica protocol.
Trying won’t harm they said.

pg chameleon v1
Compatible with CPython 2.7/3.3+
Removed SQLAlchemy
Replaced the mysqldb driver with PyMySQL
Added a command line helper
Installs in virtualenv and system wide
Shipped via pypi for easy installation

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

MySQL Replica
MySQL replica is logical
When conﬁgured the data changes are stored in the master’s binary log ﬁles
The slave gets from the master the data changes
The data changes are saved in the slave’s relay logs
The relay logs are used to replay the data in the slave

Log formats
STATEMENT: It logs the statements which are replayed on the slave.
It’s the best solution for performance, however when replaying statements
with not deterministic functions this format generates diﬀerent values o the
slave (e.g. using an insert wit the uuid function).
ROW: It’s deterministic. This format logs the row image and the DDL
queries.
This format is compulsory for using pg chameleon.
MIXED takes the best of both worlds. The master logs the statements unless
a not deterministic function is used. In that case it logs the row image.

MySQL Replica

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

pg chameleon
pg chameleon mimics a mysql slave’s behaviour
Reads the replica
Stores the decoded rows into a PostgreSQL table
PostgreSQL acts as relay log and replication slave
A plpgSQL function decodes the rows and replay the changes

MySQL replica + pg chameleon

Features
Read the schema and data from MySQL and restore it into a target
PostgreSQL schema
Setup PostgreSQL to act as a MySQL slave
Basic DDL Support (CREATE/DROP/ALTER TABLE, DROP PRIMARY
KEY/TRUNCATE)
Handles the rubbish data coming from the replica stream and saves the
problematic rows in sch chameleon.t discarded rows
Supports multiple MySQL sources for replica
There is a basic replica monitoring
Can detach replica from MySQL leaving PostgreSQL ready to work as
standalone server

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

MySQL configuration
The mysql configuration file is usually stored in /etc/mysql/my.cnf
To enable the binary logging find the section [mysqld] and check the following
parameters are set.
binlog_format= ROW
log-bin = mysql-bin
server-id = 1
binlog-row-image = FULL

MySQL user for replica
Setup a replication user on MySQL
CREATE USER usr_replica ;
SET PASSWORD FOR usr_replica =PASSWORD(’replica ’);
GRANT ALL ON sakila .* TO ’usr_replica ’;
GRANT RELOAD ON *.* to ’usr_replica ’;
GRANT REPLICATION CLIENT ON *.* to ’usr_replica ’;
GRANT REPLICATION SLAVE ON *.* to ’usr_replica ’;
FLUSH PRIVILEGES;
In our example we are using the sakila test database.
https://dev.mysql.com/doc/sakila/en/

PostgreSQL setup
Add an user on PostgreSQL capable to create schemas and relations in the
destination database
CREATE USER usr_replica WITH PASSWORD ’replica ’;
CREATE DATABASE db_replica WITH OWNER usr_replica;

Install pg chameleon
The simplest way to install pg chameleon is with a virtual environment. However
if you have root access on your system the installation can be system wide.
It’s important to upgrade pip before installing the package.
python3 -m venv venv
source venv/bin/activate
pip install pip --upgrade
pip install pg_chameleon
Execute chameleon.py to create the conﬁguration directory
$HOME/.pg chameleon/
chameleon.py

Replica setup
cd in $HOME/.pg chameleon/ and copy conﬁg-yaml.example to default.yaml then
edit the ﬁle adding the connection settings and the source and destination
schemas.
my_database: sakila
pg_database: db_replica
dest_schema: ’my_schema’
mysql_conn:
host: derpy
port: 3306
user: usr_replica
passwd: replica
pg_conn:
host: derpy
port: 5432
user: usr_replica
password: replica

Init replica
Activate the virtualenv and run
chameleon.py create_schema --config default
chameleon.py add_source --config default
chameleon.py init_replica --config default
Wait for the init replica to complete. If the database is large consider running the
init replica in a screen or tmux session.

Start replica
Start the replica with
chameleon.py start_replica --config default

Project structure
project directory
scripts
chameleon.py
pg chameleon
lib
global lib.py
mysql lib.py
pg lib.py
sqlutil lib.py

chameleon.py
Command line wrapper
Use argparse to execute the commands
Supports several commands

chameleon.py
Command line wrapper
Use argparse to execute the commands
Supports several commands
After the installation executing chameleon.py creates the conﬁguration directory
$HOME/.pg chameleon/ with three subdirectories
pid
logs
conﬁg

chameleon.py
Commands
drop schema Drops the service schema sch chameleon with cascade option.
create schema Create the service schema sch chameleon.
upgrade schema Upgrade an existing schema sch chameleon to an newer
version.
init replica Creates the table structure from the mysql to PostgreSQL. The
mysql tables are locked in read only mode and the data is copied into the
PostgreSQL database. The master’s coordinates are stored in the
PostgreSQL replica catalogue.
start replica Starts the replication from mysql to PostgreSQL using the
master data stored in sch chameleon.t replica batch. The master’s position is
updated when a new batch is processed.
list conﬁg List the available conﬁgurations and their status (’ready’,
’initialising’,’initialised’,’stopped’,’running’,’error’)

chameleon.py
add source register a new configuration file as source
drop source remove the configuration from the registered sources
stop replica ends the replica process gracefully
disable replica ends the replica process and disable the restart
enable replica enable the replica process
sync replica sync the data between mysql and postgresql without dropping
the tables
show status displays the replication status for each source, with the lag in
seconds and the last received event
detach replica stops the replica stream, discards the replica setup and resets
the sequences in PostgreSQL to work as a standalone db.

global lib.py
class global config: loads the configuration parameters into the class
attributes
class replica engine: wraps the classes mysql engine and pgsql engine and
setup the logging method. The global config instance is used to track the
configuration settings.

mysql lib.py
class mysql connection: connects to mysql using the parameters provided by
replica engine
class mysql engine: does all the magic for the replication setup and execution

pg lib.py
class pg encoder: extends the class JSON and adds some special handling for
types like decimal and datetime
class pgsql connection: connects to the PostgreSQL database
class pgsql engine: does all the magic for rebuilding the data structure,
loading data and migrating the schema

sqlutil lib.py
Consists in just one class sql token which tokenise the mysql queries using the
regular expressions.

sqlutil lib.py
Consists in just one class sql token which tokenise the mysql queries using the
regular expressions.
Yes, I have two problems now!

Limitations
Tables for being replicated require primary keys
No Daemonisation
Binary data are hexiﬁed to avoid issues with PostgreSQL

The future
pg chameleon v2 development is already started. The first alpha will come out
soon.
The new version is a reorganisation of the version 1 with several improvements.
Reorganised configuration files
Background copy with parallel processes
Separate daemon for read and replay
Improved monitoring
Python 3 only

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

init replica tune
The replica initialisation required several improvements.
The OOM killer is always happy to kill processes using large amount of
memory
Using a general slice size doesn’t work well because with large rows the
process crashes
Estimating the total rows for user’s feedback is faster but the output can be
odd.
Using not buﬀered cursors improves the speed and the memory usage.

Strictness is an illusion. MySQL doubly so
MySQL’s lack of strictness is not a mystery.
The funny way the default with NOT NULL is managed by MySQL can break the
replica.
Therefore any ﬁeld with NOT NULL added after the initialisation are created
always as NULLable in PostgreSQL.

I feel your lack of constraint disturbing
Rubbish data in MySQL can be stored without errors raised by the DBMS.
When this happens the replicator traps the error when the change is replayed on
PostgreSQL and discards the problematic row.
The value is stored hexiﬁed in the table t discarded rows for later analysis.

Table of contents
1 Some history
4 Replica in action
5 Lessons learned
6 Wrap up

Igor, the green little guy
The chameleon logo has been developed by Elena Toma, a talented Italian Lady.
https://www.facebook.com/Tonkipapperoart/
The name Igor is inspired by Martin Feldman’s Igor portraited in Young
Frankenstein movie.

Some numbers
Lines of code
global lib.py 327
mysql lib.py 401
pg lib.py 670
sql util.py 228
chameleon.py 58
Total lines of code 1684

pg chameleon’s license
2 clause BSD License
Copyright (c) 2016,2017 Federico Campoli
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Feedback please!
Please report any issue on github!
https://github.com/the4thdoctor/pg chameleon

Boring legal stuﬀ
MySQL Image source WikiCommons
Hard Disk image source WikiCommons
Tron image source Tron Wikia

Did you say hire?
WE ARE HIRING!
https://transferwise.com/jobs/

Questions
Any questions?

Contacts and license
Twitter: 4thdoctor scarf
Blog:http://www.pgdba.co.uk
Brighton PostgreSQL Meetup:
http://www.meetup.com/Brighton-PostgreSQL-Meetup/
This document is distributed under the terms of the Creative Commons

pg chameleon
MySQL to PostgreSQL replica
Federico Campoli
Transferwise
30 May 2017

pg_chameleon a MySQL to PostgreSQL replica

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to pg_chameleon a MySQL to PostgreSQL replica

Similar to pg_chameleon a MySQL to PostgreSQL replica (20)

Recently uploaded

Recently uploaded (20)

pg_chameleon a MySQL to PostgreSQL replica