SlideShare a Scribd company logo
The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
3rd February 2017
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 1 / 56
First rule about talks, don’t talk about the speaker
Born in 1972
Passionate about IT since 1982 mostly because of the TRON movie
Joined the Oracle DBA secret society in 2004
In love with PostgreSQL since 2006
Currently runs the Brighton PostgreSQL User group
Works at Transferwise as Data Engineer
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 2 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 3 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 4 / 56
We have an appointment, and we are late!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 5 / 56
The Gordian Knot of analytics db
I started the data engineer job in July 2016
I was involved in a task not customer facing
However the task was very critical to the business
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
The Gordian Knot of analytics db
I started the data engineer job in July 2016
I was involved in a task not customer facing
However the task was very critical to the business
I had to fix the performance issues on the MySQL analytics database
Which performed bad, despite the considerable resources assigned to the VM
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
Tactical assessment
The existing database had the following configuration
MySQL 5.6
Innodb buffer size 60 GB
20 CPU
database size 600 GB
Looker and Tableau for running the analytic queries
The main live database replicated into the analytics database
Several schema from the service database imported on a regular basis
One schema used for obfuscating PII and denormalising the heavy queries
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 7 / 56
The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
The performance issues worsened over a two years span
The obfuscation was made via custom views
The data size on the MySQL master increased over time
Causing the optimiser to switch on materialise when accessing the views
The analytics tools struggled just under normal load
In busy periods the database became almost unusable
Analysts were busy to tune existing queries rather writing new
A new solution was needed
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 9 / 56
The eye of the storm
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 10 / 56
One size doesn’t fits all
It was clear that MySQL was no longer a good fit.
However the new solution’s requirements had to meet some specific needs.
Data updated in almost real time from the live database
PII obfuscated for the analysts
PII available in clear for the power users
The system should be able to scale out for several years
Modern SQL for better analytics queries
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 11 / 56
May the best database win
The analysts team shortlisted few solutions.
Each solution covered partially the requirements.
Google BigQuery
Amazon RedShift
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
May the best database win
The analysts team shortlisted few solutions.
Each solution covered partially the requirements.
Google BigQuery
Amazon RedShift
Google BigQuery and Amazon RedShift did not suffice the analytics requirements
and were removed from the list.
Both PostgreSQL and Snowflake offered very good performance and modern SQL.
Neither of them offered a replication system from the MySQL system.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
Straight into the cloud
Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and
comes with different sizing.
Their pricing system is very appealing and the preliminary tests shown Snowflake
outperforming PostgreSQL1
1PostgreSQL single machine vs cloud based parallel processing
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 13 / 56
Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Unfortunately there was just one little catch.
There was no support for obfuscation.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
Customer comes first
In Transferwise we really care about the customer’s data security.
Our policy for the PII data is that any personal information moving outside our
perimeter shall be obfuscated.
In order to be compliant the database accessible by Fivetran would have only
obfuscated data.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 15 / 56
Proactive development
The sense of DBA tingled. I foresaw the requirement and in my spare time I built
a proof of concept based on the replica tool pg chameleon.
The tool which using a python library can replicate a MySQL database into
The initial tests on a reduced dataset were successful.
It was simple to add the obfuscation in real time with minimal changes.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 16 / 56
And the winner is...
The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran.
However, because the performance on PostgreSQL were quite good, and the
system have good margin for scaling up, the decision was to keep the data
analytics data behind our perimeter.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
And the winner is...
The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran.
However, because the performance on PostgreSQL were quite good, and the
system have good margin for scaling up, the decision was to keep the data
analytics data behind our perimeter.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 18 / 56
MySQL Replica in a nutshell
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 19 / 56
A quick look to the replication system
Let’s have a quick overview on how the MySQL replica works and how the
replicator interacts with it.
The following slides explain how pg chameleon works because the custom
obfuscator tool shares with pg chameleon most concepts concepts and code.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 20 / 56
MySQL Replica
The MySQL replica protocol is logical
When MySQL is configured properly the RDBMS saves the data changed
into binary log files
The slave connects to the master and gets the replication data
The replication’s data are saved into the slave’s local relay logs
The local relay logs are replayed into the slave
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 21 / 56
MySQL Replica
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 22 / 56
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Initialises the PostgreSQL replica schema in just one command
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
MySQL replica + pg chameleon
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 24 / 56
Log formats
MySQL supports different formats for the binary logs.
The STATEMENT format. It logs the statements which are replayed on the
It seems the best solution for performance.
However replaying queries with not deterministic elements generate
inconsistent slaves (e.g. insert with uuid).
The ROW format is deterministic. It logs the row image and the DDL queries.
This is the format required for pg chameleon to work.
MIXED takes the best of both worlds. The master logs the statements unless
a not deterministic element is used. In that case it logs the row image.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 25 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 26 / 56
How we did it
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 27 / 56
Replica and obfuscation
I built a minimum viable product for pg chameleon.
The project was forked into a transferwise owned repository for the customisation.
It were added the the obfuscation capabilities and other specific procedures like
the daily data aggregation.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 28 / 56
Mighty morphing power elephant
The replica initialisation locks the mysql tables in read only mode.
To avoid the main database to be locked for several hours a secondary MySQL
replica is setup with the local query logging enabled.
The cascading replica also allowed to use the ROW binlog format as the master
uses MIXED for performance reasons.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 29 / 56
This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave logs the row changes locally in ROW format
PostgreSQL reads the slave’s replica and obfuscates the data in realtime!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave logs the row changes locally in ROW format
PostgreSQL reads the slave’s replica and obfuscates the data in realtime!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
Replica initialisation
The replica initialisation follows the same rules of any mysql replica setup
Flush the tables with read lock
Get the master’s coordinates
Copy the data
Release the locks
The procedure pulls the data out from mysql using the CSV format for a fast load
in PostgreSQL with the COPY command.
This approach requires with a tricky SQL statement.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 31 / 56
First generate the select list
WHEN data_type="enum"
END AS enum_list ,
data_type IN (’"""+" ’,’". join(self.hexify)+""" ’)
concat(’hex(’,column_name ,’)’)
data_type IN (’bit ’)
concat(’cast(‘’,column_name ,’‘ AS unsigned)’)
concat(’‘’,column_name ,’‘’)
AS column_csv
information_schema .COLUMNS
table_schema =%s
AND table_name =%s
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 32 / 56
Then use it into mysql query
sql_out="SELECT "+columns_csv+" as data FROM "+table_name+";"
self.logger.debug("Executing query for table %s" % (table_name, ))
self.logger.debug("an error occurred when pulling out the data from the table %s - sql executed: %s" % (table_name, sql_out))
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 33 / 56
Fallback on failure
The CSV data is pulled out in slices in order to avoid memory overload.
The file is then pushed into PostgreSQL using the COPY command.
COPY is fast but is single transaction
One failure and the entire batch is rolled back
If this happens the procedure loads the same data using the INSERT
Which can be very slow
But at least discards only the problematic rows
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 34 / 56
obfuscation setup
A simple yaml file is used to list table, column and obfuscation strategy
u s e r d e t a i l s :
last name :
mode : normal
nonhash start : 0
nonhash length : 0
phone number :
mode : normal
nonhash start : 1
nonhash length : 2
d a t e o f b i r t h :
mode : date
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 35 / 56
Obfuscation when initialising
The obfuscation process is quite simple and uses the extension pgcrypt for hashing
in sha256.
When the replica is initialised the data is copied into the schema in clear
The table locks are released
The tables with PII are copied and obfuscated in a separate schema
The process builds the indices on the schemas with data in clear and
The tables without PII data are exposed to the normal users using simple
All the varchar fields in the obfuscated schema are converted in text fields
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 36 / 56
Obfuscation on the fly
The obfuscation is also applied when the data is replicated.
The approach is very simple.
When a row image is captured the process checks if the table contains PII
In that case the process generates a second jsonb element with the PII data
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 37 / 56
Obfuscation on the fly
’binlog’: u’mysql-bin.000227’,
’logpos’: 1543,
’action’: ’update’,
’batch_id’: 2L,
’table’: u’user’,
’log_table’: ’t_log_replica_2’,
’schema’: ’sch_clear’
u’email’: u’’
’binlog’: u’mysql-bin.000227’,
’logpos’: 1543,
’action’: ’update’,
’batch_id’: 2L,
’table’: u’user’,
’log_table’: ’t_log_replica_2’,
’schema’: ’sch_obf’
u’email’: u’2bc5aa7720b6a3462cdf8c1ae25ed8dc45b1d9e1b0cd960aa15ac72acfe20433’
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 38 / 56
The DDL. A real pain in the back
The DDL replica is possible with a little trick.
MySQL even in ROW format emits the DDL as statements
A regular expression traps the DDL like CREATE/DROP TABLE or ALTER
The mysql library gets the table’s metadata from the information schema
The metadata is used to build the DDL in the PostgreSQL dialect
This approach may not be elegant but is quite robust.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 39 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 40 / 56
Maximum effort
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 41 / 56
Query MySQL PostgreSQL PostgreSQL cached
Master procedure 20 hours 4 hours N/A
Extracting sharing ibans2
didn’t complete 3 minutes 1 minute
Adyen notification3
6 minutes 2 minutes 6 seconds
2small table with complex aggregations
3big table scan with simple filters
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 42 / 56
Resource comparison
Resource MySQL PostgreSQL
Storage Size 940 GB 664 GB
Server CPUs 18 8
Server Memory 68 GB 48 GB
Shared Memory 50 GB 5 GB
Max connections 500 100
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 43 / 56
Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
some code was optimised inside, but actually very little - maybe 10-20% was
improved. We’ll do more of that in the future, but not yet. The good thing is that
the performance gains we have can mostly be attributed just to PG vs MySQL. So
there’s a lot of scope to improve further.
Jeff McClelland - Growth Analyst, data guru
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 45 / 56
Lessons learned
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 46 / 56
init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
Using not buffered cursors improves the speed and the memory usage.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
Using not buffered cursors improves the speed and the memory usage.
However.... even after fixing the memory issues the initial copy took 6 days.
Tuning the copy speed with the unbuffered cursors and the row number estimates
improved the initial copy speed which now completes in 30 hours.
Including the time required for the index build.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
Strictness is an illusion. MySQL doubly so
MySQL’s lack of strictness is not a mystery.
The replica broke down several times because of the funny way the NOT NULL is
managed by MySQL.
To prevent any further replica breakdown the fields with NOT NULL added with
ALTER TABLE, in PostgreSQL are always as NULLable.
MySQL truncates the strings of characters at the varchar size automatically. This
is a problem if the field is obfuscated on PostgreSQL because the hashed string
could not fit into the corresponding varchar field. Therefore all the character
varying on the obfuscated schema are converted to text.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 48 / 56
I feel your lack of constraint disturbing
Rubbish data in MySQL can be stored without errors raised by the DBMS.
When this happens the replicator traps the error when the change is replayed on
PostgreSQL and discards the problematic row.
The value is logged on the replica’s log, available for further actions.
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 49 / 56
Table of contents
1 We have an appointment, and we are late!
2 The eye of the storm
3 MySQL Replica in a nutshell
4 How we did it
5 Maximum effort
6 Lessons learned
7 Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 50 / 56
Wrap up
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 51 / 56
Did you say hire?
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 52 / 56
That’s all folks!
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 53 / 56
Contacts and license
Twitter: 4thdoctor scarf
This document is distributed under the terms of the Creative Commons
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 54 / 56
Boring legal stuff
The 4th doctor meme - source
The eye, phantom playground, light end tunnel - Copyright Federico Campoli
The dolphin picture - Copyright artnoose
It could work. Young Frankenstein - source quickmeme
Deadpool Clap - source memegenerator
Deadpool Maximum Effort - source Deadpool Zoeiro
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 55 / 56
The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
3rd February 2017
Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 56 / 56

More Related Content

What's hot

The hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQLThe hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQL
Federico Campoli
Pg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replicaPg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replica
Federico Campoli
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
Federico Campoli
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introduction
Federico Campoli
pg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replicapg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replica
Federico Campoli
Streaming replication
Streaming replicationStreaming replication
Streaming replication
Federico Campoli
pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easy
Federico Campoli
Pg chameleon, mysql to postgresql replica made easy
Pg chameleon, mysql to postgresql replica made easyPg chameleon, mysql to postgresql replica made easy
Pg chameleon, mysql to postgresql replica made easy
Federico Campoli
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
Covenant Ko
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Nikolay Samokhvalov
François Belleau
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionNot Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Paris Carbone
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
Jamey Hanson
Базы данных. ZooKeeper
Базы данных. ZooKeeperБазы данных. ZooKeeper
Базы данных. ZooKeeperVadim Tsesko
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGISSFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
South Tyrol Free Software Conference
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Olaf Hartig
Oracle 10g Performance: chapter 00 statspack
Oracle 10g Performance: chapter 00 statspackOracle 10g Performance: chapter 00 statspack
Oracle 10g Performance: chapter 00 statspackKyle Hailey
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
Tugdual Grall
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
Walter Heck

What's hot (20)

The hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQLThe hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQL
Pg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replicaPg chameleon MySQL to PostgreSQL replica
Pg chameleon MySQL to PostgreSQL replica
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introduction
pg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replicapg_chameleon a MySQL to PostgreSQL replica
pg_chameleon a MySQL to PostgreSQL replica
Streaming replication
Streaming replicationStreaming replication
Streaming replication
pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easy
Pg chameleon, mysql to postgresql replica made easy
Pg chameleon, mysql to postgresql replica made easyPg chameleon, mysql to postgresql replica made easy
Pg chameleon, mysql to postgresql replica made easy
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San JoseThe Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in ActionNot Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
Базы данных. ZooKeeper
Базы данных. ZooKeeperБазы данных. ZooKeeper
Базы данных. ZooKeeper
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGISSFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
SFScon 2020 - Matteo Ghetta - DataPlotly - D3-like plots in QGIS
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
(An Overview on) Linked Data Management and SPARQL Querying (ISSLOD2011)
Oracle 10g Performance: chapter 00 statspack
Oracle 10g Performance: chapter 00 statspackOracle 10g Performance: chapter 00 statspack
Oracle 10g Performance: chapter 00 statspack
Softshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with CouchbaseSoftshake 2013: Introduction to NoSQL with Couchbase
Softshake 2013: Introduction to NoSQL with Couchbase
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DBPuppetCamp SEA @ Blk 71 - What's New in Puppet DB
PuppetCamp SEA @ Blk 71 - What's New in Puppet DB

Similar to The ninja elephant, scaling the analytics database in Transwerwise

PGDAY FR 2014 : presentation de Postgresql chez
PGDAY FR 2014 : presentation de Postgresql chez leboncoin.frPGDAY FR 2014 : presentation de Postgresql chez
PGDAY FR 2014 : presentation de Postgresql chez
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
Max De Marzi
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
Alexey Grishchenko
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
Renaud Bruyeron
Graph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analyticsGraph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analytics
Paris Carbone
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
System design basics - Part 1
System design basics - Part 1System design basics - Part 1
System design basics - Part 1
Md Imran Hasan Hira
2010 Sopac Cosugi
2010 Sopac Cosugi2010 Sopac Cosugi
2010 Sopac Cosugi
Sean Robinson
WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?
Neil Gunther
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed Database
Staying lean with application logs
Staying lean with application logsStaying lean with application logs
Staying lean with application logs
Samir Talwar
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
A Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDBA Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDB
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
Tony Tam
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
Dan Crankshaw
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava
Productionize spark structured streaming
Productionize spark structured streamingProductionize spark structured streaming
Productionize spark structured streaming
Ivan Kosianenko
Algolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide APIAlgolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide API
Paul-Louis NECH

Similar to The ninja elephant, scaling the analytics database in Transwerwise (20)

PGDAY FR 2014 : presentation de Postgresql chez
PGDAY FR 2014 : presentation de Postgresql chez leboncoin.frPGDAY FR 2014 : presentation de Postgresql chez
PGDAY FR 2014 : presentation de Postgresql chez
DataDay 2023 Presentation - Notes
DataDay 2023 Presentation - NotesDataDay 2023 Presentation - Notes
DataDay 2023 Presentation - Notes
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
Graph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analyticsGraph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analytics
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
System design basics - Part 1
System design basics - Part 1System design basics - Part 1
System design basics - Part 1
2010 Sopac Cosugi
2010 Sopac Cosugi2010 Sopac Cosugi
2010 Sopac Cosugi
WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed Database
Staying lean with application logs
Staying lean with application logsStaying lean with application logs
Staying lean with application logs
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
A Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDBA Front-Row Seat to Ticketmaster’s Use of MongoDB
A Front-Row Seat to Ticketmaster’s Use of MongoDB
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Productionize spark structured streaming
Productionize spark structured streamingProductionize spark structured streaming
Productionize spark structured streaming
Algolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide APIAlgolia's Fury Road to a Worldwide API
Algolia's Fury Road to a Worldwide API

Recently uploaded

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Recently uploaded (20)

The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

The ninja elephant, scaling the analytics database in Transwerwise

  • 1. The ninja elephant Scaling the analytics database in Transferwise Federico Campoli Transferwise 3rd February 2017 Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 1 / 56
  • 2. First rule about talks, don’t talk about the speaker Born in 1972 Passionate about IT since 1982 mostly because of the TRON movie Joined the Oracle DBA secret society in 2004 In love with PostgreSQL since 2006 Currently runs the Brighton PostgreSQL User group Works at Transferwise as Data Engineer Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 2 / 56
  • 3. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 3 / 56
  • 4. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 4 / 56
  • 5. We have an appointment, and we are late! Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 5 / 56
  • 6. The Gordian Knot of analytics db I started the data engineer job in July 2016 I was involved in a task not customer facing However the task was very critical to the business Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
  • 7. The Gordian Knot of analytics db I started the data engineer job in July 2016 I was involved in a task not customer facing However the task was very critical to the business I had to fix the performance issues on the MySQL analytics database Which performed bad, despite the considerable resources assigned to the VM Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 6 / 56
  • 8. Tactical assessment The existing database had the following configuration MySQL 5.6 Innodb buffer size 60 GB 70 GB RAM 20 CPU database size 600 GB Looker and Tableau for running the analytic queries The main live database replicated into the analytics database Several schema from the service database imported on a regular basis One schema used for obfuscating PII and denormalising the heavy queries Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 7 / 56
  • 9. The frog effect If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat will be slowly boiled to death. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
  • 10. The frog effect If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat will be slowly boiled to death. The performance issues worsened over a two years span The obfuscation was made via custom views The data size on the MySQL master increased over time Causing the optimiser to switch on materialise when accessing the views The analytics tools struggled just under normal load In busy periods the database became almost unusable Analysts were busy to tune existing queries rather writing new A new solution was needed Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 8 / 56
  • 11. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 9 / 56
  • 12. The eye of the storm Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 10 / 56
  • 13. One size doesn’t fits all It was clear that MySQL was no longer a good fit. However the new solution’s requirements had to meet some specific needs. Data updated in almost real time from the live database PII obfuscated for the analysts PII available in clear for the power users The system should be able to scale out for several years Modern SQL for better analytics queries Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 11 / 56
  • 14. May the best database win The analysts team shortlisted few solutions. Each solution covered partially the requirements. Google BigQuery Amazon RedShift Snowflake PostgreSQL Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
  • 15. May the best database win The analysts team shortlisted few solutions. Each solution covered partially the requirements. Google BigQuery Amazon RedShift Snowflake PostgreSQL Google BigQuery and Amazon RedShift did not suffice the analytics requirements and were removed from the list. Both PostgreSQL and Snowflake offered very good performance and modern SQL. Neither of them offered a replication system from the MySQL system. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 12 / 56
  • 16. Straight into the cloud Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests shown Snowflake outperforming PostgreSQL1 . 1PostgreSQL single machine vs cloud based parallel processing Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 13 / 56
  • 17. Streaming copy Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
  • 18. Streaming copy Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. Unfortunately there was just one little catch. There was no support for obfuscation. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 14 / 56
  • 19. Customer comes first In Transferwise we really care about the customer’s data security. Our policy for the PII data is that any personal information moving outside our perimeter shall be obfuscated. In order to be compliant the database accessible by Fivetran would have only obfuscated data. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 15 / 56
  • 20. Proactive development The sense of DBA tingled. I foresaw the requirement and in my spare time I built a proof of concept based on the replica tool pg chameleon. The tool which using a python library can replicate a MySQL database into PostgreSQL. The initial tests on a reduced dataset were successful. It was simple to add the obfuscation in real time with minimal changes. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 16 / 56
  • 21. And the winner is... The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran. However, because the performance on PostgreSQL were quite good, and the system have good margin for scaling up, the decision was to keep the data analytics data behind our perimeter. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
  • 22. And the winner is... The initial idea was to use PostgreSQL for obfuscate the data used by FiveTran. However, because the performance on PostgreSQL were quite good, and the system have good margin for scaling up, the decision was to keep the data analytics data behind our perimeter. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 17 / 56
  • 23. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 18 / 56
  • 24. MySQL Replica in a nutshell Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 19 / 56
  • 25. A quick look to the replication system Let’s have a quick overview on how the MySQL replica works and how the replicator interacts with it. The following slides explain how pg chameleon works because the custom obfuscator tool shares with pg chameleon most concepts concepts and code. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 20 / 56
  • 26. MySQL Replica The MySQL replica protocol is logical When MySQL is configured properly the RDBMS saves the data changed into binary log files The slave connects to the master and gets the replication data The replication’s data are saved into the slave’s local relay logs The local relay logs are replayed into the slave Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 21 / 56
  • 27. MySQL Replica Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 22 / 56
  • 28. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
  • 29. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes PostgreSQL acts as relay log and replication slave With an extra cool feature. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
  • 30. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes PostgreSQL acts as relay log and replication slave With an extra cool feature. Initialises the PostgreSQL replica schema in just one command Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 23 / 56
  • 31. MySQL replica + pg chameleon Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 24 / 56
  • 32. Log formats MySQL supports different formats for the binary logs. The STATEMENT format. It logs the statements which are replayed on the slave. It seems the best solution for performance. However replaying queries with not deterministic elements generate inconsistent slaves (e.g. insert with uuid). The ROW format is deterministic. It logs the row image and the DDL queries. This is the format required for pg chameleon to work. MIXED takes the best of both worlds. The master logs the statements unless a not deterministic element is used. In that case it logs the row image. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 25 / 56
  • 33. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 26 / 56
  • 34. How we did it Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 27 / 56
  • 35. Replica and obfuscation I built a minimum viable product for pg chameleon. The project was forked into a transferwise owned repository for the customisation. It were added the the obfuscation capabilities and other specific procedures like the daily data aggregation. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 28 / 56
  • 36. Mighty morphing power elephant The replica initialisation locks the mysql tables in read only mode. To avoid the main database to be locked for several hours a secondary MySQL replica is setup with the local query logging enabled. The cascading replica also allowed to use the ROW binlog format as the master uses MIXED for performance reasons. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 29 / 56
  • 37. This is what awesome looks like! A MySQL master is replicated into a MySQL slave Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
  • 38. This is what awesome looks like! A MySQL master is replicated into a MySQL slave The slave logs the row changes locally in ROW format PostgreSQL reads the slave’s replica and obfuscates the data in realtime! Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
  • 39. This is what awesome looks like! A MySQL master is replicated into a MySQL slave The slave logs the row changes locally in ROW format PostgreSQL reads the slave’s replica and obfuscates the data in realtime! Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 30 / 56
  • 40. Replica initialisation The replica initialisation follows the same rules of any mysql replica setup Flush the tables with read lock Get the master’s coordinates Copy the data Release the locks The procedure pulls the data out from mysql using the CSV format for a fast load in PostgreSQL with the COPY command. This approach requires with a tricky SQL statement. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 31 / 56
  • 41. First generate the select list SELECT CASE WHEN data_type="enum" THEN SUBSTRING(COLUMN_TYPE ,5) END AS enum_list , CASE WHEN data_type IN (’"""+" ’,’". join(self.hexify)+""" ’) THEN concat(’hex(’,column_name ,’)’) WHEN data_type IN (’bit ’) THEN concat(’cast(‘’,column_name ,’‘ AS unsigned)’) ELSE concat(’‘’,column_name ,’‘’) END AS column_csv FROM information_schema .COLUMNS WHERE table_schema =%s AND table_name =%s ORDER BY ordinal_position ; Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 32 / 56
  • 42. Then use it into mysql query csv_data="" sql_out="SELECT "+columns_csv+" as data FROM "+table_name+";" self.mysql_con.connect_db_ubf() try: self.logger.debug("Executing query for table %s" % (table_name, )) self.mysql_con.my_cursor_ubf.execute(sql_out) except: self.logger.debug("an error occurred when pulling out the data from the table %s - sql executed: %s" % (table_name, sql_out)) Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 33 / 56
  • 43. Fallback on failure The CSV data is pulled out in slices in order to avoid memory overload. The file is then pushed into PostgreSQL using the COPY command. However... COPY is fast but is single transaction One failure and the entire batch is rolled back If this happens the procedure loads the same data using the INSERT statements Which can be very slow But at least discards only the problematic rows Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 34 / 56
  • 44. obfuscation setup A simple yaml file is used to list table, column and obfuscation strategy u s e r d e t a i l s : last name : mode : normal nonhash start : 0 nonhash length : 0 phone number : mode : normal nonhash start : 1 nonhash length : 2 d a t e o f b i r t h : mode : date Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 35 / 56
  • 45. Obfuscation when initialising The obfuscation process is quite simple and uses the extension pgcrypt for hashing in sha256. When the replica is initialised the data is copied into the schema in clear The table locks are released The tables with PII are copied and obfuscated in a separate schema The process builds the indices on the schemas with data in clear and obfuscated The tables without PII data are exposed to the normal users using simple views All the varchar fields in the obfuscated schema are converted in text fields Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 36 / 56
  • 46. Obfuscation on the fly The obfuscation is also applied when the data is replicated. The approach is very simple. When a row image is captured the process checks if the table contains PII data In that case the process generates a second jsonb element with the PII data obfuscated Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 37 / 56
  • 47. Obfuscation on the fly {’global_data’: { ’binlog’: u’mysql-bin.000227’, ’logpos’: 1543, ’action’: ’update’, ’batch_id’: 2L, ’table’: u’user’, ’log_table’: ’t_log_replica_2’, ’schema’: ’sch_clear’ }, ’event_data’: { u’email’: u’’ } } {’global_data’: { ’binlog’: u’mysql-bin.000227’, ’logpos’: 1543, ’action’: ’update’, ’batch_id’: 2L, ’table’: u’user’, ’log_table’: ’t_log_replica_2’, ’schema’: ’sch_obf’ }, ’event_data’: { u’email’: u’2bc5aa7720b6a3462cdf8c1ae25ed8dc45b1d9e1b0cd960aa15ac72acfe20433’ } } Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 38 / 56
  • 48. The DDL. A real pain in the back The DDL replica is possible with a little trick. MySQL even in ROW format emits the DDL as statements A regular expression traps the DDL like CREATE/DROP TABLE or ALTER TABLE. The mysql library gets the table’s metadata from the information schema The metadata is used to build the DDL in the PostgreSQL dialect This approach may not be elegant but is quite robust. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 39 / 56
  • 49. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 40 / 56
  • 50. Maximum effort Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 41 / 56
  • 51. Timing Query MySQL PostgreSQL PostgreSQL cached Master procedure 20 hours 4 hours N/A Extracting sharing ibans2 didn’t complete 3 minutes 1 minute Adyen notification3 6 minutes 2 minutes 6 seconds 2small table with complex aggregations 3big table scan with simple filters Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 42 / 56
  • 52. Resource comparison Resource MySQL PostgreSQL Storage Size 940 GB 664 GB Server CPUs 18 8 Server Memory 68 GB 48 GB Shared Memory 50 GB 5 GB Max connections 500 100 Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 43 / 56
  • 53. Advantages using PostgreSQL Stronger security model Better resource optimisation (See previous slide) No invalid views No performance issues with views Complex analytics functions partitioning (thanks pg pathman!) BRIN indices Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
  • 54. Advantages using PostgreSQL Stronger security model Better resource optimisation (See previous slide) No invalid views No performance issues with views Complex analytics functions partitioning (thanks pg pathman!) BRIN indices some code was optimised inside, but actually very little - maybe 10-20% was improved. We’ll do more of that in the future, but not yet. The good thing is that the performance gains we have can mostly be attributed just to PG vs MySQL. So there’s a lot of scope to improve further. Jeff McClelland - Growth Analyst, data guru Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 44 / 56
  • 55. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 45 / 56
  • 56. Lessons learned Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 46 / 56
  • 57. init replica tune The replica initialisation required several improvements. The first init replica implementation didn’t complete. The OOM killer killed the process when the memory usage was too high. In order to speed up the replica, some large tables not required in the analytics db were excluded from the init replica. Some tables required a custom slice size because the row length triggered again the OOM killer. Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
  • 58. init replica tune The replica initialisation required several improvements. The first init replica implementation didn’t complete. The OOM killer killed the process when the memory usage was too high. In order to speed up the replica, some large tables not required in the analytics db were excluded from the init replica. Some tables required a custom slice size because the row length triggered again the OOM killer. Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. However.... even after fixing the memory issues the initial copy took 6 days. Tuning the copy speed with the unbuffered cursors and the row number estimates improved the initial copy speed which now completes in 30 hours. Including the time required for the index build. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 47 / 56
  • 59. Strictness is an illusion. MySQL doubly so MySQL’s lack of strictness is not a mystery. The replica broke down several times because of the funny way the NOT NULL is managed by MySQL. To prevent any further replica breakdown the fields with NOT NULL added with ALTER TABLE, in PostgreSQL are always as NULLable. MySQL truncates the strings of characters at the varchar size automatically. This is a problem if the field is obfuscated on PostgreSQL because the hashed string could not fit into the corresponding varchar field. Therefore all the character varying on the obfuscated schema are converted to text. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 48 / 56
  • 60. I feel your lack of constraint disturbing Rubbish data in MySQL can be stored without errors raised by the DBMS. When this happens the replicator traps the error when the change is replayed on PostgreSQL and discards the problematic row. The value is logged on the replica’s log, available for further actions. Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 49 / 56
  • 61. Table of contents 1 We have an appointment, and we are late! 2 The eye of the storm 3 MySQL Replica in a nutshell 4 How we did it 5 Maximum effort 6 Lessons learned 7 Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 50 / 56
  • 62. Wrap up Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 51 / 56
  • 63. Did you say hire? WE ARE HIRING! Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 52 / 56
  • 64. That’s all folks! QUESTIONS? Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 53 / 56
  • 65. Contacts and license Twitter: 4thdoctor scarf Transferwise: Blog: Meetup: This document is distributed under the terms of the Creative Commons Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 54 / 56
  • 66. Boring legal stuff The 4th doctor meme - source The eye, phantom playground, light end tunnel - Copyright Federico Campoli The dolphin picture - Copyright artnoose It could work. Young Frankenstein - source quickmeme Deadpool Clap - source memegenerator Deadpool Maximum Effort - source Deadpool Zoeiro Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 55 / 56
  • 67. The ninja elephant Scaling the analytics database in Transferwise Federico Campoli Transferwise 3rd February 2017 Federico Campoli (Transferwise) The ninja elephant 3rd February 2017 56 / 56