Replicating in Real-time from MySQL to Amazon Redshift

©Continuent 2014
Replicating in Real-Time from
MySQL to Amazon Redshift
Featuring Continuent Tungsten
Robert Hodges, CEO
Jeff Mace, Director of Services

Presentation Topics
• Introductions
• Tungsten Replicator Overview
• Real-Time Data Warehouse Loading
• Adding Redshift to the Mix
• Replication from Tungsten Clusters to
Redshift
• Wrap-up
©Continuent 2014
2

Introducing Continuent
• The leading provider of clustering and
replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance
©Continuent 2014
scaling and data management for MySQL
• Replication - Flexible, high-performance data
3
movement

Continuent Tungsten Customers
©Continuent 2014
4
1

Quick Continuent Facts
• Largest Tungsten installation by data volume
processes over 800 million transactions per
day on 225 terabytes of relational data
• Largest installation by transaction volume
handles up to 8 billion transactions daily
• Wide variety of topologies including MySQL,
Oracle, Vertica, and Hadoop are in production
now
• Amazon Redshift is the newest data
©Continuent 2014
warehouse target
5

Tungsten Replicator
Overview
©Continuent 2014 6

What is Tungsten Replicator?
Download from https://code.google.com/p/tungsten-replicator/
Annual support subscription available from Continuent
©Continuent 2014
A real-time,
high-performance,
open source database
replication engine
!
GPL V2 license - 100% open source
“Golden Gate® without the Price Tag”
7

Tungsten Replicator Overview
©Continuent 2014
Master
Replicator THL
8
(Transactions + Metadata)
Slave
THL
DBMS
Logs
Replicator
(Transactions + Metadata)
Extract
transactions
from log
Apply

Replication Services and Parallel
Apply
©Continuent 2014
Stage
9
Stage
Extract Filter Apply
Stage
Pipeline
Master
DBMS
Transaction
History Log
Parallel
Queue
Slave
DBMS

©Continuent 2014
Heterogeneous
MySQL Oracle Oracle MySQL MySQL
star
master-slave
Oracle MySQL Oracle
fan-in slave all-masters

Real-Time Data Warehouse Loading
©Continuent 2014
11

Why MySQL Needs A Data Warehouse
©Continuent 2014
id cust_id prod_id ...
1 335301 532 ...
2 2378 6235 ...
3 ... ... ...
12
Sales Table
Product Table
id sku type
532 C00135 consumer
533 S09957 specialty
... ...
Prod_ID Index
prod_id id
532 1
6235 2
... ...
Row format
makes table
scans very
slow
Indexes slow
OLTP
Low/no data
compression
Limited
index
types
Limited
join
types
Single-threaded query — no parallelization

Current Data Warehouse Options
MySQL Master
©Continuent 2014
13
OLTP Data
Scalable row store
with star schema
and materialized
view support
CSV files stored
on HDFS with
access via Hive/
MapReduce
Column storage
with compression
and built-in time
series support

Traditional ETL Approach
©Continuent 2014
Batch-oriented = not timely
Scan for changes = performance hit
14
MySQL
Sales
Table
Extract Transfer Load
Date columns = intrusive
Sales
Table
Data Warehouse

Real-Time Data Replication
©Continuent 2014
No SQL changes = transparent
15
MySQL
Sales
Table
Data Warehouse
Fast propagation = timely
Automatic change capture = low impact
DBMS
Logs
Data
Replication
Sales
Table

Loading to an Oracle Data Warehouse
©Continuent 2014
16
MySQL Tungsten Master
Replicator
oracle
MySQLExtractor
Special Filters
* Transform enum to string
binlog_format=row
Tungsten Slave
Replicator
oracle
Special Filters
* Ignore extra tables
* Map names to upper case
* Optimize updates to
MySQL
remove unchanged columns
Binlog
Data stored
in rows like
MySQL

How about Loading to Hadoop?
©Continuent 2014
Replication
17
CSV
Files
CSV
Files
Buffered
Transactions
Binlog
Provisioning
Data stored as
append-only files

Typical Raw Hadoop Data Layout
!
!
!
!
556,MALTESE HOPE,4.99,127n
557,MANCHURIAN CURTAIN,3.99,177n
558,MANNEQUIN WORST,2.99,71n
559,MARRIED GO,2.99,114n
©Continuent 2014
18
(CSV file)
field separator
file partitioning
record separator
compression type conversions

Loading to a Hadoop Data Warehouse
©Continuent 2014
19
Replicator
hadoop Transactions
from master
CSV
Files
CSV
Files
CSV
Files
Staging
Tables
Staging
Tables
BaBsea sTea bTalebsles Materialized Views
Staging
“Tables”
Javascript load
script
e.g. hadoop.js
Write data
to CSV
Merge using
external
MapReduce job
(Generate
Hive table
definitions)
(Generate
Hive table
definitions)
Load using
hadoop
command

Hadoop Materialized View Generation
Sort SHUFFLE by key(s), transaction order
©Continuent 2014
Transaction logs Snapshot
UNION ALL
Emit last row per key if not a delete
20
MAP
REDUCE
Materialized view
including all updates

How about Loading to Vertica?
©Continuent 2014
Replication
21
CSV
Files
CSV
Files
Buffered
Transactions
Binlog
Provisioning?
Data stored in
column format

Data Layout in a Column Store
©Continuent 2014
Product Table
22
Sales Table
cust_id
335301
2378
...
prod_id
532
6235
...
Fast scans
on columns
quantity
1
3
...
Updates to single
rows are hideously
slow
Every column
is an index
id
1
2
3
Good
compression
id
532
533
...
sku
C00135
S09957
...
type
consumer
specialty
...
Fast joins
with parallel
query

Loading to a Vertica Data Warhouse
©Continuent 2014
23
Replicator
vertica
MySQLExtractor
Special Filters
* pkey - Fill in pkey info
* colnames - Fill in names
* replicate - Ignore tables
binlog_format=row
Tungsten Slave
Replicator
vertica
MySQL
Binlog
CSV
Files
CSV
Files
CSV
Files
CSV
Files
CSV
Files
Large transaction
batches to leverage
load parallelization

Vertica Batch Loading--The Details
©Continuent 2014
COPY to
stage tables SELECT to
24
Replicator
vertica Transactions
from master
CSV
Files
CSV
Files
CSV
Files
Staging
Tables
Staging
Tables
Staging
Tables
Base
Tables
Base
Tables
Base
Tables
Merge
Script
(or)
COPY
directly to
base tables
base tables
No external
merge required!

Provisioning Using a Sandbox Server
©Continuent 2014
25
OLTP
Server
Temporary
Sandbox Server
Vertica
Cluster
1. Restore
logical
backup
2. Replicate
restored
transactions
3. Replicate
normally after
restore loads

Adding Amazon Redshift to
the Mix
©Continuent 2014 26

Tungsten Support for Amazon Redshift
• Tungsten Replicator 3.0 supports real-time
replication to Amazon Redshift
• Loading builds on data warehouse support
for Vertica and Hadoop
• Beta program in progress now
• Production ready in September 2014
©Continuent 2014
27

Redshift = Cloud Data Warehouse
©Continuent 2014
28

Redshift Capabilities
• On-demand column store in Amazon Web
Services
• Looks like a PostgreSQL 8.x DBMS
• $1,000 per terabyte per year or $0.25/hour
• Automatic backup and cluster management
• Integrated into Amazon VPC
• Data loading from Amazon S3 using SQL
©Continuent 2014
COPY command
29

Connecting to Amazon Redshift
$ psql -h redshift-webinar.cub79gczb9kq.us-east-
1.redshift.amazonaws.com -U tungsten -d dev -p 5439
psql (9.2.6, server 8.0.2)
WARNING: psql version 9.2, server version 8.0.
©Continuent 2014
Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.
!
dev=# select * from test.foo;
id | data
----+--------------
2 | salutations!
3 | bye!
1 | hello!
(3 rows)
!
dev=# q
30

Tungsten Loading to Redshift
©Continuent 2014
31
Tungsten Slave
redshift
Transactions
from Tungsten
Master
CSV
Files
CSV
Files
CSV
Files
BaBsea sTeRa bTealdebSslheisft
Base Tables
Staging
Tables
Staging
Tables RedShift
Staging Tables
Javascript load
script
e.g. hadoop.js
Write data
to CSV
3. Merge using
SQL to base
2. COPY to
1. Load to S3 Redshift
using s3cmd
S3
Storage

Prerequisites for Redshift Loading
©Continuent 2014
32
Replicator
redshift
binlog_format=row
Tungsten Slave
Replicator
redshift
MySQL
Binlog
Allow access from
Tungsten host
Java 6+,
Ruby 1.8.5+
Amazon
Java 6+,
Redshift
Ruby 1.8.5+
s3cmd 1.0+
S3 credentials
S3 Storage
Bucket for CSV uploads

Application Prerequisites
• Primary keys on all tables
• UTF-8 character set--or at least be consistent
• Use GMT timezone--or be very consistent
©Continuent 2014
33
about dates
Avoid “Cowboy” schema changes
to MySQL master(s)

Generate Schema Using ddlscan
©Continuent 2014
ddlscan
34
•Data types?
•Column lengths?
•Naming conventions?
•Staging tables?
MySQL Tables
Amazon
Redshift

Staging Table Schema
$ bin/ddlscan -db test -template ddl-mysql-redshift-staging.vm!
...!
!
CREATE SCHEMA test;!
!
DROP TABLE test.stage_xxx_foo;!
CREATE TABLE test.stage_xxx_foo!
(!
tungsten_opcode CHAR(2),!
tungsten_seqno INT,!
tungsten_row_id INT,!
tungsten_commit_timestamp TIMESTAMP,!
id INT,!
data VARCHAR(120) /* VARCHAR(30) */,!
PRIMARY KEY (tungsten_opcode, tungsten_seqno, tungsten_row_id)!
);
©Continuent 2014
35

Base Table Schema
$ bin/ddlscan -db test -template ddl-mysql-redshift.vm!
...!
!
CREATE SCHEMA test;!
!
DROP TABLE test.foo;!
CREATE TABLE test.foo!
(!
id INT,!
data VARCHAR(120) /* VARCHAR(30) */,!
PRIMARY KEY (id)!
);
©Continuent 2014
36

Sandbox Provisioning for Redshift
©Continuent 2014
37
OLTP
Server
Temporary
Sandbox Server
Redshift
Cluster
1. Restore
logical
backup
2. Replicate
restored
transactions
3. Replicate
normally after
restore loads
Amazon
Redshift

Replicating from Tungsten
Clusters to Redshift

60 Second Intro to Tungsten Clusters
Tungsten clusters combine off-the-
servers into data services with:
!
• 24x7 data access
• Scaling of load on replicas
• Simple management commands
!
...without app changes or data
migration
©Continuent 2014
shelf open source MySQL
40
Amazon
US West
GonzoPortal.com
apache
/php
Connector Connector

Replicating from a Cluster to Redshift
©Continuent 2014
41
Tungsten Cluster1
master
slave
Tungsten 3.0
Replicator
cluster1 Amazon
Redshift
Filters and Special Configuration
* pkey - Fill in pkey info
* colnames - Fill in names
* fixmysqlstrings - Convert to UT8
* Replicate from both nodes for
higher availability

Fan-in from Multiple Tungsten Clusters
©Continuent 2014
42
Redshift
Cluster
cluster1
cluster2
Tungsten
Replicator
Amazon
Redshift

Getting Started with MySQL
to Redshift Loading

Where Is Everything?
• Tungsten Replicator 3.0 builds are available on
code.google.com
http://code.google.com/p/tungsten-replicator/
(Visit the nightly builds page for latest features)
• Replicator documentation is available on Continuent
©Continuent 2014
website
https://docs.continuent.com/tungsten-replicator-3.0/
deployment-redshift.html
Contact Continuent for support
44

In Conclusion…
• Tungsten Replicator 3.0 now supports realtime
loading from MySQL to Amazon Redshift
• You can start using it yourself or join our beta
program
• Software will be production-ready in September
2014
• Continuent has a wealth of data loading features on
©Continuent 2014
tap so stay tuned!
45

560 S. Winchester Blvd., Suite 500
San Jose, CA 95128
Tel +1 (866) 998-3642
Fax +1 (408) 668-1009
e-mail: sales@continuent.com
©Continuent 2014
Our Blogs:
http://scale-out-blog.blogspot.com
http://datacharmer.org/blog
http://www.continuent.com/news/blogs
http://flyingclusters.blogspot.com/
www.continuent.com
Follow us on Twitter @continuent
!
Tungsten Replicator:
http://code.google.com/p/tungsten-replicator

Replicating in Real-time from MySQL to Amazon Redshift

More Related Content

What's hot

Similar to Replicating in Real-time from MySQL to Amazon Redshift

More from Continuent

Recently uploaded

Replicating in Real-time from MySQL to Amazon Redshift