©Continuent 2013
Tungsten University:
Load a Vertica Data
Warehouse with MySQL Data
Robert Hodges
CEO, Continuent
©Continuent 2013
Introducing Continuent
2
• The leading provider of clustering and
replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance
scaling and data management for MySQL
• Replication - Flexible, high-performance data
movement
©Continuent 2013
OLTP and Data Warehouse
Fundamentals
3
©Continuent 2013
The Contenders
4
Popular open
source RDBMS
for transaction
processing
Popular closed
source RDBMS
for analytics
©Continuent 2013
Storage Layout in MySQL
5
id cust_id prod_id ...
1 335301 532 ...
2 2378 6235 ...
3 ... ... ...
Sales Table
id sku type
532 C00135 consumer
533 S09957 specialty
... ...
Product Table
prod_id id
532 1
6235 2
... ...
Prod_ID Index
Row format
makes table
scans very
slow
Indexes slow
OLTP
Low/no data
compression
Limited
index
types
Limited
join
types
©Continuent 2013
Storage Layout in Vertica
6
Sales Table
cust_id
335301
2378
...
prod_id
532
6235
...
Fast scans
on columns
Updates to single
rows are
hideously slow
quantity
1
3
...
id
1
2
3
Every column
is an index
Good
compression
id
532
533
...
sku
C00135
S09957
...
type
consumer
specialty
...
Product Table
Fast joins
with parallel
query
©Continuent 2013
Traditional ETL Problems
7
MySQL
Sales
Table
Sales
Table
LoadTransferExtract
Date columns = intrusive
Batch-oriented = not timely
Scan for changes = performance hit
©Continuent 2013
Questions for Real-Time Loading
• Do I need to transform data and if so how?
• Do I need to clean up bad information?
• Do I need to process UPDATE/DELETE too?
• Do I need to load from multiple sources?
• How timely do loads need to be?
• What if something fails?
8
©Continuent 2013
Tungsten Replicator Basics
9
©Continuent 2013
Real-Time Data Replication
10
MySQL
Sales
Table
Sales
Table
Fast propagation = timely
No SQL changes = transparent
Automatic change capture = low impact
DBMS
Logs
Data
Replication
©Continuent 2013
Tungsten Master/Slave in Action
11
Master
(Transactions + Metadata)
Slave
THL
DBMS
Logs
Replicator
(Transactions + Metadata)
THLReplicator
Download
transactions
via network
Apply using JDBC
©Continuent 2013
Pipelines with Parallel Apply
12
Extract Filter Apply
Stage
Extract Filter Apply
Stage
Stage
Pipeline
Remote
Master
Transaction
History Log
Parallel
Queue
Slave
DBMS
Extract Filter Apply
Extract Filter Apply
Extract Filter Apply
(Assign
Shard ID)
©Continuent 2013
Real-Time Batch Loading
13
MySQL Tungsten Master
Replicator
Service my2vr
MySQLExtractor
Special Filters
* pkey - Fill in pkey info
* colnames - Fill in names
* replicate - Ignore tables
binlog_format=row
Tungsten Slave
Replicator
Service my2vr
MySQL
Binlog
CSV
Files
CSV
Files
CSV
Files
CSV
Files
CSV
Files
Large transaction
batches to leverage
load parallelization
Single transactions
from OLTP
operations
©Continuent 2013
Batch Loading--The Gory Details
14
Replicator
Service my2vr
Transactions
from master
CSV
Files
CSV
Files
CSV
Files
Staging
Tables
Staging
Tables
Staging
Tables
Base
Tables
Base
Tables
Base
Tables
Merge
Script
(or)
COPY
directly to
base tables
COPY to
stage tables SELECT to
base tables
©Continuent 2013
Setting Up MySQL to Vertica
Replication
15
©Continuent 2013
DEMO
16
MySQL toVertica replication
with some bells and a whistle
MySQL
db01
db02
db03
db01
renamed02
X
sysbench
sysbench
sysbench
©Continuent 2013
Get the Code
wget --no-check-certificate https://s3.amazonaws.com/
files.continuent.com/builds/nightly/tungsten-2.0-snapshots/
tungsten-replicator-2.1.0-285.tar.gz
tar -xf tungsten-replicator-2.1.0-285.tar.gz
cd tungsten-replicator-2.1.0-285
17
©Continuent 2013
Installing MySQL Master
18
tools/tungsten-installer --master-slave -a 
--service-name=mysql2vertica 
--master-host=mysql1 
--cluster-hosts=mysql1 
--datasource-user=tungsten 
--datasource-password=secret 
--home-directory=/opt/continuent 
--buffer-size=100 
--java-file-encoding=UTF8 
--java-user-timezone=GMT 
--mysql-use-bytes-for-string=false 
--svc-extractor-filters=replicate,colnames,pkey 
--property=replicator.filter.pkey.addPkeyToInserts=true 
--property=replicator.filter.pkey.addColumnsToDeletes=true 
--property=replicator.filter.replicate.do=db01.*,db02.* 
--start-and-report
©Continuent 2013
Installing Vertica Slave
19
$ tools/tungsten-installer --master-slave -a 
--service-name=mysql2vertica 
--home-directory=/opt/continuent 
--cluster-hosts=vertica1 
--master-host=mysql1 
--datasource-type=vertica 
--datasource-user=dbadmin 
--datasource-password=secret 
--datasource-port=5433 
--batch-enabled=true
--batch-load-template=vertica6 
--vertica-dbname=bigdata 
--java-user-timezone=GMT 
--java-file-encoding=UTF8 
--svc-applier-filters=dbtransform 
--property=replicator.filter.dbtransform.from_regex1=db02 
--property=replicator.filter.dbtransform.to_regex1=renamed02 
--property=replicator.stage.q-to-dbms.blockCommitRowCount=25000 
--start-and-report
©Continuent 2013
Generate Schema Using ddlscan
20
•Data types?
•Column lengths?
•Naming conventions?
•Staging tables?
MySQLTables
ddlscan
©Continuent 2013
Tungsten ddlscan Utility
cd /opt/continuent/tungsten/tungsten-replicator/bin
# Base table generation.
./ddlscan -template ddl-mysql-vertica.vm 
-db db01 -user tungsten -pass secret >> ddl.sql
# Staging table generation
./ddlscan -template ddl-mysql-vertica-staging.vm 
-db db01 -user tungsten -pass secret >> ddl.sql
# Load into Vertica
vsql -Udbadmin -wsecret < ddl.sql
21
©Continuent 2013
Checking Status
# Checking status on master
trepctl -host logos1 heartbeat
trepctl -host logos1 status
# Checking status on slave
trepctl -host vertica1 status
# Checking detailed performance of apply task.
trepctl -host vertica1 status -name tasks
22
©Continuent 2013
Application Tips and Tricks
23
©Continuent 2013
Application Design Practices
24
• Primary keys on all tables
• (Tungsten requires single column keys)
• Clean schema design *really* helps
• UTF-8 character set--or at least be consistent
• Use GMT timezone--or be very consistent
about dates
• Use row replication on MySQL master
©Continuent 2013
Transforming Data -- Replicator Filters
25
• Tables to ignore/include?
• Schema/table/column renaming?
• Map names to upper/lower case?
• Drop data?
tungsten-installer --master-slave -a 
--service-name=mysql2vertica 
...
--svc-extractor-filters=pkey,colnames,replicate 
--property=replicator.filter.replicate.do=db01.*,db02.*
...
©Continuent 2013
List of Commonly Used Filters
26
• CDC -- Transform log to record of changes
• colnames -- Add column names
• dbtransform -- Change db name only
• enumtostring -- Make MySQL enums a string
• pkey -- Add primary key metadata
• rename -- Rename db/table/column
• replicate -- Replicate/don’t replicate tables
• zerodate2null -- Make MySQL ‘0’ dates null
©Continuent 2013
Transforming Data -- Staging Server(s)
27
OLTP
Servers
Staging
Server with
Triggers/SQL
Vertica
Cluster
©Continuent 2013
Transforming Data -- Merge Script Hacks
28
# Hacked load script for Vertica--deletes always precede inserts, so
# inserts can load directly.
# Extract deleted data keys and put in temp CSV file for deletes.
!egrep '^"D",' %%CSV_FILE%% |cut -d, -f4 > %%CSV_FILE%%.delete
COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete'
DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'
# Delete rows using an IN clause. You could also set a column value to
# mark deleted rows.
DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN
(SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%)
# Load inserts directly into base table from a separate CSV file.
!egrep '^"I",' %%CSV_FILE%% |cut -d, -f4- > %%CSV_FILE%%.insert
COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert'
DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'
©Continuent 2013
Provisioning -- Using CSV
29
mysql> SELECT * from sales INTO
OUTFILE ‘sales.csv’;
...
(Fix up data if necessary)
...
vsql> COPY sales FROM 'sales.csv'
DIRECT NULL 'null'
DELIMITER ',' ENCLOSED BY '"';
©Continuent 2013
Provisioning Using a Sandbox Server
30
OLTP
Server
Temporary
Sandbox Server
Vertica
Cluster
1. Restore
logical
backup
2. Replicate
restored
transactions
3. Replicate
normally after
restore loads
©Continuent 2013
Parallel Provisioning from Sandbox
31
OLTP
Server
Temporary
Sandbox Server
Vertica
Cluster
1. Restore
logical
backup
2. Replicate
restored data in
parallel
3. Replicate
normally after
restore loads
©Continuent 2013
Complex Topologies: Fan-In
32
Vertica
Cluster
logos1
Master
logos2
Master
logos2
Slave
Services
logos1
©Continuent 2013
Wrapping Up
33
©Continuent 2013
Tungsten University Sessions
34
• Load a Vertica Data Warehouse with MySQL
Data (May 30 10am PDT and June 4, 4pm CEST)
Send feedback to: tu@continuent.com
©Continuent 2012.
Continuent Web Page:
http://www.continuent.com
Tungsten Replicator 2.0:
http://code.google.com/p/tungsten-replicator
Our Blogs:
http://scale-out-blog.blogspot.com
http://!yingclusters.blogspot.com
http://datacharmer.org/blog
http://www.continuent.com/news/blogs
560 S. Winchester Blvd., Suite 500
San Jose, CA 95128
Tel +1 (866) 998-3642
Fax +1 (408) 668-1009
e-mail: sales@continuent.com

Tungsten University: Load A Vertica Data Warehouse With MySQL Data

  • 1.
    ©Continuent 2013 Tungsten University: Loada Vertica Data Warehouse with MySQL Data Robert Hodges CEO, Continuent
  • 2.
    ©Continuent 2013 Introducing Continuent 2 •The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement
  • 3.
    ©Continuent 2013 OLTP andData Warehouse Fundamentals 3
  • 4.
    ©Continuent 2013 The Contenders 4 Popularopen source RDBMS for transaction processing Popular closed source RDBMS for analytics
  • 5.
    ©Continuent 2013 Storage Layoutin MySQL 5 id cust_id prod_id ... 1 335301 532 ... 2 2378 6235 ... 3 ... ... ... Sales Table id sku type 532 C00135 consumer 533 S09957 specialty ... ... Product Table prod_id id 532 1 6235 2 ... ... Prod_ID Index Row format makes table scans very slow Indexes slow OLTP Low/no data compression Limited index types Limited join types
  • 6.
    ©Continuent 2013 Storage Layoutin Vertica 6 Sales Table cust_id 335301 2378 ... prod_id 532 6235 ... Fast scans on columns Updates to single rows are hideously slow quantity 1 3 ... id 1 2 3 Every column is an index Good compression id 532 533 ... sku C00135 S09957 ... type consumer specialty ... Product Table Fast joins with parallel query
  • 7.
    ©Continuent 2013 Traditional ETLProblems 7 MySQL Sales Table Sales Table LoadTransferExtract Date columns = intrusive Batch-oriented = not timely Scan for changes = performance hit
  • 8.
    ©Continuent 2013 Questions forReal-Time Loading • Do I need to transform data and if so how? • Do I need to clean up bad information? • Do I need to process UPDATE/DELETE too? • Do I need to load from multiple sources? • How timely do loads need to be? • What if something fails? 8
  • 9.
  • 10.
    ©Continuent 2013 Real-Time DataReplication 10 MySQL Sales Table Sales Table Fast propagation = timely No SQL changes = transparent Automatic change capture = low impact DBMS Logs Data Replication
  • 11.
    ©Continuent 2013 Tungsten Master/Slavein Action 11 Master (Transactions + Metadata) Slave THL DBMS Logs Replicator (Transactions + Metadata) THLReplicator Download transactions via network Apply using JDBC
  • 12.
    ©Continuent 2013 Pipelines withParallel Apply 12 Extract Filter Apply Stage Extract Filter Apply Stage Stage Pipeline Remote Master Transaction History Log Parallel Queue Slave DBMS Extract Filter Apply Extract Filter Apply Extract Filter Apply (Assign Shard ID)
  • 13.
    ©Continuent 2013 Real-Time BatchLoading 13 MySQL Tungsten Master Replicator Service my2vr MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables binlog_format=row Tungsten Slave Replicator Service my2vr MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Large transaction batches to leverage load parallelization Single transactions from OLTP operations
  • 14.
    ©Continuent 2013 Batch Loading--TheGory Details 14 Replicator Service my2vr Transactions from master CSV Files CSV Files CSV Files Staging Tables Staging Tables Staging Tables Base Tables Base Tables Base Tables Merge Script (or) COPY directly to base tables COPY to stage tables SELECT to base tables
  • 15.
    ©Continuent 2013 Setting UpMySQL to Vertica Replication 15
  • 16.
    ©Continuent 2013 DEMO 16 MySQL toVerticareplication with some bells and a whistle MySQL db01 db02 db03 db01 renamed02 X sysbench sysbench sysbench
  • 17.
    ©Continuent 2013 Get theCode wget --no-check-certificate https://s3.amazonaws.com/ files.continuent.com/builds/nightly/tungsten-2.0-snapshots/ tungsten-replicator-2.1.0-285.tar.gz tar -xf tungsten-replicator-2.1.0-285.tar.gz cd tungsten-replicator-2.1.0-285 17
  • 18.
    ©Continuent 2013 Installing MySQLMaster 18 tools/tungsten-installer --master-slave -a --service-name=mysql2vertica --master-host=mysql1 --cluster-hosts=mysql1 --datasource-user=tungsten --datasource-password=secret --home-directory=/opt/continuent --buffer-size=100 --java-file-encoding=UTF8 --java-user-timezone=GMT --mysql-use-bytes-for-string=false --svc-extractor-filters=replicate,colnames,pkey --property=replicator.filter.pkey.addPkeyToInserts=true --property=replicator.filter.pkey.addColumnsToDeletes=true --property=replicator.filter.replicate.do=db01.*,db02.* --start-and-report
  • 19.
    ©Continuent 2013 Installing VerticaSlave 19 $ tools/tungsten-installer --master-slave -a --service-name=mysql2vertica --home-directory=/opt/continuent --cluster-hosts=vertica1 --master-host=mysql1 --datasource-type=vertica --datasource-user=dbadmin --datasource-password=secret --datasource-port=5433 --batch-enabled=true --batch-load-template=vertica6 --vertica-dbname=bigdata --java-user-timezone=GMT --java-file-encoding=UTF8 --svc-applier-filters=dbtransform --property=replicator.filter.dbtransform.from_regex1=db02 --property=replicator.filter.dbtransform.to_regex1=renamed02 --property=replicator.stage.q-to-dbms.blockCommitRowCount=25000 --start-and-report
  • 20.
    ©Continuent 2013 Generate SchemaUsing ddlscan 20 •Data types? •Column lengths? •Naming conventions? •Staging tables? MySQLTables ddlscan
  • 21.
    ©Continuent 2013 Tungsten ddlscanUtility cd /opt/continuent/tungsten/tungsten-replicator/bin # Base table generation. ./ddlscan -template ddl-mysql-vertica.vm -db db01 -user tungsten -pass secret >> ddl.sql # Staging table generation ./ddlscan -template ddl-mysql-vertica-staging.vm -db db01 -user tungsten -pass secret >> ddl.sql # Load into Vertica vsql -Udbadmin -wsecret < ddl.sql 21
  • 22.
    ©Continuent 2013 Checking Status #Checking status on master trepctl -host logos1 heartbeat trepctl -host logos1 status # Checking status on slave trepctl -host vertica1 status # Checking detailed performance of apply task. trepctl -host vertica1 status -name tasks 22
  • 23.
  • 24.
    ©Continuent 2013 Application DesignPractices 24 • Primary keys on all tables • (Tungsten requires single column keys) • Clean schema design *really* helps • UTF-8 character set--or at least be consistent • Use GMT timezone--or be very consistent about dates • Use row replication on MySQL master
  • 25.
    ©Continuent 2013 Transforming Data-- Replicator Filters 25 • Tables to ignore/include? • Schema/table/column renaming? • Map names to upper/lower case? • Drop data? tungsten-installer --master-slave -a --service-name=mysql2vertica ... --svc-extractor-filters=pkey,colnames,replicate --property=replicator.filter.replicate.do=db01.*,db02.* ...
  • 26.
    ©Continuent 2013 List ofCommonly Used Filters 26 • CDC -- Transform log to record of changes • colnames -- Add column names • dbtransform -- Change db name only • enumtostring -- Make MySQL enums a string • pkey -- Add primary key metadata • rename -- Rename db/table/column • replicate -- Replicate/don’t replicate tables • zerodate2null -- Make MySQL ‘0’ dates null
  • 27.
    ©Continuent 2013 Transforming Data-- Staging Server(s) 27 OLTP Servers Staging Server with Triggers/SQL Vertica Cluster
  • 28.
    ©Continuent 2013 Transforming Data-- Merge Script Hacks 28 # Hacked load script for Vertica--deletes always precede inserts, so # inserts can load directly. # Extract deleted data keys and put in temp CSV file for deletes. !egrep '^"D",' %%CSV_FILE%% |cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' # Delete rows using an IN clause. You could also set a column value to # mark deleted rows. DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%) # Load inserts directly into base table from a separate CSV file. !egrep '^"I",' %%CSV_FILE%% |cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'
  • 29.
    ©Continuent 2013 Provisioning --Using CSV 29 mysql> SELECT * from sales INTO OUTFILE ‘sales.csv’; ... (Fix up data if necessary) ... vsql> COPY sales FROM 'sales.csv' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"';
  • 30.
    ©Continuent 2013 Provisioning Usinga Sandbox Server 30 OLTP Server Temporary Sandbox Server Vertica Cluster 1. Restore logical backup 2. Replicate restored transactions 3. Replicate normally after restore loads
  • 31.
    ©Continuent 2013 Parallel Provisioningfrom Sandbox 31 OLTP Server Temporary Sandbox Server Vertica Cluster 1. Restore logical backup 2. Replicate restored data in parallel 3. Replicate normally after restore loads
  • 32.
    ©Continuent 2013 Complex Topologies:Fan-In 32 Vertica Cluster logos1 Master logos2 Master logos2 Slave Services logos1
  • 33.
  • 34.
    ©Continuent 2013 Tungsten UniversitySessions 34 • Load a Vertica Data Warehouse with MySQL Data (May 30 10am PDT and June 4, 4pm CEST) Send feedback to: tu@continuent.com
  • 35.
    ©Continuent 2012. Continuent WebPage: http://www.continuent.com Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator Our Blogs: http://scale-out-blog.blogspot.com http://!yingclusters.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com