© 2014 VMware Inc. All rights reserved.
Real-time Data Loading from Oracle
and MySQL to Data Warehouses
and Oracle to Analytics
MC Brown, Senior Product Line Manager
Continuent Quick Introduction
2
History
 Products
2004
 Continuent established in USA
2009
 3rd Generation Continuent
Tungsten (aka VMware Continuent)
ships
2014
 100+ customers running business-
critical applications
Oct 2014
 Acquisition by VMware: Now part 

of the vCloud Air Business Unit
Oct 2015
 Continuent solutions available
through VMware sales
Industry-leading clustering and
replication for open source DBMS
Clustering – Commercial-grade HA,
performance scaling, and data
management for MySQL
Replication– Flexible, high-
performance data movement
Business-Critical Deployment Examples
High Availability for
MySQL
Largest cluster deployment performs 800M+ transactions/
day on 275 TB of relational data
Business Continuity
Cross-site cluster topologies widely deployed including
primary/DR and multi-master
High Performance
Replication
Largest installations transfer billions of transactions daily
using high speed, parallel replication
Heterogeneous
Integration
Customers replicate from MySQL to Oracle, Hadoop,
Redshift, Vertica, and others
Real-time Analytics
Optimized data loading for data warehouses with
deployments of up to 200 MySQL masters feeding to
Hadoop
Continuent Facts
3
Select Continuent Customers
4
Data Warehouse Integration is Changing
•  Traditional data warehouse usage was based on dump from transactional store,
loads into data warehouse
•  Data warehouse and analytics were done off historical data loaded
•  Data warehouses often use merged data from multiple sources, which was hard to
handled
•  Data warehouses are now frequently sources as well as targets for data, i.e.:
–  Export data to data warehouse
–  Analyze data
–  Feed summary data back to application to display stats to users
Modern Data Warehouse Sequences
How do we cope with that model
•  Traditional Extract-Transform-Load (ETL) methods take too long
•  Data needs to be replicated into a data warehouse in real-time
•  Continuous stream of information
•  Replicate everything
•  Use data warehouse to provide join and analytics
Data Warehouse Choices
•  Hadoop
–  General purpose storage platform
–  Map Reduce for data processing
–  Front-end interfaces for interaction in SQL-like (Hive, HBase, Impala) and non-
SQL (Pig, native, Spark)
–  JDBC/ODBC Interfaces improving
•  Vertica
–  Massive cluster-based column store
–  SQL and ODBC/JDBC Interface
•  Amazon Redshift
–  Highly flexible column store
–  Easy to deploy
9
(software formerly known as Tungsten Replicator)
is a fast,
open source, database
replication engine
Designed for speed and flexibility
GPL V2 license
100% open source
Annual support subscription available
VMware Continuent for Replication/Data Warehouses
10
Master
(Transactions + Metadata)
Slave
Replicator
(Transactions + Metadata)
Replicator
Download
transactions
via network
Apply using
JDBC
Binlog
THL
THL
Continuent Master/Slave in Action (MySQL)
11
Master
(Transactions + Metadata)
Slave
Replicator
(Transactions + Metadata)
Replicator
Download
transactions
via network
Apply using
JDBC
CDC
THL
THL
Continuent Master/Slave in Action (Oracle)
12
Transactional Store Data Warehouse
Dump/Provision
Transactions?
X
Batch
The Data Warehouse Impedance Mismatch
Transactional and Data Warehouse Metadata
•  Replicating data is not just about the data
•  Table structures must be replicated too
•  ddlscan handles the translation
–  Migrates an existing MySQL or Oracle schema into the target schema
–  Template based
–  Handles underlying datatype matches
–  Needs to be executed before replication starts
Replicating into Vertica
Replicator
Replicator
CSV
JS
JDBC
cpimport
staging
base
merge
Vertica Demo
Replicating into Redshift
Replicator
Replicator
CSV
JS
JDBC
s3cmd
staging
base
merge
COPY
Replicating into Hadoop
Replicator
Replicator
CSV
JS
hadoop fs
Initial Materialization within Hadoop
load-reduce-check
Migrate staging/base DDL
Hive
materialization
CSV
Staging
Table
Base
Table
Hadoop Demo 1
Ongoing Materialization within Hadoop
materialize
Hive
materialization
CSV
Staging
Table
Base
Table
Hadoop Demo 2
Provisioning Options (all data warehouses)
•  MySQL
–  Traditional CSV export and import
–  Dump and load through Blackhole engine
–  Use tungsten_provision_thl
•  Oracle
–  Traditional CSV export and import
–  Use parallel extractor
Provisioning Options (Hadoop)
•  MySQL
–  Traditional CSV export and import
–  Dump and load through Blackhole engine
–  Use tungsten_provision_thl
–  Use Sqoop
•  Oracle
–  Traditional CSV export and import
–  Use parallel extractor
–  Use Sqoop
Comparing Loading Methods for Hadoop
Manual via CSV Sqoop Tungsten
Replicator
Process Manual/Scripted Manual/Scripted Fully Automated
Incremental
Loading
Possible with DDL
changes
Requires DDL
changes
Fully Supported
Latency Full-load Intermittent Real-time
Extraction
Requirements
Full table scan Full and partial
table scans
Low-impact CDC/
binlog scan
Hadoop Demo 3
Sqoop and Materialization within Hadoop
Hive
materialization
CSV
Staging
Table
Base
Table
Sqoop
Replicate
27
Op Seqno ID Msg
I 1 1 Hello World!
I 2 2 Meet MC
D 3 1
I 3 1 Goodbye World
Op Seq
no
ID Msg
I 2 2 Meet MC
I 3 1 Goodbye
World
How the Materialization Works
•  Extract from MySQL or Oracle
•  Hadoop Support
– Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon
EMR, IBM (Certified), Apache
•  Provision using Sqoop or parallel extraction
•  Schema generation for Hive
•  Tools for generating materialized views
•  Parallel CSV file loading
•  Partition loaded data by commit time
•  Schema Change Notification
28
Replication support (Hadoop specific)
29
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
Monday Wednesday Friday
Data Warehouse Possibilities: Point in Time Tables
30
Op Seqn
o
ID Date Msg
I 1 1 1/6/14 Hello World!
I 2 2 2/6/14 Meet MC
I 3 1 2/6/14 Goodbye World
I 4 1 3/6/14 Hello Tuesday
I 4 2 3/6/14 Ruby
Wednesday
I 5 1 4/6/14 Final Count
ID Date Msg
1 1/6/14 Hello World!
1 2/6/14 Goodbye World
1 3/6/14 Hello Tuesday
1 4/6/14 Final Count
Data Warehouse Possibilities: Time Series Generation
•  Tungsten Replicator builds are available on code.google.com
http://code.google.com/p/tungsten-replicator/
•  Replicator documentation is available on Continuent website
http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html
•  Tungsten Hadoop tools are available on GitHub
https://github.com/continuent/continuent-tools-hadoop
31
Contact Continuent for support
Getting Started!
For more information, contact us:
Robert Noyes
Alliance Manager, USA & Canada
rnoyes@vmware.com
+1 (650) 575-0958
Philippe Bernard
Alliance Manager, EMEA & APAC
pbernard@vmware.com
+41 79 347 1385
MC Brown
Senior Product Line Manager
mcb@vmware.com



Eero Teerikorpi
Sr. Director, Strategic Alliances
eteerikorpi@vmware.com
+1 (408) 431-3305


Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

  • 1.
    © 2014 VMwareInc. All rights reserved. Real-time Data Loading from Oracle and MySQL to Data Warehouses and Oracle to Analytics MC Brown, Senior Product Line Manager
  • 2.
    Continuent Quick Introduction 2 History Products 2004 Continuent established in USA 2009 3rd Generation Continuent Tungsten (aka VMware Continuent) ships 2014 100+ customers running business- critical applications Oct 2014 Acquisition by VMware: Now part 
 of the vCloud Air Business Unit Oct 2015 Continuent solutions available through VMware sales Industry-leading clustering and replication for open source DBMS Clustering – Commercial-grade HA, performance scaling, and data management for MySQL Replication– Flexible, high- performance data movement
  • 3.
    Business-Critical Deployment Examples HighAvailability for MySQL Largest cluster deployment performs 800M+ transactions/ day on 275 TB of relational data Business Continuity Cross-site cluster topologies widely deployed including primary/DR and multi-master High Performance Replication Largest installations transfer billions of transactions daily using high speed, parallel replication Heterogeneous Integration Customers replicate from MySQL to Oracle, Hadoop, Redshift, Vertica, and others Real-time Analytics Optimized data loading for data warehouses with deployments of up to 200 MySQL masters feeding to Hadoop Continuent Facts 3
  • 4.
  • 5.
    Data Warehouse Integrationis Changing •  Traditional data warehouse usage was based on dump from transactional store, loads into data warehouse •  Data warehouse and analytics were done off historical data loaded •  Data warehouses often use merged data from multiple sources, which was hard to handled •  Data warehouses are now frequently sources as well as targets for data, i.e.: –  Export data to data warehouse –  Analyze data –  Feed summary data back to application to display stats to users
  • 6.
  • 7.
    How do wecope with that model •  Traditional Extract-Transform-Load (ETL) methods take too long •  Data needs to be replicated into a data warehouse in real-time •  Continuous stream of information •  Replicate everything •  Use data warehouse to provide join and analytics
  • 8.
    Data Warehouse Choices • Hadoop –  General purpose storage platform –  Map Reduce for data processing –  Front-end interfaces for interaction in SQL-like (Hive, HBase, Impala) and non- SQL (Pig, native, Spark) –  JDBC/ODBC Interfaces improving •  Vertica –  Massive cluster-based column store –  SQL and ODBC/JDBC Interface •  Amazon Redshift –  Highly flexible column store –  Easy to deploy
  • 9.
    9 (software formerly knownas Tungsten Replicator) is a fast, open source, database replication engine Designed for speed and flexibility GPL V2 license 100% open source Annual support subscription available VMware Continuent for Replication/Data Warehouses
  • 10.
    10 Master (Transactions + Metadata) Slave Replicator (Transactions+ Metadata) Replicator Download transactions via network Apply using JDBC Binlog THL THL Continuent Master/Slave in Action (MySQL)
  • 11.
    11 Master (Transactions + Metadata) Slave Replicator (Transactions+ Metadata) Replicator Download transactions via network Apply using JDBC CDC THL THL Continuent Master/Slave in Action (Oracle)
  • 12.
    12 Transactional Store DataWarehouse Dump/Provision Transactions? X Batch The Data Warehouse Impedance Mismatch
  • 13.
    Transactional and DataWarehouse Metadata •  Replicating data is not just about the data •  Table structures must be replicated too •  ddlscan handles the translation –  Migrates an existing MySQL or Oracle schema into the target schema –  Template based –  Handles underlying datatype matches –  Needs to be executed before replication starts
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    Initial Materialization withinHadoop load-reduce-check Migrate staging/base DDL Hive materialization CSV Staging Table Base Table
  • 19.
  • 20.
    Ongoing Materialization withinHadoop materialize Hive materialization CSV Staging Table Base Table
  • 21.
  • 22.
    Provisioning Options (alldata warehouses) •  MySQL –  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl •  Oracle –  Traditional CSV export and import –  Use parallel extractor
  • 23.
    Provisioning Options (Hadoop) • MySQL –  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl –  Use Sqoop •  Oracle –  Traditional CSV export and import –  Use parallel extractor –  Use Sqoop
  • 24.
    Comparing Loading Methodsfor Hadoop Manual via CSV Sqoop Tungsten Replicator Process Manual/Scripted Manual/Scripted Fully Automated Incremental Loading Possible with DDL changes Requires DDL changes Fully Supported Latency Full-load Intermittent Real-time Extraction Requirements Full table scan Full and partial table scans Low-impact CDC/ binlog scan
  • 25.
  • 26.
    Sqoop and Materializationwithin Hadoop Hive materialization CSV Staging Table Base Table Sqoop Replicate
  • 27.
    27 Op Seqno IDMsg I 1 1 Hello World! I 2 2 Meet MC D 3 1 I 3 1 Goodbye World Op Seq no ID Msg I 2 2 Meet MC I 3 1 Goodbye World How the Materialization Works
  • 28.
    •  Extract fromMySQL or Oracle •  Hadoop Support – Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon EMR, IBM (Certified), Apache •  Provision using Sqoop or parallel extraction •  Schema generation for Hive •  Tools for generating materialized views •  Parallel CSV file loading •  Partition loaded data by commit time •  Schema Change Notification 28 Replication support (Hadoop specific)
  • 29.
    29 1 2 34 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 0 4 1 4 2 4 3 4 4 4 5 Monday Wednesday Friday Data Warehouse Possibilities: Point in Time Tables
  • 30.
    30 Op Seqn o ID DateMsg I 1 1 1/6/14 Hello World! I 2 2 2/6/14 Meet MC I 3 1 2/6/14 Goodbye World I 4 1 3/6/14 Hello Tuesday I 4 2 3/6/14 Ruby Wednesday I 5 1 4/6/14 Final Count ID Date Msg 1 1/6/14 Hello World! 1 2/6/14 Goodbye World 1 3/6/14 Hello Tuesday 1 4/6/14 Final Count Data Warehouse Possibilities: Time Series Generation
  • 31.
    •  Tungsten Replicatorbuilds are available on code.google.com http://code.google.com/p/tungsten-replicator/ •  Replicator documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html •  Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop 31 Contact Continuent for support Getting Started!
  • 32.
    For more information,contact us: Robert Noyes Alliance Manager, USA & Canada rnoyes@vmware.com +1 (650) 575-0958 Philippe Bernard Alliance Manager, EMEA & APAC pbernard@vmware.com +41 79 347 1385 MC Brown Senior Product Line Manager mcb@vmware.com Eero Teerikorpi Sr. Director, Strategic Alliances eteerikorpi@vmware.com +1 (408) 431-3305