Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

© 2014 VMware Inc. All rights reserved.
Real-time Data Loading from Oracle
and MySQL to Data Warehouses
and Oracle to Analytics
MC Brown, Senior Product Line Manager

Continuent Quick Introduction
2
History
Products
2004
Continuent established in USA
2009
3rd Generation Continuent
Tungsten (aka VMware Continuent)
ships
2014
100+ customers running business-
critical applications
Oct 2014
Acquisition by VMware: Now part  
of the vCloud Air Business Unit
Oct 2015
Continuent solutions available
through VMware sales
Industry-leading clustering and
replication for open source DBMS
Clustering – Commercial-grade HA,
performance scaling, and data
management for MySQL
Replication– Flexible, high-
performance data movement

Business-Critical Deployment Examples
High Availability for
MySQL
Largest cluster deployment performs 800M+ transactions/
day on 275 TB of relational data
Business Continuity
Cross-site cluster topologies widely deployed including
primary/DR and multi-master
High Performance
Replication
Largest installations transfer billions of transactions daily
using high speed, parallel replication
Heterogeneous
Integration
Customers replicate from MySQL to Oracle, Hadoop,
Redshift, Vertica, and others
Real-time Analytics
Optimized data loading for data warehouses with
deployments of up to 200 MySQL masters feeding to
Hadoop
Continuent Facts
3

Data Warehouse Integration is Changing
•  Traditional data warehouse usage was based on dump from transactional store,
loads into data warehouse
•  Data warehouse and analytics were done off historical data loaded
•  Data warehouses often use merged data from multiple sources, which was hard to
handled
•  Data warehouses are now frequently sources as well as targets for data, i.e.:
–  Export data to data warehouse
–  Analyze data
–  Feed summary data back to application to display stats to users

Modern Data Warehouse Sequences

How do we cope with that model
•  Traditional Extract-Transform-Load (ETL) methods take too long
•  Data needs to be replicated into a data warehouse in real-time
•  Continuous stream of information
•  Replicate everything
•  Use data warehouse to provide join and analytics

Data Warehouse Choices
•  Hadoop
–  General purpose storage platform
–  Map Reduce for data processing
–  Front-end interfaces for interaction in SQL-like (Hive, HBase, Impala) and non-
SQL (Pig, native, Spark)
–  JDBC/ODBC Interfaces improving
•  Vertica
–  Massive cluster-based column store
–  SQL and ODBC/JDBC Interface
•  Amazon Redshift
–  Highly flexible column store
–  Easy to deploy

9
(software formerly known as Tungsten Replicator)
is a fast,
open source, database
replication engine
Designed for speed and flexibility
GPL V2 license
100% open source
Annual support subscription available
VMware Continuent for Replication/Data Warehouses

10
Master
(Transactions + Metadata)
Slave
Replicator
Replicator
Download
transactions
via network
Apply using
JDBC
Binlog
THL
THL
Continuent Master/Slave in Action (MySQL)

11
Master
Slave
Replicator
Replicator
Download
transactions
via network
Apply using
JDBC
CDC
THL
THL
Continuent Master/Slave in Action (Oracle)

12
Transactional Store Data Warehouse
Dump/Provision
Transactions?
X
Batch
The Data Warehouse Impedance Mismatch

Transactional and Data Warehouse Metadata
•  Replicating data is not just about the data
•  Table structures must be replicated too
•  ddlscan handles the translation
–  Migrates an existing MySQL or Oracle schema into the target schema
–  Template based
–  Handles underlying datatype matches
–  Needs to be executed before replication starts

Replicating into Vertica
Replicator
Replicator
CSV
JS
JDBC
cpimport
staging
base
merge

Replicating into Redshift
Replicator
Replicator
CSV
JS
JDBC
s3cmd
staging
base
merge
COPY

Replicating into Hadoop
Replicator
Replicator
CSV
JS
hadoop fs

Initial Materialization within Hadoop
load-reduce-check
Migrate staging/base DDL
Hive
materialization
CSV
Staging
Table
Base
Table

Ongoing Materialization within Hadoop
materialize
Hive
materialization
CSV
Staging
Table
Base
Table

Provisioning Options (all data warehouses)
•  MySQL
–  Traditional CSV export and import
–  Dump and load through Blackhole engine
–  Use tungsten_provision_thl
•  Oracle
–  Use parallel extractor

Provisioning Options (Hadoop)
•  MySQL
–  Dump and load through Blackhole engine
–  Use tungsten_provision_thl
–  Use Sqoop
•  Oracle
–  Use parallel extractor
–  Use Sqoop

Comparing Loading Methods for Hadoop
Manual via CSV Sqoop Tungsten
Replicator
Process Manual/Scripted Manual/Scripted Fully Automated
Incremental
Loading
Possible with DDL
changes
Requires DDL
changes
Fully Supported
Latency Full-load Intermittent Real-time
Extraction
Requirements
Full table scan Full and partial
table scans
Low-impact CDC/
binlog scan

Sqoop and Materialization within Hadoop
Hive
materialization
CSV
Staging
Table
Base
Table
Sqoop
Replicate

27
Op Seqno ID Msg
I 1 1 Hello World!
I 2 2 Meet MC
D 3 1
I 3 1 Goodbye World
Op Seq
no
ID Msg
I 2 2 Meet MC
I 3 1 Goodbye
World
How the Materialization Works

•  Extract from MySQL or Oracle
•  Hadoop Support
– Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon
EMR, IBM (Certified), Apache
•  Provision using Sqoop or parallel extraction
•  Schema generation for Hive
•  Tools for generating materialized views
•  Parallel CSV file loading
•  Partition loaded data by commit time
•  Schema Change Notification
28
Replication support (Hadoop specific)

29
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
4
1
4
2
4
3
4
4
4
5
Monday Wednesday Friday
Data Warehouse Possibilities: Point in Time Tables

30
Op Seqn
o
ID Date Msg
I 1 1 1/6/14 Hello World!
I 2 2 2/6/14 Meet MC
I 3 1 2/6/14 Goodbye World
I 4 1 3/6/14 Hello Tuesday
I 4 2 3/6/14 Ruby
Wednesday
I 5 1 4/6/14 Final Count
ID Date Msg
1 1/6/14 Hello World!
1 2/6/14 Goodbye World
1 3/6/14 Hello Tuesday
1 4/6/14 Final Count
Data Warehouse Possibilities: Time Series Generation

•  Tungsten Replicator builds are available on code.google.com
http://code.google.com/p/tungsten-replicator/
•  Replicator documentation is available on Continuent website
http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html
•  Tungsten Hadoop tools are available on GitHub
https://github.com/continuent/continuent-tools-hadoop
31
Contact Continuent for support
Getting Started!

For more information, contact us:
Robert Noyes
Alliance Manager, USA & Canada
rnoyes@vmware.com
+1 (650) 575-0958
Philippe Bernard
Alliance Manager, EMEA & APAC
pbernard@vmware.com
+41 79 347 1385
MC Brown
Senior Product Line Manager
mcb@vmware.com

Eero Teerikorpi
Sr. Director, Strategic Alliances
eteerikorpi@vmware.com
+1 (408) 431-3305

Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

More Related Content

What's hot

Similar to Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics

More from Continuent

Recently uploaded

Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics