Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 VMware Inc. All rights reserved.
Real-time Data Loading from Oracle
and MySQL to Data Warehouses
and Oracle to Anal...
Continuent Quick Introduction
2
History
 Products
2004
 Continuent established in USA
2009
 3rd Generation Continuent
Tung...
Business-Critical Deployment Examples
High Availability for
MySQL
Largest cluster deployment performs 800M+ transactions/
...
Select Continuent Customers
4
Data Warehouse Integration is Changing
•  Traditional data warehouse usage was based on dump from transactional store,
loa...
Modern Data Warehouse Sequences
How do we cope with that model
•  Traditional Extract-Transform-Load (ETL) methods take too long
•  Data needs to be repli...
Data Warehouse Choices
•  Hadoop
–  General purpose storage platform
–  Map Reduce for data processing
–  Front-end interf...
9
(software formerly known as Tungsten Replicator)
is a fast,
open source, database
replication engine
Designed for speed ...
10
Master
(Transactions + Metadata)
Slave
Replicator
(Transactions + Metadata)
Replicator
Download
transactions
via networ...
11
Master
(Transactions + Metadata)
Slave
Replicator
(Transactions + Metadata)
Replicator
Download
transactions
via networ...
12
Transactional Store Data Warehouse
Dump/Provision
Transactions?
X
Batch
The Data Warehouse Impedance Mismatch
Transactional and Data Warehouse Metadata
•  Replicating data is not just about the data
•  Table structures must be repli...
Replicating into Vertica
Replicator
Replicator
CSV
JS
JDBC
cpimport
staging
base
merge
Vertica Demo
Replicating into Redshift
Replicator
Replicator
CSV
JS
JDBC
s3cmd
staging
base
merge
COPY
Replicating into Hadoop
Replicator
Replicator
CSV
JS
hadoop fs
Initial Materialization within Hadoop
load-reduce-check
Migrate staging/base DDL
Hive
materialization
CSV
Staging
Table
Ba...
Hadoop Demo 1
Ongoing Materialization within Hadoop
materialize
Hive
materialization
CSV
Staging
Table
Base
Table
Hadoop Demo 2
Provisioning Options (all data warehouses)
•  MySQL
–  Traditional CSV export and import
–  Dump and load through Blackhol...
Provisioning Options (Hadoop)
•  MySQL
–  Traditional CSV export and import
–  Dump and load through Blackhole engine
–  U...
Comparing Loading Methods for Hadoop
Manual via CSV Sqoop Tungsten
Replicator
Process Manual/Scripted Manual/Scripted Full...
Hadoop Demo 3
Sqoop and Materialization within Hadoop
Hive
materialization
CSV
Staging
Table
Base
Table
Sqoop
Replicate
27
Op Seqno ID Msg
I 1 1 Hello World!
I 2 2 Meet MC
D 3 1
I 3 1 Goodbye World
Op Seq
no
ID Msg
I 2 2 Meet MC
I 3 1 Goodbye...
•  Extract from MySQL or Oracle
•  Hadoop Support
– Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon
EMR, IBM (Cer...
29
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3...
30
Op Seqn
o
ID Date Msg
I 1 1 1/6/14 Hello World!
I 2 2 2/6/14 Meet MC
I 3 1 2/6/14 Goodbye World
I 4 1 3/6/14 Hello Tues...
•  Tungsten Replicator builds are available on code.google.com
http://code.google.com/p/tungsten-replicator/
•  Replicator...
For more information, contact us:
Robert Noyes
Alliance Manager, USA & Canada
rnoyes@vmware.com
+1 (650) 575-0958
Philippe...
Upcoming SlideShare
Loading in …5
×

Replication in real-time from Oracle and MySQL into data warehouses and analytics

665 views

Published on

Analyzing transactional data is becoming increasingly common, especially as the data sizes and complexity increase and transactional stores are no longer to keep pace with the ever-increasing storage. Although there are many techniques available for loading data, getting effective data in real-time into your data warehouse store is a more difficult problem. VMware Continuent provides capabilities for continuous and real-time data warehouse loading. Join us for practical tips and a live demo of how to get your data warehouse loading projects off the ground quickly and efficiently when replicating from MySQL and Oracle into Amazon Redshift, HP Vertica and Hadoop.

Published in: Software
  • Be the first to comment

Replication in real-time from Oracle and MySQL into data warehouses and analytics

  1. 1. © 2014 VMware Inc. All rights reserved. Real-time Data Loading from Oracle and MySQL to Data Warehouses and Oracle to Analytics MC Brown, Senior Product Line Manager
  2. 2. Continuent Quick Introduction 2 History Products 2004 Continuent established in USA 2009 3rd Generation Continuent Tungsten (aka VMware Continuent) ships 2014 100+ customers running business- critical applications Oct 2014 Acquisition by VMware: Now part 
 of the vCloud Air Business Unit Oct 2015 Continuent solutions available through VMware sales Industry-leading clustering and replication for open source DBMS Clustering – Commercial-grade HA, performance scaling, and data management for MySQL Replication– Flexible, high- performance data movement
  3. 3. Business-Critical Deployment Examples High Availability for MySQL Largest cluster deployment performs 800M+ transactions/ day on 275 TB of relational data Business Continuity Cross-site cluster topologies widely deployed including primary/DR and multi-master High Performance Replication Largest installations transfer billions of transactions daily using high speed, parallel replication Heterogeneous Integration Customers replicate from MySQL to Oracle, Hadoop, Redshift, Vertica, and others Real-time Analytics Optimized data loading for data warehouses with deployments of up to 200 MySQL masters feeding to Hadoop Continuent Facts 3
  4. 4. Select Continuent Customers 4
  5. 5. Data Warehouse Integration is Changing •  Traditional data warehouse usage was based on dump from transactional store, loads into data warehouse •  Data warehouse and analytics were done off historical data loaded •  Data warehouses often use merged data from multiple sources, which was hard to handled •  Data warehouses are now frequently sources as well as targets for data, i.e.: –  Export data to data warehouse –  Analyze data –  Feed summary data back to application to display stats to users
  6. 6. Modern Data Warehouse Sequences
  7. 7. How do we cope with that model •  Traditional Extract-Transform-Load (ETL) methods take too long •  Data needs to be replicated into a data warehouse in real-time •  Continuous stream of information •  Replicate everything •  Use data warehouse to provide join and analytics
  8. 8. Data Warehouse Choices •  Hadoop –  General purpose storage platform –  Map Reduce for data processing –  Front-end interfaces for interaction in SQL-like (Hive, HBase, Impala) and non- SQL (Pig, native, Spark) –  JDBC/ODBC Interfaces improving •  Vertica –  Massive cluster-based column store –  SQL and ODBC/JDBC Interface •  Amazon Redshift –  Highly flexible column store –  Easy to deploy
  9. 9. 9 (software formerly known as Tungsten Replicator) is a fast, open source, database replication engine Designed for speed and flexibility GPL V2 license 100% open source Annual support subscription available VMware Continuent for Replication/Data Warehouses
  10. 10. 10 Master (Transactions + Metadata) Slave Replicator (Transactions + Metadata) Replicator Download transactions via network Apply using JDBC Binlog THL THL Continuent Master/Slave in Action (MySQL)
  11. 11. 11 Master (Transactions + Metadata) Slave Replicator (Transactions + Metadata) Replicator Download transactions via network Apply using JDBC CDC THL THL Continuent Master/Slave in Action (Oracle)
  12. 12. 12 Transactional Store Data Warehouse Dump/Provision Transactions? X Batch The Data Warehouse Impedance Mismatch
  13. 13. Transactional and Data Warehouse Metadata •  Replicating data is not just about the data •  Table structures must be replicated too •  ddlscan handles the translation –  Migrates an existing MySQL or Oracle schema into the target schema –  Template based –  Handles underlying datatype matches –  Needs to be executed before replication starts
  14. 14. Replicating into Vertica Replicator Replicator CSV JS JDBC cpimport staging base merge
  15. 15. Vertica Demo
  16. 16. Replicating into Redshift Replicator Replicator CSV JS JDBC s3cmd staging base merge COPY
  17. 17. Replicating into Hadoop Replicator Replicator CSV JS hadoop fs
  18. 18. Initial Materialization within Hadoop load-reduce-check Migrate staging/base DDL Hive materialization CSV Staging Table Base Table
  19. 19. Hadoop Demo 1
  20. 20. Ongoing Materialization within Hadoop materialize Hive materialization CSV Staging Table Base Table
  21. 21. Hadoop Demo 2
  22. 22. Provisioning Options (all data warehouses) •  MySQL –  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl •  Oracle –  Traditional CSV export and import –  Use parallel extractor
  23. 23. Provisioning Options (Hadoop) •  MySQL –  Traditional CSV export and import –  Dump and load through Blackhole engine –  Use tungsten_provision_thl –  Use Sqoop •  Oracle –  Traditional CSV export and import –  Use parallel extractor –  Use Sqoop
  24. 24. Comparing Loading Methods for Hadoop Manual via CSV Sqoop Tungsten Replicator Process Manual/Scripted Manual/Scripted Fully Automated Incremental Loading Possible with DDL changes Requires DDL changes Fully Supported Latency Full-load Intermittent Real-time Extraction Requirements Full table scan Full and partial table scans Low-impact CDC/ binlog scan
  25. 25. Hadoop Demo 3
  26. 26. Sqoop and Materialization within Hadoop Hive materialization CSV Staging Table Base Table Sqoop Replicate
  27. 27. 27 Op Seqno ID Msg I 1 1 Hello World! I 2 2 Meet MC D 3 1 I 3 1 Goodbye World Op Seq no ID Msg I 2 2 Meet MC I 3 1 Goodbye World How the Materialization Works
  28. 28. •  Extract from MySQL or Oracle •  Hadoop Support – Cloudera (Certified), HortonWorks, MapR, Pivotal, Amazon EMR, IBM (Certified), Apache •  Provision using Sqoop or parallel extraction •  Schema generation for Hive •  Tools for generating materialized views •  Parallel CSV file loading •  Partition loaded data by commit time •  Schema Change Notification 28 Replication support (Hadoop specific)
  29. 29. 29 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 4 0 4 1 4 2 4 3 4 4 4 5 Monday Wednesday Friday Data Warehouse Possibilities: Point in Time Tables
  30. 30. 30 Op Seqn o ID Date Msg I 1 1 1/6/14 Hello World! I 2 2 2/6/14 Meet MC I 3 1 2/6/14 Goodbye World I 4 1 3/6/14 Hello Tuesday I 4 2 3/6/14 Ruby Wednesday I 5 1 4/6/14 Final Count ID Date Msg 1 1/6/14 Hello World! 1 2/6/14 Goodbye World 1 3/6/14 Hello Tuesday 1 4/6/14 Final Count Data Warehouse Possibilities: Time Series Generation
  31. 31. •  Tungsten Replicator builds are available on code.google.com http://code.google.com/p/tungsten-replicator/ •  Replicator documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html •  Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop 31 Contact Continuent for support Getting Started!
  32. 32. For more information, contact us: Robert Noyes Alliance Manager, USA & Canada rnoyes@vmware.com +1 (650) 575-0958 Philippe Bernard Alliance Manager, EMEA & APAC pbernard@vmware.com +41 79 347 1385 MC Brown Senior Product Line Manager mcb@vmware.com Eero Teerikorpi Sr. Director, Strategic Alliances eteerikorpi@vmware.com +1 (408) 431-3305


×