Your SlideShare is downloading. ×
0
©Continuent 2014
Real-Time Loading from
MySQL to Hadoop
Featuring Continuent Tungsten
MC Brown, Senior Information Archite...
©Continuent 2014 2
Introducing Continuent
©Continuent 2014
Introducing Continuent
3
• The leading provider of clustering and
replication for open source DBMS
• Our ...
©Continuent 2014
Quick Continuent Facts
• Largest Tungsten installation processes over
700 million transactions daily on 2...
©Continuent 2014©Continuent 2014
Continuent Tungsten Customers
5
1
©Continuent 2014 6
Five Minute Hadoop
Introduction
©Continuent 2014
What Is Hadoop, Exactly?
7
a.A distributed file system
b.A method of processing massive quantities
of dat...
©Continuent 2014
Hadoop Distributed File System
8
Java	

Client
NameNode	

(directory)
DataNodes (replicated data)
Hive
Pi...
©Continuent 2014
Map/Reduce
9
Acme,2013,4.75!
Spitze,2013,25.00!
Acme,2013,55.25!
Excelsior,2013,1.00!
Spitze,2013,5.00
Sp...
©Continuent 2014
Typical MySQL to Hadoop Use Case
10
Hive	

(Analytics)
Hadoop
Cluster
Transaction
Processing
Initial Load...
©Continuent 2014
Options for Loading Data
11
CSV	

Files
Sqoop
Manual	

Loading
Sqoop
Tungsten	

Replicator
©Continuent 2014
Comparing Methods in Detail
12
Manual via
CSV
Sqoop
Tungsten
Replicator
Process
Manual/
Scripted
Manual/
...
©Continuent 2014 13
Replicating MySQL Data
to Hadoop using
Tungsten Replicator
©Continuent 2014
What is Tungsten Replicator?
14
A real-time,
high-performance,
open source database
replication engine
!
...
©Continuent 2014
Tungsten Replicator Overview
15
Master
(Transactions + Metadata)
Slave
THL
DBMS	

Logs
Replicator
(Transa...
©Continuent 2014
Tungsten Replicator 3.0 & Hadoop
16
• Extract from MySQL or Oracle
• Base Hadoop support
• Platforms: Clo...
©Continuent 2014
Hadoop Support
17
Hadoop Hadoop-BaseFS
Apache Hadoop Yes Yes
Cloudera Yes (Certified) Yes (Certified)
MapR ...
©Continuent 2014
Basic MySQL to Hadoop Replication
18
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filtering	

* pk...
©Continuent 2014
Hadoop Data Loading - Gory Details
19
Replicator
hadoop
Transactions
from master
CSV	

Files
CSV	

Files
...
©Continuent 2014 20
Demo #1
!
Replicating sysbench data
©Continuent 2014 21
Viewing MySQL Data
in Hadoop
©Continuent 2014
Generating Staging Table Schema
22
$ ddlscan -template ddl-mysql-hive-0.10-staging.vm !
-user tungsten -p...
©Continuent 2014
Generating Base Table Schema
$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten !
-pass secret -ur...
©Continuent 2014
Creating a Materialized View in Theory
24
Log #1 Log #2 Log #N...
MAP	

Sort by key(s), transaction order...
©Continuent 2014
Creating a Materialized View in Hive
$ hive!
...!
hive> ADD FILE /home/rhodges/github/continuent-tools-ha...
©Continuent 2014
Comparing MySQL and Hadoop Data
$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!
...!
$ /opt/continuent/tung...
©Continuent 2014
Doing it all at once
$ git clone !
https://github.com/continuent/continuent-tools-
hadoop.git!
!
$ cd con...
©Continuent 2014 28
Demo #2
!
Constructing and Checking a
Materialized View
©Continuent 2014 29
Scaling It Up!
©Continuent 2014
MySQL to Hadoop Fan-In Architecture
30
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)...
©Continuent 2014
Integration with Provisioning
31
MySQL
Tungsten Master
hadoop
binlog_format=row
Tungsten Slave
hadoop
MyS...
©Continuent 2014
On-Demand Provisioning via Parallel
Extract
32
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filter...
©Continuent 2014
Tungsten Replicator Roadmap
33
• Parallel CSV file loading (supported)
• Partition loaded data by commit ...
©Continuent 2014
Continuent Hadoop Tools Roadmap
• HBase Data Support & Materialization
• Impala Data Support & Materializ...
©Continuent 2014 35
Getting Started with
Continuent Tungsten
©Continuent 2014
Where Is Everything?
36
• Tungsten Replicator 3.0 builds are now available on
code.google.com
http://code...
©Continuent 2014
Commercial Terms
• Replicator features are open source (GPL V2)
• Investment Elements
• POC / Development...
©Continuent 2014
We Do Clustering Too!
38
Tungsten clusters combine off-
the-shelf open source MySQL
servers into data ser...
©Continuent 2014
In Conclusion: Tungsten Offers...
• Fully automated, real-time replication from MySQL
into Hadoop
• Suppo...
©Continuent 2014
Continuent Web Page:	

http://www.continuent.com	

!
Tungsten Replicator:	

http://code.google.com/p/tung...
Upcoming SlideShare
Loading in...5
×

Set Up & Operate Real-Time Data Loading into Hadoop

239

Published on

Getting data into Hadoop is not difficult, but it is complex if what you want to load 'live' or semi-live data into your Hadoop cluster from your Oracle and MySQL databases. There are plenty of solutions available, from manually dumping and loading to the good and bad sides of using a tool like Sqoop. Neither are easy and both prone to the problems of lag between the moment you perform the dump and the load into Hadoop.

Replicating into Hadoop with Tungsten Replicator enables you to stream replication data from your Oracle and MySQL servers straight into Hadoop. Using the leading replication service built into Tungsten Replicator, and supporting all the topology and reliability features of Tungsten Replicator, the Hadoop applier enables you to replicate data directly from Oracle and MySQL into Hadoop.

In this course, we look at the existing methods of loading Hadoop data, review how the Hadoop replicator works, and give a live demo of replicating data from MySQL into Hadoop.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
239
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Set Up & Operate Real-Time Data Loading into Hadoop"

  1. 1. ©Continuent 2014 Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten MC Brown, Senior Information Architect
  2. 2. ©Continuent 2014 2 Introducing Continuent
  3. 3. ©Continuent 2014 Introducing Continuent 3 • The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement
  4. 4. ©Continuent 2014 Quick Continuent Facts • Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data • Tungsten Replicator was application of the year at the 2011 MySQL User Conference • Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now • MySQL to Hadoop deployments are now in progress with multiple customers 4
  5. 5. ©Continuent 2014©Continuent 2014 Continuent Tungsten Customers 5 1
  6. 6. ©Continuent 2014 6 Five Minute Hadoop Introduction
  7. 7. ©Continuent 2014 What Is Hadoop, Exactly? 7 a.A distributed file system b.A method of processing massive quantities of data in parallel c.The Cutting family’s stuffed elephant d.All of the above
  8. 8. ©Continuent 2014 Hadoop Distributed File System 8 Java Client NameNode (directory) DataNodes (replicated data) Hive Pig hadoop command Find file Read block(s)
  9. 9. ©Continuent 2014 Map/Reduce 9 Acme,2013,4.75! Spitze,2013,25.00! Acme,2013,55.25! Excelsior,2013,1.00! Spitze,2013,5.00 Spitze,2014,60.00! Spitze,2014,9.50! Acme,2014,1.00! Acme,2014,4.00! Excelsior,2014,1.00! Excelsior,2014,9.00 Acme,60.00! Excelsior,1.00! Spitze,30.00 Acme,5.00! Excelsior,10.00! Spitze,69.50 MAP MAP REDUCE Acme,65.00! Excelsior,11.00! Spitze,99.50
  10. 10. ©Continuent 2014 Typical MySQL to Hadoop Use Case 10 Hive (Analytics) Hadoop Cluster Transaction Processing Initial Load? Latency? App changes? Materialized views? Changes? App load?
  11. 11. ©Continuent 2014 Options for Loading Data 11 CSV Files Sqoop Manual Loading Sqoop Tungsten Replicator
  12. 12. ©Continuent 2014 Comparing Methods in Detail 12 Manual via CSV Sqoop Tungsten Replicator Process Manual/ Scripted Manual/ Scripted Fully automated Incremental Loading Possible with DDL changes Requires DDL changes Fully supported Latency Full-load Intermittent Real-time Extraction Requirements Full table scan Full and partial table scans Low-impact binlog scan
  13. 13. ©Continuent 2014 13 Replicating MySQL Data to Hadoop using Tungsten Replicator
  14. 14. ©Continuent 2014 What is Tungsten Replicator? 14 A real-time, high-performance, open source database replication engine ! GPLV2 license - 100% open source Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent “GoldenGate without the Price Tag”®
  15. 15. ©Continuent 2014 Tungsten Replicator Overview 15 Master (Transactions + Metadata) Slave THL DBMS Logs Replicator (Transactions + Metadata) THLReplicator Extract transactions from log Apply
  16. 16. ©Continuent 2014 Tungsten Replicator 3.0 & Hadoop 16 • Extract from MySQL or Oracle • Base Hadoop support • Platforms: Cloudera, HortonWorks, MapR, Amazon EMR, IBM InfoSphere BigInsights • Provision using Sqoop or parallel extraction • Automatic replication of incremental changes • Transformation to preferred HDFS formats • Schema generation for Hive • Tools for generating materialized views
  17. 17. ©Continuent 2014 Hadoop Support 17 Hadoop Hadoop-BaseFS Apache Hadoop Yes Yes Cloudera Yes (Certified) Yes (Certified) MapR Yes HortonWorks Yes (Awaiting Certification) IBM InfoSphere BigInsights Yes Amazon EMR Yes
  18. 18. ©Continuent 2014 Basic MySQL to Hadoop Replication 18 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL binlog Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  19. 19. ©Continuent 2014 Hadoop Data Loading - Gory Details 19 Replicator hadoop Transactions from master CSV Files CSV Files CSV Files Staging Tables Staging Tables Staging “Tables” Base TablesBase TablesMaterializedViews Javascript load script e.g. hadoop.js Write data to CSV (Run Map/ Reduce) (Generate Table Definitions) (Generate Table Definitions) Load using hadoop command
  20. 20. ©Continuent 2014 20 Demo #1 ! Replicating sysbench data
  21. 21. ©Continuent 2014 21 Viewing MySQL Data in Hadoop
  22. 22. ©Continuent 2014 Generating Staging Table Schema 22 $ ddlscan -template ddl-mysql-hive-0.10-staging.vm ! -user tungsten -pass secret ! -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.stage_xxx_sbtest;! ! CREATE EXTERNAL TABLE db01.stage_xxx_sbtest! (! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''! LINES TERMINATED BY 'n'! STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest';
  23. 23. ©Continuent 2014 Generating Base Table Schema $ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten ! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.sbtest;! ! CREATE TABLE db01.sbtest! (! id INT ,! k INT ,! c STRING ,! pad STRING )! ;! 23
  24. 24. ©Continuent 2014 Creating a Materialized View in Theory 24 Log #1 Log #2 Log #N... MAP Sort by key(s), transaction order REDUCE Emit last row per key if not a delete
  25. 25. ©Continuent 2014 Creating a Materialized View in Hive $ hive! ...! hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/ tungsten-reduce;! hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id! ) map1! INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;! ... 25 MAP REDUCE
  26. 26. ©Continuent 2014 Comparing MySQL and Hadoop Data $ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib! ...! $ /opt/continuent/tungsten/bristlecone/bin/dc ! -url1 jdbc:mysql:thin://logos1:3306/db01 ! -user1 tungsten -password1 secret ! -url2 jdbc:hive2://localhost:10000 ! -user2 'tungsten' -password2 'secret' -schema db01 ! -table sbtest -verbose -keys id ! -driver org.apache.hive.jdbc.HiveDriver! 22:33:08,093 INFO DC - Data comparison utility! ...! 22:33:24,526 INFO Tables compare OK! 26
  27. 27. ©Continuent 2014 Doing it all at once $ git clone ! https://github.com/continuent/continuent-tools- hadoop.git! ! $ cd continuent-tools-hadoop! ! $ bin/load-reduce-check ! -U jdbc:mysql:thin://logos1:3306/db01 ! -s db01 --verbose 27
  28. 28. ©Continuent 2014 28 Demo #2 ! Constructing and Checking a Materialized View
  29. 29. ©Continuent 2014 29 Scaling It Up!
  30. 30. ©Continuent 2014 MySQL to Hadoop Fan-In Architecture 30 Replicator m1 (slave) m2 (slave) m3 (slave) Replicator m1 (master) m2 (master) m3 (master) Replicator Replicator RBR RBR Slaves Hadoop Cluster (many nodes) Masters RBR
  31. 31. ©Continuent 2014 Integration with Provisioning 31 MySQL Tungsten Master hadoop binlog_format=row Tungsten Slave hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Access via Hive Sqoop/ETL (Initial provisioning run)
  32. 32. ©Continuent 2014 On-Demand Provisioning via Parallel Extract 32 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL tables Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  33. 33. ©Continuent 2014 Tungsten Replicator Roadmap 33 • Parallel CSV file loading (supported) • Partition loaded data by commit time (supported) • Expanded Data format support (CSV, JSON) • Replication out of Hadoop
  34. 34. ©Continuent 2014 Continuent Hadoop Tools Roadmap • HBase Data Support & Materialization • Impala Data Support & Materialization • Integration with emerging real-time analytics (e.g. Storm, Spark, Shark, Stinger, …) • Point-in Time Table Generation • Time-Series Generation • Rolling and Managed Materialization • Replicator driven data manipulation (e.g. denormalisation, combining, …) 34
  35. 35. ©Continuent 2014 35 Getting Started with Continuent Tungsten
  36. 36. ©Continuent 2014 Where Is Everything? 36 • Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/ • Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/ deployment-hadoop.html • Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop Contact Continuent for support
  37. 37. ©Continuent 2014 Commercial Terms • Replicator features are open source (GPL V2) • Investment Elements • POC / Development (Walk Away Option) • Production Deployment • Annual Support Subscription • Governing Principles • Annual Subscription Required • More Upfront Investment -> Less Annual Subscription 37
  38. 38. ©Continuent 2014 We Do Clustering Too! 38 Tungsten clusters combine off- the-shelf open source MySQL servers into data services with: ! • 24x7 data access • Scaling of load on replicas • Simple management commands ! ...without app changes or data migration Amazon US West apache /php GonzoPortal.com Connector Connector
  39. 39. ©Continuent 2014 In Conclusion: Tungsten Offers... • Fully automated, real-time replication from MySQL into Hadoop • Support for automatic transformation to HDFS data formats and creation of full materialized views • Positions users to take advantage of evolving real- time features in Hadoop 39
  40. 40. ©Continuent 2014 Continuent Web Page: http://www.continuent.com ! Tungsten Replicator: http://code.google.com/p/tungsten-replicator Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×