• Like

Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0

  • 958 views
Uploaded on

Hadoop is an increasingly popular means of analyzing transaction data from single MySQL or multiple MySQL servers. Up until now mechanisms for moving data between MySQL and Hadoop have been rather …

Hadoop is an increasingly popular means of analyzing transaction data from single MySQL or multiple MySQL servers. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. The new Continuent Tungsten Replicator 3.0 provides enterprise-quality replication from MySQL to Hadoop. Tungsten Replicator 3.0 is 100% open source, released under a GPL V2 license, and available for download at https://code.google.com/p/tungsten-replicator/.

Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems while retaining control over how the data is viewed and accessed within Hadoop.

We will discuss:
- How Hadoop works and why it's useful for processing transaction data from MySQL
- Setting up Continuent Tungsten replication from MySQL to Hadoop
- Understanding how Hadoop makes data analytics efficient and easy to execute
- Tuning replication to maximize performance.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
958
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
28
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. ©Continuent 2014 Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten MC Brown, Director of Documentation
  • 2. ©Continuent 2014 2 Introducing Continuent
  • 3. ©Continuent 2014 Introducing Continuent 3 • The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement
  • 4. ©Continuent 2014 Quick Continuent Facts • Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data • Tungsten Replicator was application of the year at the 2011 MySQL User Conference • Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now • MySQL to Hadoop deployments are now in progress with multiple customers 4
  • 5. ©Continuent 2014 Selected Continuent Customers 5 23
  • 6. ©Continuent 2014 6 Five Minute Hadoop Introduction
  • 7. ©Continuent 2014 What Is Hadoop, Exactly? 7 a.A distributed file system b.A method of processing massive quantities of data in parallel c.The Cutting family's stuffed elephant d.All of the above
  • 8. ©Continuent 2014 Hadoop Distributed File System 8 Java Client NameNode (directory) DataNodes (replicated data) Hive Pig hadoop command Find file Read block(s)
  • 9. ©Continuent 2014 Typical MySQL to Hadoop Use Case 9 Hive (Analytics) Hadoop Cluster Transaction Processing Initial Load? Latency? App changes? Materialized views? Changes? App load?
  • 10. ©Continuent 2014 Traditional Hadoop Deployments • Data Analytics • Single databases • Collective databases • Databases and external information • Non-structured data • Long term datastores and archiving 10 Client Back
 Office
  • 11. ©Continuent 2014 • Online Analytics • Real-time queries and caching • Fully heterogeneous deployments Future Hadoop Deployments 11 Client Back
 Office
  • 12. ©Continuent 2014 Options for Loading Data 12 CSV Files Sqoop Manual Loading Sqoop Tungsten Replicator
  • 13. ©Continuent 2014 Comparing Methods in Detail 13 Manual via CSV Sqoop Tungsten Replicator Process Manual/ Scripted Manual/ Scripted Fully automated Incremental Loading Possible with DDL changes Requires DDL changes Fully supported Latency Full-load Intermittent Real-time Extraction Requirements Full table scan Full and partial table scans Low-impact binlog scan
  • 14. ©Continuent 2014 14 Replicating MySQL Data to Hadoop using Tungsten Replicator
  • 15. ©Continuent 2014 What is Tungsten Replicator? 15 A real-time, high-performance, open source database replication engine ! GPLV2 license - 100% open source Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent “Golden Gate without the Price Tag”®
  • 16. ©Continuent 2014 Tungsten Replicator Overview 16 Master (Transactions + Metadata) Slave THL DBMS Logs Replicator (Transactions + Metadata) THLReplicator Extract transactions from log Apply
  • 17. ©Continuent 2014 Tungsten Replicator 3.0 & Hadoop 17 • Extract from MySQL or Oracle • Base Hadoop plus commercial distributions: Cloudera, HortonWorks, Amazon EMR, IBM • Provision using Sqoop or parallel extraction • Automatic replication of incremental changes • Transformation to preferred HDFS formats • Schema generation for Hive • Tools for generating materialized views
  • 18. ©Continuent 2014 Basic MySQL to Hadoop Replication 18 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL binlog Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  • 19. ©Continuent 2014 Hadoop Data Loading - Gory Details 19 Replicator hadoop Transactions from master CSV Files CSV Files CSV Files Staging Tables Staging Tables Staging “Tables” Base TablesBase TablesMaterializedViews Javascript load script e.g. hadoop.js Write data to CSV (Run MapReduce) (Generate Table Definitions) (Generate Table Definitions) Load using hadoop command
  • 20. ©Continuent 2014 20 Demo #1 ! Replicating data into Hadoop
  • 21. ©Continuent 2014 JavaScript Batch Loader • Simple, flexible batch loader • prepare() - when we go online • begin() - start of batch • apply() - write the events • commit() - commit the events • release() - when we go offline 21
  • 22. ©Continuent 2014 22 Viewing MySQL Data in Hadoop
  • 23. ©Continuent 2014 Generating Staging Table Schema 23 $ ddlscan -template ddl-mysql-hive-0.10-staging.vm ! -user tungsten -pass secret ! -url jdbc:mysql:thin://logos1:3306/sales -db sales! ...! DROP TABLE IF EXISTS sales.stage_xxx_sales;! ! CREATE EXTERNAL TABLE sales.stage_xxx_sales! (! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! salesman STRING ,! planet STRING ,! value DOUBLE)! ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''! LINES TERMINATED BY 'n'! STORED AS TEXTFILE LOCATION '/user/tungsten/staging/sales/sales';
  • 24. ©Continuent 2014 Generating Base Table Schema $ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten ! -pass secret -url jdbc:mysql:thin://logos1:3306/sales -db sales! ...! DROP TABLE IF EXISTS sales.sales;! ! CREATE TABLE sales.sales! (! id INT,! salesman STRING,! planet STRING,! value DOUBLE )! ;! 24
  • 25. ©Continuent 2014 Creating a Materialized View in Theory 25 Log #1 Log #2 Log #N... MAP Sort by key(s), transaction order REDUCE Emit last row per key if not a delete
  • 26. ©Continuent 2014 MapReduce 26 Acme,2013,4.75! Spitze,2013,25.00! Acme,2013,55.25! Excelsior,2013,1.00! Spitze,2013,5.00 Spitze,2014,60.00! Spitze,2014,9.50! Acme,2014,1.00! Acme,2014,4.00! Excelsior,2014,1.00! Excelsior,2014,9.00 Acme,(4.75,55.25)! Spitze,(25.00,5,00)! Excelsior,(1.00) Spitze,(60.00,9.50)! Acme,(1.00,4.00)! Excelsior,(1.00,9.00) MAP MAP REDUCE Acme,65.00! Excelsior,11.00! Spitze,99.50 SELECT COMPANY,VALUE FROM ...WHERE ... GROUP BY COMPANY
  • 27. ©Continuent 2014 Creating a Materialized View in Hive $ hive! ...! hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/ tungsten-reduce;! hive> FROM ( ! SELECT sales.*! FROM sales.stage_xxx_sales sales! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id! ) map1! INSERT OVERWRITE TABLE sales.sales! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id, ! salesman,planet,value)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,salesman,planet ,value'! AS id INT,salesman STRING,planet STRING,value DOUBLE;! 27 MAP REDUCE
  • 28. ©Continuent 2014 Comparing MySQL and Hadoop Data $ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib! ...! $ /opt/continuent/tungsten/bristlecone/bin/dc ! -url1 jdbc:mysql:thin://logos1:3306/sales ! -user1 tungsten -password1 secret ! -url2 jdbc:hive2://localhost:10000 ! -user2 'tungsten' -password2 'secret' -schema sales ! -table sales -verbose -keys id ! -driver org.apache.hive.jdbc.HiveDriver! 22:33:08,093 INFO DC - Data comparison utility! ...! 22:33:24,526 INFO Tables compare OK! 28
  • 29. ©Continuent 2014 Doing it all at once $ git clone ! https://github.com/continuent/continuent-tools- hadoop.git! ! $ cd continuent-tools-hadoop! ! $ bin/load-reduce-check ! -U jdbc:mysql:thin://logos1:3306/sales ! -s sales --verbose 29
  • 30. ©Continuent 2014 30 Demo #2 ! Constructing and Checking a Materialized View
  • 31. ©Continuent 2014 31 Scaling It Up!
  • 32. ©Continuent 2014 MySQL to Hadoop Fan-In Architecture 32 Replicator m1 (slave) m2 (slave) m3 (slave) Replicator m1 (master) m2 (master) m3 (master) Replicator Replicator RBR RBR Slaves Hadoop Cluster (many nodes) Masters RBR
  • 33. ©Continuent 2014 Integration with Provisioning 33 MySQL Tungsten Master hadoop binlog_format=row Tungsten Slave hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Access via Hive Sqoop/ETL (Initial provisioning run)
  • 34. ©Continuent 2014 On-Demand Provisioning via Parallel Extract 34 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL tables Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  • 35. ©Continuent 2014 Tungsten Replicator Roadmap 35 • Parallel CSV file loading • Partition loaded data by commit time • Data formats and tools to support additional Hadoop clients as well as HBase • Replication out of Hadoop • Integration with emerging real-time analytics based on HDFS (Impala, Spark/Shark, Stinger,...)
  • 36. ©Continuent 2014 36 Getting Started with Continuent Tungsten
  • 37. ©Continuent 2014 Where Is Everything? 37 • Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/ • Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/ deployment-hadoop.html • Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop Contact Continuent for support
  • 38. ©Continuent 2014 Commercial Terms • Replicator features are open source (GPL V2) • Investment Elements • POC / Development (Walk Away Option) • Production Deployment • Annual Support Subscription • Governing Principles • Annual Subscription Required • More Upfront Investment -> Less Annual Subscription 38
  • 39. ©Continuent 2014 We Do Clustering Too! 39 Tungsten clusters combine off- the-shelf open source MySQL servers into data services with: ! • 24x7 data access • Scaling of load on replicas • Simple management commands ! ...without app changes or data migration Amazon US West apache /php GonzoPortal.com Connector Connector
  • 40. ©Continuent 2014 In Conclusion: Tungsten Offers... • Fully automated, real-time replication from MySQL into Hadoop • Support for automatic transformation to HDFS data formats and creation of full materialized views • Positions users to take advantage of evolving real- time features in Hadoop 40
  • 41. ©Continuent 2014 Continuent Web Page: http://www.continuent.com ! Tungsten Replicator 3.0: http://code.google.com/p/tungsten-replicator Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com