• Like

Real-Time Data Loading from MySQL to Hadoop

  • 711 views
Uploaded on

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. Continuent Tungsten …

Hadoop is an increasingly popular means of analyzing transaction data from MySQL. Up until now mechanisms for moving data between MySQL and Hadoop have been rather limited. Continuent Tungsten Replicator provides enterprise-quality replication from MySQL to Hadoop under a GPL V2 license. Continuent Tungsten handles MySQL transaction types including INSERT/UPDATE/DELETE operations and can materialize binlogs as well as mirror-image data copies in Hadoop. Continuent Tungsten also has the high performance necessary to load data from busy source MySQL systems into Hadoop clusters with minimal load on source systems as well as Hadoop itself.

This webinar covers the following topics:

- How Hadoop works and why it's useful for processing transaction data from MySQL
- Setting up Continuent Tungsten replication from MySQL to Hadoop
- Transforming MySQL data within Hadoop to enable efficient analytics
- Tuning replication to maximize performance.

You do not need to be an expert in Hadoop or MySQL to benefit from this webinar. By the end listeners will have enough background knowledge to start setting up replication between MySQL and Hadoop using Continuent Tungsten. The software we are discussing is 100% open source and available from the Tungsten Replicator website at code.google.com.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
711
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
18
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten Robert Hodges, CEO ©Continuent 2014
  • 2. Introducing Continuent ©Continuent 2014 2
  • 3. Introducing Continuent • The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement ©Continuent 2014 3
  • 4. Quick Continuent Facts • Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data • Tungsten Replicator was application of the year at the 2011 MySQL User Conference • Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now • MySQL to Hadoop deployments are now in progress with multiple customers ©Continuent 2014 4
  • 5. Selected Continuent Customers 23 ©Continuent 2014 5
  • 6. Five Minute Hadoop Introduction ©Continuent 2014 6
  • 7. What Is Hadoop, Exactly? a.A distributed file system b.A method of processing massive quantities of data in parallel c.The Cutting family’s stuffed elephant d.All of the above ©Continuent 2014 7
  • 8. Hadoop Distributed File System hadoop command Find file NameNode (directory) Java Client Hive Read block(s) Pig DataNodes (replicated data) ©Continuent 2014 8
  • 9. Map/Reduce Acme,2013,4.75! Spitze,2013,25.00! Acme,2013,55.25! Excelsior,2013,1.00! Spitze,2013,5.00 Acme,60.00! Excelsior,1.00! Spitze,30.00 MAP REDUCE Spitze,2014,60.00! Spitze,2014,9.50! Acme,2014,1.00! Acme,2014,4.00! Excelsior,2014,1.00! Excelsior,2014,9.00 ©Continuent 2014 Acme,5.00! Excelsior,10.00! Spitze,69.50 MAP 9 Acme,65.00! Excelsior,11.00! Spitze,99.50
  • 10. Typical MySQL to Hadoop Use Case Hive (Analytics) Initial Load? Changes? Materialized views? App changes? Transaction Processing ©Continuent 2014 App load? Latency? 10 Hadoop Cluster
  • 11. Options for Loading Data Manual Loading Sqoop CSV Files ©Continuent 2014 Tungsten Replicator Sqoop 11
  • 12. Comparing Methods in Detail Manual via CSV Process Incremental Loading Latency Sqoop Tungsten Replicator Manual/ Scripted Manual/ Scripted Fully automated Possible with Requires DDL DDL changes changes Full-load Intermittent Fully supported Real-time Extraction Full and partial Low-impact Full table scan table scans binlog scan Requirements ©Continuent 2014 12
  • 13. Replicating MySQL Data to Hadoop using Tungsten Replicator ©Continuent 2014 13
  • 14. What is Tungsten Replicator? A real-time, high-performance, open source database replication engine ! GPL V2 license - 100% open source Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent “Golden Gate® without the Price Tag” ©Continuent 2014 14
  • 15. Tungsten Replicator Overview Master Replicator Extract transactions from log DBMS Logs (Transactions + Metadata) Slave Replicator THL (Transactions + Metadata) Apply ©Continuent 2014 THL 15
  • 16. Tungsten Replicator 3.0 & Hadoop • • Extract from MySQL or Oracle • • • • • Provision using Sqoop or parallel extraction ©Continuent 2014 Base Hadoop plus commercial distributions: Cloudera and HortonWorks Automatic replication of incremental changes Transformation to preferred HDFS formats Schema generation for Hive Tools for generating materialized views 16
  • 17. Basic MySQL to Hadoop Replication Access via Hive MySQL binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Extract from MySQL binlog ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files 17 Hadoop Cluster
  • 18. Hadoop Data Loading - Gory Details (Generate Table Definitions) Replicator Transactions from master Base Tables Base Tables Materialized Views hadoop Write data to CSV CSV CSV CSV Files Files Files Javascript load script e.g. hadoop.js ©Continuent 2014 Staging Staging Staging Tables Tables “Tables” Load using hadoop command (Generate Table Definitions) 18 (Run Map/ Reduce)
  • 19. Demo #1 ! Replicating sysbench data ©Continuent 2014 19
  • 20. Viewing MySQL Data in Hadoop ©Continuent 2014 20
  • 21. Generating Staging Table Schema $ ddlscan -template ddl-mysql-hive-0.10-staging.vm ! -user tungsten -pass secret ! -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.stage_xxx_sbtest;! ! CREATE EXTERNAL TABLE db01.stage_xxx_sbtest! (! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''! LINES TERMINATED BY 'n'! STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest'; ©Continuent 2014 21
  • 22. Generating Base Table Schema $ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten ! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.sbtest;! ! CREATE TABLE db01.sbtest! (! id INT ,! k INT ,! c STRING ,! pad STRING )! ;! ©Continuent 2014 22
  • 23. Creating a Materialized View in Theory Log #1 Log #2 ... Log #N MAP Sort by key(s), transaction order REDUCE Emit last row per key if not a delete ©Continuent 2014 23
  • 24. Creating a Materialized View in Hive $ hive! ...! hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/ tungsten-reduce;! hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id! ) map1! INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;! ... MAP REDUCE ©Continuent 2014 24
  • 25. Comparing MySQL and Hadoop Data $ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib! ...! $ /opt/continuent/tungsten/bristlecone/bin/dc ! -url1 jdbc:mysql:thin://logos1:3306/db01 ! -user1 tungsten -password1 secret ! -url2 jdbc:hive2://localhost:10000 ! -user2 'tungsten' -password2 'secret' -schema db01 ! -table sbtest -verbose -keys id ! -driver org.apache.hive.jdbc.HiveDriver! 22:33:08,093 INFO DC - Data comparison utility! ...! 22:33:24,526 INFO Tables compare OK! ©Continuent 2014 25
  • 26. Doing it all at once $ git clone ! https://github.com/continuent/continuent-toolshadoop.git! ! $ cd continuent-tools-hadoop! ! $ bin/load-reduce-check ! -U jdbc:mysql:thin://logos1:3306/db01 ! -s db01 --verbose ©Continuent 2014 26
  • 27. Demo #2 ! Constructing and Checking a Materialized View ©Continuent 2014 27
  • 28. Scaling It Up! ©Continuent 2014 28
  • 29. MySQL to Hadoop Fan-In Architecture Masters Slaves Replicator m1 (master) RBR Replicator Replicator m1 (slave) m2 (master) m2 (slave) m3 (slave) RBR Replicator m3 (master) RBR ©Continuent 2014 29 Hadoop Cluster (many nodes)
  • 30. Integration with Provisioning MySQL Access via Hive (Initial provisioning run) Sqoop/ETL Tungsten Master MySQL Binlog Tungsten Slave hadoop hadoop binlog_format=row ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files 30 Hadoop Cluster
  • 31. On-Demand Provisioning via Parallel Extract Access via Hive MySQL binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other filters as needed) Extract from MySQL tables ©Continuent 2014 CSV CSV CSV CSV Files CSV Files Files Files Files Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) 31 Hadoop Cluster
  • 32. Tungsten Replicator Roadmap • • • Parallel CSV file loading • • Replication out of Hadoop ©Continuent 2014 Partition loaded data by commit time Data formats and tools to support additional Hadoop clients as well as HBase Integration with emerging real-time analytics based on HDFS (Impala, Spark/Shark, Stinger,...) 32
  • 33. Getting Started with Continuent Tungsten ©Continuent 2014 33
  • 34. Where Is Everything? • Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/ • Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/ deployment-hadoop.html • Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop Contact Continuent for support ©Continuent 2014 34
  • 35. Commercial Terms • • • Replicator features are open source (GPL V2) Investment Elements • • • POC / Development (Walk Away Option) Production Deployment Annual Support Subscription Governing Principles • • ©Continuent 2014 Annual Subscription Required More Upfront Investment -> Less Annual Subscription 35
  • 36. We Do Clustering Too! GonzoPortal.com Tungsten clusters combine offthe-shelf open source MySQL servers into data services with: apache /php ! • 24x7 data access • Scaling of load on replicas • Simple management commands ! ...without app changes or data migration Amazon US West ©Continuent 2014 36 Connector Connector
  • 37. In Conclusion: Tungsten Offers... • Fully automated, real-time replication from MySQL into Hadoop • Support for automatic transformation to HDFS data formats and creation of full materialized views • Positions users to take advantage of evolving realtime features in Hadoop ©Continuent 2014 37
  • 38. 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs Continuent Web Page: http://www.continuent.com ! Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator ©Continuent 2014