MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Upcoming SlideShare
Loading in...5
×
 

MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS

on

  • 2,871 views

This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for ...

This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.

Statistics

Views

Total Views
2,871
Views on SlideShare
2,871
Embed Views
0

Actions

Likes
4
Downloads
62
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS Presentation Transcript

  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1 Insert Picture Here MySQL Applier for Apache Hadoop Real-Time Event Streaming to HDFS Mats Kindahl Neha Kumari Shubhangi Garg 2013-09-21
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.2 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decision. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Safe Harbor Statement
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.3 Presentation Outline ● Why Big Data? ● Working with Big Data ● MySQL Applier for Hadoop ● Road map
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.4 Why Big Data? ● Reporting ● Predefined data ● Viewing history ● Past occurrences ● Using Sales Data ● Typically in database ● Analytics ● Data mining ● Predicting future ● Trends ● Using all available data ● Sales ● Click stream ● Likes/Tweets Traditional Approach Big Data
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.5 Why Big Data? ● Web Recommendations ● Sentiment Analysis ● Marketing Campaign Analysis ● Customer Churn Modeling ● Fraud Detection ● Research and Development ● Risk Modeling ● Machine Learning 90% with Pilot Projects at end of 2012 Poor Data Costs 35% in Annual Revenues 10% Improvement in Data Usability Drives $2bn in Revenue Source: http://wikibon.org/blog/big-data-statistics/
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.6 Why Hadoop? ● Scales to thousands of nodes ● Combines data from multiple sources ● Handles unstructured data ● Run queries against all of the data ● Runs on commodity servers ● Easy to set up ● Affordable ● Fault-tolerant ● File block replication ● Self-healing ● Map/Reduce ● Distributed processing model ● Good for large data sets
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.7 Example Use-Case: On-Line Retail Browsing Recommendations Recommendations Updates Preferences Brands “Liked” Web Logs Page Views Comments Customers Purchase History Purchases
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.8 Big Data Lifecycle Decide Organize Acquire Applier Analyze
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.9 Hadoop Tools: In the Lifecycle Apache Sqoop MySQL Applier for Hadoop Apache Flume Apache Drill Apache Hive Apache Pig
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.10 Hadoop Tools: Apache Sqoop ● Apache top-level project ● Part of Hadoop project ● Developed by Cloudera ● Bulk data import and export ● Between Hadoop HDFS and external data stores ● Support JDBC connector architecture ● Supports plug-ins for specific functionality ● “Fast-path” connector for MySQL
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.11 Hadoop Tools: Apache Sqoop Sqoop Job Sqoop Job Sqoop Job Sqoop Job Sqoop Job Hadoop Cluster
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.12 Hadoop Tools: Apache Flume ● Apache top-level project ● Part of Hadoop project ● Collecting log data ● Various sources: Avro, Thrift, Syslog, Netcat ● Can aggregate and consolidate data ● Data typically sent to HDFS ● Can store data in other “sinks” as well
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.13 Hadoop Tools: Apache Flume Source Sink HDFS Channel
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.14 New Tool: MySQL Applier for Hadoop ● Using Binlog API ● Proof of concept ● Replication from MySQL to HDFS ● Exploit replication protocol ● Read server binary log ● Fetches changes from MySQL ● Using Binary Log API ● Row-based replication ● Caveat: DDL not handled ● Stores changes into HDFS ● Consumable by other tools ● Caveat: only row inserts ● Considering update/delete
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.15 New Tool: MySQL Applier for Hadoop HDFS Binlog API libhdfs Binary Log Events MySQL Applier for Hadoop Timestamp Primary Key Data Decode Row
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.16 MySQL Applier for Hadoop: Requirements ● MySQL 5.6 or later ● Available at http://dev.mysql.com/downloads/mysql ● MySQL Applier for Hadoop ● Available at http://labs.mysql.com ● Apache Hadoop 1.0.4 or later ● Available at http://hadoop.apache.org/releases.html ● Apache Hive or other Hadoop Tool for analysis ● Available at http://hive.apache.org/releases.html
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.17 Hadoop Applier for Hadoop: Mapping Rows ● Timestamp column is added first in table ● Timestamp from binary log INSERT INTO test.tbl VALUES    (23456,'Sanjai','Feldhoffer'),   (23457,'Manohar','Kakkar'),   (23458,'Christ','Kalefeld'),   (23459,'Gretta','Varker'),   (23460,'Masato','Steinauer'),   (23461,'Baruch','Uchoa'); 1379361681,23456,Sanjai,Feldhoffer 1379361685,23457,Manohar,Kakkar 1379361692,23458,Christ,Kalefeld 1379361693,23459,Gretta,Varker 1379361699,23460,Masato,Steinauer 1379361703,23461,Baruch,Uchoa MySQL HDFS
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.18 Hadoop Applier for Hadoop: Using Hive ● Does not handle DDL ● Create table manually as above ● MySQL Applier field and row delimiter can be controlled ­­field­delimiter ­­row­delimiter CREATE TABLE tbl (   user_id INT PRIMARY KEY,   first CHAR(60), last CHAR(60) ) CREATE TABLE tbl (   ts INT,   user_id INT,   first STRING, last STRING ) ROW FORMAT DELIMITED   FIELDS TERMINATED BY ','   STORED AS TEXTFILE  SQL HDFS
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.19 Hadoop Applier for Hadoop ● Start MySQL Applier for Hadoop happlier ­­field­delimiter=,    mysql://root@example.com hdfs://example.com:9000 ● Inserts written to files in warehouse directory ● Default: /user/hive/warehouse ● MySQL Table: test.tbl HDFS: /user/hive/warehouse/test.db/tbl/datafile1.txt
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.20 Hadoop Applier for Hadoop: Update and Delete? ● Batch import using Sqoop ● Transfer all data each time ● If changes are small, bandwidth is wasted Sqoop Hadoop Rack
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.21 Hadoop Applier for Hadoop: Update and Delete? ● Batch import using Sqoop ● Transfer all data each time ● If changes are small, bandwidth is wasted ● Incremental import using Applier ● Only changes imported ● Bandwidth is used efficiently ● … but what about updates and deletes? Applier Hadoop Rack
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.22 Hadoop Applier for Hadoop: Update and Delete? ● Problem: ● HDFS is append-only ● Rows inserted are appended to file ● How can rows be updated or deleted? ● Idea: ● Rows updated are appended to file ● Rows have primary key ● Row contain after-image and timestamp of update ● For each primary key, pick row with latest timestamp
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.23 Hadoop Applier for Hadoop: Update and Delete? Applier Hadoop Rack ● Timestamped rows to HDFS ● After image for updates ● Flag deletes ● Customized HiveQL queries SELECT … FROM tbl WHERE ts = MAX(ts) GROUP BY key
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.24 Hadoop Applier for Hadoop: Update and Delete? Clean DirtyApplier Cleaning Job Hadoop Rack ● Timestamped rows to HDFS ● After image for updates ● Flag deletes ● Special “cleaning“ job ● Read dirty files ● Write clean files ● Moving data inside rack use bandwidth efficiently
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.25 MySQL and Hadoop: Resources and Information ● MySQL and Hadoop: Guide to Big Data Integration http://www.mysql.com/why-mysql/white-papers/mysql-and-hadoop-guide-to- big-data-integration ● MySQL Applier for Hadoop http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html ● Developer Blogs ● Mats Kindahl: http://mysqlmusings.blogspot.com ● Shubhangi Garg: http://innovating-technology.blogspot.in ● Neha Kumari: http://nehakumari19.blogspot.in
  • Copyright © 2013, Oracle and/or its affiliates. All rights reserved.26 Thank you!