©Continuent 2014
Getting Serious about
MySQL and Hadoop at
Continuent
Robert Hodges, CEO
©Continuent 2014
Why should MySQL users
care about Hadoop?
2
©Continuent 2014
What is a Hadoop?
3
Hadoop Distributed File System (HDFS)
MapReduce
Spark
Hive
Storm
Pig
Shark
Mahout
HBa...
©Continuent 2014
With this much funding it must be good
4
(ZDNet)
(jaxenter.com)
(forbes.com)
(451 Group)
©Continuent 2014
Hadoop analyzes any type of data
5
Server Logs
Social
media
feeds
Geolocation 	

data
Clickstreams
Sensor...
©Continuent 2014
Hadoop data loading is simple
!
mysql> select * into
-> outfile '/tmp/sakila.rental.csv'
-> fields termin...
©Continuent 2014
Hadoop exploits downward cost of
storing and processing data
7
Disk Storage -- Average Cost Per Gigabyte
...
©Continuent 2014
Hadoop is shifting from batch to real-
time analytics
8
Cycle time for different iterative algorithms
Pag...
©Continuent 2014
Hadoop is becoming the way that
users œš‘“›⁸see’”⁹ data
9
©Continuent 2014
What does it mean to
integrate with Hadoop?
10
©Continuent 2014
Three integration problems
11
1.Continuous, high-performance loading
2.Meaningful analytics on Hadoop
3.O...
©Continuent 2014
Thesis: Snapshots
12
Data volumes?
System load?
Latency?
Change history?
Dump/load
©Continuent 2014
MySQL does not do it that way...
13
Binlog
Replication
©Continuent 2014
Antithesis: Real-time replication
14
Raw files?
Overwrite/append?
Replication
Binlog
©Continuent 2014
Synthesis: Snapshots + real-time
replication
15
Replication
CSV	

Files
CSV	

Files
Buffered
Transactions...
©Continuent 2014
We can implement that!
16
MySQL
binlog_format=row
MySQL	

Binlog
Tungsten 3.0 Master
hadoop
Tungsten 3.0 ...
©Continuent 2014
How do you like your data?
(Your data stored in MySQL)
+---------+--------------------+-------------+----...
©Continuent 2014
Does it really look better like this?
!
!
!
!
556,MALTESE HOPE,4.99,127n
557,MANCHURIAN CURTAIN,3.99,177n...
©Continuent 2014
Or this?
19
!
(INSERT)
I,57,556,2014-03-27 21:04:24.000,556,MALTESE HOPE,
4.99,127n
!
(UPDATE)
D,57,557,2...
©Continuent 2014
One more thing to replicate...
20
Dump/load
Replication
CSV	

Files
CSV	

Files
Buffered
Transactions
Bin...
©Continuent 2014
A more civilized view of data
!
!
(Your data viewed through Hive)
556	
MALTESE HOPE	
 4.99	
 127
557	
MAN...
©Continuent 2014
Are we done yet?
22
Transaction logs Snapshot
????
©Continuent 2014
Introducing a useful MapReduce trick...
23
Transaction logs Snapshot
UNION ALL
Emit last row per key if n...
©Continuent 2014
...With some amazing properties
24
Apache Sqoop
Tungsten Replication
CSV	

Files
CSV	

Files
Buffered
CSV...
©Continuent 2014
We can implement that too!!
25
https://github.com/continuent/continuent-tools-hadoop
Continuent	

Hadoop	...
©Continuent 2014
Optimizing large scale deployments
26
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)
...
©Continuent 2014
Where we want to be
27
Single path	

loading
CSV	

Files
CSV	

Files
Buffered
TransactionsBinlog
©Continuent 2014
Where we want to be
28
Single path	

loading
CSV	

Files
CSV	

Files
Buffered
TransactionsBinlog
©Continuent 2014
Tungsten 3.0 Roadmap for Hadoop
29
Q1 2014 Q2 2014
Features
• Parallel extractor
• Polished MapReduce
too...
©Continuent 2014
How can we prepare for
Hadoop integration?
30
©Continuent 2014
Users can prepare...
• Use Unicode/UTF8
• Standardize on UTC for time
• Enable row replication
• Cluster ...
©Continuent 2014
MySQL can prepare...
32
By being MySQL
©Continuent 2014
The MySQL community can prepare...
• Fast heterogeneous replication and loading
• Innovative projects to ...
©Continuent 2014
Conclusion
• Hadoop is for real and the MySQL community
needs to adapt
• The challenge is to move data to...
©Continuent 2014
Thanks to our many customers
35
23
©Continuent 2014
Wed 2:20pm Ballroom B - Hadoop for MySQL People	

!
Thurs 1pm Ballroom D - From Dolphins to Elephants:
Re...
Upcoming SlideShare
Loading in...5
×

Keynote: Getting Serious about MySQL and Hadoop at Continuent

486

Published on

Lean, mean MySQL and hulking Hadoop clusters may seem like an odd couple, but tying them together is now priority #1 for many MySQL users. This keynote talk on 1st day of this year's Percona Live MySQL Conference & Expo 2014 explores the data management trends spurring integration, how the MySQL community is stepping up, and where the integration may go in the future. Robert Hodges, CEO at Continuent, outlines how work at Continuent fits into this picture and how we are contributing to the MySQL community response to Hadoop.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
486
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Keynote: Getting Serious about MySQL and Hadoop at Continuent

  1. 1. ©Continuent 2014 Getting Serious about MySQL and Hadoop at Continuent Robert Hodges, CEO
  2. 2. ©Continuent 2014 Why should MySQL users care about Hadoop? 2
  3. 3. ©Continuent 2014 What is a Hadoop? 3 Hadoop Distributed File System (HDFS) MapReduce Spark Hive Storm Pig Shark Mahout HBase Oozie Avro HCatalog Scalding Stinger Impala Sqoop Ambari Cassandra Zookeeper
  4. 4. ©Continuent 2014 With this much funding it must be good 4 (ZDNet) (jaxenter.com) (forbes.com) (451 Group)
  5. 5. ©Continuent 2014 Hadoop analyzes any type of data 5 Server Logs Social media feeds Geolocation data Clickstreams Sensor readings Business transactions Analytic reports
  6. 6. ©Continuent 2014 Hadoop data loading is simple ! mysql> select * into -> outfile '/tmp/sakila.rental.csv' -> fields terminated by ',' -> lines terminated by 'n' -> from sakila.rental; Query OK, 16044 rows affected (0.03 sec) ! mysql> quit Bye $ hadoop fs -put /tmp/sakila.rental.csv 6
  7. 7. ©Continuent 2014 Hadoop exploits downward cost of storing and processing data 7 Disk Storage -- Average Cost Per Gigabyte $0.01 $0.10 $1.00 $10.00 $100.00 $1,000.00 $10,000.00 1990 1993 1996 1999 2002 2005 2008 2011 2014 (Source: John McCallum, http://www.jcmit.com)
  8. 8. ©Continuent 2014 Hadoop is shifting from batch to real- time analytics 8 Cycle time for different iterative algorithms Page Rank K-Means Clustering Logistic Regression 0 40 80 120 160 0.96 4.1 14 110 155 80 Core Hadoop Spark (Source: Pat McDonough, http://spark-summit.org/2013)
  9. 9. ©Continuent 2014 Hadoop is becoming the way that users œš‘“›⁸see’”⁹ data 9
  10. 10. ©Continuent 2014 What does it mean to integrate with Hadoop? 10
  11. 11. ©Continuent 2014 Three integration problems 11 1.Continuous, high-performance loading 2.Meaningful analytics on Hadoop 3.Optimized operation for large-scale deployment
  12. 12. ©Continuent 2014 Thesis: Snapshots 12 Data volumes? System load? Latency? Change history? Dump/load
  13. 13. ©Continuent 2014 MySQL does not do it that way... 13 Binlog Replication
  14. 14. ©Continuent 2014 Antithesis: Real-time replication 14 Raw files? Overwrite/append? Replication Binlog
  15. 15. ©Continuent 2014 Synthesis: Snapshots + real-time replication 15 Replication CSV Files CSV Files Buffered Transactions Binlog Dump/load
  16. 16. ©Continuent 2014 We can implement that! 16 MySQL binlog_format=row MySQL Binlog Tungsten 3.0 Master hadoop Tungsten 3.0 Slave hadoop CSV Files CSV Files CSV Files CSV FilesCSV Apache Sqoop/ETL Fast data filtering Buffered CSV Programmable load scripts Parallel apply Parallel table dumps Low impact replication from the binlog
  17. 17. ©Continuent 2014 How do you like your data? (Your data stored in MySQL) +---------+--------------------+-------------+--------+ | film_id | title | rental_rate | length | +---------+--------------------+-------------+--------+ | 556 | MALTESE HOPE | 4.99 | 127 | | 557 | MANCHURIAN CURTAIN | 2.99 | 177 | | 558 | MANNEQUIN WORST | 2.99 | 71 | | 559 | MARRIED GO | 2.99 | 114 | +---------+--------------------+-------------+--------+ ! 17
  18. 18. ©Continuent 2014 Does it really look better like this? ! ! ! ! 556,MALTESE HOPE,4.99,127n 557,MANCHURIAN CURTAIN,3.99,177n 558,MANNEQUIN WORST,2.99,71n 559,MARRIED GO,2.99,114n 18 field separator file partitioning record separator compression type conversions (Your data stored in Hadoop)
  19. 19. ©Continuent 2014 Or this? 19 ! (INSERT) I,57,556,2014-03-27 21:04:24.000,556,MALTESE HOPE, 4.99,127n ! (UPDATE) D,57,557,2014-03-27 21:04:24.000,557,N,N,Nn I,57,558,2014-03-27 21:04:24.000,557,MANCHURIAN CURTAIN,2.99,177n ! (DELETE) D,57,559,2014-03-27 21:04:24.000,558,N,N,Nn
  20. 20. ©Continuent 2014 One more thing to replicate... 20 Dump/load Replication CSV Files CSV Files Buffered Transactions Binlog Table metadata
  21. 21. ©Continuent 2014 A more civilized view of data ! ! (Your data viewed through Hive) 556 MALTESE HOPE 4.99 127 557 MANCHURIAN CURTAIN 3.99 177 558 MANNEQUIN WORST 2.99 71 559 MARRIED GO 2.99 114 21
  22. 22. ©Continuent 2014 Are we done yet? 22 Transaction logs Snapshot ????
  23. 23. ©Continuent 2014 Introducing a useful MapReduce trick... 23 Transaction logs Snapshot UNION ALL Emit last row per key if not a delete MAP REDUCE Materialized view including all updates Sort by key(s), transaction orderSHUFFLE
  24. 24. ©Continuent 2014 ...With some amazing properties 24 Apache Sqoop Tungsten Replication CSV Files CSV Files Buffered CSV Files No replication failures due to consistency Reconstruct consistent views at will No locks No transactions No need to pause processing Reprovision any table at will Table metadata
  25. 25. ©Continuent 2014 We can implement that too!! 25 https://github.com/continuent/continuent-tools-hadoop Continuent Hadoop Tools Schema creation Materialized view generation Data comparison Apache 2.0 licensing
  26. 26. ©Continuent 2014 Optimizing large scale deployments 26 Replicator m1 (slave) m2 (slave) m3 (slave) Replicator m1 (master) m2 (master) m3 (master) Replicator Replicator RBR RBR RBR
  27. 27. ©Continuent 2014 Where we want to be 27 Single path loading CSV Files CSV Files Buffered TransactionsBinlog
  28. 28. ©Continuent 2014 Where we want to be 28 Single path loading CSV Files CSV Files Buffered TransactionsBinlog
  29. 29. ©Continuent 2014 Tungsten 3.0 Roadmap for Hadoop 29 Q1 2014 Q2 2014 Features • Parallel extractor • Polished MapReduce tools • Improved schema change handling • Binary data conversion • HortonWorks 2.0 Features • Scripted load • Better block commit • Hive CSV format • Hive DDL generation • Partitioned files • Auto-recovery • Parallel batch apply • Sqoop integration • Cloudera 4.x/5.0
  30. 30. ©Continuent 2014 How can we prepare for Hadoop integration? 30
  31. 31. ©Continuent 2014 Users can prepare... • Use Unicode/UTF8 • Standardize on UTC for time • Enable row replication • Cluster your data in a way that supports restarts 31
  32. 32. ©Continuent 2014 MySQL can prepare... 32 By being MySQL
  33. 33. ©Continuent 2014 The MySQL community can prepare... • Fast heterogeneous replication and loading • Innovative projects to make relational data easy to consume on Hadoop • Competing solutions that improve life for users 33
  34. 34. ©Continuent 2014 Conclusion • Hadoop is for real and the MySQL community needs to adapt • The challenge is to move data to Hadoop and make it easy to integrate into analytics • MySQL can be *the* preferred RDBMS to use with Hadoop 34
  35. 35. ©Continuent 2014 Thanks to our many customers 35 23
  36. 36. ©Continuent 2014 Wed 2:20pm Ballroom B - Hadoop for MySQL People ! Thurs 1pm Ballroom D - From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication We’re Hiring! http://www.continuent.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×