Hadoop and mysql by Chris Schneider

MySQL and Hadoop
MySQL SF Meetup 2012
Chris Schneider

About Me
 Chris Schneider, Data Architect @ Ning.com (a
Glam Media Company)

 Spent the last ~2 years working with Hadoop
(CDH)

 Spent the last 10 years building MySQL
architecture for multiple companies

 chriss@glam.com

What we‟ll cover
 Hadoop

 CDH

 Use cases for Hadoop

 Map Reduce

 Scoop

 Hive

 Impala

What is Hadoop?
 An open-source framework for storing and
processing data on a cluster of servers

 Based on Google‟s whitepapers of the Google
File System (GFS) and MapReduce

 Scales linearly

 Designed for batch processing

 Optimized for streaming reads

The Hadoop Distribution
 Cloudera
 The only distribution for Apache Hadoop

 What Cloudera Does
 Cloudera Manager
 Enterprise Training
 Hadoop Admin
 Hadoop Development
 Hbase
 Hive and Pig
 Enterprise Support

Why Hadoop
 Volume
 Use Hadoop when you cannot or should not use
traditional RDBMS

 Velocity
 Can ingest terabytes of data per day

 Variety
 You can have structured or unstructured data

Use cases for Hadoop
 Recommendation engine
 Netflix recommends movies

 Ad targeting, log processing, search optimization
 eBay, Orbitz

 Machine learning and classification
 Yahoo Mail‟s spam detection
 Financial: Identity theft and credit risk

 Social Graph
 Facebook, Linkedin and eHarmony connections

 Predicting the outcome of an election before the
election, 50 out of 50 correct thanks to Nate Silver!

Some Details about Hadoop
 Two Main Pieces of Hadoop

 Hadoop Distributed File System (HDFS)
 Distributed and redundant data storage using many
nodes
 Hardware will inevitably fail

 Read and process data with MapReduce
 Processing is sent to the data
 Many “map” tasks each work on a slice of the data
 Failed tasks are automatically restarted on another
node or replica

MapReduce Word Count
 The key and value together represent a row of
data where the key is the byte offset and the
value is the line

map (key,value)

foreach (word in value)

output (word,1)

Map is used for Searching

64, big data is totally cool and big Foreach
… word

Intermediate Output (on local disk):
big, 1
data, 1
is, 1
MAP totally, 1
cool, 1
and, 1
big, 1

Reduce is used to aggregate
Hadoop aggregates the keys and calls a reduce for each
unique key… e.g. GROUP BY, ORDER BY

reduce (key, list) big, (1,1)
data, (1)
is, (1) Reduce
totally, (1)
sum the list cool, (1)
and, (1)
big, 2
output (key, sum) data, 1
is, 1
totally, 1
cool, 1
and, 1

Where does Hadoop fit in?
 Think of Hadoop as an augmentation of your
traditional RDBMS system

 You want to store years of data

 You need to aggregate all of the data over
many years time

 You want/need ALL your data stored and
accessible not forgotten or deleted

 You need this to be free software running on
commodity hardware

Where does Hadoop fit in?

http http http Tableau:
Hive
Business
Pig
Analytics

MySQL MySQL MySQL

Hadoop (CDH4)
MySQL MySQL MySQL
Secondary
NameNode JobTracker
NameNode2 NameNode

Sqoop or ETL DataNode DataNode DataNode DataNode
DataNode DataNode DataNode DataNode
Sqoop

Data Flow
 MySQL is used for OLTP data processing
 ETL process moves data from MySQL to Hadoop
 Cron job – Sqoop
OR
 Cron job – Custom ETL

 Use MapReduce to transform data, run batch
analysis, join data, etc…
 Export transformed results to OLAP or back to
OLTP, for example, a dashboard of aggregated
data or report

MySQL Hadoop
Data Capacity Depends, (TB)+ PB+
Data per Depends, PB+
query/MR (MB -> GB)
Read/Write Random Sequential scans,
read/write Append-only
Query Language SQL MapReduce,
Scripted
Streaming,
HiveQL, Pig Latin
Transactions Yes No
Indexes Yes No
Latency Sub-second Minutes to hours
Data structure Relational Both structured
and un-structured
Enterprise and Yes Yes
Community
Support

About Sqoop
 Open Source and stands for SQL-to-Hadoop

 Parallel import and export between Hadoop and
various RDBMS

 Default implementation is JDBC

 Optimized for MySQL but not for performance

 Integrated with connectors for
Oracle, Netezza, Teradata (Not Open Source)

Sqoop Data Into Hadoop
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--fields-terminated-by „t‟
--lines-terminated-by „n‟

 This command will submit a Hadoop job that
queries your MySQL server and reads all the rows
from world.City

 The resulting TSV file(s) will be stored in HDFS

Sqoop Features
 You can choose specific tables or columns to
import with the --where flag

 Controlled parallelism
 Parallel mappers/connections (--num-mappers)
 Specify the column to split on (--split-by)

 Incremental loads

 Integration with Hive and Hbase

Sqoop Export
$ sqoop export --connect jdbc:mysql://example.com/world
--tables City
--export-dir /hdfs_path/City_data

 The City table needs to exist

 Default CSV formatted

 Can use staging table (--staging-table)

About Hive
 Offers a way around the complexities of
MapReduce/JAVA

 Hive is an open-source project managed by the
Apache Software Foundation

 Facebook uses Hadoop and wanted non-JAVA
employees to be able to access data
 Language based on SQL
 Easy to lean and use
 Data is available to many more people

 Hive is a SQL SELECT statement to MapReduce
translator

More About Hive
 Hive is NOT a replacement for RDBMS
 Not all SQL works

 Hive is only an interpreter that converts HiveQL to
MapReduce

 HiveQL queries can take many seconds or
minutes to produce a result set

RDBMS vs Hive
RDBMS Hive
Language SQL Subset of SQL along with Hive
extensions
Transactions Yes No
ACID Yes No
Latency Sub-second Many seconds to minutes
(Indexed Data) (Non Index Data)
Updates? Yes, INSERT INSERT OVERWRITE
[IGNORE],
UPDATE, DELETE,
REPLACE

Sqoop and Hive
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--hive-import

 Alternatively, you can create table(s) within the
Hive CLI and run an “fs -put” with an exported
CSV file on the local file system

Impala
 It‟s new, it‟s fast

 Allows real time analytics on very large data sets

 Runs on top of HIVE

 Based off of Google‟s Dremel
 http://research.google.com/pubs/pub36632.html

 Cloudera VM for Impala
 https://ccp.cloudera.com/display/SUPPORT/Downlo
ads

Thanks Everyone
 Questions?

 Good References
 Cloudera.com
 http://infolab.stanford.edu/~ragho/hive-
icde2010.pdf

 VM downloads
 https://ccp.cloudera.com/display/SUPPORT/Clouder
a%27s+Hadoop+Demo+VM+for+CDH4

Hadoop and mysql by Chris Schneider

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Hadoop and mysql by Chris Schneider

Similar to Hadoop and mysql by Chris Schneider (20)

Hadoop and mysql by Chris Schneider