2. About Me
Chris Schneider, Data Architect @ Ning.com (a
Glam Media Company)
Spent the last ~2 years working with Hadoop
(CDH)
Spent the last 10 years building MySQL
architecture for multiple companies
chriss@glam.com
3. What we‟ll cover
Hadoop
CDH
Use cases for Hadoop
Map Reduce
Scoop
Hive
Impala
4. What is Hadoop?
An open-source framework for storing and
processing data on a cluster of servers
Based on Google‟s whitepapers of the Google
File System (GFS) and MapReduce
Scales linearly
Designed for batch processing
Optimized for streaming reads
5. The Hadoop Distribution
Cloudera
The only distribution for Apache Hadoop
What Cloudera Does
Cloudera Manager
Enterprise Training
Hadoop Admin
Hadoop Development
Hbase
Hive and Pig
Enterprise Support
6. Why Hadoop
Volume
Use Hadoop when you cannot or should not use
traditional RDBMS
Velocity
Can ingest terabytes of data per day
Variety
You can have structured or unstructured data
7. Use cases for Hadoop
Recommendation engine
Netflix recommends movies
Ad targeting, log processing, search optimization
eBay, Orbitz
Machine learning and classification
Yahoo Mail‟s spam detection
Financial: Identity theft and credit risk
Social Graph
Facebook, Linkedin and eHarmony connections
Predicting the outcome of an election before the
election, 50 out of 50 correct thanks to Nate Silver!
8. Some Details about Hadoop
Two Main Pieces of Hadoop
Hadoop Distributed File System (HDFS)
Distributed and redundant data storage using many
nodes
Hardware will inevitably fail
Read and process data with MapReduce
Processing is sent to the data
Many “map” tasks each work on a slice of the data
Failed tasks are automatically restarted on another
node or replica
9.
10. MapReduce Word Count
The key and value together represent a row of
data where the key is the byte offset and the
value is the line
map (key,value)
foreach (word in value)
output (word,1)
11. Map is used for Searching
64, big data is totally cool and big Foreach
… word
Intermediate Output (on local disk):
big, 1
data, 1
is, 1
MAP totally, 1
cool, 1
and, 1
big, 1
12. Reduce is used to aggregate
Hadoop aggregates the keys and calls a reduce for each
unique key… e.g. GROUP BY, ORDER BY
reduce (key, list) big, (1,1)
data, (1)
is, (1) Reduce
totally, (1)
sum the list cool, (1)
and, (1)
big, 2
output (key, sum) data, 1
is, 1
totally, 1
cool, 1
and, 1
13. Where does Hadoop fit in?
Think of Hadoop as an augmentation of your
traditional RDBMS system
You want to store years of data
You need to aggregate all of the data over
many years time
You want/need ALL your data stored and
accessible not forgotten or deleted
You need this to be free software running on
commodity hardware
14. Where does Hadoop fit in?
http http http Tableau:
Hive
Business
Pig
Analytics
MySQL MySQL MySQL
Hadoop (CDH4)
MySQL MySQL MySQL
Secondary
NameNode JobTracker
NameNode2 NameNode
Sqoop or ETL DataNode DataNode DataNode DataNode
DataNode DataNode DataNode DataNode
Sqoop
15. Data Flow
MySQL is used for OLTP data processing
ETL process moves data from MySQL to Hadoop
Cron job – Sqoop
OR
Cron job – Custom ETL
Use MapReduce to transform data, run batch
analysis, join data, etc…
Export transformed results to OLAP or back to
OLTP, for example, a dashboard of aggregated
data or report
16. MySQL Hadoop
Data Capacity Depends, (TB)+ PB+
Data per Depends, PB+
query/MR (MB -> GB)
Read/Write Random Sequential scans,
read/write Append-only
Query Language SQL MapReduce,
Scripted
Streaming,
HiveQL, Pig Latin
Transactions Yes No
Indexes Yes No
Latency Sub-second Minutes to hours
Data structure Relational Both structured
and un-structured
Enterprise and Yes Yes
Community
Support
17. About Sqoop
Open Source and stands for SQL-to-Hadoop
Parallel import and export between Hadoop and
various RDBMS
Default implementation is JDBC
Optimized for MySQL but not for performance
Integrated with connectors for
Oracle, Netezza, Teradata (Not Open Source)
18. Sqoop Data Into Hadoop
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--fields-terminated-by „t‟
--lines-terminated-by „n‟
This command will submit a Hadoop job that
queries your MySQL server and reads all the rows
from world.City
The resulting TSV file(s) will be stored in HDFS
19. Sqoop Features
You can choose specific tables or columns to
import with the --where flag
Controlled parallelism
Parallel mappers/connections (--num-mappers)
Specify the column to split on (--split-by)
Incremental loads
Integration with Hive and Hbase
20. Sqoop Export
$ sqoop export --connect jdbc:mysql://example.com/world
--tables City
--export-dir /hdfs_path/City_data
The City table needs to exist
Default CSV formatted
Can use staging table (--staging-table)
21. About Hive
Offers a way around the complexities of
MapReduce/JAVA
Hive is an open-source project managed by the
Apache Software Foundation
Facebook uses Hadoop and wanted non-JAVA
employees to be able to access data
Language based on SQL
Easy to lean and use
Data is available to many more people
Hive is a SQL SELECT statement to MapReduce
translator
22. More About Hive
Hive is NOT a replacement for RDBMS
Not all SQL works
Hive is only an interpreter that converts HiveQL to
MapReduce
HiveQL queries can take many seconds or
minutes to produce a result set
23. RDBMS vs Hive
RDBMS Hive
Language SQL Subset of SQL along with Hive
extensions
Transactions Yes No
ACID Yes No
Latency Sub-second Many seconds to minutes
(Indexed Data) (Non Index Data)
Updates? Yes, INSERT INSERT OVERWRITE
[IGNORE],
UPDATE, DELETE,
REPLACE
24. Sqoop and Hive
$ sqoop import --connect jdbc:mysql://example.com/world
--tables City
--hive-import
Alternatively, you can create table(s) within the
Hive CLI and run an “fs -put” with an exported
CSV file on the local file system
25. Impala
It‟s new, it‟s fast
Allows real time analytics on very large data sets
Runs on top of HIVE
Based off of Google‟s Dremel
http://research.google.com/pubs/pub36632.html
Cloudera VM for Impala
https://ccp.cloudera.com/display/SUPPORT/Downlo
ads
26. Thanks Everyone
Questions?
Good References
Cloudera.com
http://infolab.stanford.edu/~ragho/hive-
icde2010.pdf
VM downloads
https://ccp.cloudera.com/display/SUPPORT/Clouder
a%27s+Hadoop+Demo+VM+for+CDH4