Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

DO NOT USE PUBLICLY
From Zero to Hadoop PRIOR TO 10/23/12
Headline Goes Here
Speaker Name | Title
Speaker Name or Subhead Goes Here
March 19, 2013

1

Agenda
• Hadoop Ecosystem Overview
• Hadoop Core Technical Overview
• HDFS
• MapReduce

• Hadoop in the Enterprise
• Cluster Planning
• Cluster Management with Cloudera Manager

2

Hadoop Ecosystem Overview
What Are All These Things?

3

Hadoop Ecosystem
INGEST STORE EXPLORE PROCESS ANALYZE SERVE

MANAGEMENT SOFTWARE & CONNECTORS
TECHNICAL SUPPORT BI ETL RDBMS
SUBSCIPTION OPTIONS

CLOUDERA NAVIGATOR
CLOUD USER INTERFACE WORKFLOW MGMT
AUDIT CORE METADATA
LINEAGE WH OO
(v1.0) HU
WHIRR HUE OOZIE
ACCESS
LIFECYCLE
(v1.0) INTEGRATION BATCH PROCESSING REAL-TIME ACCESS

EXPLORE SQ HI PI MA DF & COMPUTE
SQOOP HIVE PIG MAHOUT DATAFU
AC
FL BATCH COMPUTE IM ACCESS

CLOUDERA MANAGER
FLUME MR MR2 IMPALA
FILE MAPREDUCE MAPREDUCE2
MS
BDR FUSE-DFS
RESOURCE MGMT META STORE
REST & COORDINATION YA ZO
RTD RTQ WEBHDFS / HTTPFS YARN ZOOKEEPER

STORAGE
CORE SQL HDFS HB
(REQUIRED) ODBC / JDBC
HADOOP DFS HBASE

4

Sqoop
Performs Bi
Directional data
transfers between
Hadoop and almost
any SQL database
with a JDBC driver

5

FlumeNG
A streaming data Client

collection and Client
Agent

aggregation system Client
Agent

for massive volumes Agent

Client
of data, such as RPC
services, Log4J,
Syslog, etc.

6

HBase
• A low latency,
distributed, non-
SQL database built
on HDFS.
• A “Columnar
Database”

7

Hive
• Relational database
SELECT
abstraction using a SQL like s.word, s.freq, k.freq
dialect called HiveQL FROM shakespeare
JOIN ON (s.word= k.word)
• Statements are executed as WHERE s.freq >= 5;
One or more MapReduce
Jobs

8

Pig
• High-level scripting language
emps = LOAD 'people.txt’ AS
for for executing one or more (id,name,salary);
MapReduce jobs rich = FILTER emps BY salary >
• Created to simplify authoring 200000;
sorted_rich = ORDER rich BY
of MapReduce jobs
salary DESC;
• Can be extended with user STORE sorted_rich INTO
defined functions ’rich_people.txt';

9

Oozie
A workflow engine and
scheduler built specifically
for large-scale job
orchestration on a
Hadoop cluster

10

Zookeeper
• Zookeeper is a distributed
consensus engine
• Provides well-defined concurrent
access semantics:
• Leader election
• Service discovery
• Distributed locking / mutual
exclusion
• Message board / mailboxes
11

Mahout
A machine learning library with
algorithms for:
• Recommendation based on users'
behavior.
• Clustering groups related documents.
• Classification from existing
categorized.
• Frequent item-set mining (shopping
cart content).

12

Hadoop Security
• Authentication is secured by MIT Kerberos v5
and integrated with LDAP
• Provides Identity, Authentication, and
Authorization
• Useful for multitenancy or secure
environments

13

Hadoop Core Technical Overview
Only the Good Parts

14

Components of HDFS

• NameNode – Holds all metadata for HDFS
• Needs to be a highly reliable machine
• RAID drives – typically RAID 10
• Dual power supplies
• Dual network cards – Bonded
• The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the
NameNode. Same hardware as the NameNode should be used

15

Components of HDFS – Contd.

• DataNodes – Hardware will depend on the specific needs of the
cluster
• No RAID needed, JBOD (just a bunch of disks) is used
• Typical ratio is:
• 1 hard drive
• 2 cores
• 4GB of RAM

16

HDFS Architecture Overview
Host 1 Host 3 Host 5
Namenode DataNode DataNode

Host 2 Host 4 Host n
Secondary
DataNode DataNode
Namenode

17

HDFS Block Replication
Node 1 Node 2
Block Size = 64MB
Replication Factor = 3 2 1
4 2
Blocks 5 5
Node 3
1
HDFS 1
2
3 3
Node 4 4
4 Node 5
5 2 1
3 3
4 5

18

MapReduce – Map
• Records from the data source (lines out of files, rows of a
database, etc) are fed into the map function as key*value pairs:
e.g., (filename, line).
• map() produces one or more intermediate values along with an
output key from the input.
(key 1, (key 1, int.
values) values)

Map (key 2, Shuffle (key 1, int. Reduce Final (key,
Task values) Phase values) Task values)

values) values)

19

MapReduce – Reduce
• After the map phase is over, all the intermediate values for a
given output key are combined together into a list

• reduce() combines those intermediate values into one or more
final values for that same output key
values) values)

Map (key 2, Shuffle (key 1, int. Reduce Final (key,
Task values) Phase values) Task values)

values) values)

20

MapReduce – Shuffle and Sort

21

Hadoop In the Enterprise
How It Works In The Real World

22

Networking
• One of the most important things to consider when
setting up a Hadoop cluster
• Typically a top of rack is used with Hadoop with a
core switch
• Careful on over subscribing the backplane of the
switch!

24

Hadoop Typical Data Pipeline

Hadoop
Marts
Oozie

Result or Calculated Data
Original Source Data
Data Sources

Pig
Data
Hive Warehouse
Sqoop
MapReduce
Sqoop
Flume
HDFS

25

Hadoop Use Cases
Use Case Application Industry Application Use Case

Social Network Analysis Web Clickstream Sessionization
ADVANCED ANALYTICS

Content Optimization Media Clickstream Sessionization

DATA PROCESSING
Network Analytics Telco Mediation

Loyalty & Promotions Analysis Retail Data Factory

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Sequencing Analysis Bioinformatics Genome Mapping

26

Hadoop in the Enterprise

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

Management Enterprise
IDE’s BI / Analytics
Tools Reporting

CUSTOMERS
Enterprise Data Warehouse

Web
Application

Relational
Logs Files Web Data
Databases

27

Cloudera Manager
End-to-End Administration for CDH

1 Manage
Easily deploy, configure & optimize clusters

2 Monitor
Maintain a central view of all activity

3 Diagnose
Easily identify and resolve issues

4 Integrate
Use Cloudera Manager with existing tools

28

Install A Cluster In 3 Simple Steps
Cloudera Manager Key Features

1
Find Nodes
2
Install Components
3
Assign Roles

Enter the names of the hosts which will be Cloudera Manager automatically installs the CDH Verify the roles of the nodes within your cluster.
included in the Hadoop cluster. Click Continue. components on the hosts you specified. Make changes as necessary.

29

View Service Health & Performance

30

Monitor & Diagnose Cluster Workloads

31

Visualize Health Status With Heatmaps

32

Rolling Upgrades

33

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Similar to Cloudera Sessions - Clinic 1 - Getting Started With Hadoop (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Editor's Notes