Hadoop, Taming Elephants

Hadoop, Taming elephants
JaxLUG, 2013

Ovidiu Dimulescu

About @odimulescu
• Working on the Web since 1997
• Into startup and engineering cultures
• Speaker at user groups, code camps
• Founder and organizer for JaxMUG.com
• Organizer for Jax Big Data meetup

Agenda
• Background
• Architecture v1.0 & 2.0
• Ecosystem
• Installation
• Security
• Monitoring
• Demo
• Q &A

What is ?
• Apache Hadoop is an open source Java software
framework for running data-intensive applications on
large clusters of commodity hardware

• Created by Doug Cutting (Lucene & Nutch creator)

• Named after Doug’s son’s toy elephant

What and how is solving?
• Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed ﬁle system
• Moves computation to data rather then data to computation
• Simpler programming model

CPU
CPU

CPU
CPU

CPU
CPU

CPU CPU

Why does it matter?
• Volume, Velocity, Variety and Value

• Datasets do not ﬁt on local HDDs let alone RAM

• Scaling up

‣ Is expensive (licensing, hardware, etc.)
‣ Has a ceiling (physical, technical, etc.)

Why does it matter?

Data types Complex Data

Images,Video
20% Logs
Documents
Call records
Sensor data
80% Mail archives

Structured Data
Complex
Structured User Proﬁles
CRM
* Chart Source: IDC White Paper HR Records

Why does it matter?

• Scanning 10TB at sustained transfer of 75MB/s takes

~2 days on 1 node

~5 hrs on 10 nodes cluster

• Low $/TB for commodity drives

• Low-end servers are multicore capable

Use cases

• ETL - Extract Transform Load

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”

What is Hadoop not?

• Not a database replacement

• Not a data warehousing (complements it)

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
share-nothing fashion

Architecture – Core Components

HDFS

Distributed ﬁlesystem designed for low cost storage
and high bandwidth access across the cluster.

Map-Reduce

Programming model for processing and generating
large data sets.

HDFS - Design

• Files are stored as blocks (64MB default size)

• Conﬁgurable data replication (3x, Rack Aware*)

• Fault Tolerant, Expects HW failures

• HUGE ﬁles, Expects Streaming not Low Latency

• Mostly WORM

• Not POSIX compliant

• Not mountable OOTB*

HDFS - Architecture

Namenode (NN)
Client ask NN for file H
NN returns DNs that D
host it F
Client ask DN for data
S
Datanode 1 Datanode 2 Datanode N

Namenode - Master Datanode - Slaves

• Filesystem metadata • Reads / Write blocks to / from clients
• Controls read/write to files • Replicates blocks at master’s request
• Manages blocks replication • Notifies master about block-ids

Single Namespace
Single Block Pool

HDFS - Fault tolerance

• DataNode

 Uses CRC32 to avoid corruption
 Data is replicated on other nodes (3x)*

• NameNode

 fsimage - last snapshot
 edits - changes log since last snapshot
 Checkpoint Node
 Backup NameNode
 Failover is manual*

MapReduce - Architecture

Client launches a job J JobsTracker (JT)
- Conﬁguration
O
- Mapper B
- Reducer S
- Input
- Output
API TaskTracker 1 TaskTracker 2 TaskTracker N

JobTracker - Master TaskTracker - Slaves

• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received
• Assigns Map and Reduce tasks to from Jobtracker
TaskTrackers • Manage storage and transmission of
• Monitors tasks and TaskTracker status, intermediate output
re-executes tasks upon failure
• Speculative execution

Hadoop - Core Architecture

J JobsTracker
O
B
S
TaskTracker 1 TaskTracker 2 TaskTracker N
API
DataNode 1 DataNode 2 DataNode N
H
D
F
S
NameNode

* Mini OS: Filesystem & Scheduler

Hadoop 2.0 - HDFS Architecture

• Distributed Namespace
• Multiple Block Pools

Hadoop 2.0 - YARN Architecture

MapReduce - Clients

Java - Native
hadoop jar jar_path main_class input_path output_path

C++ - Pipes framework
hadoop pipes -input path_in -output path_out -program exec_program

Any – Streaming
hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
input path_in -output path_out

Pig Latin, Hive HQL, C via JNI

Hadoop - Ecosystem

Management

ZooKeeper Chukwa Ambari HUE

Data Access

Pig Hive Flume Impala Sqoop

Data Processing
MapReduce Giraph Hama Mahaout MPI

Storage
HDFS HBase

Installation - Platforms

Production
Linux – Ofﬁcial

Development
Linux
OSX
Windows via Cygwin
*Nix

Installation - Versions

Public Numbering

1.0.x - current stable version
1.1.x - current beta version for 1.x branch
2.X - current alpha version

Development Numbering

0.20.x aka 1.x - CDH 3 & HDP 1
0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)

Installation - For toying

Option I - Ofﬁcial project releases
hadoop.apache.org/common/releases.html

Option 2 - Demo VM from vendors
• Cloudera
• Hortonworks
• Greenplum
• MapR

Option 3 - Cloud
• Amazon’s EMR
• Hadoop on Azure

Installation - For real

Vendor distributions
• Cloudera CDH
• Hortonworks HDP
• Greenplum GPHD
• MapR M3, M5 or M7

Hosted solutions

• AWS EMR
• Hadoop on Azure

Use Virtualization - VMware Serengeti *

Security - Simple Mode

• Use in a trusted environment
‣ Identity comes from euid of the client process
‣ MapReduce tasks run as the TaskTracker user
‣ User that starts the NameNode is super-user

• Reasonable protection for accidental misuse
• Simple to setup

Security - Secure Mode

• Kerberos based
• Use for tight granular access
‣ Identity comes from Kerberos Principal
‣ MapReduce tasks run as Kerberos Principal

• Use a dedicated MIT KDC

• Hook it to your primary KDC (AD, etc.)

• Signiﬁcant setup effort (users, groups and Kerberos keys
on all nodes, etc.)

Monitoring

Built-in

• JMX
• REST
• No SNMP support
Other

Cloudera Manager (Free up to 50 nodes)
Ambari - Free, RPM based systems (RH, CentOS)

References
Hadoop Operations, by Eric Sammer
Hadoop Security, by Hortonworks Blog

HDFS Federation, by Suresh Srinivas

Hadoop 2.0 New Features, by VertiCloud Inc

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop Architecture, by Phillipe Julio

Hadoop, Taming Elephants

More Related Content

What's hot

Similar to Hadoop, Taming Elephants

More from Ovidiu Dimulescu

Recently uploaded

Hadoop, Taming Elephants