Introduction to Hadoop

Hadoop
–
Taming
Big
Data

Jax
ArcSig,
June
2012

Ovidiu
Dimulescu

About
@odimulescu

•  Working
on
the
Web
since
1997

•  Likes
stuﬀ
well
done

•  Into
engineering
cultures
and
all
around
automaEon

•  Speaker
at
local
user
groups

•  Organizer
for
the
local
Mobile
User
Group
jaxmug.com

Agenda

•  IntroducEon

•  Use
cases

•  Architecture

•  MapReduce
Examples

•  Q&A

What
is

?

•  Apache
Hadoop
is
an
open
source
Java
soSware

framework
for
running
data-‐intensive
applicaEons

on
large
clusters
of
commodity
hardware

•  Created
by
Doug
CuVng
(Lucene
&
Nutch
creator)

•  Named
aSer
Doug’s
son’s
toy
elephant

What
and
how
is
solving?

•  Processing
diverse
large
datasets
in
pracAcal
Ame
at

low
cost

•  Consolidates
data
in
a
distributed
ﬁle
system

•  Moves
computaAon
to
data
rather
then
data
to

computaEon

•  Simpler
programming
model

Why
does
it
maEer?

•  Volume,
Velocity,
Variety
and
Value

•  Datasets
do
not
ﬁt
on
local
HDDs
let
alone
RAM

•  Data
grows
at
tremendous
pace

•  Data
is
heterogeneous

•  Scaling
up
is
expensive
(licensing,
cpus,
disks,

interconnects,
etc.)

•  Scaling
up
has
a
ceiling
(physical,
technical,
etc.)

Why
does
it
maEer?

Data
types
Complex
Data

Images,
Video

20%
Logs

Documents

Call
records

Sensor
data

Mail
archives

80%

Structured
Data

User
Proﬁles

Complex
Structured
CRM

HR
Records

*
Chart
Source:
IDC
White
Paper

Why
does
it
maEer?

•  Need
to
process
a
10TB
dataset

•  Assume
sustained
transfer
of
75MB/s

•  On
1
node
-‐
Scanning
data
~
2
days

•  On
10
node
cluster
-‐
Scanning
data
~
5
hrs

•  Low
$/TB
for
commodity
drives

•  Low-‐end
servers
are
mulEcore
capable

Use
Cases

•  ETL
-‐
Extract
Transform
Load

•  RecommendaEon
Engines

•  Customer
Churn
Analysis

•  Ad
TargeEng

•  Data
“sandbox”

Use
Cases
-‐
Typical
ETL

Data
Warehouse

BI

ApplicaAons

Live
DB
ETL
1

ETL
2
ReporAng

DB

Logs

Use
Cases
-‐
Hadoop
ETL

Data
Warehouse

BI

ApplicaAons

Live
DB

Data
Loading
Data
Loading
ReporAng

Hadoop

DB

Logs

Use
Cases
–
Analysis
methods

•  Pakern
recogniEon

•  Index
building

•  Text
mining

•  CollaboraEve
ﬁltering

•  PredicEon
models

•  SenEment
analysis

•  Graphs
creaEon
and
traversal

Why
use
Hadoop?

•  PracEcal
to
do
things
that
were
previously
not

ü  Shorter
execuEon
Eme

ü  Costs
less

ü  Simpler
programming
model

•  Open
system
with
greater
ﬂexibility

•  Large
and
growing
ecosystem

Hadoop
–
Silver
bullet?

•  Not
a
database
replacement

•  Not
a
data
warehousing
(complements
it)

•  Not
for
interacEve
reporEng

•  Not
a
general
purpose
storage
mechanism

•  Not
for
problems
that
are
not
parallelizable
in
a

share-‐nothing
fashion

Architecture
–
Design
Axioms

•  System
Shall
Manage
and
Heal
Itself

•  Performance
Shall
Scale
Linearly

•  Compute
Should
Move
to
Data

•  Simple
Core,
Modular
and
Extensible

Architecture
–
Core
Components

HDFS

Distributed
ﬁlesystem
designed
for
low
cost
storage

and
high
bandwidth
access
across
the
cluster.

Map-‐Reduce

Programming
model
for
processing
and
generaEng

large
data
sets.

Architecture
–
Oﬃcial
Extensions

Management

ZooKeeper
Chukwa

Data
Access

Pig
(Data
Flow)
Hive
(SQL)
Avro

Data
Processing

MapReduce
Framework

Storage

HDFS
HBase

Architecture
–
CDH
DistribuAon

1.  CDH
–
Cloudera’s
DistribuEon
of
Hadoop

2.  Image
credit
-‐
Cloudera
presentaEon
@
Microstrategy
World
2011

HDFS
-‐
Design

•  Based
on
Google’s
GFS

•  Files
are
stored
as
blocks
(64MB
default
size)

•  Conﬁgurable
data
replicaEon
(3x,
Rack
Aware)

•  Fault
Tolerant,
Expects
HW
failures

•  HUGE
ﬁles,
Expects
Streaming
not
Low
Latency

•  Mostly
WORM

HDFS
-‐
Architecture

Namenode
(NN)

Client
ask
NN
for
ﬁle
H

NN
returns
DNs
that
D

host
it
F

Client
ask
DN
for
data
S

Datanode
1
Datanode
2
Datanode
N

Namenode
-‐
Master
Datanode
-‐
Slaves

•  Filesystem
metadata
•  Reads
/
Write
blocks
to/from
clients

•  Controls
read/write
to
ﬁles
•  Replicates
blocks
at
master’s
request

•  Manages
blocks
replicaEon

•  Applies
transacEon
log
on
startup

HDFS
–
Fault
tolerance

•  DataNode

§  Uses
CRC
to
avoid
corrupEon

§  Data
is
replicated
on
other
nodes
(3x)

•  NameNode

§  Checkpoint
NameNode

§  Backup
NameNode

§  Failover
is
manual

MapReduce
-‐
Design

•  Based
on
Google’s
MR
paper

•  Borrows
from
funcEonal
programming

•  Simpler
programming
model

§  map
(in_key,
in_value)

-‐>
(out_key,
intermediate_value)
list

§  reduce
(out_key,
intermediate_value
list)

-‐>
out_value
list

•  No
user
synchronizaEon
and
coordinaEon

Input
-‐>
Map
-‐>
Reduce
-‐>
Output

MapReduce
-‐
Architecture

Client
launches
a
job
J
JobsTracker
(JT)

O

-‐
ConﬁguraEon

-‐
Mapper
B

-‐
Reducer
S

-‐
Input

-‐
Output
TaskTracker
1
TaskTracker
2
TaskTracker
N

API

JobTracker
-‐
Master
TaskTracker
-‐
Slaves

•  Accepts
MR
jobs
submiked
by
clients
•  Run
Map
and
Reduce
tasks
received

•  Assigns
Map
and
Reduce
tasks
to
from
Jobtracker

TaskTrackers,
data
locality
aware
•  Manage
storage
and
transmission
of

•  Monitors
tasks
and
TaskTracker
status,
intermediate
output

re-‐executes
tasks
upon
failure

•  SpeculaEve
execuEon

Hadoop
-‐
Core
Architecture

J
JobsTracker

O

B

S

API
TaskTracker
1
TaskTracker
2
TaskTracker
N

DataNode

1
DataNode

2
DataNode

N

H

D

F

S

NameNode

Mini
OS

•  File
system

•  Scheduler

MapReduce
–
Head
First
Style

hkp://www.slideshare.net/esaliya/mapreduce-‐
in-‐simple-‐terms

MapReduce
–
Mapper
Types

One-‐to-‐One

map(k,
v)
=
emit
(k,
transform(v))

Exploder

map(k,
v)
=
foreach
p
in
v:
emit
(k,
p)

Filter

map(k,
v)
=
if
cond(v)
then
emit
(k,
v)

MapReduce
–
Reducer
Types

Sum
Reducer

reduce(k,
vals)
=

sum
=
0

foreach
v
in
vals:
sum
+=
v

emit
(k,
sum)

MapReduce
–
High
level
pipeline

K1

K2

K1

K1

K2

K2

K1

K2

MapReduce
–
Detailed
pipeline

Diagram:
hkp://developer.yahoo.com/hadoop/tutorial/module4.html

MapReduce
–
Combiner
Phase

•  OpEonal

•  Runs
on
mapper
nodes
aSer
map
phase

•  “
Mini-‐reduce,”
only
on
local
map
output

•  Used
to
save
bandwidth
before
sending
data
to
full
reducer

•  The
Reducer
can
be
Combiner
if

1.  Output
key,
values
are
the
same
as
input
key,
values

2.  CommutaEve
and
AssociaEve
(SUM,
MAX
ok
but
AVG
not)

Diagram:
hkp://developer.yahoo.com/hadoop/tutorial/module4.html

InstallaAon

1.  Download
&
conﬁgure
single-‐node
cluster

hadoop.apache.org/common/releases.html

2.  Download
a
demo
VM

Cloudera

Hortonwork

3.  Use
a
hosted
environment
(Amazon’s
EMR,
Azure)

InstallaAon
–
Pla[orm
Notes

ProducAon

Linux
–
Oﬃcial

Development

Linux

OSX

Windows
via
Cygwin

*Nix

MapReduce
–
Client
Languages

Java,
Any
JVM
Languages
-‐
NaEve

hadoop
jar
jar_path
main_class
input_path
output_path

C++
-‐
Pipes
framework
–
Socket
IO

hadoop
pipes
-‐input
path_in
-‐output
path_out
-‐program
exec_program

Any
–
Streaming
–
Stdin
/
Stdout

hadoop
jar
hadoop-‐streaming.jar
-‐mapper
map_prog
-‐reducer
reduce_prog
-‐input

path_in
-‐output
path_out

Pig
LaEn,
Hive
HQL,
C
via
JNI

MapReduce
–
Client
Anatomy

•  Main
Program
(aka
Driver)

Conﬁgures
the
Job

IniEates
the
Job

•  Input
LocaEon

•  Mapper

•  Combiner
(opEonal)

•  Reducer

•  Output
LocaEon

MapReduce
–
Word
Count
Example

MapReduce
–
C#
Mapper

MapReduce
–
C#
Reducer

MapReduce
–
Java
Mapper

MapReduce
–
Java
Reducer

MapReduce
–
JavaScript
Mapper

MapReduce
–
JavaScript
Reducer

Summary

is
an
economical
scalable
distributed

data
processing
system
which
enables
data:

ü  ConsolidaAon
(Structured
or
Not)

ü  Query
Flexibility
(Any
Language)

ü  Agility
(Evolving
Schemas)

References

Hadoop
at
Yahoo!,
by
Y!
Developer
Network

MapReduce
in
Simple
Terms,
by Saliya Ekanayake

Hadoop
Architecture,
by Phillipe Julio

10
Hadoop-‐able
Problems,
by Cloudera

Hadoop,
An
Industry
PerspecEve,
by Amr Awadallah

Anatomy of a MapReduce Job Run by Tom White

MapReduceJobs in Hadoop

Introduction to Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Hadoop

Similar to Introduction to Hadoop (20)

More from Ovidiu Dimulescu

More from Ovidiu Dimulescu (9)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop