Hadoop operations

Marc
Cluet
–
Lynx
Consultants

How
Hadoop
Works

What we’ll cover?
¡  Understand
Hadoop
in
detail

¡  See
how
Hadoop
works
operationally

¡  Be
able
to
start
asking
the
right
questions
from
your
data

Lynx
Consultants
©
2013

Hadoop Distributions
¡  Cloudera
CDH

¡  Hortonworks

¡  MapR

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

§  Hadoop
Distributed
File
System

§  Everything
sits
on
top
of
it

§  Has
3
copies
by
default
of
every
block

¡  Hbase

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

§  Hadoop
Schemaless
Database

§  Key
value
Store

§  Sits
on
top
of
HDFS

¡  MapRed

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

§  Hadoop
Map/Reduce

§  Non-‐pluggable,
archaic

§  Requires
HDFS
for
temp
storage

¡  YARN

Lynx
Consultants
©
2013

Hadoop Components
¡  HDFS

¡  Hbase

¡  MapRed

¡  YARN

§  Hadoop
Map/Reduce
version
2.0

§  Pluggable,
you
can
add
your
own

§  Fast
and
not
so
much
memory
hungry

Lynx
Consultants
©
2013

Hadoop Component Breakdown
¡  All
these
components
divide
themselves
in

§  client/server

§  master/slave
scenarios

¡  We
will
now
check
each
individual
component
breakdown

Lynx
Consultants
©
2013

Hadoop Components Breakdown
¡  HDFS

§  Master
Namenode

▪  Keeps
track
of
all
ﬁle
allocation
on
Datanodes

▪  Rebalances
data
if
one
of
the
namenodes
goes
down

▪  Is
Rack
aware

§  Secondary
Namenode

▪  Does
cleanup
services
for
the
namenode

▪  Not
necessarily
two
diﬀerent
servers

§  Datanode

▪  Stores
the
data

▪  Good
to
have
not
RAID
disks
for
extra
I/O
speed

Lynx
Consultants
©
2013

¡  HDFS

§  How
to
access

▪  Client
can
connect
with
hadoop
client
to
hdfs://namenode:8020

▪  Supports
all
basic
Unix
commands

§  Configuration
files

▪  /etc/hadoop/conf/core-‐site.xml

▪  Defines
major
configuration
as
hdfs
namenode
and
default
parameters

▪  /etc/hadoop/conf/hdfs-‐site.xml

▪  Defines
configuration
specific
to
namenode
or
datanode
on
file
locations

▪  /etc/hadoop/conf/slaves

▪  Defines
the
list
of
servers
that
are
available
in
this
cluster

Lynx
Consultants
©
2013

¡  Hbase

§  Master

▪  Controls
the
Hbase
cluster,
knows
where
the
data
is
allocated
and

provides
a
client
listening
socket
using
Thrift
and/or
a
RESTful
API

§  Regionserver

▪  Hbase
node,
stores
some
of
the
information
in
one
of
the
regions,

it’d
be
equivalent
to
sharding

§  Thrift
/
REST

▪  Interface
to
connect
to
HBase

Lynx
Consultants
©
2013

¡  Hbase

§  How
to
access

▪  Through
the
Hbase
client
(using
Thrift)

▪  Through
the
RESTful
API

files

▪  /etc/hbase/conf/hbase-‐site.xml

▪  Defines
all
the
basic
configuration
for
accessing
hbase

▪  /etc/hbase/conf/hbase-‐policy.xml

▪  Defines
all
the
security
(ACL)
and
all
the
hbase
memory
tweaks

▪  /etc/hbase/conf/regionservers

▪  List
all
the
regionservers
available
to
this
cluster

Lynx
Consultants
©
2013

¡  MapRed

§  JobTracker

▪  Creates
the
Map/Reduce
jobs

▪  Stores
all
the
intermediate
data

▪  Keeps
track
of
all
the
previous
results
through
the
HistoryServer

§  TaskTracker

▪  Executed
Tasks
related
to
the
Map/Reduce
job

▪  Very
CPU
and
memory
intensive

▪  Stores
intermediate
results
which
then
are
pushed
to
JobTracker

Lynx
Consultants
©
2013

¡  MapRed

§  How
to
access

▪  Through
the
Hadoop
Client

▪  Through
any
MapRed
client
like
Pig
or
Hive

▪  Own
Java
code

files

▪  /etc/hadoop/conf/mapred-‐site.xml

▪  Defines
how
to
contact
this
MapRed
Cluster

▪  /etc/hadoop/conf/mapred-‐queue-‐acls.xml

▪  Defines
ACL
structure
for
accessing
MapRed,
normally
not
necessary

▪  /etc/hadoop/conf/slaves

▪  Defines
the
list
of
TaskTrackers
in
this
cluster

Lynx
Consultants
©
2013

¡  YARN

§  Same
structure
as
MapRed
(lives
on
top
of
it)

ﬁles

▪  /etc/hadoop/conf/yarn-‐site.xml

▪  All
required
conﬁguration
for
YARN

Lynx
Consultants
©
2013

Hadoop Cluster Breakdown
¡  Namenode
Server

§  HDFS
Namenode

§  Hbase
Master

¡  Secondary
Namenode
Server

§  HDFS
Secondary
Namenode

¡  JobTracker
Server

§  MapRed
JobTracker

§  MapRed
History
Server

Lynx
Consultants
©
2013

Hadoop Cluster Breakdown
¡  Datanode
Server

§  HDFS
Datanode

§  Hbase
RegionServer

§  MapRed
TaskTracker

Lynx
Consultants
©
2013

Hadoop Hardware Requirements
¡  Namenode
Server

§  Redundant
power
supplies

§  RAID1
Drives

§  Enough
memory
(16Gb)

¡  Secondary
Namenode
Server

§  Almost
none

Lynx
Consultants
©
2013

Hadoop Hardware Requirements
¡  Jobtracker
Server

§  Redundant
power
supplies

§  RAID1
Drives

§  Enough
memory
(16Gb)

¡  Datanode
Server

§  Lots
of
cheap
disk
(no
RAID)

§  Lots
of
memory
(32Gb)

§  Lots
of
CPU

Lynx
Consultants
©
2013

Hadoop Default Ports
¡  HDFS

§  8020:
HDFS
Namenode

§  50010:
HDFS
Datanode
FS
transfer

¡  MapRed

§  No
defaults

¡  Hbase

§  60010:
Master

§  60020:
Regionserver

Lynx
Consultants
©
2013

Flume
¡  Transports
streams
of
data
from
point
A
to
point
B

¡  Source

§  Where
the
data
is
read
from

¡  Channel

§  How
the
data
is
buﬀered

¡  Sink

§  Where
the
data
is
written

Lynx
Consultants
©
2013

Flume
¡  Flume
is
fault
tolerant

¡  Sources
are
pointer
kept

§  With
some
exceptions,
but
most
sources
are
in
a
known
state

¡  Channels
can
be
fault
tolerant

§  Channel
written
to
disk
can
recover
from
where
it
left

¡  Sinks
can
be
redundant

§  More
than
one
sink
for
the
same
data

§  Data
is
serialised
and
deduplicated
using
AVRO

Lynx
Consultants
©
2013

Flume
¡  Configuration
files

§  /etc/flume-‐ng/conf/flume.conf

▪  Defines
the
agent
configuration
with
source,
channel,
sink

Lynx
Consultants
©
2013

Hadoop References
¡  Hadoop

§  http://hadoop.apache.org/docs/stable/cluster_setup.html

§  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/
ClusterSetup.html

§  http://pig.apache.org/docs/r0.7.0/setup.html

§  http://wiki.apache.org/hadoop/NameNodeFailover

¡  Hbase

§  http://hbase.apache.org/book/book.html

¡  Flume

§  http://archive.cloudera.com/cdh4/cdh/4/ﬂume-‐ng/
FlumeUserGuide.html

Lynx
Consultants
©
2013

Hadoop operations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop operations

Similar to Hadoop operations (20)

More from Marc Cluet

More from Marc Cluet (20)

Recently uploaded

Recently uploaded (20)

Hadoop operations