Hadoop HDFS Detailed Introduction

Detailed Intro. to HDFS
July 10, 2012
Clay Jiang
Big Data Engineering Team
Hanborq Inc.

HDFS Intro.
• Overview

• HDFS Internal

• HDFS O&M, Tools

• HDFS Future

2

What is HDFS?
• Hadoop Distributed FileSystem
• Good For:
 Large Files
 Streaming Data Access
• NOT For:
x Lots of Small Files
x Random Access
x Low-latency Access

4

Design of HDFS
• GFS-like
– http://research.google.com/archive/gfs.html
• Master-slave design
– Master
• Single NameNode for managing FS meta
– Slaves
• Multiple DataNode s for storing data
– One more:
• SecondaryNameNode for checkpointing

5

HDFS Architecture
•

6

HDFS Storage
• HDFS Files are broken into Blocks
– Basic unit of reading/writing like disk block
– Default to 64MB, may be larger in product env.
– Make HDFS good for large file & high throughput
• Block may have multiple Replicas
– One block stored as multiple locations
– Make HDFS storage fault tolerant

7

HDFS Internal

• NameNode

• SecondaryNameNode

• DataNode

10

NameNode
• Filesystem Meta
– FSNames
– FSName  Blocks
– Block  Replicas
• Interact With
– Client
– DataNode
– SecondaryNameNode

11

NameNode FS Meta
• FSImage
– FSNames & FSName  Blocks
– Saved replicas in multiple name directory
– Recover on startup
• EditLog
– Log every FS modification
• Block  Replicas (DataNodes)
– Only in memory
– rebuilt from block reports on startup

12

NameNode Interface
• Through different protocol interface
– ClientProtocol:
• Create, addBlock, delete, rename, fsync …
– NameNodeProtocol:
• rollEditLog, rollFsImage, …
– DataNodeProtocol:
• sendHeartbeat, blockReceived, blockReport, …
–…

13

NameNode Startup
• On Startup
– Load fsimage
– Check safe mode
– Start Daemons
• HeartbeatMonitor
• LeaseManager
• ReplicationMonitor
• DecommissionManager
– Start RPC services
– Start HTTP info server
– Start Trash Emptier

14

Load FSImage
• Name Directory
– dfs.name.dir: can be multiple dirs

– Check consistence of all name dirs
– Load fsimage file
– Load edit logs
– Save namespace
• Mainly setup dirs & files properly

15

Check Safemode
• Safemode
– Fsimage loaded but locations of blocks not known
yet!
– Exit when minimal replication condition meet
• dfs.safemode.threshold.pct
• dfs.replication.min
• Default case: 99.9% of block have 1 replicas
– Start SafeModeMonitor to periodically check to
leave safe mode
– Leave safe mode manually
• hadoop dfsadmin -safemode leave
• (or enter it /get status by: hadoop dfsadmin -safemode
enter/get)

16

Start Daemons
• HeartbeatMonitor
– Check lost DN & schedule necessary replication
• LeaseManager
– Check lost lease
• ReplicationMonitor
– computeReplicationWork
– computeInvalidateWork
– dfs.replication.interval, defautl to 3 secs
• DecommissionManager
– Check and set node decommissioned

17

Trash Emptier
• /user/{user.name}/.Trash
– fs.trash.interval > 0 to enable
– When delete, file moved to .Trash
• Trash.Empiter
– Run every fs.trash.interval mins
– Delete old checkpoint (fs.trash interval mins ago)

18

HDFS Internal

• NameNode


• DataNode

19

SecondaryNameNode
• Not Standby/Backup NameNode
– Only for checkpointing
– Though, has a NON-Realtime copy of FSImage
• Need as much memory as NN to do the
checkpointing
– Estimation: 1GB for every one million blocks

20

SecondaryNameNode
• Do the checkpointing
– Copy NN’s fsimage &
editlogs
– Merge them to a new
fsimage
– Replace NN’s fsimage with
new one & clean editlogs
• Timing
– Size of editlog >
fs.checkpoint.size (poll
every 5 min)
– Every fs.checkpoint.period
secs

21

HDFS Internal

• NameNode


• DataNode

22

DataNode
• Store data blocks
– Have no knowledge about FSName
• Receive blocks from Client
• Receive blocks from DataNode peer
– Replication
– Pipeline writing
• Receive delete command from NameNode

23

Block Placement Policy
On Cluster Level
• replication = 3
– First replica local
with Client
– Second & Third
on two nodes of
same remote rack

24

Block Placement Policy
On one single node
• Write each disk in turn
– No balancing is considered !
• Skip a disk when it’s almost full or failed
• DataNode may go offline when disks failed
– dfs.datanode.failed.volumes.tolerated

25

DataNode Startup
• On DN Startup:
– Load data dirs
– Register itself to NameNode
– Start IPC Server
– Start DataXceiverServer
• Transfer blocks
– Run the main loop …
• Start BlockScanner
• Send heartbeats
• Process command from NN
• Send block report

26

DataXceiverServer
• Accept data connection & start DataXceiver
– Max num: dfs.datanode.max.xcievers [256]
• DataXceiver
– Handle blocks
• Read block
• Write block
• Replace block
• Copy block
• …

27

HDFS Routines Analysis

• Write File

• Read File

• Decrease Replication Factor

• One DN down

28

Write File
• Sample Code:
DFSClient dfsclient = …;
outputStream = dfsclient.create(…);
outputStream.write(someBytes);
…
outputStream.close();
dfsclient.close();

29

Write File
• DFSClient.create
– NameNode.create
• Check existence
• Check permission
• Check and get Lease
• Add new INode to rootDir

31

Write File
• outputStream.write
– Get DNs to write to From NN
– Break bytes into packets
– Write packets to First DataNode’s DataXceiver
– DN mirror packet to downstream DNs (Pipeline)
– When complete, confirm NN blockReceived

32

Write File
• outputStream.close
– NameNode.complete
• Remove lease

• Change file from “under construction” to “complete”

33

Lease
• What is lease ?
– Write lock for file modification

– No lease for reading files

• Avoid concurrent write on the same file
– Cause inconsistent & undefined behavior

35

Lease
• LeaseManager
– Lease is managed in NN
– When file create (or append), lease added
• DFSClient.LeaseChecker
– Client start thread to renew lease periodically

36

Lease Expiration
• Soft Limit
– No renewing for 1 min
– Other client compete for the lease
• Hard Limit
– No renewing for 60 min (60 * softLimit)
– No competition for the lease

37

Read File
• Sample Code:
DFSClient dfsclient = …
FSDataInputStream is = dfsclient.open(…)
is.read(…)
is.close()

38

Read File
• DFClient.open
– Create FSDataInputStream
• Get block locations of file from NN
• FSDataInputStream.read
– Read data from DNs block by block
• Read the data
• Do the checksum

40

Desc Repl
• Code Sample
DFSClient dfsclient = …;
dfsclient.setReplication(…, 2) ;
• Or use the CLI
hadoop fs -setrep -w 2 /path/to/file

41

Desc Repl
• Change FSName replication factor
• Choose excess replicas
– Rack number do not decrease
– Get block from least available disk space node
• Add to invalidateSets(to-be-deleted block set)
• ReplicationMonitor compute blocks to be deleted
for each DN
• On next DN’s heartbeat, give delete block
command to DN
• DN delete specified blocks
• Update blocksMap when DN send blockReport
43

One DN down
• DataNode stop sending heartbeat
• NameNode
– HeartbeatMonitor find DN dead when doing heartbeat
check
– Remove all blocks belong to DN
– Update neededReplications (block set need one or more
replication)
– ReplicationMonitor compute block to be replicated for
each DN
– On next DN’s heartbeat, NameNode send replication block
command
• DataNode
– Replicate block

44

High Availability
• NameNode SPOF
– NameNode hold all the meta
– If NN crash, the whole cluster unavailable
• Though fsimage can recover from SNN
– It’s not a up-to-date fsimage
• Need HA solutions

46

HA Solutions

• DRBD

• Avatar Node

• Backup Node

47

HA - DRBD
• DRBD (http://www.drbd.org)
– block devices designed as a building block to form
high availability (HA) clusters.
– Like network based raid-1
• Use DRBD to backup NN’s fsimage & editlogs
– A cold backup for NN
– Restart NN cost no more than 10 minutes

48

HA - DRBD
• Mirror one of NN’s name dir to a remote node
– All name dir is the same
• When NN fail
– Copy mirrored name dir to all name dir
– Restart NN
– All will be done in no more than 20 mins

49

HA Solutions

• DRBD

• Avatar Node

• Backup Node

50

HA - AvatarNode
• Complete Hot Standby
– NFS for storage of fsimage and editlogs
– Standby node Consumes transactions from
editlogs on NFS continuously
– DataNodes send message to both primary and
standby node
• Fast Switchover
– Less than a minute

51

HA - AvatarNode
• Active-Standby Pair Client
– Coordinated via ZooKeeper
– Failover in few seconds Client retrieves block
location from Primary
– Wrapper over NameNode or Standby

• Active AvatarNode Active
Write
transaction
Read
transaction Standby
AvatarNode
– Writes transaction log to NFS AvatarNode
(NameNode) NFS (NameNode)
filer
Filer
• Standby AvatarNode
Block Block
– Reads/Consumes Location Location
transactions from NFS filer messages messages
– Processes all messages from
DataNodes DataNodes

– Latest metadata in memory
52

HA - AvatarNode
• Four steps to failover
– Wipe ZooKeeper entry. Clients will know the failover
is in progress. (0 seconds)
– Stop the primary NameNode. Last bits of data will be
flushed to Transaction Log and it will die. (Seconds)
– Switch Standby to Primary. It will consume the rest of
the Transaction log and get out of SafeMode ready to
serve traffic. (Seconds)
– Update the entry in ZooKeeper. All the clients waiting
for failover will pick up the new connection (0 seconds)
• After: Start the first node in the Standby Mode
– Takes a while, but the cluster is up and running

53

HA Solutions

• DRBD

• Avatar Node

• Backup Node

55

HA - BackupNode
• NN synchronously
streams Client

transaction log to Client retrieves block location from
BackupNode NN

• BackupNode applies log NN
Synchronous stream
transacton logs to
to in-memory and disk (NameNode) BN

image
BN
• BN always commit to disk Block
Location (BackupNode)
messages
before success to NN
• If BN restarts, it has to
catch up with NN
DataNodes

56

Tools
• More Tools …
– Balancer

– Fsck

– Distcp

57

Tools - Balancer
• Need Re-Balancing
– When new node is add to cluster
• bin/start-balancer.sh
– Move block from over-utilized node to under-utilized node
• dfs.balance.bandwidthPerSec
– Control the impact on business
• -t <threshold>
– Default 10%
– stop when difference from average utilization is less than
threshold

58

Tools - Fsck
• hadoop fsck /path/to/file
• Check HDFS’s healthy
– Missing blocks, corrupt blocks, mis-replicated
blocks …
• Get blocks & locations of files
– hadoop fsck /path/to/file -files -blocks -locations

59

Tools - Distcp
• Inter-cluster copy
– hadoop distcp -i –pp -log /logdir
hdfs://srcip/srcpath/ /destpath
– Use map-reduce(actually maps) to start a
distributed-fashion copy
• Also fast copy in the same cluster

60

Hadoop Future
• Short-circuit local reads
– dfs.client.read.shortcircuit = true
– Available in hadoop-1.x or cdh3u4
• Native checksums (HDFS-2080)
• BlockReader keepalive to DN (HDFS-941)
• “Zero-copy read” support (HDFS-3051)
• NN HA (HDFS-3042)
• HDFS Federation
• HDFS RAID
62

References
• Tom White, Hadoop The definitive guide
• http://hadoop.apache.org/hdfs/
• Hadoop WiKi – HDFS
– http://wiki.apache.org/hadoop/HDFS
• Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The
Google File System
– http://research.google.com/archive/gfs.html
• Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert
Chansler , The Hadoop Distributed File System
– http://storageconference.org/2010/Papers/MSST/Shvachko.pdf

63

The End
Thank You Very Much!
chiangbing@gmail.com

64

Hadoop HDFS Detailed Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop HDFS Detailed Introduction

Similar to Hadoop HDFS Detailed Introduction (20)

More from Hanborq Inc.

More from Hanborq Inc. (11)

Recently uploaded

Recently uploaded (20)

Hadoop HDFS Detailed Introduction