4. What is HDFS?
• Hadoop Distributed FileSystem
• Good For:
Large Files
Streaming Data Access
• NOT For:
x Lots of Small Files
x Random Access
x Low-latency Access
4
5. Design of HDFS
• GFS-like
– http://research.google.com/archive/gfs.html
• Master-slave design
– Master
• Single NameNode for managing FS meta
– Slaves
• Multiple DataNode s for storing data
– One more:
• SecondaryNameNode for checkpointing
5
7. HDFS Storage
• HDFS Files are broken into Blocks
– Basic unit of reading/writing like disk block
– Default to 64MB, may be larger in product env.
– Make HDFS good for large file & high throughput
• Block may have multiple Replicas
– One block stored as multiple locations
– Make HDFS storage fault tolerant
7
15. Load FSImage
• Name Directory
– dfs.name.dir: can be multiple dirs
– Check consistence of all name dirs
– Load fsimage file
– Load edit logs
– Save namespace
• Mainly setup dirs & files properly
15
16. Check Safemode
• Safemode
– Fsimage loaded but locations of blocks not known
yet!
– Exit when minimal replication condition meet
• dfs.safemode.threshold.pct
• dfs.replication.min
• Default case: 99.9% of block have 1 replicas
– Start SafeModeMonitor to periodically check to
leave safe mode
– Leave safe mode manually
• hadoop dfsadmin -safemode leave
• (or enter it /get status by: hadoop dfsadmin -safemode
enter/get)
16
17. Start Daemons
• HeartbeatMonitor
– Check lost DN & schedule necessary replication
• LeaseManager
– Check lost lease
• ReplicationMonitor
– computeReplicationWork
– computeInvalidateWork
– dfs.replication.interval, defautl to 3 secs
• DecommissionManager
– Check and set node decommissioned
17
18. Trash Emptier
• /user/{user.name}/.Trash
– fs.trash.interval > 0 to enable
– When delete, file moved to .Trash
• Trash.Empiter
– Run every fs.trash.interval mins
– Delete old checkpoint (fs.trash interval mins ago)
18
20. SecondaryNameNode
• Not Standby/Backup NameNode
– Only for checkpointing
– Though, has a NON-Realtime copy of FSImage
• Need as much memory as NN to do the
checkpointing
– Estimation: 1GB for every one million blocks
20
21. SecondaryNameNode
• Do the checkpointing
– Copy NN’s fsimage &
editlogs
– Merge them to a new
fsimage
– Replace NN’s fsimage with
new one & clean editlogs
• Timing
– Size of editlog >
fs.checkpoint.size (poll
every 5 min)
– Every fs.checkpoint.period
secs
21
23. DataNode
• Store data blocks
– Have no knowledge about FSName
• Receive blocks from Client
• Receive blocks from DataNode peer
– Replication
– Pipeline writing
• Receive delete command from NameNode
23
24. Block Placement Policy
On Cluster Level
• replication = 3
– First replica local
with Client
– Second & Third
on two nodes of
same remote rack
24
25. Block Placement Policy
On one single node
• Write each disk in turn
– No balancing is considered !
• Skip a disk when it’s almost full or failed
• DataNode may go offline when disks failed
– dfs.datanode.failed.volumes.tolerated
25
26. DataNode Startup
• On DN Startup:
– Load data dirs
– Register itself to NameNode
– Start IPC Server
– Start DataXceiverServer
• Transfer blocks
– Run the main loop …
• Start BlockScanner
• Send heartbeats
• Process command from NN
• Send block report
26
31. Write File
• DFSClient.create
– NameNode.create
• Check existence
• Check permission
• Check and get Lease
• Add new INode to rootDir
31
32. Write File
• outputStream.write
– Get DNs to write to From NN
– Break bytes into packets
– Write packets to First DataNode’s DataXceiver
– DN mirror packet to downstream DNs (Pipeline)
– When complete, confirm NN blockReceived
32
35. Lease
• What is lease ?
– Write lock for file modification
– No lease for reading files
• Avoid concurrent write on the same file
– Cause inconsistent & undefined behavior
35
36. Lease
• LeaseManager
– Lease is managed in NN
– When file create (or append), lease added
• DFSClient.LeaseChecker
– Client start thread to renew lease periodically
36
37. Lease Expiration
• Soft Limit
– No renewing for 1 min
– Other client compete for the lease
• Hard Limit
– No renewing for 60 min (60 * softLimit)
– No competition for the lease
37
40. Read File
• DFClient.open
– Create FSDataInputStream
• Get block locations of file from NN
• FSDataInputStream.read
– Read data from DNs block by block
• Read the data
• Do the checksum
40
41. Desc Repl
• Code Sample
DFSClient dfsclient = …;
dfsclient.setReplication(…, 2) ;
• Or use the CLI
hadoop fs -setrep -w 2 /path/to/file
41
43. Desc Repl
• Change FSName replication factor
• Choose excess replicas
– Rack number do not decrease
– Get block from least available disk space node
• Add to invalidateSets(to-be-deleted block set)
• ReplicationMonitor compute blocks to be deleted
for each DN
• On next DN’s heartbeat, give delete block
command to DN
• DN delete specified blocks
• Update blocksMap when DN send blockReport
43
44. One DN down
• DataNode stop sending heartbeat
• NameNode
– HeartbeatMonitor find DN dead when doing heartbeat
check
– Remove all blocks belong to DN
– Update neededReplications (block set need one or more
replication)
– ReplicationMonitor compute block to be replicated for
each DN
– On next DN’s heartbeat, NameNode send replication block
command
• DataNode
– Replicate block
44
46. High Availability
• NameNode SPOF
– NameNode hold all the meta
– If NN crash, the whole cluster unavailable
• Though fsimage can recover from SNN
– It’s not a up-to-date fsimage
• Need HA solutions
46
48. HA - DRBD
• DRBD (http://www.drbd.org)
– block devices designed as a building block to form
high availability (HA) clusters.
– Like network based raid-1
• Use DRBD to backup NN’s fsimage & editlogs
– A cold backup for NN
– Restart NN cost no more than 10 minutes
48
49. HA - DRBD
• Mirror one of NN’s name dir to a remote node
– All name dir is the same
• When NN fail
– Copy mirrored name dir to all name dir
– Restart NN
– All will be done in no more than 20 mins
49
51. HA - AvatarNode
• Complete Hot Standby
– NFS for storage of fsimage and editlogs
– Standby node Consumes transactions from
editlogs on NFS continuously
– DataNodes send message to both primary and
standby node
• Fast Switchover
– Less than a minute
51
52. HA - AvatarNode
• Active-Standby Pair Client
– Coordinated via ZooKeeper
– Failover in few seconds Client retrieves block
location from Primary
– Wrapper over NameNode or Standby
• Active AvatarNode Active
Write
transaction
Read
transaction Standby
AvatarNode
– Writes transaction log to NFS AvatarNode
(NameNode) NFS (NameNode)
filer
Filer
• Standby AvatarNode
Block Block
– Reads/Consumes Location Location
transactions from NFS filer messages messages
– Processes all messages from
DataNodes DataNodes
– Latest metadata in memory
52
53. HA - AvatarNode
• Four steps to failover
– Wipe ZooKeeper entry. Clients will know the failover
is in progress. (0 seconds)
– Stop the primary NameNode. Last bits of data will be
flushed to Transaction Log and it will die. (Seconds)
– Switch Standby to Primary. It will consume the rest of
the Transaction log and get out of SafeMode ready to
serve traffic. (Seconds)
– Update the entry in ZooKeeper. All the clients waiting
for failover will pick up the new connection (0 seconds)
• After: Start the first node in the Standby Mode
– Takes a while, but the cluster is up and running
53
56. HA - BackupNode
• NN synchronously
streams Client
transaction log to Client retrieves block location from
BackupNode NN
• BackupNode applies log NN
Synchronous stream
transacton logs to
to in-memory and disk (NameNode) BN
image
BN
• BN always commit to disk Block
Location (BackupNode)
messages
before success to NN
• If BN restarts, it has to
catch up with NN
DataNodes
56
58. Tools - Balancer
• Need Re-Balancing
– When new node is add to cluster
• bin/start-balancer.sh
– Move block from over-utilized node to under-utilized node
• dfs.balance.bandwidthPerSec
– Control the impact on business
• -t <threshold>
– Default 10%
– stop when difference from average utilization is less than
threshold
58
60. Tools - Distcp
• Inter-cluster copy
– hadoop distcp -i –pp -log /logdir
hdfs://srcip/srcpath/ /destpath
– Use map-reduce(actually maps) to start a
distributed-fashion copy
• Also fast copy in the same cluster
60
62. Hadoop Future
• Short-circuit local reads
– dfs.client.read.shortcircuit = true
– Available in hadoop-1.x or cdh3u4
• Native checksums (HDFS-2080)
• BlockReader keepalive to DN (HDFS-941)
• “Zero-copy read” support (HDFS-3051)
• NN HA (HDFS-3042)
• HDFS Federation
• HDFS RAID
62
63. References
• Tom White, Hadoop The definitive guide
• http://hadoop.apache.org/hdfs/
• Hadoop WiKi – HDFS
– http://wiki.apache.org/hadoop/HDFS
• Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The
Google File System
– http://research.google.com/archive/gfs.html
• Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert
Chansler , The Hadoop Distributed File System
– http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
63