1. HDFS – Hadoop Distributed File System
HDFS Design and Goal
HDFS componetnts: Namenode, Datanode, Secondary
HDFS blocks and replication.
Anatomy of a File Read/Write in HDFS
Designed to reliably store very large files across machines in a
Data is organized into files and directories
In storage layer files are divided into uniform sized blocks(64MB,
128MB, 256MB) and distributed across cluster nodes
Blocks are replicated to handle hardware failure
File system keeps checksums of data for corruption detection
HDFS exposes block placement so that computes can be
migrated to data
HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of commodity
HDFS is File system rather then a storage.
HDFS exhibits almost all POSIX file system standards
- File, directory and sub-directory structure
-Access(owner, group, other) and concept super user
Hadoop provides many interfaces to its filesystems, and it
generally uses the URI scheme to pick the correct
filesystem instance to communicate with.
Fault tolerant, scalable, distributed storage system
Very Large Distributed File System
10K nodes, 100 million files, 10 PB ..! No problem, is possible !!!!!!!.
Computation moved to data
Data locations exposed so that computations can move to where
Streaming Data access
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recovers from them
Optimized for Batch Processing
Provides very high aggregate bandwidth
High throughput of data access Streaming access to data
Large files Typical file is gigabytes to terabytes in size Support for tens of
millions of files.
Simple coherency- Write-once-read-many access model
Highly fault-tolerant runs on commodity HW, which can fail frequently
HDFS Design Goals
HDFS Master “Namenode”
Manages the file system namespace
Controls read/write access to files
Manages block replication
Checkpoints namespace and journals namespace changes for
HDFS Workers “Datanodes”
Serve read/write requests from clients
Perform replication tasks upon instruction by Namenode.
Report blocks and system state.
HDFS Namespece Backup “Secondary Namenode”
Single Namespace for entire cluster
Files are broken up into sequence blocks
– For a file all blocks except the last are of the same size.
– Each block replicated on multiple DataNodes
– Write-once-read-many access model
– Client can only add/append to files, restricts to random change
– Client can find location of blocks
– Client accesses data directly from DataNode [Q: How it is possible ?]
--- User data never flows through the NameNode
Distributed file system
HDFS has Large block size
Typical 128MB, 256MB, 512MB…
Normal Filesystem blocks are few kilobytes.
Unlike a file system for a single disk. A file in HDFS that is
smaller than a single block does not occupy a full block. if a
block is 10MB it needs only 10MB of the space of the full
block on the local drive.
A file is stored in blocks on various nodes in hadoop cluster.
Provides complte abstrction view to client.
HDFS block placement
HDFS creates several replication of the data blocks.
Each and every data block is replicated to multiple nodes across the
BlockReport contains all the blocks on a Datanode.
HDFS Blocks continues...
Default is 3 replicas, but settable
Blocks are placed (writes are pipelined):
(Will seen rack –awareness in next slide)
– On same node
– On different rack
– On the other rack
Clients read from closest replica.
If the replication for a block drops below target, it is
Pluggable policy for placing block replicas.
Why blocks in HDFS so large?
Minimize the cost of seeks
Make transfer time > disk transfer rate
Desinged for porcees indepdently.
Benefit of Block abstraction
A file can be larger than any single disk in the network.
Simplify the storage subsystem.
Providing fault tolerance and availability.
Indepedent processing, failureover and distribute the
computaion. [Q: How it is independent ? ]
Data blocks are checked with CRC32
Client computes checksum per 512 byte
DataNode stores the checksum
Client retrieves the data and checksum from DataNode
If Validation fails, reports and client tries other replicas
Corrupted blocks are reported and tolerated.
Each block replica on a DataNode is represented by two files in
the local native filesystem. The first file contains the data itself
and the second file records the block's metadata including
checksums for the data and the generation stamp.
Data Integrity maintained in block level. [Q: Why it is block level not in
Client copies data along with check sum and client computes the
checksum of every block , it verifies that the corresponding
checksums match. If does not match, the client can retrieve the
block from a replica. The corrupt block delete a replica will
Verified after each operation. What if access foe long time? that
might result in data corruption. Also checked periodically.
An HDFS cluster consists of a single Namenode, a
master server that manages the file system namespace
and regulates access to files by clients. HDFS file system
namespace is stored and maintained by Namenode. Name
node maintains metadata in two binary files in
namenode’s storage directory are
Name node maintains ‘namespace ID’, to persistently
stored on all nodes of the cluster. The namespace ID is
assigned to the filesystem instance when it is formatted.
The HDFS namespace is a hierarchy of files and directories.
Files and directories are represented on the NameNode by
inodes. Inodes record attributes like permissions, modification
and access times, namespace and disk space quotas.
Meta-data in Memory
– The entire metadata is in main memory for Fast access
– No demand paging and I/O wait for meta-data
– Hierarchical file system with directories and files
– List of Blocks for each file
– File attributes, e.g access time, replication factor
Improved durability, redundant copies of the checkpoint and journal
are typically stored on multiple independent local volumes and at
remote NFS servers.
The inodes structure and the list of blocks that define the
metadata store in file : fsimage
The ‘fsimage’ file is a persistent checkpoint of the filesystem
metadata. It load in name node start-up.
The NameNode records changes to HDFS in a write-ahead
log called the ‘Transaction Log ‘ in its local native
filesystem infile: edits
Transaction is recorded in the Transaction Log, and the journal
file is flushed and synced before the acknowledgment is sent
to the client.
The location of block replicas are not part of the persistent
The checkpoint file is never changed by the NameNode;
(Will look checkpoint and secondary namenode in later topics)
NameNode Metadata continues…
[what, how and where ]
The NameNode maintains the namespace tree
Mapping of datanode to list of blocks
Receiving heartbeats and Monitor datanodes health.
Replicate missing blocks.
Recording the file system changes.
Authorization & Authentication.
Name node functions
DataNode is slave daemon to perform the grunt work of the
distributed filesystem—reading and writing HDFS blocks to
actual files on the local filesystem.
The ‘Slave’ stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem.
It stores each block of HDFS data in a separate file.
Clients access the blocks directly from data nodes after
communication with namenode.
Blocks are stored as underlying OS’s files, Datanode does
not create all files in the same directory, it use optimal
number of files per directory and creates directories
[Read, Write, Report ]
During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode, the
DataNode automatically shuts down.
Serves read, write requests, performs block creation, deletion,
and replication upon instruction from Namenode
Periodically send heartbeats and block reports to Namenode
A DataNode identifies block replicas in its possession to the
NameNode by sending a block report.
A block report contains the block ID, the generation stamp and the
length for each block replica the server hosts
Data Node functions
During normal operation DataNodes send heartbeats to the NameNode to
confirm that the DataNode is operating and the block replicas it hosts are
Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress etc...
These statistics are used for the NameNode's block allocation and load
The NameNode does not directly send requests to DataNodes. It uses replies
to heartbeats to send instructions to the DataNodes. The instructions include
commands to replicate blocks to other nodes, remove local block replicas,
re-register and send an immediate block report, and shut down the node.
To maintaining the overall system integrity it is critical to keep heartbeats
frequent even on big clusters. The NameNode can process thousands of
heartbeats per second without affecting other NameNode operations.
Data node heartbeats
The Secondary NameNode (SNN) is an assistant daemon
for monitoring and storing(backup) the state of the
NameNode to take snapshots of the HDFS metadata at
intervals defined by the cluster configuration.
The NameNode is a single point of failure for a Hadoop
cluster, and the SNN snapshots help minimize the
downtime and loss of data. Nevertheless, a NameNode
failure requires human intervention to reconfigure the
cluster to use
the SNN as the primary NameNode.
Secondary Name Node
Secondary Namenode ineraction
Will see the check point and recovery in later topics
SNN periodically merge the namespace image with the edit log to
prevent the edit log from becoming too large.
Anatomy of a File Read in HDFS
One important aspect of this design is that the client contacts datanodes
directly to retrieve data and is guided by the namenode to the best datanode for
each block Direct connection between client and datanode.
Failure : Move to next 'closest' node with the block.
1. Client connects to the NameNode with file name.
2. The namenode performs various checks to make sure the file
exist, client has the right permissions etc ..
3. The namenode returns the addresses of the datanodes that have a
copy of that block.(locality is considered)
4. The list of datanodes forms a pipeline—we’ll assume the
replication level is three, so there are three nodes in the
5. The client connects to the first(closest) datanode for the first
block in the file and reads. Then find the best datanode for the
next block… and finish reads for the file.
6. Verifies checksums for each block the data transferred to it
from the datanode.
7. During reading, if the client encounters an error while
communicating with a datanode or block corrupted, then it will
try the next closest one for that block.
Anatomy of a File Read in HDFS continue…
1. Client connects to the NameNode with file name.
2. The namenodeperforms various checks to make sure the file doesn’t
already exist, client has the right permissions etc
3. NameNode places an entry for the file in its metadata, returns the
block name and list of DataNodes to the client.
4. The list of datanodes forms a pipeline—we’ll assume the replication
level is three, so there are three nodes in the pipeline.
5. Client connects to the first DataNode and starts sending data, As
data is received by the first DataNode, it connects to the second and
starts sending data Second DataNode similarly the second datanode
stores the packet and forwards it to the third datanode in the
6. A packet is removed from the ack queue only when it has been
acknowledged by all the datanodes in the pipeline.
7. A datanode fails while data is being written to it, partial block on the
failed datanode will be deleted if failed datanode recovers later on.
8. Client reports to the NameNode when the block is written.
Anatomy of a File Write in HDFS continue…
Replication and Rack-awareneces
Replication in Hadoop is at the block level .
Default Replication factor is 3 and configurable.
Blocks are replicated for fault tolerance.
A file’s replication factor can be changed dynamically and configurable
Rack-aware replica placement- Goal: improve reliability, availability and
network bandwidth utilization
Many racks, communication between racks are through switches.
Network bandwidth between machines on the same rack is greater than
those in different racks.
Namenode determines the rack id for each DataNode.
Replication and Rack-awareneces continue…
Replication and Rack-awareneces continue…
Replicas are placed: one on a node in a local rack, one on a
different node in the local rack and one on a node in a
1/3 of the replica on a node, 2/3 on a same rack and 1/3
distributed evenly across remaining racks.
Replica selection for READ operation: HDFS tries to
minimize the bandwidth consumption and latency.
Selection of blocks to process in a MapReduce job takes
advantage of rack-awareness.
Rack-awareness is NOT automatic, and needs to be
configured. By default, all nodes are assumed to be in the
The necessity for re-replication may arise due to:
A Datanode may become unavailable,
A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be increased.
Block under-replication & over-replication is detected by
Balancer application rebalances blocks to balance datanode
Will look balncer in later topic
HDFS Worst fit with
Low-latency data access
Lots of small files
Trasaction access and update
Multiple writers, arbitrary file modifications
Not visible when copying
Write onece, read many
Apply in applications
Sorting: Standard method to test cluster
TeraGen: Generate dummy data
TeraValidate: Validate sort result
hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
Hadoop: The Definitive Guide, Third Edition by Tom White.