The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications.Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data.
Files and directories are represented on the NameNode by INODES, which record attributes like permissions, modification and access times, namespace and disk space quotas.The NameNode maintains the namespace tree and the mapping of file blocks to DataNodes(the physical location of file data).The inode data and the list of blocks belonging to each file comprise the metadata of the namesystem called the Image. The persistent recordof the image stored in the local host’s native files system is called a Checkpoint. The NameNode also stores the modification log of the image called the Journal in the local host’s native file system.
During startup each DataNode connects to the NameNode and performs :-1. HANDSHAKE - The purpose of the handshake is to verify the namespace ID and the software version of theDataNode. Namespace ID is assigned when File System is formatted. It is stored on each node in a cluster and every node on a cluster has same id. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster’s namespace ID on startup.2. REGISTERATION- DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that.3. BLOCK REPORT – Block ID + Generation Stamp + Length of each blockThe first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour.
HeartBeats= Total Storage Capacity + Fraction of Storage + No. of Data Transfers in progressDuring normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes.The NameNode does not directly call DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to:• replicate blocks to other nodes• remove local block replicas• re-register or to shut down the node• send an immediate block report
User applications access the file system using the HDFS client,a code library that exports the HDFS file system interface.HDFS provides an API that exposes the locations of a file blocks. This allows applications like the MapReduce framework to schedule a task to where the data are located, thus improving the read performance. It also allows an application to set the replication factorof a file.
The journal is a write-ahead commit log for changes to the file system that must be persistent. For each client-initiated transaction, the change is recorded in the journal, and the journal file is flushed and synched before the change is committed to the HDFS client.The checkpoint file is never changed by the NameNode, it is replaced in its entirety when a new checkpoint is created during restart. During startup the NameNode initializes the namespace image from the checkpoint, and then replays changes from the journal until the image is up-to-date with the last state of the file system. A new checkpoint and empty journal are written back to the storage directories before the NameNode starts serving clients.The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete. In order to optimize this process the NameNode batches multiple transactions initiated by different clients. When one of the NameNode’s threads initiates a flush-and-sync operation, all transactions batched at that time are committed together. Remaining threads only need to check that their transactions have been saved and do not need to initiate a flush-and-sync operation.
The Checkpoint Node periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.Creating periodic checkpoints is one way to protect the file system metadata.Creating a checkpoint lets the NameNodetruncate the tail of the journal when the new checkpoint is uploaded to the NameNode.
Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file systemnamespace that is always synchronized with the state of the NameNode.If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.The BackupNode can be viewed as a read-only NameNode.
Client has to write a block of data. It requests Name Node for the location where to write the block. Name node, based on the placement and replication policy determines the list of nodes which will hold the data and its replicas. The list of nodes is ordered based on certain criteria. A pipeline is set between the data nodes in a manner that the length of pipeline is minimum.Once the acknowledgement of the pipeline setup is received, client pushes the first packet of the data to the first node in the pipeline. Once the data is written to the first node, it gets transmitted along the pipeline to the further nodes. When the data packet is written to all the nodes, an acknowledgement is sent back to the hdfs client. The client will not wait for the acknowledgement, and will write the next packet, until there is room in the outstanding window.Each outgoing package will reduce the size of the outstanding window. Each incoming acknowledgement will increase the size.
Lets dive more into the details of the Read and write operations. How it is handled at the data nodes level.Write operation follows the Single Writer multiple reader mechanism. Means if a client is writing on a node, no other client will be allowed to write. But other clients are allowed to read. When a client needs to write to a datanode, it is granted a Lease on that node by HDFS. There are soft limits and hard limits on that lease. Client keeps on renewing the lease as it writes to the node. If the Soft limit time duration expires and client has not renewed the lease, other client can preempt the lease.If the hard limit expires, HDFS reclaims the lease, and can assign the lease to some other client.Data nodes to host the replicas form a pipeline, the order of which minimized the total distance to the last node in the list of data nodes. Data block is written to the pipeline in form of packets. Buffering of packets occurs first at the client. Once the buffer is full, it is pushed to the next node in the pipeline.HDFS doesn’t guarantee that data will be visible to other clients until the file being written is closed. In order to make the data visible earlier, hflush operation can be invoked by the client. Hflush operation will push the packet to the pipeline, and will not wait for the buffer to be complete.To ensure data integrity, checksums are calculated and are stored with the Data node in a separate file which contains the metadata about the node. When a client creates the HDFS file, it computes the checksum sequence for each block and sends it along with the data to the data nodes.When some client reads the data block from the data node, it recomputes the checksum for each datablock, and compares it with the checksum stored with the data node. If there is mismatch, the data integrity fails and client reads the data from some other data node.When the client opens a file for the read operation, it obtains the list of data nodes which contain the replica of the data blocks of the file ordered by their distance. It first tries to read from the closest possible replica. It it fails because of any reason (data integrity or node down), it tries to read from the next replica.We will see how name node identifies the node closest to the client.
In general clusters don’t have flat topology. They follow the rack approach in which there are multiple racks. All the nodes in the rack are connected through a switch. All the racks are connected through another switch. So there is some sort of tree hierarchy. Given the address of data node, name node can identify, the rack to which the datanode belongs. So for two nodes in different racks, the distance between them is the sum of distance of both the nodes from their common ancestor.Replica placement policy is important from the reliabilty and read/write operations. It is a configurable policy. Default policy - HDFS tries to place the first replica on the rack on which the writer is located. Second and third replica on some rack which is different that the rack on which client is located and others randomly.It assures that no data node can contain more than one replica of a block. And no rack can contain more than 2 replicas of a block. (Provided that there are sufficient number of racks and nodes).
There can be scenario when a block of data can become over replicated and under replicated (because of node failure and recovery after node failure). If a data block becomes over replicated, Name node identifies the data node from which to remove the block. While removing it makes sure that removing the replica doesnot result into reduction in number of racks in which the node is replicated. When a block becomes under replicated, name node places the block on a priority queue of data blocks waiting for the replication. This queue is prioritized based on the replication factor. The data blocks having one or less replica have highest priority. Data blocks which have more than 2/3 of the replication factor have lower priority. A background thread repeatedly checks the head of the queue for the replication. Replica placement follows the same policy as that of the block placement.
Until now we have seen that there block placement policy doesn’t care about the disk usage on a particular data node. Because of which some data nodes can become heavily utilized and others can remain under utilized. In order to balance this, there is a balancer thread running in the background, which computes the utilization ratio of each node and utilization ratio of the entire cluster. If the utilization ratio of a data node exceeds the utilization ratio of the entire cluster by certain threshold value, it will move the data blocks on that node to other nodes in the cluster. While moving it will again consult the name node for the new location of the block in the system. In order to maximize the throughput it involves certain policies in case of inter rack transportation.
Hadoop Distributed File System(HDFS) : Behind the scenes
HDFS ARCHITECTURE Name Node Data Node Task Tracker Job tracker Image and Journal HDFS Client Checkpoint Node Backup Node
Backup Node Image Journal Name Node Job Tracker CheckpointHDFSClient Task Tracker Task Tracker Task Tracker Data Node 1 DataNode 2 ……….. DataNode N
NAME NODE JobTracker JournalInode Image Checkpoint
Inode - Files and directories are represented on the NameNode, which record attributes like permissions, modification and access times, namespace and disk space quotas. Image - The inode data and the list of blocks belonging to each file Checkpoint - The persistent record of the image stored in the local host’s native file system Journal - Write-ahead commit log for changes to the file system that must be persistent.
FILE I/O OPERATIONS Single WriterMultiple Reader
DATA WRITE OPERATION client DN1 DN2 DN3 setupClient Name Node packet1 DN1 packet2 packet3 DN2 packet4 packet5 DN3 close DN4
DATA WRITE/READ OPERATION Single Writer Multiple Reader Model Lease Management (SoftClient Limit and Hard Limit) Name Node Pipelining, Buffering and Hflush DN1 Checksum for data integrity Choosing nodes for read operation