Introduction to HDFS

Introduction to HDFS
Hadoop Distributed File System

What is HDFS?
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications. It employs a Master and Slave
architecture to implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop clusters.

Architecture of HDFS
Components in HDFS Architecture.
★ Client.
★ Datanode.
★ Namenode.
★ Backup Node.
★ Replication Management.
★ Rack awareness.
★ Read and Write operation.

What is NameNode?
NameNode in HDFS Architecture is also known as Master node.
HDFS Namenode stores metadata i.e. number of data blocks, replicas and other
details. NameNode maintains and manages the slave nodes, and assigns tasks to
them. It should deploy on reliable hardware as it is the centerpiece of HDFS.
Task of Namenode :
● Manage file system namespace.
● Regulates client’s access to files.
● It also executes file system execution such as naming, closing, opening
files/directories.
● All DataNode sends a Heartbeat and block report to the NameNode in the
Hadoop cluster.
● NameNode is also responsible for taking care of the Replication Factor of all the
blocks.

What is DataNode?
DataNode in HDFS Architecture is also known as Slave node.
In Hadoop HDFS Architecture, DataNode stores actual data in HDFS. It
performs read and write operation as per the request of the client. DataNodes
can deploy on commodity hardware.
Task of DataNode :
● Block replica creation, deletion and replication according to the
instruction of NameNode.
● DataNode manages data storage of the system.
● DataNode send heartbeat to the NameNode to report the health of
HDFS.

What is Backup Node?
In Hadoop, Backup node keeps an in-memory, up-to-date
copy of the file system namespace.
It is always synchronized with the active NameNode state.
What is Replication Management?
Block replication provides fault tolerance.
If one copy is not accessible and corrupted, we can read
data from other copy. The number of copies or replicas of
each block of a file in HDFS Architecture is replication
factor.
Backup Node

What is Rack Awareness
Rack Awareness in Hadoop is the concept that chooses
DataNodes based on the rack information.
NameNode achieves rack information by maintaining the
rack ids of each DataNode.
HDFS Read and Write operation.
Write Operation : When a client wants to write a file to
HDFS, it communicates to namenode for metadata. The
Namenode responds with a number of blocks, their
location, replicas and other details.
Read Operation : To read from HDFS, the first client
communicates to namenode for metadata. A client comes
out of NameNode with the name of files and its location.
Rack Awareness
R/W

HDFS Design Concept
Important components in HDFS
1. NameNode.
2. DataNode.
3. Blocks

NameNode
Name Node is the single point of contact for accessing files in HDFS and it determines
the block ids and locations for data access. So, NameNode plays a Master role in
Master/Slaves Architecture whereas Data Nodes acts as slaves. File System metadata
is stored on Name Node.
File System Metadata contains majorly, File names, File Permissions and locations of
each block of files. Thus, Metadata is relatively small in size and fits into Main Memory
of a computer machine. So, it is stored in Main Memory of Namenode to allow fast
access.
Important components of name node.
FsImage : It is a file on Name Node’s Local File System containing entire HDFS file
system namespace (including mapping of blocks to files and file system properties)
Editlog : It is a Transaction Log residing on Name Node’s Local File System and
contains a record/entry for every change that occurs to File System Metadata.

DataNode
Data Nodes are the slaves part of Master/Slaves Architecture and on which actual
HDFS files are stored in the form of fixed size chunks of data which are called
blocks.
Data Nodes serve read and write requests of clients on HDFS files and also perform
block creation, replication and deletions.
Data Nodes Failure Recovery
Each data node on a cluster periodically sends a heartbeat message to the name
node which is used by the name node to discover the data node failures based on
missing heartbeats.
The name node marks data nodes without recent heartbeats as dead, and does not
dispatch any new I/O requests to them. Because data located at a dead data node is
no longer available to HDFS, data node death may cause the replication factor of
some blocks to fall below their specified values. The name node constantly tracks
which blocks must be re-replicated, and initiates replication whenever necessary.
Thus all the blocks on a dead data node are re-replicated on other live data nodes
and replication factor remains normal.

HDFS Blocks
HDFS is a block structured file system. Each HDFS file is broken into blocks of
fixed size usually 128 MB which are stored across various data nodes on the
cluster. Each of these blocks is stored as a separate file on local file system on
data nodes (Commodity machines on cluster).
Thus to access a file on HDFS, multiple data nodes need to be referenced and
the list of the data nodes which need to be accessed is determined by the file
system metadata stored on Name Node.
So, any HDFS client trying to access/read a HDFS file, will get block information
from Name Node first, and then based on the block id’s and locations, data will
be read from corresponding data nodes/computer machines on cluster.
HDFS’s fsck command is a useful to get the files and blocks details of file
system. -
$ hadoop fsck / -files -blocks

Command Line interface in
HDFS
Command Line is one of the simplest interface to Hadoop Distributed File
System. Below are the basic HDFS File System Commands which are similar to
UNIX file system commands

Hadoop File System
The Hadoop Distributed File System (HDFS) is the
primary data storage system used by Hadoop
applications. It employs a NameNode and DataNode
architecture to implement a distributed file system that
provides high-performance access to data across highly
scalable Hadoop clusters.
HDFS exposes a file system namespace and allows user
data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of
DataNodes.
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the Hadoop
framework. A Hadoop cluster has nominally a single NameNode
plus a cluster of DataNodes,. Each DataNode serves up blocks of
data over the network using a block protocol specific to HDFS. The
basic HDFS architecture is represented in the following image.

● The NameNode keeps all metadata information about where the data is stored, the location of the data files, how
the files are splitted across DataNodes, etc.. HDFS stores large files (typically in the range of gigabytes to
terabytes) across multiple machines, called DataNodes. The files are splitted in blocks (usually 64 MB or 128 MB)
and stored on a serie of DataNodes (what DataNodes are used for each file, is managed by the NameNode). The
blocks of data are also replicated (usually three times) on other DataNodes so that in case hardware failures,
clients can find the data on another server. The information about the location of the data blocks and the replicas
is also managed by the NameNode. Data Nodes can talk to each other to re balance data, to move copies
around, and to keep the replication of data high.
● The HDFS file system also includes a secondary NameNode, a name which is misleading, as it might be
interpreted as a back-up for the NameNode. In fact, this secondary NameNode regularly connects on the regular
basis to the primary NameNode and builds snapshots of all directory information managed by the latter, which the
system then saves to local or remote directories. These snapshots can later be used to restart a failed primary
NameNode without having to replay the entire journal of file-system actions, then to edit the log to create an up-
to-date directory structure.
● What of the issues related to HDFS architecture is the fact that the NameNode is the single point for storage and
management of metadata, and so it can become a bottleneck when dealing with a huge number of files,
especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain
extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in
HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and bottleneck in huge
metadata request.

Java Interface with
HDFS
● Interfaces are derived from real-world
scenarios with the main purpose to use an
object by strict rules.
● Java interfaces have same behaviour: they set
strict rules on how to interact with objects.

● Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
● The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in
Hadoop, and there are several concrete implementations.
● Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API.
● The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide
filesystem operations.
● By exposing its file system interface as a Java API, Hadoop makes it awkward for non-Java applications to
access HDFS.
● The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact
with HDFS.
● Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data
transfers if possible.
● There are two ways of accessing HDFS over HTTP : directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
● In the First Case, the embedded web servers in the NameNode and DataNodes act as WebHDFS
endpoints.
● File metadata operations are handled by the namenode, while file read (and write) operations are sent first
to the namenode, which sends an HTTP redirect to the client indicating the datanode to stream file data.

● In Second Case of accessing HDFS over HTTP relies on one or more standalone proxy servers.
● All traffic to the cluster passes through the proxy, so the client never accesses the namenode or datanode
directly.
● This allows for stricter firewall and bandwidth-limiting policies to be put in place.
● The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients can access
both using webhdfs (or swebhdfs) URIs.
● The HttpFS proxy is started independently of the namenode and datanode daemons, using the httpfs.sh
script, and by default listens on a different port number 14000.
There are two ways to use JAVA API in HDFS :
1. Reading Data Using the FileSystem API.
2. Writing Data Using the FileSystem API.

Reading Data Using the FileSystem API
● A file in a Hadoop filesystem is represented by a
Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the
filesystem we want to use—HDFS, in this case. There
are several static factory methods for getting a
FileSystem instance.
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
● A Configuration object encapsulates a client or server
configuration, which is set using configuration files read
from the classpath, such as core-site.xml.

● With a FileSystem instance in hand, we invoke an open() method to get the input stream for a file.The first
method uses a default buffer size of 4 KB.The second one gives an option to user to specify the buffer
size.
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
FSDataInputStream
● The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io
class. This class is a specialization of java.io.DataInputStream with support for random access, so you
can read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {
}

public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
● The read() method reads up to length bytes from the given position in the file into the buffer at the given offset in the buffer.
● The return value is the number of bytes actually read; callers should check this value, as it may be less than length.
● The readFully() methods will read length bytes into the buffer, unless the end of the file is reached, in which case an EOFException is
thrown.
● Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly.
● You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing
a large number of seeks.

Writing Data Using the FileSystem API
● The FileSystem class has a number of methods for
creating a file. The simplest is the method that takes a
Path object for the file to be created and returns an
output stream to write to.
public FSDataOutputStream create(Path f) throws IOException
● There are overloaded versions of this method that
allow you to specify whether to forcibly overwrite
existing files, the replication factor of the file, the buffer
size to use when writing the file, the block size for the
file, and file permissions.

FSDataOutputStream
● The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has
a method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
● This method creates all of the necessary parent directories if they don’t already exists and returns true if
its success full.

DataFlow in HDFS
There are three types of DataFlow in HDFS.
1. Anatomy of File Read in Hadoop
2. Anatomy of a File Write in Hadoop
3. Coherency Model

Anatomy of File Read in Hadoop
● Consider a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1. Each rack has 4
nodes and they are uniquely identified as R1N1, R1N2 and so on. The replication factor is set to 3 and the HDFS block size
is set to 64MB(128MB in Hadoop V2) by default.

Background :
Name node stores the HDFS block information like file location, permission, etc. in files called FSImage and edit logs. Files are
stored in HDFS as blocks. These block information are not saved in any file. Instead it is gathered every time
the cluster is started and this information is stored in namenode’s memory.
Replication: Assuming the replication factor is 3; When a file is written from a data node (say R1N1), Hadoop attempts to save the
first replica in same data node (R1N1). Second replica is written into another node (R2N2) in a different rack (R2).
Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was saved.
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is
the sum of their distances to their closest common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”. Example;
‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data center d1. Distance calculation has 4 possible
scenarios as;
distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on same node]
distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is same rack]
distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different rack in same data center]
distance(/d1/r1/n1, /d2/r3/n4) = 6 [node in different data center]

Anatomy :
● Consider a sample.csv file of size 192 MB to be saved in to the cluster. The file is divided into 3 blocks of 64 MB each (B1, B2, B3)
and it is stored in different data nodes as shown above. Along with the data a checksum is stored in each block to ensure that
data read is done without any errors.
When the cluster is started, the metadata in the name node will look as shown in the fig below.

Anatomy of File Write in
Hadoop
Consider writing a file sample.csv by HDFS client program running on R1N1’s JVM.
First the HDFS client program calls the method create() on a Java class DistributedFileSystem (subclass of FileSystem).
DFS makes a RPC call to name node to create a new file in the file system namespace. No blocks are associated to the file at this
stage .Name node performs various checks and ensures that the file doesn't exists. Also check whether the user has the right
permissions to create the file.

Coherency Model
● A coherency model for a filesystem
describes the data visibility of reads
and writes for a file.

● HDFS trades off some POSIX requirements for performance, so
some operations may behave.
● However, any content written to the file is not guaranteed to be
visible, even if the stream is flushed.
● So, the file appears to have a length of zero differently than you
expect them to.
● After creating a file, it is visible in the filesystem namespace, as
expected:
● Once more than a block’s worth of data has been written, the
first block will be visible to new readers.
● This is true of subsequent blocks, too it is always the current
block being written that is not visible to other readers.

Parallel Copying into distcp
Hadoop comes with a useful
program called distcp for
copying data to and from
Hadoop file systems in
parallel.

● The HDFS access patterns that we have seen so far focus on single-threaded access. It’s possible to act on a collection of files — by
specifying file globs, for example — but for efficient parallel processing of these files, you would have to write a program yourself.
Hadoop comes with a useful program called distcp for copying data to and from Hadoop file systems in parallel.
● One use for distcp is as an efficient replacement for hadoop fs -cp. For example, you can copy one file to another with:
○ % hadoop distcp file1 file2
● You can also copy directories:
○ % hadoop distcp dir1 dir2
● If dir2 does not exist, it will be created, and the contents of the dir1 directory will be copied there.
● You can specify multiple source paths, and all will be copied to the destination.
● If dir2 already exists, then dir1 will be copied under it, creating the directory structure dir2/dir1.
● If this isn’t what you want, you can supply the -overwrite option to keep the same directory structure and force files to be
overwritten.
● You can also update only the files that have changed using the -update option. This is best shown with an example.
● If we changed a file in the dir1 subtree, we could synchronize the change with dir2 by running:
● % hadoop distcp -update dir1 dir2 distcp is implemented as a MapReduce job where the work of copying is done by the maps that
run in parallel across the cluster.

● There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the
same amount of data by bucketing files into roughly equal allocations.
● By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp. A very
common use case for distcp is for transferring data between two HDFS clusters.
● For example, the following creates a backup of the first cluster’s /foo directory on the second:
% hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo
● The -delete flag causes distcp to delete any files or directories from the destination that are not present in the
source, and -p means that file status attributes like permissions, block size, and replication are preserved.
● You can run distcp with no arguments to see precise usage instructions.
● If the two clusters are running incompatible versions of HDFS, then you can use the webhdfs protocol to distcp
between them:
% hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/foo

● Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a
*.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files.
The _index file contains the name of the files that are part of the archive and the location within the part files.
● Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
1. The Problem with small files.
2. What is HAR?
3. Limitation of HAR.
The Problem with Small Files.
Hadoop works best with big files and small files are handled inefficiently in HDFS.
As we know, Namenode holds the metadata information in memory for all the files stored in HDFS.
Let’s say we have a file in HDFS which is 1 GB in size and the Namenode will store metadata information of the file – like file name,
creator, created time stamp, blocks, permissions etc.

Now assume we decide to split this 1 GB file in to 1000 pieces and store all 100o “small” files in HDFS.
Now Namenode has to store metadata information of 1000 small files in memory.
This is not very efficient – first it takes up a lot of memory and second soon Namenode will become a bottleneck as it is trying
to manage a lot of data.
What is HAR?
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks
efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
HAR is created from a collection of files and the archiving tool (a simple command)
will run a MapReduce job to process the input files in parallel and create an archive file.

HAR Command
hadoop archive -archiveName myhar.har /input/location /output/location
● Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files.
● Part files are nothing but the original files concatenated together in to a big file.
● Index files are look up files which is used to look up the individual small files inside the big part files
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0

Limitation of HAR Files.
● Once an archive file is created, you can not update the file to add or remove files. In other words, har files are immutable.
● Archive file will have a copy of all the original files so once a .har is created it will take as much space as the original files.
Don’t mistake .har files for compressed files.
● When .har file is given as an input to MapReduce job, the small files inside the .har file will be processed individually by
separate mappers which is inefficient.
.

Introduction to HDFS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to HDFS

Similar to Introduction to HDFS (20)

Recently uploaded

Recently uploaded (20)

Introduction to HDFS