SlideShare a Scribd company logo
1 of 39
Introduction to HDFS
Hadoop Distributed File System
What is HDFS?
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications. It employs a Master and Slave
architecture to implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop clusters.
Architecture of HDFS
Components in HDFS Architecture.
★ Client.
★ Datanode.
★ Namenode.
★ Backup Node.
★ Replication Management.
★ Rack awareness.
★ Read and Write operation.
What is NameNode?
NameNode in HDFS Architecture is also known as Master node.
HDFS Namenode stores metadata i.e. number of data blocks, replicas and other
details. NameNode maintains and manages the slave nodes, and assigns tasks to
them. It should deploy on reliable hardware as it is the centerpiece of HDFS.
Task of Namenode :
● Manage file system namespace.
● Regulates client’s access to files.
● It also executes file system execution such as naming, closing, opening
files/directories.
● All DataNode sends a Heartbeat and block report to the NameNode in the
Hadoop cluster.
● NameNode is also responsible for taking care of the Replication Factor of all the
blocks.
What is DataNode?
DataNode in HDFS Architecture is also known as Slave node.
In Hadoop HDFS Architecture, DataNode stores actual data in HDFS. It
performs read and write operation as per the request of the client. DataNodes
can deploy on commodity hardware.
Task of DataNode :
● Block replica creation, deletion and replication according to the
instruction of NameNode.
● DataNode manages data storage of the system.
● DataNode send heartbeat to the NameNode to report the health of
HDFS.
What is Backup Node?
In Hadoop, Backup node keeps an in-memory, up-to-date
copy of the file system namespace.
It is always synchronized with the active NameNode state.
What is Replication Management?
Block replication provides fault tolerance.
If one copy is not accessible and corrupted, we can read
data from other copy. The number of copies or replicas of
each block of a file in HDFS Architecture is replication
factor.
Backup Node
What is Rack Awareness
Rack Awareness in Hadoop is the concept that chooses
DataNodes based on the rack information.
NameNode achieves rack information by maintaining the
rack ids of each DataNode.
HDFS Read and Write operation.
Write Operation : When a client wants to write a file to
HDFS, it communicates to namenode for metadata. The
Namenode responds with a number of blocks, their
location, replicas and other details.
Read Operation : To read from HDFS, the first client
communicates to namenode for metadata. A client comes
out of NameNode with the name of files and its location.
Rack Awareness
R/W
HDFS Design Concept
Important components in HDFS
1. NameNode.
2. DataNode.
3. Blocks
NameNode
Name Node is the single point of contact for accessing files in HDFS and it determines
the block ids and locations for data access. So, NameNode plays a Master role in
Master/Slaves Architecture whereas Data Nodes acts as slaves. File System metadata
is stored on Name Node.
File System Metadata contains majorly, File names, File Permissions and locations of
each block of files. Thus, Metadata is relatively small in size and fits into Main Memory
of a computer machine. So, it is stored in Main Memory of Namenode to allow fast
access.
Important components of name node.
FsImage : It is a file on Name Node’s Local File System containing entire HDFS file
system namespace (including mapping of blocks to files and file system properties)
Editlog : It is a Transaction Log residing on Name Node’s Local File System and
contains a record/entry for every change that occurs to File System Metadata.
DataNode
Data Nodes are the slaves part of Master/Slaves Architecture and on which actual
HDFS files are stored in the form of fixed size chunks of data which are called
blocks.
Data Nodes serve read and write requests of clients on HDFS files and also perform
block creation, replication and deletions.
Data Nodes Failure Recovery
Each data node on a cluster periodically sends a heartbeat message to the name
node which is used by the name node to discover the data node failures based on
missing heartbeats.
The name node marks data nodes without recent heartbeats as dead, and does not
dispatch any new I/O requests to them. Because data located at a dead data node is
no longer available to HDFS, data node death may cause the replication factor of
some blocks to fall below their specified values. The name node constantly tracks
which blocks must be re-replicated, and initiates replication whenever necessary.
Thus all the blocks on a dead data node are re-replicated on other live data nodes
and replication factor remains normal.
HDFS Blocks
HDFS is a block structured file system. Each HDFS file is broken into blocks of
fixed size usually 128 MB which are stored across various data nodes on the
cluster. Each of these blocks is stored as a separate file on local file system on
data nodes (Commodity machines on cluster).
Thus to access a file on HDFS, multiple data nodes need to be referenced and
the list of the data nodes which need to be accessed is determined by the file
system metadata stored on Name Node.
So, any HDFS client trying to access/read a HDFS file, will get block information
from Name Node first, and then based on the block id’s and locations, data will
be read from corresponding data nodes/computer machines on cluster.
HDFS’s fsck command is a useful to get the files and blocks details of file
system. -
$ hadoop fsck / -files -blocks
Command Line interface in
HDFS
Command Line is one of the simplest interface to Hadoop Distributed File
System. Below are the basic HDFS File System Commands which are similar to
UNIX file system commands
Hadoop File System
The Hadoop Distributed File System (HDFS) is the
primary data storage system used by Hadoop
applications. It employs a NameNode and DataNode
architecture to implement a distributed file system that
provides high-performance access to data across highly
scalable Hadoop clusters.
HDFS exposes a file system namespace and allows user
data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of
DataNodes.
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the Hadoop
framework. A Hadoop cluster has nominally a single NameNode
plus a cluster of DataNodes,. Each DataNode serves up blocks of
data over the network using a block protocol specific to HDFS. The
basic HDFS architecture is represented in the following image.
● The NameNode keeps all metadata information about where the data is stored, the location of the data files, how
the files are splitted across DataNodes, etc.. HDFS stores large files (typically in the range of gigabytes to
terabytes) across multiple machines, called DataNodes. The files are splitted in blocks (usually 64 MB or 128 MB)
and stored on a serie of DataNodes (what DataNodes are used for each file, is managed by the NameNode). The
blocks of data are also replicated (usually three times) on other DataNodes so that in case hardware failures,
clients can find the data on another server. The information about the location of the data blocks and the replicas
is also managed by the NameNode. Data Nodes can talk to each other to re balance data, to move copies
around, and to keep the replication of data high.
● The HDFS file system also includes a secondary NameNode, a name which is misleading, as it might be
interpreted as a back-up for the NameNode. In fact, this secondary NameNode regularly connects on the regular
basis to the primary NameNode and builds snapshots of all directory information managed by the latter, which the
system then saves to local or remote directories. These snapshots can later be used to restart a failed primary
NameNode without having to replay the entire journal of file-system actions, then to edit the log to create an up-
to-date directory structure.
● What of the issues related to HDFS architecture is the fact that the NameNode is the single point for storage and
management of metadata, and so it can become a bottleneck when dealing with a huge number of files,
especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain
extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in
HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and bottleneck in huge
metadata request.
Java Interface with
HDFS
● Interfaces are derived from real-world
scenarios with the main purpose to use an
object by strict rules.
● Java interfaces have same behaviour: they set
strict rules on how to interact with objects.
● Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.
● The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in
Hadoop, and there are several concrete implementations.
● Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API.
● The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide
filesystem operations.
● By exposing its file system interface as a Java API, Hadoop makes it awkward for non-Java applications to
access HDFS.
● The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact
with HDFS.
● Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data
transfers if possible.
● There are two ways of accessing HDFS over HTTP : directly, where the HDFS daemons serve HTTP
requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual
DistributedFileSystem API.
● In the First Case, the embedded web servers in the NameNode and DataNodes act as WebHDFS
endpoints.
● File metadata operations are handled by the namenode, while file read (and write) operations are sent first
to the namenode, which sends an HTTP redirect to the client indicating the datanode to stream file data.
● In Second Case of accessing HDFS over HTTP relies on one or more standalone proxy servers.
● All traffic to the cluster passes through the proxy, so the client never accesses the namenode or datanode
directly.
● This allows for stricter firewall and bandwidth-limiting policies to be put in place.
● The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients can access
both using webhdfs (or swebhdfs) URIs.
● The HttpFS proxy is started independently of the namenode and datanode daemons, using the httpfs.sh
script, and by default listens on a different port number 14000.
There are two ways to use JAVA API in HDFS :
1. Reading Data Using the FileSystem API.
2. Writing Data Using the FileSystem API.
Reading Data Using the FileSystem API
● A file in a Hadoop filesystem is represented by a
Hadoop Path object. FileSystem is a general filesystem
API, so the first step is to retrieve an instance for the
filesystem we want to use—HDFS, in this case. There
are several static factory methods for getting a
FileSystem instance.
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
public static LocalFileSystem getLocal(Configuration conf) throws IOException
● A Configuration object encapsulates a client or server
configuration, which is set using configuration files read
from the classpath, such as core-site.xml.
● With a FileSystem instance in hand, we invoke an open() method to get the input stream for a file.The first
method uses a default buffer size of 4 KB.The second one gives an option to user to specify the buffer
size.
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
FSDataInputStream
● The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io
class. This class is a specialization of java.io.DataInputStream with support for random access, so you
can read from any part of the stream:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {
}
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
● The read() method reads up to length bytes from the given position in the file into the buffer at the given offset in the buffer.
● The return value is the number of bytes actually read; callers should check this value, as it may be less than length.
● The readFully() methods will read length bytes into the buffer, unless the end of the file is reached, in which case an EOFException is
thrown.
● Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly.
● You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing
a large number of seeks.
Writing Data Using the FileSystem API
● The FileSystem class has a number of methods for
creating a file. The simplest is the method that takes a
Path object for the file to be created and returns an
output stream to write to.
public FSDataOutputStream create(Path f) throws IOException
● There are overloaded versions of this method that
allow you to specify whether to forcibly overwrite
existing files, the replication factor of the file, the buffer
size to use when writing the file, the block size for the
file, and file permissions.
FSDataOutputStream
● The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has
a method for querying the current position in the file:
package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
}
}
● This method creates all of the necessary parent directories if they don’t already exists and returns true if
its success full.
DataFlow in HDFS
There are three types of DataFlow in HDFS.
1. Anatomy of File Read in Hadoop
2. Anatomy of a File Write in Hadoop
3. Coherency Model
Anatomy of File Read in Hadoop
● Consider a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1. Each rack has 4
nodes and they are uniquely identified as R1N1, R1N2 and so on. The replication factor is set to 3 and the HDFS block size
is set to 64MB(128MB in Hadoop V2) by default.
Background :
Name node stores the HDFS block information like file location, permission, etc. in files called FSImage and edit logs. Files are
stored in HDFS as blocks. These block information are not saved in any file. Instead it is gathered every time
the cluster is started and this information is stored in namenode’s memory.
Replication: Assuming the replication factor is 3; When a file is written from a data node (say R1N1), Hadoop attempts to save the
first replica in same data node (R1N1). Second replica is written into another node (R2N2) in a different rack (R2).
Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was saved.
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is
the sum of their distances to their closest common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”. Example;
‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data center d1. Distance calculation has 4 possible
scenarios as;
distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on same node]
distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is same rack]
distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different rack in same data center]
distance(/d1/r1/n1, /d2/r3/n4) = 6 [node in different data center]
Anatomy :
● Consider a sample.csv file of size 192 MB to be saved in to the cluster. The file is divided into 3 blocks of 64 MB each (B1, B2, B3)
and it is stored in different data nodes as shown above. Along with the data a checksum is stored in each block to ensure that
data read is done without any errors.
When the cluster is started, the metadata in the name node will look as shown in the fig below.
Anatomy of File Write in
Hadoop
Consider writing a file sample.csv by HDFS client program running on R1N1’s JVM.
First the HDFS client program calls the method create() on a Java class DistributedFileSystem (subclass of FileSystem).
DFS makes a RPC call to name node to create a new file in the file system namespace. No blocks are associated to the file at this
stage .Name node performs various checks and ensures that the file doesn't exists. Also check whether the user has the right
permissions to create the file.
Coherency Model
● A coherency model for a filesystem
describes the data visibility of reads
and writes for a file.
● HDFS trades off some POSIX requirements for performance, so
some operations may behave.
● However, any content written to the file is not guaranteed to be
visible, even if the stream is flushed.
● So, the file appears to have a length of zero differently than you
expect them to.
● After creating a file, it is visible in the filesystem namespace, as
expected:
● Once more than a block’s worth of data has been written, the
first block will be visible to new readers.
● This is true of subsequent blocks, too it is always the current
block being written that is not visible to other readers.
Parallel Copying into distcp
Hadoop comes with a useful
program called distcp for
copying data to and from
Hadoop file systems in
parallel.
● The HDFS access patterns that we have seen so far focus on single-threaded access. It’s possible to act on a collection of files — by
specifying file globs, for example — but for efficient parallel processing of these files, you would have to write a program yourself.
Hadoop comes with a useful program called distcp for copying data to and from Hadoop file systems in parallel.
● One use for distcp is as an efficient replacement for hadoop fs -cp. For example, you can copy one file to another with:
○ % hadoop distcp file1 file2
● You can also copy directories:
○ % hadoop distcp dir1 dir2
● If dir2 does not exist, it will be created, and the contents of the dir1 directory will be copied there.
● You can specify multiple source paths, and all will be copied to the destination.
● If dir2 already exists, then dir1 will be copied under it, creating the directory structure dir2/dir1.
● If this isn’t what you want, you can supply the -overwrite option to keep the same directory structure and force files to be
overwritten.
● You can also update only the files that have changed using the -update option. This is best shown with an example.
● If we changed a file in the dir1 subtree, we could synchronize the change with dir2 by running:
● % hadoop distcp -update dir1 dir2 distcp is implemented as a MapReduce job where the work of copying is done by the maps that
run in parallel across the cluster.
● There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the
same amount of data by bucketing files into roughly equal allocations.
● By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp. A very
common use case for distcp is for transferring data between two HDFS clusters.
● For example, the following creates a backup of the first cluster’s /foo directory on the second:
% hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo
● The -delete flag causes distcp to delete any files or directories from the destination that are not present in the
source, and -p means that file status attributes like permissions, block size, and replication are preserved.
● You can run distcp with no arguments to see precise usage instructions.
● If the two clusters are running incompatible versions of HDFS, then you can use the webhdfs protocol to distcp
between them:
% hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/foo
Hadoop Archives
● Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a
*.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files.
The _index file contains the name of the files that are part of the archive and the location within the part files.
● Hadoop Archives (HAR) offers an effective way to deal with the small files problem.
1. The Problem with small files.
2. What is HAR?
3. Limitation of HAR.
The Problem with Small Files.
Hadoop works best with big files and small files are handled inefficiently in HDFS.
As we know, Namenode holds the metadata information in memory for all the files stored in HDFS.
Let’s say we have a file in HDFS which is 1 GB in size and the Namenode will store metadata information of the file – like file name,
creator, created time stamp, blocks, permissions etc.
Now assume we decide to split this 1 GB file in to 1000 pieces and store all 100o “small” files in HDFS.
Now Namenode has to store metadata information of 1000 small files in memory.
This is not very efficient – first it takes up a lot of memory and second soon Namenode will become a bottleneck as it is trying
to manage a lot of data.
What is HAR?
Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks
efficiently and hence HAR can be used to tackle the small files problem in Hadoop.
HAR is created from a collection of files and the archiving tool (a simple command)
will run a MapReduce job to process the input files in parallel and create an archive file.
HAR Command
hadoop archive -archiveName myhar.har /input/location /output/location
● Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files.
● Part files are nothing but the original files concatenated together in to a big file.
● Index files are look up files which is used to look up the individual small files inside the big part files
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0
Limitation of HAR Files.
● Once an archive file is created, you can not update the file to add or remove files. In other words, har files are immutable.
● Archive file will have a copy of all the original files so once a .har is created it will take as much space as the original files.
Don’t mistake .har files for compressed files.
● When .har file is given as an input to MapReduce job, the small files inside the .har file will be processed individually by
separate mappers which is inefficient.
.
Any Question?
Thank You.

More Related Content

What's hot

Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and UsesSuvradeep Rudra
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systemsViet-Trung TRAN
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Design of Hadoop Distributed File System
Design of Hadoop Distributed File SystemDesign of Hadoop Distributed File System
Design of Hadoop Distributed File System
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Introduction to distributed file systems
Introduction to distributed file systemsIntroduction to distributed file systems
Introduction to distributed file systems
 

Similar to Introduction to HDFS

big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file systemsrikanthhadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemNilaNila16
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiUnmesh Baile
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 

Similar to Introduction to HDFS (20)

module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop
HadoopHadoop
Hadoop
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Unit 1
Unit 1Unit 1
Unit 1
 

Recently uploaded

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 

Introduction to HDFS

  • 1. Introduction to HDFS Hadoop Distributed File System
  • 2. What is HDFS? The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture to implement a distributed file system that provides high- performance access to data across highly scalable Hadoop clusters.
  • 3. Architecture of HDFS Components in HDFS Architecture. ★ Client. ★ Datanode. ★ Namenode. ★ Backup Node. ★ Replication Management. ★ Rack awareness. ★ Read and Write operation.
  • 4. What is NameNode? NameNode in HDFS Architecture is also known as Master node. HDFS Namenode stores metadata i.e. number of data blocks, replicas and other details. NameNode maintains and manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS. Task of Namenode : ● Manage file system namespace. ● Regulates client’s access to files. ● It also executes file system execution such as naming, closing, opening files/directories. ● All DataNode sends a Heartbeat and block report to the NameNode in the Hadoop cluster. ● NameNode is also responsible for taking care of the Replication Factor of all the blocks.
  • 5. What is DataNode? DataNode in HDFS Architecture is also known as Slave node. In Hadoop HDFS Architecture, DataNode stores actual data in HDFS. It performs read and write operation as per the request of the client. DataNodes can deploy on commodity hardware. Task of DataNode : ● Block replica creation, deletion and replication according to the instruction of NameNode. ● DataNode manages data storage of the system. ● DataNode send heartbeat to the NameNode to report the health of HDFS.
  • 6. What is Backup Node? In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the active NameNode state. What is Replication Management? Block replication provides fault tolerance. If one copy is not accessible and corrupted, we can read data from other copy. The number of copies or replicas of each block of a file in HDFS Architecture is replication factor. Backup Node
  • 7. What is Rack Awareness Rack Awareness in Hadoop is the concept that chooses DataNodes based on the rack information. NameNode achieves rack information by maintaining the rack ids of each DataNode. HDFS Read and Write operation. Write Operation : When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with a number of blocks, their location, replicas and other details. Read Operation : To read from HDFS, the first client communicates to namenode for metadata. A client comes out of NameNode with the name of files and its location. Rack Awareness R/W
  • 8. HDFS Design Concept Important components in HDFS 1. NameNode. 2. DataNode. 3. Blocks
  • 9. NameNode Name Node is the single point of contact for accessing files in HDFS and it determines the block ids and locations for data access. So, NameNode plays a Master role in Master/Slaves Architecture whereas Data Nodes acts as slaves. File System metadata is stored on Name Node. File System Metadata contains majorly, File names, File Permissions and locations of each block of files. Thus, Metadata is relatively small in size and fits into Main Memory of a computer machine. So, it is stored in Main Memory of Namenode to allow fast access. Important components of name node. FsImage : It is a file on Name Node’s Local File System containing entire HDFS file system namespace (including mapping of blocks to files and file system properties) Editlog : It is a Transaction Log residing on Name Node’s Local File System and contains a record/entry for every change that occurs to File System Metadata.
  • 10. DataNode Data Nodes are the slaves part of Master/Slaves Architecture and on which actual HDFS files are stored in the form of fixed size chunks of data which are called blocks. Data Nodes serve read and write requests of clients on HDFS files and also perform block creation, replication and deletions. Data Nodes Failure Recovery Each data node on a cluster periodically sends a heartbeat message to the name node which is used by the name node to discover the data node failures based on missing heartbeats. The name node marks data nodes without recent heartbeats as dead, and does not dispatch any new I/O requests to them. Because data located at a dead data node is no longer available to HDFS, data node death may cause the replication factor of some blocks to fall below their specified values. The name node constantly tracks which blocks must be re-replicated, and initiates replication whenever necessary. Thus all the blocks on a dead data node are re-replicated on other live data nodes and replication factor remains normal.
  • 11. HDFS Blocks HDFS is a block structured file system. Each HDFS file is broken into blocks of fixed size usually 128 MB which are stored across various data nodes on the cluster. Each of these blocks is stored as a separate file on local file system on data nodes (Commodity machines on cluster). Thus to access a file on HDFS, multiple data nodes need to be referenced and the list of the data nodes which need to be accessed is determined by the file system metadata stored on Name Node. So, any HDFS client trying to access/read a HDFS file, will get block information from Name Node first, and then based on the block id’s and locations, data will be read from corresponding data nodes/computer machines on cluster. HDFS’s fsck command is a useful to get the files and blocks details of file system. - $ hadoop fsck / -files -blocks
  • 12. Command Line interface in HDFS Command Line is one of the simplest interface to Hadoop Distributed File System. Below are the basic HDFS File System Commands which are similar to UNIX file system commands
  • 13. Hadoop File System The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. A Hadoop cluster has nominally a single NameNode plus a cluster of DataNodes,. Each DataNode serves up blocks of data over the network using a block protocol specific to HDFS. The basic HDFS architecture is represented in the following image.
  • 14. ● The NameNode keeps all metadata information about where the data is stored, the location of the data files, how the files are splitted across DataNodes, etc.. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines, called DataNodes. The files are splitted in blocks (usually 64 MB or 128 MB) and stored on a serie of DataNodes (what DataNodes are used for each file, is managed by the NameNode). The blocks of data are also replicated (usually three times) on other DataNodes so that in case hardware failures, clients can find the data on another server. The information about the location of the data blocks and the replicas is also managed by the NameNode. Data Nodes can talk to each other to re balance data, to move copies around, and to keep the replication of data high. ● The HDFS file system also includes a secondary NameNode, a name which is misleading, as it might be interpreted as a back-up for the NameNode. In fact, this secondary NameNode regularly connects on the regular basis to the primary NameNode and builds snapshots of all directory information managed by the latter, which the system then saves to local or remote directories. These snapshots can later be used to restart a failed primary NameNode without having to replay the entire journal of file-system actions, then to edit the log to create an up- to-date directory structure. ● What of the issues related to HDFS architecture is the fact that the NameNode is the single point for storage and management of metadata, and so it can become a bottleneck when dealing with a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and bottleneck in huge metadata request.
  • 15. Java Interface with HDFS ● Interfaces are derived from real-world scenarios with the main purpose to use an object by strict rules. ● Java interfaces have same behaviour: they set strict rules on how to interact with objects.
  • 16. ● Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation. ● The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations. ● Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API. ● The filesystem shell, for example, is a Java application that uses the Java FileSystem class to provide filesystem operations. ● By exposing its file system interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. ● The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. ● Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible. ● There are two ways of accessing HDFS over HTTP : directly, where the HDFS daemons serve HTTP requests to clients and via a proxy, which accesses HDFS on the client’s behalf using the usual DistributedFileSystem API. ● In the First Case, the embedded web servers in the NameNode and DataNodes act as WebHDFS endpoints. ● File metadata operations are handled by the namenode, while file read (and write) operations are sent first to the namenode, which sends an HTTP redirect to the client indicating the datanode to stream file data.
  • 17. ● In Second Case of accessing HDFS over HTTP relies on one or more standalone proxy servers. ● All traffic to the cluster passes through the proxy, so the client never accesses the namenode or datanode directly. ● This allows for stricter firewall and bandwidth-limiting policies to be put in place. ● The HttpFS proxy exposes the same HTTP (and HTTPS) interface as WebHDFS, so clients can access both using webhdfs (or swebhdfs) URIs. ● The HttpFS proxy is started independently of the namenode and datanode daemons, using the httpfs.sh script, and by default listens on a different port number 14000. There are two ways to use JAVA API in HDFS : 1. Reading Data Using the FileSystem API. 2. Writing Data Using the FileSystem API.
  • 18. Reading Data Using the FileSystem API ● A file in a Hadoop filesystem is represented by a Hadoop Path object. FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use—HDFS, in this case. There are several static factory methods for getting a FileSystem instance. public static FileSystem get(Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf) throws IOException public static FileSystem get(URI uri, Configuration conf, String user) throws IOException public static LocalFileSystem getLocal(Configuration conf) throws IOException ● A Configuration object encapsulates a client or server configuration, which is set using configuration files read from the classpath, such as core-site.xml.
  • 19. ● With a FileSystem instance in hand, we invoke an open() method to get the input stream for a file.The first method uses a default buffer size of 4 KB.The second one gives an option to user to specify the buffer size. public FSDataInputStream open(Path f) throws IOException public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException FSDataInputStream ● The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so you can read from any part of the stream: package org.apache.hadoop.fs; public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable { }
  • 20. public interface PositionedReadable { public int read(long position, byte[] buffer, int offset, int length) throws IOException; public void readFully(long position, byte[] buffer, int offset, int length) throws IOException; public void readFully(long position, byte[] buffer) throws IOException; } ● The read() method reads up to length bytes from the given position in the file into the buffer at the given offset in the buffer. ● The return value is the number of bytes actually read; callers should check this value, as it may be less than length. ● The readFully() methods will read length bytes into the buffer, unless the end of the file is reached, in which case an EOFException is thrown. ● Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly. ● You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks.
  • 21. Writing Data Using the FileSystem API ● The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to. public FSDataOutputStream create(Path f) throws IOException ● There are overloaded versions of this method that allow you to specify whether to forcibly overwrite existing files, the replication factor of the file, the buffer size to use when writing the file, the block size for the file, and file permissions.
  • 22. FSDataOutputStream ● The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a method for querying the current position in the file: package org.apache.hadoop.fs; public class FSDataOutputStream extends DataOutputStream implements Syncable { public long getPos() throws IOException { } } ● This method creates all of the necessary parent directories if they don’t already exists and returns true if its success full.
  • 23. DataFlow in HDFS There are three types of DataFlow in HDFS. 1. Anatomy of File Read in Hadoop 2. Anatomy of a File Write in Hadoop 3. Coherency Model
  • 24. Anatomy of File Read in Hadoop ● Consider a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1. Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on. The replication factor is set to 3 and the HDFS block size is set to 64MB(128MB in Hadoop V2) by default.
  • 25. Background : Name node stores the HDFS block information like file location, permission, etc. in files called FSImage and edit logs. Files are stored in HDFS as blocks. These block information are not saved in any file. Instead it is gathered every time the cluster is started and this information is stored in namenode’s memory. Replication: Assuming the replication factor is 3; When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data node (R1N1). Second replica is written into another node (R2N2) in a different rack (R2). Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was saved. Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”. Example; ‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data center d1. Distance calculation has 4 possible scenarios as; distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on same node] distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is same rack] distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different rack in same data center] distance(/d1/r1/n1, /d2/r3/n4) = 6 [node in different data center]
  • 26. Anatomy : ● Consider a sample.csv file of size 192 MB to be saved in to the cluster. The file is divided into 3 blocks of 64 MB each (B1, B2, B3) and it is stored in different data nodes as shown above. Along with the data a checksum is stored in each block to ensure that data read is done without any errors. When the cluster is started, the metadata in the name node will look as shown in the fig below.
  • 27. Anatomy of File Write in Hadoop Consider writing a file sample.csv by HDFS client program running on R1N1’s JVM. First the HDFS client program calls the method create() on a Java class DistributedFileSystem (subclass of FileSystem). DFS makes a RPC call to name node to create a new file in the file system namespace. No blocks are associated to the file at this stage .Name node performs various checks and ensures that the file doesn't exists. Also check whether the user has the right permissions to create the file.
  • 28. Coherency Model ● A coherency model for a filesystem describes the data visibility of reads and writes for a file.
  • 29. ● HDFS trades off some POSIX requirements for performance, so some operations may behave. ● However, any content written to the file is not guaranteed to be visible, even if the stream is flushed. ● So, the file appears to have a length of zero differently than you expect them to. ● After creating a file, it is visible in the filesystem namespace, as expected: ● Once more than a block’s worth of data has been written, the first block will be visible to new readers. ● This is true of subsequent blocks, too it is always the current block being written that is not visible to other readers.
  • 30. Parallel Copying into distcp Hadoop comes with a useful program called distcp for copying data to and from Hadoop file systems in parallel.
  • 31. ● The HDFS access patterns that we have seen so far focus on single-threaded access. It’s possible to act on a collection of files — by specifying file globs, for example — but for efficient parallel processing of these files, you would have to write a program yourself. Hadoop comes with a useful program called distcp for copying data to and from Hadoop file systems in parallel. ● One use for distcp is as an efficient replacement for hadoop fs -cp. For example, you can copy one file to another with: ○ % hadoop distcp file1 file2 ● You can also copy directories: ○ % hadoop distcp dir1 dir2 ● If dir2 does not exist, it will be created, and the contents of the dir1 directory will be copied there. ● You can specify multiple source paths, and all will be copied to the destination. ● If dir2 already exists, then dir1 will be copied under it, creating the directory structure dir2/dir1. ● If this isn’t what you want, you can supply the -overwrite option to keep the same directory structure and force files to be overwritten. ● You can also update only the files that have changed using the -update option. This is best shown with an example. ● If we changed a file in the dir1 subtree, we could synchronize the change with dir2 by running: ● % hadoop distcp -update dir1 dir2 distcp is implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster.
  • 32. ● There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data by bucketing files into roughly equal allocations. ● By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp. A very common use case for distcp is for transferring data between two HDFS clusters. ● For example, the following creates a backup of the first cluster’s /foo directory on the second: % hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo ● The -delete flag causes distcp to delete any files or directories from the destination that are not present in the source, and -p means that file status attributes like permissions, block size, and replication are preserved. ● You can run distcp with no arguments to see precise usage instructions. ● If the two clusters are running incompatible versions of HDFS, then you can use the webhdfs protocol to distcp between them: % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/foo
  • 34. ● Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. The _index file contains the name of the files that are part of the archive and the location within the part files. ● Hadoop Archives (HAR) offers an effective way to deal with the small files problem. 1. The Problem with small files. 2. What is HAR? 3. Limitation of HAR. The Problem with Small Files. Hadoop works best with big files and small files are handled inefficiently in HDFS. As we know, Namenode holds the metadata information in memory for all the files stored in HDFS. Let’s say we have a file in HDFS which is 1 GB in size and the Namenode will store metadata information of the file – like file name, creator, created time stamp, blocks, permissions etc.
  • 35. Now assume we decide to split this 1 GB file in to 1000 pieces and store all 100o “small” files in HDFS. Now Namenode has to store metadata information of 1000 small files in memory. This is not very efficient – first it takes up a lot of memory and second soon Namenode will become a bottleneck as it is trying to manage a lot of data. What is HAR? Hadoop Archives or HAR is an archiving facility that packs files in to HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop. HAR is created from a collection of files and the archiving tool (a simple command) will run a MapReduce job to process the input files in parallel and create an archive file.
  • 36. HAR Command hadoop archive -archiveName myhar.har /input/location /output/location ● Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. ● Part files are nothing but the original files concatenated together in to a big file. ● Index files are look up files which is used to look up the individual small files inside the big part files hadoop fs -ls /output/location/myhar.har /output/location/myhar.har/_index /output/location/myhar.har/_masterindex /output/location/myhar.har/part-0
  • 37. Limitation of HAR Files. ● Once an archive file is created, you can not update the file to add or remove files. In other words, har files are immutable. ● Archive file will have a copy of all the original files so once a .har is created it will take as much space as the original files. Don’t mistake .har files for compressed files. ● When .har file is given as an input to MapReduce job, the small files inside the .har file will be processed individually by separate mappers which is inefficient. .