Hadoop Distributed File System
Big Data Analytics
Nadar Saraswathi College of Arts & Science
Submitted By
N. Nagapandiyammal
M.Sc Computer Science
Hadoop Distributed File System
 The Hadoop Distributed File System (HDFS) is the primary
data storage system used by Hadoop applications.
 It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop
clusters.
 HDFS is a key part of the many Hadoop ecosystem
technologies, as it provides a reliable means for managing
pools of big data and supporting related big data
analytics applications.
 The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file system written in Java for the
Hadoop framework.
HDFS has five services
 1. Name Node
 2. Secondary Name Node
 3. Job tracker
 4. Data Node
 5. Task Tracker
Name Node
 HDFS consists of only one Name Node we call it as Master
Node which can track the files, manage the file system and
has the meta data and the whole data in it.
 To be particular Name node contains the details of the No.
of blocks, Locations at what data node the data is stored and
where the replications are stored and other details.
 As we have only one Name Node we call it as Single Point
Failure. It has Direct connect with the client.
Data Node
 A Data Node stores data in it as the blocks. This is also
known as the slave node and it stores the actual data into
HDFS which is responsible for the client to read and write.
 These are slave daemons. Every Data node sends a
Heartbeat message to the Name node every 3 seconds and
conveys that it is alive.
 In this way when Name Node does not receive a heartbeat
from a data node for 2 minutes, it will take that data node as
dead and starts the process of block replications on some
other Data node.
Secondary Name Node
 This is only to take care of the checkpoints of the file
system metadata which is in the Name Node.
 This is also known as the checkpoint Node. It is helper
Node for the Name Node.
Job Tracker
 Basically Job Tracker will be useful in the Processing the
data. Job Tracker receives the requests for Map Reduce
execution from the client.
 Job tracker talks to the Name node to know about the
location of the data like Job Tracker will request the Name
Node for the processing the data.
 Name node in response gives the Meta data to job tracker.
Task Tracker
 It is the Slave Node for the Job Tracker and it will take the
task from the Job Tracker. And also it receives code from
the Job Tracker.
 Task Tracker will take the code and apply on the file. The
process of applying that code on the file is known as
Mapper.
Other file systems
 HDFS: Hadoop's own rack-aware file system. This is designed
to scale to tens of petabytes of storage and runs on top of the
file systems of the underlying operating systems.
 FTP file system: This stores all its data on remotely accessible
FTP servers.
 Amazon S3 (Simple Storage Service) object storage: This is
targeted at clusters hosted on the Amazon Elastic Compute
Cloud server-on-demand infrastructure. There is no rack-
awareness in this file system, as it is all remote.
 Windows Azure Storage Blobs (WASB) file system: This is an
extension of HDFS that allows distributions of Hadoop to
access data in Azure blob stores without moving the data
permanently into the cluster.
Why use HDFS?
 The Hadoop Distributed File System arose at Yahoo as a
part of that company's ad serving and search engine
requirements. Like other web-oriented companies, Yahoo
found itself juggling a variety of applications that were
accessed by a growing numbers of users, who were creating
more and more data.
 Facebook, eBay, LinkedIn and Twitter are among the web
companies that used HDFS to underpin big data analytics to
address these same requirements.
 HDFS was used by The New York Times as part of large-
scale image conversions, Media6Degrees for log processing
and machine learning, LiveBet for log storage and odds
analysis, Joost for session analysis and Fox Audience
Network for log analysis and data mining.
 HDFS is also at the core of many open source data
warehouse alternatives, sometimes called data lakes.
HDFS and Hadoop history
 In 2006, Hadoop's originators ceded their work on HDFS and
MapReduce to the Apache Software Foundation project. In 2012,
HDFS and Hadoop became available in Version 1.0. The basic HDFS
standard has been continuously updated since its inception.
 With Version 2.0 of Hadoop in 2013, a general-purpose YARN
resource manager was added, and MapReduce and HDFS were
effectively decoupled. Thereafter, diverse data processing frameworks
and file systems were supported by Hadoop.
 While MapReduce was often replaced by Apache Spark, HDFS
continued to be a prevalent file format for Hadoop. After four alpha
releases and one beta, Apache Hadoop 3.0.0 became generally
available in December 2017, with HDFS enhancements supporting
additional NameNodes, erasure coding facilities and greater data
compression.
 At the same time, advances in HDFS tooling, such as LinkedIn's open
source Dr. Elephant and Dynamometer performance testing tools, have
expanded to enable development of ever larger HDFS
implementations.
Thank You

Hadoop Distributed File System

  • 1.
    Hadoop Distributed FileSystem Big Data Analytics Nadar Saraswathi College of Arts & Science Submitted By N. Nagapandiyammal M.Sc Computer Science
  • 2.
    Hadoop Distributed FileSystem  The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications.  It employs a NameNode and DataNode architecture to implement a distributed file system that provides high- performance access to data across highly scalable Hadoop clusters.  HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means for managing pools of big data and supporting related big data analytics applications.  The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework.
  • 3.
    HDFS has fiveservices  1. Name Node  2. Secondary Name Node  3. Job tracker  4. Data Node  5. Task Tracker
  • 5.
    Name Node  HDFSconsists of only one Name Node we call it as Master Node which can track the files, manage the file system and has the meta data and the whole data in it.  To be particular Name node contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details.  As we have only one Name Node we call it as Single Point Failure. It has Direct connect with the client.
  • 6.
    Data Node  AData Node stores data in it as the blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write.  These are slave daemons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive.  In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.
  • 7.
    Secondary Name Node This is only to take care of the checkpoints of the file system metadata which is in the Name Node.  This is also known as the checkpoint Node. It is helper Node for the Name Node.
  • 8.
    Job Tracker  BasicallyJob Tracker will be useful in the Processing the data. Job Tracker receives the requests for Map Reduce execution from the client.  Job tracker talks to the Name node to know about the location of the data like Job Tracker will request the Name Node for the processing the data.  Name node in response gives the Meta data to job tracker.
  • 9.
    Task Tracker  Itis the Slave Node for the Job Tracker and it will take the task from the Job Tracker. And also it receives code from the Job Tracker.  Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.
  • 10.
    Other file systems HDFS: Hadoop's own rack-aware file system. This is designed to scale to tens of petabytes of storage and runs on top of the file systems of the underlying operating systems.  FTP file system: This stores all its data on remotely accessible FTP servers.  Amazon S3 (Simple Storage Service) object storage: This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack- awareness in this file system, as it is all remote.  Windows Azure Storage Blobs (WASB) file system: This is an extension of HDFS that allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.
  • 11.
    Why use HDFS? The Hadoop Distributed File System arose at Yahoo as a part of that company's ad serving and search engine requirements. Like other web-oriented companies, Yahoo found itself juggling a variety of applications that were accessed by a growing numbers of users, who were creating more and more data.  Facebook, eBay, LinkedIn and Twitter are among the web companies that used HDFS to underpin big data analytics to address these same requirements.  HDFS was used by The New York Times as part of large- scale image conversions, Media6Degrees for log processing and machine learning, LiveBet for log storage and odds analysis, Joost for session analysis and Fox Audience Network for log analysis and data mining.  HDFS is also at the core of many open source data warehouse alternatives, sometimes called data lakes.
  • 12.
    HDFS and Hadoophistory  In 2006, Hadoop's originators ceded their work on HDFS and MapReduce to the Apache Software Foundation project. In 2012, HDFS and Hadoop became available in Version 1.0. The basic HDFS standard has been continuously updated since its inception.  With Version 2.0 of Hadoop in 2013, a general-purpose YARN resource manager was added, and MapReduce and HDFS were effectively decoupled. Thereafter, diverse data processing frameworks and file systems were supported by Hadoop.  While MapReduce was often replaced by Apache Spark, HDFS continued to be a prevalent file format for Hadoop. After four alpha releases and one beta, Apache Hadoop 3.0.0 became generally available in December 2017, with HDFS enhancements supporting additional NameNodes, erasure coding facilities and greater data compression.  At the same time, advances in HDFS tooling, such as LinkedIn's open source Dr. Elephant and Dynamometer performance testing tools, have expanded to enable development of ever larger HDFS implementations.
  • 14.