SlideShare a Scribd company logo
Introduction to Hadoop
Distributed File System(HDFS)
HDFS(Hadoop Distributed File System)
With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of machines. Such
filesystems are called distributed filesystems. Since data is stored across a network all the
complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems.
HDFS (Hadoop Distributed File System) is a unique design that provides storage for
extremely large files with streaming data access pattern and it runs on commodity
hardware.
TERMS FOR HDFS
 Extremely large files: Here we are talking about the data in range of
petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed
any number times.
 Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
1. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files
and directories.
 It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the
master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the
processes running in background.
Namenodes:
•Run on the master node.
•Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
•Require high amount of RAM.
•Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept
on disk.
•
DataNodes:
•Run on slave nodes.
•Require high memory as data is actually stored here.
Data storage in HDFS: The data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode)
will first divide the file into blocks of 10TB (default size is 128 MB in
Hadoop 2.x and above). Then these blocks are stored across different
datanodes(slavenode). Datanodes(slavenode)replicate the blocks
among themselves and the information of what blocks they contain is
sent to the master. Default replication factor is 3 means for each block
3 replicas are created (including itself). In hdfs.site.xml we can
increase or decrease the replication factor i.e we can edit its
configuration here.
Why divide the file into blocks?
 Answer: Let’s assume that we don’t divide, now it’s very difficult to
store a 100 TB file on a single machine. Even if we store, then each
read and write operation on that whole file is going to take very
high seek time. But if we have multiple blocks of size 128MB then
its become easy to perform various read and write operations on it
compared to doing it on a whole file at once. So we divide the file
to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes
while storing?
 Answer: Let’s assume we don’t replicate and only one yellow block
is present on datanode D1. Now if the data node D1 crashes we will
lose the block and which will make the overall data inconsistent
and faulty. So we replicate the blocks to achieve fault-tolerance.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a datanode
then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it will be
gone too and the blocks will be under-replicated compared to the
remaining blocks. Here master node(namenode) will give a signal
to datanodes containing replicas of those lost blocks to replicate so
that overall distribution of blocks is balanced.
 Replication:: It is done by datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present
at multiple datanodes.
 Even if multiple datanodes are down we can still do our
work, thus making it highly reliable.
 High fault tolerance.
Limitations:
 Though HDFS provide many features there are some areas where it
doesn’t work well.
 Low latency data access: Applications that require low-latency
access to data i.e in the range of milliseconds will not work well
with HDFS, because HDFS is designed keeping in mind that we
need high-throughput of data even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of
seeks and lots of movement from one datanode to another datanode
to retrieve each small file, this whole process is a very inefficient
data access pattern.
Thank you

More Related Content

Similar to Introduction to Hadoop Distributed File System(HDFS).pptx

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Delhi/NCR HUG
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
SwarnaSLcse
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
Adam Kawa
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
DrPDShebaKeziaMalarc
 
lec4_ref.pdf
lec4_ref.pdflec4_ref.pdf
lec4_ref.pdf
vishal choudhary
 
Hadoop
HadoopHadoop
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
Ayush .
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop
HadoopHadoop
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
Tan Tran
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
MindsMapped Consulting
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 

Similar to Introduction to Hadoop Distributed File System(HDFS).pptx (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Hadoop
HadoopHadoop
Hadoop
 
lec4_ref.pdf
lec4_ref.pdflec4_ref.pdf
lec4_ref.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS+basics.pptx
HDFS+basics.pptxHDFS+basics.pptx
HDFS+basics.pptx
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 

More from SakthiVinoth78

Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
SakthiVinoth78
 
ditial Marketing.ppt
ditial Marketing.pptditial Marketing.ppt
ditial Marketing.ppt
SakthiVinoth78
 
chapter - 1.ppt
chapter - 1.pptchapter - 1.ppt
chapter - 1.ppt
SakthiVinoth78
 
FLASH PROGRAM.pptx
FLASH PROGRAM.pptxFLASH PROGRAM.pptx
FLASH PROGRAM.pptx
SakthiVinoth78
 
Unguided media
Unguided mediaUnguided media
Unguided media
SakthiVinoth78
 
Network software
Network softwareNetwork software
Network software
SakthiVinoth78
 
Mobile telephone systems
Mobile telephone systemsMobile telephone systems
Mobile telephone systems
SakthiVinoth78
 
data link layer - Chapter 3
data link layer - Chapter 3data link layer - Chapter 3
data link layer - Chapter 3
SakthiVinoth78
 
computer network - Chapter 1 hardware
computer network - Chapter 1 hardwarecomputer network - Chapter 1 hardware
computer network - Chapter 1 hardware
SakthiVinoth78
 
Application layer
Application layerApplication layer
Application layer
SakthiVinoth78
 
Physical layer
Physical layerPhysical layer
Physical layer
SakthiVinoth78
 
Physical layer
Physical layerPhysical layer
Physical layer
SakthiVinoth78
 

More from SakthiVinoth78 (12)

Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
ditial Marketing.ppt
ditial Marketing.pptditial Marketing.ppt
ditial Marketing.ppt
 
chapter - 1.ppt
chapter - 1.pptchapter - 1.ppt
chapter - 1.ppt
 
FLASH PROGRAM.pptx
FLASH PROGRAM.pptxFLASH PROGRAM.pptx
FLASH PROGRAM.pptx
 
Unguided media
Unguided mediaUnguided media
Unguided media
 
Network software
Network softwareNetwork software
Network software
 
Mobile telephone systems
Mobile telephone systemsMobile telephone systems
Mobile telephone systems
 
data link layer - Chapter 3
data link layer - Chapter 3data link layer - Chapter 3
data link layer - Chapter 3
 
computer network - Chapter 1 hardware
computer network - Chapter 1 hardwarecomputer network - Chapter 1 hardware
computer network - Chapter 1 hardware
 
Application layer
Application layerApplication layer
Application layer
 
Physical layer
Physical layerPhysical layer
Physical layer
 
Physical layer
Physical layerPhysical layer
Physical layer
 

Recently uploaded

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 

Recently uploaded (20)

Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 

Introduction to Hadoop Distributed File System(HDFS).pptx

  • 2. HDFS(Hadoop Distributed File System) With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of machines. Such filesystems are called distributed filesystems. Since data is stored across a network all the complications of a network come in. This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely large files with streaming data access pattern and it runs on commodity hardware.
  • 3. TERMS FOR HDFS  Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).  Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times. Once data is written large portions of dataset can be processed any number times.  Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one of feature which specially distinguishes HDFS from other file system.
  • 4. 1. NameNode(MasterNode):  Manages all the slave nodes and assign work to them.  It executes filesystem namespace operations like opening, closing, renaming files and directories.  It should be deployed on reliable hardware which has the high config. not on commodity hardware.
  • 5. 2. DataNode(SlaveNode):  Actual worker nodes, who do the actual work like reading, writing, processing etc.  They also perform creation, deletion, and replication upon instruction from the master.  They can be deployed on commodity hardware.
  • 6. HDFS daemons: Daemons are the processes running in background. Namenodes: •Run on the master node. •Store metadata (data about data) like file path, the number of blocks, block Ids. etc. •Require high amount of RAM. •Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk. •
  • 7. DataNodes: •Run on slave nodes. •Require high memory as data is actually stored here. Data storage in HDFS: The data is stored in a distributed manner.
  • 8. Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across different datanodes(slavenode). Datanodes(slavenode)replicate the blocks among themselves and the information of what blocks they contain is sent to the master. Default replication factor is 3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or decrease the replication factor i.e we can edit its configuration here.
  • 9. Why divide the file into blocks?  Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single machine. Even if we store, then each read and write operation on that whole file is going to take very high seek time. But if we have multiple blocks of size 128MB then its become easy to perform various read and write operations on it compared to doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
  • 10. Why replicate the blocks in data nodes while storing?  Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now if the data node D1 crashes we will lose the block and which will make the overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.
  • 11. Terms related to HDFS:  HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.  Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodes containing replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.  Replication:: It is done by datanode.
  • 12. Features:  Distributed data storage.  Blocks reduce seek time.  The data is highly available as the same block is present at multiple datanodes.  Even if multiple datanodes are down we can still do our work, thus making it highly reliable.  High fault tolerance.
  • 13. Limitations:  Though HDFS provide many features there are some areas where it doesn’t work well.  Low latency data access: Applications that require low-latency access to data i.e in the range of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that we need high-throughput of data even at the cost of latency.  Small file problem: Having lots of small files will result in lots of seeks and lots of movement from one datanode to another datanode to retrieve each small file, this whole process is a very inefficient data access pattern.