SlideShare a Scribd company logo
1 of 17
HDFS BASICS
Pre-requisites
 Basic understanding of what file system means
 Practical working knowledge of UNIX file system basics
HDFS : Hadoop distributed file system
(BASIC HDFS)
Architecture of HDFS
How does HDFS store the file internally
Failure handling and recovery mechanism
Rack awareness
Role of name node and secondary name node
When to use HDFS and when not to use it
Agenda
HDFS – The Storage Layer in Hadoop
Splitting of file into blocks in HDFS
File Storage in HDFS
7
Multilayer switches/routers interconnect the switches on each
rack
Hadoop cluster deployed in a production
environment into multiple racks
DN4
DN3
DN2
DN5
DN1
DN6
DN7
DN8
DN9
DN10
DN11
DN12
Q)What happens in the event of a data node
Failure ? (eg : DN 10 fails)
A)Data saved on that node will be lost
To avoid loss of data, copies of the
Data blocks on data node is stored on multiple
data nodes. This is called data replication.
Failure of a data node
 How many copies of each block to save?
Its decided by REPLICATION FACTOR (by
default its 3, i.e. every block of data on each
data node is saved on 2 more machines so that
there is total 3 copies of the same data block on
different machines)
This replication factor can be set on per file
basis while the file is being written to HDFS for
the first time.
9
Replication of data blocks
Design and Architecture Overview
DN4
DN3
DN2
DN5
DN1
DN6
DN7
DN8
DN9
DN10
DN11
DN12
Replication factor =2
DN10 fails
Replication factor for block N2 is now 1 !
Data replication on failure and failed node recovers
DN4
DN3
DN2
DN5
DN1
DN6
DN7
DN8
DN9
DN10
DN11
DN12
Replication factor =2
DN10 fails
The data on the failed node would be copied
to some other node in the cluster
automatically. In this case its copied to
DN12 from its nearest neighbour DN11
Data replication on failure and failed node recovers Pt2
DN4
DN3
DN2
DN5
DN1
DN6
DN7
DN8
DN9
DN10
DN11
DN12
Replication factor = 3 for block N2
HDFS will delete one extra copy of N2 from
any of the 3 nodes (DN10, DN11 or DN12)
This ensures that the replication count is
maintained all the time
DN10 is up again after some time …
Q. How does namenode choose which datanodes to store replicas on?
 Replica Placements are rack aware. Namenode uses the network
location when determining where to place block replicas.
 Tradeoff: Reliability v/s read/write bandwidth e.g.
– If all replica is on single node - lowest write bandwidth but no redundancy if nodes fails
– If replica is off-rack - real redundancy but high read bandwidth (more time)
– If replica is off datacenter – best redundancy at the cost of huge bandwidth
 Hadoop’s default strategy:
– 1st replica on same node as client
– 2nd replica on off rack any random node
– 3rd replica is same rack as 2nd but other node
 Clients always read from the nearest node
 Once the replica locations is chosen a
pipeline is built taking network topology
into account
Replica Placement Strategy
15
I know where
the file blocks are..
I shall back up the data of
name node
BEWARE !
I do not work in HOT STANDBY
mode in the event of name node
failure…..
In Hadoop 1.0, there is no active standby secondary name node.
(HA : Highly available is another term used for HOT/ACTIVE STANDBY )
If the name node fails, the entire cluster goes down ! We need to manually restart
The name node and the contents of the secondary name node has to be copied to it.
Name node & secondary name node in Hadoop 1.0
 HDFS is Good for…
 Storing large files
 Terabytes, Petabytes, etc.
 millions rather billions of files (less number of large files)
 Each file typically 100MB or more
 Streaming data
 WORM - write once read many times patterns
 Optimized for batch/streaming reads rather than random reads
 Append operation added to Hadoop 0.21
 Cheap commodity hardware
 HDFS is not so Good for…
 Large amount of small files
 Better for less no of large files instead of more small files
 Low latency reads
 Many writes: write once, no random writes, append mode write at end of file
When to/not to use HDFS?
17
Introduction to HDFS
Replication factor
Rack awareness
When not to use HDFS
17
Summary

More Related Content

Similar to HDFS+basics.pptx

Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impalaNAVER D2
 

Similar to HDFS+basics.pptx (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
13160296.ppt
13160296.ppt13160296.ppt
13160296.ppt
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Hdfs
HdfsHdfs
Hdfs
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Unit 1
Unit 1Unit 1
Unit 1
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
(Aaron myers) hdfs impala
(Aaron myers)   hdfs impala(Aaron myers)   hdfs impala
(Aaron myers) hdfs impala
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

HDFS+basics.pptx

  • 2. Pre-requisites  Basic understanding of what file system means  Practical working knowledge of UNIX file system basics HDFS : Hadoop distributed file system (BASIC HDFS)
  • 3. Architecture of HDFS How does HDFS store the file internally Failure handling and recovery mechanism Rack awareness Role of name node and secondary name node When to use HDFS and when not to use it Agenda
  • 4. HDFS – The Storage Layer in Hadoop
  • 5. Splitting of file into blocks in HDFS
  • 7. 7 Multilayer switches/routers interconnect the switches on each rack Hadoop cluster deployed in a production environment into multiple racks
  • 8. DN4 DN3 DN2 DN5 DN1 DN6 DN7 DN8 DN9 DN10 DN11 DN12 Q)What happens in the event of a data node Failure ? (eg : DN 10 fails) A)Data saved on that node will be lost To avoid loss of data, copies of the Data blocks on data node is stored on multiple data nodes. This is called data replication. Failure of a data node
  • 9.  How many copies of each block to save? Its decided by REPLICATION FACTOR (by default its 3, i.e. every block of data on each data node is saved on 2 more machines so that there is total 3 copies of the same data block on different machines) This replication factor can be set on per file basis while the file is being written to HDFS for the first time. 9 Replication of data blocks
  • 11. DN4 DN3 DN2 DN5 DN1 DN6 DN7 DN8 DN9 DN10 DN11 DN12 Replication factor =2 DN10 fails Replication factor for block N2 is now 1 ! Data replication on failure and failed node recovers
  • 12. DN4 DN3 DN2 DN5 DN1 DN6 DN7 DN8 DN9 DN10 DN11 DN12 Replication factor =2 DN10 fails The data on the failed node would be copied to some other node in the cluster automatically. In this case its copied to DN12 from its nearest neighbour DN11 Data replication on failure and failed node recovers Pt2
  • 13. DN4 DN3 DN2 DN5 DN1 DN6 DN7 DN8 DN9 DN10 DN11 DN12 Replication factor = 3 for block N2 HDFS will delete one extra copy of N2 from any of the 3 nodes (DN10, DN11 or DN12) This ensures that the replication count is maintained all the time DN10 is up again after some time …
  • 14. Q. How does namenode choose which datanodes to store replicas on?  Replica Placements are rack aware. Namenode uses the network location when determining where to place block replicas.  Tradeoff: Reliability v/s read/write bandwidth e.g. – If all replica is on single node - lowest write bandwidth but no redundancy if nodes fails – If replica is off-rack - real redundancy but high read bandwidth (more time) – If replica is off datacenter – best redundancy at the cost of huge bandwidth  Hadoop’s default strategy: – 1st replica on same node as client – 2nd replica on off rack any random node – 3rd replica is same rack as 2nd but other node  Clients always read from the nearest node  Once the replica locations is chosen a pipeline is built taking network topology into account Replica Placement Strategy
  • 15. 15 I know where the file blocks are.. I shall back up the data of name node BEWARE ! I do not work in HOT STANDBY mode in the event of name node failure….. In Hadoop 1.0, there is no active standby secondary name node. (HA : Highly available is another term used for HOT/ACTIVE STANDBY ) If the name node fails, the entire cluster goes down ! We need to manually restart The name node and the contents of the secondary name node has to be copied to it. Name node & secondary name node in Hadoop 1.0
  • 16.  HDFS is Good for…  Storing large files  Terabytes, Petabytes, etc.  millions rather billions of files (less number of large files)  Each file typically 100MB or more  Streaming data  WORM - write once read many times patterns  Optimized for batch/streaming reads rather than random reads  Append operation added to Hadoop 0.21  Cheap commodity hardware  HDFS is not so Good for…  Large amount of small files  Better for less no of large files instead of more small files  Low latency reads  Many writes: write once, no random writes, append mode write at end of file When to/not to use HDFS?
  • 17. 17 Introduction to HDFS Replication factor Rack awareness When not to use HDFS 17 Summary