Introduction to Hadoop Distributed File System (HDFS) Architecture

•Download as PPTX, PDF•

0 likes•27 views

Archana Gopinath

uses and functions of Hadoop, HDFS Architecture

Education

Hello!
Myself Archana R
Assistant Professor In
Dept Of CS
SACWC.
I am here because I love
to give presentation
about

HADOOP
• Need to process huge datasets on large clusters of computers
• Very expensive to build reliability into each application
• Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
• Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence

Who uses Hadoop?
• Amazon/A9
• Facebook
• Google
• New York Times
• Veoh
• Yahoo!
• …. many more

Commodity Hardware
• Typically in 2 level architecture
Nodes are commodity PCs
30-40 nodes/rack
Uplink from rack is 3-4 gigabit
Rack-internal is 1 gigabit
Aggregation switch
Rack switch

Goals of HDFS
• Very Large Distributed File System
10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
• Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth

Distributed File System
• Single Namespace for entire cluster
• Data Coherency
Write-once-read-many access model
Client can only append to existing files
• Files are broken up into blocks
Typically 64MB block size
Each block replicated on multiple DataNodes
• Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode

Functions of Name Node
• Manages File System Namespace
• Maps a file name to a set of blocks
• Maps a block to the DataNodes where it resides
• Cluster Configuration Management
• Replication Engine for Blocks

Name Node Metadata
• Metadata in Memory
• The entire metadata is in main memory
• No demand paging of metadata
• Types of metadata
• List of files
• List of Blocks for each file
• List of DataNodes for each block
• File attributes, e.g. creation time, replication factor
• A Transaction Log
• Records file creations, file deletions etc

Data Node
• A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
• Block Report
Periodically sends a report of all existing blocks to the Name Node
• Facilitates Pipelining of Data
Forwards data to other specified Data Nodes

Block Placement
• Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
• Clients read from nearest replicas
• Would like to make this policy pluggable

Heart beats
• Data Nodes send hear beat to the Name Node
Once every 3 seconds
• Name Node uses heartbeats to detect Data Node failure
Replication Engine
• Name Node detects Data Node failures
Chooses new Data Nodes for new replicas
Balances disk usage
Balances communication traffic to Data Nodes

Data Correctness
• Use Checksums to validate data
Use CRC32
• File Creation
Client computes checksum per 512 bytes
DataNode stores the checksum
• File access
Client retrieves the data and checksum from DataNode
If Validation fails, Client tries other replicas

Data Pipelining
• Client retrieves a list of DataNodes on which to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to the next node in the Pipeline
• When all replicas are written, the Client moves on to write the next block in file

User Interface
• Commads for HDFS User:
hadoop dfs -mkdir /foodir
hadoop dfs -cat /foodir/myfile.txt
hadoop dfs -rm /foodir/myfile.txt
• Commands for HDFS Administrator
hadoop dfsadmin -report
hadoop dfsadmin -decommision datanodename
• Web Interface
http://host:port/dfshealth.jsp

What's hot

Hadoop mapreduce and yarn frame work- unit5RojaT4

Open source stak of big data techs open suse asiaMuhammad Rifqi

Design of Hadoop Distributed File SystemDr. C.V. Suresh Babu

Big data and hadoopRoushan Sinha

HadoopMayuri Gupta

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1

HadoopZubair Arshad

Apache hadoop introduction and architectureHarikrishnan K

Hadoop - Architectural road map for Hadoop Ecosystemnallagangus

HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCEHarsha Siva Sai

HadoopTuan Cuong Luu

Big data and hadoopKishor Parkhe

Lecture6 introduction to data streamshktripathy

Big data pptThirunavukkarasu Ps

Big data & hadoopTejashBansal2

Seminar pptRajatTripathi34

Big Data Open Source Technologiesneeraj rathore

Big data analysis using hadoop clusterFurqan Haider

HadoopAnkit Prasad

Intro to bigdata on gcp (1)SahilRaina21

What's hot (20)

Hadoop mapreduce and yarn frame work- unit5

Open source stak of big data techs open suse asia

Design of Hadoop Distributed File System

Big data and hadoop

Hadoop

Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women

Hadoop

Apache hadoop introduction and architecture

Hadoop - Architectural road map for Hadoop Ecosystem

HADOOP DISTRIBUTED FILE SYSTEM AND MAPREDUCE

Hadoop

Big data and hadoop

Lecture6 introduction to data streams

Big data ppt

Big data & hadoop

Seminar ppt

Big Data Open Source Technologies

Big data analysis using hadoop cluster

Hadoop

Intro to bigdata on gcp (1)

Similar to Introduction to Hadoop Distributed File System (HDFS) Architecture

HDFS_architecture.pptvijayapraba1

Hadoop introductionmusrath mohammad

Hadoop training in bangaloreKelly Technologies

Cloud computing UNIT 2.1 presentation inRahulBhole12

Hdfs architectureAisha Siddiqa

Hadoop Distributed File Systemelliando dias

Building Big Data Streaming ArchitecturesDavid Martínez Rego

Borthakur hadoop univ-researchsaintdevil163

HADOOP.pptxBharathi567510

HPC and cloud distributed computing, as a journeyPeter Clapham

Big Data Architecture Workshop - Vahid Amiridatastack

NoSQL – Data Center Centric Application EnablementDATAVERSITY

Big Data for QAsAhmed Misbah

Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble

Big data- HDFS(2nd presentation)Takrim Ul Islam Laskar

HadoopGirish Khanzode

Gfs google-file-system-13331Fengchang Xie

Data Lake and the rise of the microservicesBigstep

UNIT II DIS.pptxSamPrem3

InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData

Similar to Introduction to Hadoop Distributed File System (HDFS) Architecture (20)

HDFS_architecture.ppt

Hadoop introduction

Hadoop training in bangalore

Cloud computing UNIT 2.1 presentation in

Hdfs architecture

Hadoop Distributed File System

Building Big Data Streaming Architectures

Borthakur hadoop univ-research

HADOOP.pptx

HPC and cloud distributed computing, as a journey

Big Data Architecture Workshop - Vahid Amiri

NoSQL – Data Center Centric Application Enablement

Big Data for QAs

Hadoop Ecosystem and Low Latency Streaming Architecture

Big data- HDFS(2nd presentation)

Hadoop

Gfs google-file-system-13331

Data Lake and the rise of the microservices

UNIT II DIS.pptx

InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard

Recently uploaded

The Most Excellent Way | 1 Corinthians 13Steve Thomason

Computed Fields and api Depends in the Odoo 17Celine George

Paris 2024 Olympic Geographies - an activityGeoBlogs

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Alper Gobel In Media Res Media ComponentInMediaRes1

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George

Mastering the Unannounced Regulatory InspectionSafetyChain Software

internship ppt on smartinternz platform as salesforce developerunnathinaik

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

Proudly South Africa powerpoint Thorisha.pptxthorishapillay1

9953330565 Low Rate Call Girls In Rohini Delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

Presiding Officer Training module 2024 lok sabha electionsanshu789521

How to Configure Email Server in Odoo 17Celine George

भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a

Science lesson Moon for 4th quarter lessonJericReyAuditor

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton

EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3

Recently uploaded (20)

The Most Excellent Way | 1 Corinthians 13

Computed Fields and api Depends in the Odoo 17

Paris 2024 Olympic Geographies - an activity

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Alper Gobel In Media Res Media Component

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17

Mastering the Unannounced Regulatory Inspection

internship ppt on smartinternz platform as salesforce developer

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

Proudly South Africa powerpoint Thorisha.pptx

9953330565 Low Rate Call Girls In Rohini Delhi NCR

Presiding Officer Training module 2024 lok sabha elections

How to Configure Email Server in Odoo 17

भारत-रोम व्यापार.pptx, Indo-Roman Trade,

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf

Science lesson Moon for 4th quarter lesson

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

Blooming Together_ Growing a Community Garden Worksheet.docx

EPANDING THE CONTENT OF AN OUTLINE using notes.pptx

Introduction to Hadoop Distributed File System (HDFS) Architecture

1. Hello! Myself Archana R Assistant Professor In Dept Of CS SACWC. I am here because I love to give presentation about

2. HADOOP • Need to process huge datasets on large clusters of computers • Very expensive to build reliability into each application • Nodes fail every day Failure is expected, rather than exceptional The number of nodes in a cluster is not constant • Need a common infrastructure Efficient, reliable, easy to use Open Source, Apache Licence

3. Who uses Hadoop? • Amazon/A9 • Facebook • Google • New York Times • Veoh • Yahoo! • …. many more

4. Commodity Hardware • Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit Aggregation switch Rack switch

5. Goals of HDFS • Very Large Distributed File System 10K nodes, 100 million files, 10PB • Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them • Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth

6. Distributed File System • Single Namespace for entire cluster • Data Coherency Write-once-read-many access model Client can only append to existing files • Files are broken up into blocks Typically 64MB block size Each block replicated on multiple DataNodes • Intelligent Client Client can find location of blocks Client accesses data directly from DataNode

7. HDFS Architecture

8. Functions of Name Node • Manages File System Namespace • Maps a file name to a set of blocks • Maps a block to the DataNodes where it resides • Cluster Configuration Management • Replication Engine for Blocks

9. Name Node Metadata • Metadata in Memory • The entire metadata is in main memory • No demand paging of metadata • Types of metadata • List of files • List of Blocks for each file • List of DataNodes for each block • File attributes, e.g. creation time, replication factor • A Transaction Log • Records file creations, file deletions etc

10. Data Node • A Block Server Stores data in the local file system (e.g. ext3) Stores metadata of a block (e.g. CRC) Serves data and metadata to Clients • Block Report Periodically sends a report of all existing blocks to the Name Node • Facilitates Pipelining of Data Forwards data to other specified Data Nodes

11. Block Placement • Current Strategy One replica on local node Second replica on a remote rack Third replica on same remote rack Additional replicas are randomly placed • Clients read from nearest replicas • Would like to make this policy pluggable

12. Heart beats • Data Nodes send hear beat to the Name Node Once every 3 seconds • Name Node uses heartbeats to detect Data Node failure Replication Engine • Name Node detects Data Node failures Chooses new Data Nodes for new replicas Balances disk usage Balances communication traffic to Data Nodes

13. Data Correctness • Use Checksums to validate data Use CRC32 • File Creation Client computes checksum per 512 bytes DataNode stores the checksum • File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas

14. Data Pipelining • Client retrieves a list of DataNodes on which to place replicas of a block • Client writes block to the first DataNode • The first DataNode forwards the data to the next node in the Pipeline • When all replicas are written, the Client moves on to write the next block in file

15. User Interface • Commads for HDFS User: hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt • Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision datanodename • Web Interface http://host:port/dfshealth.jsp

16. THANK YOU

Introduction to Hadoop Distributed File System (HDFS) Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Hadoop Distributed File System (HDFS) Architecture

Similar to Introduction to Hadoop Distributed File System (HDFS) Architecture (20)

More from Archana Gopinath

More from Archana Gopinath (18)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop Distributed File System (HDFS) Architecture