Hadoop HDFS Architeture and Design

HDFS – Hadoop Distributed File System

Agenda
1. HDFS – Hadoop Distributed File System
 HDFS
HDFS Design and Goal
HDFS componetnts: Namenode, Datanode, Secondary
Namenode.
HDFS blocks and replication.
Anatomy of a File Read/Write in HDFS

 Designed to reliably store very large files across machines in a
large cluster
 Data Model
 Data is organized into files and directories
 In storage layer files are divided into uniform sized blocks(64MB,
128MB, 256MB) and distributed across cluster nodes
 Blocks are replicated to handle hardware failure
 File system keeps checksums of data for corruption detection
and recovery
 HDFS exposes block placement so that computes can be
migrated to data
HDFS
HDFS is a file system designed for storing very large files with
streaming data access patterns, running on clusters of commodity
hardware.

 HDFS is File system rather then a storage.
HDFS exhibits almost all POSIX file system standards
- File, directory and sub-directory structure
-Permissions(rwx)
-Access(owner, group, other) and concept super user
 Hadoop provides many interfaces to its filesystems, and it
generally uses the URI scheme to pick the correct
filesystem instance to communicate with.
 Fault tolerant, scalable, distributed storage system
HDFS continues…

 Very Large Distributed File System
10K nodes, 100 million files, 10 PB ..! No problem, is possible !!!!!!!.
 Computation moved to data
Data locations exposed so that computations can move to where
data resides
 Streaming Data access
 Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recovers from them
 Optimized for Batch Processing
Provides very high aggregate bandwidth
 High throughput of data access Streaming access to data
 Large files Typical file is gigabytes to terabytes in size Support for tens of
millions of files.
 Simple coherency- Write-once-read-many access model
 Highly fault-tolerant runs on commodity HW, which can fail frequently
HDFS Design Goals

Master-Slave architecture
 HDFS Master “Namenode”
Manages the file system namespace
Controls read/write access to files
Manages block replication
Checkpoints namespace and journals namespace changes for
reliability
 HDFS Workers “Datanodes”
 Serve read/write requests from clients
 Perform replication tasks upon instruction by Namenode.
 Report blocks and system state.
 HDFS Namespece Backup “Secondary Namenode”
HDFS Components

 Single Namespace for entire cluster
 Files are broken up into sequence blocks
– For a file all blocks except the last are of the same size.
– Each block replicated on multiple DataNodes
 Data Coherency
– Write-once-read-many access model
– Client can only add/append to files, restricts to random change
 Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode [Q: How it is possible ?]
--- User data never flows through the NameNode
Distributed file system

Distributed file system continue…

HDFS Blocks
HDFS has Large block size
 Default 64MB
 Typical 128MB, 256MB, 512MB…
Normal Filesystem blocks are few kilobytes.
Unlike a file system for a single disk. A file in HDFS that is
smaller than a single block does not occupy a full block. if a
block is 10MB it needs only 10MB of the space of the full
block on the local drive.
A file is stored in blocks on various nodes in hadoop cluster.
Provides complte abstrction view to client.

HDFS block placement
HDFS creates several replication of the data blocks.
Each and every data block is replicated to multiple nodes across the
cluster.
BlockReport contains all the blocks on a Datanode.
HDFS Blocks continues...

 Default is 3 replicas, but settable
 Blocks are placed (writes are pipelined):
(Will seen rack –awareness in next slide)
– On same node
– On different rack
– On the other rack
 Clients read from closest replica.
 If the replication for a block drops below target, it is
automatically re-replicated.
 Pluggable policy for placing block replicas.
Block Placements

Why blocks in HDFS so large?
Minimize the cost of seeks
 Make transfer time > disk transfer rate
Desinged for porcees indepdently.

Benefit of Block abstraction
A file can be larger than any single disk in the network.
Simplify the storage subsystem.
Providing fault tolerance and availability.
Indepedent processing, failureover and distribute the
computaion. [Q: How it is independent ? ]

 Data blocks are checked with CRC32
 File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
 File access
Client retrieves the data and checksum from DataNode
If Validation fails, reports and client tries other replicas
 Corrupted blocks are reported and tolerated.
 Each block replica on a DataNode is represented by two files in
the local native filesystem. The first file contains the data itself
and the second file records the block's metadata including
checksums for the data and the generation stamp.
Data Correctness

 Data Integrity maintained in block level. [Q: Why it is block level not in
file level?]
 Client copies data along with check sum and client computes the
checksum of every block , it verifies that the corresponding
checksums match. If does not match, the client can retrieve the
block from a replica. The corrupt block delete a replica will
create.
 Verified after each operation. What if access foe long time? that
might result in data corruption. Also checked periodically.
Data Integrity

 An HDFS cluster consists of a single Namenode, a
master server that manages the file system namespace
and regulates access to files by clients. HDFS file system
namespace is stored and maintained by Namenode. Name
node maintains metadata in two binary files in
namenode’s storage directory are
 edits,
 fsimage
 Name node maintains ‘namespace ID’, to persistently
stored on all nodes of the cluster. The namespace ID is
assigned to the filesystem instance when it is formatted.
Name Node

 The HDFS namespace is a hierarchy of files and directories.
Files and directories are represented on the NameNode by
inodes. Inodes record attributes like permissions, modification
and access times, namespace and disk space quotas.
 Meta-data in Memory
– The entire metadata is in main memory for Fast access
– No demand paging and I/O wait for meta-data
 Metadata content
– Hierarchical file system with directories and files
– List of Blocks for each file
– File attributes, e.g access time, replication factor
 Improved durability, redundant copies of the checkpoint and journal
are typically stored on multiple independent local volumes and at
remote NFS servers.
NameNode Metadata

 The inodes structure and the list of blocks that define the
metadata store in file : fsimage
 The ‘fsimage’ file is a persistent checkpoint of the filesystem
metadata. It load in name node start-up.
 The NameNode records changes to HDFS in a write-ahead
log called the ‘Transaction Log ‘ in its local native
filesystem infile: edits
 Transaction is recorded in the Transaction Log, and the journal
file is flushed and synced before the acknowledgment is sent
to the client.
 The location of block replicas are not part of the persistent
checkpoint.
 The checkpoint file is never changed by the NameNode;
(Will look checkpoint and secondary namenode in later topics)
NameNode Metadata continues…

[what, how and where ]
 The NameNode maintains the namespace tree
 Mapping of datanode to list of blocks
 Receiving heartbeats and Monitor datanodes health.
 Replicate missing blocks.
 Recording the file system changes.
 Authorization & Authentication.
Name node functions

DataNode is slave daemon to perform the grunt work of the
distributed filesystem—reading and writing HDFS blocks to
actual files on the local filesystem.
The ‘Slave’ stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem.
It stores each block of HDFS data in a separate file.
Clients access the blocks directly from data nodes after
communication with namenode.
Blocks are stored as underlying OS’s files, Datanode does
not create all files in the same directory, it use optimal
number of files per directory and creates directories
appropriately.
Data Node

[Read, Write, Report ]
 During startup each DataNode connects to the NameNode and
performs a handshake. The purpose of the handshake is to
verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode, the
DataNode automatically shuts down.
 Serves read, write requests, performs block creation, deletion,
and replication upon instruction from Namenode
 Periodically send heartbeats and block reports to Namenode
 A DataNode identifies block replicas in its possession to the
NameNode by sending a block report.
 A block report contains the block ID, the generation stamp and the
length for each block replica the server hosts
Data Node functions

 During normal operation DataNodes send heartbeats to the NameNode to
confirm that the DataNode is operating and the block replicas it hosts are
available.
 Heartbeats from a DataNode also carry information about total
storage capacity, fraction of storage in use, and the number of data
transfers currently in progress etc...
 These statistics are used for the NameNode's block allocation and load
balancing decision.
 The NameNode does not directly send requests to DataNodes. It uses replies
to heartbeats to send instructions to the DataNodes. The instructions include
commands to replicate blocks to other nodes, remove local block replicas,
re-register and send an immediate block report, and shut down the node.
 To maintaining the overall system integrity it is critical to keep heartbeats
frequent even on big clusters. The NameNode can process thousands of
heartbeats per second without affecting other NameNode operations.
Data node heartbeats

 The Secondary NameNode (SNN) is an assistant daemon
for monitoring and storing(backup) the state of the
cluster HDFS.
 NameNode to take snapshots of the HDFS metadata at
intervals defined by the cluster configuration.
 The NameNode is a single point of failure for a Hadoop
cluster, and the SNN snapshots help minimize the
downtime and loss of data. Nevertheless, a NameNode
failure requires human intervention to reconfigure the
cluster to use
the SNN as the primary NameNode.
Secondary Name Node

Secondary Namenode ineraction
with Namenode
Will see the check point and recovery in later topics
SNN periodically merge the namespace image with the edit log to
prevent the edit log from becoming too large.

Anatomy of a File Read in HDFS
One important aspect of this design is that the client contacts datanodes
directly to retrieve data and is guided by the namenode to the best datanode for
each block Direct connection between client and datanode.
Failure : Move to next 'closest' node with the block.

1. Client connects to the NameNode with file name.
2. The namenode performs various checks to make sure the file
exist, client has the right permissions etc ..
3. The namenode returns the addresses of the datanodes that have a
copy of that block.(locality is considered)
4. The list of datanodes forms a pipeline—we’ll assume the
replication level is three, so there are three nodes in the
pipeline.
5. The client connects to the first(closest) datanode for the first
block in the file and reads. Then find the best datanode for the
next block… and finish reads for the file.
6. Verifies checksums for each block the data transferred to it
from the datanode.
7. During reading, if the client encounters an error while
communicating with a datanode or block corrupted, then it will
try the next closest one for that block.
Anatomy of a File Read in HDFS continue…

Anatomy of a File Write in HDFS

1. Client connects to the NameNode with file name.
2. The namenodeperforms various checks to make sure the file doesn’t
already exist, client has the right permissions etc
3. NameNode places an entry for the file in its metadata, returns the
block name and list of DataNodes to the client.
4. The list of datanodes forms a pipeline—we’ll assume the replication
level is three, so there are three nodes in the pipeline.
5. Client connects to the first DataNode and starts sending data, As
data is received by the first DataNode, it connects to the second and
starts sending data Second DataNode similarly the second datanode
stores the packet and forwards it to the third datanode in the
pipeline.
6. A packet is removed from the ack queue only when it has been
acknowledged by all the datanodes in the pipeline.
7. A datanode fails while data is being written to it, partial block on the
failed datanode will be deleted if failed datanode recovers later on.
8. Client reports to the NameNode when the block is written.
Anatomy of a File Write in HDFS continue…

Replication and Rack-awareneces
 Replication in Hadoop is at the block level .
 Default Replication factor is 3 and configurable.
 Blocks are replicated for fault tolerance.
 A file’s replication factor can be changed dynamically and configurable
per file
 Rack-aware replica placement- Goal: improve reliability, availability and
network bandwidth utilization
 Many racks, communication between racks are through switches.
 Network bandwidth between machines on the same rack is greater than
those in different racks.

 Namenode determines the rack id for each DataNode.
Replication and Rack-awareneces continue…

Replication and Rack-awareneces continue…
Replicas are placed: one on a node in a local rack, one on a
different node in the local rack and one on a node in a
different rack.
1/3 of the replica on a node, 2/3 on a same rack and 1/3
distributed evenly across remaining racks.
Replica selection for READ operation: HDFS tries to
minimize the bandwidth consumption and latency.
Selection of blocks to process in a MapReduce job takes
advantage of rack-awareness.
Rack-awareness is NOT automatic, and needs to be
configured. By default, all nodes are assumed to be in the
same rack.

Block Re-replication
The necessity for re-replication may arise due to:
 A Datanode may become unavailable,
 A replica may become corrupted,
 A hard disk on a Datanode may fail, or
 The replication factor on the block may be increased.
Block under-replication & over-replication is detected by
Namenode
Balancer application rebalances blocks to balance datanode
utilization.
Will look balncer in later topic

HDFS Worst fit with
Low-latency data access
Lots of small files
Trasaction access and update
Multiple writers, arbitrary file modifications

Coherency Model
Not visible when copying
use sync()
Write onece, read many
Apply in applications

Command Line
Similar to *nix
 hadoop fs -ls /
 hadoop fs -mkdir /test
 hadoop fs -rmr /test
 hadoop fs -cp /1 /2
 hadoop fs -copyFromLocal /3 hdfs://localhost/
Namedone-specific:
 hadoop namenode -format
 start-all.sh

Command Line
Sorting: Standard method to test cluster
 TeraGen: Generate dummy data
 TeraSort: Sort
 TeraValidate: Validate sort result
Command Line:
 hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar
terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41

References
Hadoop: The Definitive Guide, Third Edition by Tom White.
http://hadoop.apache.org/
http://www.cloudera.com/
https://developer.yahoo.com/hadoop/tutorial/

Hadoop HDFS Architeture and Design

Hadoop HDFS Architeture and Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Hadoop HDFS Architeture and Design

Similar to Hadoop HDFS Architeture and Design (20)

Recently uploaded

Recently uploaded (20)

Hadoop HDFS Architeture and Design