• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HDFS Architeture and Design

HDFS Architeture and Design



Hadoop Distributed File System

Hadoop Distributed File System

Understand the basic design of HDFS and how it relates to basic distributed file system concepts



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    HDFS Architeture and Design HDFS Architeture and Design Presentation Transcript

    • HDFS – Hadoop Distributed File System
    • Agenda 1. HDFS – Hadoop Distributed File System  HDFS HDFS Design and Goal HDFS componetnts: Namenode, Datanode, Secondary Namenode. HDFS blocks and replication. Anatomy of a File Read/Write in HDFS
    •  Designed to reliably store very large files across machines in a large cluster  Data Model  Data is organized into files and directories  In storage layer files are divided into uniform sized blocks(64MB, 128MB, 256MB) and distributed across cluster nodes  Blocks are replicated to handle hardware failure  File system keeps checksums of data for corruption detection and recovery  HDFS exposes block placement so that computes can be migrated to data HDFS HDFS is a file system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
    •  HDFS is File system rather then a storage. HDFS exhibits almost all POSIX file system standards - File, directory and sub-directory structure -Permissions(rwx) -Access(owner, group, other) and concept super user  Hadoop provides many interfaces to its filesystems, and it generally uses the URI scheme to pick the correct filesystem instance to communicate with.  Fault tolerant, scalable, distributed storage system HDFS continues…
    •  Very Large Distributed File System 10K nodes, 100 million files, 10 PB ..! No problem, is possible !!!!!!!.  Computation moved to data Data locations exposed so that computations can move to where data resides  Streaming Data access  Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them  Optimized for Batch Processing Provides very high aggregate bandwidth  High throughput of data access Streaming access to data  Large files Typical file is gigabytes to terabytes in size Support for tens of millions of files.  Simple coherency- Write-once-read-many access model  Highly fault-tolerant runs on commodity HW, which can fail frequently HDFS Design Goals
    • Master-Slave architecture  HDFS Master “Namenode” Manages the file system namespace Controls read/write access to files Manages block replication Checkpoints namespace and journals namespace changes for reliability  HDFS Workers “Datanodes”  Serve read/write requests from clients  Perform replication tasks upon instruction by Namenode.  Report blocks and system state.  HDFS Namespece Backup “Secondary Namenode” HDFS Components
    • HDFS components interaction
    • HDFS Components relation
    •  Single Namespace for entire cluster  Files are broken up into sequence blocks – For a file all blocks except the last are of the same size. – Each block replicated on multiple DataNodes  Data Coherency – Write-once-read-many access model – Client can only add/append to files, restricts to random change  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode [Q: How it is possible ?] --- User data never flows through the NameNode Distributed file system
    • Distributed file system continue…
    • HDFS Blocks HDFS has Large block size  Default 64MB  Typical 128MB, 256MB, 512MB… Normal Filesystem blocks are few kilobytes. Unlike a file system for a single disk. A file in HDFS that is smaller than a single block does not occupy a full block. if a block is 10MB it needs only 10MB of the space of the full block on the local drive. A file is stored in blocks on various nodes in hadoop cluster. Provides complte abstrction view to client.
    • HDFS block placement HDFS creates several replication of the data blocks. Each and every data block is replicated to multiple nodes across the cluster. BlockReport contains all the blocks on a Datanode. HDFS Blocks continues...
    •  Default is 3 replicas, but settable  Blocks are placed (writes are pipelined): (Will seen rack –awareness in next slide) – On same node – On different rack – On the other rack  Clients read from closest replica.  If the replication for a block drops below target, it is automatically re-replicated.  Pluggable policy for placing block replicas. Block Placements
    • Why blocks in HDFS so large? Minimize the cost of seeks  Make transfer time > disk transfer rate Desinged for porcees indepdently.
    • Benefit of Block abstraction A file can be larger than any single disk in the network. Simplify the storage subsystem. Providing fault tolerance and availability. Indepedent processing, failureover and distribute the computaion. [Q: How it is independent ? ]
    •  Data blocks are checked with CRC32  File Creation Client computes checksum per 512 byte DataNode stores the checksum  File access Client retrieves the data and checksum from DataNode If Validation fails, reports and client tries other replicas  Corrupted blocks are reported and tolerated.  Each block replica on a DataNode is represented by two files in the local native filesystem. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. Data Correctness
    •  Data Integrity maintained in block level. [Q: Why it is block level not in file level?]  Client copies data along with check sum and client computes the checksum of every block , it verifies that the corresponding checksums match. If does not match, the client can retrieve the block from a replica. The corrupt block delete a replica will create.  Verified after each operation. What if access foe long time? that might result in data corruption. Also checked periodically. Data Integrity
    •  An HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. HDFS file system namespace is stored and maintained by Namenode. Name node maintains metadata in two binary files in namenode’s storage directory are  edits,  fsimage  Name node maintains ‘namespace ID’, to persistently stored on all nodes of the cluster. The namespace ID is assigned to the filesystem instance when it is formatted. Name Node
    •  The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas.  Meta-data in Memory – The entire metadata is in main memory for Fast access – No demand paging and I/O wait for meta-data  Metadata content – Hierarchical file system with directories and files – List of Blocks for each file – File attributes, e.g access time, replication factor  Improved durability, redundant copies of the checkpoint and journal are typically stored on multiple independent local volumes and at remote NFS servers. NameNode Metadata
    •  The inodes structure and the list of blocks that define the metadata store in file : fsimage  The ‘fsimage’ file is a persistent checkpoint of the filesystem metadata. It load in name node start-up.  The NameNode records changes to HDFS in a write-ahead log called the ‘Transaction Log ‘ in its local native filesystem infile: edits  Transaction is recorded in the Transaction Log, and the journal file is flushed and synced before the acknowledgment is sent to the client.  The location of block replicas are not part of the persistent checkpoint.  The checkpoint file is never changed by the NameNode; (Will look checkpoint and secondary namenode in later topics) NameNode Metadata continues…
    • [what, how and where ]  The NameNode maintains the namespace tree  Mapping of datanode to list of blocks  Receiving heartbeats and Monitor datanodes health.  Replicate missing blocks.  Recording the file system changes.  Authorization & Authentication. Name node functions
    • DataNode is slave daemon to perform the grunt work of the distributed filesystem—reading and writing HDFS blocks to actual files on the local filesystem. The ‘Slave’ stores data in files in its local file system. Datanode has no knowledge about HDFS filesystem. It stores each block of HDFS data in a separate file. Clients access the blocks directly from data nodes after communication with namenode. Blocks are stored as underlying OS’s files, Datanode does not create all files in the same directory, it use optimal number of files per directory and creates directories appropriately. Data Node
    • [Read, Write, Report ]  During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode, the DataNode automatically shuts down.  Serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode  Periodically send heartbeats and block reports to Namenode  A DataNode identifies block replicas in its possession to the NameNode by sending a block report.  A block report contains the block ID, the generation stamp and the length for each block replica the server hosts Data Node functions
    •  During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.  Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress etc...  These statistics are used for the NameNode's block allocation and load balancing decision.  The NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.  To maintaining the overall system integrity it is critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands of heartbeats per second without affecting other NameNode operations. Data node heartbeats
    •  The Secondary NameNode (SNN) is an assistant daemon for monitoring and storing(backup) the state of the cluster HDFS.  NameNode to take snapshots of the HDFS metadata at intervals defined by the cluster configuration.  The NameNode is a single point of failure for a Hadoop cluster, and the SNN snapshots help minimize the downtime and loss of data. Nevertheless, a NameNode failure requires human intervention to reconfigure the cluster to use the SNN as the primary NameNode. Secondary Name Node
    • Secondary Namenode ineraction with Namenode Will see the check point and recovery in later topics SNN periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
    • Anatomy of a File Read in HDFS One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block Direct connection between client and datanode. Failure : Move to next 'closest' node with the block.
    • 1. Client connects to the NameNode with file name. 2. The namenode performs various checks to make sure the file exist, client has the right permissions etc .. 3. The namenode returns the addresses of the datanodes that have a copy of that block.(locality is considered) 4. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. 5. The client connects to the first(closest) datanode for the first block in the file and reads. Then find the best datanode for the next block… and finish reads for the file. 6. Verifies checksums for each block the data transferred to it from the datanode. 7. During reading, if the client encounters an error while communicating with a datanode or block corrupted, then it will try the next closest one for that block. Anatomy of a File Read in HDFS continue…
    • Anatomy of a File Read
    • Anatomy of a File Write in HDFS
    • 1. Client connects to the NameNode with file name. 2. The namenodeperforms various checks to make sure the file doesn’t already exist, client has the right permissions etc 3. NameNode places an entry for the file in its metadata, returns the block name and list of DataNodes to the client. 4. The list of datanodes forms a pipeline—we’ll assume the replication level is three, so there are three nodes in the pipeline. 5. Client connects to the first DataNode and starts sending data, As data is received by the first DataNode, it connects to the second and starts sending data Second DataNode similarly the second datanode stores the packet and forwards it to the third datanode in the pipeline. 6. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline. 7. A datanode fails while data is being written to it, partial block on the failed datanode will be deleted if failed datanode recovers later on. 8. Client reports to the NameNode when the block is written. Anatomy of a File Write in HDFS continue…
    • Replication and Rack-awareneces  Replication in Hadoop is at the block level .  Default Replication factor is 3 and configurable.  Blocks are replicated for fault tolerance.  A file’s replication factor can be changed dynamically and configurable per file  Rack-aware replica placement- Goal: improve reliability, availability and network bandwidth utilization  Many racks, communication between racks are through switches.  Network bandwidth between machines on the same rack is greater than those in different racks.
    •  Namenode determines the rack id for each DataNode. Replication and Rack-awareneces continue…
    • Replication and Rack-awareneces continue… Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a same rack and 1/3 distributed evenly across remaining racks. Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency. Selection of blocks to process in a MapReduce job takes advantage of rack-awareness. Rack-awareness is NOT automatic, and needs to be configured. By default, all nodes are assumed to be in the same rack.
    • Block Re-replication The necessity for re-replication may arise due to:  A Datanode may become unavailable,  A replica may become corrupted,  A hard disk on a Datanode may fail, or  The replication factor on the block may be increased. Block under-replication & over-replication is detected by Namenode Balancer application rebalances blocks to balance datanode utilization. Will look balncer in later topic
    • HDFS Worst fit with Low-latency data access Lots of small files Trasaction access and update Multiple writers, arbitrary file modifications
    • Coherency Model Not visible when copying use sync() Write onece, read many Apply in applications
    • Command Line Similar to *nix  hadoop fs -ls /  hadoop fs -mkdir /test  hadoop fs -rmr /test  hadoop fs -cp /1 /2  hadoop fs -copyFromLocal /3 hdfs://localhost/ Namedone-specific:  hadoop namenode -format  start-all.sh
    • Command Line Sorting: Standard method to test cluster  TeraGen: Generate dummy data  TeraSort: Sort  TeraValidate: Validate sort result Command Line:  hadoop jar /usr/share/hadoop/hadoop-examples-1.0.3.jar terasort hdfs://ubuntu/10GdataUnsorted /10GDataSorted41
    • References Hadoop: The Definitive Guide, Third Edition by Tom White. http://hadoop.apache.org/ http://www.cloudera.com/ https://developer.yahoo.com/hadoop/tutorial/