The Hadoop Distributed File System
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The Hadoop Distributed File System

on

  • 847 views

 

Statistics

Views

Total Views
847
Views on SlideShare
847
Embed Views
0

Actions

Likes
1
Downloads
81
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Hadoop Distributed File System Presentation Transcript

  • 1. 2010 IEEE 26th Symposium on Mass storage systems and Technologies (MSST) Author: K. Schvako, Yahoo! Reported by: Tzu-Li Tai, NCKU
  • 2. The Hadoop Distributed File System (HDFS)  Designed to store very large data sets reliably. (fault tolerance)  Stream those data sets at high bandwidth to user applications. (performance)  In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.  By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. (scalability) This paper describes the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.
  • 3. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 4. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 5. Component Description Developer HDFS Distributed file system Subject of this paper! Yahoo! MapReduce Distributed computation framework Yahoo! Hbase Column-oriented table service Powerset & Microsoft Pig Dataflow language and parallel execution framework Yahoo! Hive Data warehouse infrastructure Facebook ZooKeeper Distributed coordination service Yahoo! Chukwa System for collecting management data Yahoo! Avro Data serialization system Yahoo! & Cloudera
  • 6.  Stores system metadata and application data separately. Stores metadata on a dedicated server ,called the NameNode, and application data on other servers called DataNodes.  All servers are fully connected and communicate using TCP-based protocols.  File content replicated on multiple DataNodes for reliability, and has the advantage of that data transfer bandwidth is multiplied and more opportunities for locating computation near the needed data.
  • 7. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 8. A. NameNode B. DataNodes C. HDFS Client D. Image and Journal E. CheckpointNode F. BackupNode G. Upgrades, File System and Snapshots
  • 9.  One single NameNode that maintains the metadata of each cluster. • Inode Data: The HDFS namespace is a hierarchy of files and directories, represented on the NameNode by inodes. • Block Management: The NameNode maintains the mapping of file blocks to DataNodes (physical location of file data).
  • 10.  Content of files are split into large blocks (user-selectable, usually three), each independently replicated at multiple DataNodes.  Each cluster can have up to thousands of DataNodes, and each DataNode may execute multiple application tasks concurrently.
  • 11. Block Replica Representation Block replicas are represented by two files in the local host’s native file system.  First file – contains the data itself.  Second file – metadata of block, including checksums and generation stamp. Startup process and Registration During startup, each DataNode connects to the NameNode and performs a handshake to verify the namespace ID and software version. After the handshake, the DataNode registers with the NameNode. First time registrations are assigned a permanent storage ID.
  • 12. Block Reports and Heartbeats  Block Report Contains the block ID, generation stamp, and length of each block.  Heartbeats Confirms normal operation and availability of block replicas. Default heartbeat is 3 sec. Carries information about total storage capacity, fraction of storage in use, number of data transfers in progress (used for NameNode’s space allocation and load balancing decisions). Indirect call to DataNodes; Replying Heartbeats The NameNode gives commands to DataNodes by replying to heartbeats.
  • 13. HDFS Client – A code library that exports the HDFS file system interface Read  HDFS Client asks NameNode for list of DataNodes that hosts the file’s block replicas.  HDFS Client contacts the located DataNodes for data transfer. Write
  • 14. HDFS provides an API that exposes file block locations  Allows applications (ex. MapReduce Framework) to schedule a task to where the data are located. (improves read performance, executed near client)
  • 15. For Durability and Reliability - image, checkpoint, journal - image: the inode data + list of file blocks compromising the metadata. - checkpoint: Persistent record of the image. Replaced in entirety; never changed by NameNode. - journal: Modification write-ahead commit log of the image.
  • 16. Startup Process  NameNode initializes image from checkpoint.  Replays journal until image is up-to-date.  New checkpoint and empty journal written back to storage before serving clients. Recommended practice for prevention of data corruption  Redundant copies of the checkpoint and journal can be made at other servers for enhanced reliability.
  • 17. Batching multiple transactions to deal with bottleneck situation  For each client-initiated transaction, the change is recorded in journal, and then the journal performs a flush-and-sync.  NameNode is a multithreaded system. Saving a transaction to disk can become a bottleneck since all thread need to wait until the flush-and-sync initiated by one of the threads is executed.  To optimize, multiple transactions are batched and committed together at a given time.
  • 18.  The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.  Creating periodic checkpoints is one way to protect the file system metadata.  Creating a checkpoint lets the NameNode truncate the trail of the journal when the new checkpoint is uploaded to the NameNode.  Author’s experience: For a large cluster, is takes 1 hr to restart a NameNode with a week long journal. Daily checkpoints is a good practice.
  • 19.  Maintains an in-memory, up-to-date and synchronized image of the file system.  Accepts journal stream from the NameNode, saves them to the BackupNode’s own storage, and applies the transaction logs to the BackupNode’s own in- memory image.  The BackupNode can be viewed as a read-only NameNode. It contains all file-system metadata information except for block locations.  Use of a BackupNode provides a more efficient running of the NameNode as it is delegated the responsibility of namespace state persisting.
  • 20. The HDFS Snapshot Mechanism  Snapshots (only one can exit at a time) are requested by administrators and saves the current sate of the file system.  Namespace metadata snapshot: When requested, the NameNode reads the checkpoint and journal files and merges them in memory. Then it writes the new checkpoint and empty journal to a new location.  DataNode block file data snapshot: Each DataNode creates a copy of the storage directory and hard links existing block files into it. Block removements remove only the hard link.
  • 21. System Restoration using snapshots  The cluster administrator can choose to roll back to HDFS to the snapshot state when restarting the system.  The storage occupied by the snapshot can be abandoned. Upgrade/Conversion of data representation formats (layout version)  New/Old layout version comparison during startup  Conversion requires mandatory creation of snapshot on system.  NameNode and DataNodes layout versions are not separated; the NameNode will not recognize blocks reported by DataNodes with different layout versions.
  • 22. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 23. A. File Read and Write B. Block Placement C. Replication management D. Balancer E. Block Scanner F. Decommissioning G. Inter-Cluster Data Copy
  • 24. Lease Mechanism & Management  HDFS implements a single-writer, multiple-reader model.  The HDFS Client that opens a file for writing is granted a lease for the file. The writing client sends periodical heartbeats to the NameNode to renew lease.  The lease duration is bound by a soft limit and a hard limit. After Soft Limit: Other clients can preempt lease After Hard Limit: Lease is automatically recovered.
  • 25. Data Write Operation DN1 DN4 DN2 DN3 Client Name Node client DN1 DN2 DN3 setup packet1 packet2 packet3 packet4 packet5 close
  • 26. hflush Operation hflush Guaranteed data visibility
  • 27. Read Operation Client Name Node DN1 DN4 DN2 DN3 DN1 DN4 DN2 DN3 DN1 DN4 DN2 DN3 Request list of block replicas for a file List of each block’s replicas in order or distance Replicasofblock1 Replicasofblock2 Replicasofblock3 Nearest to Farthest Replica from Client
  • 28. Data Block Checksum management  When client creates a block of data, it computes the checksum of the block and sends it along with the data to the DataNodes (stored in a metadata file separate from data content file).  When a client reads a file, it uses the checksum to verify the block’s data. If checksums do not match, the client notifies the NameNode of corrupt data replica.
  • 29. • Nodes are spread across multiple racks. Nodes of a rack share a switch; rack switches are connected by core switches • Shorter distance = greater bandwidth • HDFS allows admin to configure a script that decides which rack a node belongs to DN1 DN2 DN3 DN4 DN5 RACK 1 DN6 DN7 DN8 DN9 D10 RACK 2 / D11 D12 D13 D14 D15 RACK 3
  • 30. DN1 DN2 DN3 DN4 DN5 RACK 1 DN6 DN7 DN8 DN9 D10 RACK 2 / D11 D12 D13 D14 D15 RACK 3 Default HDFS Block Placement Policy • 1st Replica  Placed on the node where the writer is located. • 2nd and 3rd Replica  Placed on different nodes on a different rack. • Further replicas are placed on random nodes with restrictions: 1. No DataNode contains more than one replica of any block. 2. No rack contains more than two replicas of the same block.
  • 31. Tradeoff: Writing Cost  Data Reliability, Availability  Scenario: Choosing the 2nd and 3rd replica’s to be closer to the 1st replica (i.e. in the same rack) can implement a shorter writing pipeline path, but this is a tradeoff for data reliability, availability, and reading bandwidth.
  • 32. DN1 DN2 DN3 DN4 DN5 RACK1 DN6 DN7 DN8 DN9 D10 RACK2 / D11 D12 D13 D14 D15 RACK3 Over Replicated Under Replicated • Over-Replicated Policy: Maintain the number of racks that hold replicas, and remove replicas from DataNodes with least available space. • Under-Replicated Policy: Similar to new block placement. • Scenario of all replicas are on one single stack: First treat as under-replicated. After replication, treat as over- replicated.
  • 33. • An Admin application program that balances cluster disk space utilization. • 1st parameter: Threshold value in the range of (0,1). A cluster is balanced if for all i, U(DNi) – U(Whole Cluster) <= Threshold Value U() = Utizlization Ratio DNi = ith DataNode • 2nd parameter: Bandwidth limit. The higher the allowed bandwidth, the faster a cluster can be balanced, but with greater competition with application processes. • The balancer optimizes the balancing process by minimizing inter-rack data copying.
  • 34. • Each DataNode periodically runs a block scanner to verify checksums with data block replicas. • If a client reads a block and checksum verification succeeds, the process is treated as a verification of the replica. • Whenever a read client or a block scanner detects a corrupt block, it notifies the NameNode. The NameNode schedules replication of good copies until the intended replica count is reached, then the corrupt copy is deleted.
  • 35. • The removal process of a node from the cluster without jeopardizing any data availability. • A node that is about to be excluded is first marked as “decommissioning”. “Decommissioning” nodes will not be selected as a target for replica placement, but still allows to serve read operations. • During “decommissioning” state, the blocks the node hosts are scheduled by the NameNode to be replicated to other DataNodes. After replication, the node is labled “decommissioned” and is safe to be removed.
  • 36. • DistCp: a tool for large inter/intra-cluster parallel copying. • Uses MapReduce; each of the map tasks copies a portion of the source data into the new cluster. The MapReduce framework automatically handles parallel task scheduling, error detection and recovery.
  • 37. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 38. 2 quad core Xeon processors @ 2.5 ghz Red Hat Enterprise Linux Server Release 5.1 Sun Java JDK 1.6.0_13-b03 4 SATA drives (1 TB each) 16GB RAM (NameNode and BackupNode 64GB RAM) 1GB Ethernet  70% disk space allocated to HDFS. The rest is for OS, logs, and MapReduce intermediate data.  9.8 PB of storage available on 3500 nodes cluster  net 3.3 PB of user application storage  1 PB application storage per 1000 node approximation
  • 39.  60 million files on a 3500 node cluster  54000 replicas on a single DataNode.  0.8 percent of nodes fail each month  about 1 or 2 nodes are lost each day. However, the 54000-108000 replicas lost can be re-created in 2 minutes (Re- replication is fast since it is a parallel problem that scales with cluster size.)  Probability of losing a whole block is .005% per year.
  • 40. I. INTRODUCTION AND RELATED WORK II. ARCHITECTURE III.FILE I/O OPERATIONS AND REPLICA MANAGEMENT IV.PRACTICE AT YAHOO! V. FUTURE WORK
  • 41.  Correlated failures are failures of rack/core switches, or total loss of electricity power, which can lead to loss of whole blocks of data.  Restoring power is not much use since 1.5% percent of nodes will not survive a full power-on restart.  Yahoo!’s future strategy is to deliberately restart each node one at a time over a period of weeks to identify nodes that will not survive a restart.
  • 42.  The NameNode keeps all the namespace and block locations in memory, limiting the number of files and also the number of blocks addressable.  New applications for HDFS that require storage of a large number of small files.  No point encouraging larger files of smaller quantity (changing application behavior is hard).  Near-term solution: Allow multiple namespaces (and NameNodes) to share physical storage of a cluster; the block pool model (next page).
  • 43.  The design of HDFS I/O is particularly optimized for batch processing systems, like MapReduce, which require high throughput for sequential reads and writes.  What if, maybe for near-future applications, we need to read/write big capacity of data in real-time?  Data generation rates can be far higher than it can be analyzed or stored affordably (such as analyzing internet network flows at real-time).  Main topic and problem: Real-time Big Data Processing & Challenges.