A Seminar Report OnHADOOP By Varun Narang MA 399 Seminar IIT Guwahati Roll Number: 09012332
Index of Topics:1. Abstract2. Introduction3. What is MapReduce?4. HDFSAssumptionsDesignConceptsThe Communication ProtocolsRobustnessCluster RebalancingData IntegrityMetadata disk failureSnapshots5. Data Organisation Data Blocks Staging Replication Pipelining6. Accessibility7. Space Reclaimation • File Deletes and Undeletes • Decrease Replication Factor • Hadoop Filesystems7. Hadoop Archives
Bibliography1)Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press2)MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat3)Ranking and Semi-supervised Classification on Large Scale Graphs Using Map- Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins University4)Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley5)MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball Winter 2007
AbstractProblem Statement:The amount total digital data in the world has exploded in recent years. This hashappened primarily due to information (or data) generated by various enterprises all over
the globe. In 2006, the universal data was estimated to be 0.18 zettabytes in 2006, and isforecasting a tenfold growth by 2011 to 1.8 zettabytes.1 zettabyte = 1021 bytes The problem is that while the storage capacities of hard drives have increasedmassively over the years, access speeds—the rate at which data can be read from driveshave not kept up. One typical drive from 1990 could store 1370 MB of data and had atransfer speed of 4.4 MB/s, so we could read all the data from a full drive in around 300seconds. In 2010, 1 Tbdrives are the standard hard disk size, but the transfer speed isaround 100 MB/s, so it takes more than two and a half hours to read all the data off thedisk.Solution Proposed:Parallelisation:A very obvious solution to solving this problem is parallelisation. The input data isusually large and the computations have to be distributed across hundreds or thousands ofmachines in order to finish in a reasonable amount of time.Reading 1 Tb from a single hard drive may take a long time, but on parallelizing this over100 different machines can solve the problem in 2 minutes.The key issues involved in this Solution: • Hardware failure • Combine the data after analysis (i.e reading)Apache Hadoop is a framework for running applications on large cluster built ofcommodity hardware. The Hadoop framework transparently provides applications bothreliability and data motion.It solves the problem of Hardware Failure through replication. Redundant copies of thedata are kept by the system so that in the event of failure, there is another copy available.(Hadoop Distributed File System)The second problem is solved by a simple programming model- Mapreduce. Thisprogramming paradigm abstracts the problem from data read/write to computation over aseries of keys.Even though HDFS and MapReduce are the most significant features of Hadoop, othersubprojects provide complementary services:The various subprojects of hadoop includes:- • Core • Avro • Pig • HBase • Zoo Keeper • Hive • Chukwa
Introduction Hadoop is designed to efficiently process large volumes of information by connecting manycommodity computers together to work in parallel. A 1000 CPU single machine (i.e a
supercomputer with a vast memory storage) would cost a lot. Thus Hadoop parallelizes thecomputation by tying smaller and more reasonably priced machines together into a single cost-effective compute cluster.The features of hadoop that stand out are its simplified programming model and its efficient,automatic distribution of data and work across machines.Now we take a deeper look into these two main features of Hadoop and list their importantcharacteristics and description.1. Data Distribution: In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. TheHadoop Distributed File System (HDFS) will split large data files into chunks which are managedby different nodes in the cluster. In addition to this each chunk is replicated across severalmachines, so that a single machine failure does not result in any data being unavailable. In case ofa system failure, the data is re-replicated which can result in partial storage. Even though the filechunks are replicated and distributed across several machines, they form a single namespace, sotheir contents are universally accessible. Data is conceptually record-oriented in the Hadoop programmingframework. Individual input files are broken into segments and each segmentis processed upon by a node. The Hadoop framework schedules theprocesses to be run in proximity to the location of data/records usingknowledge from the distributed file system. Each computation processrunning on a node operates on a subset of the data. Which data operated onby which node is decide based on its proximity to the node: i.e:Most data is read from the local disk straight into the CPU, alleviating strainon network bandwidth and preventing unnecessary network transfers. Thisstrategy of moving computation to the data, instead of moving the data to thecomputation allows Hadoop to achieve high data locality which in turn results in highperformance.
2. MapReduce: Isolated Processes Hadoop limits the amount of communication which can be performed by the processes, as eachindividual record is processed by a task in isolation from one another. It makes the wholeframework much more reliable. Programs must be written to conform to a particularprogramming model, named "MapReduce."MapReduce is composed of two chief elements: Mappers and Reducers. 1. Data segments or records are processed in isolation by tasks called Mappers. 2. The output from the Mappers is then brought together by Reducers, where results from different mappers are merged together. Separate nodes in a Hadoop cluster communicate implicitly. Pieces of data can be tagged withkey names which inform Hadoop how to send related bits of information to a common destinationnode. Hadoop internally manages all of the data transfer and cluster topology issues. By restricting the communication between nodes, Hadoop makes the distributed system muchmore reliable. Individual node failures can be worked around by restarting tasks on othermachines. The other workers continue to operate as though nothing went wrong, leaving thechallenging aspects of partially restarting the program. What is MapReduce?MapReduce is a programming model for processing and generating large data sets. Usersspecify a map function that processes a key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate values associated with the sameintermediate key.Programs written in this functional style are automatically parallelized and executed on a largecluster of commodity machines. The run-time system takes care of the details of partitioning theinput data, scheduling the programs execution across a set of machines, handling machinefailures, and managing the required inter-machine communication (i.e this procedure is abstractedor hidden from the user who can focus on the computational problem)Note: This abstraction was inspired by the map and reduces primitives present in Lispand many other functional languages.The Programming Model:The computation takes a set of input key/value pairs, and produces a set of outputkey/value pairs. The user of the MapReduce library expresses the computation as twofunctions: Map and Reduce.Map, written by the user, takes an input pair and produces a set of intermediatekey/value pairs. The MapReduce library groups together all intermediate valuesassociated with the same intermediate key Iand passes them to the Reduce function.The Reduce function, also written by the user, accepts an intermediate key Iand aset of values for that key. It merges together these values to form a possibly smallerset of values. Typically just zero or one output value is produced per Reduceinvocation. The intermediate valuesare supplied to the users reduce function via an iterator.
Map and Reduce (Associated Types):The input keys and values are drawn from a different domain than the output keysand values. Also, the intermediate keys and values are from the same domain as theoutput keys and values. Hadoop Map-Reduce is a software framework for easily writing applications whichprocess vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.Analyzing the Data with Hadoop MapReduce:MapReduce works by breaking the processing into two phases: the map phase and thereduce phase. It splits the input data-set into independent chunks which are processed by the maptasks in a completely parallel manner. The framework sorts the outputs of the maps, which arethen input to the reduce tasks.Both the input and the output of the job are stored in a file-system. The framework takes care ofscheduling tasks, monitoring them and re-executes the failed tasks.Data Locality Optimisation: Typically the compute nodes and the storage nodes are the same. The Map-Reduce frameworkand the Distributed File System run on the same set of nodes. This configuration allows theframework to effectively schedule tasks on the nodes where data is already present, resulting invery high aggregate bandwidth across the cluster.There are two types of nodes that control the job execution process: 1. jobtrackers 2. tasktrackersThe jobtracker coordinates all the jobs run on the system by scheduling tasks to run ontasktrackers.Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of theoverall progress of each job.If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
Input splits: Hadoop divides the input to a MapReduce job into fixed-size pieces called inputsplits, or just splits. Hadoop creates one map task for each split, which runs the userdefined mapfunction for each record in the split.The quality of the load balancing increases as the splits become more fine-grained. On the otherhand, if splits are too small, then the overhead of managing the splits and of map task creationbegins to dominate the total job execution time. For most jobs, a good split size tends to be thesize of a HDFS block, 64 MB by default.WHY?Map tasks write their output to local disk, not to HDFS. Map output is intermediate output: it’sprocessed by reduce tasks to produce the final output, and once the job is complete the mapoutput can be thrown away. So storing it in HDFS, with replication, would be a waste of time. Itis also possible that the node running the map task fails before the map output has been consumedby the reduce task.Reduce tasks don’t have the advantage of data locality—the input to a single reduce task isnormally the output from all mappers. In case we have a single reduce task that is fed by all of the map tasks: The sorted map outputshave to be transferred across the network to the node where the reduce task is running, where theyare merged and then passed to the user-defined reduce function.The output of the reducer is normally stored in HDFS for reliability. For each HDFS block of thereduce output, the first replica is stored on the local node, with other replicas being stored on off-rack nodes.
MapReduce data flow with a single reduce taskWhen there are multiple reducers: The map tasks partition their output, each creating one partition for each reduce task. There canbe many keys (and their associated values) in each partition, but the records for every key are allin a single partition. MapReduce data flow with multiple reduce task. It is also possible to have zero reduce tasks as illustrated in the figure below.
MapReduce data flow with no reduce tasksCombiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster. In order tominimize the data transferred between the map and reduce tasks, combiner functions areintroduced. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. Combiner finctions canhelp cut down the amount of data shuffled between the maps and the reduces.Hadoop Streaming: Hadoop provides an API to MapReduce that allows you to write your map and reduce functionsin languages other than Java. Hadoop Streaming uses Unix standard streams as the interfacebetween Hadoop and your program, so you can use any language that can read standard input andwrite to standard output to write your MapReduce program.Hadoop Pipes:Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. Unlike Streaming, whichuses standard input and output to communicate with the map and reduce code, Pipes uses sockets
as the channel over which the tasktracker communicates with the process running the C++ map orreduce function. JNI is not used.
HADOOP DISTRIBUTED FILESYSTEM (HDFS) Filesystems that manage the storage across a network of machines are called distributedfilesystems. They are network-based, and thus all the complications of network programming arealso present in distributed file system. Hadoop comes with a distributed filesystem called HDFS,which stands for Hadoop Distributed Filesystem. HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold verylarge amounts of data (terabytes or even petabytes), and provide high-throughput access to thisinformation. ASSUMPTIONS AND GOALS: 1. Hardware Failure An HDFS instance may consist of hundreds or thousands of server machines, each storing partof the file system’s data. In case of such a large number of nodes, the probability of one of themfailing becomes substantial. 2. Streaming Data Access Applications that run on HDFS need streaming access to their data sets. HDFS is designed morefor batch processing rather than interactive use by users. The emphasis is on high throughput ofdata access rather than low latency of data access. 3. Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes toterabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregatedata bandwidth and scale to hundreds of nodes in a single cluster. 4. Simple Coherency Model HDFS applications need a write-once-read-many access model for files. A file once created,written, and closed need not be changed. This assumption simplifies data coherency issues andenables high throughput data access. A Map/Reduce application or a web crawler application fitsperfectly with this model. There is a plan to support appending-writes to files in the future. 5. “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the datait operates on. This is especially true when the size of the data set is huge. This minimizesnetwork congestion and increases the overall throughput of the system. HDFS provides interfacesfor applications to move themselves closer to where the data is located. 6. Portability Across Heterogeneous Hardware and Software Platforms HDFS has been designed to be easily portable from one platform to another. This facilitateswidespread adoption of HDFS as a platform of choice for a large set of applications.
The Design of HDFS: • Very large files: “Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. • Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record. • Commodity hardware: Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware for which the chance of node failure across the cluster is high for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. The use of commodity hardware restricts the effectiveness of Hadoop in some applications. These applications have the following common characteristics: • Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. • Lots of small files: Since the master node (or namenode) holds file-system metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. • Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
A few Important Concepts of Hadoop Distributed File System:1. Blocks: A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integralmultiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, while diskblocks are normally 512 bytesHDFS too has the concept of a block, but it is a much larger unit—64 MB by default. Like in afilesystem for a single disk, files in HDFS are broken into block-sized chunks, which are storedas independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than asingle block does not occupy a full block’s worth of underlying storage.HDFS blocks are large compared to disk blocks. Having a block abstraction for a distributed filesystem brings several benefits: A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block rather than a file simplifies the storage subsystem. Blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). 2. Namenodes and Datanodes: A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (themaster) and a number of datanodes (workers).The namenode has two chief functions: • To manage the filesystem namespace. • To maintains the filesystem tree and the metadata for all the files and directories in the tree.
• . The namenode also knows the datanodes on which all the blocks for a given file are located This information is stored persistently on the local disk in the form of two files: the namespaceimage and the edit log. The namenode however does not store block locations persistently, sincethis information is reconstructed from datanodes when the system starts.A client accesses the filesystem on behalf of the user by communicating with the namenode anddatanodes. Datanodes are the work horses of the filesystem. They store and retrieve blocks when they aretold to (by clients or the namenode), and they report back to the namenode periodically with listsof blocks that they are storing. Without the namenode, the filesystem cannot be used. In fact, ifthe machine running the namenode were obliterated, all the files on the filesystem would be lostsince there would be no way of knowing how to reconstruct the files from the blocks on thedatanodes. For this reason, it is important to make the namenode resilient to failure, and Hadoopprovides two mechanisms for this.
The first way is to back up the files that make up the persistent state of the filesystem metadata.Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems.These writes are synchronous and atomic. The usual configuration Choice is to write to local diskas well as a remote NFS mount. Another approach is to run a secondary namenode. It does not act as a namenode. Its main roleis to periodically merge the namespace image with the edit log to prevent the edit log frombecoming too large. The secondary namenode usually runs on a separate physical machine, sinceit requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps acopy of the merged namespace image, which can be used in the event of the namenode failing.However, the state of the secondary namenode lags that of the primary, so in the event of totalfailure of the primary data, loss is almost guaranteed.3. The File System Namespace: HDFS supports a traditional hierarchical file organization. A user or an application can createdirectories and store files inside these directories. The file system namespace hierarchy is similarto most other existing file systems; one can create and remove files, move a file from onedirectory to another, or rename a file. HDFS does not yet implement user quotas or accesspermissions. The Namenode maintains the file system namespace. Any change to the file system namespaceor its properties is recorded by the Namenode.4. Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster. It storeseach file as a sequence of blocks; all blocks in a file except the last block are the same size. Theblocks of a file are replicated for fault tolerance. The block size and replication factor areconfigurable per file. An application can specify the number of replicas of a file.
The NameNode makes all decisions regarding replication of blocks. It periodically receives aHeartbeat and a Blockreport from each of the Datanodes in the cluster. Receipt of a Heartbeatimplies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on aDatanode.5. Replica Placement: Optimizing replica placement distinguishes HDFS from most other distributed file systems.. Thepurpose of a rack-aware replica placement policy is to improve data reliability, availability, andnetwork bandwidth utilization. Large HDFS instances run on a cluster of computers that commonly spread across many racks.Communication between two nodes in different racks has to go through switches. In most cases,network bandwidth between machines in the same rack is greater than network bandwidthbetween machines in different racks. For the common case, when the replication factor is three, HDFS’s placement policy is to putone replica on one node in the local rack, another on a different node in the local rack, and the laston a different node in a different rack. This policy cuts the inter-rack write traffic which generallyimproves write performance.6. Replica Selection: To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a readrequest from a replica that is closest to the reader.7. Safemode: On startup, the NameNode enters a special state called Safemode. Replication of data blocksdoes not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeatand Blockreport messages from the DataNodes. A Blockreport contains the list of data blocks thata DataNode is hosting. Each block has a specified minimum number of replicas. A block is
considered safely replicated when the minimum number of replicas of that data block has checkedin with the NameNode. After a configurable percentage of safely replicated data blocks checks inwith the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. Itthen determines the list of data blocks (if any) that still have fewer than the specified number ofreplicas. The NameNode then replicates these blocks to other DataNodes.8. The Persistence of File System Metadata: The HDFS namespace is stored by the NameNode. The NameNode uses a transaction log calledthe EditLog to persistently record every change that occurs to file system metadata. TheNameNode uses a file in its local host OS file system to store the EditLog. The entire file systemnamespace, including the mapping of blocks to files and file system properties, is stored in a filecalled the FsImage. The FsImage is stored as a file in the NameNode’s local file system too. The NameNode keeps an image of the entire file system namespace and file Blockmap inmemory. This key metadata item is designed to be compact, such that a NameNode with 4 GB ofRAM is plenty to support a huge number of files and directories The DataNode stores HDFS data in files in its local file system. The DataNode has noknowledge about HDFS files. It stores each block of HDFS data in a separate file in its local filesystem. The DataNode does not create all files in the same directory. Instead, it uses a heuristic todetermine the optimal number of files per directory and creates subdirectories appropriately. It isnot optimal to create all local files in the same directory because the local file system might notbe able to efficiently support a huge number of files in a single directory. When a DataNode startsup, it scans through its local file system, generates a list of all HDFS data blocks that correspondto each of these local files and sends this report to the NameNode: this is the Blockreport
The Communication Protocols: All HDFS communication protocols are layered on top of the TCP/IP protocol. A clientestablishes a connection to a configurable TCP port on the NameNode machine. It talks theClientProtocol with the NameNode. The DataNodes talk to the NameNode using the DataNodeProtocol. A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and theDataNode Protocol. By design, the NameNode never initiates any RPCs. Instead, it only respondsto RPC requests issued by DataNodes or clients.
Robustness: The primary objective of HDFS is to store data reliably even in the presence of failures. Thethree common types of failures are NameNode failures, DataNode failures and networkpartitions. Data Disk Failure, Heartbeats and Re-Replication Each DataNode sends a Heartbeat message to the NameNode periodically. A network partitioncan cause a subset of DataNodes to lose connectivity with the NameNode. The NameNodedetects this condition by the absence of a Heartbeat message. The NameNode marks DataNodeswithout recent Heartbeats as dead and does not forward any new IO requests to them. Any datathat was registered to a dead DataNode is not available to HDFS any more. DataNode death maycause the replication factor of some blocks to fall below their specified value. The NameNodeconstantly tracks which blocks need to be replicated and initiates replication whenever necessary.The necessity for re-replication may arise due to many reasons: a DataNode may becomeunavailable, a replica may become corrupted, a hard disk on a DataNode may fail, or thereplication factor of a file may be increased.
Cluster Rebalancing The HDFS architecture is compatible with data rebalancing schemes. A scheme mightautomatically move data from one DataNode to another if the free space on a DataNode fallsbelow a certain threshold. In the event of a sudden high demand for a particular file, a schememight dynamically create additional replicas and rebalance other data in the cluster. These typesof data rebalancing schemes are not yet implemented. Data Integrity It is possible that a block of data fetched from a DataNode arrives corrupted. This corruption canoccur because of faults in a storage device, network faults, or buggy software. The HDFS clientsoftware implements checksum checking on the contents of HDFS files. When a client creates anHDFS file, it computes a checksum of each block of the file and stores these checksums in aseparate hidden file in the same HDFS namespace. When a client retrieves file contents it verifiesthat the data it received from each DataNode matches the checksum stored in the associatedchecksum file. If not, then the client can opt to retrieve that block from another DataNode that hasa replica of that block. Metadata Disk FailureThe FsImage and the EditLog are central data structures of HDFS. A corruption of these files cancause the HDFS instance to be non-functional. For this reason, the NameNode can be configuredto support maintaining multiple copies of the FsImage and EditLog. Any update to either theFsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously.This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rateof namespace transactions per second that a NameNode can support. However, this degradation isacceptable because even though HDFS applications are very data intensive in nature, they are notmetadata intensive. When a NameNode restarts, it selects the latest consistent FsImage andEditLog to use.The NameNode machine is a single point of failure for an HDFS cluster. If the NameNodemachine fails, manual intervention is necessary. Currently, automatic restart and failover of theNameNode software to another machine is not supported. SnapshotsSnapshots support storing a copy of data at a particular instant of time. One usage of the snapshotfeature may be to roll back a corrupted HDFS instance to a previously known good point in time.HDFS does not currently support snapshots but will in a future release.
Data Organization Data Blocks HDFS is designed to support very large files. Applications that are compatible with HDFS arethose that deal with large data sets. These applications write their data only once but they read itone or more times and require these reads to be satisfied at streaming speeds. HDFS supportswrite-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, anHDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on adifferent DataNode. Staging A client request to create a file does not reach the NameNode immediately. In fact, initially theHDFS client caches the file data into a temporary local file. Application writes are transparentlyredirected to this temporary local file. When the local file accumulates data worth over one HDFSblock size, the client contacts the NameNode. The NameNode inserts the file name into the filesystem hierarchy and allocates a data block for it. The NameNode responds to the client requestwith the identity of the DataNode and the destination data block. Then the client flushes the blockof data from the local temporary file to the specified DataNode. When a file is closed, theremaining un-flushed data in the temporary local file is transferred to the DataNode. The clientthen tells the NameNode that the file is closed. At this point, the NameNode commits the filecreation operation into a persistent store. If the NameNode dies before the file is closed, the file islost. The above approach has been adopted after careful consideration of target applications that runon HDFS. These applications need streaming writes to files. If a client writes to a remote filedirectly without any client side buffering, the network speed and the congestion in the networkimpacts throughput considerably. This approach is not without precedent. Earlier distributed filesystems, e.g. AFS, have used client side caching to improve performance. A POSIX requirementhas been relaxed to achieve higher performance of data uploads. Replication Pipelining When a client is writing data to an HDFS file, its data is first written to a local file as explainedin the previous section. Suppose the HDFS file has a replication factor of three. When the localfile accumulates a full block of user data, the client retrieves a list of DataNodes from theNameNode. This list contains the DataNodes that will host a replica of that block. The client thenflushes the data block to the first DataNode. The first DataNode starts receiving the data in smallportions (4 KB), writes each portion to its local repository and transfers that portion to the secondDataNode in the list. The second DataNode, in turn starts receiving each portion of the datablock, writes that portion to its repository and then flushes that portion to the third DataNode.Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can bereceiving data from the previous one in the pipeline and at the same time forwarding data to thenext one in the pipeline. Thus, the data is pipelined from one DataNode to the next.۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩۩
Accessibility: HDFS can be accessed from applications in many different ways. Natively, HDFS provides ajava API for applications to use. A C language wrapper for this Java API is also available. Inaddition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is inprogress to expose HDFS through the WebDAV protocol. Space Reclamation: 1. File Deletes and Undeletes When a file is deleted by a user or an application, it is not immediately removed from HDFS.Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly aslong as it remains in /trash. A file remains in /trash for a configurable amount of time. After theexpiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletionof a file causes the blocks associated with the file to be freed. Note that there could be anappreciable time delay between the time a file is deleted by a user and the time of thecorresponding increase in free space in HDFS. A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a userwants to undelete a file that he/she has deleted, he/she can navigate the /trash directory andretrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The/trash directory is just like any other directory with one special feature: HDFS applies specifiedpolicies to automatically delete files from this directory. The current default policy is to deletefiles from /trash that are more than 6 hours old. In the future, this policy will be configurablethrough a well defined interface.2. Decrease Replication Factor When the replication factor of a file is reduced, the NameNode selects excess replicas that canbe deleted. The next Heartbeat transfers this information to the DataNode. The DataNode thenremoves the corresponding blocks and the corresponding free space appears in the cluster. Onceagain, there might be a time delay between the completion of the setReplication API call and theappearance of free space in the cluster.3. Hadoop FilesystemsHadoop has an abstract notion of filesystem, of which HDFS is just one implementation. TheJava abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop, and thereare several concrete implementations, which are described in following table. A filesystem for a locally connected
disk with client-side checksums.Local file Use RawLocalFileSys fs.LocalFileSystem tem for a local filesystem with no checksums. Hadoop’s distributed filesystem. HDFS is designed to work efficientlyHDFS hdfs hdfs.DistributedFileSystem in conjunction with Map- Reduce. A filesystem providing read-only access to HDFS over HTTP. (DespiteHFTP hftp its name, HFTP has no connection hdfs.HftpFileSystem with FTP.) Often used with distcp (“Parallel Copying with A filesystem providing read-only access to HDFS over HTTPS. (Again,HSFTP hsftp Hdfs.HsftpFileSystem this has no connection with FTP.) A filesystem layered on another filesystem for archiving files.HAR har Fs.HarFileSystem Hadoop Archives are typically used for archiving files in HDFS to reduce the namenode’s memory usage. CloudStore (formerly Kosmos filesystem)KFS(Clou Kfs fs.kfs.KosmosFileSystem is a distributed filesystemd Store) like HDFS or Google’s GFS, written in C++. A filesystem backed by an FTPFTP ftp fs.ftp.FtpFileSystem server. A filesystem backed by AmazonS3(Nativ s3n fs.s3native.NativeS3FileSyste S3.e) m A filesystem backed by Amazon S3, which stores files in blocksS3(Block S3 fs.s3.S3FileSystem A (much like HDFS) to overcomeBased) S3’s 5 GB file size limit.
Hadoop Archives: HDFS stores small files inefficiently, since each file is stored in a block, and block metadata isheld in memory by the namenode. Thus, a large number of small files can eat up a lot of memoryon the namenode. (Note, however, that small files do not take up any more disk space than isrequired to store the raw contents of the file. For example, a 1 MB file stored with a block size of128 MB uses 1 MB of disk space, not 128 MB.) Hadoop Archives, or HAR files, are a filearchiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenodememory usage while still allowing transparent access to files. In particular, Hadoop Archives canbe used as input to MapReduce.Using Hadoop Archives A Hadoop Archive is created from a collection of files using the archive tool. The tool runs aMapReduce job to process the input files in parallel, so to run it, you need a MapReduce clusterrunning to use it.Limitations There are a few limitations to be aware of with HAR files. Creating an archive creates a copy ofthe original files, so you need as much disk space as the files you are archiving to create thearchive (although you can delete the originals once you have created the archive). There iscurrently no support for archive compression, although the files that go into the archive can becompressed (HAR files are like tar files in this respect). Archives are immutable once they havebeen created. To add or remove files, you must recreate the archive. In practice, this is not aproblem for files that don’t change after being written, since they can be archived in batches on aregular basis, such as daily or weekly. As noted earlier, HAR files can be used as input toMapReduce. However, there is no archive-aware InputFormat that can pack multiple files into asingle MapReduce split, so processing lots of small files, even in a HAR file, can still beinefficient.