Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298 Review of Distributed File Systems: Case Studies Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni Department of Computer Technology, Veermata Jijabai Technological InstituteAbstract Distributed Systems have enabled Architecturesharing of data across networks. In this paperfour Distributed File Systems Architectures:Andrew, Sun Network, Google and Hadoop willbe reviewed with their implementedarchitectures, file system implementation,replication and concurrency control techniquesemployed. For better understanding of the filesystems a comparative study is required.Keywords— Distributed File Systems, AndrewFile System, Sun Network File System, Google FileSystem, Hadoop File System. Figure 1: Andrew File System ArchitectureI. INTRODUCTION File system is a subsystem of an operating File Systemsystem whose purpose is to organize, retrieve, store The general goal of widespreadand allow sharing of data files. A Distributed File accessibility of computational and informationalSystem is a distributed implementation of the facilities, coupled with the choice of UNIX, led toclassical time-sharing model of a file system, where the decision to provide an integrated, campus-widemultiple users who are geographically dispersed file system with functional characteristics as close toshare files and storage resources. Accordingly, the that of UNIX as possible. The first design choicefile service activity in a distributed system has to be was to make the file system compatible with UNIXcarried out across the network, and instead of a at the system call level.single centralized data repository there are multiple The second design decision was to useand independent storage devices. whole files as the basic unit of data movement and The DFS can also be defined in terms of the storage, rather than some smaller unit such asabstract notation of a file. Permanent storage is a physical or logical records. This is undoubtedly thefundamental abstraction in computing. It consists of most controversial and interesting aspect of thea named set of objects that come into existence by Andrew File System. It means that before aexplicit creation, are immune to temporary failures workstation can use a file, it must copy the entire fileof the system, and persist until explicitly destroyed. to its local disk, and it must write modified filesA file system is the refinement of such an back to the file system in their entirety. This in turnabstraction. A DFS is a file system that supports the requires using a local disk to hold recently-usedsharing of files in the form of persistent storage over files. On the other hand, it provides significanta set of network connected nodes. The DFS has to benefits in performance and to some degree insatisfy three important requirements: Transparency, availability. Once a workstation has a copy of a fileFault Tolerance and Scalability. it can use it independently of the central file system. This dramatically reduces network traffic and file server loads as compared to record-based distributedII. CASE STUDY 1: ANDREW FILE SYSTEM file systems. Furthermore, it is possible to cache and Andrew is a distributed computing reuse files on the local disk, resulting in furtherenvironment being developed in a joint project by reductions in server - loads and in additionalCarnegie Mellon University and IBM. One of the workstation autonomy.major components of Andrew is a distributed file Two functional issues with the whole filesystem. The goal of the Andrew File System is to strategy are often raised. The first concerns filesupport growth up to at least 7000 workstations (one sizes: only files small enough to fit in the local disksfor each student, faculty member, and staff at can be handled. Where this matters in theCarnegie Mellon) while providing users, application environment, the large files had to be broken intoprograms, and system administrators with the smaller parts which fit. The second has to do withamenities of a shared file system. updates. Modified files are returned to the central system only when they are closed, thus rendering 1293 | P a g e
  2. 2. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298record-level updates impossible. This is a view of the shared space. The convention is tofundamental property of the design. However, it is configure the system to have a uniform name space.not a serious problem in the university computing By mounting a shared file system over user homeenvironment. The main application for record-level directories on all the machines, a user can log in toupdates is databases. Serious multi-user databases any workstation and get his or her homehave many other requirements (such as record- or environment. Thus, user mobility can be provided,field-granularity authorization, physical disk write although again by convention.ordering controls, and update serialization) which Subject to access rights accreditation,are not satisfied by UNIX file system semantics, potentially any file system or a directory within aeven in a non-distributed environment. file system can be remotely mounted on top of any The third and last key design decision in the local directory. In the latest NFS version, disklessAndrew File System was to implement it with many workstations can even mount their own roots fromrelatively small servers rather than a single large servers (Version 4.0, May 1988 described in Sunmachine. This decision was based on the desire to Microsystems Inc. . In previous NFS versions, asupport growth gracefully, to enhance availability diskless workstation depends on the Network Disk(since if any single server fails, the others should (ND) protocol that provides raw block I/O servicecontinue), and to simplify the development process from remote disks; the server disk was partitionedby using the same hardware and operating system as and no sharing of root file systems was allowed.the workstations. At the present time, an Andrew. One of the design goals of NFS is to provide filefile server consists of a workstation with three to six services in a heterogeneous environment of different400-megabyte disks attached. A price/performance machines, operating systems, and networkgoal of supporting at least 50 active workstations per architecture. The NFS specification is independentfile server is acheived, so that the centralized costs of these media and thus encourages otherof the file system would be reasonable. In a large implementations.configuration like the one at Carnegie Mellon, a This independence is achieved through theseparate "system control machine" to broadcast use of RPC primitives built on top of an Externalglobal information (such as where specific users Date Representation (XDR) protocol-twofiles are to be found) to the file servers is used. In a implementation independent interfaces [Sunsmall configuration the system control machine is Microsystems Inc. 19881. Hence, if the systemcombined with a (the) server machine. consists of heterogeneous machines and file systems that are properly interfaced to NFS, file systems ofIII. CASE STUDY 2: SUN NETWORK FILE different types can be mounted both locally andSYSTEM remotely. NFS views a set of interconnectedworkstations as a set of independent machines with Architectureindependent file systems. The goal is to allow some In general, Sun’s implementation of NFS isdegree of sharing among these file systems in a integrated with the SunOS kernel for reasons oftransparent manner. Sharing is based on server-client efficiency (although such integration is not strictlyrelationship. A machine may be, and often is, both a necessary).client and a server. Sharing is allowed between anypair of machines, not only with dedicated servermachines. Consistent with the independence of amachine is the critical observation that NFS sharingof a remote file system affects only the clientmachine and no other machine. Therefore, there isno notion of a globally shared file system as inLocus, Sprite, UNIX United, and Andrew. To make a remote directory accessible in atransparent manner from a client machine, a user ofthat machine first has to carry out a mount operation.Actually, only a superuser can invoke the mountoperation. Specifying the remote directory as an Figure 2: Schematic View of NFS Architectureargument for the mount operation is done in anontransparent manner; the location (i.e., hostname) The NFS architecture is schematicallyof the remote directory has to be provided. From depicted in Figure 6. The user interface is the UNIXthen on, users on the client machine can access files system calls interface based on the Open, Read,in the remote directory in a totally transparent Write, Close calls, and file descriptors. Thismanner, as if the directory were local. Since each interface is on top of a middle layer called themachine is free to configure its own name space, it is Virtual File System (VFS) layer. The bottom layer isnot guaranteed that all machines have a common the one that implements the NFS protocol and is 1294 | P a g e
  3. 3. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298called the NFS layer. These layers comprise the NFS are created once but read many times. The GFS issoftware architecture. The figure also shows the optimized to run on computing clusters where theRPC/XDR software layer, local file systems, and the nodes are cheap computers. Hence, there is a neednetwork and thus can serve to illustrate the for precautions against the high failure rate ofintegration of a DFS with all these components. The individual nodes and data loss.VFS serves two important functions: It separates file system generic operations Motivation for the GFS Design:from their implementation by defining a clean The GFS was developed based on the followinginterface. Several implementations for the VFS assumptions:interface may coexist on the same machine, allowing a. Systems are prone to failure. Hence there is atransparent access to a variety of types of file need for self monitoring and self recovery fromsystems mounted locally (e.g., 4.2 BSD or MS- failure.DOS). b. The file system stores a modest number of large The VFS is based on a file representation files, where the file size is greater than 100 MB.structure called a unode, which contains a numerical c. There are two types of reads: Large streamingdesignator for a file that is networkwide unique. reads of 1MB or more. These types of reads are from(Recall that UNIXi- nodes are unique only within a a contiguous region of a file by the same client. Thesingle file system.) The kernel maintains one vnode other is a set of small random reads of a few KBs.structure for each active node (file or directory). d. Many large sequential writes are performed. TheEssentially, for every file the vnode structures writes are performed by appending data to files andcomplemented by the mount table provide a pointer once written the files are seldom its parent file system, as well as to the file system e. Need to support multiple clients concurrentlyover which it is mounted. Thus, the VFS appending the same file.distinguishes local files from remote ones, and localfiles are further distinguished according to their file Architecturesystem types. The VFS activates file system specific Master – Chunk Servers – Clientoperations to handle local requests according to their A GFS cluster consists of a single masterfile system types and calls the NFS protocol and multiple chunkservers and is accessed byprocedures for remote requests. File handles are multiple clients. Files are divided into fixed-sizeconstructed from the relevant vnodes and passed as chunks. Each chunk is identified by an immutablearguments to these procedures. and globally unique 64 bit chunk handle assigned by As an illustration of the architecture, let us the master at the time of chunk creation.trace how an operation on an already open remote Chunkservers store chunks on local disks as Linuxfile is handled (follow the example in Figure 6). The files and read or write chunk data specified by aclient initiates the operation by a regular system call. chunk handle and byte range.The operating system layer maps this call to a VFS For reliability, each chunk is replicated onoperation on the appropriate vnode. The VFS layer multiple chunkservers. By default, we store threeidentifies the file as a remote one and invokes the replicas, though users can designate differentappropriate NFS procedure. An RPC call is made to replication levels for different regions of the filethe NFS service layer at the remote server. This call namespace. The master maintains all file systemis reinjected into the VFS layer, which finds that it is metadata. This includes the namespace, accesslocal and invokes the appropriate file system control information, the mapping from files tooperation. This path is retraced to return the result. chunks, and the current locations of chunks. It alsoAn advantage of this architecture is that the client controls system-wide activities such as chunk leaseand the server are identical; thus, it is possible for a management, garbage collection of orphanedmachine to be a client, or a server, or both. The chunks, and chunk migration between chunkservers.actual service on each server is performed by several The master periodically communicates with eachkernel processes, which provide a temporary chunkserver in HeartBeat messages to give itsubstitute to a LWP facility. instructions and collect its state. Neither the client nor the chunkserver caches file data.IV. CASE STUDY 3: GOOGLE FILE SYSTEM The GFS architecture diagram is shown below: The Google File System (GFS) is aproprietary DFS developed by Google. It is designedto provide efficient, reliable access to data usinglarge clusters of commodity hardware. The files arehuge and divided into chunks of 64 megabytes. Mostfiles are mutated by appending new data rather thanoverwriting existing data: once written, the files areonly read and often only sequentially. This DFS isbest suited for scenarios in which many large files 1295 | P a g e
  4. 4. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298 File Write The control flow of a write is given below as numbered steps: 1. Client translates the file name and byte offset specified by the application into a chunk index within the file using the fixed chunk size. It sends the master a request containing the file name and chunk index. 2. The master replies with the corresponding chunk handle and locations of the replicas 3. The client pushes the data to all the replicas. Data stored in internal buffer of chunkserver.Figure 3: Google File System Architecture 4. Client sends a write request to the primary. TheChunk Size primary assigns serial numbers to all the writeThe GFS uses a large chunk size of 64MB. This has requests it receives. Perform write on data it stores inthe following advantages: the serial number ordera. Reduces clients’ need to interact with the master 5. The primary forwards the write request to allbecause reads and writes on the same chunk require secondary replicasonly one initial request to the master for chunk 6. The secondaries all reply to the primary onlocation information. completion of writeb. Reduce network overhead by keeping a persistent 7. The primary replies to the client.TCP connection to the chunkserver over an extendedperiod of time.c. Reduces the size of the metadata stored on themaster. This allows keeping the metadata in memoryof master.File Access MethodFile ReadA simple file read is performed as follows:a. Client translates the file name and byte offsetspecified by the application into a chunk indexwithin the file using the fixed chunk size.b. It sends the master a request containing the filename and chunk index.c. The master replies with the corresponding chunkhandle and locations of the replicas. The clientcaches this information using the file name and Figure 5: A File Write Sequencechunk index as the key.d. The client then sends a request to one of the Replication and Consistencyreplicas, most likely the closest one. The request Consistencyspecifies the chunk handle and a byte range within The GFS applies mutations to a chunk inthat chunk. the same order on all its replicas. A mutation is ane. Further reads of the same chunk require no more operation that changes the contents or metadata of aclient-master interaction until the cached chunk such as a write or an append operation. It usesinformation expires or the file is reopened. the chunk version numbers to detect any replica that has become stale due to missed mutations while itsThis file read sequence is illustrated below: chunkserver was down. The chance of a client reading from a stale replica stored in its cache is small. This is because the cache entry uses a timeout mechanism. Also, it purges all chunk information for that file on the next file open. Replication The GFS employs both chunk replication and master replication for added reliability.Figure 4: File Read System System Availability The GFS supports Fast Recovery to ensure availability. Both the master and the chunkserver are 1296 | P a g e
  5. 5. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298designed to restore their state and start in seconds no record of the image stored in the local host’s nativematter how they terminated. The normal and files system. The modification log of the imageabnormal termination are not distinguished. Hence, stored in the local host’s native file system isany fault will result in the same recovery process as referred to as the Journal. During restarts thea successful termination. NameNode restores the namespace by reading the namespace and replaying the journal.Data Integrity The GFS employs a checksum mechanism Data Nodeto ensure integrity of data being read / written. A 32 Each block replica on a DataNode isbit checksum is included for every 64KB block of represented by two files in the local host’s native filechunk. For reads, the chunkserver verifies the system. The first file contains the data itself and thechecksum of data blocks that overlap the read range second file is block’s metadata including checksumsbefore returning any data to the requester. For writes for the block data and the block’s generation stamp.(append to end of a chunk) incrementally update the The size of the data file equals the actual length ofchecksum for the last partial checksum block, and the block and does not require extra space to round itcompute new checksums for any brand new up to the nominal block size as in traditional filechecksum blocks filled by the append. systems. Thus, if a block is half full it needs only half of the space of the full block on the local drive.Limitations of the GFS During startup each DataNode connects to the1. No standard API such as POSIX for NameNode and performs a handshake. The purposeprogramming. of the handshake is to verify the namespace ID and2. The Application / client have opportunity to get a the software version of the DataNode. If either doesstale chunk replica, though this probability is low. not match that of the NameNode the DataNode3. Some of the performance issues depend on the automatically shuts down. A DataNode identifiesclient and application implementation. block replicas in its possession to the NameNode by4. If a write by the application is large or straddles a sending a block report. A block report contains thechunk boundary, it may be added fragments from block id, the generation stamp and the length forother clients. each block replica the server hosts. The first block report is sent immediately after the DataNodeV. CASE STUDY 4: HADOOP FILE SYSTEM registration. Subsequent block reports are sent every The Hadoop is a distributed parallel fault hour and provide the NameNode with an up-to datetolerant file system inspired by the Google File view of where block replicas are located on theSystem. It was designed to reliably store very large cluster.files across machines in a large cluster. Each filestored as a sequence of blocks; all blocks in a file File Access Methodexcept the last block are the same size. Blocks File Readbelonging to a file are replicated for fault tolerance. When an application reads a file, the HDFSThe block size and replication factor are client first asks the NameNode for the list ofconfigurable per file. The files are “write once” and DataNodes that host replicas of the blocks of the file.have strictly one writer at any time. This DFS has It then contacts a DataNode directly and requests thebeen used by Facebook and Yahoo. transfer of the desired block.Architecture File Write The file system metadata and application When a client writes, it first asks thedata stored separately. The Metadata stored on a NameNode to choose DataNodes to host replicas ofdedicated server called the NameNode. Application the first block of the file. The client organizes adata are stored on other servers called DataNodes. pipeline from node-to-node and sends the data.All servers are fully connected and communicate When the first block is filled, the client requests newwith each other using TCP-based protocols. File DataNodes to be chosen to host replicas of the nextcontent is split into large blocks (typically 128MB) block. A new pipeline is organized, and the clientand each block of the file is independently replicated sends the further bytes of the file. Each choice ofat multiple DataNodes for reliability. DataNodes is likely to be different.Name Node Synchronization The files and directories represented by The Hadoop DFS implements a single-inodes, which record attributes like permissions, writer, multiple-reader model. The Hadoop clientmodification and access times, namespace and disk that opens a file for writing is granted a lease for thespace quotas. The metadata comprising the inode file; no other client can write to the file. The writingdata and the list of blocks belonging to each file is client periodically renews the lease. When the file iscalled the Image. Checkpoints are the persistent closed, the lease is revoked. The writers lease does 1297 | P a g e
  6. 6. Sunita Suralkar, Ashwini Mujumdar, Gayatri Masiwal, Manasi Kulkarni / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 3, Issue 1, January -February 2013, pp1293-1298not prevent other clients from reading the file; a file VI. COMPARISON OF FILE SYSTEMSmay have many concurrent readers. File System AFS NFS GFS HadoopThe interactions involved are shown in the figurebelow: Architecture symmetric symmetric Clustered- Clustered- based, based, asymmetric, asymmetric parallel, , parallel, objectbased objectbased Processes Stateless Stateless Stateful Stateful Communication RPC/ RPC/ RPC/TCP RPC/TCP TCP TCP or & UDP UDP Naming - - Central Central metadata metadata server serverFigure 4: File Access in Hadoop Synchronizatio Callback Read- Write-once- Write- n promise ahead, read-many, once-read-Replication Management delayed Multiple many, give -write producer/Sing locks on The blocks are replicated for reliability. le consumer, Give locks on objects toHadoop has a method to identify and overcome the objects to clients,issues of under-replication and over- replication. The clients, using usingdefault number of replicas for each block is 3. leases leasesNameNode detects that a block has become under- Consistency Callback One Server side Server sideor over-replicated based on DataNode’s block and mechanis copy replication, replication, semantics Asynchronous Asynchronoreport. If over replicated, the NameNode chooses a Replication m , read replication, usreplica to remove. Preference is given to remove only file checksum, replication,from the DataNode with the least amount of stores relax Checksumavailable disk space. If under-replicated, it is put in can be consistency replicate Amongthe replication priority queue. A block with only one d replications ofreplica has the highest priority, while a block with a data objectsnumber of replicas that is greater than two thirds of Fault Failure as Failure Failure as Failure asits replication factor has the lowest priority. It also Tolerance norm as norm norm normensures that not all replicas of a block are located on Table 1: Comparision of AFS, NFS, GFS andsame physical location. HadoopConsistency VII. CONCLUSION The Hadoop DFS use checksums with each In this paper the Distributed Systemsdata block to maintain data consistency. The implementations of Andrew, Sun Network, Googlechecksums are verified by the client while reading to and Hadoop are reviewed with their Architecture,help detect corruption caused either by the client, the File System implementation, replication andDataNodes, or network. When a client creates an concurrency control techniques. The filesystems areHDFS file, it computes the checksum sequence for also compared based on their implementation.each block and sends it to a DataNode along with the REFERENCESdata. DataNode stores checksums in a metadata file [1] Howard J. “An Overview of the Andrewseparate from the block’s data file. When HDFS File System”.reads a file, each block’s data and checksums are [2] Sandberg R., Goldberg D., Kleiman S.,returned to the client. Walsh D., Lyon B., “Design and Implementation of the Sun NetworkLimitations of Hadoop DFS Filesystem”.1. Centralization: The Hadoop system uses a [3] Ghemawat S., Gobioff H., Leung S., “Thecentralized master server. So, the Hadoop cluster is Google File System”.effectively unavailable when its NameNode is down. [4] Shvacko K., Kunang H., Radia S., ChanslerRestarting the NameNode has been a satisfactory R., “The Hadoop Distributed File System”,recovery method so far and steps being taken IEEE 2010.towards automated recovery. [5] Dean J., Ghemawat S., “MapReduce:2. Scalability: Since the NameNode keeps all the Simplified Data Processing on Largenamespace and block locations in memory, the size Clusters”, OSDI 2004.of the NameNode heap limits number of files and [6] Keerthivasan M., “Review of Distributedblocks addressable. One solution is to allow multiple File Systems: Concepts and Case Studies”,namespaces (and NameNodes) to share the physical ECE 677 Distributed Computing Systemsstorage within a cluster. 1298 | P a g e