SlideShare a Scribd company logo
1 of 15
Download to read offline
The Google File System
                                 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
                                                                              Google∗




ABSTRACT                                                                            1. INTRODUCTION
We have designed and implemented the Google File Sys-                                  We have designed and implemented the Google File Sys-
tem, a scalable distributed file system for large distributed                        tem (GFS) to meet the rapidly growing demands of Google’s
data-intensive applications. It provides fault tolerance while                      data processing needs. GFS shares many of the same goals
running on inexpensive commodity hardware, and it delivers                          as previous distributed file systems such as performance,
high aggregate performance to a large number of clients.                            scalability, reliability, and availability. However, its design
   While sharing many of the same goals as previous dis-                            has been driven by key observations of our application work-
tributed file systems, our design has been driven by obser-                          loads and technological environment, both current and an-
vations of our application workloads and technological envi-                        ticipated, that reflect a marked departure from some earlier
ronment, both current and anticipated, that reflect a marked                         file system design assumptions. We have reexamined tradi-
departure from some earlier file system assumptions. This                            tional choices and explored radically different points in the
has led us to reexamine traditional choices and explore rad-                        design space.
ically different design points.                                                         First, component failures are the norm rather than the
   The file system has successfully met our storage needs.                           exception. The file system consists of hundreds or even
It is widely deployed within Google as the storage platform                         thousands of storage machines built from inexpensive com-
for the generation and processing of data used by our ser-                          modity parts and is accessed by a comparable number of
vice as well as research and development efforts that require                        client machines. The quantity and quality of the compo-
large data sets. The largest cluster to date provides hun-                          nents virtually guarantee that some are not functional at
dreds of terabytes of storage across thousands of disks on                          any given time and some will not recover from their cur-
over a thousand machines, and it is concurrently accessed                           rent failures. We have seen problems caused by application
by hundreds of clients.                                                             bugs, operating system bugs, human errors, and the failures
   In this paper, we present file system interface extensions                        of disks, memory, connectors, networking, and power sup-
designed to support distributed applications, discuss many                          plies. Therefore, constant monitoring, error detection, fault
aspects of our design, and report measurements from both                            tolerance, and automatic recovery must be integral to the
micro-benchmarks and real world use.                                                system.
                                                                                       Second, files are huge by traditional standards. Multi-GB
                                                                                    files are common. Each file typically contains many applica-
Categories and Subject Descriptors                                                  tion objects such as web documents. When we are regularly
D [4]: 3—Distributed file systems                                                    working with fast growing data sets of many TBs comprising
                                                                                    billions of objects, it is unwieldy to manage billions of ap-
General Terms                                                                       proximately KB-sized files even when the file system could
                                                                                    support it. As a result, design assumptions and parameters
Design, reliability, performance, measurement                                       such as I/O operation and block sizes have to be revisited.
                                                                                       Third, most files are mutated by appending new data
Keywords                                                                            rather than overwriting existing data. Random writes within
Fault tolerance, scalability, data storage, clustered storage                       a file are practically non-existent. Once written, the files
                                                                                    are only read, and often only sequentially. A variety of
∗
  The authors can be reached at the following addresses:                            data share these characteristics. Some may constitute large
{sanjay,hgobioff,shuntak}@google.com.                                                repositories that data analysis programs scan through. Some
                                                                                    may be data streams continuously generated by running ap-
                                                                                    plications. Some may be archival data. Some may be in-
                                                                                    termediate results produced on one machine and processed
Permission to make digital or hard copies of all or part of this work for           on another, whether simultaneously or later in time. Given
personal or classroom use is granted without fee provided that copies are           this access pattern on huge files, appending becomes the fo-
not made or distributed for profit or commercial advantage and that copies
                                                                                    cus of performance optimization and atomicity guarantees,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific   while caching data blocks in the client loses its appeal.
permission and/or a fee.                                                               Fourth, co-designing the applications and the file system
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.                        API benefits the overall system by increasing our flexibility.
Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.
For example, we have relaxed GFS’s consistency model to          2.2 Interface
vastly simplify the file system without imposing an onerous          GFS provides a familiar file system interface, though it
burden on the applications. We have also introduced an           does not implement a standard API such as POSIX. Files are
atomic append operation so that multiple clients can append      organized hierarchically in directories and identified by path-
concurrently to a file without extra synchronization between      names. We support the usual operations to create, delete,
them. These will be discussed in more details later in the       open, close, read, and write files.
paper.                                                              Moreover, GFS has snapshot and record append opera-
  Multiple GFS clusters are currently deployed for different      tions. Snapshot creates a copy of a file or a directory tree
purposes. The largest ones have over 1000 storage nodes,         at low cost. Record append allows multiple clients to ap-
over 300 TB of disk storage, and are heavily accessed by         pend data to the same file concurrently while guaranteeing
hundreds of clients on distinct machines on a continuous         the atomicity of each individual client’s append. It is use-
basis.                                                           ful for implementing multi-way merge results and producer-
                                                                 consumer queues that many clients can simultaneously ap-
2.    DESIGN OVERVIEW                                            pend to without additional locking. We have found these
                                                                 types of files to be invaluable in building large distributed
2.1 Assumptions                                                  applications. Snapshot and record append are discussed fur-
  In designing a file system for our needs, we have been          ther in Sections 3.4 and 3.3 respectively.
guided by assumptions that offer both challenges and op-
portunities. We alluded to some key observations earlier         2.3 Architecture
and now lay out our assumptions in more details.                    A GFS cluster consists of a single master and multiple
                                                                 chunkservers and is accessed by multiple clients, as shown
     • The system is built from many inexpensive commodity       in Figure 1. Each of these is typically a commodity Linux
       components that often fail. It must constantly monitor    machine running a user-level server process. It is easy to run
       itself and detect, tolerate, and recover promptly from    both a chunkserver and a client on the same machine, as long
       component failures on a routine basis.                    as machine resources permit and the lower reliability caused
                                                                 by running possibly flaky application code is acceptable.
     • The system stores a modest number of large files. We
                                                                    Files are divided into fixed-size chunks. Each chunk is
       expect a few million files, each typically 100 MB or
                                                                 identified by an immutable and globally unique 64 bit chunk
       larger in size. Multi-GB files are the common case
                                                                 handle assigned by the master at the time of chunk creation.
       and should be managed efficiently. Small files must be
                                                                 Chunkservers store chunks on local disks as Linux files and
       supported, but we need not optimize for them.
                                                                 read or write chunk data specified by a chunk handle and
     • The workloads primarily consist of two kinds of reads:    byte range. For reliability, each chunk is replicated on multi-
       large streaming reads and small random reads. In          ple chunkservers. By default, we store three replicas, though
       large streaming reads, individual operations typically    users can designate different replication levels for different
       read hundreds of KBs, more commonly 1 MB or more.         regions of the file namespace.
       Successive operations from the same client often read        The master maintains all file system metadata. This in-
       through a contiguous region of a file. A small ran-        cludes the namespace, access control information, the map-
       dom read typically reads a few KBs at some arbitrary      ping from files to chunks, and the current locations of chunks.
       offset. Performance-conscious applications often batch     It also controls system-wide activities such as chunk lease
       and sort their small reads to advance steadily through    management, garbage collection of orphaned chunks, and
       the file rather than go back and forth.                    chunk migration between chunkservers. The master peri-
                                                                 odically communicates with each chunkserver in HeartBeat
     • The workloads also have many large, sequential writes     messages to give it instructions and collect its state.
       that append data to files. Typical operation sizes are        GFS client code linked into each application implements
       similar to those for reads. Once written, files are sel-   the file system API and communicates with the master and
       dom modified again. Small writes at arbitrary posi-        chunkservers to read or write data on behalf of the applica-
       tions in a file are supported but do not have to be        tion. Clients interact with the master for metadata opera-
       efficient.                                                  tions, but all data-bearing communication goes directly to
                                                                 the chunkservers. We do not provide the POSIX API and
     • The system must efficiently implement well-defined se-
                                                                 therefore need not hook into the Linux vnode layer.
       mantics for multiple clients that concurrently append
                                                                    Neither the client nor the chunkserver caches file data.
       to the same file. Our files are often used as producer-
                                                                 Client caches offer little benefit because most applications
       consumer queues or for many-way merging. Hundreds
                                                                 stream through huge files or have working sets too large
       of producers, running one per machine, will concur-
                                                                 to be cached. Not having them simplifies the client and
       rently append to a file. Atomicity with minimal syn-
                                                                 the overall system by eliminating cache coherence issues.
       chronization overhead is essential. The file may be
                                                                 (Clients do cache metadata, however.) Chunkservers need
       read later, or a consumer may be reading through the
                                                                 not cache file data because chunks are stored as local files
       file simultaneously.
                                                                 and so Linux’s buffer cache already keeps frequently accessed
     • High sustained bandwidth is more important than low       data in memory.
       latency. Most of our target applications place a pre-
       mium on processing data in bulk at a high rate, while     2.4 Single Master
       few have stringent response time requirements for an        Having a single master vastly simplifies our design and
       individual read or write.                                 enables the master to make sophisticated chunk placement
Application                                 GFS master              /foo/bar
                              (file name, chunk index)
               GFS client                                  File namespace           chunk 2ef0
                                 (chunk handle,
                                  chunk locations)
                                                                                                            Legend:
                                                                                                                      Data messages
                                                              Instructions to chunkserver                             Control messages

                                                                            Chunkserver state
                         (chunk handle, byte range)
                                                           GFS chunkserver             GFS chunkserver
                            chunk data
                                                            Linux file system           Linux file system




                                                         Figure 1: GFS Architecture



and replication decisions using global knowledge. However,                      tent TCP connection to the chunkserver over an extended
we must minimize its involvement in reads and writes so                         period of time. Third, it reduces the size of the metadata
that it does not become a bottleneck. Clients never read                        stored on the master. This allows us to keep the metadata
and write file data through the master. Instead, a client asks                   in memory, which in turn brings other advantages that we
the master which chunkservers it should contact. It caches                      will discuss in Section 2.6.1.
this information for a limited time and interacts with the                         On the other hand, a large chunk size, even with lazy space
chunkservers directly for many subsequent operations.                           allocation, has its disadvantages. A small file consists of a
   Let us explain the interactions for a simple read with refer-                small number of chunks, perhaps just one. The chunkservers
ence to Figure 1. First, using the fixed chunk size, the client                  storing those chunks may become hot spots if many clients
translates the file name and byte offset specified by the ap-                      are accessing the same file. In practice, hot spots have not
plication into a chunk index within the file. Then, it sends                     been a major issue because our applications mostly read
the master a request containing the file name and chunk                          large multi-chunk files sequentially.
index. The master replies with the corresponding chunk                             However, hot spots did develop when GFS was first used
handle and locations of the replicas. The client caches this                    by a batch-queue system: an executable was written to GFS
information using the file name and chunk index as the key.                      as a single-chunk file and then started on hundreds of ma-
   The client then sends a request to one of the replicas,                      chines at the same time. The few chunkservers storing this
most likely the closest one. The request specifies the chunk                     executable were overloaded by hundreds of simultaneous re-
handle and a byte range within that chunk. Further reads                        quests. We fixed this problem by storing such executables
of the same chunk require no more client-master interaction                     with a higher replication factor and by making the batch-
until the cached information expires or the file is reopened.                    queue system stagger application start times. A potential
In fact, the client typically asks for multiple chunks in the                   long-term solution is to allow clients to read data from other
same request and the master can also include the informa-                       clients in such situations.
tion for chunks immediately following those requested. This
extra information sidesteps several future client-master in-                    2.6 Metadata
teractions at practically no extra cost.                                           The master stores three major types of metadata: the file
                                                                                and chunk namespaces, the mapping from files to chunks,
2.5 Chunk Size                                                                  and the locations of each chunk’s replicas. All metadata is
   Chunk size is one of the key design parameters. We have                      kept in the master’s memory. The first two types (names-
chosen 64 MB, which is much larger than typical file sys-                        paces and file-to-chunk mapping) are also kept persistent by
tem block sizes. Each chunk replica is stored as a plain                        logging mutations to an operation log stored on the mas-
Linux file on a chunkserver and is extended only as needed.                      ter’s local disk and replicated on remote machines. Using
Lazy space allocation avoids wasting space due to internal                      a log allows us to update the master state simply, reliably,
fragmentation, perhaps the greatest objection against such                      and without risking inconsistencies in the event of a master
a large chunk size.                                                             crash. The master does not store chunk location informa-
   A large chunk size offers several important advantages.                       tion persistently. Instead, it asks each chunkserver about its
First, it reduces clients’ need to interact with the master                     chunks at master startup and whenever a chunkserver joins
because reads and writes on the same chunk require only                         the cluster.
one initial request to the master for chunk location informa-
tion. The reduction is especially significant for our work-                      2.6.1 In-Memory Data Structures
loads because applications mostly read and write large files                       Since metadata is stored in memory, master operations are
sequentially. Even for small random reads, the client can                       fast. Furthermore, it is easy and efficient for the master to
comfortably cache all the chunk location information for a                      periodically scan through its entire state in the background.
multi-TB working set. Second, since on a large chunk, a                         This periodic scanning is used to implement chunk garbage
client is more likely to perform many operations on a given                     collection, re-replication in the presence of chunkserver fail-
chunk, it can reduce network overhead by keeping a persis-                      ures, and chunk migration to balance load and disk space
usage across chunkservers. Sections 4.3 and 4.4 will discuss                            Write           Record Append
these activities further.                                                 Serial        defined          defined
                                                                          success                       interspersed with
   One potential concern for this memory-only approach is
                                                                          Concurrent    consistent      inconsistent
that the number of chunks and hence the capacity of the                   successes     but undefined
whole system is limited by how much memory the master                     Failure                  inconsistent
has. This is not a serious limitation in practice. The mas-
ter maintains less than 64 bytes of metadata for each 64 MB
chunk. Most chunks are full because most files contain many             Table 1: File Region State After Mutation
chunks, only the last of which may be partially filled. Sim-
ilarly, the file namespace data typically requires less then
64 bytes per file because it stores file names compactly us-        limited number of log records after that. The checkpoint is
ing prefix compression.                                            in a compact B-tree like form that can be directly mapped
   If necessary to support even larger file systems, the cost      into memory and used for namespace lookup without ex-
of adding extra memory to the master is a small price to pay      tra parsing. This further speeds up recovery and improves
for the simplicity, reliability, performance, and flexibility we   availability.
gain by storing the metadata in memory.                              Because building a checkpoint can take a while, the mas-
                                                                  ter’s internal state is structured in such a way that a new
2.6.2 Chunk Locations                                             checkpoint can be created without delaying incoming muta-
   The master does not keep a persistent record of which          tions. The master switches to a new log file and creates the
chunkservers have a replica of a given chunk. It simply polls     new checkpoint in a separate thread. The new checkpoint
chunkservers for that information at startup. The master          includes all mutations before the switch. It can be created
can keep itself up-to-date thereafter because it controls all     in a minute or so for a cluster with a few million files. When
chunk placement and monitors chunkserver status with reg-         completed, it is written to disk both locally and remotely.
ular HeartBeat messages.                                             Recovery needs only the latest complete checkpoint and
   We initially attempted to keep chunk location information      subsequent log files. Older checkpoints and log files can
persistently at the master, but we decided that it was much       be freely deleted, though we keep a few around to guard
simpler to request the data from chunkservers at startup,         against catastrophes. A failure during checkpointing does
and periodically thereafter. This eliminated the problem of       not affect correctness because the recovery code detects and
keeping the master and chunkservers in sync as chunkservers       skips incomplete checkpoints.
join and leave the cluster, change names, fail, restart, and
so on. In a cluster with hundreds of servers, these events
happen all too often.
                                                                  2.7 Consistency Model
   Another way to understand this design decision is to real-       GFS has a relaxed consistency model that supports our
ize that a chunkserver has the final word over what chunks         highly distributed applications well but remains relatively
it does or does not have on its own disks. There is no point      simple and efficient to implement. We now discuss GFS’s
in trying to maintain a consistent view of this information       guarantees and what they mean to applications. We also
on the master because errors on a chunkserver may cause           highlight how GFS maintains these guarantees but leave the
chunks to vanish spontaneously (e.g., a disk may go bad           details to other parts of the paper.
and be disabled) or an operator may rename a chunkserver.
                                                                  2.7.1 Guarantees by GFS
2.6.3 Operation Log                                                  File namespace mutations (e.g., file creation) are atomic.
   The operation log contains a historical record of critical     They are handled exclusively by the master: namespace
metadata changes. It is central to GFS. Not only is it the        locking guarantees atomicity and correctness (Section 4.1);
only persistent record of metadata, but it also serves as a       the master’s operation log defines a global total order of
logical time line that defines the order of concurrent op-         these operations (Section 2.6.3).
erations. Files and chunks, as well as their versions (see           The state of a file region after a data mutation depends
Section 4.5), are all uniquely and eternally identified by the     on the type of mutation, whether it succeeds or fails, and
logical times at which they were created.                         whether there are concurrent mutations. Table 1 summa-
   Since the operation log is critical, we must store it reli-    rizes the result. A file region is consistent if all clients will
ably and not make changes visible to clients until metadata       always see the same data, regardless of which replicas they
changes are made persistent. Otherwise, we effectively lose        read from. A region is defined after a file data mutation if it
the whole file system or recent client operations even if the      is consistent and clients will see what the mutation writes in
chunks themselves survive. Therefore, we replicate it on          its entirety. When a mutation succeeds without interference
multiple remote machines and respond to a client opera-           from concurrent writers, the affected region is defined (and
tion only after flushing the corresponding log record to disk      by implication consistent): all clients will always see what
both locally and remotely. The master batches several log         the mutation has written. Concurrent successful mutations
records together before flushing thereby reducing the impact       leave the region undefined but consistent: all clients see the
of flushing and replication on overall system throughput.          same data, but it may not reflect what any one mutation
   The master recovers its file system state by replaying the      has written. Typically, it consists of mingled fragments from
operation log. To minimize startup time, we must keep the         multiple mutations. A failed mutation makes the region in-
log small. The master checkpoints its state whenever the log      consistent (hence also undefined): different clients may see
grows beyond a certain size so that it can recover by loading     different data at different times. We describe below how our
the latest checkpoint from local disk and replaying only the      applications can distinguish defined regions from undefined
regions. The applications do not need to further distinguish        file data that is still incomplete from the application’s per-
between different kinds of undefined regions.                         spective.
   Data mutations may be writes or record appends. A write             In the other typical use, many writers concurrently ap-
causes data to be written at an application-specified file            pend to a file for merged results or as a producer-consumer
offset. A record append causes data (the “record”) to be             queue. Record append’s append-at-least-once semantics pre-
appended atomically at least once even in the presence of           serves each writer’s output. Readers deal with the occa-
concurrent mutations, but at an offset of GFS’s choosing             sional padding and duplicates as follows. Each record pre-
(Section 3.3). (In contrast, a “regular” append is merely a         pared by the writer contains extra information like check-
write at an offset that the client believes to be the current        sums so that its validity can be verified. A reader can
end of file.) The offset is returned to the client and marks          identify and discard extra padding and record fragments
the beginning of a defined region that contains the record.          using the checksums. If it cannot tolerate the occasional
In addition, GFS may insert padding or record duplicates in         duplicates (e.g., if they would trigger non-idempotent op-
between. They occupy regions considered to be inconsistent          erations), it can filter them out using unique identifiers in
and are typically dwarfed by the amount of user data.               the records, which are often needed anyway to name corre-
   After a sequence of successful mutations, the mutated file        sponding application entities such as web documents. These
region is guaranteed to be defined and contain the data writ-        functionalities for record I/O (except duplicate removal) are
ten by the last mutation. GFS achieves this by (a) applying         in library code shared by our applications and applicable to
mutations to a chunk in the same order on all its replicas          other file interface implementations at Google. With that,
(Section 3.1), and (b) using chunk version numbers to detect        the same sequence of records, plus rare duplicates, is always
any replica that has become stale because it has missed mu-         delivered to the record reader.
tations while its chunkserver was down (Section 4.5). Stale
replicas will never be involved in a mutation or given to           3. SYSTEM INTERACTIONS
clients asking the master for chunk locations. They are
garbage collected at the earliest opportunity.                        We designed the system to minimize the master’s involve-
   Since clients cache chunk locations, they may read from a        ment in all operations. With that background, we now de-
stale replica before that information is refreshed. This win-       scribe how the client, master, and chunkservers interact to
dow is limited by the cache entry’s timeout and the next            implement data mutations, atomic record append, and snap-
open of the file, which purges from the cache all chunk in-          shot.
formation for that file. Moreover, as most of our files are           3.1 Leases and Mutation Order
append-only, a stale replica usually returns a premature
end of chunk rather than outdated data. When a reader                  A mutation is an operation that changes the contents or
retries and contacts the master, it will immediately get cur-       metadata of a chunk such as a write or an append opera-
rent chunk locations.                                               tion. Each mutation is performed at all the chunk’s replicas.
   Long after a successful mutation, component failures can         We use leases to maintain a consistent mutation order across
of course still corrupt or destroy data. GFS identifies failed       replicas. The master grants a chunk lease to one of the repli-
chunkservers by regular handshakes between master and all           cas, which we call the primary. The primary picks a serial
chunkservers and detects data corruption by checksumming            order for all mutations to the chunk. All replicas follow this
(Section 5.2). Once a problem surfaces, the data is restored        order when applying mutations. Thus, the global mutation
from valid replicas as soon as possible (Section 4.3). A chunk      order is defined first by the lease grant order chosen by the
is lost irreversibly only if all its replicas are lost before GFS   master, and within a lease by the serial numbers assigned
can react, typically within minutes. Even in this case, it be-      by the primary.
comes unavailable, not corrupted: applications receive clear           The lease mechanism is designed to minimize manage-
errors rather than corrupt data.                                    ment overhead at the master. A lease has an initial timeout
                                                                    of 60 seconds. However, as long as the chunk is being mu-
                                                                    tated, the primary can request and typically receive exten-
2.7.2 Implications for Applications                                 sions from the master indefinitely. These extension requests
   GFS applications can accommodate the relaxed consis-             and grants are piggybacked on the HeartBeat messages reg-
tency model with a few simple techniques already needed for         ularly exchanged between the master and all chunkservers.
other purposes: relying on appends rather than overwrites,          The master may sometimes try to revoke a lease before it
checkpointing, and writing self-validating, self-identifying        expires (e.g., when the master wants to disable mutations
records.                                                            on a file that is being renamed). Even if the master loses
   Practically all our applications mutate files by appending        communication with a primary, it can safely grant a new
rather than overwriting. In one typical use, a writer gener-        lease to another replica after the old lease expires.
ates a file from beginning to end. It atomically renames the            In Figure 2, we illustrate this process by following the
file to a permanent name after writing all the data, or pe-          control flow of a write through these numbered steps.
riodically checkpoints how much has been successfully writ-
ten. Checkpoints may also include application-level check-            1. The client asks the master which chunkserver holds
sums. Readers verify and process only the file region up                  the current lease for the chunk and the locations of
to the last checkpoint, which is known to be in the defined               the other replicas. If no one has a lease, the master
state. Regardless of consistency and concurrency issues, this            grants one to a replica it chooses (not shown).
approach has served us well. Appending is far more effi-                2. The master replies with the identity of the primary and
cient and more resilient to application failures than random             the locations of the other (secondary) replicas. The
writes. Checkpointing allows writers to restart incremen-                client caches this data for future mutations. It needs
tally and keeps readers from processing successfully written             to contact the master again only when the primary
4               step 1                                   file region may end up containing fragments from different
                     Client                      Master
                                                                         clients, although the replicas will be identical because the in-
                                         2                               dividual operations are completed successfully in the same
                         3
                                                                         order on all replicas. This leaves the file region in consistent
                                                                         but undefined state as noted in Section 2.7.
                    Secondary
                    Replica A
                                 6
                                                                         3.2 Data Flow
                                                                            We decouple the flow of data from the flow of control to
            7                                                            use the network efficiently. While control flows from the
                    Primary
                                             5                           client to the primary and then to all secondaries, data is
                    Replica
                                                     Legend:             pushed linearly along a carefully picked chain of chunkservers
                                                                         in a pipelined fashion. Our goals are to fully utilize each
                                                               Control   machine’s network bandwidth, avoid network bottlenecks
                                 6
                    Secondary                                  Data      and high-latency links, and minimize the latency to push
                    Replica B                                            through all the data.
                                                                            To fully utilize each machine’s network bandwidth, the
                                                                         data is pushed linearly along a chain of chunkservers rather
        Figure 2: Write Control and Data Flow
                                                                         than distributed in some other topology (e.g., tree). Thus,
                                                                         each machine’s full outbound bandwidth is used to trans-
                                                                         fer the data as fast as possible rather than divided among
       becomes unreachable or replies that it no longer holds            multiple recipients.
       a lease.                                                             To avoid network bottlenecks and high-latency links (e.g.,
                                                                         inter-switch links are often both) as much as possible, each
  3.   The client pushes the data to all the replicas. A client
                                                                         machine forwards the data to the “closest” machine in the
       can do so in any order. Each chunkserver will store
                                                                         network topology that has not received it. Suppose the
       the data in an internal LRU buffer cache until the
                                                                         client is pushing data to chunkservers S1 through S4. It
       data is used or aged out. By decoupling the data flow
                                                                         sends the data to the closest chunkserver, say S1. S1 for-
       from the control flow, we can improve performance by
                                                                         wards it to the closest chunkserver S2 through S4 closest to
       scheduling the expensive data flow based on the net-
                                                                         S1, say S2. Similarly, S2 forwards it to S3 or S4, whichever
       work topology regardless of which chunkserver is the
                                                                         is closer to S2, and so on. Our network topology is simple
       primary. Section 3.2 discusses this further.
                                                                         enough that “distances” can be accurately estimated from
  4.   Once all the replicas have acknowledged receiving the             IP addresses.
       data, the client sends a write request to the primary.               Finally, we minimize latency by pipelining the data trans-
       The request identifies the data pushed earlier to all of           fer over TCP connections. Once a chunkserver receives some
       the replicas. The primary assigns consecutive serial              data, it starts forwarding immediately. Pipelining is espe-
       numbers to all the mutations it receives, possibly from           cially helpful to us because we use a switched network with
       multiple clients, which provides the necessary serial-            full-duplex links. Sending the data immediately does not
       ization. It applies the mutation to its own local state           reduce the receive rate. Without network congestion, the
       in serial number order.                                           ideal elapsed time for transferring B bytes to R replicas is
  5.   The primary forwards the write request to all sec-                B/T + RL where T is the network throughput and L is la-
       ondary replicas. Each secondary replica applies mu-               tency to transfer bytes between two machines. Our network
       tations in the same serial number order assigned by               links are typically 100 Mbps (T ), and L is far below 1 ms.
       the primary.                                                      Therefore, 1 MB can ideally be distributed in about 80 ms.
  6.   The secondaries all reply to the primary indicating
       that they have completed the operation.
  7.   The primary replies to the client. Any errors encoun-             3.3 Atomic Record Appends
       tered at any of the replicas are reported to the client.             GFS provides an atomic append operation called record
       In case of errors, the write may have succeeded at the            append. In a traditional write, the client specifies the off-
       primary and an arbitrary subset of the secondary repli-           set at which data is to be written. Concurrent writes to
       cas. (If it had failed at the primary, it would not               the same region are not serializable: the region may end up
       have been assigned a serial number and forwarded.)                containing data fragments from multiple clients. In a record
       The client request is considered to have failed, and the          append, however, the client specifies only the data. GFS
       modified region is left in an inconsistent state. Our              appends it to the file at least once atomically (i.e., as one
       client code handles such errors by retrying the failed            continuous sequence of bytes) at an offset of GFS’s choosing
       mutation. It will make a few attempts at steps (3)                and returns that offset to the client. This is similar to writ-
       through (7) before falling back to a retry from the be-           ing to a file opened in O APPEND mode in Unix without the
       ginning of the write.                                             race conditions when multiple writers do so concurrently.
                                                                            Record append is heavily used by our distributed applica-
  If a write by the application is large or straddles a chunk            tions in which many clients on different machines append
boundary, GFS client code breaks it down into multiple                   to the same file concurrently. Clients would need addi-
write operations. They all follow the control flow described              tional complicated and expensive synchronization, for ex-
above but may be interleaved with and overwritten by con-                ample through a distributed lock manager, if they do so
current operations from other clients. Therefore, the shared             with traditional writes. In our workloads, such files often
serve as multiple-producer/single-consumer queues or con-            handle C’. It then asks each chunkserver that has a current
tain merged results from many different clients.                      replica of C to create a new chunk called C’. By creating
   Record append is a kind of mutation and follows the con-          the new chunk on the same chunkservers as the original, we
trol flow in Section 3.1 with only a little extra logic at the        ensure that the data can be copied locally, not over the net-
primary. The client pushes the data to all replicas of the           work (our disks are about three times as fast as our 100 Mb
last chunk of the file Then, it sends its request to the pri-         Ethernet links). From this point, request handling is no dif-
mary. The primary checks to see if appending the record              ferent from that for any chunk: the master grants one of the
to the current chunk would cause the chunk to exceed the             replicas a lease on the new chunk C’ and replies to the client,
maximum size (64 MB). If so, it pads the chunk to the max-           which can write the chunk normally, not knowing that it has
imum size, tells secondaries to do the same, and replies to          just been created from an existing chunk.
the client indicating that the operation should be retried
on the next chunk. (Record append is restricted to be at
most one-fourth of the maximum chunk size to keep worst-             4. MASTER OPERATION
case fragmentation at an acceptable level.) If the record               The master executes all namespace operations. In addi-
fits within the maximum size, which is the common case,               tion, it manages chunk replicas throughout the system: it
the primary appends the data to its replica, tells the secon-        makes placement decisions, creates new chunks and hence
daries to write the data at the exact offset where it has, and        replicas, and coordinates various system-wide activities to
finally replies success to the client.                                keep chunks fully replicated, to balance load across all the
   If a record append fails at any replica, the client retries the   chunkservers, and to reclaim unused storage. We now dis-
operation. As a result, replicas of the same chunk may con-          cuss each of these topics.
tain different data possibly including duplicates of the same
record in whole or in part. GFS does not guarantee that all          4.1 Namespace Management and Locking
replicas are bytewise identical. It only guarantees that the
                                                                        Many master operations can take a long time: for exam-
data is written at least once as an atomic unit. This prop-
                                                                     ple, a snapshot operation has to revoke chunkserver leases on
erty follows readily from the simple observation that for the
                                                                     all chunks covered by the snapshot. We do not want to delay
operation to report success, the data must have been written
                                                                     other master operations while they are running. Therefore,
at the same offset on all replicas of some chunk. Further-
                                                                     we allow multiple operations to be active and use locks over
more, after this, all replicas are at least as long as the end
                                                                     regions of the namespace to ensure proper serialization.
of record and therefore any future record will be assigned a
                                                                        Unlike many traditional file systems, GFS does not have
higher offset or a different chunk even if a different replica
                                                                     a per-directory data structure that lists all the files in that
later becomes the primary. In terms of our consistency guar-
                                                                     directory. Nor does it support aliases for the same file or
antees, the regions in which successful record append opera-
                                                                     directory (i.e, hard or symbolic links in Unix terms). GFS
tions have written their data are defined (hence consistent),
                                                                     logically represents its namespace as a lookup table mapping
whereas intervening regions are inconsistent (hence unde-
                                                                     full pathnames to metadata. With prefix compression, this
fined). Our applications can deal with inconsistent regions
                                                                     table can be efficiently represented in memory. Each node
as we discussed in Section 2.7.2.
                                                                     in the namespace tree (either an absolute file name or an
                                                                     absolute directory name) has an associated read-write lock.
3.4 Snapshot                                                            Each master operation acquires a set of locks before it
  The snapshot operation makes a copy of a file or a direc-           runs. Typically, if it involves /d1/d2/.../dn/leaf, it will
tory tree (the “source”) almost instantaneously, while min-          acquire read-locks on the directory names /d1, /d1/d2, ...,
imizing any interruptions of ongoing mutations. Our users            /d1/d2/.../dn, and either a read lock or a write lock on the
use it to quickly create branch copies of huge data sets (and        full pathname /d1/d2/.../dn/leaf. Note that leaf may be
often copies of those copies, recursively), or to checkpoint         a file or directory depending on the operation.
the current state before experimenting with changes that                We now illustrate how this locking mechanism can prevent
can later be committed or rolled back easily.                        a file /home/user/foo from being created while /home/user
  Like AFS [5], we use standard copy-on-write techniques to          is being snapshotted to /save/user. The snapshot oper-
implement snapshots. When the master receives a snapshot             ation acquires read locks on /home and /save, and write
request, it first revokes any outstanding leases on the chunks        locks on /home/user and /save/user. The file creation ac-
in the files it is about to snapshot. This ensures that any           quires read locks on /home and /home/user, and a write
subsequent writes to these chunks will require an interaction        lock on /home/user/foo. The two operations will be seri-
with the master to find the lease holder. This will give the          alized properly because they try to obtain conflicting locks
master an opportunity to create a new copy of the chunk              on /home/user. File creation does not require a write lock
first.                                                                on the parent directory because there is no “directory”, or
  After the leases have been revoked or have expired, the            inode-like, data structure to be protected from modification.
master logs the operation to disk. It then applies this log          The read lock on the name is sufficient to protect the parent
record to its in-memory state by duplicating the metadata            directory from deletion.
for the source file or directory tree. The newly created snap-           One nice property of this locking scheme is that it allows
shot files point to the same chunks as the source files.               concurrent mutations in the same directory. For example,
  The first time a client wants to write to a chunk C after           multiple file creations can be executed concurrently in the
the snapshot operation, it sends a request to the master to          same directory: each acquires a read lock on the directory
find the current lease holder. The master notices that the            name and a write lock on the file name. The read lock on
reference count for chunk C is greater than one. It defers           the directory name suffices to prevent the directory from
replying to the client request and instead picks a new chunk         being deleted, renamed, or snapshotted. The write locks on
file names serialize attempts to create a file with the same           The master picks the highest priority chunk and “clones”
name twice.                                                       it by instructing some chunkserver to copy the chunk data
  Since the namespace can have many nodes, read-write lock        directly from an existing valid replica. The new replica is
objects are allocated lazily and deleted once they are not in     placed with goals similar to those for creation: equalizing
use. Also, locks are acquired in a consistent total order         disk space utilization, limiting active clone operations on
to prevent deadlock: they are first ordered by level in the        any single chunkserver, and spreading replicas across racks.
namespace tree and lexicographically within the same level.       To keep cloning traffic from overwhelming client traffic, the
                                                                  master limits the numbers of active clone operations both
4.2 Replica Placement                                             for the cluster and for each chunkserver. Additionally, each
   A GFS cluster is highly distributed at more levels than        chunkserver limits the amount of bandwidth it spends on
one. It typically has hundreds of chunkservers spread across      each clone operation by throttling its read requests to the
many machine racks. These chunkservers in turn may be             source chunkserver.
accessed from hundreds of clients from the same or different          Finally, the master rebalances replicas periodically: it ex-
racks. Communication between two machines on different             amines the current replica distribution and moves replicas
racks may cross one or more network switches. Addition-           for better disk space and load balancing. Also through this
ally, bandwidth into or out of a rack may be less than the        process, the master gradually fills up a new chunkserver
aggregate bandwidth of all the machines within the rack.          rather than instantly swamps it with new chunks and the
Multi-level distribution presents a unique challenge to dis-      heavy write traffic that comes with them. The placement
tribute data for scalability, reliability, and availability.      criteria for the new replica are similar to those discussed
   The chunk replica placement policy serves two purposes:        above. In addition, the master must also choose which ex-
maximize data reliability and availability, and maximize net-     isting replica to remove. In general, it prefers to remove
work bandwidth utilization. For both, it is not enough to         those on chunkservers with below-average free space so as
spread replicas across machines, which only guards against        to equalize disk space usage.
disk or machine failures and fully utilizes each machine’s net-
work bandwidth. We must also spread chunk replicas across         4.4 Garbage Collection
racks. This ensures that some replicas of a chunk will sur-         After a file is deleted, GFS does not immediately reclaim
vive and remain available even if an entire rack is damaged       the available physical storage. It does so only lazily during
or offline (for example, due to failure of a shared resource        regular garbage collection at both the file and chunk levels.
like a network switch or power circuit). It also means that       We find that this approach makes the system much simpler
traffic, especially reads, for a chunk can exploit the aggre-       and more reliable.
gate bandwidth of multiple racks. On the other hand, write
traffic has to flow through multiple racks, a tradeoff we make
willingly.
                                                                  4.4.1 Mechanism
                                                                     When a file is deleted by the application, the master logs
4.3 Creation, Re-replication, Rebalancing                         the deletion immediately just like other changes. However
                                                                  instead of reclaiming resources immediately, the file is just
   Chunk replicas are created for three reasons: chunk cre-
                                                                  renamed to a hidden name that includes the deletion times-
ation, re-replication, and rebalancing.
                                                                  tamp. During the master’s regular scan of the file system
   When the master creates a chunk, it chooses where to
                                                                  namespace, it removes any such hidden files if they have ex-
place the initially empty replicas. It considers several fac-
                                                                  isted for more than three days (the interval is configurable).
tors. (1) We want to place new replicas on chunkservers with
                                                                  Until then, the file can still be read under the new, special
below-average disk space utilization. Over time this will
                                                                  name and can be undeleted by renaming it back to normal.
equalize disk utilization across chunkservers. (2) We want to
                                                                  When the hidden file is removed from the namespace, its in-
limit the number of “recent” creations on each chunkserver.
                                                                  memory metadata is erased. This effectively severs its links
Although creation itself is cheap, it reliably predicts immi-
                                                                  to all its chunks.
nent heavy write traffic because chunks are created when de-
                                                                     In a similar regular scan of the chunk namespace, the
manded by writes, and in our append-once-read-many work-
                                                                  master identifies orphaned chunks (i.e., those not reachable
load they typically become practically read-only once they
                                                                  from any file) and erases the metadata for those chunks. In
have been completely written. (3) As discussed above, we
                                                                  a HeartBeat message regularly exchanged with the master,
want to spread replicas of a chunk across racks.
                                                                  each chunkserver reports a subset of the chunks it has, and
   The master re-replicates a chunk as soon as the number
                                                                  the master replies with the identity of all chunks that are no
of available replicas falls below a user-specified goal. This
                                                                  longer present in the master’s metadata. The chunkserver
could happen for various reasons: a chunkserver becomes
                                                                  is free to delete its replicas of such chunks.
unavailable, it reports that its replica may be corrupted, one
of its disks is disabled because of errors, or the replication
goal is increased. Each chunk that needs to be re-replicated      4.4.2 Discussion
is prioritized based on several factors. One is how far it is       Although distributed garbage collection is a hard problem
from its replication goal. For example, we give higher prior-     that demands complicated solutions in the context of pro-
ity to a chunk that has lost two replicas than to a chunk that    gramming languages, it is quite simple in our case. We can
has lost only one. In addition, we prefer to first re-replicate    easily identify all references to chunks: they are in the file-
chunks for live files as opposed to chunks that belong to re-      to-chunk mappings maintained exclusively by the master.
cently deleted files (see Section 4.4). Finally, to minimize       We can also easily identify all the chunk replicas: they are
the impact of failures on running applications, we boost the      Linux files under designated directories on each chunkserver.
priority of any chunk that is blocking client progress.           Any such replica not known to the master is “garbage.”
The garbage collection approach to storage reclamation          quantity of components together make these problems more
offers several advantages over eager deletion. First, it is         the norm than the exception: we cannot completely trust
simple and reliable in a large-scale distributed system where      the machines, nor can we completely trust the disks. Com-
component failures are common. Chunk creation may suc-             ponent failures can result in an unavailable system or, worse,
ceed on some chunkservers but not others, leaving replicas         corrupted data. We discuss how we meet these challenges
that the master does not know exist. Replica deletion mes-         and the tools we have built into the system to diagnose prob-
sages may be lost, and the master has to remember to resend        lems when they inevitably occur.
them across failures, both its own and the chunkserver’s.
Garbage collection provides a uniform and dependable way           5.1 High Availability
to clean up any replicas not known to be useful. Second,              Among hundreds of servers in a GFS cluster, some are
it merges storage reclamation into the regular background          bound to be unavailable at any given time. We keep the
activities of the master, such as the regular scans of names-      overall system highly available with two simple yet effective
paces and handshakes with chunkservers. Thus, it is done           strategies: fast recovery and replication.
in batches and the cost is amortized. Moreover, it is done
only when the master is relatively free. The master can re-
spond more promptly to client requests that demand timely
                                                                   5.1.1 Fast Recovery
attention. Third, the delay in reclaiming storage provides a          Both the master and the chunkserver are designed to re-
safety net against accidental, irreversible deletion.              store their state and start in seconds no matter how they
   In our experience, the main disadvantage is that the delay      terminated. In fact, we do not distinguish between normal
sometimes hinders user effort to fine tune usage when stor-          and abnormal termination; servers are routinely shut down
age is tight. Applications that repeatedly create and delete       just by killing the process. Clients and other servers experi-
temporary files may not be able to reuse the storage right          ence a minor hiccup as they time out on their outstanding
away. We address these issues by expediting storage recla-         requests, reconnect to the restarted server, and retry. Sec-
mation if a deleted file is explicitly deleted again. We also       tion 6.2.2 reports observed startup times.
allow users to apply different replication and reclamation
policies to different parts of the namespace. For example,          5.1.2 Chunk Replication
users can specify that all the chunks in the files within some        As discussed earlier, each chunk is replicated on multiple
directory tree are to be stored without replication, and any       chunkservers on different racks. Users can specify different
deleted files are immediately and irrevocably removed from          replication levels for different parts of the file namespace.
the file system state.                                              The default is three. The master clones existing replicas as
                                                                   needed to keep each chunk fully replicated as chunkservers
4.5 Stale Replica Detection                                        go offline or detect corrupted replicas through checksum ver-
   Chunk replicas may become stale if a chunkserver fails          ification (see Section 5.2). Although replication has served
and misses mutations to the chunk while it is down. For            us well, we are exploring other forms of cross-server redun-
each chunk, the master maintains a chunk version number            dancy such as parity or erasure codes for our increasing read-
to distinguish between up-to-date and stale replicas.              only storage requirements. We expect that it is challenging
   Whenever the master grants a new lease on a chunk, it           but manageable to implement these more complicated re-
increases the chunk version number and informs the up-to-          dundancy schemes in our very loosely coupled system be-
date replicas. The master and these replicas all record the        cause our traffic is dominated by appends and reads rather
new version number in their persistent state. This occurs          than small random writes.
before any client is notified and therefore before it can start
writing to the chunk. If another replica is currently unavail-     5.1.3 Master Replication
able, its chunk version number will not be advanced. The
                                                                      The master state is replicated for reliability. Its operation
master will detect that this chunkserver has a stale replica
                                                                   log and checkpoints are replicated on multiple machines. A
when the chunkserver restarts and reports its set of chunks
                                                                   mutation to the state is considered committed only after
and their associated version numbers. If the master sees a
                                                                   its log record has been flushed to disk locally and on all
version number greater than the one in its records, the mas-
                                                                   master replicas. For simplicity, one master process remains
ter assumes that it failed when granting the lease and so
                                                                   in charge of all mutations as well as background activities
takes the higher version to be up-to-date.
                                                                   such as garbage collection that change the system internally.
   The master removes stale replicas in its regular garbage
                                                                   When it fails, it can restart almost instantly. If its machine
collection. Before that, it effectively considers a stale replica
                                                                   or disk fails, monitoring infrastructure outside GFS starts a
not to exist at all when it replies to client requests for chunk
                                                                   new master process elsewhere with the replicated operation
information. As another safeguard, the master includes
                                                                   log. Clients use only the canonical name of the master (e.g.
the chunk version number when it informs clients which
                                                                   gfs-test), which is a DNS alias that can be changed if the
chunkserver holds a lease on a chunk or when it instructs
                                                                   master is relocated to another machine.
a chunkserver to read the chunk from another chunkserver
                                                                      Moreover, “shadow” masters provide read-only access to
in a cloning operation. The client or the chunkserver verifies
                                                                   the file system even when the primary master is down. They
the version number when it performs the operation so that
                                                                   are shadows, not mirrors, in that they may lag the primary
it is always accessing up-to-date data.
                                                                   slightly, typically fractions of a second. They enhance read
                                                                   availability for files that are not being actively mutated or
5.   FAULT TOLERANCE AND DIAGNOSIS                                 applications that do not mind getting slightly stale results.
  One of our greatest challenges in designing the system is        In fact, since file content is read from chunkservers, appli-
dealing with frequent component failures. The quality and          cations do not observe stale file content. What could be
stale within short windows is file metadata, like directory       finally compute and record the new checksums. If we do
contents or access control information.                          not verify the first and last blocks before overwriting them
   To keep itself informed, a shadow master reads a replica of   partially, the new checksums may hide corruption that exists
the growing operation log and applies the same sequence of       in the regions not being overwritten.
changes to its data structures exactly as the primary does.         During idle periods, chunkservers can scan and verify the
Like the primary, it polls chunkservers at startup (and infre-   contents of inactive chunks. This allows us to detect corrup-
quently thereafter) to locate chunk replicas and exchanges       tion in chunks that are rarely read. Once the corruption is
frequent handshake messages with them to monitor their           detected, the master can create a new uncorrupted replica
status. It depends on the primary master only for replica        and delete the corrupted replica. This prevents an inactive
location updates resulting from the primary’s decisions to       but corrupted chunk replica from fooling the master into
create and delete replicas.                                      thinking that it has enough valid replicas of a chunk.

5.2 Data Integrity                                               5.3 Diagnostic Tools
   Each chunkserver uses checksumming to detect corruption          Extensive and detailed diagnostic logging has helped im-
of stored data. Given that a GFS cluster often has thousands     measurably in problem isolation, debugging, and perfor-
of disks on hundreds of machines, it regularly experiences       mance analysis, while incurring only a minimal cost. With-
disk failures that cause data corruption or loss on both the     out logs, it is hard to understand transient, non-repeatable
read and write paths. (See Section 7 for one cause.) We          interactions between machines. GFS servers generate di-
can recover from corruption using other chunk replicas, but      agnostic logs that record many significant events (such as
it would be impractical to detect corruption by comparing        chunkservers going up and down) and all RPC requests and
replicas across chunkservers. Moreover, divergent replicas       replies. These diagnostic logs can be freely deleted without
may be legal: the semantics of GFS mutations, in particular      affecting the correctness of the system. However, we try to
atomic record append as discussed earlier, does not guar-        keep these logs around as far as space permits.
antee identical replicas. Therefore, each chunkserver must          The RPC logs include the exact requests and responses
independently verify the integrity of its own copy by main-      sent on the wire, except for the file data being read or writ-
taining checksums.                                               ten. By matching requests with replies and collating RPC
   A chunk is broken up into 64 KB blocks. Each has a corre-     records on different machines, we can reconstruct the en-
sponding 32 bit checksum. Like other metadata, checksums         tire interaction history to diagnose a problem. The logs also
are kept in memory and stored persistently with logging,         serve as traces for load testing and performance analysis.
separate from user data.                                            The performance impact of logging is minimal (and far
   For reads, the chunkserver verifies the checksum of data       outweighed by the benefits) because these logs are written
blocks that overlap the read range before returning any data     sequentially and asynchronously. The most recent events
to the requester, whether a client or another chunkserver.       are also kept in memory and available for continuous online
Therefore chunkservers will not propagate corruptions to         monitoring.
other machines. If a block does not match the recorded
checksum, the chunkserver returns an error to the requestor      6. MEASUREMENTS
and reports the mismatch to the master. In response, the
requestor will read from other replicas, while the master          In this section we present a few micro-benchmarks to illus-
will clone the chunk from another replica. After a valid new     trate the bottlenecks inherent in the GFS architecture and
replica is in place, the master instructs the chunkserver that   implementation, and also some numbers from real clusters
reported the mismatch to delete its replica.                     in use at Google.
   Checksumming has little effect on read performance for
several reasons. Since most of our reads span at least a         6.1 Micro-benchmarks
few blocks, we need to read and checksum only a relatively         We measured performance on a GFS cluster consisting
small amount of extra data for verification. GFS client code      of one master, two master replicas, 16 chunkservers, and
further reduces this overhead by trying to align reads at        16 clients. Note that this configuration was set up for ease
checksum block boundaries. Moreover, checksum lookups            of testing. Typical clusters have hundreds of chunkservers
and comparison on the chunkserver are done without any           and hundreds of clients.
I/O, and checksum calculation can often be overlapped with         All the machines are configured with dual 1.4 GHz PIII
I/Os.                                                            processors, 2 GB of memory, two 80 GB 5400 rpm disks, and
   Checksum computation is heavily optimized for writes          a 100 Mbps full-duplex Ethernet connection to an HP 2524
that append to the end of a chunk (as opposed to writes          switch. All 19 GFS server machines are connected to one
that overwrite existing data) because they are dominant in       switch, and all 16 client machines to the other. The two
our workloads. We just incrementally update the check-           switches are connected with a 1 Gbps link.
sum for the last partial checksum block, and compute new
checksums for any brand new checksum blocks filled by the         6.1.1 Reads
append. Even if the last partial checksum block is already          N clients read simultaneously from the file system. Each
corrupted and we fail to detect it now, the new checksum         client reads a randomly selected 4 MB region from a 320 GB
value will not match the stored data, and the corruption will    file set. This is repeated 256 times so that each client ends
be detected as usual when the block is next read.                up reading 1 GB of data. The chunkservers taken together
   In contrast, if a write overwrites an existing range of the   have only 32 GB of memory, so we expect at most a 10% hit
chunk, we must read and verify the first and last blocks of       rate in the Linux buffer cache. Our results should be close
the range being overwritten, then perform the write, and         to cold cache results.
Figure 3(a) shows the aggregate read rate for N clients              Cluster                           A         B
and its theoretical limit. The limit peaks at an aggregate of           Chunkservers                   342        227
125 MB/s when the 1 Gbps link between the two switches                  Available disk space            72 TB     180 TB
                                                                        Used disk space                 55 TB     155 TB
is saturated, or 12.5 MB/s per client when its 100 Mbps                 Number of Files                735  k     737  k
network interface gets saturated, whichever applies. The                Number of Dead files             22  k     232  k
observed read rate is 10 MB/s, or 80% of the per-client                 Number of Chunks               992  k    1550  k
limit, when just one client is reading. The aggregate read              Metadata at chunkservers        13 GB      21 GB
rate reaches 94 MB/s, about 75% of the 125 MB/s link limit,             Metadata at master              48 MB      60 MB
for 16 readers, or 6 MB/s per client. The efficiency drops
from 80% to 75% because as the number of readers increases,
so does the probability that multiple readers simultaneously          Table 2: Characteristics of two GFS clusters
read from the same chunkserver.
                                                                  longer and continuously generate and process multi-TB data
6.1.2 Writes                                                      sets with only occasional human intervention. In both cases,
   N clients write simultaneously to N distinct files. Each        a single “task” consists of many processes on many machines
client writes 1 GB of data to a new file in a series of 1 MB       reading and writing many files simultaneously.
writes. The aggregate write rate and its theoretical limit are
shown in Figure 3(b). The limit plateaus at 67 MB/s be-           6.2.1 Storage
cause we need to write each byte to 3 of the 16 chunkservers,        As shown by the first five entries in the table, both clusters
each with a 12.5 MB/s input connection.                           have hundreds of chunkservers, support many TBs of disk
   The write rate for one client is 6.3 MB/s, about half of the   space, and are fairly but not completely full. “Used space”
limit. The main culprit for this is our network stack. It does    includes all chunk replicas. Virtually all files are replicated
not interact very well with the pipelining scheme we use for      three times. Therefore, the clusters store 18 TB and 52 TB
pushing data to chunk replicas. Delays in propagating data        of file data respectively.
from one replica to another reduce the overall write rate.           The two clusters have similar numbers of files, though B
   Aggregate write rate reaches 35 MB/s for 16 clients (or        has a larger proportion of dead files, namely files which were
2.2 MB/s per client), about half the theoretical limit. As in     deleted or replaced by a new version but whose storage have
the case of reads, it becomes more likely that multiple clients   not yet been reclaimed. It also has more chunks because its
write concurrently to the same chunkserver as the number          files tend to be larger.
of clients increases. Moreover, collision is more likely for 16
writers than for 16 readers because each write involves three     6.2.2 Metadata
different replicas.                                                   The chunkservers in aggregate store tens of GBs of meta-
   Writes are slower than we would like. In practice this has     data, mostly the checksums for 64 KB blocks of user data.
not been a major problem because even though it increases         The only other metadata kept at the chunkservers is the
the latencies as seen by individual clients, it does not sig-     chunk version number discussed in Section 4.5.
nificantly affect the aggregate write bandwidth delivered by           The metadata kept at the master is much smaller, only
the system to a large number of clients.                          tens of MBs, or about 100 bytes per file on average. This
                                                                  agrees with our assumption that the size of the master’s
6.1.3 Record Appends                                              memory does not limit the system’s capacity in practice.
   Figure 3(c) shows record append performance. N clients         Most of the per-file metadata is the file names stored in a
append simultaneously to a single file. Performance is lim-        prefix-compressed form. Other metadata includes file own-
ited by the network bandwidth of the chunkservers that            ership and permissions, mapping from files to chunks, and
store the last chunk of the file, independent of the num-          each chunk’s current version. In addition, for each chunk we
ber of clients. It starts at 6.0 MB/s for one client and drops    store the current replica locations and a reference count for
to 4.8 MB/s for 16 clients, mostly due to congestion and          implementing copy-on-write.
variances in network transfer rates seen by different clients.        Each individual server, both chunkservers and the master,
   Our applications tend to produce multiple such files con-       has only 50 to 100 MB of metadata. Therefore recovery is
currently. In other words, N clients append to M shared           fast: it takes only a few seconds to read this metadata from
files simultaneously where both N and M are in the dozens          disk before the server is able to answer queries. However, the
or hundreds. Therefore, the chunkserver network congestion        master is somewhat hobbled for a period – typically 30 to
in our experiment is not a significant issue in practice be-       60 seconds – until it has fetched chunk location information
cause a client can make progress on writing one file while         from all chunkservers.
the chunkservers for another file are busy.
                                                                  6.2.3 Read and Write Rates
6.2 Real World Clusters                                              Table 3 shows read and write rates for various time pe-
  We now examine two clusters in use within Google that           riods. Both clusters had been up for about one week when
are representative of several others like them. Cluster A is      these measurements were taken. (The clusters had been
used regularly for research and development by over a hun-        restarted recently to upgrade to a new version of GFS.)
dred engineers. A typical task is initiated by a human user          The average write rate was less than 30 MB/s since the
and runs up to several hours. It reads through a few MBs          restart. When we took these measurements, B was in the
to a few TBs of data, transforms or analyzes the data, and        middle of a burst of write activity generating about 100 MB/s
writes the results back to the cluster. Cluster B is primarily    of data, which produced a 300 MB/s network load because
used for production data processing. The tasks last much          writes are propagated to three replicas.
60                           Network limit
                                          Network limit                                                                                                                        Network limit
                                                                                                                                                       10




                                                                                                                                  Append rate (MB/s)
                   100




                                                                Write rate (MB/s)
Read rate (MB/s)




                                                                                    40

                    50              Aggregate read rate                                                                                                 5
                                                                                    20
                                                                                                           Aggregate write rate                                        Aggregate append rate

                     0                                                               0                                                                  0
                         0     5         10        15                                    0           5         10          15                               0       5         10        15
                             Number of clients N                                                   Number of clients N                                            Number of clients N

                               (a) Reads                                                             (b) Writes                                                 (c) Record appends

 Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves
 show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases
 because of low variance in measurements.

                     Cluster                                    A                                  B                  15,000 chunks containing 600 GB of data. To limit the im-
                     Read rate (last minute)              583   MB/s                         380   MB/s               pact on running applications and provide leeway for schedul-
                     Read rate (last hour)                562   MB/s                         384   MB/s
                                                                                                                      ing decisions, our default parameters limit this cluster to
                     Read rate (since restart)            589   MB/s                          49   MB/s
                     Write rate (last minute)               1   MB/s                         101   MB/s               91 concurrent clonings (40% of the number of chunkservers)
                     Write rate (last hour)                 2   MB/s                         117   MB/s               where each clone operation is allowed to consume at most
                     Write rate (since restart)            25   MB/s                          13   MB/s               6.25 MB/s (50 Mbps). All chunks were restored in 23.2 min-
                     Master ops (last minute)             325   Ops/s                        533   Ops/s              utes, at an effective replication rate of 440 MB/s.
                     Master ops (last hour)               381   Ops/s                        518   Ops/s                 In another experiment, we killed two chunkservers each
                     Master ops (since restart)           202   Ops/s                        347   Ops/s              with roughly 16,000 chunks and 660 GB of data. This double
                                                                                                                      failure reduced 266 chunks to having a single replica. These
                                                                                                                      266 chunks were cloned at a higher priority, and were all
 Table 3: Performance Metrics for Two GFS Clusters
                                                                                                                      restored to at least 2x replication within 2 minutes, thus
                                                                                                                      putting the cluster in a state where it could tolerate another
   The read rates were much higher than the write rates.                                                              chunkserver failure without data loss.
 The total workload consists of more reads than writes as we
 have assumed. Both clusters were in the middle of heavy                                                              6.3 Workload Breakdown
 read activity. In particular, A had been sustaining a read                                                             In this section, we present a detailed breakdown of the
 rate of 580 MB/s for the preceding week. Its network con-                                                            workloads on two GFS clusters comparable but not identi-
 figuration can support 750 MB/s, so it was using its re-                                                              cal to those in Section 6.2. Cluster X is for research and
 sources efficiently. Cluster B can support peak read rates of                                                          development while cluster Y is for production data process-
 1300 MB/s, but its applications were using just 380 MB/s.                                                            ing.

       6.2.4 Master Load                                                                                                 6.3.1 Methodology and Caveats
    Table 3 also shows that the rate of operations sent to the                                                           These results include only client originated requests so
 master was around 200 to 500 operations per second. The                                                              that they reflect the workload generated by our applications
 master can easily keep up with this rate, and therefore is                                                           for the file system as a whole. They do not include inter-
 not a bottleneck for these workloads.                                                                                server requests to carry out client requests or internal back-
    In an earlier version of GFS, the master was occasionally                                                         ground activities, such as forwarded writes or rebalancing.
 a bottleneck for some workloads. It spent most of its time                                                              Statistics on I/O operations are based on information
 sequentially scanning through large directories (which con-                                                          heuristically reconstructed from actual RPC requests logged
 tained hundreds of thousands of files) looking for particular                                                         by GFS servers. For example, GFS client code may break a
 files. We have since changed the master data structures to                                                            read into multiple RPCs to increase parallelism, from which
 allow efficient binary searches through the namespace. It                                                              we infer the original read. Since our access patterns are
 can now easily support many thousands of file accesses per                                                            highly stylized, we expect any error to be in the noise. Ex-
 second. If necessary, we could speed it up further by placing                                                        plicit logging by applications might have provided slightly
 name lookup caches in front of the namespace data struc-                                                             more accurate data, but it is logistically impossible to re-
 tures.                                                                                                               compile and restart thousands of running clients to do so
                                                                                                                      and cumbersome to collect the results from as many ma-
       6.2.5 Recovery Time                                                                                            chines.
    After a chunkserver fails, some chunks will become under-                                                            One should be careful not to overly generalize from our
 replicated and must be cloned to restore their replication                                                           workload. Since Google completely controls both GFS and
 levels. The time it takes to restore all such chunks depends                                                         its applications, the applications tend to be tuned for GFS,
 on the amount of resources. In one experiment, we killed a                                                           and conversely GFS is designed for these applications. Such
 single chunkserver in cluster B. The chunkserver had about                                                           mutual influence may also exist between general applications
gfs-sosp2003
gfs-sosp2003
gfs-sosp2003

More Related Content

What's hot

Veritas Storage Foundation
Veritas Storage FoundationVeritas Storage Foundation
Veritas Storage FoundationSymantec
 
4 Ways To Save Big Money in Your Data Center and Private Cloud
4 Ways To Save Big Money in Your Data Center and Private Cloud4 Ways To Save Big Money in Your Data Center and Private Cloud
4 Ways To Save Big Money in Your Data Center and Private Cloudtervela
 
Symantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec
 
2 25008 domain_ten11.29.12_v2_opt
2 25008 domain_ten11.29.12_v2_opt2 25008 domain_ten11.29.12_v2_opt
2 25008 domain_ten11.29.12_v2_optEdda Kang
 
Storage Management and High Availability 6.0 Launch
Storage Management and High Availability 6.0 LaunchStorage Management and High Availability 6.0 Launch
Storage Management and High Availability 6.0 LaunchSymantec
 
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication TechnologyTECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication TechnologySymantec
 
Flevy.com - Feasibility Study for an eMail Archiving solution
Flevy.com - Feasibility Study for an eMail Archiving solutionFlevy.com - Feasibility Study for an eMail Archiving solution
Flevy.com - Feasibility Study for an eMail Archiving solutionDavid Tracy
 
TECHNICAL WHITE PAPER: NetBackup Appliances WAN Optimization
TECHNICAL WHITE PAPER: NetBackup Appliances WAN OptimizationTECHNICAL WHITE PAPER: NetBackup Appliances WAN Optimization
TECHNICAL WHITE PAPER: NetBackup Appliances WAN OptimizationSymantec
 
Oracle Distributed Document Capture
Oracle Distributed Document CaptureOracle Distributed Document Capture
Oracle Distributed Document Captureguest035a27
 
Novell File Management Suite Use Cases
Novell File Management Suite Use CasesNovell File Management Suite Use Cases
Novell File Management Suite Use CasesNovell
 
Control the Chaos with Novell File Reporter
Control the Chaos with Novell File ReporterControl the Chaos with Novell File Reporter
Control the Chaos with Novell File ReporterNovell
 
Disaster Recovery for the Real-Time Data Warehouses
Disaster Recovery for the Real-Time Data WarehousesDisaster Recovery for the Real-Time Data Warehouses
Disaster Recovery for the Real-Time Data Warehousestervela
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopJoão Gabriel Lima
 

What's hot (15)

Veritas Storage Foundation
Veritas Storage FoundationVeritas Storage Foundation
Veritas Storage Foundation
 
635 642
635 642635 642
635 642
 
4 Ways To Save Big Money in Your Data Center and Private Cloud
4 Ways To Save Big Money in Your Data Center and Private Cloud4 Ways To Save Big Money in Your Data Center and Private Cloud
4 Ways To Save Big Money in Your Data Center and Private Cloud
 
Symantec Netbackup Appliance Family
Symantec Netbackup Appliance FamilySymantec Netbackup Appliance Family
Symantec Netbackup Appliance Family
 
2 25008 domain_ten11.29.12_v2_opt
2 25008 domain_ten11.29.12_v2_opt2 25008 domain_ten11.29.12_v2_opt
2 25008 domain_ten11.29.12_v2_opt
 
Storage Management and High Availability 6.0 Launch
Storage Management and High Availability 6.0 LaunchStorage Management and High Availability 6.0 Launch
Storage Management and High Availability 6.0 Launch
 
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication TechnologyTECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology
TECHNICAL BRIEF▶ NetBackup 7.6 Deduplication Technology
 
Flevy.com - Feasibility Study for an eMail Archiving solution
Flevy.com - Feasibility Study for an eMail Archiving solutionFlevy.com - Feasibility Study for an eMail Archiving solution
Flevy.com - Feasibility Study for an eMail Archiving solution
 
TECHNICAL WHITE PAPER: NetBackup Appliances WAN Optimization
TECHNICAL WHITE PAPER: NetBackup Appliances WAN OptimizationTECHNICAL WHITE PAPER: NetBackup Appliances WAN Optimization
TECHNICAL WHITE PAPER: NetBackup Appliances WAN Optimization
 
Oracle Distributed Document Capture
Oracle Distributed Document CaptureOracle Distributed Document Capture
Oracle Distributed Document Capture
 
Novell File Management Suite Use Cases
Novell File Management Suite Use CasesNovell File Management Suite Use Cases
Novell File Management Suite Use Cases
 
University of Siegen
University of Siegen University of Siegen
University of Siegen
 
Control the Chaos with Novell File Reporter
Control the Chaos with Novell File ReporterControl the Chaos with Novell File Reporter
Control the Chaos with Novell File Reporter
 
Disaster Recovery for the Real-Time Data Warehouses
Disaster Recovery for the Real-Time Data WarehousesDisaster Recovery for the Real-Time Data Warehouses
Disaster Recovery for the Real-Time Data Warehouses
 
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoopAttaching cloud storage to a campus grid using parrot, chirp, and hadoop
Attaching cloud storage to a campus grid using parrot, chirp, and hadoop
 

Viewers also liked

http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfhttp://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfHiroshi Ono
 

Viewers also liked (6)

vldb08a
vldb08avldb08a
vldb08a
 
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdfhttp://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
http://www.logos.ic.i.u-tokyo.ac.jp/~kay/papers/ccgrid2008_stable_broadcast.pdf
 
qpipe
qpipeqpipe
qpipe
 
chubby_osdi06
chubby_osdi06chubby_osdi06
chubby_osdi06
 
thrift-20070401
thrift-20070401thrift-20070401
thrift-20070401
 
100420
100420100420
100420
 

Similar to gfs-sosp2003

Google File System
Google File SystemGoogle File System
Google File Systemvivatechijri
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...IOSR Journals
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesJoão Gabriel Lima
 
Cloud computing overview
Cloud computing overviewCloud computing overview
Cloud computing overviewKHANSAFEE
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
A Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationA Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationEditor IJMTER
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagedbpublications
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxviveknagle4
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)David Rosenblum
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...IRJET Journal
 
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...IRJET Journal
 
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdfHOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdfAgaram Technologies
 

Similar to gfs-sosp2003 (20)

Gfs论文
Gfs论文Gfs论文
Gfs论文
 
The google file system
The google file systemThe google file system
The google file system
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
 
Gfs
GfsGfs
Gfs
 
Google File System
Google File SystemGoogle File System
Google File System
 
191
191191
191
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Megastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive servicesMegastore providing scalable, highly available storage for interactive services
Megastore providing scalable, highly available storage for interactive services
 
Cloud computing overview
Cloud computing overviewCloud computing overview
Cloud computing overview
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
A Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-DuplicationA Hybrid Cloud Approach for Secure Authorized De-Duplication
A Hybrid Cloud Approach for Secure Authorized De-Duplication
 
Privacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storagePrivacy preserving public auditing for secured cloud storage
Privacy preserving public auditing for secured cloud storage
 
M0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptxM0339_v1_6977127809 (1).pptx
M0339_v1_6977127809 (1).pptx
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
Scalability: What It Is and How to Analyze It (keynote talk at SBES 2007)
 
Facade
FacadeFacade
Facade
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
 
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...
THE SURVEY ON REFERENCE MODEL FOR OPEN STORAGE SYSTEMS INTERCONNECTION MASS S...
 
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdfHOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
HOW-CLOUD-IMPLEMENTATION-CAN-ENSURE-MAXIMUM-ROI.pdf
 

More from Hiroshi Ono

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipediaHiroshi Ono
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説Hiroshi Ono
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitectureHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdfHiroshi Ono
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdfHiroshi Ono
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfHiroshi Ono
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdfHiroshi Ono
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdfHiroshi Ono
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfHiroshi Ono
 

More from Hiroshi Ono (20)

Voltdb - wikipedia
Voltdb - wikipediaVoltdb - wikipedia
Voltdb - wikipedia
 
Gamecenter概説
Gamecenter概説Gamecenter概説
Gamecenter概説
 
EventDrivenArchitecture
EventDrivenArchitectureEventDrivenArchitecture
EventDrivenArchitecture
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfpragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdf
 
downey08semaphores.pdf
downey08semaphores.pdfdowney08semaphores.pdf
downey08semaphores.pdf
 
BOF1-Scala02.pdf
BOF1-Scala02.pdfBOF1-Scala02.pdf
BOF1-Scala02.pdf
 
TwitterOct2008.pdf
TwitterOct2008.pdfTwitterOct2008.pdf
TwitterOct2008.pdf
 
camel-scala.pdf
camel-scala.pdfcamel-scala.pdf
camel-scala.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
SACSIS2009_TCP.pdf
SACSIS2009_TCP.pdfSACSIS2009_TCP.pdf
SACSIS2009_TCP.pdf
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfstateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdf
 
program_draft3.pdf
program_draft3.pdfprogram_draft3.pdf
program_draft3.pdf
 
nodalities_issue7.pdf
nodalities_issue7.pdfnodalities_issue7.pdf
nodalities_issue7.pdf
 
genpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdfgenpaxospublic-090703114743-phpapp01.pdf
genpaxospublic-090703114743-phpapp01.pdf
 
kademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdfkademlia-1227143905867010-8.pdf
kademlia-1227143905867010-8.pdf
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

gfs-sosp2003

  • 1. The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗ ABSTRACT 1. INTRODUCTION We have designed and implemented the Google File Sys- We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed tem (GFS) to meet the rapidly growing demands of Google’s data-intensive applications. It provides fault tolerance while data processing needs. GFS shares many of the same goals running on inexpensive commodity hardware, and it delivers as previous distributed file systems such as performance, high aggregate performance to a large number of clients. scalability, reliability, and availability. However, its design While sharing many of the same goals as previous dis- has been driven by key observations of our application work- tributed file systems, our design has been driven by obser- loads and technological environment, both current and an- vations of our application workloads and technological envi- ticipated, that reflect a marked departure from some earlier ronment, both current and anticipated, that reflect a marked file system design assumptions. We have reexamined tradi- departure from some earlier file system assumptions. This tional choices and explored radically different points in the has led us to reexamine traditional choices and explore rad- design space. ically different design points. First, component failures are the norm rather than the The file system has successfully met our storage needs. exception. The file system consists of hundreds or even It is widely deployed within Google as the storage platform thousands of storage machines built from inexpensive com- for the generation and processing of data used by our ser- modity parts and is accessed by a comparable number of vice as well as research and development efforts that require client machines. The quantity and quality of the compo- large data sets. The largest cluster to date provides hun- nents virtually guarantee that some are not functional at dreds of terabytes of storage across thousands of disks on any given time and some will not recover from their cur- over a thousand machines, and it is concurrently accessed rent failures. We have seen problems caused by application by hundreds of clients. bugs, operating system bugs, human errors, and the failures In this paper, we present file system interface extensions of disks, memory, connectors, networking, and power sup- designed to support distributed applications, discuss many plies. Therefore, constant monitoring, error detection, fault aspects of our design, and report measurements from both tolerance, and automatic recovery must be integral to the micro-benchmarks and real world use. system. Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many applica- Categories and Subject Descriptors tion objects such as web documents. When we are regularly D [4]: 3—Distributed file systems working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of ap- General Terms proximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters Design, reliability, performance, measurement such as I/O operation and block sizes have to be revisited. Third, most files are mutated by appending new data Keywords rather than overwriting existing data. Random writes within Fault tolerance, scalability, data storage, clustered storage a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of ∗ The authors can be reached at the following addresses: data share these characteristics. Some may constitute large {sanjay,hgobioff,shuntak}@google.com. repositories that data analysis programs scan through. Some may be data streams continuously generated by running ap- plications. Some may be archival data. Some may be in- termediate results produced on one machine and processed Permission to make digital or hard copies of all or part of this work for on another, whether simultaneously or later in time. Given personal or classroom use is granted without fee provided that copies are this access pattern on huge files, appending becomes the fo- not made or distributed for profit or commercial advantage and that copies cus of performance optimization and atomicity guarantees, bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific while caching data blocks in the client loses its appeal. permission and/or a fee. Fourth, co-designing the applications and the file system SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. API benefits the overall system by increasing our flexibility. Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.
  • 2. For example, we have relaxed GFS’s consistency model to 2.2 Interface vastly simplify the file system without imposing an onerous GFS provides a familiar file system interface, though it burden on the applications. We have also introduced an does not implement a standard API such as POSIX. Files are atomic append operation so that multiple clients can append organized hierarchically in directories and identified by path- concurrently to a file without extra synchronization between names. We support the usual operations to create, delete, them. These will be discussed in more details later in the open, close, read, and write files. paper. Moreover, GFS has snapshot and record append opera- Multiple GFS clusters are currently deployed for different tions. Snapshot creates a copy of a file or a directory tree purposes. The largest ones have over 1000 storage nodes, at low cost. Record append allows multiple clients to ap- over 300 TB of disk storage, and are heavily accessed by pend data to the same file concurrently while guaranteeing hundreds of clients on distinct machines on a continuous the atomicity of each individual client’s append. It is use- basis. ful for implementing multi-way merge results and producer- consumer queues that many clients can simultaneously ap- 2. DESIGN OVERVIEW pend to without additional locking. We have found these types of files to be invaluable in building large distributed 2.1 Assumptions applications. Snapshot and record append are discussed fur- In designing a file system for our needs, we have been ther in Sections 3.4 and 3.3 respectively. guided by assumptions that offer both challenges and op- portunities. We alluded to some key observations earlier 2.3 Architecture and now lay out our assumptions in more details. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown • The system is built from many inexpensive commodity in Figure 1. Each of these is typically a commodity Linux components that often fail. It must constantly monitor machine running a user-level server process. It is easy to run itself and detect, tolerate, and recover promptly from both a chunkserver and a client on the same machine, as long component failures on a routine basis. as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable. • The system stores a modest number of large files. We Files are divided into fixed-size chunks. Each chunk is expect a few million files, each typically 100 MB or identified by an immutable and globally unique 64 bit chunk larger in size. Multi-GB files are the common case handle assigned by the master at the time of chunk creation. and should be managed efficiently. Small files must be Chunkservers store chunks on local disks as Linux files and supported, but we need not optimize for them. read or write chunk data specified by a chunk handle and • The workloads primarily consist of two kinds of reads: byte range. For reliability, each chunk is replicated on multi- large streaming reads and small random reads. In ple chunkservers. By default, we store three replicas, though large streaming reads, individual operations typically users can designate different replication levels for different read hundreds of KBs, more commonly 1 MB or more. regions of the file namespace. Successive operations from the same client often read The master maintains all file system metadata. This in- through a contiguous region of a file. A small ran- cludes the namespace, access control information, the map- dom read typically reads a few KBs at some arbitrary ping from files to chunks, and the current locations of chunks. offset. Performance-conscious applications often batch It also controls system-wide activities such as chunk lease and sort their small reads to advance steadily through management, garbage collection of orphaned chunks, and the file rather than go back and forth. chunk migration between chunkservers. The master peri- odically communicates with each chunkserver in HeartBeat • The workloads also have many large, sequential writes messages to give it instructions and collect its state. that append data to files. Typical operation sizes are GFS client code linked into each application implements similar to those for reads. Once written, files are sel- the file system API and communicates with the master and dom modified again. Small writes at arbitrary posi- chunkservers to read or write data on behalf of the applica- tions in a file are supported but do not have to be tion. Clients interact with the master for metadata opera- efficient. tions, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and • The system must efficiently implement well-defined se- therefore need not hook into the Linux vnode layer. mantics for multiple clients that concurrently append Neither the client nor the chunkserver caches file data. to the same file. Our files are often used as producer- Client caches offer little benefit because most applications consumer queues or for many-way merging. Hundreds stream through huge files or have working sets too large of producers, running one per machine, will concur- to be cached. Not having them simplifies the client and rently append to a file. Atomicity with minimal syn- the overall system by eliminating cache coherence issues. chronization overhead is essential. The file may be (Clients do cache metadata, however.) Chunkservers need read later, or a consumer may be reading through the not cache file data because chunks are stored as local files file simultaneously. and so Linux’s buffer cache already keeps frequently accessed • High sustained bandwidth is more important than low data in memory. latency. Most of our target applications place a pre- mium on processing data in bulk at a high rate, while 2.4 Single Master few have stringent response time requirements for an Having a single master vastly simplifies our design and individual read or write. enables the master to make sophisticated chunk placement
  • 3. Application GFS master /foo/bar (file name, chunk index) GFS client File namespace chunk 2ef0 (chunk handle, chunk locations) Legend: Data messages Instructions to chunkserver Control messages Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system Figure 1: GFS Architecture and replication decisions using global knowledge. However, tent TCP connection to the chunkserver over an extended we must minimize its involvement in reads and writes so period of time. Third, it reduces the size of the metadata that it does not become a bottleneck. Clients never read stored on the master. This allows us to keep the metadata and write file data through the master. Instead, a client asks in memory, which in turn brings other advantages that we the master which chunkservers it should contact. It caches will discuss in Section 2.6.1. this information for a limited time and interacts with the On the other hand, a large chunk size, even with lazy space chunkservers directly for many subsequent operations. allocation, has its disadvantages. A small file consists of a Let us explain the interactions for a simple read with refer- small number of chunks, perhaps just one. The chunkservers ence to Figure 1. First, using the fixed chunk size, the client storing those chunks may become hot spots if many clients translates the file name and byte offset specified by the ap- are accessing the same file. In practice, hot spots have not plication into a chunk index within the file. Then, it sends been a major issue because our applications mostly read the master a request containing the file name and chunk large multi-chunk files sequentially. index. The master replies with the corresponding chunk However, hot spots did develop when GFS was first used handle and locations of the replicas. The client caches this by a batch-queue system: an executable was written to GFS information using the file name and chunk index as the key. as a single-chunk file and then started on hundreds of ma- The client then sends a request to one of the replicas, chines at the same time. The few chunkservers storing this most likely the closest one. The request specifies the chunk executable were overloaded by hundreds of simultaneous re- handle and a byte range within that chunk. Further reads quests. We fixed this problem by storing such executables of the same chunk require no more client-master interaction with a higher replication factor and by making the batch- until the cached information expires or the file is reopened. queue system stagger application start times. A potential In fact, the client typically asks for multiple chunks in the long-term solution is to allow clients to read data from other same request and the master can also include the informa- clients in such situations. tion for chunks immediately following those requested. This extra information sidesteps several future client-master in- 2.6 Metadata teractions at practically no extra cost. The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, 2.5 Chunk Size and the locations of each chunk’s replicas. All metadata is Chunk size is one of the key design parameters. We have kept in the master’s memory. The first two types (names- chosen 64 MB, which is much larger than typical file sys- paces and file-to-chunk mapping) are also kept persistent by tem block sizes. Each chunk replica is stored as a plain logging mutations to an operation log stored on the mas- Linux file on a chunkserver and is extended only as needed. ter’s local disk and replicated on remote machines. Using Lazy space allocation avoids wasting space due to internal a log allows us to update the master state simply, reliably, fragmentation, perhaps the greatest objection against such and without risking inconsistencies in the event of a master a large chunk size. crash. The master does not store chunk location informa- A large chunk size offers several important advantages. tion persistently. Instead, it asks each chunkserver about its First, it reduces clients’ need to interact with the master chunks at master startup and whenever a chunkserver joins because reads and writes on the same chunk require only the cluster. one initial request to the master for chunk location informa- tion. The reduction is especially significant for our work- 2.6.1 In-Memory Data Structures loads because applications mostly read and write large files Since metadata is stored in memory, master operations are sequentially. Even for small random reads, the client can fast. Furthermore, it is easy and efficient for the master to comfortably cache all the chunk location information for a periodically scan through its entire state in the background. multi-TB working set. Second, since on a large chunk, a This periodic scanning is used to implement chunk garbage client is more likely to perform many operations on a given collection, re-replication in the presence of chunkserver fail- chunk, it can reduce network overhead by keeping a persis- ures, and chunk migration to balance load and disk space
  • 4. usage across chunkservers. Sections 4.3 and 4.4 will discuss Write Record Append these activities further. Serial defined defined success interspersed with One potential concern for this memory-only approach is Concurrent consistent inconsistent that the number of chunks and hence the capacity of the successes but undefined whole system is limited by how much memory the master Failure inconsistent has. This is not a serious limitation in practice. The mas- ter maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many Table 1: File Region State After Mutation chunks, only the last of which may be partially filled. Sim- ilarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly us- limited number of log records after that. The checkpoint is ing prefix compression. in a compact B-tree like form that can be directly mapped If necessary to support even larger file systems, the cost into memory and used for namespace lookup without ex- of adding extra memory to the master is a small price to pay tra parsing. This further speeds up recovery and improves for the simplicity, reliability, performance, and flexibility we availability. gain by storing the metadata in memory. Because building a checkpoint can take a while, the mas- ter’s internal state is structured in such a way that a new 2.6.2 Chunk Locations checkpoint can be created without delaying incoming muta- The master does not keep a persistent record of which tions. The master switches to a new log file and creates the chunkservers have a replica of a given chunk. It simply polls new checkpoint in a separate thread. The new checkpoint chunkservers for that information at startup. The master includes all mutations before the switch. It can be created can keep itself up-to-date thereafter because it controls all in a minute or so for a cluster with a few million files. When chunk placement and monitors chunkserver status with reg- completed, it is written to disk both locally and remotely. ular HeartBeat messages. Recovery needs only the latest complete checkpoint and We initially attempted to keep chunk location information subsequent log files. Older checkpoints and log files can persistently at the master, but we decided that it was much be freely deleted, though we keep a few around to guard simpler to request the data from chunkservers at startup, against catastrophes. A failure during checkpointing does and periodically thereafter. This eliminated the problem of not affect correctness because the recovery code detects and keeping the master and chunkservers in sync as chunkservers skips incomplete checkpoints. join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often. 2.7 Consistency Model Another way to understand this design decision is to real- GFS has a relaxed consistency model that supports our ize that a chunkserver has the final word over what chunks highly distributed applications well but remains relatively it does or does not have on its own disks. There is no point simple and efficient to implement. We now discuss GFS’s in trying to maintain a consistent view of this information guarantees and what they mean to applications. We also on the master because errors on a chunkserver may cause highlight how GFS maintains these guarantees but leave the chunks to vanish spontaneously (e.g., a disk may go bad details to other parts of the paper. and be disabled) or an operator may rename a chunkserver. 2.7.1 Guarantees by GFS 2.6.3 Operation Log File namespace mutations (e.g., file creation) are atomic. The operation log contains a historical record of critical They are handled exclusively by the master: namespace metadata changes. It is central to GFS. Not only is it the locking guarantees atomicity and correctness (Section 4.1); only persistent record of metadata, but it also serves as a the master’s operation log defines a global total order of logical time line that defines the order of concurrent op- these operations (Section 2.6.3). erations. Files and chunks, as well as their versions (see The state of a file region after a data mutation depends Section 4.5), are all uniquely and eternally identified by the on the type of mutation, whether it succeeds or fails, and logical times at which they were created. whether there are concurrent mutations. Table 1 summa- Since the operation log is critical, we must store it reli- rizes the result. A file region is consistent if all clients will ably and not make changes visible to clients until metadata always see the same data, regardless of which replicas they changes are made persistent. Otherwise, we effectively lose read from. A region is defined after a file data mutation if it the whole file system or recent client operations even if the is consistent and clients will see what the mutation writes in chunks themselves survive. Therefore, we replicate it on its entirety. When a mutation succeeds without interference multiple remote machines and respond to a client opera- from concurrent writers, the affected region is defined (and tion only after flushing the corresponding log record to disk by implication consistent): all clients will always see what both locally and remotely. The master batches several log the mutation has written. Concurrent successful mutations records together before flushing thereby reducing the impact leave the region undefined but consistent: all clients see the of flushing and replication on overall system throughput. same data, but it may not reflect what any one mutation The master recovers its file system state by replaying the has written. Typically, it consists of mingled fragments from operation log. To minimize startup time, we must keep the multiple mutations. A failed mutation makes the region in- log small. The master checkpoints its state whenever the log consistent (hence also undefined): different clients may see grows beyond a certain size so that it can recover by loading different data at different times. We describe below how our the latest checkpoint from local disk and replaying only the applications can distinguish defined regions from undefined
  • 5. regions. The applications do not need to further distinguish file data that is still incomplete from the application’s per- between different kinds of undefined regions. spective. Data mutations may be writes or record appends. A write In the other typical use, many writers concurrently ap- causes data to be written at an application-specified file pend to a file for merged results or as a producer-consumer offset. A record append causes data (the “record”) to be queue. Record append’s append-at-least-once semantics pre- appended atomically at least once even in the presence of serves each writer’s output. Readers deal with the occa- concurrent mutations, but at an offset of GFS’s choosing sional padding and duplicates as follows. Each record pre- (Section 3.3). (In contrast, a “regular” append is merely a pared by the writer contains extra information like check- write at an offset that the client believes to be the current sums so that its validity can be verified. A reader can end of file.) The offset is returned to the client and marks identify and discard extra padding and record fragments the beginning of a defined region that contains the record. using the checksums. If it cannot tolerate the occasional In addition, GFS may insert padding or record duplicates in duplicates (e.g., if they would trigger non-idempotent op- between. They occupy regions considered to be inconsistent erations), it can filter them out using unique identifiers in and are typically dwarfed by the amount of user data. the records, which are often needed anyway to name corre- After a sequence of successful mutations, the mutated file sponding application entities such as web documents. These region is guaranteed to be defined and contain the data writ- functionalities for record I/O (except duplicate removal) are ten by the last mutation. GFS achieves this by (a) applying in library code shared by our applications and applicable to mutations to a chunk in the same order on all its replicas other file interface implementations at Google. With that, (Section 3.1), and (b) using chunk version numbers to detect the same sequence of records, plus rare duplicates, is always any replica that has become stale because it has missed mu- delivered to the record reader. tations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to 3. SYSTEM INTERACTIONS clients asking the master for chunk locations. They are garbage collected at the earliest opportunity. We designed the system to minimize the master’s involve- Since clients cache chunk locations, they may read from a ment in all operations. With that background, we now de- stale replica before that information is refreshed. This win- scribe how the client, master, and chunkservers interact to dow is limited by the cache entry’s timeout and the next implement data mutations, atomic record append, and snap- open of the file, which purges from the cache all chunk in- shot. formation for that file. Moreover, as most of our files are 3.1 Leases and Mutation Order append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader A mutation is an operation that changes the contents or retries and contacts the master, it will immediately get cur- metadata of a chunk such as a write or an append opera- rent chunk locations. tion. Each mutation is performed at all the chunk’s replicas. Long after a successful mutation, component failures can We use leases to maintain a consistent mutation order across of course still corrupt or destroy data. GFS identifies failed replicas. The master grants a chunk lease to one of the repli- chunkservers by regular handshakes between master and all cas, which we call the primary. The primary picks a serial chunkservers and detects data corruption by checksumming order for all mutations to the chunk. All replicas follow this (Section 5.2). Once a problem surfaces, the data is restored order when applying mutations. Thus, the global mutation from valid replicas as soon as possible (Section 4.3). A chunk order is defined first by the lease grant order chosen by the is lost irreversibly only if all its replicas are lost before GFS master, and within a lease by the serial numbers assigned can react, typically within minutes. Even in this case, it be- by the primary. comes unavailable, not corrupted: applications receive clear The lease mechanism is designed to minimize manage- errors rather than corrupt data. ment overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunk is being mu- tated, the primary can request and typically receive exten- 2.7.2 Implications for Applications sions from the master indefinitely. These extension requests GFS applications can accommodate the relaxed consis- and grants are piggybacked on the HeartBeat messages reg- tency model with a few simple techniques already needed for ularly exchanged between the master and all chunkservers. other purposes: relying on appends rather than overwrites, The master may sometimes try to revoke a lease before it checkpointing, and writing self-validating, self-identifying expires (e.g., when the master wants to disable mutations records. on a file that is being renamed). Even if the master loses Practically all our applications mutate files by appending communication with a primary, it can safely grant a new rather than overwriting. In one typical use, a writer gener- lease to another replica after the old lease expires. ates a file from beginning to end. It atomically renames the In Figure 2, we illustrate this process by following the file to a permanent name after writing all the data, or pe- control flow of a write through these numbered steps. riodically checkpoints how much has been successfully writ- ten. Checkpoints may also include application-level check- 1. The client asks the master which chunkserver holds sums. Readers verify and process only the file region up the current lease for the chunk and the locations of to the last checkpoint, which is known to be in the defined the other replicas. If no one has a lease, the master state. Regardless of consistency and concurrency issues, this grants one to a replica it chooses (not shown). approach has served us well. Appending is far more effi- 2. The master replies with the identity of the primary and cient and more resilient to application failures than random the locations of the other (secondary) replicas. The writes. Checkpointing allows writers to restart incremen- client caches this data for future mutations. It needs tally and keeps readers from processing successfully written to contact the master again only when the primary
  • 6. 4 step 1 file region may end up containing fragments from different Client Master clients, although the replicas will be identical because the in- 2 dividual operations are completed successfully in the same 3 order on all replicas. This leaves the file region in consistent but undefined state as noted in Section 2.7. Secondary Replica A 6 3.2 Data Flow We decouple the flow of data from the flow of control to 7 use the network efficiently. While control flows from the Primary 5 client to the primary and then to all secondaries, data is Replica Legend: pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion. Our goals are to fully utilize each Control machine’s network bandwidth, avoid network bottlenecks 6 Secondary Data and high-latency links, and minimize the latency to push Replica B through all the data. To fully utilize each machine’s network bandwidth, the data is pushed linearly along a chain of chunkservers rather Figure 2: Write Control and Data Flow than distributed in some other topology (e.g., tree). Thus, each machine’s full outbound bandwidth is used to trans- fer the data as fast as possible rather than divided among becomes unreachable or replies that it no longer holds multiple recipients. a lease. To avoid network bottlenecks and high-latency links (e.g., inter-switch links are often both) as much as possible, each 3. The client pushes the data to all the replicas. A client machine forwards the data to the “closest” machine in the can do so in any order. Each chunkserver will store network topology that has not received it. Suppose the the data in an internal LRU buffer cache until the client is pushing data to chunkservers S1 through S4. It data is used or aged out. By decoupling the data flow sends the data to the closest chunkserver, say S1. S1 for- from the control flow, we can improve performance by wards it to the closest chunkserver S2 through S4 closest to scheduling the expensive data flow based on the net- S1, say S2. Similarly, S2 forwards it to S3 or S4, whichever work topology regardless of which chunkserver is the is closer to S2, and so on. Our network topology is simple primary. Section 3.2 discusses this further. enough that “distances” can be accurately estimated from 4. Once all the replicas have acknowledged receiving the IP addresses. data, the client sends a write request to the primary. Finally, we minimize latency by pipelining the data trans- The request identifies the data pushed earlier to all of fer over TCP connections. Once a chunkserver receives some the replicas. The primary assigns consecutive serial data, it starts forwarding immediately. Pipelining is espe- numbers to all the mutations it receives, possibly from cially helpful to us because we use a switched network with multiple clients, which provides the necessary serial- full-duplex links. Sending the data immediately does not ization. It applies the mutation to its own local state reduce the receive rate. Without network congestion, the in serial number order. ideal elapsed time for transferring B bytes to R replicas is 5. The primary forwards the write request to all sec- B/T + RL where T is the network throughput and L is la- ondary replicas. Each secondary replica applies mu- tency to transfer bytes between two machines. Our network tations in the same serial number order assigned by links are typically 100 Mbps (T ), and L is far below 1 ms. the primary. Therefore, 1 MB can ideally be distributed in about 80 ms. 6. The secondaries all reply to the primary indicating that they have completed the operation. 7. The primary replies to the client. Any errors encoun- 3.3 Atomic Record Appends tered at any of the replicas are reported to the client. GFS provides an atomic append operation called record In case of errors, the write may have succeeded at the append. In a traditional write, the client specifies the off- primary and an arbitrary subset of the secondary repli- set at which data is to be written. Concurrent writes to cas. (If it had failed at the primary, it would not the same region are not serializable: the region may end up have been assigned a serial number and forwarded.) containing data fragments from multiple clients. In a record The client request is considered to have failed, and the append, however, the client specifies only the data. GFS modified region is left in an inconsistent state. Our appends it to the file at least once atomically (i.e., as one client code handles such errors by retrying the failed continuous sequence of bytes) at an offset of GFS’s choosing mutation. It will make a few attempts at steps (3) and returns that offset to the client. This is similar to writ- through (7) before falling back to a retry from the be- ing to a file opened in O APPEND mode in Unix without the ginning of the write. race conditions when multiple writers do so concurrently. Record append is heavily used by our distributed applica- If a write by the application is large or straddles a chunk tions in which many clients on different machines append boundary, GFS client code breaks it down into multiple to the same file concurrently. Clients would need addi- write operations. They all follow the control flow described tional complicated and expensive synchronization, for ex- above but may be interleaved with and overwritten by con- ample through a distributed lock manager, if they do so current operations from other clients. Therefore, the shared with traditional writes. In our workloads, such files often
  • 7. serve as multiple-producer/single-consumer queues or con- handle C’. It then asks each chunkserver that has a current tain merged results from many different clients. replica of C to create a new chunk called C’. By creating Record append is a kind of mutation and follows the con- the new chunk on the same chunkservers as the original, we trol flow in Section 3.1 with only a little extra logic at the ensure that the data can be copied locally, not over the net- primary. The client pushes the data to all replicas of the work (our disks are about three times as fast as our 100 Mb last chunk of the file Then, it sends its request to the pri- Ethernet links). From this point, request handling is no dif- mary. The primary checks to see if appending the record ferent from that for any chunk: the master grants one of the to the current chunk would cause the chunk to exceed the replicas a lease on the new chunk C’ and replies to the client, maximum size (64 MB). If so, it pads the chunk to the max- which can write the chunk normally, not knowing that it has imum size, tells secondaries to do the same, and replies to just been created from an existing chunk. the client indicating that the operation should be retried on the next chunk. (Record append is restricted to be at most one-fourth of the maximum chunk size to keep worst- 4. MASTER OPERATION case fragmentation at an acceptable level.) If the record The master executes all namespace operations. In addi- fits within the maximum size, which is the common case, tion, it manages chunk replicas throughout the system: it the primary appends the data to its replica, tells the secon- makes placement decisions, creates new chunks and hence daries to write the data at the exact offset where it has, and replicas, and coordinates various system-wide activities to finally replies success to the client. keep chunks fully replicated, to balance load across all the If a record append fails at any replica, the client retries the chunkservers, and to reclaim unused storage. We now dis- operation. As a result, replicas of the same chunk may con- cuss each of these topics. tain different data possibly including duplicates of the same record in whole or in part. GFS does not guarantee that all 4.1 Namespace Management and Locking replicas are bytewise identical. It only guarantees that the Many master operations can take a long time: for exam- data is written at least once as an atomic unit. This prop- ple, a snapshot operation has to revoke chunkserver leases on erty follows readily from the simple observation that for the all chunks covered by the snapshot. We do not want to delay operation to report success, the data must have been written other master operations while they are running. Therefore, at the same offset on all replicas of some chunk. Further- we allow multiple operations to be active and use locks over more, after this, all replicas are at least as long as the end regions of the namespace to ensure proper serialization. of record and therefore any future record will be assigned a Unlike many traditional file systems, GFS does not have higher offset or a different chunk even if a different replica a per-directory data structure that lists all the files in that later becomes the primary. In terms of our consistency guar- directory. Nor does it support aliases for the same file or antees, the regions in which successful record append opera- directory (i.e, hard or symbolic links in Unix terms). GFS tions have written their data are defined (hence consistent), logically represents its namespace as a lookup table mapping whereas intervening regions are inconsistent (hence unde- full pathnames to metadata. With prefix compression, this fined). Our applications can deal with inconsistent regions table can be efficiently represented in memory. Each node as we discussed in Section 2.7.2. in the namespace tree (either an absolute file name or an absolute directory name) has an associated read-write lock. 3.4 Snapshot Each master operation acquires a set of locks before it The snapshot operation makes a copy of a file or a direc- runs. Typically, if it involves /d1/d2/.../dn/leaf, it will tory tree (the “source”) almost instantaneously, while min- acquire read-locks on the directory names /d1, /d1/d2, ..., imizing any interruptions of ongoing mutations. Our users /d1/d2/.../dn, and either a read lock or a write lock on the use it to quickly create branch copies of huge data sets (and full pathname /d1/d2/.../dn/leaf. Note that leaf may be often copies of those copies, recursively), or to checkpoint a file or directory depending on the operation. the current state before experimenting with changes that We now illustrate how this locking mechanism can prevent can later be committed or rolled back easily. a file /home/user/foo from being created while /home/user Like AFS [5], we use standard copy-on-write techniques to is being snapshotted to /save/user. The snapshot oper- implement snapshots. When the master receives a snapshot ation acquires read locks on /home and /save, and write request, it first revokes any outstanding leases on the chunks locks on /home/user and /save/user. The file creation ac- in the files it is about to snapshot. This ensures that any quires read locks on /home and /home/user, and a write subsequent writes to these chunks will require an interaction lock on /home/user/foo. The two operations will be seri- with the master to find the lease holder. This will give the alized properly because they try to obtain conflicting locks master an opportunity to create a new copy of the chunk on /home/user. File creation does not require a write lock first. on the parent directory because there is no “directory”, or After the leases have been revoked or have expired, the inode-like, data structure to be protected from modification. master logs the operation to disk. It then applies this log The read lock on the name is sufficient to protect the parent record to its in-memory state by duplicating the metadata directory from deletion. for the source file or directory tree. The newly created snap- One nice property of this locking scheme is that it allows shot files point to the same chunks as the source files. concurrent mutations in the same directory. For example, The first time a client wants to write to a chunk C after multiple file creations can be executed concurrently in the the snapshot operation, it sends a request to the master to same directory: each acquires a read lock on the directory find the current lease holder. The master notices that the name and a write lock on the file name. The read lock on reference count for chunk C is greater than one. It defers the directory name suffices to prevent the directory from replying to the client request and instead picks a new chunk being deleted, renamed, or snapshotted. The write locks on
  • 8. file names serialize attempts to create a file with the same The master picks the highest priority chunk and “clones” name twice. it by instructing some chunkserver to copy the chunk data Since the namespace can have many nodes, read-write lock directly from an existing valid replica. The new replica is objects are allocated lazily and deleted once they are not in placed with goals similar to those for creation: equalizing use. Also, locks are acquired in a consistent total order disk space utilization, limiting active clone operations on to prevent deadlock: they are first ordered by level in the any single chunkserver, and spreading replicas across racks. namespace tree and lexicographically within the same level. To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both 4.2 Replica Placement for the cluster and for each chunkserver. Additionally, each A GFS cluster is highly distributed at more levels than chunkserver limits the amount of bandwidth it spends on one. It typically has hundreds of chunkservers spread across each clone operation by throttling its read requests to the many machine racks. These chunkservers in turn may be source chunkserver. accessed from hundreds of clients from the same or different Finally, the master rebalances replicas periodically: it ex- racks. Communication between two machines on different amines the current replica distribution and moves replicas racks may cross one or more network switches. Addition- for better disk space and load balancing. Also through this ally, bandwidth into or out of a rack may be less than the process, the master gradually fills up a new chunkserver aggregate bandwidth of all the machines within the rack. rather than instantly swamps it with new chunks and the Multi-level distribution presents a unique challenge to dis- heavy write traffic that comes with them. The placement tribute data for scalability, reliability, and availability. criteria for the new replica are similar to those discussed The chunk replica placement policy serves two purposes: above. In addition, the master must also choose which ex- maximize data reliability and availability, and maximize net- isting replica to remove. In general, it prefers to remove work bandwidth utilization. For both, it is not enough to those on chunkservers with below-average free space so as spread replicas across machines, which only guards against to equalize disk space usage. disk or machine failures and fully utilizes each machine’s net- work bandwidth. We must also spread chunk replicas across 4.4 Garbage Collection racks. This ensures that some replicas of a chunk will sur- After a file is deleted, GFS does not immediately reclaim vive and remain available even if an entire rack is damaged the available physical storage. It does so only lazily during or offline (for example, due to failure of a shared resource regular garbage collection at both the file and chunk levels. like a network switch or power circuit). It also means that We find that this approach makes the system much simpler traffic, especially reads, for a chunk can exploit the aggre- and more reliable. gate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly. 4.4.1 Mechanism When a file is deleted by the application, the master logs 4.3 Creation, Re-replication, Rebalancing the deletion immediately just like other changes. However instead of reclaiming resources immediately, the file is just Chunk replicas are created for three reasons: chunk cre- renamed to a hidden name that includes the deletion times- ation, re-replication, and rebalancing. tamp. During the master’s regular scan of the file system When the master creates a chunk, it chooses where to namespace, it removes any such hidden files if they have ex- place the initially empty replicas. It considers several fac- isted for more than three days (the interval is configurable). tors. (1) We want to place new replicas on chunkservers with Until then, the file can still be read under the new, special below-average disk space utilization. Over time this will name and can be undeleted by renaming it back to normal. equalize disk utilization across chunkservers. (2) We want to When the hidden file is removed from the namespace, its in- limit the number of “recent” creations on each chunkserver. memory metadata is erased. This effectively severs its links Although creation itself is cheap, it reliably predicts immi- to all its chunks. nent heavy write traffic because chunks are created when de- In a similar regular scan of the chunk namespace, the manded by writes, and in our append-once-read-many work- master identifies orphaned chunks (i.e., those not reachable load they typically become practically read-only once they from any file) and erases the metadata for those chunks. In have been completely written. (3) As discussed above, we a HeartBeat message regularly exchanged with the master, want to spread replicas of a chunk across racks. each chunkserver reports a subset of the chunks it has, and The master re-replicates a chunk as soon as the number the master replies with the identity of all chunks that are no of available replicas falls below a user-specified goal. This longer present in the master’s metadata. The chunkserver could happen for various reasons: a chunkserver becomes is free to delete its replicas of such chunks. unavailable, it reports that its replica may be corrupted, one of its disks is disabled because of errors, or the replication goal is increased. Each chunk that needs to be re-replicated 4.4.2 Discussion is prioritized based on several factors. One is how far it is Although distributed garbage collection is a hard problem from its replication goal. For example, we give higher prior- that demands complicated solutions in the context of pro- ity to a chunk that has lost two replicas than to a chunk that gramming languages, it is quite simple in our case. We can has lost only one. In addition, we prefer to first re-replicate easily identify all references to chunks: they are in the file- chunks for live files as opposed to chunks that belong to re- to-chunk mappings maintained exclusively by the master. cently deleted files (see Section 4.4). Finally, to minimize We can also easily identify all the chunk replicas: they are the impact of failures on running applications, we boost the Linux files under designated directories on each chunkserver. priority of any chunk that is blocking client progress. Any such replica not known to the master is “garbage.”
  • 9. The garbage collection approach to storage reclamation quantity of components together make these problems more offers several advantages over eager deletion. First, it is the norm than the exception: we cannot completely trust simple and reliable in a large-scale distributed system where the machines, nor can we completely trust the disks. Com- component failures are common. Chunk creation may suc- ponent failures can result in an unavailable system or, worse, ceed on some chunkservers but not others, leaving replicas corrupted data. We discuss how we meet these challenges that the master does not know exist. Replica deletion mes- and the tools we have built into the system to diagnose prob- sages may be lost, and the master has to remember to resend lems when they inevitably occur. them across failures, both its own and the chunkserver’s. Garbage collection provides a uniform and dependable way 5.1 High Availability to clean up any replicas not known to be useful. Second, Among hundreds of servers in a GFS cluster, some are it merges storage reclamation into the regular background bound to be unavailable at any given time. We keep the activities of the master, such as the regular scans of names- overall system highly available with two simple yet effective paces and handshakes with chunkservers. Thus, it is done strategies: fast recovery and replication. in batches and the cost is amortized. Moreover, it is done only when the master is relatively free. The master can re- spond more promptly to client requests that demand timely 5.1.1 Fast Recovery attention. Third, the delay in reclaiming storage provides a Both the master and the chunkserver are designed to re- safety net against accidental, irreversible deletion. store their state and start in seconds no matter how they In our experience, the main disadvantage is that the delay terminated. In fact, we do not distinguish between normal sometimes hinders user effort to fine tune usage when stor- and abnormal termination; servers are routinely shut down age is tight. Applications that repeatedly create and delete just by killing the process. Clients and other servers experi- temporary files may not be able to reuse the storage right ence a minor hiccup as they time out on their outstanding away. We address these issues by expediting storage recla- requests, reconnect to the restarted server, and retry. Sec- mation if a deleted file is explicitly deleted again. We also tion 6.2.2 reports observed startup times. allow users to apply different replication and reclamation policies to different parts of the namespace. For example, 5.1.2 Chunk Replication users can specify that all the chunks in the files within some As discussed earlier, each chunk is replicated on multiple directory tree are to be stored without replication, and any chunkservers on different racks. Users can specify different deleted files are immediately and irrevocably removed from replication levels for different parts of the file namespace. the file system state. The default is three. The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers 4.5 Stale Replica Detection go offline or detect corrupted replicas through checksum ver- Chunk replicas may become stale if a chunkserver fails ification (see Section 5.2). Although replication has served and misses mutations to the chunk while it is down. For us well, we are exploring other forms of cross-server redun- each chunk, the master maintains a chunk version number dancy such as parity or erasure codes for our increasing read- to distinguish between up-to-date and stale replicas. only storage requirements. We expect that it is challenging Whenever the master grants a new lease on a chunk, it but manageable to implement these more complicated re- increases the chunk version number and informs the up-to- dundancy schemes in our very loosely coupled system be- date replicas. The master and these replicas all record the cause our traffic is dominated by appends and reads rather new version number in their persistent state. This occurs than small random writes. before any client is notified and therefore before it can start writing to the chunk. If another replica is currently unavail- 5.1.3 Master Replication able, its chunk version number will not be advanced. The The master state is replicated for reliability. Its operation master will detect that this chunkserver has a stale replica log and checkpoints are replicated on multiple machines. A when the chunkserver restarts and reports its set of chunks mutation to the state is considered committed only after and their associated version numbers. If the master sees a its log record has been flushed to disk locally and on all version number greater than the one in its records, the mas- master replicas. For simplicity, one master process remains ter assumes that it failed when granting the lease and so in charge of all mutations as well as background activities takes the higher version to be up-to-date. such as garbage collection that change the system internally. The master removes stale replicas in its regular garbage When it fails, it can restart almost instantly. If its machine collection. Before that, it effectively considers a stale replica or disk fails, monitoring infrastructure outside GFS starts a not to exist at all when it replies to client requests for chunk new master process elsewhere with the replicated operation information. As another safeguard, the master includes log. Clients use only the canonical name of the master (e.g. the chunk version number when it informs clients which gfs-test), which is a DNS alias that can be changed if the chunkserver holds a lease on a chunk or when it instructs master is relocated to another machine. a chunkserver to read the chunk from another chunkserver Moreover, “shadow” masters provide read-only access to in a cloning operation. The client or the chunkserver verifies the file system even when the primary master is down. They the version number when it performs the operation so that are shadows, not mirrors, in that they may lag the primary it is always accessing up-to-date data. slightly, typically fractions of a second. They enhance read availability for files that are not being actively mutated or 5. FAULT TOLERANCE AND DIAGNOSIS applications that do not mind getting slightly stale results. One of our greatest challenges in designing the system is In fact, since file content is read from chunkservers, appli- dealing with frequent component failures. The quality and cations do not observe stale file content. What could be
  • 10. stale within short windows is file metadata, like directory finally compute and record the new checksums. If we do contents or access control information. not verify the first and last blocks before overwriting them To keep itself informed, a shadow master reads a replica of partially, the new checksums may hide corruption that exists the growing operation log and applies the same sequence of in the regions not being overwritten. changes to its data structures exactly as the primary does. During idle periods, chunkservers can scan and verify the Like the primary, it polls chunkservers at startup (and infre- contents of inactive chunks. This allows us to detect corrup- quently thereafter) to locate chunk replicas and exchanges tion in chunks that are rarely read. Once the corruption is frequent handshake messages with them to monitor their detected, the master can create a new uncorrupted replica status. It depends on the primary master only for replica and delete the corrupted replica. This prevents an inactive location updates resulting from the primary’s decisions to but corrupted chunk replica from fooling the master into create and delete replicas. thinking that it has enough valid replicas of a chunk. 5.2 Data Integrity 5.3 Diagnostic Tools Each chunkserver uses checksumming to detect corruption Extensive and detailed diagnostic logging has helped im- of stored data. Given that a GFS cluster often has thousands measurably in problem isolation, debugging, and perfor- of disks on hundreds of machines, it regularly experiences mance analysis, while incurring only a minimal cost. With- disk failures that cause data corruption or loss on both the out logs, it is hard to understand transient, non-repeatable read and write paths. (See Section 7 for one cause.) We interactions between machines. GFS servers generate di- can recover from corruption using other chunk replicas, but agnostic logs that record many significant events (such as it would be impractical to detect corruption by comparing chunkservers going up and down) and all RPC requests and replicas across chunkservers. Moreover, divergent replicas replies. These diagnostic logs can be freely deleted without may be legal: the semantics of GFS mutations, in particular affecting the correctness of the system. However, we try to atomic record append as discussed earlier, does not guar- keep these logs around as far as space permits. antee identical replicas. Therefore, each chunkserver must The RPC logs include the exact requests and responses independently verify the integrity of its own copy by main- sent on the wire, except for the file data being read or writ- taining checksums. ten. By matching requests with replies and collating RPC A chunk is broken up into 64 KB blocks. Each has a corre- records on different machines, we can reconstruct the en- sponding 32 bit checksum. Like other metadata, checksums tire interaction history to diagnose a problem. The logs also are kept in memory and stored persistently with logging, serve as traces for load testing and performance analysis. separate from user data. The performance impact of logging is minimal (and far For reads, the chunkserver verifies the checksum of data outweighed by the benefits) because these logs are written blocks that overlap the read range before returning any data sequentially and asynchronously. The most recent events to the requester, whether a client or another chunkserver. are also kept in memory and available for continuous online Therefore chunkservers will not propagate corruptions to monitoring. other machines. If a block does not match the recorded checksum, the chunkserver returns an error to the requestor 6. MEASUREMENTS and reports the mismatch to the master. In response, the requestor will read from other replicas, while the master In this section we present a few micro-benchmarks to illus- will clone the chunk from another replica. After a valid new trate the bottlenecks inherent in the GFS architecture and replica is in place, the master instructs the chunkserver that implementation, and also some numbers from real clusters reported the mismatch to delete its replica. in use at Google. Checksumming has little effect on read performance for several reasons. Since most of our reads span at least a 6.1 Micro-benchmarks few blocks, we need to read and checksum only a relatively We measured performance on a GFS cluster consisting small amount of extra data for verification. GFS client code of one master, two master replicas, 16 chunkservers, and further reduces this overhead by trying to align reads at 16 clients. Note that this configuration was set up for ease checksum block boundaries. Moreover, checksum lookups of testing. Typical clusters have hundreds of chunkservers and comparison on the chunkserver are done without any and hundreds of clients. I/O, and checksum calculation can often be overlapped with All the machines are configured with dual 1.4 GHz PIII I/Os. processors, 2 GB of memory, two 80 GB 5400 rpm disks, and Checksum computation is heavily optimized for writes a 100 Mbps full-duplex Ethernet connection to an HP 2524 that append to the end of a chunk (as opposed to writes switch. All 19 GFS server machines are connected to one that overwrite existing data) because they are dominant in switch, and all 16 client machines to the other. The two our workloads. We just incrementally update the check- switches are connected with a 1 Gbps link. sum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the 6.1.1 Reads append. Even if the last partial checksum block is already N clients read simultaneously from the file system. Each corrupted and we fail to detect it now, the new checksum client reads a randomly selected 4 MB region from a 320 GB value will not match the stored data, and the corruption will file set. This is repeated 256 times so that each client ends be detected as usual when the block is next read. up reading 1 GB of data. The chunkservers taken together In contrast, if a write overwrites an existing range of the have only 32 GB of memory, so we expect at most a 10% hit chunk, we must read and verify the first and last blocks of rate in the Linux buffer cache. Our results should be close the range being overwritten, then perform the write, and to cold cache results.
  • 11. Figure 3(a) shows the aggregate read rate for N clients Cluster A B and its theoretical limit. The limit peaks at an aggregate of Chunkservers 342 227 125 MB/s when the 1 Gbps link between the two switches Available disk space 72 TB 180 TB Used disk space 55 TB 155 TB is saturated, or 12.5 MB/s per client when its 100 Mbps Number of Files 735 k 737 k network interface gets saturated, whichever applies. The Number of Dead files 22 k 232 k observed read rate is 10 MB/s, or 80% of the per-client Number of Chunks 992 k 1550 k limit, when just one client is reading. The aggregate read Metadata at chunkservers 13 GB 21 GB rate reaches 94 MB/s, about 75% of the 125 MB/s link limit, Metadata at master 48 MB 60 MB for 16 readers, or 6 MB/s per client. The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously Table 2: Characteristics of two GFS clusters read from the same chunkserver. longer and continuously generate and process multi-TB data 6.1.2 Writes sets with only occasional human intervention. In both cases, N clients write simultaneously to N distinct files. Each a single “task” consists of many processes on many machines client writes 1 GB of data to a new file in a series of 1 MB reading and writing many files simultaneously. writes. The aggregate write rate and its theoretical limit are shown in Figure 3(b). The limit plateaus at 67 MB/s be- 6.2.1 Storage cause we need to write each byte to 3 of the 16 chunkservers, As shown by the first five entries in the table, both clusters each with a 12.5 MB/s input connection. have hundreds of chunkservers, support many TBs of disk The write rate for one client is 6.3 MB/s, about half of the space, and are fairly but not completely full. “Used space” limit. The main culprit for this is our network stack. It does includes all chunk replicas. Virtually all files are replicated not interact very well with the pipelining scheme we use for three times. Therefore, the clusters store 18 TB and 52 TB pushing data to chunk replicas. Delays in propagating data of file data respectively. from one replica to another reduce the overall write rate. The two clusters have similar numbers of files, though B Aggregate write rate reaches 35 MB/s for 16 clients (or has a larger proportion of dead files, namely files which were 2.2 MB/s per client), about half the theoretical limit. As in deleted or replaced by a new version but whose storage have the case of reads, it becomes more likely that multiple clients not yet been reclaimed. It also has more chunks because its write concurrently to the same chunkserver as the number files tend to be larger. of clients increases. Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three 6.2.2 Metadata different replicas. The chunkservers in aggregate store tens of GBs of meta- Writes are slower than we would like. In practice this has data, mostly the checksums for 64 KB blocks of user data. not been a major problem because even though it increases The only other metadata kept at the chunkservers is the the latencies as seen by individual clients, it does not sig- chunk version number discussed in Section 4.5. nificantly affect the aggregate write bandwidth delivered by The metadata kept at the master is much smaller, only the system to a large number of clients. tens of MBs, or about 100 bytes per file on average. This agrees with our assumption that the size of the master’s 6.1.3 Record Appends memory does not limit the system’s capacity in practice. Figure 3(c) shows record append performance. N clients Most of the per-file metadata is the file names stored in a append simultaneously to a single file. Performance is lim- prefix-compressed form. Other metadata includes file own- ited by the network bandwidth of the chunkservers that ership and permissions, mapping from files to chunks, and store the last chunk of the file, independent of the num- each chunk’s current version. In addition, for each chunk we ber of clients. It starts at 6.0 MB/s for one client and drops store the current replica locations and a reference count for to 4.8 MB/s for 16 clients, mostly due to congestion and implementing copy-on-write. variances in network transfer rates seen by different clients. Each individual server, both chunkservers and the master, Our applications tend to produce multiple such files con- has only 50 to 100 MB of metadata. Therefore recovery is currently. In other words, N clients append to M shared fast: it takes only a few seconds to read this metadata from files simultaneously where both N and M are in the dozens disk before the server is able to answer queries. However, the or hundreds. Therefore, the chunkserver network congestion master is somewhat hobbled for a period – typically 30 to in our experiment is not a significant issue in practice be- 60 seconds – until it has fetched chunk location information cause a client can make progress on writing one file while from all chunkservers. the chunkservers for another file are busy. 6.2.3 Read and Write Rates 6.2 Real World Clusters Table 3 shows read and write rates for various time pe- We now examine two clusters in use within Google that riods. Both clusters had been up for about one week when are representative of several others like them. Cluster A is these measurements were taken. (The clusters had been used regularly for research and development by over a hun- restarted recently to upgrade to a new version of GFS.) dred engineers. A typical task is initiated by a human user The average write rate was less than 30 MB/s since the and runs up to several hours. It reads through a few MBs restart. When we took these measurements, B was in the to a few TBs of data, transforms or analyzes the data, and middle of a burst of write activity generating about 100 MB/s writes the results back to the cluster. Cluster B is primarily of data, which produced a 300 MB/s network load because used for production data processing. The tasks last much writes are propagated to three replicas.
  • 12. 60 Network limit Network limit Network limit 10 Append rate (MB/s) 100 Write rate (MB/s) Read rate (MB/s) 40 50 Aggregate read rate 5 20 Aggregate write rate Aggregate append rate 0 0 0 0 5 10 15 0 5 10 15 0 5 10 15 Number of clients N Number of clients N Number of clients N (a) Reads (b) Writes (c) Record appends Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements. Cluster A B 15,000 chunks containing 600 GB of data. To limit the im- Read rate (last minute) 583 MB/s 380 MB/s pact on running applications and provide leeway for schedul- Read rate (last hour) 562 MB/s 384 MB/s ing decisions, our default parameters limit this cluster to Read rate (since restart) 589 MB/s 49 MB/s Write rate (last minute) 1 MB/s 101 MB/s 91 concurrent clonings (40% of the number of chunkservers) Write rate (last hour) 2 MB/s 117 MB/s where each clone operation is allowed to consume at most Write rate (since restart) 25 MB/s 13 MB/s 6.25 MB/s (50 Mbps). All chunks were restored in 23.2 min- Master ops (last minute) 325 Ops/s 533 Ops/s utes, at an effective replication rate of 440 MB/s. Master ops (last hour) 381 Ops/s 518 Ops/s In another experiment, we killed two chunkservers each Master ops (since restart) 202 Ops/s 347 Ops/s with roughly 16,000 chunks and 660 GB of data. This double failure reduced 266 chunks to having a single replica. These 266 chunks were cloned at a higher priority, and were all Table 3: Performance Metrics for Two GFS Clusters restored to at least 2x replication within 2 minutes, thus putting the cluster in a state where it could tolerate another The read rates were much higher than the write rates. chunkserver failure without data loss. The total workload consists of more reads than writes as we have assumed. Both clusters were in the middle of heavy 6.3 Workload Breakdown read activity. In particular, A had been sustaining a read In this section, we present a detailed breakdown of the rate of 580 MB/s for the preceding week. Its network con- workloads on two GFS clusters comparable but not identi- figuration can support 750 MB/s, so it was using its re- cal to those in Section 6.2. Cluster X is for research and sources efficiently. Cluster B can support peak read rates of development while cluster Y is for production data process- 1300 MB/s, but its applications were using just 380 MB/s. ing. 6.2.4 Master Load 6.3.1 Methodology and Caveats Table 3 also shows that the rate of operations sent to the These results include only client originated requests so master was around 200 to 500 operations per second. The that they reflect the workload generated by our applications master can easily keep up with this rate, and therefore is for the file system as a whole. They do not include inter- not a bottleneck for these workloads. server requests to carry out client requests or internal back- In an earlier version of GFS, the master was occasionally ground activities, such as forwarded writes or rebalancing. a bottleneck for some workloads. It spent most of its time Statistics on I/O operations are based on information sequentially scanning through large directories (which con- heuristically reconstructed from actual RPC requests logged tained hundreds of thousands of files) looking for particular by GFS servers. For example, GFS client code may break a files. We have since changed the master data structures to read into multiple RPCs to increase parallelism, from which allow efficient binary searches through the namespace. It we infer the original read. Since our access patterns are can now easily support many thousands of file accesses per highly stylized, we expect any error to be in the noise. Ex- second. If necessary, we could speed it up further by placing plicit logging by applications might have provided slightly name lookup caches in front of the namespace data struc- more accurate data, but it is logistically impossible to re- tures. compile and restart thousands of running clients to do so and cumbersome to collect the results from as many ma- 6.2.5 Recovery Time chines. After a chunkserver fails, some chunks will become under- One should be careful not to overly generalize from our replicated and must be cloned to restore their replication workload. Since Google completely controls both GFS and levels. The time it takes to restore all such chunks depends its applications, the applications tend to be tuned for GFS, on the amount of resources. In one experiment, we killed a and conversely GFS is designed for these applications. Such single chunkserver in cluster B. The chunkserver had about mutual influence may also exist between general applications