The Google File System
Based on
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
2003
I am Sergio Shevchenko
You can find me in Telegram with @sshevchenko
Hello!
2
Agenda
◉ Why GFS?
◉ System Design
◉ System Interactions
◉ Master operation
◉ Fault tolerance and diagnosis
◉ Measurements
◉ Experiences and Conclusions
3
Why GFS?
What prompted Google to build GFS
1
4
Historical context
◉ August 1996: Larry Page and Sergey Brin launch Google
◉ ...
◉ December 1998: Index of about 60 million pages
◉ ...
◉ February 2003: Index 3+ billion pages
◉ ...
5
Historical context
6
Historical context
7
System design2
8
System design: Assumptions
◉ COTS x86 servers
◉ Few million files, typically 100MB or larger
◉ Large streaming reads, small random reads
◉ Sequential writes, append data
◉ Multiple concurrent clients that append to the same file
◉ High sustained bandwidth more important than low latency
9
System design: Interface
◉ Not posix
◉ Usual file operations:
open, close, create, delete, write, read
◉ Enhanced file operations:
snapshot, record append
10
System design: Architecture
11
System design: Architecture
◉ Single Master
○ Design simplification (there is a replica)
○ Global knowledge permit sophisticated chunk placement and
replication decisions
○ No data plane bottleneck: GFS clients communicate with
■ master only for metadata (cached with TTL expiration)
■ chunks only for read/write
◉ Chunk size
○ 64 Mb stored as a linux file
○ Extended only as needed (lazy allocation)
12
System design: Metadata
◉ 3 types of metadata
○ File and chunk namespaces
○ Mappings from file to chunks
○ Location of each chunks’ replicas
◉ Metadata all in RAM memory of master
○ Re-replication
○ Chunk migration
○ GC
◉ OpLog
○ Timeline for concurrent operations
○ No persistent record for chunk locations
○ Saved on disk and replicated
13
Consistency model
◉ File namespace mutation are atomic
○ A file region is consistent if all clients
will always see the same data,
regardless of which replicas they
read from.
○ A region is defined after a file data
mutation if it is consistent and clients
will see what the mutation writes in
its entirety.
◉ GFS doesn't guarantee againsts duplicates
14
15
In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer
scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide
more than two out of this three guarantees.
System interactions2
16
Leases and mutation order
◉ Each mutation (write or append) is performed at all the chunk’s replicas
◉ Leases used to maintain a consistent mutation order across replicas
○ Master grants a chunk lease for 60 seconds to one of the replicas: the
primary replica
○ The primary replica picks a serial order for all mutations to the chunk
○ All replicas follow this order when applying mutations
17
Leases and mutation order
18
1. Client ask master for primary and secondaries replicas info for a chunk
2. Master replies with replicas identity (client put info in cache with
timeout)
3. Client pushes Data to all replicas (pipelined fashion) Each chunkserver
store the Data in an internal LRU buffer cache until Data is used or aged out
4. One all replicas have acknowledged receiving the data, client send a
write request to the primary replica
5. Primary replica forward the write request to all secondary replicas that
applies mutations in the same serial number order assigned by the primary
6. Secondary replicas acknowledge the primary that they have completed
the operation
7. The primary replica replies to the client. Any errors encountered at any of
the replicas are reported to the client.
System interactions:
Dataflow
◉ Data flow decoupled from Control flow to use the network efficiently
◉ Data is pushed linearly along a carefully picked chain of chunkservers in a TCP
pipelined fashion
◉ Each machine forwards the data to the closest machine in the network topology
that has not received it
19
Atomic record appends
◉ GFS decides the offset
◉ Maximum size of an append record is 16 MB
◉ Client push the data to all replicas of the last chunk of the file (as for write)
◉ Client sends its request to the primary replica
○ Primary replica check if the data size would cause the chunk to exceed the
chunk maximum size (64MB)
◉ Atomic records are heavily used in Google for handling multiple producers and
single consumer o MapRed operation
20
Snapshot
◉ Make a instantaneously copy of a file or a directory tree by using standard
copy-on-write techniques
○ Master receives a snapshot request
○ Master revokes any outstanding leases on the chunks in the files to
snapshot
○ Master wait leases to be revoked or expired
○ Master logs the snapshot operation to disk
○ Master duplicate the metadata for the source file or directory tree: newly
created snapshot files point to the same chunk as source files
○ When client asks write to chunk which was not accessed before clone,
chunk is locally cloned on replica
21
Master operation3
22
Replica placement
◉ Serves for 2 purposes:
○ maximize data reliability and availability
○ maximize network bandwidth utilization
◉ Spread chunks in different racks
○ Ensure chunk survivability due rack failure (switch failure e.g.)
○ Aggregate bandwidth of different racks
○ Write traffic has to flow through multiple racks
■ Tradeoff they make willingly
23
Creation, Re-replication and Balancing
◉ Creation
○ Place new replicas on chunkservers with below-average disk space
utilization
○ Limit the number of "recent" creations on each chunkserver
○ Spread replicas of a chunk across racks
◉ Re-Replication
○ Master re-replicate a chunk as soon as the number of available replicas
falls below a user-specified goal
○ Re-replicated chunk is prioritized based
○ Re-replication placement is similar as for "creation"
○ Master limits the numbers of active clone operations both for the cluster
and for each chunkservers
24
Creation, Re-replication and Balancing
◉ Balancing
○ Master re-balances replicas periodically for better disk space and load
balancing
○ Master gradually fills up a new chunkserver rather than instantly swaps it
with new chunks (and the heavy write traffic that come with them !)
○ Re-balancing placement is similar as for “creation”
25
Garbage collection and Stale replica detection
◉ File is not deleted istantly but lazily
○ Master logs deletions
○ File renamed to a hidden name that include deletion timestamp
○ After 3 days hidden files are removed on regular scan
○ Same procedure for orphaned chunks
○ Chunks “heartbeat” his chunks to master, if they are not present in metadata
chunkserver is free to remove chunks
◉ Stale replica
○ For each chunk master maintains a version
26
Fault tolerance and diagnosis
HA and Data Integrity in GFS
4
27
High availability in GFS
Chunk
Replication
Fast
Recovery
Master
Replication
28
Data integrity
READS
Before returning
block chankserver
verifies integrity, if
fails, returns an error
and notifies master
WRITES
Chunkserver verifies
the checksum of first
and last data blocks
that overlap the write
range before perform
the write, and finally
compute and record
the new checksums
APPEND
No checksum
verification on the
last block, just
incrementally update
the checksum for the
last partial checksum
block
29
Each chunkserver uses checksumming to detect corruption of stored chunk. For each
64 Kb adds a 32 Bit checksum
Measurements
Micro and Macro benchmarks
5
30
Micro benchmarks
1 master, 2 master replicas, 16
chunkservers with 16 clients.
All the machines are configured
with dual 1.4 GHz PIII processors,
2 GB of memory, two 80 GB
5400 rpm disks, and a 100 Mbps
full-duplex Ethernet connection
to an HP 2524 switch.
31
Micro benchmarks - READS
32
Each client read a randomly selected 4MB region 256 times (= 1GB of data) from a 320MB file.
The chunkservers taken together have only 32GB of memory, so is expected at most a 10% hit rate in the
Linux buffer cache.
75-80%
efficiency
Micro benchmarks - WRITES
33
Each client writes 1 GB of data to a new file in a series of 1 MB writes
50%
efficiency
Micro benchmarks - APPENDS
34
Each client appends simultaneously to a single file
Performance is limited by the network bandwidth of the 3 chunkservers that store the last chunk of the
file, independent of the number of clients
35-40%
efficiency
Real World clusters
35
Cluster A used regularly for R&D by +100 engineers
Cluster B used for production data processing
Fault tolerance
◉ Kill 1 chunk servers in cluster B
○ 15.000 chunks on it (= 600 GB of data)
○ Cluster limited to 91 concurrent chunk cloning (= 40% of 227 chunkservers)
○ Each clone operation consume at most 50 Mbps
■ 15.000 chunks restored in 23.2 minutes effective replication rate of
440 MB/s
◉ Kill 2 chunk servers in cluster B
○ 16.000 chunks on each (= 660 GB of data)
○ This double failure reduced 266 chunks to having a single replica
■ These 266 chunks cloned at higher priority and were all restored to a
least 2x replication within 2 minutes
36
Experiences and Conclusions6
37
Experiences
◉ Operational issues
○ Original design for backend file system on production systems, not with
R&D tasks in mind (lack of support for permissions and quotas…)
◉ Technical issues
○ IDE protocol versions support that corrupt data silently: this problem
motivated checksums usage
○ fsync() cost in kernel 2.2 towards large operational logs files
38
Conclusions
◉ GFS is:
○ Support large-scale data processing workloads on COTS x86 servers
○ Component failures as the norm rather than the exception
○ Optimize for huge files mostly append to and then read sequentially
○ Fault tolerance by constant monitoring, replicating crucial data and fast and
automatic recovery (+ checksum to detect data corruption)
○ Delivers high aggregate throughput to many concurrent readers and writers
39
Conclusions
◉ GFS now:
○ Source code is closed
○ Substituted with Colossus in 2010s
■ Smaller chunks 1 Mb
■ Eliminated single point of failure
■ Built for “real-time” big batch operations
40
Any questions ?
You can find me at
◉ @sshevchenko
◉ s.shevchenko@studenti.unisa.it
Thanks!
41

The Google file system

  • 1.
    The Google FileSystem Based on The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 2003
  • 2.
    I am SergioShevchenko You can find me in Telegram with @sshevchenko Hello! 2
  • 3.
    Agenda ◉ Why GFS? ◉System Design ◉ System Interactions ◉ Master operation ◉ Fault tolerance and diagnosis ◉ Measurements ◉ Experiences and Conclusions 3
  • 4.
    Why GFS? What promptedGoogle to build GFS 1 4
  • 5.
    Historical context ◉ August1996: Larry Page and Sergey Brin launch Google ◉ ... ◉ December 1998: Index of about 60 million pages ◉ ... ◉ February 2003: Index 3+ billion pages ◉ ... 5
  • 6.
  • 7.
  • 8.
  • 9.
    System design: Assumptions ◉COTS x86 servers ◉ Few million files, typically 100MB or larger ◉ Large streaming reads, small random reads ◉ Sequential writes, append data ◉ Multiple concurrent clients that append to the same file ◉ High sustained bandwidth more important than low latency 9
  • 10.
    System design: Interface ◉Not posix ◉ Usual file operations: open, close, create, delete, write, read ◉ Enhanced file operations: snapshot, record append 10
  • 11.
  • 12.
    System design: Architecture ◉Single Master ○ Design simplification (there is a replica) ○ Global knowledge permit sophisticated chunk placement and replication decisions ○ No data plane bottleneck: GFS clients communicate with ■ master only for metadata (cached with TTL expiration) ■ chunks only for read/write ◉ Chunk size ○ 64 Mb stored as a linux file ○ Extended only as needed (lazy allocation) 12
  • 13.
    System design: Metadata ◉3 types of metadata ○ File and chunk namespaces ○ Mappings from file to chunks ○ Location of each chunks’ replicas ◉ Metadata all in RAM memory of master ○ Re-replication ○ Chunk migration ○ GC ◉ OpLog ○ Timeline for concurrent operations ○ No persistent record for chunk locations ○ Saved on disk and replicated 13
  • 14.
    Consistency model ◉ Filenamespace mutation are atomic ○ A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. ○ A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. ◉ GFS doesn't guarantee againsts duplicates 14
  • 15.
    15 In theoretical computerscience, the CAP theorem, also named Brewer's theorem after computer scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two out of this three guarantees.
  • 16.
  • 17.
    Leases and mutationorder ◉ Each mutation (write or append) is performed at all the chunk’s replicas ◉ Leases used to maintain a consistent mutation order across replicas ○ Master grants a chunk lease for 60 seconds to one of the replicas: the primary replica ○ The primary replica picks a serial order for all mutations to the chunk ○ All replicas follow this order when applying mutations 17
  • 18.
    Leases and mutationorder 18 1. Client ask master for primary and secondaries replicas info for a chunk 2. Master replies with replicas identity (client put info in cache with timeout) 3. Client pushes Data to all replicas (pipelined fashion) Each chunkserver store the Data in an internal LRU buffer cache until Data is used or aged out 4. One all replicas have acknowledged receiving the data, client send a write request to the primary replica 5. Primary replica forward the write request to all secondary replicas that applies mutations in the same serial number order assigned by the primary 6. Secondary replicas acknowledge the primary that they have completed the operation 7. The primary replica replies to the client. Any errors encountered at any of the replicas are reported to the client.
  • 19.
    System interactions: Dataflow ◉ Dataflow decoupled from Control flow to use the network efficiently ◉ Data is pushed linearly along a carefully picked chain of chunkservers in a TCP pipelined fashion ◉ Each machine forwards the data to the closest machine in the network topology that has not received it 19
  • 20.
    Atomic record appends ◉GFS decides the offset ◉ Maximum size of an append record is 16 MB ◉ Client push the data to all replicas of the last chunk of the file (as for write) ◉ Client sends its request to the primary replica ○ Primary replica check if the data size would cause the chunk to exceed the chunk maximum size (64MB) ◉ Atomic records are heavily used in Google for handling multiple producers and single consumer o MapRed operation 20
  • 21.
    Snapshot ◉ Make ainstantaneously copy of a file or a directory tree by using standard copy-on-write techniques ○ Master receives a snapshot request ○ Master revokes any outstanding leases on the chunks in the files to snapshot ○ Master wait leases to be revoked or expired ○ Master logs the snapshot operation to disk ○ Master duplicate the metadata for the source file or directory tree: newly created snapshot files point to the same chunk as source files ○ When client asks write to chunk which was not accessed before clone, chunk is locally cloned on replica 21
  • 22.
  • 23.
    Replica placement ◉ Servesfor 2 purposes: ○ maximize data reliability and availability ○ maximize network bandwidth utilization ◉ Spread chunks in different racks ○ Ensure chunk survivability due rack failure (switch failure e.g.) ○ Aggregate bandwidth of different racks ○ Write traffic has to flow through multiple racks ■ Tradeoff they make willingly 23
  • 24.
    Creation, Re-replication andBalancing ◉ Creation ○ Place new replicas on chunkservers with below-average disk space utilization ○ Limit the number of "recent" creations on each chunkserver ○ Spread replicas of a chunk across racks ◉ Re-Replication ○ Master re-replicate a chunk as soon as the number of available replicas falls below a user-specified goal ○ Re-replicated chunk is prioritized based ○ Re-replication placement is similar as for "creation" ○ Master limits the numbers of active clone operations both for the cluster and for each chunkservers 24
  • 25.
    Creation, Re-replication andBalancing ◉ Balancing ○ Master re-balances replicas periodically for better disk space and load balancing ○ Master gradually fills up a new chunkserver rather than instantly swaps it with new chunks (and the heavy write traffic that come with them !) ○ Re-balancing placement is similar as for “creation” 25
  • 26.
    Garbage collection andStale replica detection ◉ File is not deleted istantly but lazily ○ Master logs deletions ○ File renamed to a hidden name that include deletion timestamp ○ After 3 days hidden files are removed on regular scan ○ Same procedure for orphaned chunks ○ Chunks “heartbeat” his chunks to master, if they are not present in metadata chunkserver is free to remove chunks ◉ Stale replica ○ For each chunk master maintains a version 26
  • 27.
    Fault tolerance anddiagnosis HA and Data Integrity in GFS 4 27
  • 28.
    High availability inGFS Chunk Replication Fast Recovery Master Replication 28
  • 29.
    Data integrity READS Before returning blockchankserver verifies integrity, if fails, returns an error and notifies master WRITES Chunkserver verifies the checksum of first and last data blocks that overlap the write range before perform the write, and finally compute and record the new checksums APPEND No checksum verification on the last block, just incrementally update the checksum for the last partial checksum block 29 Each chunkserver uses checksumming to detect corruption of stored chunk. For each 64 Kb adds a 32 Bit checksum
  • 30.
  • 31.
    Micro benchmarks 1 master,2 master replicas, 16 chunkservers with 16 clients. All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch. 31
  • 32.
    Micro benchmarks -READS 32 Each client read a randomly selected 4MB region 256 times (= 1GB of data) from a 320MB file. The chunkservers taken together have only 32GB of memory, so is expected at most a 10% hit rate in the Linux buffer cache. 75-80% efficiency
  • 33.
    Micro benchmarks -WRITES 33 Each client writes 1 GB of data to a new file in a series of 1 MB writes 50% efficiency
  • 34.
    Micro benchmarks -APPENDS 34 Each client appends simultaneously to a single file Performance is limited by the network bandwidth of the 3 chunkservers that store the last chunk of the file, independent of the number of clients 35-40% efficiency
  • 35.
    Real World clusters 35 ClusterA used regularly for R&D by +100 engineers Cluster B used for production data processing
  • 36.
    Fault tolerance ◉ Kill1 chunk servers in cluster B ○ 15.000 chunks on it (= 600 GB of data) ○ Cluster limited to 91 concurrent chunk cloning (= 40% of 227 chunkservers) ○ Each clone operation consume at most 50 Mbps ■ 15.000 chunks restored in 23.2 minutes effective replication rate of 440 MB/s ◉ Kill 2 chunk servers in cluster B ○ 16.000 chunks on each (= 660 GB of data) ○ This double failure reduced 266 chunks to having a single replica ■ These 266 chunks cloned at higher priority and were all restored to a least 2x replication within 2 minutes 36
  • 37.
  • 38.
    Experiences ◉ Operational issues ○Original design for backend file system on production systems, not with R&D tasks in mind (lack of support for permissions and quotas…) ◉ Technical issues ○ IDE protocol versions support that corrupt data silently: this problem motivated checksums usage ○ fsync() cost in kernel 2.2 towards large operational logs files 38
  • 39.
    Conclusions ◉ GFS is: ○Support large-scale data processing workloads on COTS x86 servers ○ Component failures as the norm rather than the exception ○ Optimize for huge files mostly append to and then read sequentially ○ Fault tolerance by constant monitoring, replicating crucial data and fast and automatic recovery (+ checksum to detect data corruption) ○ Delivers high aggregate throughput to many concurrent readers and writers 39
  • 40.
    Conclusions ◉ GFS now: ○Source code is closed ○ Substituted with Colossus in 2010s ■ Smaller chunks 1 Mb ■ Eliminated single point of failure ■ Built for “real-time” big batch operations 40
  • 41.
    Any questions ? Youcan find me at ◉ @sshevchenko ◉ s.shevchenko@studenti.unisa.it Thanks! 41