The Google file system

The Google File System
Based on
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
2003

I am Sergio Shevchenko
You can ﬁnd me in Telegram with @sshevchenko
Hello!
2

Agenda
◉ Why GFS?
◉ System Design
◉ System Interactions
◉ Master operation
◉ Fault tolerance and diagnosis
◉ Measurements
◉ Experiences and Conclusions
3

Why GFS?
What prompted Google to build GFS
1
4

Historical context
◉ August 1996: Larry Page and Sergey Brin launch Google
◉ ...
◉ December 1998: Index of about 60 million pages
◉ ...
◉ February 2003: Index 3+ billion pages
◉ ...
5

System design: Assumptions
◉ COTS x86 servers
◉ Few million ﬁles, typically 100MB or larger
◉ Large streaming reads, small random reads
◉ Sequential writes, append data
◉ Multiple concurrent clients that append to the same ﬁle
◉ High sustained bandwidth more important than low latency
9

System design: Interface
◉ Not posix
◉ Usual ﬁle operations:
open, close, create, delete, write, read
◉ Enhanced ﬁle operations:
snapshot, record append
10

System design: Architecture
11

System design: Architecture
◉ Single Master
○ Design simpliﬁcation (there is a replica)
○ Global knowledge permit sophisticated chunk placement and
replication decisions
○ No data plane bottleneck: GFS clients communicate with
■ master only for metadata (cached with TTL expiration)
■ chunks only for read/write
◉ Chunk size
○ 64 Mb stored as a linux ﬁle
○ Extended only as needed (lazy allocation)
12

System design: Metadata
◉ 3 types of metadata
○ File and chunk namespaces
○ Mappings from ﬁle to chunks
○ Location of each chunks’ replicas
◉ Metadata all in RAM memory of master
○ Re-replication
○ Chunk migration
○ GC
◉ OpLog
○ Timeline for concurrent operations
○ No persistent record for chunk locations
○ Saved on disk and replicated
13

Consistency model
◉ File namespace mutation are atomic
○ A file region is consistent if all clients
will always see the same data,
regardless of which replicas they
read from.
○ A region is defined after a file data
mutation if it is consistent and clients
will see what the mutation writes in
its entirety.
◉ GFS doesn't guarantee againsts duplicates
14

15
In theoretical computer science, the CAP theorem, also named Brewer's theorem after computer
scientist Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide
more than two out of this three guarantees.

Leases and mutation order
◉ Each mutation (write or append) is performed at all the chunk’s replicas
◉ Leases used to maintain a consistent mutation order across replicas
○ Master grants a chunk lease for 60 seconds to one of the replicas: the
primary replica
○ The primary replica picks a serial order for all mutations to the chunk
○ All replicas follow this order when applying mutations
17

Leases and mutation order
18
1. Client ask master for primary and secondaries replicas info for a chunk
2. Master replies with replicas identity (client put info in cache with
timeout)
3. Client pushes Data to all replicas (pipelined fashion) Each chunkserver
store the Data in an internal LRU buffer cache until Data is used or aged out
4. One all replicas have acknowledged receiving the data, client send a
write request to the primary replica
5. Primary replica forward the write request to all secondary replicas that
applies mutations in the same serial number order assigned by the primary
6. Secondary replicas acknowledge the primary that they have completed
the operation
7. The primary replica replies to the client. Any errors encountered at any of
the replicas are reported to the client.

System interactions:
Dataflow
◉ Data flow decoupled from Control flow to use the network efficiently
◉ Data is pushed linearly along a carefully picked chain of chunkservers in a TCP
pipelined fashion
◉ Each machine forwards the data to the closest machine in the network topology
that has not received it
19

Atomic record appends
◉ GFS decides the offset
◉ Maximum size of an append record is 16 MB
◉ Client push the data to all replicas of the last chunk of the ﬁle (as for write)
◉ Client sends its request to the primary replica
○ Primary replica check if the data size would cause the chunk to exceed the
chunk maximum size (64MB)
◉ Atomic records are heavily used in Google for handling multiple producers and
single consumer o MapRed operation
20

Snapshot
◉ Make a instantaneously copy of a file or a directory tree by using standard
copy-on-write techniques
○ Master receives a snapshot request
○ Master revokes any outstanding leases on the chunks in the files to
snapshot
○ Master wait leases to be revoked or expired
○ Master logs the snapshot operation to disk
○ Master duplicate the metadata for the source file or directory tree: newly
created snapshot files point to the same chunk as source files
○ When client asks write to chunk which was not accessed before clone,
chunk is locally cloned on replica
21

Replica placement
◉ Serves for 2 purposes:
○ maximize data reliability and availability
○ maximize network bandwidth utilization
◉ Spread chunks in different racks
○ Ensure chunk survivability due rack failure (switch failure e.g.)
○ Aggregate bandwidth of different racks
○ Write trafﬁc has to ﬂow through multiple racks
■ Tradeoff they make willingly
23

Creation, Re-replication and Balancing
◉ Creation
○ Place new replicas on chunkservers with below-average disk space
utilization
○ Limit the number of "recent" creations on each chunkserver
○ Spread replicas of a chunk across racks
◉ Re-Replication
○ Master re-replicate a chunk as soon as the number of available replicas
falls below a user-speciﬁed goal
○ Re-replicated chunk is prioritized based
○ Re-replication placement is similar as for "creation"
○ Master limits the numbers of active clone operations both for the cluster
and for each chunkservers
24

Creation, Re-replication and Balancing
◉ Balancing
○ Master re-balances replicas periodically for better disk space and load
balancing
○ Master gradually ﬁlls up a new chunkserver rather than instantly swaps it
with new chunks (and the heavy write trafﬁc that come with them !)
○ Re-balancing placement is similar as for “creation”
25

Garbage collection and Stale replica detection
◉ File is not deleted istantly but lazily
○ Master logs deletions
○ File renamed to a hidden name that include deletion timestamp
○ After 3 days hidden ﬁles are removed on regular scan
○ Same procedure for orphaned chunks
○ Chunks “heartbeat” his chunks to master, if they are not present in metadata
chunkserver is free to remove chunks
◉ Stale replica
○ For each chunk master maintains a version
26

Fault tolerance and diagnosis
HA and Data Integrity in GFS
4
27

High availability in GFS
Chunk
Replication
Fast
Recovery
Master
Replication
28

Data integrity
READS
Before returning
block chankserver
verifies integrity, if
fails, returns an error
and notifies master
WRITES
Chunkserver verifies
the checksum of first
and last data blocks
that overlap the write
range before perform
the write, and finally
compute and record
the new checksums
APPEND
No checksum
verification on the
last block, just
incrementally update
the checksum for the
last partial checksum
block
29
Each chunkserver uses checksumming to detect corruption of stored chunk. For each
64 Kb adds a 32 Bit checksum

Measurements
Micro and Macro benchmarks
5
30

Micro benchmarks
1 master, 2 master replicas, 16
chunkservers with 16 clients.
All the machines are conﬁgured
with dual 1.4 GHz PIII processors,
2 GB of memory, two 80 GB
5400 rpm disks, and a 100 Mbps
full-duplex Ethernet connection
to an HP 2524 switch.
31

Micro benchmarks - READS
32
Each client read a randomly selected 4MB region 256 times (= 1GB of data) from a 320MB ﬁle.
The chunkservers taken together have only 32GB of memory, so is expected at most a 10% hit rate in the
Linux buffer cache.
75-80%
efﬁciency

Micro benchmarks - WRITES
33
Each client writes 1 GB of data to a new ﬁle in a series of 1 MB writes
50%
efﬁciency

Micro benchmarks - APPENDS
34
Each client appends simultaneously to a single file
Performance is limited by the network bandwidth of the 3 chunkservers that store the last chunk of the
file, independent of the number of clients
35-40%
efficiency

Real World clusters
35
Cluster A used regularly for R&D by +100 engineers
Cluster B used for production data processing

Fault tolerance
◉ Kill 1 chunk servers in cluster B
○ 15.000 chunks on it (= 600 GB of data)
○ Cluster limited to 91 concurrent chunk cloning (= 40% of 227 chunkservers)
○ Each clone operation consume at most 50 Mbps
■ 15.000 chunks restored in 23.2 minutes effective replication rate of
440 MB/s
◉ Kill 2 chunk servers in cluster B
○ 16.000 chunks on each (= 660 GB of data)
○ This double failure reduced 266 chunks to having a single replica
■ These 266 chunks cloned at higher priority and were all restored to a
least 2x replication within 2 minutes
36

Experiences and Conclusions6
37

Experiences
◉ Operational issues
○ Original design for backend ﬁle system on production systems, not with
R&D tasks in mind (lack of support for permissions and quotas…)
◉ Technical issues
○ IDE protocol versions support that corrupt data silently: this problem
motivated checksums usage
○ fsync() cost in kernel 2.2 towards large operational logs ﬁles
38

Conclusions
◉ GFS is:
○ Support large-scale data processing workloads on COTS x86 servers
○ Component failures as the norm rather than the exception
○ Optimize for huge ﬁles mostly append to and then read sequentially
○ Fault tolerance by constant monitoring, replicating crucial data and fast and
automatic recovery (+ checksum to detect data corruption)
○ Delivers high aggregate throughput to many concurrent readers and writers
39

Conclusions
◉ GFS now:
○ Source code is closed
○ Substituted with Colossus in 2010s
■ Smaller chunks 1 Mb
■ Eliminated single point of failure
■ Built for “real-time” big batch operations
40

Any questions ?
You can ﬁnd me at
◉ @sshevchenko
◉ s.shevchenko@studenti.unisa.it
Thanks!
41

The Google file system

More Related Content

What's hot

Similar to The Google file system

More from Sergio Shevchenko

Recently uploaded

The Google file system