Google file system

THE GOOGLE FILE SYSTEM
S. GHEMAWAT, H. GOBIOFF AND S. LEUNG
APRIL 7, 2015
CSI5311: Distributed Databases and Transaction Processing
Winter 2015
Prof. Iluju Kiringa
University of Ottawa
Presented By:
Ajaydeep Grewal
Roopesh Jhurani
1

AGENDA
• Introduction
• Design Overview
• System Interactions
• Master Operations
• Fault Tolerance and Diagnosis
• Measurements
• Conclusion
• References
2

Introduction
 Google File System(GFS) is a distributed file
system developed by GOOGLE for its own use.
 It is a scalable file system for large distributed
data-intensive applications.
 It is widely used within GOOGLE as a storage
platform for generation and processing of data.
3

Inspirational factors
 Multiple clusters distributed worldwide.
 Thousands of queries served per second.
 Single query reads more than 100's of MB of data.
 Google stores dozens of copies of the entire Web.
Conclusion
 Need large, distributed, highly fault tolerant file system.
 Large data processing needs Performance, Reliability,
Scalability and Availability.
4

Design Assumptions
 Component Failures
File System consists of hundreds of machines made from
commodity parts.
The quantity and quality of the machines guarantee that there
are non functional nodes at a given time.
 Huge File Sizes
 Workload
Large streaming reads.
Small random reads.
Large, sequential writes that append data to file.
 Applications & API are co-designed
Increases flexibility.
Goal is simple file system, light burden on applications. 5

GFS Architecture
Master
Chunk Servers
GFS Client API
6

GFS Architecture
Master
Contains the system metadata like:
• Namespaces
• Access Control Information
• Mappings from files to chunks
• Current location of chunks
Also helps in:
◦ Garbage collection
◦ Synching across Chunk Servers(Heartbeat Synching)
7

GFS Architecture
Chunk Servers
 Machines containing physical files divided into chunks.
 Each Master server can have a number of associated chunk
servers.
 For reliability, each chunk is replicated on multiple chunk
servers.
Chunk Handle
 Immutable 64 bit chunk handle assigned by master at the
time of chunk creation.
8

GFS Architecture
GFS Client code
 Code at client machine that interacts with GFS.
 Interacts with the master for metadata operations.
 Interacts with Chunk Servers for all Read-Write operations.
9

GFS Architecture
1.GFS Client
code requests for
a particular file .
2. Master gives
the location of the
chunk server.
3.Client caches
the information
and interacts
directly with the
chunk server.
4.Periodic
replication of
changes across
all the replicas.
10

Chunk Size
Having a large uniform chunk size of 64 MB has the
following advantages:
 Reduced Client-Master interaction.
 Reduced Network-Overhead.
 Reduction in the size of metadata's stored.
11

Metadata
 The file and chunk namespaces.
 The mappings from files to chunks.
 Location of each chunk’s replica.
First two are kept persistently in operation log files to
ensure reliability and recoverability.
Chunk locations are held by chunk servers.
Master polls the chunk server at start-up and also
periodically thereafter.
12

Operation Logs
 The operation log contains a historical record of critical
metadata changes.
 Metadata updates are in following format
 e.g. (old value, new value) pairs.
 Since the operation logs are very important, so they are
replicated on remote machines.
 Global snapshots (checkpoints)
 Checkpoint is B-tree like form and mapped into
memory.
 When new updates arrive checkpoints can be created.
13

System Interactions
 Mutation
A mutation is an operation that changes the contents or
metadata of a chunk such as a write or an append operation.
 Lease mechanism
Leases are used to maintain a consistent mutation order across
replicas.
◦ Firstly the master grants a chunk lease to a replica and
calls it primary.
◦ The primary determines the order of updates to all the
other replicas.
14

Write Control and Data Flow
15
1.Client requests for a
write operation.
2.Master replies with
the location of Chunk
Primary and replicas.
3.Client caches the
information and pushes the
write information.
4.The Primary and
replicas store the
information in buffer
and sends a
confirmation.
5.Primary sends a
mutation order to all
the secondaries.
7.Primary sends a
confirmation to the
client.
6.Secondaries commit
the mutations and
sends a confirmation
to the Primary.

Consistency
 Consistent: All the replicated chunks have the
same data.
 Inconsistent: A failed mutation makes the region
inconsistent, i.e., diﬀerent clients may see diﬀerent
data.
16

Master Operations
1. Namespace Management and Locking
2. Replica Placement
3. Creation, Re-replication and Rebalancing
4. Garbage Collection
5. Stale Replica Detection
17

Master Operations
Namespace Management and Locking
 Separate locks on region namespace ensures:
 Serialization
 Multiple operations on master to avoid any delay.
 Each master operation acquires a set of locks before it runs.
 To make operation on /dir1/dir2/dir3/leaf it requires locks.
 Read-Lock on /dir1, /dir1/dir2/, /dir1/dir2/dir3
 Read-Lock or Write-Lock on /dir1/dir2/dir3/leaf
 File creation doesn’t require write-lock on parent directory: read-lock is
enough to protect it from deletion, rename, or snapshotted.
 Write-locks on file names serialize attempts to create any duplicate file.
18

Master Operations
Locking Mechanism
 Snapshot acquires
 Read Locks on: /home, /save
 Write Locks on: /home/user, /save/user
 File to be created:
 Read Locks on: /home, /home/user
 Write Locks on: /home/user/foo
 Conflicting locks on /home/user
/home/user /save/user
snapshotted
/home/user/foo
19

Master Operations
Replica Placement
 Serves two purposes:
 Maximize data reliability and availability
 Maximize Network Bandwidth utilization
 Spread Chunk replicas across racks:
 To ensure chunk survivability
 To exploit aggregate read bandwidth of multiple racks
 Write traffic has to flow through multiple racks.
20

Master Operations
Creation, re-replication and rebalancing
 Creation: Master considers several factors
 Place new replicas on chunk servers with below average disk utilization.
 Limit the number of “recent” creations on chunk server.
 Spread replicas of a chunk across racks.
 Re-replication:
 Master re-replicate a chunk when number of replicas fall below a goal level.
 Re-replicated chunk is prioritized based on several factors.
 Master limits the numbers of active clone operations both for the cluster and
for each chunk servers.
 Each chunk servers limits bandwidth it spends on each clone operation.
 Balancing:
 Master re-balances replicas periodically for better disk and load-balancing.
 Master gradually fills up a chunk server rather than instantly filling it with
new chunks.
21

Master Operations
Garbage Collection
 Lazy garbage collection by GFS for a deleted file.
 Mechanism:
 Master logs the deletion like other changes.
 File is renamed to a hidden name that include deletion timestamp.
 Master removes any hidden files during regular namespace
scanning thus erasing its in-memory metadata.
 Similar scan performed for chunk namespace to identify orphaned
chunks and erase metadata for the same.
 Chunk Server can delete those chunks not identified in master
metadata during regular heartbeat message exchange.
22

Master Operations
Stale Replica Detection
 Problem: Chunk Replica may become stale if a chunk server fails and
misses mutations.
 Solution: for each chunk, master maintains a version number.
 Whenever a master grants a new lease on a chunk, master increases
the version number and inform up-to-date replicas (version number
is stored permanently on the master and associated chunk servers)
 Master detects that chunk server has a stale replica when the chunk
server restarts and reports its set of chunks and associated version
numbers.
 Master removes stale replica in its regular garbage collection.
 Master includes chunk version number when it informs clients
which chunk server holds a lease on chunk, or when it instructs a
chunk server to read the chunk from another chunk server in
cloning operation.
23

Fault Tolerance and Diagnosis
High Availability
 Strategies: Fast recovery and Replication.
 Fast Recovery:
 Master and chunk servers are designed to restore their state in seconds.
 No matter how they terminated, no distinction between normal and abnormal
termination (servers routinely shutdown just by killing process).
 Clients and servers experience minor timeout on outstanding requests, reconnect to
the restarted server, and retry.
 Chunk Replication:
 Chunk replicated on multiple chunk servers on different racks (different parts of the
file namespace can have different replica on level).
 Master clones existing replicas as chunk servers go offline or detect corrupted
replicas (checksum verification).
 Master Replication
 Shadow master provides read-only access to file system even when the master is
down.
 Master operation logs and checkpoints are replicated on multiple machines for
reliability.
24

Fault Tolerance and Diagnosis
Data Integrity
 Each chunk server uses check summing to detect corruption of stored
chunk.
 Chunk is broken into 64KB blocks with associated 32 bit checksum.
 Checksums are metadata kept in memory and stored persistently with
logging, separate from user data.
 For READS: chunk server verifies the checksum of data blocks that
overlap the range before returning any data.
 For WRITES: chunk server verifies the checksum of first and last
data blocks that overlap the write range before perform the write, and
finally compute and record new checksums.
25

Measurements
Micro-benchmarks: GFS cluster
 One master, 2 master replicas, 16 chunk servers with 16 clients.
 Dual 1.4 GHz PIII processors, 2 GB RAM, 2*80 GB 5400 RPM
disks, FastEthernet NIC connected to one HP 2524 Ethernet switch
ports 10/100 + Gigabit uplink.
26

Measurements
Micro-benchmarks: READS
 Each client read a randomly selected 4MB region 256 times (=1GB of
data) from a 320 MB file.
 Aggregate chunk server memory is 32GB, so 10% hit rate in Linux
buffer cache is expected.
27

Measurements
Micro-benchmarks: WRITE
 Each client writes 1GB of data to a new file in a series of 1MB writes.
 Network stack does not interact very well with the pipelining scheme
used for pushing data to the chunk replicas: network congestion is
more likely for 16 writers than for 16 readers because each write
involves 3 different replicas()
28

Measurements
Micro-benchmarks: RECORD APPENDS
 Each client appends simultaneously to a single file.
 Performance is limited by the network bandwidth of the 3 chunk
servers that store the last chunk of the file, independent of the number
of clients.
29

Conclusion
Google File System
 Support Large Scale data processing workloads on COTS x86 servers.
 Component failure are norms rather than exceptions.
 Optimize for huge files mostly append to and then read sequentially.
 Fault tolerance by constant monitoring, replicating crucial data and
fast and automatic recovery.
 Delivers high aggregate throughput to many concurrent readers and
writers.
Future Improvements
 Networking Stack Limit: Write throughput can be improved in the
future.
30

References
1. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The
Google File System." ACM SIGOPS Operating Systems Review:
29. Print.
2. Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee.
Frangipani: A scalable distributed file system. In Proceedings of the
16th ACM Symposium on Operating System Principles, pages 224–
237, Saint-Malo, France, October 1997.
3. http://en.wikipedia.org/wiki/Google_File_System
4. http://computer.howstuffworks.com/internet/basics/google-file-
system.htm
5. http://en.wikiversity.org/wiki/Big_Data/Google_File_System
6. http://storagemojo.com/google-file-system-eval-part-i/
7. https://www.youtube.com/watch?v=d2SWUIP40Nw
31

Google file system

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Google file system

Similar to Google file system (20)

Recently uploaded

Recently uploaded (20)

Google file system

Editor's Notes