Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 7 (more)

GFS - Google File System

From tutchiio, 2 years ago

4834 views  |  1 comment  |  7 favorites  |  7 embeds (Stats)
Download not available ?
 

Groups / Events

 

 
Embed
options

More Info

This slideshow is Public
Total Views: 4834
on Slideshare: 4762
from embeds: 72

Slideshow transcript

Slide 1: The Google File System Tut Chi Io

Slide 2: Design Overview – Assumption • Inexpensive commodity hardware • Large files: Multi-GB • Workloads – Large streaming reads – Small random reads – Large, sequential appends • Concurrent append to the same file • High Throughput > Low Latency

Slide 3: Design Overview – Interface • Create • Delete • Open • Close • Read • Write • Snapshot • Record Append

Slide 4: Design Overview – Architecture • Single master, multiple chunk servers, multiple clients – User-level process running on commodity Linux machine – GFS client code linked into each client application to communicate • File -> 64MB chunks -> Linux files – on local disks of chunk servers – replicated on multiple chunk servers (3r) • Cache metadata but not chunk on clients

Slide 5: Design Overview – Single Master • Why centralization? Simplicity! • Global knowledge is needed for – Chunk placement – Replication decisions

Slide 6: Design Overview – Chunk Size • 64MB – Much Larger than ordinary, why? – Advantages • Reduce client-master interaction • Reduce network overhead • Reduce the size of the metadata – Disadvantages • Internal fragmentation – Solution: lazy space allocation • Hot Spots – many clients accessing a 1-chunk file, e.g. executables – Solution: – Higher replication factor – Stagger application start times – Client-to-client communication

Slide 7: Design Overview – Metadata • File & chunk namespaces – In master’s memory – In master’s and chunk servers’ storage • File-chunk mapping – In master’s memory – In master’s and chunk servers’ storage • Location of chunk replicas – In master’s memory – Ask chunk servers when • Master starts • Chunk server joins the cluster – If persistent, master and chunk servers must be in sync

Slide 8: Design Overview – Metadata – In-memory DS • Why in-memory data structure for the master? – Fast! For GC and LB • Will it pose a limit on the number of chunks -> total capacity? – No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) • Most chunks are full • Prefix compression on file names

Slide 9: Design Overview – Metadata – Log • The only persistent record of metadata • Defines the order of concurrent operations • Critical – Replicated on multiple remote machines – Respond to client only when log locally and remotely • Fast recovery by using checkpoints – Use a compact B-tree like form directly mapping into memory – Switch to a new log, Create new checkpoints in a separate threads

Slide 10: Design Overview – Consistency Model • Consistent – All clients will see the same data, regardless of which replicas they read from • Defined – Consistent, and clients will see what the mutation writes in its entirety

Slide 11: Design Overview – Consistency Model • After a sequence of success, a region is guaranteed to be defined – Same order on all replicas – Chunk version number to detect stale replicas • Client cache stale chunk locations? – Limited by cache entry’s timeout – Most files are append-only • A Stale replica return a premature end of chunk

Slide 12: System Interactions – Lease • Minimized management overhead • Granted by the master to one of the replicas to become the primary • Primary picks a serial order of mutation and all replicas follow • 60 seconds timeout, can be extended • Can be revoked

Slide 13: System Interactions – Mutation Order Current lease holder? Write request 3a. data identity of primary location of replicas (cached by client) Operation completed Operation completed or Error report 3b. data Primary assign s/n to mutations Applies it Forward write request 3c. data Operation completed

Slide 14: System Interactions – Data Flow • Decouple data flow and control flow • Control flow – Master -> Primary -> Secondaries • Data flow – Carefully picked chain of chunk servers • Forward to the closest first • Distances estimated from IP addresses – Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) – Pipelining, to exploit full-duplex links • Time to transfer B bytes to R replicas = B/T + RL • T: network throughput, L: latency

Slide 15: System Interactions – Atomic Record Append • Concurrent appends are serializable – Client specifies only data – GFS appends at least once atomically – Return the offset to the client – Heavily used by Google to use files as • multiple-producer/single-consumer queues • Merged results from many different clients – On failures, the client retries the operation – Data are defined, intervening regions are inconsistent • A Reader can identify and discard extra padding and record fragments using the checksums

Slide 16: System Interactions – Snapshot • Makes a copy of a file or a directory tree almost instantaneously • Use copy-on-write • Steps – Revokes lease – Logs operations to disk – Duplicates metadata, pointing to the same chunks • Create real duplicate locally – Disks are 3 times as fast as 100 Mb Ethernet links

Slide 17: Master Operation – Namespace Management • No per-directory data structure • No support for alias • Lock over regions of namespace to ensure serialization • Lookup table mapping full pathnames to metadata – Prefix compression -> In-Memory

Slide 18: Master Operation – Namespace Locking • Each node (file/directory) has a read- write lock • Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user – Snapshot • Read locks on /home, /save • Write locks on /home/user, /save/user – Create • Read locks on /home, /home/user • Write lock on /home/user/foo

Slide 19: Master Operation – Policies • New chunks creation policy – New replicas on below-average disk utilization – Limit # of “recent” creations on each chun server – Spread replicas of a chunk across racks • Re-replication priority – Far from replication goal first – Chunk that is blocking client first – Live files first (rather than deleted) • Rebalance replicas periodically

Slide 20: Master Operation – Garbage Collection • Lazy reclamation – Logs deletion immediately – Rename to a hidden name • Remove 3 days later • Undelete by renaming back • Regular scan for orphaned chunks – Not garbage: • All references to chunks: file-chunk mapping • All chunk replicas: Linux files under designated directory on each chunk server – Erase metadata – HeartBeat message to tell chunk servers to delete chunks

Slide 21: Master Operation – Garbage Collection • Advantages – Simple & reliable • Chunk creation may failed • Deletion messages may be lost – Uniform and dependable way to clean up unuseful replicas – Done in batches and the cost is amortized – Done when the master is relatively free – Safety net against accidental, irreversible deletion

Slide 22: Master Operation – Garbage Collection • Disadvantage – Hard to fine tune when storage is tight • Solution – Delete twice explicitly -> expedite storage reclamation – Different policies for different parts of the namespace • Stale Replica Detection – Master maintains a chunk version number

Slide 23: Fault Tolerance – High Availability • Fast Recovery – Restore state and start in seconds – Do not distinguish normal and abnormal termination • Chunk Replication – Different replication levels for different parts of the file namespace – Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification

Slide 24: Fault Tolerance – High Availability • Master Replication – Log & checkpoints are replicated – Master failures? • Monitoring infrastructure outside GFS starts a new master process – “Shadow” masters • Read-only access to the file system when the primary master is down • Enhance read availability • Reads a replica of the growing operation log

Slide 25: Fault Tolerance – Data Integrity • Use checksums to detect data corruption • A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum • Chunk server verifies the checksum before returning, no error propagation • Record append – Incrementally update the checksum for the last block, error will be detected when read • Random write – Read and verify the first and last block first – Perform write, compute new checksums

Slide 26: Conclusion • GFS supports large-scale data processing using commodity hardware • Reexamine traditional file system assumption – based on application workload and technological environment – Treat component failures as the norm rather than the exception – Optimize for huge files that are mostly appended – Relax the stand file system interface

Slide 27: Conclusion • Fault tolerance – Constant monitoring – Replicating crucial data – Fast and automatic recovery – Checksumming to detect data corruption at the disk or IDE subsystem level • High aggregate throughput – Decouple control and data transfer – Minimize operations by large chunk size and by chunk lease

Slide 28: Reference • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”