Your SlideShare is downloading. ×
0
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
GFS - Google File System
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

GFS - Google File System

21,126

Published on

Published in: Technology
3 Comments
21 Likes
Statistics
Notes
No Downloads
Views
Total Views
21,126
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
920
Comments
3
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Google File System Tut Chi Io
  • 2. Design Overview – Assumption <ul><li>Inexpensive commodity hardware </li></ul><ul><li>Large files: Multi-GB </li></ul><ul><li>Workloads </li></ul><ul><ul><li>Large streaming reads </li></ul></ul><ul><ul><li>Small random reads </li></ul></ul><ul><ul><li>Large, sequential appends </li></ul></ul><ul><li>Concurrent append to the same file </li></ul><ul><li>High Throughput > Low Latency </li></ul>
  • 3. Design Overview – Interface <ul><li>Create </li></ul><ul><li>Delete </li></ul><ul><li>Open </li></ul><ul><li>Close </li></ul><ul><li>Read </li></ul><ul><li>Write </li></ul><ul><li>Snapshot </li></ul><ul><li>Record Append </li></ul>
  • 4. Design Overview – Architecture <ul><li>Single master, multiple chunk servers, multiple clients </li></ul><ul><ul><li>User-level process running on commodity Linux machine </li></ul></ul><ul><ul><li>GFS client code linked into each client application to communicate </li></ul></ul><ul><li>File -> 64MB chunks -> Linux files </li></ul><ul><ul><li>on local disks of chunk servers </li></ul></ul><ul><ul><li>replicated on multiple chunk servers (3r) </li></ul></ul><ul><li>Cache metadata but not chunk on clients </li></ul>
  • 5. Design Overview – Single Master <ul><li>Why centralization? Simplicity! </li></ul><ul><li>Global knowledge is needed for </li></ul><ul><ul><li>Chunk placement </li></ul></ul><ul><ul><li>Replication decisions </li></ul></ul>
  • 6. Design Overview – Chunk Size <ul><li>64MB – Much Larger than ordinary, why? </li></ul><ul><ul><li>Advantages </li></ul></ul><ul><ul><ul><li>Reduce client-master interaction </li></ul></ul></ul><ul><ul><ul><li>Reduce network overhead </li></ul></ul></ul><ul><ul><ul><li>Reduce the size of the metadata </li></ul></ul></ul><ul><ul><li>Disadvantages </li></ul></ul><ul><ul><ul><li>Internal fragmentation </li></ul></ul></ul><ul><ul><ul><ul><li>Solution: lazy space allocation </li></ul></ul></ul></ul><ul><ul><ul><li>Hot Spots – many clients accessing a 1-chunk file, e.g. executables </li></ul></ul></ul><ul><ul><ul><ul><li>Solution: </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Higher replication factor </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Stagger application start times </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Client-to-client communication </li></ul></ul></ul></ul>
  • 7. Design Overview – Metadata <ul><li>File & chunk namespaces </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>In master’s and chunk servers’ storage </li></ul></ul><ul><li>File-chunk mapping </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>In master’s and chunk servers’ storage </li></ul></ul><ul><li>Location of chunk replicas </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>Ask chunk servers when </li></ul></ul><ul><ul><ul><li>Master starts </li></ul></ul></ul><ul><ul><ul><li>Chunk server joins the cluster </li></ul></ul></ul><ul><ul><li>If persistent, master and chunk servers must be in sync </li></ul></ul>
  • 8. Design Overview – Metadata – In-memory DS <ul><li>Why in-memory data structure for the master? </li></ul><ul><ul><li>Fast! For GC and LB </li></ul></ul><ul><li>Will it pose a limit on the number of chunks -> total capacity? </li></ul><ul><ul><li>No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) </li></ul></ul><ul><ul><ul><li>Most chunks are full </li></ul></ul></ul><ul><ul><ul><li>Prefix compression on file names </li></ul></ul></ul>
  • 9. Design Overview – Metadata – Log <ul><li>The only persistent record of metadata </li></ul><ul><li>Defines the order of concurrent operations </li></ul><ul><li>Critical </li></ul><ul><ul><li>Replicated on multiple remote machines </li></ul></ul><ul><ul><li>Respond to client only when log locally and remotely </li></ul></ul><ul><li>Fast recovery by using checkpoints </li></ul><ul><ul><li>Use a compact B-tree like form directly mapping into memory </li></ul></ul><ul><ul><li>Switch to a new log, Create new checkpoints in a separate threads </li></ul></ul>
  • 10. Design Overview – Consistency Model <ul><li>Consistent </li></ul><ul><ul><li>All clients will see the same data, regardless of which replicas they read from </li></ul></ul><ul><li>Defined </li></ul><ul><ul><li>Consistent, and clients will see what the mutation writes in its entirety </li></ul></ul>
  • 11. Design Overview – Consistency Model <ul><li>After a sequence of success, a region is guaranteed to be defined </li></ul><ul><ul><li>Same order on all replicas </li></ul></ul><ul><ul><li>Chunk version number to detect stale replicas </li></ul></ul><ul><li>Client cache stale chunk locations? </li></ul><ul><ul><li>Limited by cache entry’s timeout </li></ul></ul><ul><ul><li>Most files are append-only </li></ul></ul><ul><ul><ul><li>A Stale replica return a premature end of chunk </li></ul></ul></ul>
  • 12. System Interactions – Lease <ul><li>Minimized management overhead </li></ul><ul><li>Granted by the master to one of the replicas to become the primary </li></ul><ul><li>Primary picks a serial order of mutation and all replicas follow </li></ul><ul><li>60 seconds timeout, can be extended </li></ul><ul><li>Can be revoked </li></ul>
  • 13. System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a. data 3b. data 3c. data Write request Primary assign s/n to mutations Applies it Forward write request Operation completed Operation completed Operation completed or Error report
  • 14. System Interactions – Data Flow <ul><li>Decouple data flow and control flow </li></ul><ul><li>Control flow </li></ul><ul><ul><li>Master -> Primary -> Secondaries </li></ul></ul><ul><li>Data flow </li></ul><ul><ul><li>Carefully picked chain of chunk servers </li></ul></ul><ul><ul><ul><li>Forward to the closest first </li></ul></ul></ul><ul><ul><ul><li>Distances estimated from IP addresses </li></ul></ul></ul><ul><ul><li>Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) </li></ul></ul><ul><ul><li>Pipelining, to exploit full-duplex links </li></ul></ul><ul><ul><ul><li>Time to transfer B bytes to R replicas = B/T + RL </li></ul></ul></ul><ul><ul><ul><li>T: network throughput, L: latency </li></ul></ul></ul>
  • 15. System Interactions – Atomic Record Append <ul><li>Concurrent appends are serializable </li></ul><ul><ul><li>Client specifies only data </li></ul></ul><ul><ul><li>GFS appends at least once atomically </li></ul></ul><ul><ul><li>Return the offset to the client </li></ul></ul><ul><ul><li>Heavily used by Google to use files as </li></ul></ul><ul><ul><ul><li>multiple-producer/single-consumer queues </li></ul></ul></ul><ul><ul><ul><li>Merged results from many different clients </li></ul></ul></ul><ul><ul><li>On failures, the client retries the operation </li></ul></ul><ul><ul><li>Data are defined, intervening regions are inconsistent </li></ul></ul><ul><ul><ul><li>A Reader can identify and discard extra padding and record fragments using the checksums </li></ul></ul></ul>
  • 16. System Interactions – Snapshot <ul><li>Makes a copy of a file or a directory tree almost instantaneously </li></ul><ul><li>Use copy-on-write </li></ul><ul><li>Steps </li></ul><ul><ul><li>Revokes lease </li></ul></ul><ul><ul><li>Logs operations to disk </li></ul></ul><ul><ul><li>Duplicates metadata, pointing to the same chunks </li></ul></ul><ul><li>Create real duplicate locally </li></ul><ul><ul><li>Disks are 3 times as fast as 100 Mb Ethernet links </li></ul></ul>
  • 17. Master Operation – Namespace Management <ul><li>No per-directory data structure </li></ul><ul><li>No support for alias </li></ul><ul><li>Lock over regions of namespace to ensure serialization </li></ul><ul><li>Lookup table mapping full pathnames to metadata </li></ul><ul><ul><li>Prefix compression -> In-Memory </li></ul></ul>
  • 18. Master Operation – Namespace Locking <ul><li>Each node (file/directory) has a read-write lock </li></ul><ul><li>Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user </li></ul><ul><ul><li>Snapshot </li></ul></ul><ul><ul><ul><li>Read locks on /home, /save </li></ul></ul></ul><ul><ul><ul><li>Write locks on /home/user, /save/user </li></ul></ul></ul><ul><ul><li>Create </li></ul></ul><ul><ul><ul><li>Read locks on /home, /home/user </li></ul></ul></ul><ul><ul><ul><li>Write lock on /home/user/foo </li></ul></ul></ul>
  • 19. Master Operation – Policies <ul><li>New chunks creation policy </li></ul><ul><ul><li>New replicas on below-average disk utilization </li></ul></ul><ul><ul><li>Limit # of “recent” creations on each chun server </li></ul></ul><ul><ul><li>Spread replicas of a chunk across racks </li></ul></ul><ul><li>Re-replication priority </li></ul><ul><ul><li>Far from replication goal first </li></ul></ul><ul><ul><li>Chunk that is blocking client first </li></ul></ul><ul><ul><li>Live files first (rather than deleted) </li></ul></ul><ul><li>Rebalance replicas periodically </li></ul>
  • 20. Master Operation – Garbage Collection <ul><li>Lazy reclamation </li></ul><ul><ul><li>Logs deletion immediately </li></ul></ul><ul><ul><li>Rename to a hidden name </li></ul></ul><ul><ul><ul><li>Remove 3 days later </li></ul></ul></ul><ul><ul><ul><li>Undelete by renaming back </li></ul></ul></ul><ul><li>Regular scan for orphaned chunks </li></ul><ul><ul><li>Not garbage: </li></ul></ul><ul><ul><ul><li>All references to chunks: file-chunk mapping </li></ul></ul></ul><ul><ul><ul><li>All chunk replicas: Linux files under designated directory on each chunk server </li></ul></ul></ul><ul><ul><li>Erase metadata </li></ul></ul><ul><ul><li>HeartBeat message to tell chunk servers to delete chunks </li></ul></ul>
  • 21. Master Operation – Garbage Collection <ul><li>Advantages </li></ul><ul><ul><li>Simple & reliable </li></ul></ul><ul><ul><ul><li>Chunk creation may failed </li></ul></ul></ul><ul><ul><ul><li>Deletion messages may be lost </li></ul></ul></ul><ul><ul><li>Uniform and dependable way to clean up unuseful replicas </li></ul></ul><ul><ul><li>Done in batches and the cost is amortized </li></ul></ul><ul><ul><li>Done when the master is relatively free </li></ul></ul><ul><ul><li>Safety net against accidental, irreversible deletion </li></ul></ul>
  • 22. Master Operation – Garbage Collection <ul><li>Disadvantage </li></ul><ul><ul><li>Hard to fine tune when storage is tight </li></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Delete twice explicitly -> expedite storage reclamation </li></ul></ul><ul><ul><li>Different policies for different parts of the namespace </li></ul></ul><ul><li>Stale Replica Detection </li></ul><ul><ul><li>Master maintains a chunk version number </li></ul></ul>
  • 23. Fault Tolerance – High Availability <ul><li>Fast Recovery </li></ul><ul><ul><li>Restore state and start in seconds </li></ul></ul><ul><ul><li>Do not distinguish normal and abnormal termination </li></ul></ul><ul><li>Chunk Replication </li></ul><ul><ul><li>Different replication levels for different parts of the file namespace </li></ul></ul><ul><ul><li>Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification </li></ul></ul>
  • 24. Fault Tolerance – High Availability <ul><li>Master Replication </li></ul><ul><ul><li>Log & checkpoints are replicated </li></ul></ul><ul><ul><li>Master failures? </li></ul></ul><ul><ul><ul><li>Monitoring infrastructure outside GFS starts a new master process </li></ul></ul></ul><ul><ul><li>“Shadow” masters </li></ul></ul><ul><ul><ul><li>Read-only access to the file system when the primary master is down </li></ul></ul></ul><ul><ul><ul><li>Enhance read availability </li></ul></ul></ul><ul><ul><ul><li>Reads a replica of the growing operation log </li></ul></ul></ul>
  • 25. Fault Tolerance – Data Integrity <ul><li>Use checksums to detect data corruption </li></ul><ul><li>A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum </li></ul><ul><li>Chunk server verifies the checksum before returning, no error propagation </li></ul><ul><li>Record append </li></ul><ul><ul><li>Incrementally update the checksum for the last block, error will be detected when read </li></ul></ul><ul><li>Random write </li></ul><ul><ul><li>Read and verify the first and last block first </li></ul></ul><ul><ul><li>Perform write, compute new checksums </li></ul></ul>
  • 26. Conclusion <ul><li>GFS supports large-scale data processing using commodity hardware </li></ul><ul><li>Reexamine traditional file system assumption </li></ul><ul><ul><li>based on application workload and technological environment </li></ul></ul><ul><ul><li>Treat component failures as the norm rather than the exception </li></ul></ul><ul><ul><li>Optimize for huge files that are mostly appended </li></ul></ul><ul><ul><li>Relax the stand file system interface </li></ul></ul>
  • 27. Conclusion <ul><li>Fault tolerance </li></ul><ul><ul><li>Constant monitoring </li></ul></ul><ul><ul><li>Replicating crucial data </li></ul></ul><ul><ul><li>Fast and automatic recovery </li></ul></ul><ul><ul><li>Checksumming to detect data corruption at the disk or IDE subsystem level </li></ul></ul><ul><li>High aggregate throughput </li></ul><ul><ul><li>Decouple control and data transfer </li></ul></ul><ul><ul><li>Minimize operations by large chunk size and by chunk lease </li></ul></ul>
  • 28. Reference <ul><li>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System” </li></ul>

×