Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
The Google File System Tut Chi Io
Design Overview – Assumption <ul><li>Inexpensive commodity hardware </li></ul><ul><li>Large files: Multi-GB </li></ul><ul>...
Design Overview – Interface <ul><li>Create </li></ul><ul><li>Delete </li></ul><ul><li>Open </li></ul><ul><li>Close </li></...
Design Overview – Architecture <ul><li>Single master, multiple chunk servers, multiple clients </li></ul><ul><ul><li>User-...
Design Overview – Single Master <ul><li>Why centralization? Simplicity! </li></ul><ul><li>Global knowledge is needed for <...
Design Overview – Chunk Size <ul><li>64MB – Much Larger than ordinary, why? </li></ul><ul><ul><li>Advantages </li></ul></u...
Design Overview – Metadata <ul><li>File & chunk namespaces </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul...
Design Overview – Metadata – In-memory DS <ul><li>Why in-memory data structure for the master?  </li></ul><ul><ul><li>Fast...
Design Overview – Metadata – Log <ul><li>The only persistent record of metadata </li></ul><ul><li>Defines the order of con...
Design Overview – Consistency Model <ul><li>Consistent </li></ul><ul><ul><li>All clients will see the same data, regardles...
Design Overview – Consistency Model <ul><li>After a sequence of success, a region is guaranteed to be defined </li></ul><u...
System Interactions – Lease <ul><li>Minimized management overhead </li></ul><ul><li>Granted by the master to one of the re...
System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a....
System Interactions – Data Flow <ul><li>Decouple data flow and control flow </li></ul><ul><li>Control flow </li></ul><ul><...
System Interactions – Atomic Record Append <ul><li>Concurrent appends are serializable </li></ul><ul><ul><li>Client specif...
System Interactions – Snapshot <ul><li>Makes a copy of a file or a directory tree almost instantaneously </li></ul><ul><li...
Master Operation – Namespace Management <ul><li>No per-directory data structure </li></ul><ul><li>No support for alias </l...
Master Operation – Namespace Locking <ul><li>Each node (file/directory) has a read-write lock </li></ul><ul><li>Scenario: ...
Master Operation – Policies <ul><li>New chunks creation policy </li></ul><ul><ul><li>New replicas on below-average disk ut...
Master Operation – Garbage Collection <ul><li>Lazy reclamation </li></ul><ul><ul><li>Logs deletion immediately </li></ul><...
Master Operation – Garbage Collection <ul><li>Advantages </li></ul><ul><ul><li>Simple & reliable </li></ul></ul><ul><ul><u...
Master Operation – Garbage Collection <ul><li>Disadvantage </li></ul><ul><ul><li>Hard to fine tune when storage is tight <...
Fault Tolerance – High Availability <ul><li>Fast Recovery </li></ul><ul><ul><li>Restore state and start in seconds </li></...
Fault Tolerance – High Availability <ul><li>Master Replication </li></ul><ul><ul><li>Log & checkpoints are replicated </li...
Fault Tolerance – Data Integrity <ul><li>Use checksums to detect data corruption </li></ul><ul><li>A chunk(64MB) is broken...
Conclusion <ul><li>GFS supports large-scale data processing using commodity hardware </li></ul><ul><li>Reexamine tradition...
Conclusion <ul><li>Fault tolerance </li></ul><ul><ul><li>Constant monitoring </li></ul></ul><ul><ul><li>Replicating crucia...
Reference <ul><li>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System” </li></ul>
Upcoming SlideShare
Loading in …5
×

GFS - Google File System

26,472 views

Published on

Published in: Technology

GFS - Google File System

  1. 1. The Google File System Tut Chi Io
  2. 2. Design Overview – Assumption <ul><li>Inexpensive commodity hardware </li></ul><ul><li>Large files: Multi-GB </li></ul><ul><li>Workloads </li></ul><ul><ul><li>Large streaming reads </li></ul></ul><ul><ul><li>Small random reads </li></ul></ul><ul><ul><li>Large, sequential appends </li></ul></ul><ul><li>Concurrent append to the same file </li></ul><ul><li>High Throughput > Low Latency </li></ul>
  3. 3. Design Overview – Interface <ul><li>Create </li></ul><ul><li>Delete </li></ul><ul><li>Open </li></ul><ul><li>Close </li></ul><ul><li>Read </li></ul><ul><li>Write </li></ul><ul><li>Snapshot </li></ul><ul><li>Record Append </li></ul>
  4. 4. Design Overview – Architecture <ul><li>Single master, multiple chunk servers, multiple clients </li></ul><ul><ul><li>User-level process running on commodity Linux machine </li></ul></ul><ul><ul><li>GFS client code linked into each client application to communicate </li></ul></ul><ul><li>File -> 64MB chunks -> Linux files </li></ul><ul><ul><li>on local disks of chunk servers </li></ul></ul><ul><ul><li>replicated on multiple chunk servers (3r) </li></ul></ul><ul><li>Cache metadata but not chunk on clients </li></ul>
  5. 5. Design Overview – Single Master <ul><li>Why centralization? Simplicity! </li></ul><ul><li>Global knowledge is needed for </li></ul><ul><ul><li>Chunk placement </li></ul></ul><ul><ul><li>Replication decisions </li></ul></ul>
  6. 6. Design Overview – Chunk Size <ul><li>64MB – Much Larger than ordinary, why? </li></ul><ul><ul><li>Advantages </li></ul></ul><ul><ul><ul><li>Reduce client-master interaction </li></ul></ul></ul><ul><ul><ul><li>Reduce network overhead </li></ul></ul></ul><ul><ul><ul><li>Reduce the size of the metadata </li></ul></ul></ul><ul><ul><li>Disadvantages </li></ul></ul><ul><ul><ul><li>Internal fragmentation </li></ul></ul></ul><ul><ul><ul><ul><li>Solution: lazy space allocation </li></ul></ul></ul></ul><ul><ul><ul><li>Hot Spots – many clients accessing a 1-chunk file, e.g. executables </li></ul></ul></ul><ul><ul><ul><ul><li>Solution: </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Higher replication factor </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Stagger application start times </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Client-to-client communication </li></ul></ul></ul></ul>
  7. 7. Design Overview – Metadata <ul><li>File & chunk namespaces </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>In master’s and chunk servers’ storage </li></ul></ul><ul><li>File-chunk mapping </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>In master’s and chunk servers’ storage </li></ul></ul><ul><li>Location of chunk replicas </li></ul><ul><ul><li>In master’s memory </li></ul></ul><ul><ul><li>Ask chunk servers when </li></ul></ul><ul><ul><ul><li>Master starts </li></ul></ul></ul><ul><ul><ul><li>Chunk server joins the cluster </li></ul></ul></ul><ul><ul><li>If persistent, master and chunk servers must be in sync </li></ul></ul>
  8. 8. Design Overview – Metadata – In-memory DS <ul><li>Why in-memory data structure for the master? </li></ul><ul><ul><li>Fast! For GC and LB </li></ul></ul><ul><li>Will it pose a limit on the number of chunks -> total capacity? </li></ul><ul><ul><li>No, a 64MB chunk needs less than 64B metadata (640TB needs less than 640MB) </li></ul></ul><ul><ul><ul><li>Most chunks are full </li></ul></ul></ul><ul><ul><ul><li>Prefix compression on file names </li></ul></ul></ul>
  9. 9. Design Overview – Metadata – Log <ul><li>The only persistent record of metadata </li></ul><ul><li>Defines the order of concurrent operations </li></ul><ul><li>Critical </li></ul><ul><ul><li>Replicated on multiple remote machines </li></ul></ul><ul><ul><li>Respond to client only when log locally and remotely </li></ul></ul><ul><li>Fast recovery by using checkpoints </li></ul><ul><ul><li>Use a compact B-tree like form directly mapping into memory </li></ul></ul><ul><ul><li>Switch to a new log, Create new checkpoints in a separate threads </li></ul></ul>
  10. 10. Design Overview – Consistency Model <ul><li>Consistent </li></ul><ul><ul><li>All clients will see the same data, regardless of which replicas they read from </li></ul></ul><ul><li>Defined </li></ul><ul><ul><li>Consistent, and clients will see what the mutation writes in its entirety </li></ul></ul>
  11. 11. Design Overview – Consistency Model <ul><li>After a sequence of success, a region is guaranteed to be defined </li></ul><ul><ul><li>Same order on all replicas </li></ul></ul><ul><ul><li>Chunk version number to detect stale replicas </li></ul></ul><ul><li>Client cache stale chunk locations? </li></ul><ul><ul><li>Limited by cache entry’s timeout </li></ul></ul><ul><ul><li>Most files are append-only </li></ul></ul><ul><ul><ul><li>A Stale replica return a premature end of chunk </li></ul></ul></ul>
  12. 12. System Interactions – Lease <ul><li>Minimized management overhead </li></ul><ul><li>Granted by the master to one of the replicas to become the primary </li></ul><ul><li>Primary picks a serial order of mutation and all replicas follow </li></ul><ul><li>60 seconds timeout, can be extended </li></ul><ul><li>Can be revoked </li></ul>
  13. 13. System Interactions – Mutation Order Current lease holder? identity of primary location of replicas (cached by client) 3a. data 3b. data 3c. data Write request Primary assign s/n to mutations Applies it Forward write request Operation completed Operation completed Operation completed or Error report
  14. 14. System Interactions – Data Flow <ul><li>Decouple data flow and control flow </li></ul><ul><li>Control flow </li></ul><ul><ul><li>Master -> Primary -> Secondaries </li></ul></ul><ul><li>Data flow </li></ul><ul><ul><li>Carefully picked chain of chunk servers </li></ul></ul><ul><ul><ul><li>Forward to the closest first </li></ul></ul></ul><ul><ul><ul><li>Distances estimated from IP addresses </li></ul></ul></ul><ul><ul><li>Linear (not tree), to fully utilize outbound bandwidth (not divided among recipients) </li></ul></ul><ul><ul><li>Pipelining, to exploit full-duplex links </li></ul></ul><ul><ul><ul><li>Time to transfer B bytes to R replicas = B/T + RL </li></ul></ul></ul><ul><ul><ul><li>T: network throughput, L: latency </li></ul></ul></ul>
  15. 15. System Interactions – Atomic Record Append <ul><li>Concurrent appends are serializable </li></ul><ul><ul><li>Client specifies only data </li></ul></ul><ul><ul><li>GFS appends at least once atomically </li></ul></ul><ul><ul><li>Return the offset to the client </li></ul></ul><ul><ul><li>Heavily used by Google to use files as </li></ul></ul><ul><ul><ul><li>multiple-producer/single-consumer queues </li></ul></ul></ul><ul><ul><ul><li>Merged results from many different clients </li></ul></ul></ul><ul><ul><li>On failures, the client retries the operation </li></ul></ul><ul><ul><li>Data are defined, intervening regions are inconsistent </li></ul></ul><ul><ul><ul><li>A Reader can identify and discard extra padding and record fragments using the checksums </li></ul></ul></ul>
  16. 16. System Interactions – Snapshot <ul><li>Makes a copy of a file or a directory tree almost instantaneously </li></ul><ul><li>Use copy-on-write </li></ul><ul><li>Steps </li></ul><ul><ul><li>Revokes lease </li></ul></ul><ul><ul><li>Logs operations to disk </li></ul></ul><ul><ul><li>Duplicates metadata, pointing to the same chunks </li></ul></ul><ul><li>Create real duplicate locally </li></ul><ul><ul><li>Disks are 3 times as fast as 100 Mb Ethernet links </li></ul></ul>
  17. 17. Master Operation – Namespace Management <ul><li>No per-directory data structure </li></ul><ul><li>No support for alias </li></ul><ul><li>Lock over regions of namespace to ensure serialization </li></ul><ul><li>Lookup table mapping full pathnames to metadata </li></ul><ul><ul><li>Prefix compression -> In-Memory </li></ul></ul>
  18. 18. Master Operation – Namespace Locking <ul><li>Each node (file/directory) has a read-write lock </li></ul><ul><li>Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user </li></ul><ul><ul><li>Snapshot </li></ul></ul><ul><ul><ul><li>Read locks on /home, /save </li></ul></ul></ul><ul><ul><ul><li>Write locks on /home/user, /save/user </li></ul></ul></ul><ul><ul><li>Create </li></ul></ul><ul><ul><ul><li>Read locks on /home, /home/user </li></ul></ul></ul><ul><ul><ul><li>Write lock on /home/user/foo </li></ul></ul></ul>
  19. 19. Master Operation – Policies <ul><li>New chunks creation policy </li></ul><ul><ul><li>New replicas on below-average disk utilization </li></ul></ul><ul><ul><li>Limit # of “recent” creations on each chun server </li></ul></ul><ul><ul><li>Spread replicas of a chunk across racks </li></ul></ul><ul><li>Re-replication priority </li></ul><ul><ul><li>Far from replication goal first </li></ul></ul><ul><ul><li>Chunk that is blocking client first </li></ul></ul><ul><ul><li>Live files first (rather than deleted) </li></ul></ul><ul><li>Rebalance replicas periodically </li></ul>
  20. 20. Master Operation – Garbage Collection <ul><li>Lazy reclamation </li></ul><ul><ul><li>Logs deletion immediately </li></ul></ul><ul><ul><li>Rename to a hidden name </li></ul></ul><ul><ul><ul><li>Remove 3 days later </li></ul></ul></ul><ul><ul><ul><li>Undelete by renaming back </li></ul></ul></ul><ul><li>Regular scan for orphaned chunks </li></ul><ul><ul><li>Not garbage: </li></ul></ul><ul><ul><ul><li>All references to chunks: file-chunk mapping </li></ul></ul></ul><ul><ul><ul><li>All chunk replicas: Linux files under designated directory on each chunk server </li></ul></ul></ul><ul><ul><li>Erase metadata </li></ul></ul><ul><ul><li>HeartBeat message to tell chunk servers to delete chunks </li></ul></ul>
  21. 21. Master Operation – Garbage Collection <ul><li>Advantages </li></ul><ul><ul><li>Simple & reliable </li></ul></ul><ul><ul><ul><li>Chunk creation may failed </li></ul></ul></ul><ul><ul><ul><li>Deletion messages may be lost </li></ul></ul></ul><ul><ul><li>Uniform and dependable way to clean up unuseful replicas </li></ul></ul><ul><ul><li>Done in batches and the cost is amortized </li></ul></ul><ul><ul><li>Done when the master is relatively free </li></ul></ul><ul><ul><li>Safety net against accidental, irreversible deletion </li></ul></ul>
  22. 22. Master Operation – Garbage Collection <ul><li>Disadvantage </li></ul><ul><ul><li>Hard to fine tune when storage is tight </li></ul></ul><ul><li>Solution </li></ul><ul><ul><li>Delete twice explicitly -> expedite storage reclamation </li></ul></ul><ul><ul><li>Different policies for different parts of the namespace </li></ul></ul><ul><li>Stale Replica Detection </li></ul><ul><ul><li>Master maintains a chunk version number </li></ul></ul>
  23. 23. Fault Tolerance – High Availability <ul><li>Fast Recovery </li></ul><ul><ul><li>Restore state and start in seconds </li></ul></ul><ul><ul><li>Do not distinguish normal and abnormal termination </li></ul></ul><ul><li>Chunk Replication </li></ul><ul><ul><li>Different replication levels for different parts of the file namespace </li></ul></ul><ul><ul><li>Keep each chunk fully replicated as chunk servers go offline or detect corrupted replicas through checksum verification </li></ul></ul>
  24. 24. Fault Tolerance – High Availability <ul><li>Master Replication </li></ul><ul><ul><li>Log & checkpoints are replicated </li></ul></ul><ul><ul><li>Master failures? </li></ul></ul><ul><ul><ul><li>Monitoring infrastructure outside GFS starts a new master process </li></ul></ul></ul><ul><ul><li>“Shadow” masters </li></ul></ul><ul><ul><ul><li>Read-only access to the file system when the primary master is down </li></ul></ul></ul><ul><ul><ul><li>Enhance read availability </li></ul></ul></ul><ul><ul><ul><li>Reads a replica of the growing operation log </li></ul></ul></ul>
  25. 25. Fault Tolerance – Data Integrity <ul><li>Use checksums to detect data corruption </li></ul><ul><li>A chunk(64MB) is broken up into 64KB blocks with 32-bit checksum </li></ul><ul><li>Chunk server verifies the checksum before returning, no error propagation </li></ul><ul><li>Record append </li></ul><ul><ul><li>Incrementally update the checksum for the last block, error will be detected when read </li></ul></ul><ul><li>Random write </li></ul><ul><ul><li>Read and verify the first and last block first </li></ul></ul><ul><ul><li>Perform write, compute new checksums </li></ul></ul>
  26. 26. Conclusion <ul><li>GFS supports large-scale data processing using commodity hardware </li></ul><ul><li>Reexamine traditional file system assumption </li></ul><ul><ul><li>based on application workload and technological environment </li></ul></ul><ul><ul><li>Treat component failures as the norm rather than the exception </li></ul></ul><ul><ul><li>Optimize for huge files that are mostly appended </li></ul></ul><ul><ul><li>Relax the stand file system interface </li></ul></ul>
  27. 27. Conclusion <ul><li>Fault tolerance </li></ul><ul><ul><li>Constant monitoring </li></ul></ul><ul><ul><li>Replicating crucial data </li></ul></ul><ul><ul><li>Fast and automatic recovery </li></ul></ul><ul><ul><li>Checksumming to detect data corruption at the disk or IDE subsystem level </li></ul></ul><ul><li>High aggregate throughput </li></ul><ul><ul><li>Decouple control and data transfer </li></ul></ul><ul><ul><li>Minimize operations by large chunk size and by chunk lease </li></ul></ul>
  28. 28. Reference <ul><li>Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System” </li></ul>

×