Topic 11: Google Filesystem
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Topic 11: Google Filesystem

on

  • 712 views

Cloud Computing Workshop 2013, ITU

Cloud Computing Workshop 2013, ITU

Statistics

Views

Total Views
712
Views on SlideShare
712
Embed Views
0

Actions

Likes
0
Downloads
39
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Topic 11: Google Filesystem Presentation Transcript

  • 1. 11: Google FilesystemZubair Nabizubair.nabi@itu.edu.pkApril 20, 2013Zubair Nabi 11: Google Filesystem April 20, 2013 1 / 29
  • 2. Outline1 Introduction2 Google Filesystem3 Hadoop Distributed FilesystemZubair Nabi 11: Google Filesystem April 20, 2013 2 / 29
  • 3. Outline1 Introduction2 Google Filesystem3 Hadoop Distributed FilesystemZubair Nabi 11: Google Filesystem April 20, 2013 3 / 29
  • 4. FilesystemThe purpose of a filesystem is to:1 Organize and store dataZubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
  • 5. FilesystemThe purpose of a filesystem is to:1 Organize and store data2 Support sharing of data among users and applicationsZubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
  • 6. FilesystemThe purpose of a filesystem is to:1 Organize and store data2 Support sharing of data among users and applications3 Ensure persistence of data after a rebootZubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
  • 7. FilesystemThe purpose of a filesystem is to:1 Organize and store data2 Support sharing of data among users and applications3 Ensure persistence of data after a reboot4 Examples include FAT, NTFS, ext3, ext4, etc.Zubair Nabi 11: Google Filesystem April 20, 2013 4 / 29
  • 8. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesZubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 9. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesThe DFS provides a common abstraction to the dispersed filesZubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 10. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesThe DFS provides a common abstraction to the dispersed filesEach DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 11. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesThe DFS provides a common abstraction to the dispersed filesEach DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical namesZubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 12. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesThe DFS provides a common abstraction to the dispersed filesEach DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical namesSimplifies replication and migrationZubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 13. Distributed filesystemSelf-explanatory: the filesystem is distributed across many machinesThe DFS provides a common abstraction to the dispersed filesEach DFS has an associated API that provides a service to clients,which are normal file operations, such as create, read, write, etc.Maintains a namespace which maps logical names to physical namesSimplifies replication and migrationExamples include the Network Filesystem (NFS), Andrew Filesystem(AFS), etc.Zubair Nabi 11: Google Filesystem April 20, 2013 5 / 29
  • 14. Outline1 Introduction2 Google Filesystem3 Hadoop Distributed FilesystemZubair Nabi 11: Google Filesystem April 20, 2013 6 / 29
  • 15. IntroductionDesigned by Google to meet its massive storage needsZubair Nabi 11: Google Filesystem April 20, 2013 7 / 29
  • 16. IntroductionDesigned by Google to meet its massive storage needsShares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availabilityZubair Nabi 11: Google Filesystem April 20, 2013 7 / 29
  • 17. IntroductionDesigned by Google to meet its massive storage needsShares many goals with previous distributed filesystems such asperformance, scalability, reliability, and availabilityAt the same time, design driven by key observations of their workloadand infrastructure, both current and futureZubair Nabi 11: Google Filesystem April 20, 2013 7 / 29
  • 18. Design Goals1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failureZubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
  • 19. Design Goals1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small filesZubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
  • 20. Design Goals1 Failure is the norm rather than the exception: The GFS mustconstantly introspect and automatically recover from failure2 The system stores a fair number of large files: Optimize for largefiles, on the order of GBs, but still support small files3 Applications prefer to do large streaming reads of contiguousregions: Optimize for this caseZubair Nabi 11: Google Filesystem April 20, 2013 8 / 29
  • 21. Design Goals (2)4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for themZubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
  • 22. Design Goals (2)4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneouslyZubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
  • 23. Design Goals (2)4 Most applications perform large, sequential writes that are mostlyappend operations: Support small writes but do not optimize for them5 Most operations are producer-consume queues or many-waymerging: Support concurrent reads or writes by hundreds of clientssimultaneously6 Applications process data in bulk at a high rate: Favour throughputover latencyZubair Nabi 11: Google Filesystem April 20, 2013 9 / 29
  • 24. InterfaceThe interface is similar to traditional filesystems but no support for astandard POSIX-like APIZubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
  • 25. InterfaceThe interface is similar to traditional filesystems but no support for astandard POSIX-like APIFiles are organized hierarchically into directories with pathnamesZubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
  • 26. InterfaceThe interface is similar to traditional filesystems but no support for astandard POSIX-like APIFiles are organized hierarchically into directories with pathnamesSupport for create, delete, open, close, read, and write operationsZubair Nabi 11: Google Filesystem April 20, 2013 10 / 29
  • 27. ArchitectureConsists of a single master and multiple chunkserversZubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
  • 28. ArchitectureConsists of a single master and multiple chunkserversThe system can be accessed by multiple clientsZubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
  • 29. ArchitectureConsists of a single master and multiple chunkserversThe system can be accessed by multiple clientsBoth the master and chunkservers run as user-space server processeson commodity Linux machinesZubair Nabi 11: Google Filesystem April 20, 2013 11 / 29
  • 30. FilesFiles are sliced into fixed-size chunksZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 31. FilesFiles are sliced into fixed-size chunksEach chunk is identifiable by an immutable and globally unique 64-bithandleZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 32. FilesFiles are sliced into fixed-size chunksEach chunk is identifiable by an immutable and globally unique 64-bithandleChunks are stored by chunkservers as local Linux filesZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 33. FilesFiles are sliced into fixed-size chunksEach chunk is identifiable by an immutable and globally unique 64-bithandleChunks are stored by chunkservers as local Linux filesReads and writes to a chunk are specified by a handle and a byterangeZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 34. FilesFiles are sliced into fixed-size chunksEach chunk is identifiable by an immutable and globally unique 64-bithandleChunks are stored by chunkservers as local Linux filesReads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkserversZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 35. FilesFiles are sliced into fixed-size chunksEach chunk is identifiable by an immutable and globally unique 64-bithandleChunks are stored by chunkservers as local Linux filesReads and writes to a chunk are specified by a handle and a byterangeEach chunk is replicated on multiple chunkservers3 by defaultZubair Nabi 11: Google Filesystem April 20, 2013 12 / 29
  • 36. MasterIn charge of all filesystem metadataZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 37. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 38. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksHolds this information in memory and regularly syncs it with a log fileZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 39. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksHolds this information in memory and regularly syncs it with a log fileAlso in charge of chunk leasing, garbage collection, and chunkmigrationZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 40. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksHolds this information in memory and regularly syncs it with a log fileAlso in charge of chunk leasing, garbage collection, and chunkmigrationPeriodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 41. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksHolds this information in memory and regularly syncs it with a log fileAlso in charge of chunk leasing, garbage collection, and chunkmigrationPeriodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkserversZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 42. MasterIn charge of all filesystem metadataNamespace, access control information, mapping between files andchunks, and current locations of chunksHolds this information in memory and regularly syncs it with a log fileAlso in charge of chunk leasing, garbage collection, and chunkmigrationPeriodically sends each chunkserver a heartbeat signal to check itsstate and send it instructionsClients interact with it to access metadata but all data-bearingcommunication goes directly to the relevant chunkserversAs a result, the master does not become a performance bottleneckZubair Nabi 11: Google Filesystem April 20, 2013 13 / 29
  • 43. Zubair Nabi 11: Google Filesystem April 20, 2013 14 / 29
  • 44. Consistency Model: MasterAll namespace mutations (such as file creation) are atomic as they areexclusively handled by the masterZubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
  • 45. Consistency Model: MasterAll namespace mutations (such as file creation) are atomic as they areexclusively handled by the masterNamespace locking guarantees atomicity and correctnessZubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
  • 46. Consistency Model: MasterAll namespace mutations (such as file creation) are atomic as they areexclusively handled by the masterNamespace locking guarantees atomicity and correctnessThe operation log maintained by the master defines a global total orderof these operationsZubair Nabi 11: Google Filesystem April 20, 2013 15 / 29
  • 47. Consistency Model: DataThe state after mutation depends on:Mutation type: write or appendZubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
  • 48. Consistency Model: DataThe state after mutation depends on:Mutation type: write or appendWhether it succeeds or failsZubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
  • 49. Consistency Model: DataThe state after mutation depends on:Mutation type: write or appendWhether it succeeds or failsWhether there are other concurrent mutationsZubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
  • 50. Consistency Model: DataThe state after mutation depends on:Mutation type: write or appendWhether it succeeds or failsWhether there are other concurrent mutationsA file region is consistent if all clients see the same data, regardlessof the replicaZubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
  • 51. Consistency Model: DataThe state after mutation depends on:Mutation type: write or appendWhether it succeeds or failsWhether there are other concurrent mutationsA file region is consistent if all clients see the same data, regardlessof the replicaA region is defined after a mutation if it is still consistent and clientssee the mutation in its entiretyZubair Nabi 11: Google Filesystem April 20, 2013 16 / 29
  • 52. Consistency Model: Data (2)If there are no other concurrent writers, the region is defined andconsistentZubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
  • 53. Consistency Model: Data (2)If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistentZubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
  • 54. Consistency Model: Data (2)If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistentMingled fragments from multiple mutationsZubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
  • 55. Consistency Model: Data (2)If there are no other concurrent writers, the region is defined andconsistentConcurrent and successful mutations leave the region undefined butconsistentMingled fragments from multiple mutationsA failed mutation makes the region both inconsistent and undefinedZubair Nabi 11: Google Filesystem April 20, 2013 17 / 29
  • 56. Mutation OperationsEach chunk has many replicasZubair Nabi 11: Google Filesystem April 20, 2013 18 / 29
  • 57. Mutation OperationsEach chunk has many replicasThe primary replica holds a lease from the masterZubair Nabi 11: Google Filesystem April 20, 2013 18 / 29
  • 58. Mutation OperationsEach chunk has many replicasThe primary replica holds a lease from the masterIt decides the order of all mutations for all replicasZubair Nabi 11: Google Filesystem April 20, 2013 18 / 29
  • 59. Write OperationClient obtains the location of replicas and the identity of the primaryreplica from the masterZubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
  • 60. Write OperationClient obtains the location of replicas and the identity of the primaryreplica from the masterIt then pushes the data to all replica nodesZubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
  • 61. Write OperationClient obtains the location of replicas and the identity of the primaryreplica from the masterIt then pushes the data to all replica nodesThe client issues an update request to primaryZubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
  • 62. Write OperationClient obtains the location of replicas and the identity of the primaryreplica from the masterIt then pushes the data to all replica nodesThe client issues an update request to primaryPrimary forwards the write request to all replicasZubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
  • 63. Write OperationClient obtains the location of replicas and the identity of the primaryreplica from the masterIt then pushes the data to all replica nodesThe client issues an update request to primaryPrimary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the clientZubair Nabi 11: Google Filesystem April 20, 2013 19 / 29
  • 64. Record Append OperationPerformed atomicallyZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 65. Record Append OperationPerformed atomicallyAppend location chosen by the GFS and communicated to the clientZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 66. Record Append OperationPerformed atomicallyAppend location chosen by the GFS and communicated to the clientPrimary forwards the write request to all replicasZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 67. Record Append OperationPerformed atomicallyAppend location chosen by the GFS and communicated to the clientPrimary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the clientZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 68. Record Append OperationPerformed atomicallyAppend location chosen by the GFS and communicated to the clientPrimary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client1 If the records fits in the current chunk, it is written and communicated tothe clientZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 69. Record Append OperationPerformed atomicallyAppend location chosen by the GFS and communicated to the clientPrimary forwards the write request to all replicasIt waits for a reply from all replicas before returning to the client1 If the records fits in the current chunk, it is written and communicated tothe client2 If it does not, the chunk is padded and the client is told to try the nextchunkZubair Nabi 11: Google Filesystem April 20, 2013 20 / 29
  • 70. Zubair Nabi 11: Google Filesystem April 20, 2013 21 / 29
  • 71. Application SafeguardsUse record append rather than writeZubair Nabi 11: Google Filesystem April 20, 2013 22 / 29
  • 72. Application SafeguardsUse record append rather than writeInsert checksums in record headers to detect fragmentsZubair Nabi 11: Google Filesystem April 20, 2013 22 / 29
  • 73. Application SafeguardsUse record append rather than writeInsert checksums in record headers to detect fragmentsInsert sequence numbers to detect duplicatesZubair Nabi 11: Google Filesystem April 20, 2013 22 / 29
  • 74. Chunk PlacementPut on chunkservers with below average disk space usageZubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
  • 75. Chunk PlacementPut on chunkservers with below average disk space usageLimit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh dataZubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
  • 76. Chunk PlacementPut on chunkservers with below average disk space usageLimit number of “recent” creations on a chunkserver, to ensure that itdoes not experience any traffic spike due to its fresh dataFor reliability, replicas spread across racksZubair Nabi 11: Google Filesystem April 20, 2013 23 / 29
  • 77. Garbage CollectionChunks become garbage when they are orphanedZubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
  • 78. Garbage CollectionChunks become garbage when they are orphanedA lazy reclamation strategy is used by not reclaiming chunks at deletetimeZubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
  • 79. Garbage CollectionChunks become garbage when they are orphanedA lazy reclamation strategy is used by not reclaiming chunks at deletetimeEach chunkserver communicates the subset of its current chunks tothe master in the heartbeat signalZubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
  • 80. Garbage CollectionChunks become garbage when they are orphanedA lazy reclamation strategy is used by not reclaiming chunks at deletetimeEach chunkserver communicates the subset of its current chunks tothe master in the heartbeat signalMaster pinpoints chunks which have been orphanedZubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
  • 81. Garbage CollectionChunks become garbage when they are orphanedA lazy reclamation strategy is used by not reclaiming chunks at deletetimeEach chunkserver communicates the subset of its current chunks tothe master in the heartbeat signalMaster pinpoints chunks which have been orphanedThe chunkserver finally reclaims that spaceZubair Nabi 11: Google Filesystem April 20, 2013 24 / 29
  • 82. Stale Replica DetectionEach chunk is assigned a version numberZubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
  • 83. Stale Replica DetectionEach chunk is assigned a version numberEach time a new lease is granted, the version number is incrementedZubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
  • 84. Stale Replica DetectionEach chunk is assigned a version numberEach time a new lease is granted, the version number is incrementedStale replicas will have outdated version numbersZubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
  • 85. Stale Replica DetectionEach chunk is assigned a version numberEach time a new lease is granted, the version number is incrementedStale replicas will have outdated version numbersThey are simply garbage collectedZubair Nabi 11: Google Filesystem April 20, 2013 25 / 29
  • 86. Outline1 Introduction2 Google Filesystem3 Hadoop Distributed FilesystemZubair Nabi 11: Google Filesystem April 20, 2013 26 / 29
  • 87. IntroductionOpen-source clone of GFSZubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
  • 88. IntroductionOpen-source clone of GFSComes packaged with HadoopZubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
  • 89. IntroductionOpen-source clone of GFSComes packaged with HadoopMaster is called the NameNode and chunkservers are calledDataNodesZubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
  • 90. IntroductionOpen-source clone of GFSComes packaged with HadoopMaster is called the NameNode and chunkservers are calledDataNodesChunks are known as blocksZubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
  • 91. IntroductionOpen-source clone of GFSComes packaged with HadoopMaster is called the NameNode and chunkservers are calledDataNodesChunks are known as blocksExposes a Java API and a command-line interfaceZubair Nabi 11: Google Filesystem April 20, 2013 27 / 29
  • 92. Command-line APIAccessible through: bin/hdfs dfs -command args1http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29
  • 93. Command-line APIAccessible through: bin/hdfs dfs -command argsUseful commands: cat, copyFromLocal, copyToLocal, cp,ls, mkdir, moveFromLocal, moveToLocal, mv, rm, etc.11http://hadoop.apache.org/docs/r1.0.4/file_system_shell.htmlZubair Nabi 11: Google Filesystem April 20, 2013 28 / 29
  • 94. References1 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. TheGoogle file system. In Proceedings of the nineteenth ACM symposiumon Operating systems principles (SOSP ’03). ACM, New York, NY,USA, 29-43.Zubair Nabi 11: Google Filesystem April 20, 2013 29 / 29