Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Lecture 2 – Distributed Filesystems 922EU3870 – Cloud Computing and Mobile Platforms, Autumn 2009 2009/9/21 Ping Yeh ( 葉平 ...
Outline <ul><li>Get to know the numbers
Filesystems overview
Distributed file systems </li><ul><li>Basic (example: NFS)
Shared storage (example: Global FS)
Wide-area (example: AFS)
Fault-tolerant (example: Coda)
Parallel (example: Lustre)
Fault-tolerant and Parallel (example: dCache) </li></ul><li>The Google File System
Homework </li></ul>
Numbers real world engineers should know L1 cache reference 0.5 ns Branch mispredict 5  ns L2 cache reference 7  ns Mutex ...
The Joys of Real Hardware Typical first year for a new cluster: ~0.5  overheating  (power down most machines in <5 mins, ~...
File Systems Overview <ul><li>System that permanently stores data
Usually layered on top of a lower-level physical storage medium
Divided into logical units called “files” </li><ul><li>Addressable by a filename  (“foo.txt”) </li></ul><li>Files are ofte...
A path is the expression that joins directories and filename to form a unique “full name” for a file. </li></ul><li>Direct...
The set of valid paths form the  namespace  of the file system. </li></ul>
What Gets Stored <ul><li>User data itself is the bulk of the file system's contents
Also includes meta-data on a volume-wide and per-file basis: </li></ul>Volume-wide: Available space Formatting info charac...
High-Level Organization <ul><li>Files are typically organized in a “tree” structure made of nested directories
One directory acts as the “root”
“links” (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file
Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)
Typical operations on a file: create, delete, rename, open, close, read, write, append. </li><ul><li>also lock for multi-u...
Low-Level Organization (1/2) <ul><li>File data and meta-data stored separately
File descriptors + meta-data stored in inodes (Un*x) </li><ul><li>Large tree or table at designated location on disk
Tells how to look up file contents </li></ul><li>Meta-data may be replicated to increase system reliability </li></ul>
Low-Level Organization (2/2) <ul><li>“Standard” read-write medium is a hard drive (other media: CDROM, tape, ...)
Viewed as a sequential array of blocks
Usually address ~1 KB chunk at a time
Tree structure is “flattened” into blocks
Overlapping writes/deletes can cause fragmentation: files are often not stored with a linear layout </li><ul><li>inodes st...
Fragmentation
Filesystem Design Considerations <ul><li>Namespace: physical, logical
Consistency: what to do when more than one user reads/writes on the same file?
Security: who can do what to a file? Authentication/ACL
Reliability: can files not be damaged at power outage or other hardware failures?  </li></ul>
Local Filesystems on Unix-like Systems <ul><li>Many different designs
Namespace: root directory “/”, followed by directories and files.
Consistency: “sequential consistency”, newly written data are immediately visible to open reads (if...)
Security: </li><ul><li>uid/gid, mode of files
kerberos: tickets </li></ul><li>Reliability: superblocks, journaling, snapshot </li><ul><li>more reliable filesystem on to...
Namespace <ul><li>Physical mapping: a directory and all of its subdirectories are stored on the same physical media. </li>...
/mnt/disk1, /mnt/disk2, … when you have multiple disks </li></ul><li>Logical volume: a logical namespace that can contain ...
dynamical resizing by adding/removing disks without reboot
splitting/merging volumes as long as no data spans the split </li></ul></ul>
Upcoming SlideShare
Loading in …5
×

14

Share

Download to read offline

Distributed File System

Download to read offline

Lecture 2

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Distributed File System

  1. 1. Lecture 2 – Distributed Filesystems 922EU3870 – Cloud Computing and Mobile Platforms, Autumn 2009 2009/9/21 Ping Yeh ( 葉平 ), Google, Inc.
  2. 2. Outline <ul><li>Get to know the numbers
  3. 3. Filesystems overview
  4. 4. Distributed file systems </li><ul><li>Basic (example: NFS)
  5. 5. Shared storage (example: Global FS)
  6. 6. Wide-area (example: AFS)
  7. 7. Fault-tolerant (example: Coda)
  8. 8. Parallel (example: Lustre)
  9. 9. Fault-tolerant and Parallel (example: dCache) </li></ul><li>The Google File System
  10. 10. Homework </li></ul>
  11. 11. Numbers real world engineers should know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1 KB with Zippy 10,000 ns Send 2 KB through 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within the same data center 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Round trip between California and Netherlands 150,000,000 ns
  12. 12. The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packetloss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.
  13. 13. File Systems Overview <ul><li>System that permanently stores data
  14. 14. Usually layered on top of a lower-level physical storage medium
  15. 15. Divided into logical units called “files” </li><ul><li>Addressable by a filename (“foo.txt”) </li></ul><li>Files are often organized into directories </li><ul><li>Usually supports hierarchical nesting (directories)
  16. 16. A path is the expression that joins directories and filename to form a unique “full name” for a file. </li></ul><li>Directories may further belong to a volume
  17. 17. The set of valid paths form the namespace of the file system. </li></ul>
  18. 18. What Gets Stored <ul><li>User data itself is the bulk of the file system's contents
  19. 19. Also includes meta-data on a volume-wide and per-file basis: </li></ul>Volume-wide: Available space Formatting info character set ... Per-file: name owner modification date physical layout...
  20. 20. High-Level Organization <ul><li>Files are typically organized in a “tree” structure made of nested directories
  21. 21. One directory acts as the “root”
  22. 22. “links” (symlinks, shortcuts, etc) provide simple means of providing multiple access paths to one file
  23. 23. Other file systems can be “mounted” and dropped in as sub-hierarchies (other drives, network shares)
  24. 24. Typical operations on a file: create, delete, rename, open, close, read, write, append. </li><ul><li>also lock for multi-user systems. </li></ul></ul>
  25. 25. Low-Level Organization (1/2) <ul><li>File data and meta-data stored separately
  26. 26. File descriptors + meta-data stored in inodes (Un*x) </li><ul><li>Large tree or table at designated location on disk
  27. 27. Tells how to look up file contents </li></ul><li>Meta-data may be replicated to increase system reliability </li></ul>
  28. 28. Low-Level Organization (2/2) <ul><li>“Standard” read-write medium is a hard drive (other media: CDROM, tape, ...)
  29. 29. Viewed as a sequential array of blocks
  30. 30. Usually address ~1 KB chunk at a time
  31. 31. Tree structure is “flattened” into blocks
  32. 32. Overlapping writes/deletes can cause fragmentation: files are often not stored with a linear layout </li><ul><li>inodes store all block numbers related to file </li></ul></ul>
  33. 33. Fragmentation
  34. 34. Filesystem Design Considerations <ul><li>Namespace: physical, logical
  35. 35. Consistency: what to do when more than one user reads/writes on the same file?
  36. 36. Security: who can do what to a file? Authentication/ACL
  37. 37. Reliability: can files not be damaged at power outage or other hardware failures? </li></ul>
  38. 38. Local Filesystems on Unix-like Systems <ul><li>Many different designs
  39. 39. Namespace: root directory “/”, followed by directories and files.
  40. 40. Consistency: “sequential consistency”, newly written data are immediately visible to open reads (if...)
  41. 41. Security: </li><ul><li>uid/gid, mode of files
  42. 42. kerberos: tickets </li></ul><li>Reliability: superblocks, journaling, snapshot </li><ul><li>more reliable filesystem on top of existing filesystem: RAID </li></ul></ul>computer
  43. 43. Namespace <ul><li>Physical mapping: a directory and all of its subdirectories are stored on the same physical media. </li><ul><li>/mnt/cdrom
  44. 44. /mnt/disk1, /mnt/disk2, … when you have multiple disks </li></ul><li>Logical volume: a logical namespace that can contain multiple physical media or a partition of a physical media </li><ul><li>still mounted like /mnt/vol1
  45. 45. dynamical resizing by adding/removing disks without reboot
  46. 46. splitting/merging volumes as long as no data spans the split </li></ul></ul>
  47. 47. Journaling <ul><li>Changes to the filesystem is logged in a journal before it is committed. </li><ul><li>useful if an supposedly atomic action needs two or more writes, e.g., appending to a file (update metadata + allocate space + write the data).
  48. 48. can play back a journal to recover data quickly in case of hardware failure. (old-fashioned “scan the volume” fsck anyone?) </li></ul><li>What to log? </li><ul><li>changes to file content: heavy overhead
  49. 49. changes to metadata: fast, but data corruption may occur </li></ul><li>Implementations: xfs3, ReiserFS, IBM's JFS, etc. </li></ul>
  50. 50. RAID <ul><li>Redundant Array of Inexpensive Disks
  51. 51. Different RAID levels </li><ul><li>RAID 0 (striped disks): data is distributed among disk for performance.
  52. 52. RAID 1 (mirror): data is mirrored on another disk 1:1 for reliability.
  53. 53. RAID 5 (striped disks with distributed parity): similar to RAID 0, but with a parity bit for data recovery in case one disk is lost. The parity bit is rotated among disks to maximize throughput.
  54. 54. RAID 6 (striped disks with dual distributed parity): similar to RAID 5 but with 2 parity bits. Can lose 2 disks without losing data.
  55. 55. RAID 10 (1+0, striped mirrors) </li></ul></ul>
  56. 56. Snapshot <ul><li>A snapshot = a copy of a set of files and directories at a point in time. </li><ul><li>read-only snapshots, read-write snapshots
  57. 57. usually done by the filesystem itself, sometimes by LVMs
  58. 58. backing up data can be done on a read-only snapshot without worrying about consistency </li></ul><li>Copy-on-write is a simple and fast way to create snapshots </li><ul><li>current data is the snapshot.
  59. 59. a request to write to a file creates a new copy, and work from there afterwards. </li></ul><li>Implementation: UFS, Sun's ZFS, etc. </li></ul>
  60. 60. <ul><li>Should the file system be faster or more reliable?
  61. 61. But faster at what: Large files? Small files? Lots of reading? Frequent writers, occasional readers?
  62. 62. Block size </li><ul><li>Smaller block size reduces amount of wasted space
  63. 63. Larger block size increases speed of sequential reads (may not help random access) </li></ul></ul>
  64. 64. Distributed File Systems
  65. 65. Distributed Filesystems <ul><li>Support access to files on remote servers
  66. 66. Must support concurrency </li><ul><li>Make varying guarantees about locking, who “wins” with concurrent writes, etc...
  67. 67. Must gracefully handle dropped connections </li></ul><li>Can offer support for replication and local caching
  68. 68. Different implementations sit in different places on complexity/feature scale </li></ul>
  69. 69. DFS is useful when... <ul><li>Multiple users want to share files
  70. 70. Users move from one computer to another
  71. 71. Users want a uniform namespace for shared files on every computer
  72. 72. Users want a centralized storage system – backup and management
  73. 73. Note that a “user” of a DFS may actually be a “program” </li></ul>
  74. 74. Features of Distributed File Systems <ul><li>Different systems have different designs and behaviors on the following features. </li><ul><li>Interface: file system, block I/O, custom made
  75. 75. Security: various authentication/authorization schemes.
  76. 76. Reliability (fault-tolerance): can continue to function when some hardware fail (disks, nodes, power, etc).
  77. 77. Namespace (virtualization): can provide logical namespace that can span across physical boundaries.
  78. 78. Consistency: all clients get the same data all the time. It is related to locking, caching, and synchronization.
  79. 79. Parallel: multiple clients can have access to multiple disks at the same time.
  80. 80. Scope: local area network vs. wide area network. </li></ul></ul>
  81. 81. NFS <ul><li>First developed in 1980s by Sun
  82. 82. Most widely known distributed filesystem.
  83. 83. Presented with standard POSIX FS interface
  84. 84. Network drives are mounted into local directory hierarchy </li><ul><li>Your home directory at CINC is probably NFS-driven
  85. 85. Type 'mount' some time at the prompt if curious </li></ul></ul>client client client server
  86. 86. NFS Protocol <ul><li>Initially completely stateless </li><ul><li>Operated over UDP; did not use TCP streams
  87. 87. File locking, etc, implemented in higher-level protocols
  88. 88. Implication: All client requests have enough info to complete op </li><ul><li>example: Client specifies offset in file to write to
  89. 89. one advantage: Server state does not grow with more clients </li></ul></ul><li>Modern implementations use TCP/IP & stateful protocols
  90. 90. RFCs: </li><ul><li>RFC 1094 for v2 (3/1989)
  91. 91. RFC 1813 for v3 (6/1995)
  92. 92. RFC 3530 for v4 (4/2003) </li></ul></ul>
  93. 93. NFS Overview <ul><li>Remote Procedure Calls (RPC) for communication between client and server
  94. 94. Client Implementation </li><ul><li>Provides transparent access to NFS file system </li><ul><li>UNIX contains Virtual File system layer (VFS)
  95. 95. Vnode: interface for procedures on an individual file </li></ul><li>Translates vnode operations to NFS RPCs </li></ul><li>Server Implementation </li><ul><li>Stateless: Must not have anything only in memory
  96. 96. Implication: All modified data written to stable storage before return control to client </li><ul><li>Servers often add NVRAM to improve performance </li></ul></ul></ul>
  97. 97. Server-side Implementation <ul><li>NFS defines a virtual file system </li><ul><li>Does not actually manage local disk layout on server </li></ul><li>Server instantiates NFS volume on top of local file system </li><ul><li>Local hard drives managed by concrete file systems (ext*, ReiserFS, ...)
  98. 98. Export local disks to clients </li></ul></ul>
  99. 99. NFS Locking <ul><li>NFS v4 supports stateful locking of files </li><ul><li>Clients inform server of intent to lock
  100. 100. Server can notify clients of outstanding lock requests
  101. 101. Locking is lease-based: clients must continually renew locks before a timeout
  102. 102. Loss of contact with server abandons locks </li></ul></ul>
  103. 103. NFS Client Caching <ul><li>NFS Clients are allowed to cache copies of remote files for subsequent accesses
  104. 104. Supports close-to-open cache consistency </li><ul><li>When client A closes a file, its contents are synchronized with the master, and timestamp is changed
  105. 105. When client B opens the file, it checks that local timestamp agrees with server timestamp. If not, it discards local copy.
  106. 106. Concurrent reader/writers must use flags to disable caching </li></ul></ul>
  107. 107. Cache Consistency <ul><li>Problem: Consistency across multiple copies (server and multiple clients)
  108. 108. How to keep data consistent between client and server? </li><ul><li>If file is changed on server, will client see update?
  109. 109. Determining factor: Read policy on clients </li></ul><li>How to keep data consistent across clients? </li><ul><li>If write file on client A and read on client B, will B see update?
  110. 110. Determining factor: Write and read policy on clients </li></ul></ul>
  111. 111. NFS: Summary <ul><li>Very popular
  112. 112. Full POSIX interface means it “drops in” very easily.
  113. 113. No cluster-wide uniform namespace: each client decides how to name a volume mounted from an NFS server.
  114. 114. No location transparency: data movement means remount
  115. 115. NFS volume managed by single server </li><ul><li>Higher load on central server, bottleneck on scalability
  116. 116. Simplifies coherency protocols </li></ul><li>Challenges: Fault tolerance, scalable performance, and consistency </li></ul>
  117. 117. AFS (The Andrew File System) <ul><li>Developed at Carnegie Mellon University since 1983
  118. 118. Strong security: Kerberos authentication </li><ul><li>richer set of access control bits than UNIX: separate “administer”/“delete” bits, application-specific bits. </li></ul><li>Higher scalability compared with distributed file systems at the time. Supports 50,000+ clients at enterprise level.
  119. 119. Global namespace: /afs/cmu.edu/vol1/..., WAN scope
  120. 120. Location transparency: moving files just works.
  121. 121. Heavily use client side caching.
  122. 122. Support read-only replication. </li></ul>client client client server replica
  123. 123. Morgan Stanley's comparison: AFS vs. NFS <ul><li>Better client/server ratio </li><ul><li>NFS (circa 1993) topped out at 25:1
  124. 124. AFS: 100s </li></ul><li>Robust volume replication </li><ul><li>NFS servers go down, and take their clients with them
  125. 125. AFS servers go down, nobody notices (OK, RO data only) </li></ul><li>WAN file sharing </li><ul><li>NFS couldn't do it reliably
  126. 126. AFS worked like a charm </li></ul><li>Security was never an issue (kerberos4) </li></ul>http://www-conf.slac.stanford.edu/AFSBestPractices/Slides/MorganStanley.pdf
  127. 127. Local Caching <ul><li>File reads/writes operate on locally cached copy
  128. 128. The modified local copy is sent back to server when file is closed
  129. 129. Open local copies are notified of external updates through callbacks </li><ul><li>but not updated </li></ul><li>Trade-offs: </li><ul><li>Shared database files do not work well on this system
  130. 130. Does not support write-through to shared medium </li></ul></ul>
  131. 131. Replication <ul><li>AFS allows read-only copies of filesystem volumes
  132. 132. Copies are guaranteed to be atomic checkpoints of entire FS at time of read-only copy generation (“snapshot”)
  133. 133. Modifying data requires access to the sole r/w volume </li><ul><li>Changes do not propagate to existing read-only copies </li></ul></ul>
  134. 134. AFS: Summary <ul><li>Not quite POSIX </li><ul><li>Stronger security/permissions
  135. 135. No file write-through </li></ul><li>High availability through replicas and local caching
  136. 136. Scalability by reducing load on servers
  137. 137. Not appropriate for all file types </li></ul>ACM Trans. on Computer Systems, 6(1):51-81, Feb 1988.
  138. 138. Global File System
  139. 139. Lustre Overview <ul><li>Developed by Peter Braam at Carnegie Mellon since 1999 </li><ul><li>then Cluster File System in 2001, acquired by Sun in 2007
  140. 140. Goal: removing bottlenecks to achieve scalability </li></ul><li>Object-based: separate metadata and file data </li><ul><li>metadata server(s): MDS (namespace, ACL, attributes)
  141. 141. object storage server(s): OSS
  142. 142. client read/write directly with OSS </li></ul><li>Consistency: Lustre distributed lock manager
  143. 143. Performance: data can be striped
  144. 144. POSIX interface </li></ul>client client client OSS MDS OSS OSS MDS OSS OST
  145. 145. Key Design Issue : Scalability <ul><li>I/O throughput </li><ul><li>How to avoid bottlenecks </li></ul><li>Metadata scalability </li><ul><li>How can 10,000’s of nodes work on files in same folder </li></ul><li>Cluster Recovery </li><ul><li>If sth fails, how can transparent recovery happen </li></ul><li>Management </li><ul><li>Adding, removing, replacing, systems; data migration & backup </li></ul></ul>
  146. 146. The Lustre Architecture
  147. 147. Metadata Service (MDS) <ul><li>All access to the file is governed by MDS which will directly or indirectly authorize access.
  148. 148. To control namespace and manage inodes
  149. 149. Load balanced cluster service for the scalability (a well balanced API, a stackable framework for logical MDS, replicated MDS)
  150. 150. Journaled batched metadata updates </li></ul>
  151. 151. Object Storage Targets (OST) <ul><li>Keep file data objects
  152. 152. File I/O service: Access to the objects
  153. 153. The block allocation for data obj., leading distributed and scalability
  154. 154. OST s/w modules </li><ul><li>OBD server, Lock server
  155. 155. Obj. storage driver, OBD filter
  156. 156. Portal API </li></ul></ul>
  157. 157. Distributed Lock Manager <ul><li>For generic and rich lock service
  158. 158. Lock resources: resource database </li><ul><li>Organize resources in trees </li></ul><li>High performance </li><ul><li>node that acquires resource manages tree </li></ul><li>Intent locking: utilization dependent locking modes </li><ul><li>write-back lock for low utilization files
  159. 159. no lock for high contention files, write to OSS by DLM </li></ul></ul>
  160. 160. Metadata
  161. 161. File I/O
  162. 162. Striping
  163. 163. Lustre Summary <ul><li>Good scalability and high performance </li><ul><li>Lawrence Livermore National Lab BlueGene/L has 200,000 clients processes access 2.5PB Lustre fs through 1024 IO nodes over 1Gbit Ethernet network at 35 GB/sec. </li></ul><li>Reliability </li><ul><li>Journaled metadata in metadata cluster
  164. 164. OSTs are responsible for their data </li></ul><li>Consistency </li></ul>
  165. 165. Other file systems worth mentioning <ul><li>Coda: post-AFS, pre-Lustre, from CMU
  166. 166. Global File System: shared disk file systems
  167. 167. PVFS by IBM
  168. 168. ZFS by Sun </li></ul>
  • Alee_AleXx

    Aug. 8, 2019
  • AishwaryaGawande2

    Jun. 4, 2019
  • PriyadarshaniDiddi

    Nov. 24, 2018
  • SirishaPalakursha

    Feb. 15, 2018
  • Margonikani

    Jan. 4, 2018
  • Didijaiswal

    Dec. 12, 2017
  • KaishveenMann

    Mar. 27, 2017
  • ssuser79e7ad

    Dec. 28, 2016
  • yuktiarora1

    Dec. 14, 2016
  • forhappy

    Nov. 27, 2015
  • ksasan

    Jun. 18, 2015
  • shepjeng

    May. 4, 2015
  • ErickValverdeSolis

    Feb. 8, 2012
  • jykang

    Jun. 20, 2010

Lecture 2

Views

Total views

15,095

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

535

Shares

0

Comments

0

Likes

14

×