Your SlideShare is downloading. ×
Distributed File Systems
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Distributed File Systems


Published on

Distributed file systems lecture I gave for Andrew Grimshaw's Distributed systems course in the Spring of 2009

Distributed file systems lecture I gave for Andrew Grimshaw's Distributed systems course in the Spring of 2009

Published in: Technology

1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • File is a named collection of related information that is recorded on some permanent storage
  • Access transparency – Use same mechanism to access file whether it is local or remote, i.e., map remote files into local file system name space
  • Harder to mark failure Performance issues – caching etc? Can’t really do
  • COTS Legacy apps Does it really matter where it is? One copy versus many copies Etc
  • Lampson –hints for push model
  • A server exports one or more of its directories to remote clients Clients access exported directories by mounting them The contents are then accessed as if they were local
  • Pros: server is stateless, i.e. no state about open files Cons: Locking is difficult, no concurrency control
  • No consistency semantics – things marked dirty flushed within 30 seconds Checks non-dirty items every 5 seconds Bad performance with heavy load etc
  • User process wants to open a file with a pathname P The kernel resolves that it’s a Vice file & passes it to Venus on that workstation One of the LWP’s uses the cache to examine each directory component D of P… When processing a pathname component, Venus identifies the server to be contacted by examining the volume field of the Fid. Authentication, protection checking, and network failures complicate the matter considerably.
  • Transcript

    • 1. 03-10-09 Some slides are taken from Professor Grimshaw, Ranveer Chandra, Krasimira Kapitnova, etc
    • 2.
      • 3 rd year graduate student working with Professor Grimshaw
      • Interests lie in Operating Systems, Distributed Systems, and more recently Cloud Computing
      • Also
        • Trumpet
        • Sporty things
        • Hardware Junkie
      I like tacos … a lot
    • 3.
      • File System refresher
      • Basic Issues
        • Naming / Transparency
        • Caching
        • Coherence
        • Security
        • Performance
      • Case Studies
        • NFS v3 - v4
        • Lustre
        • AFS 2.0
    • 4.
      • What is a file system?
      • Why have a file system?
      Mmmm, refreshing File Systems
    • 5.
        • Must have
          • Name e.g. “/home/sosa/DFSSlides.ppt”
          • Data – some structured sequence of bytes
        • Tend to also have
          • Size
          • Protection Information
          • Non-symbolic identifier
          • Location
          • Times, etc
    • 6.
      • A container abstraction to help organize files
        • Generally hierarchical (tree) structure
        • Often a special type of file
      • Directories have a
        • Name
        • Files and directories (if hierarchical) within them
      A large container for tourists
    • 7.
      • Two approaches to sharing files
      • Copy-based
        • Application explicitly copies files between machines
        • Examples: UUCP, FTP, gridFTP, {.*} FTP, Rcp, Scp, etc.
      • Access transparency – i.e. Distributed File Systems
      Sharing is caring
    • 8.
      • Basic idea
        • Find a copy
          • naming is based on machine name of source (, user id, and path
        • Transfer the file to the local file system
          • scp .
        • Read/write
        • Copy back if modified
      • Pros ans Cons?
    • 9.
      • Pros
        • Semantics are clear
        • No OS/library modification
      • Cons?
        • Deal with model
        • Have to copy whole file
        • Inconsistencies
        • Inconsistent copies all over the place
        • Others?
    • 10.
      • Mechanism to access remote the same as local (i.e. through the file system hierarchy)
      • Why is this better?
      • … enter Distributed File Systems
    • 11.
      • A Distributed File System is a file system that may have files on more than one machine
      • Distributed File Systems take many forms
        • Network File Systems
        • Parallel File Systems
        • Access Transparent Distributed File Systems
      • Why distribute?
    • 12.
      • Sharing files with other users
        • Others can access your files
        • You can have access to files you wouldn’t regularly have access to
      • Keeping files available for yourself on more than one computer
      • Small amount of local resources
      • High failure rate of local resources
      • Can eliminate version problems (same file copied around with local edits)
    • 13.
      • Naming
      • Performance
      • Caching
      • Consistency Semantics
      • Fault Tolerance
      • Scalability
    • 14.
      • What does a DFS look like to the user?
        • Mount-like protocol .e.g /../mntPointToBobsSharedFolder/file.txt
        • Unified namespace. Everything looks like they’re on the same namespace
      • Pros and Cons?
    • 15.
      • Location transparency
        • Name does not hint at physical location
        • Mount points are not transparent
      • Location Independence
        • File name does not need to be changed when the file’s physical storage location changes
      • Independence without transparency?
    • 16.
      • Generally trade-off the benefits of DFS’s with some performance hits
        • How much depends on workload
        • Always look at workload to figure out what mechanisms to use
      • What are some ways to improve performance?
    • 17.
      • Single architectural feature that contributes most to performance in a DFS!!!
      • Single greatest cause of heartache for programmers of DFS’s
        • Maintaining consistency semantics more difficult
        • Has a large potential impact on scalability
    • 18.
      • Size of the cached units of data
        • Larger sizes make more efficient use of the network –spacial locality, latency
        • Whole files simply semantics but can’t store very large files locally
        • Small files
      • Who does what
        • Push vs Pull
        • Important for maintaining consistency
    • 19.
      • Different DFS’s have different consistency semantics
        • UNIX semantics
        • On Close semantics
        • Timeout semantics (at least x-second up-to date)
      • Pro’s / Con’s?
    • 20.
      • Can replicate
        • Fault Tolerance
        • Performance
      • Replication is inherently location-opaque i.e. we need location independence in naming
      • Different forms of replication mechanisms, different consistency semantics
        • Tradeoffs, tradeoffs, tradeoffs
    • 21.
      • Mount-based DFS
        • NFS version 3
        • Others include SMB, CIFS, NFS version 4
      • Parallel DFS
        • Lustre
        • Others include HDFS, Google File System, etc
      • Non-Parallel Unified Namespace DFS’s
        • Sprite
        • AFS version 2.0 (basis for many other DFS’s)
          • Coda
          • AFS 3.0
    • 22.
    • 23.
      • Most commonly used DFS ever!
      • Goals
        • Machine & OS Independent
        • Crash Recovery
        • Transparent Access
        • “ Reasonable” Performance
      • Design
        • All are client and servers
        • RPC (on top of UDP v.1, v.2+ on TCP)
          • Open Network Computing Remote Procedure Call
          • External Data Representation (XDR)
        • Stateless Protocol
    • 24.
    • 25.
      • Client sends path name to server with request to mount
      • If path is legal and exported, server returns file handle
        • Contains FS type, disk, i-node number of directory, security info
        • Subsequent accesses use file handle
      • Mount can be either at boot or automount
        • Automount: Directories mounted on-use
        • Why helpful?
      • Mount only affects client view
    • 26.
      • Mounting (part of) a remote file system in NFS.
    • 27.
      • Mounting nested directories from multiple servers in NFS.
    • 28.
      • Supports directory and file access via remote procedure calls ( RPC s)
      • All UNIX system calls supported other than open & close
      • Open and close are intentionally not supported
        • For a read , client sends lookup message to server
        • Lookup returns file handle but does not copy info in internal system tables
        • Subsequently, read contains file handle, offset and num bytes
        • Each message is self-contained – flexible, but?
    • 29.
      • Reading data from a file in NFS version 3.
      • Reading data using a compound procedure in version 4.
    • 30.
      • Some general mandatory file attributes in NFS.
      Attribute Description TYPE The type of the file (regular, directory, symbolic link) SIZE The length of the file in bytes CHANGE Indicator for a client to see if and/or when the file has changed FSID Server-unique identifier of the file's file system
    • 31.
      • Some general recommended file attributes.
      Attribute Description ACL an access control list associated with the file FILEHANDLE The server-provided file handle of this file FILEID A file-system unique identifier for this file FS_LOCATIONS Locations in the network where this file system may be found OWNER The character-string name of the file's owner TIME_ACCESS Time when the file data were last accessed TIME_MODIFY Time when the file data were last modified TIME_CREATE Time when the file was created
    • 32.
      • All communication done in the clear
      • Client sends userid, group id of request NFS server
      • Discuss
    • 33.
      • Consistency semantics are dirty
        • Checks non-dirty items every 5 seconds
        • Things marked dirty flushed within 30 seconds
      • Performance under load is horrible, why?
      • Cross-mount hell - paths to files different on different machines
      • ID mismatch between domains
    • 34.
      • Goals
        • Improved Access and good performance on the Internet
        • Better Scalability
        • Strong Security
        • Cross-platform interoperability and ease to extend
    • 35.
      • Stateful Protocol (Open + Close)
      • Compound Operations (Fully utilize bandwidth)
      • Lease-based Locks (Locking built-in)
      • “ Delegation” to clients (Less work for the server)
      • Close-Open Cache Consistency (Timeouts still for attributes and directories)
      • Better security
    • 36.
      • Borrowed model from CIFS (Common Internet File System) see MS
      • Open/Close
        • Opens do lookup, create, and lock all in one (what a deal)!
        • Locks / delegation (explained later) released on file close
        • Always a notion of a “current file handle” i.e. see pwd
    • 37.
      • Problem: Normal filesystem semantics have too many RPC’s (boo)
      • Solution: Group many calls into one call (yay)
      • Semantics
        • Run sequentially
        • Fails on first failure
        • Returns status of each individual RPC in the compound response (either to failure or success)
      Compound Kitty
    • 38.
      • Both byte-range and file locks
      • Heartbeats keep locks alive (renew lock)
      • If server fails, waits at least the agreed upon lease time (constant) before accepting any other lock requests
      • If client fails, locks are released by server at the end of lease period
    • 39.
      • Tells client no one else has the file
      • Client exposes callbacks
    • 40.
      • Any opens that happen after a close finishes are consistent with the information with the last close
      • Last close wins the competition
    • 41.
      • Uses the GSS-API framework
      • All id’s are formed with
        • [email_address]
        • [email_address]
      • Every implementation must have Kerberos v5
      • Every implementation must have LIPKey
    • 42.
      • Replication / Migration mechanism added
        • Special error messages to indicate migration
        • Special attribute for both replication and migration that gives the location of the other / new location
        • May have read-only replicas
    • 43.
    • 44.
      • People don’t like to move
      • Requires Kerberos (the death of many good distributed file systems
      • Looks just like V3 to end-user and V3 is good enough 
    • 45.
    • 46.
      • Need for a file system for large clusters that has the following attributes
        • Highly scalable > 10,000 nodes
        • Provide petabytes of storage
        • High throughput (100 GB/sec)
      • Datacenters have different needs so we need a general-purpose back-end file system
    • 47.
      • Open-source object-based cluster file system
      • Fully compliant with POSIX
      • Features (i.e. what I will discuss)
        • Object Protocols
        • Intent-based Locking
        • Adaptive Locking Policies
        • Aggressive Caching
    • 48.
    • 49.
    • 50.
    • 51.
    • 52.
    • 53.
      • Policy depends on context
      • Mode 1: Performing operations on something they only mostly use (e.g. /home/username)
      • Mode 2: Performing operations on a highly contentious Resource (e.g. /tmp)
      • DLM capable of granting locks on an entire subtree and whole files
    • 54.
      • POSIX
      • Keeps local journal of updates for locked files
        • One per file operation
        • Hard linked files get special treatment with subtree locks
      • Lock revoked -> updates flushed and replayed
      • Use subtree change times to validate cache entries
      • Additionally features collaborative caching -> referrals to other dedicated cache service
    • 55. Security
      • Supports GSS-API
        • Supports (does not require) Kerberos
        • Supports PKI mechanisms
      • Did not want to be tied down to one mechanism
    • 56.
    • 57.
      • Named after Andrew Carnegie and Andrew Mellon
        • Transarc Corp. and then IBM took development of AFS
        • In 2000 IBM made OpenAFS available as open source
      • Goals
        • Large scale (thousands of servers and clients)
        • User mobility
        • Scalability
        • Heterogeneity
        • Security
        • Location transparency
        • Availability
    • 58.
      • Features:
        • Uniform name space
        • Location independent file sharing
        • Client side caching with cache consistency
        • Secure authentication via Kerberos
        • High availability through automatic switchover of replicas
        • Scalability to span 5000 workstations
    • 59.
      • Based on the upload/download model
        • Clients download and cache files
        • Server keeps track of clients that cache the file
        • Clients upload files at end of session
      • Whole file caching is key
        • Later amended to block operations (v3)
        • Simple and effective
      • Kerberos for Security
      • AFS servers are stateful
        • Keep track of clients that have cached files
        • Recall files that have been modified
    • 60.
      • Clients have partitioned name space:
        • Local name space and shared name space
        • Cluster of dedicated servers (Vice) present shared name space
        • Clients run Virtue protocol to communicate with Vice
    • 61.
    • 62.
      • AFS’s storage is arranged in volumes
        • Usually associated with files of a particular client
      • AFS dir entry maps vice files/dirs to a 96-bit fid
        • Volume number
        • Vnode number: index into i-node array of a volume
        • Uniquifier: allows reuse of vnode numbers
      • Fids are location transparent
        • File movements do not invalidate fids
      • Location information kept in volume-location database
        • Volumes migrated to balance available disk space, utilization
        • Volume movement is atomic; operation aborted on server crash
    • 63. User process –> open file F The kernel resolves that it’s a Vice file -> passes it to Venus D is in the cache & has callback – > use it without any network communication D is in cache but has no callback – > contact the appropriate server for a new copy; establish callback D is not in cache – > fetch it from the server ; establish callback File F is identified -> create a current cache copy Venus returns to the kernel which opens F and returns its handle to the process
    • 64.
      • AFS caches entire files from servers
        • Client interacts with servers only during open and close
      • OS on client intercepts calls, passes to Venus
        • Venus is a client process that caches files from servers
        • Venus contacts Vice only on open and close
        • Reads and writes bypass Venus
      • Works due to callback :
        • Server updates state to record caching
        • Server notifies client before allowing another client to modify
        • Clients lose their callback when someone writes the file
      • Venus caches dirs and symbolic links for path translation
    • 65.
      • The use of local copies when opening a session in Coda.
    • 66.
      • A descendent of AFS v2 (AFS v3 went another way with large chunk caching)
      • Goals
        • More resilient to server and network failures
        • Constant Data Availability
        • Portable computing
    • 67.
      • Keeps whole file caching, callbacks, end-to-end encryption
      • Adds full server replication
      • General Update Protocol
        • Known as Coda Optimistic Protocol
        • COP1 (first phase) performs actual semantic operation to servers (using multicast if available)
        • COP2 sends a data structure called an update set which summarizes the client’s knowledge. These messages are piggybacked on later COP1’s
    • 68.
      • Disconnected Operation (KEY)
        • Hoarding
          • Periodically reevaluates which objects merit retention in the cache (hoard walking)
          • Relies on both implicit and a lot of explicit info (profiles etc)
        • Emulating i.e. maintaining a replay log
        • Reintegration – re-play replay log
      • Conflict Resolution
        • Gives repair tool
        • Log to give to user to manually fix issue
    • 69.
      • The state-transition diagram of a Coda client with respect to a volume.
    • 70.
      • AFS deployments in academia and government (100’s)
      • Security model required Kerberos
        • Many organizations not willing to make the costly switch
      • AFS (but not coda) was not integrated into Unix FS. Separate “ls”, different – though similar – API
      • Session semantics not appropriate for many applications
    • 71.
      • Goals
        • Efficient use of large main memories
        • Support for multiprocessor workstations
        • Efficient network communication
        • Diskless Operation
        • Exact Emulation of UNIX FS semantics
      • Location transparent UNIX FS
    • 72.
      • Naming
        • Local prefix table which maps path-name prefixes to servers
        • Cached locations
        • Otherwise there is location embedded in remote stubs in the tree hierarchy
      • Caching
        • Needs sequential consistency
        • If one client wants to write, disables caching on all open clients. Assumes this isn’t very bad since this doesn’t happen often
      • No security between kernels. All over trusted network
    • 73.
      • The best way to implement something depends very highly on the goals you want to achieve
      • Always start with goals before deciding on consistency semantics