Distributed File Systems
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Distributed File Systems

Uploaded on

Distributed file systems lecture I gave for Andrew Grimshaw's Distributed systems course in the Spring of 2009

Distributed file systems lecture I gave for Andrew Grimshaw's Distributed systems course in the Spring of 2009

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • nice sharing :)
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 26

http://www.slideshare.net 25
http://www.docshut.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • File is a named collection of related information that is recorded on some permanent storage
  • Access transparency – Use same mechanism to access file whether it is local or remote, i.e., map remote files into local file system name space
  • Harder to mark failure Performance issues – caching etc? Can’t really do
  • COTS Legacy apps Does it really matter where it is? One copy versus many copies Etc
  • Lampson –hints for push model
  • A server exports one or more of its directories to remote clients Clients access exported directories by mounting them The contents are then accessed as if they were local
  • Pros: server is stateless, i.e. no state about open files Cons: Locking is difficult, no concurrency control
  • No consistency semantics – things marked dirty flushed within 30 seconds Checks non-dirty items every 5 seconds Bad performance with heavy load etc
  • User process wants to open a file with a pathname P The kernel resolves that it’s a Vice file & passes it to Venus on that workstation One of the LWP’s uses the cache to examine each directory component D of P… When processing a pathname component, Venus identifies the server to be contacted by examining the volume field of the Fid. Authentication, protection checking, and network failures complicate the matter considerably.


  • 1. 03-10-09 Some slides are taken from Professor Grimshaw, Ranveer Chandra, Krasimira Kapitnova, etc
  • 2.
    • 3 rd year graduate student working with Professor Grimshaw
    • Interests lie in Operating Systems, Distributed Systems, and more recently Cloud Computing
    • Also
      • Trumpet
      • Sporty things
      • Hardware Junkie
    I like tacos … a lot
  • 3.
    • File System refresher
    • Basic Issues
      • Naming / Transparency
      • Caching
      • Coherence
      • Security
      • Performance
    • Case Studies
      • NFS v3 - v4
      • Lustre
      • AFS 2.0
  • 4.
    • What is a file system?
    • Why have a file system?
    Mmmm, refreshing File Systems
  • 5.
      • Must have
        • Name e.g. “/home/sosa/DFSSlides.ppt”
        • Data – some structured sequence of bytes
      • Tend to also have
        • Size
        • Protection Information
        • Non-symbolic identifier
        • Location
        • Times, etc
  • 6.
    • A container abstraction to help organize files
      • Generally hierarchical (tree) structure
      • Often a special type of file
    • Directories have a
      • Name
      • Files and directories (if hierarchical) within them
    A large container for tourists
  • 7.
    • Two approaches to sharing files
    • Copy-based
      • Application explicitly copies files between machines
      • Examples: UUCP, FTP, gridFTP, {.*} FTP, Rcp, Scp, etc.
    • Access transparency – i.e. Distributed File Systems
    Sharing is caring
  • 8.
    • Basic idea
      • Find a copy
        • naming is based on machine name of source (viper.cs.virginia.edu), user id, and path
      • Transfer the file to the local file system
        • scp grimshaw@viper.cs.virginia.edu:fred.txt .
      • Read/write
      • Copy back if modified
    • Pros ans Cons?
  • 9.
    • Pros
      • Semantics are clear
      • No OS/library modification
    • Cons?
      • Deal with model
      • Have to copy whole file
      • Inconsistencies
      • Inconsistent copies all over the place
      • Others?
  • 10.
    • Mechanism to access remote the same as local (i.e. through the file system hierarchy)
    • Why is this better?
    • … enter Distributed File Systems
  • 11.
    • A Distributed File System is a file system that may have files on more than one machine
    • Distributed File Systems take many forms
      • Network File Systems
      • Parallel File Systems
      • Access Transparent Distributed File Systems
    • Why distribute?
  • 12.
    • Sharing files with other users
      • Others can access your files
      • You can have access to files you wouldn’t regularly have access to
    • Keeping files available for yourself on more than one computer
    • Small amount of local resources
    • High failure rate of local resources
    • Can eliminate version problems (same file copied around with local edits)
  • 13.
    • Naming
    • Performance
    • Caching
    • Consistency Semantics
    • Fault Tolerance
    • Scalability
  • 14.
    • What does a DFS look like to the user?
      • Mount-like protocol .e.g /../mntPointToBobsSharedFolder/file.txt
      • Unified namespace. Everything looks like they’re on the same namespace
    • Pros and Cons?
  • 15.
    • Location transparency
      • Name does not hint at physical location
      • Mount points are not transparent
    • Location Independence
      • File name does not need to be changed when the file’s physical storage location changes
    • Independence without transparency?
  • 16.
    • Generally trade-off the benefits of DFS’s with some performance hits
      • How much depends on workload
      • Always look at workload to figure out what mechanisms to use
    • What are some ways to improve performance?
  • 17.
    • Single architectural feature that contributes most to performance in a DFS!!!
    • Single greatest cause of heartache for programmers of DFS’s
      • Maintaining consistency semantics more difficult
      • Has a large potential impact on scalability
  • 18.
    • Size of the cached units of data
      • Larger sizes make more efficient use of the network –spacial locality, latency
      • Whole files simply semantics but can’t store very large files locally
      • Small files
    • Who does what
      • Push vs Pull
      • Important for maintaining consistency
  • 19.
    • Different DFS’s have different consistency semantics
      • UNIX semantics
      • On Close semantics
      • Timeout semantics (at least x-second up-to date)
    • Pro’s / Con’s?
  • 20.
    • Can replicate
      • Fault Tolerance
      • Performance
    • Replication is inherently location-opaque i.e. we need location independence in naming
    • Different forms of replication mechanisms, different consistency semantics
      • Tradeoffs, tradeoffs, tradeoffs
  • 21.
    • Mount-based DFS
      • NFS version 3
      • Others include SMB, CIFS, NFS version 4
    • Parallel DFS
      • Lustre
      • Others include HDFS, Google File System, etc
    • Non-Parallel Unified Namespace DFS’s
      • Sprite
      • AFS version 2.0 (basis for many other DFS’s)
        • Coda
        • AFS 3.0
  • 22.
  • 23.
    • Most commonly used DFS ever!
    • Goals
      • Machine & OS Independent
      • Crash Recovery
      • Transparent Access
      • “ Reasonable” Performance
    • Design
      • All are client and servers
      • RPC (on top of UDP v.1, v.2+ on TCP)
        • Open Network Computing Remote Procedure Call
        • External Data Representation (XDR)
      • Stateless Protocol
  • 24.
  • 25.
    • Client sends path name to server with request to mount
    • If path is legal and exported, server returns file handle
      • Contains FS type, disk, i-node number of directory, security info
      • Subsequent accesses use file handle
    • Mount can be either at boot or automount
      • Automount: Directories mounted on-use
      • Why helpful?
    • Mount only affects client view
  • 26.
    • Mounting (part of) a remote file system in NFS.
  • 27.
    • Mounting nested directories from multiple servers in NFS.
  • 28.
    • Supports directory and file access via remote procedure calls ( RPC s)
    • All UNIX system calls supported other than open & close
    • Open and close are intentionally not supported
      • For a read , client sends lookup message to server
      • Lookup returns file handle but does not copy info in internal system tables
      • Subsequently, read contains file handle, offset and num bytes
      • Each message is self-contained – flexible, but?
  • 29.
    • Reading data from a file in NFS version 3.
    • Reading data using a compound procedure in version 4.
  • 30.
    • Some general mandatory file attributes in NFS.
    Attribute Description TYPE The type of the file (regular, directory, symbolic link) SIZE The length of the file in bytes CHANGE Indicator for a client to see if and/or when the file has changed FSID Server-unique identifier of the file's file system
  • 31.
    • Some general recommended file attributes.
    Attribute Description ACL an access control list associated with the file FILEHANDLE The server-provided file handle of this file FILEID A file-system unique identifier for this file FS_LOCATIONS Locations in the network where this file system may be found OWNER The character-string name of the file's owner TIME_ACCESS Time when the file data were last accessed TIME_MODIFY Time when the file data were last modified TIME_CREATE Time when the file was created
  • 32.
    • All communication done in the clear
    • Client sends userid, group id of request NFS server
    • Discuss
  • 33.
    • Consistency semantics are dirty
      • Checks non-dirty items every 5 seconds
      • Things marked dirty flushed within 30 seconds
    • Performance under load is horrible, why?
    • Cross-mount hell - paths to files different on different machines
    • ID mismatch between domains
  • 34.
    • Goals
      • Improved Access and good performance on the Internet
      • Better Scalability
      • Strong Security
      • Cross-platform interoperability and ease to extend
  • 35.
    • Stateful Protocol (Open + Close)
    • Compound Operations (Fully utilize bandwidth)
    • Lease-based Locks (Locking built-in)
    • “ Delegation” to clients (Less work for the server)
    • Close-Open Cache Consistency (Timeouts still for attributes and directories)
    • Better security
  • 36.
    • Borrowed model from CIFS (Common Internet File System) see MS
    • Open/Close
      • Opens do lookup, create, and lock all in one (what a deal)!
      • Locks / delegation (explained later) released on file close
      • Always a notion of a “current file handle” i.e. see pwd
  • 37.
    • Problem: Normal filesystem semantics have too many RPC’s (boo)
    • Solution: Group many calls into one call (yay)
    • Semantics
      • Run sequentially
      • Fails on first failure
      • Returns status of each individual RPC in the compound response (either to failure or success)
    Compound Kitty
  • 38.
    • Both byte-range and file locks
    • Heartbeats keep locks alive (renew lock)
    • If server fails, waits at least the agreed upon lease time (constant) before accepting any other lock requests
    • If client fails, locks are released by server at the end of lease period
  • 39.
    • Tells client no one else has the file
    • Client exposes callbacks
  • 40.
    • Any opens that happen after a close finishes are consistent with the information with the last close
    • Last close wins the competition
  • 41.
    • Uses the GSS-API framework
    • All id’s are formed with
      • [email_address]
      • [email_address]
    • Every implementation must have Kerberos v5
    • Every implementation must have LIPKey
  • 42.
    • Replication / Migration mechanism added
      • Special error messages to indicate migration
      • Special attribute for both replication and migration that gives the location of the other / new location
      • May have read-only replicas
  • 43.
  • 44.
    • People don’t like to move
    • Requires Kerberos (the death of many good distributed file systems
    • Looks just like V3 to end-user and V3 is good enough 
  • 45.
  • 46.
    • Need for a file system for large clusters that has the following attributes
      • Highly scalable > 10,000 nodes
      • Provide petabytes of storage
      • High throughput (100 GB/sec)
    • Datacenters have different needs so we need a general-purpose back-end file system
  • 47.
    • Open-source object-based cluster file system
    • Fully compliant with POSIX
    • Features (i.e. what I will discuss)
      • Object Protocols
      • Intent-based Locking
      • Adaptive Locking Policies
      • Aggressive Caching
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    • Policy depends on context
    • Mode 1: Performing operations on something they only mostly use (e.g. /home/username)
    • Mode 2: Performing operations on a highly contentious Resource (e.g. /tmp)
    • DLM capable of granting locks on an entire subtree and whole files
  • 54.
    • POSIX
    • Keeps local journal of updates for locked files
      • One per file operation
      • Hard linked files get special treatment with subtree locks
    • Lock revoked -> updates flushed and replayed
    • Use subtree change times to validate cache entries
    • Additionally features collaborative caching -> referrals to other dedicated cache service
  • 55. Security
    • Supports GSS-API
      • Supports (does not require) Kerberos
      • Supports PKI mechanisms
    • Did not want to be tied down to one mechanism
  • 56.
  • 57.
    • Named after Andrew Carnegie and Andrew Mellon
      • Transarc Corp. and then IBM took development of AFS
      • In 2000 IBM made OpenAFS available as open source
    • Goals
      • Large scale (thousands of servers and clients)
      • User mobility
      • Scalability
      • Heterogeneity
      • Security
      • Location transparency
      • Availability
  • 58.
    • Features:
      • Uniform name space
      • Location independent file sharing
      • Client side caching with cache consistency
      • Secure authentication via Kerberos
      • High availability through automatic switchover of replicas
      • Scalability to span 5000 workstations
  • 59.
    • Based on the upload/download model
      • Clients download and cache files
      • Server keeps track of clients that cache the file
      • Clients upload files at end of session
    • Whole file caching is key
      • Later amended to block operations (v3)
      • Simple and effective
    • Kerberos for Security
    • AFS servers are stateful
      • Keep track of clients that have cached files
      • Recall files that have been modified
  • 60.
    • Clients have partitioned name space:
      • Local name space and shared name space
      • Cluster of dedicated servers (Vice) present shared name space
      • Clients run Virtue protocol to communicate with Vice
  • 61.
  • 62.
    • AFS’s storage is arranged in volumes
      • Usually associated with files of a particular client
    • AFS dir entry maps vice files/dirs to a 96-bit fid
      • Volume number
      • Vnode number: index into i-node array of a volume
      • Uniquifier: allows reuse of vnode numbers
    • Fids are location transparent
      • File movements do not invalidate fids
    • Location information kept in volume-location database
      • Volumes migrated to balance available disk space, utilization
      • Volume movement is atomic; operation aborted on server crash
  • 63. User process –> open file F The kernel resolves that it’s a Vice file -> passes it to Venus D is in the cache & has callback – > use it without any network communication D is in cache but has no callback – > contact the appropriate server for a new copy; establish callback D is not in cache – > fetch it from the server ; establish callback File F is identified -> create a current cache copy Venus returns to the kernel which opens F and returns its handle to the process
  • 64.
    • AFS caches entire files from servers
      • Client interacts with servers only during open and close
    • OS on client intercepts calls, passes to Venus
      • Venus is a client process that caches files from servers
      • Venus contacts Vice only on open and close
      • Reads and writes bypass Venus
    • Works due to callback :
      • Server updates state to record caching
      • Server notifies client before allowing another client to modify
      • Clients lose their callback when someone writes the file
    • Venus caches dirs and symbolic links for path translation
  • 65.
    • The use of local copies when opening a session in Coda.
  • 66.
    • A descendent of AFS v2 (AFS v3 went another way with large chunk caching)
    • Goals
      • More resilient to server and network failures
      • Constant Data Availability
      • Portable computing
  • 67.
    • Keeps whole file caching, callbacks, end-to-end encryption
    • Adds full server replication
    • General Update Protocol
      • Known as Coda Optimistic Protocol
      • COP1 (first phase) performs actual semantic operation to servers (using multicast if available)
      • COP2 sends a data structure called an update set which summarizes the client’s knowledge. These messages are piggybacked on later COP1’s
  • 68.
    • Disconnected Operation (KEY)
      • Hoarding
        • Periodically reevaluates which objects merit retention in the cache (hoard walking)
        • Relies on both implicit and a lot of explicit info (profiles etc)
      • Emulating i.e. maintaining a replay log
      • Reintegration – re-play replay log
    • Conflict Resolution
      • Gives repair tool
      • Log to give to user to manually fix issue
  • 69.
    • The state-transition diagram of a Coda client with respect to a volume.
  • 70.
    • AFS deployments in academia and government (100’s)
    • Security model required Kerberos
      • Many organizations not willing to make the costly switch
    • AFS (but not coda) was not integrated into Unix FS. Separate “ls”, different – though similar – API
    • Session semantics not appropriate for many applications
  • 71.
    • Goals
      • Efficient use of large main memories
      • Support for multiprocessor workstations
      • Efficient network communication
      • Diskless Operation
      • Exact Emulation of UNIX FS semantics
    • Location transparent UNIX FS
  • 72.
    • Naming
      • Local prefix table which maps path-name prefixes to servers
      • Cached locations
      • Otherwise there is location embedded in remote stubs in the tree hierarchy
    • Caching
      • Needs sequential consistency
      • If one client wants to write, disables caching on all open clients. Assumes this isn’t very bad since this doesn’t happen often
    • No security between kernels. All over trusted network
  • 73.
    • The best way to implement something depends very highly on the goals you want to achieve
    • Always start with goals before deciding on consistency semantics