SlideShare a Scribd company logo
1 of 36
Hard State Revisited: Network Filesystems
Jeff Chase
CPS 212, Fall 2000
Network File System (NFS)
syscall layer
*FS
NFS
server
VFS
VFS
NFS
client
*FS
syscall layer
client
user programs
RPC over UDP or TCP
server
NFS Vnodes
syscall layer
*FS
NFS
server
VFS
RPC
network
nfsnode
NFS client stubs
nfs_vnodeops
The nfsnode holds client state
needed to interact with the server
to operate on the file.
struct nfsnode* np = VTONFS(vp);
The NFS protocol has an operation type for (almost) every
vnode operation, with similar arguments/results.
File Handles
Question: how does the client tell the server which file or
directory the operation applies to?
• Similarly, how does the server return the result of a lookup?
More generally, how to pass a pointer or an object reference as an
argument/result of an RPC call?
In NFS, the reference is a file handle or fhandle, a token/ticket
whose value is determined by the server.
• Includes all information needed to identify the file/object on
the server, and find it quickly.
volume ID inode # generation #
NFS: From Concept to Implementation
Now that we understand the basics, how do we make it fast?
• caching
data blocks
file attributes
lookup cache (dnlc): name->fhandle mappings
directory contents?
• read-ahead and write-behind
file I/O at wire speed
And of course we want the full range of other desirable “*ility”
properties....
NFS as a “Stateless” Service
A classical NFS server maintains no in-memory hard state.
The only hard state is the stable file system image on disk.
• no record of clients or open files
• no implicit arguments to requests
E.g., no server-maintained file offsets: read and write requests
must explicitly transmit the byte offset for each operation.
• no write-back caching on the server
• no record of recently processed requests
• etc., etc....
Statelessness makes failure recovery simple and efficient.
Recovery in Stateless NFS
If the server fails and restarts, there is no need to rebuild in-
memory state on the server.
• Client reestablishes contact (e.g., TCP connection).
• Client retransmits pending requests.
Classical NFS uses a connectionless transport (UDP).
• Server failure is transparent to the client; no connection to
break or reestablish.
A crashed server is indistinguishable from a slow server.
• Sun/ONC RPC masks network errors by retransmitting a
request after an adaptive timeout.
A dropped packet is indistinguishable from a crashed server.
Drawbacks of a Stateless Service
The stateless nature of classical NFS has compelling design
advantages (simplicity), but also some key drawbacks:
• Recovery-by-retransmission constrains the server interface.
ONC RPC/UDP has execute-at-least-once semantics (“send and
pray”), which compromises performance and correctness.
• Update operations are disk-limited.
Updates must commit synchronously at the server.
• NFS cannot (quite) preserve local single-copy semantics.
Files may be removed while they are open on the client.
Server cannot help in client cache consistency.
Let’s explore these problems and their solutions...
Problem 1: Retransmissions and Idempotency
For a connectionless RPC transport, retransmissions can saturate
an overloaded server.
Clients “kick ‘em while they’re down”, causing steep hockey stick.
Execute-at-least-once constrains the server interface.
• Service operations should/must be idempotent.
Multiple executions should/must have the same effect.
• Idempotent operations cannot capture the full semantics we
expect from our file system.
remove, append-mode writes, exclusive create
Solutions to the Retransmission Problem
1. Hope for the best and smooth over non-idempotent requests.
E.g., map ENOENT and EEXIST to ESUCCESS.
2. Use TCP or some other transport protocol that produces
reliable, in-order delivery.
higher overhead...and we still need sessions.
3. Implement an execute-at-most once RPC transport.
TCP-like features (sequence numbers)...and sessions.
4. Keep a retransmission cache on the server [Juszczak90].
Remember the most recent request IDs and their results, and just
resend the result....does this violate statelessness?
DAFS persistent session cache.
Problem 2: Synchronous Writes
Stateless NFS servers must commit each operation to stable
storage before responding to the client.
• Interferes with FS optimizations, e.g., clustering, LFS, and
disk write ordering (seek scheduling).
Damages bandwidth and scalability.
• Imposes disk access latency for each request.
Not so bad for a logged write; much worse for a complex
operation like an FFS file write.
The synchronous update problem occurs for any storage
service with reliable update (commit).
Speeding Up Synchronous NFS Writes
Interesting solutions to the synchronous write problem, used
in high-performance NFS servers:
• Delay the response until convenient for the server.
E.g., NFS write-gathering optimizations for clustered writes
(similar to group commit in databases).
Relies on write-behind from NFS I/O daemons (iods).
• Throw hardware at it: non-volatile memory (NVRAM)
Battery-backed RAM or UPS (uninterruptible power supply).
Use as an operation log (Network Appliance WAFL)...
...or as a non-volatile disk write buffer (Legato).
• Replicate server and buffer in memory (e.g., MIT Harp).
NFS V3 Asynchronous Writes
NFS V3 sidesteps the synchronous write problem by adding a
new asynchronous write operation.
• Server may reply to client as soon as it accepts the write,
before executing/committing it.
If the server fails, it may discard any subset of the accepted but
uncommitted writes.
• Client holds asynchronously written data in its cache, and
reissues the writes if the server fails and restarts.
When is it safe for the client to discard its buffered writes?
How can the client tell if the server has failed?
NFS V3 Commit
NFS V3 adds a new commit operation to go with async-write.
• Client may issue a commit for a file byte range at any time.
• Server must execute all covered uncommitted writes before
replying to the commit.
• When the client receives the reply, it may safely discard any
buffered writes covered by the commit.
• Server returns a verifier with every reply to an async write or
commit request.
The verifier is just an integer that is guaranteed to change if the
server restarts, and to never change back.
• What if the client crashes?
Problem 3: File Cache Consistency
Problem: Concurrent write sharing of files.
Contrast with read sharing or sequential write sharing.
Solutions:
• Timestamp invalidation (NFS).
Timestamp each cache entry, and periodically query the server:
“has this file changed since time t?”; invalidate cache if stale.
• Callback invalidation (AFS, Sprite, Spritely NFS).
Request notification (callback) from the server if the file
changes; invalidate cache and/or disable caching on callback.
• Leases (NQ-NFS) [Gray&Cheriton89,Macklem93,NFS V4]
• Later: distributed shared memory
File Cache Example: NQ-NFS Leases
In NQ-NFS, a client obtains a lease on the file that permits
the client’s desired read/write activity.
“A lease is a ticket permitting an activity; the lease is valid until
some expiration time.”
• A read-caching lease allows the client to cache clean data.
Guarantee: no other client is modifying the file.
• A write-caching lease allows the client to buffer modified
data for the file.
Guarantee: no other client has the file cached.
Allows delayed writes: client may delay issuing writes to
improve write performance (i.e., client has a writeback cache).
Using NQ-NFS Leases
1. Client NFS piggybacks lease requests for a given file on
I/O operation requests (e.g., read/write).
NQ-NFS leases are implicit and distinct from file locking.
2. The server determines if it can safely grant the request, i.e.,
does it conflict with a lease held by another client.
read leases may be granted simultaneously to multiple clients
write leases are granted exclusively to a single client
3. If a conflict exists, the server may send an eviction notice
to the holder of the conflicting lease.
If a client is evicted from a write lease, it must write back.
Grace period: server grants extensions while the client writes.
Client sends vacated notice when all writes are complete.
NQ-NFS Lease Recovery
Key point: the bounded lease term simplifies recovery.
• Before a lease expires, the client must renew the lease.
• What if a client fails while holding a lease?
Server waits until the lease expires, then unilaterally reclaims the
lease; client forgets all about it.
If a client fails while writing on an eviction, server waits for
write slack time before granting conflicting lease.
• What if the server fails while there are outstanding leases?
Wait for lease period + clock skew before issuing new leases.
• Recovering server must absorb lease renewal requests and/or
writes for vacated leases.
NQ-NFS Leases and Cache Consistency
• Every lease contains a file version number.
Invalidation cache iff version number has changed.
• Clients may disable client caching when there is concurrent
write sharing.
no-caching lease
• What consistency guarantees do NQ-NFS leases provide?
Does the server eventually receive/accept all writes?
Does the server accept the writes in order?
Are groups of related writes atomic?
How are write errors reported?
What is the relationship to NFS V3 commit?
The Distributed Lock Lab
The lock implementation is similar to DSM systems, with
reliability features similar to distributed file caches.
• use Java RMI
• lock token caching with callbacks
lock tokens passed through server, not peer-peer as DSM
• synchronizes multiple threads on same client
• state bit for pending callback on client
• server must reissue callback each lease interval (or use RMI
timeouts to detect a failed client)
• client must renew token each lease interval
Background: Unix Filesystem Internals
A Typical Unix File Tree
/
tmp usr
etc
File trees are built by grafting
volumes from different volumes
or from network servers.
Each volume is a set of directories and files; a host’s file tree is the set of
directories and files visible to processes on a given host.
bin vmunix
ls sh project users
packages
(volume root)
tex emacs
In Unix, the graft operation is
the privileged mount system call,
and each volume is a filesystem.
mount point
mount (coveredDir, volume)
coveredDir: directory pathname
volume: device specifier or network volume
volume root contents become visible at pathname coveredDir
Filesystems
Each file volume (filesystem) has a type, determined by its
disk layout or the network protocol used to access it.
ufs (ffs), lfs, nfs, rfs, cdfs, etc.
Filesystems are administered independently.
Modern systems also include “logical” pseudo-filesystems in
the naming tree, accessible through the file syscalls.
procfs: the /proc filesystem allows access to process internals.
mfs: the memory file system is a memory-based scratch store.
Processes access filesystems through common system calls.
VFS: the Filesystem Switch
syscall layer (file, uio, etc.)
user space
Virtual File System (VFS)
network
protocol
stack
(TCP/IP) NFS FFS LFS etc.
*FS etc.
device drivers
Sun Microsystems introduced the virtual file system interface
in 1985 to accommodate diverse filesystem types cleanly.
VFS allows diverse specific file systems to coexist in a file tree,
isolating all FS-dependencies in pluggable filesystem modules.
VFS was an internal kernel restructuring
with no effect on the syscall interface.
Incorporates object-oriented concepts:
a generic procedural interface with
multiple implementations.
Based on abstract objects with dynamic
method binding by type...in C.
Other abstract interfaces in the kernel: device drivers,
file objects, executable files, memory objects.
Vnodes
In the VFS framework, every file or directory in active use is
represented by a vnode object in kernel memory.
syscall layer
NFS UFS
free vnodes
Each vnode has a standard
file attributes struct.
Vnode operations are
macros that vector to
filesystem-specific
procedures.
Generic vnode points at
filesystem-specific struct
(e.g., inode, rnode), seen
only by the filesystem.
Each specific file system
maintains a cache of its
resident vnodes.
Vnode Operations and Attributes
directories only
vop_lookup (OUT vpp, name)
vop_create (OUT vpp, name, vattr)
vop_remove (vp, name)
vop_link (vp, name)
vop_rename (vp, name, tdvp, tvp, name)
vop_mkdir (OUT vpp, name, vattr)
vop_rmdir (vp, name)
vop_symlink (OUT vpp, name, vattr, contents)
vop_readdir (uio, cookie)
vop_readlink (uio)
files only
vop_getpages (page**, count, offset)
vop_putpages (page**, count, sync, offset)
vop_fsync ()
vnode attributes (vattr)
type (VREG, VDIR, VLNK, etc.)
mode (9+ bits of permissions)
nlink (hard link count)
owner user ID
owner group ID
filesystem ID
unique file ID
file size (bytes and blocks)
access time
modify time
generation number
generic operations
vop_getattr (vattr)
vop_setattr (vattr)
vhold()
vholdrele()
V/Inode Cache
HASH(fsid, fileid)
VFS free list head
Active vnodes are reference- counted
by the structures that hold pointers to
them.
- system open file table
- process current directory
- file system mount points
- etc.
Each specific file system maintains its
own hash of vnodes (BSD).
- specific FS handles initialization
- free list is maintained by VFS
vget(vp): reclaim cached inactive vnode from VFS free list
vref(vp): increment reference count on an active vnode
vrele(vp): release reference count on a vnode
vgone(vp): vnode is no longer valid (file is removed)
Pathname Traversal
When a pathname is passed as an argument to a system call,
the syscall layer must “convert it to a vnode”.
Pathname traversal is a sequence of vop_lookup calls to descend
the tree to the named file or directory.
open(“/tmp/zot”)
vp = get vnode for / (rootdir)
vp->vop_lookup(&cvp, “tmp”);
vp = cvp;
vp->vop_lookup(&cvp, “zot”);
Issues:
1. crossing mount points
2. obtaining root vnode (or current dir)
3. finding resident vnodes in memory
4. caching name->vnode translations
5. symbolic (soft) links
6. disk implementation of directories
7. locking/referencing to handle races
with name create and delete operations
NFS Protocol
NFS is a network protocol layered above TCP/IP.
• Original implementations (and most today) use UDP
datagram transport for low overhead.
Maximum IP datagram size was increased to match FS block
size, to allow send/receive of entire file blocks.
Some implementations use TCP as a transport.
• The NFS protocol is a set of message formats and types.
Client issues a request message for a service operation.
Server performs requested operation and returns a reply message
with status and (perhaps) requested data.
Network Block Storage
One approach to scalable storage is to attach raw block
storage to a network.
• abstraction: OS addresses storage by <volume, sector>.
iSCSI, Petal, FC: access through souped-up device driver
• dedicated Storage Area Network or general-purpose network
FibreChannel vs. Ethernet
• shared access with scalable bandwidth and capacity
• volume-based administrative tools
backup, volume replication, remote sharing
• Called “raw” or “block”, “storage volumes” or just “SAN”.
“NAS vs. SAN”
In the commercial sector there is a raging debate today about
“NAS vs. SAN”.
• Network-Attached Storage has been the dominant approach
to shared storage since NFS.
NAS == NFS or CIFS: named files over Ethernet/Internet.
Network Appliance “filers”
• Proponents of FibreChannel SANs market them as a
fundamentally faster way to access shared storage.
no “indirection through a file server” (“SAD”)
lower overhead on clients
network is better/faster (if not cheaper) and dedicated/trusted
Brocade, HP, Emulex are some big players.
NAS vs. SAN: Cutting through the BS
• FibreChannel a high-end technology incorporating NIC
enhancements to reduce host overhead....
...but bogged down in interoperability problems.
• Ethernet is getting faster faster than FibreChannel.
gigabit, 10-gigabit, + smarter NICs, + smarter/faster switches
• Future battleground is Ethernet vs. Infiniband.
• The choice of network is fundamentally orthogonal to
storage service design.
Well, almost: flow control, RDMA, user-level access (DAFS/VI)
• The fundamental questions are really about abstractions.
shared raw volume vs. shared file volume vs. private disks
Storage Abstractions
• relational database (IBM and Oracle)
tables, transactions, query language
• file system
hierarchical name space with ACLs
• block storage
SAN, Petal, RAID-in-a-box (e.g., EMC)
• object storage
object == file, with a flat name space: NASD, DDS
• persistent objects
pointer structures, requires transactions: OODB, ObjectStore
Storage Architecture
Any of these abstractions can be built using any, some, or all
of the others.
Use the “right” abstraction for your application.
The fundamental questions are:
• What is the best way to build the abstraction you want?
division of function between device, network, server, and client
• What level of the system should implement the features and
properties you want?
How does Frangipani answer them?
Cluster File Systems
shared block storage service (FC/SAN, Petal, NASD)
xFS [Dahlin95]
Petal/Frangipani [Lee/Thekkath]
GFS
Veritas
EMC Celerra
storage client
cluster FS cluster FS
storage client
issues
trust
compatibility with NAS protocols
sharing, coordination, recovery
Sharing and Coordination
storage service + lock manager
*FS client *FS client
*FS svc *FS svc
block allocation and layout
locking/leases, granularity
shared access
separate lock service
logging and recovery
network partitions
reconfiguration
NAS
“SAN”
What does Frangipani need from Petal?
How does Petal contribute to F’s *ility?
Could we build Frangipani without Petal?

More Related Content

Similar to nfs.ppt

Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...xKinAnx
 
NFS(Network File System)
NFS(Network File System)NFS(Network File System)
NFS(Network File System)udamale
 
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)Hien Nguyen Van
 
A Project Report on Linux Server Administration
A Project Report on Linux Server AdministrationA Project Report on Linux Server Administration
A Project Report on Linux Server AdministrationAvinash Kumar
 
Learn Advanced JAVA at ASIT
Learn Advanced JAVA at ASITLearn Advanced JAVA at ASIT
Learn Advanced JAVA at ASITASIT
 
Linux network file system (nfs)
Linux   network file system (nfs)Linux   network file system (nfs)
Linux network file system (nfs)Raghu nath
 
NFSv4 Replication for Grid Computing
NFSv4 Replication for Grid ComputingNFSv4 Replication for Grid Computing
NFSv4 Replication for Grid Computingpeterhoneyman
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleScyllaDB
 
RPC communication,thread and processes
RPC communication,thread and processesRPC communication,thread and processes
RPC communication,thread and processesshraddha mane
 
Directory Write Leases in MagFS
Directory Write Leases in MagFSDirectory Write Leases in MagFS
Directory Write Leases in MagFSMaginatics
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introductionSyed Hadoop
 
GFProxy: Scaling the GlusterFS FUSE Client
GFProxy: Scaling the GlusterFS FUSE Client	GFProxy: Scaling the GlusterFS FUSE Client
GFProxy: Scaling the GlusterFS FUSE Client Gluster.org
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.pptshreesha16
 

Similar to nfs.ppt (20)

Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
 
ACE - Comcore
ACE - ComcoreACE - Comcore
ACE - Comcore
 
NFS(Network File System)
NFS(Network File System)NFS(Network File System)
NFS(Network File System)
 
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)Kubecon shanghai  rook deployed nfs clusters over ceph-fs (translator copy)
Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)
 
Kafka vs kinesis
Kafka vs kinesisKafka vs kinesis
Kafka vs kinesis
 
A Project Report on Linux Server Administration
A Project Report on Linux Server AdministrationA Project Report on Linux Server Administration
A Project Report on Linux Server Administration
 
Learn Advanced JAVA at ASIT
Learn Advanced JAVA at ASITLearn Advanced JAVA at ASIT
Learn Advanced JAVA at ASIT
 
Linux network file system (nfs)
Linux   network file system (nfs)Linux   network file system (nfs)
Linux network file system (nfs)
 
NFSv4 Replication for Grid Computing
NFSv4 Replication for Grid ComputingNFSv4 Replication for Grid Computing
NFSv4 Replication for Grid Computing
 
Understanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at ScaleUnderstanding Apache Kafka P99 Latency at Scale
Understanding Apache Kafka P99 Latency at Scale
 
RPC communication,thread and processes
RPC communication,thread and processesRPC communication,thread and processes
RPC communication,thread and processes
 
Windows Communication Foundation (WCF)
Windows Communication Foundation (WCF)Windows Communication Foundation (WCF)
Windows Communication Foundation (WCF)
 
Directory Write Leases in MagFS
Directory Write Leases in MagFSDirectory Write Leases in MagFS
Directory Write Leases in MagFS
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
 
운영체제론 Ch17
운영체제론 Ch17운영체제론 Ch17
운영체제론 Ch17
 
GFProxy: Scaling the GlusterFS FUSE Client
GFProxy: Scaling the GlusterFS FUSE Client	GFProxy: Scaling the GlusterFS FUSE Client
GFProxy: Scaling the GlusterFS FUSE Client
 
10
1010
10
 
pNFS Introduction
pNFS IntroductionpNFS Introduction
pNFS Introduction
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
 
Cs 704 d rpc
Cs 704 d rpcCs 704 d rpc
Cs 704 d rpc
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

nfs.ppt

  • 1. Hard State Revisited: Network Filesystems Jeff Chase CPS 212, Fall 2000
  • 2. Network File System (NFS) syscall layer *FS NFS server VFS VFS NFS client *FS syscall layer client user programs RPC over UDP or TCP server
  • 3. NFS Vnodes syscall layer *FS NFS server VFS RPC network nfsnode NFS client stubs nfs_vnodeops The nfsnode holds client state needed to interact with the server to operate on the file. struct nfsnode* np = VTONFS(vp); The NFS protocol has an operation type for (almost) every vnode operation, with similar arguments/results.
  • 4. File Handles Question: how does the client tell the server which file or directory the operation applies to? • Similarly, how does the server return the result of a lookup? More generally, how to pass a pointer or an object reference as an argument/result of an RPC call? In NFS, the reference is a file handle or fhandle, a token/ticket whose value is determined by the server. • Includes all information needed to identify the file/object on the server, and find it quickly. volume ID inode # generation #
  • 5. NFS: From Concept to Implementation Now that we understand the basics, how do we make it fast? • caching data blocks file attributes lookup cache (dnlc): name->fhandle mappings directory contents? • read-ahead and write-behind file I/O at wire speed And of course we want the full range of other desirable “*ility” properties....
  • 6. NFS as a “Stateless” Service A classical NFS server maintains no in-memory hard state. The only hard state is the stable file system image on disk. • no record of clients or open files • no implicit arguments to requests E.g., no server-maintained file offsets: read and write requests must explicitly transmit the byte offset for each operation. • no write-back caching on the server • no record of recently processed requests • etc., etc.... Statelessness makes failure recovery simple and efficient.
  • 7. Recovery in Stateless NFS If the server fails and restarts, there is no need to rebuild in- memory state on the server. • Client reestablishes contact (e.g., TCP connection). • Client retransmits pending requests. Classical NFS uses a connectionless transport (UDP). • Server failure is transparent to the client; no connection to break or reestablish. A crashed server is indistinguishable from a slow server. • Sun/ONC RPC masks network errors by retransmitting a request after an adaptive timeout. A dropped packet is indistinguishable from a crashed server.
  • 8. Drawbacks of a Stateless Service The stateless nature of classical NFS has compelling design advantages (simplicity), but also some key drawbacks: • Recovery-by-retransmission constrains the server interface. ONC RPC/UDP has execute-at-least-once semantics (“send and pray”), which compromises performance and correctness. • Update operations are disk-limited. Updates must commit synchronously at the server. • NFS cannot (quite) preserve local single-copy semantics. Files may be removed while they are open on the client. Server cannot help in client cache consistency. Let’s explore these problems and their solutions...
  • 9. Problem 1: Retransmissions and Idempotency For a connectionless RPC transport, retransmissions can saturate an overloaded server. Clients “kick ‘em while they’re down”, causing steep hockey stick. Execute-at-least-once constrains the server interface. • Service operations should/must be idempotent. Multiple executions should/must have the same effect. • Idempotent operations cannot capture the full semantics we expect from our file system. remove, append-mode writes, exclusive create
  • 10. Solutions to the Retransmission Problem 1. Hope for the best and smooth over non-idempotent requests. E.g., map ENOENT and EEXIST to ESUCCESS. 2. Use TCP or some other transport protocol that produces reliable, in-order delivery. higher overhead...and we still need sessions. 3. Implement an execute-at-most once RPC transport. TCP-like features (sequence numbers)...and sessions. 4. Keep a retransmission cache on the server [Juszczak90]. Remember the most recent request IDs and their results, and just resend the result....does this violate statelessness? DAFS persistent session cache.
  • 11. Problem 2: Synchronous Writes Stateless NFS servers must commit each operation to stable storage before responding to the client. • Interferes with FS optimizations, e.g., clustering, LFS, and disk write ordering (seek scheduling). Damages bandwidth and scalability. • Imposes disk access latency for each request. Not so bad for a logged write; much worse for a complex operation like an FFS file write. The synchronous update problem occurs for any storage service with reliable update (commit).
  • 12. Speeding Up Synchronous NFS Writes Interesting solutions to the synchronous write problem, used in high-performance NFS servers: • Delay the response until convenient for the server. E.g., NFS write-gathering optimizations for clustered writes (similar to group commit in databases). Relies on write-behind from NFS I/O daemons (iods). • Throw hardware at it: non-volatile memory (NVRAM) Battery-backed RAM or UPS (uninterruptible power supply). Use as an operation log (Network Appliance WAFL)... ...or as a non-volatile disk write buffer (Legato). • Replicate server and buffer in memory (e.g., MIT Harp).
  • 13. NFS V3 Asynchronous Writes NFS V3 sidesteps the synchronous write problem by adding a new asynchronous write operation. • Server may reply to client as soon as it accepts the write, before executing/committing it. If the server fails, it may discard any subset of the accepted but uncommitted writes. • Client holds asynchronously written data in its cache, and reissues the writes if the server fails and restarts. When is it safe for the client to discard its buffered writes? How can the client tell if the server has failed?
  • 14. NFS V3 Commit NFS V3 adds a new commit operation to go with async-write. • Client may issue a commit for a file byte range at any time. • Server must execute all covered uncommitted writes before replying to the commit. • When the client receives the reply, it may safely discard any buffered writes covered by the commit. • Server returns a verifier with every reply to an async write or commit request. The verifier is just an integer that is guaranteed to change if the server restarts, and to never change back. • What if the client crashes?
  • 15. Problem 3: File Cache Consistency Problem: Concurrent write sharing of files. Contrast with read sharing or sequential write sharing. Solutions: • Timestamp invalidation (NFS). Timestamp each cache entry, and periodically query the server: “has this file changed since time t?”; invalidate cache if stale. • Callback invalidation (AFS, Sprite, Spritely NFS). Request notification (callback) from the server if the file changes; invalidate cache and/or disable caching on callback. • Leases (NQ-NFS) [Gray&Cheriton89,Macklem93,NFS V4] • Later: distributed shared memory
  • 16. File Cache Example: NQ-NFS Leases In NQ-NFS, a client obtains a lease on the file that permits the client’s desired read/write activity. “A lease is a ticket permitting an activity; the lease is valid until some expiration time.” • A read-caching lease allows the client to cache clean data. Guarantee: no other client is modifying the file. • A write-caching lease allows the client to buffer modified data for the file. Guarantee: no other client has the file cached. Allows delayed writes: client may delay issuing writes to improve write performance (i.e., client has a writeback cache).
  • 17. Using NQ-NFS Leases 1. Client NFS piggybacks lease requests for a given file on I/O operation requests (e.g., read/write). NQ-NFS leases are implicit and distinct from file locking. 2. The server determines if it can safely grant the request, i.e., does it conflict with a lease held by another client. read leases may be granted simultaneously to multiple clients write leases are granted exclusively to a single client 3. If a conflict exists, the server may send an eviction notice to the holder of the conflicting lease. If a client is evicted from a write lease, it must write back. Grace period: server grants extensions while the client writes. Client sends vacated notice when all writes are complete.
  • 18. NQ-NFS Lease Recovery Key point: the bounded lease term simplifies recovery. • Before a lease expires, the client must renew the lease. • What if a client fails while holding a lease? Server waits until the lease expires, then unilaterally reclaims the lease; client forgets all about it. If a client fails while writing on an eviction, server waits for write slack time before granting conflicting lease. • What if the server fails while there are outstanding leases? Wait for lease period + clock skew before issuing new leases. • Recovering server must absorb lease renewal requests and/or writes for vacated leases.
  • 19. NQ-NFS Leases and Cache Consistency • Every lease contains a file version number. Invalidation cache iff version number has changed. • Clients may disable client caching when there is concurrent write sharing. no-caching lease • What consistency guarantees do NQ-NFS leases provide? Does the server eventually receive/accept all writes? Does the server accept the writes in order? Are groups of related writes atomic? How are write errors reported? What is the relationship to NFS V3 commit?
  • 20. The Distributed Lock Lab The lock implementation is similar to DSM systems, with reliability features similar to distributed file caches. • use Java RMI • lock token caching with callbacks lock tokens passed through server, not peer-peer as DSM • synchronizes multiple threads on same client • state bit for pending callback on client • server must reissue callback each lease interval (or use RMI timeouts to detect a failed client) • client must renew token each lease interval
  • 22. A Typical Unix File Tree / tmp usr etc File trees are built by grafting volumes from different volumes or from network servers. Each volume is a set of directories and files; a host’s file tree is the set of directories and files visible to processes on a given host. bin vmunix ls sh project users packages (volume root) tex emacs In Unix, the graft operation is the privileged mount system call, and each volume is a filesystem. mount point mount (coveredDir, volume) coveredDir: directory pathname volume: device specifier or network volume volume root contents become visible at pathname coveredDir
  • 23. Filesystems Each file volume (filesystem) has a type, determined by its disk layout or the network protocol used to access it. ufs (ffs), lfs, nfs, rfs, cdfs, etc. Filesystems are administered independently. Modern systems also include “logical” pseudo-filesystems in the naming tree, accessible through the file syscalls. procfs: the /proc filesystem allows access to process internals. mfs: the memory file system is a memory-based scratch store. Processes access filesystems through common system calls.
  • 24. VFS: the Filesystem Switch syscall layer (file, uio, etc.) user space Virtual File System (VFS) network protocol stack (TCP/IP) NFS FFS LFS etc. *FS etc. device drivers Sun Microsystems introduced the virtual file system interface in 1985 to accommodate diverse filesystem types cleanly. VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules. VFS was an internal kernel restructuring with no effect on the syscall interface. Incorporates object-oriented concepts: a generic procedural interface with multiple implementations. Based on abstract objects with dynamic method binding by type...in C. Other abstract interfaces in the kernel: device drivers, file objects, executable files, memory objects.
  • 25. Vnodes In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory. syscall layer NFS UFS free vnodes Each vnode has a standard file attributes struct. Vnode operations are macros that vector to filesystem-specific procedures. Generic vnode points at filesystem-specific struct (e.g., inode, rnode), seen only by the filesystem. Each specific file system maintains a cache of its resident vnodes.
  • 26. Vnode Operations and Attributes directories only vop_lookup (OUT vpp, name) vop_create (OUT vpp, name, vattr) vop_remove (vp, name) vop_link (vp, name) vop_rename (vp, name, tdvp, tvp, name) vop_mkdir (OUT vpp, name, vattr) vop_rmdir (vp, name) vop_symlink (OUT vpp, name, vattr, contents) vop_readdir (uio, cookie) vop_readlink (uio) files only vop_getpages (page**, count, offset) vop_putpages (page**, count, sync, offset) vop_fsync () vnode attributes (vattr) type (VREG, VDIR, VLNK, etc.) mode (9+ bits of permissions) nlink (hard link count) owner user ID owner group ID filesystem ID unique file ID file size (bytes and blocks) access time modify time generation number generic operations vop_getattr (vattr) vop_setattr (vattr) vhold() vholdrele()
  • 27. V/Inode Cache HASH(fsid, fileid) VFS free list head Active vnodes are reference- counted by the structures that hold pointers to them. - system open file table - process current directory - file system mount points - etc. Each specific file system maintains its own hash of vnodes (BSD). - specific FS handles initialization - free list is maintained by VFS vget(vp): reclaim cached inactive vnode from VFS free list vref(vp): increment reference count on an active vnode vrele(vp): release reference count on a vnode vgone(vp): vnode is no longer valid (file is removed)
  • 28. Pathname Traversal When a pathname is passed as an argument to a system call, the syscall layer must “convert it to a vnode”. Pathname traversal is a sequence of vop_lookup calls to descend the tree to the named file or directory. open(“/tmp/zot”) vp = get vnode for / (rootdir) vp->vop_lookup(&cvp, “tmp”); vp = cvp; vp->vop_lookup(&cvp, “zot”); Issues: 1. crossing mount points 2. obtaining root vnode (or current dir) 3. finding resident vnodes in memory 4. caching name->vnode translations 5. symbolic (soft) links 6. disk implementation of directories 7. locking/referencing to handle races with name create and delete operations
  • 29. NFS Protocol NFS is a network protocol layered above TCP/IP. • Original implementations (and most today) use UDP datagram transport for low overhead. Maximum IP datagram size was increased to match FS block size, to allow send/receive of entire file blocks. Some implementations use TCP as a transport. • The NFS protocol is a set of message formats and types. Client issues a request message for a service operation. Server performs requested operation and returns a reply message with status and (perhaps) requested data.
  • 30. Network Block Storage One approach to scalable storage is to attach raw block storage to a network. • abstraction: OS addresses storage by <volume, sector>. iSCSI, Petal, FC: access through souped-up device driver • dedicated Storage Area Network or general-purpose network FibreChannel vs. Ethernet • shared access with scalable bandwidth and capacity • volume-based administrative tools backup, volume replication, remote sharing • Called “raw” or “block”, “storage volumes” or just “SAN”.
  • 31. “NAS vs. SAN” In the commercial sector there is a raging debate today about “NAS vs. SAN”. • Network-Attached Storage has been the dominant approach to shared storage since NFS. NAS == NFS or CIFS: named files over Ethernet/Internet. Network Appliance “filers” • Proponents of FibreChannel SANs market them as a fundamentally faster way to access shared storage. no “indirection through a file server” (“SAD”) lower overhead on clients network is better/faster (if not cheaper) and dedicated/trusted Brocade, HP, Emulex are some big players.
  • 32. NAS vs. SAN: Cutting through the BS • FibreChannel a high-end technology incorporating NIC enhancements to reduce host overhead.... ...but bogged down in interoperability problems. • Ethernet is getting faster faster than FibreChannel. gigabit, 10-gigabit, + smarter NICs, + smarter/faster switches • Future battleground is Ethernet vs. Infiniband. • The choice of network is fundamentally orthogonal to storage service design. Well, almost: flow control, RDMA, user-level access (DAFS/VI) • The fundamental questions are really about abstractions. shared raw volume vs. shared file volume vs. private disks
  • 33. Storage Abstractions • relational database (IBM and Oracle) tables, transactions, query language • file system hierarchical name space with ACLs • block storage SAN, Petal, RAID-in-a-box (e.g., EMC) • object storage object == file, with a flat name space: NASD, DDS • persistent objects pointer structures, requires transactions: OODB, ObjectStore
  • 34. Storage Architecture Any of these abstractions can be built using any, some, or all of the others. Use the “right” abstraction for your application. The fundamental questions are: • What is the best way to build the abstraction you want? division of function between device, network, server, and client • What level of the system should implement the features and properties you want? How does Frangipani answer them?
  • 35. Cluster File Systems shared block storage service (FC/SAN, Petal, NASD) xFS [Dahlin95] Petal/Frangipani [Lee/Thekkath] GFS Veritas EMC Celerra storage client cluster FS cluster FS storage client issues trust compatibility with NAS protocols sharing, coordination, recovery
  • 36. Sharing and Coordination storage service + lock manager *FS client *FS client *FS svc *FS svc block allocation and layout locking/leases, granularity shared access separate lock service logging and recovery network partitions reconfiguration NAS “SAN” What does Frangipani need from Petal? How does Petal contribute to F’s *ility? Could we build Frangipani without Petal?