File systemimplementationfinal

1
File-System Implementation (Galvin Notes, 9th Ed.)
Chapter 12: File-SystemImplementation
 FILE-SYSTEMSTRUCTURE
 FILE-SYSTEMIMPLEMENTATION
 Overview
 Partitions and Mounting
 Virtual File Systems
 DIRECTORY IMPLEMENTATION
 Linear List
 HashTable
 ALLOCATION METHODS
 Contiguous Allocation
 Linked Allocation
 Indexed Allocation
 Performance
 FREE-SPACE MANAGEMENT
 Bit Vector
 Linked List
 Grouping
 Counting
 Space Maps
 EFFICIENCY AND PERFORMANCE
 Efficiency
 Performance
 RECOVERY
 Consistency Checking
 Log-Structured File Systems
 Other Solutions
 Backup and Restore
SKIPPED CONTENT
 NFS (Optional--SKIPPED)
 Overview
 The Mount Protocol
 The NFS Protocol
 Path-Name Translation
 Remote Operations
 EXAMPLE: THEWAFL FILESYSTEM (Optional--SKIPPED)
Content
FILE-SYSTEM STRUCTURE
 The file systemresidespermanentlyon secondarystorage. This chapter is primarilyconcerned withissues surrounding file storage and
access on the most common secondary-storage medium, the disk.
 Hard disks have twoimportant properties that make them suitable for secondarystorage of files infile systems:(1) Blocks o f data can be
rewrittenin place;it is possible to read a block fromthe disk, modify the block, andwrite it back intothe same place, and(2) theyare direct
access, allowing anyblock ofdata to be accessedwith only(relatively) minor movements ofthe diskheads androtational latency. (Disks are
usuallyaccessedinphysical blocks – one or more sectors - rather thana byte at a time. Block sizes may range from 512 bytes to 4K or
larger.)
 To provide efficient andconvenient accessto the disk, the OS imposes one or more file systems to allow the data to be stored, located, and
retrieved easily. One of the designproblems a file systemposes is creating algorithms anddata structures to map the logical file sys tem
onto the physical secondary-storage devices.

2
 The file systemitself is generallycomposed ofmanydifferent levels. The structure showninFigure 11.1 is anexample of a layered design,
where each level inthe design uses the features of lower levels to create newfeatures for use byhigher levels.
 File systems organize storage ondisk drives, andcanbe viewed as a layereddesign:
o At the lowest layer are the physicaldevices, consisting ofthe magnetic media, motors & controls, and the electronics connected to
them andcontrollingthem. Modern disk put more andmore of the electronic controls directlyon the disk drive itself, leaving relatively
little work for the disk controller card to perform.
o I/O Control consists ofdevice drivers, special software programs (often writteninassembly) whichcommunicate withthe devices by
reading andwriting special codes directlyto andfrom memory addresses corresponding to the controller card's registers. Each
controller card (device)on a systemhas a different set of addresses(registers, a.k.a. ports) that it listens to, and a uni que set of
commandcodes andresults codes that it understands. (Book:The I/O control is the lowest level and consists of device drivers and
interrupt handlers to transfer informationbetweenthe mainmemoryandthe disk system. A device driver can be thought of as a
translator. Its input consists of high-level commands such as "retrieve block123". Its output consists of low-level, hardware-specific
instructions that are usedbythe hardware controller, which interfaces the I/O device to the rest of the system. The device driver
usuallywrites specific bit patterns to special locations inthe I/O controller's memoryto tell the controller whichdevice location to act
on and what actions to take.)
o The basic file system level works directlywith the device drivers interms of retrieving and storing raw blocks of data, without any
consideration for what is ineach block. Dependingon the system, blocks maybe referred to witha single block number, (e.g. block #
234234), or with head-sector-cylinder combinations. ((Book:The basic file system needs only to issue
generic commands to the appropriate device driver to read andwrite physical blocks on the disk. Each
physical block is identifiedbyits numeric disk address (for example, drive 1, cylinder 73,track2,sector 10)
o The file organization module knows about files and their logicalblocks, and how they map to physical
blocks onthe disk. Inadditionto translatingfrom logical to physicalblocks, the file organization module
also maintains the list of free blocks, and allocates free blocks to files a s needed. (Book: The file-
organizationmodule knows about files andtheir logical blocks, as well as physical blocks. Byknowingthe
type of file allocationusedandthe locationof the file, the file-organizationmodule cantranslate logical
block addressesto physical block addresses for the basic file systemto transfer. Each file's logical blocks
are numbered from 0(or 1) throughN. Since the physical blocks containingthe data usuallydo not match
the logical numbers, a translationis neededto locate each block. The file-organization module also
includes the free-space manager, which tracks unallocated blocks andprovides these blocks to the file -
allocation module when requested.)
o The logical file system dealswith all ofthe meta data associatedwith a file (UID, GID, mode, dates, etc),
i.e. everything about the file except the data itself. This level manages the directory structure and the
mapping offile namesto file control blocks, FCBs, which containall of the meta data as well as block
number informationfor finding the data on the disk. (IBMKnowledgeCenter: The logicalfile systemis the
level of the file system at which users canrequest file operations by system call. This level of the file
systemprovides the kernel with a consistent view of what might be multiple physical file systems and multiple file system
implementations. As far as the logical file system is concerned, file system types, whether local, remote, or strictly logica l, and
regardless of implementation, are indistinguishable. ((Book:The logicalfile systemmanages metadata information. Metadata includes
all of the file-system structure except the actual data (or contents of the files). The logical file system manages the directory structure
to provide the file-organization module withthe informationthe latter needs, givena symbolic file name. It maintains file structure via
FCBs. An FCB contains informationabout the file, includingownership, permissions, andlocation ofthe file contents. The lo gical file
system is also responsible for protection and security.)
 The layeredapproachto file systems means that much ofthe code can be used uniformlyfor a wide varietyof different file s ystems, and
onlycertain layers needto be filesystem specific. (Book: When a layeredstructure is usedfor file-system implementation, duplication of
code is minimized. The I/O control andsometimes back file-system code canbe usedbymultiple file systems. Each file system can then
have its own logical file system and file-organization modules.)
 Most operatingsystems support more thanone file systems. Inadditionto removable-mediafile systems, each OS has one disk-based file
system(or more). UNIXuses the UNIXfile system (UFS), whichis basedon the BerkeleyFast File System (FFS). Windows NT, 2000, and XP
support diskfile-system formats of FAT, FAT 32, and NTFS(or Windows NT File System), as well as CD-ROM, DVDandfloppy-disk file-system
formats. Although Linux supports over 40 different file systems, the standardLinux file system is known as the extended file system, with
the most common version being ext2 and ext3.
FILE SYSTEM IMPLEMENTATION
As was described inSection10.1.2, operatingsystems implement open() andclose() system calls for processesto request access to filecontents. In
this section, we delve into the structures and operations used to implement file -system operations.
Overview
Several on-disk andin-memorystructures are usedto implement a file system. These structures varydepending onthe OS andthe file system, but
some general principles apply.
On disk, the file system maycontain information about how to boot anoperatingsystem stored there, the total number of blocks, the number
and locationof free blocks, the directorystructure, and individual files. Manyof these structures are detailed throughout the remainder of this
chapter; here we describe them briefly.
 File systems store several important data structures on the disk (Ilinois part is erroneous, refer book parts):
o A boot-control block, (per volume) a.k.a. the boot block inUNIXor the partitionboot sector in Windows contains information
about how to boot the system off of this disk. Thiswill generallybe the first sector of the volume if there is a bootable s ystem

3
loaded on that volume, or the block will be left vacant otherwise. (Book: A boot control block (per volume) can contain
informationneeded bythe systemto boot an operating system from that volume. If the disk does not contain an operating
system, this block canbe empty. It is typicallythe first blockof a volume. InUFS, thisis called the boot block; in NTFS, it is the
partition boot sector.)
o A volume control block, (per volume) a.k.a. the master file table in UNIX or the superblock in Windows, which contains
informationsuchas the partitiontable, number of blocks on each filesystem, andpointers to free blocks and free FCB blocks.
(Book:Avolume control block (per volume) contains volume (or partition) details, such as the number of blocks inthe partition,
size of the blocks, free-block count and free-block pointers, and free FCB count and FCB pointers. In UFS, this is called a
superblock; in NTFS, it is stored in the master file table)
o A directorystructure (per file system), containing file names andpointers to correspondingFCBs. UNIXuses inode numbers, and
NTFS uses a master file table. (Book: A directorystructure per file system is usedto organize the files. In UFS, this includes file
name and associated inode numbers. In NTFS, it is stored in the master file table.)
o The File Control Block, FCB, (per file) containing details about ownership, size, permissions, dates, etc. UNIX stores this
informationin inodes, andNTFSinthe master file table as a relationaldatabase structure. (Book: A per-file FCB contains many
detailsabout the file, including file permissions, ownership, size, andlocation ofthe data blocks. InUFS, this is called the inode. In
NTFS, this informationis actuallystored withinthe master file table, whichuses a relationaldatabase structure, with a row per
file.)
 There are alsoseveral keydata structures storedinmemory ((Book:The in-memoryinformation is usedfor bothfile-system management
and performance improvement via caching. The data are loadedat mount time anddiscardedat dismount. The structures mayinclude the
ones described below):
o An in-memory mount table contains information about each mounted volume.
o An in-memorydirectory-structure cache holds the directoryinformationof recentlyaccesseddirectories. (For directories at which
volumes are mounted, it can contain a pointer to the volume table.).
o The system-wide open-file table contains a copy of the FCB of each open file, as well as other information.
o A per-process open file table, containing a pointer to the system open file table as well as some other information. (For example
the current file positionpointer maybe either here or inthe systemfile table, dependingon the implementationandwhether the
file is being sharedor not.)(Book: The per-processopen-file table contains a pointer to the appropriate entryin the system-wide
open-file table, as well as other information.)
 Interactions of file system components when files are created and/or used:
To create a new file, an application programcalls the logical file system, which
knows the format ofthe directorystructures. Tocreate a new file, it allocates a
new FCB. (Alternatively, if the file-system implementationcreatesall FCBs at file-
systemcreation time, anFCB is allocatedfrom the set of free FCBs.) The system
then reads the appropriate directoryintomemory, updates it with the new file
name and FCB, andwrites it back to the disk. A typical FCB is showninFigure 11.2.
Some operatingsystems, including UNIX, treat a directoryexactlythe same as a
file – one with a type field indicating that it is a directory. Other operating systems,
includingWindows NT, implement separate systemcalls for files and directories
and treat directories as entitiesseparate fromfiles. Whatever the larger structural
issues, the logical file system can call the file-organization module to map the
directoryI/O into disk-blocknumbers, whichare passedon to the basic file system
and I/O control system.
Now that a file has beencreated, it can be used for I/O. First, though, it must be opened. The open() call passes a file nam e to the file
system. The open()systemcallfirst searches the system-wide open-file table to see if the file is alreadyinuse by another process. If it is, a
per-processopen-file table entryis createdpointingto the existing system-wide open-file table. This algorithm can save substantial
overhead. Whena file is opened, the directorystructure is searchedfor the givenfile name. Parts of the directory structure are usually
cachedinmemoryto speeddirectoryoperations. Once the file is found, the FCB is copiedintoa system-wide open-file table in memory.
This table not only stores the FCB but also tracks the number of process es that have the file open.
Next, an entryis made inthe per-processopen-file table, with a pointer to the entryinthe system-wide open-file table and some other
fields. These other fields caninclude a pointer to the current locationinthe file (for the next read() or write() operation) and the access
mode inwhich the file is open. The open()call returns a pointer to the appropriate entry in the per-process file-system table. All file
operations are then performedvia this pointer. The file name maynot be part of the open-file table, as the system has no use for it once
the appropriate FCB is locatedon disk. It could be cached, though, to save time on subsequent opens ofthe same file. The na me given to
the entryvaries. UNIXsystems refer to it as a file descriptor; Windows refers to it as a file handle. Consequently, as long as the file is not
closed, all file operations are done on the open-file table.
When a processcloses the file, the per-process table entryis removed, andthe system-wide entry's opencount is decremented. When
all users that have opened the file close it, anyupdated metadata is copiedbackto the disk-baseddirectorystructure, and the system-wide
open-file table entry is removed.
Some systems complicate this scheme further byusing the file system as aninterface to other system aspects, such as networking. For
example, inUFS, the system-wide open-file table holds the inodesandother information for files and directories. It also holds similar
information for network connections and devices. In this way, once mechanism is used for multiple purposes.
The caching aspects of file-system structures shouldnot be overlooked. Most systems keepall information about an openfile, except for
its actual data blocks in memory. The BSDUNIXsystem is typical in its use ofcaches wherever diskI/O canbe saved. Its average cache hit
rate of 85% shows that these techniques are well worth implementing.
The operating structures of a file-system implementation are summarized in Figure 11.3.

4
 Before moving on to the next section, go to the reference material on MBT, MFT, VBR and FCB in the “Assorted Content” section.
Partitions and Mounting
 Partitions caneither be used as rawdevices (withnostructure imposed upon them), or theycan be formatted to hold a filesystem (i.e.
populatedwithFCBs andinitial directorystructuresas appropriate.) Rawpartitions are generallyusedfor swapspace, andmayalso be used
for certainprograms suchas databases that choose to manage their own disk storage system. Partitions containing filesystems can
generallyonlybe accessed using the file system structure by ordinary users, but can often be accessed a s a raw device also by root.
 The boot blockis accessedas part of a rawpartition, bythe boot program prior to any operating system being loaded. Modern boot
programs understand multiple OSes and filesystem formats, andcangive the user a choice of whichof several available systems to boot.
 The root partition contains the OS kernel andat least the keyportions of the OS neededto complete the boot process. At boot time the
root partitionis mounted, andcontrol is transferred fromthe boot program to the kernel found there. (Older systems require d that the
root partitionlie completelywithin the first 1024 cylinders of the disk, because that was as far as the boot program could re ach. Once the
kernel had control, then it could access partitions beyond the 1024 cylinder boundary.)
 Continuing with the boot process, additional filesystems get mounted, adding their informationintothe appropriate mount table structure.
As a part of the mounting process the file systems maybe checkedfor errors or inconsistencies, either because they are flag ged as not
havingbeenclosedproperlythe last time they were used, or just for
general principals. Filesystems maybe mountedeither automaticallyor
manually. In UNIXa mount point is indicatedbysetting a flag inthe in-
memorycopyof the inode, so all future references to that inode get re-
directed to the root directory of the mounted filesystem.
Virtual File Systems:Virtual File Systems, VFS, provide a common interface to
multiple different filesystem types. In addition, it provides for a unique
identifier (vnode) for files across the entire space, including across all
filesystems of different types. (UNIXinodes are unique only across a single
filesystem, and certainlydo not carryacross networkedfile systems.) The VFS
in Linux is baseduponfour keyobject types:(a) The inode object, representing
an individual file (b) The file object, representing an open file. (c) The
superblock object, representing a filesystem. (d) The dentry object,
representing a directory entry.
DIRECTORY IMPLEMENTATION
The selectionof directory-allocationanddirectory-management algorithms significantlyaffects the efficiency, performance and reliability of the
file system. Inthis section, we discussthe trade-off involved inchoosingone of these algorithms. (Directories needto be fast to search, insert, and
delete, with a minimum of wasted disk space).
 Linear List: The simplest methodof implementing a directoryis to use a linear list of file nameswith pointers to the data blocks. This
method is simpleto programbut time-consumingto execute. To create a newfile, we must first search the directoryto be sure that no
existingfile has the same name. Then, we adda newentryat the end ofthe directory. To delete a file, we search the directory for the
namedfile, thenrelease the space allocatedto it. To reuse the directoryentry, we cando one of several things. We can markthe entryas
unused(byassigningit a specialname, such as an all-blank name, or with a used-unusedbit ineachentry), or we canattach it to a list of
free directoryentries. A thirdalternative is to copythe last entryinthe directoryinto the freedlocation and to decrease the lengthof the
directory. A linkedlist canalsobe usedto decrease the time required to delete a file
(there is an overhead for the links).
The real disadvantage of a linear list of directory entries is that finding a file
requiresa linear search. Directoryinformation is usedfrequently, and users will notice
if access to it is slow.
A sortedbinarylist allows a binarysearch anddecreasesthe average search time.
However, the requirement that the list be kept sorted maycomplicate creating and
deleting files, since we may have to move substantial amounts of directory
information to maintain a sorted directory. A more sophisticated tree data
structure, such as a B-tree, might help here. An advantage of the sorted list is that a
sorted directory listing can be produced without a separate sort step.
 Hashtable: Another data structure for a file directory is a hash table. With this
method, a linear list stores the directoryentries, but a hash data structure is also
used. The hashtable takes a value computedfrom the file name andreturns a pointer

5
to the file name in the linear l ist. Therefore it can greatly decrease the directory search time.
ALLOCATION METHODS
Here we discuss howto allocate space to files so that diskspace is utilizedeffectivelyandfilescanbe accessed quickly. Three major methods of
allocatingdiskspace are inwide use: Contiguous, linkedandindexed. Some systems (such as Data General's RDOSfor its Nova line of computers)
support all three. More commonly, a system uses one method for all file within a file system type.
Contiguous Allocation: It requires that all blocks of a file be kept together contiguously. Performance is veryfast, because reading successive
blocks of the same file generally requires no movement of the disk heads, or at most one small step to the next adjacent cylinder .
 Storage allocationinvolves the same issues discussedearlier for the allocationof contiguous
blocks ofmemory(first fit, best fit, fragmentationproblems, etc.) The distinctionis that the
high time penaltyrequiredfor moving the disk heads from spot to spot maynow justify the
benefits ofkeeping files contiguously when possible. (Even file systems that do not by
default store filescontiguouslycan benefit from certain utilities that compact the disk and
make all files contiguous in the process.)
 Problems canarise whenfilesgrow, or ifthe exact size of a file is unknownat creation time:
Over-estimationof the file's finalsize increases external fragmentation and wastes disk
space. Under-estimationmayrequire that a file be moved or a process a borted if the file
grows beyondits originallyallocatedspace. If a file grows slowlyover a longtime periodand
the total final space must be allocatedinitially, thena lot ofspace becomes unusable before
the file fills the space.
 To minimize these drawbacks, some operatingsystems use a modified contiguous-allocation
scheme. Here, a contiguous chunkof space is allocatedinitially; and then, if that amount
proves not to be large enough, another chunk ofcontiguous space, known as an extent, is
added. The location ofthe file's blocks is thenrecordedas a locationanda blockcount, plus
a link to the first block of the next extent (used by Veritas file system).
Linked Allocation: Linkedallocationsolves all problems of contiguous allocation. Withlinkedallocation, each file is a linked list of disk blocks; the
diskblocks maybe scattered anywhere onthe disk. The directorycontains a pointer to the first andlast blocks ofthe file (Each block contains a
pointer to the next block). These pointers are not made available to the user. Thus, if each block is 512 bytes in size, and a disk address (the
pointer) requires 4 bytes, then the user sees blocks of 508 bytes.
 To create a new file, we simplycreate a new entryinthe directory. Withlinkedallocation, each directoryentryhas a pointer to the first
diskblock of the file. This pointer is initializedto nil (the end-of-list pointer value) to
signifyanemptyfile. The size fieldis alsoset to 0. A write to the file causes the free-
space management system to fine a free block, and thisnew blockis writtento andis
linked to the endof the file. Toreada file, we simply read blocks by following the
pointers from block to block. There is no external fragmentation with linked
allocation, andanyfree blockon the free-space list canbe usedto satisfy a request.
The size of a file neednot be declared when that file is created. A file cancontinue to
grow as long as free blocks are available. Consequently, it is never necessary to
compact disk space.
 Linkedallocationdoeshave disadvantages, however. The major problem is that it can
be usedeffectivelyonlyfor sequential-access files. To findthe ith block of a file, we
must start at the beginningof that file and follow the pointers till we get to the ith
block. Each access to a pointer requires a diskread, and some require a disk seek.
Consequently, it is inefficient to support a direct-access capabilityfor linked-allocation
files. (Another disadvantage is the space required for the pointers).
 The usual solutionto thisproblem is to collect blocks intomultiples, called clusters,
and to allocate clusters rather thanblocks. For instance, the file system maydefine a
cluster as four blocks and operate on the diskonlyin cluster units. Pointers then use a
much smaller percentage ofthe file's disk space. The cost of this
approachis anincrease ininternal fragmentation, because more
space is wastedwhena cluster is partiallyfull thanwhen a block is
partiallyfull. Clusters can be used to improve the disk-access time
for manyother algorithms as well, so they are used in most file
systems.
 Another problemof linkedallocation is reliability. The files are
linked together bypointers scatteredall over the disk, so consider
what would happenifa pointer were lost or damaged. One partial
solution is to use doubly-linkedlists, and another is to store the file
name and relative block number in each block; howeve r, these
schemes require even more overhead for each file.
 An important variation on linked allocation is the use of a file -
allocationtable (FAT). This simple but efficient methodof disk-space
allocationis usedby the MS-DOS and OS/2 operating systems. A
sectionof disk at the beginning of each volume is set aside to

6
contain the table. The tablehas one entryfor each disk blockandis indexedbyblocknumber. The FAT is used inmuchthe same wayas a
linked list. The directoryentrycontains the block number of the first blockof teh file. The table entry indexed by that block number
contains the block number of the next block in the file. This chain continuesuntil the last block, whichhas a specialend -of-file value as
the table entry. Unused blocks are indicatedbya 0 table value. Allocating a new blockto a file is a simple matter of finding the first 0-
valued table entryand replacingthe previous end-of-file value withthe address ofthe new block. The 0 is thenreplacedwithend-of-file
value. An illustrative example is the FAT structure shown in Figure 11.7 for a file consisting of disk blocks 217, 618, and 339.
The FAT allocationscheme canresult in a significant number of disk head seeks, unless the FAT is cached. The disk headmust move to
the start of the volume to read the FAT and findthe locationof the block in question, thenmove to the locationof the block itself. In the
worst case, both moves occur for each of the blocks. A benefit is that random-accesstime is improved, because the disk head can find
the location of any block by reading the information in the FAT.
Indexed Allocation: Linkedallocationsolves the external-fragmentationand size-declarationproblems of contiguous allocation. However, in the
absence of a FAT, linked allocationcannot support efficient direct access, since the pointers to the blocks are scatteredwiththe blocks themselves
all over the disks and must be retrievedinorder. Indexedallocation solves this problem bybringing all the pointers together into one location:the
index block.
 Each file hasits ownindex block, whichis anarrayof disk-blockaddresses. The ithentryin the index block points to the ith block of the
file. The directorycontains the address ofthe index block(Figure 11.8). To findandreadthe ithblock, we use the pointer inthe ithindex-
block entry. This scheme is similar to the paging scheme described in Section 8.4.
 When the file is created, allpointers inthe index blockare set to nil. Whenthe ith blockis first written, a blockis obtainedfromthe free-
space manager, and its address is put in the ith index-block entry.
 Indexedallocation supports direct access, without suffering from external
fragmentation, because anyfree blockon the disk can satisfy a request for
more space. Indexedallocationdoes suffer from wasted space, however. The
pointer overhead of the index block is generally greater than the pointer
overheadof linkedallocation. Consider a commoncase in which we have a
file of onlyone or two blocks. Withlinked allocation, we lose the space of
onlyone pointer per block. Withindexed allocation, an entire index block
must be allocated, even if only one or two pointers will be non -nil.
 This point raises the questionof how large the index block should be. Every
file must have anindex block, sowe want the index block to be as small as
possible. Ifthe index blockis too small, however, it will not be able to hold
enoughpointers for a large file, and a mechanism willhave to be available to
deal withthe issue. Mechanisms for this purpose include the following:
o Linkedscheme – An index block is normallyone diskblock. Thus, it
can be readandwritten directlybyitself. To allow for large files,
we can linktogether several index blocks. For example, an index
block might containa small header givingthe name ofthe file anda
set of the first 100 disk-block addresses. The next address (the last word inthe index block) is nil (for a small file)or is a pointer
to another index block (for a large file).
o Multilevel index – A variant of the linked representation is to use a
first-level index block to a set ofsecond-level index blocks, which in
turn point to the file blocks. Toaccessa block, the OS uses the first-
level index to finda second-level index blockandthen uses that block
to find the desireddata block. This approachcould be continued to a
third or fourthlevel, depending onthe desired maximum file size.
With 4096-byte blocks, we could store 1,024 4-byte pointers in an
index block. Two levels of indexes allow 1,048,576 data blocks and a
file size of up to 4 GB.
o Combinedscheme – Another alternative, usedinthe UFS, is to keep
the first, say, 15 pointers of the index block in the file's inode. The
first 12 of these pointers point to direct blocks; that is, they contain
addresses ofblocks that contain data of the file. Thus, the data for
small files (ofno more than 12 blocks) donot need a separate index
block. If the block size is 4KB, then up to 48 KB of data can be
accessed directly. The next three pointers point to indirect blocks. The first points to a single indirect block, which is an index
block containing not data but the addressesof blocks that do containdata. The secondpoints to a double indirect block, which
contains the address of a block that contains the addresses of blocks that containpointers to the actual data blo cks. The last
pointer contains the addressof a triple indirect block. Under this method, the number of blocks that can be allocated to a file
exceeds the amount of space addressable bythe 4-byte file pointers used bymanyOSes. A 32-bit file pointer reachesonly2^32
bytes, or 4 GB. ManyUNIXimplementations, including Solaris and IBM's AIX, now support up to 64-bit file pointers. Pointers of
this size allow files and file systems to be terabytes in size. A UNIX inode is shown in Figure 11.9.
 Indexed-allocation schemes suffer fromsome ofthe same performance problems as does linkedallocation. Specifically, the index blocks
can be cached in memory, but the data blocks may be spread all over a volume.

7
Performance: The optimal allocation methodis different for sequential accessfiles thanfor random access files, and is alsodifferent for smallfiles
than for large files. Some systems support more than one allocationmethod, whichmayrequire specifying how the file is to b e used (sequential or
randomaccess) at the time it is allocated. Such systems also provide conversion utilities. Some
systems have been knownto use contiguous accessfor small files, and automatically switch to an
indexedscheme whenfile sizessurpass a certainthreshold. Andof course some systems adjust their
allocationschemes (e.g. block sizes) to best matchthe characteristics of the hardware for optimum
performance.
FREE-SPACE MANAGEMENT
Another important aspect of disk management is keeping track of and allocating free space.
 Bit Vector: One simple approachis to use a bit vector, inwhich each bit represents a disk
block, set to 1 if free or 0 if allocated. Fast algorithms exist for quickly finding contiguous
blocks ofa given size The down side is that a 40GB diskrequires over 5MB just to store the
bitmap (For example).
 Linked List: A linked list canalso be used to keeptrackof all free blocks. Traversingthe list
and/or finding a contiguous block of a given size are not easy, but fortunately are not
frequentlyneededoperations. Generallythe systemjust adds andremoves single blocks
from the beginning ofthe list. The FAT table keeps track of the free list as just one more
linked list on the table.
 Grouping: A variationon linkedlist free lists is to use links of blocks ofindices of free blocks. If a blockholds up to N addresses, then the
first block in the linked-list contains up to N-1 addresses of free blocks and a pointer to the next block of free addresses.
 Counting: When there are multiple contiguous blocks of free space thenthe systemcankeep track ofthe starting address of the group
and the number ofcontiguous free blocks. As long as the average lengthof a contiguous groupof free blocks is greater than two this
offers a savings in space neededfor the free list. (Similar to compressiontechniquesusedfor graphics images when a group ofpixels all
the same color is encountered.)
 Space Maps: Sun's ZFSfile systemwas designed for HUGE numbers andsizes offiles, directories, andeven file systems. The resulting
data structures could be VERY inefficient ifnot implementedcarefully. For example, freeing upa 1 GB file on a 1 TB file system could
involve updating thousands ofblocks offree list bit maps if the file wasspreadacross the disk. ZFSuses a combination of techniques,
startingwith dividing the disk upinto(hundreds of)metaslabs ofa manageable size, eachhaving their own space map. Free blocks are
managedusing the countingtechnique, but rather than write the information to a table, it is recorded in a log-structured transaction
record. Adjacent free blocks are also coalescedinto a larger single free block. An in-memoryspace map is constructedusing a balanced
tree data structure, constructedfrom the logdata. The combinationof the in-memorytree andthe on-disklog provide for very fast and
efficient management of these very large files and free blocks.
EFFICIENCYAND PERFORMANCE
 Efficiency: The efficient use of diskspace depends heavilyon the diskallocationanddirectoryalgorithms in use. For instance, UNIX pre-
allocates inodes, whichoccupies space evenbefore anyfiles are created. UNIXalsodistributes inodes across the disk, and tries to store
data filesnear their inode, to reduce the distance of disk seeks betweenthe inodes andthe data. Some systems use variable size clusters
depending onthe file size. The more data that is stored in a directory (e.g., i nformation like last access time), the more often the
directoryblocks have to be re-written. As technologyadvances, addressing schemes have hadto grow as well. Sun's ZFSfile systemuses
128-bit pointers, whichshouldtheoreticallynever needto be expanded. (The massrequiredto store 2^128 bytes with atomic storage
wouldbe at least 272 trillionkilograms!) Kernel table sizes usedto be fixed, and couldonlybe changed byrebuildingthe k ernels. Modern
tables are dynamically allocated, but that require s more complicated algorithms for accessing them.
 Performance: Even after the basic file-system algorithms have been selected, we canstill improve performance in several ways. Disk
controllers generallyinclude on-boardcaching. Whena seekis requested, the heads are moved into place, and then an entire track is
read, startingfrom whatever sector is currently under the
heads (reducing latency). The requestedsector is returned
and the unrequestedportion of the track is cached in the
disk's electronics. Some OSescache diskblocks they expect
to need again ina buffer cache. A page cache connected to
the virtual memory system is actually more efficient as
memoryaddresses do not need to be converted to disk
block addresses and back again. Some systems (Solaris,
Linux, Windows 2000, NT, XP) use page caching for both
process pagesand file data in a unified virtual memory.
Figures 11.11 and11.12 show the advantagesof the unified
buffer cache found in some versions of UNIX and Linux -
Data does not need to be stored twice, and problems of
inconsistent buffer information are avoided. (Book: Some
systems maintaina separate section of main memory for a
buffer cache, where blocks are kept under the assumption
that theywill be usedagainshortly. Other systems cache file data usinga page cache. The page cache usesvirtual memorytechniquesto
cache file data as pages rather thanas file-system-oriented blocks. Caching file data using virtual addresses is far more efficient than
caching through physical blocks, as accesses interface with virtual memory rather than the file system. Several systems, including
Solaris/Linus/WIndows NT/XP, use page cachingto cache bothprocess pages and file data. This is known as unified virtual memory.)
(Book: Some versions of UNIXandLinux provide a unifiedbuffer cache. To illustrate the benefits of the unified buffer cache, consider
the two alternatives for opening and accessinga file. One approachis to use memorymapping (section 9.7); the second is to use the

8
standardsystemcallsread()andwrite(). Without a unifiedbuffer cache, we have a situation similar to Figure 11.11. Here, read() and
write() system calls gothroughthe buffer cache. The memory-mapping call, however, requires using two caches - the page cache andthe
buffer cache. A memorymapping proceeds byreading indiskblocks fromthe file systemandstoring theminthe buffer cache. Because
the virtual memorydoes not interface withthe buffer cache, the contents of the file in the buffer cache must be copied in to the page
cache. This situationis knownas double caching and requires caching file-system data twice. Not onlydoes it waste memory but it also
wastes significant CPU andI/O cyclesdue to the extra data movement withinsystem memory. I naddition, inconsistencies between the
two caches canresult incorrupt files. In contrast, when a unifiedbuffer cache is provided, both memory mapping and the rea d() and
write() system calls use the same page cache. This hasthe benefit of avoiding double caching, andit allows the virtual memorysystem to
manage file-system data. The unified buffer cache is shown in Figure 11.12.)
o Page replacement strategies canbe complicatedwith a unified cache, as one needs to decide whether to replace process or file
pages, andhowmanypagesto guarantee to each categoryof pages. Solaris, for example, has gone throughmanyvariations,
resulting in priority paging givingprocesspages priorityover file I/O pages, andsettinglimits sothat neither canknock the
other completelyout of memory.
o Another issue affecting performance is the questionof whether to implement synchronous writes or asynchronous writes.
Synchronous writes occur inthe order in whichthe disksubsystem receives them, without caching; Asynchron ous writes are
cached, allowing the disksubsystemto schedule writes ina more efficient order (See Chapter 12.) Metadata writes are often
done synchronously. Some systems support flags to the open call requiring that writes be synchronous, for example fo r the
benefit of database systems that require their writes be performed in a required order.
o The type of file access canalsohave animpact on optimal page replacement policies. For example, LRU is not necessarily a
good policyfor sequentialaccessfiles. For these types offiles progressionnormallygoes ina forward direction only, and the
most recentlyusedpage will not be neededagainuntil after the file hasbeen rewoundandre -readfrom the beginning, (if it is
ever neededat all.)On the other hand, we canexpect to needthe next page inthe file fairlysoon. For this reason sequential
access files often take advantage of two special policies:
 Free-behind frees upa page as soonas the next page inthe file is requested, with the assumption that we are now
done with the old page and won't need it again for a long time.
 Read-ahead reads the requested page andseveral subsequent pagesat the same time, with the assumption that
those pageswill be needed inthe near future. This is similar to the trackcaching that is alreadyperformedbythe disk
controller, except it saves the future latencyof transferringdata fromthe diskcontroller memoryinto motherboard
main memory.
o The caching system andasynchronous writesspeedup disk writes considerably, because the disk subsystem can schedule
physical writes to the diskto minimize headmovement anddiskseektimes. (See Chapter 12). Reads, onthe other hand, must
be done more synchronouslyinspite of the cachingsystem, with the result that diskwrite s can counter-intuitively be much
faster on average than disk reads.
RECOVERY
Filesanddirectoriesare kept bothinmainmemoryandondisk, andcare must be taken to ensure that system failure does not result in loss of
data or in data inconsistency. We deal with these issues in the following sections.
 Consistency Checking: The storingof certaindata structures (e.g. directories and inodes)inmemoryandthe caching ofdisk operations
can speedupperformance, but what happens in the result of a system crash? All volatile memory structures are lost, and the
informationstoredon the hard drive maybe left in aninconsistent state. A Consistency Checker (fsck in UNIX, chkdsk or scandisk in
Windows)is oftenrun at boot time or mount time, particularlyif a filesystem wasnot closed downproperly. Some of the problems that
these tools look for include:
o Disk blocks allocated to files and also listed on the free list.
o Disk blocks neither allocated to files nor on the free list.
o Disk blocks allocated to more than one file.
o The number of disk blocks allocated to a file inconsistent with the file's stated size.
o Properly allocated files / inodes which do not appear in any directory entry.
o Link counts for an inode not matching the number of references to that inode in the directory structure.
o Two or more identical file names in the same directory.
o Illegallylinkeddirectories, e.g. cyclical relationships where those are not allowed, or files/directories that are not acce ssible
from the root of the directory tree.
o Consistencycheckers will often collect questionable disk blocks intonew files with names such as chk00001.dat. These files
maycontainvaluable informationthat wouldotherwise be lost, but inmost cases theycanbe safelydeleted, (returning those
disk blocks to the free list.)
UNIXcaches directoryinformationfor reads, but anychangesthat affect space allocationor metadata changes are written
synchronously, before any of the corresponding data blocks are written to.
 Log-Structured File Systems: Log-based transaction-oriented (a.k.a. journaling) filesystems borrow techniques developedfor databases,
guaranteeing that anygiventransactioneither completes successfully or can be rolled back to a safe state before the transa ction
commenced:
o All metadata changes are written sequentially to a log.
o A set of changes for performing a specific task (e.g. moving a file) is a transaction.
o As changes are written to the log they are said to be committed, allowing the system to return to its work.
o In the meantime, the changesfrom the logare carried out onthe actual filesystem, anda pointer keeps track ofwhichchanges
in the log have been completed and which have not yet been completed.
o When all changescorresponding to a particular transactionhave beencompleted, that transactioncanbe safelyremovedfrom
the log.

9
o At anygiventime, the log will containinformation pertaining to uncompleted transactions only, e.g. actions that were
committed but for which the entire transaction has not yet been completed.
 From the log, the remaining transactions can be completed,
 or if the transaction was aborted, then the partially completed changes can be undone.
 Backup and Restore: A full backupcopies everyfile ona filesystem. Incrementalbackups copyonlyfiles which have changedsince some
previous time. A combination of full andincrementalbackups canoffer a compromise betweenfullrecoverability, the number and size of
backuptapes needed, andthe number oftapes that needto be used
to do a full restore. For example, one strategy might be: At the
beginning of the monthdoa full backup. At the end of the first and
againat the endof the second week, backup all files which have
changedsince the beginning of the month. At the end of the third
week, backupall filesthat have changed since the endof the second
week. Everydayof the month not listed above, do an incremental
backupof all filesthat have changed since the most recent of the
weekly backups described above.
 Other Solutions: Sun's ZFS and Network Appliance's WAFL file
systems take a different approach to file system consistency. No
blocks ofdata are ever over-writtenin place. Rather the newdata is
written intofresh newblocks, and after the transaction is complete,
the metadata (data block pointers) is updated to point to the new
blocks. The old blocks can then be freed up for future use.
Alternatively, ifthe old blocks andoldmetadata are saved, then a
snapshot of the systeminits originalstate is preserved. This approachis takenbyWAFL. ZFS combinesthis with check -summing of all
metadata and data blocks, andRAID, to ensure that no inconsistencies are possible, and therefore ZFS does not incorporate a
consistency checker.
NFS (OPTIONAL)
The NFS protocol is implementedas a set of remote procedure calls (RPCs):Searchingfor a file ina dire ctory, Reading a set of directory entries,
Manipulating links and directories, Accessing file attributes, Readingandwritingfiles. For remote operations, buffering and caching improve
performance, but can cause a disparity in local versus remote views of the same file(s).
(In addition to the figure 12.15, you can also view the preceding figures illustrating NFS file system mounting if you forgot )
Assorted Content
 Master Boot Record (MBR:Wiki): A master boot record (MBR) is a special type of boot sector at the very beginning of partitioned
computer massstorage deviceslike fixed disks or removable drives intendedfor use withIBMPC-compatible systems and beyond. The
MBR holds the informationon howthe logical partitions, containing file systems, are organizedonthat medium. The MBR also contains
executable code to functionas a loader for the installedoperating system—usuallybypassingcontrol over to the loader's second stage,
or in conjunction witheachpartition's volume boot record (VBR). This MBR code is usuallyreferred to as a boot loader. MBRs are not
present on non-partitioned media such as floppies, super floppies or other storage devices configured to behave as such.
The MBR is not locatedina partition;it is located at a first sector of the device (physical offset 0), preceding the first partition. (The
boot sector present on a non-partitioned device or within an individual partition i s called a volume boot record instead.)
The organizationof the partitiontable inthe MBR limits the maximumaddressable storage space ofa disk to 2 TiB (232 × 512 bytes).
Approaches to slightlyraise this limit assuming 33-bit arithmetics or 4096-byte sectors are not officiallysupportedas they fatally break
compatibilitywithexistingboot loaders andmost MBR-compliant operating systems and system tools, and can causes serious data
corruptionwhen used outside of narrowlycontrolled system environments. Therefore, the MBR-based partitioning scheme is in the
process of being supersededbythe GUIDPartitionTable (GPT) scheme innew computers. A GPT can coexist with an MBR in order to
provide some limited form of backward compatibility for older systems.
The MBR consists of 512 or more bytes located inthe first sector of the drive. It may contain one or more of: (A) A partition table
describing the partitions of a storage device. In this context the boot sector mayalsobe called a partition sector. (B) Bootstrap code:
Instructions to identifythe configuredbootable partition, thenloadandexecute its volume boot record (VBR) as a chain loa der. (C)
Optional 32-bit disk timestamp. (D) Optional 32-bit disk signature.
By convention, there are exactly four primary partition table entries in the MBR partition table scheme :
 Second-stage boot loader: Second-stage boot loaders, suchas GNU GRUB, BOOTMGR, Syslinux, NTLDR or BootX, are not themselves
operatingsystems, but are able to load anoperating systemproperlyand transfer executionto it;the operating system subsequently
initializesitself andmayloadextra device drivers. The second-stage boot loader doesnot needdrivers for its own operation, but may
insteaduse generic storage accessmethods provided bysystemfirmware such as the BIOS or Open Firmware, though typically with
restricted hardware functionality and lower performance.
 Volume Boot Record (VBR): A Volume Boot Record (VBR) (also knownas a volume boot sector, a partition boot record or a partition
boot sector) is a type ofboot sector introducedbythe IBMPersonalComputer. It maybe foundon a partitioned data storage device

10
such as a harddisk, or anunpartitioneddevice such as a floppydisk, and contains machine code for bootstrappingprograms (usually, but
not necessarily, operating systems)stored inother parts ofthe device. On non-partitioned storage devices, it is the first sector of the
device. On partitioneddevices, it is the first sector of anindividual partitionon the device, withthe first sector of the entire device being
a Master Boot Record(MBR) containing the partitiontable. The code involume boot records is invokedeither directlybythe machine's
firmware or indirectlybycode inthe master boot recordor a boot manager. Code inthe MBR andVBR is in essence loaded the same
way. Invoking a VBR via a boot manager is known as chain loading.
 Master File Table (MFT): The NTFS file system contains a file calledthe master file table, or MFT. There is at least one entry in the MFT
for everyfile onanNTFSfile systemvolume, includingthe MFT itself. All informationabout a file, including its size, tim e anddate stamps,
permissions, anddata content, is storedeither in MFT entries, or in space outside the MFT that is describedbyMFT entries. As fi les are
added to anNTFS file system volume, more entriesare addedto the MFT andthe MFT increasesinsize. Whenfiles are delete d from an
NTFS file system volume, their MFT entries are markedas free andmaybe reused. However, diskspace that has beenallocated for these
entries is not reallocated, andthe size of the MFT does not decrease. (The master file table (MFT) is a database in which information
about every file and directory on an NT
File System (NTFS) volume is stored. There
is at least one record for every file and
directoryon the NTFS logical volume. Each
record contains attributes that tell the
operatingsystem (OS) how to deal with
the file or directory associated with the
record.)
 File Control Block (FCB): A File Control
Block (FCB) is a file system structure in
which the state of an openfile is maintained. A FCB is managed bythe operating system, but it resides in the memory of the program
that usesthe file, not in operating system memory. This allows a process to have as manyfiles openat one time as it wants to, provided
it can spare enoughmemoryfor anFCB per file. A fullFCB is 36 bytes long;in early versi ons of CP/M, it was 33 bytes. This fixed size,
which could not be increased without breaking application compatibility, lead to the FCB's eventual demise as the standard me thod of
accessing files. The meanings of severalof the fields in the FCB differ betweenCP/MandDOS, andalsodepending onwhat operation is
being performed. The following fields have consistent meanings:
SUMMARY
 The file systemresidespermanentlyon secondarystorage, whichis designed to holda large amount of data permanently. The m ost
common secondary-storage medium is the disk.
 Physical disks maybe segmentedintopartitions to control media use andto allow multiple, possibly varying, file systems on a single
spindle. These file systems are mounted ontoa logicalfile systemarchitecture to make them available for use. File systems are often
implemented ina layeredor modular structure. The lower levelsdeal withthe physical propertiesof storage devices. Upper levels deal
with symbolic file namesandlogical properties offiles. Intermediate levels mapthe logical file concepts intophysical device properties.
 Anyfile-systemtype can have different structures and algorithms. A VFS layer allows the upper layers to deal witheachfile-system type
uniformly. Even remote file systems canbe integrated into the system’s directorystructure andactedonbystandard system calls via the
VFS interface.
 The various files can be allocatedspace onthe disk in three ways: through contiguous, linked, or indexed allocation. Contiguous
allocationcan suffer fromexternalfragmentation. Direct accessis veryinefficient with linkedallocation. Indexedallocation may require
substantialoverheadfor its index block. These algorithms canbe optimizedin manyways. Contiguous space can be enlarged th rough
extents to increase flexibilityandto decrease externalfragmentation. Indexedallocationcan be done inclusters of mu ltiple blocks to
increase throughput andto reduce the number of index entries needed. Indexing inlarge clusters is similar to contiguous allocation with
extents.
 Free-space allocation methods alsoinfluence the efficiencyof disk-space use, the performance of the file system, and the reliability of
secondarystorage. The methods usedinclude bit vectors and linkedlists. Optimizations include grouping, counting, and the F AT, which
places the linked list in one contiguous area.
 Directory-management routines must consider efficiency, performance, andreliability. A hash table is a commonlyusedmethod, as it is
fast andefficient. Unfortunately, damage to the table or a system crash canresult in inconsistencybetween the directory in formation
and the disk’s contents. A consistencychecker canbe usedto repair the damage. Operating-system backup toolsallow disk data to be
copied to tape, enabling the user to recover fromdata or even disk loss due to hardware failure, operating system bug, or us er error.
 Network file systems, suchas NFS, use client–server methodologyto allowusers to access filesanddirectoriesfrom remote machines as
if theywere on local file systems. System callson the client are translated into network protocols and retranslated in to file-system
operations onthe server. Networking andmultiple-client access create challengesinthe areas of data consistency and performance.
 Due to the fundamental role that file systems playinsystem operation, their performance andreliabilityare crucial. Techniques such as
log structures andcaching help improve performance, while log structures andRAID improve reliability. The WAFL file system is an
example of optimization of performance to match a specific I/O load.

File systemimplementationfinal

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to File systemimplementationfinal

Similar to File systemimplementationfinal (20)

More from marangburu42

More from marangburu42 (20)

Recently uploaded

Recently uploaded (19)

File systemimplementationfinal