SlideShare a Scribd company logo
1 of 24
Download to read offline
File Systems (1)




       CS 167                         1




Copyright © 1999 Thomas W. Doeppner. All rights reserved.




                                                            XVII-1
File Systems

                  • System 5 File System (S5FS)
                  • UNIX File System (UFS)
                  • Buffer Cache




         CS 167                          2




  In this lecture we cover the "traditional" UNIX file systems: S5FS and UFS. We
then look at how kernel-supported buffering is used to speed file-system operations
and the trouble this sometimes causes.




                                                                                      XVII-2
S5FS: Inode
                                       Device
                                    Inode Number
                                        Mode
                                     Link Count
                                    Owner, Group
                                        Size



                                      Disk Map




         CS 167                           3




  In both S5FS and UFS, a data structure known as the inode (for index node) is
used to represent a file. These inodes are the focus of all file activity, i.e., every
access to a file must make use of information from the inode. Every file has a inode
on permanent (disk) storage. While a file is active (e.g. it is open), its inode is
brought into primary storage.




                                                                                         XVII-3
Disk Map
                   0
                   1
                   2
                   3
                   4
                   5
                   6
                   7
                   8
                   9
                   10
                   11
                   12




          CS 167                             4




   The first file system we discuss is known as the S5 File System -- it is based on the
original UNIX file system, developed in the early seventies. The name S5 comes from
the fact that this was the only file system supported in early versions of what's
known as UNIX System V.
   The purpose of the disk-map portion of the inode is to map block numbers relative
to the beginning of a file into block numbers relative to the beginning of the file
system. Each block is 1024 (1K) bytes long.
   The disk map consists of 13 pointers to disk blocks, the first 10 of which point to
the first 10 blocks of the file. Thus the first 10Kb of a file are accessed directly. If the
file is larger than 10Kb, then pointer number 10 points to a disk block called an
indirect block. This block contains up to 256 (4-byte) pointers to data blocks (i.e.,
256KB of data). If the file is bigger than this (256K +10K = 266K), then pointer
number 11 points to a double indirect block containing 256 pointers to indirect
blocks, each of which contains 256 pointers to data blocks (64Mb of data). If the file
is bigger than this (64MB + 256KB + 10KB), then pointer number 12 points to a
triple indirect block containing up to 256 pointers to double indirect blocks, each of
which contains up to 256 pointers pointing to single indirect blocks, each of which
contains up to 256 pointers pointing to data blocks (potentially 16GB, although the
real limit is 2GB, since the file size, a signed number of bytes, must fit in a 32-bit
word).
   This data structure allows the efficient representation of sparse files, i.e., files
whose content is mainly zeros. Consider, for example, the effect of creating an empty
file and then writing one byte at location 2,000,000,000. Only four disk blocks are
allocated to represent this file: a triple indirect block, a double indirect block, a
single indirect block, and a data block. All pointers in the disk map, except for the
last one, are zero. All bytes up to the last one read as zero. This is because a zero
pointer is treated as if it points to a block containing all zeros: a zero pointer to an
indirect block is treated as if it pointed to an indirect block filled with zero pointers,
each of which is treated as if it pointed to a data block filled with zeros. However,
one must be careful about copying such a file, since commands such as cp and tar
actually attempt to write all the zero blocks! (The dump command, on the other                 XVII-4
hand, copes with sparse files properly.)
S5FS Layout



                                    Data Region




                                         I-list

                                     Superblock
                                     Boot block

         CS 167                            5




  The file system is organized on disk as four separate regions:
        • Boot block
                 – used on some systems to contain a bootstrap program
        • Superblock
                 – describes the file system:
                         · total size
                         · size of inode list (I-list)
                         · header of free-block list
                         · list of free inodes
                         · modified flag
                         · read-only flag
                         · number of free blocks and free inodes
                         · resides in a buffer borrowed from the buffer cache while the
                           file system is mounted
        • I-list
                 – area for allocating inodes
        • Data region
                 – remainder of file system is for data blocks and indirect blocks
  A problem with this organization is the separation of the I-list and the data region.
Since one must always fetch the inode before reading or writing the blocks of a file,
the disk head is constantly moving back and forth between the I-list and the data
region.




                                                                                          XVII-5
S5FS Free List




                      99
                      98
                      97

                                                            99
                                                            98
                       0                                    97


                  Super Block
                                                            0




         CS 167                            6




  Free disk blocks are organized as shown in the picture. The superblock contains
the address of up to 100 free disk blocks. The last of these disk blocks contains 100
pointers to additional free disk blocks. The last of these pointers points to another
block containing up to n free disk blocks, etc., until all free disk blocks are
represented. Thus most requests for a free block can be satisfied by merely getting
an address from the superblock. When the last block reference by the superblock is
consumed, however, a disk read must be done to fetch the addresses of up to 100
more free disk blocks. Freeing a disk block results in reconstructing the list
structure.
  This organization, though very simple, scatters the blocks of files all over the
surface of the disk. When allocating a block for a file, one must always use the next
block from the free list; there is no way to request a block at a specific location. No
matter how carefully the free list is ordered when the file system is initialized, it
becomes fairly well randomized after the file system has been used for a while.




                                                                                          XVII-6
S5FS Free Inode List
                                                16   0
                                                15
                                                14
                                                13   0
                                                12   0
                                                11   0
                      13                        10
                                                 9   0
                                                 8
                      11                         7
                      6                          6   0
                      12                         5
                      4                          4   0
                                                 3
                  Super Block                    2
                                                 1
                                                          I-list

         CS 167                             7




   Inodes are allocated from the I-list. Free inodes are represented simply by zeroing
their mode bits. The superblock contains a cache of indices of free inodes. When a
free inode is needed (i.e., to represent a new file), its index is taken from this cache.
If the cache is empty, then the I-list is scanned sequentially until enough free inodes
are found to refill the cache.
   To speed this search somewhat, the cache contains a reference to the inode with
the smallest index that is known to be free. When an inode is free, it is added to the
cache if there is room, and its mode bits are zeroed on disk.




                                                                                            XVII-7
UFS

                  • The goal is to lay out files on disk so that they
                    can be accessed as quickly as possible and
                    so that as little disk space as possible is
                    wasted
                  • Component names of directories can be
                    much longer than in the original (S5) file
                    system




         CS 167                            8




  UFS was developed at the University of California at Berkeley as part of the version
of UNIX known as 4.2 BSD (BSD stands for Berkeley Software Distribution). It was
designed to be much faster than the S5 file system and to eliminate some of its
restrictions, such as the length of components within directory path names.
  This material is covered in The Design and Implementation of the 4.3BSD UNIX
Operating System, by Leffler et al.




                                                                                         XVII-8
UFS Directory Format
                                                 117
                                        16                 4
                                   u         n         i       x
                                   0
                                                 4
                                        12                 3
                                   e         t         c       0
                                                 18
                                       484                 3
                                   u         s         r       0


                                        Free Space



                                  Directory Block
         CS 167                                   9




   UFS allows component names of directory path names to be up to 255 characters
long, thereby necessitating a variable-length field for components. Directories are
composed of 512-byte blocks and entries must not cross block boundaries. This
design adds a degree of atomicity to directory updates. It should take exactly one
disk write to update a directory entry (512 bytes was chosen as the smallest
conceivable disk sector size). If two disk writes are necessary to modify a directory
entry, then clearly the disk will crash between the two!
   Like the S5 directory entry, the UFS directory entry contains the inode number
and the component name. Since the component name is of variable length, there is
also a string length field (the component name includes a null byte at the end; the
string length does not include the null byte). In addition to the string length, there is
also a record length, which is the length of the entire entry (and must be a multiple
of four to ensure that each entry starts on a four-byte boundary). The purpose of the
record length field is to represent free space within a directory block. Any free space
is considered a part of the entry that precedes it, and thus a record length longer
than necessary indicates that free space follows. If a directory entry is free, then its
record length is added to that of the preceding entry. However, if the first entry in a
directory block is free, then this free space is represented by setting the inode
number to zero and leaving the record length as is.
   Compressing directories is considered to be too difficult. Free space within a
directory is made available for representing new entries, but is not returned to the
file system. However, if there is free space at the end of the directory, the directory
may be truncated to a directory block boundary.




                                                                                            XVII-9
Doing File-System I/O Quickly

                  • Transfer as much as possible with each I/O
                    request
                  • Minimize seek time (i.e. reduce head
                    movement)
                  • Minimize latency time




         CS 167                            10




  The UFS file system uses three techniques to improve I/O performance. The first
technique, which has perhaps the greatest payoff, maximizes the amount of data
transferred with each I/O request by using a relatively large block size. UFS block
sizes may be either 4K bytes or 8K bytes (the size is fixed for each individual file
system). A disadvantage of using a large block size is the wastage due to internal
fragmentation: on the average, half of a disk block is wasted for each file. To alleviate
this problem, blocks may be shared among files under certain circumstances.
  The second technique to improve performance is to minimize seek time by
attempting to locate the blocks of a file near to one another.
  Finally, UFS attempts to minimize latency time, i.e. to reduce the amount of time
spent waiting for the disk to rotate to bring the desired block underneath the desired
disk head (many modern disk controllers make it either impossible or unnecessary
to apply this technique).




                                                                                            XVII-10
UFS Layout
                                                         data
                        cg n-1
                                                        inodes
                                                       cg block
                                                      super block
                         cg i                            data



                                                         data

                         cg 1                        cg summary
                                                        inodes
                                                       cg block
                         cg 0                        super block
                                                      boot block

         CS 167                           11




  The UFS file system is laid out on disk as follows:
        • Superblock
               – incore while the file system is mounted
               – contains the parameters describing the layout of the file system
               – for paranoia's sake, one copy is kept in each cylinder group, at a
                 rotating track position
        • Cylinder group summary (one for each cylinder group)
               – incore while the file system is mounted
               – contains a summary of the available storage in each cylinder group
               – allocated from the data section of cylinder group 0
        • Cylinder group block
               – contains free block map and all other allocation information
  Note: the superblock contains two sorts of information, static and dynamic. The
static information describes the layout of the entire file system and is essential to
make sense of the file system. The dynamic information describes the file system's
current state and can be computed from redundant information in the file system. If
the static portion of the superblock is lost, then the file system cannot be used. To
guard against this, each cylinder group contains a copy of the superblock (just the
static information needs to be copied).
  A possible (though unlikely) failure condition might be that the entire contents of
one surface are lost, but the remainder of the disk is usable. However, if this surface
contains all copies of the superblock, then the rest of the disk would be effectively
unusable. To guard against this, the copy of the superblock is placed on a different
surface in each cylinder group. Of course, the system must keep track of where
these copies are. This information is kept in the disk label (along with information
describing how the physical disk is partitioned).



                                                                                          XVII-11
Minimizing Fragmentation
                               Costs
                  • A file system block may be split into
                    fragments that can be independently
                    assigned to files
                     – fragments assigned to a file must be
                       contiguous and in order
                  • The number of fragments per block (1, 2, 4, or
                    8) is fixed for each file system
                  • Allocation in fragments may only be done on
                    what would be the last block of a file, and
                    only if the file does not contain indirect
                    blocks


         CS 167                            12




  Fragmentation is what Berkeley calls their technique for reducing disk space
wastage due to file sizes not being an integral multiple of the block size. Files are
normally allocated in units of blocks, since this allows the system to transfer data in
relatively large, block-size units. But this causes space problems if we have lots of
small files, where the average amount of space wasted per file (half the block size) is
an appreciable fraction of the size of the file (the wastage can be far greater than the
size of the file for very small files). The ideal solution might be to reduce the block
size for small files, but this could cause other problems; e.g., small files might grow
to be large files. The solution used in UFS is to have all of the blocks of a file be the
standard size, except for perhaps the last block of a file. This block may actually be
some number of contiguous, in-order fragments of a standard block.




                                                                                            XVII-12
Use of Fragments (1)




                                                               File A


                                                               File B




         CS 167                           13




  This example illustrates a difficulty associated with the use of fragments. The file
system must preserve the invariant that fragments assigned to a file must be
contiguous and in order, and that allocation of fragments may be done only on what
would be the last block of the file. In the picture, the direction of growth is
downwards. Thus file A may easily grow by up to two fragments, but file B cannot
easily grow within this block.
  In the picture, file A is 18 fragments in length, file B is 12 fragments in length.




                                                                                         XVII-13
Use of Fragments (2)




                                             File A


                                             File B




       CS 167                   14




File A grows by one fragment.




                                                      XVII-14
Use of Fragments (3)




                                                                 File A


                                                                 File B




         CS 167                            15




   File A grows by two more fragments, but since there is no space for it, the file
system allocates another block and copies file A's fragments into it. How much space
should be available in the newly allocated block? If the newly allocated block is
entirely free, i.e., none of its fragments are used by other files, then further growth
by file A will be very cheap. However, if the file system uses this approach all the
time, then we do not get the space-saving benefits of fragmentation. An alternative
approach is to use a "best-fit" policy: find a block that contains exactly the number
of free fragments needed by file A, or if such a block is not available, find a block
containing the smallest number of contiguous free fragments that will satisfy file A's
needs.
   Which approach is taken depends upon the degree to which the file system is
fragmented. If disk space is relatively unfragmented, then the first approach is taken
("optimize for time"). Otherwise, i.e., when disk space is fragmented, the file system
takes the second approach ("optimize for space").
   The points at which the system switches between the two policies is parameterized
in the superblock: a certain percentage of the disk space, by default 10%, is reserved
for superuser. (Disk-allocation techniques need a reasonable chance of finding free
disk space in each cylinder group in order to optimize the layout of files.) If the total
amount of fragmented free disk space (i.e., the total amount of free disk space not
counting that portion consisting of whole blocks) increases to 8% of the size of the
file system (or, more generally, increases to 2% less than the reserve), then further
allocation is done using the best-fit approach. Once this approach is being used, if
the total amount of fragmented free disk space drops below 5% (or half of the
reserve), then further allocation is done using the whole-block technique.




                                                                                            XVII-15
Minimizing Seek Time

            • The principle:
                  – keep related information as close together as possible
                  – distribute information sufficiently to make the above
                    possible
            • The practice:
                  – attempt to put new inodes in the same cylinder group
                    as their directories
                  – put inodes for new directories in cylinder groups with
                    "lots" of free space
                  – put the beginning of a file (direct blocks) in the inode's
                    cylinder group
                  – put additional portions of the file (each 2MB) in
                    cylinder groups with "lots" of free space
         CS 167                             16




   One of the major components (in terms of time) of a disk I/O operation is the
positioning of the disk head. In the S5 file system we didn't worry about this, but in
the UFS file system we would like to lay out files on disk so as to minimize the time
required to position the disk head. If we know exactly what the contents of an entire
file system will be when we create it, then, in principle, we could lay files out
optimally. But we don't have this sort of knowledge, so, in the UFS file system, a
reasonable effort is made to lay files out "pretty well."




                                                                                         XVII-16
Minimizing Latency (1)



                                         7        6
                                     8                5

                                     1                4
                                         2        3




         CS 167                              17




  Latency is the time spent waiting for the disk platter to rotate, bringing the desired
sector underneath the disk head. In most disks, latency time is dominated by seek
time, but if we've done a good job improving seek time, perhaps we can do
something useful with latency time.
  A naive way of laying out consecutive blocks of the file on a track would be to put
them in consecutive locations. The problem with this is that some amount of time
passes between the completion of one disk request and the start of the next. During
this time, the disk rotates a certain distance, probably far enough that the disk head
is positioned after the next block. Thus it is necessary to wait for the disk to rotate
almost a complete revolution for it to bring the beginning of the next block
underneath the disk head. This delay could cause a significant slowdown.




                                                                                           XVII-17
Minimizing Latency (2)



                       4
                               3

                   1
                           2




         CS 167                            18




  A better technique is not to lay out the blocks on the track consecutively, but to
leave enough space between them that the disk rotates no further than to the
position of the next block during the time between disk requests.
  It may be that when a new block is allocated for a file, the optimal position for the
next block is already occupied. If so, one may be able to find a block that is just as
good. If the disk has multiple surfaces (and multiple heads), then we can make the
reasonable assumption that the blocks underneath each head can be accessed
equally quickly. Thus the stack of blocks underneath the disk heads at one instant
are said to be rotationally equivalent. If all of these blocks are occupied, then the
next stack of rotationally equivalent blocks in the opposite direction of disk rotation
is almost as good as the first. If all of these blocks are taken, then the third stack is
almost as good, and so forth all the way around the cylinder. If all of these are taken,
then any block within the cylinder group is chosen.
  This technique is perhaps not as useful today as in the past, since many disk
controllers buffer entire tracks and hide the relevant disk geometry.




                                                                                            XVII-18
The Buffer Cache


                         Buffer

                    User Process




                    Buffer Cache

         CS 167                          19




  File I/O in UNIX is not done directly to the disk drive, but through an
intermediary, the buffer cache.
  The buffer cache has two primary functions. The first, and most important, is to
make possible concurrent I/O and computation within a UNIX process. The second
is to insulate the user from physical block boundaries.
  From a user thread's point of view, I/O is synchronous. By this we mean that when
the I/O system call returns, the system no longer needs the user-supplied buffer.
For example, after a write system call, the data in the user buffer has either been
transmitted to the device or copied to a kernel buffer -- the user can now scribble
over the buffer without affecting the data transfer. Because of this synchronization,
from a user thread's point of view, no more than one I/O operation can be in
progress at a time. Thus user-implemented multibuffered I/O is not possible (in a
single-threaded process).
  The buffer cache provides a kernel implementation of multibuffering I/O, and thus
concurrent I/O and computation are possible even for single-threaded processes.




                                                                                        XVII-19
Multi-Buffered I/O


                  Process

                   read( … )



                                       i-1         i         i+1
                                    previous    current    probable
                                     block       block    next block




         CS 167                         20




  The use of read-aheads and write-behinds makes possible concurrent I/O and
computation: if the block currently being fetched is block i and the previous block
fetched was block i-1, then block i+1 is also fetched. Modified blocks are normally
written out not synchronously but instead sometime after they were modified,
asynchronously.




                                                                                      XVII-20
Maintaining the Cache

                                                      buffer requests


                           Aged      probably free buffers

                                                      returns of no-longer-
                                                      active buffers
                  oldest


                           LRU       probably active buffers

                                                      returns of active
              youngest
                                                      buffers

         CS 167                            21




   Active buffers are maintained in least-recently-used (LRU) order in the system-wide
LRU list. Thus after a buffer has been used (as part of a read or write system call), it
is returned to the end of the LRU list. The system also maintains a separate list of
"free" buffers called the aged list. Included in this list are buffers holding no-longer-
needed blocks, such as blocks from files that have been deleted.
   Fresh buffers are taken from the aged list. If this list is empty, then a buffer is
obtained from the LRU list as follows. If the first buffer (least recently used) in this
list is clean (i.e., contains a block that is identical to its copy on disk), then this
buffer is taken. Otherwise (i.e., if the buffer is dirty), it is written out to disk
asynchronously and, when written, is placed at the end of the aged list. The search
for a fresh buffer continues on to the next buffer in the LRU list, etc.
   When a file is deleted, any buffers containing its blocks are placed at the head of
the aged list. Also, when I/O into a buffer results in an I/O error, the buffer is
placed at the head of the aged list.




                                                                                            XVII-21
File-System Consistency (1)




                                       New Node                New Node




           1                       2                       3



         CS 167                               22




  In the event of a crash, the contents of the file system may well be inconsistent
with any view of it the user might have. For example, a programmer may have
carefully added a node to the end of the list, so that at all times the list structure is
well-formed.




                                                                                            XVII-22
File-System Consistency (2)




                                                    CRASH!!!
                          New Node       New Node


                      Not on         Not on
                      disk           disk




       1              2              3              4          5




           CS 167                              23




  But, if the new node and the old node are stored on separate disk blocks, the
modifications to the block containing the old node might be written out first; the
system might well crash before the second block is written out.




                                                                                     XVII-23
Keeping It Consistent

                                                  2) Then write this
                                                     asynchronously




            1) Write this    New Node
               synchronously




         CS 167                             24




  To deal with this problem, one must make certain that the target of a pointer is
safely on disk before the pointer is set to point to it. This is done for certain system
data structures (e.g., directory entries, inodes, indirect blocks, etc.).
  No such synchronization is done for user data structures: not enough is known
about the semantics of user operations to make this possible. However, a user
process called update executes a sync system call every 30 seconds, which initiates
the writing out to disk of all dirty buffers. Alternatively, the user can open a file with
the synchronous option so that all writes are waited for; i.e, the buffer cache acts as
a write-through cache (N.B.: this is expensive!).




                                                                                             XVII-24

More Related Content

What's hot

X-Tour Nutanix 101
X-Tour Nutanix 101X-Tour Nutanix 101
X-Tour Nutanix 101NEXTtour
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer Zara Nawaz
 
Compreendendo a redundância de camada 3
Compreendendo a redundância de camada 3Compreendendo a redundância de camada 3
Compreendendo a redundância de camada 3Vitor Albuquerque
 
Apresentação Sistemas Distribuídos - Conceito
Apresentação Sistemas Distribuídos - ConceitoApresentação Sistemas Distribuídos - Conceito
Apresentação Sistemas Distribuídos - ConceitoThiago Marinho
 
engage 2019 - 15 Domino v10 Admin features we LOVE
engage 2019 - 15 Domino v10 Admin features we LOVEengage 2019 - 15 Domino v10 Admin features we LOVE
engage 2019 - 15 Domino v10 Admin features we LOVEChristoph Adler
 
Linux internal
Linux internalLinux internal
Linux internalmcganesh
 
Hci solution with VxRail
Hci solution with VxRailHci solution with VxRail
Hci solution with VxRailAnton An
 
[154] 데이터 센터의 오픈 소스 open compute project (ocp)
[154] 데이터 센터의 오픈 소스 open compute project (ocp)[154] 데이터 센터의 오픈 소스 open compute project (ocp)
[154] 데이터 센터의 오픈 소스 open compute project (ocp)NAVER D2
 
NF101: Nutanix 101
NF101: Nutanix 101NF101: Nutanix 101
NF101: Nutanix 101NEXTtour
 
DNS(Domain Name System)
DNS(Domain Name System)DNS(Domain Name System)
DNS(Domain Name System)Vishal Mittal
 

What's hot (16)

X-Tour Nutanix 101
X-Tour Nutanix 101X-Tour Nutanix 101
X-Tour Nutanix 101
 
V mware v-sphere & vsa presentation
V mware   v-sphere & vsa presentationV mware   v-sphere & vsa presentation
V mware v-sphere & vsa presentation
 
Linux file system
Linux file systemLinux file system
Linux file system
 
Multicore computers
Multicore computersMulticore computers
Multicore computers
 
Translation Look Aside buffer
Translation Look Aside buffer Translation Look Aside buffer
Translation Look Aside buffer
 
Compreendendo a redundância de camada 3
Compreendendo a redundância de camada 3Compreendendo a redundância de camada 3
Compreendendo a redundância de camada 3
 
Disk structure
Disk structureDisk structure
Disk structure
 
Apresentação Sistemas Distribuídos - Conceito
Apresentação Sistemas Distribuídos - ConceitoApresentação Sistemas Distribuídos - Conceito
Apresentação Sistemas Distribuídos - Conceito
 
engage 2019 - 15 Domino v10 Admin features we LOVE
engage 2019 - 15 Domino v10 Admin features we LOVEengage 2019 - 15 Domino v10 Admin features we LOVE
engage 2019 - 15 Domino v10 Admin features we LOVE
 
Linux internal
Linux internalLinux internal
Linux internal
 
Hci solution with VxRail
Hci solution with VxRailHci solution with VxRail
Hci solution with VxRail
 
[154] 데이터 센터의 오픈 소스 open compute project (ocp)
[154] 데이터 센터의 오픈 소스 open compute project (ocp)[154] 데이터 센터의 오픈 소스 open compute project (ocp)
[154] 데이터 센터의 오픈 소스 open compute project (ocp)
 
NF101: Nutanix 101
NF101: Nutanix 101NF101: Nutanix 101
NF101: Nutanix 101
 
Windows server 2012
Windows server 2012Windows server 2012
Windows server 2012
 
Cs8493 unit 3
Cs8493 unit 3Cs8493 unit 3
Cs8493 unit 3
 
DNS(Domain Name System)
DNS(Domain Name System)DNS(Domain Name System)
DNS(Domain Name System)
 

Viewers also liked (9)

4.chap4 synch
4.chap4 synch4.chap4 synch
4.chap4 synch
 
10 pdf-25896
10 pdf-2589610 pdf-25896
10 pdf-25896
 
13.chap13 distributed systems
13.chap13 distributed systems13.chap13 distributed systems
13.chap13 distributed systems
 
2.chap2 process
2.chap2 process2.chap2 process
2.chap2 process
 
12.chap12 hard diskmanagement
12.chap12 hard diskmanagement12.chap12 hard diskmanagement
12.chap12 hard diskmanagement
 
W2 k williamstalling
W2 k williamstallingW2 k williamstalling
W2 k williamstalling
 
Unix cook-book
Unix cook-bookUnix cook-book
Unix cook-book
 
Soumatome n3 tuvung_wk1_wk2
Soumatome n3 tuvung_wk1_wk2Soumatome n3 tuvung_wk1_wk2
Soumatome n3 tuvung_wk1_wk2
 
Soumatome n3 tuvung_wk1_wk5
Soumatome n3 tuvung_wk1_wk5Soumatome n3 tuvung_wk1_wk5
Soumatome n3 tuvung_wk1_wk5
 

Similar to Unixfs

Internal representation of file chapter 4 Sowmya Jyothi
Internal representation of file chapter 4 Sowmya JyothiInternal representation of file chapter 4 Sowmya Jyothi
Internal representation of file chapter 4 Sowmya JyothiSowmya Jyothi
 
Introduction to file system and OCFS2
Introduction to file system and OCFS2Introduction to file system and OCFS2
Introduction to file system and OCFS2Gang He
 
Internal representation of files ppt
Internal representation of files pptInternal representation of files ppt
Internal representation of files pptAbhaysinh Surve
 
Unix file systems 2 in unix internal systems
Unix file systems 2 in unix internal systems Unix file systems 2 in unix internal systems
Unix file systems 2 in unix internal systems senthilamul
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
Ch11 file system implementation
Ch11   file system implementationCh11   file system implementation
Ch11 file system implementationWelly Dian Astika
 
linuxfilesystem-180727181106 (1).pdf
linuxfilesystem-180727181106 (1).pdflinuxfilesystem-180727181106 (1).pdf
linuxfilesystem-180727181106 (1).pdfShaswatSurya
 
Root file system
Root file systemRoot file system
Root file systemBindu U
 
De-Anonymizing Live CDs through Physical Memory Analysis
De-Anonymizing Live CDs through Physical Memory AnalysisDe-Anonymizing Live CDs through Physical Memory Analysis
De-Anonymizing Live CDs through Physical Memory AnalysisAndrew Case
 
Ch11 file system implementation
Ch11 file system implementationCh11 file system implementation
Ch11 file system implementationAbdullah Al Shiam
 
Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849marangburu42
 

Similar to Unixfs (20)

Internal representation of file chapter 4 Sowmya Jyothi
Internal representation of file chapter 4 Sowmya JyothiInternal representation of file chapter 4 Sowmya Jyothi
Internal representation of file chapter 4 Sowmya Jyothi
 
Introduction to file system and OCFS2
Introduction to file system and OCFS2Introduction to file system and OCFS2
Introduction to file system and OCFS2
 
UNIT III.pptx
UNIT III.pptxUNIT III.pptx
UNIT III.pptx
 
osd - co1 session7.pptx
osd - co1 session7.pptxosd - co1 session7.pptx
osd - co1 session7.pptx
 
Internal representation of files ppt
Internal representation of files pptInternal representation of files ppt
Internal representation of files ppt
 
Unix Administration
Unix AdministrationUnix Administration
Unix Administration
 
ZFS
ZFSZFS
ZFS
 
Unix file systems 2 in unix internal systems
Unix file systems 2 in unix internal systems Unix file systems 2 in unix internal systems
Unix file systems 2 in unix internal systems
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
009709863.pdf
009709863.pdf009709863.pdf
009709863.pdf
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
Ch11 file system implementation
Ch11   file system implementationCh11   file system implementation
Ch11 file system implementation
 
Linux file system
Linux file systemLinux file system
Linux file system
 
linuxfilesystem-180727181106 (1).pdf
linuxfilesystem-180727181106 (1).pdflinuxfilesystem-180727181106 (1).pdf
linuxfilesystem-180727181106 (1).pdf
 
UNIX.pptx
UNIX.pptxUNIX.pptx
UNIX.pptx
 
Root file system
Root file systemRoot file system
Root file system
 
De-Anonymizing Live CDs through Physical Memory Analysis
De-Anonymizing Live CDs through Physical Memory AnalysisDe-Anonymizing Live CDs through Physical Memory Analysis
De-Anonymizing Live CDs through Physical Memory Analysis
 
ubantu ppt.pptx
ubantu ppt.pptxubantu ppt.pptx
ubantu ppt.pptx
 
Ch11 file system implementation
Ch11 file system implementationCh11 file system implementation
Ch11 file system implementation
 
Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849Filesystemimplementationpre final-160919095849
Filesystemimplementationpre final-160919095849
 

More from Linh Nguyễn Thanh (20)

Soumatemen3hantu full5bai
Soumatemen3hantu full5baiSoumatemen3hantu full5bai
Soumatemen3hantu full5bai
 
Soumatemen3hantu full5bai
Soumatemen3hantu full5baiSoumatemen3hantu full5bai
Soumatemen3hantu full5bai
 
Soumateme n3 hantu
Soumateme n3 hantuSoumateme n3 hantu
Soumateme n3 hantu
 
Oboeru ngu phap_n3_new_word1-7
Oboeru ngu phap_n3_new_word1-7Oboeru ngu phap_n3_new_word1-7
Oboeru ngu phap_n3_new_word1-7
 
Best Katarina Pentakill HD
Best Katarina Pentakill HDBest Katarina Pentakill HD
Best Katarina Pentakill HD
 
N3 new word1-7
N3 new word1-7N3 new word1-7
N3 new word1-7
 
N3 new word1-4
N3 new word1-4N3 new word1-4
N3 new word1-4
 
N3 new word1-4
N3 new word1-4N3 new word1-4
N3 new word1-4
 
English collection kenny_nguyen
English collection kenny_nguyenEnglish collection kenny_nguyen
English collection kenny_nguyen
 
Kanji29 33
Kanji29 33Kanji29 33
Kanji29 33
 
Kanji29 33
Kanji29 33Kanji29 33
Kanji29 33
 
Katakana23 to44
Katakana23 to44Katakana23 to44
Katakana23 to44
 
Kanji34 to44
Kanji34 to44Kanji34 to44
Kanji34 to44
 
Kanji23 28
Kanji23 28Kanji23 28
Kanji23 28
 
Kanji23 28
Kanji23 28Kanji23 28
Kanji23 28
 
Kanji11 22
Kanji11 22Kanji11 22
Kanji11 22
 
Kanji11 20
Kanji11 20Kanji11 20
Kanji11 20
 
Kanji11 22
Kanji11 22Kanji11 22
Kanji11 22
 
3000 tu tieng_anh_full_ngu_phap_6097
3000 tu tieng_anh_full_ngu_phap_60973000 tu tieng_anh_full_ngu_phap_6097
3000 tu tieng_anh_full_ngu_phap_6097
 
10 cap tu_de_gay_nham_lan_trong_tieng_anh_672
10 cap tu_de_gay_nham_lan_trong_tieng_anh_67210 cap tu_de_gay_nham_lan_trong_tieng_anh_672
10 cap tu_de_gay_nham_lan_trong_tieng_anh_672
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Unixfs

  • 1. File Systems (1) CS 167 1 Copyright © 1999 Thomas W. Doeppner. All rights reserved. XVII-1
  • 2. File Systems • System 5 File System (S5FS) • UNIX File System (UFS) • Buffer Cache CS 167 2 In this lecture we cover the "traditional" UNIX file systems: S5FS and UFS. We then look at how kernel-supported buffering is used to speed file-system operations and the trouble this sometimes causes. XVII-2
  • 3. S5FS: Inode Device Inode Number Mode Link Count Owner, Group Size Disk Map CS 167 3 In both S5FS and UFS, a data structure known as the inode (for index node) is used to represent a file. These inodes are the focus of all file activity, i.e., every access to a file must make use of information from the inode. Every file has a inode on permanent (disk) storage. While a file is active (e.g. it is open), its inode is brought into primary storage. XVII-3
  • 4. Disk Map 0 1 2 3 4 5 6 7 8 9 10 11 12 CS 167 4 The first file system we discuss is known as the S5 File System -- it is based on the original UNIX file system, developed in the early seventies. The name S5 comes from the fact that this was the only file system supported in early versions of what's known as UNIX System V. The purpose of the disk-map portion of the inode is to map block numbers relative to the beginning of a file into block numbers relative to the beginning of the file system. Each block is 1024 (1K) bytes long. The disk map consists of 13 pointers to disk blocks, the first 10 of which point to the first 10 blocks of the file. Thus the first 10Kb of a file are accessed directly. If the file is larger than 10Kb, then pointer number 10 points to a disk block called an indirect block. This block contains up to 256 (4-byte) pointers to data blocks (i.e., 256KB of data). If the file is bigger than this (256K +10K = 266K), then pointer number 11 points to a double indirect block containing 256 pointers to indirect blocks, each of which contains 256 pointers to data blocks (64Mb of data). If the file is bigger than this (64MB + 256KB + 10KB), then pointer number 12 points to a triple indirect block containing up to 256 pointers to double indirect blocks, each of which contains up to 256 pointers pointing to single indirect blocks, each of which contains up to 256 pointers pointing to data blocks (potentially 16GB, although the real limit is 2GB, since the file size, a signed number of bytes, must fit in a 32-bit word). This data structure allows the efficient representation of sparse files, i.e., files whose content is mainly zeros. Consider, for example, the effect of creating an empty file and then writing one byte at location 2,000,000,000. Only four disk blocks are allocated to represent this file: a triple indirect block, a double indirect block, a single indirect block, and a data block. All pointers in the disk map, except for the last one, are zero. All bytes up to the last one read as zero. This is because a zero pointer is treated as if it points to a block containing all zeros: a zero pointer to an indirect block is treated as if it pointed to an indirect block filled with zero pointers, each of which is treated as if it pointed to a data block filled with zeros. However, one must be careful about copying such a file, since commands such as cp and tar actually attempt to write all the zero blocks! (The dump command, on the other XVII-4 hand, copes with sparse files properly.)
  • 5. S5FS Layout Data Region I-list Superblock Boot block CS 167 5 The file system is organized on disk as four separate regions: • Boot block – used on some systems to contain a bootstrap program • Superblock – describes the file system: · total size · size of inode list (I-list) · header of free-block list · list of free inodes · modified flag · read-only flag · number of free blocks and free inodes · resides in a buffer borrowed from the buffer cache while the file system is mounted • I-list – area for allocating inodes • Data region – remainder of file system is for data blocks and indirect blocks A problem with this organization is the separation of the I-list and the data region. Since one must always fetch the inode before reading or writing the blocks of a file, the disk head is constantly moving back and forth between the I-list and the data region. XVII-5
  • 6. S5FS Free List 99 98 97 99 98 0 97 Super Block 0 CS 167 6 Free disk blocks are organized as shown in the picture. The superblock contains the address of up to 100 free disk blocks. The last of these disk blocks contains 100 pointers to additional free disk blocks. The last of these pointers points to another block containing up to n free disk blocks, etc., until all free disk blocks are represented. Thus most requests for a free block can be satisfied by merely getting an address from the superblock. When the last block reference by the superblock is consumed, however, a disk read must be done to fetch the addresses of up to 100 more free disk blocks. Freeing a disk block results in reconstructing the list structure. This organization, though very simple, scatters the blocks of files all over the surface of the disk. When allocating a block for a file, one must always use the next block from the free list; there is no way to request a block at a specific location. No matter how carefully the free list is ordered when the file system is initialized, it becomes fairly well randomized after the file system has been used for a while. XVII-6
  • 7. S5FS Free Inode List 16 0 15 14 13 0 12 0 11 0 13 10 9 0 8 11 7 6 6 0 12 5 4 4 0 3 Super Block 2 1 I-list CS 167 7 Inodes are allocated from the I-list. Free inodes are represented simply by zeroing their mode bits. The superblock contains a cache of indices of free inodes. When a free inode is needed (i.e., to represent a new file), its index is taken from this cache. If the cache is empty, then the I-list is scanned sequentially until enough free inodes are found to refill the cache. To speed this search somewhat, the cache contains a reference to the inode with the smallest index that is known to be free. When an inode is free, it is added to the cache if there is room, and its mode bits are zeroed on disk. XVII-7
  • 8. UFS • The goal is to lay out files on disk so that they can be accessed as quickly as possible and so that as little disk space as possible is wasted • Component names of directories can be much longer than in the original (S5) file system CS 167 8 UFS was developed at the University of California at Berkeley as part of the version of UNIX known as 4.2 BSD (BSD stands for Berkeley Software Distribution). It was designed to be much faster than the S5 file system and to eliminate some of its restrictions, such as the length of components within directory path names. This material is covered in The Design and Implementation of the 4.3BSD UNIX Operating System, by Leffler et al. XVII-8
  • 9. UFS Directory Format 117 16 4 u n i x 0 4 12 3 e t c 0 18 484 3 u s r 0 Free Space Directory Block CS 167 9 UFS allows component names of directory path names to be up to 255 characters long, thereby necessitating a variable-length field for components. Directories are composed of 512-byte blocks and entries must not cross block boundaries. This design adds a degree of atomicity to directory updates. It should take exactly one disk write to update a directory entry (512 bytes was chosen as the smallest conceivable disk sector size). If two disk writes are necessary to modify a directory entry, then clearly the disk will crash between the two! Like the S5 directory entry, the UFS directory entry contains the inode number and the component name. Since the component name is of variable length, there is also a string length field (the component name includes a null byte at the end; the string length does not include the null byte). In addition to the string length, there is also a record length, which is the length of the entire entry (and must be a multiple of four to ensure that each entry starts on a four-byte boundary). The purpose of the record length field is to represent free space within a directory block. Any free space is considered a part of the entry that precedes it, and thus a record length longer than necessary indicates that free space follows. If a directory entry is free, then its record length is added to that of the preceding entry. However, if the first entry in a directory block is free, then this free space is represented by setting the inode number to zero and leaving the record length as is. Compressing directories is considered to be too difficult. Free space within a directory is made available for representing new entries, but is not returned to the file system. However, if there is free space at the end of the directory, the directory may be truncated to a directory block boundary. XVII-9
  • 10. Doing File-System I/O Quickly • Transfer as much as possible with each I/O request • Minimize seek time (i.e. reduce head movement) • Minimize latency time CS 167 10 The UFS file system uses three techniques to improve I/O performance. The first technique, which has perhaps the greatest payoff, maximizes the amount of data transferred with each I/O request by using a relatively large block size. UFS block sizes may be either 4K bytes or 8K bytes (the size is fixed for each individual file system). A disadvantage of using a large block size is the wastage due to internal fragmentation: on the average, half of a disk block is wasted for each file. To alleviate this problem, blocks may be shared among files under certain circumstances. The second technique to improve performance is to minimize seek time by attempting to locate the blocks of a file near to one another. Finally, UFS attempts to minimize latency time, i.e. to reduce the amount of time spent waiting for the disk to rotate to bring the desired block underneath the desired disk head (many modern disk controllers make it either impossible or unnecessary to apply this technique). XVII-10
  • 11. UFS Layout data cg n-1 inodes cg block super block cg i data data cg 1 cg summary inodes cg block cg 0 super block boot block CS 167 11 The UFS file system is laid out on disk as follows: • Superblock – incore while the file system is mounted – contains the parameters describing the layout of the file system – for paranoia's sake, one copy is kept in each cylinder group, at a rotating track position • Cylinder group summary (one for each cylinder group) – incore while the file system is mounted – contains a summary of the available storage in each cylinder group – allocated from the data section of cylinder group 0 • Cylinder group block – contains free block map and all other allocation information Note: the superblock contains two sorts of information, static and dynamic. The static information describes the layout of the entire file system and is essential to make sense of the file system. The dynamic information describes the file system's current state and can be computed from redundant information in the file system. If the static portion of the superblock is lost, then the file system cannot be used. To guard against this, each cylinder group contains a copy of the superblock (just the static information needs to be copied). A possible (though unlikely) failure condition might be that the entire contents of one surface are lost, but the remainder of the disk is usable. However, if this surface contains all copies of the superblock, then the rest of the disk would be effectively unusable. To guard against this, the copy of the superblock is placed on a different surface in each cylinder group. Of course, the system must keep track of where these copies are. This information is kept in the disk label (along with information describing how the physical disk is partitioned). XVII-11
  • 12. Minimizing Fragmentation Costs • A file system block may be split into fragments that can be independently assigned to files – fragments assigned to a file must be contiguous and in order • The number of fragments per block (1, 2, 4, or 8) is fixed for each file system • Allocation in fragments may only be done on what would be the last block of a file, and only if the file does not contain indirect blocks CS 167 12 Fragmentation is what Berkeley calls their technique for reducing disk space wastage due to file sizes not being an integral multiple of the block size. Files are normally allocated in units of blocks, since this allows the system to transfer data in relatively large, block-size units. But this causes space problems if we have lots of small files, where the average amount of space wasted per file (half the block size) is an appreciable fraction of the size of the file (the wastage can be far greater than the size of the file for very small files). The ideal solution might be to reduce the block size for small files, but this could cause other problems; e.g., small files might grow to be large files. The solution used in UFS is to have all of the blocks of a file be the standard size, except for perhaps the last block of a file. This block may actually be some number of contiguous, in-order fragments of a standard block. XVII-12
  • 13. Use of Fragments (1) File A File B CS 167 13 This example illustrates a difficulty associated with the use of fragments. The file system must preserve the invariant that fragments assigned to a file must be contiguous and in order, and that allocation of fragments may be done only on what would be the last block of the file. In the picture, the direction of growth is downwards. Thus file A may easily grow by up to two fragments, but file B cannot easily grow within this block. In the picture, file A is 18 fragments in length, file B is 12 fragments in length. XVII-13
  • 14. Use of Fragments (2) File A File B CS 167 14 File A grows by one fragment. XVII-14
  • 15. Use of Fragments (3) File A File B CS 167 15 File A grows by two more fragments, but since there is no space for it, the file system allocates another block and copies file A's fragments into it. How much space should be available in the newly allocated block? If the newly allocated block is entirely free, i.e., none of its fragments are used by other files, then further growth by file A will be very cheap. However, if the file system uses this approach all the time, then we do not get the space-saving benefits of fragmentation. An alternative approach is to use a "best-fit" policy: find a block that contains exactly the number of free fragments needed by file A, or if such a block is not available, find a block containing the smallest number of contiguous free fragments that will satisfy file A's needs. Which approach is taken depends upon the degree to which the file system is fragmented. If disk space is relatively unfragmented, then the first approach is taken ("optimize for time"). Otherwise, i.e., when disk space is fragmented, the file system takes the second approach ("optimize for space"). The points at which the system switches between the two policies is parameterized in the superblock: a certain percentage of the disk space, by default 10%, is reserved for superuser. (Disk-allocation techniques need a reasonable chance of finding free disk space in each cylinder group in order to optimize the layout of files.) If the total amount of fragmented free disk space (i.e., the total amount of free disk space not counting that portion consisting of whole blocks) increases to 8% of the size of the file system (or, more generally, increases to 2% less than the reserve), then further allocation is done using the best-fit approach. Once this approach is being used, if the total amount of fragmented free disk space drops below 5% (or half of the reserve), then further allocation is done using the whole-block technique. XVII-15
  • 16. Minimizing Seek Time • The principle: – keep related information as close together as possible – distribute information sufficiently to make the above possible • The practice: – attempt to put new inodes in the same cylinder group as their directories – put inodes for new directories in cylinder groups with "lots" of free space – put the beginning of a file (direct blocks) in the inode's cylinder group – put additional portions of the file (each 2MB) in cylinder groups with "lots" of free space CS 167 16 One of the major components (in terms of time) of a disk I/O operation is the positioning of the disk head. In the S5 file system we didn't worry about this, but in the UFS file system we would like to lay out files on disk so as to minimize the time required to position the disk head. If we know exactly what the contents of an entire file system will be when we create it, then, in principle, we could lay files out optimally. But we don't have this sort of knowledge, so, in the UFS file system, a reasonable effort is made to lay files out "pretty well." XVII-16
  • 17. Minimizing Latency (1) 7 6 8 5 1 4 2 3 CS 167 17 Latency is the time spent waiting for the disk platter to rotate, bringing the desired sector underneath the disk head. In most disks, latency time is dominated by seek time, but if we've done a good job improving seek time, perhaps we can do something useful with latency time. A naive way of laying out consecutive blocks of the file on a track would be to put them in consecutive locations. The problem with this is that some amount of time passes between the completion of one disk request and the start of the next. During this time, the disk rotates a certain distance, probably far enough that the disk head is positioned after the next block. Thus it is necessary to wait for the disk to rotate almost a complete revolution for it to bring the beginning of the next block underneath the disk head. This delay could cause a significant slowdown. XVII-17
  • 18. Minimizing Latency (2) 4 3 1 2 CS 167 18 A better technique is not to lay out the blocks on the track consecutively, but to leave enough space between them that the disk rotates no further than to the position of the next block during the time between disk requests. It may be that when a new block is allocated for a file, the optimal position for the next block is already occupied. If so, one may be able to find a block that is just as good. If the disk has multiple surfaces (and multiple heads), then we can make the reasonable assumption that the blocks underneath each head can be accessed equally quickly. Thus the stack of blocks underneath the disk heads at one instant are said to be rotationally equivalent. If all of these blocks are occupied, then the next stack of rotationally equivalent blocks in the opposite direction of disk rotation is almost as good as the first. If all of these blocks are taken, then the third stack is almost as good, and so forth all the way around the cylinder. If all of these are taken, then any block within the cylinder group is chosen. This technique is perhaps not as useful today as in the past, since many disk controllers buffer entire tracks and hide the relevant disk geometry. XVII-18
  • 19. The Buffer Cache Buffer User Process Buffer Cache CS 167 19 File I/O in UNIX is not done directly to the disk drive, but through an intermediary, the buffer cache. The buffer cache has two primary functions. The first, and most important, is to make possible concurrent I/O and computation within a UNIX process. The second is to insulate the user from physical block boundaries. From a user thread's point of view, I/O is synchronous. By this we mean that when the I/O system call returns, the system no longer needs the user-supplied buffer. For example, after a write system call, the data in the user buffer has either been transmitted to the device or copied to a kernel buffer -- the user can now scribble over the buffer without affecting the data transfer. Because of this synchronization, from a user thread's point of view, no more than one I/O operation can be in progress at a time. Thus user-implemented multibuffered I/O is not possible (in a single-threaded process). The buffer cache provides a kernel implementation of multibuffering I/O, and thus concurrent I/O and computation are possible even for single-threaded processes. XVII-19
  • 20. Multi-Buffered I/O Process read( … ) i-1 i i+1 previous current probable block block next block CS 167 20 The use of read-aheads and write-behinds makes possible concurrent I/O and computation: if the block currently being fetched is block i and the previous block fetched was block i-1, then block i+1 is also fetched. Modified blocks are normally written out not synchronously but instead sometime after they were modified, asynchronously. XVII-20
  • 21. Maintaining the Cache buffer requests Aged probably free buffers returns of no-longer- active buffers oldest LRU probably active buffers returns of active youngest buffers CS 167 21 Active buffers are maintained in least-recently-used (LRU) order in the system-wide LRU list. Thus after a buffer has been used (as part of a read or write system call), it is returned to the end of the LRU list. The system also maintains a separate list of "free" buffers called the aged list. Included in this list are buffers holding no-longer- needed blocks, such as blocks from files that have been deleted. Fresh buffers are taken from the aged list. If this list is empty, then a buffer is obtained from the LRU list as follows. If the first buffer (least recently used) in this list is clean (i.e., contains a block that is identical to its copy on disk), then this buffer is taken. Otherwise (i.e., if the buffer is dirty), it is written out to disk asynchronously and, when written, is placed at the end of the aged list. The search for a fresh buffer continues on to the next buffer in the LRU list, etc. When a file is deleted, any buffers containing its blocks are placed at the head of the aged list. Also, when I/O into a buffer results in an I/O error, the buffer is placed at the head of the aged list. XVII-21
  • 22. File-System Consistency (1) New Node New Node 1 2 3 CS 167 22 In the event of a crash, the contents of the file system may well be inconsistent with any view of it the user might have. For example, a programmer may have carefully added a node to the end of the list, so that at all times the list structure is well-formed. XVII-22
  • 23. File-System Consistency (2) CRASH!!! New Node New Node Not on Not on disk disk 1 2 3 4 5 CS 167 23 But, if the new node and the old node are stored on separate disk blocks, the modifications to the block containing the old node might be written out first; the system might well crash before the second block is written out. XVII-23
  • 24. Keeping It Consistent 2) Then write this asynchronously 1) Write this New Node synchronously CS 167 24 To deal with this problem, one must make certain that the target of a pointer is safely on disk before the pointer is set to point to it. This is done for certain system data structures (e.g., directory entries, inodes, indirect blocks, etc.). No such synchronization is done for user data structures: not enough is known about the semantics of user operations to make this possible. However, a user process called update executes a sync system call every 30 seconds, which initiates the writing out to disk of all dirty buffers. Alternatively, the user can open a file with the synchronous option so that all writes are waited for; i.e, the buffer cache acts as a write-through cache (N.B.: this is expensive!). XVII-24