Fsck Sx


Published on

An approach to FSCK Performance Enhancement

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Fsck Sx

  1. 1. FSCK SX An approach to FSCK Performance Enhancement Gaurav Naigaonkar, Sanjyot, Tipnis, Ajay Mandvekar, Moksh Matta {gnaigaonkar, sanj312} @gmail.com {ajay30_sam, mokshmatta1004} @hotmail.com Pune Institute of Computer Technology, Pune Abstract 1. Introduction File System Check Program (fsck)is an interactive file system check and File system repair is usually anrepair program. Fsck uses the redundant afterthought for file system designers. Onestructural information in the UNIX file reason is that repair is difficult andsystem to perform several consistency annoying to reason about. It‘s neitherchecks. Unfortunately, disk capacity is possible nor worthwhile to fix all errorgrowing faster than disk bandwidth, seek modes so we must focus our efforts on thetimes are hardly budging, and the overall ones that commonly occur, yet we do notchance of an I/O error occurring know what they are until we encountersomewhere on the disk is increasing. The them in the wild. In practice, most fileresult: the traditional file system check and system repair code is written in response torepair cycle will be not only longer, but an observed corruption mode. File systemalso more frequent, with disastrous repair is annoying because, by definition,consequences for data availability. Data something went wrong and we must thinkreliability will also decline with frequency outside the state space of our beautifullyof corruption. designed system. In the end, designing a file system is more fun than designing a We have implemented techniques for file system checker.reducing the average fsck time on ext3 filesystems. First, we improve the For many years, we could brush offperformance by parallelizing the two the importance of making file systemmajor operations of fsck – fetching repair fast and reliable with the followingmetadata and checking it. This chain of reasoning: File system corruptionmultithreaded operation, along with an is a rare event, and when it does occur,intelligent issuing of IO requests, helps to repairing it takes only a few minutes orgreatly reduce the overall seek time. We maybe a few hours of downtime, and ifhave also implemented ‗Metaclustering‘ repair is too difficult or time–consuming,wherein we store the indirect blocks in ―That‘s what backups are for.‖clusters on a per group basis instead of Unfortunately, if this reasoning was everspreading them out along with the data valid, it is being eroded by someblocks. This makes fsck even faster inconvenient truths about disk hardwaresince it can now read and verify all trends.indirect blocks without much seek. 2006 2009 2013 Change Capacity (GB) 500 2000 8000 16xKeywords: Fsck, Parallelism, Metadata Bandwidth(Mb/s) 1000 2000 5000 5xclustering, Readahead Seek Time(ms) 8 7.2 6.5 1.2x Table 1: Projected disk hardware trends
  2. 2. As Table 1 shows, Seagate projects checks to see if the ext3 file system wasthat during the same time that disk cleanly un-mounted by reading the statecapacity increases by 16 times, disk field in the file system superblock. If thebandwidth will increase by only 5 times, state is set as VALID, the file system isand seek times will remain nearly already consistent and does not needunchanged. This is good news for many recovery; fsck exits without further ado. Ifcommon workloads—we can store more the state is INVALID, fsck does a fulldata and read and write more of it at once. check of the file system integrity, repairingBut it is terrible news for any workload any inconsistencies it finds. In order tothat is (a) proportional to the size of the check the correctness of allocationdisk, (b) throughput–intensive, and (c) bitmaps, file nlinks, directory entries, etc.,seek–intensive. fsck reads every inode in the system, every indirect block referenced by an inode, and One workload that fits this profile every directory entry. Using thisis file system check and repair. It has been information, it builds up a new set of inodecalculated that file system check and repair and block allocation bitmaps, calculatestime will increase by approximately a the correct number of links of every inode,factor of 10 between 2006 and 2013 with and removes directory entries totoday‘s file system formats. unreferenced inodes. It does many other things as well, such as sanity check inode At the same time that capacity is fields, but these three activitiesincreasing, the per–bit error rate is fundamentally require reading every inodeimproving. However, for an overall in the file system. Otherwise, there is noimprovement in the error rate for way to find out whether, for example, aoperations that read data proportional to particular block is referenced by a file butthe file system size (such as fsck), the per– is marked as unallocated on the blockbit error rate must improve as fast as allocation bitmap. In summary, there arecapacity grows, which seems unlikely. We no back pointers from a data block to theconclude that the frequency of file system indirect block that points to it or from acorruption and necessary check and repair file to the directories that point to it, so theor restore is more likely to increase than only way to reconstruct reference counts isdecrease. This combination of increasing to start at the top level and build afsck time and increasing fsck frequency is complete picture of the file systemwhat we call the fsck time crunch. metadata.2. The fsck program Unsurprisingly, it takes fsck quite some time to rebuild the entirety of the file Cutting down crash recovery time system metadata, approximately O(totalfor an ext3 file system depends on file system size + data stored). Theunderstanding how the file system checker average laptop takes several minutes toprogram, fsck works. After Linux has fsck an ext2 file system; large file serversfinished booting the kernel, the root file can sometimes take hours or, on occasion,system is mounted read-only and the days!kernel executes the init program. As partof normal system initialization, fsck is run Straightforward tacticalon the root file system before it is performance optimizations such asremounted read-write and on other file requesting reads of needed blocks insystems before they are mounted. Repair sequential order and readahead requestsof the file system is necessary before it can can only improve the situation so much,be safely written. When fsck runs, it given that the whole operation will still
  3. 3. take time proportional to the entire file read off disk. This takes a relatively smallsystem. What we want is file system amount of time compared to the time spentrecovery time that is O(writes in progress), doing what are effectively random 4 KB oras is the case for journal replay in similar–sized reads, although morejournaling file systems. complex file systems may burn more CPU time in computing checksums or similar3. Motivation tasks. In summary, the ways to reduce fsck time, in rough order of effectiveness, are to reduce seeks, reduce dependent reads, The fundamental limiting factors in reduce the amount of metadata that needsthe performance of fsck are amount of data to be read (either by reducing the overallread, number of separate I/Os, how quantity or the amount that needs to bescattered the data is on disk, number of read), and to reduce the complexity of thedependent reads, and CPU time required to consistency checks themselves.check and repair the data read. The amountof memory available is a factor as well,though most fsck programs operate on an 4. Our Approachall–or–nothing basis: Either there isenough memory to fit all the needed In order to discover and correctmetadata for a particular checking pass or filesystem errors, fsck must read all thethe checker simply aborts. metadata in the entire file system. Hence, the basic idea of our project is to The time required to read the file introduce parallelism in the operation ofsystem metadata is partially constrained by Fsck by pre fetching these metadatathe bandwidth of the disk. Depending on blocks(which includes inodes, bitmaps,the file system, some kinds of file system directory entries, indirect blocks, blockdata, such as blocks of statically allocated group summaries, etc) and simultaneouslyinodes or block group summaries, are performing consistency checks on this prelocated in contiguous chunks at known fetched data.locations. Reading this data off disk isrelatively swift. Originally, Fsck consists of a single thread of operation. To enhance Other kinds of file system data are Fsck performance, we have added an extradynamically allocated, such as directory thread to read ahead the indirect blocks.entries, indirect blocks, and extents, and Thus, the project involves two threadshence are scattered all over the disk. Many namely - ‗main thread‘ and a ‗prefetchmodern file systems allocate nearly all thread‘. The main thread is responsible fortheir metadata dynamically. The location the actual data checking while the prefetchof much of this kind of metadata is not thread fetches the metadata (indirectknown until the block of metadata pointing blocks) for the main thread. Thisto it is read off the disk, introducing many modification would ensure that when thelevels of dependent reads. This portion of main thread begins its checking operation,the file system check is usually the most the metadata that it would require istime consuming, as we must issue a set of already brought into the system cache bysmall scattered reads, wait for them to the prefetch thread. Hence, there would becomplete, read the address of the next a reduction in the overall time taken byblock, then issue another set of reads. Fsck to complete its operations. Finally, we need CPU time and While actual fetching the data fromsufficient memory to actually compute the the disk, the prefetch thread has to go toconsistency checks on the data we have the disk many times and each time the data
  4. 4. brought into the cache will be minimal. As As mentioned earlier, when thea solution to this, we have designed a main fsck thread begins its operation, itstrategy to reduce the number of disk seeks needs to fetch the metadata from the disk.for indirect blocks and also to increase the This data is then checked for consistency.amount of data brought in each time we go In other words, while the metadata is beingto the disk. This strategy involves queuing fetched no other checking is performedthe block numbers to be brought in until and CPU remains idle. This is a majorwe reach the end of a block group and bottleneck which leads to fsck takingonce the end is reached, issue these queued enormous time to check and repair the fileIO‘s to fetch the blocks from disk. Also, system. As a solution to this we haveby merging the IO requests, we ensure that implemented a multi-threaded model induring each disk seek maximum data can which we create a new thread to performbe pre fetched into the cache instead of the fetching of metadata. This thread whatissuing single IO‘s. we call as ‗Pre-fetch‘ thread performs the task of pre-fetching metadata from disk Another aspect of the project is and making it available to the main fsckmetadata clustering. Metaclustering refers thread for performing usual consistencyto storing indirect blocks in clusters on a checks. In this way, we can ensureper-group basis instead of spreading them maximum CPU utilization by co-out along with the data blocks. This makes ordinating the operation of both the mainfsck faster since it can now read and verify and also the pre-fetch thread. Since theall indirect blocks without much seeks. prefetch thread reads in all the metadata that the main thread requires, the main Fsck involves five passes. Pass 1 is thread is absolved of the fetching work. Asresponsible for checking inodes, blocks a result, the checking and fetching of dataand sizes and pass 2 for checking directory can take place simultaneously thusstructure. Implementing the above ensuring performance benefits withmentioned features helps us achieve regards to time taken for the overallreduction in the times for these two passes working of fsck.which take up maximum time as comparedto the other passes. Hence, by ourmodifications and additions to the original 5.2 Working of Pre-fetch threadFsck utility we can ensure improvement inthe overall performance of Fsck. As the name suggests, the prefetch thread has been introduced to pre-fetch or read-ahead the metadata for the main5. Implementation thread. We have added two new queues namely a ‗Select Queue‘ and an ‗IO Queue‘ which forms an integral part of the 5.1 Parallelization Operation prefetch procedure. The working of prefetch thread and the queues can be better understood from fig. and can be summarised as follows: 1. Initially the inode table location on disk i.e. the block number holding the inode table is read into the IO queue. Figure 1: Multithreaded Fsck 2. This inode table block is then actually fetched from disk into the buffer cache.
  5. 5. Figure 2: Pre-fetch thread working3. From this table, individual inodes are The above implementation provides the picked up to perform various following benefits: consistency checks. 1. The main fsck thread performs only4. For each inode in the table, the checking of metadata and does not prefetch thread fetches the indirect have to go to the disk to fetch any block numbers associated with it into blocks as all the blocks required by the the select queue. Thus, we see that, main thread have been pre-fetched into select queue holds the indirect block the system cache by the pre-fetch numbers of every inode in the current thread. inode table. 2. The select queue that holds the indirect5. Once the end of block group is block numbers is sorted. This helps us reached, the select queue is sorted as achieve a nearly sequential sweep of per block numbers and merging is read-write head over the disk. performed to club together contiguous block numbers. Then all those block 3. By merging the requests in the select numbers that lie within the current queue, we reduce the number of times block group are transferred into the IO the prefetch thread needs to go the disk queue. Thus, the IO queue holds all to fetch blocks. Thus we ensure those block numbers that are to be minimal in-ordered seeks and also currently fetched from disk. minimal overall fetches from disk.6. Finally, the indirect blocks indicated 4. Only those block numbers that lie by the blocks numbers present in the within the current block group are held IO queue are fetched from disk into the by the IO queue. These blocks are then buffer cache. Thus, the required fetched from disk. Thus, we limit our metadata blocks become available to fetching to the current block group main thread. while delaying fetching those indirect blocks that lie in other block groups. This further helps in reducing random
  6. 6. movement of read-write head over fetch these blocks which otherwise are disk. spread out across the filesystem. 5.3 Metadata Clustering 6. Performance Evaluation Every block group has at its enda semi-reserved region called the 6.1 Test Environment‗Metacluster‘. This region is mostly used  Processor: Intel Core 2 Duofor allocating indirect blocks. Under (2.6 GHz)normal circumstances, the metacluster isused only for allocating indirect blocks  Memory: 512 MBwhich are allocated in decreasing order of  Operating System: Kubuntublock numbers. The non-Metacluster Gutsy Gibbon 7.10region is used for data block allocations  Kernel: 2.6.23which are allocated in increasing order of  Base File System Checkerblock numbers. However, when the MC Code: e2fsprogs-1.40.4runs out of space, indirect blocks can beallocated in the non-MC region along with Number of disks used 2+1the data blocks in the forward direction. Type of disks IDESimilarly, when non-MC runs out of Size of each disk 40 GBspace, new data blocks are allocated in MC Partition size 80 GB (RAID 0)but in the forward direction. Avg. Seek Time 9 ms Avg. Rotational Latency 6 ms Table 2: Experiment Disk Characteristics (1 GB = 10^9 bytes) Figure 3: Metaclustering 6.2 Time vs Percentage of file system under useThe steps involved in metadata clusteringare: We measured the time taken to run fsck on our test machine with a gradual increase in1. Read in the inode table into the buffer the percentage of file system used. Wecache. observed that by using the readahead2. Scan through each inode present in the concept to fetch the data, we get about 30 -table. 35% decrease in the time taken to3. Find the number of indirect blocks complete an fsck run.indicated by the inodes.4. Find the amount of contiguous free % filled Original Modifiedspace (metacluster region) required tocluster or group together these indirect 15 99.62 66.26blocks. 25 198.52 132.695. Transfer the indirect blocks into the 35 303.1 215.65metacluster region.6. Perform required updation of metadata 45 331.88 231.46to reflect above changes to the file system. 55 412.32 290.87 65 475.72 365.14Such clustering helps to club together the 75 526.39 397.67indirect blocks and hence reduces seeks to Table 3: Time taken to run fsck
  7. 7. affect fsck time in 2013, we ran fsck on the /dev/md0 partition (ext3 formatted) of a desktop machine (see Table 1 for details on the disk used). We measured the elapsed time and CPU time using the time command, and the number of individual I/O operations and the total data read using iostat. Using this data and projected changes in disk hardware, we made a rough estimate of the time needed to complete a file system check on a moderately sized desktop (RAID 0) file system in 2013.  First, we estimated how the using the CPU, reading data off the disk, and head travel between blocks (seek time plus rotational latency). Graph 1: Time vs. FS usage (%) To check an 80 GB file system with 60 GB of data: 6.3 Finding sequential order break with  The total elapsed time of original respect to the file system usage fsck is 527 seconds, 21 seconds of which are spent in CPU time. ThatWe also measured the number of times leaves 506 seconds in I/O.fsck needs to fetch blocks which are  The total elapsed time of modifiedinordered i.e. which break the sequential fsck is 398 seconds, 16 seconds ofmovement of the read-write head and which are spent in CPU time. Thatfound that in the original fsck, the number leaves 382 seconds in I/O.of inordered reads is quite high. This countof the number of inordered reads reduces  We measured 1.5GB of data read. Weconsiderably by using the modified fsck. estimated the amount of time to readThe actual count can be seen as follows: 1.5 GB of data off the disk by using dd to transfer that amount of data from theFile Original Modified partition into /dev/null, which took 37System seconds at optimal read size.used (in  The remaining time, 490 seconds (forGB) original fsck) and 345seconds (for10 29 5 modified fsck), we assume is spent20 47 26 seeking between tracks and waiting for30 89 40 the data to rotate under the head.40 107 49  We measured 233,440 separate I/O50 136 61 requests.  The average seek time for this disk isTable 4: Number of in-ordered reads 9 ms, and the average rotational latency is 6 ms. We estimate that 6.4 Finding seek per number of IOs original fsck required about 32000 seeks (about one seek per every 35-38To get a rough estimate of how 16x I/Os) while the modified fsck requiredcapacity increase, 5x bandwidth increase, about 22000 seeks (about one seek perand almost no change in seek time will every 58-60 I/Os).
  8. 8. Steps to find IOs/seek: [1] Valerie Henson, Open Source Technology Centre, Intel Corporation.1. Elapsed time = CPU time + time to Repair – Driven File System Design read data of the disk + head travel between the blocks(seek time + [2] Val Henson, Zach Brown, rotational latency) Theodore Ts‘o, Arjan Van De Ven.2. Calculate total elapsed time for e2fsck Reducing Fsck time for Ext2 File3. Subtract CPU time from it to get the Sytems. In Ottawa linux Symposiu input/output time 2006, 2006. Note: [CPU time = user time + system time] (time command used) [3] Val Henson, Arjan van de Ven,4. Calculate time required to read certain Amit, Gud, and Zach Brown. Chunkfs: amount of metadata (approximately = Using divide-and-conquer to improve current metadata for test) file system reliability and repair. In Note: use dd operation to calculate Hot Topics in System Dependability,5. Subtract the above time from I/O time 2006. to get time spent in head travel6. Divide this time by the disk access [4] Design and Implementation of the time second extended file system. Note: Access time = seek time + http://e2fsprogs.sourceforge.net/ext2int rotational latency ro.html7. Calculate number of I/O requests for the test [5] T.J.Kowalski and Marshall K.8. Divide this by the time calculated in McKusick. Fsck – the UNIX file step 6 to get the number of I/O‘s system check program. Technical required for one seek report, Bell Laboratories, 1978.7. Conclusion This paper enunciates the design andimplementation for a multithreadedfilesystem checker (fsck), an improvementover the current single threaded version. Italso describes the extensions implementedin the current fsck, to enable clustering ofmetadata to further improve performance.The sample tests performed using FSCK –SX have shown its capability to achievenearly a 30% enhancement over currentperformance. FSCK – SX thus enables aframework to achieve an optimizedfilesystem checker on ext3 filesystem, theconcept of which can be extended to otherfile systems.8. References