Minimizing User Perceived Latency in Flash Memory File System ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Minimizing User Perceived Latency in Flash Memory File System ...

  1. 1. 1 Minimizing User Perceived Latency in Flash Memory File System via Uniformity Improving Paging Allocation Seungjae Baek+, Jongmoo Choi+, Donghee Lee*, Sam H. Noh** + Division of Information and Computer Science, Dankook University, Korea San-8, Hannam-Dong, Yongsan-Gu, Seoul, 140-714 Korea, +82-2-799-1367 + {ibanez1383,choijm}, * Department of Computer Science, University of Seoul, Korea Siripdae-1 gil, 12-1, Jeonnong-Dong, Dongdaemun-Gu, Seoul, 130-743 Korea +82-2-2210-5615 *, ** School of Computer and Information Engineering, Hongik University, Korea 72-1, Sangsu-Dong, Mapo-Gu, Seoul, 121-791, Korea, +82-2-320-1470 ** Abstract Flash memory is a storage medium that is becoming more and more popular. Though not yet fully embraced in traditional computing systems, Flash memory is prevalent in embedded systems, materialized as commodity appliances such as the digital camera and the MP3 player that we enjoy in our everyday lives. This paper considers an issue in file systems that use Flash memory as a storage medium and makes the following two contributions. First, we identify the cost of block cleaning as the key performance bottleneck for Flash memory analogous to the seek time in disk storage. We derive and define three performance parameters, namely, utilization, invalidity, and uniformity, from characteristics of Flash memory and present a formula for block cleaning cost based on these parameters. We show that, of these parameters, uniformity, which is a file system controllable parameter, most strongly influences the cost of cleaning. This leads us to our second contribution, which is designing of the modification-aware (MODA) page allocation scheme and the analysis of how enhanced uniformity affects the block cleaning cost with various workloads. Real implementation experiments conducted on an embedded system show that the MODA scheme can reduce block cleaning time by up to 43 seconds (with an average of 10.2 seconds) compared to the traditional sequential allocation scheme that is used in YAFFS. 1. Introduction such as the disparity of seek distances in the x and y dimensions have been suggested [4,5]. Characteristics of storage media have been the key driving force behind the development of file Recent developments in Flash memory technology have systems. The Fast File System’s (FFS) introduction of brought about numerous products that make use of cylinder groups and the rule of thumb of keeping 10% Flash memory. In this paper, we explore and identify of the disk as a free space reserve for effective layout the characteristics of Flash memory and analyze how was, essentially, to reduce seek time, which is the key they influence the latency of data access. We identify bottleneck for user perceived disk performance [1]. the cost of block cleaning as the key characteristic that Likewise, development of the Log-structured File influences latency. A performance model for analyzing System (LFS) was similarly motivated by the want to the cost of block cleaning is presented based on three make large sequential writes so that the head movement parameters that we derive, namely, utilization, of the disk would be minimized and to fully utilize the invalidity, and uniformity, which we define clearly later. limited bandwidth that is available [2]. Various other optimization techniques that take into consideration the The model reveals that the cost of block cleaning is mechanical movement of the disk head has been strongly influenced by uniformity just like seek is a proposed both at the file system level and the device strong influence for disk based storage. We observe level [3]. that most of previous algorithms that trying to improve block cleaning time in Flash memory are, essentially, Similar developments have occurred for the MEMS- trying to maintain high uniformity of Flash memory. based storage media. Various scheduling algorithms Our model gives the upper bound of on the that consider physical characteristics of MEMS devices performance gain expected by developing a new algorithm. To validate the model and to analyze our
  2. 2. 2 observations in real environments, we design a new word-level random accesses, while NAND Flash simple modification-aware (MODA) page allocation memory supports page-level random accesses. Hence, scheme that strives to maintain high uniformity by in embedded systems, NOR Flash memory is usually grouping files based on their update frequencies. used to store code, while NAND Flash memory is used as storage for the file system. NOR and NAND Flash We implement the MODA page allocation scheme on memory also differ in density, operational time, and an embedded system that has 64MB of NAND Flash bad block marking mechanisms, etc., and we refer the memory running the Linux kernel 2.4.19. The NAND audience to Sharma [9] for detailed comparisons. From Flash memory is managed by YAFFS (Yet Another here we limit our focus on to NAND Flash memory, Flash File System) [7] supported in Linux. We modify though the ideas proposed in the paper, are also the page allocation scheme in YAFFS to MODA and applicable to NOR Flash memory. compare its performance with the original scheme. Experimental results show that, by enhancing uniformity, the MODA scheme can reduce block cleaning time up to 43 seconds with an average of 10.2 seconds for the benchmarks considered. As the utilization of Flash memory increases, the performance enhancements become even larger. Performance is also compared with the DAC (Dynamic dAta Clustering) scheme that was previously proposed[8]. Results show that both the MODA and the DAC scheme performs better than the sequential allocation scheme used in YAFFS, though there are some subtle differences Figure 1. Structure of Flash memory causing performance gaps between them. Figure 1 shows the general structure of NAND Flash memory. Flash memory consists of a set of blocks, each The rest of the paper is organized as follows. In Section block consisting of a set of pages. According to the 2, we elaborate on the characteristics of Flash memory block size, NAND Flash memory is further divided into and explain the need for cleaning, which is the key two classes, that is, small block NAND and large block issue that we deal with. We then review previous works NAND [10]. In small block NAND Flash memory, that have dealt with this issue in Section 3. Then, we each block has 32 pages, where the page size is 528 present a model for analyzing the cost of block cleaning bytes. A 512-byte portion of these bytes is the data area in Section 4. In Section 5, we present the new page used to store data, while the remaining 16-byte portion allocation scheme, which we refer to as MODA, in is called the spare area, which is generally used to store detail. The implementation details and the performance ECC and/or bookkeeping information. In large block evaluation results are discussed in Section 6. We NAND Flash memory, each block has 64 pages of 2112 conclude the paper with a summary and directions for bytes (2048 bytes for data area and 64 bytes for spare future works in Section 7. area). 2. Flash Memory and Block Cleaning Flash memory as a storage medium has characteristics that are different from traditional disk storage. These In this section, we explain the internal structure and the characteristics can be summarized as follows [9]. characteristics of Flash memory focusing on the difference from disk storage. Then, we describe why Access time in Flash memory is location block cleaning is required and how it affects the independent similar to RAM. There is no “seek performance of Flash memory file systems. time” involved. 2.1 General Characteristics of Flash Overwrite is not possible in Flash memory. Flash Memory memory is a form of EEPROM (Electrically Erasable Programmable Read Only Memory), so it Flash memory that is most widely used today is either needs to be erased before new data can be of the NOR type or the NAND type. One key overwritten. difference between NOR and NAND Flash memory is the access granularity. NOR Flash memory supports
  3. 3. 3 Execution time for the basic operations in Flash Note that from the above discussions that a page can be memory is asymmetric. Traditionally, three basic in three different states. That is, a page can be holding operations, namely, read, write, and erase, are legitimate data making it a valid page; we will say that supported. An erase operation is used to clean a page is in a valid state if the page is valid. If the page used pages so that the page may be written to no longer holds valid data, that is, it was invalidated by again. In general, a write operation takes an order being deleted or by being updated, then the page is a of magnitude longer than a read operation, while dead/invalid page, and the page is in an invalid state. an erase operation takes another order or more Note that a page in this state cannot be written to until magnitudes longer than a write operation. Table 1 the block it resides in is first erased. Finally, if the page shows the execution times for these operations has not been written to in the first place or the block in compared to DRAM and Disk [21]. which the page resides has just been erased, then the page is clean. In this case, we will say that the page is The unit of operation is also different for the basic in a clean state. Note that only pages that are in a clean operations. While the erase operation is performed state may be written to. in block units, read/write operations are performed in page units. The number of erasures possible on each block is limited, typically, to 100,000 or 1,000,000 times. These characteristics make designing software for Flash memory challenging and interesting [11]. Table 1. Operation execution times for different (storage) media Execution time Figure 2. Page state transition diagram Media Read Write Erase DRAM 2.56us (512B) 2.56ns (512B) N/A Figure 2 shows the state transition diagram of pages in NOR Flash 14.4us (512B) 3.53ms (512B) 1.2s (128KB) Flash memory. Recall that in disks, there is no notion of NAND Flash 35.9us (512B) 226us (512B) 2ms (16KB) an invalid state as in-place overwrites to sectors is Disk 12.4ms (512B) 12.4ms (512B) N/A always possible. Note from the tri-state characteristics that the number 2.2 The Need for Block Cleaning of clean pages diminishes not only as new data is written, but also as existing data is updated. In order to Reading data or writing totally new data into Flash store more data and even to make updates to existing memory is simply like reading/writing to disk. A page data, it is imperative that invalid pages be continually in Flash is referenced/allocated for the data and data is cleaned. Since cleaning is done via erase operation, read/written from/to that particular page. The which is done in block units, valid pages in the block to distinction from a disk is that all reads/writes take a be erased must be copied to a clean block. This (much shorter) constant amount of time (though writes exacerbates the already large overhead incurred by the take longer than reads). erase operation needed for cleaning a block. However, for updates of existing data, the story is 3. Related Work totally different. As overwriting updated pages is not possible, various mechanisms for non-in-place update The issue of segment/block cleaning is not new and has have been developed [7,12,13,14]. Though specific been dealt with in both the disk and Flash memory details differ, the basic mechanism is to allocate a new realms. In this section, we discuss previous research in page, write the updated data onto the new page, and this area, especially in relation to the work presented in then, invalidate the original page that holds the (now this paper. obsolete) original data. The original page now becomes a dead or invalid page. Conceptually, the need for block cleaning in Flash memory is identical to the need of segment cleaning in
  4. 4. 4 the Log-structured File System (LFS). LFS writes data of times the block has been erased. The DAC partitions to a clean segment and performs segment cleaning to Flash memory into several regions and place data into reclaim space occupied by obsolete data just like different regions according to their update frequencies. invalid pages are cleaned in Flash memory [2]. In actuality, DAC is very similar to our MODA scheme Segment cleaning is a vital issue in the Log-structured in that it classifies hot/cold-modified data and allocates File System (LFS) as it greatly affects performance them into different blocks. Later, we compare the [15,16,17,18]. Blackwell et al. use the terms ‘on- performance of DAC that we implemented with the demand cleaning’ and ‘background cleaning’ separately MODA scheme we propose, and explain what are and try to reduce user-perceived latency by applying differences between two schemes. Another key heuristics to remove on-demand cleaning [15]. difference between this work and ours is that we Matthews et al. propose a scheme that adaptively identify the parameters that influence the cost of block incorporates hole-plugging into cleaning according to cleaning in Flash memory and provide a model for the the changes in disk utilization [16]. They also consider cost based on these parameters. The model gives us how to take advantage of cached data to reduce the cost what fundamental we need to focus on and how much of cleaning. gain we can expect, when we develop a new algorithm for Flash memory such as page allocation, block Some of the studies on LFS are in line with an aspect of selection for cleaning, and background cleaning scheme. our study, that is, exploiting modification characteristics. Wang and Hu present a scheme that There are also some noticeable studies regarding Flash gathers modified data into segment buffers and sorts memory. Kim et al. describe the difference between them according to the modification frequency and block level mapping and page level mapping [21]. They writes two segments of data to the disk at one time [17]. propose a hybrid scheme that not only handles small This scheme writes hot and cold modified data into writes efficiently, but also reduces resources required different segments. Wang et al. describe a scheme that for mapping information. Gal and Toledo present a applies non-in-place update for hot-modified data and recoverable Flash file system for embedded systems in-place updates for cold-modified data [18]. These two [13]. They conjecture that a better allocation policy and schemes, however, are disk-based approaches, and a better reclamation policy would improve endurance hence, are not appropriate for Flash memory. For and performance of Flash memory. The same authors example, the scheme presented by Wang and Hu [17] also present a survey, where they provide a writes one segment at a time as seek time of disks is a comprehensive discussion of algorithms and data critical performance bottleneck. However, in Flash structures regarding Flash memory [11]. Chang et al. memory, this limitation is unnecessary because of the discuss a block cleaning scheme that considers random access nature of Flash memory. Hence, several deadlines in real-time environments [20]. Ben-Aroya blocks may be accessed simultaneously, copying the analyzes wear-leveling problems mathematically and necessary pages one by one to any block as needed. suggests separating the allocation and cleaning policies Also, the scheme by Wang et al. [18] cannot be used in from the wear-leveling problems [24]. Flash memory due to the overwrite limitation. 4. Analysis of Block Cleaning Cost In the Flash memory arena, studies for improving block cleaning have been suggested in many studies In this section, we identify the parameters that affect [8,11,12,19,20]. Kawaguchi et al. propose using two the cost of block cleaning in Flash memory. We separate segments for cleaning: one for newly written formulate the cost based on these parameters and data and the other for data to be copied during cleaning analyze their effects on the cost. [12]. Wu and Zwaenepoel present a hybrid cleaning scheme that combines the FIFO algorithm for uniform 4.1. Performance Parameters access and locality gathering algorithm for highly skewed access distribution [19]. These approaches Two types of block cleaning are possible in Flash differ from ours in that their classification is mainly memory. The first is when valid and invalid pages based on whether data is newly written or copied. coexist within the block that is to be cleaned. Here, the valid pages must first be copied to a clean page in a Chiang et al. propose the CAT (Cost Age Time) and different block before the erase operation on the block DAC (Dynamic dAta Clustering) schemes [8]. CAT can happen. We shall refer to this type of cleaning as chooses blocks to be reclaimed by taking into account ‘copy-erase cleaning’. The other kind of cleaning is the cleaning cost, age of data in blocks, and the number where there are no valid pages in the block to be erased.
  5. 5. 5 All pages in this block are either invalid or clean. This Utilization determines, on average, the number of valid cleaning imposes only a single erase operation, and we pages that need to be copied for copy-erase cleaning. shall refer to this type of cleaning as ‘erase-only Invalidity determines the number of blocks that are cleaning’. Note that for the same number of cleaning candidates for cleaning. Finally, uniformity refers to the attempts, more erase-only cleaning will result in lower fraction of blocks that are uniform blocks. A uniform cleaning cost. block is a block with zero or more clean pages and the remainder of the pages in the block are uniformly valid Observe that for copy-erase cleaning the number of or uniformly invalid pages. Another definition of valid pages in the block to be cleaned is a key factor uniformity would be “1 – the fraction of blocks that that affects the cost of cleaning. The more valid pages have both valid and invalid pages.” Of all the uniform there are in the block to be cleaned, more copy blocks, only those blocks containing invalid pages are operations need to be performed. Also observe that for candidates for erase-only cleaning. more erase-only cleaning to happen, the way in which the invalid pages are distributed plays a key role. From these observations, we can formulate the cost of Skewed distributed of the invalid pages towards block cleaning as follows: particular blocks will increase the chance of having only invalid pages in blocks, increasing the possibility Cost of Block Cleaning of erase-only cleaning. = Cost of copy-erase cleaning + Cost of erase-only cleaning = (# of non-uniform blocks * et)+(# of valid pages in non-uniform From these observations, we identify three parameters blocks * (rt + wt)) + (# of uniform blocks with invalid pages * et) that affect the cost of block cleaning. They are defined where as follows: rt : read operation execution time wt: write operation execution time Utilization (u): the fraction of valid pages in Flash et : erase operation execution time memory And, each term is elaborated as follows: Invalidity (i): the fraction of invalid pages in Flash memory # of non-uniform blocks = (1 - p) * B Uniformity (p): the fraction of blocks that are # of valid pages in non-uniform blocks = # of non-uniform blocks * # of valid pages in a block uniform in Flash memory, where a uniform block = (1 - p) * B * (P/B) * (u /( u + i )) is a block that does not contain both valid and # of uniform blocks with invalid pages invalid blocks simultaneously. = (# of invalid pages - # of invalid pages in non-uniform blocks) / (# of pages in a block) Figure 3 shows three page allocation situations where = (( i * P) – ((1 - p) * B * (P/B) * (i /( u + i )))) / (P/B) utilization and invalidity are the same, but uniformity is where different. Since there are eight valid pages and eight u: utilization (0 ≤ u ≤ 1) i : invalidity (0 ≤ i ≤ 1- u) invalid pages among the 20 pages for all three cases, p: uniformity (0 ≤ p ≤ 1) utilization and invalidity are both 0.4. However, there is P: # of pages in Flash memory (P=capacity/size of page) one, three, and five uniform blocks in Figure 3(a), (b), B: # of blocks in Flash memory (P/B: # of pages in a block) and (c), respectively, hence uniformity is 0.2, 0.6, and 1, respectively. In the formula, (P/B) * (u / ( u + i )) denotes the number of pages in a block that are valid, and hence, need to be copied. Each copy is associated with a read and write operation denoted by rt and wt, respectively. Note that instead of using (rt + wt) as the copy overhead, as some Flash memory controllers support the copy-back operation, that is, copying of a page to another page internally [10], this copy-back operation execution time could be used. In general, the copy-back execution time is similar to the write operation execution time. The number of invalid pages for uniform blocks represented as (i * P) – (1– p) * B * (P/B) * (i / ( u + i )) that Figure 3. Situation where utilization (u=0.4) and contributes to the cost of erase-only cleaning. The first invalidity (i=0.4) remains unchanged, while term is total number of invalid pages, while the second uniformity (p) changes (a) p = 0.2 (b) p = 0.6 (c) p = 1
  6. 6. 6 term is the number of invalid pages in non-uniform large block 1GB NAND Flash memory[10], which blocks. shows similar trends observed in Figure 4. 4.2. Validation of Model We now discuss the implication of each parameter in detail. Utilization is determined by the amount of useful As we need to explain the experimental setup to data that the user stores. From the file system’s point of validate the model, we defer the validation to and the view, this is almost an uncontrollable parameter, experiments that were performed, which is done in though compression technique with hardware Section 6. accelerator [25] can be used to maintain low utilization. 4.3. Implication of the Parameters on Block Invalidity is determined by the data update frequency Cleaning Costs and cleaning frequency. If updates (including deletions of stored data) are frequent, invalidity will rise. However, since invalidity will be decreased with each block cleaning, frequent cleaning will keep invalidity low. Update frequency is controllable by delaying the updates using buffer cache. However, mobile appliances such as cellular phones, MP3 players, and digital cameras that use Flash memory as its storage device are prone to sudden power failures. Hence, delayed writes can be applied only if some form of power failure recovery mechanism is present. Cleaning frequency is also controllable, but we need to consider carefully how to balance the benefit caused by decreasing invalidity and the cost caused by frequent cleaning. Finally, uniformity is a parameter that may be controlled by the file system as the file system is in Figure 4. Block cleaning costs charge of allocating particular pages upon a write Figure 4 shows the analytic results of the cost of block request. The redistribution policy about where the valid cleaning based on the derived model. In the figure, the data in reclaimed blocks are written out also controls x-axis is utilization, the y-axis is uniformity, and the z- this parameter. By judiciously allocating and axis is the cost of block cleaning. For this graph, we set reorganizing pages, we may be able to maintain high invalidity as 0.1 and use the raw data of a small block uniformity. 64MB NAND Flash memory [10]. The execution times for read, write, and erase operations are 15us, 200us, Figure 5 depicts how each parameter affects the cost of and 2000us, respectively. block cleaning. For this analysis, we use the same small block NAND Flash memory that was used for Figure 4. The main observation from Figure 4 is that the cost of In each figure, the initial values of the three parameters cleaning increases dramatically when utilization is high are all set to 0.5. Then, we decrease utilization in and uniformity is low. We also conduct analysis with Figure 5(a), decrease invalidity in Figure 5(b), and different values of invalidity and with the raw data of a increase uniformity in Figure 5(c). From these figures, Invalidity Figure 5. How block cleaning cost is affected by (a) utilization, (b) invalidity, and (c) uniformity as the other two parameters are kept constant at 0.5
  7. 7. 7 we find that the impact of utilization and uniformity on Rather, we want to design a simple scheme that block cleaning cost is higher than that of invalidity. provides different uniformity compared with the Since utilization is almost uncontrollable, this implies sequential allocation with minimal overheads and that to keep cleaning cost down keeping uniformity analyze how the difference affects the performance of high may be a better approach than trying to keep Flash memory in accordance with the model, described invalidity low through frequent cleaning. in Section 4. 5. Page Allocation Scheme that Strives for Uniformity In the previous section, we found that among the three parameters, uniformity is a parameter that strongly influences the cost of block cleaning. In this section, we present a new page allocation scheme that strives to maintain high uniformity so that the cost of block cleaning may be kept low. 5.1. Modification-Aware Allocation Scheme When pages are requested in file systems, in general, Figure 6. Two level (static and dynamic property) pages are allocated sequentially [7,12]. Flash file classification used in MODA scheme systems tend to follow this approach and simply allocate the next available clean page when a page is Figure 6 shows two levels of modification-aware requested, not taking into account any of the classifications used in MODA to classify hot/cold- characteristics of the storage media. modified data. At the first level, we make use of static properties, and dynamic properties are used at the We propose an allocation scheme that takes into second level. This classification is based on the account the file modification characteristics such that skewness in page modification distribution [3,8,18]. uniformity may be maximized. The allocation scheme is modification-aware (MODA) as it distinguishes data As static property, we distinguish system data and user that are hot-modified, that is, modified frequently and date as the modification characteristics of the two are data that are cold-modified, that is, modified quite different. The superblock and inodes are infrequently. Allocation of pages for the distinct type of examples of system data, while data written by users data is done from distinct blocks. are examples of user data. We know that inodes are modified more intensively than user data, since any The motivation behind this scheme is that by change to the user data in a file causes changes to its classifying hot-modified pages and allocating them to associated inode. the same block, we will eventually turn the block into a uniform block filled with invalid pages. Likewise, by User data is further classified at the second level, where classifying cold-modified pages and allocating them its dynamic property is used. In particular, we make use together, we will turn this block into a uniform block of the modification count. Keeping the modification filled with valid pages. Pages that are neither hot- count for each page, however, may incur considerable modified nor cold-modified are sequestered so that they overhead. Therefore, we choose to monitor at a much may not corrupt the uniformity of blocks that hold hot larger granularity and keep a modification count for and cold modified pages. each file which is updated when the modification time is updated. In fact, this kind of file segregation based on update frequency is not our own approach. It is well-known For classification with the modification count, we adopt approach exploited in many research including IRM the MQ (Multi Queue) algorithm [26]. Specifically, it (Independent Reference Model) [3], buffer cache uses multiple LRU queues numbered Q0, Q1,…, Qm-1. management schemes [26], segment cleaning schemes Each file stays in a queue for a given lifetime. When a in LFS [17, 18] and data clustering in Flash memory [8, file is first written (created), it is inserted into Q0. If a 11]. Note that the main goal of our scheme is not file is modified within its lifetime, it is promoted from inventing a thoroughly new page allocation scheme.
  8. 8. 8 Qi to Qi+1.On the other hand, if a file is not modified memory, and embedded controllers. Table 2 within its lifetime, it is demoted from Qi to Qi-1. Then, summarizes the hardware components and their we classify a file promoted from Qm-1 as hot-modified specifications [22]. The same NAND Flash memory data, while a file demoted from Q0 as cold-modified that was used to analyze the cost of block cleaning in data. Files within queues are defined as unclassified Figure 4 is used here. data. In our experiments, we set m as 2 and lifetime as 100 times (time is virtual that ticks at each modification Table 2. Hardware component and specification request). In other words, a file modified more than 2 is Hardware Components Specification classified as hot, while a file that has not been modified within the recent 100 modification requests is classified CPU 400MHz XScale PXA 255 as cold. We find that MODA with different values of m RAM 64MB SDRAM = 3 and/or lifetime = 500 shows similar behavior. Flash 64MB NAND, 0.5MB NOR Interface CS8900, USB, RS232, JTAG The Linux kernel 2.4.19 was ported on the hardware platform and YAFFS is used to manage the NAND Flash memory [7]. We modify the page allocation scheme in YAFFS and compare the performance with the native YAFFS. We will omit a detailed discussion regarding YAFFS, but only describe the relevant parts below. Interested readers are directed to [6,7] for details of YAFFS. The default page allocation scheme in YAFFS is the sequential allocation scheme. We implemented the MODA scheme in YAFFS and will refer to this version of YAFFS, the MODA-YAFFS or simply MODA. In MODA-YAFFS, we modified functions such as Figure 7. Managing mechanism for MODA yaffs_WriteChunkDataToObject(), yaffs_FlushFile(), Figure 7 shows the structure of the MODA scheme. yaffs_UpdateObjectHeader() in yaffs_guts.c and There are 4 managers named hot block, cold block, init_yaffs_fs(), exit_yaffs_fs() in yaffs_fs.c. unclassified block and clean block manager. The first three managers govern hot-classified files, cold- The block cleaning scheme in YAFFS is invoked at classified files, and unclassified files, respectively. each write request. There are two modes of cleaning: When blocks are reclaimed, they are given to clean normal and aggressive. In normal mode, YAFFS block manager, while clean blocks are requested by chooses the block that has the largest number of invalid other three managers dynamically for serving write pages among the predefined number of blocks (default requests. setting is 200). If the chosen block has less than 3 valid pages, it reclaims the block. Otherwise, it gives up on 6. Implementation and Performance the reclaiming. If the number of clean blocks is lower Evaluation than a predefined number (default setting is 6), the mode is converted to aggressive. In aggressive mode, In this section, we describe the implementation of the YAFFS chooses a block that has invalid pages and MODA scheme on an embedded platform. Using this reclaims the block without checking the number of real implementation, we perform various experiments valid pages in it. The block cleaning scheme in with the analysis of effects of MODA on the view of MODA-YAFFS is exactly the same. We do add a new the proposed model. interface for block cleaning that may be invoked at the user level for ease of measurement. 6.1. Platform and Implementation 6.2. The Workload We have implemented the MODA scheme on an embedded system. Hardware components of the system To gather sufficient workloads for evaluating our include a 400MHz XScale PXA CPU, 64MB SDRAM, scheme, we survey Flash memory related papers as 64MB NAND Flash memory, 0.5MB NOR Flash
  9. 9. 9 many as possible and choose the following 7 number) and perform 500 transactions (again, the benchmarks for our implementation experiments. default value) for one benchmark [23]. It models Camera benchmark : It executes a number of Internet appliance, namely e-mail, e-commerce, transactions repeatedly where each transaction and news transactions and used widely for testing consists of three steps; creating 10 files whose size performance of embedded system [6]. is 200~1000KB, which is the common size of a Andrew benchmark : It was originally developed photo in JPEG format, reading some of these files, for testing disk based file systems, not for Flash and deleting some of these files randomly. Such memory. But, many Flash memory studies use it actions of taking, browsing, and erasing pictures for performance evaluation [12, 21]. The Andrew benchmark consists of 5 phases: directory creation, are common behaviors of digital camera users, file copying, file status checking, file contents observed in [21, 27]. checking and compiling. In our experiments, out Movie player benchmark : This benchmark of the five phases, we only execute the first two phases. simulates the workload of Portable Media Players [27]. It writes large files sequentially whose sizes These benchmarks can be roughly grouped into three vary from 15 to 30MB and deletes some of them. categories: sequential read/write intensive workloads, update intensive workloads, and multiple files intensive Phone benchmark : It simulates the behavior of a workloads. The first group includes Camera and Movie cellular phone [13]. Each daily activity consists of benchmark that access large files sequentially. Phone sending/receiving 3 SMS messages, receiving/ and Recorder benchmark are typical examples of the dialing 10 calls, missing 5 calls, adding/deleting 5 update intensive workloads which are manipulating appointments. Six files are updated for the small number of files and update them intensively. The activities, three files for recording call information Fax, Postmark and Andrew benchmark are embraced as the final category that accesses multiple files with (15 bytes), two files for saving SMS message (80 different access probabilities. bytes), and one file for managing appointments (variable length). 6.3. Performance Evaluation Recorder benchmark : It models the behavior of In the first part of this subsection, we report the event recorder such as automotive black box and performance evaluation results. Then we discuss the remote sensor [13]. It handles two files, one for validation of our proposed model and performance recording every event (32 bytes) and the other for characteristics according to the variety of uniformity, recording every 10th event (also, 32 bytes). utilization, and periodic block cleaning. Fax machine benchmark : It creates two files 6.3.1 Performance results named a parameter file and a phonebook file, which are never touched after creating. Also, it Table 3 shows performance results both measured by creates four new files (51300 bytes per each file) executing benchmarks and estimated by the model. Before each execution the utilization of Flash memory and updates a history file (200 bytes), when it is set to 0, that is, the Flash memory is reset completely. receives a fax. This behavior can be observed in Then, we execute each benchmark on YAFFS and fax machines, answering machines, and music MODA-YAFFS and measure its execution time player [13]. reported in the ‘Benchmark Running Time’ column. Note that the only difference between YAFFS and Postmark : This benchmark creates a large MODA-YAFFS is the page allocation scheme. Also, number of randomly sized files. It then executes after executing the benchmark, we measure the read, write, delete, and append transactions on performance parameters, namely utilization, invalidity, these files. We create 500 files (the default and uniformity of Flash memory denoted as ‘U’, ‘I’, ‘P’ in the Table 3.
  10. 10. 10 Table 3. Performance comparison of YAFFS and MODA-YAFFS for the benchmarks when utilization at start of execution is 0 (The unit of time measurement is in seconds) Performance Estimated Results Measured Results Benchmark Parameters Benchmark Scheme Running # of # of # of Cleaing # of Cleaning Time U I P Eras Eras Copy Time Copy Time e e YAFFS 37 0.3 0.0006 0.98 62 1916 2.46 60 1504 9.97 Camera MODA 37 0.3 0.0006 0.99 7 159 0.20 5 62 7.58 YAFFS 564 0.99 0.0001 0.99 10 319 0.41 10 17 24.64 Movie MODA 564 0.99 0.0001 0.99 1 31 0.04 1 3 24.54 YAFFS 88 0.05 0.32 0.62 1398 6047 9.87 1398 6047 13.40 Phone MODA 88 0.05 0.22 0.72 1010 6052 9.20 1011 6063 12.08 YAFFS 31 0.005 0.16 0.83 626 692 1.94 626 699 2.17 Recorder MODA 31 0.005 0.08 0.90 233 692 1.44 344 690 1.91 Fax YAFFS 97 0.86 0.0087 0.73 1023 31710 40.78 1001 30996 67.50 machine MODA 97 0.86 0.0087 0.97 107 2407 3.14 76 1383 23.47 YAFFS 16 0.08 0.0107 0.90 357 10158 13.11 357 10147 18.39 Postmark MODA 16 0.08 0.0107 0.93 260 7057 9.13 248 6652 12.84 YAFFS 33 0.09 0.0008 0.90 348 10174 13.12 345 10060 18.51 Andrew MODA 32 0.09 0.0008 0.98 87 1828 2.40 62 1004 3.86 Using the measured performance parameters and the estimate the block cleaning time. The simplest way to model proposed in Section 4.1, we can estimate the determine these values is by using the data sheet provided by the Flash memory chip vendor. number of erase and copy operations required to reclaim all the invalid pages. Also, the model gives us However, through experiments we observe that the values reported in the datasheet and the actual time the expected cleaning time. These estimated results are seen at various levels of the system differ considerably. reported in the ‘Estimated Results’ column. Finally, we Figure 8 shows these results. The results shows that actually measure the number of erase and copy while the datasheet reports read, write, and erase times operations and cleaning times to reclaim all invalid of 0.01ms, 0.2ms, and 2ms, respectively, for the Flash pages, which are reported in the ‘Measured Results’ memory used in our experiments, the times observed column. The measured results reported are averages of for directly accessing Flash memory at the device three executions for each case unless otherwise stated. driver level is 0.19ms, 0.3ms, and 1.7ms, respectively. Furthermore, when observed just above the MTD layer, 6.3.2 Model Validation the read, write, and erase times are 0.2ms, 1.03ms, and 1.74ms, respectively, with large variances. These Table 3 shows that the number of erase and copy variances influence the accuracy of the model operations estimated by the model are similar to those drastically. Since the YAFFS runs on the basis of the measured by real implementaion, though the model tends to overestimate the copy operations when invalidity is low. These similarities imply that the model is fairly effective to predict how many operations are required under given status of Flash memory. But, there are noticable differences between the block cleaning times measued and estimated. From the sensitive analysis, we find out two main reasons leading to these differences. The first reason is that the Figure 8. Execution time at each level model requires the read, write, and erase times to
  11. 11. 11 MTD layer, we use the times observed above the MTD phone with digital camera capabilities and ‘Movie + layer for estimated results reported in Table 3. Recorder + Postmark’ simulating the PMP player that uses Embedded DB to maintain movie title, actor The second reason is that block cleaning causes not library and digital rights. Experiments show that the only the copy and erase overheads but also software trends of multiple benchmark executions are similar to manipulating overheads. Specifically, the YAFFS does those of Postmark, reported in Table 3. For example, not manage block and page status information in main for the combination of ‘Movie + Recorder + Postmark’, memory to minimize RAM footprint. Hence, it needs to the block cleaning time of YAFFS and MODA are read Flash memory to detect blocks and pages to be 34.82 and 22.58 seconds, while uniformity are 0.73 and cleaned. Due to this overhead, there are gaps between 0.84, respectively. We also find that the interferences the cleaning times measured and estimated. The among benchmarks drive uniformity of Flash memory overhead also explains why the gap increases as low, even for the large sequential multimedia files. utilization increases. However, the trends derived from the model and measurement are quite same, which 6.3.4 Effects of Utilization implies that the model is a good indicator reflecting the performance characteristics of Flash memory. In real life, utilization of Flash memory will rarely be 0. Hence, we conduct similar measurements as was done 6.3.3 Effects of Uniformity for Table 3, but varying the initial utilization value. Figure 9 shows the results of executing Postmark under By comparing the results of YAFFS and those of the various initial utilization values. Utilization was MODA, we make the following observations. artificially increased by executing the Andrew benchmark before each of the measurements. Exact The benchmark execution time is the same for same experiments were conducted with utilization YAFFS and MODA. This implies that there is adjusted by pre-executing the Postmark benchmark, but minimal overhead for the additional computation the result trend is similar, hence we only report one set that may be incurred for data classification. of these results. The MODA scheme maintains high uniformity which enables to reduce the block cleaning time up to 43 seconds (for Fax machine benchmark) with an average of 10.2 seconds for the benchmarks considered. The performance gains of MODA for Movie and Camera benchmarks are trivial. Our model reveals that the original sequential page allocation scheme used in YAFFS also keeps high uniformity, makes it difficult to obtain considerable gains. The gains of MODA for Phone and Recorder benchmark are also trivial, even though there are Figure 9. Results reported under various initial still room for enhancing uniformity. Careful utilization values analysis shows that the MODA classifies hot/cold For the moment, ignore the results reported when data on the file-level granularity, but these utilization is 0.9, which shows somewhat different benchmarks access small number of files; six files behavior. We come back to discuss these results shortly. for Phone and two files for Recorder. These The results in Figure 9 show that block cleaning time experiments disclose the limitation of the MODA increases as utilization increases confirming what we scheme and suggest that the page-level had observed in Figure 5(a). Our model also expects in classification may be used effectively for the Figure 4 that, in high utilization, enhancement of benchmarks. uniformity drives more reduction of cleaning times. Figure 9 reinforces the expectation by showing that the We also perform two or more benchmarks running difference in block cleaning time between YAFFS and experiments by making the combination of benchmarks, MODA-YAFFS increases as utilization increases. such as ‘Camera + Phone’ simulating the recent cellular
  12. 12. 12 Let us now discuss results reported when the initial idle state detection mechanism, and so on. utilization is 0.9. Observe that the results are different from results with lower utilization values, in that the To see how periodic block cleaning affects benchmark running time is much higher, more so for performance we conduct the following sequence of YAFFS. This is because YAFFS is confronted with a executions. Starting from utilization zero, that is, a lack of clean blocks during execution, and hence, turns clean Flash memory state, we repeatedly execute the block cleaning mode to aggressive. As a result, on- Postmark until the benchmark can no longer complete demand block cleaning occurs frequently increasing the as Flash memory completely fills up. During this benchmark running time to 72 seconds, four times the iteration, two different actions are taken. For Figures running time compared to when utilization is lower. 10(a) and (c), nothing is done between each execution. Note that in MODA-YAFFS, the running time does That is, no form of explicit cleaning is performed. For increase, but not as significantly. This is because the Figures 10(b) and (d), block cleaning is performed MODA allocation scheme allows for more blocks to be between benchmark executions. This represents a case kept uniform, and hence aggressive on-demand block where block cleaning is occurring periodically. Figures cleaning is invoked less. 10(a) and (b) are the measurement results for YAFFS, while Figures 10(c) and (d) are results for MODA. The 6.3.5 Effects of Periodic block cleaning utilization values reported on the x-axis is the value before each benchmark execution. In YAFFS, block cleaning is invoked at each write request and attempts to reclaim at most one block at The results here verify what we would expect based on each trial. In other file system, block cleaning is observations from Section 6.3.4. As utilization is kept invoked when free space becomes smaller than a under some value, YAFFS and MODA perform the predefined lower threshold and attempts to reclaim same (with or without periodic cleaning). Without blocks until it becomes larger than an upper threshold periodic cleaning, once past that threshold, 0.87 in [2, 12]. On the contrary, block cleaning can happen Postmark, the benchmark execution time abruptly when the system is idle [15]. Ideally, if this can happen, increases for YAFFS, while for MODA, it increases then all block cleaning costs may be hidden from the only slightly. This is because enough clean blocks are user. Whether this is possible or not will depend on being maintained in MODA. When block cleaning is many factors including the burstiness of request arrival, invoked periodically, the benchmark execution time may be maintained at a minimum. However, the Figure 10. Execution time for Postmark for YAFFS (a) without periodic cleaning, (b) with periodic cleaning and for MODA (c) without periodic cleaning, (d) with periodic cleaning
  13. 13. 13 problem with this approach is that periodic cleaning The second source is that the MODA scheme uses a itself incurs overhead, and the issue becomes whether two-level classifications scheme distinguishing system this overhead can be hidden from the user or not. As data and user data (static property) at the first level, this issue is beyond the scope of this paper, we leave then further classifying user data based on their this matter for future work. Note that the periodic modification counts (dynamic property) at the second cleaning times are smaller in MODA than those in level. But, by only considering the dynamic property of YAFFS, implying the periodic cleaning also get the data, the DAC scheme is not able to gather enough benefits from keeping uniformity high. information to make a timely distinction between the two types. When we apply static property based 6.4. MODA vs DAC classification into the DAC scheme, it’s performance is almost close to MODA. These results imply that static In this subsection, we compare MODA with the DAC properties give chances early enough to enhance scheme proposed by Chiang et al. [8]. In the DAC uniformity. Also, from the performance parameters scheme, Flash memory is partitioned into several measured in these experiments, we find that the regions and data are moved toward top/bottom region if performance benefits obtained by MODA and DAC are their update frequencies increase/decrease. Both essentially come from enhancing uniformity of Flash MODA and DAC try to cluster data not only at block memory. cleaning time, but also at data updating time. We implement the DAC schemes into YAFFS and set the 7. Conclusion number of regions as 4 as this is reported to have shown the best performance [8]. Two contributions are made in this paper. First, we identify the cost of block cleaning as the key performance bottleneck for Flash memory analogous to the seek time in disk storage. We derive three performance parameters from features of Flash memory and present a formula for block cleaning cost based on these parameters. We show that, of these parameters, uniformity is the key controllable parameter that has a strong influence on the cost. This leads us to our second contribution, which is a new modification- aware (MODA) page allocation scheme that strives to maintain high uniformity. Using the MODA scheme, Figure 11. Performance Comparison between we validate our model and evaluate performance MODA and DAC characteristics with the views of uniformity, unitization and periodic cleaning. Figure 11 shows the results between for the MODA and DAC schemes. (The results for YAFFS are shown We are considering three research directions for future for comparisons sake.) We execute each benchmark work. One direction is enhancing the proposed MODA under the same conditions described in Table 3. The scheme that can keep uniformity high for benchmarks results show that the MODA scheme performs better manipulating small number of files. Another direction than the DAC scheme. is in the development of an efficient block cleaning scheme. Uniformity is influenced by not only the page Detailed examinations reveal that the performance gap allocation scheme, but also by the block cleaning between the MODA and the DAC schemes comes from scheme. We need to investigate issues such as defining two sources. One is that the DAC scheme clusters data and finding idle time to initiate block cleaning and into the cold region if their update frequency is low. deciding which blocks and how many of these blocks This may cause the real cold-modified data and newly should be reclaimed once reclaiming is initiated. In written data (which may actually be hot-modified data) relation to block cleaning, we are thinking to extend the to coexist on the same block, which reduces uniformity. model to capture the burstiness of the workload, which However, the MODA scheme groups newly written determines how much overhead we could hide from data into separate blocks managed by the unclassified user perceived latency. Finally, we neglected the issue manager shown in Figure 7, segregating them from the of wear-leveling in this paper. Though we project that cold-modified data. our scheme will help reduce wear-leveling as cleaning is reduced, we have not quantified this issue. We need
  14. 14. 14 to investigate the effects of our proposed scheme on [13] E. Gal and S. Toledo, "A transactions Flash file wear-leveling in a more quantitative manner. system for microcontrollers", Proceedings of the 2005 USENIX Annual Technical Conference, pp. References 89-104, 2005. [14] D. Woodhouse, "JFFS: The journaling Flash file [1] M. McKusick, W. Joy, S. Leffler, and R. Fabry, system", Ottawa Linux Symposium, 2001, "A Fast File System for UNIX", ACM Transactions on Computer Systems, 2(3), pp. 181- [15] T. Blackwell, J. Harris, and M. Seltzer, "Heuristic 197, Aug., 1984. cleaning algorithms in log-structured file systems", [2] M. Rosenblum and J. K. Ousterhout, "The design Proceedings of the 1995 Annual Technical and implementation of a log-structured file Conference, pp. 277-288, 1993. system", ACM Transactions on Computer Systems, [16] J. Matthews, D. Roselli, A. Costello, R. Wang, vol. 10, no. 1, pp. 26-52, 1992. and T. Anderson, "Improving the performance of [3] William Stalling, "Operating Systems: Internals log-structured file system with adaptive methods", and Design Principles", 5th Edition, Pearson ACM Symposiums on Operating System Prentice Hall, 2004. Principles (SOSP), pp. 238-251, 1997. [4] H. Yu, D. Agrawal, and A. E. Abbadi, “Towards [17] J. Wang and Y. Hu, "WOLF - a novel reordering optimal I/O scheduling for MEMS-based storage,” write buffer to boost the performance of log- Proceedings of the 20th IEEE/11th NASA structured file system", Proceedings of the Goddard Conference on Mass Storage Systems USENIX Conference on File and Storage and Technologies (MSS’03), 2003. Technologies (FAST), pp. 46-60, 2002. [5] S. W. Schlosser and G. R. Ganger, “MEMS-based [18] W. Wang, Y. Zhao, and R. Bunt, "HyLog: A High storage devices and standard disk interfaces: A Performance Approach to Managing Disk Layout", square peg in a round hole?” Proceedings of 3rd Proceedings of the USENIX Conference on File USENIX Conference on File and Storage and Storage Technologies (FAST), pp. 145-158, Technologies (FAST’04), 2004. 2004. [6] S. Lim, K. Park, “An Efficient NAND Flash File [19] M. Wu and W. Zwaenepoel, "eNVy: a non- System for Flash Memory Storage”, IEEE volatile, main memory storage system", Transactions on Computers, Vol. 55, No. 7, July, Proceeding of the 6th International Conference on 2006. Architectural Support for Programming [7] Aleph One, "YAFFS: Yet another Flash file Languages and Operation Systems (ASPLOS), pp. system", 86-97, 1994. [8] M-L. Chiang, P. C. H. Lee, and R-C. Chang, [20] L. P. Chang, T. W. Kuo and S. W. Lo, "Real-time "Using data clustering to improve cleaning garbage collection for Flash memory storage performance for Flash memory", Software: systems of real time embedded systems", ACM Practice and Experience, vol. 29, no. 3, pp. 267- Transactions on Embedded Computing Systems, 290, 1999. Vol. 3, No. 4, pp. 837-863, 2004. [9] Ashok K. Sharma, "Advanced Semiconductor [21] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, Y. Cho, Memories: Architectures, Designs, and "A space-efficient Flash translation layer for Applications", WILEY Interscience, 2003. CompactFlash systems", IEEE Transactions on [10] Samsung Electronics, “NAND Flash Data Sheet”, Consumer Electronics, vol. 48, no. 2, pp. 366-375, 2002. NDFlash. [22] EZ-X5, [11] E. Gal and S. Toledo, "Algorithms and Data [23] J. Katcher, "PostMark: A New File System Structures for Flash Memories", ACM Computing Benchmark", Technical Report TR3022, Network Surveys, vol. 37, no. 2, pp 138-163, 2005. Appliance Inc., 1997. [12] A. Kawaguchi, S. Nishioka and H. Motoda, "A [24] A. Ben-Aroya, “Competitive analysis of flash- Flash-memory based file system", Proceedings of memory algorithms”, M.Sc. paper, Apr. 2006, the 1995 USENIX Annual Technical Conference, pp. 155-164, 1995. msc.pdf.
  15. 15. 15 [25] K. Yim, H. Bahn, and K. Koh "A Flash Compression Layer for SmartMedia Card Systems”, IEEE Transactions on Consumer Electronics, Vol. 50, No. 1, pp. 192-197, Feb. 2004. [26] Y. Zhou, P. M. Chen, and K. Li, “The Multi- Queue Replacement Algorithm for Second-Level Buffer Caches, Proceeding of the 2001 USENIX Annual Technical Conference, June, 2001. [27] H. Jo, J. Kang, S. Park, J. Kim, and J. Lee, ”FAB: Flash-Aware Buffer Management Policy for Portable Media Players”, IEEE Transactions on Consumer Electronics, Vol. 52, No. 2, May, 2006.