Minimizing User Perceived Latency in Flash Memory File System ...
Minimizing User Perceived Latency in Flash Memory File System via
Uniformity Improving Paging Allocation
Seungjae Baek+, Jongmoo Choi+, Donghee Lee*, Sam H. Noh**
Division of Information and Computer Science, Dankook University, Korea
San-8, Hannam-Dong, Yongsan-Gu, Seoul, 140-714 Korea, +82-2-799-1367
Department of Computer Science, University of Seoul, Korea
Siripdae-1 gil, 12-1, Jeonnong-Dong, Dongdaemun-Gu, Seoul, 130-743 Korea +82-2-2210-5615
School of Computer and Information Engineering, Hongik University, Korea
72-1, Sangsu-Dong, Mapo-Gu, Seoul, 121-791, Korea, +82-2-320-1470
Flash memory is a storage medium that is becoming more and more popular. Though not yet fully embraced in
traditional computing systems, Flash memory is prevalent in embedded systems, materialized as commodity
appliances such as the digital camera and the MP3 player that we enjoy in our everyday lives. This paper considers
an issue in file systems that use Flash memory as a storage medium and makes the following two contributions. First,
we identify the cost of block cleaning as the key performance bottleneck for Flash memory analogous to the seek
time in disk storage. We derive and define three performance parameters, namely, utilization, invalidity, and
uniformity, from characteristics of Flash memory and present a formula for block cleaning cost based on these
parameters. We show that, of these parameters, uniformity, which is a file system controllable parameter, most
strongly influences the cost of cleaning. This leads us to our second contribution, which is designing of the
modification-aware (MODA) page allocation scheme and the analysis of how enhanced uniformity affects the block
cleaning cost with various workloads. Real implementation experiments conducted on an embedded system show
that the MODA scheme can reduce block cleaning time by up to 43 seconds (with an average of 10.2 seconds)
compared to the traditional sequential allocation scheme that is used in YAFFS.
1. Introduction such as the disparity of seek distances in the x and y
dimensions have been suggested [4,5].
Characteristics of storage media have been the key
driving force behind the development of file Recent developments in Flash memory technology have
systems. The Fast File System’s (FFS) introduction of brought about numerous products that make use of
cylinder groups and the rule of thumb of keeping 10% Flash memory. In this paper, we explore and identify
of the disk as a free space reserve for effective layout the characteristics of Flash memory and analyze how
was, essentially, to reduce seek time, which is the key they influence the latency of data access. We identify
bottleneck for user perceived disk performance . the cost of block cleaning as the key characteristic that
Likewise, development of the Log-structured File influences latency. A performance model for analyzing
System (LFS) was similarly motivated by the want to the cost of block cleaning is presented based on three
make large sequential writes so that the head movement parameters that we derive, namely, utilization,
of the disk would be minimized and to fully utilize the invalidity, and uniformity, which we define clearly later.
limited bandwidth that is available . Various other
optimization techniques that take into consideration the The model reveals that the cost of block cleaning is
mechanical movement of the disk head has been strongly influenced by uniformity just like seek is a
proposed both at the file system level and the device strong influence for disk based storage. We observe
level . that most of previous algorithms that trying to improve
block cleaning time in Flash memory are, essentially,
Similar developments have occurred for the MEMS- trying to maintain high uniformity of Flash memory.
based storage media. Various scheduling algorithms Our model gives the upper bound of on the
that consider physical characteristics of MEMS devices performance gain expected by developing a new
algorithm. To validate the model and to analyze our
observations in real environments, we design a new word-level random accesses, while NAND Flash
simple modification-aware (MODA) page allocation memory supports page-level random accesses. Hence,
scheme that strives to maintain high uniformity by in embedded systems, NOR Flash memory is usually
grouping files based on their update frequencies. used to store code, while NAND Flash memory is used
as storage for the file system. NOR and NAND Flash
We implement the MODA page allocation scheme on memory also differ in density, operational time, and
an embedded system that has 64MB of NAND Flash bad block marking mechanisms, etc., and we refer the
memory running the Linux kernel 2.4.19. The NAND audience to Sharma  for detailed comparisons. From
Flash memory is managed by YAFFS (Yet Another here we limit our focus on to NAND Flash memory,
Flash File System)  supported in Linux. We modify though the ideas proposed in the paper, are also
the page allocation scheme in YAFFS to MODA and applicable to NOR Flash memory.
compare its performance with the original scheme.
Experimental results show that, by enhancing
uniformity, the MODA scheme can reduce block
cleaning time up to 43 seconds with an average of 10.2
seconds for the benchmarks considered. As the
utilization of Flash memory increases, the performance
enhancements become even larger. Performance is also
compared with the DAC (Dynamic dAta Clustering)
scheme that was previously proposed. Results show
that both the MODA and the DAC scheme performs
better than the sequential allocation scheme used in
YAFFS, though there are some subtle differences Figure 1. Structure of Flash memory
causing performance gaps between them. Figure 1 shows the general structure of NAND Flash
memory. Flash memory consists of a set of blocks, each
The rest of the paper is organized as follows. In Section block consisting of a set of pages. According to the
2, we elaborate on the characteristics of Flash memory block size, NAND Flash memory is further divided into
and explain the need for cleaning, which is the key two classes, that is, small block NAND and large block
issue that we deal with. We then review previous works NAND . In small block NAND Flash memory,
that have dealt with this issue in Section 3. Then, we each block has 32 pages, where the page size is 528
present a model for analyzing the cost of block cleaning bytes. A 512-byte portion of these bytes is the data area
in Section 4. In Section 5, we present the new page used to store data, while the remaining 16-byte portion
allocation scheme, which we refer to as MODA, in is called the spare area, which is generally used to store
detail. The implementation details and the performance ECC and/or bookkeeping information. In large block
evaluation results are discussed in Section 6. We NAND Flash memory, each block has 64 pages of 2112
conclude the paper with a summary and directions for bytes (2048 bytes for data area and 64 bytes for spare
future works in Section 7. area).
2. Flash Memory and Block Cleaning Flash memory as a storage medium has characteristics
that are different from traditional disk storage. These
In this section, we explain the internal structure and the characteristics can be summarized as follows .
characteristics of Flash memory focusing on the
difference from disk storage. Then, we describe why Access time in Flash memory is location
block cleaning is required and how it affects the independent similar to RAM. There is no “seek
performance of Flash memory file systems. time” involved.
2.1 General Characteristics of Flash Overwrite is not possible in Flash memory. Flash
Memory memory is a form of EEPROM (Electrically
Erasable Programmable Read Only Memory), so it
Flash memory that is most widely used today is either needs to be erased before new data can be
of the NOR type or the NAND type. One key overwritten.
difference between NOR and NAND Flash memory is
the access granularity. NOR Flash memory supports
Execution time for the basic operations in Flash Note that from the above discussions that a page can be
memory is asymmetric. Traditionally, three basic in three different states. That is, a page can be holding
operations, namely, read, write, and erase, are legitimate data making it a valid page; we will say that
supported. An erase operation is used to clean a page is in a valid state if the page is valid. If the page
used pages so that the page may be written to no longer holds valid data, that is, it was invalidated by
again. In general, a write operation takes an order being deleted or by being updated, then the page is a
of magnitude longer than a read operation, while dead/invalid page, and the page is in an invalid state.
an erase operation takes another order or more Note that a page in this state cannot be written to until
magnitudes longer than a write operation. Table 1 the block it resides in is first erased. Finally, if the page
shows the execution times for these operations has not been written to in the first place or the block in
compared to DRAM and Disk . which the page resides has just been erased, then the
page is clean. In this case, we will say that the page is
The unit of operation is also different for the basic in a clean state. Note that only pages that are in a clean
operations. While the erase operation is performed state may be written to.
in block units, read/write operations are performed
in page units.
The number of erasures possible on each block is
limited, typically, to 100,000 or 1,000,000 times.
These characteristics make designing software for Flash
memory challenging and interesting .
Table 1. Operation execution times for different
Execution time Figure 2. Page state transition diagram
Read Write Erase
DRAM 2.56us (512B) 2.56ns (512B) N/A Figure 2 shows the state transition diagram of pages in
NOR Flash 14.4us (512B) 3.53ms (512B) 1.2s (128KB) Flash memory. Recall that in disks, there is no notion of
NAND Flash 35.9us (512B) 226us (512B) 2ms (16KB) an invalid state as in-place overwrites to sectors is
Disk 12.4ms (512B) 12.4ms (512B) N/A always possible.
Note from the tri-state characteristics that the number
2.2 The Need for Block Cleaning of clean pages diminishes not only as new data is
written, but also as existing data is updated. In order to
Reading data or writing totally new data into Flash store more data and even to make updates to existing
memory is simply like reading/writing to disk. A page data, it is imperative that invalid pages be continually
in Flash is referenced/allocated for the data and data is cleaned. Since cleaning is done via erase operation,
read/written from/to that particular page. The which is done in block units, valid pages in the block to
distinction from a disk is that all reads/writes take a be erased must be copied to a clean block. This
(much shorter) constant amount of time (though writes exacerbates the already large overhead incurred by the
take longer than reads). erase operation needed for cleaning a block.
However, for updates of existing data, the story is 3. Related Work
totally different. As overwriting updated pages is not
possible, various mechanisms for non-in-place update The issue of segment/block cleaning is not new and has
have been developed [7,12,13,14]. Though specific been dealt with in both the disk and Flash memory
details differ, the basic mechanism is to allocate a new realms. In this section, we discuss previous research in
page, write the updated data onto the new page, and this area, especially in relation to the work presented in
then, invalidate the original page that holds the (now this paper.
obsolete) original data. The original page now becomes
a dead or invalid page. Conceptually, the need for block cleaning in Flash
memory is identical to the need of segment cleaning in
the Log-structured File System (LFS). LFS writes data of times the block has been erased. The DAC partitions
to a clean segment and performs segment cleaning to Flash memory into several regions and place data into
reclaim space occupied by obsolete data just like different regions according to their update frequencies.
invalid pages are cleaned in Flash memory . In actuality, DAC is very similar to our MODA scheme
Segment cleaning is a vital issue in the Log-structured in that it classifies hot/cold-modified data and allocates
File System (LFS) as it greatly affects performance them into different blocks. Later, we compare the
[15,16,17,18]. Blackwell et al. use the terms ‘on- performance of DAC that we implemented with the
demand cleaning’ and ‘background cleaning’ separately MODA scheme we propose, and explain what are
and try to reduce user-perceived latency by applying differences between two schemes. Another key
heuristics to remove on-demand cleaning . difference between this work and ours is that we
Matthews et al. propose a scheme that adaptively identify the parameters that influence the cost of block
incorporates hole-plugging into cleaning according to cleaning in Flash memory and provide a model for the
the changes in disk utilization . They also consider cost based on these parameters. The model gives us
how to take advantage of cached data to reduce the cost what fundamental we need to focus on and how much
of cleaning. gain we can expect, when we develop a new algorithm
for Flash memory such as page allocation, block
Some of the studies on LFS are in line with an aspect of selection for cleaning, and background cleaning scheme.
our study, that is, exploiting modification
characteristics. Wang and Hu present a scheme that There are also some noticeable studies regarding Flash
gathers modified data into segment buffers and sorts memory. Kim et al. describe the difference between
them according to the modification frequency and block level mapping and page level mapping . They
writes two segments of data to the disk at one time . propose a hybrid scheme that not only handles small
This scheme writes hot and cold modified data into writes efficiently, but also reduces resources required
different segments. Wang et al. describe a scheme that for mapping information. Gal and Toledo present a
applies non-in-place update for hot-modified data and recoverable Flash file system for embedded systems
in-place updates for cold-modified data . These two . They conjecture that a better allocation policy and
schemes, however, are disk-based approaches, and a better reclamation policy would improve endurance
hence, are not appropriate for Flash memory. For and performance of Flash memory. The same authors
example, the scheme presented by Wang and Hu  also present a survey, where they provide a
writes one segment at a time as seek time of disks is a comprehensive discussion of algorithms and data
critical performance bottleneck. However, in Flash structures regarding Flash memory . Chang et al.
memory, this limitation is unnecessary because of the discuss a block cleaning scheme that considers
random access nature of Flash memory. Hence, several deadlines in real-time environments . Ben-Aroya
blocks may be accessed simultaneously, copying the analyzes wear-leveling problems mathematically and
necessary pages one by one to any block as needed. suggests separating the allocation and cleaning policies
Also, the scheme by Wang et al.  cannot be used in from the wear-leveling problems .
Flash memory due to the overwrite limitation.
4. Analysis of Block Cleaning Cost
In the Flash memory arena, studies for improving block
cleaning have been suggested in many studies In this section, we identify the parameters that affect
[8,11,12,19,20]. Kawaguchi et al. propose using two the cost of block cleaning in Flash memory. We
separate segments for cleaning: one for newly written formulate the cost based on these parameters and
data and the other for data to be copied during cleaning analyze their effects on the cost.
. Wu and Zwaenepoel present a hybrid cleaning
scheme that combines the FIFO algorithm for uniform 4.1. Performance Parameters
access and locality gathering algorithm for highly
skewed access distribution . These approaches Two types of block cleaning are possible in Flash
differ from ours in that their classification is mainly memory. The first is when valid and invalid pages
based on whether data is newly written or copied. coexist within the block that is to be cleaned. Here, the
valid pages must first be copied to a clean page in a
Chiang et al. propose the CAT (Cost Age Time) and different block before the erase operation on the block
DAC (Dynamic dAta Clustering) schemes . CAT can happen. We shall refer to this type of cleaning as
chooses blocks to be reclaimed by taking into account ‘copy-erase cleaning’. The other kind of cleaning is
the cleaning cost, age of data in blocks, and the number where there are no valid pages in the block to be erased.
All pages in this block are either invalid or clean. This Utilization determines, on average, the number of valid
cleaning imposes only a single erase operation, and we pages that need to be copied for copy-erase cleaning.
shall refer to this type of cleaning as ‘erase-only Invalidity determines the number of blocks that are
cleaning’. Note that for the same number of cleaning candidates for cleaning. Finally, uniformity refers to the
attempts, more erase-only cleaning will result in lower fraction of blocks that are uniform blocks. A uniform
cleaning cost. block is a block with zero or more clean pages and the
remainder of the pages in the block are uniformly valid
Observe that for copy-erase cleaning the number of
or uniformly invalid pages. Another definition of
valid pages in the block to be cleaned is a key factor
uniformity would be “1 – the fraction of blocks that
that affects the cost of cleaning. The more valid pages
have both valid and invalid pages.” Of all the uniform
there are in the block to be cleaned, more copy
blocks, only those blocks containing invalid pages are
operations need to be performed. Also observe that for
candidates for erase-only cleaning.
more erase-only cleaning to happen, the way in which
the invalid pages are distributed plays a key role.
From these observations, we can formulate the cost of
Skewed distributed of the invalid pages towards
block cleaning as follows:
particular blocks will increase the chance of having
only invalid pages in blocks, increasing the possibility Cost of Block Cleaning
of erase-only cleaning. = Cost of copy-erase cleaning + Cost of erase-only cleaning
= (# of non-uniform blocks * et)+(# of valid pages in non-uniform
From these observations, we identify three parameters blocks * (rt + wt)) + (# of uniform blocks with invalid pages * et)
that affect the cost of block cleaning. They are defined where
as follows: rt : read operation execution time
wt: write operation execution time
Utilization (u): the fraction of valid pages in Flash et : erase operation execution time
memory And, each term is elaborated as follows:
Invalidity (i): the fraction of invalid pages in
Flash memory # of non-uniform blocks = (1 - p) * B
Uniformity (p): the fraction of blocks that are # of valid pages in non-uniform blocks
= # of non-uniform blocks * # of valid pages in a block
uniform in Flash memory, where a uniform block = (1 - p) * B * (P/B) * (u /( u + i ))
is a block that does not contain both valid and # of uniform blocks with invalid pages
invalid blocks simultaneously. = (# of invalid pages - # of invalid pages in non-uniform blocks) /
(# of pages in a block)
Figure 3 shows three page allocation situations where = (( i * P) – ((1 - p) * B * (P/B) * (i /( u + i )))) / (P/B)
utilization and invalidity are the same, but uniformity is where
different. Since there are eight valid pages and eight u: utilization (0 ≤ u ≤ 1)
i : invalidity (0 ≤ i ≤ 1- u)
invalid pages among the 20 pages for all three cases, p: uniformity (0 ≤ p ≤ 1)
utilization and invalidity are both 0.4. However, there is P: # of pages in Flash memory (P=capacity/size of page)
one, three, and five uniform blocks in Figure 3(a), (b), B: # of blocks in Flash memory (P/B: # of pages in a block)
and (c), respectively, hence uniformity is 0.2, 0.6, and 1,
respectively. In the formula, (P/B) * (u / ( u + i )) denotes the number
of pages in a block that are valid, and hence, need to be
copied. Each copy is associated with a read and write
operation denoted by rt and wt, respectively. Note that
instead of using (rt + wt) as the copy overhead, as some
Flash memory controllers support the copy-back
operation, that is, copying of a page to another page
internally , this copy-back operation execution time
could be used. In general, the copy-back execution time
is similar to the write operation execution time. The
number of invalid pages for uniform blocks represented
as (i * P) – (1– p) * B * (P/B) * (i / ( u + i )) that
Figure 3. Situation where utilization (u=0.4) and contributes to the cost of erase-only cleaning. The first
invalidity (i=0.4) remains unchanged, while term is total number of invalid pages, while the second
uniformity (p) changes (a) p = 0.2 (b) p = 0.6 (c) p = 1
term is the number of invalid pages in non-uniform large block 1GB NAND Flash memory, which
blocks. shows similar trends observed in Figure 4.
4.2. Validation of Model We now discuss the implication of each parameter in
detail. Utilization is determined by the amount of useful
As we need to explain the experimental setup to data that the user stores. From the file system’s point of
validate the model, we defer the validation to and the view, this is almost an uncontrollable parameter,
experiments that were performed, which is done in though compression technique with hardware
Section 6. accelerator  can be used to maintain low utilization.
4.3. Implication of the Parameters on Block Invalidity is determined by the data update frequency
Cleaning Costs and cleaning frequency. If updates (including deletions
of stored data) are frequent, invalidity will rise.
However, since invalidity will be decreased with each
block cleaning, frequent cleaning will keep invalidity
low. Update frequency is controllable by delaying the
updates using buffer cache. However, mobile
appliances such as cellular phones, MP3 players, and
digital cameras that use Flash memory as its storage
device are prone to sudden power failures. Hence,
delayed writes can be applied only if some form of
power failure recovery mechanism is present. Cleaning
frequency is also controllable, but we need to consider
carefully how to balance the benefit caused by
decreasing invalidity and the cost caused by frequent
Finally, uniformity is a parameter that may be
controlled by the file system as the file system is in
Figure 4. Block cleaning costs
charge of allocating particular pages upon a write
Figure 4 shows the analytic results of the cost of block request. The redistribution policy about where the valid
cleaning based on the derived model. In the figure, the data in reclaimed blocks are written out also controls
x-axis is utilization, the y-axis is uniformity, and the z- this parameter. By judiciously allocating and
axis is the cost of block cleaning. For this graph, we set reorganizing pages, we may be able to maintain high
invalidity as 0.1 and use the raw data of a small block uniformity.
64MB NAND Flash memory . The execution times
for read, write, and erase operations are 15us, 200us, Figure 5 depicts how each parameter affects the cost of
and 2000us, respectively. block cleaning. For this analysis, we use the same small
block NAND Flash memory that was used for Figure 4.
The main observation from Figure 4 is that the cost of In each figure, the initial values of the three parameters
cleaning increases dramatically when utilization is high are all set to 0.5. Then, we decrease utilization in
and uniformity is low. We also conduct analysis with Figure 5(a), decrease invalidity in Figure 5(b), and
different values of invalidity and with the raw data of a increase uniformity in Figure 5(c). From these figures,
Figure 5. How block cleaning cost is affected by (a) utilization, (b) invalidity, and (c) uniformity as the other
two parameters are kept constant at 0.5
we find that the impact of utilization and uniformity on Rather, we want to design a simple scheme that
block cleaning cost is higher than that of invalidity. provides different uniformity compared with the
Since utilization is almost uncontrollable, this implies sequential allocation with minimal overheads and
that to keep cleaning cost down keeping uniformity analyze how the difference affects the performance of
high may be a better approach than trying to keep Flash memory in accordance with the model, described
invalidity low through frequent cleaning. in Section 4.
5. Page Allocation Scheme that Strives for
In the previous section, we found that among the three
parameters, uniformity is a parameter that strongly
influences the cost of block cleaning. In this section, we
present a new page allocation scheme that strives to
maintain high uniformity so that the cost of block
cleaning may be kept low.
5.1. Modification-Aware Allocation Scheme
When pages are requested in file systems, in general,
Figure 6. Two level (static and dynamic property)
pages are allocated sequentially [7,12]. Flash file
classification used in MODA scheme
systems tend to follow this approach and simply
allocate the next available clean page when a page is Figure 6 shows two levels of modification-aware
requested, not taking into account any of the classifications used in MODA to classify hot/cold-
characteristics of the storage media. modified data. At the first level, we make use of static
properties, and dynamic properties are used at the
We propose an allocation scheme that takes into second level. This classification is based on the
account the file modification characteristics such that skewness in page modification distribution [3,8,18].
uniformity may be maximized. The allocation scheme
is modification-aware (MODA) as it distinguishes data As static property, we distinguish system data and user
that are hot-modified, that is, modified frequently and date as the modification characteristics of the two are
data that are cold-modified, that is, modified quite different. The superblock and inodes are
infrequently. Allocation of pages for the distinct type of examples of system data, while data written by users
data is done from distinct blocks. are examples of user data. We know that inodes are
modified more intensively than user data, since any
The motivation behind this scheme is that by change to the user data in a file causes changes to its
classifying hot-modified pages and allocating them to associated inode.
the same block, we will eventually turn the block into a
uniform block filled with invalid pages. Likewise, by User data is further classified at the second level, where
classifying cold-modified pages and allocating them its dynamic property is used. In particular, we make use
together, we will turn this block into a uniform block of the modification count. Keeping the modification
filled with valid pages. Pages that are neither hot- count for each page, however, may incur considerable
modified nor cold-modified are sequestered so that they overhead. Therefore, we choose to monitor at a much
may not corrupt the uniformity of blocks that hold hot larger granularity and keep a modification count for
and cold modified pages. each file which is updated when the modification time
In fact, this kind of file segregation based on update
frequency is not our own approach. It is well-known For classification with the modification count, we adopt
approach exploited in many research including IRM the MQ (Multi Queue) algorithm . Specifically, it
(Independent Reference Model) , buffer cache uses multiple LRU queues numbered Q0, Q1,…, Qm-1.
management schemes , segment cleaning schemes Each file stays in a queue for a given lifetime. When a
in LFS [17, 18] and data clustering in Flash memory [8, file is first written (created), it is inserted into Q0. If a
11]. Note that the main goal of our scheme is not file is modified within its lifetime, it is promoted from
inventing a thoroughly new page allocation scheme.
Qi to Qi+1.On the other hand, if a file is not modified memory, and embedded controllers. Table 2
within its lifetime, it is demoted from Qi to Qi-1. Then, summarizes the hardware components and their
we classify a file promoted from Qm-1 as hot-modified specifications . The same NAND Flash memory
data, while a file demoted from Q0 as cold-modified that was used to analyze the cost of block cleaning in
data. Files within queues are defined as unclassified Figure 4 is used here.
data. In our experiments, we set m as 2 and lifetime as
100 times (time is virtual that ticks at each modification Table 2. Hardware component and specification
request). In other words, a file modified more than 2 is
Hardware Components Specification
classified as hot, while a file that has not been modified
within the recent 100 modification requests is classified CPU 400MHz XScale PXA 255
as cold. We find that MODA with different values of m RAM 64MB SDRAM
= 3 and/or lifetime = 500 shows similar behavior. Flash 64MB NAND, 0.5MB NOR
Interface CS8900, USB, RS232, JTAG
The Linux kernel 2.4.19 was ported on the hardware
platform and YAFFS is used to manage the NAND
Flash memory . We modify the page allocation
scheme in YAFFS and compare the performance with
the native YAFFS. We will omit a detailed discussion
regarding YAFFS, but only describe the relevant parts
below. Interested readers are directed to [6,7] for
details of YAFFS.
The default page allocation scheme in YAFFS is the
sequential allocation scheme. We implemented the
MODA scheme in YAFFS and will refer to this version
of YAFFS, the MODA-YAFFS or simply MODA. In
MODA-YAFFS, we modified functions such as
Figure 7. Managing mechanism for MODA yaffs_WriteChunkDataToObject(), yaffs_FlushFile(),
Figure 7 shows the structure of the MODA scheme. yaffs_UpdateObjectHeader() in yaffs_guts.c and
There are 4 managers named hot block, cold block, init_yaffs_fs(), exit_yaffs_fs() in yaffs_fs.c.
unclassified block and clean block manager. The first
three managers govern hot-classified files, cold- The block cleaning scheme in YAFFS is invoked at
classified files, and unclassified files, respectively. each write request. There are two modes of cleaning:
When blocks are reclaimed, they are given to clean normal and aggressive. In normal mode, YAFFS
block manager, while clean blocks are requested by chooses the block that has the largest number of invalid
other three managers dynamically for serving write pages among the predefined number of blocks (default
requests. setting is 200). If the chosen block has less than 3 valid
pages, it reclaims the block. Otherwise, it gives up on
6. Implementation and Performance the reclaiming. If the number of clean blocks is lower
Evaluation than a predefined number (default setting is 6), the
mode is converted to aggressive. In aggressive mode,
In this section, we describe the implementation of the YAFFS chooses a block that has invalid pages and
MODA scheme on an embedded platform. Using this reclaims the block without checking the number of
real implementation, we perform various experiments valid pages in it. The block cleaning scheme in
with the analysis of effects of MODA on the view of MODA-YAFFS is exactly the same. We do add a new
the proposed model. interface for block cleaning that may be invoked at the
user level for ease of measurement.
6.1. Platform and Implementation
6.2. The Workload
We have implemented the MODA scheme on an
embedded system. Hardware components of the system To gather sufficient workloads for evaluating our
include a 400MHz XScale PXA CPU, 64MB SDRAM, scheme, we survey Flash memory related papers as
64MB NAND Flash memory, 0.5MB NOR Flash
many as possible and choose the following 7 number) and perform 500 transactions (again, the
benchmarks for our implementation experiments. default value) for one benchmark . It models
Camera benchmark : It executes a number of Internet appliance, namely e-mail, e-commerce,
transactions repeatedly where each transaction and news transactions and used widely for testing
consists of three steps; creating 10 files whose size performance of embedded system .
is 200~1000KB, which is the common size of a Andrew benchmark : It was originally developed
photo in JPEG format, reading some of these files, for testing disk based file systems, not for Flash
and deleting some of these files randomly. Such memory. But, many Flash memory studies use it
actions of taking, browsing, and erasing pictures for performance evaluation [12, 21]. The Andrew
benchmark consists of 5 phases: directory creation,
are common behaviors of digital camera users,
file copying, file status checking, file contents
observed in [21, 27]. checking and compiling. In our experiments, out
Movie player benchmark : This benchmark of the five phases, we only execute the first two
simulates the workload of Portable Media Players
. It writes large files sequentially whose sizes These benchmarks can be roughly grouped into three
vary from 15 to 30MB and deletes some of them. categories: sequential read/write intensive workloads,
update intensive workloads, and multiple files intensive
Phone benchmark : It simulates the behavior of a
workloads. The first group includes Camera and Movie
cellular phone . Each daily activity consists of benchmark that access large files sequentially. Phone
sending/receiving 3 SMS messages, receiving/ and Recorder benchmark are typical examples of the
dialing 10 calls, missing 5 calls, adding/deleting 5 update intensive workloads which are manipulating
appointments. Six files are updated for the small number of files and update them intensively. The
activities, three files for recording call information Fax, Postmark and Andrew benchmark are embraced as
the final category that accesses multiple files with
(15 bytes), two files for saving SMS message (80
different access probabilities.
bytes), and one file for managing appointments
(variable length). 6.3. Performance Evaluation
Recorder benchmark : It models the behavior of
In the first part of this subsection, we report the
event recorder such as automotive black box and performance evaluation results. Then we discuss the
remote sensor . It handles two files, one for validation of our proposed model and performance
recording every event (32 bytes) and the other for characteristics according to the variety of uniformity,
recording every 10th event (also, 32 bytes). utilization, and periodic block cleaning.
Fax machine benchmark : It creates two files 6.3.1 Performance results
named a parameter file and a phonebook file,
which are never touched after creating. Also, it Table 3 shows performance results both measured by
creates four new files (51300 bytes per each file) executing benchmarks and estimated by the model.
Before each execution the utilization of Flash memory
and updates a history file (200 bytes), when it
is set to 0, that is, the Flash memory is reset completely.
receives a fax. This behavior can be observed in Then, we execute each benchmark on YAFFS and
fax machines, answering machines, and music MODA-YAFFS and measure its execution time
player . reported in the ‘Benchmark Running Time’ column.
Note that the only difference between YAFFS and
Postmark : This benchmark creates a large MODA-YAFFS is the page allocation scheme. Also,
number of randomly sized files. It then executes after executing the benchmark, we measure the
read, write, delete, and append transactions on performance parameters, namely utilization, invalidity,
these files. We create 500 files (the default and uniformity of Flash memory denoted as ‘U’, ‘I’, ‘P’
in the Table 3.
Table 3. Performance comparison of YAFFS and MODA-YAFFS for the benchmarks when utilization at
start of execution is 0 (The unit of time measurement is in seconds)
Estimated Results Measured Results
Benchmark Scheme Running # of # of
# of Cleaing # of Cleaning
Time U I P Eras Eras
Copy Time Copy Time
YAFFS 37 0.3 0.0006 0.98 62 1916 2.46 60 1504 9.97
MODA 37 0.3 0.0006 0.99 7 159 0.20 5 62 7.58
YAFFS 564 0.99 0.0001 0.99 10 319 0.41 10 17 24.64
MODA 564 0.99 0.0001 0.99 1 31 0.04 1 3 24.54
YAFFS 88 0.05 0.32 0.62 1398 6047 9.87 1398 6047 13.40
MODA 88 0.05 0.22 0.72 1010 6052 9.20 1011 6063 12.08
YAFFS 31 0.005 0.16 0.83 626 692 1.94 626 699 2.17
MODA 31 0.005 0.08 0.90 233 692 1.44 344 690 1.91
Fax YAFFS 97 0.86 0.0087 0.73 1023 31710 40.78 1001 30996 67.50
machine MODA 97 0.86 0.0087 0.97 107 2407 3.14 76 1383 23.47
YAFFS 16 0.08 0.0107 0.90 357 10158 13.11 357 10147 18.39
MODA 16 0.08 0.0107 0.93 260 7057 9.13 248 6652 12.84
YAFFS 33 0.09 0.0008 0.90 348 10174 13.12 345 10060 18.51
MODA 32 0.09 0.0008 0.98 87 1828 2.40 62 1004 3.86
Using the measured performance parameters and the estimate the block cleaning time. The simplest way to
model proposed in Section 4.1, we can estimate the determine these values is by using the data sheet
provided by the Flash memory chip vendor.
number of erase and copy operations required to
reclaim all the invalid pages. Also, the model gives us However, through experiments we observe that the
values reported in the datasheet and the actual time
the expected cleaning time. These estimated results are seen at various levels of the system differ considerably.
reported in the ‘Estimated Results’ column. Finally, we Figure 8 shows these results. The results shows that
actually measure the number of erase and copy while the datasheet reports read, write, and erase times
operations and cleaning times to reclaim all invalid of 0.01ms, 0.2ms, and 2ms, respectively, for the Flash
pages, which are reported in the ‘Measured Results’ memory used in our experiments, the times observed
column. The measured results reported are averages of for directly accessing Flash memory at the device
three executions for each case unless otherwise stated. driver level is 0.19ms, 0.3ms, and 1.7ms, respectively.
Furthermore, when observed just above the MTD layer,
6.3.2 Model Validation the read, write, and erase times are 0.2ms, 1.03ms, and
1.74ms, respectively, with large variances. These
Table 3 shows that the number of erase and copy variances influence the accuracy of the model
operations estimated by the model are similar to those drastically. Since the YAFFS runs on the basis of the
measured by real implementaion, though the model
tends to overestimate the copy operations when
invalidity is low. These similarities imply that the
model is fairly effective to predict how many
operations are required under given status of Flash
But, there are noticable differences between the block
cleaning times measued and estimated. From the
sensitive analysis, we find out two main reasons
leading to these differences. The first reason is that the
Figure 8. Execution time at each level
model requires the read, write, and erase times to
MTD layer, we use the times observed above the MTD phone with digital camera capabilities and ‘Movie +
layer for estimated results reported in Table 3. Recorder + Postmark’ simulating the PMP player that
uses Embedded DB to maintain movie title, actor
The second reason is that block cleaning causes not
library and digital rights. Experiments show that the
only the copy and erase overheads but also software
trends of multiple benchmark executions are similar to
manipulating overheads. Specifically, the YAFFS does
those of Postmark, reported in Table 3. For example,
not manage block and page status information in main
for the combination of ‘Movie + Recorder + Postmark’,
memory to minimize RAM footprint. Hence, it needs to
the block cleaning time of YAFFS and MODA are
read Flash memory to detect blocks and pages to be
34.82 and 22.58 seconds, while uniformity are 0.73 and
cleaned. Due to this overhead, there are gaps between
0.84, respectively. We also find that the interferences
the cleaning times measured and estimated. The
among benchmarks drive uniformity of Flash memory
overhead also explains why the gap increases as
low, even for the large sequential multimedia files.
utilization increases. However, the trends derived from
the model and measurement are quite same, which
6.3.4 Effects of Utilization
implies that the model is a good indicator reflecting the
performance characteristics of Flash memory.
In real life, utilization of Flash memory will rarely be 0.
Hence, we conduct similar measurements as was done
6.3.3 Effects of Uniformity
for Table 3, but varying the initial utilization value.
Figure 9 shows the results of executing Postmark under
By comparing the results of YAFFS and those of
the various initial utilization values. Utilization was
MODA, we make the following observations.
artificially increased by executing the Andrew
benchmark before each of the measurements. Exact
The benchmark execution time is the same for
same experiments were conducted with utilization
YAFFS and MODA. This implies that there is
adjusted by pre-executing the Postmark benchmark, but
minimal overhead for the additional computation
the result trend is similar, hence we only report one set
that may be incurred for data classification.
of these results.
The MODA scheme maintains high uniformity
which enables to reduce the block cleaning time
up to 43 seconds (for Fax machine benchmark)
with an average of 10.2 seconds for the
The performance gains of MODA for Movie and
Camera benchmarks are trivial. Our model reveals
that the original sequential page allocation scheme
used in YAFFS also keeps high uniformity, makes
it difficult to obtain considerable gains.
The gains of MODA for Phone and Recorder
benchmark are also trivial, even though there are Figure 9. Results reported under various initial
still room for enhancing uniformity. Careful utilization values
analysis shows that the MODA classifies hot/cold For the moment, ignore the results reported when
data on the file-level granularity, but these utilization is 0.9, which shows somewhat different
benchmarks access small number of files; six files behavior. We come back to discuss these results shortly.
for Phone and two files for Recorder. These The results in Figure 9 show that block cleaning time
experiments disclose the limitation of the MODA increases as utilization increases confirming what we
scheme and suggest that the page-level had observed in Figure 5(a). Our model also expects in
classification may be used effectively for the Figure 4 that, in high utilization, enhancement of
benchmarks. uniformity drives more reduction of cleaning times.
Figure 9 reinforces the expectation by showing that the
We also perform two or more benchmarks running difference in block cleaning time between YAFFS and
experiments by making the combination of benchmarks, MODA-YAFFS increases as utilization increases.
such as ‘Camera + Phone’ simulating the recent cellular
Let us now discuss results reported when the initial idle state detection mechanism, and so on.
utilization is 0.9. Observe that the results are different
from results with lower utilization values, in that the To see how periodic block cleaning affects
benchmark running time is much higher, more so for performance we conduct the following sequence of
YAFFS. This is because YAFFS is confronted with a executions. Starting from utilization zero, that is, a
lack of clean blocks during execution, and hence, turns clean Flash memory state, we repeatedly execute
the block cleaning mode to aggressive. As a result, on- Postmark until the benchmark can no longer complete
demand block cleaning occurs frequently increasing the as Flash memory completely fills up. During this
benchmark running time to 72 seconds, four times the iteration, two different actions are taken. For Figures
running time compared to when utilization is lower. 10(a) and (c), nothing is done between each execution.
Note that in MODA-YAFFS, the running time does That is, no form of explicit cleaning is performed. For
increase, but not as significantly. This is because the Figures 10(b) and (d), block cleaning is performed
MODA allocation scheme allows for more blocks to be between benchmark executions. This represents a case
kept uniform, and hence aggressive on-demand block where block cleaning is occurring periodically. Figures
cleaning is invoked less. 10(a) and (b) are the measurement results for YAFFS,
while Figures 10(c) and (d) are results for MODA. The
6.3.5 Effects of Periodic block cleaning utilization values reported on the x-axis is the value
before each benchmark execution.
In YAFFS, block cleaning is invoked at each write
request and attempts to reclaim at most one block at The results here verify what we would expect based on
each trial. In other file system, block cleaning is observations from Section 6.3.4. As utilization is kept
invoked when free space becomes smaller than a under some value, YAFFS and MODA perform the
predefined lower threshold and attempts to reclaim same (with or without periodic cleaning). Without
blocks until it becomes larger than an upper threshold periodic cleaning, once past that threshold, 0.87 in
[2, 12]. On the contrary, block cleaning can happen Postmark, the benchmark execution time abruptly
when the system is idle . Ideally, if this can happen, increases for YAFFS, while for MODA, it increases
then all block cleaning costs may be hidden from the only slightly. This is because enough clean blocks are
user. Whether this is possible or not will depend on being maintained in MODA. When block cleaning is
many factors including the burstiness of request arrival, invoked periodically, the benchmark execution time
may be maintained at a minimum. However, the
Figure 10. Execution time for Postmark for YAFFS (a) without periodic cleaning, (b) with periodic
cleaning and for MODA (c) without periodic cleaning, (d) with periodic cleaning
problem with this approach is that periodic cleaning The second source is that the MODA scheme uses a
itself incurs overhead, and the issue becomes whether two-level classifications scheme distinguishing system
this overhead can be hidden from the user or not. As data and user data (static property) at the first level,
this issue is beyond the scope of this paper, we leave then further classifying user data based on their
this matter for future work. Note that the periodic modification counts (dynamic property) at the second
cleaning times are smaller in MODA than those in level. But, by only considering the dynamic property of
YAFFS, implying the periodic cleaning also get the data, the DAC scheme is not able to gather enough
benefits from keeping uniformity high. information to make a timely distinction between the
two types. When we apply static property based
6.4. MODA vs DAC classification into the DAC scheme, it’s performance is
almost close to MODA. These results imply that static
In this subsection, we compare MODA with the DAC properties give chances early enough to enhance
scheme proposed by Chiang et al. . In the DAC uniformity. Also, from the performance parameters
scheme, Flash memory is partitioned into several measured in these experiments, we find that the
regions and data are moved toward top/bottom region if performance benefits obtained by MODA and DAC are
their update frequencies increase/decrease. Both essentially come from enhancing uniformity of Flash
MODA and DAC try to cluster data not only at block memory.
cleaning time, but also at data updating time. We
implement the DAC schemes into YAFFS and set the 7. Conclusion
number of regions as 4 as this is reported to have
shown the best performance . Two contributions are made in this paper. First, we
identify the cost of block cleaning as the key
performance bottleneck for Flash memory analogous to
the seek time in disk storage. We derive three
performance parameters from features of Flash memory
and present a formula for block cleaning cost based on
these parameters. We show that, of these parameters,
uniformity is the key controllable parameter that has a
strong influence on the cost. This leads us to our
second contribution, which is a new modification-
aware (MODA) page allocation scheme that strives to
maintain high uniformity. Using the MODA scheme,
Figure 11. Performance Comparison between we validate our model and evaluate performance
MODA and DAC characteristics with the views of uniformity, unitization
and periodic cleaning.
Figure 11 shows the results between for the MODA
and DAC schemes. (The results for YAFFS are shown We are considering three research directions for future
for comparisons sake.) We execute each benchmark work. One direction is enhancing the proposed MODA
under the same conditions described in Table 3. The scheme that can keep uniformity high for benchmarks
results show that the MODA scheme performs better manipulating small number of files. Another direction
than the DAC scheme. is in the development of an efficient block cleaning
scheme. Uniformity is influenced by not only the page
Detailed examinations reveal that the performance gap allocation scheme, but also by the block cleaning
between the MODA and the DAC schemes comes from scheme. We need to investigate issues such as defining
two sources. One is that the DAC scheme clusters data and finding idle time to initiate block cleaning and
into the cold region if their update frequency is low. deciding which blocks and how many of these blocks
This may cause the real cold-modified data and newly should be reclaimed once reclaiming is initiated. In
written data (which may actually be hot-modified data) relation to block cleaning, we are thinking to extend the
to coexist on the same block, which reduces uniformity. model to capture the burstiness of the workload, which
However, the MODA scheme groups newly written determines how much overhead we could hide from
data into separate blocks managed by the unclassified user perceived latency. Finally, we neglected the issue
manager shown in Figure 7, segregating them from the of wear-leveling in this paper. Though we project that
cold-modified data. our scheme will help reduce wear-leveling as cleaning
is reduced, we have not quantified this issue. We need
to investigate the effects of our proposed scheme on  E. Gal and S. Toledo, "A transactions Flash file
wear-leveling in a more quantitative manner. system for microcontrollers", Proceedings of the
2005 USENIX Annual Technical Conference, pp.
References 89-104, 2005.
 D. Woodhouse, "JFFS: The journaling Flash file
 M. McKusick, W. Joy, S. Leffler, and R. Fabry, system", Ottawa Linux Symposium, 2001,
"A Fast File System for UNIX", ACM http://source.redhat.com/jffs2/jffs2.pdf.
Transactions on Computer Systems, 2(3), pp. 181-
 T. Blackwell, J. Harris, and M. Seltzer, "Heuristic
197, Aug., 1984.
cleaning algorithms in log-structured file systems",
 M. Rosenblum and J. K. Ousterhout, "The design Proceedings of the 1995 Annual Technical
and implementation of a log-structured file Conference, pp. 277-288, 1993.
system", ACM Transactions on Computer Systems,
 J. Matthews, D. Roselli, A. Costello, R. Wang,
vol. 10, no. 1, pp. 26-52, 1992.
and T. Anderson, "Improving the performance of
 William Stalling, "Operating Systems: Internals log-structured file system with adaptive methods",
and Design Principles", 5th Edition, Pearson ACM Symposiums on Operating System
Prentice Hall, 2004. Principles (SOSP), pp. 238-251, 1997.
 H. Yu, D. Agrawal, and A. E. Abbadi, “Towards  J. Wang and Y. Hu, "WOLF - a novel reordering
optimal I/O scheduling for MEMS-based storage,” write buffer to boost the performance of log-
Proceedings of the 20th IEEE/11th NASA structured file system", Proceedings of the
Goddard Conference on Mass Storage Systems USENIX Conference on File and Storage
and Technologies (MSS’03), 2003. Technologies (FAST), pp. 46-60, 2002.
 S. W. Schlosser and G. R. Ganger, “MEMS-based  W. Wang, Y. Zhao, and R. Bunt, "HyLog: A High
storage devices and standard disk interfaces: A Performance Approach to Managing Disk Layout",
square peg in a round hole?” Proceedings of 3rd Proceedings of the USENIX Conference on File
USENIX Conference on File and Storage and Storage Technologies (FAST), pp. 145-158,
Technologies (FAST’04), 2004. 2004.
 S. Lim, K. Park, “An Efficient NAND Flash File  M. Wu and W. Zwaenepoel, "eNVy: a non-
System for Flash Memory Storage”, IEEE volatile, main memory storage system",
Transactions on Computers, Vol. 55, No. 7, July, Proceeding of the 6th International Conference on
2006. Architectural Support for Programming
 Aleph One, "YAFFS: Yet another Flash file Languages and Operation Systems (ASPLOS), pp.
system", www.aleph1.co.uk/yaffs/. 86-97, 1994.
 M-L. Chiang, P. C. H. Lee, and R-C. Chang,  L. P. Chang, T. W. Kuo and S. W. Lo, "Real-time
"Using data clustering to improve cleaning garbage collection for Flash memory storage
performance for Flash memory", Software: systems of real time embedded systems", ACM
Practice and Experience, vol. 29, no. 3, pp. 267- Transactions on Embedded Computing Systems,
290, 1999. Vol. 3, No. 4, pp. 837-863, 2004.
 Ashok K. Sharma, "Advanced Semiconductor  J. Kim, J. M. Kim, S. H. Noh, S. L. Min, Y. Cho,
Memories: Architectures, Designs, and "A space-efficient Flash translation layer for
Applications", WILEY Interscience, 2003. CompactFlash systems", IEEE Transactions on
 Samsung Electronics, “NAND Flash Data Sheet”, Consumer Electronics, vol. 48, no. 2, pp. 366-375,
NDFlash.  EZ-X5, www.falinux.com/zproducts.
 E. Gal and S. Toledo, "Algorithms and Data  J. Katcher, "PostMark: A New File System
Structures for Flash Memories", ACM Computing Benchmark", Technical Report TR3022, Network
Surveys, vol. 37, no. 2, pp 138-163, 2005. Appliance Inc., 1997.
 A. Kawaguchi, S. Nishioka and H. Motoda, "A  A. Ben-Aroya, “Competitive analysis of flash-
Flash-memory based file system", Proceedings of memory algorithms”, M.Sc. paper, Apr. 2006,
the 1995 USENIX Annual Technical Conference, www.cs.tau.ac.il/~stoledo/Pubs/flash-abrhambe-
pp. 155-164, 1995. msc.pdf.
 K. Yim, H. Bahn, and K. Koh "A Flash
Compression Layer for SmartMedia Card
Systems”, IEEE Transactions on Consumer
Electronics, Vol. 50, No. 1, pp. 192-197, Feb.
 Y. Zhou, P. M. Chen, and K. Li, “The Multi-
Queue Replacement Algorithm for Second-Level
Buffer Caches, Proceeding of the 2001 USENIX
Annual Technical Conference, June, 2001.
 H. Jo, J. Kang, S. Park, J. Kim, and J. Lee, ”FAB:
Flash-Aware Buffer Management Policy for
Portable Media Players”, IEEE Transactions on
Consumer Electronics, Vol. 52, No. 2, May, 2006.