Real-Time Garbage Collection for Flash-Memory Storage Systems ...

  • 2,073 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,073
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded Systems Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo {d6526009,ktw,d89015}@csie.ntu.edu.tw National Taiwan University Flash memory technology is becoming critical in building embedded systems applications because of its shock-resistant, power economic, and non-volatile nature. With the recent technology break- throughs in both capacity and reliability, flash-memory storage systems are now very popular in many types of embedded systems. However, because flash memory is a write-once and bulk-erase medium, we need a translation layer and a garbage collection mechanism to provide applications a transparent storage service. In the past work, various techniques were introduced to improve the garbage collection mechanism. These techniques aimed at both performance and endurance issues, but they all failed in providing applications a guaranteed performance. In this paper, we propose a real-time garbage collection mechanism, which provides a guaranteed performance, for hard real-time systems. On the other hand, the proposed mechanism supports non-real-time tasks so that the potential bandwidth of the storage system can be fully utilized. A wear-levelling method, which is executed as a non-real-time service, is presented to resolve the endurance prob- lem of flash memory. The capability of the proposed mechanism is demonstrated by a series of experiments over our system prototype. Categories and Subject Descriptors: B.3.2 [Design Styles]: Mass storage; D.4.2 [Storage Man- agement]: Garbage collection; D.4.7 [Organization and Design]: Real-time systems and em- bedded systems General Terms: Algorithms,Design Additional Key Words and Phrases: Real-time system, embedded systems, storage systems, flash memory, garbage collection 1. INTRODUCTION Flash memory is not only shock-resistant and power economic but also non-volatile. With the recent technology breakthroughs in both capacity and reliability, more and more (embedded) system applications now deploy flash memory for their storage systems. For example, the manufacturing systems in factories must be able to tolerate severe vibration, which may cause damage to hard-disks. As a result, flash memory is a good choice for such systems. Flash memory is also suitable to portable devices, which have limited energy source from batteries, to lengthen their This paper is an extended version of a paper appeared in the 8th International Conference on Real-Time Computing Systems and Applications, 2002. Author’s Address: Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, 106 Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 2004 ACM 1529-3785/2004/0700-0001 $5.00 ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004, Pages 1–26.
  • 2. 2 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo operating time. A complicated management method is needed for flash-memory storage systems. There are two major issues for the system implementation: the nature of flash memory in (1) write-once and bulk-erasing, and (2) the endurance problem. Be- cause flash memory is write-once, the flash memory management software could not overwrite existing data. Instead, the newer version of data will be written to available space elsewhere. The old version of the data is then invalidated and con- sidered as “dead”. The latest version of the data is considered as “live”. Flash memory Translation Layer (FTL) is introduced to emulate a block device for flash memory so that applications could have transparent access to data which might dynamically move around different locations. Bulk erasing, which involves signifi- cant live data copying, could be initiated when flash-memory storage systems have a large number of live and dead data mixed together. That is so called garbage collection with the intension in recycling the space occupied by dead data. Garbage collection directly over flash memory could result in performance overheads and, potentially, unpredictable run-time latency. It is the price paid for robustness considerations. Note that many industrial applications, such as industrial control systems and distributed automation systems, do require reliable storage systems for control data and system configurations that can survive over various kinds of power failures and operate in extreme environments (e.g., under severe vibration consid- erations). No matter what garbage collection or data-writing policies are adopted, the flash-memory storage system should consider the limit of possible erasings on each erasable unit of flash memory (the endurance problem). In the past work, various techniques were proposed to improve the performance of garbage collection for flash memory, e.g., [Kawaguchi et al. 1995; Kim and Lee 1999; Wu and Zwaenepoel 1994; Chiang et al. 1997]. In particular, Kawaguchi, et al. proposed the cost-benefit policy [Kawaguchi et al. 1995], which uses a value-driven heuristic function based on the cost and the benefit of recycling a specific block. 2u The policy picks up the block which has the largest value of (a ∗ 1−u ) to recycle, where u is the capacity utilization (percentage of fullness) of the block, and a is the time elapsed so far since the last data invalidation on the block. The cost-benefit policy avoids recycling a block which contains recently invalidated data because the policy surmises that more live data on the block might be invalidated soon. Chiang, et al. [Chiang et al. 1997] refined the above work by considering a fine-grained hot- cold identification mechanism. They proposed to keep track of the hotness of each LBA, where LBA stands for the Logical Block Address of a block device. The hotness of an LBA denotes how often the LBA is written. They proposed to avoid recycle a block which contains many live and hot data, because any copying of the live and hot data is usually considered inefficient. Chang and Kuo [Chang and Kuo 2002; 2004] investigated garbage collection issues over a novel striping architecture and large-scale flash-memory storage systems, in which the challenges faced are to maximize the parallelism and to reduce the management overheads, respectively. Wu et al. [Wu et al. 2003] proposed a native spatial index implementation over flash memory. Kim and Lee [Kim and Lee 1999] proposed to periodically move live data among blocks so that blocks have more even life times. Although researchers have proposed excellent garbage-collection policies, there ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 3. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 3 is little work done in providing a deterministic performance guarantee for flash- memory storage systems. For a time-critical system, such as manufacturing sys- tems, it is highly important to predict the number of free pages reclaimed after each block recycling so that the system will never be blocked for an unpredictable duration of time because of garbage collection. It has been shown that garbage col- lection could impose almost 40 seconds of blocking time on real-time tasks without proper management [Malik 2001b; 2001a]. This paper is motivated by the needs of a predictable block-recycling policy to provide real-time performance to many time-critical systems. We shall not only propose a block-recycling policy with guar- anteed performance but also resolve the endurance problem. A free-page replenish- ment mechanism is proposed for garbage collection to control the consumption rate of free pages so that bulk erasings only occur whenever necessary. Because of the real-time garbage collection support, each time-critical task could be guaranteed with a specified number of reads and/or writes within its period. Non-real-time tasks are also serviced with the objective to fully utilize the potential bandwidth of the flash-memory storage system. The design of the proposed mechanism is in- dependent of the implementation of flash memory management software and the adoption of real-time scheduling algorithms. We demonstrate the performance of the system in terms of a system prototype. The rest of this paper is organized as follows: Section 2 introduces the system architecture of a real-time flash memory storage system. Section 3 presents our real-time block-recycling policy, free-page replenishment mechanism, and the sup- ports for non-real-time tasks. We provide the admission control strategy and the justification of the proposed mechanism in Section 4. Section 5 summarizes the experimental results. Section 6 is the conclusion and future research. 2. SYSTEM ARCHITECTURE This section proposes the system architecture of a real-time flash-memory storage system, as shown in Figure 1. The system architecture of a typical flash-memory storage system is similar to that in Figure 1, except that a typical flash-memory storage system does not have or consider real-time tasks, a real-time scheduler, and real-time garbage collectors. We selected NAND flash to realize our storage system. There are two major architectures in flash memory design: NOR flash and NAND flash. NOR flash is a kind of EEPROM, and NAND flash is designed for data storage. We study NAND flash in this paper because it has a better price/capacity ratio, compared to NOR flash. A NAND flash memory chip is partitioned into blocks, where each block has a fixed number of pages, and each page is of a fixed size byte array. Due to the hardware architecture, data on a flash are written in a unit of one page, and the erase is performed in a unit of one block. A page can be either programmable or un-programmable, and any page initially is programmable. Programmable page are called “free pages”. A programmable page becomes un-programmable once it is written (programmed). A written page that contains live data is called a “live page”; it is called “dead page” if it contains dead (invalidated) data. A block erase will erase all of its pages and reset their status back to programmable (free). Furthermore, each block has an individual limit (theoretically equivalent ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 4. 4 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo NRT NRT task task RT RT RT RT RT Garbage Garbage Task Task Task Collector Collector Time-Sharing Scheduler Real-Time Scheduler fread, fwrite ioctl_copy File System ioctl_erase sector reads and writes Block Device Emulation (FTL) page reads, page writes, and block erases Flash Memory Driver control signals Flash Memory Fig. 1. System architecture. in the beginning) on the number of erasings. A worn-out block will suffer from frequent write errors. The block size of a typical NAND flash is 16KB, and the page size is 512B. The endurance of each block is usually 1,000,000 under the current technology. Over a real-time flash-memory storage system, user tasks might read or/and write the FTL-emulated block device through the file system1 , where the FTL (Flash Memory translation Layer) is to provide transparent accesses to flash mem- ory through block-device emulation. We propose to support real-time and reason- able services to real-time and non-real-time tasks through a real-time scheduler and a time-sharing scheduler, respectively. A real-time garbage collector is initiated for each real-time task which might write data to flash memory to reclaim free pages for the task. Real-time garbage collectors interact directly with the FTL though the special designed control services. The control services includes ioctl erase which performs the block erase, and ioctl copy which performs atomic copying. Each atomic copying, that consists of a page read then a page write, is designed to real- ize the live-page copying in garbage collection. In order to ensure the integrity of each atomic copying, its read-then-write operation is non-preemptible to prevent the possibility of any race condition. The garbage collection for non-real-time tasks 1 Note that there are several different approaches in managing flash memory for storage systems. In this paper, we focused on the block-device emulation approach. There are also alternative approaches which build native flash-memory file systems, we refer interested readers to [b:J ; ] for more details. Note that one of the major advantages of log-structured file systems [b:J ; ] is on robustness. Although the approach is very different from the block-device emulation approach for flash memory, it does provide robustness to storage systems to survive over power failures. Garbage collection might be needed to remove unnecessary logs (that correspond to dead pages for the block-device emulation approach). ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 5. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 5 Page Read Page Write Block Erase 512 bytes 512 bytes 16K bytes Performance (µs) 348 919 1,881 Symbol tr tw te Table I. Performance of nand flash memory. are handled inside the FTL. FTL must reclaim an enough number of free pages for non-real-time tasks, similar to typical flash-memory storage systems, so that rea- sonable performance is provided. Note that a write by any task invocation does not necessarily make the data written by previous invocations of the same task stale. Unless the same data are updated, data written by previous invocations remain valid. 3. REAL-TIME GARBAGE COLLECTION MECHANISM 3.1 Characteristics of Flash Memory Operations A typical flash memory chip supports three kinds of operations: Page read, page write, and block erase. The performance of the three operations measured on a real prototype is listed in Table I. Block erases take a much longer time, compared to others. Because the garbage collection facility in FTL might impose unbounded blocking time on tasks, we propose to adopt two FTL-supported control services to process real-time garbage collection: the block erase service (ioctl erase) and the atomic copying service (ioctl copy). The atomic copying operation copies data from one page to another by the specified page addresses, and the copy operation is non-preemptible to avoid any possibility of race conditions. 2 The main idea in this paper is to have a real-time garbage collector created for each real-time task to handle block erase requests (through ioctl erase) and live-page copyings (through ioctl copy) for real-time garbage collection. We shall illustrate the idea in more detailed in later sections. Flash memory is a programmed-I/O device. In other words, flash memory op- erations are very CPU-consuming. For example: To handle a page write request, the CPU first downloads the data to the flash memory, issues the address and the command, and then monitors the status of the flash memory until the command is completed. Page reads and block erases are handled in a similar fashion. As a result, the CPU is fully occupied during the execution of flash memory opera- tions. On the other hand, flash memory operations are non-preemptible since the operations can not be interleaved with one another. We can treat flash memory operations as non-preemptible portions in tasks. As shown in Table I, the longest non-preemptible operation among them is a block erase, which takes 1,881 µs to complete. 3.2 Real-Time Task Model Each periodic real-time task Ti is defined as a triple (cTi , pTi , wTi ), where cTi , pTi , and wTi denote the CPU requirements, the period, and the maximum number of 2 Some advanced NAND flash memory had a native support for the atomic copying. The overhead of the atomic copying is significantly reduced since the operation can be done internally in flash memory without the redundant data transfer between RAM and flash memory. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 6. 6 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo t t + pi W R W R W W Legend: A preempible CPU computation region. R A non-preemptible read request to flash memory. W A non-preemptible write request to flash memory. Fig. 2. A task Ti which reads and writes flash memory. its page writes per period, respectively. The CPU requirements ci consists of the CPU computation time and the flash memory operating time: Suppose that task T i wishes to use CPU for ccpu µs, read i pages from flash memory, and write j pages Ti to flash memory in each period. The CPU requirements cTi can be calculated as ccpu + i ∗ tr + j ∗ tw (tr and tw can be found in Table I). If Ti does not wish to write Ti to flash memory, it may set wTi = 0. Figure 2 illustrates that a periodic real-time task Ti issues one page read and two page writes in each period (note that the order of the read and writes does not need to be fixed in this paper). Any file-system access over a block device might not only access the corresponding files but also touch their meta-data at the same time, where the meta-data of a file contains the essential information of the file, e.g., the file owner, the access permissions, and some block pointers. For example, file accesses over an FAT file system might follow cluster chains (i.e., parts of the corresponding meta-data) in the FAT table, where file accesses over an UNIX file system are done over direct/indirect block pointers of the corresponding meta-data. Meta-data accesses could result in unpredictable latencies for their corresponding file-system accesses because the amount of the meta-data accessed for each file-system access sometimes could not be predicted precisely, such as those activities on the housekeeping of the entire device space. Molano, et al. [Molano et al. 1998] proposed a meta-data pre-fetch scheme which loads necessary meta-data into RAM to provide a more predictable behavior on meta-data accesses. In this paper, we assume that each real-time task has a predictable behavior in the accessing of the corresponding meta-data so that we can focus on the real-time support issue for flash-memory storage systems. For example, in a manufacturing system, a task T1 might periodically fetch the control codes from files, drive the mechanical gadgets, sample the reading from the sensors, and then update the status to some specified files. Suppose that c cpu = 1ms T1 and pT1 = 20ms. Let T1 wish to read a 2K-sized fragment from a data file, write 256 bytes of machine status back to a status file. Assume that the status file already exists, and we have one-page ancillary file-system meta-data to read and one-page meta-data to write. According to these information, task T1 can be defined as 512 256 256 (1000 + (1 + 2048 ) ∗ tr + (1 + 512 ) ∗ tw , 20 ∗ 1000, 1 + 512 ) = (4578, 20000, 2). Note that the time granularity is 1µs. As shown in the example, it is very straight- forward to describe a real-time task in our system. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 7. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 7 - 30 -3 -3 -3 -3 -3 T1 T2 wT1=30 wT2=3 +16 G1 +16 +16 G2 (a) (b) Fig. 3. Creation of the corresponding real-time garbage collectors. 3.3 Real-Time Garbage Collection This section is meant to propose a real-time garbage collection mechanism to pre- vent each real-time task from being blocked because of an insufficient number of free pages. We shall first propose the idea of real-time garbage collectors and the free-page replenishment mechanism for garbage collection. We will then present a block-recycling policy for the free-page replenishment mechanism to choose appro- priate blocks to recycle. 3.3.1 Real-Time Garbage Collectors. For each real-time task Ti which may write to flash memory (wTi > 0), we propose to create a corresponding real-time garbage collector Gi . Gi should reclaim and supply Ti with enough free pages. Let a constant α denote a lower-bound on the number of free pages that can be reclaimed for each block recycling (we will show that in Section 4.1). Note that the number of reclaimed free pages for a block recycling is identical to the number of dead pages on the block before the block recycling. Let π denote the number of pages per block. Given a real-time task Ti = (cTi , pTi , wTi ) with wTi > 0, the corresponding real-time garbage collector is created as follows: cGi = (π − α) ∗ (tr + tw ) + te + ccpu Gi w Ti . (1) pT i / α , if wTi > α p Gi = pT i ∗ α , otherwise w Ti Equation 1 is based on the assumption that Gi can reclaim at least α pages for Ti for every period pGi . The CPU demand cGi consists of at most (π − α) live-page copyings, a block erase, and computation requirements ccpu . All real-time garbage Gi collectors have the same worst-case CPU requirements since α is a constant lower- bound. Obviously, the estimation is based on a worst-case analysis, and G i might not consume the entire CPU requirements in each period. The period pGi is set under the guideline in supplying Ti with enough free pages. The length of its period depends on how fast Ti consumes free pages. We let Gi and Ti arrive the system at the same time. Figure 3 provides two examples for real-time garbage collectors, with the system parameter α = 16. In Figure 3.(a), because wT1 = 30, pG1 is set as one half of pT1 so that G1 can reclaim 32 free pages for T1 in each pT1 . As astute readers may point out, more free pages may be unnecessarily reclaimed. We will address this issue in the next section. In Figure 3.(b), because wT2 = 3, pG2 is set to five-times ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 8. 8 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo Symbol Description Value Λ the total number of live pages currently on flash ∆ the total number of dead pages currently on flash Φ the total number of free pages currently on flash π the number of pages in each block 32 Θ the total number of pages on flash Λ+∆+Φ the block size 32KB the page size 512B α the pre-defined lower bound of the number of reclaimed free pages after each block recycling ρ the total number of tokens in system ρf ree the number of unallocated tokens in system ρT i the number of tokens given to Ti ρ Gi the number of tokens given to Gi ρnr the number of tokens given to non-real-time tasks Table II. Symbol definitions. of pT2 so that 16 pages are reclaimed for T2 in each pG2 . The reclaimed free pages are enough for T1 and T2 to consume in each pT1 and pG2 , respectively. We define the meta-period σi of Ti and Gi as pTi if pTi ≥ pGi ; otherwise, σi is equal to pGi . In the above examples, σ1 = pT1 and σ2 = pG2 . The meta-period will be used in the later sections. 3.3.2 Free-Page Replenishment Mechanism. The free-page replenishment mech- anism proposed in this section is to resolve the over-reclaiming issue for free pages during real-time garbage collection. The over-reclaiming occurs because the real- time garbage collection is based on the worst-case assumption of the number of re- claimed free-pages per erasing. Furthermore, in many cases, real-time tasks might consume free pages slower than what they declared. A coordinating mechanism should be adopted to manage the free pages and their reclaiming more efficiently and flexibly. In this section, we propose a token-based mechanism to coordinate the free-page usage and reclaiming. In order not to let a real-time garbage collector reclaim too many unnecessary free pages, here a token-based “free-page replenishment mecha- nism” is presented: Consider a real-time task Ti = (cTi , pTi , wTi ) with wTi > 0. Initially, Ti is given (wi ∗ σi /pTi ) tokens, and one token is good for executing one page-write (Please see Section 3.3.1 for the definition of σi ). We require that each (real-time) task could not write any page if it does not own any token. Note that a token does not correspond to any specific free page in the system. Several counters of tokens are maintained: ρinit denotes the number of available tokens when system starts. ρ denotes the total number of tokens that are currently in system (regardless of whether they are allocated and not). ρf ree denotes the number of unallocated tokens currently in system, ρTi and ρGi denote the numbers of tokens currently given to task Ti and Gi , respectively. (Gi also needs tokens for live-page copyings, and it will be addressed later.) The symbols for token counters are summarized in Table II. Initially, Ti and Gi are given (wTi ∗ σi /pTi ) and (π − α) tokens, respectively. It ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 9. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 9 Give up extra tokens - 30 T1 Supply T1 with Supply T1 with 16 tokens 16 tokens G1 +16 +16 Create tokens Create tokens Give up extra Give up extra tokens tokens Fig. 4. The free-page replenishment mechanism. CW¡ P ¥ £ ¦¤¢  ¥£ ¡ § § ( Φ − ρ ) ≥α ¥¥¦£bV ¡ ¢¢ `¨ "# ¡ # ¡ ©@ ©¡   a   Y X 3 # 3  £ ¨ ©¡ £ ¨ ¥ § § %&#$¢¢©$©TA $¡ 1  !     2   0  4 H S 3  ©$)(¢ ¨ &$¢"¢¢  0  ' % #  !         "¢©5 ¨ ©# ¡ % ¡ 1 %  3  4   3  2   ( wTi *σ i ) ρ Gi = ρ Gi + α ; ρ = ρ + α ; x = ρTi − ; 6 pTi ¢  % 7 ρ Ti = ρ Ti − x ; ρ = ρ − x ; § 6 ¡ P ¨ fI¢¢ ¢¢c @ e d c c c c c c  ¢d ©A¢¢¢1 !  7 @   7  9   8  6 ¢C£ ¢"¢¢"¢¢ D ¥ !  7 B  7  9   Φ ,ρ ,ρGi ¢©$¢&F   0    E  ¢©$©$  3 #  0 6 %&#$¢!R ¡&Q5¡ PA97&4¢4¢HIG 0 α ρTi = ρTi +α ;ρGi = ρGi −α ;   ©"¢©H ¡ "¢U"T1 ¢  % #  !  7  V %    0  4 H S ¡ z = ρGi −( π −α );ρGi = ρGi − z;ρ = ρ − z; 6 Fig. 5. The algorithm of the free-page replenishment mechanism. is to prevent Ti and Gi from being blocked in their first meta-period. During their executions, Ti and Gi are more like a pair of consumer and producer for tokens. Gi creates and provides tokens to Ti in each of its period pGi because of its reclaimed free pages in garbage collection. The replenishment of tokens are enough for T i to write pages in each of its meta-period σi . When Ti consumes a token, both ρTi and ρ are decreased by one. When Gi reclaim a free page and, thus, create a token, ρGi and ρ are both increased by one. Basically, by the end of each period (of Gi ), Gi provides Ti the created tokens in the period. When Ti terminates, its tokens (and those of Gi ) must be returned to the system. The above replenishment mechanism has some problems: First, real-time tasks might consume free pages slower than what they declared. Secondly, the real-time garbage collectors might reclaim too many free pages than what we estimate in the worst case. Since Gi always replenish Ti with all its created tokens, tokens would gradually accumulate at Ti . The other problem is that Gi might unnecessarily reclaim free pages even though there are already sufficient free pages in the system. Here we propose to refine the basic replenishment mechanism as follows: ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 10. 10 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo In the beginning of each meta-period σi , Ti gives up ρTi − (wTi ∗ σi /pTi ) tokens and decreases ρTi and ρ by the same number because those tokens are beyond the needs of Ti . In the beginning of each period of Gi , Gi also checks up if (Φ − ρ) ≥ α, where a variable Φ denotes the total number of free pages currently in the system. If the condition holds, then Gi takes α free pages from the system, and ρGi and ρ is incremented by α, instead of actually performing a block recycling. Otherwise (i.e., (Φ − ρ) < α), Gi initiates a block recycling to reclaim free pages and create tokens. Suppose that Gi now has y tokens (regardless of whether they are done by a block erasing or a gift from the system) and then gives α tokens to Ti . Gi might give up y − α − (π − α) = y − π tokens because they are beyond its needs, where (π − α) is the number of pages needed for live-page copyings (done by Gi ). Figure 4 illustrates the consuming and supplying of tokens between Ti and Gi (based on the example in Figure 3.(a)), and Figure 5 provides the algorithm of the free page replenishment mechanism. We shall justify that Gi and Ti always have free pages to write or perform live-page-copyings in Section 4.4. All symbols used in this section is summarized in Table II. 3.3.3 A Block-Recycling Policy. A block-recycling policy should make a deci- sion on which blocks should be erased during garbage collection. The previous two sections propose the idea of real-time garbage collectors and the free-page replen- ishment mechanism for garbage collection. The only thing missing is a policy for the free-page replenishment mechanism to choose appropriate blocks for recycling. The block-recycling policies proposed in the past work [Kawaguchi et al. 1995; Kim and Lee 1999; Wu and Zwaenepoel 1994; Chiang et al. 1997] usually adopted sophisticated heuristics with objectives in low garbage collection overheads and a longer overall flash memory lifetime. With unpredictable performance on garbage collection, these block-recycling policies could not be used for time-critical appli- cations. The purpose of this section is to propose a greedy policy that delivers a predictable performance on garbage collection: We propose to recycle a block which has the largest number of dead pages with the objective in predicting the worst-case performance. Obviously the worst case in the number of free-page reclaiming happens when all dead pages are evenly distributed among all blocks in the system. The number of free pages reclaimed in the worst case after a block recycling can be denoted as: ∆ π∗ , (2) Θ where π, ∆, and Θ denote the number of pages per block, the total number of dead pages on flash, and the total number of pages on flash, respectively. We refer readers to Table II for the definitions of all symbols used in this paper. Formula 2 denotes the worst case performance of the greedy policy, which is proportional to the ratio of the numbers of dead pages and pages on flash mem- ory. That is an interesting observation because we can not obtain a high garbage collection performance by simply increasing the flash memory capacity. We must emphasize that π and Θ are constant in a system. A proper greedy policy which properly manages ∆ would result in a better low bound on the number of free pages reclaimed in a block recycling. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 11. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 11 Because Θ = ∆ + Φ + Λ. Equation 2 could be re-written as follows: Θ−Φ−Λ π∗ , (3) Θ where π and Θ are constants, and Φ and Λ denote the numbers of free pages and live pages, respectively, when the block-recycling policy is activated. As shown in Equation 3, the worst case performance of the block-recycling policy can be controlled by bounding Φ and Λ: Because it is not necessary to perform garbage collection if there are already sufficient free pages in system. The block-recycling policy could be activated only when the number of free pages is less than a threshold value (a bound for Φ). For the rest of this section, we shall show the relationship between the utilization of flash memory and the minimum number of free pages obtained for each recycle of a block based on Equation 3: Suppose that we have a 64MB NAND flash memory, and live data occupy 48MB. The rest 16MB are occupied by dead or free pages. Let each block consist of 32 pages. The worst-case performance of the greedy policy is 32 ∗ 131,072−100−98,304 = 8, where 131, 072 and 98, 304 are the numbers of 512B 131,072 pages for 48MB and 64MB, respectively. As shown in Equation 3, the worst case performance of the block-recycling policy can be controlled by bounding Φ and Λ. The major challenge in guaranteeing the worst-case performance is how to prop- erly set a bound for Φ for each block recycling since the number of free pages in system might grow and shrink from time to time3 . We shall further discuss this issue in Section 4.1. As astute readers may point out, the above greedy policy does not consider wear-levelling because the consideration of wear-levelling could result in an unpredictable behavior in block recycling. We should address this issue in the next section. 3.4 Supports for Non-Real-Time Tasks The objective of the system design aims at the simultaneous supports for both real-time and non-real-time tasks. In this section, we shall extend the token-based free-page replenishment mechanism in the previous section to supply free pages for non-real-time tasks. A non-real-time wear-leveller will then be proposed to resolve the endurance problem. 3.4.1 Free-Page Replenishment for Non-Real-Time Tasks. We propose to ex- tend the free-page replenishment mechanism to support non-real-time tasks as fol- lows: Let all non-real-time tasks be first scheduled by a time-sharing scheduler. When a non-real-time task is scheduled by the time-sharing scheduler, it will exe- cute as a background task under the real-time scheduler (Please see Figure 1). Different from real-time tasks, let all non-real-time tasks share a collection of tokens, denoted by ρnr . π tokens are given to non-real-time tasks (ρnr = π) initially because FTL might need up to π tokens to recycle a block full of live pages (e.g., due to wear-levelling considerations). Note that non-real-time tasks depends on FTL for garbage collection. Before a non-real-time task issues a page write, it should check up if ρnr > π. If the condition holds, the page write is executed, 3 The capacity of a single NAND flash memory chip had grown to 256MB when we wrote this paper. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 12. 12 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo and one token is consumed (ρnr and ρ are decreased by one). Otherwise (i.e., ρnr ≤ π), the system must replenish itself with tokens for non-real-time tasks. The token creation for non-real-time tasks is similar to the strategy adopted by real-time garbage collectors: If Φ ≤ ρ, a block recycling is initiated to reclaim free pages, and tokens are created. If Φ > ρ, then there might not be any needs for any block recycling. 3.4.2 A Non-Real-Time Wear-Leveller. In the past work, researchers tend to re- solve the wear-levelling issue in the block-recycling policy. A typical wear-levelling- aware block-recycling policy might sometimes recycle the block which has the least number of erasing, regardless of how many free pages can be reclaimed. This approach is not suitable to a real-time application, where predictability is an im- portant issue. We propose to use non-real-time tasks for wear-levelling and separate the wear- levelling policy from the block-recycling policy. We could create a non-real-time wear-leveller, which sequentially scans a given number of blocks to see if any block has a relatively small erase count, where the erase count of a block denotes the number of erases that has been performed on the block so far. We say that a block has a relatively small erase count if the count is less than the average by a given number, e.g., 2. When the wear-leveller finds a block with a relatively small erase count, the wear-leveller first copy live pages from the block to some other blocks and then sleeps for a while. As the wear-leveller repeats the scanning and live- page copying, dead pages on the blocks with relatively small erase counts would gradually increase. As a result, those blocks will be selected by the block-recycling policy sooner or later. Note that one of the main reasons why a block has a relatively small erase count is because the block contains many live-cold data (which had not been invali- dated/updated for a long period of time). The wear-levelling is done by ”defrosting” blocks with relatively small erase counts by moving cold data away. The capability of the proposed wear-leveller is evaluated in Section 5.3. 4. SYSTEM ANALYSIS In the previous sections, we had proposed the idea of the real-time garbage col- lectors, the free page replenishment mechanism, and the greedy block-recycling policy. The purpose of this section is to provide a sufficient condition to guarantee the worst-case performance of the block-recycling policy. As a result, we can justify the performance of the free-page replenishment mechanism. An admission control strategy is also introduced. 4.1 The Performance of the Block-Recycling Policy The performance guarantees of real-time garbage collectors and the free-page re- plenishment mechanism are based on a constant α, i.e., a lower-bound on the num- ber of free pages that can be reclaimed for each block recycling. As shown in Section 3.3.3, it is not trivial in guaranteeing the performance of the block-recycling policy since the number of free pages on flash, i.e., Φ, may grow or shrink from time to time. In this section, we will derive the relationship between α and Λ (the number of live pages on the flash) as a sufficient condition for engineers to guarantee the ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 13. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 13 wT 1 = 30, ρ T 1 = 30 w xv h ig ρ G 1 = 16 ¢ir t s &Wr € y h qp wT 2 = 3, ρ T 2 = 15 w Wv w xv w Wv w xv w xv u ¤g ρ G 2 = 16 &Wr € y u p  Fig. 6. An instant of having the maximum number of tokens in system (based on the example in Figure 3.) performance of the block-recycling policy for a specified value of α. As indicated by Equation 2, the system must satisfy the following condition: ∆ ∗ π ≥ α. (4) Θ Since ∆ = (Θ − Λ − Φ), and ( x ≥ y) implies (x > y − 1) (y is an integer), Equation 4 can be re-written as follows: (α − 1) Φ < (1 − ) ∗ Θ − Λ. (5) π The formula shows that the number of free pages, i.e., Φ, must be controlled under Equation 5 if the lower-bound α on the number of free pages reclaimed for each block recycling has to be guaranteed (under the utilization of the flash, i.e., Λ). Because real-time garbage collectors would initiate block recyclings only if Φ − ρ < α (please see Section 3.3.2), the largest possible value of Φ on each block recycling is (α − 1) + ρ. Let Φ = (α − 1) + ρ, Equation 5 can be again re-written as follows: (α − 1) ρ < (1 − ) ∗ Θ − Λ − α + 1. (6) π Note that given a specified bound α and the (even conservatively estimated) utilization Λ of the flash, the total number of tokens ρ in system should be controlled under Equation 6. In the next paragraphs, we will first derive ρmax which is the largest possible number of tokens in the system under the proposed garbage collection mechanism. Because ρmax should also satisfy Equation 6, we will derive the relationship between α and Λ (the number of live pages on the flash) as a sufficient condition for engineers to guarantee the performance of the block-recycling policy for a specified value of α. Note that ρ may go up and down, as shown in Section 3.3.1. We shall consider the case which has the maximum number of tokens accu- mulated: The case occurs when non-real-time garbage collection recycled a block ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 14. 14 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo without any live-page copying. As a result, non-real-time tasks could hold up to 2 ∗ π tokens, where π tokens are reclaimed by the block recycling, and the other π tokens are reserved for live-page copying. Now consider real-time garbage col- lection: Within a meta-period σi of Ti (and Gi ), assume that Ti consumes none of its reserved (wTi ∗ σi /pTi ) tokens. On the other hand, Gi replenishes Ti with (α ∗ σi /pGi ) tokens. In the best case, Gi recycles a block without any live-page copying in the last period of Gi within the meta-period. As a result, Ti could hold up to (wTi ∗σi /pTi )+(α∗σi /pGi ) tokens, and Gi could hold up to 2∗(π −α) tokens, where (π − α) tokens are the extra tokens created, and the other (π − α) tokens are reserved for live page copying. The system will have the maximum number of tokens accumulated when all real-time tasks Ti and their corresponding real-time garbage collectors Gi behave in the same way as described above, and all of the meta-periods end at the same time point. Figure 6 illustrates the above example based on the task set in Figure 3. The largest potential number of tokens in system ρmax could be as follows, where ρf ree is the number of un-allocated tokens n wT i ∗ σi α ∗ σi ρmax = ρf ree + 2π + + + 2(π − α) . (7) i=1 pT i p Gi When ρ = ρmax , we have the following equation by combining Equation 6 and Equation 7: n wT i ∗ σi α ∗ σi (α − 1) ρf ree +2π + + + 2(π − α) < (1− )∗Θ−Λ−α+1 (8) i=1 pT i p Gi π Equation 8 shows the relationship between α and Λ. We must emphasize that this equation serves as as a sufficient condition for engineers to guarantee the perfor- mance of the block-recycling policy for a specified value of α, provided the number of live pages in the system, i.e., the utilization of the flash Λ. We shall address the admission control of real-time tasks in the next section, based on α. 4.2 Initial System Configuration In the previous section, we derived a sufficient condition to verify whether the promised block-recycling policy performance (i.e., α) could be delivered. How- ever, the setting of α highly depends on the physical parameters of the selected flash memory and the target capacity utilization. In this section, we shall provide guidelines for the initial system configuration of the proposed flash-memory storage systems: the values of α and ρinit (i.e., the number of initially available tokens). According to Equation 4, the value of α could be set in-between (0, π ∗ ∆ ]. Θ The objective is to set α as large as possible, because a large value for α means a better block-recycling performance. With a better block-recycling performance, the system could accept tasks which are highly demanding on page writes. For example, if the capacity utilization is 50%, we could set the value of α as 32 ∗ 0.5 = 16. ρinit could be set as follows: A large value for ρinit could help the system in accepting more tasks because tokens must be provided for real-time tasks and real-time garbage collectors on their arrivals. Equation 6 in Section 4.1 suggests an upper-bound of ρ, which is the total number of tokens in the current system. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 15. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 15 Obviously, ρ = ρinit when there is no task in system, and ρinit must satisfy Equation 6. We have shown that ρmax (which is derived from Equation 7) must satisfy Equation 6 because the value of ρ might grow up to ρmax . It becomes interesting if we could find a relationship between ρinit and ρmax , so that we could derive an upper-bound for ρinit based on Equation 6. First, Equation 7 could be re-written as follows: n wT i ∗ σi α ∗ σi ρmax = ρf ree + 2π + ( + (π − α)) + ( + (π − α)) . (9) i=1 pT i p Gi Obviously the largest possible value of ρmax occurs when all available tokens are given to real-time tasks and real-time garbage collectors. In other words, we w ∗σ only need to consider when ρf ree = 0. Additionally, we have ρTi = ( TiT i ), p i w ρGi = (π−α), and ( α∗σi ) ≥ ( TiT i ) by the creation rules in Equation 1. Therefore, ∗σ p Gi p i Equation 9 can be further re-written as follows: n ρmax ≥ 2π + 2 (ρTi + ρGi ). (10) i=1 n Since we let ρf ree = 0, we have ρinit = i=1 (ρTi + ρGi ). The relationship between ρmax and ρinit could be represented as follows: ρmax ≥ 2π + 2 ∗ ρinit . (11) This equation reveals the relationship between ρmax and ρinit , and the equation is irrelevant to the characteristics (e.g., their periods) of real-time tasks and real-time garbage collectors in system. In other words, we could bound the value of ρ max by managing the value of ρinit . When we consider both Equation 6 and Equation 11, we have the following equation: (α−1) ((1 − π ) ∗ Θ − Λ − α + 1 − 2π) ρinit < . (12) 2 Equation 12 suggests an upper-bound for ρinit . If ρinit is always under the bound, then the total number of tokens in system will be bounded by Equation 6 (because ρmax is also bounded). As a result, we don’t need to verify Equation 8 for every arrival of any new real-time tasks and real-time garbage collectors. We only need to ensure that Equation 12 holds when the system starts. That further simplifies the admission control procedure to be proposed in the next section. 4.3 Admission Control The admission control of a real-time task must consider its resource requirements and the impacts on other tasks. The purpose of this section is to provide the procedures for admission control: Given a set of real-time tasks {T1 , T2 , ..., Tn } and their corresponding real-time garbage collectors {G1 , G2 , ..., Gn }, let cTi , pTi , and wTi denote the CPU require- ments, the period, and the maximum number of page writes per period of task T i , ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 16. 16 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo respectively. Suppose that cGi = pGi = 0 when wTi = 0 (it is because no real-time garbage collector is needed for Ti ). Since the focus of this paper is on flash-memory storage systems, the admission control of the entire real-time task set must consider whether the system could give enough tokens to all tasks initially. The verification could be done by evaluating the following formula: n wT i ∗ σi + (π − α) ≤ ρf ree . (13) i=1 pT i Note that ρf ree = ρinit when the system starts (and there are no tasks in system) , where ρinit is the number of initially reserved tokens. The evaluation of the above test could be done in a linear time. Beside the test, engineers are supposed to verify the relationship between α and Λ by means of Equation 8, which serves as a sufficient condition to guarantee α. However, in the previous section, we have shown that if the value of ρinit is initially configured to satisfy Equation 12, the relation in Equation 8 holds automatically. As a result, it is not necessary to verify Equation 8 at each arrival of any real-time task. Other than the above tests, engineers should verify the schedulability of real- time tasks in terms of CPU utilization. Suppose that the earliest deadline first algorithm (EDF) [Liu and Layland 1973] is adopted to schedule all real-time tasks and their corresponding real-time garbage collectors. Since all flash memory opera- tions are non-preemptive, and block erasing is the most time-consuming operation, the schedulability of the real-time task set can be verified by the following formula, provided other system overheads could be ignorable: n te cTi cG + + i ≤ 1, (14) min p() i=1 pT i p Gi where min p() denotes the minimum period of all Ti and Gi , and te is the duration of a block erase. Equation 14 is correct because [Liu and Layland 1973] shows that all independent tasks scheduled by EDF are schedulable if their total CPU n c c t utilization (i.e., i=1 pTi + pGi ) is not over 100%, and mine p() denotes the ratio Ti Gi of CPU utilization sacrificed due to the blocking of non-preemptive operations [Baker 1990]. It is because a real-time task might be blocked by a block erase in the worst case. The maximum blocking factor is contributed by the task which has t the shortest period, i.e., mine p() . Equation 14 could be derived in a similar way as that for Theorem 2 in [Baker 1990]. We must emphasize that the above formula only intends to deliver the idea of blocking time, due to flash-memory operations. Note that there exists a relationship between the utilization of flash memory and the minimum number of free pages obtained for each recycle of a block. If we want to guarantee the minimum number of available pages recovered from the recycle of a block, then we could set a bound on the utilization of flash memory. The minimum number of available pages recovered from the recycle of a block could have an impact on the admission control of real-time tasks (i.e., Equations 13 and 14) because a smaller number of available pages obtained for each block recycling ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 17. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 17 would result in a larger number of block recycles in order to obtain enough free pages for tasks. 4.4 The Performance Justification of the Free-Page Replenishment Mechanism The purpose of this section is to justify that all real-time tasks always have enough tokens to run under the free-page replenishment mechanism, and no real-time garbage collector will be blocked because of an insufficient number of free pages. Because each real-time task Ti consumes no more than (wTi ∗ σi /pTi ) tokens in a meta-period, the tokens initially reserved are enough for its first meta-period. Because the corresponding real-time garbage collector Gi replenishes Ti with (α ∗ σi /pGi ) tokens in each subsequent meta-period, the number of replenished tokens is larger than or equal to the number that Ti needs, according to Equation 1. We conclude that all real-time tasks always have enough tokens to run under the free-page replenishment mechanism. Because real-time garbage collectors also need to consume tokens for live-page copying, we shall justify that no real-time garbage collector will be blocked forever because of an insufficient number of free pages. Initially, (π − α) tokens are given to each real-time garbage collector Gi . It is enough for the first block recycling of Gi . Suppose that Gi decides to erase a block which has x dead pages (note that x ≥ α). The rest (π − x) pages in the block could be either live pages and/or free pages. For each non-dead page in the block, Gi might need to consume a token to handle the page, regardless of whether the page is live or free. After the copying of live pages, a block erase is executed to wipe out all pages on the block, and π tokens (free pages) are created. The π created tokens could be used by Gi as follows: Gi replenishes itself with (π − x) tokens, replenishes Ti with α tokens, and finally gives up the residual tokens. Because x ≥ α, we have (π − x) + α ≤ π so Gi will not be blocked because of an insufficient number of free pages. 5. PERFORMANCE EVALUATION 5.1 Simulation Workloads 5.1.1 Overview. The real-time garbage collection mechanism proposed in Sec- tion 3 provides a deterministic performance guarantee to real-time tasks. The mechanism also targets at simultaneous services to both real-time and non-real- time tasks. The purpose of this section is to show the usefulness and effectiveness of the proposed mechanism. We compared the behaviors of a system prototype with or without the proposed real-time garbage collection mechanism. We also show the effectiveness of the proposed mechanism in the wear-levelling service, which is ex- ecuted as a non-real-time service. A series of experiments were done over a real system prototype. A set of tasks was created to emulate the workload of a manufacturing system, and all files were Task ccpu page page c p c/p reads writes (w) T1 3ms 4 2 6,354 µs 20ms 0.3177 T2 5ms 2 5 8,738 µs 200ms 0.0437 Table III. The basic simulation workload ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 18. 18 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo stored on flash memory to avoid vibration from damaging the system: The basic workload consisted of two real-time tasks T1 and T2 and one non-real-time task. T1 and T2 sequentially read the control files, did some computations, and updated their own (small) status files. The non-real-time task emulated a file downloader, which downloaded the control files continuously over a local-area-network and then wrote the downloaded file contents onto flash memory. The flash memory used in the experiments was a 16MB NAND flash [b:N ]. The block size was 16KB, and the page size was 512B (π = 32). The traces of T1 , T2 , and the non-real-time file downloader were synthesized to emulate the necessary requests for the file system. The basic simulation workload is detailed in Table III. We varied the capacity utilization of flash memory in some experiments to observe the system behavior under different capacity utilizations. For example, a 50% capacity utilization stand for the case that one half of the total pages were live pages. An 8MB block device was emulated by a 16MB flash memory to generate a 50% capacity utilization. Before each experiment started, the emulated block device was written so that most of pages on the flash memory were live pages. The duration of each experiment was 20 minutes. 5.1.2 System Configuration and Admission Control for Simulation Experiments. In this section, we used the basic simulation workload to explain how to configure the system and how to perform the admission control: In order to configure the system, system engineers need to first decide the ex- pected block-recycling performance α. According to Equation 4, the value of α could be set as a value between (0, π ∗ ∆ ]. Under a 50% capacity utilization, Θ the value of α could be set as 32 ∗ 0.5 = 16, since a large value is preferred. Secondly, we use Equation 12 to configure the number of tokens initially available in system (i.e., ρinit ). In this example, we have α = 16, π = 32, Λ = 16384, and Θ = 32768. As a result, we have ρinit < 472.5. Intuitively, ρinit should be set as large as possible since we must give tokens to tasks at their arrival, and a larger value of ρinit might help system to accept more tasks. However, if we know the token requirements of all tasks in a priori, a smaller ρinit could help in reducing the possibility to recycle a block which still have free pages inside. In this example, we set ρinit = 256, which is enough for the basic simulation workload. Once α and ρinit are determined, the corresponding real-time garbage collectors G1 and G2 could be created under the creation rules (Equation 1). The CPU requirements, the flash memory requirements, and the periods of G1 and G2 were shown in Table IV. The admission control procedure is as follows: First we verify if the system has enough tokens for the real-time garbage collection mechanism under Equation 13: The token requirements of T1 , T2 , G1 , G2 , and the non-real-time garbage collection are 16, 15, 16, 16, and 32, respectively. Because ρinit = 256, the verification can be verified by the formula (15+16+16+16+32) < 256 (Equation 13). Since the system satisfies Equation 12, it implies the condition in Equation 8 also holds, and we need not to check it again. Secondly, the CPU utilization is verified by Equation 14: The CPU utilization of each real-time task is shown in Table IV, and the verification is verified by the formula (0.3115 + 0.0514 + 0.1267 + 0.0338 + 1.881 ) = 0.6176 ≤ 1 20 (Equation 14). ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 19. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 19 5.2 The Evaluation of the Real-time Garbage Collection Mechanism This section is meant to evaluate the proposed real-time garbage collection mech- anism. We shall show that a non-real-time garbage collection mechanism might impose an unpredictable blocking time on time-critical applications, and the pro- posed real-time garbage collection mechanism could prevent real-time tasks from being blocked due to an insufficient number of tokens. We evaluated a system which adopted a non-real-time garbage collection mech- anism [Kawaguchi et al. 1995] under the basic workload and a 50% flash memory capacity utilization. We compared our real-time garbage collection mechanism with the well-known cost − benef it block-recycling policy [Kawaguchi et al. 1995] in the non-real-time garbage collection mechanism. The non-real-time garbage collection mechanism was configured as follows: A page write could only be processed if the total number of free pages on flash was more than 100. Note that any number larger than 32, i.e., the number of pages in a block of the system prototype (as described in Section 3.4.1), was acceptable. 100 was chosen because it was reason- able for the experiments. If there were not enough free pages, garbage collection was activated to reclaim free pages. The cost-benefit block-recycling policy picked up a block which had the largest value of a ∗ 1−u , as summarized in Section 1. 2u But in order to perform wear-levelling, the block-recycling policy might recycle a block which had an erase count less than the average erase count by 2, regardless how many free pages could be reclaimed. Since garbage collection was activated on demand, the garbage collection activities directly imposed blocking time on page writes. We defined that an instance of garbage collection consisted of all of the activities which started when the garbage collection was invoked and ended when the garbage collection returned control to the blocked page write. An instance of garbage collection might consist of recycling several blocks consecutively because the non-real-time garbage collection policy might recycle a block without dead pages, due to wear-levelling. We measured the blocking time imposed on a page write by each instance of garbage collection, because a non-real-time garbage collection instance blocked the whole system until the garbage collection instance finished. Figure 7 shows the blocking times of page writes, where the X-axis denotes the number of garbage collection instances invoked so far in a unit of 1,000, and the Y-axis denotes the blocking time in ms. The results show that the non-real-time garbage collection activities indeed imposed a lengthy and unpredictable blocking time on page writes. The blocking time could even reach 21.209 seconds in the worst case. The lengthy blocking time was mainly caused by wear-levelling. The block-recycling policy might recycle a block which had a low erase count, regardless how many dead pages were on the block. As a result, the block-recycling policy might consecutively recycle many blocks until at least one free page was reclaimed. Although the maximum blocking time could be reduced by decreasing the wear-leveller activities or performing garbage collection in an opportunistic fashion, the non-real-time garbage collection did fail in providing a deterministic service to time-critical tasks. The next experiment intended to observe if the proposed real-time garbage col- lection mechanism could prevent each real-time task from being blocked, due to an insufficient number of tokens. To compare with the non-real-time garbage collec- ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 20. 20 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo 100,000 bxe©– $I– g f d™ ˜ — 10,000 ”• 1,000 ‘’ “ h ‘‰  ‡ 20 h 100 ˆ †‡ … ‚ƒ„ 10 10 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 0 1 2 3 Number 5 Garbage Collection Instances Invoked10 Far (K) 4 of 6 7 8 9 So 11 12 13 14 15 Number of garbage collect ion inst ances invoked so far (K) Fig. 7. The blocking time imposed on each page write by a non-real-time garbage collection instance. tion mechanism, a system adopted the real-time garbage collection mechanism was evaluated under the same basic configurations. Additionally, according to the basic workload, the real-time garbage collection mechanism created two (corresponding) real-time garbage collectors and a non-real-time wear-leveller. In particular, the non-real-time wear-leveller performed live-page copyings whenever a block had an erase count less than the average erase count by 2. The wear-leveller slept for 50ms between every two consecutive live-page copyings. The workload is summarized in Table IV. Note that the garbage collection activities (real-time and non-real-time) didn’t directly block a real-time task’s page write request. A page write request of a real-time task would be blocked only if its corresponding real-time garbage collec- tor did not supply it with enough tokens. Distinct from the non-real-time garbage collection experiment, we measured the blocking time of each real-time task’s page write request to observe if any blocking time ever existed. The results are shown in Figure 8, where the X-axis denotes the number of page writes processed so far in a unit of 1,000, and the Y-axis denotes the blocking time imposed on each page write (of real-time tasks) by waiting for tokens. Note that T1 and T2 generated 12,000 and 3,000 page writes respectively in the 20-minutes experiment. Figure 8 shows that the proposed real-time garbage collection mechanism successfully prevented T1 and T2 from waiting for tokens, as expected. Task ccpu page page c p c/p reads writes (w) T1 3ms 4 2 6,23 µs 20ms 0.3115 T2 5ms 2 5 10,291 µs 200ms 0.0514 G1 10 µs 16 16 20,282 µs 160ms 0.1267 G2 10 µs 16 16 20,282 µs 600ms 0.0338 Table IV. The simulation workload for evaluating the real-time garbage collection mechanism . (α = 16, capacity utilization = 50%) ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 21. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 21 1.00 Blocking time (ms) T1 T2 0.00 0 20 40 60 80 100 120 Page writes processed so far (K) Fig. 8. The blocking time imposed on each page write of a real-time task by waiting for tokens (under the real-time garbage collection mechanism). 5.3 Effectiveness of the Wear-Levelling Method The second part of experiments evaluated the effectiveness of the proposed wear- levelling policy. A good wear-levelling policy should keep an even distribution of erase cycle counts for blocks. It was because when some blocks of flash memory was worn out, flash memory would start to malfunction. The effectiveness of a wear-levelling policy could be evaluated in terms of the standard deviation of the erase counts of all blocks and the earliest appearance time of worn-out blocks. The non-real-time wear-leveller was configured as follows: The non-real-time wear-leveller performed live-page copyings whenever a block had an erase count less than the average erase count by 2. The wear-leveller slept for 50ms between every two consecutive live-page copyings. Since the past work showed that the over- heads of garbage collection highly depend on the flash-memory capacity utilization [Kawaguchi et al. 1995; Wu and Zwaenepoel 1994; Chiang et al. 1997; Douglis et al. 1994], we evaluated the experiments under different capacity utilizations: 50%, 60% and 70%. System parameters under each capacity utilization are summarized in Table V. Additionally, we disabled the wear-leveller to observe the usage of blocks without wear-levelling. We also compared the effectiveness of the proposed wear- levelling method with the non-real-time garbage collection mechanism. The results are shown in Figure 10, where the X-axis denotes that the number of erases had been performed so far in a unit of 1,000, and the Y-axis denotes the value of the standard deviation. Figure 10(a) and 10(b) show the effectiveness of the proposed wear-levelling method and the wear-levelling method in the non-real-time 50% 55% 60% 65% 70% 75% ρinit 256 256 256 256 256 256 α 16 15 13 12 10 8 ρT 1 16 14 12 12 10 8 ρT 2 15 15 10 10 10 5 ρ G1 16 17 19 20 22 24 ρ G2 16 17 19 20 22 24 ρnr 32 32 32 32 32 32 ρmax 320 322 326 328 332 336 Table V. System parameters under different capacity utilizations. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 22. 22 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo 100% Percentage of worn-out blocks 90% 80% 70% 60% 50% 40% 30% RTGC, 50% 20% NRTGC, 50% 10% RTGC, 50%, no wear-levelling 0% 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 Elapsed time (minutes) Fig. 9. The percentage of worn-out blocks. garbage collection mechanism. In Figure 10(a), the proposed wear-levelling method gradually levelled the erase count of each block. The results also pointed out that the effectiveness of wear-levelling was better when the capacity utilization was low, since the system was not heavily loaded by the overheads of garbage collection. When the system was heavily loaded because of the frequent activities of garbage collection, the real-time tasks and the real-time garbage collectors would utilize most of the available bandwidth, so that the effectiveness of the non-real-time wear- levelling degraded accordingly. The experiment which disabled the wear-leveller was evaluated under a 50% capacity utilization. The results showed that the standard deviation increased very rapidly: Some blocks were constantly erased while some blocks were not ever erased. The frequently erased blocks would be worn-out very quickly, thus the overall life-time of flash memory was severely reduced. In Figure 10(b), we observed that the standard deviation under the non-real-time garbage collection mechanism was stringently controlled, since the non-real-time garbage collection mechanism performed wear-levelling without the consideration of timing constraint. Figure 9 shows the percentage of worn-out blocks in the experiments. We set the limit on wear-levelling as 200 in the experiments to compare best-effort and real-time memory management methods. We did not use one million as the wear- levelling limit because it could take years to complete the experiments. Based on our experiments, 200 seemed a reasonable number for the wear-levelling limit because similar results were observed for larger numbers. The X-axis of Figure 9 denotes the amount of time elapsed in the experiments, and the Y-axis denotes the percentage of blocks that reached the wear-levelling limit. It was observed that over 50% of blocks quickly reached the wear-levelling limit under the real-time garbage collection method without a wear-levelling policy. However, when a wear-levelling policy was adopted, the performance of the real-time garbage collection method on wear-levelling control was close to that of the best effort method (labelled as NRTGC). Note that a good wear-levelling policy tends to keep an even distribution of erase cycle counts for blocks. When some blocks of flash memory was worn out, flash memory would start to malfunction. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 23. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 23 25 25 50% 50% 20 60% 20 60% Standard deviation of erase Standard deviation of erase 70% 70% counts of all blocks counts of all blocks 15 50%, no wear-levelling 15 10 10 5 5 0 0 0 2 4 6 8 10 12 14 16 18 20 22 24 0 2 4 6 8 10 12 14 16 18 20 22 24 Number of erases processed so far (K) Number of erases processed so far (K) (a) The real-time garbage collection mechanism (b) The non-real-time garbage collection mechanism Fig. 10. Effectiveness of the wear-levelling method. 450 Trace-Driven (RT) 400 Random (RT) 350 Write throughput (KB/S) 300 Sequential (RT) 250 Trace-Driven (RT) 200 (Hot-Cold) 150 Trace-Driven (NRT) 100 Random (NRT) 50 0 Sequential (NRT) 50% 55% 60% 65% 70% 75% Capacity utilization Fig. 11. The overall write throughput under different access patterns and capacity utilizations. 5.4 Remark: Overheads A good real-time garbage collection mechanism should deliver a reasonably good performance without significant overheads. This section is to compare the over- heads of the proposed mechanism with that of the typical non-real-time approach (described in previous sections) in terms of their performance degradation (under garbage collection). The performance metric was the overall write throughput, which was contributed by the real-time tasks and the non-real-time task. The basic simulation workload in Table III was used in this part of experiments. We shall not consider the writes generated internally by garbage collection and wear-levelling in the performance metric because they were considered as overheads, instead of parts of the perfor- mance index. In this part of experiments, we varied the capacity utilization from 50% to 75%. Additionally, we evaluated two more access patterns: sequential access patterns and random access patterns. The flash-memory block device was sequentially (/ randomly) accessed by the real-time tasks T1 and T2 and the non-real-time task ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 24. 24 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo 100% 50%, Trace-Driven 80% Percentage of deadline violations 50%, Sequential 60% 40% 20% 0% ) ) 6) 6) ) ) 3) 6) ) 8) 4) ) ) 2) 17 39 81 27 72 10 56 91 96 10 .83 .96 .27 .82 .6 .7 .1 .2 1.8 2.0 2.0 1. 1. 2. ,0 ,0 ,0 ,0 ,1 ,1 ,1 ,1 0, 1, 2, 3, 4, 5, (2 (3 (4 (5 (6 (7 (8 (9 (1 (1 (1 (1 (1 (1 (Workload, CPU utilization) Fig. 12. The percentage of deadline violations under different overloaded workloads. write under the sequential access patterns (/ the random access patterns). The results are shown in Figure 11, where the X-axis denotes the capacity utilization of the flash memory, and the Y-axis denotes the overall write throughput in KB/s. Based on the results shown in Figure 11 we could see that the proposed mechanism delivered a comparable performance with respect to the typical non-real-time ap- proach. In other words, the overheads of the proposed real-time garbage collection mechanism was affordable. Moreover, we could have two observations: Firstly, a high capacity utilization would impose a significant performance degradation on the write throughput. Secondly, the system performed better under the sequential access patterns, because the live data were sequentially updated and invalidated. As astute readers may notice, the system under the trace-driven access patterns had the worst performance in all experiments although people usually considered a random access pattern as the worst access patterns. Such anomaly was also observed in [Kawaguchi et al. 1995; Chiang et al. 1997]. Figure 11(a) shows that the performance of the system under the trace-driven access patterns was significantly boosted if hot-data and cold-data were properly identified and separated. For a better explanation of hot-cold identification and separation mechanisms, we refer interested readers to [Kawaguchi et al. 1995; Chiang et al. 1997]. Figure 12 shows the percentages of deadline violations when the system was over- loaded. It was to evaluate how conservative the proposed admission control policy was. Note that the experiments did not reject any workload because we turned off admission control in this part of experiments. (Otherwise, the proposed admis- sion control policy should start to reject additional workloads when the number of writes was larger than 5. That was when the CPU utilization was over 100%.) The number of page writes issued by T1 in each period was increased from 2 (the original setup) to 15 with a 50% capacity utilization of flash memory. The X-axis of Figure 12 denotes the number of writes per period and the corresponding CPU utilizations, and the Y-axis denotes the percentage of deadline violations under dif- ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 25. Real-Time Garbage Collection for Flash-Memory Storage Systems of Real-Time Embedded System · 25 ferent workloads. It was observed that the proposed admission control was good under trace-driven access patterns because deadline violations were observed right after admission control failed. Note that trace-driven access patterns introduced a lot of live data copyings since trace-driven access patterns resulted in a smaller probability in recycling blocks with a less number of live pages. When sequential access patterns were experimented, admission control seemed very conservative. No deadline violations were observed until the total CPU utilization approached 187.2%. It was because sequential access had very light overheads on live data copying. 6. CONCLUSION This paper is motivated by the needs of a predictable block-recycling policy to pro- vide real-time performance to many time-critical systems. We propose a real-time garbage collection methodology with a guaranteed performance. The endurance problem is also resolved by the proposing of a wear-levelling method. Because of the real-time garbage collection support, each time-critical task could be guaran- teed with a specified number of reads and/or writes within its period. Non-real-time tasks are also serviced with the objective to fully utilize the potential bandwidth of the flash-memory storage system. The design of the proposed mechanism is in- dependent of the implementation of flash memory management software and the adoption of real-time scheduling algorithms. We demonstrate the performance of the proposed methodology in terms of a system prototype. There are several promising directions for future research. We shall extend the concept of the real-time garbage collection to the (disk) log-structured file systems (LFS) [Rosenblum and Ousterhout 1992]. LFS perform out-of-place update like a flash memory storage system. Note that time-critical applications might wish to use LFS because of its reliability in data consistency and its capability in rapid error recovering. We shall also extend the research results of this paper to power management for portable devices. The goal is to combine the dynamic-voltage- adjustment technique [Chang et al. 2001] with the real-time garbage collection mechanism to reduce the energy dissipation of a flash memory storage system. We believe that more research in these directions may prove being very rewarded. REFERENCES Journaling flash file system. http://sources.redhat.com/jffs2/jffs2-html/ . K9f2808u0b 16mb*8 nand flash memory data sheet. Samsung Electronics Company. Yet another flash-filing system (yaffs). Aleph One Company.. Baker, T. P. 1990. A stack-based resource allocation policy for real-time process. In Proceedings of the 11th IEEE Real-Time System Symposium. Chang, L. P. and Kuo, T. W. 2002. An adaptive striping architecture for flash memory storage systems of embedded systems. In Proceedings of the The 8th IEEE Real-Time and Embedded Technology and Applications Symposium. Chang, L. P. and Kuo, T. W. 2004. An efficient management scheme for large-scale flash-memory storage systems. In Proceedings of the ACM Symposium on Applied Computing. Chang, L. P., Kuo, T. W., and Lo, S. W. 2001. A dynamic-voltage-adjustment mechanism in reducing the power comsumption in flash memory storage systems for portable devices. In Proceedings of the IEEE International Conference on Consumer Electronics. Chiang, M. L., Paul, C. H., and Chang, R. C. 1997. Manage flash memory in personal com- municate devices. In Proceedings of IEEE International Symposium on Consumer Electronics. ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.
  • 26. 26 · Li-Pin Chang, Tei-Wei Kuo, and Shi-Wu Lo Douglis, F., Caceres, R., Kaashoek, F., Li, K., Marsh, B., and Tauber, J. 1994. Storage alternatives for mobile computers. In Proceedings of the USENIX Operating System Design and Implementation. Kawaguchi, A., Nishioka, S., and Motoda, H. 1995. A flash memory based file system. In Proceedings of the USENIX Technical Conference. Kim, H. J. and Lee, S. G. 1999. A new flash memory management for flash storage system. In Proceedings of the Computer Software and Applications Conference. Liu, C. L. and Layland, J. W. 1973. Scheduling algorithms for multiprogramming in a hard real-time environment. Journal of the ACM. 7, 3 (July). Malik, V. 2001a. Jffs-a practical guide. In http://www.embeddedlinuxworks.com/articles/jffs guide.html. Malik, V. 2001b. Jffs2 is broken. In Mailing List of Memory Technology Device (MTD) Subsys- tem for Linux. Molano, A., Rajkumar, R., and Juvva, K. 1998. Dynamic disk bandwidth management and metadata pre-fetching in a real-time filesystem. In Proceedings of the 10th Euromicro Workshop on Real-Time Systems. Rosenblum, M. and Ousterhout, J. K. 1992. The design and implementation of a log-structured file system. In ACM Transactions on Computer Systems. Vol. 10. Wu, C. H., Chang, L. P., and Kuo, T. W. 2003. An efficient r-tree implementation over flash-memory storage systems. In Proceedings of the ACM 11th International Symposium on Advances on Geographic Information Systems. Wu, M. and Zwaenepoel, W. 1994. envy: A non-volatile, main memory storage system. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. ... ACM Transactions on Embedded Computing Systems, Vol. V, No. N, June 2004.