A NOR Emulation Strategy over NAND Flash Memory


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A NOR Emulation Strategy over NAND Flash Memory

  1. 1. A NOR Emulation Strategy over NAND Flash Memory Jian-Hong Lin† Yuan-Hao Chang† Jen-Wei Hsieh§ Tei-Wei Kuo† Cheng-Chih Yang‡∗ † Graduate Institute of Networking and Multimedia Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan 106, R.O.C. {r94944003, d93944006, ktw}@csie.ntu.edu.tw § Department of Computer Science and Information Engineering National Chiayi University,Chiayi, Taiwan 60004, R.O.C. jenwei@mail.ncyu.edu.tw ‡ Product Development Firmware Engineering Gruop Genesys Logic, Inc. Taipei, Taiwan 231, R.O.C Mikey.Yang@genesyslogic.com.tw Abstract The management of flash memory is carried out by either software on a host system (as a raw medium) or hardware cir- This work is motivated by a strong market demand in the re- cuits/firmware inside its embedded devices (as block-oriented de- placement of NOR flash memory with NAND flash memory to cut vices). In the past decade, there have been excellent research down the cost in many embedded-system designs, such as mobile and implementation designs being proposed in the management of phones. Different from LRU-related caching or buffering studies, flash-memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 14, 28, 30, 31]. we are interested in prediction-based prefetching based on given In particular, some researchers exploited efficient management execution traces of application executions. An implementation schemes for large-scale storage systems and/or consider different strategy is proposed in the storage of the prefetching information system architecture designs, e,g., [8, 9, 14, 28, 30, 31]. In the with limited SRAM and run-time overheads. An efficient prediction industry, several vendors, such as Intel and Microsoft, also start procedure is presented based on information extracted from appli- exploring the advantages in having flash memory in their prod- cation executions to reduce the performance gap between NAND uct designs, e.g., the flash-memory cache of hard disks (known flash memory and NOR flash memory in reads. With the behav- as the Robson solution) and the fast booting in Windows Vista ior of a target application extracted from a set of collected traces, [1, 4, 7, 29]. Besides, flash memory also becomes a layer in the tra- we show that data access to NOR flash memory can be responded ditional memory hierarchy, such as that with NAND in a demand effectively over the proposed implementation. paging mechanism (with compiler assistance): The source code of applications and specific compilers are used to lay out compiled code in fixed memory locations in such studies [17, 18]. Among the approaches that try to improve the performance of NAND with Keywords: NAND, NOR, flash memory, data caching a SRAM cache [13, 15, 16, 22, 23], OneNAND by Samsung pre- sented a simple but effective hardware architecture to replace NOR with NAND and a SRAM cache [13, 22, 23]. Although the idea is 1 INTRODUCTION intuitive and useful, little past work is reported in how to manage the system performance of NAND with a SRAM cache, and the resources, such as source code and specific compilers, involved in While NAND flash memory (referred to as NAND for short) some researches [17, 18] are not always available in the develop- has become a popular alternative in the implementation of stor- ment. We must point out that the success in the replacement of age systems, NOR flash memory (referred to as NOR for short) NOR with NAND is seriously dependent on an intelligent way in is widely adopted in embedded system designs to store and run the management of the SRAM cache in the product domains. programs. Compared to NAND, NOR has good performance in reads and supports XIP (eXecute-In-Place) to run programs di- Different from popular caching ideas adopted in the memory rectly. Since NAND is much inexpensive, and NAND is better in hierarchy and OneNAND-related work [13, 15, 16, 22, 23], we are writes, NAND is popularly adopted in storage system implemen- interested in application-oriented caching. Instead of the adopt- tations [19, 20, 24, 27]. Because of the cost difference between ing of an LRU-like policy, we are interested in prediction-based NAND and NOR, it becomes increasingly attractive in replacing prefetching based on given execution traces of applications. We NOR with NAND in embedded system designs, such as that in consider the designs of embedded systems with a limited set of mobile phones. The demand keeps increasing significantly be- applications, such as a set of selected system programs in mobile cause the difference might become even widely in the near future phones or arcade games of amusement-park machines. In this pa- (where the cost of 8Gb NOR is 5 times more than that of 8Gb per, we propose an efficient prediction mechanism with limited NAND) [12]. Such an observation underlines the motivation of SRAM-space requirements and an efficient implementation. The this research. idea of prediction graphs is presented based on the working-set concept [10, 11], and an implementation strategy is proposed to ∗ Supportedby the National Science Council of Taiwan, R.O.C., under reduce run-time and space overheads. A prefetch procedure is then Grant NSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3 proposed to prefetch pages from NAND based on the trace analy-
  2. 2. sis of application executions. A series of experiments is conducted Data Bus based on realistic traces, that are based on computer games with Address Bus different characteristics: “Age of Empire II (AOE II)”, “The Typ- ing of the Death (TTD)”, and “Raiden”. The experimental results are very encouraging: We show that the average read performance Host Interface of NAND with the proposed prediction mechanism could be even Converter byte better than those of NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden, respectively. Furthermore, the cache miss rates Cache (SRAM) were 35.27%, 4.21%, and 0.06% for AOE II, TTD, and Raiden, Prefetch 512 bytes Procedure respectively. NAND Flash Memory The rest of this paper is organized as follows: Section 2 de- Control Logic scribes the characteristics of flash memory and research motiva- tion. In Section 3, an efficient prediction mechanism is proposed. Section 4 summarizes the experimental results on read perfor- mance, cache miss rate, and extra overheads. Section 5 is the conclusion. Figure 1. An Architecture for the Performance Improvement of NAND Flash Memory. 2 Flash-Memory Characteristics and Re- search Motivation system designs, such as mobile phones and Personal Multimedia There are two types of flash memory: NAND and NOR. Each Players (PMP). The characteristics of NAND and NOR are sum- NAND flash memory chip consists of many blocks, and each block marized in Table 1. is of a fixed number of pages. A block is the smallest unit for erase This research is motivated by a strong market demand in the operations, while reads and writes are done in pages. A page con- replacement of NOR with NAND in many embedded-system de- tains a user area and a spare area, where the user area is for the data signs. In order to fill up the performance gap between NAND and storage of a logical block, and the spare area stores ECC and other NOR, SRAM is a nature choice for data caching in performance house-keeping information (i.e., LBA). Because flash memory is improvement, such as that in the simple but effective hardware write-once, we do not overwrite data on each update. Instead, data architecture adopted by OneNAND [13, 22, 23]. (Please see Fig- are written to free space, and the old versions of data are invali- ure 1.) However, the most critical technical problem behind the dated (or considered as dead). The update strategy is called “out- success in the replacement of NOR with NAND is on the predic- place update”. In other words, any existing data on flash memory tion scheme and its implementation design. Such an observation could not be over-written (updated) unless its corresponding block underlines the objective of this research. That is the design and is erased. The pages that store live data and dead data are called implementation of an effective prediction mechanism for appli- “valid pages” and “invalid pages”, respectively. cations, with the considerations of flash-memory characteristics. Depending on the designs, blocks have different bounds on the Because of stringent resource supports over embedded systems, number of erases over a block. For example, the typical erase the proposed mechanism must also face challenges in restricted bounds of SLC and MLC×2 NAND flash memory are 10,000 SRAM usage and limited computing power. and 1,000, respectively1 . Each page of small-block(/large-block) SLC NAND can store 512B(/2KB) data, and there are 32(/64) pages per block. The spare area of a small-block(/large-block) 3 An Efficient Prediction Mechanism SLC NAND page is 16B(/64B). On the other hand, each page of MLC×2 NAND can store 2KB, and there are 128 pages per block. Different from NAND flash memory, a byte is the unit for reads 3.1 Overview and writes over NOR flash memory. In order to fill up the performance gap between NAND and SLC NOR [25] SLC NAND [21] NOR, SRAM can serve as a cache layer for data access over (large-block, 2KB-page) NAND. As shown in Figure 1, the Host Interface is responsi- Price (US$/GB) [12] 34.65 6.79 ble to the communication with the host system via address and Read (random access of 8bits) 40ns 25µs data buses. The Control Logic manages the caching activity and Write (random access of 8bits) 14µs 300µs provides the service emulation of NOR with NAND and SRAM. Read (sequential access) 23.842MB/s 15.33MB/s Write (sequential access) 0.068MB/s 4.57MB/s The Control Logic should have an intelligent prediction mecha- Erase 0.217MB/s 6.25MB/s nism implemented to improve the system performance. Different from popular caching ideas adopted in the memory hierarchy, this research aims at an application-oriented caching mechanism. In- stead of the adopting of an LRU-like policy, we are interested in Table 1. The Typical Characteristics of NOR prediction-based prefetching based on given execution traces of and NAND. applications. We consider the designs of embedded systems with a limited set of applications, such as a set of selected system pro- grams in mobile phones or arcade games of amusement-park ma- NAND has been widely adopted in the implementation of stor- chines. The design and implementation should also consider the age systems because of its advantages in cost and write throughput resource constraints of a controller in the SRAM capacity and com- (for block-oriented access), compared to NOR. The typical cost of puting. 1GB NOR costs US$34.65 in the market, compared to US$6.79 There are two major components in the Control Logic: The per GB for NAND, and the price gap of NAND and NOR will get Converter emulates NOR access over NAND with a SRAM cache, even wider in the coming future. However, due to the high perfor- where address translation must be done from byte addressing (for mance of NOR in reads, as shown in Table 1, and its eXecute-In- NOR) to Logical Block Address (LBA) addressing (for NAND). Place (XIP) characteristics, NOR is adopted in various embedded- Note that each 512B/2KB NAND page corresponds to one and four LBA’s, respectively [26]. The Prefetch Procedure tries to 1 There are two major NAND flash memory designs: SLC (Single Level prefetch data from NAND to SRAM so that the hit rate of the NOR Cell) flash memory and MLC (Multiple Level Cell) flash memory. Each access is high over SRAM. The procedure should parse and ex- cell of SLC flash memory contains one-bit information while each cell of tract the behavior of the target application via a set of collected MLC×n flash memory contains n-bit information. traces. According to the extracted access patterns from the col-
  3. 3. lected traces, the procedure generates prediction information, re- the spare area of the corresponding page. It is because the spare ferred to as a prediction graph. In Section 3.2, we shall define area of a page in current implementations and the specification a prediction graph and present its implementation strategy over has unused space, and the reading of a page usually comes with NAND. An algorithm design for the Prefetch Procedure will be the reading of its data and spare areas simultaneously. In such a then presented in Section 3.3. way, the accessing of the subsequent LBA information of a regular node comes with no extra cost. Since a branch node has more than one subsequent LBA’s, the spare area of the corresponding page 3.2 A Prediction Graph and Implementa- might not have enough free space to store the information. We tion propose to maintain a branch table to save the subsequent LBA information of all branch nodes. The starting entry address of the branch table that corresponds to a branch node can be saved at the spare area of the corresponding page, as shown in Figure 3(a). 8 The starting entry records the number of subsequent LBA’s of the 0 branch node, and the subsequent LBA’s are stored in the entries 6 7 following the starting entry (Please see Figure 3(b)). The branch 1 2 table can be saved on flash memory. During the run time, the entire table can be loaded into SRAM for better performance. If there is not enough SRAM space, parts of the table can be loaded in an 9 3 5 on-demand fashion. 11 10 3.3 A Prefetch Procedure 4 12 13 ... 1 2 3 4 5 1 6 ... Figure 2. An example of a prediction graph current next The access pattern of an application execution over NOR (or Figure 4. A Snapshot of the Cache NAND) consists of a sequence of LBA’s, where some LBA’s are for instructions, and the others are for data. As an application runs for multiple times, the “virtually” complete picture of the possible The objective of the prefetch procedure is to prefetch data from access pattern of an application execution might appear, as shown NAND based on a given prediction graph such that most data ac- in Figure 2. Since most application executions are input-dependent cesses occur over SRAM. The basic idea is to prefetch data by or data-driven, there can be more than one subsequent LBA’s fol- following the LBA order in the graph. In order to efficiently look lowing a given LBA, where each LBA corresponds to one node up a selected page in the cache, we propose to adopt a cyclic in the graph. A node with more than one subsequent LBA’s is buffer in the cache management, and let two indices current and called a branch node (such as the shaded nodes in Figure 2), and next denote the pages currently accessed and prefetched, respec- the other nodes are called regular nodes. The graph that corre- tively. When current = next, the caching buffer is empty. When sponds to the access patterns is referred to as the prediction graph current = (next + 1) mod SIZE, the buffer is full, where of the patterns. If pages in NAND could be pre-fetched in an on- SIZE is the number of buffers for page caching. Consider the time fashion, and there is enough SRAM space for caching, then prediction graph shown in Figure 2. The page that corresponds all data accesses can be done over SRAM. to Node 2 is currently accessed, and the page that corresponds to Node 6 is just prefetched (Please see Figure 4). The prefetch procedure is done in a greedy way: Let P1 be the last prefetched page. If P1 corresponds to a regular node, then the page that corresponds to the subsequent LBA is prefetched. If P1 corresponds to a branch node, then the procedure should … prefetch pages by following all possible next LBA links in an equal base and an alternative way. That is, the prefetch proce- data data data branch table (regular node) (regular node) (branch node) dure can follow each LBA link in an alternative way. For exam- ple, pages corresponding to Nodes 4 and 5 are prefetched after the page that corresponds to Node 3 is prefetched, as shown in Figure (a) Prediction Information 4. The next pages to be prefetched are the pages corresponding to Nodes 1 and 6. In order to properly manage the preferching cost, the prefetch procedure stops following an LBA link when … next reaches a branch node again along a link, or when next and current might point to the same page (both referred to as 3 Stop Conditions). When the caching buffer is full (also referred addr(b1) to as a Stop Condition), the prefetch procedure should also stop addr(b2) temporarily. Take the prediction graph shown in Figure 2 as an addr(b3) example. The prefetch procedure should not prefetch pages corre- sponding to Nodes 8 and 9 when the page corresponding to Node 7 … is prefetched. When current reaches a page that corresponds to a branch node, the next page to be accessed (referred to as the target (b) A Branch Table page) will determine which branch the application execution will follow. The prefetch procedure should start prefecting the page Figure 3. The Storage of a Prediction Graph that corresponds to the subsequent LBA of the target page (or the pages that correspond to the subsequent LBA’s of the target page if the target page corresponds to a branch node). The above prefetch The technical problems are how to save the prediction graph procedure shall repeat again if it stops tentatively because of any over flash memory with overheads minimized and how to prefetch Stop Condition. Note that all pages cached in the SRAM cache pages based on the graph in a simple but effective way. We propose between current and next stay in the cache after the target page to save the subsequent LBA information of each regular node at (in the following of a branch) is accessed. It is because some of the
  4. 4. cached pages might be accessed shortly, even though the access of Algorithm 1: Prefetch Procedure the target page has determined which branch the application exe- cution will follow. Note that cache misses are still possible, e.g., Input: stop, next, current, lba, idxbch , Nbch , lbabch [], and those when current = next. In such a case, data are accessed startbch from NAND and loaded onto the SRAM cache in an on-demand Output: null fashion. 1 if stop = T U RE then return ; The pseudo code of the prefetch procedure is as shown in Al- 2 while startbch = F ALSE and (next + 1) mod gorithm 1: Two flags, i.e., stop and startbch , are used to track the SIZE = current do prefetching state: stop and startbch denote the satisfaction of any 3 if ChkN xLBA(lba) = cache(current) then Stop Condition and the reaching of a branch node, respectively. 4 stop ← T RU E ; Initially, stop and startbch are set as F ALSE. If any Stop Con- 5 return ; dition is satisfied when the procedure is invoked, then the proce- 6 end dure simply returns (Step 1). The procedure prefetches one page in 7 next ← (next + 1) mod SIZE ; each iteration (Steps 2-19) until the cache is full (i.e., a Stop Con- 8 lba ← GetNxLBA(lba) ; dition), or we reach a branch node for the first time. First, next 9 Read(next, lba) ; is checked if it will point to the same page as current does, then 10 startbch ← IsBchStart() ; the prefetch procedure stops and returns (Steps 3-6) Otherwise, 11 if startbch = T RU E then in each iteration, the procedure increases next, i.e., the location 12 LdBchTable(GetNxLBA(lba)) ; of the next free cache buffer (Step 7). The LBA is obtained by 13 idxbch ← 0 ; checking up the latest prefetched LBA (Step 8), and then the page 14 Nbch ← GetBchNum() ; of the LBA is prefetched (Step 9). After the prefetching of a page, 15 for i = 0; i < Nbch ; i = i + 1 do the procedure checks up whether the prefetched page corresponds 16 lbabch [i] ← GetBchLBA(i) ; to a branch node (Steps 10-11). If so, the procedure loads the cor- 17 end responding branch table entries (Step 12) and save the subsequent 18 end LBA of each branch of the branch node (Steps 12-17). Because 19 end the prefetched page corresponds to a branch node, the procedure 20 while startbch = T RU E and (next + 1) mod should start prefetching pages by following each branch in an al- SIZE = current do ternative way (Steps 20-36). The loop will stop when the cache is 21 if IsBchCplt(idxbch ) = F ALSE then full (Step 20), when every next LBA link of a branch node reaches 22 if ChkN xLBA(lbabch [idxbch ]) = cache(current) the next branch node (Steps 31-35), or when next and current then might point to the same page (Steps 22-25). In each iteration of 23 stop ← T RU E ; the loop, if the LBA link indexed by idxbch does not yet reach the 24 return ; next branch node (Step 21), the next LBA following the link shall 25 end be prefetched (Steps 26-28). Pages are prefetched by following all 26 next ← (next + 1) mod SIZE ; possible next LBA links in an equal base and an alternative way 27 lbabch [idxbch ] ←GetNxLBA(lbabch [idxbch ]); (Step 30). 28 Read(next, lbabch [idxbch ]) ; Note that stop should be set to F ALSE when the cache is 29 end no longer full or when next and current do not point to the 30 idxbch ← (idxbch + 1) mod Nbch ; same page2 . Moreover, stop and startbch should both be reset to 31 if IsBchStop() = T RU E then F ALSE when current passes a branch node and meets the target 32 stop ← T RU E ; page or when a cache miss occurs (i.e., current = next). Once 33 startbch ← F ALSE ; stop is set to FALSE, the prefetch procedure is invoked. When 34 return; startbch is FALSE in such an invocation, the prefetch procedure 35 end starts prefetching from the first loop between Steps 2 and 19. Oth- 36 end erwise, the prefetch procedure will continue its previous prefetch- ing job by following next LBA links of a visited branch node in an alternative way (Steps 20-36). of the Death (refereed to as TTD), and Raiden. Each game was 4 PERFORMANCE EVALUATION played for ten times in the trace collection. The characteristics of the games are summarized in Table 2. AOE II is a real-time strategy game, where all players conduct their game actions simul- 4.1 Performance Metrics and Experiment taneously. Compared to conventional turn-based strategy games, Setup real-time strategy games progress in real time rather than turn-by- turn. In general, the access pattern of each execution is diversified The purpose of this section is to evaluate the capability of and hard to predict. TTD is a game for English typing. A player the proposed prefetch procedure and implemenation in terms of can pick up any stage to play. Once a player clears a stage, some read performance (Section 4.2) and prefetching overhead (Section animation of that stage will be displayed. Compared to AOE II, 4.3). The read performance were evaluated against the number the program size over NOR is large, but its executions are more of game traces considered for the creation of a prediction graph. predictable. Raiden is a 3D vertical-shooter game, in which play- The prefetching overhead was evaluated based on the percentage ers should clear stages one by one, and each enemy appears at of redundant data that were prefetched unnecessarily. some specific time and place. Once a stage is cleared, the data of The performance of the proposed prediction mechanism was the next stage are completely loaded to run. The execution of this evaluated over a trace-driven simulation. The experimental traces game has good predictability, but data are loaded in burst. were collected over a mobile PC in the unit of a sector (512B), We considered large-block NAND and NOR in the experi- and the unit was consistent with the unit in data prefetching. Since ments, where there were 64 pages per block and 2KB per page NOR is mainly used to store programs, we conduct serial exper- for large-block NAND. The response time in per-page read over iments by running some benchmark applications, such as game NAND was 100 µs, and the response time in per-byte read over softwares. Three games with different characteristics were con- NOR was 40ns. The set-up time of NAND to read data from a sidered in the experiments, and their execution traces were col- page was 25 µs, where the set-up time was for transferring data lected: Age of Empires II (referred to as AOE II), The Typing from the page cells to the internal page buffer. There was no set- up time in NOR. In the experiments, SRAM was used to store the 2 Performance enhancement is possible by deploying more complicated branch table and serve as the cache space. The response time in condition setting and actions. per-bye read was set as 10ns. We assume that the branch table was
  5. 5. AOE II TTD Raiden because the access pattern of AOE II was highly random. The in- small large small creasing in the number of collected traces for the prediction graph Size (438 MB) (812 MB) (467 MB) could not reduce the cache miss rate significantly. For TTD, good Average number of branches high medium low Burst in reads low medium high improvement was observed with the inclusion of two more traces. Temporal locality in data access low low low It was because the last two traces were, in fact, collected during Randomness in data access high medium low the advance of players in the game by clearing more stages. Fur- large large small thermore, we summarize the read performance of the proposed Branch table size (35.14 KB) (39.83 KB) (0.43 KB) scheme and other existing products in Table 3. It shows that the read performance of some specific applications with regular ac- cess patterns is even better than that of OneNAND. On the other Table 2. The characteristics of games under hand, in the worst case, i.e., 100% miss rate, the desired data has to be read from NAND flash memory on each read request. Thus, investigation it is impractical to use NAND to replace NOR without any pre- diction mechanism because the read performance gap between the emulated NOR and NOR is too large. originally stored over NAND. The table was loaded into SRAM in an on-demand fashion so that the branch table could always fit in SRAM. In the experiments, the proposed prediction mechanism was evaluated against the number of traces. We must point out that the larger the number of traces were considered, the larger the av- erage number of branches per branch node. It was because more branches were observed when more traces were analyzed (even though the average number of branches per branch node might saturate when enough representative traces were analyzed). How the average number of branches per branch node grew with the given set of traces also depended on the characteristics of games under considerations. As shown in Figure 5, the average number of branches per branch node was less than four, and the average number of branches per branch node grew slowly, except AOE II. It was because data accesses of AOE II were more randomized, compared to others. Figure 6. The read performance with different numbers of traces (4KB cache) AOE II TTD Raiden Worst case NOR OneNAND Read 94.44 75.24 29.57 8.76 23.84 68 (MB/s) Table 3. Comparison of the read performance (10 traces and 4KB cache in our approach) 4.3 Cache Pollution Rate Figure 5. Increment of average branch num- ber Cache pollution Rate is the rate of data that are prefetched but not referenced during the program execution. The prefetching of unnecessary data represented overheads and might even decreased the read performance because the prefetching activities of unnec- essary data might delay the prefetching of useful data. In addition, 4.2 Read Performance unnecessary data transfer leads to extra power consumption, which is critical to designs of embedded systems. Let NSRAM 2host be the amount of data accessed by the host, and Nf lash2SRAM the Figure 6 shows the read performance of the proposed approach amount of data transferred from NAND flash memory to SRAM. for the three games with respect to different numbers of traces, The cache pollution rate was defined as follows: where the cache size was 4KB. We found that 4KB cache was suf- ficient for the games under considerations because the read per- NSRAM 2host formance became saturated when the cache size was no less than Cache pollution rate = 1− 4KB. The read performance of each game was better than that of Nf lash2SRAM NOR even when only two traces were used to generate a predic- As shown in Figure 7, the cache pollution rate increased as tion graph. For example, the improvement ratios over AOE II, the number of traces for each game increased. That was because TTD, and Raiden were 24%, 216%, and 298%, respectively, when more traces led to a larger number of branches per branch node, the number of traces for each game was 10, and the size of cache and only one of the LBA links that follow a given branch node was 4KB. When there were more than two traces, the read perfor- was actually referenced by the program. In summary, there was mance of Raiden had almost no improvement because the cache a trade-off between the prefetching accuracy and the prefetching miss rate was almost zero. For AOE II, the read performance was overhead, even though the cache pollution rates were still lower improved slowly when the number of collected traces increased than 10% in most cases.
  6. 6. tems. In IEEE Real-Time and Embedded Technology and Applications Symposium, pages 187–196, 2002. [9] L.-P. Chang and T.-W. Kuo. An Efficient Management Scheme for Large-Scale Flash-Memory Storage Systems. In ACM Symposium on Applied Computing (SAC), pages 862– 868, Mar 2004. [10] P. J. Denning. The Working Set Model for Program Behav- ior. Communications of the ACM, 11(5):323–333, 1968. [11] P. J. Denning and S. C. Schwartz. Properties of the Working- Set Model. Communications of the ACM, 15(3):191–198, 1972. [12] DRAMeXchange. NAND Flash Contract Price, http://www.dramexchange.com/, 03 2007. [13] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and N. Chang. Demand Paging for OneNANDT M Flash eXecute-In-Place. CODES+ISSS, October 2006. Figure 7. The cache pollution rate (4KB [14] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash- cache) Memory Based File System. In Proceedings of the 1995 USENIX Technical Conference, pages 155–164, Jan 1995. [15] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-type flash memory package with smart buffer system for spatial 5 Conclusions and temporal localities. JOURNAL OF SYSTEMS ARCHI- TECTURE, 51:111–123, 2004. This research proposes an application-oriented approach in the [16] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy- replacement of NOR with NAND. It is strongly motivated by a aware demand paging on nand flash-based embedded stor- market demand in cutting down the cost of embedded systems in ages. ISLPED, August 2004. the storing and running of applications over NOR. Different from [17] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler- the previous work in caching and buffering and OneNAND-related assisted demand paging for embedded systems with flash work [13, 15, 16, 22, 23], we consider the designs of embedded memory. EMSOFT, September 2004. systems with a limited set of applications. We propose an ef- [18] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efficient ficient prediction mechanism with limited SRAM-space require- ments and an efficient implementation. A prefetch procedure is memory architecture design of nand flash memory embedded proposed to prefetch pages from NAND based on the trace analy- systems. ICCD, 2003. sis of application executions. A series of experiments is conducted [19] Z. Paz. Alternatives to Using NAND Flash White Paper. based on realistic traces of computer games with different charac- Technical report, M-Systems, August 2003. teristics: “Age of Empire II (AOE II)”, “The Typing of the Death [20] R. A. Quinnell. Meet Different Needs with NAND and NOR. (TTD)”, and “Raiden”. The experimental results are very encour- Technical report, TOSHIBA, September 2005. aging: We show that the average read performance of NAND with [21] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND the proposed prediction mechanism could be even better than those Flash Memory Data Sheet, 2003. of NOR for 24%, 216%, and 298% for the three games, respec- [22] Samsung Electronics. OneNAND Features and Performance, tively. Their cache miss rates were 35.27%, 4.21%, and 0.06%, 11 2005. respectively. The percentage of unnecessary prefetched data was [23] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit lower than 10% in most cases. For future research, we shall further extend the proposed mech- OneNAND Flash Memory Data Sheet, 09 2006. anism to explore on-line incremental mechanisms to adapt to dy- [24] M. Santarini. NAND versus NOR. Technical report, EDN, namic changings in programs’ access patterns. We also plan to October 2005. incorporate the research results to the designs of adaptors in stor- [25] Silicon Storage Technology (SST). SST39LF040 4K x 8bit age system designs. More research will be conducted to analyze SST Flash Memory Data Sheet, 2005. the execution traces of different user applications. [26] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND Flash Memory, 2005. [27] A. Tal. Two Technologies Compared: NOR vs. NAND References White Paper. Technical report, M-Systems, July 2003. [28] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Manage- [1] Flash Cache Memory Puts Robson in the Middle. Intel. ment for the Flash Translation Layer in Embedded Systems. [2] Flash File System. US Patent 540,448. In Intel Corporation. In IEEE/ACM 2006 International Conference on Computer- [3] FTL Logger Exchanging Data with FTL Systems. Technical Aided Design (ICCAD), November 2006. report, Intel Corporation. [29] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main [4] Software Concerns of Implementing a Resident Flash Disk. Memory Storage System. In Proceedings of the Sixth Inter- Intel Corporation. national Conference on Architectural Support for Program- [5] Flash-memory Translation Layer for NAND flash (NFTL). ming Languages and Operating Systems, pages 86–97, 1994. M-Systems, 1998. [30] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt, [6] Understanding the Flash Translation Layer (FTL) Specifica- and W. Litwin. Reliability Mechanisms for Very Large Stor- tion, http://developer.intel.com/. Technical report, Intel Cor- age Systems. In Proceedings of the 20th IEEE / 11th NASA poration, Dec 1998. Goddard Conference on Mass Storage Systems and Tech- [7] Windows ReadyDrive and Hybrid Hard Disk Drives, nologies (MSS’03), pages 146–156, Apr 2003. http:// www.microsoft.com/whdc/device/storage/hybrid.mspx. [31] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer Technical report, Microsoft, May 2006. for SmartMedia Card Systems. IEEE Transactions on Con- [8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architec- sumer Electronics, 50(1):192–197, Feburary 2004. ture for Flash Memory Storage Systems of Embedded Sys-