Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Microsoft Word - IWSSPS09_min_revised


Published on

  • Be the first to comment

  • Be the first to like this

Microsoft Word - IWSSPS09_min_revised

  1. 1. Efficient Flash Memory Read Request Handling Based on Split Transactions Bryan Kim, Eyee Hyun Nam, Yoon Jae Seong, Hang Jun Min, and Sang Lyul Min School of Computer Science and Engineering, Seoul National University, Seoul, Korea {sjkim, ehnam, yjsung, hjmin, symin} ABSTRACT same interface as HDDs are replacing HDDs in mobile and Flash memory is a storage class memory widely used in mobile general-purpose computers. computing systems due to its small size, low power consumption, NAND flash memory, however, has peculiar characteristics in fast access time, and high shock and vibration resistance. Flash terms of architecture and reliability that make a straightforward memory based storage systems exploit chip-level parallelism and replacement of HDDs difficult [3, 7]. Architecturally, NAND hide the latency of flash memory operations through request flash memory prohibits in-place update of data. Instead, writing is scheduling. However, conventional scheduling techniques are performed by a program-page operation, which must be preceded inadequate for handling series of small random requests, by an erase-block operation that sets all the bits in the target degrading the overall performance of the system. This paper physical block to 1. A block, the unit of erase operations, contains presents a flash memory request handling technique based on split a set of pages, which is the unit of read and program operations. transactions to improve the read performance – in particular, the In terms of reliability, NAND flash memory allows some blocks performance of multiple small read requests. This technique to be ‘bad’ at manufacture time. Even good blocks may turn bad decouples the first and second phase of each flash memory read after a certain number of program-erase cycles. This physical request and services them in a different manner. The first phase of endurance limit is typically about 10,000 ~ 100,000 cycles, the operation is issued at the earliest time possible, while the depending on the type of NAND flash cells. NAND flash memory second phase is delayed until the operation latency expires or the is also subject to bit-flipping errors, causing one or more bits in a next read request arrives to the same chip. This request handling block to be reversed, which necessitate the use of external error technique based on split transactions has been incorporated into a correction logic. prototype flash memory based storage system on an FPGA development board. Experimental results from this prototype In order to hide the peculiarities of NAND flash memory, a flash show that for small random read requests, this technique improves memory based storage system employs a software layer known as read performance by up to 46%. the flash translation layer (FTL) [7, 8]. The first and foremost responsibility of the FTL is mapping table management resulting from the fact flash memory does not allow in-place update. Upon Categories and Subject Descriptors receiving a host write request, the FTL writes to a pre-erased area B.3. Memory Structures; C.4 Performance of Systems and remaps the logical host address to the physical flash memory address. Upon receiving a host read request, the mapping table is General Terms looked up for the corresponding physical flash memory address Management, Measurement, Performance, Design and the data is subsequently read from flash. Other FTL responsibilities include bad block management and wear-leveling. Keywords On the hardware side, flash memory request scheduling employed Flash memory, Split transactions, Request scheduling by hardware acceleration modules also plays an important role in flash memory based storage systems [4]. Flash memory operations such as erase-block, program-page, and read-page have 1. INTRODUCTION operation latencies during which the chip is busy and other In recent years, mobile computing devices such as MP3 players, operations cannot be accepted. These latencies typically range PMPs (portable media players), and PDAs (personal digital from a few tens of microseconds to a few milliseconds. In order to assistants) have gained much popularity in consumer electronics. hide the latency of flash memory operations, multiple chips are Flash memory is preferred over hard disk drives (HDDs) in such used concurrently by flash memory based storage systems mobile computing devices due to its small size, fast access speed, through request scheduling. Because the efficiency of request low power consumption, absolute silence, and high resistance to scheduling directly impacts the overall performance of the system, shock and vibration [1]. many research groups have explored and studied this area [2, 5, 9, The density of NAND flash memory has been doubling every 10, 11]. However, most previous scheduling techniques focus on year, a rate exceeding Moore’s Law [6]. With this trend, NAND optimizing a single (large sequential) host request, and are not flash memory becomes more affordable and its price more very effective in handling series of multiple (small random) host competitive against HDDs, accelerating the replacement of HDDs requests, although this type of workload is common in multi- by NAND flash memory based storage systems [3, 4]. For tasking environments, such as desktops and servers. example, flash memory solid-state disks (SSDs) that provide the
  2. 2. This paper presents a flash memory request scheduling technique based on split transactions that allows a stream of multiple requests to be reordered and optimized. By decoupling the first phase and second phase of a single flash memory operation, this technique allows overlapping of multiple requests over multiple chips. The proposed technique exploits not only intra (host) request parallelism but also inter (host) request parallelism and thus, it cures the inefficiency of previous techniques in handling multiple small random host read requests. We implemented a prototype flash storage system that Figure 2.1 Flash memory operations incorporates the request handling technique based on split transactions. Performance evaluation using this prototype shows indication of a bad block, and thus it is imperative to check the that for multiple small random read requests, this technique status. improves read performance by up to 46%. Program-page operation writes the data transferred to the data The remainder of the paper is organized as follows. Section 2 buffer to the selected page. As shown in Figure 2.1(b), flash gives backgrounds on flash memory organization and supported memory page program requires a data input to the chip, typically operations. Section 3 explains the conventional architecture of at 40MB/s (1B per 25ns), followed by the program latency, flash memory based storage systems and how host read requests typically 200μs. The first phase of programming a page requires are handled. Section 4 explores flash memory read request the physical address of the page, followed by the data transfer, handling based on split transactions. Section 5 describes the and then the program command. Like the flash memory erase, the implementation and evaluation environment, and presents program operation may fail, and thus the status check on the chip performance measurements. Finally, Section 6 concludes and is required after the latency. Again, a program failure is an suggests future research directions. indication of a bad block. Read-page operation moves the data in the flash memory array 2. BACKGROUND to the data buffer. As shown in Figure 2.1(c), the first phase of 2.1 NAND flash memory internal organization reading a page requires the read command and the physical NAND flash memory chip is internally organized as follows: (1) address of the page, followed by the read latency, typically 25μs, I/O and control logic that accept commands and addresses, (2) but up to 200μs in some recent chips. After the read latency, data data buffer that temporarily stores data read or to be written, (3) in the data buffer can be transferred out, typically at 40MB/s (1B the NAND flash memory array, the non-volatile memory inside per 25ns). Unlike erase and program, read operation does not the chip. require a status check on the chip, but since data read may suffer The NAND flash memory array is further composed of blocks, from bit-flip errors during program or simply over time, it is each of which in turn is composed of pages [4]. Typically 64 imperative to have ECC (error correction code) encoding and pages or 128 pages make up a single block. A page is decoding logic in the flash memory controller. conceptually divided into a data area and a spare area. The size of a data area is multiples of 512 bytes, typically 2KB or 4KB. The 3. FLASH READ REQUEST SCHEDULING As explored in the previous section, a single NAND flash spare area is typically 16B per 512B of data area, but some recent memory chip has limited bandwidth for program and read chips have spare area of 24B or 27.5B per 512B of data area. operations. Thus, flash memory based storage systems utilize Although there is no difference between the data area and spare multiple NAND flash memory chips to overlap multiple area at the flash memory cell-level, flash memory based storage operations over multiple chips [2, 4]. Because of the long program systems typically use the spare area for ECC or meta-data. More latency and the erase-before-program requirement of NAND flash importantly, manufacture-time bad blocks are marked in the spare memory, previous research has focused on improving the write area, so it is imperative to read this area for constructing a bad performance – in particular, small random write performance [7, block management structure upon system initialization. 13]. On the other hand, this paper focuses on improving the read 2.2 NAND flash memory operations performance because of the blocking nature of read requests from This section will briefly explain basic flash memory operations the host – while write requests may be buffered and their that access non-volatile flash memory data: erase-block, program- materialization can be delayed by the storage system, reads may page, and read-page. For specific timing values, Samsung not. The following sections will explore how the conventional K9WBG08U1M [12] chip is assumed. design hides operation latencies for read requests, and will further investigate how it falls short in handling series of small read Erase-block operation sets all the flash cell values to 1 in the requests. selected block. As shown in Figure 2.1(a), flash memory block erase requires no data transfer, but has a long latency, typically 3.1 Flash memory based storage system 1.5ms. The first phase of erasing a block requires the erase The basic architecture of a typical flash memory based storage command and the physical address of the block. After the erase system is illustrated in Figure 3.1. The processor core executes latency, a status check on the chip is required to check the pass or software functionalities of the FTL, and the flash memory the failure of the erase operation. A failure in erasing a block is an controller enhances the effectiveness of the FTL as a hardware accelerator. SRAM provides code execution environment, and DRAM serves as a volatile buffer for read caching, pre-fetching,
  3. 3. Figure 3.1 Basic flash memory based Figure 3.3 Stream of scheduled read requests storage system architecture for a single host read and write buffering. The host interface and flash memory Upon receiving a host read request, the flash memory controller interfacer allow a modular design approach of the storage system looks up the mapping table and locates the physical flash memory by decoupling interface protocols from the rest of the storage addresses. To read data out of flash memory, the basic flash system functionalities. More specifically, the host interface allows memory controller issues a read page request, then a data output multiple implementations of interface protocols such as Serial request to the flash memory interfacer. Similarly, other pages to ATA (SATA) and Serial Attached SCSI (SAS), and the flash be read are accessed in the same manner – a read page request interfacer supports diverse interfaces of NAND flash chips such followed by a data output request. Although this sequential access as the traditional NAND [12], OneNAND [14], and Hyper-Link of each page is simple, it results in poor read performance as the NAND [15]. unscheduled stream of read requests incur operation latencies Previously, the flash memory controller had not been emphasized. between the read request and the data output request. However, the flash memory controller plays a central part in the In order to improve the read performance, many research groups overall performance of the system as it is responsible for handling have studied interleaving techniques that hide operation latencies flash memory requests such that multiple operations can be by overlapping operations to multiple chips [2, 5, 9, 10, 11, 16]. performed in parallel over multiple chips. Moreover, the flash Figure 3.3 is an example of such an intelligent flash memory memory controller abstracts flash memory operations and allows controller that utilizes interleaving techniques to translate host the FTL to execute oblivious of the operation latencies. The read requests into a stream of scheduled flash memory read details of the flash memory controller will be explored in the next requests. Instead of sequentially accessing chips, flash memory section. controller that interleaves requests issues read page requests to more than one chip, and overlaps the data output traffic of one 3.2 Conventional read request handling chip to hide the latency of another. Figure 3.2 describes the operation of a basic flash memory However, the efficiency of conventional interleaving techniques controller that accepts host read requests from the host interface heavily relies on the size and sequentiality of data to be read and translates them into series of flash memory requests. For because they usually adopts static data layout scheme such as simplicity, only one flash bus traffic is depicted in the examples super-block [11, 16], in which large sequential data is placed over to emphasize the performance degradation caused by the latencies multiple flash memory chips. If data to be read is too small to take of flash memory operations, but this can be easily generalized to advantage of the super-block scheme, or if the data to be read for multiple buses. Furthermore, for clarification, this paper assumes a host request is clustered to only one or a few chips, the that the mapping table look up can be accelerated by hardware interleaving technique becomes less effective. Furthermore, the components in the flash memory controller, and that host read conventional interleaving techniques only consider parallelism requests are directly forwarded to the flash memory controller within host requests and do not exploit parallelism across multiple without software intervention. host requests. Figure 3.4 exemplifies the limitations of the conventional flash Data out Data out memory controllers. Upon receiving the first host read request, Read Read Read chip 1 chip 2 chip 2 chip 1 chip 1 the flash memory controller issues a stream of scheduled flash memory read requests based on the conventional technique. However, because it is unknown to the flash memory controller (and even to the flash memory based storage system) when the subsequent host read request will arrive, the flash memory controller is unable to optimize between multiple host read requests. This naturally induces the controller to exhibit a one-at- a-time behavior in servicing host requests, failing to hide latencies between multiple read requests. This performance vulnerability is amplified if each host read request size is small – in such a case, Figure 3.2 Stream of unscheduled read requests the performance of a flash memory controller using an for a single host read
  4. 4. D#: data output to chip # R#: read-page to chip # Data out Data out Host Host Read Read Read chip 1 chip 2 chip 2 chip 1 chip 1 read read request request Flash Flash D0 R0 D1 R1 D0 R0 Host memory memory Flash Flash interface Request controller interfacer memory memory reordering unit controller interfacer Bus Chip 0 (a) Request reordering unit example 1 Chip 1 Chip 2 D0 R0 D1 R1 R0 Chip 3 Flash Flash Request memory memory reordering unit Figure 3.4 Stream of scheduled read requests controller D0 interfacer for multiple host reads (b) Request reordering unit example 2 interleaving technique may be degraded to that of a basic controller that does not schedule requests at all. D0 R0 R1 R0 Flash Flash Request 4. REQUEST HANDLING BASED ON memory controller reordering unit memory interfacer SPLIT TRANSACTIONS D1 D0 This section introduces a request handling technique based on (c) Request reordering unit example 3 split transactions and a hardware module that implements this technique. The design goals are two-folds: first, to solve the R0 D0 R1 R0 performance problem of multiple small read requests, and second, Flash Request Flash memory memory to make the technique compatible with the existing flash memory controller reordering unit interfacer based storage system architecture. D0 D1 The basic principle behind the request handling based on split (d) Request reordering unit example 4 transactions is decoupling the two phases in flash memory read operations. More specifically for a flash memory page read, it loosens the coupling between the first read page request and the D0 D1 R0 D0 R1 R0 Flash Flash subsequent data output request so that they can be serviced under memory Request reordering unit memory controller interfacer different policies. The details on how the two phases are handled will be covered in the next sub-section. (e) Request reordering unit example 5 Figure 4.1 shows a hardware module called the “request reordering unit” between the flash memory controller and the Figure 4.2 Request reordering unit examples flash memory interfacer. The request reordering unit implements the request handling technique based on split transactions to without the data output request of the previous read request to the improve the read performance of the system while maintaining same chip being issued. This constraint must be met in order to the same interface semantics between the flash memory controller guarantee correctness of the flash memory based storage system. and the flash memory interfacer. Because the interface semantics In the flash memory based storage system in which the request are maintained, the request reordering unit is pluggable, and can reordering unit is incorporated, the architectural assumption be easily integrated into existing flash memory based storage constrains the total ordering of data outputs to be in-order. systems. Because of this restriction, the request reordering unit would only need one first-in, first-out buffer to delay the issue of data output 4.1 Design of request reordering unit requests. The design and implementation of the request reordering unit is Figure 4.2 explains the operation details of the request reordering simplified by two reasons: straightforward correctness constraint unit. Figure 4.2(a) shows the initial state, with requests queued at and architectural assumption. There is only one dependency in the request reordering unit. The queued requests are, in the order ordering flash memory requests: the read request and the of items placed, read to chip 0, data output to chip 0, read to chip following data output request. Data output request may not be 1, data output to chip1, read to chip 0, and data output to chip 0. issued from the request reordering unit without the matching read request being issued first. A read request may not be issued The read request to chip 0 is passed to the flash memory interfacer upon receiving. The data output request, on the other hand, is queued, as shown in Figure 4.2(b). By delaying the data output request the request reordering unit allows subsequent read requests, in this case, the read request to chip 1, to be issued. This effectively reduces the latency penalty, and provides a window of parallelism. When requests are queued, an internal data structure keeps track of how many requests have been queued for each chip. Figure 4.1 Request reordering unit
  5. 5. Figure 5.1 Evaluation framework evaluation framework using synthetic workload is shown in Figure 5.1. The experimental set-up is as follows. One block is randomly selected for each chip, and the block is programmed with a Figure 4.3 Comparison between the original sequence and random data stored in the volatile buffer. Then up to 32 host read reordered sequence of read requests requests are generated and also stored in the volatile buffer. These requests will later be fetched and decoded by the flash memory Figure 4.2(c) shows the state of the request reordering unit, in controller. The timer for performance measurement starts as the which read requests to chip 0 and chip 1 have been issued, and flash memory controller begins fetching host requests. When all data output requests to chip 0 and chip 1 are queued. Upon data is read from flash memory to the volatile buffer, the timer receiving the read request to chip 0, all data output requests to stops, and the programmed data and read data are compared for chip 0 queued must be issued before the second read request can correctness. be issued. The internal data structure to keep track of how many The flash memory controller with the request reordering unit has requests are queued is used to determine if all the data output been prototyped using an in-house development platform. Figure requests have been sent or not. 5.2 shows the major components of the development platform: As shown in Figure 4.2(d), once the data output request to chip 0 SDRAM, FPGA, and DIMM sockets. The 64MB SDRAM on the has been sent, the second read request to chip 0 can be issued, and prototype board serves as data and request storage for the flash the subsequent data output to chip 0 is queued. memory controller. The Xilinx Virtex4 FPGA implements all the In order to issue queued data output requests, an internal timer is hardware components of the storage system, including the flash used to specify the size of the window for exploiting parallelism. memory controller and the request reordering unit. In addition, the Once there are no more requests to the request reordering unit, PowerPC405 processor embedded in the FPGA executes the test and there exist requests queued in the internal FIFO, the timer firmware that emulates the host interface and runs the FTL. The increments until the specified window size. Once the timer DIMM sockets allow diverse NAND flash chip modules to be expires, all the queued requests are issued, as shown in Figure plugged – for this specific evaluation, 64 Samsung 4.2(e). K9WBG08U1M [12] chips (8 chips/bus x 8 buses) were used. The reordered stream of requests effectively hides the read latency by delaying data output requests until necessary for 5.2 Performance measurement correctness, or until the time window for exploiting parallelism 5.2.1 Random workload expires. As shown in Figure 4.3(a) and Figure 4.3(b), the Figure 5.3 shows the measured bandwidth of the flash memory reordered stream of requests has a shorter completion time than based storage system for synthetic random workload. The x-axis the original stream of requests. represents the size of each request, ranging from 512B (1 sector) to 64KB (128 sectors). For all request sizes, the storage system 5. EVALUATION with the request reordering unit enabled (represented by light In order to evaluate the performance of the request handling based bars) outperforms that with it disabled (represented by dark bars). on split transactions, the request reordering unit described in the previous section has been incorporated into a flash memory controller and prototyped onto an FPGA based development platform. 5.1 Implementation and test environment Synthetic workloads are used for performance evaluation to eliminate the effects of the host interface processing overhead and to isolate the effect of the proposed technique. The synthetic workload is comprised of two independent dimensions – randomness and request size. Although workloads such as “large random” and “small sequential” are rarely found in real workload Figure 5.2 FPGA development platform s, they are included nonetheless in the synthetic workload. The
  6. 6. B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s B/s 7M 8M 8M 9M 3M 4M 3M 6M 0M 1M 9M 9M 4M 4M 7M 8M 19 19 21 21 23 23 24 24 26 26 26 26 27 27 27 27 Figure 5.3 Request size vs. bandwidth Figure 5.5 Request size vs. bandwidth for synthetic random workload for synthetic sequential workload In particular, the measurements for 4KB random read requests between 720μs and 780μs when the request reordering unit is show approximately 46% increase in bandwidth. Even for 64KB enabled; when it is disabled, more than half of the measurements random read requests, the performance improves by almost 10%. are between 1060μs and 1100μs. These response time The relative performance gap between the request reordering unit measurements are consistent with the bandwidth measurements in disabled and enabled is the greatest when the request size is 4KB Figure 5.3. Similarly, response time measurements for 64KB (one sector from each flash bus). As the request size increases request size in Figure 5.4(b) are consistent with the bandwidth beyond 4KB, the chip-level interleaving technique employed by measurements: the average response time when enabled is 9034μs, the flash memory controller allows the performance of the request while when disabled is 9596μs. reordering unit disabled to catch up the enabled one. On the other hand, as the request size decreases below 4KB, idle flash buses 5.2.2 Sequential workload allow overlapping of host requests, reducing the performance gap Figure 5.5 shows the measured bandwidth for synthetic sequential between the two. workload. The results show that although the performance Figure 5.4(a) and Figure 5.4(b) show the distribution of response improvement for sequential workload is much less than that for times for 4KB and 64KB request sizes, respectively. For random random workload, the request reordering unit does not degrade workload with 4KB requests, nearly half of the measurements fall the performance nevertheless. 6. CONCLUSION Probability Request reordering unit : Disable / Enable 30% Number of requests : 32 Size of each request Data distribution : 4KB : Random This paper explores the limitation of conventional request 25% scheduling in flash memory based storage systems, and presents a 20% request handling technique based on split transactions. This technique decouples the first phase and second phase of a flash 15% memory read operation to facilitate parallel processing on 10% multiple chips. The proposed technique is particularly effective for multiple small random requests as compared against the 5% conventional techniques. The proposed request handling technique based on split Response 640 s 680 s 720 s 760 s 800 s 840 s 880 s 920 s 960 s 1000 s 1040 s 1080 s 1120 s 1160 s time transactions has been incorporated into a hardware module that (a) Response time histogram for 4KB random read requests handles host read requests to improve the read performance of a flash memory based storage system. Performance evaluation on Probability 30% Request reordering unit Number of requests : : 32 Disable / Enable an FPGA based development board shows that the proposed Size of each request Data distribution : 64KB : Random technique improves the read performance by up to 46% for a 25% series of small random read requests. 20% For future research, we plan to investigate a request handling technique based on more general out-of-order execution, a 15% concept used by most modern processors. We expect that this 10% relaxation of constraints using the out-of-order execution model will significantly improve the performance of flash memory based 5% storage systems by allowing more overlapping of flash operations to multiple chips. Response 8.8ms 8.9ms 9.0ms 9.1ms 9.2ms 9.3ms 9.4ms 9.5ms 9.6ms 9.7ms 9.8ms 9.9ms time (b) Response time histogram for 64KB random read requests Figure 5.4 Response time histograms for random read requests
  7. 7. 7. REFERENCES systems,” in Proceedings of the 8th IEEE Real-Time and [1] F. Douglis, R. Caceres, F. Kaashoek, K. Li, and J. A. Tauber, Embedded Technology and Applications Symposium, 2002, “Storage alternative for mobile computers,” in Proceedings pp. 187-196. of the 1st Symposium on Operation Systems Design and [10] S. Kim, C. Park, and S. Ha, “Architecture exploration of Implementation, 1994, pp. 25-37. NAND flash-based multimedia card,” in Proceedings of the [2] J. U. Kang, J. S. Kim, C. Park, H. Park, and J. Lee “A multi- Conference on Design, Automation, and Test in Europe, channel architecture for high-performance NAND flash- 2008, pp. 218-223. based storage system,” Journal of Systems Architecture, Vol. [11] C. Dirik and B. Jacob, “The performance of PC solid-state 53, No. 9, pp. 644-658, January 2007. disks (SSDs) as a function of bandwidth, concurrency, [3] J. H. Um, B. Kim, S. G. Lee, E. H. Nam, and S. Y. Min, device architecture, and system organization,” in “Flash memory-based development platform for homecare Proceedings of the International Symposium on Computer devices,” presented at IEEE International Conference on Architecture, 2009. Systems, Man, and Cybernetics, Singapore, 2008. [12] Samsung Electronics, K9WBG08U1M NAND flash memory [4] S. L. Min and E. H. Nam, “Current trends in flash memory data sheet, 2007. technology,” in Proceedings of the Conference on Asia South [13] J. H. Yoon, E. H. Nam, Y. J. Seong, H. Kim, B. Kim, S. L. Pacific Design Automation, 2006, pp. 332-333. Min, and Y. Cho, “Chameleon: A high performance [5] N. Agrwal, V. Prabhakaran, T. Wobber, J. D. Davis, M. flash/FRAM hybrid solid state disk architecture,” IEEE Manasse, and R. Panigrahy, “Design tradeoffs for SSD Computer Architecture Letters, Vol. 7, No. 1, pp. 17-20, Jan performance,” in Proceedings of the USENIX Annual 2008. Technical Conference, 2008, pp. 57-70. [14] B. Kim, S. Cho, Y. Choi, and Y. Choi, “OneNANDTM: A [6] C. Hwang, “Nanotechnology Enables a New Memory high performance and low power memory solution for code Growth Model,” Proceedings of the IEEE, Vol. 91, No. 11, and data storage,” in Proceedings of the 20th Non-Volatile pp. 1765-1771, Nov. 2003. Semiconductor Workshop, 2004. [7] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho, “A [15] MOSAID Technologies Incorporated, “Unleashing the next space-efficient flash translation layer for CompactFlash generation flash memory architecture: HyperLink NAND systems,” IEEE Transactions on Consumer Electronics, Vol. (HLNANDTM) Flash,” 2007. 48, No. 2, pp. 366-375, May 2002. [16] Y.J. Seong, E.H. Nam, J.H. Yoon, H. Kim, J. Choi, S. Lee, [8] E. Gal and S. Toledo, “Algorithms and data structures for Y.H. Bae, J. Lee, Y. Cho, and S.L. Min, “Hydra: A Block- flash memories,” ACM Computing Surveys, Vol. 37, No. 2, Mapped Parallel Flash Memory Solid-State Disk pp. 138-163. June 2005. Architecture,” accepted for publication in the IEEE transactions on Computers, 2009. [9] L. P. Chang and T. W. Kuo, “An adaptive striping architecture for flash memory storage systems of embedded