Architecture Exploration of NAND Flash-based Multimedia Card

           Sungchan Kim                                     ...
NAND flash as a cache memory. We use an environment similar           in this paper.
to ViP for memory trace generation in...
bus master or a logical memory block (LMB), and each edge e∈E                    annotation: Initial traces are generated ...
conventional GA [14]. Since the evaluation of each candidate is                                                         of...
driven simulation, reducing the exploration time.                                                                         ...
Table 1. The obtained flash memory configurations.                                                                        ...
Upcoming SlideShare
Loading in …5

Architecture Exploration of NAND Flash-Based Multimedia Card


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Architecture Exploration of NAND Flash-Based Multimedia Card

  1. 1. Architecture Exploration of NAND Flash-based Multimedia Card Sungchan Kim Chanik Park Soonhoi Ha School of EECS Memory division School of EECS Seoul National University, Korea Semiconductor Business Seoul National University, Korea Samsung Electronics Co., Korea Abstract architectural capture [7][8]. Using a TLM simulation technique is In this paper, we present an architecture exploration methodology not fast enough for exploring the huge design space. Within the for low-end embedded systems where the reduction of cost is a limited development time budget, it usually results in a sub- primary design concern. The architecture exploration of such optimal design. systems needs to explore a wide design space spanned by detailed To solve this problem, this paper presents a flexible and architecture parameters through cycle-accurate performance systematic design methodology to explore the large design space estimation. For fast exploration, the proposed methodology is of memory card architectures complying with MMC specification based on an efficient evolutionary algorithm, called QEA, and [15]. The proposed semi-automated methodology consists of two trace-driven simulation to evaluate architecture candidates steps. The first step generates memory traces by TLM simulation quickly. We applied the proposed methodology to NAND flash- exploring the architecture parameters that affect the behavior of based Multimedia Card as a case study considering the following memory traces. Then the memory traces are fed to the second step design parameters: buffer size, flash memory configuration, clock, of main architecture exploration of the remaining design communication architecture, and memory allocation. The parameters. Since many design parameters are considered at the experimental results validate the proposed methodology by same time, we use an evolutionary algorithm, called Quantum- showing the optimal architecture configurations with varying inspired Evolutionary Algorithm (QEA) [14], that has good performance constraints and design parameters. properties such as fast convergence until the production of viable solutions and the simplicity in implementation. In each evolution 1. Introduction step of QEA, we use trace-driven simulation for cycle-accurate Previous works on the design space exploration of performance estimation of each architecture candidate to communication architecture have mainly focused on the high accelerate the exploration. It is much faster than TLM simulation performance embedded systems of increasing complexity while keeping accuracy close to it. [1][2][3][4]. Since the performance is of primary concern, The rest of this paper is organized as follows. In the next architecture exploration has been performed with the abstracted section, the related work and our contributions are summarized. In models of communication and memory system, based on Section 3, we present the basics on NAND flash-based memory approximate cost models. On the other hand, reducing the material card. Section 4, 5, and 6 give the details of the proposed cost is of primary concern in the design of low-end embedded exploration methodology. The experimental results are provided systems since a few cents saving will be multiplied by a huge in Section 7. Lastly, Section 8 draws the conclusion and future volume of products. Consequently, an architecture exploration works. technique for low-end embedded systems should take into account 2. Related Work more detailed architecture parameters, such as buffer size, memory configuration (interleaving or multi-channel), or bus The previous works on the exploration of hierarchical shared clock frequency. And accurate performance estimation during buses have shown that the design space of on-chip communication exploration is necessary to evaluate minor modification of such architecture is extremely large with various parameters such as detailed architecture parameters. bus topology, priority assignment, and memory allocation even at A NAND flash-based memory card is a popular low-end a high level of abstraction [1][2]. Researches on bus matrix have embedded system. It is widely used as a storage device due to its tried to find out optimal bus matrix architectures considering the prominent physical characteristics such as fast access time, low number of buses and arbitration schemes [3][4]. However, these power consumption, resistance for shock/temperature, small size, works focusing on on-chip communication architectures have paid light weight, and noiselessness [5]. It is predicted that the adoption no attention to low-level details of memory subsystem as of NAND flash-based memory card continues to grow to reach considered in this paper. And they usually did not use accurate about 4.6 billion dollar market in 2007 [6]. performance estimators during exploration. Memory cards typically consist of only a few bus masters and As NAND flash memory is used as code and data storages for a small number of memory components (usually less than tens mobile devices, several architectural optimization techniques were kilobytes). Furthermore, it should operate at low clock frequency proposed. Joo et al. proposed a demand paging method based on as much as possible for less energy consumption while satisfying OneNAND flash [9]. Kang et al. explored various hardware performance constraints. For the architecture exploration of a architectures to maximize the parallelism of NAND flash NAND flash-based memory card, we consider the following memories: multi-way interleaving and multi-channel architectures parameters: bus architecture, on-chip memory allocation, off-chip [10]. They built a hardware prototyping platform for accurate NAND flash memory connection, buffer size, and clock frequency. performance analysis of multi-way interleaving and multi-channel Since the design space is huge and time consuming software architectures. In [11], Min introduced an advanced controller development depends on the architecture, a fast exploration architecture for high-performance flash disks. For systematic technique is searched for in the early design stage [11]. performance optimization, a virtual prototyping environment for The current practice of the architecture exploration of NAND NAND flash application, ViP (Virtual Platform), was proposed in flash-based storage system relies on the iterations of manual [8]. They focused on software optimization based on cycle decision and cycle-accurate simulation with system-level accurate simulation models of a new hard disk system including 978-3-9810801-3-1/DATE08 © 2008 EDAA
  2. 2. NAND flash as a cache memory. We use an environment similar in this paper. to ViP for memory trace generation in the proposed design flow. Memory card Our contributions can be summarized as follows: First, the NAND NAND NAND flash Clock frequency Priority assignment at flash proposed methodology provides the automated design flow to DMA FIFO flash memory memory each slave port memory explore detailed architectural decisions for low-end embedded M0 S0 ARM7 systems. It considers memory and buffer architecture as well as Host Interface M1 bus architecture while the previous works only focused on Bus architecture S1 M2 communication architectures. Masters Slaves Second, we propose a two-step exploration technique to NOR Double Double SRAM SRAM flash buffer 1 buffer 0 SRAM Internal bus connection Memory-to-bus minimize the number of TLM simulation during exploration. We Double buffer size: of bus matrix matrix assignment use TLM simulation only for memory trace generation and use 512B/1KB/2KB trace-driven simulation for fast but accurate performance (a) (b) estimation in the inner-most exploration loop. NAND NAND DMA DMA Controller Finally, by using Quantum-inspired Evolutionary Algorithm NAND NAND DMA Controller (QEA), we make the proposed methodology flexible and 8-bit 8-bit extensible. Therefore, the proposed exploration framework can be bus Chip package buses×2 Chip package tailored for other low-end embedded systems such as Solid-State- component component Disk (SSD). CPU CPU 3. Background on MMC Architecture Flash 0 Flash 0 Flash 1 Flash 1 In this section, the architecture of an MMC memory card is time time introduced to reveal important architectural parameters to be SW execution explored. Figure 1(a) shows the overall architecture of memory Data loading time Page program time card. There are three bus masters depicted with thick border in the figure: ARM7, DMA, and Host interface. A NOR flash is (c) (d) dedicated for the local memory of ARM7. Since ARM7 is a Figure 1. (a) The overall architecture of a memory card, cacheless core, it performs all local memory accesses on SRAM (b) bus matrix architecture, and write performance of with single-cycle latency only after an initialization from the NOR flash memory (c) with multi-way Interleaving capability flash memory. and (d) with multi-channel. The grey-colored parts of Figure 1(a) include the parameters While performing read-/write-commands, memory accesses under consideration in the proposed methodology. As for bus from three masters may occur simultaneously, which might result architecture, we use a bus matrix where each of bus maters and in significant performance degradation in a single bus architecture. slaves can be connected with separate data path called ‘bus Since ARM7 without cache never stops accessing local memory, segment’, providing concurrent memory accesses for different it would cause frequent access conflicts. A bus matrix architecture, masters to achieve high performance. Figure 1(b) shows the where local memory and double buffers are split over different parameters to configure a bus matrix architecture. There can be bus segments, allows simultaneous accesses, but paying higher multiple SRAMs that reside on different bus segments to exploit cost of bus segments as trade-off. Multiple masters competing for concurrency. the same memory are arbitrated with a fixed-priority policy. In order to maximize the read and write bandwidth of NAND flash memory, two parallel architectures are considered. In the 4. Overview of the Proposed Methodology multi-way interleaving, while the first flash memory is This section explains the details of the proposed exploration programmed, the next flash memory is loaded with the incoming method. The overall flow of the proposed methodology is shown data as shown in Figure 1(c). As a result, the program latency of in Figure 2(a). The first step is to collect memory traces for a the first flash memory can be overlapped with the data loading given test scenario. It is important to note that we should identify time of the second flash memory, resulting in improving write the design parameters that can affect the behavior of traces if any. throughput. On the other hand, in the multi-channel architecture, In our case, the size of double buffer and input test scenarios fall an independent bus can be connected to each flash memory. The into the category of these parameters. We use test scenarios incoming data are simultaneously loaded into multiple flash transferring various lengths of blocks, which produce different memories as shown in Figure 1(d). Even though the multi-channel length and patterns of memory traces. The different size of double architecture shows better performance than the interleaving buffers requires a slight modification of the firmware running on architecture, it requires additional pins and DMA control logic. the card. Transferring data to or from the memory card occurs when a Once these parameters are chosen, a TLM simulation is run to host issues read- or write-command [15]. For example, when a get the memory traces. Throughput constraints are given to this host wants to write data stream to a memory card, it first sends a step to annotate the bandwidth requirements for each access from write-command to the card and, then, transfers data stream to a master to a logical memory block (LMB). A LMB is a memory buffers managed by Host interface inside the card by breaking it segment to be mapped onto an on-chip SRAM through the into small pieces with predefined lengths, so called ‘block’. exploration. We use the read and write bandwidth as the Double-buffering is usually used to allow simultaneous accesses performance constraint. Note that any communication overhead is to the blocks between Host interface and DMA. DMA may read a avoided in this step of memory trace generation. The block from the buffer to fill the FIFO between DMA and flash communication overhead is evaluated in the next step of memory while the Host interface writes a new block to the other exploration. buffer. The controller inside the NAND flash module draws data We then generate a communication description graph (CDG) from FIFO to store them in flash memories. The block size is a as shown in Figure 2(b). CDG=G(V, E) is a directed graph where design parameter, which varies among 512 bytes, 1 KB, and 2 KB each vertex v∈V represents a component of a system that can be a
  3. 3. bus master or a logical memory block (LMB), and each edge e∈E annotation: Initial traces are generated in a VCD (Value Dump corresponds to communication between components. The weight Change) format that is a popular ASCII-based dump file format in of each edge is the minimum bandwidth in average to be sustained logic simulation tools. We need to tailor it for our exploration step. to satisfy the throughput constraint in MB/s. If ARM7 requires 90 A single VCD file is divided for each master (ARM7, DMA, Host MB/s to access a LMB ‘ARM7 local’ for a certain throughput interface) with the predefined trace format as shown in Figure 3(a). constraint as shown in Figure 2(b), it should perform 32-bit Initial VCD file from simulation memory accesses 22.5×106 times in a second. Trace file for ARM7 Double buffer sizes Choose a parameter in (512B / 1KB / 2KB) those affecting memory traces Test scenarios Trace file for host interface (read/write commands) TLM Simulation to generate memory trace CDG Memory traces Throughput Logical memory block index Address Data constraint Architecture exploration of bus matrix and memory 1325812 W N 4 0x00000000 0x00000073 Record the best solution Time Read/ Non-sequential/ stamp Write sequential Architecture All parameters No (a) candidates considered? ARM7 Yes DMA Choose the best one in architecture candidates End of exploration (a) 90 ARM7 local LMB1 m1 ARM7 0.6 0.6 Double buffer0 LMB2 Trace for ARM7 Trace for DMA m2 21 Double buffer1 LMB3 DMA 6.7 (b) 0.6 CPU ↔ DMA LMB4 Figure 3. (a) Trace format translation and (b) control m3 Host I/F 0.6 5.2 APB register LMB5 dependency annotation. 0.1 Masters AHB register LMB6 During the trace translation, control dependency between (unit: MB/s) Logical memory blocks masters should be considered to enforce correct execution order of (b) masters when performing trace-driven simulation. Although the Figure 2. (a) The proposed exploration flow and (b) TLM simulation guarantees correct order of memory accesses communication description graph (CDG). from masters if dependency exists, the trace-driven simulation will not be aware of it if the behavior of masters is simply abstracted as After obtaining memory traces, the bus matrix and memory memory traces to boost up simulation speed. architecture exploration is performed considering design Thus we annotate the special traces with the predefined parameters in Figure 1 to get the bus matrix architecture with the keyword into the translated traces. A handshaking protocol optimal trade-off between the number of bus segments in a bus between ARM7 and DMA is a good example, which is shown in matrix, the clock frequency, and the flash memory architecture. Figure 3(b). ARM7 initiates DMA to transfer data so that the Then the resultant architecture is recorded. Afterwards, the double beginning of DMA transfer appearing in the trace at time buffer size or (and) the test scenario are changed to generate a new 1,369,479 is put off until ARM7 sets a register to make DMA set of memory traces for the next exploration. These procedures operate at time 1,369,472. Similarly, ARM7 must wait the are repeated until we traverse all available combinations of completion of DMA transfer by polling a register that is set at the parameters for the generation of memory traces. In the final step, end of transfer by DMA as shown on the trace of DMA at time we choose the best one of the architectures recorded so far. 1,377,667. In the following sections, we describe two essential steps in more detail: the memory trace generation and the architecture 6. Architecture Exploration of Bus Matrix exploration of bus matrix and memory. and Memory 5. Memory Trace Generation After we collect memory traces, the design space of bus matrix The generation of memory traces consists of the following sub- and memory architecture is explored to satisfy a given throughput steps. constraint. Since the combination of many design parameters 1) TLM simulation to generate initial traces: To obtain the determines the performance of architecture, it is a typical memory traces without communication overhead included in the combinatorial optimization problem. Among many heuristics such timing information, we model all LMBs as multi-port memories. as Simulated Annealing (SA) [12], Genetic Algorithm (GA) [13], Then multiple masters can access the same LMB without suffering and Quantum-inspired Evolutionary Algorithm (QEA) [14], we any delay due to access conflicts. We use SoC DesignerTM [7] adopt the QEA that has the nice properties: First, its simple (previously known as MaxSim) as the virtual prototyping principle of operation allows an easy implementation. And second, environment for the card system in this step. nevertheless, it converges fast even with the small size of 2) Trace format translation and control dependency population resulting in superior solutions compared to
  4. 4. conventional GA [14]. Since the evaluation of each candidate is of architecture using Q-bits in a Q-bit individual. It consists of resorted to simulation, the fast convergence is important to avoid following four orthogonal parts each of which is a set of Q-bit excessive exploration time in the proposed methodology. strings. 6.1 Quantum-inspired evolutionary algorithm Master to Logical memory block to Clock frequency Flash memory bus matrix mapping physical memory mapping configuration In QEA, Q-bit is defined as a primitive unit to represent information with two numbers (α, β), where |α|2+|β|2=1. This is a m1 m2 … mM pmem1 pmem2 … pmemL # of mems Interleaving/multi-channel probabilistic representation, the concept of which is borrowed from the quantum computation. |α|2 corresponds to the probability bm2,1 bm2,2 … bm2,L that Q-bit is in the state ‘0’ while |β|2 to ‘1’. Q-bit represents the (a) linear superposition of the state ‘1’ or ‘0’ according to the value of binary 1 -1 -1 1 1 1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1 -1 1 2 2 2 1 1 25 2 0 (α, β). A Q-bit individual is a sequence of m Q-bits. For example, solution a Q-bit individual with 3 Q-bits can represent 8 values according ARM7 DMA Host interface Logical mem. to Flash mem. to the state of each Q-bit. physical mem. Configuration Best solution b NOR flash ARM7 local SRAM1 Step 5. Step 4. Selection and global migration NAND DMA ARM7 AHB register Global/local migration B(t) evaluation APB register and termination NAND Ctrl DMA Local FIFO t migration t … t Double buffer 0 B(t) b1 b2 bn Host I/F SRAM2 NAND module Double buffer 1 Step 2. Bus matrix 1 selection between t x1 and b1 −1 t selection between t xn and bn−1 t CPU ↔ DMA P(t) evaluation Step 3. t t (Clock frequency: 25 MHz) P(t) Update x1 … xn of Q-bit (b) Step 1. Observation of Q-bits Figure 5. (a) The Q-bit representation of architecture Q(t-1) q1−1 t Q-gate … qn−1 t Q-gate and (b) an observed solution example. update update - Mapping between a master and an LMB through a bus Figure 4. The overall procedure of QEA. matrix: With M masters and L LMBs, each access of master mi to LMBj passes though a bus matrix bmi,j as shown in Figure 5(a) Compared to GA, QEA is similar in that a solution for a given where i=1,…,M and j=1,…,L. If bmi,j=-1, mi does not access LMBj. problem is represented as a series of primitives (Q-bits in QEA - Mapping of logical memory to physical memory (on-chip and chromosomes in GA). On the other hand, a Q-bit individual SRAM): This part determines to which physical memory each contains the probability of a certain state for each Q-bit, while a LMB is mapped. Note that the NOR flash is connected to the local chromosome in GA represents a solution itself. Therefore, to bus of ARM7 since no masters other then ARM7 access it. evaluate Q-bit individuals, we need to get a binary solution after - Operation clock frequency: The system runs with a single observing Q-bit individuals. The observation is performed for clock. The range within which the clock varies is given by a each Q-bit in an individual, selecting ‘1’ or ‘0’ probabilistically designer. for the associated bit field in a binary solution by comparing |α|2 or - NAND flash memory configuration: This part selects the |β|2 with a random number within [0,1]. QEA does not require a flash memory architecture (interleaving/multi-channel) and the crossover operation between multiple binary solutions, which number of flash memories to use. If the field ‘interleaving/multi- makes the implementation of QEA simpler than GA. channel’ is ‘0’ in Figure 5(a), flash memories will be connected in The overall procedure of QEA at generation t is shown in the interleaved manner while ‘1’ in the multi-channel. A single Q- Figure 4: Step 1 generates a set of binary solutions P(t) by bit is enough to make this decision. observing each Q-bit individual in a set of individuals Q(t-1). Note An example of architecture representation by a Q-bit that multiple solutions can be observed from a Q-bit individual at individual is depicted in Figure 5(b) according to the CDG in a time. Generated solutions might be repaired to be valid for a Figure 2(b). Since, considering the CDG, Host interface has given application. In step 2, the observed binary solutions are accesses to LMB2 and LMB3 only, the associated fields are set to evaluated to get fitness. The best solution from the same Q-bit ‘1’ in a binary solution. As for mapping of LMBs to physical individual is store in B(t). Then, in step 3, based on the best memories, three LMBs ‘ARM7 local’, ‘AHB register’, and ‘APB solution, Q-bits of the associated Q-bit individual in Q(t) are register’ are assigned to SRAM1 and others to SRAM2. Bus updated. It changes (α, β) of each Q-bit using Q-gate operation to segment connection of each bus matrix is dependent on the make the exploration converge to a better solution sub-space. Step mappings of LMBs to physical memories. A global clock is 4 updates b with the highest fitness among the best solutions in chosen to be 25 MHz. Finally, the bank interleaving with two B(t) to keep the global solution during exploration. When the flash memories is selected as a flash memory architecture. termination conditions are met, the exploration procedure is Fitness evaluation: We evaluate observed binary solutions finished as in step 5, producing the best solution b. We use two (representing architectures) using trace-driven simulation to get criteria to judge the end of exploration: the average convergence the number of cycles for given traces on each architecture. To of Q-bits in all individuals or the number of generation evolved so reduce the number of simulations in this step, we first verify far. We omit the detailed description of each step: refer to [14] for whether the architecture under evaluation supports the minimum further details. bandwidth to be sustained. Suppose that we have an architecture 6.2 Working with QEA where all LMBs are assigned to a single 32-bit SRAM running at In this section, we describe the QEA used in the proposed 25MHz. The maximum bandwidth of this architecture is 100 methodology focusing on the Q-bit representation of architecture MB/s while the required bandwidth considering the CDG in and the fitness evaluation of a solution. Figure 2(b) is 125.4 MB/s. Therefore we see that this architecture Q-bit representation: Figure 5(a) shows the representation can not meet the throughput constraint without running trace-
  5. 5. driven simulation, reducing the exploration time. number of bus segments The fitness F of a binary solution (or architecture) is defined as 80 number of SRAMs 7 4 follows: num b er of S R A M s 70 6 ( ) ( ) ( ) α β γ (1) 60 3 s F = A⋅ G ⋅ G ⋅ G 5 b us seg m ent clk on _ chip _ bus NAND _ bus 50 cl ck (M H z) 4 num b er of 40 2 A is 1 if the current solution can support the required 30 3 bandwidth described in CDG, otherwise 0. Gclk is the gain by 20 2 1 o using low clock frequency and is defined as the ratio between the 10 1 0 0 0 clock of the current solution, clkcurrent, and the maximum clock t o ughput hr 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 available, clkmax, e.g. clkmax / clkcurrent. Gon_chip_bus is related to the (M B /s) 512 B 1 KB 2 KB 512 B 1 KB 2 KB reduction of the number of bus segments in a bus matrix. We buffer si ze consider the maximum number of bus segments busmax as the Figure 6. The architecture configurations from the product of the number of masters and the number of LMBs in a exploration: the read-command transferring 64 KB. system. If we denote the number of bus segments in the current It is observed that the double buffer size plays a critical role to architecture by buscurrent, Gon_chip_bus is defined as busmax / buscurrent. reduce the clock speed. For 4 MB/s throughput, the required clock The last term GNAND_bus is used to prefer a flash memory speed for the architecture with 512 byte-double buffer is almost architecture with less total bus-width because the increase of pin twice higher than the architecture with 2 KB-double buffer. counts means the larger chip package and, in turn, higher cost. Furthermore, the clock frequency increases faster to meet higher The total number of bus-width for a flash memory configuration throughput constraint as the double buffer becomes smaller. considers both the architecture (interleaving/multi-channel) and The result shows that we need two or three on-chip SRAM the number of flash memories. For example, interleaving with 2 memories to satisfy a given throughput constraint. Although we memories has 8-bit bus for the controller and 2×8-bit for could not draw the internal structure of the bus matrix and SRAM memories, consequently 24 bits in total. On the other hand, multi- configuration due to space limitation, LMB1 (ARM7 local channel with 4 flash memories has 4×8-bit buses for the memory) and LMB4 (shared memory between ARM7 and DMA) controller and 4×8-bit for memories, i.e. 64 bits in total, which is are always assigned to different SRAMs in all cases. It is because the maximum total bus-width. Thus GNAND_bus can be calculated most memory accesses come from the local memory access of similar to Gclk and Gon_chip_bus. ARM7 and the data stream transfer by DMA simultaneously. α, β, and γ are the coefficients for Gclk, Gon_chip_bus, and For flash memory configuration, single flash memory GNAND_bus respectively. These terms control the search direction of configuration was selected due to no need of hiding the page QEA to bias toward specific parameters. program latency of the NAND flash memory during read operation. Benefit of Multiple banks is not significant for read 7. Experiments operation. Based on the proposed methodology of Figure 2(a), we have number of bus segments developed an architecture exploration framework which consists 140 number of SRAMs 8 5 of two sub-programs: the QEA algorithm written in C++ and the num ber o f S R A M s 120 7 trace-driven simulator using SystemC library [16]. The trace- 100 6 4 bus segm ents driven simulator models the bus matrix and flash memory ock (M H z) 80 5 3 num ber o f 4 architecture at cycle-accurate level. These programs communicate 60 3 2 with each other via files in a predefined format during exploration. 40 2 1 cl To collect memory traces by TLM simulation with SoC 20 1 0 0 DesignerTM, we have implemented the simulation models of DMA t o ug hp ut hr 0 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 and Host interface at cycle-accurate level. We have also (M B /s) 512 B 1 KB 2 KB 512 B 1 KB 2 KB buffer si ze implemented the firmware running on ARM7, which consists of command handler and FTL (Flash Translation Layer) to manage Figure 7. The architecture configurations from the data transfers from/to flash memories. exploration: the write-command transferring 64 KB. As test scenarios for the exploration, we used data transfers For ‘WRITE_MULTIPLE_BLOCK’ commands, we lowered writing/reading 64 KB stream to/from a memory card using the the throughput constraint to 2, 3, and 4 MB/s while making other ‘WRITE_MULTIPLE_BLOCK’/‘READ_MULTIPLE_BLOCK’ parameters remain the same. Figure 7 displays the exploration commands in the MMC specification. The parameters α, β, and γ results. Again the size of double buffer affects the system in equation (1) were set to 4.0, 2.0, and 1.0 respectively to performance significantly. Note that we could not obtain an prioritize lower clock frequency. All experiments were conducted architecture satisfying 4 MB/s throughput with 512 bytes-double on Linux workstation with a 3.0-GHz Xeon processor and 4.0-GB buffer at any clock frequency ranging up to 200 MHz. main memory. When performing the write-command, the performance is not The first set of experiments obtains the optimal architectures sensitive to separation of the local memory of ARM7 and the from the exploration framework for each combination of shared memory with DMA because longer write latency of NAND {throughput constraint, double buffer size}. For the case of flash memory reduces the bandwidth requirement of DMA. On the ‘READ-MULTIPLE BLOCK’ command, we varied the other hand, flash memory configuration affects the performance throughput constraints from 4 to 6 MB/s. The left graph in Figure significantly. Since we gave more weight to clock frequency in the 6 shows the clock frequency of the architectures, and the right experimentation, the multi-channel configuration is preferred to graph shows the number of bus segments in a bus matrix and the the interleaving. The number of banks depends on the double number of SRAMs. buffer size and the throughput constraint: 4 banks are needed in most cases as appeared in Table 1.
  6. 6. Table 1. The obtained flash memory configurations. the proposed methodology to NAND flash-based memory card as Double buffer size 512 bytes 1 KB 2 KB a case study. The experiments show the optimal architecture Throughput 2 3 4 2 3 4 2 3 4 configurations with varying performance constraints and double constraint (MB/s) buffer sizes. We plan to apply the proposed methodology to other Flash mem. arch. Multi-channel low-end embedded systems such as SSD (Solid-State-Disk). We Num. of flash mem. 4 4 4 4 4 4 2 2 4 will consider power as another optimization factor in the future. We can obtain the pareto-optimal solution set with various Acknowledgments parameters. Figure 8 illustrates how the flash memory The authors would like to thank Hae-woo Park and Jinwoo configuration and on-chip bus architecture affect the clock Kim for their help on the experimentations. This work was frequency to meet performance requirement. X-axis and Y-axis at supported by BK21 project, Samsung Electronics, and Creative each graph are associated with the flash memory configurations Research Initiative sponsored by KOSEF research program (R17- and the number of bus segments respectively while Z-axis to the 2007-086-01001-0). The ICT and ISRC at Seoul National required clock frequency to meet the constraints. The University and IDEC provided research facilities for this study. configurations with zero clock frequency mean that no solution exists. References [1] K. Lahiri, A. Raghunathan, and S. Dey, "Design space exploration for optimizing on-chip communication architectures,” IEEE TCAD, cl ck (M H z) 140 X -a xi fl s: ash m em ory architecture 120 vol. 23, no. 6, Jun. 2004. 16: si e fl sh m em o ry ngl a 100 [2] S. Kim and S. Ha, "Efficient exploration of bus-based System-on- 24: 2-w a y i nterl eavi ng 80 Chip architectures," IEEE TVLSI, vol. 14, no. 7, pp. 681-692, Jul. 32: 2-channel o 60 40: 4-w a y i nterl eavi ng 2006. 40 64: 4-channel [3] S. Pasricha, N. Dutt, and M. Ben-Romdhane, “Constraint-driven bus 20 7 0 6 matrix synthesis for MPSoC,” in Proc. ASP-DAC, pp. 30-35, Jan. 16 24 4 5 Y -axi num b er of bus seg m ents s: 2006. 32 40 64 Y [4] S. Murali, L. Benini, and G. De Micheli, “An application-specific X b uffer si 512 K B ze: design methodology for on-chip crossbar generation,” IEEE TCAD, thro ug hp ut constrai 4 M B /s nt: vol. 26, no. 7, pp. 1283-1296, Jul. 2007. [5] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and J.A. Tauber, “Storage alternatives for mobile computers,” in Proc. OSDI, ock (M H z) cl ck (M H z) 80 70 180 160 pp. 25–37, Nov. 1994. 60 140 [6] Gartner Dataquest, 50 120 [7] RealView SoC Designer, ARM Inc., 100 40 cl o 80 30 60 [8] S. Hong, S. Yoo, S. Lee, S Lee, H. J. Nam, B.-S. Yoo, J. Hwang, D. 20 10 7 40 20 4 Song, J. Kim, J. Kim, H. Jin, K.-M. Choi, J.-T. Kong, and S. Eo, 0 5 6 0 16 6 5 “Creation and utilization of a virtual platform for embedded software 16 24 32 40 4 24 32 40 64 7 optimization: An industrial case study.” in Proc. CODES+ISSS, pp. 64 235-240, Oct. 2006. b uffer si 1 K B ze: b uffer si 2 K B ze: thro ug hp ut constrai 2 M B /s nt: throughp ut co nstrai 4 M B /s nt: [9] Y. Joo, Y. Choi, C. Park, S. Chung, E. Chung, and N. Chang, “Demand paging for OneNAND TM Flash eXecute-In-Place,” in Proc. Figure 8. The pareto-optimal solution space of clock CODES+ISSS, pp. 229-234, Oct. 2006. frequency. [10] J.-U. Kang, J.-S. Kim, C. Park, H. Park, and J. Lee, “A multi-channel architecture for high-performance NAND flash-based storage We observed that with the same number of flash memories, system,” in Journal of System Architecture, vol. 53, no. 9, pp. 644- multi-channel architecture performs better than interleaving to 658, Feb. 2007. lower clock paying more bus wires. The effects of on-chip bus [11] S. Min and E. Nam, “Current trends in flash memory technology,” in architecture differ from case to case. For example, with 1 KB- Proc. ASP-DAC, pp. 332-333, Jan. 2006. double buffer and 2 MB/s-throughput, the clock frequency is not [12] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by sensitive to the number of bus segments. On the contrary, the simulated annealing,” Science, vol. 220, no. 4598, pp. 671-680, May clock frequency reduces as the number of bus segments increases 1983. [13] A. S. Fraser, “Simulation of genetic systems by automatic digital in the case of 2 KB-buffer and 4 MB/s-throughput constraint. It computers,” Australian Journal of Biological Sciences, vol. 10, pp. implies that there is no deterministic rule for the effects of 484-491, 1957. parameters on the performance. Thus the proposed exploration [14] K. Han and J. Kim, "Quantum-inspired evolutionary algorithms with methodology is valuable for architecture optimization with many a new termination criterion, Hε Gate, and two phase scheme," IEEE design parameters. TEVC, vol. 8, no. 2, pp. 156-169, Apr. 2004. It took less than an hour to explore the wide design space for a [15] Multimedia Card Association and System Specification, given set of memory traces. The speed of trace-driven simulation was about 450 Kcycles/sec. [16] SystemC Language Reference Manual, ver 2.1. (2005, May). 8. Conclusion In this paper, we have presented an architecture exploration methodology for low-end embedded systems. Since cost-effective implementation is critical in low-end embedded systems, the proposed methodology considers low-level architecture details and uses trace-driven simulation for fast cycle-accurate performance estimation. It consists of two steps: the memory trace generation and architecture exploration using QEA. We applied