Layered Spiral Algorithm for Memory-Aware         Mapping and Scheduling on Network-on-Chip                    Shuo Li, Fa...
TABLE I                                                                                     communicaiton                 ...
t0 is {0}, t01 is {1}, and t1 is {2, 3}.                                  Hence, the total inter-task communication cost, ...
access layers. The former are layers which contain normal tasks           TPL of each task layer is generated. After that,...
TABLE IV                                   TASK LAYERS                                                    xN +            ...
Minimize-Cost-MILP problem runs about 3.5 hours and results              KB memory. Therefore, LSA maps t2 onto two PM nod...
Upcoming SlideShare
Loading in …5



Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Layered Spiral Algorithm for Memory-Aware Mapping and Scheduling on Network-on-Chip Shuo Li, Fahimeh Jafari, Ahmed Hemani Shashi Kumar Department of Electronic Systems School of Engineering School of Information and Communication Technology J¨ nk¨ ping University o o Royal Institute of Technology J¨ nk¨ ping, Sweden o o Stockholm, Sweden Email: Email: {shuol, fjafari, hemani} Abstract—In this paper, Layered Spiral Algorithm (LSA) is in Section V and evaluate our approach in Section VI. Sectionproposed for memory-aware application mapping and scheduling VII gives conclusions and future works.onto Network-on-Chip (NoC) based Multi-Processor System-on-Chip (MPSoC). The energy consumption is optimized while keepinghigh task level parallelism. The experimental evaluation indicates II. R ELATED W ORKthat if memory-awareness is not considered during mapping andscheduling, memory overflows may occur. The underlying problem Different mapping algorithms [5][6] are proposed withoutis also modeled as a Mixed Integer Linear Programming (MILP) memory-awareness and scheduling coverage. In [7], schedul-problem and solved using an efficient branch-and-bound algorithm ing problem is covered with mapping but without memory-to compare optimal solutions with results achieved by LSA. Com- awareness. As claimed in [4], memory is critical in NoC designparing to MILP solutions, the LSA results demonstrate only about process and consequently should be considered during mapping20% and 12% increase of total communication cost in case of asmall and middle size synthetic problem, respectively, while it is and scheduling practical applications onto practical platforms.order of magnitude faster than the MILP solutions. Therefore, the In [8], memory-awareness is covered while scheduling is notLSA can find acceptable total communication cost with a low run- considered. In this paper, we propose a very fast algorithmtime complexity, enabling quick exploration of large design spaces, called Layered Spiral Algorithm (LSA) to solve memory-awarewhich is infeasible for exhaustive search. application mapping and scheduling problem. The objective of LSA is minimizing total energy consumption while keeping high I. I NTRODUCTION task level parallelism. The proposed algorithm is based on spiral Even though Network-on-Chip (NoC) has been introduced algorithm [9] which is a very fast mapping algorithm withoutfor a decade [1], programming on them is arduous since the memory-awareness. We extend spiral algorithm by introducingapplication mapping and scheduling problem is NP-hard [2]. memory-aware concepts and task layers to cover both memory-The existing compilation tools are not suitable in NoC context awareness and scheduling problem. The paper demonstrates that[3]. In addition, as memory is being critical during NoC design LSA is able to solve large scale problems with acceptableprocess [4], memory requirement and availability should not accuracy. Although dynamic mapping is a cut above staticbe ignored. This memory-awareness increases the complexity mapping in terms of platform utilization [10], extra control logicsof the application mapping and scheduling problem even more. are required. Therefore, we consider static application mappingSince the entire solution space is enormous, problem-specific and scheduling in this paper for simplification and its dynamicalgorithms are desired beyond exhaustive search-based algo- extension is planned in the future work.rithms to obtain acceptable solutions in a reasonable time. Inthis paper, we describe (1) Memory-Aware Communication Task III. P ROBLEM F ORMULATIONGraph (MACTG) to model applications, (2) Platform Architec- This section models the underlying application and platform.ture Graph (PAG) to model NoC-based Multi-Processor System- Then, the objective function of the memory-aware applicationon-Chip (MPSoC) platforms, (3) a problem-specific heuristic mapping and scheduling problem is formulized.algorithm, Layered Spiral Algorithm (LSA), to quickly mapand schedule an application characterized by MACTG to an A. Application Modelarchitecture characterized by PAG. (4) a Mixed Integer LinearProgramming (MILP) formulation of the underlying problem We exploit a variant of communicational task graph calledand solve it by a branch-and-bound algorithm. Afterwards, the Memory-Aware Communication Task Graph (MACTG) to modelefficacy of the proposed algorithm is established by comparing an application and we model task execution as an impartiblethe MILP solutions and LSA results. process with three non-overlapping phases: input phase for The rest parts of this paper is organized as follows. Section collecting input data, computing phase for computing output dataII introduces related work. Section III discusses the applica- and output phase for sending out output data. The upper parttion mapping and scheduling problem formulation based on of Table I lists notations in a MACTG and their descriptions.application and platform models. Section IV is devoted to the Data memory requirement presents data memory required in thedescription of LSA. We model the underlying problem in MILP computing phase. Fig. 1(a) shows an example of MACTG. 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  2. 2. TABLE I communicaiton MACTG AND PAG N OTATIONS t0 edge 1 t0 1 1 1 Notation Description normal t 02 t 01 t1 t2 task 1 T = {ti } Set of tasks 3 3 1 2 1 ET i ti ’s execution time t2 t1 DM Ri ti ’s data memory requirement t3 t4 t5 3 3 1 2 E = {eij } Set of communication edges 2 2 2 memory N = {nj } Set of PM nodes access t 25 t 24 t 14 t 13 M Sj Total data memory size on nj t6 t7 t8 task 1 3 3 2 SM axj Maximum distributed shared memory size on nj 1 3 2 t5 t4 t3 B Communication channel bandwidth 2 CH = {chij } Entire communication channel set t9 2 2 1 2 (a) MACTG t 57 t 58 t 49 t 36 communication channel 2 2 2B. Platform Model 1 t7 t8 t6 We model the target 2D homogeneous mesh NoC-based n0 n1 3 2 2MPSoC by a Platform Architecture Graph (PAG). The platform processor t 79 t 89 t 69consists of Processor Memory (PM) nodes and communication data memory 2channels. Each PM node consists of one single-threaded pro- 3 2 n2 n3 PM nodecessor and one data memory. Data memory is divided into two t9parts: local memory for the local processor and distributed shared (b) PAG (c) MAGmemory for remote processors. Sizes of both parts are tunableduring application execution under the condition of constant totalsize and the maximum distributed shared memory size is smaller Fig. 1. MAG Examplethan the total memory size. By introducing distributed sharedmemory, efficiently memory partitioning becomes a challenge. in the list only provide remote data memory for temporal and/orThis problem is addressed from a slightly different angle by intermediate data during the computation phase. For a memorycache partitioning problems that targeting to minimize cache access task, it is a list of PM nodes for storing the data. Anothermisses while memory partitioning problem is to minimize access annotation is a list of corresponding data memory usage in thetime/energy. We model PM nodes as vertices and communication PM node list. For example, suppose normal task ti uses three PMchannels as arcs in a PAG. The lower part of Table I lists the nodes including nx , ny and nz , and ny is used for computation.notations of a PAG and their descriptions and Fig. 1(b) illustrates Hence, the PM node list will be {y, x, z}. Assume ti uses 8 KBa sample PAG. in ny as local data memory and 4 KB in nx and nz as remote distributed data memory. The memory usage list will be {8 KB,C. Memory Access Graph 4 KB, 4 KB} or {8, 4, 4}. If the unit of entries all is the same, Memory Access Graph (MAG) describes data communication. it can be removed from the list.We define a type of dummy task called memory access taskwhose memory requirement is equal to the communication data D. Constraint and Objective Functionvolume between two corresponding normal tasks with zero In this subsection, we address the constraints and objectiveexecution time. Memory access task tij represents the data function used in our problem formulation.communication from ti to tj . Mapping tij is to locate output Constraints: Besides of constraints related to the memorydata from ti on one or multiple PM nodes. This data will be size, a deadline time timed for execution time of the applicationused by tj as input data. In fact, each memory access task is considered. This constraint will be a criterion for validatingcontains the input/output data storage information of two directly solutions. Let time0 be the starting time of the application andrelated normal tasks. For example, t4 requires data from t1 and timeL be the finishing time of the last task in the application.t2 . The corresponding memory access tasks are t14 and t24 as Thus, the time constraint is timeL − time0 ≤ timed .illustrated in the MAG example in Fig. 1(c). By introducing Objective function: Our objective is minimizing the applica-memory access tasks, the input/output from/to multiple tasks tion energy consumption while meeting mentioned constraints.are modeled separately such that we do not have to explicitly For simplicity, we assume that the leakage current is negligiblegive input/output data location and which tasks fetch/produce [11], which means the total execution energy will be fixedit. Hence, in the rest of paper, we consider two subset TN and and for minimizing total energy consumption, it is enough toTA containing normal and memory access tasks, respectively, minimize the communication energy consumption. The under-such that T = TN TA . Replacing each communication edge lying system’s communications include inter-task and intra-taskby a memory access task and two new communication edges, a communications. Inter-task communication is the data commu-MACTG is converted to a MAG. nication during the input and output phases while intra-task Since each task in MAG can be mapped onto several PM nodes communication is the data communication during the computingand also the input/output data of each task has to be stored on or multiple PM nodes, we introduce two new annotations To illustrate the concepts of inter-task and intra-task com-for each task in MAG. The first one is a list of PM nodes. For a munications, consider the example of Fig. 1(c). Task t1 needsnormal task, this list is a set of PM nodes that the first PM node data produced by t0 , which is modeled by t01 , and t1 producesin the list stores the program of the normal task and performs data for some other tasks. Suppose application mapping andcomputation during the computation phase while other PM nodes scheduling algorithm is performed such that PM node list of
  3. 3. t0 is {0}, t01 is {1}, and t1 is {2, 3}. Hence, the total inter-task communication cost, C inter , is In this case, inter-task communications are the data transfer equal to C out + C in . Assume normal task ti is mapped on thefrom n0 (on which task t0 is run) to n1 (in which output data of set of PM nodes NiT and run on node nr ∈ NiT . Thus, it has itask t0 or input data of task t1 is located) and also from n1 to intra-task communication with other pieces of remote shared datan2 (on which task t1 is executed). The intra-task communication memory on each node k ∈ NiT &k = nr . Thus, total inter-task i intrais the data transfer between n2 (on which task t1 is executed) communication cost for normal task ti , Ci , is:and n3 (which allocates remote shared data memory to task t1 ). intraWe suppose the length of communication path between each two Ci = CVnr ,k ∗ hknr ∗ wR ∗ wa i i (4) T ∀k∈NiPM nodes i and j, lij , is proportional to the length of shortestpath between them, hij . This means that for all communications, where wa is a weight factor used to simply model how manywe can say: times in average a normal task will access its remote data mem- lij = wR ∗ hij (1) ory because sometimes it will access its remote data memory more than one times. It is clear that CVnr ,k is equal to allocatedwhere wR is a constant routing factor. i data memory for task ti on node k, xN . Also, for computing ik We also assume that the energy consumption for a communica- C intra in an application, it is necessary to calculate Ciintra fortion between two nodes i and j, Enij , is proportional to CVij ∗ ∀i ∈ TN . Therefore, total inter-task communication cost in thelij ; where CVij is communication data volume between these application can be written as:two nodes. Considering Eq. 1, we can say Enij ∝ CVij ∗hij . Wedenote CVij ∗hij as communication cost between two PM nodes C intra = xN ∗ hknr ∗ wR ∗ wa ik (5) ii and j. Therefore, for minimizing total energy consumption, we ∀i∈TN ∀k∈Ni Tcan minimize the sum of all individual intra-task and inter-task’scommunication costs of the application executed on the given Thus, total communication cost in the application is as C =platform. C inter + C intra and our objective is to minimize it. Before that, we need to introduce several other notations. The IV. A LGORITHM D ESCRIPTIONallocated data memory size on node k to each normal task i andmemory access task j are denoted as xN and xA , respectively. A. Spiral Algorithm ik jkThe dependency task set of a task contains all directly dependent Since the LSA is an extension of the existing spiral core map-tasks of that task. The dependency task set of ti is denoted as ping algorithm [9], we briefly describe spiral algorithm in thisTid . For example, in Fig. 1(c), T1 = {t01 } and T01 = {t0 }. d d subsection. In this algorithm, IP cores are mapped onto the tiles As we described before, the mapping of each memory access of the mesh platform based on a Task Priority List (TPL) whichtask tij specifies where and how much memory is allocated to is a list of the tasks ordered based on the priorities that theyoutput data from normal task ti or input data to normal task tj . should be mapped. In mesh topology, the inner switches haveTherefore, for the inter-task communication, the communication a higher connection degree compared to the boundary is just the input/output data needed/produced by normal This provides more connectivity to the neighbor switches andtasks. In this respect, total inter-task communication cost consists forms a Platform Priority List (PPL) (we call it Node Priority Listof total communication cost for writing output data, C out , (NPL) in LSA for reflecting the task-node mapping) which startsand reading input data, C in . In the mentioned example, the from the center of the mesh platform and ends to a boundarycommunication cost for writing output data modeled by memory switch in a spiral fashion. The priority assignment policy is out expressed in the following rules.access task t01 is C01 = CV0,1 ∗ h0,1 ∗ wR . Since requiredmemory size for memory access task t01 in node n1 is equal a) The tasks that have higher data transfer sizes should beto CV0,1 , we can say C01 = xA ∗ h0,1 ∗ wR . Therefore, for out 01,1 placed as close as possible to each other to satisfy the bandwidth out out constraint.calculating C in an application, it is enough to compute Cifor ∀i ∈ TA . Assuming task i ∈ TA is mapped on node nm i b) The tasks which are tightly related to each other should haveand task s ∈ Tid is mapped and executed on node nr , total s the least possible Manhattan distance on the mesh platform.communication cost for writing output data is as follows: c) The tasks which have the high connection degrees should not be placed on the boundaries. For these tasks, the central area of C out = xA m ∗ hnm nr ∗ wR ini i s (2) the mesh is the best candidate. ∀i∈TA ∀s∈Ti d All IP cores are mapped onto tiles of the mesh platform based In the mentioned example, normal task t1 needs data from on TPL, PPL and the above rules with starting from the IP corenormal task t0 that is modeled by memory access task t01 . There- of the task with highest priority in the TPL.fore, communication cost for reading input data of normal task B. LSA Algorithm Descriptiont1 is C1 = CV1,2 ∗ h1,2 ∗ wR . Likewise, since CV1,2 = xA , in 01,1we can say C1 = xA ∗ h1,2 ∗ wR . Therefore, for calculating in The inputs of LSA are an application modeled by a 01,1C in in an application, it is enough to compute Ci for ∀i ∈ TN . in M ACT G = Gm (T, E) and a platform modeled by a P AG =Assuming task i ∈ TN is mapped and executed on node nr and Gp (N, CH). The output of the LSA is the application mapping itask s ∈ Tid is mapped on node nm , total communication cost and scheduling. sfor reading input data is as follows: Intermediate results in this algorithm are: 1) Task layer is a set of independent tasks. Each task in task C in = xA m ∗ hnr nm ∗ wR sns i s (3) layer k has its dependency tasks in task layer k−1, .., 0. There are ∀i∈TN ∀s∈Ti d two different kinds of task layers including normal and memory
  4. 4. access layers. The former are layers which contain normal tasks TPL of each task layer is generated. After that, node priority listand the latter include memory access tasks. It is clear that there of the PAG is built in line 6. Line 7 maps the first task layeris no task layer including normal and memory access task both. onto the platform by using the spiral algorithm. Lines 8-16 mapThe set of normal layers is denoted as N L and likewise, AL is remaining task layers onto the platform as follows.the set of memory access layers. First, the dependency node set Nid for each task ti in 2) Task priority list (TPL): Each task layer has a TPL which task priority list is computed. If Nid has only one node nd ,is an ordered set of tasks of that layer. The ordering criterion try to map ti onto nd . If nd is not an execution-available node,is based on priority of each task which means the task with try to map ti onto an execution-available PM node that hasthe higher priority is located in front of the task with the lower the highest priority among all execution-available PM nodes.priority in TPL list. The task priorities are specified as follows: If any single PM nodes has no enough available memory for • For normal tasks: Each task with larger total input and ti , try to map the execution of ti onto nd and the remote data output data volume has higher priority. memory onto storage-available PM nodes with as high priority • For memory access tasks: The priority of each memory as possible. If this trial fails, try to map execution of ti onto an access task depends on priority of its dependency (parent) execution-available PM node that has the highest priority in all and child tasks which all are normal tasks. Hence, for execution-available nodes and map ti ’s remote data memory onto comparing two memory access tasks tix and tjy , we first storage-available PM nodes with as high priority as possible. If check the priority of tx and ty . If we find that tx has higher this mapping also fails, the LSA will fail. This mapping is called priority than ty , the memory access task priority list is {tix , single dependency mapping in Algorithm 1. If Nid has multiple tjy }. If x = y, we check the priority of ti and tj . If priority nodes, for each node nd ∈ Nid , follow the above procedure and of ti is higher than tj ’s priority, the memory access task collect the cost of each mapping solution of ti . Then compare priority list is {tix , tjy }. If child task of a memory access the costs and pick up the solution with the minimal cost. This task is not in the next layer of the current layer, this memory mapping is called multiple dependencies mapping in Algorithm access task has lower priority compared to other memory 1. Finally, record mapping and scheduling results and if the access tasks in the same layer. total time fulfills the timing constraint, the algorithm successes; 3) Node priority list (NPL) is an ordered set of all nodes so that otherwise the algorithm cannot solve this problem.the nodes located closer to the center of the mesh have higher C. Application Mapping and Scheduling Examplepriority. If some nodes are located as close as to the center of Consider a simple example of mapping an application shownthe mesh, the priority list will be formed in a spiral style starting in Fig. 1(a) onto a platform shown in Fig. 1(b). The numbersfrom the right bottom corner. along with the arrows are the communication volumes between 4) Dependency node set: The dependency node set of a task two tasks and the arguments are listed in Table II.contains all nodes assigned to all directly dependent tasks of that We consider the following assumptions for the platform.task. The dependency node set of ti is denoted as Nid . Bandwidth is 1 GB/sec, data memory size is 8 KB and maximum We also introduce execution-available and storage-available 6 KB data memory can be shared.terms for PM nodes to state that a PM node is ready for executing The data memory requirements in KB for memory accessa task and it has available data memory for remote or local tasks are listed in Table III. TPL of each layer is built byaccess, respectively. It is worth mentioning that an execution- sorting tasks such that the tasks with higher total communicationavailable PM node is also an storage-available PM node. The volume has higher priority compared with other tasks. Table IVLSA works as Algorithm 1. lists the TPL of each task layer. Since all nodes have the sameAlgorithm 1 Layered spiral algorithm pseudo-code connectivity, NPL is {n3 , n2 , n0 , n1 } which lists all nodes in a spiral style. 1: Gm (T, E) → M AG = Gl (T, E); Then we map task layer 0, which contains only t0 . This 2: Lt = task layers (Gl (T, E)); task is mapped onto n3 and uses 2 KB of data memory in 3: for all task layer li in Lt do P 4: li = task priority list (li ); 5: end for TABLE II A NNOTATIONS FOR THE MACTG EXAMPLE SHOWN IN F IG . 1( A ) 6: Ln = node priority list (Gp (N, CH)); 7: M ap(l0 ); Task ET DMR Task ET DMR 8: for task layer 1 to last layer do (µs) (KB) (µs) (KB) P 0 1 2 5 2 4 9: for all ti in li do 1 1 2 6 2 410: if count(Nid ) = 1 then 2 9 16 7 1 211: single dependency mapping(ti ); 3 2 2 8 1 212: else 4 2 4 9 4 1613: multiple dependencies mapping(ti ); TABLE III14: end if DATA MEMORY REQUIREMENTS FOR MEMORY ACCESS TASKS15: end for16: end for Task DMR Task DMR Task DMR Task DMR 01 1 24 3 57 2 89 2 02 1 25 1 58 2 Line 1 of the algorithm converts input MACTG Gm (T, E) to 13 2 36 2 69 2MAG Gl (T, E) and line 2 extracts task layers. In lines 3-5, the 14 3 49 1 79 3
  5. 5. TABLE IV TASK LAYERS xN + ij xA ≤ M S j sj ∀j ∈ N, ∀l ∈ N L (8) ∀i∈T askL(l) ∀i∈InT askL(l) Layer Priority List Layer Priority List Layer Priority List 0 {0} 3 {14, 24, 25, 13} 6 {7, 6, 8} xA + ij xN ≤ M S j sj ∀j ∈ N, ∀l ∈ AL (9) 1 {01, 02} 4 {4, 5, 3} 7 {79, 69, 89} ∀i∈T askL(l) ∀i∈InT askL(l) 2 {1, 2} 5 {57, 36, 58, 49} 8 {9} yij ≤ 1 ∀j ∈ N, ∀l ∈ N L (10) ∀i∈T askL(l)n3 . Now n3 has 6 KB data memory available. Then we map yij = 1 ∀i ∈ TN (11) ∀j∈Ntask layer 1, which contains {t01 , t02 }. t01 occupies 1 KB andt02 occupies also 1 KB. t01 and t02 are mapped onto n3 . The xN yij > 0 ij ∀i ∈ TN (12)memory assigned to t0 is still in n3 . Now, n3 has 4 KB available. ∀j∈NThen, we map task layer 2 which contains {t1 , t2 }. t1 is mapped xN = DM RN i ∀i ∈ TN (13) ijonto n3 (2 KB) and t2 is mapped onto n2 (8 KB), n0 (4 KB), ∀j∈Nand n3 (4 KB). The memory assigned to t0 in n3 is freed. Now, xA = DM RA ij i ∀i ∈ TA (14)n3 and n2 have no data memory available and n0 has 4 KB ∀j∈Navailable. In the same way, we mapped all task layers and gotthe results in the LSA part in Table V. The time parameter in where yij , xN and xA are optimization variables. ij ijthe table is the LSA computation time. By considering wa = 2 Eq. (6) is the objective function of this optimization problemand wR = 1, the total communication cost C = 65 µs and total which minimizes total communication cost. Constraint (7) saysapplication execution time timeL − time0 = 74 µs. that allocated data memory size to tasks of each normal layer l not running on each node j cannot exceed shared memory size of that node. Constraint (8) and (9) indicate that allocated V. MILP F ORMULATION data memory size for tasks of each layer (N L&AL) and their To evaluate the capability of our method, we formulate the dependency tasks on each node j cannot exceed memory size ofunderlying problem as a MILP problem which maps tasks onto that node. It is clear that at most a task of each normal layer l ona generic regular NoC architecture and allocates the required every node j can be executed and also each normal task i can bedata memory of each task on every PM node based on tasks run on only one node; these constraints are shown in (10) andscheduling to minimize the total communication cost while satis- (11), respectively. Constraint (12) says that each normal task ifying acceptable constraints in the network. For formulating this takes a part or all of its data memory requirements of that nodeproblem, besides of before mentioned notations, we introduce on which is run. Constraints (13) and (14) state that allocatedmore notations. T askL(l) is a set of independent tasks in layer data memory size of each task i ∈ TA or TN cannot exceed itsl. InT askL(l) is the set of dependency tasks of all tasks of data memory requirements.layer l. For instance, InT askL(2) in example of section IV-C is It is worth mentioning that above mentioned problem is a{01, 02}. Thus, The cost minimization MILP problem, Minimize- quadratic problem [12]. However, since an integer variable isCost-MILP, can be formulated as follows. multiplied by a binary variable, Big M technique [13] can be used Given a MAG=Gl (T, E), hop count matrix of shortest path to convert it to a mixed integer linear programming. Therefore,between each two nodes H = {hij |i & j ∈ N }, memory size we solve the proposed Minimize-Cost-MILP problem using anand maximum shared memory size that each node j can allocate efficient branch-and-bound algorithm in tasks in the network which denoted as M S j and SM axj ,respectively. Data memory requirement for each normal task VI. E XPERIMENTAL R ESULT i ii, DM RN , and each memory access task i, DM RA . Find Since there is no existing memory-aware task graph bench-the mapping matrix Y = {yij |yij = 0 or 1; ∀i ∈ TN & ∀j ∈ N } marks, we use tgff [14] tool to generate synthetic MACTGs.and also the size of allocated data memory on each node j Therefore, five task graphs are randomly generated to test ourto every normal task i and memory access task i, xN for ij LSA implemented in C#. The program is run on a PC with 2.8∀i ∈ TN & ∀j ∈ N and xAj for ∀i ∈ TA & ∀j ∈ N , ij GHz Intel i7 CPU and 8 GB main memory. For all case studies,respectively, such that we assume that wR and wa are equal to 1 and 2, respectively, and also bandwidth is 1 GB/sec.  A. Comparing with MILP Problem min  xN hjk yik wR wa ij yij ,xN ,xA ij ij ∀i∈TN ∀j∈N ∀k∈N To evaluate the capability of our method, we addressed an  MILP problem of minimizing total communication cost sub- + xA hjk yij wR  (6) ject to the mentioned constraints. We apply LSA method and sk ∀s∈Ti ∀j∈N ∀k∈N d Minimize-Cost-MILP problem to two synthetic task graphs which are mapped to a 4 × 4 2D mesh network. + xA hjk ysk wR ij Table V shows the results for LSA and Minimize-Cost-MILP ∀i∈TA ∀s∈T d ∀j∈N ∀k∈N i problem for case study mentioned in IV-C. The results show that subject to: comparing with the Minimize-Cost-MILP problem, LSA achieves about 20% larger cost with only 7.92 ms run time. For a larger xN (1 − yij ) ≤ SM axj ij ∀j ∈ N, ∀l ∈ N L (7) size problem that maps and schedules a MACTG with 26 tasks, ∀i∈T askL(l) LSA runs 8.56 ms and results a cost of 3167 µs while the
  6. 6. Minimize-Cost-MILP problem runs about 3.5 hours and results KB memory. Therefore, LSA maps t2 onto two PM nodes (n9a cost of 2821 µs. Therefore, the LSA can find acceptable total and n10 ) while nLSA maps it onto only one PM node n10 .communication cost in a very short time. As the exploration space increases exponentially, the VII. C ONCLUSION AND F UTURE W ORKSMinimize-Cost-MILP problem cannot solve larger size problemsin a reasonable time and physical memory. Thus, in the rest, we LSA can find solution with acceptable energy consumptionachieve results via proposed LSA for larger size problems. in a very short time. As the exploration space increases exponentially with number of task, large size problems cannot TABLE V be solved by MILP solver in a reasonable time and physical C OMPARISON BETWEEN LSA AND Minimize-Cost-MILP PROBLEM memory so that we can only achieve solutions via proposed LSA Minimize-Cost-MILP Problem LSA. As shown in Table VI, LSA runtime for mapping and Task PM Node list Memory Usage PM Node Memory Usage scheduling an application consists of 815 tasks onto an 8 × 8 List List List List platform is still under 0.25 sec. By comparing memory-aware 0 {3} {2} {1} {2} 01 {3} {1} {3} {1} and non-memory-aware solutions, we concluded that the 02 {3} {1} {1} {1} memory requirements and availabilities are critical and should 1 {3} {2} {3} {2} not be ignored when modeling applications and platforms. 13 {1} {2} {3} {2} 14 {0} {3} {3} {3} In the future work, we expect the concept of LSA will also 2 {2, 0, 3} {8, 4, 4} {0, 1, 2} {8, 7, 1} fit into dynamic application mapping and scheduling problems. 24 {1} {3} {2} {3} Another future work is to add more tunable options to LSA 25 {3} {1} {1} {1} 3 {1} {2} {3} {2} such that LSA can be optimized more to parallelism or to 36 {1} {2} {3} {2} energy consumption. Implementation of task layer partitioning 4 {0} {4} {2} {4} is also planned in the future work. 49 {0} {1} {2} {1} 5 {3} {4} {1} {4} 57 {3} {2} {1} {2} 58 {3} {2} {1} {2} R EFERENCES 6 {1} {4} {3} {4} ¨ [1] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg, and 69 {1} {2} {3} {2} D. Lindqvist, “Network on a chip: An architecture for billion transistor 7 {3} {2} {0} {2} era,” in Proc. NORCHIP 2000, Turku, Finland, Nov. 2000. 79 {3} {3} {0} {3} [2] R. Pop and S. Kumar, “A survey of techniques for mapping and scheduling 8 {2} {2} {1} {1} applications to network on chip systems,” School of Engineering, J¨ nk¨ ping o o 89 {2} {2} {1} {2} University, J¨ nk¨ ping, Sweden, Tech. Rep. 04:4, Apr. 2004. o o 9 {2, 0, 3} {6, 5, 5} {1, 0, 3} {6, 5, 5} [3] G. Chen, F. Li, S. W. Son, and M. Kandemir, “Application mapping for C 65 µs 54µs chip multiprocessors,” in Proc. DAC’ 08, Anaheim, California, USA, Jun. time 7.92 ms 10.91 s 2008, pp. 620–625. [4] N. Dutt, “Memory-aware noc exploration and design,” in Proc. of Design, Automation and Test in Europe, 2008. DATE ’08, Munich, Germany, Apr.B. Comparing with Non-memory-aware LSA 2008, pp. 1128–1129. [5] T. Lei and S. Kumar, “A two-step genetic algorithm for mapping task For larger size problems listed in Table VI, we try to graphs to a network on chip architecture,” in Proc. Euromicro Symposiumsolve them using LSA and then compare the results to a in Digital Systems Design (DSD), Belek, Turkey, Sep. 2003. [6] S. Yang, L. Li, M. Gao, and Y. Zhang, “An energy- and delay- awarenon-memory-aware version of LSA (nLSA)’s results. nLSA is mapping method of noc,” Acta Electronica Sinica, vol. 36, no. 5, pp. 937–LSA that assumes the platform has infinite memory. In Table 942, May 2008.VI, we tested LSA and nLSA with five test cases (Middle, [7] H. Yu, Y. Ha, and B. Veeravalli, “Communication-aware application mapping and scheduling for noc-based mpsocs,” in Proc. of 2010 IEEEHard, Hard0, Hard1 and Hard2). The number of tasks of each International Symposium on Circuits and Systems (ISCAS), Paris, France,test case is listed in Table VI. The number of PM nodes in the May/Jun. 2010, pp. 3232– platform for each test case is 4 × 4, except for test case [8] M. Monchiero, G. Palermo, C. Silvano, and O. Villa, “Exploration of distributed shared memory architectures for noc-based multiprocessors,”Hard0 and Hard4. Their number of PM nodes is 8 × 8. in Proc. of International Conference on Embedded Computer Systems: There are differences in cost between memory-aware and Architectures, Modeling and Simulation, 2006. IC-SAMOS 2006, Samos, Greece, Jul. 2006, pp. 144–151. TABLE VI [9] A. Mehran, S. Saeidi, A. Khademzadeh, and A. Afzali-Kusha, “Spiral: A E XPERIMENTAL RESULT heuristic mapping algorithm for network on chip,” IEICE Electron. Express, vol. 4, no. 15, pp. 478–484, 2007. LSA nLSA [10] E. Carvalho, C. Marcon, N. Calazans, and F. Moraes, “Evaluation of static Test Number Program Program and dynamic task mapping algorithms in noc-based mpsocs,” in Proc. of Case of tasks Run Time Cost Run Time Cost International Symposium on System-on-Chip, 2009, Tampere, Finland, Oct. Middle 26 8.5654 3167 9.3606 ms 1615 2009, pp. 87–90. Hard 81 13.4606 ms 13788 11.2038 ms 6819 [11] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, Hard0 147 18.9718 ms 27872 18.7562 ms 12713 M. Kandemir, and V. Narayanan, “Leakage current: Moore’s law meets Hard1 32 8.5231 ms 4679 10.4146 ms 1990 static power,” Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003. Hard2 815 246.6145 ms 165486 274.7816 ms 43494 [12] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999. [13] R. Rafeh, M. J. G. de la Banda, K. Marriott, and M. Wallace, “From zinc to design model,” in Proc. of 9th International Symposium of Practicalnon-memory aware results. This means nLSA actually takes Aspects of Declarative Languages (PADL), Nice, France, Jan. 2007, pp.some nodes that do not have enough available memory. This 215–229. [14] R. Dick, D. Rhodes, and W. Wolf, “Tgff: task graphs for free,” in Proc.leads to a smaller cost but an invalid solution. For example, in of 6th International Workshop on Hardware/Software Codesign, 1998,the test case Middle, task t2 has a data memory requirement of CODES/CASHE ’98, Seattle, WA, USA, Mar. 1998, pp. 97–101.880 KB and each PM node in the target platform has only 800