SlideShare a Scribd company logo
1 of 49
Download to read offline
Application Behavior-aware Flow Control
in Network-on-Chip
Advisor: Chung-Ta King
Student: Huan-Yu Liu
Department of Computer Science
National Tsing Hua University
Hsinchu, Taiwan 30013
R.O.C.
July, 2010
Abstract
Multicore might be the only solution when concerning about performance and
power issues in future chip processor architecture. As the number of cores on
a chip keeps on increasing, traditional bus-based architectures are incapable of
offering the required communication bandwidth on the chip, so Network-on-chip
(NoC) becomes the main paradigm for on-chip interconnection. NoCs not only
offer significant bandwidth advantages but also provide outstanding flexibility.
However, the performance of NoCs can be degraded significantly if the network
flow is not controlled properly. Most previous solutions try to detect network
congestion by monitoring the hardware status of the network switches or links.
Change of hardware statuses at local end may indicate possible congestions in the
network, and thus packet injection into the network should be controlled to react
to the congestions. The problem with these solutions is that congestion detection
is based only on local status without global information. Actual congestions may
occur somewhere else and can only be detected through backpressure, which may
be too passive and too slow for taking reactive measures in time.
This work takes a proactive approach for congestion detection. The idea is to
predict the changes in global, end-to-end network traffic patterns of the running
1
application and take proactive flow control actions to avoid possible congestions.
Traffic prediction is based on our recent paper [1], which uses a table-driven
predictor for predicting application communication patterns. In this thesis, we
discuss how to use the prediction results for effective scheduling of packet injec-
tion to avoid network congestions and improve the throughput. The proposed
scheme is evaluated using simulation based on a SPLASH-2 benchmark as well
as synthetic traffic. The results show its superior performance improvement and
negligible execution overhead.
i
摘要
當考慮到在將來的晶片處理器架構的效能跟電的議題時,多核心可能是唯一
的解決方法。當晶片上的核心數量一直不斷增加時,傳統的以匯流排為主的架構
已經不能滿足晶片上需要的傳輸頻寬,而晶片網路(NoC)就成為晶片上互連傳輸
的主流。晶片網路不只提供可觀的頻寬的優點,也展現出它很傑出的彈性。然而,
假如網路的流量不能被適當的控管晶片,網路效能會大大的降低。大部份以前的
解法是藉著偵測網路上的交換器跟連結的硬體狀態嘗試去偵測網路擁塞。這些局
部端點的硬體狀態的改變可以指出網路上可能會發生的壅塞,再藉由壅塞的狀態
去控制網路的封包注入。這些偵測網路壅塞的方法是只有看局部的硬體狀態,並
沒有考慮整個網路的情況,實際上網路的壅塞不是只會發生在局部的地方,而是
可能會發生在網路的其他地方。而且用硬體狀態偵測網路壅塞的方法,是一種回
壓的機制,這是一個很被動也太慢的方法,並不能及時的反應真正的網路壅塞。
這篇論文就採用一個比較前瞻性的方法來偵測網路壅塞。概念就是去預測正
在執行的應用程式的網路流量模型,這是一種看總體網路的點對點傳輸的方式,
藉由這個預測方法做流量控制來避免網路壅塞。網路流量的預測是以一篇最近的
論文當作基礎,它的手法是用表來紀錄應用程式傳輸模型而達到預測的目的。在
這篇論文,我們討論到如何用這些預測出來的結果來做有效的封包注入的行程,
以避免網路壅塞而且也能提高總體的處理能力。我們提出的這個系統是使用
SPLASH-2 來評估我們的模擬,另外也用了合成的網路流量來作實驗。這些實驗
結果可以看出我們大大的增進整體的效能,而且總體執行時間也有些微的減少。
誌謝
假如要問我,兩年的碩士研究生活跟四年的大學生活的充實程度那我一定毫
不猶豫選擇前者。大學四年要過的生活無非就是上課,唸書跟考試。但上研究所
就不一樣了,除了修課唸書跟考試之外,還要忙著做計劃跟找論文的研究題目。
說到找論文題目的過程,真的是一波三折啊,換了又換,總是找不到一個合適的
題目。最後總算找到論文題目時,卻總是活在不能如期完成的恐慌跟壓力下。然
而,最後還是順利完全這個題目,做出一個成果,在這個研究的過程中雖充滿著
挫折,卻也是學到很多,不管是在研究上或精神上。
第一個要感謝的就是我的指導教授金仲達老師,給予我的論文許多有用的寶
貴意見,指出我的盲點,讓我的研究可以順利進行,所以萬分的感謝老師。
還有要感謝 Multi-core 的各位大大,尤其是有希學長跟布拉胖學姐,給我
論文很多實質的幫助,學長更是常常不厭其煩的跟我討論,是我論文能順利完成
的一大推手。另外還有 kaven 大大,你的強度是讓我望塵莫及,我也常常拿論文
的東西來跟你討論,讓我收穫良多,還有你的嘴砲也是常給我們帶來歡樂。雙翔
學弟更是一對活寶,更是讓我相信我們 Multi-core 組真的是由奇人異士組成
的。
感謝 PADS 的每個博班學長和學姐還有同儕和其他的學弟,雖然我們研究領
域不同,但是因為有你們,我們 PADS 總是充滿著歡樂,讓我雖處在畢業的壓力,
心情還是蠻愉快的,所以也要感謝你們。
最後,感謝我的家人還有我的知心朋友們,你們才是我最大的動力。
Contents
1 Introduction 1
2 Motivating Example 6
3 Related Work 10
4 Problem Formulation 12
4.1 Application-Driven Predictor . . . . . . . . . . . . . . . . . . . . . . 14
5 Traffic Control Algorithm 19
5.1 Traffic Control Algorithm and Implementation Overhead . . . . . . 19
5.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Area Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Experimental Results 25
6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Real Application Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ii
6.3 Synthetic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Conclusion and Future Works 32
iii
List of Tables
6.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Our proposed flow control algorithm leads to the huge reduction
in the latency and slight execution time overhead. . . . . . . . . . . 27
6.3 Our proposed flow control algorithm for synthetic traffic leads
to the huge reduction in the average latency and the maximum
latency and slight reduction in the execution time. . . . . . . . . . 31
iv
List of Figures
2.1 The tile arrangement and interconnection topology used for ex-
periment on TILE64 platform . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The traffic of router 4 is tracked. The first diagram is all the traffic
input/output from router 4. The second to the fourth diagrams
show the decomposed traffic. Note that the traffic relayed by
router 4 is omitted. The last one is the output traffic from router
4 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 The structure of a router . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 A example of a L1-table. The columns G4 : G0 record the quantized
transmitted data size of the last 5 time intervals. . . . . . . . . . . . 14
4.3 A example of a L2-table which is indexed by the transmission
history pattern G4 : G0. The corresponding data size level Gp is
the value predicted to transmit in the next time interval . . . . . . 15
4.4 A table which records the delayed transmissions . . . . . . . . . . . 15
5.1 The diagram of the flow control algorithm . . . . . . . . . . . . . . . 21
v
5.2 The diagram of flow control . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Histograms of the packet latencies without (a) and with (b) the
proposed flow control and in (b) the latencies slow down drastically. 28
6.2 The maximum workload of links in the network without (a) and
with (b) the proposed flow control. . . . . . . . . . . . . . . . . . . . 29
vi
Chapter 1
Introduction
The number of transistors on a chip has increased exponentially over the decades
according to the Moore’s Law. At the same time, applications, such as process-
ing, have also increased in complexities and therefore require huge computations.
These factors are further coupled with the increasing need for power saving is
also as the clock frequency of a core increases. The best practice at this is
to go for multicore architecture and application parallelization. However, the
communication overhead would be a critical bottleneck if we cannot offer sub-
stantial bandwidth among the core. Traditional bus-based architectures suffer
from increased packet latencies as the number of cores on-chip increases and are
incapable of providing performance guarantees especially for real-time applica-
tions. As a result, Network-on-chip (NoC) becomes a de facto solution to handle
this critical problem.
NoCs not only offer significant bandwidth but also provide outstanding flex-
1
ibility and scalability. There have already had multi- and many-core processors
at the market that adopt NoC as their communication fabric. For example,
Tilera’s TILE64 [2] introduced in 2007 uses a 2-D mesh-based network to inter-
connect 64 tiles and 4 memory controllers. Indeed, NoCs are becoming the main
communication and design fabric for chip-level multiprocessors.
Since the cores are connected by a network, the flow control and congestion
control in NoC are certainly important issues. If a core transmits too many
packets to another core, the intermediate routers need to buffer many packet
flits, causing the network congested. Without an effective flow control mech-
anism, the performance of NoCs may degrade sharply due to the congestion.
According to [3], the accepted traffic increases linearly with the applied load
until a saturation point is reached. After the saturation point, the accepted
traffic decreases considerably.
There have already had many solutions solving the congestion situation in
the off-chip network [4–6]. However, most of them are not suitable for on-
chip network. In off-chip environments, dropping packets is usually used as a
means of flow control when congestion happens. Using this kind of control, the
environments must provide an acknowledgment mechanism. On the other hand,
on-chip network possesses reliable on-chip wires and more effective link-level
flow control, which make on-chip NoCs almost lossless. As a result, there is no
2
need to implement complicated protocols, such as acknowledgment, only for flow
control. This difference provides us the chance to come up with new solution.
To our best knowledge, there are very few research works discussing the
congestion control problem in NoCs. In [7], the switches exchange their load
information with neighboring switches to avoid hot spots where most packets
will pass through. In [8, 9], a predictive closed-loop flow control mechanism is
proposed based on a router model, which is used to predict how many flits the
router can accept in the next k time steps. However, it ignores the flits injected
by neighbor routers in the prediction period. In [10, 11], a centralized, end-to-
end flow control mechanism is proposed. However, they need a special network
called control NoC to transfer OS-control messages and they only rely on local
blocked messages to decide the time where a processing element is able to send
messages to the network.
Most of the works mentioned above detect network congestions by monitoring
the hardware status, such as buffer fillings, link utilization, and the amount of
blocked messages. However, the statuses are probably bounded due to the hard-
ware limitation. For example, the size and the number of buffers are limited,
so without adding any new hardware, the detection may be very inaccurate.
Particularly, if a bursty workload exceeds the limitation of the hardware, the
congestion information might not be detected immediately. In addition, con-
3
gestion detection based on hardware status is a reactive technique. It relies on
the backpressure to detect the network congestion, and thus the traffic sources
cannot throttle the injection rate immediately before the network is severely
congested. Furthermore, previous work on flow control of NoCs do not take
global information into consideration when making flow control decision. Even
if a certain core determines that the network is out of congestion and decides to
inject packets onto the network, some links or buffers of the other cores might
be still in congestion statuses causing more severe congestion.
In this thesis, we propose a proactive congestion and flow control mechanism.
The core idea is to predict the future global traffic in the NoC according to
the data transmission behaviors of the running applications. According to the
prediction, we can control network injection before congestion occurs. Notice
that most applications show repetitive communication patterns because they
likely execute similar codes in a time interval, such as a loop in the program.
These patterns may reflect the network states more accurately since applications
are the sources of the traffic in the network. Once the application patterns can
be predicted accurately, the future traffic of every link can be estimated based on
this information. The injection rate of each node can thus be controlled before
the network goes into congestion. However, predicting the traffic in a network
with high accuracy is a challenge. In this thesis, the data transmission behavior
4
of the running application is tracked and then used as the clues for predicting the
future traffic by a specialized table-driven predictor. This technique is inspired
by the branch predictor and works well for the end-to-end traffic of the network
[1].
The main contributions of this paper are as follows. First, we predict the
congestion according to the data transmission behaviors of applications rather
than the hardware statuses since data transmissions os application are the direct
source of NoC congestion of the network. Second, we modify the table-driven
predictor proposed in [1] to not only capture and predict the data transmission
behaviors in the application at run time, but also make the decision for the
injection rate control. Third, the implementation details for this traffic control
algorithm are presented. By taking the advantage of many-core architecture, we
can dedicate a core for making decisions on packet injection and achieving global
performance.
This thesis is organized as follows. In Chapter 2, a motivating example is given
to show the repetitive data transmission behavior in applications. In Chapter 3,
related works are discussed. Next, we give a formal definition of the flow control
problem in Chapter 4. In Chapter 5, we present the details of the traffic control
algorithm. Evaluations are shown in Chapter 6. Finally, conclusions are given
in Chapter 7.
5
Chapter 2
Motivating Example
In this chapter, we show that the data transmission behavior appears to have
repetitive patterns in the parallel programs by taking the LU decomposition
of the SPLASH-2 benchmark as an example. The LU decomposition kernel
is ported to TILE64 platform and run on 4 × 4 tile array as Figure 2.1 shows.
Detailed experiment setup is described in Chapter 6. We used 16 tiles for porting
the applications, and the routing algorithm is X-Y dimensional routing. In the
following discussion, we use the form of (source, destination) to describe the
transmission pairs.
Figure 2.2 shows the transmission trace of router 4. In the first diagram, the
traffic is mixed from the viewpoints of East. The mixed traffic is somewhat messy
and hard to predict. In previous works, the traffic prediction is made mainly
by checking the hardware status, such as the fullness of buffers, the utilization
of links, and so on. The hardware status is affected by the mixed traffic as the
6
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
Figure 2.1: The tile arrangement and interconnection topology used for experiment on
TILE64 platform
first diagram shows. Irregular traffic makes hardware status not suitable for
predicting the network workload.
However, when we extract the traffic between the pairs of (5,4), (6,4) and
(7,4), as the second to the fourth diagram show, and the last diagram is for
the output traffic(4,5), they are more regular and predictable. The separated
transmission trace is recorded in the view point of end-to-end data transmission,
which issued by the running application. The end-to-end data transmission
behaves in some repetitive patterns since the application is executing similar
operations in the time intervals.
By utilizing the repetitive characteristic of application execution, we can pre-
dict the end-to-end data transmission accurately by recording the history. The
7
workload prediction for a given link in the network can be derived by summing
all the predicted end-to-end data transmission that passing through this link.
As we can predict the NoC traffic in the next time interval, we can control
the sources of the traffic and regulate them ahead of packet injection and the
congestion avoidance can also be realized.
8
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
2000
4000
6000
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
3000
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
0
2000
4000
6000
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
3
0
7
4
1
8
5
2
9
6
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
4 to 5
0
5000
10000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
Figure 2.2: The traffic of router 4 is tracked. The first diagram is all the traffic input/output
from router 4. The second to the fourth diagrams show the decomposed traffic. Note that
the traffic relayed by router 4 is omitted. The last one is the output traffic from router 4 to
5
9
Chapter 3
Related Work
In [7], information of a switch is sent to other ones for deciding the routing path
to avoid the congestion. The control information is sent locally and cannot reflect
the statuses of the whole network. The authors predict network congestion based
on their proposed traffic source and router model in [8,9]. By using this model,
each router predicts the availability of its buffer ahead of time, i.e., how many
flits a router can accept currently. The traffic source cannot inject packets
until the availability is greater than zero. They predict traffic from the switch
perspective but our predictions are made from the perspectives of applications.
In [12–15], they consider a congestion control scenario which models flow
control as an utility maximization problem. These works propose an iterative
algorithm as the solution to the maximization problem.
The authors in [10] make use of the operating system (OS) and let the system
software to control the resource usage. In [11] the authors detail a NoC com-
10
munication management scheme based on a centralized, end-to-end flow control
mechanism by monitoring the hardware statuses. All the works above need a
dedicated control NoC to transfer OS-control message and a data NoC which is
responsible for delivering data packets. The OS refers to the blocked messages of
the local processing element to limit the time wherein the element is able to send
messages. In [16], almost the same network architecture is assumed except that
they add some extra hardware to support its distributed HW/SW congestion
control technique.
Model Predictive Control (MPC) is used for on-chip congestion control in
[17]. In this work, link utilization of a router is used as the indication for
the congestion measurement. In contrast, our work makes predictions from
the application-layer rather in the link-layer in order to obtain the transmission
behaviors of the running applications. We claim that these behaviors are actually
the main reason which brings about the network congestion.
11
Chapter 4
Problem Formulation
We have already known that congestion might degrade network performance
considerably. So congestion in the network should be avoided as possible as we
can. In [18], the queueing delay is used as one metric of congestion detection.
In [17], the authors use link utilization as congestion measure. Since there is no
official universally accepted definition of network congestion [19], we take link
utilization as congestion measure in this thesis. The utilization of a link ei at the
t-th time interval is defined as:
Utili(t) =
Di(t)
T × W
0 ≤ Utili ≤ 1
where Di(t) denotes the total data size transmitted by ei at the t-th time interval.
The period of a time interval is defined as T seconds and W is the maximum
bandwidth of a communication link. Thus T ×W denotes the maximum possible
data size transmitted in one time interval.
We make an assumption that if the link utilization of a given link in the
12
network exceeds a properly selected threshold Th, this link is congested. Ex-
perimental results in [17] asserts that 80 % link utilization results in reasonable
latencies before the congestion limit. However, this selected threshold value
should take some hardware configurations in to consideration such as the buffer
size and the link bandwidth.
We hope to prevent network from being congested before it happens. This
prospect is achieved by predicting possible traffic at the t-th time interval. We
hope to prevent several traffic sources from injecting packets concurrently. By
scheduling the packet injection effectively we can avoid network congestion and
then improve the average packet latency. Latency is a commonly used perfor-
mance metric and can be interpreted in different ways [3]. We define latency
here as the time elapsed from the message header is injected into the network
at the source node to the tail of packet is received at the destination node.
Assume that λ is the average packet latency and texec is the total execution
time without doing any flow control. λ is the average packet latency and texec
is the total execution time with our proposed flow control. Our goal is to max-
imize λ − λ and texec − texec. However, the execution time is affected by the
communication dependencies between traffic sources [20]. This will require fur-
ther discussion about dependencies in the program which is beyond the scope
of this thesis.
13
Cross-bar
Output 0
Output 1
Output 2
Output 3
Output 4
Input 0
Input 1
Input 2
Input 3
Input 4
To local processorFrom local processor
To the N. router
To the W. router
To the S. router
To the E. router
From the N. router
From the W. router
From the S. router
From the E. router
Figure 4.1: The structure of a router
Dest. LRU Data Size G4 G3 G2 G1 G0
5 0 256 5 3 1 2 4
8 2 128 3 3 0 3 3
10 1 512 2 2 2 2 2
13 3 64 5 4 3 5 4
Transmission history
Figure 4.2: A example of a L1-table. The columns G4 : G0 record the quantized transmitted
data size of the last 5 time intervals.
4.1 Application-Driven Predictor
In this subsection, we show that how to predict traffic by using a table-driven
network traffic predictor and make traffic control decisions with an extra ta-
ble which records the delayed transmissions. This original prediction method
is proposed in [1]; however, they only discuss how to monitor and predict the
traffic without interfering in the traffic. In this thesis, the future transmissions
14
G4 G3 G2 G1 G0 LRU Gp
5 3 1 2 4 31 2
4 4 0 4 4 13 4
5 4 2 5 3 5 0
3 1 2 6 3 12 2
…
Indexed by L1-table
Figure 4.3: A example of a L2-table which is indexed by the transmission history pattern
G4 : G0. The corresponding data size level Gp is the value predicted to transmit in the next
time interval
Src. Dest. Data size Priority
9 10 256 3
4 3 64 2
3 12 32 0
5 6 16 0
…
Figure 4.4: A table which records the delayed transmissions
15
are deeply controlled by our extended design. In order to simplify the following
discussion, we assume the 2D mesh network as the underlying topology, and the
size of the mesh network is N × N. Note that our approach is independent of
the topology and the size of the network, so it can be easily extended to other
network topology and arbitrary size of network. Each tile consists of a processor
core, a memory module and a router. We assume that the router has 5 input
and 5 output ports and a 5 × 5 crossbar. The structure of a router is shown in
Figure 4.1. Each crossbar contains five connections: east, north, west, south and
the local processor. Each connection consists of two uni-directional communi-
cation links for sending and receiving data, respectively. Deterministic routing
algorithm is assumed so that the path between a source and a destination is
determined in advance. This is the most common type of the routing algorithms
in the current NoC implementations.
A table-driven predictor is employed to record the traffic of the past history
and then we make use of the history to predict the data size and the destina-
tion of the outgoing traffic from each router in the next time interval. Each
router maintains two hierarchical tables for tracking and predicting the data
transmission. The first level table (L1-table) as shown in Figure 4.2 tracks all
output data transmissions. Each router here uses only four entries to record
transmission destination since a core may only communicate with a subset of
16
all the cores [1]. The destination entry can be replaced by the LRU replace-
ment policy for reducing the size of the table. In order to map the patterns to
guess the following transmission, a second-level table (L2-table) is required. At
the beginning of the t-th time interval, the transmission history recorded in the
L1-table is used to index the L2-table to get the predicted level of the trans-
mission data size at the t-th time interval. At the end of the t-th time interval,
when an output transmission is issued by the processor core, the destination
and data size are recorded in L1-table. The data size is quantized and recorded
in G0. The columns from G0 to Gn records the quantized transmitted data size
of the last n + 1 time intervals. The two tables are updated at the end of each
predefined time interval. After checking the prediction, the value of the data
size counter in the L1-table is quantized and shift into G0. Finally, the updated
transmission history in the L1-table is used to index the L2-table and retrieve
the predicted data size level that will be transmitted in the next time interval. If
the transmission history can not be found in the L2-table, the system will either
create a new entry or replace the existing entry by LRU in the L2-table, and use
the last value (G0) as the predicted transmission data size level. The recorded
transmitted data size levels in the L1-table are used to check the accuracy of the
prediction made at the last time interval. If the prediction was wrong, the value
of Gp at the L2-table for the corresponding transmission history pattern will be
17
modified to the data size level recorded in L1-table.
Besides the traffic predictor, we need to maintain another table to record
the delayed transmission, as shown in Table 4.4. As the traffic control algorithm
decides to delay a transmission, we need to record the source and the destination
and the traffic size. In order to avoid starvation, we need to add the priority
column. As the transmission is determined to delay for another interval, the
value in the priority column is also increased.
18
Chapter 5
Traffic Control Algorithm
In this chapter, we present a heuristic algorithm for NoC traffic management.
Then, we give some possible solutions to aggregate the prediction data.
5.1 Traffic Control Algorithm and Implementation
Overhead
The algorithm, detailed in Algorithm 1 is for a central control system and the
algorithm detailed in Algorithm 2 is for each node. This control system needs to
maintain two tables: one is to record those transmissions which are delayed and
another table is to record those transmissions which are predicted. Upon receiv-
ing these transmissions, this control system has to decide which transmission
should be delayed and which transmission should be injected. The control mes-
sage inject sent from control system to each node to decide whether the source
node i can inject or not into the destination node j in the next time interval.
Noticeably, Algorithm 1 is executed at the beginning of the time interval and
19
then Algorithm 2 is executed between the time interval.
Because this flow control algorithm is from the end-to-end layer, we use inject
to indicate if source i can send packets to destination j. Figure 5.1 is a simple
flow chart to explain our flow control algorithm.
At the beginning, we assume that each source can send traffic to each desti-
nation (line 3). Then, the algorithm will have to decide which transmission in
the delay table can inject (line 5 - 22). Each transmission has its own priority
to avoid starvation. (line 6). The transmission with the highest priority is the
one which has the longest delay. The workload (line 10) includes the workload
which has not finished processing before and the workload which may inject in
the next time interval. If the workload of any link exceeds the threshold value,
the control signal should be set false (line 11). The threshold value depends on
the architecture. After deciding which transmission in the delay table should
inject, the remaining transmissions should update their priority (line 23). The
control system collect transmissions which are predicted to inject in the next
time interval from the predictor and decide that the control signal value should
be true or false.
Algorithm 2 is executed in each source node between a time interval. Every
source node receives the control message from the control system and makes
decisions (line 1). When there is transmission from source i to destination j,
20
Figure 5.1: The diagram of the flow control algorithm
if the control message value is true, it means that the source node is allowed
to inject traffic onto the network; on the opposite, the source node should not
inject any traffic and add this transmission to the centralize delay table.
To deserve to be mentioned, the algorithms presented here are just a example
for flow control as we know how to predict NoC traffic. There may be other
available algorithms to solve flow control problems.
5.2 Data Aggregation
Figure 5.2 is the basic idea of our proposed method. The control system is re-
sponsible for Algorithm 1 and each node is responsible for Algorithm 2. The
21
Algorithm 1 Algorithm for central control system
1: // Initialization. inject[src][dest] is a control message to decide injecting or not.
2: for all source-to-destination transmission pairs do
3: inject[src][dest] = true;
4: end for
5: for all transmissions in the delay table do
6: Selecting the transmission Tdelay i,j with the highest priority;
7: Let path be the routing path of Tdelay i,j
8: if injected[i][j] == true then
9: for all link path do
10: if link.workload ¿ threshold then
11: inject[i][j] = false;
12: break;
13: end if
14: // send the control message to the nodes
15: if inject[i][j] == true then
16: Sending injecting notification to node i to inject Tdelay i,j;
17: Updating link.workload;
18: Deleting Tdelay i,j from delay table
19: end if
20: end for
21: end if
22: end for
23: Updating delay table for priority;
24: Collecting predicted transmissions from application-driven predictor;
25: for all predicted transmissions do
26: Selecting the transmission Tpredict i,j with the highest priority;
27: if inject[i][j] == true then
28: for all link path do
29: if link.workload ¿ threshold then
30: inject[i][j] = false;
31: break;
32: if inject[i][j] == true then
33: Updating link.workload;
34: Deleting Tpredict i,j from the predicted transmissions
35: end if
36: end if
37: end for
38: end if
39: end for
22
Algorithm 2 Algorithm for each node i
1: receive the control message;
2: if there is a transmission to destination j then
3: if inject[i][j] == true then
4: inject;
5: else
6: Adding the transmission to the centralized delay table;
7: end if
8: end if
9: Updating application-driven predictor;
10: Updating link.workload;
control system is bound to send control signal to each node via control net-
work and each node needs to send some information via control network to the
control system to help the control’s system make decision. Each node communi-
cates with each other by data network. In [10], the authors think the operating
system is capable of network traffic management. For this reason, our method
can be adopted to the architecture platform mentioned in [10] and the control
system can be seen as the operating system. However, this method may be too
troublesome so we propose an alternative. Since there are many existing cores,
we can use a dedicated core to handle the flow control decision. This dedicated
core stands for the control system in Figure 5.2.
5.3 Area Occupancy
Then, we analyze the area overhead of the NTPT. In this subsection, we use the
number of transistors in real manycore design. In UC Davis AsAP, it has 55M
transistors, and in Tileras TILE64, the number of transistors is 615M. Assuming
23
control signal
Cores Data Network
update information in
control system
cconon
Application-Driven
Traffic Predictor
(via control network)
(via control network)
Control System
Figure 5.2: The diagram of flow control
that each bit needs 6 transistors, in our design the application-driven predictor
needs 0.69M transistors when the number of cores is 64. And because we need to
maintain another table named control table to record the delayed transmissions
and here we assume that the number of entry is 128 and the needed transistors
are about 0.02M. The application-driven predictor occupies 1.29% and 0.12% in
AsAP and TILE64, respectively. It is quite small and tolerable area overhead.
However, [21] addresses that an increase of the data path width by 138% results
in an area penalty of 64% in Xpipe, which is a NoC architecture. The area
overhead is extremely considerable. The average packet latency changes from 49
cycles to 39 cycles as the link bandwidth enlarges from 2.2 GB/s to 3.2 GB/s. In
short, the average packet latency improves slightly as the link bandwidth enlarges
and results in huge area overhead. This conclusion gives us the motivation to do
the inject rate flow control since increasing the link bandwidth is not economic.
24
Chapter 6
Experimental Results
In this chapter, we will demonstrate the experimental result to evaluate our
proposed flow control algorithm. We adopt both real application traffic and
synthetic traffic in our experiment.
6.1 Simulation Setup
The PoPNet network simulator [22] is used for our simulations and the data
transmission traces are used as the input of the simulator. The data transmission
traces record the packet injection time, the address of the source router, the
address of the destination router and the packet size. The detailed configuration
of simulation is provided in Table 6.1. The original data transmission traces are
altered by our flow control algorithm, and this results in that some transmissions
are delayed for some period so as to avoid congestion. The experimental results
presented in the following show that our algorithm exhibits huge performance
improvement.
25
Table 6.1: Simulation Configuration
Network Topology mesh 4x4
Virtual Channel 3
Buffer Size 12
Routing Algorithm x-y routing
Bandwidth 32 byte
6.2 Real Application Traffic
The Tilera’s TILE64 platform is used to run the benchmark programs and collect
the data transmission traces. We use SPLASH-2 blocked LU decomposition
as our benchmark program. The total workload is 3991 packets. As shown
in Table 6.2, the average packet latency drops from 2410.79 cycles to 771.858
cycles and the maximum packet latency drops from 5332 cycles to 3242 cycles.
The significant performance improvement origins from that we predict traffic
workload in the next interval and delay some packet injection to avoid congestion.
As depicted in Figure 6.1 (a), the packet latencies without flow control range
between 0 cycles and 5500 cycles. However, with our proposed flow control
algorithm, the packet latencies range between 0 cycles and 3300 cycles. These
packet latencies have decreased violently so that the histogram shifts to the left
side. To bear up our conviction, Figure 6.2 demonstrates more details about
26
Original Pattern-
oriented
Reduction
Ave. latency 2410.79 cycles 771.858 cycles 3.12
Max. latency 5332 cycles 3242 cycles 1.64
Simulation
Cycle
5600 cycles 6100 cycles 0.92
Table 6.2: Our proposed flow control algorithm leads to the huge reduction in the latency
and slight execution time overhead.
the network congestion. We set the congestion threshold as 40 flits. The line in
the Figure 6.2 (b) goes above the threshold is because of the wrong predictions
of network traffic. However, the impact of miss prediction is slight so that the
result is under our acceptable scope. In Figure 6.2 (a) without flow control,
the maximum workload is far apart from the threshold, and consequently causes
severe network congestion.
6.3 Synthetic Traffic
Besides the real application traffic, we also extend our algorithm for synthetic
traffic. In [20], the authors state that injected network traffic possesses self-
similar temporal properties. They use a single parameter, the Hurst exponent H
to capture temporal burstiness characteristic of NoC traffic. Based on this traffic
model, we synthesize our traffic traces. In Table 6.3, we give some instances
27
0
20
40
60
80
100
120
140
160
Numberofpackets
Packet Latency (cycles)
Histogram of the packet latencies
(a)
0
50
100
150
200
250
300
350
400
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
Numberofpackets
Packet Latency (cycles)
Histogram of the packet latencies
(b)
Figure 6.1: Histograms of the packet latencies without (a) and with (b) the proposed flow
control and in (b) the latencies slow down drastically.
28
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1
9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
193
201
209
217
225
233
241
249
257
265
273
281
289
297
305
313
Maxworkloadoflink(flits)
Time (cycles)
original
0
10
20
30
40
50
60
70
1
10
19
28
37
46
55
64
73
82
91
100
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
244
253
262
271
280
289
298
307
316
Maxworkloadoflink(flits)
Time (cycles)
pattern-oriented
Figure 6.2: The maximum workload of links in the network without (a) and with (b) the
proposed flow control.
29
of different parameters H and make some comparisons. These parameters are
chosen based on [20]. Table 1 in [20] shows some values of Hurst exponent H and
in our thesis we choose some values among them as a matter of convenience. The
average packet latency and the maximum latency both drop down significantly.
Besides, the execution time with our proposed flow control is a little better than
that which is without flow control. Relatively large H values indicate highly
self-similar traffic and higher traffic prediction accuracy rate. But because the
average packet size also increases with H, the reduction does not arise linearly
with H
30
H 0.576 0.661 0.768 0.855 0.978
Original
Ave. latency
3553.14
cycles
3596.45
cycles
3649.21
cycles
3665.53
cycles
3614.56
cycles
Improved
Ave. latency
482.512
cycles
467.787
cycles
387.716
cycles
412.983
cycles
417.577
cycles
Reduction
of Ave.
latency
7.364 7.688 9.412 8.876 8.656
Original
Max.
latency
7623
cycles
7623
cycles
7710
cycles
7658
cycles
7714
cycles
Improved
Max.
latency
1591
cycles
1532
cycles
1016
cycles
1054
cycles
1037
cycles
Reduction
of Max.
latency
4.791 4.976 7.589 7.266 7.438
Original
Simulation
Cycle
8580
cycles
8510
cycles
8550
cycles
8480
cycles
8450
cycles
Improved
Simulation
Cycle
8280
cycles
8260
cycles
7690
cycles
7781
cycles
7731
cycles
Table 6.3: Our proposed flow control algorithm for synthetic traffic leads to the huge reduc-
tion in the average latency and the maximum latency and slight reduction in the execution
time.
31
Chapter 7
Conclusion and Future Works
Our thesis proposes an application-oriented flow control for packet-switched
networks-on-chip. By tracking and predicting the end-to-end transmission be-
havior of the running applications, we can limit the traffic injection when the
network is heavily loaded. By delaying some transmissions efficiently, the aver-
age packet latency can be decreased significantly so that the performance can
be improved obviously. In our experiments, we adopt real application traffic
traces as well as synthetic traffic traces. The experimental result shows that
our proposed flow control not only decreases the average packet latency and the
maximum latency, but under some condition the execution time can even be
shortened.
Future work will focus on improving the accuracy of the application-oriented
traffic prediction. Also, the simulation configuration should be further discussed.
Determining the optimal parameter and adjusting flow control algorithm are
32
also important. Besides, we ignore the communication dependencies between
the traffic traces because there is difficulty in considering about this issue.
33
Bibliography
[1] Y. S.-C. Huang, C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the
end-to-end traffic prediction in the on-chip networks”, in Proc. 47th ACM
IEEE Design Automation Conference, 2010.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-
gro, J. Stickney, and J. Zook, “Tile64 - processor: A 64-core soc with mesh
interconnect”, in Proc. Digest of Technical Papers. IEEE International
Solid-State Circuits Conference ISSCC 2008, Feb. 3–7, 2008, pp. 88–598.
[3] Jose Duato, Sudhakar Yalmanchili, and Lionel Ni, “Interconnection net-
works”, 2002, pp. 428–431.
[4] S. Mascolo, “Classical control theory for congestion avoidance in high-speed
internet”, in Proc. Decision and Control Conference, 1999.
34
[5] Cui-Qing Yang, “A taxonomy for congestion control algorithms in packet
switching networks”, in IEEE Network, 1995.
[6] Hua Yongru Gu, Wang Hua O., and Hong Yiguang, “A predictive conges-
tion control algorithm for high speed communication networks”, in Proc.
American Control Conference, 2001.
[7] Erland Nillson, Mikael Millberg, Johnny ¨Oberg, and Axel Jantsch, “Load
distribution with the proximity congestion awareness in a network on chip”,
in Proc. Design, Automation, and Test in Europe, 2003, p. 11126.
[8] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-
on-chip traffic”, in Proc. 43rd ACM IEEE Design Automation Conference,
2006, pp. 839–844.
[9] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction-
based flow control in networks-on-chip”, in ACM Transactions on Design
Automation of Electronic Systems, 2008.
[10] Vincent Nollet, Th´eodore. Marescaux, and Diederik Verkest, “Operating-
system controlled network on chip”, in Proc. 41st ACM IEEE Deaign
Automation Conference, 2004.
35
[11] P. Avasare, J-Y. Nollet, D. Verkest, and H. Corporaal, “Centralized end-
to-end flow control in a best-effort network-on-chip”, in Proc. 5th ACM
internatinoal conference on Embedded software, 2005.
[12] Mohammad S. Talebi, Fahimeh Jafari, and Ahmad Khonsari, “A novel
congestion control scheme for elastic flows in network-on-chip based on sum-
rate optimization”, in ICCSA, 2007.
[13] M. S. Talebi, F. Jafari, and A. Khonsari, “A novel flow control scheme
for best effort traffic in noc based on source rate utility maximization”, in
MASCOTs, 2007.
[14] Mohammad S. Talebi, Fahimeh Jafari, Ahmad Khonsari, and Mohammad H.
Yaghmaeem, “Best effort flow control in network-on-chip”, in CSICC, 2008.
[15] Fahimeh Jafari, Mohammad S. Talebi, Mohammad H. Yaghmaee, Ahmad
Khonsari, and Mohamed Ould-Khaoua, “Throughput-fairness tradeoff in
best effort flow control for on-chip architectures”, in Proc. 2009 IEEE
International Symposium on Parallel and Distributed Processing, 2009.
[16] T. Marescaux, A. R˚angevall, V. Nollet, A. Bartic, and H. Corporaal, “Dis-
tributed congestion control for packet switched networks on chip”, in
ParCo, 2005.
36
[17] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-
controlled best-effort communication for networks-on-chip”, in Proc. De-
sign, Automation, and Test in Europe, 2007.
[18] Jin Yuho, Yum Ki Hwan, and Kim Eun Jung, “Adaptive data compression
for high-performance low-power on-chip networks”, in Proc. 41st annual
IEEE/ACM International Symposium on Microarchitecture, 2008.
[19] Keshav Srinivasan, “Congestion control in computer networks”, 1991.
[20] Vassos Soteriou, Hangsheng Wang, and Li-Shiuan Peh, “A statistical traffic
model for on-chip interconnection networks”, in Proc. 14th IEEE Interna-
tional Symposium on Modeling, Analysis, and Simulation, 2006.
[21] Anthony Leroy, “Optimizing the on-chip communication architecture of low
power systems-on-chip in deep sub-micron technology”, 2006.
[22] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip
network model inside a full-system simulator”, in Proceedings of Inter-
national Symposium on Performance Analysis of Systems and Software,
2009.
37

More Related Content

What's hot

Improving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-PImproving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-P
IDES Editor
 
Cisco discovery drs ent module 7 - v.4 in english.
Cisco discovery   drs ent module 7 - v.4 in english.Cisco discovery   drs ent module 7 - v.4 in english.
Cisco discovery drs ent module 7 - v.4 in english.
igede tirtanata
 
Ijarcet vol-2-issue-7-2311-2318
Ijarcet vol-2-issue-7-2311-2318Ijarcet vol-2-issue-7-2311-2318
Ijarcet vol-2-issue-7-2311-2318
Editor IJARCET
 
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdfPerformance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
IUA
 

What's hot (16)

Improving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-PImproving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-P
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
Studying_the_TCP_Flow_and_Congestion_Con.pdf
Studying_the_TCP_Flow_and_Congestion_Con.pdfStudying_the_TCP_Flow_and_Congestion_Con.pdf
Studying_the_TCP_Flow_and_Congestion_Con.pdf
 
IRJET- Throughput Performance Improvement for Unbalanced Slotted Aloha Relay ...
IRJET- Throughput Performance Improvement for Unbalanced Slotted Aloha Relay ...IRJET- Throughput Performance Improvement for Unbalanced Slotted Aloha Relay ...
IRJET- Throughput Performance Improvement for Unbalanced Slotted Aloha Relay ...
 
Performance evaluation of tcp sack1 in wimax network asymmetry
Performance evaluation of tcp sack1 in wimax network asymmetryPerformance evaluation of tcp sack1 in wimax network asymmetry
Performance evaluation of tcp sack1 in wimax network asymmetry
 
1
11
1
 
A046020112
A046020112A046020112
A046020112
 
Mcseminar
McseminarMcseminar
Mcseminar
 
TCP Fairness for Uplink and Downlink Flows in WLANs
TCP Fairness for Uplink and Downlink Flows in WLANsTCP Fairness for Uplink and Downlink Flows in WLANs
TCP Fairness for Uplink and Downlink Flows in WLANs
 
Cisco discovery drs ent module 7 - v.4 in english.
Cisco discovery   drs ent module 7 - v.4 in english.Cisco discovery   drs ent module 7 - v.4 in english.
Cisco discovery drs ent module 7 - v.4 in english.
 
Delay jitter control for real time communication
Delay jitter control for real time communicationDelay jitter control for real time communication
Delay jitter control for real time communication
 
Ijarcet vol-2-issue-7-2311-2318
Ijarcet vol-2-issue-7-2311-2318Ijarcet vol-2-issue-7-2311-2318
Ijarcet vol-2-issue-7-2311-2318
 
Redesigning MPTCP in Edge clouds
Redesigning MPTCP in Edge cloudsRedesigning MPTCP in Edge clouds
Redesigning MPTCP in Edge clouds
 
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdfPerformance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
Performance-Evaluation-of-RPL-Routes-and-DODAG-Construction-for-IoTs .pdf
 
Implementation_and_Analysis_of_the_6LoWPAN.pdf
Implementation_and_Analysis_of_the_6LoWPAN.pdfImplementation_and_Analysis_of_the_6LoWPAN.pdf
Implementation_and_Analysis_of_the_6LoWPAN.pdf
 
An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...
An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...
An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...
 

Similar to Application Behavior-Aware Flow Control in Network-on-Chip

Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
zhao fu
 
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ijgca
 
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
ijgca
 
Application-Driven Flow Control in Network-on-Chip for Many-Core Architectures
Application-Driven Flow Control in Network-on-Chip for Many-Core ArchitecturesApplication-Driven Flow Control in Network-on-Chip for Many-Core Architectures
Application-Driven Flow Control in Network-on-Chip for Many-Core Architectures
Ivonne Liu
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
ijaceeejournal
 

Similar to Application Behavior-Aware Flow Control in Network-on-Chip (20)

Area-Efficient Design of Scheduler for Routing Node of Network-On-Chip
Area-Efficient Design of Scheduler for Routing Node of Network-On-ChipArea-Efficient Design of Scheduler for Routing Node of Network-On-Chip
Area-Efficient Design of Scheduler for Routing Node of Network-On-Chip
 
AREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIP
AREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIPAREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIP
AREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIP
 
Network on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A surveyNetwork on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A survey
 
Low power network on chip architectures: A survey
Low power network on chip architectures: A surveyLow power network on chip architectures: A survey
Low power network on chip architectures: A survey
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
 
A0520106
A0520106A0520106
A0520106
 
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...
 
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...
 
IJMTST 2016 | ISSN: 2455-3778 Traffic and Power reduction Routing Algorithm f...
IJMTST 2016 | ISSN: 2455-3778 Traffic and Power reduction Routing Algorithm f...IJMTST 2016 | ISSN: 2455-3778 Traffic and Power reduction Routing Algorithm f...
IJMTST 2016 | ISSN: 2455-3778 Traffic and Power reduction Routing Algorithm f...
 
Available network bandwidth schema to improve performance in tcp protocols
Available network bandwidth schema to improve performance in tcp protocolsAvailable network bandwidth schema to improve performance in tcp protocols
Available network bandwidth schema to improve performance in tcp protocols
 
IRJET- Re-Configuration Topology for On-Chip Networks by Back-Tracking
IRJET- Re-Configuration Topology for On-Chip Networks by Back-TrackingIRJET- Re-Configuration Topology for On-Chip Networks by Back-Tracking
IRJET- Re-Configuration Topology for On-Chip Networks by Back-Tracking
 
Analysis of Rate Based Congestion Control Algorithms in Wireless Technologies
Analysis of Rate Based Congestion Control Algorithms in Wireless TechnologiesAnalysis of Rate Based Congestion Control Algorithms in Wireless Technologies
Analysis of Rate Based Congestion Control Algorithms in Wireless Technologies
 
Application-Driven Flow Control in Network-on-Chip for Many-Core Architectures
Application-Driven Flow Control in Network-on-Chip for Many-Core ArchitecturesApplication-Driven Flow Control in Network-on-Chip for Many-Core Architectures
Application-Driven Flow Control in Network-on-Chip for Many-Core Architectures
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
 
Fault Injection Approach for Network on Chip
Fault Injection Approach for Network on ChipFault Injection Approach for Network on Chip
Fault Injection Approach for Network on Chip
 
94
9494
94
 
Improvement In LEACH Protocol By Electing Master Cluster Heads To Enhance The...
Improvement In LEACH Protocol By Electing Master Cluster Heads To Enhance The...Improvement In LEACH Protocol By Electing Master Cluster Heads To Enhance The...
Improvement In LEACH Protocol By Electing Master Cluster Heads To Enhance The...
 
Design of fault tolerant algorithm for network on chip router using field pr...
Design of fault tolerant algorithm for network on chip router  using field pr...Design of fault tolerant algorithm for network on chip router  using field pr...
Design of fault tolerant algorithm for network on chip router using field pr...
 
Investigating the Performance of NoC Using Hierarchical Routing Approach
Investigating the Performance of NoC Using Hierarchical Routing ApproachInvestigating the Performance of NoC Using Hierarchical Routing Approach
Investigating the Performance of NoC Using Hierarchical Routing Approach
 
Investigating the Performance of NoC Using Hierarchical Routing Approach
Investigating the Performance of NoC Using Hierarchical Routing ApproachInvestigating the Performance of NoC Using Hierarchical Routing Approach
Investigating the Performance of NoC Using Hierarchical Routing Approach
 

Application Behavior-Aware Flow Control in Network-on-Chip

  • 1. Application Behavior-aware Flow Control in Network-on-Chip Advisor: Chung-Ta King Student: Huan-Yu Liu Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 30013 R.O.C. July, 2010
  • 2. Abstract Multicore might be the only solution when concerning about performance and power issues in future chip processor architecture. As the number of cores on a chip keeps on increasing, traditional bus-based architectures are incapable of offering the required communication bandwidth on the chip, so Network-on-chip (NoC) becomes the main paradigm for on-chip interconnection. NoCs not only offer significant bandwidth advantages but also provide outstanding flexibility. However, the performance of NoCs can be degraded significantly if the network flow is not controlled properly. Most previous solutions try to detect network congestion by monitoring the hardware status of the network switches or links. Change of hardware statuses at local end may indicate possible congestions in the network, and thus packet injection into the network should be controlled to react to the congestions. The problem with these solutions is that congestion detection is based only on local status without global information. Actual congestions may occur somewhere else and can only be detected through backpressure, which may be too passive and too slow for taking reactive measures in time. This work takes a proactive approach for congestion detection. The idea is to predict the changes in global, end-to-end network traffic patterns of the running 1
  • 3. application and take proactive flow control actions to avoid possible congestions. Traffic prediction is based on our recent paper [1], which uses a table-driven predictor for predicting application communication patterns. In this thesis, we discuss how to use the prediction results for effective scheduling of packet injec- tion to avoid network congestions and improve the throughput. The proposed scheme is evaluated using simulation based on a SPLASH-2 benchmark as well as synthetic traffic. The results show its superior performance improvement and negligible execution overhead. i
  • 4. 摘要 當考慮到在將來的晶片處理器架構的效能跟電的議題時,多核心可能是唯一 的解決方法。當晶片上的核心數量一直不斷增加時,傳統的以匯流排為主的架構 已經不能滿足晶片上需要的傳輸頻寬,而晶片網路(NoC)就成為晶片上互連傳輸 的主流。晶片網路不只提供可觀的頻寬的優點,也展現出它很傑出的彈性。然而, 假如網路的流量不能被適當的控管晶片,網路效能會大大的降低。大部份以前的 解法是藉著偵測網路上的交換器跟連結的硬體狀態嘗試去偵測網路擁塞。這些局 部端點的硬體狀態的改變可以指出網路上可能會發生的壅塞,再藉由壅塞的狀態 去控制網路的封包注入。這些偵測網路壅塞的方法是只有看局部的硬體狀態,並 沒有考慮整個網路的情況,實際上網路的壅塞不是只會發生在局部的地方,而是 可能會發生在網路的其他地方。而且用硬體狀態偵測網路壅塞的方法,是一種回 壓的機制,這是一個很被動也太慢的方法,並不能及時的反應真正的網路壅塞。 這篇論文就採用一個比較前瞻性的方法來偵測網路壅塞。概念就是去預測正 在執行的應用程式的網路流量模型,這是一種看總體網路的點對點傳輸的方式, 藉由這個預測方法做流量控制來避免網路壅塞。網路流量的預測是以一篇最近的 論文當作基礎,它的手法是用表來紀錄應用程式傳輸模型而達到預測的目的。在 這篇論文,我們討論到如何用這些預測出來的結果來做有效的封包注入的行程, 以避免網路壅塞而且也能提高總體的處理能力。我們提出的這個系統是使用 SPLASH-2 來評估我們的模擬,另外也用了合成的網路流量來作實驗。這些實驗
  • 6. 誌謝 假如要問我,兩年的碩士研究生活跟四年的大學生活的充實程度那我一定毫 不猶豫選擇前者。大學四年要過的生活無非就是上課,唸書跟考試。但上研究所 就不一樣了,除了修課唸書跟考試之外,還要忙著做計劃跟找論文的研究題目。 說到找論文題目的過程,真的是一波三折啊,換了又換,總是找不到一個合適的 題目。最後總算找到論文題目時,卻總是活在不能如期完成的恐慌跟壓力下。然 而,最後還是順利完全這個題目,做出一個成果,在這個研究的過程中雖充滿著 挫折,卻也是學到很多,不管是在研究上或精神上。 第一個要感謝的就是我的指導教授金仲達老師,給予我的論文許多有用的寶 貴意見,指出我的盲點,讓我的研究可以順利進行,所以萬分的感謝老師。 還有要感謝 Multi-core 的各位大大,尤其是有希學長跟布拉胖學姐,給我 論文很多實質的幫助,學長更是常常不厭其煩的跟我討論,是我論文能順利完成 的一大推手。另外還有 kaven 大大,你的強度是讓我望塵莫及,我也常常拿論文 的東西來跟你討論,讓我收穫良多,還有你的嘴砲也是常給我們帶來歡樂。雙翔 學弟更是一對活寶,更是讓我相信我們 Multi-core 組真的是由奇人異士組成 的。 感謝 PADS 的每個博班學長和學姐還有同儕和其他的學弟,雖然我們研究領 域不同,但是因為有你們,我們 PADS 總是充滿著歡樂,讓我雖處在畢業的壓力, 心情還是蠻愉快的,所以也要感謝你們。
  • 8. Contents 1 Introduction 1 2 Motivating Example 6 3 Related Work 10 4 Problem Formulation 12 4.1 Application-Driven Predictor . . . . . . . . . . . . . . . . . . . . . . 14 5 Traffic Control Algorithm 19 5.1 Traffic Control Algorithm and Implementation Overhead . . . . . . 19 5.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Area Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6 Experimental Results 25 6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Real Application Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ii
  • 9. 6.3 Synthetic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 7 Conclusion and Future Works 32 iii
  • 10. List of Tables 6.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2 Our proposed flow control algorithm leads to the huge reduction in the latency and slight execution time overhead. . . . . . . . . . . 27 6.3 Our proposed flow control algorithm for synthetic traffic leads to the huge reduction in the average latency and the maximum latency and slight reduction in the execution time. . . . . . . . . . 31 iv
  • 11. List of Figures 2.1 The tile arrangement and interconnection topology used for ex- periment on TILE64 platform . . . . . . . . . . . . . . . . . . . . . . 7 2.2 The traffic of router 4 is tracked. The first diagram is all the traffic input/output from router 4. The second to the fourth diagrams show the decomposed traffic. Note that the traffic relayed by router 4 is omitted. The last one is the output traffic from router 4 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1 The structure of a router . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 A example of a L1-table. The columns G4 : G0 record the quantized transmitted data size of the last 5 time intervals. . . . . . . . . . . . 14 4.3 A example of a L2-table which is indexed by the transmission history pattern G4 : G0. The corresponding data size level Gp is the value predicted to transmit in the next time interval . . . . . . 15 4.4 A table which records the delayed transmissions . . . . . . . . . . . 15 5.1 The diagram of the flow control algorithm . . . . . . . . . . . . . . . 21 v
  • 12. 5.2 The diagram of flow control . . . . . . . . . . . . . . . . . . . . . . . 24 6.1 Histograms of the packet latencies without (a) and with (b) the proposed flow control and in (b) the latencies slow down drastically. 28 6.2 The maximum workload of links in the network without (a) and with (b) the proposed flow control. . . . . . . . . . . . . . . . . . . . 29 vi
  • 13. Chapter 1 Introduction The number of transistors on a chip has increased exponentially over the decades according to the Moore’s Law. At the same time, applications, such as process- ing, have also increased in complexities and therefore require huge computations. These factors are further coupled with the increasing need for power saving is also as the clock frequency of a core increases. The best practice at this is to go for multicore architecture and application parallelization. However, the communication overhead would be a critical bottleneck if we cannot offer sub- stantial bandwidth among the core. Traditional bus-based architectures suffer from increased packet latencies as the number of cores on-chip increases and are incapable of providing performance guarantees especially for real-time applica- tions. As a result, Network-on-chip (NoC) becomes a de facto solution to handle this critical problem. NoCs not only offer significant bandwidth but also provide outstanding flex- 1
  • 14. ibility and scalability. There have already had multi- and many-core processors at the market that adopt NoC as their communication fabric. For example, Tilera’s TILE64 [2] introduced in 2007 uses a 2-D mesh-based network to inter- connect 64 tiles and 4 memory controllers. Indeed, NoCs are becoming the main communication and design fabric for chip-level multiprocessors. Since the cores are connected by a network, the flow control and congestion control in NoC are certainly important issues. If a core transmits too many packets to another core, the intermediate routers need to buffer many packet flits, causing the network congested. Without an effective flow control mech- anism, the performance of NoCs may degrade sharply due to the congestion. According to [3], the accepted traffic increases linearly with the applied load until a saturation point is reached. After the saturation point, the accepted traffic decreases considerably. There have already had many solutions solving the congestion situation in the off-chip network [4–6]. However, most of them are not suitable for on- chip network. In off-chip environments, dropping packets is usually used as a means of flow control when congestion happens. Using this kind of control, the environments must provide an acknowledgment mechanism. On the other hand, on-chip network possesses reliable on-chip wires and more effective link-level flow control, which make on-chip NoCs almost lossless. As a result, there is no 2
  • 15. need to implement complicated protocols, such as acknowledgment, only for flow control. This difference provides us the chance to come up with new solution. To our best knowledge, there are very few research works discussing the congestion control problem in NoCs. In [7], the switches exchange their load information with neighboring switches to avoid hot spots where most packets will pass through. In [8, 9], a predictive closed-loop flow control mechanism is proposed based on a router model, which is used to predict how many flits the router can accept in the next k time steps. However, it ignores the flits injected by neighbor routers in the prediction period. In [10, 11], a centralized, end-to- end flow control mechanism is proposed. However, they need a special network called control NoC to transfer OS-control messages and they only rely on local blocked messages to decide the time where a processing element is able to send messages to the network. Most of the works mentioned above detect network congestions by monitoring the hardware status, such as buffer fillings, link utilization, and the amount of blocked messages. However, the statuses are probably bounded due to the hard- ware limitation. For example, the size and the number of buffers are limited, so without adding any new hardware, the detection may be very inaccurate. Particularly, if a bursty workload exceeds the limitation of the hardware, the congestion information might not be detected immediately. In addition, con- 3
  • 16. gestion detection based on hardware status is a reactive technique. It relies on the backpressure to detect the network congestion, and thus the traffic sources cannot throttle the injection rate immediately before the network is severely congested. Furthermore, previous work on flow control of NoCs do not take global information into consideration when making flow control decision. Even if a certain core determines that the network is out of congestion and decides to inject packets onto the network, some links or buffers of the other cores might be still in congestion statuses causing more severe congestion. In this thesis, we propose a proactive congestion and flow control mechanism. The core idea is to predict the future global traffic in the NoC according to the data transmission behaviors of the running applications. According to the prediction, we can control network injection before congestion occurs. Notice that most applications show repetitive communication patterns because they likely execute similar codes in a time interval, such as a loop in the program. These patterns may reflect the network states more accurately since applications are the sources of the traffic in the network. Once the application patterns can be predicted accurately, the future traffic of every link can be estimated based on this information. The injection rate of each node can thus be controlled before the network goes into congestion. However, predicting the traffic in a network with high accuracy is a challenge. In this thesis, the data transmission behavior 4
  • 17. of the running application is tracked and then used as the clues for predicting the future traffic by a specialized table-driven predictor. This technique is inspired by the branch predictor and works well for the end-to-end traffic of the network [1]. The main contributions of this paper are as follows. First, we predict the congestion according to the data transmission behaviors of applications rather than the hardware statuses since data transmissions os application are the direct source of NoC congestion of the network. Second, we modify the table-driven predictor proposed in [1] to not only capture and predict the data transmission behaviors in the application at run time, but also make the decision for the injection rate control. Third, the implementation details for this traffic control algorithm are presented. By taking the advantage of many-core architecture, we can dedicate a core for making decisions on packet injection and achieving global performance. This thesis is organized as follows. In Chapter 2, a motivating example is given to show the repetitive data transmission behavior in applications. In Chapter 3, related works are discussed. Next, we give a formal definition of the flow control problem in Chapter 4. In Chapter 5, we present the details of the traffic control algorithm. Evaluations are shown in Chapter 6. Finally, conclusions are given in Chapter 7. 5
  • 18. Chapter 2 Motivating Example In this chapter, we show that the data transmission behavior appears to have repetitive patterns in the parallel programs by taking the LU decomposition of the SPLASH-2 benchmark as an example. The LU decomposition kernel is ported to TILE64 platform and run on 4 × 4 tile array as Figure 2.1 shows. Detailed experiment setup is described in Chapter 6. We used 16 tiles for porting the applications, and the routing algorithm is X-Y dimensional routing. In the following discussion, we use the form of (source, destination) to describe the transmission pairs. Figure 2.2 shows the transmission trace of router 4. In the first diagram, the traffic is mixed from the viewpoints of East. The mixed traffic is somewhat messy and hard to predict. In previous works, the traffic prediction is made mainly by checking the hardware status, such as the fullness of buffers, the utilization of links, and so on. The hardware status is affected by the mixed traffic as the 6
  • 19. 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Figure 2.1: The tile arrangement and interconnection topology used for experiment on TILE64 platform first diagram shows. Irregular traffic makes hardware status not suitable for predicting the network workload. However, when we extract the traffic between the pairs of (5,4), (6,4) and (7,4), as the second to the fourth diagram show, and the last diagram is for the output traffic(4,5), they are more regular and predictable. The separated transmission trace is recorded in the view point of end-to-end data transmission, which issued by the running application. The end-to-end data transmission behaves in some repetitive patterns since the application is executing similar operations in the time intervals. By utilizing the repetitive characteristic of application execution, we can pre- dict the end-to-end data transmission accurately by recording the history. The 7
  • 20. workload prediction for a given link in the network can be derived by summing all the predicted end-to-end data transmission that passing through this link. As we can predict the NoC traffic in the next time interval, we can control the sources of the traffic and regulate them ahead of packet injection and the congestion avoidance can also be realized. 8
  • 21. 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 5 to 4 2000 4000 6000 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 5 to 4 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 6 to 4 3000 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 5 to 4 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 6 to 4 0 1000 2000 3000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 7 to 4 0 2000 4000 6000 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 5 to 4 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 6 to 4 0 1000 2000 3000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 7 to 4 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 4 to 5 0 5000 10000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 East output East input 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 5 to 4 0 2000 4000 6000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 6 to 4 0 1000 2000 3000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 7 to 4 Figure 2.2: The traffic of router 4 is tracked. The first diagram is all the traffic input/output from router 4. The second to the fourth diagrams show the decomposed traffic. Note that the traffic relayed by router 4 is omitted. The last one is the output traffic from router 4 to 5 9
  • 22. Chapter 3 Related Work In [7], information of a switch is sent to other ones for deciding the routing path to avoid the congestion. The control information is sent locally and cannot reflect the statuses of the whole network. The authors predict network congestion based on their proposed traffic source and router model in [8,9]. By using this model, each router predicts the availability of its buffer ahead of time, i.e., how many flits a router can accept currently. The traffic source cannot inject packets until the availability is greater than zero. They predict traffic from the switch perspective but our predictions are made from the perspectives of applications. In [12–15], they consider a congestion control scenario which models flow control as an utility maximization problem. These works propose an iterative algorithm as the solution to the maximization problem. The authors in [10] make use of the operating system (OS) and let the system software to control the resource usage. In [11] the authors detail a NoC com- 10
  • 23. munication management scheme based on a centralized, end-to-end flow control mechanism by monitoring the hardware statuses. All the works above need a dedicated control NoC to transfer OS-control message and a data NoC which is responsible for delivering data packets. The OS refers to the blocked messages of the local processing element to limit the time wherein the element is able to send messages. In [16], almost the same network architecture is assumed except that they add some extra hardware to support its distributed HW/SW congestion control technique. Model Predictive Control (MPC) is used for on-chip congestion control in [17]. In this work, link utilization of a router is used as the indication for the congestion measurement. In contrast, our work makes predictions from the application-layer rather in the link-layer in order to obtain the transmission behaviors of the running applications. We claim that these behaviors are actually the main reason which brings about the network congestion. 11
  • 24. Chapter 4 Problem Formulation We have already known that congestion might degrade network performance considerably. So congestion in the network should be avoided as possible as we can. In [18], the queueing delay is used as one metric of congestion detection. In [17], the authors use link utilization as congestion measure. Since there is no official universally accepted definition of network congestion [19], we take link utilization as congestion measure in this thesis. The utilization of a link ei at the t-th time interval is defined as: Utili(t) = Di(t) T × W 0 ≤ Utili ≤ 1 where Di(t) denotes the total data size transmitted by ei at the t-th time interval. The period of a time interval is defined as T seconds and W is the maximum bandwidth of a communication link. Thus T ×W denotes the maximum possible data size transmitted in one time interval. We make an assumption that if the link utilization of a given link in the 12
  • 25. network exceeds a properly selected threshold Th, this link is congested. Ex- perimental results in [17] asserts that 80 % link utilization results in reasonable latencies before the congestion limit. However, this selected threshold value should take some hardware configurations in to consideration such as the buffer size and the link bandwidth. We hope to prevent network from being congested before it happens. This prospect is achieved by predicting possible traffic at the t-th time interval. We hope to prevent several traffic sources from injecting packets concurrently. By scheduling the packet injection effectively we can avoid network congestion and then improve the average packet latency. Latency is a commonly used perfor- mance metric and can be interpreted in different ways [3]. We define latency here as the time elapsed from the message header is injected into the network at the source node to the tail of packet is received at the destination node. Assume that λ is the average packet latency and texec is the total execution time without doing any flow control. λ is the average packet latency and texec is the total execution time with our proposed flow control. Our goal is to max- imize λ − λ and texec − texec. However, the execution time is affected by the communication dependencies between traffic sources [20]. This will require fur- ther discussion about dependencies in the program which is beyond the scope of this thesis. 13
  • 26. Cross-bar Output 0 Output 1 Output 2 Output 3 Output 4 Input 0 Input 1 Input 2 Input 3 Input 4 To local processorFrom local processor To the N. router To the W. router To the S. router To the E. router From the N. router From the W. router From the S. router From the E. router Figure 4.1: The structure of a router Dest. LRU Data Size G4 G3 G2 G1 G0 5 0 256 5 3 1 2 4 8 2 128 3 3 0 3 3 10 1 512 2 2 2 2 2 13 3 64 5 4 3 5 4 Transmission history Figure 4.2: A example of a L1-table. The columns G4 : G0 record the quantized transmitted data size of the last 5 time intervals. 4.1 Application-Driven Predictor In this subsection, we show that how to predict traffic by using a table-driven network traffic predictor and make traffic control decisions with an extra ta- ble which records the delayed transmissions. This original prediction method is proposed in [1]; however, they only discuss how to monitor and predict the traffic without interfering in the traffic. In this thesis, the future transmissions 14
  • 27. G4 G3 G2 G1 G0 LRU Gp 5 3 1 2 4 31 2 4 4 0 4 4 13 4 5 4 2 5 3 5 0 3 1 2 6 3 12 2 … Indexed by L1-table Figure 4.3: A example of a L2-table which is indexed by the transmission history pattern G4 : G0. The corresponding data size level Gp is the value predicted to transmit in the next time interval Src. Dest. Data size Priority 9 10 256 3 4 3 64 2 3 12 32 0 5 6 16 0 … Figure 4.4: A table which records the delayed transmissions 15
  • 28. are deeply controlled by our extended design. In order to simplify the following discussion, we assume the 2D mesh network as the underlying topology, and the size of the mesh network is N × N. Note that our approach is independent of the topology and the size of the network, so it can be easily extended to other network topology and arbitrary size of network. Each tile consists of a processor core, a memory module and a router. We assume that the router has 5 input and 5 output ports and a 5 × 5 crossbar. The structure of a router is shown in Figure 4.1. Each crossbar contains five connections: east, north, west, south and the local processor. Each connection consists of two uni-directional communi- cation links for sending and receiving data, respectively. Deterministic routing algorithm is assumed so that the path between a source and a destination is determined in advance. This is the most common type of the routing algorithms in the current NoC implementations. A table-driven predictor is employed to record the traffic of the past history and then we make use of the history to predict the data size and the destina- tion of the outgoing traffic from each router in the next time interval. Each router maintains two hierarchical tables for tracking and predicting the data transmission. The first level table (L1-table) as shown in Figure 4.2 tracks all output data transmissions. Each router here uses only four entries to record transmission destination since a core may only communicate with a subset of 16
  • 29. all the cores [1]. The destination entry can be replaced by the LRU replace- ment policy for reducing the size of the table. In order to map the patterns to guess the following transmission, a second-level table (L2-table) is required. At the beginning of the t-th time interval, the transmission history recorded in the L1-table is used to index the L2-table to get the predicted level of the trans- mission data size at the t-th time interval. At the end of the t-th time interval, when an output transmission is issued by the processor core, the destination and data size are recorded in L1-table. The data size is quantized and recorded in G0. The columns from G0 to Gn records the quantized transmitted data size of the last n + 1 time intervals. The two tables are updated at the end of each predefined time interval. After checking the prediction, the value of the data size counter in the L1-table is quantized and shift into G0. Finally, the updated transmission history in the L1-table is used to index the L2-table and retrieve the predicted data size level that will be transmitted in the next time interval. If the transmission history can not be found in the L2-table, the system will either create a new entry or replace the existing entry by LRU in the L2-table, and use the last value (G0) as the predicted transmission data size level. The recorded transmitted data size levels in the L1-table are used to check the accuracy of the prediction made at the last time interval. If the prediction was wrong, the value of Gp at the L2-table for the corresponding transmission history pattern will be 17
  • 30. modified to the data size level recorded in L1-table. Besides the traffic predictor, we need to maintain another table to record the delayed transmission, as shown in Table 4.4. As the traffic control algorithm decides to delay a transmission, we need to record the source and the destination and the traffic size. In order to avoid starvation, we need to add the priority column. As the transmission is determined to delay for another interval, the value in the priority column is also increased. 18
  • 31. Chapter 5 Traffic Control Algorithm In this chapter, we present a heuristic algorithm for NoC traffic management. Then, we give some possible solutions to aggregate the prediction data. 5.1 Traffic Control Algorithm and Implementation Overhead The algorithm, detailed in Algorithm 1 is for a central control system and the algorithm detailed in Algorithm 2 is for each node. This control system needs to maintain two tables: one is to record those transmissions which are delayed and another table is to record those transmissions which are predicted. Upon receiv- ing these transmissions, this control system has to decide which transmission should be delayed and which transmission should be injected. The control mes- sage inject sent from control system to each node to decide whether the source node i can inject or not into the destination node j in the next time interval. Noticeably, Algorithm 1 is executed at the beginning of the time interval and 19
  • 32. then Algorithm 2 is executed between the time interval. Because this flow control algorithm is from the end-to-end layer, we use inject to indicate if source i can send packets to destination j. Figure 5.1 is a simple flow chart to explain our flow control algorithm. At the beginning, we assume that each source can send traffic to each desti- nation (line 3). Then, the algorithm will have to decide which transmission in the delay table can inject (line 5 - 22). Each transmission has its own priority to avoid starvation. (line 6). The transmission with the highest priority is the one which has the longest delay. The workload (line 10) includes the workload which has not finished processing before and the workload which may inject in the next time interval. If the workload of any link exceeds the threshold value, the control signal should be set false (line 11). The threshold value depends on the architecture. After deciding which transmission in the delay table should inject, the remaining transmissions should update their priority (line 23). The control system collect transmissions which are predicted to inject in the next time interval from the predictor and decide that the control signal value should be true or false. Algorithm 2 is executed in each source node between a time interval. Every source node receives the control message from the control system and makes decisions (line 1). When there is transmission from source i to destination j, 20
  • 33. Figure 5.1: The diagram of the flow control algorithm if the control message value is true, it means that the source node is allowed to inject traffic onto the network; on the opposite, the source node should not inject any traffic and add this transmission to the centralize delay table. To deserve to be mentioned, the algorithms presented here are just a example for flow control as we know how to predict NoC traffic. There may be other available algorithms to solve flow control problems. 5.2 Data Aggregation Figure 5.2 is the basic idea of our proposed method. The control system is re- sponsible for Algorithm 1 and each node is responsible for Algorithm 2. The 21
  • 34. Algorithm 1 Algorithm for central control system 1: // Initialization. inject[src][dest] is a control message to decide injecting or not. 2: for all source-to-destination transmission pairs do 3: inject[src][dest] = true; 4: end for 5: for all transmissions in the delay table do 6: Selecting the transmission Tdelay i,j with the highest priority; 7: Let path be the routing path of Tdelay i,j 8: if injected[i][j] == true then 9: for all link path do 10: if link.workload ¿ threshold then 11: inject[i][j] = false; 12: break; 13: end if 14: // send the control message to the nodes 15: if inject[i][j] == true then 16: Sending injecting notification to node i to inject Tdelay i,j; 17: Updating link.workload; 18: Deleting Tdelay i,j from delay table 19: end if 20: end for 21: end if 22: end for 23: Updating delay table for priority; 24: Collecting predicted transmissions from application-driven predictor; 25: for all predicted transmissions do 26: Selecting the transmission Tpredict i,j with the highest priority; 27: if inject[i][j] == true then 28: for all link path do 29: if link.workload ¿ threshold then 30: inject[i][j] = false; 31: break; 32: if inject[i][j] == true then 33: Updating link.workload; 34: Deleting Tpredict i,j from the predicted transmissions 35: end if 36: end if 37: end for 38: end if 39: end for 22
  • 35. Algorithm 2 Algorithm for each node i 1: receive the control message; 2: if there is a transmission to destination j then 3: if inject[i][j] == true then 4: inject; 5: else 6: Adding the transmission to the centralized delay table; 7: end if 8: end if 9: Updating application-driven predictor; 10: Updating link.workload; control system is bound to send control signal to each node via control net- work and each node needs to send some information via control network to the control system to help the control’s system make decision. Each node communi- cates with each other by data network. In [10], the authors think the operating system is capable of network traffic management. For this reason, our method can be adopted to the architecture platform mentioned in [10] and the control system can be seen as the operating system. However, this method may be too troublesome so we propose an alternative. Since there are many existing cores, we can use a dedicated core to handle the flow control decision. This dedicated core stands for the control system in Figure 5.2. 5.3 Area Occupancy Then, we analyze the area overhead of the NTPT. In this subsection, we use the number of transistors in real manycore design. In UC Davis AsAP, it has 55M transistors, and in Tileras TILE64, the number of transistors is 615M. Assuming 23
  • 36. control signal Cores Data Network update information in control system cconon Application-Driven Traffic Predictor (via control network) (via control network) Control System Figure 5.2: The diagram of flow control that each bit needs 6 transistors, in our design the application-driven predictor needs 0.69M transistors when the number of cores is 64. And because we need to maintain another table named control table to record the delayed transmissions and here we assume that the number of entry is 128 and the needed transistors are about 0.02M. The application-driven predictor occupies 1.29% and 0.12% in AsAP and TILE64, respectively. It is quite small and tolerable area overhead. However, [21] addresses that an increase of the data path width by 138% results in an area penalty of 64% in Xpipe, which is a NoC architecture. The area overhead is extremely considerable. The average packet latency changes from 49 cycles to 39 cycles as the link bandwidth enlarges from 2.2 GB/s to 3.2 GB/s. In short, the average packet latency improves slightly as the link bandwidth enlarges and results in huge area overhead. This conclusion gives us the motivation to do the inject rate flow control since increasing the link bandwidth is not economic. 24
  • 37. Chapter 6 Experimental Results In this chapter, we will demonstrate the experimental result to evaluate our proposed flow control algorithm. We adopt both real application traffic and synthetic traffic in our experiment. 6.1 Simulation Setup The PoPNet network simulator [22] is used for our simulations and the data transmission traces are used as the input of the simulator. The data transmission traces record the packet injection time, the address of the source router, the address of the destination router and the packet size. The detailed configuration of simulation is provided in Table 6.1. The original data transmission traces are altered by our flow control algorithm, and this results in that some transmissions are delayed for some period so as to avoid congestion. The experimental results presented in the following show that our algorithm exhibits huge performance improvement. 25
  • 38. Table 6.1: Simulation Configuration Network Topology mesh 4x4 Virtual Channel 3 Buffer Size 12 Routing Algorithm x-y routing Bandwidth 32 byte 6.2 Real Application Traffic The Tilera’s TILE64 platform is used to run the benchmark programs and collect the data transmission traces. We use SPLASH-2 blocked LU decomposition as our benchmark program. The total workload is 3991 packets. As shown in Table 6.2, the average packet latency drops from 2410.79 cycles to 771.858 cycles and the maximum packet latency drops from 5332 cycles to 3242 cycles. The significant performance improvement origins from that we predict traffic workload in the next interval and delay some packet injection to avoid congestion. As depicted in Figure 6.1 (a), the packet latencies without flow control range between 0 cycles and 5500 cycles. However, with our proposed flow control algorithm, the packet latencies range between 0 cycles and 3300 cycles. These packet latencies have decreased violently so that the histogram shifts to the left side. To bear up our conviction, Figure 6.2 demonstrates more details about 26
  • 39. Original Pattern- oriented Reduction Ave. latency 2410.79 cycles 771.858 cycles 3.12 Max. latency 5332 cycles 3242 cycles 1.64 Simulation Cycle 5600 cycles 6100 cycles 0.92 Table 6.2: Our proposed flow control algorithm leads to the huge reduction in the latency and slight execution time overhead. the network congestion. We set the congestion threshold as 40 flits. The line in the Figure 6.2 (b) goes above the threshold is because of the wrong predictions of network traffic. However, the impact of miss prediction is slight so that the result is under our acceptable scope. In Figure 6.2 (a) without flow control, the maximum workload is far apart from the threshold, and consequently causes severe network congestion. 6.3 Synthetic Traffic Besides the real application traffic, we also extend our algorithm for synthetic traffic. In [20], the authors state that injected network traffic possesses self- similar temporal properties. They use a single parameter, the Hurst exponent H to capture temporal burstiness characteristic of NoC traffic. Based on this traffic model, we synthesize our traffic traces. In Table 6.3, we give some instances 27
  • 40. 0 20 40 60 80 100 120 140 160 Numberofpackets Packet Latency (cycles) Histogram of the packet latencies (a) 0 50 100 150 200 250 300 350 400 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 Numberofpackets Packet Latency (cycles) Histogram of the packet latencies (b) Figure 6.1: Histograms of the packet latencies without (a) and with (b) the proposed flow control and in (b) the latencies slow down drastically. 28
  • 42. of different parameters H and make some comparisons. These parameters are chosen based on [20]. Table 1 in [20] shows some values of Hurst exponent H and in our thesis we choose some values among them as a matter of convenience. The average packet latency and the maximum latency both drop down significantly. Besides, the execution time with our proposed flow control is a little better than that which is without flow control. Relatively large H values indicate highly self-similar traffic and higher traffic prediction accuracy rate. But because the average packet size also increases with H, the reduction does not arise linearly with H 30
  • 43. H 0.576 0.661 0.768 0.855 0.978 Original Ave. latency 3553.14 cycles 3596.45 cycles 3649.21 cycles 3665.53 cycles 3614.56 cycles Improved Ave. latency 482.512 cycles 467.787 cycles 387.716 cycles 412.983 cycles 417.577 cycles Reduction of Ave. latency 7.364 7.688 9.412 8.876 8.656 Original Max. latency 7623 cycles 7623 cycles 7710 cycles 7658 cycles 7714 cycles Improved Max. latency 1591 cycles 1532 cycles 1016 cycles 1054 cycles 1037 cycles Reduction of Max. latency 4.791 4.976 7.589 7.266 7.438 Original Simulation Cycle 8580 cycles 8510 cycles 8550 cycles 8480 cycles 8450 cycles Improved Simulation Cycle 8280 cycles 8260 cycles 7690 cycles 7781 cycles 7731 cycles Table 6.3: Our proposed flow control algorithm for synthetic traffic leads to the huge reduc- tion in the average latency and the maximum latency and slight reduction in the execution time. 31
  • 44. Chapter 7 Conclusion and Future Works Our thesis proposes an application-oriented flow control for packet-switched networks-on-chip. By tracking and predicting the end-to-end transmission be- havior of the running applications, we can limit the traffic injection when the network is heavily loaded. By delaying some transmissions efficiently, the aver- age packet latency can be decreased significantly so that the performance can be improved obviously. In our experiments, we adopt real application traffic traces as well as synthetic traffic traces. The experimental result shows that our proposed flow control not only decreases the average packet latency and the maximum latency, but under some condition the execution time can even be shortened. Future work will focus on improving the accuracy of the application-oriented traffic prediction. Also, the simulation configuration should be further discussed. Determining the optimal parameter and adjusting flow control algorithm are 32
  • 45. also important. Besides, we ignore the communication dependencies between the traffic traces because there is difficulty in considering about this issue. 33
  • 46. Bibliography [1] Y. S.-C. Huang, C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the end-to-end traffic prediction in the on-chip networks”, in Proc. 47th ACM IEEE Design Automation Conference, 2010. [2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene- gro, J. Stickney, and J. Zook, “Tile64 - processor: A 64-core soc with mesh interconnect”, in Proc. Digest of Technical Papers. IEEE International Solid-State Circuits Conference ISSCC 2008, Feb. 3–7, 2008, pp. 88–598. [3] Jose Duato, Sudhakar Yalmanchili, and Lionel Ni, “Interconnection net- works”, 2002, pp. 428–431. [4] S. Mascolo, “Classical control theory for congestion avoidance in high-speed internet”, in Proc. Decision and Control Conference, 1999. 34
  • 47. [5] Cui-Qing Yang, “A taxonomy for congestion control algorithms in packet switching networks”, in IEEE Network, 1995. [6] Hua Yongru Gu, Wang Hua O., and Hong Yiguang, “A predictive conges- tion control algorithm for high speed communication networks”, in Proc. American Control Conference, 2001. [7] Erland Nillson, Mikael Millberg, Johnny ¨Oberg, and Axel Jantsch, “Load distribution with the proximity congestion awareness in a network on chip”, in Proc. Design, Automation, and Test in Europe, 2003, p. 11126. [8] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network- on-chip traffic”, in Proc. 43rd ACM IEEE Design Automation Conference, 2006, pp. 839–844. [9] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction- based flow control in networks-on-chip”, in ACM Transactions on Design Automation of Electronic Systems, 2008. [10] Vincent Nollet, Th´eodore. Marescaux, and Diederik Verkest, “Operating- system controlled network on chip”, in Proc. 41st ACM IEEE Deaign Automation Conference, 2004. 35
  • 48. [11] P. Avasare, J-Y. Nollet, D. Verkest, and H. Corporaal, “Centralized end- to-end flow control in a best-effort network-on-chip”, in Proc. 5th ACM internatinoal conference on Embedded software, 2005. [12] Mohammad S. Talebi, Fahimeh Jafari, and Ahmad Khonsari, “A novel congestion control scheme for elastic flows in network-on-chip based on sum- rate optimization”, in ICCSA, 2007. [13] M. S. Talebi, F. Jafari, and A. Khonsari, “A novel flow control scheme for best effort traffic in noc based on source rate utility maximization”, in MASCOTs, 2007. [14] Mohammad S. Talebi, Fahimeh Jafari, Ahmad Khonsari, and Mohammad H. Yaghmaeem, “Best effort flow control in network-on-chip”, in CSICC, 2008. [15] Fahimeh Jafari, Mohammad S. Talebi, Mohammad H. Yaghmaee, Ahmad Khonsari, and Mohamed Ould-Khaoua, “Throughput-fairness tradeoff in best effort flow control for on-chip architectures”, in Proc. 2009 IEEE International Symposium on Parallel and Distributed Processing, 2009. [16] T. Marescaux, A. R˚angevall, V. Nollet, A. Bartic, and H. Corporaal, “Dis- tributed congestion control for packet switched networks on chip”, in ParCo, 2005. 36
  • 49. [17] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion- controlled best-effort communication for networks-on-chip”, in Proc. De- sign, Automation, and Test in Europe, 2007. [18] Jin Yuho, Yum Ki Hwan, and Kim Eun Jung, “Adaptive data compression for high-performance low-power on-chip networks”, in Proc. 41st annual IEEE/ACM International Symposium on Microarchitecture, 2008. [19] Keshav Srinivasan, “Congestion control in computer networks”, 1991. [20] Vassos Soteriou, Hangsheng Wang, and Li-Shiuan Peh, “A statistical traffic model for on-chip interconnection networks”, in Proc. 14th IEEE Interna- tional Symposium on Modeling, Analysis, and Simulation, 2006. [21] Anthony Leroy, “Optimizing the on-chip communication architecture of low power systems-on-chip in deep sub-micron technology”, 2006. [22] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip network model inside a full-system simulator”, in Proceedings of Inter- national Symposium on Performance Analysis of Systems and Software, 2009. 37