Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks

Traffic-aware Frequency Scaling for
Balanced On-Chip Networks on GPGPUs
Luca Sinico
23/03/2016
Review of the paper:
Chiao-Yun Tu, Yuan-Ying Chang, Chung-Ta King, Chien-Ting Chen, Tai-Yuan Wang
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan (2014)
Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione

Outline
I. Introduction
II. Problem description
III. The proposed solution
IV. Characterization of the GPGPUs’ traffic patterns
V. Traffic-Aware DFS design
VI. Evaluation of the DFS’ performances
VII. Overview on related works
VIII. Paper assessment
2

Terminology
 NoC = On-Chip Network
 Perfect NoC = NoC with zero latency and infinite bandwidth
 Shader cores = “computational cores” of the GPUs
 MC = Memory Controller
 Request Network = network from shader cores to MCs
 Reply Network = network from MCs to shader cores
 Flits = FLow control digITs
(large network packets are broken into small pieces,
called flits)
3

I. Introduction(1)
o GPGPU: what & why?
• General-Purpose computing on Graphics Processing Units
• over the past decades:
from a fixed-function drawing device …
… to a general-purpose, programmable computing engine
• can offer ten to hundred times computing power than
general purpose processors (CPU) for highly parallel
applications
4

I. Introduction(2)
o GPGPU: issues…
• more sensitive to bandwidth than to latency
• the many-to-few-to-many traffic pattern (bottleneck)
A tollbooth
• the NoC must provide sufficient bandwidth to warrant full
performance of GPGPUs
5

II. Problem description
o A less obvious issue:
the traffic imbalance in the request and reply traffic given by
the nature of the read/write instructions
• (e.g. think about a “strange tollbooth”:
for each car entering, i.e. a read request made of the 8-bytes address;
a bus, i.e. a read reply made of 128-bytes data packet, will exit).
o Typical GPGPU applications will have different mixes of
memory reads and writes at different stages of the
computation ( = different traffic patterns over the execution)
6

III. The proposed solution
o The proposed solution:
scaling the network frequency dynamically to balance the
throughput of the request and reply networks, by means of a
Traffic-Aware Dynamic Frequency Scaling approach
o Two steps to design a traffic-aware NoC:
 detecting the traffic pattern of the application;
 adjusting the bandwidth of different parts of the NoC
accordingly.
7

IV. Characterization of the traffic
patterns(1)
Three types of traffic pattern:
o Type A: few memory operations (either read or write)
 light memory traffic into the request network
 the reply traffic is also light
 the resources of NoC are under-utilized
 there are opportunities to throttle the components with low
utilizations
8

patterns(2)
o Type B: large amount of memory operations, mainly
read operations
 many small read request packets in the request network; the
same number of large data packets in the reply network
 each single MC has to inject much more packets during the
interval than a shader core does
 the routers connected to the MCs in the reply network are
easily congested, causing the MCs to stall
 the reply network requires more bandwidth
9

patterns(3)
o Type C: large amount of memory operations, mainly
write operations
 many large write request packets in the request network;
the same number of small acknowledgement packets in the
reply network
 due to the many-to-few-to-many architecture, the traffic on
the reply network is also heavy
 both the request and reply networks require more
bandwidth
10

V. Traffic-Aware DFS design(1)
o Two cardinal ratios:
computed every x cycles.
o The “algorithm” that
realizes the Traffic-Aware
DFS mechanism :
11

o Three threshold values:
• Threshold_low
• R_Threshold_high
• W_Threshold_high
o The algorithm that
dynamically controls
the frequency:
12

o First section:
 if
average_injection_rate goes below Threshold_low : Type A.
The DFS controller decreases the frequency of both the
request and reply network
 else…
13

o Second section:
 else…
Different thresholds are then used to decide whether
average_injection_rate is too high that may cause network
congestion in the next period
• if so, then for Type B pattern the frequency of the reply
network is increased and the frequency of the request
network may be reduced …
14

o Third section:
 else…
• if so, then for Type C pattern, the frequency of both
networks is increased
• else, that is if average_injection_rate is below the
thresholds, for both Type B and Type C, the network can
handle the traffic load and the network frequency is
maintained at F_base
15

o What about the Frequency Scaling Policy to use?
• Ideal Frequency Tuning
Ideally, the frequency of the two networks should be scaled
proportional to the traffic pattern by means of a balance ratio (B_ratio).
• Practical Frequency Tuning
In practice, there is a limited range for the frequency of a network to
scale under a fixed voltage.
The DFS scheme proposed must chose one from a fixed set of
frequencies to set the networks.
16

1) Ideal Frequency Tuning
• The doubling strategy attempts to strike a balance between
responsiveness and stability of frequency scaling.
• Why the Ideal Frequency Tuning study?
In order to understand the upper bound of the performance
achievable by a DFS mechanism.
17
Traffic Pattern F_req F_reply
Type A
Type B
Type C

2) Practical Frequency Tuning
• Since the difference of traffic load between the request and reply
network may be as high as ten-folds, the best that can be done is to
adjust the frequency between the extremes.
• There is a transition overhead when the frequency is changed.
However, with the current technology, the frequency transition
overhead may be negligible depending on how often the frequency is
changed.
18
With:
F_low = F_base / 2
F_high = F_base * 1.3
Traffic Pattern F_req F_reply
Type A F_low F_low
Type B F_low F_high
Type C F_high F_high

o The required hardware:
• 4 x 13-bit counters in each shader core monitored:
 one to frame the time interval (5,000 cycles);
 three for flit count, read request count and write request count.
• a control network is needed to collect the data to the
central DFS controller
• a 15-bit adder and divider inside the DFS controller, to
calculate the ratios
• a floating point comparator inside the DFS controller, for
comparing the difference between the threshold values
19

VI. Evaluation of the DFS’
performances(1)
1. Simulation Setup
 GPGPU-Sim is used to model a modern GPU
(with micro-architecture parameters similar to NVidia GeForce GTX480)
 ORION 2.0 is used for network power consumption estimation
 15 benchmark applications
 a two-letter classification scheme is used to classify the applications:
 first letter: high or low (exceed 30% or not) speedup with a perfect
NoC;
 second letter: whether the benchmark injects heavy or light traffic
to the NoC;
 the 15 applications are divided into three groups:
5 of them are LL, 3 are LH, and 7 are HH.
20

performances(2)
2. Evaluation Results
a) Number of shader cores to monitor
How many shader cores are needed for an accurate identification?
• Since GPGPU applications are typically data parallel, shader cores
will have similar memory access behaviors.
• By monitoring only 1 or 2 shader cores, the correlation coefficient is
under 0.6 (too low). The correlation coefficients of 4, 8 and 14
shader cores are greater than 0.7. As a result, it has been chosen to
monitor 8 shader cores.
21

performances(3)
b) DFS period
How large should be the period of time in order to evaluate the frequency
setting of the networks?
• The frequency transition latency between different frequency
levels is about 3 ns.
• The quantum of time for frequency scaling should be two to three
orders of magnitude larger than the transition overhead, therefore
a time interval smaller than 1,000 cycles is not appropriate.
• 5000 cycles is a suitable interval when compared to 1000 cycles,
5000 cycles, and 25000 cycles; both in terms of scaling overhead
and time reduction.
22

performances(4)
c) Speedup comparison (execution time reduction)
Comparison of the practical DFS scheme with a perfect NoC and a static
1:4 scheme in terms of overall execution time normalized to the baseline
setting.
23

performances(5)
• LL benchmarks: the practical DFS scheme improves only by 0.1%.
However, the perfect NoC does not perform well either, just 1.8% of
improvement.
• LH benchmarks: the DFS scheme improves the performance by 2.5%.
 Note that for some programs, such as RAY and LIB (mainly Type C), the DFS
mechanism does not result in performance degradation, as does the static 1:4
method.
• For the HH benchmarks:
- The perfect NoC can achieve an overall average of 88% speedup.
- The practical DFS improves by 14% on average.
(For some workloads, such as BFS and RD, the practical DFS
mechanism can achieve a performance improvement up to 27%).
- The static 1:4 method shows a 36% average speedup; but it cannot
adapt for the Type C benchmarks, such as FWT and TRA, with a loss of
14% and 20% respectively.
24

performances(6)
d) Power Consumption
• Static power:
Since the static power under the same voltage is the same,
the consumption is proportional to the total execution time.
- For LL and LH benchmarks: reduction of 0.6% and 4.62%.
- For HH benchmarks: reduction of 14.05%.
- Overall: 7.5% energy saving across the 15 applications.
• Dynamic power:
On average there is more dynamic power consumption than the
baseline, although by only a small amount.
 More energy may be saved, particularly for LL benchmarks, if the
routers may be put into sleep.
25

performances(7)
e) Performance Gain by the Ideal DFS mechanism
It is assumed that the frequency range is infinite and that it is possible to
adjust the frequency without constraints
26

performances(8)
• For LL and LH benchmarks, Ideal DFS mechanism provides 1% and 6%
speedup respectively.
• For HH benchmarks, Ideal DFS mechanism achieves 55% improvement on
average.
For some of them, such as BFS and RD, the Ideal DFS mechanism results
in a great performance improvement of up to 74% and 158%.
The reason is that the B_ratio succeeds in balancing the seriously
imbalanced injection rate on shader cores and MCs for these two
benchmarks consisting of large amount of read requests (mainly Type B).
• Overall, the Ideal DFS provides 26% performance benefits across the 15
applications, on average.
27

VII. Overview on related works
- NoC for GPUs have been explored by Bakhoda et al. [6], where the impacts
of different network parameters are evaluated (2009).
- In [2], Bakhoda et al. point out that GPU applications are bandwidth
sensitive and latency insensitive, while the many-to-few-to-many
architecture can create a bottleneck in NoC. They propose a throughput-
effective NoC design by adopting a checkerboard organization and
multiport routers to solve the bottleneck (2010).
- In [12], Kim et al. propose a heterogeneous network that the reply
network is a direct all-to-all network overlaid on the mesh to remove
contentions (2012).
- Several research works have employed the DVFS policy to manage power
consumption [13], [14], [15] (2002, 2008, 2009).
- In [16] Lee et al. optimize the throughput of GPU under a specified power
constraint by dynamically scaling the number of cores and the
voltage/frequency of cores and on-chip interconnects/caches (2011).
28

VIII. Paper assessment(1)
o Over a [1-5] range of assessment:
• ORIGINALITY: 3
• TECHNICAL IMPACT: 4
• CLEARNESS: 4
• SCIENTIFIC SIGNIFICANCE: 3
29

 Originality: 3
 this work stands on “the same wavelength” of the other works of the
same years;
 it heavily bases its work on notations and results of [2] Bakhoda et al.;
 it focuses the attention on a problem not treated by other works;
 it exploits solutions previously adopted for other problematic aspects
(like the DVFS for the management of the power consumption);
 it proposes a solution simple but effective.
30

 Technical Impact: 4
 it directly provides the hardware and software implementation of the
solution proposed;
 the solution can be easily implemented on the GPGPUs with low costs;
 the solution proposed brings some useful performance gains;
 no degradations in performance (differently from the static 1:4 for
some of the applications );
 however, the performance gains (7.4% on average, and up to 27%) are
“in line” with the improvements obtained by other works’ solutions
(15-20%); that is, it does not provide a disruptive improvement in
performances compared with others results.
31

 Clearness: 4
 the path from the analyzed problem to the solution is well-traced;
 charts and diagrams are shown and quite well commented;
 the numeric results are quite well reported;
 however, some parts are explained probably a little bit too fast (e.g.
power consumption);
 some numeric results could be reported in a more schematic and
clearer way.
32

 Scientific Significance: 3
 the proposed solution is completely implementable with the available
current technologies;
 the proposed solution is quite generic and not restricted to only
certain types of GPGPUs;
 the proposed solution is independent of the topology of the NoC;
 the proposed solution can be maintained even with the technological
progress in the density of the integrated chips;
 some minor improvements have been cited and left for future works,
so there are possibilities to enhance the performances;
 however, the solution proposed is an incremental improvement, not a
disruptive one;
 the average speedup is of about 7.4%.
33

Thanks for your attention
Luca Sinico
Attending the Master’s Degree in Computer
Engineering at University of Padua (Italy)
Review made for the “Distributed Systems” course
luca.sinico@gmail.com

Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks

Similar to Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks (20)

Recently uploaded

Recently uploaded (20)

Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks