This work has been done as assignment and as part of the exam of the Distributed Systems course, while attending the Master's Degree in Computer Engineering at University of Padua.
If you find something wrong or not clear, or if you don't agree with me with the work done or the grades of the assessment, please tell me.
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks
1. Traffic-aware Frequency Scaling for
Balanced On-Chip Networks on GPGPUs
Luca Sinico
23/03/2016
Review of the paper:
Chiao-Yun Tu, Yuan-Ying Chang, Chung-Ta King, Chien-Ting Chen, Tai-Yuan Wang
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan (2014)
Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
2. Outline
I. Introduction
II. Problem description
III. The proposed solution
IV. Characterization of the GPGPUs’ traffic patterns
V. Traffic-Aware DFS design
VI. Evaluation of the DFS’ performances
VII. Overview on related works
VIII. Paper assessment
2
3. Terminology
NoC = On-Chip Network
Perfect NoC = NoC with zero latency and infinite bandwidth
Shader cores = “computational cores” of the GPUs
MC = Memory Controller
Request Network = network from shader cores to MCs
Reply Network = network from MCs to shader cores
Flits = FLow control digITs
(large network packets are broken into small pieces,
called flits)
3
4. I. Introduction(1)
o GPGPU: what & why?
• General-Purpose computing on Graphics Processing Units
• over the past decades:
from a fixed-function drawing device …
… to a general-purpose, programmable computing engine
• can offer ten to hundred times computing power than
general purpose processors (CPU) for highly parallel
applications
4
5. I. Introduction(2)
o GPGPU: issues…
• more sensitive to bandwidth than to latency
• the many-to-few-to-many traffic pattern (bottleneck)
A tollbooth
• the NoC must provide sufficient bandwidth to warrant full
performance of GPGPUs
5
6. II. Problem description
o A less obvious issue:
the traffic imbalance in the request and reply traffic given by
the nature of the read/write instructions
• (e.g. think about a “strange tollbooth”:
for each car entering, i.e. a read request made of the 8-bytes address;
a bus, i.e. a read reply made of 128-bytes data packet, will exit).
o Typical GPGPU applications will have different mixes of
memory reads and writes at different stages of the
computation ( = different traffic patterns over the execution)
6
7. III. The proposed solution
o The proposed solution:
scaling the network frequency dynamically to balance the
throughput of the request and reply networks, by means of a
Traffic-Aware Dynamic Frequency Scaling approach
o Two steps to design a traffic-aware NoC:
detecting the traffic pattern of the application;
adjusting the bandwidth of different parts of the NoC
accordingly.
7
8. IV. Characterization of the traffic
patterns(1)
Three types of traffic pattern:
o Type A: few memory operations (either read or write)
light memory traffic into the request network
the reply traffic is also light
the resources of NoC are under-utilized
there are opportunities to throttle the components with low
utilizations
8
9. IV. Characterization of the traffic
patterns(2)
o Type B: large amount of memory operations, mainly
read operations
many small read request packets in the request network; the
same number of large data packets in the reply network
each single MC has to inject much more packets during the
interval than a shader core does
the routers connected to the MCs in the reply network are
easily congested, causing the MCs to stall
the reply network requires more bandwidth
9
10. IV. Characterization of the traffic
patterns(3)
o Type C: large amount of memory operations, mainly
write operations
many large write request packets in the request network;
the same number of small acknowledgement packets in the
reply network
due to the many-to-few-to-many architecture, the traffic on
the reply network is also heavy
both the request and reply networks require more
bandwidth
10
11. V. Traffic-Aware DFS design(1)
o Two cardinal ratios:
computed every x cycles.
o The “algorithm” that
realizes the Traffic-Aware
DFS mechanism :
11
12. V. Traffic-Aware DFS design(2)
o Three threshold values:
• Threshold_low
• R_Threshold_high
• W_Threshold_high
o The algorithm that
dynamically controls
the frequency:
12
13. V. Traffic-Aware DFS design(3)
o First section:
if
average_injection_rate goes below Threshold_low : Type A.
The DFS controller decreases the frequency of both the
request and reply network
else…
13
14. V. Traffic-Aware DFS design(4)
o Second section:
else…
Different thresholds are then used to decide whether
average_injection_rate is too high that may cause network
congestion in the next period
• if so, then for Type B pattern the frequency of the reply
network is increased and the frequency of the request
network may be reduced …
14
15. V. Traffic-Aware DFS design(5)
o Third section:
else…
• if so, then for Type C pattern, the frequency of both
networks is increased
• else, that is if average_injection_rate is below the
thresholds, for both Type B and Type C, the network can
handle the traffic load and the network frequency is
maintained at F_base
15
16. V. Traffic-Aware DFS design(6)
o What about the Frequency Scaling Policy to use?
• Ideal Frequency Tuning
Ideally, the frequency of the two networks should be scaled
proportional to the traffic pattern by means of a balance ratio (B_ratio).
• Practical Frequency Tuning
In practice, there is a limited range for the frequency of a network to
scale under a fixed voltage.
The DFS scheme proposed must chose one from a fixed set of
frequencies to set the networks.
16
17. V. Traffic-Aware DFS design(7)
1) Ideal Frequency Tuning
• The doubling strategy attempts to strike a balance between
responsiveness and stability of frequency scaling.
• Why the Ideal Frequency Tuning study?
In order to understand the upper bound of the performance
achievable by a DFS mechanism.
17
Traffic Pattern F_req F_reply
Type A
Type B
Type C
18. V. Traffic-Aware DFS design(8)
2) Practical Frequency Tuning
• Since the difference of traffic load between the request and reply
network may be as high as ten-folds, the best that can be done is to
adjust the frequency between the extremes.
• There is a transition overhead when the frequency is changed.
However, with the current technology, the frequency transition
overhead may be negligible depending on how often the frequency is
changed.
18
With:
F_low = F_base / 2
F_high = F_base * 1.3
Traffic Pattern F_req F_reply
Type A F_low F_low
Type B F_low F_high
Type C F_high F_high
19. V. Traffic-Aware DFS design(9)
o The required hardware:
• 4 x 13-bit counters in each shader core monitored:
one to frame the time interval (5,000 cycles);
three for flit count, read request count and write request count.
• a control network is needed to collect the data to the
central DFS controller
• a 15-bit adder and divider inside the DFS controller, to
calculate the ratios
• a floating point comparator inside the DFS controller, for
comparing the difference between the threshold values
19
20. VI. Evaluation of the DFS’
performances(1)
1. Simulation Setup
GPGPU-Sim is used to model a modern GPU
(with micro-architecture parameters similar to NVidia GeForce GTX480)
ORION 2.0 is used for network power consumption estimation
15 benchmark applications
a two-letter classification scheme is used to classify the applications:
first letter: high or low (exceed 30% or not) speedup with a perfect
NoC;
second letter: whether the benchmark injects heavy or light traffic
to the NoC;
the 15 applications are divided into three groups:
5 of them are LL, 3 are LH, and 7 are HH.
20
21. VI. Evaluation of the DFS’
performances(2)
2. Evaluation Results
a) Number of shader cores to monitor
How many shader cores are needed for an accurate identification?
• Since GPGPU applications are typically data parallel, shader cores
will have similar memory access behaviors.
• By monitoring only 1 or 2 shader cores, the correlation coefficient is
under 0.6 (too low). The correlation coefficients of 4, 8 and 14
shader cores are greater than 0.7. As a result, it has been chosen to
monitor 8 shader cores.
21
22. VI. Evaluation of the DFS’
performances(3)
2. Evaluation Results
b) DFS period
How large should be the period of time in order to evaluate the frequency
setting of the networks?
• The frequency transition latency between different frequency
levels is about 3 ns.
• The quantum of time for frequency scaling should be two to three
orders of magnitude larger than the transition overhead, therefore
a time interval smaller than 1,000 cycles is not appropriate.
• 5000 cycles is a suitable interval when compared to 1000 cycles,
5000 cycles, and 25000 cycles; both in terms of scaling overhead
and time reduction.
22
23. VI. Evaluation of the DFS’
performances(4)
2. Evaluation Results
c) Speedup comparison (execution time reduction)
Comparison of the practical DFS scheme with a perfect NoC and a static
1:4 scheme in terms of overall execution time normalized to the baseline
setting.
23
24. VI. Evaluation of the DFS’
performances(5)
• LL benchmarks: the practical DFS scheme improves only by 0.1%.
However, the perfect NoC does not perform well either, just 1.8% of
improvement.
• LH benchmarks: the DFS scheme improves the performance by 2.5%.
Note that for some programs, such as RAY and LIB (mainly Type C), the DFS
mechanism does not result in performance degradation, as does the static 1:4
method.
• For the HH benchmarks:
- The perfect NoC can achieve an overall average of 88% speedup.
- The practical DFS improves by 14% on average.
(For some workloads, such as BFS and RD, the practical DFS
mechanism can achieve a performance improvement up to 27%).
- The static 1:4 method shows a 36% average speedup; but it cannot
adapt for the Type C benchmarks, such as FWT and TRA, with a loss of
14% and 20% respectively.
24
25. VI. Evaluation of the DFS’
performances(6)
2. Evaluation Results
d) Power Consumption
• Static power:
Since the static power under the same voltage is the same,
the consumption is proportional to the total execution time.
- For LL and LH benchmarks: reduction of 0.6% and 4.62%.
- For HH benchmarks: reduction of 14.05%.
- Overall: 7.5% energy saving across the 15 applications.
• Dynamic power:
On average there is more dynamic power consumption than the
baseline, although by only a small amount.
More energy may be saved, particularly for LL benchmarks, if the
routers may be put into sleep.
25
26. VI. Evaluation of the DFS’
performances(7)
2. Evaluation Results
e) Performance Gain by the Ideal DFS mechanism
It is assumed that the frequency range is infinite and that it is possible to
adjust the frequency without constraints
26
27. VI. Evaluation of the DFS’
performances(8)
• For LL and LH benchmarks, Ideal DFS mechanism provides 1% and 6%
speedup respectively.
• For HH benchmarks, Ideal DFS mechanism achieves 55% improvement on
average.
For some of them, such as BFS and RD, the Ideal DFS mechanism results
in a great performance improvement of up to 74% and 158%.
The reason is that the B_ratio succeeds in balancing the seriously
imbalanced injection rate on shader cores and MCs for these two
benchmarks consisting of large amount of read requests (mainly Type B).
• Overall, the Ideal DFS provides 26% performance benefits across the 15
applications, on average.
27
28. VII. Overview on related works
- NoC for GPUs have been explored by Bakhoda et al. [6], where the impacts
of different network parameters are evaluated (2009).
- In [2], Bakhoda et al. point out that GPU applications are bandwidth
sensitive and latency insensitive, while the many-to-few-to-many
architecture can create a bottleneck in NoC. They propose a throughput-
effective NoC design by adopting a checkerboard organization and
multiport routers to solve the bottleneck (2010).
- In [12], Kim et al. propose a heterogeneous network that the reply
network is a direct all-to-all network overlaid on the mesh to remove
contentions (2012).
- Several research works have employed the DVFS policy to manage power
consumption [13], [14], [15] (2002, 2008, 2009).
- In [16] Lee et al. optimize the throughput of GPU under a specified power
constraint by dynamically scaling the number of cores and the
voltage/frequency of cores and on-chip interconnects/caches (2011).
28
29. VIII. Paper assessment(1)
o Over a [1-5] range of assessment:
• ORIGINALITY: 3
• TECHNICAL IMPACT: 4
• CLEARNESS: 4
• SCIENTIFIC SIGNIFICANCE: 3
29
30. VIII. Paper assessment(2)
Originality: 3
this work stands on “the same wavelength” of the other works of the
same years;
it heavily bases its work on notations and results of [2] Bakhoda et al.;
it focuses the attention on a problem not treated by other works;
it exploits solutions previously adopted for other problematic aspects
(like the DVFS for the management of the power consumption);
it proposes a solution simple but effective.
30
31. VIII. Paper assessment(3)
Technical Impact: 4
it directly provides the hardware and software implementation of the
solution proposed;
the solution can be easily implemented on the GPGPUs with low costs;
the solution proposed brings some useful performance gains;
no degradations in performance (differently from the static 1:4 for
some of the applications );
however, the performance gains (7.4% on average, and up to 27%) are
“in line” with the improvements obtained by other works’ solutions
(15-20%); that is, it does not provide a disruptive improvement in
performances compared with others results.
31
32. VIII. Paper assessment(4)
Clearness: 4
the path from the analyzed problem to the solution is well-traced;
charts and diagrams are shown and quite well commented;
the numeric results are quite well reported;
however, some parts are explained probably a little bit too fast (e.g.
power consumption);
some numeric results could be reported in a more schematic and
clearer way.
32
33. VIII. Paper assessment(5)
Scientific Significance: 3
the proposed solution is completely implementable with the available
current technologies;
the proposed solution is quite generic and not restricted to only
certain types of GPGPUs;
the proposed solution is independent of the topology of the NoC;
the proposed solution can be maintained even with the technological
progress in the density of the integrated chips;
some minor improvements have been cited and left for future works,
so there are possibilities to enhance the performances;
however, the solution proposed is an incremental improvement, not a
disruptive one;
the average speedup is of about 7.4%.
33
34. Thanks for your attention
Luca Sinico
Attending the Master’s Degree in Computer
Engineering at University of Padua (Italy)
Review made for the “Distributed Systems” course
luca.sinico@gmail.com