SlideShare a Scribd company logo
1 of 34
Download to read offline
Traffic-aware Frequency Scaling for
Balanced On-Chip Networks on GPGPUs
Luca Sinico
23/03/2016
Review of the paper:
Chiao-Yun Tu, Yuan-Ying Chang, Chung-Ta King, Chien-Ting Chen, Tai-Yuan Wang
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan (2014)
Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
Outline
I. Introduction
II. Problem description
III. The proposed solution
IV. Characterization of the GPGPUs’ traffic patterns
V. Traffic-Aware DFS design
VI. Evaluation of the DFS’ performances
VII. Overview on related works
VIII. Paper assessment
2
Terminology
 NoC = On-Chip Network
 Perfect NoC = NoC with zero latency and infinite bandwidth
 Shader cores = “computational cores” of the GPUs
 MC = Memory Controller
 Request Network = network from shader cores to MCs
 Reply Network = network from MCs to shader cores
 Flits = FLow control digITs
(large network packets are broken into small pieces,
called flits)
3
I. Introduction(1)
o GPGPU: what & why?
• General-Purpose computing on Graphics Processing Units
• over the past decades:
from a fixed-function drawing device …
… to a general-purpose, programmable computing engine
• can offer ten to hundred times computing power than
general purpose processors (CPU) for highly parallel
applications
4
I. Introduction(2)
o GPGPU: issues…
• more sensitive to bandwidth than to latency
• the many-to-few-to-many traffic pattern (bottleneck)
A tollbooth
• the NoC must provide sufficient bandwidth to warrant full
performance of GPGPUs
5
II. Problem description
o A less obvious issue:
the traffic imbalance in the request and reply traffic given by
the nature of the read/write instructions
• (e.g. think about a “strange tollbooth”:
for each car entering, i.e. a read request made of the 8-bytes address;
a bus, i.e. a read reply made of 128-bytes data packet, will exit).
o Typical GPGPU applications will have different mixes of
memory reads and writes at different stages of the
computation ( = different traffic patterns over the execution)
6
III. The proposed solution
o The proposed solution:
scaling the network frequency dynamically to balance the
throughput of the request and reply networks, by means of a
Traffic-Aware Dynamic Frequency Scaling approach
o Two steps to design a traffic-aware NoC:
 detecting the traffic pattern of the application;
 adjusting the bandwidth of different parts of the NoC
accordingly.
7
IV. Characterization of the traffic
patterns(1)
Three types of traffic pattern:
o Type A: few memory operations (either read or write)
 light memory traffic into the request network
 the reply traffic is also light
 the resources of NoC are under-utilized
 there are opportunities to throttle the components with low
utilizations
8
IV. Characterization of the traffic
patterns(2)
o Type B: large amount of memory operations, mainly
read operations
 many small read request packets in the request network; the
same number of large data packets in the reply network
 each single MC has to inject much more packets during the
interval than a shader core does
 the routers connected to the MCs in the reply network are
easily congested, causing the MCs to stall
 the reply network requires more bandwidth
9
IV. Characterization of the traffic
patterns(3)
o Type C: large amount of memory operations, mainly
write operations
 many large write request packets in the request network;
the same number of small acknowledgement packets in the
reply network
 due to the many-to-few-to-many architecture, the traffic on
the reply network is also heavy
 both the request and reply networks require more
bandwidth
10
V. Traffic-Aware DFS design(1)
o Two cardinal ratios:
computed every x cycles.
o The “algorithm” that
realizes the Traffic-Aware
DFS mechanism :
11
V. Traffic-Aware DFS design(2)
o Three threshold values:
• Threshold_low
• R_Threshold_high
• W_Threshold_high
o The algorithm that
dynamically controls
the frequency:
12
V. Traffic-Aware DFS design(3)
o First section:
 if
average_injection_rate goes below Threshold_low : Type A.
The DFS controller decreases the frequency of both the
request and reply network
 else…
13
V. Traffic-Aware DFS design(4)
o Second section:
 else…
Different thresholds are then used to decide whether
average_injection_rate is too high that may cause network
congestion in the next period
• if so, then for Type B pattern the frequency of the reply
network is increased and the frequency of the request
network may be reduced …
14
V. Traffic-Aware DFS design(5)
o Third section:
 else…
• if so, then for Type C pattern, the frequency of both
networks is increased
• else, that is if average_injection_rate is below the
thresholds, for both Type B and Type C, the network can
handle the traffic load and the network frequency is
maintained at F_base
15
V. Traffic-Aware DFS design(6)
o What about the Frequency Scaling Policy to use?
• Ideal Frequency Tuning
Ideally, the frequency of the two networks should be scaled
proportional to the traffic pattern by means of a balance ratio (B_ratio).
• Practical Frequency Tuning
In practice, there is a limited range for the frequency of a network to
scale under a fixed voltage.
The DFS scheme proposed must chose one from a fixed set of
frequencies to set the networks.
16
V. Traffic-Aware DFS design(7)
1) Ideal Frequency Tuning
• The doubling strategy attempts to strike a balance between
responsiveness and stability of frequency scaling.
• Why the Ideal Frequency Tuning study?
In order to understand the upper bound of the performance
achievable by a DFS mechanism.
17
Traffic Pattern F_req F_reply
Type A
Type B
Type C
V. Traffic-Aware DFS design(8)
2) Practical Frequency Tuning
• Since the difference of traffic load between the request and reply
network may be as high as ten-folds, the best that can be done is to
adjust the frequency between the extremes.
• There is a transition overhead when the frequency is changed.
However, with the current technology, the frequency transition
overhead may be negligible depending on how often the frequency is
changed.
18
With:
F_low = F_base / 2
F_high = F_base * 1.3
Traffic Pattern F_req F_reply
Type A F_low F_low
Type B F_low F_high
Type C F_high F_high
V. Traffic-Aware DFS design(9)
o The required hardware:
• 4 x 13-bit counters in each shader core monitored:
 one to frame the time interval (5,000 cycles);
 three for flit count, read request count and write request count.
• a control network is needed to collect the data to the
central DFS controller
• a 15-bit adder and divider inside the DFS controller, to
calculate the ratios
• a floating point comparator inside the DFS controller, for
comparing the difference between the threshold values
19
VI. Evaluation of the DFS’
performances(1)
1. Simulation Setup
 GPGPU-Sim is used to model a modern GPU
(with micro-architecture parameters similar to NVidia GeForce GTX480)
 ORION 2.0 is used for network power consumption estimation
 15 benchmark applications
 a two-letter classification scheme is used to classify the applications:
 first letter: high or low (exceed 30% or not) speedup with a perfect
NoC;
 second letter: whether the benchmark injects heavy or light traffic
to the NoC;
 the 15 applications are divided into three groups:
5 of them are LL, 3 are LH, and 7 are HH.
20
VI. Evaluation of the DFS’
performances(2)
2. Evaluation Results
a) Number of shader cores to monitor
How many shader cores are needed for an accurate identification?
• Since GPGPU applications are typically data parallel, shader cores
will have similar memory access behaviors.
• By monitoring only 1 or 2 shader cores, the correlation coefficient is
under 0.6 (too low). The correlation coefficients of 4, 8 and 14
shader cores are greater than 0.7. As a result, it has been chosen to
monitor 8 shader cores.
21
VI. Evaluation of the DFS’
performances(3)
2. Evaluation Results
b) DFS period
How large should be the period of time in order to evaluate the frequency
setting of the networks?
• The frequency transition latency between different frequency
levels is about 3 ns.
• The quantum of time for frequency scaling should be two to three
orders of magnitude larger than the transition overhead, therefore
a time interval smaller than 1,000 cycles is not appropriate.
• 5000 cycles is a suitable interval when compared to 1000 cycles,
5000 cycles, and 25000 cycles; both in terms of scaling overhead
and time reduction.
22
VI. Evaluation of the DFS’
performances(4)
2. Evaluation Results
c) Speedup comparison (execution time reduction)
Comparison of the practical DFS scheme with a perfect NoC and a static
1:4 scheme in terms of overall execution time normalized to the baseline
setting.
23
VI. Evaluation of the DFS’
performances(5)
• LL benchmarks: the practical DFS scheme improves only by 0.1%.
However, the perfect NoC does not perform well either, just 1.8% of
improvement.
• LH benchmarks: the DFS scheme improves the performance by 2.5%.
 Note that for some programs, such as RAY and LIB (mainly Type C), the DFS
mechanism does not result in performance degradation, as does the static 1:4
method.
• For the HH benchmarks:
- The perfect NoC can achieve an overall average of 88% speedup.
- The practical DFS improves by 14% on average.
(For some workloads, such as BFS and RD, the practical DFS
mechanism can achieve a performance improvement up to 27%).
- The static 1:4 method shows a 36% average speedup; but it cannot
adapt for the Type C benchmarks, such as FWT and TRA, with a loss of
14% and 20% respectively.
24
VI. Evaluation of the DFS’
performances(6)
2. Evaluation Results
d) Power Consumption
• Static power:
Since the static power under the same voltage is the same,
the consumption is proportional to the total execution time.
- For LL and LH benchmarks: reduction of 0.6% and 4.62%.
- For HH benchmarks: reduction of 14.05%.
- Overall: 7.5% energy saving across the 15 applications.
• Dynamic power:
On average there is more dynamic power consumption than the
baseline, although by only a small amount.
 More energy may be saved, particularly for LL benchmarks, if the
routers may be put into sleep.
25
VI. Evaluation of the DFS’
performances(7)
2. Evaluation Results
e) Performance Gain by the Ideal DFS mechanism
It is assumed that the frequency range is infinite and that it is possible to
adjust the frequency without constraints
26
VI. Evaluation of the DFS’
performances(8)
• For LL and LH benchmarks, Ideal DFS mechanism provides 1% and 6%
speedup respectively.
• For HH benchmarks, Ideal DFS mechanism achieves 55% improvement on
average.
For some of them, such as BFS and RD, the Ideal DFS mechanism results
in a great performance improvement of up to 74% and 158%.
The reason is that the B_ratio succeeds in balancing the seriously
imbalanced injection rate on shader cores and MCs for these two
benchmarks consisting of large amount of read requests (mainly Type B).
• Overall, the Ideal DFS provides 26% performance benefits across the 15
applications, on average.
27
VII. Overview on related works
- NoC for GPUs have been explored by Bakhoda et al. [6], where the impacts
of different network parameters are evaluated (2009).
- In [2], Bakhoda et al. point out that GPU applications are bandwidth
sensitive and latency insensitive, while the many-to-few-to-many
architecture can create a bottleneck in NoC. They propose a throughput-
effective NoC design by adopting a checkerboard organization and
multiport routers to solve the bottleneck (2010).
- In [12], Kim et al. propose a heterogeneous network that the reply
network is a direct all-to-all network overlaid on the mesh to remove
contentions (2012).
- Several research works have employed the DVFS policy to manage power
consumption [13], [14], [15] (2002, 2008, 2009).
- In [16] Lee et al. optimize the throughput of GPU under a specified power
constraint by dynamically scaling the number of cores and the
voltage/frequency of cores and on-chip interconnects/caches (2011).
28
VIII. Paper assessment(1)
o Over a [1-5] range of assessment:
• ORIGINALITY: 3
• TECHNICAL IMPACT: 4
• CLEARNESS: 4
• SCIENTIFIC SIGNIFICANCE: 3
29
VIII. Paper assessment(2)
 Originality: 3
 this work stands on “the same wavelength” of the other works of the
same years;
 it heavily bases its work on notations and results of [2] Bakhoda et al.;
 it focuses the attention on a problem not treated by other works;
 it exploits solutions previously adopted for other problematic aspects
(like the DVFS for the management of the power consumption);
 it proposes a solution simple but effective.
30
VIII. Paper assessment(3)
 Technical Impact: 4
 it directly provides the hardware and software implementation of the
solution proposed;
 the solution can be easily implemented on the GPGPUs with low costs;
 the solution proposed brings some useful performance gains;
 no degradations in performance (differently from the static 1:4 for
some of the applications );
 however, the performance gains (7.4% on average, and up to 27%) are
“in line” with the improvements obtained by other works’ solutions
(15-20%); that is, it does not provide a disruptive improvement in
performances compared with others results.
31
VIII. Paper assessment(4)
 Clearness: 4
 the path from the analyzed problem to the solution is well-traced;
 charts and diagrams are shown and quite well commented;
 the numeric results are quite well reported;
 however, some parts are explained probably a little bit too fast (e.g.
power consumption);
 some numeric results could be reported in a more schematic and
clearer way.
32
VIII. Paper assessment(5)
 Scientific Significance: 3
 the proposed solution is completely implementable with the available
current technologies;
 the proposed solution is quite generic and not restricted to only
certain types of GPGPUs;
 the proposed solution is independent of the topology of the NoC;
 the proposed solution can be maintained even with the technological
progress in the density of the integrated chips;
 some minor improvements have been cited and left for future works,
so there are possibilities to enhance the performances;
 however, the solution proposed is an incremental improvement, not a
disruptive one;
 the average speedup is of about 7.4%.
33
Thanks for your attention
Luca Sinico
Attending the Master’s Degree in Computer
Engineering at University of Padua (Italy)
Review made for the “Distributed Systems” course
luca.sinico@gmail.com

More Related Content

What's hot

KALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERKALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERijdpsjournal
 
S URVEY OF L TE D OWNLINK S CHEDULERS A LGORITHMS IN O PEN A CCESS S IM...
S URVEY OF  L TE  D OWNLINK  S CHEDULERS A LGORITHMS IN  O PEN  A CCESS  S IM...S URVEY OF  L TE  D OWNLINK  S CHEDULERS A LGORITHMS IN  O PEN  A CCESS  S IM...
S URVEY OF L TE D OWNLINK S CHEDULERS A LGORITHMS IN O PEN A CCESS S IM...ijwmn
 
Scheduling Algorithms in LTE and Future Cellular Networks
Scheduling Algorithms in LTE and Future Cellular NetworksScheduling Algorithms in LTE and Future Cellular Networks
Scheduling Algorithms in LTE and Future Cellular NetworksINDIAN NAVY
 
Performance Analysis and Optimization of Next Generation Wireless Networks (P...
Performance Analysis and Optimization of Next Generation Wireless Networks (P...Performance Analysis and Optimization of Next Generation Wireless Networks (P...
Performance Analysis and Optimization of Next Generation Wireless Networks (P...University of Piraeus
 
M phil-computer-science-mobile-computing-projects
M phil-computer-science-mobile-computing-projectsM phil-computer-science-mobile-computing-projects
M phil-computer-science-mobile-computing-projectsVijay Karan
 
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...ijcsit
 
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETs
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETsSimulation based Evaluation of a Simple Channel Distribution Scheme for MANETs
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETsIOSR Journals
 
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsLTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsIllaKolani1
 
E04124030034
E04124030034E04124030034
E04124030034IOSR-JEN
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Performance analysis of fls, exp, log and
Performance analysis of fls, exp, log andPerformance analysis of fls, exp, log and
Performance analysis of fls, exp, log andijwmn
 
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
Differentiated Classes of Service and Flow Management using An Hybrid Broker1Differentiated Classes of Service and Flow Management using An Hybrid Broker1
Differentiated Classes of Service and Flow Management using An Hybrid Broker1IDES Editor
 
Improving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-PImproving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-PIDES Editor
 
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...ijwmn
 
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidanceFault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidancegraphhoc
 
Schedule and Contention based MAC protocols
Schedule and Contention based MAC protocolsSchedule and Contention based MAC protocols
Schedule and Contention based MAC protocolsDarwin Nesakumar
 
Mac protocols sensor_20071105_slideshare
Mac protocols sensor_20071105_slideshareMac protocols sensor_20071105_slideshare
Mac protocols sensor_20071105_slideshareChih-Yu Lin
 
5G NR Coverage Analysis for 700 MHz
5G NR Coverage Analysis for 700 MHz 5G NR Coverage Analysis for 700 MHz
5G NR Coverage Analysis for 700 MHz Eiko Seidel
 

What's hot (19)

KALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLERKALMAN FILTER BASED CONGESTION CONTROLLER
KALMAN FILTER BASED CONGESTION CONTROLLER
 
S URVEY OF L TE D OWNLINK S CHEDULERS A LGORITHMS IN O PEN A CCESS S IM...
S URVEY OF  L TE  D OWNLINK  S CHEDULERS A LGORITHMS IN  O PEN  A CCESS  S IM...S URVEY OF  L TE  D OWNLINK  S CHEDULERS A LGORITHMS IN  O PEN  A CCESS  S IM...
S URVEY OF L TE D OWNLINK S CHEDULERS A LGORITHMS IN O PEN A CCESS S IM...
 
Scheduling Algorithms in LTE and Future Cellular Networks
Scheduling Algorithms in LTE and Future Cellular NetworksScheduling Algorithms in LTE and Future Cellular Networks
Scheduling Algorithms in LTE and Future Cellular Networks
 
Performance Analysis and Optimization of Next Generation Wireless Networks (P...
Performance Analysis and Optimization of Next Generation Wireless Networks (P...Performance Analysis and Optimization of Next Generation Wireless Networks (P...
Performance Analysis and Optimization of Next Generation Wireless Networks (P...
 
M phil-computer-science-mobile-computing-projects
M phil-computer-science-mobile-computing-projectsM phil-computer-science-mobile-computing-projects
M phil-computer-science-mobile-computing-projects
 
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...
EFFICIENT ADAPTATION OF FUZZY CONTROLLER FOR SMOOTH SENDING RATE TO AVOID CON...
 
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETs
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETsSimulation based Evaluation of a Simple Channel Distribution Scheme for MANETs
Simulation based Evaluation of a Simple Channel Distribution Scheme for MANETs
 
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering ChannelsLTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
LTE Physical Layer Transmission Mode Selection Over MIMO Scattering Channels
 
E04124030034
E04124030034E04124030034
E04124030034
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Performance analysis of fls, exp, log and
Performance analysis of fls, exp, log andPerformance analysis of fls, exp, log and
Performance analysis of fls, exp, log and
 
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
Differentiated Classes of Service and Flow Management using An Hybrid Broker1Differentiated Classes of Service and Flow Management using An Hybrid Broker1
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
 
Improving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-PImproving Performance of TCP in Wireless Environment using TCP-P
Improving Performance of TCP in Wireless Environment using TCP-P
 
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
 
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidanceFault tolerant wireless sensor mac protocol for efficient collision avoidance
Fault tolerant wireless sensor mac protocol for efficient collision avoidance
 
Schedule and Contention based MAC protocols
Schedule and Contention based MAC protocolsSchedule and Contention based MAC protocols
Schedule and Contention based MAC protocols
 
Mac protocols sensor_20071105_slideshare
Mac protocols sensor_20071105_slideshareMac protocols sensor_20071105_slideshare
Mac protocols sensor_20071105_slideshare
 
5G NR Coverage Analysis for 700 MHz
5G NR Coverage Analysis for 700 MHz 5G NR Coverage Analysis for 700 MHz
5G NR Coverage Analysis for 700 MHz
 

Viewers also liked

Juvenile Justice-Final Exam
Juvenile Justice-Final ExamJuvenile Justice-Final Exam
Juvenile Justice-Final ExamMarx Cadet
 
Seminário TGS G9
Seminário TGS G9Seminário TGS G9
Seminário TGS G9Psouzes
 
Internship-Final Project
Internship-Final ProjectInternship-Final Project
Internship-Final ProjectMarx Cadet
 
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...molesm
 
Jasper Thorley Amazon Kindle Stand
Jasper Thorley Amazon Kindle StandJasper Thorley Amazon Kindle Stand
Jasper Thorley Amazon Kindle StandJasper Thorley
 
Pemotongan dan penempelan DNA
Pemotongan dan penempelan DNAPemotongan dan penempelan DNA
Pemotongan dan penempelan DNAAMusdalifah123
 
Magazine flat plan
Magazine flat plan Magazine flat plan
Magazine flat plan pregnaul99
 
Masthead Font Styles
Masthead Font Styles Masthead Font Styles
Masthead Font Styles pregnaul99
 
Preliminary task and planning and research
Preliminary task and planning and research Preliminary task and planning and research
Preliminary task and planning and research pregnaul99
 

Viewers also liked (11)

Juvenile Justice-Final Exam
Juvenile Justice-Final ExamJuvenile Justice-Final Exam
Juvenile Justice-Final Exam
 
Le Ngoc Trinh-CV
Le Ngoc Trinh-CVLe Ngoc Trinh-CV
Le Ngoc Trinh-CV
 
Seminário TGS G9
Seminário TGS G9Seminário TGS G9
Seminário TGS G9
 
Internship-Final Project
Internship-Final ProjectInternship-Final Project
Internship-Final Project
 
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...
Keynote Speaker at 2016 uOPRA CONFERENCE - Marilou Moles - Twenty York Street...
 
Jasper Thorley Amazon Kindle Stand
Jasper Thorley Amazon Kindle StandJasper Thorley Amazon Kindle Stand
Jasper Thorley Amazon Kindle Stand
 
Pemotongan dan penempelan DNA
Pemotongan dan penempelan DNAPemotongan dan penempelan DNA
Pemotongan dan penempelan DNA
 
Teh, kopi, coklat
Teh, kopi, coklatTeh, kopi, coklat
Teh, kopi, coklat
 
Magazine flat plan
Magazine flat plan Magazine flat plan
Magazine flat plan
 
Masthead Font Styles
Masthead Font Styles Masthead Font Styles
Masthead Font Styles
 
Preliminary task and planning and research
Preliminary task and planning and research Preliminary task and planning and research
Preliminary task and planning and research
 

Similar to Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks

PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKS
PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKSPERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKS
PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKScscpconf
 
Performance analysis of resource
Performance analysis of resourcePerformance analysis of resource
Performance analysis of resourcecsandit
 
performanceandtrafficmanagement-160328180107.pdf
performanceandtrafficmanagement-160328180107.pdfperformanceandtrafficmanagement-160328180107.pdf
performanceandtrafficmanagement-160328180107.pdfABYTHOMAS46
 
LREProxy module for Kamailio Presenation
LREProxy module for Kamailio PresenationLREProxy module for Kamailio Presenation
LREProxy module for Kamailio PresenationMojtaba Esfandiari
 
Unit 5-Performance and Trafficmanagement.pptx
Unit 5-Performance and Trafficmanagement.pptxUnit 5-Performance and Trafficmanagement.pptx
Unit 5-Performance and Trafficmanagement.pptxABYTHOMAS46
 
A modified approach for secure routing and power aware in mobile ad hoc network
A modified approach for secure routing and power aware in mobile ad hoc networkA modified approach for secure routing and power aware in mobile ad hoc network
A modified approach for secure routing and power aware in mobile ad hoc networkDiksha Katiyar
 
Ccna 4 final exam answer v5
Ccna 4 final exam answer v5Ccna 4 final exam answer v5
Ccna 4 final exam answer v5friv4schoolgames
 
An efficient vertical handoff mechanism for future mobile network
An efficient vertical handoff mechanism for  future mobile networkAn efficient vertical handoff mechanism for  future mobile network
An efficient vertical handoff mechanism for future mobile networkBasil John
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsVladislavKashansky
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel acceleratorBaharJV
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPijaceeejournal
 
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...Tal Lavian Ph.D.
 

Similar to Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks (20)

PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKS
PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKSPERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKS
PERFORMANCE ANALYSIS OF RESOURCE SCHEDULING IN LTE FEMTOCELLS NETWORKS
 
Performance analysis of resource
Performance analysis of resourcePerformance analysis of resource
Performance analysis of resource
 
Chapter04
Chapter04Chapter04
Chapter04
 
performanceandtrafficmanagement-160328180107.pdf
performanceandtrafficmanagement-160328180107.pdfperformanceandtrafficmanagement-160328180107.pdf
performanceandtrafficmanagement-160328180107.pdf
 
Performance and traffic management for WSNs
Performance and traffic management for WSNsPerformance and traffic management for WSNs
Performance and traffic management for WSNs
 
LREProxy module for Kamailio Presenation
LREProxy module for Kamailio PresenationLREProxy module for Kamailio Presenation
LREProxy module for Kamailio Presenation
 
Unit 5-Performance and Trafficmanagement.pptx
Unit 5-Performance and Trafficmanagement.pptxUnit 5-Performance and Trafficmanagement.pptx
Unit 5-Performance and Trafficmanagement.pptx
 
A modified approach for secure routing and power aware in mobile ad hoc network
A modified approach for secure routing and power aware in mobile ad hoc networkA modified approach for secure routing and power aware in mobile ad hoc network
A modified approach for secure routing and power aware in mobile ad hoc network
 
Link_NwkingforDevOps
Link_NwkingforDevOpsLink_NwkingforDevOps
Link_NwkingforDevOps
 
Bg4101335337
Bg4101335337Bg4101335337
Bg4101335337
 
Ccna 4 final exam answer v5
Ccna 4 final exam answer v5Ccna 4 final exam answer v5
Ccna 4 final exam answer v5
 
U01725129138
U01725129138U01725129138
U01725129138
 
Example summary of SDN + NFV + Cloud Technology
Example summary of SDN + NFV + Cloud TechnologyExample summary of SDN + NFV + Cloud Technology
Example summary of SDN + NFV + Cloud Technology
 
An efficient vertical handoff mechanism for future mobile network
An efficient vertical handoff mechanism for  future mobile networkAn efficient vertical handoff mechanism for  future mobile network
An efficient vertical handoff mechanism for future mobile network
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
 
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
A Platform for Data Intensive Services Enabled by Next Generation Dynamic Opt...
 

Recently uploaded

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 

Recently uploaded (20)

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 

Traffic-aware Frequency Scaling Balances GPGPU On-Chip Networks

  • 1. Traffic-aware Frequency Scaling for Balanced On-Chip Networks on GPGPUs Luca Sinico 23/03/2016 Review of the paper: Chiao-Yun Tu, Yuan-Ying Chang, Chung-Ta King, Chien-Ting Chen, Tai-Yuan Wang Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan (2014) Università degli Studi di Padova Dipartimento di Ingegneria dell’Informazione
  • 2. Outline I. Introduction II. Problem description III. The proposed solution IV. Characterization of the GPGPUs’ traffic patterns V. Traffic-Aware DFS design VI. Evaluation of the DFS’ performances VII. Overview on related works VIII. Paper assessment 2
  • 3. Terminology  NoC = On-Chip Network  Perfect NoC = NoC with zero latency and infinite bandwidth  Shader cores = “computational cores” of the GPUs  MC = Memory Controller  Request Network = network from shader cores to MCs  Reply Network = network from MCs to shader cores  Flits = FLow control digITs (large network packets are broken into small pieces, called flits) 3
  • 4. I. Introduction(1) o GPGPU: what & why? • General-Purpose computing on Graphics Processing Units • over the past decades: from a fixed-function drawing device … … to a general-purpose, programmable computing engine • can offer ten to hundred times computing power than general purpose processors (CPU) for highly parallel applications 4
  • 5. I. Introduction(2) o GPGPU: issues… • more sensitive to bandwidth than to latency • the many-to-few-to-many traffic pattern (bottleneck) A tollbooth • the NoC must provide sufficient bandwidth to warrant full performance of GPGPUs 5
  • 6. II. Problem description o A less obvious issue: the traffic imbalance in the request and reply traffic given by the nature of the read/write instructions • (e.g. think about a “strange tollbooth”: for each car entering, i.e. a read request made of the 8-bytes address; a bus, i.e. a read reply made of 128-bytes data packet, will exit). o Typical GPGPU applications will have different mixes of memory reads and writes at different stages of the computation ( = different traffic patterns over the execution) 6
  • 7. III. The proposed solution o The proposed solution: scaling the network frequency dynamically to balance the throughput of the request and reply networks, by means of a Traffic-Aware Dynamic Frequency Scaling approach o Two steps to design a traffic-aware NoC:  detecting the traffic pattern of the application;  adjusting the bandwidth of different parts of the NoC accordingly. 7
  • 8. IV. Characterization of the traffic patterns(1) Three types of traffic pattern: o Type A: few memory operations (either read or write)  light memory traffic into the request network  the reply traffic is also light  the resources of NoC are under-utilized  there are opportunities to throttle the components with low utilizations 8
  • 9. IV. Characterization of the traffic patterns(2) o Type B: large amount of memory operations, mainly read operations  many small read request packets in the request network; the same number of large data packets in the reply network  each single MC has to inject much more packets during the interval than a shader core does  the routers connected to the MCs in the reply network are easily congested, causing the MCs to stall  the reply network requires more bandwidth 9
  • 10. IV. Characterization of the traffic patterns(3) o Type C: large amount of memory operations, mainly write operations  many large write request packets in the request network; the same number of small acknowledgement packets in the reply network  due to the many-to-few-to-many architecture, the traffic on the reply network is also heavy  both the request and reply networks require more bandwidth 10
  • 11. V. Traffic-Aware DFS design(1) o Two cardinal ratios: computed every x cycles. o The “algorithm” that realizes the Traffic-Aware DFS mechanism : 11
  • 12. V. Traffic-Aware DFS design(2) o Three threshold values: • Threshold_low • R_Threshold_high • W_Threshold_high o The algorithm that dynamically controls the frequency: 12
  • 13. V. Traffic-Aware DFS design(3) o First section:  if average_injection_rate goes below Threshold_low : Type A. The DFS controller decreases the frequency of both the request and reply network  else… 13
  • 14. V. Traffic-Aware DFS design(4) o Second section:  else… Different thresholds are then used to decide whether average_injection_rate is too high that may cause network congestion in the next period • if so, then for Type B pattern the frequency of the reply network is increased and the frequency of the request network may be reduced … 14
  • 15. V. Traffic-Aware DFS design(5) o Third section:  else… • if so, then for Type C pattern, the frequency of both networks is increased • else, that is if average_injection_rate is below the thresholds, for both Type B and Type C, the network can handle the traffic load and the network frequency is maintained at F_base 15
  • 16. V. Traffic-Aware DFS design(6) o What about the Frequency Scaling Policy to use? • Ideal Frequency Tuning Ideally, the frequency of the two networks should be scaled proportional to the traffic pattern by means of a balance ratio (B_ratio). • Practical Frequency Tuning In practice, there is a limited range for the frequency of a network to scale under a fixed voltage. The DFS scheme proposed must chose one from a fixed set of frequencies to set the networks. 16
  • 17. V. Traffic-Aware DFS design(7) 1) Ideal Frequency Tuning • The doubling strategy attempts to strike a balance between responsiveness and stability of frequency scaling. • Why the Ideal Frequency Tuning study? In order to understand the upper bound of the performance achievable by a DFS mechanism. 17 Traffic Pattern F_req F_reply Type A Type B Type C
  • 18. V. Traffic-Aware DFS design(8) 2) Practical Frequency Tuning • Since the difference of traffic load between the request and reply network may be as high as ten-folds, the best that can be done is to adjust the frequency between the extremes. • There is a transition overhead when the frequency is changed. However, with the current technology, the frequency transition overhead may be negligible depending on how often the frequency is changed. 18 With: F_low = F_base / 2 F_high = F_base * 1.3 Traffic Pattern F_req F_reply Type A F_low F_low Type B F_low F_high Type C F_high F_high
  • 19. V. Traffic-Aware DFS design(9) o The required hardware: • 4 x 13-bit counters in each shader core monitored:  one to frame the time interval (5,000 cycles);  three for flit count, read request count and write request count. • a control network is needed to collect the data to the central DFS controller • a 15-bit adder and divider inside the DFS controller, to calculate the ratios • a floating point comparator inside the DFS controller, for comparing the difference between the threshold values 19
  • 20. VI. Evaluation of the DFS’ performances(1) 1. Simulation Setup  GPGPU-Sim is used to model a modern GPU (with micro-architecture parameters similar to NVidia GeForce GTX480)  ORION 2.0 is used for network power consumption estimation  15 benchmark applications  a two-letter classification scheme is used to classify the applications:  first letter: high or low (exceed 30% or not) speedup with a perfect NoC;  second letter: whether the benchmark injects heavy or light traffic to the NoC;  the 15 applications are divided into three groups: 5 of them are LL, 3 are LH, and 7 are HH. 20
  • 21. VI. Evaluation of the DFS’ performances(2) 2. Evaluation Results a) Number of shader cores to monitor How many shader cores are needed for an accurate identification? • Since GPGPU applications are typically data parallel, shader cores will have similar memory access behaviors. • By monitoring only 1 or 2 shader cores, the correlation coefficient is under 0.6 (too low). The correlation coefficients of 4, 8 and 14 shader cores are greater than 0.7. As a result, it has been chosen to monitor 8 shader cores. 21
  • 22. VI. Evaluation of the DFS’ performances(3) 2. Evaluation Results b) DFS period How large should be the period of time in order to evaluate the frequency setting of the networks? • The frequency transition latency between different frequency levels is about 3 ns. • The quantum of time for frequency scaling should be two to three orders of magnitude larger than the transition overhead, therefore a time interval smaller than 1,000 cycles is not appropriate. • 5000 cycles is a suitable interval when compared to 1000 cycles, 5000 cycles, and 25000 cycles; both in terms of scaling overhead and time reduction. 22
  • 23. VI. Evaluation of the DFS’ performances(4) 2. Evaluation Results c) Speedup comparison (execution time reduction) Comparison of the practical DFS scheme with a perfect NoC and a static 1:4 scheme in terms of overall execution time normalized to the baseline setting. 23
  • 24. VI. Evaluation of the DFS’ performances(5) • LL benchmarks: the practical DFS scheme improves only by 0.1%. However, the perfect NoC does not perform well either, just 1.8% of improvement. • LH benchmarks: the DFS scheme improves the performance by 2.5%.  Note that for some programs, such as RAY and LIB (mainly Type C), the DFS mechanism does not result in performance degradation, as does the static 1:4 method. • For the HH benchmarks: - The perfect NoC can achieve an overall average of 88% speedup. - The practical DFS improves by 14% on average. (For some workloads, such as BFS and RD, the practical DFS mechanism can achieve a performance improvement up to 27%). - The static 1:4 method shows a 36% average speedup; but it cannot adapt for the Type C benchmarks, such as FWT and TRA, with a loss of 14% and 20% respectively. 24
  • 25. VI. Evaluation of the DFS’ performances(6) 2. Evaluation Results d) Power Consumption • Static power: Since the static power under the same voltage is the same, the consumption is proportional to the total execution time. - For LL and LH benchmarks: reduction of 0.6% and 4.62%. - For HH benchmarks: reduction of 14.05%. - Overall: 7.5% energy saving across the 15 applications. • Dynamic power: On average there is more dynamic power consumption than the baseline, although by only a small amount.  More energy may be saved, particularly for LL benchmarks, if the routers may be put into sleep. 25
  • 26. VI. Evaluation of the DFS’ performances(7) 2. Evaluation Results e) Performance Gain by the Ideal DFS mechanism It is assumed that the frequency range is infinite and that it is possible to adjust the frequency without constraints 26
  • 27. VI. Evaluation of the DFS’ performances(8) • For LL and LH benchmarks, Ideal DFS mechanism provides 1% and 6% speedup respectively. • For HH benchmarks, Ideal DFS mechanism achieves 55% improvement on average. For some of them, such as BFS and RD, the Ideal DFS mechanism results in a great performance improvement of up to 74% and 158%. The reason is that the B_ratio succeeds in balancing the seriously imbalanced injection rate on shader cores and MCs for these two benchmarks consisting of large amount of read requests (mainly Type B). • Overall, the Ideal DFS provides 26% performance benefits across the 15 applications, on average. 27
  • 28. VII. Overview on related works - NoC for GPUs have been explored by Bakhoda et al. [6], where the impacts of different network parameters are evaluated (2009). - In [2], Bakhoda et al. point out that GPU applications are bandwidth sensitive and latency insensitive, while the many-to-few-to-many architecture can create a bottleneck in NoC. They propose a throughput- effective NoC design by adopting a checkerboard organization and multiport routers to solve the bottleneck (2010). - In [12], Kim et al. propose a heterogeneous network that the reply network is a direct all-to-all network overlaid on the mesh to remove contentions (2012). - Several research works have employed the DVFS policy to manage power consumption [13], [14], [15] (2002, 2008, 2009). - In [16] Lee et al. optimize the throughput of GPU under a specified power constraint by dynamically scaling the number of cores and the voltage/frequency of cores and on-chip interconnects/caches (2011). 28
  • 29. VIII. Paper assessment(1) o Over a [1-5] range of assessment: • ORIGINALITY: 3 • TECHNICAL IMPACT: 4 • CLEARNESS: 4 • SCIENTIFIC SIGNIFICANCE: 3 29
  • 30. VIII. Paper assessment(2)  Originality: 3  this work stands on “the same wavelength” of the other works of the same years;  it heavily bases its work on notations and results of [2] Bakhoda et al.;  it focuses the attention on a problem not treated by other works;  it exploits solutions previously adopted for other problematic aspects (like the DVFS for the management of the power consumption);  it proposes a solution simple but effective. 30
  • 31. VIII. Paper assessment(3)  Technical Impact: 4  it directly provides the hardware and software implementation of the solution proposed;  the solution can be easily implemented on the GPGPUs with low costs;  the solution proposed brings some useful performance gains;  no degradations in performance (differently from the static 1:4 for some of the applications );  however, the performance gains (7.4% on average, and up to 27%) are “in line” with the improvements obtained by other works’ solutions (15-20%); that is, it does not provide a disruptive improvement in performances compared with others results. 31
  • 32. VIII. Paper assessment(4)  Clearness: 4  the path from the analyzed problem to the solution is well-traced;  charts and diagrams are shown and quite well commented;  the numeric results are quite well reported;  however, some parts are explained probably a little bit too fast (e.g. power consumption);  some numeric results could be reported in a more schematic and clearer way. 32
  • 33. VIII. Paper assessment(5)  Scientific Significance: 3  the proposed solution is completely implementable with the available current technologies;  the proposed solution is quite generic and not restricted to only certain types of GPGPUs;  the proposed solution is independent of the topology of the NoC;  the proposed solution can be maintained even with the technological progress in the density of the integrated chips;  some minor improvements have been cited and left for future works, so there are possibilities to enhance the performances;  however, the solution proposed is an incremental improvement, not a disruptive one;  the average speedup is of about 7.4%. 33
  • 34. Thanks for your attention Luca Sinico Attending the Master’s Degree in Computer Engineering at University of Padua (Italy) Review made for the “Distributed Systems” course luca.sinico@gmail.com