(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Techniques for Efficient RTL Clock and Memory Gating Takedown of Next Generation High-performance Microprocessor Designs
1. Techniques for Efficient RTL Clock and
Memory Gating Takedown of Next Generation
High-performance Microprocessor Designs
Arun Joseph, Spandana Rachamalla, Rahul Rao, Shashidhar Reddy
IBM Systems, Contact: arujosep@in.ibm.com
2. Over the last decade or so, several techniques were proposed for enabling RTL analysis. But with the
advent of FinFET based designs [1], there is renewed focus on dynamic power analysis and mitigation [2],
especially early in the design flow [3] using techniques like clock gating [4] and memory activity takedown.
Additionally, high performance microprocessor design blocks are getting larger, with increasing number of
clock gating domains and with notable differences in activity across these domains and workloads [5].
The turn around time for performing RTL analysis using quick synthesis followed by netlist based tool
engines [6,7,8] is not efficient for rapid clock and memory activity exploration. Formal techniques [9,10],
though comprehensive, do not accurately capture the dependency of clock and memory activity on the
workloads.
Techniques in [3], for early activity analysis, require development of dedicated software for test-bench
creation and activity analysis. Also, even though thorough analysis is presented in a graphically rich
manner, further debug and activity reduction of highly active blocks is non-intuitive to the designer.
In this paper, we present a new platform for enabling rapid clock and memory activity takedown and its
application for design of a next generation industry class high performance microprocessor [1]. To the
best of our knowledge, this is the first such platform which brings together principles of designer level logic
verification, logic debug and light-weight netlist creation, while specifically catering to the requirements of
rapid RTL activity takedown.
Slide 2
Motivation
3. Slide 3
Main Idea
Bring together principles of designer level logic
verification trace import and simulation replay [11],
logic debug [12, 14] and virtual logic netlist based
RTL analysis [13], to specifically cater to the
requirements of rapid RTL activity takedown.
Fig1 shows key EDA building blocks for building the
platform and Fig2 shows how these were “tied
together” to create the EinsCG+ platform for
enabling rapid clock and memory activity takedown.
(Details in foot notes)
Figure 1. Clock & memory gating platform
– pieces of the puzzle
Figure 2. EinsCG+ software architecture
Enables logic designers with a pre-configured (yet
familiar) platform for meeting clock and memory
gating targets.
Enables rapid exploration of power saving
opportunities across IP blocks, use-scenarios,
workloads and workload windows. (Input block RTL to
next set of pin-pointed opportunities in ~3 minutes)
Low platform development cost (~1 month). Except
building block 9, rest are available in modern day
industry class EDA-suites like [14].
Key Idea Key Benefits
4. Fig3 shows the generic use model of EinsCG+ to rapidly takedown
clock and memory activity. The quick analyses illustrated in Fig4
aids quick decision making and tracking.
One key benefit of EinsCG+ is that, if activity needs to be further
reduced, it provides the RTL designer a familiar logic debug
environment specifically preconfigured for clock gating (Fig5).
The wave view is automatically preloaded with pin-pointed clock
gating opportunities in the design, sorted on return of investment.
The compiled version of design is also preloaded to enable
structural assisted debug like “why”, “trace-back” analysis.
Seamless per simulation cycle clock gating debug across the wave
window, RTL source, hierarchy and logic browser is also enabled.
Slide 4
EinsCG+ Iteration
Figure 3. EinsCG+ iteration: Generic use-model
Figure 4. EinsCG+ quick analyses
(a) Tracking across releases (b) Per clock gating domain multi-workload clock and
data activity report (c) Multi workload memory activity report
Figure 5. EinsCG+ advanced clock gating configured debug view
Source View Wave View
Preloaded with
Sorted & Pin-pointed
Clock Gating
Opportunities
Hierarchy View Logic View
(a)
(b)
(c)
5. Use-case1: EinsCG+ helped identify sub-design blocks of a design under test (DUT) not meeting clock and/or memory gating
criteria and independently iterate on those blocks to close on targets, before redoing the analysis on DUT. Capability of
EinsCG+ to perform on-the-fly re-simulation from an existing DUT trace, eliminated the need for higher level DUT analysis for
individual RTL update iteration, while allowing for evaluation of individual block level updates for clock gating by different
designers in parallel.
Use-case2: Use of vendor IP in microprocessor designs is becoming increasingly common in the era of Open Compute [15].
Such IP blocks are often used in different modes across the design. While the vendor IP blocks may be designed efficiently for
power, incorrect mode configuration can result in high activity and power. Independent EinsCG+ iterative analysis on specific
vendor IP instances in the design enabled ensuring the correct mode configuration.
Use-case3: EinsCG+ analysis on the design was used to identify activity peaks and corresponding simulation windows. To
takedown these activity peaks, EinsCG+ was used iteratively on simulation windows of interest, especially for larger workloads.
Slide 5
Experimental Evaluation
Figure 7. Use-case2
Vendor IP mode configuration
Figure 6. Use-case1
EinsCG+ iterations on sub-designs
Figure 8. Use-case3
EinsCG+ workload analysis
6. Slide 6
Summary
We introduced a first such platform, which brings together principles of designer level logic
verification trace import and simulation replay, logic debug and virtual logic netlist based
RTL analysis to specifically cater to the requirements of rapid RTL clock and memory activity
takedown.
We demonstrated how the platform was developed in ~1 month using existing EDA building
blocks used in an industry context.
We presented the application of the platform for the design of a next generation industry
class microprocessor, across a range of use-cases. The platform enabled the path from an
input block RTL to the next iteration of pin-pointed opportunities in ~3 minutes.
We believe the techniques described in the paper are generic and advocate application of
the same techniques to enable rapid activity takedown.
Editor's Notes
[1] Thompto, et al. “POWER9 Processor for the Cognitive Era,” HotChips 2016
[2] http://www.edn.com/electronics-blogs/eda-power-up/4438874/FinFET-impact-on-dynamic-power
[3] Putting On the Dynamic Power Glasses: A FinFET-Aware Approach for Early Realistic Block Activity Analysis and Exploration, DAC’16
[4] Jacobson et al. "Stretching the limits of clock-gating efficiency in server-class processors." In High-Performance Computer Architecture, 2005. HPCA-11.
[5] Efficient Techniques for Per Clock Gating Domain Contributor based Power Abstraction of IP Blocks for Hierarchical Power Analysis, DAC’16
[6] Guy D et al., “VDHL/Verilog expertise and gate synthesis automation system”, Patent US 6,289,498 Bl.
[7] P. Hurst. Automatic synthesis of clock gating logic with controlled netlist perturbation. In Proc. of DAC ’08, pages 654–657, 2008.
[8] Sundaresan, K. et al;, “A Tool for Exploring Advanced RTL Clock Gating Opportunities in Microprocessor Design”
[9] Arbel, E.; Eisner, C.; Rokhlenko, O., "Resurrecting infeasible clock-gating functions," Design Automation Conference, 2009. DAC '09.
[10] Y. Kuo, S. Weng, and S. Chang. A novel sequential circuit optimization with clock gating logic. In Proc. Of the ICCAD ’08, pages 230–233, 2008.
Memory activity takedown: These are techniques to essentially reduce say the number of reads and writes per cycle to arrays and other memory elements.
[11] S. Bergman et al., "Designer-level verification — An industrial experience story," DATE 2015
[12] B. Wile, J. C. Goss, and W. Roesner, Comprehensive Functional Verification: The Complete Industry Cycle. Elsevier, 2005
[13] “Virtual logic netlist: Enabling efficient RTL analysis”, Sixteenth International Symposium on Quality Electronic Design, 2015
[14] Darringer et al., "EDA in IBM: past, present, and future," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Dec 2000.
Details of the flow:
As shown in figure 2, the input RTL is first compiled and then structural analysis is performed to generate structural data and simulation directives. Simulation directives are essentially used for determining which signals in the simulation model need to be monitored to get clock and data switching activity. These are often the signals at the input, output of latches and design macros, read/write ports of arrays and other memory elements. Additional structural information like per clock domain information (how many clock gating domains in a macro, which latches fall into each of those domains) is also extracted from the RTL model. This is later on used for capturing and tracking clock activity and data activity of macros on a per clock gating domain basis.
Next, from any higher level (this can be a system, chip, core, or unit level) simulation trace (either from pre-silicon logic simulators, hardware-accelerated simulators etc), for the simulation window of interest, a scenario is generated using a designer level verification tool [11]. The simulation window can be the full simulation cycle or a subset of the cycle where high clock or memory activity is seen or any other window of interest. For example, from a higher level simulation of 0-3000 cycle, the EinsCG+ sim window can be 1200-1300 cycles where a activity peak is seen. Logic simulation is then performed using this scenario as a test case, along with the simulation directives and a simulation model (generated from the input RTL), to generate switching activity data. Light-weight netlist creators like VLN [13] are used to enable the reuse of some backend power analysis engines for performing workload aware clock and memory gating analyses. Virtual logic netlist (VLN), is a incomplete yet logical netlist graph of the design. VLN enables rapid RTL analysis using backend tool engines without the need for time-intensive synthesis techniques. The outputs of EinsCG+ are various activity analyses and a logic debug environment, specially preconfigured with pin-pointed opportunities for further clock gating debug. Multiple activity analyses like per clock gating domain clock and data analysis, memory gating analysis, redundancy removal estimation are performed and reports are generated.
Benefits:
When evaluated across different macros used in the design of [1], the time from an input block RTL to next set of pin-pointed gating opportunities for the same block is within ~3 minutes. This was evaluated across a broad range of macros from the core and uncore macros (blocks) of a next generation high performance microprocessor design [1], and also across different workloads traces. The turn around time for smaller macros was ~1 minute or even lesser. For very large macros this was within ~8 minutes, while for most macros it was within ~3 minutes. Bigger macros are those with much more complex logic, more number of latches and clock gating domains (>~1000).
The development cost for the overall platform was ~1 month.
This effort was primarily related to tying together the different building blocks 1-9 and also for overall user-experience improvements.
Details:
Figure 4: In addition to the reporting and visualization techniques in [3], EinsCG+ analysis also enables reporting (and visualization) of both clock and data activity of macros on a per clock gating domain basis (as shown in Fig4) across multiple workloads. This is critical for next generation high performance microprocessor designs like [1], where macros are getting larger with every generation, and with increasing number of clock gating domains (in some cases, with even more that 1000 clock gating domains in a single macro) and with notable differences in activity across these clock gating domains and across workloads [5]. Additionally, EinsCG+ also enables analysis of memory activity events like readspercycle, writespercycle etc to help focus on the reduction of memory activity.
Figure 5: To further reduce activity the logic designer can simply focus on debugging the pin-pointed opportunities for gating presented in the familiar logic debug frontend, in the order listed in the wave viewer (Fig5). Once some set of actions to reduce activity have been taken, the logic designer can simply iterate (as shown in Fig3), to get to the next set of opportunities in ~3 minutes.
Once the activity targets have been met, further opportunities are not presented.
Presented are the three specific use-cases of EinsCG+ during design of next generation microprocessors like [1].
[15] Open Compute Project: http://www.opencompute.org/