RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc

Avesta Sasan
Associate Professor
University of California, Davis
RTLDESIGN IN MLWORLD

I am Avesta Sasan!
UC Davis
Electrical and Computer Engineering
NICE TO MEET YOU!

3
2004
Clean
Room
Automation
IP
Technology
Migration
Clean Room, Automation, and Technology Mapping
Industry Experience

4
2004 2008
Memory
Design
Memory Compiler
Design
Memory and Compiler Design
Industry Experience

5
2004 2008 2009
EDA
Design
Physical
Design
VLSI
Design
Signal & Power
Integrity
Physical Design, SOC Design, and EDA Design 20+ tapeouts!
Industry Experience

6
2004 2008 2009 2014
EDA
Design
Timing
Signoff
R&D
Yield
Analysis
IR
Analysis
R&D, and in-house EDA Design
Industry Experience

7
2004 2008 2009 2014
Joining Academia
2016
Academic Experience

8
2004 2008 2009 2014
Moving to Davis
2016 2021
Academic Experience

Research Focus
9
Applied, efficient &
Accelerated ML
VLSI Design and
Hardware Security

10
z
RTL Services
to
Accelerate
AI

Let’s first talk about some
problems!
11

Hardware vs AI Trend
13
Source: Ark Investments
Transformer (213M param) trained with NAS
Time
to
Market
Design RTL/Physical Manufactu
re
Test/Pkg/Ing
Days to Months –
Months – to years

Hardware vs AI Trend
n HW improvement reduce the cost of training 37% per year
n The model size has grown at the paste of 10X per year
n AI training cost continues to climb quickly
14
Source: Ark Investments

15
Source: https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
AI is Energy Demanding
2.0 – Round trip flight NY/SF (1 passenger)
11.0 -- Human life (1 year)
36.2 -- American life (1 year)
126.0 -- US car (1 year)
Transformer (213M param) trained with NAS
CO2
Emission
(x1000lbs)
626.2

Carbon Footprint of AI and DL
n Eye opening study:
q Energy of training the model for 1-day was computed
q Scaled using the data in the paper on the number of GPU-days took for training
q The cost was computed based on average energy cost in US
q This is result for a one training run
16
Source: https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/
BERT carbon footprint =~
1,400lb of CO2 ~ close to a
round-trip trans-America
flight for one person
= round-trip trans-
America flight for 416
persons

Growth in AI Energy Requirements
n Models are becoming larger
q GPT-2 à 1.5B parameters and a few
petaflop-days to train
q GPT-3 à 175B parameters
q PaLM à 540B parameters
q GPT4 à 100s of B parameters!
q What is next?
17
Source: compute trends across three areas of machine learning, Servila et al, arXiv 2022

Growth in AI Energy Requirements
n Datasets are becoming significantly larger
q 3B words (training set) à training of BERT
q 32B words à XLNet
q 40B words à GPT-2
q 500B words à GPT-3
18
Source: https://epochai.org/blog/trends-in-training-dataset-sizes

19
AI User/App Base is J Growing!

20
Source: https://www.analyticsvidhya.com/blog/2022/03/the-carbon-footprint-of-ai-and-deep-learning/#
Increase in number of ML publication is exponential
# researchers + entities
investigating ML and its
application à Exponential
Growth

21
What Can We Do?
Improve AI Speed and Efficiency!

Need Everyone's Contribution!
22
Model Compiler Hardware
RTL/Physical
Design
Process
Design
? ?
?
?
?

Need Everyone's Contribution!
23
Model
Architecture
Compiler
Hardware
Architecture
RTL/Physical
Design
Process
Design
Pruning
Quantization
Sparse.
Efficient Architectures
Low-rank Fact. Alg.
Early Stopping
Adaptive Computation
E-Constrained NAS
Dynamic NN
Energy-aware Training
…
…
…
Fusion
Reuse
Autotuning
Dynamic Precision
Lazy eval.
Operator Simplification
Loop Unrolling
Cross-Function Opt.
Profile-Guided opt.
Dynamic Compilation
…
3D Architecture
Systolic Enhancement
In-HW Pruning
Sparse Support
Aggressive DVFS
HBM/TRAM/ ….
In-Memory Computing
Near-memory
Neuromorphic HW …
Optical HW …
Optical Comm …
…
…
3D
High-k Metal Gate
Extreme UV Litho.
C-Nanotube Trans.
Silicon Photonics
Spintronic …
Memristor …
…
?

24
How to Change Our Approach to
RTL and Physical Design to keep AI
Development Energized?

25
Use ML to Automate and Improve Physical and RTL Design
Make it faster!
Make it more efficient!
How?

ML in
Electronic Design Automation (EDA)
26

Modern VLSI Layout
n Designs are getting larger with billions of transistor on chip
n Design flows are getting increasingly more complicated
27
IBM Power 10
18B
Apple A15
15B
NVIDIA Ampere GA100
54B
Cerebras Mega
1.2T

Learning Assisted Computer Aided Design
n Physical Design is a very time-consuming Process
q Iterative and incremental (5-18 months in industry)
q Heuristic optimization algorithms
q Human Expert
n Applied Learning can help with
q Reduce the significance of human expert
q Optimization beyond heuristics
q Reduce the design time
29
Design Time
Design
Maturity
Goal
Learning-Assisted outcome
Conventional Design outcome
Design Space
Heuristic Search Space
Best
Local Best

Problem-Solution Opportunity Matrix
n Some problems do not get old, but our solutions do!
30
New
Problem
Old
Problem
New
Solution
Old
Solution
Waste of Time! Big Money!
Moderate Risk!
Big Potential!
High Risk
Some Potential!
Lower Risk!
Invention
Experimentation
Maintenance Innovation

Physical Design
n In each physical design step, there are 2 required processes:
q Optimization (usually semi-heurisitic) solution – placement, CTS, route, chip finishing
q Quality of Result (QoR) Analysis – i.e. power, area, timing, DRC violation, IR, EM, SI
n For example, during Placement
q The placement of movable objects (cells) are determined (optimization)
q The quality of placement is measured (QoR) in terms of:
n Power, Area, Timing, Potential routing congestion, etc.
n The optimization solution (semi-heuristic) are designed to optimize to improve QoR
metrics.
n In an ML-assisted physical design, we also need to develop ML solutions for both
Optimization and QoR analysis
n Given the approximate nature of ML, signoff level analysis is not possible. The next best
thing is prediction. (however it could be verified by EDA)
q If prediction has high accuracy and could be done far faster, it provide an advantage à could be
used in optimization loop. (full STA may take days, ML-based prediction may take seconds)!
n But also, ML allows us to forecast the outcome of future steps.
q Hence, ML can be used for both prediction and forecast.
31

ML Framework could be used for speedup
33
n ML Framework for speedup: rewrite heuristic algorithms using learning framework,
formulate the optimization as a training problem, enjoy GPU scaling.
Learning Assisted Physical Design (LAPD)
ML
Framework
Speedup

Example: DREAMPlace
n VLSI Placement
34
VLSI
Placement
Gate level Netlist
STD Cell Lib
Floorplan
Constraints
Legal Placement
Challenges of Nonlinear Placement:
• Low efficiently
• >3h for 10M cell design
• Today we are targeting much larger placement!
• Limited acceleration
• Limited speedup
References: Lin, Yibo, et al. "Dreamplace: Deep learning toolkit-enabled gpu acceleration for modern vlsi placement." Proceedings of the 56th Annual
Design Automation Conference 2019. 2019.

Example: DREAMPlace
36
n Interestingly, the objective of placement and training are very similar

Example: DREAMPlace
37
n Interestingly, the objective of placement and training are very similar

Example: DREAMPlace Results
n Significant speedup!
n Area for improvement:
q It is not congestion aware (congestion is handled indirectly by density constraint)
q It is not timing aware (average timing is optimized, not the worst case) – good for power though!

ML Framework could be used for speedup
39
n ML Framework for speedup: rewrite heuristic algorithms using learning framework,
formulate the optimization as a training problem, enjoy GPU scaling.
ML
Framework
Speedup

40
n ML Framework for speedup
n Develop ML for QoR prediction
q Reduce QoR analysis time
q See many PVT corners early in design time
q Predict QoR in future steps (i.e. routing congestion in placement time)
ML could be used for QoR prediction/cast
ML
for
QoR
Prediction
ML
Framework
Speedup
Examples:
• PBA prediction using GBA timing analysis
• MCMM STA prediction using STA runs in limited
(i.e., 3) corners.
• IR drop prediction
• Routing congestion prediction at synthesis
• Routing congestion prediction at placement
• DRC prediction
• Yield prediction

41
n Use Representation Learning to automate feature engineering
q Learn better features à improve accuracy!
q Remove need for cross domain ML and CAD experts à Ease of development
q Lower Adoption bar à widespread use!
ML could be used for QoR prediction/cast
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup
We will cover a case study from our group for MCMM
PBA prediction using automated feature engineering

42
n ML (i.e., RL) for Optimization
q Optimize beyond heuristic models.
ML Could be used for Optimization
ML
for
Optimization
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup
Examples:
• RL,CNN, GAN for Macro Placement
• RL for CTS
• RL,CNN, GAN for Routing
• ….
We will cover a case study from our group for
CTS using Reinforcement Learning!

43
ML
for
Optimization
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup
n The ML for QoR prediction and Optimization (ML or framework based) could work
together in loop.
ML-Guided ML-Optimization (ML Loop)

44
n The ML for QoR prediction and Optimization (ML or framework based) could work
together in loop.
ML
for
Optimization
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup
Two possibilities:
1. QoR nested within optimizationà
ML guided ML optimization
• Reduced number of iterations
• Make better optimization decisions
2. Optimization nested within QoR à
Genetic-based optimization
• Analogous to ML replacing physical designer,
in analyzing the result, and re-running the flow!

45
n Using multiple ML-based QoR prediction in optimization loop results in
q multi-objective optimization
q Predictive-optimization
n Pros:
q Optimize for current and future QoR
q Prevent doomed runs
q Prevent QoR estimation (pessimism) from lowering design quality
q Faster signoff
q Reduce Time To Market (TTM)
q Reduce tool and licensing cost
q Reduce engineering cost
q ….
ML
for
Optimization
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup

Case Study 1
RL for Clock Tree Synthesis
46

n Problem: Reducing peak current à IR drop
n Method: Maximize the skew in design to spread the clock arrival times
47
RL for Peak Current Reduction
Time
Demanded
Current
Reference: "A Reinforced Learning Solution for Clock Skew Engineering to Reduce Peak Current and IR Drop." Proceedings of the 2021 on GLSVLSI 2021

48
Time
Demanded
Current

n Solution: Reinforcement Learning
49

50
Reinforcement Learning
n Reinforcement learning (RL): How Intelligent agent should take actions and
interact with environment in order to maximize its reward
n RL combines exploitation (maximizing rewards) with exploration (taking risk) to learn
about possible future rewards
n Applicable to problems of sequential decision making

51
Agent

52
T + tck-c – tck-l ³ tc-q + tplogic + tsu
thold + d ≤ tcdlogic + tcdreg
Timing Check
Environment

q Positive reward if an action increase overall CAT distribution
q Large Negative reward if an action generate timing violation
q Allow aggressive exploitation with large discount factor (allowing search for future rewards)
53
Timing Engine
Action A =
• Insert or remove clock buffer
• Where to Move
Reward R =
• + delta skew
• - timing violation
State S = Updated design

n Results on Ethernet benchmark
q Wider distribution of CAT in RL
n Measurement from Ansys Redhawk
q 36.2% Reduction in Peak current
q 41.4% Improvement in IR drop
55
Reinforce-L for Peak Current Reduction
skew
Count
CTS
RL
Heuristic CTS
Reinforcement Learning

Case Study 2
RepL for Static Timing Analysis
56

Representation Learning for STA Prediction
n Problem:
q Path-based static timing (PBA) analysis is very expensive
q Timing Checks in many corners is a very expensive
q Designer resort to Graph-based timing analysis (GBA) on one or few corners to carry the physical design
q When near maturity, designer switch to PBA mode, and check for other corners
q Signoff require PBA timing check in all corners
57
GBA
PBA
(PVT) = 1
(P,V,T) = 100s
Design Time
Design
Maturity
Goal
Timing
Check
Design outcome + GBA
Design outcome + PBA
Reference: "RAPTA: A hierarchical representation learning solution for real-time prediction of path-based static timing analysis." GLSVLSI 2022

Representation Learning for STA
n But at this point, the damage is already done!
q Design Cycle: GBA is pessimistic à tool will overfix à increasing design iterations
q PPA penalty: GBA is pessimistic à Trade PPA for timing à Lost optimization opportunity
q Corner Blindness: Design is only tracked in limited corners à other corners may surprise!
58
GBA
PBA
(PVT) = 1
(P,V,T) = 100s
Design Time
Design
Maturity
Goal
Timing
Check
Design outcome + GBA
Design outcome + PBA

n Objective: Prediction of Static Timing Analysis Results (PBA prediction)
n Constraint: No manual feature engineering
n Approach: Representation Learning + MLP
59
GBA
PBA
(PVT) = 1
(P,V,T) = 100s
Design Time
Design
Maturity
Goal
Timing
Check
Learning Model
PBA Corner 1
PBA Corner 2
PBA Corner 185
…
…
Design outcome (old) + PBA
Design outcome (now) + GBA
PBA Predicted outcome (ML)

Comparison to Prior Art
60
Timing Engine (STA)
Features
Timing Engine (STA)
Representation
Learning
Training
Training

61
3-Unit LSTM Cell
Gate XYZ
Property 1: Value
Property 2: Value
Property 3: Value
.
.
.
Property 500: Value
net YUZ
Property 1: Value
Property 2: Value
Property 3: Value
.
.
.
Property 600: Value
Bigram XYZ-YUZ
Property 1: Value
Property 2: Value
Property 3: Value
.
.
.
Property 500: Value
Property 501: Value
Property 502: Value
Property 503: Value
.
.
.
Property 1100: Value
1
LSTM LSTM LSTM LSTM LSTM
Bigram
Bigram
Bigram
Bigram
Bigram
Data Path
Data Path
2
LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM LSTM
LSTM

62
Data Path
1
2
Data
1
2
Capture
1
2
Launch
Capture
Path
Launch
Path
Data Representation Learning
Capture Representation Learning
Launch Representation Learning

63
1
2
Data
1
2
Capture
1
2
Launch
Data Representation Learning
Capture Representation Learning
Launch Representation Learning

64
Launch
Delay
Capture
Delay
Data
Delay
Launch Path
Features
Capture Path
Features
Data Path
Features
Label
Prediction
PBA Slack Prediction
FC
Sub-Label Prediction
Used for training Phase; Removed in test phase
Input Sample
Data
Representation
Learning
Capture
Representation
Learning
Launch
Representation
Learning
Fully Connected Layer
Dropout
Fully Connected Layer
FC FC

Comparison to Prior Art
65
Timing Engine (STA)
Features
Timing Engine (STA)
Representation
Learning
Training
Training

Results (Route PBA from GBA)
0
5
10
15
20
25
30
0
.
7
8
0
.
7
9
0
.
8
1
0
.
8
3
0
.
8
5
0
.
8
7
0
.
8
9
0
.
9
1
0
.
9
3
0
.
9
5
0
.
9
7
0
.
9
9
1
.
0
1
1
.
0
3
1
.
0
5
68
Ethernet
0
5
10
15
20
25
30
35
40
0
.
7
8
0
.
7
9
0
.
8
1
0
.
8
3
0
.
8
5
0
.
8
7
0
.
8
9
0
.
9
1
0
.
9
3
0
.
9
5
0
.
9
7
0
.
9
9
1
.
0
1
1
.
0
3
1
.
0
5
S38417
Standard
Deviation
(ps)
Standard
Deviation
(ps)
Voltages (V)
Voltages (V)
RAPTA average train and test time on GPU. The reported number for test, is the time needed to generate PBA
prediction for 10K timing paths. The training is only done once during the design cycle.
RAPTA

Many Possibilities Ahead!
Going Beyond Heuristic
69

70
ML
for
Optimization
Auto
Feature
Engineering
ML
for
QoR
Prediction
ML
Framework
Speedup
Final Word!

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc

Recommended

Recommended

More Related Content

Similar to RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc

Similar to RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc (20)

More from Object Automation

More from Object Automation (20)

Recently uploaded

Recently uploaded (20)

RTL DESIGN IN ML WORLD_OBJECT AUTOMATION Inc