SlideShare a Scribd company logo
1 of 27
Challenges in the
Static Timing Analysis
of FPGA’s
Tom Spyrou
TAU 2015
3/2015
Programmability: Where do FPGA’s fit?
2
Intel CPU
TI DSP
MultiCore
ManyCore
GPU
FPGA
ASSP
ASIC
Flexibility, Programming Abstraction
Performance, Area and Power Efficiency
CPU:
• Market-agnostic
• Accessible to many
programmers (C++)
• Flexible, portable
ASIC
• Market-specific
• Fewer programmers
• Rigid, less programmable
• Hard to build (physical)
FPGA:
• Somewhat Restricted Market
• Harder to Program (Verilog)
• More efficient than SW
• More expensive than ASIC
3 / 61
FPGA End Markets
Entertainment Broadcast
Broadband
Audio/video
Video display
Studio
Satellite
Broadcasting
Wireless Networking Wireline
Cellular
Basestations
Wireless LAN
Switches
Routers
Optical
Metro
Access
Computer Storage
Office
Automation
Servers
Mainframe
RAID
SAN
Copiers
Printers
MFP
Instrumentation
Security/
Energy Mgmt. Auto
Medical
Test equipment
Manufacturing
Card readers
Control systems
ATM
Navigation
Entertainment
Military
Secure comm.
Radar
Guidance and control
Computer
and Storage
Communications
IndustrialDigital Consumer
FPGA User Programming Model
 User writes Verilog (or VHDL, or schematic)
 Quartus compiles the Verilog to a bitstream
 Synthesis: Verilog -> Gates
 Tech-Mapping: Gates -> Device-specific LUTs & FF
 Clustering: LUTs+FF -> LAB clusters
 Placement: LABs –> placed LABS with an (x,y) position
 Routing: Abstract connections -> exact routing
 STA: Timing evaluated vs. constraints
 Assembly: Routing converted to bitstream
 Programming: Bitstream downloaded onto FPGA
(More on this in the Software Flow Section)
4
5
FPGA CAD
Map to LAB’s not standard cells
Routing is setting mux select line bits
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
// Begin: Write Control
always @ (posedge wrbusy_int)
begin
write0 <= 1'b1;
write1 <= 1'b0;
writex <= 1'b0;
end
always @ (negedge wrbusy_int)
begin
write0 <= 1'b0;
end
always @ (posedge write0_done)
begin
write1 <= 1'b1;
Quartus II Database
Device features and timing information
Merge
Programmer
Timing
Analysis
Placement
& Routing
Power
Assembler
Simulator
3-rd Party
or Altera
EDA
Synthesis
3-rd Party
or Altera
What is FPGA Fabric – Logic Array Block
6
Input Muxing Logic Cell
Optional DFF
Output Muxing
Bottom line: Quartus generates a configuration bitstream which sets the logic
functions, and routing steering to instantiate one hardware design into the device.
LAB: X 20
®
®
®
®
®
Hard-Blocks
Routing Fabric
7
Secondary Signals (CE, SLOAD, …)
FPGA Interconnect Model
8
0xab0f
0
VDIM
HDIM
HDIM
LIM
LEIM
A
B
C
D
0x81
0xf0
0x14
0x44
0x24
CRAM Programming
LAB (4,6)
LAB (12,9)
V4
H3
H3
 Wires are point-to-point
 Individual bits, not groups or
word-wise
 Statically programmed by SW
to establish the necessary
connection
 No bus, protocol, etc. routing
(unless built on top)
Unique Challenges in STA of FPGA’s
 Fixed device with programmable LUTS, Routing and various IP
 I would like to break down the challenges into categories
 Verification of the un-programmed device
 Many possible modes due to programmability
 Delay Calculation of non-CMOS structures like pass gate muxes
 Verification of a user’s compiled design
 CRPR analysis can be very expensive
 Large clock latency and skew, tree used versus mesh
 Long combinational paths with lots of re-convergent logic
 Slow logic that is still much faster than software on a CPU
 Incremental moves affect function not just delay of instances
 CRAM configuration constant changes
 Mode changes
9
Unique Challenges in STA of FPGA’s
 Periphery and Core have different challenges
 Programmable core logic implementing functions via look up tables
 Peripheral IP blocks performing programmable but less flexible tasks
 SerDes, DSP, RAM, Arm Core etc
 Periphery blocks often implemented with ASIC style flows
 Core is full custom with pass gates
 Delay modelling and parasitic reduction are challenges
 Both have challenges due to configurability
 I cannot cover all the challenges and will focus on 3
 LUT modelling
 Mode explosion flat implementation with hierarchical modelling
 Modelling pass gate based multiplexors
10
LUT Overview
 For the purposes of this
tutorial, let’s assume
we have a 3-LUT, i.e. 3
inputs on the select
lines to select one of 8
bits driven by the
CRAM.
 This 3-LUT can be
used to model any logic
function of 3 bits by
assigning appropriate
values to the CRAM.
 We call the 8-bit value
b[7:0] the LUTMASK.
11
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7
Timing Arcs Dependency on LUTMASK
 The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 For example, if bits are all
0s, then Y = 0 and there are
no arcs from any of the
inputs to the output. This is
a degenerate case.
12
A B C
CRAM
Y
b0
b1
b2
b3
b4
b5
b6
b7
Timing Arcs Dependency on LUTMASK
 The existence and delays
of the timing arcs from
A=>Y, B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 For example, if bits are
10001000 (as shown in the
diagram), there is no arc for
C=>Y. [This LUTMASK
implements the logic function
Y=A&B.]
 Unateness is a function of
LUTMASK
 This configuration should
have positive-unate arcs
 Ignoring unateness will hurt
fmax, but is not necessarily
critical for early Quartus
development
13
A B C
CRAM
Y
0
0
0
1
0
0
0
1
Timing Arcs Dependency on LUTMASK
 The existence and delays of
the timing arcs from A=>Y,
B=>Y, and C=>Y are
dependent upon the
LUTMASK.
 Another example: if bits are
10101010 (as shown in the
diagram), there is no arc for
B=>Y or C=>Y. [This
LUTMASK implements the
logic function Y=A.]
14
A B C
CRAM
Y
0
1
0
1
0
1
0
1
Enumerating Timing Arc Dependencies
 One method to identify all the
arcs as a function of the
LUTMASK is to enumerate all
256 LUTMASK possibilities
along with the arc
dependencies.
 This becomes unfeasible with
a 6-LUT, where there are 64
bits driven by CRAM,
resulting in 2^64
enumerations.
 Alternate method is noticing
pattern of dependencies.
15
A B C
CRAM
Y
0
1
0
1
0
1
0
1
Enumerating Timing Arc Dependencies
 Positive unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 0
and the second bit of the
same mux is a 1.
 Formally, it may be written
as: (!b0 && b1) || (!b2 &&
b3) || (!b4 && b5) || (!b6 &&
b7)
 Negative unate arc for A=>Y
will exist if any of the first bit
of the first level muxes is a 1
and the second bit of the
same mux is a 0.
 Formally, it may be written
as: (b0 && !b1) || (b2 &&
!b3) || (b4 && !b5) || (b6 &&
!b7)
16
A B C
CRAM
Y
0
1
0
1
0
1
0
1
LUT timing is an instance of case analysis
 In Asic style STA case analysis can be slow
 Happens once and not revisited during
incremental timing
 Symbolic simulation has acceptable runtime
 In FPGA timing, especially incremental timing,
the evaluation has to be done on every netlist
modification that affects logic
17
Modes can explode for complex blocks
 Imagine a large block with many modes
 Mode dependent timing is used to gain accuracy in STA
 This block is used by a parent block with many cell modes continuing up multiple levels
 The number of possible modes can explode especially if automatic tools are used to
enumerate them like PrimeTime’s extract_model command
 Design teams want to do physical design at the highest possible level
 Timing Modelling which needs to avoid an explosion of modes want to build models at
a lower level
 It is not uncommon to have a complex block like DSP with 10K modes
 We have no problem building these models but they can be slow in STA even when the
STA has been tuned for handling of many more modes than in ASIC flows.
 PrimeTime and other commercial tools simultaneously build the graph for and delay
calculate all modes. No commercial tool can load and link Altera’s full chip.
18
Two possible approaches
 Goal is to build models one level lower in the
Verilog hierarchy and provide a netlist of models to
Quartus and PrimeTime
 Perform Place and Route Hierarchically
 More work for design teams
 Less optimal results
 Use ICC’s hierarchical Verilog + flat Spef to build a
timing model below the top level
19
Hierarchical Place and Route / Extraction
 Perform Place and Route Hierarchically
 Pros
 Spef is divided naturally by hierarchical P&R and extraction
 Manual floorplan of top level may improve QoR over automatic P&R
 Run time of P&R for lower level blocks will be dramatically faster
allowing more time for manual inspection and improvement of results
 Cons
 Design engineer must manually floorplan the top level
 Multiple runs to manage or P&R and extraction
 Possible QoR degredation if floorplan is poorly done
20
Model extraction one level lower
 Use ICC’s hierarchical Verilog + an extracted flat Spef to build a
timing model below the top level
 Pros
 No change to construction flow
 Cons
 Some loss of accuracy on boundary rc delay calculation
 Rc tree of boundary nets turned into lumped R and lumped C
 Order 5% of the final gate in the path to the output/input of the model.
 Approach
 Read hierarchical Verilog in PT + flat spef + sdc for top level
 Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub
block
 Write_parasitcs –format spef –nets [get_nets *] for top
 Charactarize_context –environment –timing sub_instance_name
 Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in
spef
 Post process spef to remove prepended sub_instance_name from all names in map
 Restart pt_shell with current_design as sub_module
 Load spef and environment context
 Extract_model
21
22
FIHM – Model Validation Flow
 Model comparison between:
 Flat model (golden)
 Hierarchical model (consuming molecule timing liberty model)
 Parasitics for hierarchical model validation generated by
hacking the flat SPEF file:
 Rename standard cell’s leaf pins to molecule’s boundary pins.
 Zero R&C if nets connected within a molecule block.
 Parasitics only extracted from flat SPEF for the nets connected to
top level elements or output ports.
23
Correlation Results
 Testcase: mm_core_digital
 Total timing paths = 1444.
 60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)
 14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution)
 95% of total paths agreed within ±20ps
24
Correlation Results
25
N-MOS gate multi-stage Multiplexors
 Multiplexors are pervasive in an FPGA
 They are designed using NMOS pass
gates to save area
 This causes a timing model challenge
 The input pin capacitance changes with
each select line configuration
 Think of the Mux as a set of switches
 The output load is seen on the input
 Usual use of Liberty assumes a fixed
input capacitance or fixed receiver model
 Quartus compiler uses fast spice but we
want a model for PrimeTime as well
26
N-MOS gate multi-stage Multiplexors
 Select line enabled by CRAM
 2 stage one hot mux
 Input cap varies depending on path taken
and if other side loads’ select lines are on
or off
 Each possible path through the multi-stage
mux requires its own pin cap
 Arc specific receiver model
 This is part of the CCS noise model
 It would be nice if there were a more
natural way to support arc and mode
specific pin caps in Liberty
Other NMOS inputs
Incentive for EDA Companies to help
 As each process generation becomes more complex the number
of unique chip starts decreases.
 Already 12K to less than 3K per year
 Each chip that is designed will be increasingly hyper-optimized.
 Custom tricks that need to be modeled at the gate level
 FPGA use is increasing as its ability to run 1GHZ designs at
reasonable power approaches
 FPGA compilers will not be able to model every effect
 ICC needs PrimeTime
 Encounter needs ETS
 Eventually FPGA compilers may need to output their
programmed CRAM bits as constants and do a Super-Signoff in
commercial STA tools
 This could be a good possibility for Market growth
27

More Related Content

What's hot

Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slidesRajesh M
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond MooreRCCSRENKEI
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresRISC-V International
 
Fpga video capturing
Fpga video capturingFpga video capturing
Fpga video capturingshehryar88
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flowijsrd.com
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
Resume analog
Resume analogResume analog
Resume analogtarora1
 
Resume mixed signal
Resume mixed signalResume mixed signal
Resume mixed signaltarora1
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGAVLSICS Design
 
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...VLSI SYSTEM Design
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...Shinya Takamaeda-Y
 
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
Design and analysis of optimized CORDIC based  GMSK system on FPGA platform Design and analysis of optimized CORDIC based  GMSK system on FPGA platform
Design and analysis of optimized CORDIC based GMSK system on FPGA platform IJECEIAES
 
Bharat gargi final project report
Bharat gargi final project reportBharat gargi final project report
Bharat gargi final project reportBharat Biyani
 

What's hot (20)

3D-DRESD FT
3D-DRESD FT3D-DRESD FT
3D-DRESD FT
 
Clock mesh sizing slides
Clock mesh sizing slidesClock mesh sizing slides
Clock mesh sizing slides
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
Thesis Donato Slides EN
Thesis Donato Slides ENThesis Donato Slides EN
Thesis Donato Slides EN
 
GPU Design on FPGA
GPU Design on FPGAGPU Design on FPGA
GPU Design on FPGA
 
An open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V coresAn open flow for dn ns on ultra low-power RISC-V cores
An open flow for dn ns on ultra low-power RISC-V cores
 
Fpga video capturing
Fpga video capturingFpga video capturing
Fpga video capturing
 
DSP Processors versus ASICs
DSP Processors versus ASICsDSP Processors versus ASICs
DSP Processors versus ASICs
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flow
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Resume analog
Resume analogResume analog
Resume analog
 
Resume mixed signal
Resume mixed signalResume mixed signal
Resume mixed signal
 
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGALOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA
 
PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014PPU_PNSS-1_ICS-2014
PPU_PNSS-1_ICS-2014
 
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...
Define Width and Height of Core and Die (http://www.vlsisystemdesign.com/PD-F...
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
ate_full_paper
ate_full_paperate_full_paper
ate_full_paper
 
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
Design and analysis of optimized CORDIC based  GMSK system on FPGA platform Design and analysis of optimized CORDIC based  GMSK system on FPGA platform
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
 
Bharat gargi final project report
Bharat gargi final project reportBharat gargi final project report
Bharat gargi final project report
 
Lc3519051910
Lc3519051910Lc3519051910
Lc3519051910
 

Viewers also liked

Today's FPGA Ecosystem - Neeraj Varma, Xilinx
Today's FPGA Ecosystem - Neeraj Varma, XilinxToday's FPGA Ecosystem - Neeraj Varma, Xilinx
Today's FPGA Ecosystem - Neeraj Varma, XilinxFPGA Central
 
Static Timing Analysis
Static Timing AnalysisStatic Timing Analysis
Static Timing Analysisshobhan pujari
 
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...IMTC
 
Roman Korkikyan - Timing analysis workshop Part 2 Practice
Roman Korkikyan - Timing analysis workshop Part 2 PracticeRoman Korkikyan - Timing analysis workshop Part 2 Practice
Roman Korkikyan - Timing analysis workshop Part 2 PracticeDefconRussia
 
timing-analysis
 timing-analysis timing-analysis
timing-analysisVimal Raj
 
Intro to Bitcoin
Intro to BitcoinIntro to Bitcoin
Intro to BitcoinRon Gross
 
Timing Analysis
Timing AnalysisTiming Analysis
Timing Analysisrchovatiya
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloadsinside-BigData.com
 
Announcing Amazon EC2 F1 Instances with Custom FPGAs
Announcing Amazon EC2 F1 Instances with Custom FPGAsAnnouncing Amazon EC2 F1 Instances with Custom FPGAs
Announcing Amazon EC2 F1 Instances with Custom FPGAsAmazon Web Services
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-completeMurali Rai
 
Top 10 verification engineer interview questions and answers
Top 10 verification engineer interview questions and answersTop 10 verification engineer interview questions and answers
Top 10 verification engineer interview questions and answerstonychoper2706
 
Physical design
Physical design Physical design
Physical design Mantra VLSI
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notesDr.YNM
 

Viewers also liked (20)

Today's FPGA Ecosystem - Neeraj Varma, Xilinx
Today's FPGA Ecosystem - Neeraj Varma, XilinxToday's FPGA Ecosystem - Neeraj Varma, Xilinx
Today's FPGA Ecosystem - Neeraj Varma, Xilinx
 
Static Timing Analysis
Static Timing AnalysisStatic Timing Analysis
Static Timing Analysis
 
ASIC vs SOC vs FPGA
ASIC  vs SOC  vs FPGAASIC  vs SOC  vs FPGA
ASIC vs SOC vs FPGA
 
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...
Development of a 4K Main 10 Profile HEVC Encoder for Great Improvements in Co...
 
Roman Korkikyan - Timing analysis workshop Part 2 Practice
Roman Korkikyan - Timing analysis workshop Part 2 PracticeRoman Korkikyan - Timing analysis workshop Part 2 Practice
Roman Korkikyan - Timing analysis workshop Part 2 Practice
 
Major project iii 3
Major project  iii  3Major project  iii  3
Major project iii 3
 
Zynq architecture
Zynq architectureZynq architecture
Zynq architecture
 
timing-analysis
 timing-analysis timing-analysis
timing-analysis
 
Clock Distribution
Clock DistributionClock Distribution
Clock Distribution
 
Intro to Bitcoin
Intro to BitcoinIntro to Bitcoin
Intro to Bitcoin
 
Timing Analysis
Timing AnalysisTiming Analysis
Timing Analysis
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloads
 
FIFOPt
FIFOPtFIFOPt
FIFOPt
 
Synthesis
SynthesisSynthesis
Synthesis
 
Announcing Amazon EC2 F1 Instances with Custom FPGAs
Announcing Amazon EC2 F1 Instances with Custom FPGAsAnnouncing Amazon EC2 F1 Instances with Custom FPGAs
Announcing Amazon EC2 F1 Instances with Custom FPGAs
 
Physical design-complete
Physical design-completePhysical design-complete
Physical design-complete
 
Top 10 verification engineer interview questions and answers
Top 10 verification engineer interview questions and answersTop 10 verification engineer interview questions and answers
Top 10 verification engineer interview questions and answers
 
Physical design
Physical design Physical design
Physical design
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notes
 
Introduction Bitcoin
Introduction BitcoinIntroduction Bitcoin
Introduction Bitcoin
 

Similar to tau 2015 spyrou fpga timing

Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track fAlona Gradman
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensAlona Gradman
 
ece260project.doc
ece260project.docece260project.doc
ece260project.docFanyu Yang
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
Leakage power optimization for ripple carry adder
Leakage power optimization for ripple carry adder Leakage power optimization for ripple carry adder
Leakage power optimization for ripple carry adder NAVEEN TOKAS
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
Grasp the Critical Issues for a Functioning JESD204B Interface
Grasp the Critical Issues for a Functioning JESD204B InterfaceGrasp the Critical Issues for a Functioning JESD204B Interface
Grasp the Critical Issues for a Functioning JESD204B InterfaceAnalog Devices, Inc.
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
computer architecture
computer architecture computer architecture
computer architecture Dr.Umadevi V
 
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...abdenour boussioud
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armPrashant Ahire
 
IEEE Paper A SystemC AMS Model of an I2C Bus Controller
IEEE Paper A SystemC AMS Model  of an I2C Bus ControllerIEEE Paper A SystemC AMS Model  of an I2C Bus Controller
IEEE Paper A SystemC AMS Model of an I2C Bus ControllerDweapons Art
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdfFrangoCamila
 

Similar to tau 2015 spyrou fpga timing (20)

Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
ece260project.doc
ece260project.docece260project.doc
ece260project.doc
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Lec02
Lec02Lec02
Lec02
 
Leakage power optimization for ripple carry adder
Leakage power optimization for ripple carry adder Leakage power optimization for ripple carry adder
Leakage power optimization for ripple carry adder
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Verilog
VerilogVerilog
Verilog
 
Grasp the Critical Issues for a Functioning JESD204B Interface
Grasp the Critical Issues for a Functioning JESD204B InterfaceGrasp the Critical Issues for a Functioning JESD204B Interface
Grasp the Critical Issues for a Functioning JESD204B Interface
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
computer architecture
computer architecture computer architecture
computer architecture
 
Vlsi
VlsiVlsi
Vlsi
 
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...
Best practices for catalyst 4500 4000, 5500-5000, and 6500-6000 series switch...
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Arm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_armArm cortex-m3 by-joe_bungo_arm
Arm cortex-m3 by-joe_bungo_arm
 
IEEE Paper A SystemC AMS Model of an I2C Bus Controller
IEEE Paper A SystemC AMS Model  of an I2C Bus ControllerIEEE Paper A SystemC AMS Model  of an I2C Bus Controller
IEEE Paper A SystemC AMS Model of an I2C Bus Controller
 
24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf24-02-18 Rejender pratap.pdf
24-02-18 Rejender pratap.pdf
 

tau 2015 spyrou fpga timing

  • 1. Challenges in the Static Timing Analysis of FPGA’s Tom Spyrou TAU 2015 3/2015
  • 2. Programmability: Where do FPGA’s fit? 2 Intel CPU TI DSP MultiCore ManyCore GPU FPGA ASSP ASIC Flexibility, Programming Abstraction Performance, Area and Power Efficiency CPU: • Market-agnostic • Accessible to many programmers (C++) • Flexible, portable ASIC • Market-specific • Fewer programmers • Rigid, less programmable • Hard to build (physical) FPGA: • Somewhat Restricted Market • Harder to Program (Verilog) • More efficient than SW • More expensive than ASIC
  • 3. 3 / 61 FPGA End Markets Entertainment Broadcast Broadband Audio/video Video display Studio Satellite Broadcasting Wireless Networking Wireline Cellular Basestations Wireless LAN Switches Routers Optical Metro Access Computer Storage Office Automation Servers Mainframe RAID SAN Copiers Printers MFP Instrumentation Security/ Energy Mgmt. Auto Medical Test equipment Manufacturing Card readers Control systems ATM Navigation Entertainment Military Secure comm. Radar Guidance and control Computer and Storage Communications IndustrialDigital Consumer
  • 4. FPGA User Programming Model  User writes Verilog (or VHDL, or schematic)  Quartus compiles the Verilog to a bitstream  Synthesis: Verilog -> Gates  Tech-Mapping: Gates -> Device-specific LUTs & FF  Clustering: LUTs+FF -> LAB clusters  Placement: LABs –> placed LABS with an (x,y) position  Routing: Abstract connections -> exact routing  STA: Timing evaluated vs. constraints  Assembly: Routing converted to bitstream  Programming: Bitstream downloaded onto FPGA (More on this in the Software Flow Section) 4
  • 5. 5 FPGA CAD Map to LAB’s not standard cells Routing is setting mux select line bits // Begin: Write Control always @ (posedge wrbusy_int) begin write0 <= 1'b1; write1 <= 1'b0; writex <= 1'b0; end always @ (negedge wrbusy_int) begin write0 <= 1'b0; end always @ (posedge write0_done) begin write1 <= 1'b1; // Begin: Write Control always @ (posedge wrbusy_int) begin write0 <= 1'b1; write1 <= 1'b0; writex <= 1'b0; end always @ (negedge wrbusy_int) begin write0 <= 1'b0; end always @ (posedge write0_done) begin write1 <= 1'b1; // Begin: Write Control always @ (posedge wrbusy_int) begin write0 <= 1'b1; write1 <= 1'b0; writex <= 1'b0; end always @ (negedge wrbusy_int) begin write0 <= 1'b0; end always @ (posedge write0_done) begin write1 <= 1'b1; Quartus II Database Device features and timing information Merge Programmer Timing Analysis Placement & Routing Power Assembler Simulator 3-rd Party or Altera EDA Synthesis 3-rd Party or Altera
  • 6. What is FPGA Fabric – Logic Array Block 6 Input Muxing Logic Cell Optional DFF Output Muxing Bottom line: Quartus generates a configuration bitstream which sets the logic functions, and routing steering to instantiate one hardware design into the device. LAB: X 20 ® ® ® ® ® Hard-Blocks Routing Fabric
  • 8. FPGA Interconnect Model 8 0xab0f 0 VDIM HDIM HDIM LIM LEIM A B C D 0x81 0xf0 0x14 0x44 0x24 CRAM Programming LAB (4,6) LAB (12,9) V4 H3 H3  Wires are point-to-point  Individual bits, not groups or word-wise  Statically programmed by SW to establish the necessary connection  No bus, protocol, etc. routing (unless built on top)
  • 9. Unique Challenges in STA of FPGA’s  Fixed device with programmable LUTS, Routing and various IP  I would like to break down the challenges into categories  Verification of the un-programmed device  Many possible modes due to programmability  Delay Calculation of non-CMOS structures like pass gate muxes  Verification of a user’s compiled design  CRPR analysis can be very expensive  Large clock latency and skew, tree used versus mesh  Long combinational paths with lots of re-convergent logic  Slow logic that is still much faster than software on a CPU  Incremental moves affect function not just delay of instances  CRAM configuration constant changes  Mode changes 9
  • 10. Unique Challenges in STA of FPGA’s  Periphery and Core have different challenges  Programmable core logic implementing functions via look up tables  Peripheral IP blocks performing programmable but less flexible tasks  SerDes, DSP, RAM, Arm Core etc  Periphery blocks often implemented with ASIC style flows  Core is full custom with pass gates  Delay modelling and parasitic reduction are challenges  Both have challenges due to configurability  I cannot cover all the challenges and will focus on 3  LUT modelling  Mode explosion flat implementation with hierarchical modelling  Modelling pass gate based multiplexors 10
  • 11. LUT Overview  For the purposes of this tutorial, let’s assume we have a 3-LUT, i.e. 3 inputs on the select lines to select one of 8 bits driven by the CRAM.  This 3-LUT can be used to model any logic function of 3 bits by assigning appropriate values to the CRAM.  We call the 8-bit value b[7:0] the LUTMASK. 11 A B C CRAM Y b0 b1 b2 b3 b4 b5 b6 b7
  • 12. Timing Arcs Dependency on LUTMASK  The existence and delays of the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.  For example, if bits are all 0s, then Y = 0 and there are no arcs from any of the inputs to the output. This is a degenerate case. 12 A B C CRAM Y b0 b1 b2 b3 b4 b5 b6 b7
  • 13. Timing Arcs Dependency on LUTMASK  The existence and delays of the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.  For example, if bits are 10001000 (as shown in the diagram), there is no arc for C=>Y. [This LUTMASK implements the logic function Y=A&B.]  Unateness is a function of LUTMASK  This configuration should have positive-unate arcs  Ignoring unateness will hurt fmax, but is not necessarily critical for early Quartus development 13 A B C CRAM Y 0 0 0 1 0 0 0 1
  • 14. Timing Arcs Dependency on LUTMASK  The existence and delays of the timing arcs from A=>Y, B=>Y, and C=>Y are dependent upon the LUTMASK.  Another example: if bits are 10101010 (as shown in the diagram), there is no arc for B=>Y or C=>Y. [This LUTMASK implements the logic function Y=A.] 14 A B C CRAM Y 0 1 0 1 0 1 0 1
  • 15. Enumerating Timing Arc Dependencies  One method to identify all the arcs as a function of the LUTMASK is to enumerate all 256 LUTMASK possibilities along with the arc dependencies.  This becomes unfeasible with a 6-LUT, where there are 64 bits driven by CRAM, resulting in 2^64 enumerations.  Alternate method is noticing pattern of dependencies. 15 A B C CRAM Y 0 1 0 1 0 1 0 1
  • 16. Enumerating Timing Arc Dependencies  Positive unate arc for A=>Y will exist if any of the first bit of the first level muxes is a 0 and the second bit of the same mux is a 1.  Formally, it may be written as: (!b0 && b1) || (!b2 && b3) || (!b4 && b5) || (!b6 && b7)  Negative unate arc for A=>Y will exist if any of the first bit of the first level muxes is a 1 and the second bit of the same mux is a 0.  Formally, it may be written as: (b0 && !b1) || (b2 && !b3) || (b4 && !b5) || (b6 && !b7) 16 A B C CRAM Y 0 1 0 1 0 1 0 1
  • 17. LUT timing is an instance of case analysis  In Asic style STA case analysis can be slow  Happens once and not revisited during incremental timing  Symbolic simulation has acceptable runtime  In FPGA timing, especially incremental timing, the evaluation has to be done on every netlist modification that affects logic 17
  • 18. Modes can explode for complex blocks  Imagine a large block with many modes  Mode dependent timing is used to gain accuracy in STA  This block is used by a parent block with many cell modes continuing up multiple levels  The number of possible modes can explode especially if automatic tools are used to enumerate them like PrimeTime’s extract_model command  Design teams want to do physical design at the highest possible level  Timing Modelling which needs to avoid an explosion of modes want to build models at a lower level  It is not uncommon to have a complex block like DSP with 10K modes  We have no problem building these models but they can be slow in STA even when the STA has been tuned for handling of many more modes than in ASIC flows.  PrimeTime and other commercial tools simultaneously build the graph for and delay calculate all modes. No commercial tool can load and link Altera’s full chip. 18
  • 19. Two possible approaches  Goal is to build models one level lower in the Verilog hierarchy and provide a netlist of models to Quartus and PrimeTime  Perform Place and Route Hierarchically  More work for design teams  Less optimal results  Use ICC’s hierarchical Verilog + flat Spef to build a timing model below the top level 19
  • 20. Hierarchical Place and Route / Extraction  Perform Place and Route Hierarchically  Pros  Spef is divided naturally by hierarchical P&R and extraction  Manual floorplan of top level may improve QoR over automatic P&R  Run time of P&R for lower level blocks will be dramatically faster allowing more time for manual inspection and improvement of results  Cons  Design engineer must manually floorplan the top level  Multiple runs to manage or P&R and extraction  Possible QoR degredation if floorplan is poorly done 20
  • 21. Model extraction one level lower  Use ICC’s hierarchical Verilog + an extracted flat Spef to build a timing model below the top level  Pros  No change to construction flow  Cons  Some loss of accuracy on boundary rc delay calculation  Rc tree of boundary nets turned into lumped R and lumped C  Order 5% of the final gate in the path to the output/input of the model.  Approach  Read hierarchical Verilog in PT + flat spef + sdc for top level  Write_parasitics –format spef –nets [get_nets –hier sub_instance_name/*] for sub block  Write_parasitcs –format spef –nets [get_nets *] for top  Charactarize_context –environment –timing sub_instance_name  Avoid boundary nets in context with -no_boundary_annotations or no boundary nets in spef  Post process spef to remove prepended sub_instance_name from all names in map  Restart pt_shell with current_design as sub_module  Load spef and environment context  Extract_model 21
  • 22. 22 FIHM – Model Validation Flow  Model comparison between:  Flat model (golden)  Hierarchical model (consuming molecule timing liberty model)  Parasitics for hierarchical model validation generated by hacking the flat SPEF file:  Rename standard cell’s leaf pins to molecule’s boundary pins.  Zero R&C if nets connected within a molecule block.  Parasitics only extracted from flat SPEF for the nets connected to top level elements or output ports.
  • 23. 23 Correlation Results  Testcase: mm_core_digital  Total timing paths = 1444.  60 timing paths are pessimistic > 20ps as compared to flat model (4% of distribution)  14 timing paths are optimistic > 20ps as compared to flat model (1% of distribution)  95% of total paths agreed within ±20ps
  • 25. 25 N-MOS gate multi-stage Multiplexors  Multiplexors are pervasive in an FPGA  They are designed using NMOS pass gates to save area  This causes a timing model challenge  The input pin capacitance changes with each select line configuration  Think of the Mux as a set of switches  The output load is seen on the input  Usual use of Liberty assumes a fixed input capacitance or fixed receiver model  Quartus compiler uses fast spice but we want a model for PrimeTime as well
  • 26. 26 N-MOS gate multi-stage Multiplexors  Select line enabled by CRAM  2 stage one hot mux  Input cap varies depending on path taken and if other side loads’ select lines are on or off  Each possible path through the multi-stage mux requires its own pin cap  Arc specific receiver model  This is part of the CCS noise model  It would be nice if there were a more natural way to support arc and mode specific pin caps in Liberty Other NMOS inputs
  • 27. Incentive for EDA Companies to help  As each process generation becomes more complex the number of unique chip starts decreases.  Already 12K to less than 3K per year  Each chip that is designed will be increasingly hyper-optimized.  Custom tricks that need to be modeled at the gate level  FPGA use is increasing as its ability to run 1GHZ designs at reasonable power approaches  FPGA compilers will not be able to model every effect  ICC needs PrimeTime  Encounter needs ETS  Eventually FPGA compilers may need to output their programmed CRAM bits as constants and do a Super-Signoff in commercial STA tools  This could be a good possibility for Market growth 27