January 18 2024
FPGAs for Pre-silicon Emulation of
Large-Scale RISC-V based Processors
Behzad Salami
HiPEAC’2024- RAPIDO Workshop
2
RISC-V based Chip Design Flow
3
FPGA
Emulation
Design
Verification
Higher-accuracy, Reduced execution time, Increased cost
• Develop FPGA HW/SW tools
• Large-size RISC-V designs
• Different Skills needed
• …..
FPGAs for Pre-Silicon Prototyping:
• Examples from Industry
Makinote: BSC’s FPGA Platform for RISC-V Prototyping
• Hardware Platform
• Software Toolset
Open Discussions
Limited Accessibility for users
(based on payment, contract, etc)
Closed source SW stack
Not Off-The-Shelf FPGAs
5
Rich set of software tools
(management, design partitioning,
verification of different parameters,
etc)
State-of-the-art HW
Technology
(Scalable from a M to B size, SOTA
FPGA technology, Customizable
peripherals, etc)
Technical Support Typical
Large-Scale
FPGA-based
Emulation
Platforms
FPGAs for Pre-Silicon Prototyping:
• Examples from Industry
Makinote: BSC’s FPGA Platform for RISC-V Prototyping
• Hardware Platform
• Software Toolset
Open Discussions
Makinote
BSC’s FPGA-based Platform for RISC-V Emulation
Makinote Hardware
FPGA Cluster
FPGA Power (Total)
#FPGAs (Alveo U55c) 96
HBM2 Capacity 1.5TB
#LUT 125M
Max ASIC size (#cells) 750M
9
Interconnection Network
PCIe
QSFP1
PCIe
QSFP1
PCIe
QSFP1
PCIe
QSFP1
PCIe- Gen4 PCIe- Gen4 PCIe- Gen4
Aurora
(QSFP0)
Eth
PCIe- Gen4
Aurora
(QSFP0)
Ethernet Switch 100gb
Eth Eth
Eth
FPGA0 FPGA1 FPGA2 FPGA3
10
User Access Methodology
11
Cluster Configurations
FPGA Card RTL Design
Single (small) Design-Single FPGAs
Multiple (small) Designs-Multiple FPGAs
• Ethernet-based
• MPI style applications
• Small designs (up to 7.8M cells)
• Current designs: 50 Mhz (up to 400 Mhz achievable)
Host (Linux)
A
B
C Single (large) Design- Multiple FPGAs
• Design partitioning methodology
• Compiler for automatic partition and bitstream
generattion on multiple FPGAs
• Large designs (up to 750M cells)
• Objective: Support up to 50 Mhz
12
7.8M
750M
1B
+10B
13
• Hardware Improvement:
• Goal: Larger, faster, better-connected FPGAs
• Potential Technologies:
• Xilinx ACAP technology
• AMD roadmap for FPGA emulation (Versal
VP1902 Adaptive SOC)
• Chiplet technology:
• Better chip-to-chip connectivity
• Larger number of GTY Transceivers
• More QSFP ports
• Larger number of LUTs
• Supporting large designs
• 600gb ethernet vs Aurora 100gb
• No performance loss for large design
• Software Toolset Developments
• Design Partitioning Toolset
• Design Verification Toolset
• Performance, Power, Energy Emulation Toolset
• Hardware Acceleration for AI and LLM
No Performance Loss: Linear Increase/fixed clock speed
Roadmap for Makinote
Performance
Makinote Toolset
1) FPGA Shell • Goal: Simplify and automate FPGA-based designs generation
• Features:
• SW support (Drivers (ONIC), tool (images)
• FPGA support (FPGA Shell generation)
• Scripts & TCLs (e.g.,timing policies)
• FPGA Flow (Gitlab CI/CD)
• Backend based on Xilinx IPs
ACME
HBM Ctrl
Ethernet Ctrl
F2F
Ctrl
SM HBM Ctrl
F2F
Link
Ctrl
Remote F2F Ctrl
Open Source repo
MEEPproject/fpga_shell: MEEP FPGA Shell project, currently
supporting Alveos u280 and u55c (github.com)
IPs Features
DRAM DDR4, HBM2 (DMA)
PCIe Gen4 (qdma)
Ethernet 10/100 GBe (over QSFP port)
Point-2-Point MAC-Layer Aurora (over QSFP port)
UART Debug, Bitstream Uploading
Others JTAG, InfoROM
RTL
16
Rapid FPGA Prototyping using FPGA Shell
MEEP Shell Scripts (TCL)
- shell_aurora
- shell_hbm
….
Vivado Project
(automatically generated)
Config file (accelerator_def.csv)
Make
FPGA Shell Alveo U55c
Alveo U280
PCI (QDMA)
DRAM (HBM, DDR)
Ethernet (10/100 GBe)
UART
Aurora-DMA
InfoROM
FPGA Cluster
FPGA
Tools
bitstream
./build_pci_drivers.sh
./load_bitstream.sh
./load_fedora.sh
./fpga_test.sh
./boot_acme.sh
./boot_ariane.sh
….
Supported Hardware
1 2 3
17
PCIe (QDMA)
PCIE, yes, pci_axi, , PCIE_CLK, pcie_clk, pcie_rstn, dma, 0, , 02
DDR, no, mem_axi
HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 00
HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 01
HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 02
.
HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 31
DRAM (HBM and DDR4)
• PCIe Gen4: 16x Lanes running at 16.0 GT/s
• Bandwidth: Gen4. 16GT/s per Lane ~ 16Gb/s = 2GB/s. 16x X 2GB/s = 32GB/s
• Hardware/Software configurations:
1.Setting up the HOST: Compile the drivers, install them on the host.
2.Setting up the PCIE4C hard block inside the FPGA using Vivado:
1.BAR options
2.Number of Lanes
3.Interfaces (AXI4 for memory mapped accesses, AXILite for
registers, etc)
• QDMA: wrappers the PCIEC block with a DMA engine
• HBM:
• 32x Configurable AXI Pseudo-Channels
• Size: 32 x 256MB
• Bandwidth: (256 bits per AXI port) x (2 ports per memory controller) x
(8 channels) x 450 MHz x 2= 460 GB/s
18
Ethernet (10/100 gbe)
Ethernet, yes, eth_axi, AXI4LITE-64, CLK0, 100Gb, eth_irq, qsfp1, hbm, 29 Aurora,yes,eth_axi,AXI4LITE-64,CLK0,dma,eth_irq,qsfp1,hbm,13
Aurora (64B/66B)
• Ethernet solution over QSFP and GTY transceivers:
• A pair of board-level QSFP+ optical units:
• Bandwidth of up-to (28 Gb/s x 4 lanes) = 112 Gb/s.
• A set (4-lanes) of FPGA-level Ultrascale+ GTY transceivers
• Bandwidth of up-to 32.75 Gb/s per external differential pins pair
• Integrated with DMA engine
• Used for Multiple Design- Multiple FPGA scenario
• For direct FPGA-to-FPGA (F2F) communications
• Aurora 64B/66B is a scalable, lightweight, link-layer protocol for high-
speed serial GTY communication.
• Physical Layer
• Line Rate (Gbps): up to 25 Gb/s
• Lanes: 4
• Low-resource cost with 3% transmission overhead
• Latency: ~50 cycles
• Clock: 400 MHz
• Link Layer:
• Dataflow mode: Duplex or simple operation
• Interface: streaming
• 32-bit CRC for user data
• Potentially for Design partitioning scenario
• Integrated with DMA engine
19
2) Integrated with OpenPiton (an Open Source NOC for Multi-core Systems)
Lagarto
Lagarto
Lagarto
FPGA Shell (PCIe)
FPGA Shell (ETH)
FPGA
Shell
(HBM)
FPGA
Shell
(UART)
20
3) Design Partitioning onto Multiple FPGAs (Supporting Large Designs)
Chipset
Tile
0
Tile
1
Tile
2
Tile
n
FPGA P&R FPGA P&R FPGA P&R FPGA P&R
……..
Design Partitioning Compiler
FPGA Shell (Aurora, Eth, PCIe, HBM/DDR4)
RTL (NOC-based)
FPGA P&R
FPGA P&R
• Compiler Development:
• OpenPiton tiles partitioning
• SerDes
• FPGA Shell adaptation
• Hardware Development:
• Improve the P2P interconnectivity through
GTY demultiplexing (new cabling)
• Optical circuit switch
21
4) Design Verification @ FPGA
Goals:
• Speedup the UVM verification process (~3 orders of magnitude vs. simulation-level)
• Increase the coverage by running more random tests
• Increase the accuracy of verification (on real silicon vs. simulation-level)
• Raise the verification from signal level to transaction level (vs ILA@FPGA), still cycle-accurate
• Easier tracing and debugging and faster verification
Features:
• Tracer (extract traces at FPGA to be compared with Spike at the host):
• RISC-V Instruction traces (at every pipeline stage, e.g., fetched, decoded, committed, etc)
• Cache (L1, L2, L3) and memory traces
• ALU, FPU, VPU, accelerator traces
• …..
• Profiling and visualization tools
• Synthesizable Printf and Assertion
• Breakpoint, watchpoint, pause and resume
• Supporting Multi-FPGA
Constraints:
• Limited hardware resources at FPGA
• FPGA development efforts
DUT
Co-sim
Interface
(HW)
DUT
Wrapper
(SW)
Test
bench
Host FPGA
22
FPGAs for Pre-Silicon Prototyping:
• Examples from Industry
Makinote: BSC’s FPGA Platform for RISC-V Prototyping
• Hardware Platform
• Software Toolset
Open Discussions
Open Discussions
• Vs. Industrial Platforms:
• Makinote is built for Research purpose
• Easy access
• A platform for collaboration
• Makinote is built based on COTS FPGAs
• Scalable
• Makinote SW stack is/will be open-source
• A platform for collaboration
24
Thank you!
(behzad.salami@bsc.es)
25

HiPEAC-Keynote.pptx

  • 1.
    January 18 2024 FPGAsfor Pre-silicon Emulation of Large-Scale RISC-V based Processors Behzad Salami HiPEAC’2024- RAPIDO Workshop
  • 2.
  • 3.
    RISC-V based ChipDesign Flow 3 FPGA Emulation Design Verification Higher-accuracy, Reduced execution time, Increased cost • Develop FPGA HW/SW tools • Large-size RISC-V designs • Different Skills needed • …..
  • 4.
    FPGAs for Pre-SiliconPrototyping: • Examples from Industry Makinote: BSC’s FPGA Platform for RISC-V Prototyping • Hardware Platform • Software Toolset Open Discussions
  • 5.
    Limited Accessibility forusers (based on payment, contract, etc) Closed source SW stack Not Off-The-Shelf FPGAs 5 Rich set of software tools (management, design partitioning, verification of different parameters, etc) State-of-the-art HW Technology (Scalable from a M to B size, SOTA FPGA technology, Customizable peripherals, etc) Technical Support Typical Large-Scale FPGA-based Emulation Platforms
  • 6.
    FPGAs for Pre-SiliconPrototyping: • Examples from Industry Makinote: BSC’s FPGA Platform for RISC-V Prototyping • Hardware Platform • Software Toolset Open Discussions
  • 7.
  • 8.
  • 9.
    FPGA Cluster FPGA Power(Total) #FPGAs (Alveo U55c) 96 HBM2 Capacity 1.5TB #LUT 125M Max ASIC size (#cells) 750M 9
  • 10.
    Interconnection Network PCIe QSFP1 PCIe QSFP1 PCIe QSFP1 PCIe QSFP1 PCIe- Gen4PCIe- Gen4 PCIe- Gen4 Aurora (QSFP0) Eth PCIe- Gen4 Aurora (QSFP0) Ethernet Switch 100gb Eth Eth Eth FPGA0 FPGA1 FPGA2 FPGA3 10
  • 11.
  • 12.
    Cluster Configurations FPGA CardRTL Design Single (small) Design-Single FPGAs Multiple (small) Designs-Multiple FPGAs • Ethernet-based • MPI style applications • Small designs (up to 7.8M cells) • Current designs: 50 Mhz (up to 400 Mhz achievable) Host (Linux) A B C Single (large) Design- Multiple FPGAs • Design partitioning methodology • Compiler for automatic partition and bitstream generattion on multiple FPGAs • Large designs (up to 750M cells) • Objective: Support up to 50 Mhz 12
  • 13.
    7.8M 750M 1B +10B 13 • Hardware Improvement: •Goal: Larger, faster, better-connected FPGAs • Potential Technologies: • Xilinx ACAP technology • AMD roadmap for FPGA emulation (Versal VP1902 Adaptive SOC) • Chiplet technology: • Better chip-to-chip connectivity • Larger number of GTY Transceivers • More QSFP ports • Larger number of LUTs • Supporting large designs • 600gb ethernet vs Aurora 100gb • No performance loss for large design • Software Toolset Developments • Design Partitioning Toolset • Design Verification Toolset • Performance, Power, Energy Emulation Toolset • Hardware Acceleration for AI and LLM No Performance Loss: Linear Increase/fixed clock speed Roadmap for Makinote Performance
  • 14.
  • 15.
    1) FPGA Shell• Goal: Simplify and automate FPGA-based designs generation • Features: • SW support (Drivers (ONIC), tool (images) • FPGA support (FPGA Shell generation) • Scripts & TCLs (e.g.,timing policies) • FPGA Flow (Gitlab CI/CD) • Backend based on Xilinx IPs ACME HBM Ctrl Ethernet Ctrl F2F Ctrl SM HBM Ctrl F2F Link Ctrl Remote F2F Ctrl Open Source repo MEEPproject/fpga_shell: MEEP FPGA Shell project, currently supporting Alveos u280 and u55c (github.com) IPs Features DRAM DDR4, HBM2 (DMA) PCIe Gen4 (qdma) Ethernet 10/100 GBe (over QSFP port) Point-2-Point MAC-Layer Aurora (over QSFP port) UART Debug, Bitstream Uploading Others JTAG, InfoROM RTL 16
  • 16.
    Rapid FPGA Prototypingusing FPGA Shell MEEP Shell Scripts (TCL) - shell_aurora - shell_hbm …. Vivado Project (automatically generated) Config file (accelerator_def.csv) Make FPGA Shell Alveo U55c Alveo U280 PCI (QDMA) DRAM (HBM, DDR) Ethernet (10/100 GBe) UART Aurora-DMA InfoROM FPGA Cluster FPGA Tools bitstream ./build_pci_drivers.sh ./load_bitstream.sh ./load_fedora.sh ./fpga_test.sh ./boot_acme.sh ./boot_ariane.sh …. Supported Hardware 1 2 3 17
  • 17.
    PCIe (QDMA) PCIE, yes,pci_axi, , PCIE_CLK, pcie_clk, pcie_rstn, dma, 0, , 02 DDR, no, mem_axi HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 00 HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 01 HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 02 . HBM, yes, mem_axi, AXI3-256, CLK0, 0*0, mem_calib_complete, 31 DRAM (HBM and DDR4) • PCIe Gen4: 16x Lanes running at 16.0 GT/s • Bandwidth: Gen4. 16GT/s per Lane ~ 16Gb/s = 2GB/s. 16x X 2GB/s = 32GB/s • Hardware/Software configurations: 1.Setting up the HOST: Compile the drivers, install them on the host. 2.Setting up the PCIE4C hard block inside the FPGA using Vivado: 1.BAR options 2.Number of Lanes 3.Interfaces (AXI4 for memory mapped accesses, AXILite for registers, etc) • QDMA: wrappers the PCIEC block with a DMA engine • HBM: • 32x Configurable AXI Pseudo-Channels • Size: 32 x 256MB • Bandwidth: (256 bits per AXI port) x (2 ports per memory controller) x (8 channels) x 450 MHz x 2= 460 GB/s 18
  • 18.
    Ethernet (10/100 gbe) Ethernet,yes, eth_axi, AXI4LITE-64, CLK0, 100Gb, eth_irq, qsfp1, hbm, 29 Aurora,yes,eth_axi,AXI4LITE-64,CLK0,dma,eth_irq,qsfp1,hbm,13 Aurora (64B/66B) • Ethernet solution over QSFP and GTY transceivers: • A pair of board-level QSFP+ optical units: • Bandwidth of up-to (28 Gb/s x 4 lanes) = 112 Gb/s. • A set (4-lanes) of FPGA-level Ultrascale+ GTY transceivers • Bandwidth of up-to 32.75 Gb/s per external differential pins pair • Integrated with DMA engine • Used for Multiple Design- Multiple FPGA scenario • For direct FPGA-to-FPGA (F2F) communications • Aurora 64B/66B is a scalable, lightweight, link-layer protocol for high- speed serial GTY communication. • Physical Layer • Line Rate (Gbps): up to 25 Gb/s • Lanes: 4 • Low-resource cost with 3% transmission overhead • Latency: ~50 cycles • Clock: 400 MHz • Link Layer: • Dataflow mode: Duplex or simple operation • Interface: streaming • 32-bit CRC for user data • Potentially for Design partitioning scenario • Integrated with DMA engine 19
  • 19.
    2) Integrated withOpenPiton (an Open Source NOC for Multi-core Systems) Lagarto Lagarto Lagarto FPGA Shell (PCIe) FPGA Shell (ETH) FPGA Shell (HBM) FPGA Shell (UART) 20
  • 20.
    3) Design Partitioningonto Multiple FPGAs (Supporting Large Designs) Chipset Tile 0 Tile 1 Tile 2 Tile n FPGA P&R FPGA P&R FPGA P&R FPGA P&R …….. Design Partitioning Compiler FPGA Shell (Aurora, Eth, PCIe, HBM/DDR4) RTL (NOC-based) FPGA P&R FPGA P&R • Compiler Development: • OpenPiton tiles partitioning • SerDes • FPGA Shell adaptation • Hardware Development: • Improve the P2P interconnectivity through GTY demultiplexing (new cabling) • Optical circuit switch 21
  • 21.
    4) Design Verification@ FPGA Goals: • Speedup the UVM verification process (~3 orders of magnitude vs. simulation-level) • Increase the coverage by running more random tests • Increase the accuracy of verification (on real silicon vs. simulation-level) • Raise the verification from signal level to transaction level (vs ILA@FPGA), still cycle-accurate • Easier tracing and debugging and faster verification Features: • Tracer (extract traces at FPGA to be compared with Spike at the host): • RISC-V Instruction traces (at every pipeline stage, e.g., fetched, decoded, committed, etc) • Cache (L1, L2, L3) and memory traces • ALU, FPU, VPU, accelerator traces • ….. • Profiling and visualization tools • Synthesizable Printf and Assertion • Breakpoint, watchpoint, pause and resume • Supporting Multi-FPGA Constraints: • Limited hardware resources at FPGA • FPGA development efforts DUT Co-sim Interface (HW) DUT Wrapper (SW) Test bench Host FPGA 22
  • 22.
    FPGAs for Pre-SiliconPrototyping: • Examples from Industry Makinote: BSC’s FPGA Platform for RISC-V Prototyping • Hardware Platform • Software Toolset Open Discussions
  • 23.
    Open Discussions • Vs.Industrial Platforms: • Makinote is built for Research purpose • Easy access • A platform for collaboration • Makinote is built based on COTS FPGAs • Scalable • Makinote SW stack is/will be open-source • A platform for collaboration 24
  • 24.

Editor's Notes