Micro-architecture of 64-bit 4-wide Out-of-Order Processor core
• RV64IMAFDCSU Instruction set support
• 9 Stage pipeline (Fetch, Predecode, Checking, Decode,
Rename, Dispatch, Issue, Execute and Retire)
• Decode and Issue up to 4 instructions per cycle
• Out-of-Order execution and In-Order commit using ROB
• Six execution units with split functionality units in each of
them
• Branch prediction using BTB and 2-bit direction predictor
using G-Share algorithm; RAS for return address fetching
• Register renaming to avoid dependency (64 Integer and 48
floating-point Registers), mapped via Register Alias Tables
(RATs), branch misprediction recovery using restoring store
RAT states.
• 16 KiB 4-Way set-associative VIPT I-Cache & PIPT D-Cache
• Fully Associative I-TLB & D-TLB with 3-level page table walk
• Supervisor and user mode implementation
SoC Design around Dual-Core Processor
• 64 KiB internal SRAM
• 16 KiB BootROM hardcoded with Zeroth Stage BootLoader (ZSBL) supporting loading
binary image via Xmodem
• Memory mapped peripherals like GPIO, UART and I2C using standard Xilinx IPs
• Implemented in HTG-K800 board using Kintex UltraScale FPGA
• A slight reduction in performance was observed compared to the previous design.
• This performance reduction is expected due to the arbitration between the read and write
requests from the processor
Results: Throughput and Benchmarking of Multi-Core
Compatible Single core
• Dual-Core processor is 1.66 and 1.54 times faster than single-core
processor for matrix multiplication and quicksort application.
Results: Performance Improvement in Dual-Core
Processor
RISC-V for AI Applications
Challenges:
• Support to RVV ISA.
• Hardware support for vector instructions
execution.
• Interface Unit for data flow between
Scalar and Vector.
• Stall Generation Unit to maintain in-order
issue.
Approach:
• Vector decoder unit is designed to support
ISA, which works parallelly with scalar
decoder.
• Vector Unit is designed to execute the
vector operations.
• Native-Bus Interface unit is designed to
reduce latency.
• For In-Order issue, only vector unit or scalar
core will work at any given point of time.
Requirements of Hardware:
• A FIFO is required to store control signals from
Vector Decoder.
• Read FSM controller to read control signals
from FIFO.
• RVV ISA has 32 Vector Registers each register
of length VLEN.
• VLEN = Number of Lanes x ELEN.
• Vector Register File is required to support ISA.
• Vector Execution Unit (ALU and LSU) for
computation.
• Vector Memory Unit for Data transfer between
Registers and Memory.
• Vector Write Back Unit for writing back to
register file.
6
Single Lane Vector Unit
1 2 3 4 5 6
Cycle 1
• Decode Signals are read
from FIFO.
Cycle 2
• Read Indexes are
generated.
Cycle 3 - RD Stage
• Data is read from VRF.
Cycle 4 - EX Stage
• Address generation and
operations are
performed.
Cycle 5 - MEM Stage
• For memory type
instruction.
Cycle 6 - WB Stage
• Data is written back into
VRF.
7
Design of Reduction Unit
Challenges:
• In most of the instructions, elements in single
register are independent to each other.
• For Reduction instructions, elements are
dependent to each other.
• Example: Sum of elements in an array.
Solution:
• Elements are passed through reduction
logic to reduce to one output.
• Latency is reduced, if multiple units are
used. Three clock cycles are consumed.
• Stalling is done in vector unit.
8
10
Implementation Results
Timing Analysis
Processor Worst Negative Slack Worst Hold Slack
4-Lane 0.13 ns 0.044 ns
8-Lane 0.021 ns 0.022 ns
16-Lane 0.031 ns 0.01 ns
32-Lane 0.02 ns 0.027 ns
Resource Utilization
Power Estimation
On chip
Power
Static
Power
Dynamic
Power
Total
Power
8-Lane 270 mW 429 mW 699mW
Vectored Power Estimation using Xilinx Power
Analyzer by running CNN on CPU
Parameter Specification
CPU Single-Core, single-issue, in-order,
5 stage pipeline
Frequency 50MHz
Memory I-Cache : 8KB, 2 way Set Associative
D-Cache : 8KB, 2-way Set Associative
Main Memory : 1MB
Peripheral UART
ISA Support RV32gV
Accelerator 4, 8, 16, 32 Lane Vector Unit
Vector Memory: 256KB/512KB Scratchpad
memory
Implemented on
Xilinx FPGA Vertex-7
xc7vx485tffg1761-2
RISC-V
Vector CPU
Main
Memory
Peripherals Core
Scalar
Pipeline
I-Cache D-Cache TLB ALU, FPU
Vector
Pipeline
VRF
Execution
Unit
Vector
Memory
Processor CONV8_3X3 CONV16_3X3 CONV32_3X3 CONV32_5X5 MATMUL64X64
8-Lane RISC-V
Vector
2.1 us 9.7 us 43.16 us 115.59 us 1.742 ms
Klessydra 9.91 us 21.18 us 59.54 us 113 us 2.741 ms
CE32V40 46.46 us 165.07 us 252.42 us 1969.37 us 14.88 ms
ZeroRI5Cy 69.2 us 623.85 us 970.93 us 2720.99 us 2.53 ms
11
Comparison with other Data-parallel Processors
• RISC-V Vector CPU performs better than Klessydra Vector Processor, but for Convolution with
5x5 filter the performance is in the range of Klessydra.
• The speedup is because of more packaging to utilize vector unit completely.
13
• 512 elements are provided as
input to the sorter.
Hardware Accelerator for Short Read Alignment
14
• Performance comparison for a number of test cases
implemented on 4-parallel dual-rate merge tree are
specified.
• Proposed hardware achieves around 2.5 times
sorting performance improvement as compared to
Intel core i7-10700 CPU operating at 2.90 GHz.
Number of Input
Elements
Hardware
Execution Time
(in μs)
Software
Execution Time
(in μs)
512 5 13-14
4,096 54.7 134-137
32,768 558.6 1290-1320
• The functionality of traditional merge tree and
dual-rate merge tree has been tested on
hardware.
• Parallelization of merge tree is implemented and
has been tested on hardware.
• Performance comparison of traditional merge
tree, dual-rate merge tree and 4-parallel merge
tree for random 512 input elements:
Traditional
merge tree
Dual-rate
merge tree
4-Parallel dual-
rate merge tree
14.28 μs 9.1 μs 5 μs
Hardware Accelerator for Short
Read Alignment
Remote Lab Control and Monitoring Platform
By:
Animesh Jain
Ch Kalyan Kumar Prusty
Guided by:
Kuruvilla Varghese
Prof. L Umanand
Haresh Dagale
Digital output (DUT's digital input configuration).
Programming FPGA
Digital input
(DUT's digital output configuration).
1. Write HDL code
2. Generate Output file (.bit)
for the project
1 0 1
1 1 0
User System
3. Send Output .bit file and
control bits
DIGITAL
DUT
Internet
5. Sample the output generated
by DUT with Logic Analyzer
7. Display Waveform in
the User system
4. Program (or) control the DUT
using the output file
6. Send the sampled
output to the user
TCAM Block Diagram
• All N input sub-words of size
w bits are simultaneously
applied to every layer.
• From layers outputs priority
encoder will select the highest
priority matched address.
21
Resource Efficient Implementation of Ternary
Content Addressable Memories
Pipelined layer
architecture
• For this design data path was
appropriately divided to
introduce the pipelining.
• By pipelining, the design
frequency of operation is
increased.
22
FPGA Implementation
Results
▪ The resource Efficient TCAM is
implemented with a size 64 x 32 with
L=4 and N=4.
▪ Design parameters were improved in
all corners like speed by 10.52%,
power by 7.62 %, and resource
utilization by 50 % compared to
existing designs.
23
Implementation
BRAMs
(18K)
FFs LUTs Speed
(MHz)
Power
(mW)
Z-TCAM 32 198 447 190 35.69
Efficient TCAM 16 164 233 210 33.00
ASIC implementation
Results
24
• ASIC design for the same
architecture was implemented
using Cadence tool with GPDK
library on 45-nm CMOS
technology node.
• In ASIC implementation, design
parameters were improved in all
corners like speed by 121.10 %,
power by 70.12 %, and area uses
reduced by 18.7 times compared
to existing designs.
Implementation Area
(µm2)
Speed
(MHz)
Power
(W)
Z-TCAM 18707925 226 0.376
Efficient TCAM 1000000 493.82 0.111