Andes andes clarity for risc-v vector processor

AndesClarity for RISC-V
Vector Processor
Chuan-Hua Chang, Ph.D.
Associate VP, Architecture
Andes Technology
RISC-V Summit, 12/2020

Agenda
Overview of AndesClarity
1
AndesCore™ NX27V Pipeline
2
Program Analysis Example
3
Concluding Remarks
4

Taking RISC-V® Mainstream 4
Overview of AndesClarity
• A pipeline visualizer/analyzer for Andes V5 processors.
– Performance statistics and execution bottleneck.
– Ideal for complex pipelines, esp. NX27V vector processor.
• Integrated into AndeSight™ IDE as a plugin
– Using execution log from Andes core simulator (AndeSim).
– Easily linked to source code from RISC-V instructions.
• Representing graphically with performance information
– High-level “Instruction Per Cycle” information
– Instruction execution pipelining flow
– Data dependencies & resource usages

Usage scenarios for AndesClarity
• Algorithm tuning:
– Identify bottleneck from pipeline stalls or resource usage.
– Experiment on different enhancements of the same task.
• Compiler enhancement and flag tuning:
– Identify issues of compiler generated code.
– Compare performance with different compiler options.
• Architecture exploration:
– Explore different SoC architecture and processor configurations.
– Discover potential processor micro-architecture improvements.

AndesClarity Main Interfaces
• Performance viewer
– IPC (Instruction Per Cycle)
– OPC (Operation Per Cycle), designed for vector instructions.
• Pipeline stage viewer
– Instruction-centric view
– Resource-centric view

Performance Viewer
• Timeline view of “Instruction per Cycle” or “Operation per
Cycle”.
• A vector instruction has VL operations, instead of 1.
• User can zoom in to find more details.

Instruction-Centric Pipeline Viewer
• Instruction sequence vs instruction pipeline stage flow. Along
with utilized resources.
• Focused instruction can be highlighted.

Resource-Centric Pipeline Viewer
• Pipeline stages/resources vs instruction occupancy.
• Instruction footprint on multiple paths & resources can be
examined.

Display Dependency & Stall Reason
• Dependent instructions (producer & consumer) can be
highlighted.
• Display stall reason to help identify performance issues.

Example Vector
Processor Pipeline

AndesCore™ NX27V
Fetch Decode Execute Memory Retire
Integer
Execution
Unit
Data
Cache
Exception
Handling
IFU GPR
V P U
Vector
scalar
V
I
Q
V/F/D
insn
More
Custom
Coprocessor
command
data
A C E pipeline Execute
Streaming
Ports

Vector Program
Analysis Example

Vectorizing FDCT: Initial Development
• With repeated code sequence as follows:
vmul.vv v2, v10, v20
vredsum.vs v3, v2, v1
vmv.x.s a4, v3
sw a4, 136(sp)
• Average performance / Iteration: 16 cycles.
• Discovery:
– “sw” is waiting for a4 to be ready and blocking later vector
instructions from entering into vector pipeline.
– Frequent interaction between scalar and vector pipeline is not good.

AndesClarity for FDCT Initial Opt.
32 cycles / 2 iterations
Vector instruction queue is not as full.

Vectorizing FDCT: 2nd Optimization
• Do not move data to scalar GPR, use masked vector store
instead.
vsw.v v3, (s10), v0.t
addi s10, sp, imm
• Average performance / Iteration: 8.3 cycles.
• Discovery:
– Vector instruction queue is full. This is good. However, …
– Not efficient to vector store just one element.

AndesClarity for FDCT 2nd Opt.
Vector instruction queue is full now.

Vectorizing FDCT: 3rd Optimization
• Use vslideup to gather data into vector registers.
vslideup.vi v24, v3, const, v0.t
• Discovery:
– “vmul” of next iteration cannot enter VIQ.
– Too much dependency. Functional units overlapping is low.

AndesClarity for FDCT 3rd Opt.

Vectorizing FDCT: 4th Optimization
• Interleave iterations 3 times to reduce dependency.
• Contains repeated code sequence as follows:
vmul.vv v2,…; vmul.vv v4,…; vmul.vv v6,…
vredsum.vs v3, v2,…; vredsum.vs v5, v4,…; vredsum.vs v7, v6,…
vslideup.vi v24, v3,…; vslideup.vi v25, v3,…; vslideup.vi v26, v3,…
• Discovery:
– Utilization of function units increases.
– Iteration latency is dominated by non-pipelined “vredsum”.

AndesClarity for FDCT 4th Opt.

What Users Can Learn
• Processor micro-architecture characteristics.
• Execution latencies of instructions.
• Interaction between scalar pipeline and vector pipeline.
– Vector load/store on data cache.
– Data movement between scalar and vector GPRs.
• Reasons of lost performance.
• Resource utilization and data bandwidth under different code
sequences.

Concluding Remarks
• AndesClarity can help a user to quickly
– understand the bottleneck of an application code sequences.
– learn the complex pipeline/micro-architecture characteristics of a
powerful vector processor.
– discover enhancements to improve the application performance.
• AndesClarity is a powerful tool for vector processors, e.g.,
AndesCore™ NX27V Vector processor.
• AndesClarity is integrated in AndeSight™ development
environment as a plugin for easy application profiling and
debugging.

Andes andes clarity for risc-v vector processor

Andes andes clarity for risc-v vector processor

More Related Content

What's hot

Similar to Andes andes clarity for risc-v vector processor

More from RISC-V International

Recently uploaded

Andes andes clarity for risc-v vector processor