4. Taking RISC-V® Mainstream 4
Overview of AndesClarity
• A pipeline visualizer/analyzer for Andes V5 processors.
– Performance statistics and execution bottleneck.
– Ideal for complex pipelines, esp. NX27V vector processor.
• Integrated into AndeSight™ IDE as a plugin
– Using execution log from Andes core simulator (AndeSim).
– Easily linked to source code from RISC-V instructions.
• Representing graphically with performance information
– High-level “Instruction Per Cycle” information
– Instruction execution pipelining flow
– Data dependencies & resource usages
5. Taking RISC-V® Mainstream 5
Usage scenarios for AndesClarity
• Algorithm tuning:
– Identify bottleneck from pipeline stalls or resource usage.
– Experiment on different enhancements of the same task.
• Compiler enhancement and flag tuning:
– Identify issues of compiler generated code.
– Compare performance with different compiler options.
• Architecture exploration:
– Explore different SoC architecture and processor configurations.
– Discover potential processor micro-architecture improvements.
6. Taking RISC-V® Mainstream 6
AndesClarity Main Interfaces
• Performance viewer
– IPC (Instruction Per Cycle)
– OPC (Operation Per Cycle), designed for vector instructions.
• Pipeline stage viewer
– Instruction-centric view
– Resource-centric view
7. Taking RISC-V® Mainstream 7
Performance Viewer
• Timeline view of “Instruction per Cycle” or “Operation per
Cycle”.
• A vector instruction has VL operations, instead of 1.
• User can zoom in to find more details.
8. Taking RISC-V® Mainstream 8
Instruction-Centric Pipeline Viewer
• Instruction sequence vs instruction pipeline stage flow. Along
with utilized resources.
• Focused instruction can be highlighted.
9. Taking RISC-V® Mainstream 9
Resource-Centric Pipeline Viewer
• Pipeline stages/resources vs instruction occupancy.
• Instruction footprint on multiple paths & resources can be
examined.
10. Taking RISC-V® Mainstream 10
Display Dependency & Stall Reason
• Dependent instructions (producer & consumer) can be
highlighted.
• Display stall reason to help identify performance issues.
12. Taking RISC-V® Mainstream 12
AndesCore™ NX27V
Fetch Decode Execute Memory Retire
Integer
Execution
Unit
Data
Cache
Exception
Handling
IFU GPR
V P U
Vector
scalar
V
I
Q
V/F/D
insn
More
Custom
Coprocessor
command
data
A C E pipeline Execute
Streaming
Ports
14. Taking RISC-V® Mainstream 14
Vectorizing FDCT: Initial Development
• With repeated code sequence as follows:
vmul.vv v2, v10, v20
vredsum.vs v3, v2, v1
vmv.x.s a4, v3
sw a4, 136(sp)
• Average performance / Iteration: 16 cycles.
• Discovery:
– “sw” is waiting for a4 to be ready and blocking later vector
instructions from entering into vector pipeline.
– Frequent interaction between scalar and vector pipeline is not good.
15. Taking RISC-V® Mainstream 15
AndesClarity for FDCT Initial Opt.
32 cycles / 2 iterations
Vector instruction queue is not as full.
16. Taking RISC-V® Mainstream 16
Vectorizing FDCT: 2nd Optimization
• Do not move data to scalar GPR, use masked vector store
instead.
• With repeated code sequence as follows:
vmul.vv v2, v10, v20
vredsum.vs v3, v2, v1
vsw.v v3, (s10), v0.t
addi s10, sp, imm
• Average performance / Iteration: 8.3 cycles.
• Discovery:
– Vector instruction queue is full. This is good. However, …
– Not efficient to vector store just one element.
17. Taking RISC-V® Mainstream 17
AndesClarity for FDCT 2nd Opt.
25 cycles / 3 iterations
Vector instruction queue is full now.
18. Taking RISC-V® Mainstream 18
Vectorizing FDCT: 3rd Optimization
• Use vslideup to gather data into vector registers.
• With repeated code sequence as follows:
vmul.vv v2, v10, v20
vredsum.vs v3, v2, v1
vslideup.vi v24, v3, const, v0.t
• Average performance / Iteration: 6.75 cycles.
• Discovery:
– “vmul” of next iteration cannot enter VIQ.
– Too much dependency. Functional units overlapping is low.
22. Taking RISC-V® Mainstream 22
What Users Can Learn
• Processor micro-architecture characteristics.
• Execution latencies of instructions.
• Interaction between scalar pipeline and vector pipeline.
– Vector load/store on data cache.
– Data movement between scalar and vector GPRs.
• Reasons of lost performance.
• Resource utilization and data bandwidth under different code
sequences.
23. Taking RISC-V® Mainstream 23
Concluding Remarks
• AndesClarity can help a user to quickly
– understand the bottleneck of an application code sequences.
– learn the complex pipeline/micro-architecture characteristics of a
powerful vector processor.
– discover enhancements to improve the application performance.
• AndesClarity is a powerful tool for vector processors, e.g.,
AndesCore™ NX27V Vector processor.
• AndesClarity is integrated in AndeSight™ development
environment as a plugin for easy application profiling and
debugging.