Performance evaluation with Arm HPC tools for SVE

Gem5 simulator
RIKEN AICS
(Advanced Institute for Computational Science)
2017/12/13
Y. Kodama
ARM HPC workshop 2017

Gem5 simulator
 Processor simulator
 supports multiple ISA: Alpha, SPARC, x86, ARM
 CPU model
• Atomic: instruction level simulation
• O3: Out of Order pipeline simulation
• Can estimate execution cycles
 Development “gem5-sve”
 Atomic mode for SVE is developed by ARM.
 Gem5 supported SVE (atomic and o3) will be uploaded in
main stream by ARM soon.
 Riken also originally developed o3 mode for SVE based on
ARM atomic model of SVE.
http://gem5.org

Gem5 | CPU model
• Atomic: instruction level simulation
○ Number of dynamic executed instructions
○ Instruction MIX (ratio of arithmetic vs memory, ratio of
vectorization, etc.)
× Execution cycles
× Cache hit ratio (some instructions are divided to micro
operations)
○ Simulation speed is several millions of insts/sec
• O3: Out of Order pipeline simulation
○ Execution cycles
○ Cache hit ratio, L1/L2/Memory bandwidth/latency
× Simulation speed is less than 1/10 of atomic mode.

Gem5 | O3 pipeline
 Based on Alpha21264
 7 stages pipeline: Fetch, Decode, Rename, Issue, Execute,
Write Back, Commit
 Parameter file
 can specify several parameters as next slide.
 These are based on O3_ARM_v7a.py that is preset
parameter in gem5.
 Add instruction latency for SVE referred to NEON

Gem5 | architecture parameters
 Based on O3_ARM_v7a.py that is preset parameter in
gem5.
Hardware parameters
Clock Frequency 2.0GHz # of core 1
L1 Dcache, Icache size 32kB L2 cache size 2MB
# of Integer pipeline 2 Load/Store unit 1/1
# of Floating point pipeline 2 Fetch width 3
OoO resource parameters
IQ (Reservation Station) 64 (←32)
ROB (Re-order Buffer) 64 (←48)
LQ (Load Queue) 16
SQ (Store Queue) 16
Physical Vector Register 96 (new)

Gem5 | statistics (atomic)
sim_seconds 0.000620
# Number of seconds simulated
host_inst_rate 714344
# Simulator instruction rate (inst/s)
host_seconds 1.73
# Real time elapsed on the host
sim_insts 1239041
# Number of instructions simulated
system.mem_ctrls.bytes_read::cpu.data 71168
# Number of bytes read from this memory
system.cpu.vector_ext_num_insts 1055097
# Number of vector instructions executed
system.cpu.vector_ext_num_mem_insts 768768
# Number of vector memory instructions executed
system.cpu.Branches 30653
# Number of branches fetched
system.cpu.op_class::IntAlu 180739 14.56% 14.56%
# Class of executed instruction
system.cpu.op_class::MemRead 1978 0.16% 14.76%
system.cpu.op_class::MemWrite 2000 0.16% 14.92%
system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71%
system.cpu.op_class::VectorExtMread 512000 41.26% 79.31%
system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00%

Gem5 | statistics (o3)
sim_seconds 0.000319
# Number of seconds simulated
host_inst_rate 200521
# Simulator instruction rate (inst/s)
host_seconds 6.18
# Real time elapsed on the host
sim_insts 1239028
# Number of instructions simulated
system.mem_ctrls.bw_total::total 665233479
# Total bandwidth to/from this memory (bytes/s)
system.cpu.rename.ROBFullEvents 12
# Number of times rename has blocked due to ROB full
system.cpu.rename.IQFullEvents 1
# Number of times rename has blocked due to IQ full
system.cpu.rename.LQFullEvents 5979
# Number of times rename has blocked due to LQ full
system.cpu.rename.SQFullEvents 180574
# Number of times rename has blocked due to SQ full
system.cpu.rename.FullRegisterEvents 79
# Number of times there has been no free registers
system.cpu.ipc 1.944260
# IPC: Instructions Per Cycle
system.cpu.dcache.ReadReq_miss_rate::total 0.000204
# miss rate for ReadReq accesses
system.cpu.dcache.WriteReq_miss_rate::total 0.002887
# miss rate for WriteReq accesses

Discussion
 Gem5 o3 can simulate program precisely, but
 it takes long time. For example of previous slide, 300us
execution takes 6 seconds, i.e. 20,000 times.
 multithread program can be simulated, but it takes several
times of single core simulation.
 -> simulation of whole application program is impossible,
so we must extract kernels from application. Tool chains
are required.
 Gem5 o3 can flexiblly set parameters of pipelines, but
 In other words, we must specify such parameters for target
processor, but processor vender will not disclose the
parameters.
 What parameters we used is a big problem, especially if we
will compare the performance with others.
 -> We want base parameters for HPC performance
comparison that anyone can shared.

Performance evaluation with Arm HPC tools for SVE

More Related Content

What's hot

Similar to Performance evaluation with Arm HPC tools for SVE

More from Linaro

Recently uploaded

Performance evaluation with Arm HPC tools for SVE