Gem5 simulator
RIKEN AICS
(Advanced Institute for Computational Science)
2017/12/13
Y. Kodama
ARM HPC workshop 2017
Gem5 simulator
 Processor simulator
 supports multiple ISA: Alpha, SPARC, x86, ARM
 CPU model
• Atomic: instruction level simulation
• O3: Out of Order pipeline simulation
• Can estimate execution cycles
 Development “gem5-sve”
 Atomic mode for SVE is developed by ARM.
 Gem5 supported SVE (atomic and o3) will be uploaded in
main stream by ARM soon.
 Riken also originally developed o3 mode for SVE based on
ARM atomic model of SVE.
ARM HPC workshop 2017
http://gem5.org
Gem5 | CPU model
• Atomic: instruction level simulation
○ Number of dynamic executed instructions
○ Instruction MIX (ratio of arithmetic vs memory, ratio of
vectorization, etc.)
× Execution cycles
× Cache hit ratio (some instructions are divided to micro
operations)
○ Simulation speed is several millions of insts/sec
• O3: Out of Order pipeline simulation
○ Execution cycles
○ Cache hit ratio, L1/L2/Memory bandwidth/latency
× Simulation speed is less than 1/10 of atomic mode.
ARM HPC workshop 2017
Gem5 | O3 pipeline
 Based on Alpha21264
 7 stages pipeline: Fetch, Decode, Rename, Issue, Execute,
Write Back, Commit
 Parameter file
 can specify several parameters as next slide.
 These are based on O3_ARM_v7a.py that is preset
parameter in gem5.
 Add instruction latency for SVE referred to NEON
ARM HPC workshop 2017
Gem5 | architecture parameters
 Based on O3_ARM_v7a.py that is preset parameter in
gem5.
ARM HPC workshop 2017
Hardware parameters
Clock Frequency 2.0GHz # of core 1
L1 Dcache, Icache size 32kB L2 cache size 2MB
# of Integer pipeline 2 Load/Store unit 1/1
# of Floating point pipeline 2 Fetch width 3
OoO resource parameters
IQ (Reservation Station) 64 (←32)
ROB (Re-order Buffer) 64 (←48)
LQ (Load Queue) 16
SQ (Store Queue) 16
Physical Vector Register 96 (new)
Gem5 | statistics (atomic)
ARM HPC workshop 2017
sim_seconds 0.000620
# Number of seconds simulated
host_inst_rate 714344
# Simulator instruction rate (inst/s)
host_seconds 1.73
# Real time elapsed on the host
sim_insts 1239041
# Number of instructions simulated
system.mem_ctrls.bytes_read::cpu.data 71168
# Number of bytes read from this memory
system.cpu.vector_ext_num_insts 1055097
# Number of vector instructions executed
system.cpu.vector_ext_num_mem_insts 768768
# Number of vector memory instructions executed
system.cpu.Branches 30653
# Number of branches fetched
system.cpu.op_class::IntAlu 180739 14.56% 14.56%
# Class of executed instruction
system.cpu.op_class::MemRead 1978 0.16% 14.76%
# Class of executed instruction
system.cpu.op_class::MemWrite 2000 0.16% 14.92%
# Class of executed instruction
system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71%
# Class of executed instruction
system.cpu.op_class::VectorExtMread 512000 41.26% 79.31%
# Class of executed instruction
system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00%
# Class of executed instruction
Gem5 | statistics (o3)
ARM HPC workshop 2017
sim_seconds 0.000319
# Number of seconds simulated
host_inst_rate 200521
# Simulator instruction rate (inst/s)
host_seconds 6.18
# Real time elapsed on the host
sim_insts 1239028
# Number of instructions simulated
system.mem_ctrls.bw_total::total 665233479
# Total bandwidth to/from this memory (bytes/s)
system.cpu.rename.ROBFullEvents 12
# Number of times rename has blocked due to ROB full
system.cpu.rename.IQFullEvents 1
# Number of times rename has blocked due to IQ full
system.cpu.rename.LQFullEvents 5979
# Number of times rename has blocked due to LQ full
system.cpu.rename.SQFullEvents 180574
# Number of times rename has blocked due to SQ full
system.cpu.rename.FullRegisterEvents 79
# Number of times there has been no free registers
system.cpu.ipc 1.944260
# IPC: Instructions Per Cycle
system.cpu.dcache.ReadReq_miss_rate::total 0.000204
# miss rate for ReadReq accesses
system.cpu.dcache.WriteReq_miss_rate::total 0.002887
# miss rate for WriteReq accesses
Discussion
 Gem5 o3 can simulate program precisely, but
 it takes long time. For example of previous slide, 300us
execution takes 6 seconds, i.e. 20,000 times.
 multithread program can be simulated, but it takes several
times of single core simulation.
 -> simulation of whole application program is impossible,
so we must extract kernels from application. Tool chains
are required.
 Gem5 o3 can flexiblly set parameters of pipelines, but
 In other words, we must specify such parameters for target
processor, but processor vender will not disclose the
parameters.
 What parameters we used is a big problem, especially if we
will compare the performance with others.
 -> We want base parameters for HPC performance
comparison that anyone can shared.
ARM HPC workshop 2017

Performance evaluation with Arm HPC tools for SVE

  • 1.
    Gem5 simulator RIKEN AICS (AdvancedInstitute for Computational Science) 2017/12/13 Y. Kodama ARM HPC workshop 2017
  • 2.
    Gem5 simulator  Processorsimulator  supports multiple ISA: Alpha, SPARC, x86, ARM  CPU model • Atomic: instruction level simulation • O3: Out of Order pipeline simulation • Can estimate execution cycles  Development “gem5-sve”  Atomic mode for SVE is developed by ARM.  Gem5 supported SVE (atomic and o3) will be uploaded in main stream by ARM soon.  Riken also originally developed o3 mode for SVE based on ARM atomic model of SVE. ARM HPC workshop 2017 http://gem5.org
  • 3.
    Gem5 | CPUmodel • Atomic: instruction level simulation ○ Number of dynamic executed instructions ○ Instruction MIX (ratio of arithmetic vs memory, ratio of vectorization, etc.) × Execution cycles × Cache hit ratio (some instructions are divided to micro operations) ○ Simulation speed is several millions of insts/sec • O3: Out of Order pipeline simulation ○ Execution cycles ○ Cache hit ratio, L1/L2/Memory bandwidth/latency × Simulation speed is less than 1/10 of atomic mode. ARM HPC workshop 2017
  • 4.
    Gem5 | O3pipeline  Based on Alpha21264  7 stages pipeline: Fetch, Decode, Rename, Issue, Execute, Write Back, Commit  Parameter file  can specify several parameters as next slide.  These are based on O3_ARM_v7a.py that is preset parameter in gem5.  Add instruction latency for SVE referred to NEON ARM HPC workshop 2017
  • 5.
    Gem5 | architectureparameters  Based on O3_ARM_v7a.py that is preset parameter in gem5. ARM HPC workshop 2017 Hardware parameters Clock Frequency 2.0GHz # of core 1 L1 Dcache, Icache size 32kB L2 cache size 2MB # of Integer pipeline 2 Load/Store unit 1/1 # of Floating point pipeline 2 Fetch width 3 OoO resource parameters IQ (Reservation Station) 64 (←32) ROB (Re-order Buffer) 64 (←48) LQ (Load Queue) 16 SQ (Store Queue) 16 Physical Vector Register 96 (new)
  • 6.
    Gem5 | statistics(atomic) ARM HPC workshop 2017 sim_seconds 0.000620 # Number of seconds simulated host_inst_rate 714344 # Simulator instruction rate (inst/s) host_seconds 1.73 # Real time elapsed on the host sim_insts 1239041 # Number of instructions simulated system.mem_ctrls.bytes_read::cpu.data 71168 # Number of bytes read from this memory system.cpu.vector_ext_num_insts 1055097 # Number of vector instructions executed system.cpu.vector_ext_num_mem_insts 768768 # Number of vector memory instructions executed system.cpu.Branches 30653 # Number of branches fetched system.cpu.op_class::IntAlu 180739 14.56% 14.56% # Class of executed instruction system.cpu.op_class::MemRead 1978 0.16% 14.76% # Class of executed instruction system.cpu.op_class::MemWrite 2000 0.16% 14.92% # Class of executed instruction system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71% # Class of executed instruction system.cpu.op_class::VectorExtMread 512000 41.26% 79.31% # Class of executed instruction system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00% # Class of executed instruction
  • 7.
    Gem5 | statistics(o3) ARM HPC workshop 2017 sim_seconds 0.000319 # Number of seconds simulated host_inst_rate 200521 # Simulator instruction rate (inst/s) host_seconds 6.18 # Real time elapsed on the host sim_insts 1239028 # Number of instructions simulated system.mem_ctrls.bw_total::total 665233479 # Total bandwidth to/from this memory (bytes/s) system.cpu.rename.ROBFullEvents 12 # Number of times rename has blocked due to ROB full system.cpu.rename.IQFullEvents 1 # Number of times rename has blocked due to IQ full system.cpu.rename.LQFullEvents 5979 # Number of times rename has blocked due to LQ full system.cpu.rename.SQFullEvents 180574 # Number of times rename has blocked due to SQ full system.cpu.rename.FullRegisterEvents 79 # Number of times there has been no free registers system.cpu.ipc 1.944260 # IPC: Instructions Per Cycle system.cpu.dcache.ReadReq_miss_rate::total 0.000204 # miss rate for ReadReq accesses system.cpu.dcache.WriteReq_miss_rate::total 0.002887 # miss rate for WriteReq accesses
  • 8.
    Discussion  Gem5 o3can simulate program precisely, but  it takes long time. For example of previous slide, 300us execution takes 6 seconds, i.e. 20,000 times.  multithread program can be simulated, but it takes several times of single core simulation.  -> simulation of whole application program is impossible, so we must extract kernels from application. Tool chains are required.  Gem5 o3 can flexiblly set parameters of pipelines, but  In other words, we must specify such parameters for target processor, but processor vender will not disclose the parameters.  What parameters we used is a big problem, especially if we will compare the performance with others.  -> We want base parameters for HPC performance comparison that anyone can shared. ARM HPC workshop 2017