Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Performance evaluation with Arm HPC tools for SVE


Published on

by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.

Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.

Published in: Technology
  • Be the first to comment

Performance evaluation with Arm HPC tools for SVE

  1. 1. Gem5 simulator RIKEN AICS (Advanced Institute for Computational Science) 2017/12/13 Y. Kodama ARM HPC workshop 2017
  2. 2. Gem5 simulator  Processor simulator  supports multiple ISA: Alpha, SPARC, x86, ARM  CPU model • Atomic: instruction level simulation • O3: Out of Order pipeline simulation • Can estimate execution cycles  Development “gem5-sve”  Atomic mode for SVE is developed by ARM.  Gem5 supported SVE (atomic and o3) will be uploaded in main stream by ARM soon.  Riken also originally developed o3 mode for SVE based on ARM atomic model of SVE. ARM HPC workshop 2017
  3. 3. Gem5 | CPU model • Atomic: instruction level simulation ○ Number of dynamic executed instructions ○ Instruction MIX (ratio of arithmetic vs memory, ratio of vectorization, etc.) × Execution cycles × Cache hit ratio (some instructions are divided to micro operations) ○ Simulation speed is several millions of insts/sec • O3: Out of Order pipeline simulation ○ Execution cycles ○ Cache hit ratio, L1/L2/Memory bandwidth/latency × Simulation speed is less than 1/10 of atomic mode. ARM HPC workshop 2017
  4. 4. Gem5 | O3 pipeline  Based on Alpha21264  7 stages pipeline: Fetch, Decode, Rename, Issue, Execute, Write Back, Commit  Parameter file  can specify several parameters as next slide.  These are based on that is preset parameter in gem5.  Add instruction latency for SVE referred to NEON ARM HPC workshop 2017
  5. 5. Gem5 | architecture parameters  Based on that is preset parameter in gem5. ARM HPC workshop 2017 Hardware parameters Clock Frequency 2.0GHz # of core 1 L1 Dcache, Icache size 32kB L2 cache size 2MB # of Integer pipeline 2 Load/Store unit 1/1 # of Floating point pipeline 2 Fetch width 3 OoO resource parameters IQ (Reservation Station) 64 (←32) ROB (Re-order Buffer) 64 (←48) LQ (Load Queue) 16 SQ (Store Queue) 16 Physical Vector Register 96 (new)
  6. 6. Gem5 | statistics (atomic) ARM HPC workshop 2017 sim_seconds 0.000620 # Number of seconds simulated host_inst_rate 714344 # Simulator instruction rate (inst/s) host_seconds 1.73 # Real time elapsed on the host sim_insts 1239041 # Number of instructions simulated 71168 # Number of bytes read from this memory system.cpu.vector_ext_num_insts 1055097 # Number of vector instructions executed system.cpu.vector_ext_num_mem_insts 768768 # Number of vector memory instructions executed system.cpu.Branches 30653 # Number of branches fetched system.cpu.op_class::IntAlu 180739 14.56% 14.56% # Class of executed instruction system.cpu.op_class::MemRead 1978 0.16% 14.76% # Class of executed instruction system.cpu.op_class::MemWrite 2000 0.16% 14.92% # Class of executed instruction system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71% # Class of executed instruction system.cpu.op_class::VectorExtMread 512000 41.26% 79.31% # Class of executed instruction system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00% # Class of executed instruction
  7. 7. Gem5 | statistics (o3) ARM HPC workshop 2017 sim_seconds 0.000319 # Number of seconds simulated host_inst_rate 200521 # Simulator instruction rate (inst/s) host_seconds 6.18 # Real time elapsed on the host sim_insts 1239028 # Number of instructions simulated system.mem_ctrls.bw_total::total 665233479 # Total bandwidth to/from this memory (bytes/s) system.cpu.rename.ROBFullEvents 12 # Number of times rename has blocked due to ROB full system.cpu.rename.IQFullEvents 1 # Number of times rename has blocked due to IQ full system.cpu.rename.LQFullEvents 5979 # Number of times rename has blocked due to LQ full system.cpu.rename.SQFullEvents 180574 # Number of times rename has blocked due to SQ full system.cpu.rename.FullRegisterEvents 79 # Number of times there has been no free registers system.cpu.ipc 1.944260 # IPC: Instructions Per Cycle system.cpu.dcache.ReadReq_miss_rate::total 0.000204 # miss rate for ReadReq accesses system.cpu.dcache.WriteReq_miss_rate::total 0.002887 # miss rate for WriteReq accesses
  8. 8. Discussion  Gem5 o3 can simulate program precisely, but  it takes long time. For example of previous slide, 300us execution takes 6 seconds, i.e. 20,000 times.  multithread program can be simulated, but it takes several times of single core simulation.  -> simulation of whole application program is impossible, so we must extract kernels from application. Tool chains are required.  Gem5 o3 can flexiblly set parameters of pipelines, but  In other words, we must specify such parameters for target processor, but processor vender will not disclose the parameters.  What parameters we used is a big problem, especially if we will compare the performance with others.  -> We want base parameters for HPC performance comparison that anyone can shared. ARM HPC workshop 2017