Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HCQC : HPC Compiler Quality Checker

1,007 views

Published on

By Masaki Arai, Fujitsu Laboratories Ltd.

For numerical calculation programs on supercomputers, the kernel part occupies 80% or more of the execution time in many cases. Therefore, the quality of the code generated by the compiler for these kernel parts is significant. We created a tool, which is called HCQC, to aid in the investigation of the quality of the code generated by the compiler for the kernel part. In this presentation, we report the details of HCQC and the results of evaluating the quality of GCC and LLVM when compiling the kernel part of benchmark programs using HCQC.

Masaki Arai Bio
In 1992, He joined Fujitsu Laboratories Ltd. His research interests are in the area of compiler optimizations and computer architectures. He joined Linaro as member engineer in 2017.

Email
itaru.kitayama@riken.jp

For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

HCQC : HPC Compiler Quality Checker

  1. 1. HCQC - HPC compiler quality checker Masaki Arai masaki.arai@linaro.org LEG HPC-SIG arai.masaki@jp.fujitsu.com FUJITSU LABORATORIES LTD. 1
  2. 2. Background and Purpose • The quality of the kernel part is important in HPC applications(number crunching on supercomputers). • Make it easy to check the quality of compiler optimizations and acquire data to improve them HCQC:HPC compiler quality checker 2
  3. 3. Subject of Quality Check • Configuration file defines the subject of quality check. • Main items:  Compiler  Compiler version  Optimization flags { “DISTRIBUTION" : "OpenSUSE Tumbleweed", "ARCH" : “aarch64", "CPU" : "AMD Opteron A1100 Cortex A57", "LANGUAGE" : "C", "COMPILER" : “GCC", "COMMAND" : "/usr/bin/gcc", "VERSION" : “7.1.1", "OPT_FLAGS" : ["-O2"], "ASM_FLAGS" : ["-S“, “-fverbose-asm”], “FLAG_DB” : [[“?DEBUG_FLAG", “-g”], [“?C99_STANDARD", “-std=c99”]] } 3 Example of configuration file
  4. 4. Metrics for Quality Evaluation • HCQC has the following metrics:  op : # of mnemonics in an assembly code  kind : The kind of mnemonics in an assembly code(memory, branch, other)  regalloc : The quality of register allocation(# of spill in/out instructions)  height : The height of instruction dependence graph  ilp :Instruction level parallelism by instruction scheduler  vectorize : Vectrization/SIMDization situation  swpl : # of initiation interval by software pipelining 4 These data are basically static data at compile time.
  5. 5. Investigation Result 5 ARCH : aarch64 CPU : AMD Opteron A1100 Cortex A57 LANGUAGE : C COMPILER : ClangLLVM COMMAND : /usr/bin/clang VERSION :4.0.1 OPT_FLAGS : -O2 TEST_PROGRAM: sample KERNEL_FUNCTION_NAME : kernel DATE: 2017/11/07 ilp swpl memorybranch other spill in spill out IPC II kind mem arith other BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0 BB1 0 0 0 4 0 0 0.5 0 0 0 LBB0_2 1 3 0 5 0 1 0.7 0 0 0 LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0 BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1 LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1 LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2 LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2 LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1 BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0 LBB0_11 0 0 0 1 0 0 0.2 0 0 0 *SUMMARY* 25 8 39 2 6 2 7 7 vectorizekind regalloc CFG DEPTH
  6. 6. Quality Evaluation by Comparison • One investigation result has little meaning. • Typical comparison examples: GCC vs. LLVM(on AArch64) LLVM 4.0.0 vs. LLVM 5.0.0(on AArch64) LLVM with –O2 vs. LLVM with –O3(on AArch64) LLVM on AArch64 vs. LLVM on x86_64 Missing optimizations on AArch64 LLVM on AArch64 vs. ICC on x86_84 Optimization hints for SVE from AVX codes 6
  7. 7. Example of Comparison(1) • HimenoBMT-dynamic(regalloc)  GCC is better than Clang/LLVM. 7 CFG DEPTH spill in spill out CFG DEPTH spill in spill out jacobi cond .L18 0 0 7 jacobi cond .LBB0_32 0 0 12 0 0 5 0 0 6 .L3 cond .L16 1 1 0 .LBB0_2 cond .LBB0_31 1 1 0 1 3 2 1 0 2 .L8 cond .L28 2 1 0 .LBB0_4 cond .LBB0_11 2 0 1 2 4 6 2 0 1 .L11 cond .L7 3 0 1 .LBB0_6 cond .LBB0_10 4 0 0 3 10 0 3 13 7 .L4 cond .L4 4 0 0 .LBB0_8 cond .LBB0_8 4 2 0 .L7 cond .L11 3 2 0 cond .LBB0_6 3 3 0 .L5 cond .L8 2 5 1 goto .LBB0_11 2 0 0 1 5 0 .LBB0_10 cond .LBB0_6 3 0 0 .L9 cond .L13 2 0 0 .LBB0_11 cond .LBB0_4 2 3 2 2 0 0 cond .LBB0_30 1 1 0 .L17 cond .L15 3 0 0 1 3 0 3 0 0 .LBB0_14 cond .LBB0_29 2 0 0 .L12 cond .L12 4 0 0 goto .LBB0_19 2 0 0 .L15 cond .L17 3 0 0 .LBB0_16 3 0 0 .L13 cond .L9 2 0 0 .LBB0_17 cond .LBB0_17 4 0 0 .L16 cond .L3 1 2 1 cond .LBB0_28 3 0 0 end 0 0 0 goto .LBB0_26 3 0 0 .L28 goto .L5 2 1 2 .LBB0_19 cond .LBB0_28 3 0 0 .L18 end 0 0 0 cond .LBB0_22 3 1 0 *SUMMARY* - 34 25 goto .LBB0_26 3 0 0 .LBB0_22 cond .LBB0_25 3 0 0 cond .LBB0_16 3 0 0 cond .LBB0_16 3 0 0 .LBB0_25 3 0 0 .LBB0_26 3 0 0 .LBB0_27 cond .LBB0_27 4 0 0 .LBB0_28 cond .LBB0_19 3 0 0 .LBB0_29 cond .LBB0_14 2 1 0 goto .LBB0_31 1 0 0 .LBB0_30 1 1 0 .LBB0_31 cond .LBB0_2 1 0 0 goto .LBB0_33 0 0 0 .LBB0_32 0 0 0 .LBB0_33 end 0 8 0 *SUMMARY* - 37 31 LLVMGCC LLVMGCC
  8. 8. Example of Comparison(2) • mVMC-mini-calculateNewPfMTwo_child(op) – GCC generates codes with better addressing modes. 8 CFG DEPTH lsl calculateNewPfMTwo_child cond .LBB0_3 0 0 0 0 .LBB0_2 cond .LBB0_2 1 1 .LBB0_3 cond .LBB0_11 0 0 0 0 .LBB0_5 cond .LBB0_5 1 0 cond .LBB0_12 0 0 0 1 .LBB0_8 1 0 .LBB0_9 cond .LBB0_9 2 0 cond .LBB0_8 1 0 goto .LBB0_13 0 0 .LBB0_11 0 0 .LBB0_12 0 0 .LBB0_13 end 0 0 *SUMMARY* - 2 CFG DEPTH lsl calculateNewPfMTwo_child cond .L2 0 0 .L3 cond .L3 1 .L2 cond .L8 0 0 .L5 cond .L5 1 0 .L7 1 .L6 cond .L6 2 cond .L7 1 0 .L4 end 0 .L8 goto .L4 0 *SUMMARY* - LLVM GCC .LBB0_2: ldr w1, [x5, x16, lsl #2] sdiv w2, w16, w13 lsl x3, x16, #3 add x16, x16, #1 madd w1, w17, w2, w1 sbfiz x1, x1, #3, #32 ldr x2, [x18, x1] cmp x11, x16 str x2, [x10, x3] ldr x1, [x0, x1] str x1, [x9, x3] b.ne .LBB0_2 // x3 is dead LLVM
  9. 9. Example of Comparison(3) • HimenoBMT-static(height)  Clang/LLVM is not good for register usage. 9 CFG DEPTH # height .LBB1_8 cond .LBB1_8 4 71 25 CFG DEPTH # height .L19 cond .L19 4 62 17 LLVM GCC 71/25 = 2.84 62/17 = 3.647 .LBB1_8: add x17, x4, x14 mov v6.16b, v17.16b ldr s17, [x17, #8844] add x16, x3, x14 mov v7.16b, v16.16b ldr s16, [x16, #8844] add x17, x25, x14 ldr s21, [x17, x7] fmul s0, s17, s0 ldr s17, [x17, x21] add x16, x5, x14 mov v5.16b, v18.16b ldr s18, [x16, #8844] …… LLVM The metric `height’ is not committed on GitHub yet.
  10. 10. Workflow of HCQC ① Compile one test program ② Run the executable file and verify its result by comparing output and answer data ③ Generate the assembly code file ④ Make the control flow graph of the kernel part from the assembly code ⑤ Get result data using metric programs ⑥ Make the report file from data 10 % hcqc config test metric+ % hcqc-report config test metric+
  11. 11. Workflow of HCQC 11 COMMAND:/usr/bin/clang OPT_FLAGS:-O2 Configuration FileTest Program Executable File out.data resut.data in.data Assembly Code File cfg.py CFG+DEPTH hcqc-report diff check DISTRIBUTION : OpenSUSE Tumbleweed ARCH : aarch64 CPU : AMD Opteron A1100 Cortex A57 LANGUAGE : C COMPILER : ClangLLVM COMMAND : /usr/bin/clang VERSION :4.0.1 OPT_FLAGS : -O2 TEST_PROGRAM: sample KERNEL=FUNCTION=NAME : kernel DATE: 2017/11/07 ilp swpl mem branch other spill in spill out IPC II kind mem arith other BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0 BB1 0 0 0 4 0 0 0.5 0 0 0 LBB0_2 1 3 0 5 0 1 0.7 0 0 0 LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0 BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1 LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1 LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2 LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2 LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1 BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0 LBB0_11 0 0 0 1 0 0 0.2 0 0 0 *SUMMARY* 25 8 39 2 6 2 7 7 vectorizekind regalloc CFG DEPTH Report file(csv) Result Data Result Data Result Data Metric Program Metric Program Metric Program Test Program Info File ① ② ③ ④ ⑤ ⑥ JSON format file Debug Information?
  12. 12. Test Programs for HCQC • Generate from programs that were problematic in Fujitsu's production compilers in the past • Extract kernel parts and modify them to use under HCQC – Extract hot spots – If it is Fortran program, then convert them to C language(for comparison between GCC and Clang/LLVM) – Prepare the data to run and check those kernel parts 12
  13. 13. Test Programs for HCQC • All original benchmarks are publically available. • I/O data for HCQC is being prepared. – Some data file sizes are very large. – For different architectures or different optimization levels, error tolerance is required. 13 benchark name kernel name HimenoBMT-dynamic jacobi HimenoBMT-static jacobi hpcg-3.0 ComputeSYMGS_ref ccs-qcd bicgstab_hmc ccs-qcd clover ffb-mini CALAX3 ffb-mini FLD3X2 ffb-mini GRAD3X ffvc-mini poi_residual ffvc-mini psor2sma_core mVMC-mini calculateNewPfMTwo_child mVMC-mini updateMAllTwo_child mVMC-mini updateMAll_child ngsa-mini bwt_match_exact_alt ngsa-mini bwt_match_gap nicam-dc-mini vi_path2
  14. 14. Future Work • Add supports for SVE(if available in GCC or LLVM) • Implement metric programs: – vectorization(vectorize) – software pipelining(swpl) – instruction level parallelism(ilp) • Add features for comparing with x86_64(SVE vs. AVX) • Add tools for automatic and intelligent comparison 14
  15. 15. 15 URL https://github.com/Linaro/hcqc Thank you very much! Any comments or suggestions are welcome.

×