Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmarking at TCWG - SFO17-312


Published on

Session ID: SFO17-312
Session Name: Benchmarking at TCWG - SFO17-312
Speaker: Maxim Kuvyrkov - Christophe Lyon
Track: Toolchain

★ Session Summary ★
This session will be a walk-through of TCWG benchmarking infrastructure and discussion of design choices we made while building it. This presentation will be aimed at developers interested in performance benchmarking in general, and toolchain benchmarking in particular.
★ Resources ★
Event Page:

★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport

Follow us on Social Media

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Benchmarking at TCWG - SFO17-312

  1. 1. Benchmarking @TCWG Maxim Kuvyrkov Christophe Lyon
  2. 2. ENGINEERS AND DEVICES WORKING TOGETHER DEMO! Compiler battle: GCC vs LLVM ● AArch64: gcc-aarch64 vs llvm-aarch64 ● ARMv7: gcc-armv7 vs llvm-armv7 OR
  3. 3. ENGINEERS AND DEVICES WORKING TOGETHER DEMO! Compiler battle: GCC vs LLVM ● AArch64: gcc-aarch64 vs llvm-aarch64 ● ARMv7: gcc-armv7 vs llvm-armv7 OR ISA battle: AArch64 vs AArch32 ● GCC: gcc-aarch64 vs gcc-armv8 ● LLVM: llvm-aarch64 vs llvm-armv7
  4. 4. ENGINEERS AND DEVICES WORKING TOGETHER What do we benchmark? ● We are ToolChain WG, so we benchmark toolchains ● We compare toolchains, to check performance changes: ○ From release to release ○ Our patches ● On the same hardware ● Using SPEC CPU2006 ● Planning to add SPEC CPU2017, CoreMark, EEMBC What we don’t do ● We do NOT benchmark hardware ● We do NOT compare absolute performance results
  5. 5. ENGINEERS AND DEVICES WORKING TOGETHER Hardware configuration ● Cortex-A57: AArch64 and AArch32 ● Cortex-A15: ARMv7 ● Several identical boards in parallel to reduce latency ● 4 cores per board ○ 1 core running benchmark, 3 cores idle / system ○ 3 cores running benchmarks, 1 core idle / system ● Up to 3 benchmarks in parallel with low noise ● Up to 9 benchmarks in parallel with high noise
  6. 6. ENGINEERS AND DEVICES WORKING TOGETHER Jenkins ● Provides UI to start benchmark jobs ● Schedules benchmark runs ● [optionally] reboots benchmarking board ● Calls harness scripts ● ● Jenkins has no direct connection to benchmarking boards ○ Stub jenkins nodes used for scheduling to avoid Jenkins clients on the boards ● 1-1 mapping between stub nodes and benchmarking boards ● Stub node connects to benchmarking board via ssh
  7. 7. ENGINEERS AND DEVICES WORKING TOGETHER Harness ● Runs on the benchmarking board ● Installs benchmark ● Configures benchmark ● Builds benchmark [in parallel] ○ Remote and cross-compilation is supported ● Configures board ● Runs benchmark [in parallel] ● Collects results ● Uploads results to dev server for later analysis
  8. 8. ENGINEERS AND DEVICES WORKING TOGETHER Board configuration To improve the consistency of benchmarking results, some configuration changes are made before the session is started. ● cpufreq governor is set to ‘performance’ ● some system services are disabled ○ currently docker, ntpd ● various sysctls set ○ enable ptrace, perf etc ● hardware-specific workarounds enabled ○ TK1 has an old kernel, needs hacks to use perf ● script records every change it makes, and restores the original configuration after benchmarking is completed
  9. 9. ENGINEERS AND DEVICES WORKING TOGETHER Reducing noise ● Out-of-order cores imply a fairly high level of noise in the results ● Several iterations ○ Keep either mean or peak performance ● Reboot boards before running the benchmarks ● At least 1 CPU reserved for system / harness tasks ● 1-3 CPUs used for benchmarking ● Benchmarking on only 1 CPU gives better results ● Fix CPU frequency at 90% of the max on the benchmarking cores ● [If possible] Use minimal CPU frequency for idle cores
  10. 10. ENGINEERS AND DEVICES WORKING TOGETHER Results ● Benchmarks run under “perf record” ○ perf process runs on the system core ● files are raw results ● 10 perf cycle samples == 1 second ○ Use perf’s sample period setting, do NOT use perf’s sample frequency setting ● Record everything happening on the core ○ Benchmark executable ○ Dynamic libraries ○ Kernel ● files from AArch64 / ARMv7l can be analyzed using x86_64 perf.