Cache profiling on ARM Linux


Published on

Explains how to measure Cache performance for Linux applications and kernel usage using peemuperf

Published in: Technology
  • Be the first to comment

Cache profiling on ARM Linux

  1. 1. peemuperfCache monitoring on ARM Linux 2012
  2. 2. What is PMU ?• Cortex-A series processors contain event counting hardware which can be used to profile and benchmark code, including generation of cycle and instruction count figures and to derive figures for cache misses and so forth. The performance counter block contains a cycle counter which can count processor cycles, or be configured to count every 64 cycles. There are also a number of configurable 32-bit wide event counters which can be set to count instances of events from a wide-ranging list (for example, instructions executed, or MMU TLB misses). These counters can be accessed through debug tools, or by software running on the processor, through the CP15 Performance Monitoring Unit (PMU) registers. They provide a non-invasive debug feature and do not change the behavior of the processor. CP15 also provides a number of controls for enabling and resetting the counters and to indicate overflows (there is an option to generate an interrupt on a counter overflow). The cycle counter can be enabled independently of the event counters.• From ARM Architecture Reference Manual
  3. 3. Profiling alternatives• Oprofile – Supported in mainline kernel (drivers/oprofile) – ARM support enabled – Relies on “Interrupts” from HW unit, when event counters overflow – Timer fallback when no HW event monitors are available• Unfortunately, different errata in current ARM A8/A9 devices, make interrupt based monitoring unreliable – To be fixed in later ARM cores• Due to above, oprofile only supports CPU cycle measurement using timers, on majority of ARM cores, atleast upto 3.2 kernel
  4. 4. Latest status•• Convert OMAP2/3 devices to use HWMOD for creating a PMU device. To support PMU• on OMAP2/3 devices we only need to use MPU sub-system and so we can simply use• the MPU HWMOD to create the PMU device. The MPU HWMOD for OMAP2/3 devices is• currently missing the PMU interrupt and so add the PMU interrupt to the MPU• HWMOD for these devices.• This change also moves the PMU code out of the mach-omap2/devices.c files into• its own pmu.c file as suggested by Kevin Hilman to de-clutter devices.c.• Cc: Ming Lei <ming.lei at>• Cc: Will Deacon <will.deacon at>• Cc: Benoit Cousson <b-cousson at>• Cc: Paul Walmsley <paul at>• Cc: Kevin Hilman <khilman at>• Signed-off-by: Jon Hunter <jon-hunter at>• ---• arch/arm/mach-omap2/Makefile | 1+• arch/arm/mach-omap2/devices.c | 33 -----------• arch/arm/mach-omap2/omap_hwmod_2xxx_ipblock_data.c | 6 ++• arch/arm/mach-omap2/omap_hwmod_3xxx_data.c | 6 ++• arch/arm/mach-omap2/pmu.c | 59 ++++++++++++++++++++• arch/arm/plat-omap/include/plat/irqs.h | 1+• 6 files changed, 73 insertions(+), 33 deletions(-)• create mode 100644 arch/arm/mach-omap2/pmu.c
  5. 5. Patch status• The patch set mentioned in earlier slide, is in various stages of integration into different SOC architectures• Beagle/ OMAP35x is supported• This is not supported in AM335x as of 2012, expect to be in mainline by 2013• In the interim, what is the option ?
  6. 6. What is the need ?• For measuring different aspects of performance related to external memory bandwidth, cache usage monitoring is very key• Current oprofile does not support this in different SOCs
  7. 7. peemuperf• A tool to measure overall Linux Performance using PMU HW of ARM - ARM CPU Cycles, Cache misses at L1 and L2 level, stalls, NEON..• Consists of a kernel module that does the heavy lifting, and exposes all profile information to userspace via proc entry
  8. 8. Configurable parameters• evdelay=500 evlist=1,68,3,4 evdebug=1• evdelay – Sampling interval (milliseconds)• evlist – Comma separated array of event IDs (refer 3.2.49 c9, Event Selection Register of Cortex A8 TRM)• evdebug – Controls debug output messages
  9. 9. Userspace access• Proc entry is – /proc/peemuperf• Displays in below format – <COUNTER #> : <COUNTER VALUE> – Counter[0] : 48, – Counter[1] :77448, – Counter[2]: 13, – Counter[3]: 115058 – Overflow flag: = 0, Cycle Count: = 5739253
  10. 10. A8 vs A9• A8 has 4 performance counters• A9 has 6• peemuperf dynamically configures based on run-time query
  11. 11. Default Events monitored• 1 ==> Instruction fetch that causes a refill at the lowest level of instruction or unified cache• 68 ==> Any cacheable miss in the L2 cache• 3 ==> Data read or write operation that causes a refill at the lowest level of data or unified cache• 4 ==> Data read or write operation that causes a cache access at the lowest level of data or unified cache
  12. 12. Source•