What is PMU ?• Cortex-A series processors contain event counting hardware which can be used to profile and benchmark code, including generation of cycle and instruction count figures and to derive figures for cache misses and so forth. The performance counter block contains a cycle counter which can count processor cycles, or be configured to count every 64 cycles. There are also a number of configurable 32-bit wide event counters which can be set to count instances of events from a wide-ranging list (for example, instructions executed, or MMU TLB misses). These counters can be accessed through debug tools, or by software running on the processor, through the CP15 Performance Monitoring Unit (PMU) registers. They provide a non-invasive debug feature and do not change the behavior of the processor. CP15 also provides a number of controls for enabling and resetting the counters and to indicate overflows (there is an option to generate an interrupt on a counter overflow). The cycle counter can be enabled independently of the event counters.• From ARM Architecture Reference Manual
Profiling alternatives• Oprofile – Supported in mainline kernel (drivers/oprofile) – ARM support enabled – Relies on “Interrupts” from HW unit, when event counters overflow – Timer fallback when no HW event monitors are available• Unfortunately, different errata in current ARM A8/A9 devices, make interrupt based monitoring unreliable – To be fixed in later ARM cores• Due to above, oprofile only supports CPU cycle measurement using timers, on majority of ARM cores, atleast upto 3.2 kernel
Latest status• http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/103189.html• Convert OMAP2/3 devices to use HWMOD for creating a PMU device. To support PMU• on OMAP2/3 devices we only need to use MPU sub-system and so we can simply use• the MPU HWMOD to create the PMU device. The MPU HWMOD for OMAP2/3 devices is• currently missing the PMU interrupt and so add the PMU interrupt to the MPU• HWMOD for these devices.• This change also moves the PMU code out of the mach-omap2/devices.c files into• its own pmu.c file as suggested by Kevin Hilman to de-clutter devices.c.• Cc: Ming Lei <ming.lei at canonical.com>• Cc: Will Deacon <will.deacon at arm.com>• Cc: Benoit Cousson <b-cousson at ti.com>• Cc: Paul Walmsley <paul at pwsan.com>• Cc: Kevin Hilman <khilman at ti.com>• Signed-off-by: Jon Hunter <jon-hunter at ti.com>• ---• arch/arm/mach-omap2/Makefile | 1+• arch/arm/mach-omap2/devices.c | 33 -----------• arch/arm/mach-omap2/omap_hwmod_2xxx_ipblock_data.c | 6 ++• arch/arm/mach-omap2/omap_hwmod_3xxx_data.c | 6 ++• arch/arm/mach-omap2/pmu.c | 59 ++++++++++++++++++++• arch/arm/plat-omap/include/plat/irqs.h | 1+• 6 files changed, 73 insertions(+), 33 deletions(-)• create mode 100644 arch/arm/mach-omap2/pmu.c
Patch status• The patch set mentioned in earlier slide, is in various stages of integration into different SOC architectures• Beagle/ OMAP35x is supported• This is not supported in AM335x as of 2012, expect to be in mainline by 2013• In the interim, what is the option ?
What is the need ?• For measuring different aspects of performance related to external memory bandwidth, cache usage monitoring is very key• Current oprofile does not support this in different SOCs
peemuperf• A tool to measure overall Linux Performance using PMU HW of ARM - ARM CPU Cycles, Cache misses at L1 and L2 level, stalls, NEON..• Consists of a kernel module that does the heavy lifting, and exposes all profile information to userspace via proc entry
A8 vs A9• A8 has 4 performance counters• A9 has 6• peemuperf dynamically configures based on run-time query
Default Events monitored• 1 ==> Instruction fetch that causes a refill at the lowest level of instruction or unified cache• 68 ==> Any cacheable miss in the L2 cache• 3 ==> Data read or write operation that causes a refill at the lowest level of data or unified cache• 4 ==> Data read or write operation that causes a cache access at the lowest level of data or unified cache