LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter
Upcoming SlideShare
Loading in...5
×
 

LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter

on

  • 527 views

Resource: LCA14

Resource: LCA14
Name: LCA14-401: BoF - Networking - Debug/tracing/counter
Date: 06-03-2014
Speaker: Santosh Shukla, Mike Holmes

Statistics

Views

Total Views
527
Views on SlideShare
527
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter LCA14: LCA14-401: BoF - Networking - Debug/tracing/counter Presentation Transcript

  • Thu 6 March, 10:05am, Santosh Shukla, Mike Holmes LCA14-401: BoF, Networking - Debug/tracing/counter
  • • Introduction • Use case • libperf and in-kernel perf API • Test analysis direct user access vs syscall based perf counter access • Design Issues and Next step • QA Fast access to perf Counters
  • • Access to perf counters is not fast enough in the embedded networking space. • We think we need • The fastest access from user space. (see use case) • Shared when read only (no locking overhead). • Stable API (based on libperf) • Easy way to access to SoC specific counters Introduction
  • • In fast path (could be ODP in future), There’ll be a method to analyze odp crash dump based on statistics. • Because crash dump statistics are based on the perf hw counters, really low overhead counter access is needed. Should provide near accurate cpu or bus clock cycle precision. • For example, in the fast path - per-packet budgeting is 1000 cpu cycle, then measuring can not take 3000 cpu cycle as it does today with syscall based perf counter in linux. Use Case
  • Perf provides a syscall method to open a perf file descriptor for user space application to access the counters, and attach the events to them. sys_perf_counter_open - The syscall - event type attributes for monitoring/sampling - target pid - target cpu - group_fd - flags Event type : - PERF_TYPE_HARDWARE - PERF_TYPE_SOFTWARE - PERF_TYPE_TRACEPOINT - PERF_TYPE_HW_CACHE - PERF_TYPE_RAW (for raw tracepoint data) Perf
  • attr.sample_type { bitmask PERF_SAMPLE_IP PERF_SAMPLE_TID PERF_SAMPLE_TIME PERF_SAMPLE_CALLCHAIN PERF_SAMPLE_ID PERF_SAMPLE_CPU } attr config bitfield { disabled: off by default inherit: children inherit it exclude_{user,kernel,hv,idle}: don’t count these mmap: include mmap data comm: include comm data inherit_stat: per task counts enable_on_exec: next exec enables } perf continued..
  • • Libperf creates set of file descriptors for bunch of perf events..by calling sys_perf_open_event() api, and does enable/disable/read operation on them . current API has : libperf_initialize : sets up a set of fd's for profiling code to read from libperf_finalize : read from fd’s, print and close all pef FD. libperf_readcounter : read perf counter. libperf_enablecounter : Enable perf counter libperf_disablecounter : disable perf counter libperf_close : Close fd Libperf
  • • Raw Proposal : • Mmaping hw counters to user space could be a way forward for fast access, removing overhead with the current kernel implementation. • Adding scalable framework in user space ..could be libperf so to read cpu specific counter, counter on offload block and other variant of counters. • Current mmapped based perf support in kernel: • in-kernel perf supports mmaped based persistent ring-buffer implementation for user space. • This implementation is limited in performance due to the following. The hw counter mappable and stored into ring-buffer with lots of synchronisation overhead for user space to access i.e.. rmb for every perf read counter, locking, async wake-up event for user space to read statistics. design issues, next step investigation
  • • But, • The current kernel mappable events are exclusive, and are not shareable, they won't fall back to sysfs perf event mode. Therefore it is not scalable. • The current kernel counter overhead is still significant, therefore the current implementation won't achieve 1000 cycle requirement for fast path model, example ODP crash dump statistics requirement mentioned in prev slide [4]. Next Step continued..
  • • Effort to investigate and try to evaluate these issues : • Focus on exclusive fast access approach • HW counter pinned to specific core, specific task • Avoid sync primitives in kernel space while reading hw counter, Let user space application handle this job. • Educate libperf to handle sync primitive and decide on locking policy. • Design should be flexible enough to fall back to syscall based perf mode. • Respect SMP policy as much as possible. Next Step continued..
  • Userspace fast access flow control arrow key - too shor Application should be squa Both these inside Soc Arm Processor Core event extensions
  • Custom user space application detail - • Ran test application on arndale to demonstrate delta of user vs kernel space perf counter. Result shows close to 9x improvement. • Tiny test kernel module enables,disable perf counter for user mode. /* enable */ asm ("MCR p15, 0, %0, C9, C14, 0nt" :: "r"(1)); /* disable */ asm ("MCR p15, 0, %0, C9, C14, 2nt" :: "r"(0x8000000f)); • User app uses x86 style timer api to read perf counter. static inline uint32_t rdtsc32(void) { #if defined(__GNUC__) && defined(__ARM_ARCH_7A__) uint32_t r = 0; asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) ); return r; #else #error Unsupported architecture/compiler! #endif } Benchmarking current & proposed access
  • Libperf application using perf syscall - • Create perf event FD using perf_event_open syscall. • Reads perf counter event from file descriptor. init(void) { static struct perf_event_attr attr; attr.type = PERF_TYPE_HARDWARE; attr.config = PERF_COUNT_HW_CPU_CYCLES; fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0); } • Both application runs in a tight loop for some duration and there delta recorded for comparison.. Benchmarking cont..
  • • Enable pmu direct user space vs perf syscall based application. Benchmarking cont..
  • [1]ARM A15 Performance counter registers http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438c/BIIFDEEJ. html [2]LNG CARD https://cards.linaro.org/browse/LNG-260 [3]Perf 0n A15 https://perf.wiki.kernel.org/index.php/Tutorial [4]http://neocontra.blogspot.com/2013/05/user-mode-performance-counters-for. html [5]https://github.com/thoughtpolice/enable_arm_pmu [6]Lib perf https://github.com/theonewolf/libperf [7]http://www.linux-kongress.org/2010/slides/lk2010-perf-acme.pdf Reference links
  • QA
  • More about Linaro Connect: http://connect.linaro.org More about Linaro: http://www.linaro.org/about/ More about Linaro engineering: http://www.linaro.org/engineering/ Linaro members: www.linaro.org/members