The engineering challenges of designing for low latency execution include tightly controlling the time it takes to detect the onset of latency excursion and a diagnosis of its most likely cause. In modern x-as-a-service (XaaS) forms of distributed applications, the points at which latency is experienced by a service consumer are separated by many layers of modular abstractions from the underlying system hardware. This separation makes it difficult to pinpoint the causes of latency pushouts and to apply corrective actions in a timely manner. The classic performance methodology to profile ‘cycles’ of work may be broadly successful in extracting higher levels of latency, but not very effective in determining causes of short-duration latency surges; and, to determine that, it is frequently necessary to:
• trace execution
• pinpoint when a significant latency stretch out occurs
• establish its correlation with a nearby precursor or a set of precursor events
Each of these steps can incur significant overheads; further, one has to be concerned that even modest overheads from tracing risk contributing to tail latencies. Not just the detection of the onset of a latency excursion, but the identification of why it occurs must be completed quickly so that if a corrective action is possible, it can be taken promptly. Similarly, if no recourse to curb the latency of a slice of computation is available at some point in time, then it is ideal that steps to minimize the impact of the exception are put into effect as early as possible
In our talk, we present an approach that complements the very low overhead software tracing provided by KUtrace. It uses eBPF to trigger a collection of additional data at very low overhead from the hardware performance monitoring unit (PMU) so that latency excursions within a span of execution can be examined in a timely manner. We will describe the use of PMU capabilities like precise events-based sampling (PEBS) and timed last branch records (Timed LBRs) in close proximity to events of interest to extract critical clues. We will further discuss planned future work to integrate in-band network telemetry (INT) into these tracing flows.
3. Investigating Tail Latency is About Probing for the
Atypical
■ Execution samples that land in the tail have been slowed down for some reason
that differs from the average case: either—
● They differ in amount (or type of) work performed, or
● They were affected by some uncommon event or interference
● They encountered more waiting (experienced resource starvation longer than average
cases).
■ In particular, it is not generally true that tail latency samples and remaining
execution samples exhibit comparable software, hardware, or scheduling
histories.
■ Standard approaches for exposing throughput limiters do not generally expose
causes of peak latency
7. Illustrative Scenario: 1 -- Why ?
SLA
violation
Default sleep
state (C9)
Min sleep
state (C1)
■ Default power setting results in latency SLA violation at ½ the throughput
■ Ironically: low utilization results in earlier onset of tail latency because of
power-management interference (not in application’s control).
SLA
violation
8. Illustrative Scenario: 2
■ Reducing latency to a minimum 🡪 frequently entails choosing between
conflicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.
9. Illustrative Scenario: 2
■ Reducing latency to a minimum 🡪 frequently entails choosing between
conflicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a fine-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.
■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!
10. Illustrative Scenario: 2
■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!
■ For a first order impact on reducing frame drops: make task migration immediate.
Tunable CFS default New value
kernel_sched_migration_cost_ns 500000 0
kernel_sched_min_granularity_ns 3000000 1000000
kernel_sched_wakeup_granularity_ns 4000000 0
Scheduler tuning for dramatically reducing frame drops
kernel/sched_fair.c
. . .
gran = sysctl_sched_wakeup_granularity;
. . .
// if virtual run time of current executing task exceeds that of wakeup target plus a margin
// of safety provided by gran, then the current task is forced to yield to the wakeup target.
if (pse->vruntime + gran < se->vruntime) //se is current executing task
resched_task(curr);
11. Vital to Understand Transitions at Sub-ms Grain
■ Small-time ranges contain the vital signals for analyzing tail latency growth.
● Sampling and averaging of CPU and platform software telemetry helps only a little, but
is too blunt to tie short-range causes to effects that persist.
● For example: the scheduling of a large I/O that may take a few microseconds but cause
lingering effects in shared caches.
■ Tuning of schedulers (as in Scenario 2) requires the ability to observe intra- and
inter- process effects more or less continuously.
● Interestingly, scheduler tuning can also free up CPU utilization, and create an
opportunity to support an ability to handle load spikes within a latency budget.
13. perf-sched: a Dump and Post-process Approach for
Capturing and Analyzing Scheduler Events
■ perf-sched record
■ perf-sched timehist
■ perf-sched map
■ …
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis,
by Brendan Gregg, 2017
Get to an understanding of -- How far is
max sched delay from average? When is
wait time blowing up - what
contemporaneous activities are in play
when max delays occur? Do they show
similar effects, or is their execution itself
something unusual?
14. perf-sched timehist to Understand and Tune Various
Scheduling Parameters
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis, by Brendan Gregg, 2017
15. KUtrace by Richard Sites
■ All Kernel-user and User-kernel transitions
collected at very low space, time overhead.
■ Doing one thing, and that one thing well, for
production coverage
■ Further, captures instruction counter from
the PMU at transition points, to understand
IPC – all at ~40ns overhead
From: KUTrace: Where have all the nanoseconds gone?, Richard
Sites, Tracing Summit, 2017.
16. Hardware Role in Monitoring
■ Modern CPUs build in a significant amount of ability to monitor many events at
very low overhead.
■ PMU registers get multiplexed over event space (under software control) at a
moderate granularity.
■ A very large subset of these events can provide for event-based sampling, so that
correlated ratios can be collected and time-aligned -- for dissecting into likely
causes of IPC shifts.
https://perfmon-events.intel.com/
18. PMU Based Extraction of Long-latency Paths
■ A more detailed view
(different example case)
19. Identifying Cache Miss Hotspots (Data, Code)
• Similarly, data heatmap can be generated without requiring software instrumentation and
tracing (e.g., with valgrind, Pin, etc.)
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_<Binary Number>
20. From Sampling to Tracing
■ Profiling tools today (e.g., perf, Intel® VTune, etc.) are mostly based on the idea of statistical
hotspot collection: sampling and averaging – and therefore lose short interval transitions.
■ Tracing today (e.g., with insertion of tracepoints) requires a software developer to anticipate
where to instrument code. This is not generally easy, unless a lot of engineering has already
gone into preselecting (e.g, KUTrace).
● Instrumenting everything or over-collecting results incurs too much CPU penalty
● And memory and cache pollution
■ Challenges beyond trace collection:
● Much effort and data pruning before tail latencies can be linked to likely causes
● Usually pushed to offline analysis.
■ Understanding (and remediating) tail latencies itself needs to be a low latency endeavor.
21. eBPF
■ eBPF provides for programmable triggering and conditional collection
● at low overhead
● user or kernel
■ Thus, for example, one can do something like this:
Using the longest-latency access
event based sampling
as a trigger
Snapshot the timed LBR buffer
by using eBPF
22. eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel filtering
■ For deep insights: KUTrace will provide all
user<->kernel transitions
23. eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel filtering
■ For deep insights: KUTrace will provide all
user<->kernel transitions
■ With eBPF controlled tracing we could
invoke traces only in areas of interest. E.g.
trace all grpc requests with packet size >
64K (maybe that’s only when you see high
tail latencies)
■ (Or start first with eBPF and perf-sched latency co-monitoring and triggering)
● (While connecting eBPF based probes for latency monitoring in select higher
stack layers)
24. Summing It Up
■ Tail latency control is crucial, with penetration of real-time complex event processing in
virtually all sectors.
■ Low overhead and agile monitoring of latency excursions is needed.
■ Equally, to unveil the causes, contributing factors need to be collected at low overhead and
in a timely manner – ideally, through conditional collection and filtering.
■ Hardware performance monitoring capabilities are rich and can collect a rich variety of
events at very low overhead.
■ Linking eBPF based latency-focused monitoring (e.g., timed LBRs and long latency cache
misses) is one direction.
■ Another is triggering eBPF based hardware event rates collection, time-aligned with
scheduler events filtered for high scheduling delays (wait-signaling and post-wait
dispatching).