Hardware Assisted Latency Investigations

Investigating Tail Latency is About Probing for the
Atypical
■ Execution samples that land in the tail have been slowed down for some reason
that differs from the average case: either—
● They differ in amount (or type of) work performed, or
● They were affected by some uncommon event or interference
● They encountered more waiting (experienced resource starvation longer than average
cases).
■ In particular, it is not generally true that tail latency samples and remaining
execution samples exhibit comparable software, hardware, or scheduling
histories.
■ Standard approaches for exposing throughput limiters do not generally expose
causes of peak latency

Illustrative Scenario: 1
SLA
violation

SLA
violation
SLA
violation

Illustrative Scenario: 1 -- Why ?
SLA
violation
Default sleep
state (C9)
Min sleep
state (C1)
■ Default power setting results in latency SLA violation at ½ the throughput
■ Ironically: low utilization results in earlier onset of tail latency because of
power-management interference (not in application’s control).
SLA
violation

■ Reducing latency to a minimum 🡪 frequently entails choosing between
conﬂicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a ﬁne-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.

■ Reducing latency to a minimum 🡪 frequently entails choosing between
conﬂicting options for scheduling resources like cpu time.
● Reduce overheads vs.
● Preempt sooner vs.
● Prioritize critical sections vs.
● Do not thrash the cache vs.
● Be fair at a ﬁne-grained scale: do not starve tasks (or activities like garbage
collection) . . . etc.
■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!

■ Online video gaming at high scale:
● High FPS (frames per second).
● Hotness in L1, L2.
● Streamlined execution: do not preempt too soon!
● Deadline priorities: Schedule quickly upon waking up!
■ For a ﬁrst order impact on reducing frame drops: make task migration immediate.
Tunable CFS default New value
kernel_sched_migration_cost_ns 500000 0
kernel_sched_min_granularity_ns 3000000 1000000
kernel_sched_wakeup_granularity_ns 4000000 0
Scheduler tuning for dramatically reducing frame drops
kernel/sched_fair.c
. . .
gran = sysctl_sched_wakeup_granularity;
. . .
// if virtual run time of current executing task exceeds that of wakeup target plus a margin
// of safety provided by gran, then the current task is forced to yield to the wakeup target.
if (pse->vruntime + gran < se->vruntime) //se is current executing task
resched_task(curr);

Vital to Understand Transitions at Sub-ms Grain
■ Small-time ranges contain the vital signals for analyzing tail latency growth.
● Sampling and averaging of CPU and platform software telemetry helps only a little, but
is too blunt to tie short-range causes to effects that persist.
● For example: the scheduling of a large I/O that may take a few microseconds but cause
lingering effects in shared caches.
■ Tuning of schedulers (as in Scenario 2) requires the ability to observe intra- and
inter- process effects more or less continuously.
● Interestingly, scheduler tuning can also free up CPU utilization, and create an
opportunity to support an ability to handle load spikes within a latency budget.

Solution Space and Role of Hardware

perf-sched: a Dump and Post-process Approach for
Capturing and Analyzing Scheduler Events
■ perf-sched record
■ perf-sched timehist
■ perf-sched map
■ …
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis,
by Brendan Gregg, 2017
Get to an understanding of -- How far is
max sched delay from average? When is
wait time blowing up - what
contemporaneous activities are in play
when max delays occur? Do they show
similar effects, or is their execution itself
something unusual?

perf-sched timehist to Understand and Tune Various
Scheduling Parameters
https://www.brendangregg.com/blog/2017-03-16/perf-sched.html
From: perf sched for Linux CPU scheduler analysis, by Brendan Gregg, 2017

KUtrace by Richard Sites
■ All Kernel-user and User-kernel transitions
collected at very low space, time overhead.
■ Doing one thing, and that one thing well, for
production coverage
■ Further, captures instruction counter from
the PMU at transition points, to understand
IPC – all at ~40ns overhead
From: KUTrace: Where have all the nanoseconds gone?, Richard
Sites, Tracing Summit, 2017.

Hardware Role in Monitoring
■ Modern CPUs build in a signiﬁcant amount of ability to monitor many events at
very low overhead.
■ PMU registers get multiplexed over event space (under software control) at a
moderate granularity.
■ A very large subset of these events can provide for event-based sampling, so that
correlated ratios can be collected and time-aligned -- for dissecting into likely
causes of IPC shifts.
https://perfmon-events.intel.com/

PMU Based Extraction of Long-latency Paths
■ Timed LBRs (last branch records)

PMU Based Extraction of Long-latency Paths
■ A more detailed view
(different example case)

Identifying Cache Miss Hotspots (Data, Code)
• Similarly, data heatmap can be generated without requiring software instrumentation and
tracing (e.g., with valgrind, Pin, etc.)
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_<Binary Number>

From Sampling to Tracing
■ Proﬁling tools today (e.g., perf, Intel® VTune, etc.) are mostly based on the idea of statistical
hotspot collection: sampling and averaging – and therefore lose short interval transitions.
■ Tracing today (e.g., with insertion of tracepoints) requires a software developer to anticipate
where to instrument code. This is not generally easy, unless a lot of engineering has already
gone into preselecting (e.g, KUTrace).
● Instrumenting everything or over-collecting results incurs too much CPU penalty
● And memory and cache pollution
■ Challenges beyond trace collection:
● Much effort and data pruning before tail latencies can be linked to likely causes
● Usually pushed to oﬄine analysis.
■ Understanding (and remediating) tail latencies itself needs to be a low latency endeavor.

eBPF
■ eBPF provides for programmable triggering and conditional collection
● at low overhead
● user or kernel
■ Thus, for example, one can do something like this:
Using the longest-latency access
event based sampling
as a trigger
Snapshot the timed LBR buffer
by using eBPF

eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel ﬁltering
■ For deep insights: KUTrace will provide all
user<->kernel transitions

eBPF and KUTrace/perf-sched and . . .
■ KUtrace is intentionally austere
● So it can be deployed in production, at scale, and be available and running continuously.
● (Adding bells and whistles to it– not a good idea, discouraged!).
+ eBPF’s In kernel ﬁltering
■ For deep insights: KUTrace will provide all
user<->kernel transitions
■ With eBPF controlled tracing we could
invoke traces only in areas of interest. E.g.
trace all grpc requests with packet size >
64K (maybe that’s only when you see high
tail latencies)
■ (Or start ﬁrst with eBPF and perf-sched latency co-monitoring and triggering)
● (While connecting eBPF based probes for latency monitoring in select higher
stack layers)

Summing It Up
■ Tail latency control is crucial, with penetration of real-time complex event processing in
virtually all sectors.
■ Low overhead and agile monitoring of latency excursions is needed.
■ Equally, to unveil the causes, contributing factors need to be collected at low overhead and
in a timely manner – ideally, through conditional collection and ﬁltering.
■ Hardware performance monitoring capabilities are rich and can collect a rich variety of
events at very low overhead.
■ Linking eBPF based latency-focused monitoring (e.g., timed LBRs and long latency cache
misses) is one direction.
■ Another is triggering eBPF based hardware event rates collection, time-aligned with
scheduler events ﬁltered for high scheduling delays (wait-signaling and post-wait
dispatching).

Hardware Assisted Latency Investigations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hardware Assisted Latency Investigations

Similar to Hardware Assisted Latency Investigations (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Hardware Assisted Latency Investigations