False sharing references and power management can trigger wide latency spreads, but are neither directly observable nor easily traced to causes. This talk describes how to diagnose the problems quickly, and outlines several remedies.
Automating the Hunt for Non-Obvious Sources of Latency Spreads
1. Automating the Hunt for
Non-Obvious Sources of
Latency Spreads
Kshitij Doshi, Sr. Principal Engr at Intel
Harshad S. Sane, Principal Engr at Intel
Datacenter
& AI
2. Kshitij Doshi
■ Ph.D., Rice Univ – Comm. efficient parallel algorithms
■ Performance of Systems, DB, Cloud-native apps
■ Research interests in storage, memory, distributed systems
■ 20 y at Intel; previously,13 y at Unix Systems Labs & Novell.
Datacenter
& AI
3. Harshad Sane
■ Harshad Sane is a Principal Engineer in Intel's Data Center
and AI group
■ Deep technical expertise in system software, memory, and
CPU architectures.
■ Specializes in Performance Engineering with extensive
experience and expertise in Telemetry, Observability,
Monitoring, Software optimization.
Datacenter
& AI
4. ■ Section 1 - About tail latency spreads
■ Section 2 - Two non-obvious causes of latency escapes
■ Section 3 - How to decide if either of them are hurting your application
■ Section 4 - Mitigations, if they are hurting your application
■ Section 5 - Summary
Agenda
5. Hurdles are not always predictable.
Courtesy: bing.com/images
6. ScyllaDB is engineered for usages needing high
throughputs and predictable, low latencies...
https://resources.scylladb.com/videos/build-low-latency-applications-in-rust-on-scylladb
Query
Commitlog
Compaction
Queue
Queue
Queue
0.5 msec
Userspace
I/O Scheduler
Disk
8. frequently there is some issue that intersects in an unpredictable manner
with execution of normal hotspots
When Small Performance Fluctuations Magnify
Into Sudden, Large Spikes in Response Times…
9. repeating over and over with minor perturbations in end to end
latencies for each itearation
Consider a Streamlined Flow of Execution
10. Such a hiccup . . . propagates and throws both timing and resource usage out of
balance, for some period of time.
But this period of non-streamlined flow can feed on itself and produce secondary
spikes in end-to-end latencies, even as overall flow throughput evens out.
Where Something Goes Out of Balance Momentarily
and Causes a Hiccup.
12. Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
First issue:
13. Wait 100 ms
T1
T3
T2
T4
T5
T0
: set random
seed S
use S
use S
use S
use S
use S
Producer
Consumer
Get Y from
queue L
Put X in
queue L
X
Y
Producer frequently modifies tail of queue L, while
consumer frequently modifies head of queue L.
A first module in an application
A second module in the application
The threads working in the two modules, which have no logical intersection, do however get cross-coupled if the
variable S ends up on a cacheline that is also used for storing either or both of the head / tail pointers of queue L.
Not a significant problem unless updates of queue L become frequent.
FIrst issue: false sharing
14. High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Operating system and hardware algorithms together with system configuration parameters
determine conditions under which CPUs transition among different states of operation
Second Issue: CPU Active (P-states) and
Sleep (C-states) States
15. High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the
point of normal instruction execution, and also experience transient effects of colder caches.
Second Issue: CPU Power Management
Transitions
16. High CPU frequency
High power consumption
More instructions/time
Low CPU frequency
Low power consumption
Fewer instructions/time
One of several sleep states
Very low power consumption
Active idle
Transitions out of deeper sleep states into active execution take many microseconds upto the point of normal
instruction execution, and also experience transient effects of colder caches.
Transitions from low power states to normal execution go through a series of frequency step-ups, causing
software actions that are dependency-chained, to stall due to inter-thread or inter-process data/event waiting.
Second issue: CPU Power (and Sleep) State
Transitions
19. ThrA
ThrB
ThrC
Imagine that this CPU is
transitioning out of a deep
sleep state . . .
Delayed trigger for ThrA
ThrC
runs slower
Possibly this CPU has
entered a deeper sleep
as a result
ThrA
runs slower
as a result
Delayed trigger for ThrB
Cascading Sleep -> Wakeup -> Sleep transients can take up time to
fade out, and cause high peak latencies …
… even though impact on average latency gets amortized.
20. Detecting and untangling the causes of these intersections of issues is very
challenging. Particularly if a high degree of instrumentation, such as tracing
or logging interferes with and distorts the effects.
These effects are not easily noticed through lightweight sampling or
counting of events
21. Collecting Traces
Security, Collection overheads at CPU, Caches,
Bandwidth to memory and storage/network.
Analyzing Traces
Like searching for a needle in multiple haystacks
without knowing if a needle is to be found at all.
Scheduling collection
and analysis
Like figuring out when crime is going to occur in order
to launch crime scene analysis
Challenges
22. • Turbostat
• Powertop
• Runqueue lengths
Indicating Power transitions
• Sharp IPC drop with concurrency
• No obvious mem/disk data bottleneck
• High utilization, low runqueue lengths
Indicating cache coherence issues
Good but circumstantial clues
• CoreFreq -
https://github.com/cyring/CoreFreq
Correlated Events That Can Be Collected
at Low Overhead
24. To see what is available:
sudo perf list | grep cstate
Example output:
cstate_core/c3-residency/
cstate_core/c6-residency/
cstate_core/c7-residency/
cstate_pkg/c2-residency/
…
Capturing C-State Transitions
25. sudo perf timechart record
sudo perf timechart
Monitoring sleep, wait, and
run times per CPU.
4 threads in intermittent
sleeps
4 threads in variable I/O
wait durations
Timecharting
26. Per CPU timeline of processes as they are
context switched
2. perf sched map
Visualizing scheduling events by time
1. perf sched timehist
(-Mw): migrations and wakeups
Tracks Scheduler latency by event, including time
to wakeup , latency from wakeup to run (sched
delay)
27. P-states (≡ frequencies)
‒ BIOS controlled
‒ OS controlled via scaling drivers
‒ HW controlled P states
‒ Turbo
Monitoring and controlling P-states
Credits:https://images.anandtech.com/doci/9582/43.jpg
To monitor the P-states:
‒ Turbostat, CoreFreq
‒ Profiling tools (Perf, Vtune, etc.)
To control P-states: OS controlled P-states manually
configured through scaling drivers. such as
‒ Cpupower
‒ CoreFreq https://github.com/cyring/CoreFreq
to control performance governor for P states.
28. Response time R
--monitored at
application level
Exp-weighted Moving
Window Avg
Short-range average
Detect upward
heave
C-State Monitoring
P-State Monitoring
Tn = Transitions
count over 250ms
windows
Tn > Threshold
Detect
Overlap
Snapshot
runqlat and
timechart
activity from
last ‘n’ secs
29. (a) Likelihood with higher concurrency + low scaling + higher response times, with low IPC
despite low LLC misses/instruction
(b) Sensitivity insensitive to runq-lengths (not sensitive to CPU subscription)
(c) Clues higher number of coherence misses in L1 and/or L2 (PMC snoop events
S2I, M2I)
increased inter-socket link utilization in a multi-socket system
Step 1: Establish whether sufficient clues exist to suspect false sharing
Clues for False-Sharing (With Low
Overhead)
30. Drilling down for concrete evidence of false sharing
perf c2c
Sampling based detection of cachelines where false sharing was likely – based on
the HITM event (see below).
These are read or write accesses for which a different core’s cache reports a “hit” in
“modified” state (HITM).
Provides insights into data addresses, code addresses, processes and threads that
generate sharing conflicts.
Conditionally upon step 1 indicating possible false sharing
Step 2: Collect perf c2c profiles identifying data and code addresses producing the contention
33. When power management actions are suspected to provoke high tail latencies:
1. Choose less extreme power-performance settings
power-save, energy-efficient etc.
1. Explore changes in scheduler tunings, such as –
a. Quicker preemption (reducing wakeup -> onproc)
b. Smaller time-slices
c. Different (usually lower) migration thresholds
Solution Space
34. When false-sharing is suspected to provoke high tail latencies:
1. Some data structure layout possibilities:
a. Data structure / global variable padding (if possible)
b. Changing the affected data structure to better separate (quasi)- immutable from
mutable cachelines
c. Splitting data structures in question into sub-structures
2. Possible computation strategy changes:
a. Rate-limiting writers to cachelines that are accessed frequently by readers
b. Colocating to same socket or sub-numa clusters
c. Make code bimodal: normal computation until a monitor signals rise in coherence
events, and one of (2a/2b) after.
Solution Space
35. • Latency instrumentation needs to be made as close to real time as possible.
• Tracing needs to be combined with sampling over short intervals and, triggered by good
precursors so overhead is kept to a minimum.
• We outlined two issues—
• False sharing
• Power management transitions
that may not arise frequently, but can have measurable effects on tail latencies, which can be
hard to detect.
In this presentation we have shown the role these 2 components play in application
performance, their detectability and possible solutions.
Summary
36. Thank You
Stay in Touch
kshitij.a.doshi@intel.com
harshad.s.sane@intel.com