Memory Bandwidth QoS

David Lo, Dragos Sbirlea, Rohit Jnagal
Managing Memory Bandwidth Antagonism @ Scale

Borg Model
● Large clusters with multi-tenant hosts.
● Run a mix of :
○ high and low priority workloads.
○ latency-sensitive and batch workloads.
● Isolation through bare-metal containers
(cgroups/namespaces)
○ Cgroups and perf to monitor host and job
performance.
○ Cgroups and h/w controls to manage
on-node performance.
○ Cluster scheduling and balancing manages
service performance.
Efficiency
Availability
Performance
3

The Memory Bandwidth Problem
● Large variation in performance
on multi-tenant hosts.
● On average, saturation events are
few, but:
○ Periodically causes significant
cluster-wide performance
degradation.
● Some workloads are much more
seriously affected than others.
○ Does not necessarily correlate
with victim’s memory bandwidth
use.
Latency
time
Antagonist task starts
4
Note : This talk is focussed on membw problem for general servers and
does not cover GPUs and other special devices. Similar techniques apply
there too.

Memory BW Saturation is Increasing Over Time
Nov
2018
5
Time
Fractionofmachineswithsaturation
Jan
2018
Fraction of machines that experienced mem BW saturation

● Large machines need to pack more jobs to maintain
utilization, resulting in more “noisy neighbor” problems.
Why It Is a (Bigger) Problem Now
● ML workloads are memory BW intensive
6

● Track per-socket local and remote memory bandwidth use
● Identify per-platform thresholds for performance dips (saturation)
● Characterize saturation by platform and clusters
Understanding the Scope : Socket-Level MonitoringMEM
Local
Write
Remote
MEM
LocalRemote
Read WriteWriteReadRead WriteRead
Socket 0 Socket 1
7

Saturation behavior varies with platform and cluster, due to
● hardware differences (membw/core ratio)
● workload (large CPU consumers run on bigger platforms)
Platform and Cluster Variation
8
By platform
By cluster

● Socket-level information gives the magnitude of the
problem and hot-spots
● Need task-level information to identify:
○ Abusers : tasks using disproportionate amount of
bandwidth
○ Victims : tasks seeing performance drop
● New platforms provide task-level memory bandwidth
monitoring, but:
○ RDT cgroup was on its way out
○ Have no data on older platforms
For our purposes, a rough attribution of memory bandwidth
was good enough
Monitoring Sockets ↣ Monitoring Tasks
Saturation threshold
9
Totalmemorybandwidth
MemoryBWbreakdown

● Summary of requirements:
○ Local and remote bandwidth breakdown
○ Compatible with with cgroup model
● What's available in hardware?
○ Uncore counters (IMC, CHA)
■ Difficult to attribute to HyperThread => cgroup
○ CPU PMU counters
■ Counters are HyperThread local
■ Works with cgroup profiling mode
D
D
R
I
M
C
CPU Core
CHA
HT0 HT1
CPU Core
CHA
HT0 HT1
Per-task Memory Bandwidth Estimation
10

● OFFCORE_RESPONSE for Intel CPUs
● Programmable filter to specify events of interest (i.e. DRAM local and DRAM remote)
● Captures both demand load and HW prefetcher traffic
● Online documentation of the meaning of bits, per CPU (download.01.org)
● How to interpret: cache lines / sec X 64b/cache line = BW
Intel SDM Vol 3
Which CPU Perfmon to Use?
11

Abuser insights
● Large percentage of time, a single consumer uses up most bandwidth.
● The share of CPU of that consumer are much lower than its share of membw.
Victim insight
● Many jobs are sensitive to membw saturation.
● Jobs are sensitive even though they are not big users of membw.
Guidance on enforcement options
● How much saturation would we avoid if we do X?
● Which jobs would get caught in the crossfire?
Insights from Task Measurement
CPI degradation on saturation
(as a fraction)
Numberofjobs
Combinations of jobs (by CPU
requirements) during saturation
12

Enforcement : Controlling Different Workloads
BW Usage
Priority
Moderate Heavy
LowMediumHigh
Isolate
Disable
ThrottleThrottle
Reactive rescheduling
Isolate
13

What Can We Do ? Node and Cluster Level Actuators
Node
Memory Bandwidth Allocation in hardware
Use HW QoS to apply max limits to tasks
overusing memory bandwidth.
CPU throttling for indirect control
Limit CPU access of over-using tasks to
indirectly limit the memory bandwidth used.
Cluster
Reactive evictions & re-scheduling
Hosts experiencing memory BW saturation
signals scheduler to re-distribute bigger memory
bandwidth users to lightly-loaded machines.
Disabling heavy antagonist workloads
Tasks that saturate a socket by itself cannot be
effectively redistributed. If slowing down is not
an option, de-schedule them.
14

+ Very effective in reducing saturation;
+ Works on all platforms
Node : CPU Throttling
Socket 0 (saturated) Socket 1
CPUs running memBW over-users
- Too coarse in granularity;
- Interacts poorly with Autoscaling & Load-balancing
15

Socket memory BW
saturation detector
Cgroup memory BW
estimator
Memory BW enforcer
Socket
perf counters
Every x seconds
If socket BW > saturation threshold
Socket, Cgroup
perf counters
Profile potentially
eligible tasks
Policy filter
CPU runnable mask
Select eligible tasks
for throttling
If socket BW < unthrottle threshold,
unthrottle tasks
16
Throttling - Enforcement Algorithm

Node : Memory Bandwidth Allocation
Intel RDT
Memory Bandwidth Allocation
+ Reduced bandwidth without lowering CPU
utilization.
+ Somewhat fine-grained than cpu-level
controls.
- Newer platforms only.
- Can’t isolate well between hyperthreads.
Supported through resctrl in kernel
(more on that later)
17

In many cases, there are:
● A low-percentage of saturated sockets in cluster, and
● Multiple tasks contributing to saturation.
Re-scheduling the tasks to less loaded machines can avoid
slow-downs.
Does not help with large antagonists that can saturate any
socket it runs on.
Cluster : Reactive Re-Scheduling
ObserverScheduler
host
A
host
B
host
C
host
D
saturated
1.Callforhelp
2.Evict
3.Reschedule
18

Low priority jobs can be dealt at node-level through throttling.
If SLOs do not permit throttling and the antagonists cannot be
redistributed :
● Disable (kick out of the cluster)
● Users can then reconfigure their service to use different product.
● Area of continual work.
Alternative :
● Colocate multiple antagonists (that’s just working around SLOs)
Handling Cluster-Wide Saturation
Cluster Membw distribution
amenable to rescheduling
Cluster Membw distribution
amenable to job disabling
Saturation
threshold
Saturation
threshold
19

Results : CPU Throttling + Rescheduling
20

● New, unified interface: resctrl
● resctrl is a big improvement over the previous non-standard cgroup interface
● Uniform way of monitoring/controlling HW QoS across vendors/architectures
○ AMD, ARM, Intel
● (Non-exhaustive) list of HW features supported:
○ Memory BW monitoring
○ Memory BW throttling
○ L3 cache usage monitoring
○ L3 cache partitioning
resctrl : HW QoS Support in Kernel
22

● Below is using x86 terminology
● CLass of Service ID (CLOSID): maps to a QoS configuration. Typically O(10) unique
ones in HW.
● Resource Monitoring ID (RMID): used to tag workloads and their used resources to
aggregate their resource usage. Typically O(100) unique ones in HW.
Intro to HW QoS Terms and Concepts
Hi priority (CLOSID 0)
100% L3 cache
100% mem BW
Low priority (CLOSID 1)
50% L3 cache
20% mem BW
RMID0 RMID1 RMID2 RMID3 RMID4
Workload A Workload B Workload C
23

Example Usage of resctrl Interfaces
$ cat groupA/schemata
L3:0=ff;1=ff
MB:0=90;1=90
$ READING0=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ sleep 1
$ READING1=$(cat groupA/mon_data/mon_L3_00/mbm_total_bytes)
$ echo $((READING1-READING0))
1816234126
Allowed to use 8 cache ways for L3 on both sockets.
Per-core memory BW constrained to 90% on both sockets.
Compute memory BW by taking a rate.
In this case, BW ~= 1.8GiB/s
25

Reconciling resctrl and cgroups: First Try
resctrl/
|- no_throttle/
| |- mon_groups/
| | |- cgroupX/
| | | |- mon_data/
| | | |- tasks
| | | |- ...
| | |- monB/
| | |- mon_data/
| | |- ...
| |- schemata
| |- tasks
| |- ...
|- bw_throttled/
|- ...
<< #1
<< #1
<< #1
<< #3
<< #5 ↻
<< #6 ↻
Use case: dynamically apply memory BW throttling if
machine is in trouble
1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
26

1. Node SW creates 2 resctrl groups: no_throttle
and bw_throttled
2. On cgroup creation, logically assign cgroupX to
no_throttle
3. Create a mongroup for cgroupX in
no_throttle
4. Start cgroupX
5. Move TIDs into no_throttle/tasks
6. Move TIDs into
no_throttle/mon_groups/cgroupX/tasks
7. Move TIDs of high BW user into bw_throttled
Challenges with Naive Approach
Race in moving TIDs if cgroup is
creating threads. Expensive if lots
of TIDs and to deal with the race.
Desynchronization of L3 cache
occupancy data, since existing
data is tagged with an old RMID.
27

● What if we had the ability to have a 1:1 mapping of cgroups to resctrl groups
○ To change QoS configs, just rewrite schemata
○ More efficient, remove need to move TIDs around
○ Keep existing RMID, prevent L3 occupancy desynchronization issue
○ 100% compatible with existing resctrl abstraction
● CHALLENGE: with existing system, will run out of CLOSIDs very quickly
● SOLUTION: share CLOSIDs between resource control groups with the same schemata
● Google-developed kernel patch for this functionality to be released soon
● Demonstrates need to make cgroup model a first class consideration for QoS
interfaces
A Better Approach for resctrl and cgroups
28

● Measuring µArch impact is not a first class component of
most container runtimes.
○ Can’t manage what we can’t see...
● Most container runtimes expose isolation knobs per
container.
● Managing µArch isolation requires node and cluster level
feedback-loops.
○ Dual operating mode : admins & users.
○ Performance isolation not necessarily controllable by
end-users.
We would love to contribute to a standard framework around
performance management for container runtimes.
µArch Features & Container Runtimes
Efficiency
Availability
Performance
30

Takeaways and Future work
● Memory bandwidth and low-level isolation issues becoming more significant.
● Continuous monitoring is critical to run successful multi-tenant hosts.
● Defining requirements for h/w providers and s/w interfaces on QoS knobs.
○ Critical to have these solutions work for containers / process-groups.
● Increasing success rate with current approach:
○ Handling of minimum guaranteed membw usage
○ Handling logically related jobs - Borg allocs
● A general framework would help collaboration.
● Future : Memory BW scheduling (based on hints)
○ Based on membw usage
○ Based on membw sensitivity
31

Find us at the conf or reach out at :
davidlo@
dragoss@
google.com
jnagal@
eranian@
Thanks !
32

Memory Bandwidth QoS

More Related Content

What's hot

Similar to Memory Bandwidth QoS

More from Rohit Jnagal

Recently uploaded

Memory Bandwidth QoS