PPT_on_Cache_Partitioning_Techniques.pdf

SPARSH MITTAL
IIT HYDERABAD, INDIA
Cache Partitioning Techniques

Some acronyms
 CP = cache partitioning, CPT = CP technique
 HW = hardware, SW = software
 BW = bandwidth
 LLC = last level cache
 RP = replacement policy
 MLP/TLP = memory/thread level parallelism
 IPC = instruction per cycle
 NUCA = non-uniform cache architecture
 QoS = quality-of-service
 Perf = performance
N denotes number of cores

Motivation for Cache Management in Multicores
 With increasing number of cores, performance of
multicores does not scale linearly due to cache
contention and other factors
 Memory requirement of applications is increasing
=> Cache management has become extremely
important in multicores

Private v/s shared cache
Private Caches
 Avoid interference
 Cannot account for inter-
and intra-application
variation in cache
requirements
 Limited capacity =>
cannot reduce miss-rate
effectively
Shared cache
 Higher total capacity =>
can reduce miss-rate
 Interference b/w apps on
using traditional cache
management policies =>
performance loss,
unfairness and lack of
QoS
Use CP in shared cache => capacity advantage of shared
cache, performance isolation advantage of private cache

Examples of processors with shared LLC
 IBM Power 7
 Intel core i7
 AMD Phenom X4
 Sun Niagara T2
We first provide background on CPTs and then discuss
several CPTs

Potential of CP
 Different cache demand and performance sensitivity
 Of different apps
 Of different threads in a multithreaded app
 Further, performance of cores may differ due to
 differences in cache latencies due to NUCA design
 differences in core frequencies due to process variation
 CP can compensate for these differences!
 CP can also optimize for fairness and QoS

Potential of CP
 CP avoids interference & provides higher effective
cache capacity
 Reduces miss-rate and bandwidth contention
 This may benefit even those applications whose cache
quota are reduced!
 Saves energy by
 Reducing execution time
 Allowing unused cache to bet power-gated

Challenges of CP
 Number of possible partitions increase exponentially
with increasing core-count
 Simple schemes become ineffective
 Finding partitioning with minimum overall cache
miss-rate (i.e., optimal partitioning) is NP-hard and
yet, optimal partitioning may not be fair.
 Naive CPTs: large profiling and reconfiguration
overhead
 Hardware support required for implementing CPTs
(e.g., true-LRU) may be too-costly or unavailable

Challenges of CP
 Reduction in miss-rate brought by CPT may not
translate into better performance
 When there is performance bottleneck due to load-
imbalance, BW congestion, etc.
 CP useful only for LLC-sensitive apps; unnecessary
or harmful for small-footprint apps
 CP unnecessary for large-sized caches

A Quick Background on Page coloring
(will be useful for understanding CP)

Page Coloring
virtual page number
Virtual address page offset
physical page number
Physical address Page offset
Address translation
Cache tag Block offset
Set index
Cache address
Physically indexed cache
page color bits
… …
OS control
=
•Physically indexed caches are divided into multiple regions (colors).
•All cache lines in a physical page are cached in one of those regions (colors).
Lin et al. HPCA’08

Summary of page coloring
 Virtual address has: Virtual page number and page offset
 VA converted to PA by OS-controlled address translation
 PA used in a physically indexed cache.
 Page color bits = common bits between physical page
number and set index
 Physically indexed cache is divided into multiple regions.
Each OS page will cached in one of those regions indexed
by the page color
 OS can control the page color of a virtual page through
address mapping (by selecting a physical page with a
specific value in its page color bits).

#PageColor bits = # BlockOffset bits + #SetIndex bits
- #PageOffset bits
=> #PageColors = (CacheBlockSize * NumberOfSets)/PageSize
= CacheSize/(PageSize * CacheAssociativity)
Page number Page offset
Cache tag Cache set index Block offset
Physical
address
Cache color bits
Computing # of Page colors

Classification 1. Based on granularity
 Cache quota can be allocated in terms of ways, sets
(colors) and blocks
 A 16-way, 4MB cache with block size of 64B and
system page size of 4KB => 16 ways, 64 colors,
65536 blocks
Way Set Block
Increasing granularity

Way-based CPT
 Simple implementation
 Flushing-free reconfiguration
 Ease of obtaining way-level profiling information
 Sufficient for small N (number of cores)
 Harms associativity
 Meaningful only if associativity >= 2N (at least one
way needs to be allocated to each core)
 Requires caches of high associativity => high access
latency/power overheads
 Requires additional bits to identify owner core of
each block

Set (color)-based CPT
 Higher granularity than way-based CP
 Amenable to SW control.
 Requires significant changes to OS
 May complicate virtual memory management
 Changes set-indices of many blocks => these blocks
need to be flushed or migrated to new set-indices
 To lower this, reduce reconfig. frequency or number
of recolored pages in each interval or perform page-
migration in lazy manner

Block-based CPT
 Provides highest granularity
 Highly useful for large N
 Obtaining profiling info for block-based allocation is
challenging
 Some CPTs obtain this info by linearly interpolating
miss-rate curve of way-level monitors => not accurate
 May require changes to RP and additional bits to
identify owner core of each block

Classification 2. Whether static or dynamic
 Static CPT: determine cache partitions offline (i.e.,
before application execution)
 Dynamic CPT: determine cache partitions
dynamically (i.e., at runtime, i.e., when application is
running)

Static v/s dynamic CPT
Static CPT
 Useful for testing all
possible partitions for
small core-count to find
upper bound on gain
 Not feasible with large N
 Cannot account for
temporal variation in
cache behavior
Dynamic CPT
 Suitable for large N
 Can account for temporal
variation in behavior
 Incur runtime overhead
 Unnecessary if app
behavior uniform over
time

Classification 3. Whether strict or pseudo
 Strict (hard) CPT: cache quota is strictly enforced
 Pseudo (soft) CPT: cache quota not strictly
enforced, actual allocation may differ from target quota
 Ex.: 8-way cache, quota App1 =3 ways, App2 = 5 ways
 Strict: Enforce [3,5] in all intervals
 Pseudo: Quota = [3,5] in most intervals but [2,6] or
[4,4] in other intervals

Way-based CP Block-based CP
Pseudo-partitioning
(actual deviates from target)
Strict-
Partitioning
(actual close
to target)
Sanchez et al. ISCA’11

Strict v/s pseudo CPT
Strict CPT
 Important to guarantee
QoS and fairness
 May lead to inefficient
utilization of cache, esp.
when allocation
granularity is large.
 Dead blocks of one cannot
be evicted by other core,
even if it can benefit from
those blocks
Pseudo CPT
 May provide most
benefits of strict-CPT
with much simpler
implementation
 Allow cores to steal
quotas of other cores
 Actual quota of a core
can differ from target
 This problem esp. severe
with large N

Classification 4. Whether HW or SW-control
 HW-based CPT: CPT is independent of OS
parameters and is implemented in HW
 SW-based CPT: Partitioning decision is taken in
SW, CPT depends on OS features (e.g., system page
size)

HW-based v/s SW-based CPT
HW-based CPT
 Can be used at fine-
granularity (~10K cycles)
 Reduces profiling and
reconfiguration overhead
 Adding required HW
support is challenging
SW-based CPT
 SW control is important to
account for other
processor components,
management schemes and
system-level goals, e.g.
optimizing fairness (v/s
cache-level goals e.g.
minimizing miss-rate).
 Higher reconfig. overhead
=> can be used at coarse
granularity (>1M cycles)

Partitioned b/w cores
(private to each core)
Classification 5. Fully v/s partially partitioned
Partitioned b/w cores
(private to each core)
Not partitioned
(Shared b/w
cores)
Fully partitioned
Higher capacity and
better granularity
available for partitioning,
Partially partitioned
May provide advantage of
both shared and private
cache

CPTs in real processors
 Some Intel processors provides support for way-based
CP [Int16]
 Page coloring-based CP [Lin08] in Linux kernel
 Intel Xeon processor E5-2600 v3 family: support for
implementing shared cache QoS. It has
 “cache monitoring technology” to track cache usage
 “cache allocation technology” for allocating cache quotas,
e.g. to avoid cache starvation
 AMD Opteron: pseudo-CPT to restrict cache quota of
cache-polluting apps
[Int16: Intel 64 and IA-32 Architectures Developer’s Manual: Vol. 3B http://goo.gl/sw24WL ]
[Lin08: Lin et al. HPCA’08]

Key Ideas and Strategies for
Performing CP

How to perform CP
 Profile apps to find their cache behavior/requirement
 Classify apps based on their cache behavior
 Determine cache quota of each app

Profiling techniques
 Collect data about hits/misses to different ways.
Based on that, decide benefit from giving/taking-
away cache space to/from an app
 Set-sampling: only few sets need to be monitored
to estimate property of entire cache
 Data can be collected from actual cache or
separate profiling unit
 Separate unit only needs tags, not data => size small
 By using set-sampling, its size can be reduced greatly

T0
D0
T1
D0
T2
D2
T3
D3
TK-2
DK-2
TK-1
DK-1
V0 V1 V2 V3
T0
T1
T2
T3
TK-2
TK-1
V0 V1 V2 V3
T1
T3
TK-1
V0 V1 V2 V3
(a) A cache with 4-ways, K-sets
(tag & data directories)
(b) Auxiliary tag directory (ATD)
(c) Sampled ATD
(sampling ratio =2)
Counters
Tj and Dj : Tag and Data for set j
Collecting profiling data from
Actual cache Separate profiling unit ( (b) and (c) )
MRU LRU

Utility Monitors
38 Qureshi et al. MICRO’06

• Find misses avoided with each way in each core
• Find the partitioning which gives least misses (highest hits)
Xie et al. HIPEAC’10
Two-core system
How to Use Profiling Information

Xie et al. HIPEAC’10
Four-core system
How to Use Profiling Information

Cache-behavior Based Application Classification
 Cache insensitive: not many accesses to L2
 Cache friendly: miss-rate reduces with increasing
L2 quota
 Cache fitting: miss-rate reduces with increasing L2
quota and becomes nearly-zero at some point
 Streaming: very large working set; show thrashing
with any RP due to inadequate cache reuse
 Thrashing: working set larger than cache capacity;
they thrash LRU-managed cache, but may benefit
from thrash-resistant RPs.

• Reduce cache quota of thrashing/streaming app
• Give higher quota to friendly and fitting app till they
benefit from it

 Utility = change in miss-rate with cache quota
 Low, high and saturating utility

Ideas for Reducing Overhead of CPTs
1. Few thrashing apps are responsible for interference
in a shared cache
=> Restraining just their cache quotas can provide
performance benefits similar to exact (but complex)
CPTs
2. Some works extend CPTs proposed for true-LRU to
pseudo-LRU
3. Use Bloom Filter to reduce storage overhead

Way, color and block-based
CPTs

Utility Based Shared Cache Partitioning
 Goal: Maximize system throughput
 Observation: Not all threads/applications benefit
equally from caching  simple LRU replacement not
good for system throughput
 Idea: Allocate more cache space to applications that
obtain the most benefit from more space
Qureshi et al. MICRO’06

Utility Based Shared Cache Partitioning
47
Utility Ua
b = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Misses
per
1000
instructions

Partitioning Algorithm
Evaluate all possible partitions and select the best
 With a ways to core1 and (16-a) ways to core2:
Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2
 Select a that maximizes (Hitscore1 + Hitscore2)
 Partitioning done once every 5 million cycles
48 Qureshi et al. MICRO’06

Way Partitioning Support
49
1. Each line has core-id bits
2. On a miss, count ways_occupied in set by miss-
causing app
ways_occupied < ways_given
Yes No
Victim is the LRU line
from other app
Victim is the LRU line
from miss-causing app

Greedy and Lookahead CP Algorithm
Greedy CP algorithm
 Iteratively assign a way to an app with highest utility for
that way
 Optimal if utility curves of all apps are convex
Lookahead algorithm
 Works for general case when utility curves are not convex
 In each iteration, for every app: Compute “maximum
marginal utility” (MMU) and least number of ways at
which MMU occurs
 App with largest MMU is given # of ways required for
achieving MMU.
 Stop iterating when all ways allocated
Qureshi et al. MICRO’06

Machine-learning based CPT
 Perform synergistic management of processor
resources (e.g., L2 cache quota, power budget and
off-chip bandwidth), instead of isolated management
 Train a neural network to learn processor
performance as function of allocated resources
 Use a stochastic hill-climbing based search heuristic
 Use a way-based CPT, per-core DVFS to manage
chip-power distribution and a strategy for
distributing bandwidth between apps
Bitirgen et al. MICRO'08

Coloring-based CPT
 CPT for performance
 Run one interval with current partition and one interval each
with increasing/decreasing quotas of each core (total 2 cores)
 Select partition with least misses
 CPT for QoS
 Target: perf. of app1 is not degraded >= threshold1 and perf. of
app2 is maximized
 if ((IPC of app1 - baselineIPC of app1)< threshold2)
Increase quota of app1 (if already maximum, stall app2)
 Else if(IPC of app1 > baselineIPC of app1)
Resume app2 (if stalled) or increase its quota
 On change in cache quota, perform lazy page-migration
Lin et al. HPCA'08

Coloring-based CPT with HW support
 Estimate energy consumption of running apps for
different cache quota
 Let #sets in LLC be X
 Estimate miss-rate for caches of different number of sets,
viz., X, X=2, X=4, X=8, etc. using profiling units
 From this, estimate energy consumption for different
cache partitions
 Quota of app with small utility is reduced or is increased
only slightly
 In some partitions, some colors may not be allocated to
any core
 From these, select a partitioning with minimum energy
 Power-gate unused cache colors
Mittal et al. TVLSI’14

Vantage: A Block-based CPT (1/2)
 Divide cache into managed and unmanaged portion
(e.g., 85:15)
 Only partition managed portion
 Allows maintaining associativity of each partition

Vantage: A Block-based CPT (2/2)
 Preferentially evict blocks from unmanaged portion
 ~0 evictions from managed portion
 Enforce quotas by matching demotion and
promotion rates
 On any eviction, all candidates with eviction
priorities greater than a partition-specific threshold
are demoted
 Use time-stamp based LRU to estimate eviction
priorities with low-overhead

Pseudo-partitioning techniques

Partitioning by Controlling Insertion-priority (1/3)
 Find cache quota of each app
 Quota of an app decides its insertion priority location
 Cache hit => block promoted by one step with priority
Z and not promoted by priority 1-Z
 Blocks of apps with low-priority experience high
competition
 Thrashing apps get one way each. Also, Z is very small
for them
Xie et al. ISCA'09

1 P 2 3 4 Q R
5 Accesses
S
1 P 2 3 4 S Q 5
6
1 P 2 6 3 Q
4
1 P 2 7 6 S
S
3 4
7
1 P 2 7 6 S
3 4
S
1 P 2 7 6 S
3
T
T
1 P
2 7 6 S
3
T
2
T
1 P
2 7 6 S
3
T
8
1 P
2 7 6 3
T
8
Action
Insertion at loc. 3
Insertion at loc. 5
Insertion at loc. 5
Promotion by 1
Insertion at loc. 3
Promotion by 1
Promotion by 1
Insertion at loc. 5
Insertion Locations
Core 0 Core 1
Quota
Deviation
Quota
Deviation
Highest
priority
Lowest
priority
Xie et al. ISCA'09

 Limitations:
 many partitions may have low insertion positions =>
 severe contention at near-LRU position
 difficult-to-evict blocks at near-MRU position
Xie et al. ISCA'09

Decay-interval based CPT
 Let decay interval = if a block not accessed for a
decay interval, it becomes candidate for replacement
irrespective of LRU status.
 Tune decay intervals of apps based on their cache
utility and priority
 => blocks of apps with high priority and locality stay
in cache for longer time
 Choose decay interval which minimizes total misses
and increases cache usage efficiency
Petoumenos et al. MoBS'06

Reuse-distance based CPT
 Keep a block in cache only until its expected reuse
happens
 This reuse distance is called protecting distance (PD)
 At insertion/promotion time, reuse distance of a
block set to PD
 On each access to set, PD values for all its blocks
decreased by one; if value reaches 0, block becomes
replacement candidate.
 Change PD to control cache quota of an app
 In multicore, for PDs for cores to maximize overall
cache hit rate
Duong et al. MICRO'12

Performance metric-based
CPTs
Next slide discusses limitations of miss-rate guided
CPTs. We then summarize CPTs which are guided
directly by some performance metric

Limitation of miss-rate guided CPTs
 Latency of different misses may be different due to
instantaneous MLP and NUCA effects, however,
most CPTs treat different misses equally
L2 miss
memory latency
memory latency
memory latency
memory latency
L2 misses
IPC IPC
Pipeline stalls Commit restarts
(a) Isolated miss (b) Clustered misses
Time Time
Higher average latency Lower average latency
Moreto et al. HiPEAC'08

Using MLP penalty of misses
 Find cache misses for different # of ways
 Assign higher MLP penalty to isolated misses than
clustered misses
 Compute perf. impact of a cache miss converted into
hit and vice versa on an increase/decrease in cache
size, respectively
 From this, find length of the miss-cluster
 L2 instruction misses stall fetch => they have a fixed
miss latency and MLP penalty
 From all possible partitions, select one with
minimum total "MLP penalty"
Moreto et al. HiPEAC'08

Using application slowdown model
 This model measures app slowdown due to
interference at shared cache and main memory
 Measure slowdown for every app at different # of
ways
 Compute marginal slowdown utility (MSU) as
(slowdowW+K – slowdowW)/K
 Partition using lookahead algorithm, except that use
MSU instead of marginal miss utility
Subramanian et al. MICRO'15

Using stall rate curves
 Use "instruction retirement stall rate curves“ (SRC):
stall cycles due to memory latency at various L2 sizes
 Get SRC directly from HW counters on real system
 SRC is better than miss-rate curve in guiding CPT,
since SRC accounts for several factors, e.g.,
 L2 miss-rate
 impact of L2 misses on instruction retirement stall
 memory bus contention
 variable latencies of lower levels of memory hierarchy
(e.g., L3 and main memory)
Tam et al. WIOSCA'07

Using Memory Bandwidth
 Apps with large miss-count may not consume largest
off-chip bandwidth if their memory accesses are not
clustered
 => Partitioning based on bandwidth can provide
better performance than based on misses
 Through offline analysis, find partition with least
overall bandwidth requirement
 Reduce cache quota of apps with low bandwidth
requirement
Yu et al. DAC'10

EXAMPLE OBJECTIVES:
• FAIRNESS
• LOAD-BALANCING
• IMPLEMENTING PRIORITIES
• ENERGY
• LIMITING PEAK-POWER OF LLC
CPTs for Various Optimization
Objectives

CPT for Ensuring Fairness
 Iteratively perform two steps
1. Quota allocation: evaluate fairness metric for all apps
 If unfairness difference between apps with least unfair
and most unfair impact of CP > threshold1
 Transfer some cache space from app with lower
unfairness to one with larger unfairness
 Exclude these 2 apps. Repeat the step for remaining apps
2. Adjustment: If reduction in miss-rate of app receiving
increased quota is more than threshold2
 Commit decision made in quota allocation step
 Else
 Reverse the decision
Kim et al. PACT'04

Using Feedback-Control theory
 Assume: IPC targets are given for all apps
 Find new targets to maximize cache utilization
 Find cache quota to achieve those targets
 If total cache quota exceeds cache size
 For QoS: reduce quota of low-priority apps
 For fairness: reduce quota in proportion to current quota
 App-level controller (a PID controller) finds quota
required for next epoch to achieve perf targets based
on perf in previous epoch with its quota
Srikantaiah et al. MICRO'09 (PID = proportional integral derivative)

Limiting LLC power and achieving fairness/QoS (1/2)
 Goal: Limiting maximum power of LLC and
achieving fair or differentiated cache access latencies
between different apps
 Use two-level synergistic controllers design
1. LLC power-controller (every 10M cycles)
 Limits maximum LLC power for a given budget by
controlling number of active LLC banks
 Remaining banks are power-gated
Wang et al. TC'2012

Limiting LLC power and achieving fairness/QoS (2/2)
2. Latency controller (every 1M cycles)
 Controls ratio of cache access latencies between two
apps on every pair of neighboring cores.
 For fairness: same latencies for all apps
 For QoS: shorter latencies for high-priority apps
 Finds cache-bank quota of each app
 Their technique provides theoretical guarantee of
accurate control and system stability.
 Controllers are designed as PI controllers
Wang et al. TC'2012

Changing quotas in different intervals (1/2)
 Allocate different sized partitions to different apps in
different intervals
 Cache quotas are expanded and contracted in
different intervals
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
(c) Multiple time-sharing partitioning
(for both throughput and fairness)
Spatial partitioning
Time-
sharing
(a) Fairness-oriented
spatial partitioning
(b) Throughput-oriented
spatial partitioning
Time
IPC=0.26, WS=1.23, FS=1.0
Example
results
IPC=0.52, WS=2.42, FS=1.22 IPC=0.52, WS=2.42, FS=1.97
WS/FS = weighted/fair speedup
Proposed technique

Changing quotas in different intervals (2/2)
 A thrashing app already has low throughput,
reducing its quota in contraction epoch does not
reduce its perf much…
 but increasing its quota in expansion epoch boosts
its perf greatly which compensates slowdown in
contraction epochs.
 Expansion opportunity is given to different apps
 equally for fairness
 in differentiated manner for QoS

CPTs for load-balancing in Multithreaded Apps (1/2)
C0
L2 (32 ways)
C1 C2 C3 C0 C1 C2 C3
3 16 8 5
Thread #
# L2 ways
0 1 2 3
Critical
Path
Thread
0 1 2 3
(a) Shared cache (b) Partitioned Cache
Critical thread can be accelerated by giving higher
cache quota to it => bottleneck removed

1st CPT
 Record CPIs of all threads
 Allocate more ways to threads with higher CPIs
 Limitation: thread's cache sensitivity is not taken into
account
2nd CPT
 For each thread, build a model of how CPI varies with
cache quota
 Do curve fitting by “cubic spline interpolation”
 Repeatedly transfer one way from fastest thread to slowest
thread until some other thread becomes slowest
 At this point, revert cache allocation by one-step and
accept this partitioning
Muralidhara et al. IPDPS'10
CPTs for load-balancing in Multithreaded Apps (2/2)

Removing imbalance due to process variation (1/2)
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to
process variation)
For multithreaded programs
with synchronization barriers,
the slowest core will limit the
performance of other cores
Kozhikkottu et al. DAC'14

 Using cache partitioning to give higher cache quota
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to PV)
1.8 GHz
2.1 2.4 2.6
20
#L2 ways 6 4 2
PV-aware L2 cache partitioning
Higher
throughput
Removing imbalance due to process variation (2/2)
Kozhikkottu et al. DAC'14

Saving leakage energy (1/2)
 Locality of an app = (accesses to LRU blocks)/(accesses to
MRU blocks)
 Most hits at MRU => app needs few ways to achieve high
hit rate and vice versa
 Compare this ratio for two apps to decide cache quota
Kotera et al. HiPEAC'11

Allocated to
core 0
Allocated to
core 1
Power-gated
• Compare above ratio with thresholds to decide number of
ways to power-gate
• Insight: If total cache requirement of cores < available cache
=> power-gate remaining cache to save leakage energy
Saving leakage energy (2/2)

Sets
0
1
2
3
4
5
6
7
(a) Shared unpartitioned cache (b) Shared partitioned cache (c) Shared partitioned cache +
Way-aligned data
0 1 2 3 4 5 6 7
Ways
Core 0 Core 1 Core 2 Core 3
Saving dynamic energy
Ensure way-alignment of data of each core. On access to a core,
only that way needs to be accessed => dynamic energy saved
Dynamic energy
saving
No dynamic energy saving
Sundararajan et al. HPCA’12

CPTs in various contexts
 If cache is NUCA: try reducing both
 misses and
 hit latency (by allocating cache banks to closest
core)
 If main memory is PCM: perform CP such that
both misses and writebacks are minimized
 since PCM has high write energy/latency and low
write endurance

Integration of CP with other
techniques

Integration with processor partitioning
• When variation in degree of TLP between apps is high,
equally distributing processors between them is not optimal
• => Perform both
• Processor partitioning (every 65M cycle) and
• cache partitioning (every 10M cycle)
Srikantaiah et al. SC'09

Integration with DRAM-bank partitioning
 In physical address, few bits are common among LLC set
index bits and those for computing DRAM bank
 Thus, we can perform cache-only, bank-only or
combined partitioning, based on which is better
Overlapped-bits
Cache-only bits
Bank-only bits
21
22 17
18 15
16 13
14 12
Page offset
Physical frame number
Bank-only (21-22), cache-only (16-18) and overlapped (14-15) bits on a
processor with 8GB memory and 64 banks
Liu et al. ISCA'14

Integration with Bandwidth Partitioning
 Whether BW partitioning can improve perf depends on
difference in miss frequencies between apps
 With decreasing bandwidth, scope of perf improvement
increases
 => CP may lower the impact of BW partitioning on perf
 By reducing difference in miss frequencies of apps and
 By reducing total cache misses which relieves BW pressure
 But, if CP increases difference in miss frequencies, it
increases impact of BW partitioning on performance.
 E.g., for cache insensitive apps, CP cannot improve perf,
but by changing difference in miss frequencies, CP
enhances effectiveness of BW partitioning in boosting perf
Liu et al. HPCA'10

Integration with DVFS (1/3)
 Model problem of dividing shared resource (chip
power budget and LLC capacity) between apps as a
dynamic distributed market
 each app (core) is an agent
 resource-prices change based on “demand” and “supply”
 Initially:
 each agent has a purchasing budget and builds a
performance model as function of allocated resource
 A global arbiter fixes initial prices of all resources
Wang et al. HPCA'15

 Iteratively: Each agent bids for the resources to
maximize its perf
 Based on the bids, arbiter increases and reduces the price
of resources in high and low demand, respectively.
 Agents bid again under new prices
 Iteration stops when
 change in price within iterations is very small or
 a threshold # of iterations done or
 no improvement in perf of an agent on changing the bid
 At this point, perform resource-allocation

 Agents work in decentralized manner and only
centralized function is pricing scheme => this
technique scales well to >64 cores
 For throughput: assign larger budget to agents
with higher marginal utility
 For fairness: assign equal budgets to all agents
 Find cache utility by seeing miss-rate change and
power utility by changing frequency using DVFS

Integration with RP selection (1/2)
 CPTs and thrash-resistant RPs are complementary
 RPs temporally share LLC based on apps’ “locality”
 CPTs spatially divide LLC based on app’ “utility”
 => Thrash-resistant RPs good for workloads with
poor-locality apps
 CPTs good for workloads with apps with widely
different utility values
 Idea: Perform CPT with RP-selection to optimize
both locality and utility
Zhan et al. TC'14

Quota0 = 5 ways
RP0 = BIP
Quota1 = 3 ways
RP1 = LRU
6
7 4
5 2
3 0
1
Most
recent
Core 0 Core 1
p
LRU
1-p
BIP
Least
recent
Decision
module
Recency stack
of 8-way cache
Insertion position for Core 0
Insertion position for Core 1
Replacement point for Core 0
Replacement point for Core 1
Integration with RP selection (2/2)
 Find hits at different way-counts for LRU and BIP
 From this, find optimal CP and optimal RP
 For CP use lookahead algorithm
 RP chosen for a core is implemented in its cache
portion

References
 S. Mittal, “A Survey of Techniques for Cache
Partitioning in Multicore Processors”, ACM
Computing Surveys, 2017 (link)

PPT_on_Cache_Partitioning_Techniques.pdf

Recommended

Recommended

More Related Content

Similar to PPT_on_Cache_Partitioning_Techniques.pdf

Similar to PPT_on_Cache_Partitioning_Techniques.pdf (20)

More from Gnanavi2

More from Gnanavi2 (6)

Recently uploaded

Recently uploaded (20)

PPT_on_Cache_Partitioning_Techniques.pdf