SlideShare a Scribd company logo
SPARSH MITTAL
IIT HYDERABAD, INDIA
Cache Partitioning Techniques
Some acronyms
 CP = cache partitioning, CPT = CP technique
 HW = hardware, SW = software
 BW = bandwidth
 LLC = last level cache
 RP = replacement policy
 MLP/TLP = memory/thread level parallelism
 IPC = instruction per cycle
 NUCA = non-uniform cache architecture
 QoS = quality-of-service
 Perf = performance
N denotes number of cores
Motivation for Cache Management in Multicores
 With increasing number of cores, performance of
multicores does not scale linearly due to cache
contention and other factors
 Memory requirement of applications is increasing
=> Cache management has become extremely
important in multicores
Private v/s shared cache
Private Caches
 Avoid interference
 Cannot account for inter-
and intra-application
variation in cache
requirements
 Limited capacity =>
cannot reduce miss-rate
effectively
Shared cache
 Higher total capacity =>
can reduce miss-rate
 Interference b/w apps on
using traditional cache
management policies =>
performance loss,
unfairness and lack of
QoS
Use CP in shared cache => capacity advantage of shared
cache, performance isolation advantage of private cache
Examples of processors with shared LLC
 IBM Power 7
 Intel core i7
 AMD Phenom X4
 Sun Niagara T2
We first provide background on CPTs and then discuss
several CPTs
Benefits and Challenges of CP
Potential of CP
 Different cache demand and performance sensitivity
 Of different apps
 Of different threads in a multithreaded app
 Further, performance of cores may differ due to
 differences in cache latencies due to NUCA design
 differences in core frequencies due to process variation
 CP can compensate for these differences!
 CP can also optimize for fairness and QoS
Potential of CP
 CP avoids interference & provides higher effective
cache capacity
 Reduces miss-rate and bandwidth contention
 This may benefit even those applications whose cache
quota are reduced!
 Saves energy by
 Reducing execution time
 Allowing unused cache to bet power-gated
Challenges of CP
 Number of possible partitions increase exponentially
with increasing core-count
 Simple schemes become ineffective
 Finding partitioning with minimum overall cache
miss-rate (i.e., optimal partitioning) is NP-hard and
yet, optimal partitioning may not be fair.
 Naive CPTs: large profiling and reconfiguration
overhead
 Hardware support required for implementing CPTs
(e.g., true-LRU) may be too-costly or unavailable
Challenges of CP
 Reduction in miss-rate brought by CPT may not
translate into better performance
 When there is performance bottleneck due to load-
imbalance, BW congestion, etc.
 CP useful only for LLC-sensitive apps; unnecessary
or harmful for small-footprint apps
 CP unnecessary for large-sized caches
A Quick Background on Page coloring
(will be useful for understanding CP)
Page Coloring
virtual page number
Virtual address page offset
physical page number
Physical address Page offset
Address translation
Cache tag Block offset
Set index
Cache address
Physically indexed cache
page color bits
… …
OS control
=
•Physically indexed caches are divided into multiple regions (colors).
•All cache lines in a physical page are cached in one of those regions (colors).
Lin et al. HPCA’08
Summary of page coloring
 Virtual address has: Virtual page number and page offset
 VA converted to PA by OS-controlled address translation
 PA used in a physically indexed cache.
 Page color bits = common bits between physical page
number and set index
 Physically indexed cache is divided into multiple regions.
Each OS page will cached in one of those regions indexed
by the page color
 OS can control the page color of a virtual page through
address mapping (by selecting a physical page with a
specific value in its page color bits).
#PageColor bits = # BlockOffset bits + #SetIndex bits
- #PageOffset bits
=> #PageColors = (CacheBlockSize * NumberOfSets)/PageSize
= CacheSize/(PageSize * CacheAssociativity)
Page number Page offset
Cache tag Cache set index Block offset
Physical
address
Cache color bits
Computing # of Page colors
Classification of CPTs
Classification 1. Based on granularity
 Cache quota can be allocated in terms of ways, sets
(colors) and blocks
 A 16-way, 4MB cache with block size of 64B and
system page size of 4KB => 16 ways, 64 colors,
65536 blocks
Way Set Block
Increasing granularity
Way-based CPT
 Simple implementation
 Flushing-free reconfiguration
 Ease of obtaining way-level profiling information
 Sufficient for small N (number of cores)
 Harms associativity
 Meaningful only if associativity >= 2N (at least one
way needs to be allocated to each core)
 Requires caches of high associativity => high access
latency/power overheads
 Requires additional bits to identify owner core of
each block
Set (color)-based CPT
 Higher granularity than way-based CP
 Amenable to SW control.
 Requires significant changes to OS
 May complicate virtual memory management
 Changes set-indices of many blocks => these blocks
need to be flushed or migrated to new set-indices
 To lower this, reduce reconfig. frequency or number
of recolored pages in each interval or perform page-
migration in lazy manner
Block-based CPT
 Provides highest granularity
 Highly useful for large N
 Obtaining profiling info for block-based allocation is
challenging
 Some CPTs obtain this info by linearly interpolating
miss-rate curve of way-level monitors => not accurate
 May require changes to RP and additional bits to
identify owner core of each block
Classification 2. Whether static or dynamic
 Static CPT: determine cache partitions offline (i.e.,
before application execution)
 Dynamic CPT: determine cache partitions
dynamically (i.e., at runtime, i.e., when application is
running)
Static v/s dynamic CPT
Static CPT
 Useful for testing all
possible partitions for
small core-count to find
upper bound on gain
 Not feasible with large N
 Cannot account for
temporal variation in
cache behavior
Dynamic CPT
 Suitable for large N
 Can account for temporal
variation in behavior
 Incur runtime overhead
 Unnecessary if app
behavior uniform over
time
Classification 3. Whether strict or pseudo
 Strict (hard) CPT: cache quota is strictly enforced
 Pseudo (soft) CPT: cache quota not strictly
enforced, actual allocation may differ from target quota
 Ex.: 8-way cache, quota App1 =3 ways, App2 = 5 ways
 Strict: Enforce [3,5] in all intervals
 Pseudo: Quota = [3,5] in most intervals but [2,6] or
[4,4] in other intervals
Way-based CP Block-based CP
Pseudo-partitioning
(actual deviates from target)
Strict-
Partitioning
(actual close
to target)
Sanchez et al. ISCA’11
Strict v/s pseudo CPT
Strict CPT
 Important to guarantee
QoS and fairness
 May lead to inefficient
utilization of cache, esp.
when allocation
granularity is large.
 Dead blocks of one cannot
be evicted by other core,
even if it can benefit from
those blocks
Pseudo CPT
 May provide most
benefits of strict-CPT
with much simpler
implementation
 Allow cores to steal
quotas of other cores
 Actual quota of a core
can differ from target
 This problem esp. severe
with large N
Classification 4. Whether HW or SW-control
 HW-based CPT: CPT is independent of OS
parameters and is implemented in HW
 SW-based CPT: Partitioning decision is taken in
SW, CPT depends on OS features (e.g., system page
size)
HW-based v/s SW-based CPT
HW-based CPT
 Can be used at fine-
granularity (~10K cycles)
 Reduces profiling and
reconfiguration overhead
 Adding required HW
support is challenging
SW-based CPT
 SW control is important to
account for other
processor components,
management schemes and
system-level goals, e.g.
optimizing fairness (v/s
cache-level goals e.g.
minimizing miss-rate).
 Higher reconfig. overhead
=> can be used at coarse
granularity (>1M cycles)
Partitioned b/w cores
(private to each core)
Classification 5. Fully v/s partially partitioned
Partitioned b/w cores
(private to each core)
Not partitioned
(Shared b/w
cores)
Fully partitioned
Higher capacity and
better granularity
available for partitioning,
Partially partitioned
May provide advantage of
both shared and private
cache
CPTs in real processors
 Some Intel processors provides support for way-based
CP [Int16]
 Page coloring-based CP [Lin08] in Linux kernel
 Intel Xeon processor E5-2600 v3 family: support for
implementing shared cache QoS. It has
 “cache monitoring technology” to track cache usage
 “cache allocation technology” for allocating cache quotas,
e.g. to avoid cache starvation
 AMD Opteron: pseudo-CPT to restrict cache quota of
cache-polluting apps
[Int16: Intel 64 and IA-32 Architectures Developer’s Manual: Vol. 3B http://goo.gl/sw24WL ]
[Lin08: Lin et al. HPCA’08]
Key Ideas and Strategies for
Performing CP
How to perform CP
 Profile apps to find their cache behavior/requirement
 Classify apps based on their cache behavior
 Determine cache quota of each app
Profiling techniques
 Collect data about hits/misses to different ways.
Based on that, decide benefit from giving/taking-
away cache space to/from an app
 Set-sampling: only few sets need to be monitored
to estimate property of entire cache
 Data can be collected from actual cache or
separate profiling unit
 Separate unit only needs tags, not data => size small
 By using set-sampling, its size can be reduced greatly
T0
D0
T1
D0
T2
D2
T3
D3
TK-2
DK-2
TK-1
DK-1
V0 V1 V2 V3
T0
T1
T2
T3
TK-2
TK-1
V0 V1 V2 V3
T1
T3
TK-1
V0 V1 V2 V3
(a) A cache with 4-ways, K-sets
(tag & data directories)
(b) Auxiliary tag directory (ATD)
(c) Sampled ATD
(sampling ratio =2)
Counters
Tj and Dj : Tag and Data for set j
Collecting profiling data from
Actual cache Separate profiling unit ( (b) and (c) )
MRU LRU
Utility Monitors
38 Qureshi et al. MICRO’06
• Find misses avoided with each way in each core
• Find the partitioning which gives least misses (highest hits)
Xie et al. HIPEAC’10
Two-core system
How to Use Profiling Information
Xie et al. HIPEAC’10
Four-core system
How to Use Profiling Information
Cache-behavior Based Application Classification
 Cache insensitive: not many accesses to L2
 Cache friendly: miss-rate reduces with increasing
L2 quota
 Cache fitting: miss-rate reduces with increasing L2
quota and becomes nearly-zero at some point
 Streaming: very large working set; show thrashing
with any RP due to inadequate cache reuse
 Thrashing: working set larger than cache capacity;
they thrash LRU-managed cache, but may benefit
from thrash-resistant RPs.
Cache-behavior Based Application Classification
• Reduce cache quota of thrashing/streaming app
• Give higher quota to friendly and fitting app till they
benefit from it
Cache-behavior Based Application Classification
 Utility = change in miss-rate with cache quota
 Low, high and saturating utility
Ideas for Reducing Overhead of CPTs
1. Few thrashing apps are responsible for interference
in a shared cache
=> Restraining just their cache quotas can provide
performance benefits similar to exact (but complex)
CPTs
2. Some works extend CPTs proposed for true-LRU to
pseudo-LRU
3. Use Bloom Filter to reduce storage overhead
Way, color and block-based
CPTs
Utility Based Shared Cache Partitioning
 Goal: Maximize system throughput
 Observation: Not all threads/applications benefit
equally from caching  simple LRU replacement not
good for system throughput
 Idea: Allocate more cache space to applications that
obtain the most benefit from more space
Qureshi et al. MICRO’06
Utility Based Shared Cache Partitioning
47
Utility Ua
b = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Misses
per
1000
instructions
Partitioning Algorithm
Evaluate all possible partitions and select the best
 With a ways to core1 and (16-a) ways to core2:
Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2
 Select a that maximizes (Hitscore1 + Hitscore2)
 Partitioning done once every 5 million cycles
48 Qureshi et al. MICRO’06
Way Partitioning Support
49
1. Each line has core-id bits
2. On a miss, count ways_occupied in set by miss-
causing app
ways_occupied < ways_given
Yes No
Victim is the LRU line
from other app
Victim is the LRU line
from miss-causing app
Greedy and Lookahead CP Algorithm
Greedy CP algorithm
 Iteratively assign a way to an app with highest utility for
that way
 Optimal if utility curves of all apps are convex
Lookahead algorithm
 Works for general case when utility curves are not convex
 In each iteration, for every app: Compute “maximum
marginal utility” (MMU) and least number of ways at
which MMU occurs
 App with largest MMU is given # of ways required for
achieving MMU.
 Stop iterating when all ways allocated
Qureshi et al. MICRO’06
Machine-learning based CPT
 Perform synergistic management of processor
resources (e.g., L2 cache quota, power budget and
off-chip bandwidth), instead of isolated management
 Train a neural network to learn processor
performance as function of allocated resources
 Use a stochastic hill-climbing based search heuristic
 Use a way-based CPT, per-core DVFS to manage
chip-power distribution and a strategy for
distributing bandwidth between apps
Bitirgen et al. MICRO'08
Coloring-based CPT
 CPT for performance
 Run one interval with current partition and one interval each
with increasing/decreasing quotas of each core (total 2 cores)
 Select partition with least misses
 CPT for QoS
 Target: perf. of app1 is not degraded >= threshold1 and perf. of
app2 is maximized
 if ((IPC of app1 - baselineIPC of app1)< threshold2)
Increase quota of app1 (if already maximum, stall app2)
 Else if(IPC of app1 > baselineIPC of app1)
Resume app2 (if stalled) or increase its quota
 On change in cache quota, perform lazy page-migration
Lin et al. HPCA'08
Coloring-based CPT with HW support
 Estimate energy consumption of running apps for
different cache quota
 Let #sets in LLC be X
 Estimate miss-rate for caches of different number of sets,
viz., X, X=2, X=4, X=8, etc. using profiling units
 From this, estimate energy consumption for different
cache partitions
 Quota of app with small utility is reduced or is increased
only slightly
 In some partitions, some colors may not be allocated to
any core
 From these, select a partitioning with minimum energy
 Power-gate unused cache colors
Mittal et al. TVLSI’14
Vantage: A Block-based CPT (1/2)
 Divide cache into managed and unmanaged portion
(e.g., 85:15)
 Only partition managed portion
 Allows maintaining associativity of each partition
Sanchez et al. ISCA’11
Vantage: A Block-based CPT (2/2)
 Preferentially evict blocks from unmanaged portion
 ~0 evictions from managed portion
 Enforce quotas by matching demotion and
promotion rates
 On any eviction, all candidates with eviction
priorities greater than a partition-specific threshold
are demoted
 Use time-stamp based LRU to estimate eviction
priorities with low-overhead
Sanchez et al. ISCA’11
Pseudo-partitioning techniques
Partitioning by Controlling Insertion-priority (1/3)
 Find cache quota of each app
 Quota of an app decides its insertion priority location
 Cache hit => block promoted by one step with priority
Z and not promoted by priority 1-Z
 Blocks of apps with low-priority experience high
competition
 Thrashing apps get one way each. Also, Z is very small
for them
Xie et al. ISCA'09
1 P 2 3 4 Q R
5 Accesses
S
1 P 2 3 4 S Q 5
6
1 P 2 6 3 Q
4
1 P 2 7 6 S
S
3 4
7
1 P 2 7 6 S
3 4
S
1 P 2 7 6 S
3
T
T
1 P
2 7 6 S
3
T
2
T
1 P
2 7 6 S
3
T
8
1 P
2 7 6 3
T
8
Action
Insertion at loc. 3
Insertion at loc. 5
Insertion at loc. 5
Promotion by 1
Insertion at loc. 3
Promotion by 1
Promotion by 1
Insertion at loc. 5
Insertion Locations
Core 0 Core 1
Quota
Deviation
Quota
Deviation
Highest
priority
Lowest
priority
Xie et al. ISCA'09
Partitioning by Controlling Insertion-priority (2/3)
 Limitations:
 many partitions may have low insertion positions =>
 severe contention at near-LRU position
 difficult-to-evict blocks at near-MRU position
Partitioning by Controlling Insertion-priority (3/3)
Xie et al. ISCA'09
Decay-interval based CPT
 Let decay interval = if a block not accessed for a
decay interval, it becomes candidate for replacement
irrespective of LRU status.
 Tune decay intervals of apps based on their cache
utility and priority
 => blocks of apps with high priority and locality stay
in cache for longer time
 Choose decay interval which minimizes total misses
and increases cache usage efficiency
Petoumenos et al. MoBS'06
Reuse-distance based CPT
 Keep a block in cache only until its expected reuse
happens
 This reuse distance is called protecting distance (PD)
 At insertion/promotion time, reuse distance of a
block set to PD
 On each access to set, PD values for all its blocks
decreased by one; if value reaches 0, block becomes
replacement candidate.
 Change PD to control cache quota of an app
 In multicore, for PDs for cores to maximize overall
cache hit rate
Duong et al. MICRO'12
Performance metric-based
CPTs
Next slide discusses limitations of miss-rate guided
CPTs. We then summarize CPTs which are guided
directly by some performance metric
Limitation of miss-rate guided CPTs
 Latency of different misses may be different due to
instantaneous MLP and NUCA effects, however,
most CPTs treat different misses equally
L2 miss
memory latency
memory latency
memory latency
memory latency
L2 misses
IPC IPC
Pipeline stalls Commit restarts
(a) Isolated miss (b) Clustered misses
Time Time
Higher average latency Lower average latency
Moreto et al. HiPEAC'08
Using MLP penalty of misses
 Find cache misses for different # of ways
 Assign higher MLP penalty to isolated misses than
clustered misses
 Compute perf. impact of a cache miss converted into
hit and vice versa on an increase/decrease in cache
size, respectively
 From this, find length of the miss-cluster
 L2 instruction misses stall fetch => they have a fixed
miss latency and MLP penalty
 From all possible partitions, select one with
minimum total "MLP penalty"
Moreto et al. HiPEAC'08
Using application slowdown model
 This model measures app slowdown due to
interference at shared cache and main memory
 Measure slowdown for every app at different # of
ways
 Compute marginal slowdown utility (MSU) as
(slowdowW+K – slowdowW)/K
 Partition using lookahead algorithm, except that use
MSU instead of marginal miss utility
Subramanian et al. MICRO'15
Using stall rate curves
 Use "instruction retirement stall rate curves“ (SRC):
stall cycles due to memory latency at various L2 sizes
 Get SRC directly from HW counters on real system
 SRC is better than miss-rate curve in guiding CPT,
since SRC accounts for several factors, e.g.,
 L2 miss-rate
 impact of L2 misses on instruction retirement stall
 memory bus contention
 variable latencies of lower levels of memory hierarchy
(e.g., L3 and main memory)
Tam et al. WIOSCA'07
Using Memory Bandwidth
 Apps with large miss-count may not consume largest
off-chip bandwidth if their memory accesses are not
clustered
 => Partitioning based on bandwidth can provide
better performance than based on misses
 Through offline analysis, find partition with least
overall bandwidth requirement
 Reduce cache quota of apps with low bandwidth
requirement
Yu et al. DAC'10
EXAMPLE OBJECTIVES:
• FAIRNESS
• LOAD-BALANCING
• IMPLEMENTING PRIORITIES
• ENERGY
• LIMITING PEAK-POWER OF LLC
CPTs for Various Optimization
Objectives
CPT for Ensuring Fairness
 Iteratively perform two steps
1. Quota allocation: evaluate fairness metric for all apps
 If unfairness difference between apps with least unfair
and most unfair impact of CP > threshold1
 Transfer some cache space from app with lower
unfairness to one with larger unfairness
 Exclude these 2 apps. Repeat the step for remaining apps
2. Adjustment: If reduction in miss-rate of app receiving
increased quota is more than threshold2
 Commit decision made in quota allocation step
 Else
 Reverse the decision
Kim et al. PACT'04
Using Feedback-Control theory
 Assume: IPC targets are given for all apps
 Find new targets to maximize cache utilization
 Find cache quota to achieve those targets
 If total cache quota exceeds cache size
 For QoS: reduce quota of low-priority apps
 For fairness: reduce quota in proportion to current quota
 App-level controller (a PID controller) finds quota
required for next epoch to achieve perf targets based
on perf in previous epoch with its quota
Srikantaiah et al. MICRO'09 (PID = proportional integral derivative)
Limiting LLC power and achieving fairness/QoS (1/2)
 Goal: Limiting maximum power of LLC and
achieving fair or differentiated cache access latencies
between different apps
 Use two-level synergistic controllers design
1. LLC power-controller (every 10M cycles)
 Limits maximum LLC power for a given budget by
controlling number of active LLC banks
 Remaining banks are power-gated
Wang et al. TC'2012
Limiting LLC power and achieving fairness/QoS (2/2)
2. Latency controller (every 1M cycles)
 Controls ratio of cache access latencies between two
apps on every pair of neighboring cores.
 For fairness: same latencies for all apps
 For QoS: shorter latencies for high-priority apps
 Finds cache-bank quota of each app
 Their technique provides theoretical guarantee of
accurate control and system stability.
 Controllers are designed as PI controllers
Wang et al. TC'2012
Changing quotas in different intervals (1/2)
 Allocate different sized partitions to different apps in
different intervals
 Cache quotas are expanded and contracted in
different intervals
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
1
0 3
2
(c) Multiple time-sharing partitioning
(for both throughput and fairness)
Spatial partitioning
Time-
sharing
(a) Fairness-oriented
spatial partitioning
(b) Throughput-oriented
spatial partitioning
Time
IPC=0.26, WS=1.23, FS=1.0
Example
results
IPC=0.52, WS=2.42, FS=1.22 IPC=0.52, WS=2.42, FS=1.97
WS/FS = weighted/fair speedup
Proposed technique
Changing quotas in different intervals (2/2)
 A thrashing app already has low throughput,
reducing its quota in contraction epoch does not
reduce its perf much…
 but increasing its quota in expansion epoch boosts
its perf greatly which compensates slowdown in
contraction epochs.
 Expansion opportunity is given to different apps
 equally for fairness
 in differentiated manner for QoS
CPTs for load-balancing in Multithreaded Apps (1/2)
C0
L2 (32 ways)
C1 C2 C3 C0 C1 C2 C3
3 16 8 5
Thread #
# L2 ways
0 1 2 3
Critical
Path
Thread
0 1 2 3
(a) Shared cache (b) Partitioned Cache
Critical thread can be accelerated by giving higher
cache quota to it => bottleneck removed
1st CPT
 Record CPIs of all threads
 Allocate more ways to threads with higher CPIs
 Limitation: thread's cache sensitivity is not taken into
account
2nd CPT
 For each thread, build a model of how CPI varies with
cache quota
 Do curve fitting by “cubic spline interpolation”
 Repeatedly transfer one way from fastest thread to slowest
thread until some other thread becomes slowest
 At this point, revert cache allocation by one-step and
accept this partitioning
Muralidhara et al. IPDPS'10
CPTs for load-balancing in Multithreaded Apps (2/2)
Removing imbalance due to process variation (1/2)
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to
process variation)
For multithreaded programs
with synchronization barriers,
the slowest core will limit the
performance of other cores
Kozhikkottu et al. DAC'14
 Using cache partitioning to give higher cache quota
1.8 GHz
2.1 2.4 2.6
Frequencies of different cores
in a 4-core processor (due to PV)
1.8 GHz
2.1 2.4 2.6
20
#L2 ways 6 4 2
PV-aware L2 cache partitioning
Higher
throughput
Removing imbalance due to process variation (2/2)
Kozhikkottu et al. DAC'14
Saving leakage energy (1/2)
 Locality of an app = (accesses to LRU blocks)/(accesses to
MRU blocks)
 Most hits at MRU => app needs few ways to achieve high
hit rate and vice versa
 Compare this ratio for two apps to decide cache quota
Kotera et al. HiPEAC'11
Allocated to
core 0
Allocated to
core 1
Power-gated
• Compare above ratio with thresholds to decide number of
ways to power-gate
• Insight: If total cache requirement of cores < available cache
=> power-gate remaining cache to save leakage energy
Saving leakage energy (2/2)
Sets
0
1
2
3
4
5
6
7
(a) Shared unpartitioned cache (b) Shared partitioned cache (c) Shared partitioned cache +
Way-aligned data
0 1 2 3 4 5 6 7
Ways
Core 0 Core 1 Core 2 Core 3
Saving dynamic energy
Ensure way-alignment of data of each core. On access to a core,
only that way needs to be accessed => dynamic energy saved
Dynamic energy
saving
No dynamic energy saving
Sundararajan et al. HPCA’12
CPTs in various contexts
 If cache is NUCA: try reducing both
 misses and
 hit latency (by allocating cache banks to closest
core)
 If main memory is PCM: perform CP such that
both misses and writebacks are minimized
 since PCM has high write energy/latency and low
write endurance
Integration of CP with other
techniques
Integration with processor partitioning
• When variation in degree of TLP between apps is high,
equally distributing processors between them is not optimal
• => Perform both
• Processor partitioning (every 65M cycle) and
• cache partitioning (every 10M cycle)
Srikantaiah et al. SC'09
Integration with DRAM-bank partitioning
 In physical address, few bits are common among LLC set
index bits and those for computing DRAM bank
 Thus, we can perform cache-only, bank-only or
combined partitioning, based on which is better
Overlapped-bits
Cache-only bits
Bank-only bits
21
22 17
18 15
16 13
14 12
Page offset
Physical frame number
Bank-only (21-22), cache-only (16-18) and overlapped (14-15) bits on a
processor with 8GB memory and 64 banks
Liu et al. ISCA'14
Integration with Bandwidth Partitioning
 Whether BW partitioning can improve perf depends on
difference in miss frequencies between apps
 With decreasing bandwidth, scope of perf improvement
increases
 => CP may lower the impact of BW partitioning on perf
 By reducing difference in miss frequencies of apps and
 By reducing total cache misses which relieves BW pressure
 But, if CP increases difference in miss frequencies, it
increases impact of BW partitioning on performance.
 E.g., for cache insensitive apps, CP cannot improve perf,
but by changing difference in miss frequencies, CP
enhances effectiveness of BW partitioning in boosting perf
Liu et al. HPCA'10
Integration with DVFS (1/3)
 Model problem of dividing shared resource (chip
power budget and LLC capacity) between apps as a
dynamic distributed market
 each app (core) is an agent
 resource-prices change based on “demand” and “supply”
 Initially:
 each agent has a purchasing budget and builds a
performance model as function of allocated resource
 A global arbiter fixes initial prices of all resources
Wang et al. HPCA'15
Integration with DVFS (2/3)
 Iteratively: Each agent bids for the resources to
maximize its perf
 Based on the bids, arbiter increases and reduces the price
of resources in high and low demand, respectively.
 Agents bid again under new prices
 Iteration stops when
 change in price within iterations is very small or
 a threshold # of iterations done or
 no improvement in perf of an agent on changing the bid
 At this point, perform resource-allocation
Integration with DVFS (3/3)
 Agents work in decentralized manner and only
centralized function is pricing scheme => this
technique scales well to >64 cores
 For throughput: assign larger budget to agents
with higher marginal utility
 For fairness: assign equal budgets to all agents
 Find cache utility by seeing miss-rate change and
power utility by changing frequency using DVFS
Integration with RP selection (1/2)
 CPTs and thrash-resistant RPs are complementary
 RPs temporally share LLC based on apps’ “locality”
 CPTs spatially divide LLC based on app’ “utility”
 => Thrash-resistant RPs good for workloads with
poor-locality apps
 CPTs good for workloads with apps with widely
different utility values
 Idea: Perform CPT with RP-selection to optimize
both locality and utility
Zhan et al. TC'14
Quota0 = 5 ways
RP0 = BIP
Quota1 = 3 ways
RP1 = LRU
6
7 4
5 2
3 0
1
Most
recent
Core 0 Core 1
p
LRU
1-p
BIP
Least
recent
Decision
module
Recency stack
of 8-way cache
Insertion position for Core 0
Insertion position for Core 1
Replacement point for Core 0
Replacement point for Core 1
Integration with RP selection (2/2)
 Find hits at different way-counts for LRU and BIP
 From this, find optimal CP and optimal RP
 For CP use lookahead algorithm
 RP chosen for a core is implemented in its cache
portion
References
 S. Mittal, “A Survey of Techniques for Cache
Partitioning in Multicore Processors”, ACM
Computing Surveys, 2017 (link)

More Related Content

Similar to PPT_on_Cache_Partitioning_Techniques.pdf

Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
IndicThreads
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDIT
tjunicornfx
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
Coburn Watson
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
ijesajournal
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
ateeq ateeq
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
Sajith Harshana
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
Rashmi Bhat
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
ShriKant Vashishtha
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
Perfectly Perfect
 
Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav)
Rohit malav
 
Cache memory
Cache memoryCache memory
Cache memory
Anand Goyal
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
Dhritiman Halder
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Amazon Web Services
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
eSAT Publishing House
 
Unit 5Memory management.pptx
Unit 5Memory management.pptxUnit 5Memory management.pptx
Unit 5Memory management.pptx
SourabhRaj29
 
IMDB_Scalability
IMDB_ScalabilityIMDB_Scalability
IMDB_Scalability
Israel Gold
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
ijdpsjournal
 
E24099025李宇洋_專題報告.pdf
E24099025李宇洋_專題報告.pdfE24099025李宇洋_專題報告.pdf
E24099025李宇洋_專題報告.pdf
KerzPAry137
 

Similar to PPT_on_Cache_Partitioning_Techniques.pdf (20)

Optimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardwareOptimizing your java applications for multi core hardware
Optimizing your java applications for multi core hardware
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDIT
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Study of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processorsStudy of various factors affecting performance of multi core processors
Study of various factors affecting performance of multi core processors
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Caching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant VashishthaCaching fundamentals by Shrikant Vashishtha
Caching fundamentals by Shrikant Vashishtha
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Aca lab project (rohit malav)
Aca lab project (rohit malav) Aca lab project (rohit malav)
Aca lab project (rohit malav)
 
Cache memory
Cache memoryCache memory
Cache memory
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
Unit 5Memory management.pptx
Unit 5Memory management.pptxUnit 5Memory management.pptx
Unit 5Memory management.pptx
 
IMDB_Scalability
IMDB_ScalabilityIMDB_Scalability
IMDB_Scalability
 
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESWRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORES
 
E24099025李宇洋_專題報告.pdf
E24099025李宇洋_專題報告.pdfE24099025李宇洋_專題報告.pdf
E24099025李宇洋_專題報告.pdf
 

More from Gnanavi2

computerforensicsppt-111006063922-phpapp01.pdf
computerforensicsppt-111006063922-phpapp01.pdfcomputerforensicsppt-111006063922-phpapp01.pdf
computerforensicsppt-111006063922-phpapp01.pdf
Gnanavi2
 
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
Gnanavi2
 
computerforensics-140212060522-phpapp02.pdf
computerforensics-140212060522-phpapp02.pdfcomputerforensics-140212060522-phpapp02.pdf
computerforensics-140212060522-phpapp02.pdf
Gnanavi2
 
computerforensics-140529094816-phpapp01 (1).pdf
computerforensics-140529094816-phpapp01 (1).pdfcomputerforensics-140529094816-phpapp01 (1).pdf
computerforensics-140529094816-phpapp01 (1).pdf
Gnanavi2
 
computerforensicppt-160201192341.pdf
computerforensicppt-160201192341.pdfcomputerforensicppt-160201192341.pdf
computerforensicppt-160201192341.pdf
Gnanavi2
 
Computer_forensics_ppt.ppt
Computer_forensics_ppt.pptComputer_forensics_ppt.ppt
Computer_forensics_ppt.ppt
Gnanavi2
 

More from Gnanavi2 (6)

computerforensicsppt-111006063922-phpapp01.pdf
computerforensicsppt-111006063922-phpapp01.pdfcomputerforensicsppt-111006063922-phpapp01.pdf
computerforensicsppt-111006063922-phpapp01.pdf
 
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
644205e3-8f85-43da-95ac-e4cbb6a7a406-150917105917-lva1-app6892.pdf
 
computerforensics-140212060522-phpapp02.pdf
computerforensics-140212060522-phpapp02.pdfcomputerforensics-140212060522-phpapp02.pdf
computerforensics-140212060522-phpapp02.pdf
 
computerforensics-140529094816-phpapp01 (1).pdf
computerforensics-140529094816-phpapp01 (1).pdfcomputerforensics-140529094816-phpapp01 (1).pdf
computerforensics-140529094816-phpapp01 (1).pdf
 
computerforensicppt-160201192341.pdf
computerforensicppt-160201192341.pdfcomputerforensicppt-160201192341.pdf
computerforensicppt-160201192341.pdf
 
Computer_forensics_ppt.ppt
Computer_forensics_ppt.pptComputer_forensics_ppt.ppt
Computer_forensics_ppt.ppt
 

Recently uploaded

Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 

Recently uploaded (20)

Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 

PPT_on_Cache_Partitioning_Techniques.pdf

  • 1. SPARSH MITTAL IIT HYDERABAD, INDIA Cache Partitioning Techniques
  • 2. Some acronyms  CP = cache partitioning, CPT = CP technique  HW = hardware, SW = software  BW = bandwidth  LLC = last level cache  RP = replacement policy  MLP/TLP = memory/thread level parallelism  IPC = instruction per cycle  NUCA = non-uniform cache architecture  QoS = quality-of-service  Perf = performance N denotes number of cores
  • 3. Motivation for Cache Management in Multicores  With increasing number of cores, performance of multicores does not scale linearly due to cache contention and other factors  Memory requirement of applications is increasing => Cache management has become extremely important in multicores
  • 4. Private v/s shared cache Private Caches  Avoid interference  Cannot account for inter- and intra-application variation in cache requirements  Limited capacity => cannot reduce miss-rate effectively Shared cache  Higher total capacity => can reduce miss-rate  Interference b/w apps on using traditional cache management policies => performance loss, unfairness and lack of QoS Use CP in shared cache => capacity advantage of shared cache, performance isolation advantage of private cache
  • 5. Examples of processors with shared LLC  IBM Power 7  Intel core i7  AMD Phenom X4  Sun Niagara T2 We first provide background on CPTs and then discuss several CPTs
  • 7. Potential of CP  Different cache demand and performance sensitivity  Of different apps  Of different threads in a multithreaded app  Further, performance of cores may differ due to  differences in cache latencies due to NUCA design  differences in core frequencies due to process variation  CP can compensate for these differences!  CP can also optimize for fairness and QoS
  • 8. Potential of CP  CP avoids interference & provides higher effective cache capacity  Reduces miss-rate and bandwidth contention  This may benefit even those applications whose cache quota are reduced!  Saves energy by  Reducing execution time  Allowing unused cache to bet power-gated
  • 9. Challenges of CP  Number of possible partitions increase exponentially with increasing core-count  Simple schemes become ineffective  Finding partitioning with minimum overall cache miss-rate (i.e., optimal partitioning) is NP-hard and yet, optimal partitioning may not be fair.  Naive CPTs: large profiling and reconfiguration overhead  Hardware support required for implementing CPTs (e.g., true-LRU) may be too-costly or unavailable
  • 10. Challenges of CP  Reduction in miss-rate brought by CPT may not translate into better performance  When there is performance bottleneck due to load- imbalance, BW congestion, etc.  CP useful only for LLC-sensitive apps; unnecessary or harmful for small-footprint apps  CP unnecessary for large-sized caches
  • 11. A Quick Background on Page coloring (will be useful for understanding CP)
  • 12. Page Coloring virtual page number Virtual address page offset physical page number Physical address Page offset Address translation Cache tag Block offset Set index Cache address Physically indexed cache page color bits … … OS control = •Physically indexed caches are divided into multiple regions (colors). •All cache lines in a physical page are cached in one of those regions (colors). Lin et al. HPCA’08
  • 13. Summary of page coloring  Virtual address has: Virtual page number and page offset  VA converted to PA by OS-controlled address translation  PA used in a physically indexed cache.  Page color bits = common bits between physical page number and set index  Physically indexed cache is divided into multiple regions. Each OS page will cached in one of those regions indexed by the page color  OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).
  • 14. #PageColor bits = # BlockOffset bits + #SetIndex bits - #PageOffset bits => #PageColors = (CacheBlockSize * NumberOfSets)/PageSize = CacheSize/(PageSize * CacheAssociativity) Page number Page offset Cache tag Cache set index Block offset Physical address Cache color bits Computing # of Page colors
  • 16. Classification 1. Based on granularity  Cache quota can be allocated in terms of ways, sets (colors) and blocks  A 16-way, 4MB cache with block size of 64B and system page size of 4KB => 16 ways, 64 colors, 65536 blocks Way Set Block Increasing granularity
  • 17. Way-based CPT  Simple implementation  Flushing-free reconfiguration  Ease of obtaining way-level profiling information  Sufficient for small N (number of cores)  Harms associativity  Meaningful only if associativity >= 2N (at least one way needs to be allocated to each core)  Requires caches of high associativity => high access latency/power overheads  Requires additional bits to identify owner core of each block
  • 18. Set (color)-based CPT  Higher granularity than way-based CP  Amenable to SW control.  Requires significant changes to OS  May complicate virtual memory management  Changes set-indices of many blocks => these blocks need to be flushed or migrated to new set-indices  To lower this, reduce reconfig. frequency or number of recolored pages in each interval or perform page- migration in lazy manner
  • 19. Block-based CPT  Provides highest granularity  Highly useful for large N  Obtaining profiling info for block-based allocation is challenging  Some CPTs obtain this info by linearly interpolating miss-rate curve of way-level monitors => not accurate  May require changes to RP and additional bits to identify owner core of each block
  • 20. Classification 2. Whether static or dynamic  Static CPT: determine cache partitions offline (i.e., before application execution)  Dynamic CPT: determine cache partitions dynamically (i.e., at runtime, i.e., when application is running)
  • 21. Static v/s dynamic CPT Static CPT  Useful for testing all possible partitions for small core-count to find upper bound on gain  Not feasible with large N  Cannot account for temporal variation in cache behavior Dynamic CPT  Suitable for large N  Can account for temporal variation in behavior  Incur runtime overhead  Unnecessary if app behavior uniform over time
  • 22. Classification 3. Whether strict or pseudo  Strict (hard) CPT: cache quota is strictly enforced  Pseudo (soft) CPT: cache quota not strictly enforced, actual allocation may differ from target quota  Ex.: 8-way cache, quota App1 =3 ways, App2 = 5 ways  Strict: Enforce [3,5] in all intervals  Pseudo: Quota = [3,5] in most intervals but [2,6] or [4,4] in other intervals
  • 23. Way-based CP Block-based CP Pseudo-partitioning (actual deviates from target) Strict- Partitioning (actual close to target) Sanchez et al. ISCA’11
  • 24. Strict v/s pseudo CPT Strict CPT  Important to guarantee QoS and fairness  May lead to inefficient utilization of cache, esp. when allocation granularity is large.  Dead blocks of one cannot be evicted by other core, even if it can benefit from those blocks Pseudo CPT  May provide most benefits of strict-CPT with much simpler implementation  Allow cores to steal quotas of other cores  Actual quota of a core can differ from target  This problem esp. severe with large N
  • 25. Classification 4. Whether HW or SW-control  HW-based CPT: CPT is independent of OS parameters and is implemented in HW  SW-based CPT: Partitioning decision is taken in SW, CPT depends on OS features (e.g., system page size)
  • 26. HW-based v/s SW-based CPT HW-based CPT  Can be used at fine- granularity (~10K cycles)  Reduces profiling and reconfiguration overhead  Adding required HW support is challenging SW-based CPT  SW control is important to account for other processor components, management schemes and system-level goals, e.g. optimizing fairness (v/s cache-level goals e.g. minimizing miss-rate).  Higher reconfig. overhead => can be used at coarse granularity (>1M cycles)
  • 27. Partitioned b/w cores (private to each core) Classification 5. Fully v/s partially partitioned Partitioned b/w cores (private to each core) Not partitioned (Shared b/w cores) Fully partitioned Higher capacity and better granularity available for partitioning, Partially partitioned May provide advantage of both shared and private cache
  • 28. CPTs in real processors  Some Intel processors provides support for way-based CP [Int16]  Page coloring-based CP [Lin08] in Linux kernel  Intel Xeon processor E5-2600 v3 family: support for implementing shared cache QoS. It has  “cache monitoring technology” to track cache usage  “cache allocation technology” for allocating cache quotas, e.g. to avoid cache starvation  AMD Opteron: pseudo-CPT to restrict cache quota of cache-polluting apps [Int16: Intel 64 and IA-32 Architectures Developer’s Manual: Vol. 3B http://goo.gl/sw24WL ] [Lin08: Lin et al. HPCA’08]
  • 29. Key Ideas and Strategies for Performing CP
  • 30. How to perform CP  Profile apps to find their cache behavior/requirement  Classify apps based on their cache behavior  Determine cache quota of each app
  • 31. Profiling techniques  Collect data about hits/misses to different ways. Based on that, decide benefit from giving/taking- away cache space to/from an app  Set-sampling: only few sets need to be monitored to estimate property of entire cache  Data can be collected from actual cache or separate profiling unit  Separate unit only needs tags, not data => size small  By using set-sampling, its size can be reduced greatly
  • 32. T0 D0 T1 D0 T2 D2 T3 D3 TK-2 DK-2 TK-1 DK-1 V0 V1 V2 V3 T0 T1 T2 T3 TK-2 TK-1 V0 V1 V2 V3 T1 T3 TK-1 V0 V1 V2 V3 (a) A cache with 4-ways, K-sets (tag & data directories) (b) Auxiliary tag directory (ATD) (c) Sampled ATD (sampling ratio =2) Counters Tj and Dj : Tag and Data for set j Collecting profiling data from Actual cache Separate profiling unit ( (b) and (c) ) MRU LRU
  • 33. Utility Monitors 38 Qureshi et al. MICRO’06
  • 34. • Find misses avoided with each way in each core • Find the partitioning which gives least misses (highest hits) Xie et al. HIPEAC’10 Two-core system How to Use Profiling Information
  • 35. Xie et al. HIPEAC’10 Four-core system How to Use Profiling Information
  • 36. Cache-behavior Based Application Classification  Cache insensitive: not many accesses to L2  Cache friendly: miss-rate reduces with increasing L2 quota  Cache fitting: miss-rate reduces with increasing L2 quota and becomes nearly-zero at some point  Streaming: very large working set; show thrashing with any RP due to inadequate cache reuse  Thrashing: working set larger than cache capacity; they thrash LRU-managed cache, but may benefit from thrash-resistant RPs.
  • 37. Cache-behavior Based Application Classification • Reduce cache quota of thrashing/streaming app • Give higher quota to friendly and fitting app till they benefit from it
  • 38. Cache-behavior Based Application Classification  Utility = change in miss-rate with cache quota  Low, high and saturating utility
  • 39. Ideas for Reducing Overhead of CPTs 1. Few thrashing apps are responsible for interference in a shared cache => Restraining just their cache quotas can provide performance benefits similar to exact (but complex) CPTs 2. Some works extend CPTs proposed for true-LRU to pseudo-LRU 3. Use Bloom Filter to reduce storage overhead
  • 40. Way, color and block-based CPTs
  • 41. Utility Based Shared Cache Partitioning  Goal: Maximize system throughput  Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput  Idea: Allocate more cache space to applications that obtain the most benefit from more space Qureshi et al. MICRO’06
  • 42. Utility Based Shared Cache Partitioning 47 Utility Ua b = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 Misses per 1000 instructions
  • 43. Partitioning Algorithm Evaluate all possible partitions and select the best  With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2  Select a that maximizes (Hitscore1 + Hitscore2)  Partitioning done once every 5 million cycles 48 Qureshi et al. MICRO’06
  • 44. Way Partitioning Support 49 1. Each line has core-id bits 2. On a miss, count ways_occupied in set by miss- causing app ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app
  • 45. Greedy and Lookahead CP Algorithm Greedy CP algorithm  Iteratively assign a way to an app with highest utility for that way  Optimal if utility curves of all apps are convex Lookahead algorithm  Works for general case when utility curves are not convex  In each iteration, for every app: Compute “maximum marginal utility” (MMU) and least number of ways at which MMU occurs  App with largest MMU is given # of ways required for achieving MMU.  Stop iterating when all ways allocated Qureshi et al. MICRO’06
  • 46. Machine-learning based CPT  Perform synergistic management of processor resources (e.g., L2 cache quota, power budget and off-chip bandwidth), instead of isolated management  Train a neural network to learn processor performance as function of allocated resources  Use a stochastic hill-climbing based search heuristic  Use a way-based CPT, per-core DVFS to manage chip-power distribution and a strategy for distributing bandwidth between apps Bitirgen et al. MICRO'08
  • 47. Coloring-based CPT  CPT for performance  Run one interval with current partition and one interval each with increasing/decreasing quotas of each core (total 2 cores)  Select partition with least misses  CPT for QoS  Target: perf. of app1 is not degraded >= threshold1 and perf. of app2 is maximized  if ((IPC of app1 - baselineIPC of app1)< threshold2) Increase quota of app1 (if already maximum, stall app2)  Else if(IPC of app1 > baselineIPC of app1) Resume app2 (if stalled) or increase its quota  On change in cache quota, perform lazy page-migration Lin et al. HPCA'08
  • 48. Coloring-based CPT with HW support  Estimate energy consumption of running apps for different cache quota  Let #sets in LLC be X  Estimate miss-rate for caches of different number of sets, viz., X, X=2, X=4, X=8, etc. using profiling units  From this, estimate energy consumption for different cache partitions  Quota of app with small utility is reduced or is increased only slightly  In some partitions, some colors may not be allocated to any core  From these, select a partitioning with minimum energy  Power-gate unused cache colors Mittal et al. TVLSI’14
  • 49. Vantage: A Block-based CPT (1/2)  Divide cache into managed and unmanaged portion (e.g., 85:15)  Only partition managed portion  Allows maintaining associativity of each partition Sanchez et al. ISCA’11
  • 50. Vantage: A Block-based CPT (2/2)  Preferentially evict blocks from unmanaged portion  ~0 evictions from managed portion  Enforce quotas by matching demotion and promotion rates  On any eviction, all candidates with eviction priorities greater than a partition-specific threshold are demoted  Use time-stamp based LRU to estimate eviction priorities with low-overhead Sanchez et al. ISCA’11
  • 52. Partitioning by Controlling Insertion-priority (1/3)  Find cache quota of each app  Quota of an app decides its insertion priority location  Cache hit => block promoted by one step with priority Z and not promoted by priority 1-Z  Blocks of apps with low-priority experience high competition  Thrashing apps get one way each. Also, Z is very small for them Xie et al. ISCA'09
  • 53. 1 P 2 3 4 Q R 5 Accesses S 1 P 2 3 4 S Q 5 6 1 P 2 6 3 Q 4 1 P 2 7 6 S S 3 4 7 1 P 2 7 6 S 3 4 S 1 P 2 7 6 S 3 T T 1 P 2 7 6 S 3 T 2 T 1 P 2 7 6 S 3 T 8 1 P 2 7 6 3 T 8 Action Insertion at loc. 3 Insertion at loc. 5 Insertion at loc. 5 Promotion by 1 Insertion at loc. 3 Promotion by 1 Promotion by 1 Insertion at loc. 5 Insertion Locations Core 0 Core 1 Quota Deviation Quota Deviation Highest priority Lowest priority Xie et al. ISCA'09 Partitioning by Controlling Insertion-priority (2/3)
  • 54.  Limitations:  many partitions may have low insertion positions =>  severe contention at near-LRU position  difficult-to-evict blocks at near-MRU position Partitioning by Controlling Insertion-priority (3/3) Xie et al. ISCA'09
  • 55. Decay-interval based CPT  Let decay interval = if a block not accessed for a decay interval, it becomes candidate for replacement irrespective of LRU status.  Tune decay intervals of apps based on their cache utility and priority  => blocks of apps with high priority and locality stay in cache for longer time  Choose decay interval which minimizes total misses and increases cache usage efficiency Petoumenos et al. MoBS'06
  • 56. Reuse-distance based CPT  Keep a block in cache only until its expected reuse happens  This reuse distance is called protecting distance (PD)  At insertion/promotion time, reuse distance of a block set to PD  On each access to set, PD values for all its blocks decreased by one; if value reaches 0, block becomes replacement candidate.  Change PD to control cache quota of an app  In multicore, for PDs for cores to maximize overall cache hit rate Duong et al. MICRO'12
  • 57. Performance metric-based CPTs Next slide discusses limitations of miss-rate guided CPTs. We then summarize CPTs which are guided directly by some performance metric
  • 58. Limitation of miss-rate guided CPTs  Latency of different misses may be different due to instantaneous MLP and NUCA effects, however, most CPTs treat different misses equally L2 miss memory latency memory latency memory latency memory latency L2 misses IPC IPC Pipeline stalls Commit restarts (a) Isolated miss (b) Clustered misses Time Time Higher average latency Lower average latency Moreto et al. HiPEAC'08
  • 59. Using MLP penalty of misses  Find cache misses for different # of ways  Assign higher MLP penalty to isolated misses than clustered misses  Compute perf. impact of a cache miss converted into hit and vice versa on an increase/decrease in cache size, respectively  From this, find length of the miss-cluster  L2 instruction misses stall fetch => they have a fixed miss latency and MLP penalty  From all possible partitions, select one with minimum total "MLP penalty" Moreto et al. HiPEAC'08
  • 60. Using application slowdown model  This model measures app slowdown due to interference at shared cache and main memory  Measure slowdown for every app at different # of ways  Compute marginal slowdown utility (MSU) as (slowdowW+K – slowdowW)/K  Partition using lookahead algorithm, except that use MSU instead of marginal miss utility Subramanian et al. MICRO'15
  • 61. Using stall rate curves  Use "instruction retirement stall rate curves“ (SRC): stall cycles due to memory latency at various L2 sizes  Get SRC directly from HW counters on real system  SRC is better than miss-rate curve in guiding CPT, since SRC accounts for several factors, e.g.,  L2 miss-rate  impact of L2 misses on instruction retirement stall  memory bus contention  variable latencies of lower levels of memory hierarchy (e.g., L3 and main memory) Tam et al. WIOSCA'07
  • 62. Using Memory Bandwidth  Apps with large miss-count may not consume largest off-chip bandwidth if their memory accesses are not clustered  => Partitioning based on bandwidth can provide better performance than based on misses  Through offline analysis, find partition with least overall bandwidth requirement  Reduce cache quota of apps with low bandwidth requirement Yu et al. DAC'10
  • 63. EXAMPLE OBJECTIVES: • FAIRNESS • LOAD-BALANCING • IMPLEMENTING PRIORITIES • ENERGY • LIMITING PEAK-POWER OF LLC CPTs for Various Optimization Objectives
  • 64. CPT for Ensuring Fairness  Iteratively perform two steps 1. Quota allocation: evaluate fairness metric for all apps  If unfairness difference between apps with least unfair and most unfair impact of CP > threshold1  Transfer some cache space from app with lower unfairness to one with larger unfairness  Exclude these 2 apps. Repeat the step for remaining apps 2. Adjustment: If reduction in miss-rate of app receiving increased quota is more than threshold2  Commit decision made in quota allocation step  Else  Reverse the decision Kim et al. PACT'04
  • 65. Using Feedback-Control theory  Assume: IPC targets are given for all apps  Find new targets to maximize cache utilization  Find cache quota to achieve those targets  If total cache quota exceeds cache size  For QoS: reduce quota of low-priority apps  For fairness: reduce quota in proportion to current quota  App-level controller (a PID controller) finds quota required for next epoch to achieve perf targets based on perf in previous epoch with its quota Srikantaiah et al. MICRO'09 (PID = proportional integral derivative)
  • 66. Limiting LLC power and achieving fairness/QoS (1/2)  Goal: Limiting maximum power of LLC and achieving fair or differentiated cache access latencies between different apps  Use two-level synergistic controllers design 1. LLC power-controller (every 10M cycles)  Limits maximum LLC power for a given budget by controlling number of active LLC banks  Remaining banks are power-gated Wang et al. TC'2012
  • 67. Limiting LLC power and achieving fairness/QoS (2/2) 2. Latency controller (every 1M cycles)  Controls ratio of cache access latencies between two apps on every pair of neighboring cores.  For fairness: same latencies for all apps  For QoS: shorter latencies for high-priority apps  Finds cache-bank quota of each app  Their technique provides theoretical guarantee of accurate control and system stability.  Controllers are designed as PI controllers Wang et al. TC'2012
  • 68. Changing quotas in different intervals (1/2)  Allocate different sized partitions to different apps in different intervals  Cache quotas are expanded and contracted in different intervals 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 (c) Multiple time-sharing partitioning (for both throughput and fairness) Spatial partitioning Time- sharing (a) Fairness-oriented spatial partitioning (b) Throughput-oriented spatial partitioning Time IPC=0.26, WS=1.23, FS=1.0 Example results IPC=0.52, WS=2.42, FS=1.22 IPC=0.52, WS=2.42, FS=1.97 WS/FS = weighted/fair speedup Proposed technique
  • 69. Changing quotas in different intervals (2/2)  A thrashing app already has low throughput, reducing its quota in contraction epoch does not reduce its perf much…  but increasing its quota in expansion epoch boosts its perf greatly which compensates slowdown in contraction epochs.  Expansion opportunity is given to different apps  equally for fairness  in differentiated manner for QoS
  • 70. CPTs for load-balancing in Multithreaded Apps (1/2) C0 L2 (32 ways) C1 C2 C3 C0 C1 C2 C3 3 16 8 5 Thread # # L2 ways 0 1 2 3 Critical Path Thread 0 1 2 3 (a) Shared cache (b) Partitioned Cache Critical thread can be accelerated by giving higher cache quota to it => bottleneck removed
  • 71. 1st CPT  Record CPIs of all threads  Allocate more ways to threads with higher CPIs  Limitation: thread's cache sensitivity is not taken into account 2nd CPT  For each thread, build a model of how CPI varies with cache quota  Do curve fitting by “cubic spline interpolation”  Repeatedly transfer one way from fastest thread to slowest thread until some other thread becomes slowest  At this point, revert cache allocation by one-step and accept this partitioning Muralidhara et al. IPDPS'10 CPTs for load-balancing in Multithreaded Apps (2/2)
  • 72. Removing imbalance due to process variation (1/2) 1.8 GHz 2.1 2.4 2.6 Frequencies of different cores in a 4-core processor (due to process variation) For multithreaded programs with synchronization barriers, the slowest core will limit the performance of other cores Kozhikkottu et al. DAC'14
  • 73.  Using cache partitioning to give higher cache quota 1.8 GHz 2.1 2.4 2.6 Frequencies of different cores in a 4-core processor (due to PV) 1.8 GHz 2.1 2.4 2.6 20 #L2 ways 6 4 2 PV-aware L2 cache partitioning Higher throughput Removing imbalance due to process variation (2/2) Kozhikkottu et al. DAC'14
  • 74. Saving leakage energy (1/2)  Locality of an app = (accesses to LRU blocks)/(accesses to MRU blocks)  Most hits at MRU => app needs few ways to achieve high hit rate and vice versa  Compare this ratio for two apps to decide cache quota Kotera et al. HiPEAC'11
  • 75. Allocated to core 0 Allocated to core 1 Power-gated • Compare above ratio with thresholds to decide number of ways to power-gate • Insight: If total cache requirement of cores < available cache => power-gate remaining cache to save leakage energy Saving leakage energy (2/2)
  • 76. Sets 0 1 2 3 4 5 6 7 (a) Shared unpartitioned cache (b) Shared partitioned cache (c) Shared partitioned cache + Way-aligned data 0 1 2 3 4 5 6 7 Ways Core 0 Core 1 Core 2 Core 3 Saving dynamic energy Ensure way-alignment of data of each core. On access to a core, only that way needs to be accessed => dynamic energy saved Dynamic energy saving No dynamic energy saving Sundararajan et al. HPCA’12
  • 77. CPTs in various contexts  If cache is NUCA: try reducing both  misses and  hit latency (by allocating cache banks to closest core)  If main memory is PCM: perform CP such that both misses and writebacks are minimized  since PCM has high write energy/latency and low write endurance
  • 78. Integration of CP with other techniques
  • 79. Integration with processor partitioning • When variation in degree of TLP between apps is high, equally distributing processors between them is not optimal • => Perform both • Processor partitioning (every 65M cycle) and • cache partitioning (every 10M cycle) Srikantaiah et al. SC'09
  • 80. Integration with DRAM-bank partitioning  In physical address, few bits are common among LLC set index bits and those for computing DRAM bank  Thus, we can perform cache-only, bank-only or combined partitioning, based on which is better Overlapped-bits Cache-only bits Bank-only bits 21 22 17 18 15 16 13 14 12 Page offset Physical frame number Bank-only (21-22), cache-only (16-18) and overlapped (14-15) bits on a processor with 8GB memory and 64 banks Liu et al. ISCA'14
  • 81. Integration with Bandwidth Partitioning  Whether BW partitioning can improve perf depends on difference in miss frequencies between apps  With decreasing bandwidth, scope of perf improvement increases  => CP may lower the impact of BW partitioning on perf  By reducing difference in miss frequencies of apps and  By reducing total cache misses which relieves BW pressure  But, if CP increases difference in miss frequencies, it increases impact of BW partitioning on performance.  E.g., for cache insensitive apps, CP cannot improve perf, but by changing difference in miss frequencies, CP enhances effectiveness of BW partitioning in boosting perf Liu et al. HPCA'10
  • 82. Integration with DVFS (1/3)  Model problem of dividing shared resource (chip power budget and LLC capacity) between apps as a dynamic distributed market  each app (core) is an agent  resource-prices change based on “demand” and “supply”  Initially:  each agent has a purchasing budget and builds a performance model as function of allocated resource  A global arbiter fixes initial prices of all resources Wang et al. HPCA'15
  • 83. Integration with DVFS (2/3)  Iteratively: Each agent bids for the resources to maximize its perf  Based on the bids, arbiter increases and reduces the price of resources in high and low demand, respectively.  Agents bid again under new prices  Iteration stops when  change in price within iterations is very small or  a threshold # of iterations done or  no improvement in perf of an agent on changing the bid  At this point, perform resource-allocation
  • 84. Integration with DVFS (3/3)  Agents work in decentralized manner and only centralized function is pricing scheme => this technique scales well to >64 cores  For throughput: assign larger budget to agents with higher marginal utility  For fairness: assign equal budgets to all agents  Find cache utility by seeing miss-rate change and power utility by changing frequency using DVFS
  • 85. Integration with RP selection (1/2)  CPTs and thrash-resistant RPs are complementary  RPs temporally share LLC based on apps’ “locality”  CPTs spatially divide LLC based on app’ “utility”  => Thrash-resistant RPs good for workloads with poor-locality apps  CPTs good for workloads with apps with widely different utility values  Idea: Perform CPT with RP-selection to optimize both locality and utility Zhan et al. TC'14
  • 86. Quota0 = 5 ways RP0 = BIP Quota1 = 3 ways RP1 = LRU 6 7 4 5 2 3 0 1 Most recent Core 0 Core 1 p LRU 1-p BIP Least recent Decision module Recency stack of 8-way cache Insertion position for Core 0 Insertion position for Core 1 Replacement point for Core 0 Replacement point for Core 1 Integration with RP selection (2/2)  Find hits at different way-counts for LRU and BIP  From this, find optimal CP and optimal RP  For CP use lookahead algorithm  RP chosen for a core is implemented in its cache portion
  • 87. References  S. Mittal, “A Survey of Techniques for Cache Partitioning in Multicore Processors”, ACM Computing Surveys, 2017 (link)