Viacheslav Fedorov, Sheng Qiu,
Narasimha Reddy, Paul Gratz
Texas A&M University
ARI:
Adaptive Replacement and Insertion
Hi...
Conventional Main Memory
● Usually we only care about
speeding up the cache miss path
Main Memory
Core 0
Core 1
Core 2
Cor...
Main Memory: Trends
● New Memories emerging
● DRAM not dense enough
● Replace or augment DRAM
DRAM
Core 0
Core 1
Core 2
Co...
PCM Technology
● Based on Chalcogenide glass
● Exploits two phases
● Amorphous
● Chrystalline
● Higher density than DRAM
●...
DRAM vs PCM
● DRAM is writeback-agnostic
● Write Buffers cushion the impact of writebacks
● State-of-the-art policies targ...
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
Motivation
● PCM is attractive as a Main Memory, but...
● PCM does not favor writes
● High energy
● High latency
● Low wri...
Application behavior in
High-Associativity Caches
● Bi-Polar block distribution due to LRU policy
● 'Hot' blocks tend to g...
Static LLC policies
● Based on the observed hot-cold distribution
● 16-way cache: 16 static policies, xH16
● Replace any c...
Enter ARI:
Adaptive Replacement and Insertion
●
Goal: Reduce LLC writebacks !
● Keep miss rate lower than conventional pol...
ARI: Operation
● Evict clean blocks from Low-Hit region
● Insert new blocks into top of Low-Hit region
%hitrate
Position i...
ARI: Operation
● Application hit-distributions are not static
● Dynamic policy adaptation based on epochs
● Emulate variou...
Core 0
Core 1
Core 2
Core 3
L3$
L2$
L2$
ARI: Implementation
● Emulate static thresholds in shadow tags
● Adapt to the hit-...
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
Methodology
● gem5 + DRAMSim2 simulators
● nVidia Tegra -like out-of-order, dual-issue CPU
● SPEC2006 and PARSEC suites
● ...
ARI: Writeback reduction
● ARI beats the competition: 33% WB reduction
Writeback improvement, normalized to LRU policy
DIP...
ARI: Miss reduction
● ARI achieves 4.7% Misses reduction
Miss rate improvement, normalized to LRU policy
DIP: M. Qureshi e...
ARI: Performance improvement
● ARI yields a 5% IPC improvement on average
IPC improvement, normalized to LRU policy
ARI: Dynamic behavior
● ARI adapts to program phases
● Achieves lower WBs than the best static policy
Soplex application, ...
ARI: Multicore applications
ARI: PCM lifetime improvement
● ARI facilitates the use of PCM as Main Memory
DIP DBLK RRIP ARI
0%
10%
20%
30%
40%
50%
60%...
ARI: PCM lifetime improvement
ARI: Hardware overhead
● 8 sets shadowed per LLC bank (x8)
● p*2 shadow tags (we use p=9)
● 14kB storage overhead in a 16M...
Outline
● Introduction
● Motivation
● ARI: Adaptive Replacement and Insertion
● Evaluation
● Summary
● Conclusion
ARI: Summary
● 33% writeback reduction
● 4.7% cache miss rate reduction
● 9% less Main Memory traffic
● System IPC boost o...
Conclusion
● DRAM is hitting a scalability wall
● New memories/architectures proposed
● We target PCM as main memory
● Pro...
Thank you!
Questions?..
Backup Slides
Related Work: PCM
G. Dhiman et al.
PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09
M. K. Qureshi et al.
Enhancin...
Related Work: PCM
N. H. Seong et al.
Security refresh: prevent malicious wear-out and increase durability
for phase-change...
ARI: Insertion impact
ARI: Total Memory Traffic
gcc
bzip
bwaves
mcf
milc
zeus
gromacs
cactusADMleslie3d
namd
gobmk
soplex
hmmer
sjeng
GemsFDTDh2...
Upcoming SlideShare
Loading in...5
×

ARI. HiPEAK 2014

86

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
86
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ARI. HiPEAK 2014

  1. 1. Viacheslav Fedorov, Sheng Qiu, Narasimha Reddy, Paul Gratz Texas A&M University ARI: Adaptive Replacement and Insertion HiPEAC 2013, Vienna, Austria
  2. 2. Conventional Main Memory ● Usually we only care about speeding up the cache miss path Main Memory Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$
  3. 3. Main Memory: Trends ● New Memories emerging ● DRAM not dense enough ● Replace or augment DRAM DRAM Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ DRAM PCM DRAM cache
  4. 4. PCM Technology ● Based on Chalcogenide glass ● Exploits two phases ● Amorphous ● Chrystalline ● Higher density than DRAM ● Non-volatile Image: Stanford NanoHeat Lab
  5. 5. DRAM vs PCM ● DRAM is writeback-agnostic ● Write Buffers cushion the impact of writebacks ● State-of-the-art policies target cache misses ● PCM ● High write latency – Write Buffers insufficient ● High write energy – Mobile, embedded devices ? ● Low cell endurance – Limited write cycles ? Parameter DRAM PCM Row Read 210 mW 78 mW Row Write 195 mW 773 mW Activate 75 mW 25 mW Standby 90 mW 45 mW Refresh 4 mW 0 mW Initial Row Read 15 ns 28 ns Row Write 22 ns 150 ns Same Row R/W 15 ns 15 ns 0.3x 4x 0.3x 0.5x 7x 2x 0x
  6. 6. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  7. 7. Motivation ● PCM is attractive as a Main Memory, but... ● PCM does not favor writes ● High energy ● High latency ● Low write cycle tolerance ● Solution: reduce writes into Main Memory ● Modify LLC policies to reduce Writebacks ● Mind the Miss rate!
  8. 8. Application behavior in High-Associativity Caches ● Bi-Polar block distribution due to LRU policy ● 'Hot' blocks tend to group towards MRU side ● 'Cold' blocks towards LRU side in a set ● Hot blocks have higher Hit-ratio ● Cold blocks tend to have similar Hit-ratios %hitrate Position in LRU stackMRU LRU 'Hot' region 'Cold' region Hit distribution in a high-associativity cache (16-way)
  9. 9. Static LLC policies ● Based on the observed hot-cold distribution ● 16-way cache: 16 static policies, xH16 ● Replace any clean block in (16-x) Low-hit blocks ● Drawbacks: ● No single static policy good for all applications ● Less writebacks => more cache misses – When replacing hot blocks
  10. 10. Enter ARI: Adaptive Replacement and Insertion ● Goal: Reduce LLC writebacks ! ● Keep miss rate lower than conventional policies ● How? ● Do not replace dirty cache blocks (as long as possible) ● Place fresh incoming blocks into LLC smartly ● Dynamically choose the best policy
  11. 11. ARI: Operation ● Evict clean blocks from Low-Hit region ● Insert new blocks into top of Low-Hit region %hitrate Position in LRU stackMRU LRU High-Hit region Low-Hit region
  12. 12. ARI: Operation ● Application hit-distributions are not static ● Dynamic policy adaptation based on epochs ● Emulate various static thresholds in LLC tags ● Pick the best one for next epoch (25k LLC accesses) ● Misses + Writebacks metric used %hitrate MRU LRU
  13. 13. Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ ARI: Implementation ● Emulate static thresholds in shadow tags ● Adapt to the hit-distribution Tag Array Data ArrayShadow Tag Array dynamically 4H16 10H16 14H16
  14. 14. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  15. 15. Methodology ● gem5 + DRAMSim2 simulators ● nVidia Tegra -like out-of-order, dual-issue CPU ● SPEC2006 and PARSEC suites ● Compared against state-of-the-art policies ● ARI beats them in writeback reduction ● Nearly identical in total performance System Single core Multicore L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private) L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared) Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer
  16. 16. ARI: Writeback reduction ● ARI beats the competition: 33% WB reduction Writeback improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  17. 17. ARI: Miss reduction ● ARI achieves 4.7% Misses reduction Miss rate improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  18. 18. ARI: Performance improvement ● ARI yields a 5% IPC improvement on average IPC improvement, normalized to LRU policy
  19. 19. ARI: Dynamic behavior ● ARI adapts to program phases ● Achieves lower WBs than the best static policy Soplex application, SPEC 2006mcf application, SPEC 2006 Writebacks
  20. 20. ARI: Multicore applications
  21. 21. ARI: PCM lifetime improvement ● ARI facilitates the use of PCM as Main Memory DIP DBLK RRIP ARI 0% 10% 20% 30% 40% 50% 60% %PCMlifetimeimprovement Decrease lifetime for several apps
  22. 22. ARI: PCM lifetime improvement
  23. 23. ARI: Hardware overhead ● 8 sets shadowed per LLC bank (x8) ● p*2 shadow tags (we use p=9) ● 14kB storage overhead in a 16MB LLC ● Epoch counter – 15 bits ● Performance counters, adders ● Not on critical path ● Can be designed for low power
  24. 24. Outline ● Introduction ● Motivation ● ARI: Adaptive Replacement and Insertion ● Evaluation ● Summary ● Conclusion
  25. 25. ARI: Summary ● 33% writeback reduction ● 4.7% cache miss rate reduction ● 9% less Main Memory traffic ● System IPC boost of 5% ● Enabling PCM as Main Memory ● 50% lifetime improvement Win – Win
  26. 26. Conclusion ● DRAM is hitting a scalability wall ● New memories/architectures proposed ● We target PCM as main memory ● Propose ARI: Adaptive Replacement and Insertion ● Simple scheme ● Reduce writebacks to main memory ● Boost the PCM performance and lifetime
  27. 27. Thank you! Questions?..
  28. 28. Backup Slides
  29. 29. Related Work: PCM G. Dhiman et al. PDRAM: A hybrid PRAM and DRAM main memory system. DAC ’09 M. K. Qureshi et al. Enhancing Lifetime and Security of PCM-based Main Memory with Start-Gap Wear Leveling. MICRO ’09 B. C. Lee et al. Architecting Phase Change Memory as a Scalable DRAM Alternative. ISCA ’09 M. K. Qureshi et al. Scalable high performance main memory system using phase-change memory technology. ISCA ’09 A. P. Ferreira et al. Increasing PCM main memory lifetime. DATE ’10
  30. 30. Related Work: PCM N. H. Seong et al. Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. ISCA ’10 H. Yoon et al. Row buffer locality aware caching policies for hybrid memories. ICCD ’12 Stuecheli et al. The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies. ISCA ’10 M. K. Qureshi & G. H. Loh Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. MICRO ’12
  31. 31. ARI: Insertion impact
  32. 32. ARI: Total Memory Traffic gcc bzip bwaves mcf milc zeus gromacs cactusADMleslie3d namd gobmk soplex hmmer sjeng GemsFDTDh264ref astar sphinx3 avg 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Total memory traffic, Misses + Writebacks. Normalized to LRU 4H16 ARI TotaltrafficnormalizedtoLRU
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×