SlideShare a Scribd company logo
1 of 31
3-D Memory Stacking
3-D Stacked memory can provide large caches at high bandwidth
3D Stacking for low latency and high bandwidth memory system
- E.g. Half the latency, 8x the bandwidth [Loh&Hill, MICRO’11]
Stacked DRAM: Few hundred MB, not enough for main memory
Hardware-managed cache is desirable: Transparent to software
Source: Loh and Hill MICRO’11
Problems in Architecting Large Caches
Architecting tag-store for low-latency and low-storage is challenging
Organizing at cache line granularity (64 B) reduces wasted space and
wasted bandwidth
Problem: Cache of hundreds of MB needs tag-store of tens of MB
E.g. 256MB DRAM cache needs ~20MB tag store (5 bytes/line)
Option 1: SRAM Tags
Fast, But Impractical
(Not enough transistors)
Option 2: Tags in DRAM
Naïve design has 2x latency
(One access each for tag, data)
Loh-Hill Cache Design [Micro’11, TopPicks]
Recent work tries to reduce latency of Tags-in-DRAM approach
LH-Cache design similar to traditional set-associative cache
2KB row buffer = 32 cache lines
Speed-up cache miss detection:
A MissMap (2MB) in L3 tracks lines of pages resident in DRAM cache
Miss
Map
Data lines (29-ways)Tags
Cache organization: A 29-way set-associative DRAM (in 2KB row)
Keep Tag and Data in same DRAM row (tag-store & data store)
Data access guaranteed row-buffer hit (Latency ~1.5x instead of 2x)
Cache Optimizations Considered Harmful
Need to revisit DRAM cache structure given widely different constraints
DRAM caches are slow  Don’t make them slower
Many “seemingly-indispensable” and “well-understood” design
choices degrade performance of DRAM cache:
• Serial tag and data access
• High associativity
• Replacement update
Optimizations effective only in certain parameters/constraints
Parameters/constraints of DRAM cache quite different from SRAM
E.g. Placing one set in entire DRAM row  Row buffer hit rate ≈ 0%
Outline
 Introduction & Background
 Insight: Optimize First for Latency
 Proposal: Alloy Cache
 Memory Access Prediction
 Summary
Simple Example: Fast Cache (Typical)
Optimizing for hit-rate (at expense of hit latency) is effective
Consider a system with cache: hit latency 0.1 miss latency: 1
Base Hit Rate: 50% (base average latency: 0.55)
Opt A removes 40% misses (hit-rate:70%), increases hit latency by 40%
Base Cache Opt-A
Break Even
Hit-Rate=52%
Hit-Rate A=70%
Simple Example: Slow Cache (DRAM)
Base Cache Opt-A
Break Even
Hit-Rate=83%
Consider a system with cache: hit latency 0.5 miss latency: 1
Base Hit Rate: 50% (base average latency: 0.75)
Opt A removes 40% misses (hit-rate:70%), increases hit latency by 40%
Hit-Rate A=70%
Optimizations that increase hit latency start becoming ineffective
Overview of Different Designs
Our Goal: Outperform SRAM-Tags with a simple and practical design
For DRAM caches, critical to optimize first for latency, then hit-rate
What is the Hit Latency Impact?
Both SRAM-Tag and LH-Cache have much higher latency  ineffective
Consider Isolated accesses: X always gives row buffer hit, Y needs an row activation
How about Bandwidth?
LH-Cache reduces effective DRAM cache bandwidth by > 4x
Configuration Raw
Bandwidth
Transfer
Size on Hit
Effective
Bandwidth
Main Memory 1x 64B 1x
DRAM$(SRAM-Tag) 8x 64B 8x
DRAM$(LH-Cache) 8x 256B+16B 1.8x
DRAM$(IDEAL) 8x 64B 8x
For each hit, LH-Cache transfers:
• 3 lines of tags (3x64=192 bytes)
• 1 line for data (64 bytes)
• Replacement update (16 bytes)
Performance Potential
LH-Cache gives 8.7%, SRAM-Tag 24%, latency-optimized design 38%
8-core system with 8MB shared L3 cache at 24 cycles
DRAM Cache: 256MB (Shared), latency 2x lower than off-chip
0.6
0.8
1
1.2
1.4
1.6
1.8
Speedup(NoDRAM$)
LH-Cache SRAM-Tag IDEAL-Latency Optimized
De-optimizing for Performance
More benefits from optimizing for hit-latency than for hit-rate
LH-Cache uses LRU/DIP  needs update, uses bandwidth
LH-Cache can be configured as direct map  row buffer hits
Configuration Speedup Hit-Rate Hit-Latency
(cycles)
LH-Cache 8.7% 55.2% 107
LH-Cache + Random Repl. 10.2% 51.5% 98
LH-Cache (Direct Map) 15.2% 49.0% 82
IDEAL-LO (Direct Map) 38.4% 48.2% 35
Outline
 Introduction & Background
 Insight: Optimize First for Latency
 Proposal: Alloy Cache
 Memory Access Prediction
 Summary
Alloy Cache: Avoid Tag Serialization
Alloy Cache has low latency and uses less bandwidth
No dependent access for Tag and Data  Avoids Tag serialization
Consecutive lines in same DRAM row  High row buffer hit-rate
No need for separate “Tag-store” and “Data-Store”  Alloy Tag+Data
One “Tag+Data”
0.6
0.8
1
1.2
1.4
1.6
1.8
Performance of Alloy Cache
Alloy Cache with good predictor can outperform SRAM-Tag
Alloy+MissMap SRAM-TagAlloy+PerfectPredAlloy Cache
Speedup(NoDRAM$)
Alloy Cache with no early-miss detection gets 22%, close to SRAM-Tag
Outline
 Introduction & Background
 Insight: Optimize First for Latency
 Proposal: Alloy Cache
 Memory Access Prediction
 Summary
Cache Access Models
Each model has distinct advantage: lower latency or lower BW usage
Serial Access Model (SAM) and Parallel Access Model (PAM)
Higher Miss Latency
Needs less BW
Lower Miss Latency
Needs more BW
To Wait or Not to Wait?
Using Dynamic Access Model (DAM), we can get best latency and BW
Dynamic Access Model: Best of both SAM and PAM
When line likely to be present in cache use SAM, else use PAM
Memory Access
Predictor (MAP)
L3-miss
Address
Prediction =
Cache Hit
Prediction =
Memory Access
Use PAM
Use SAM
Memory Access Predictor (MAP)
Proposed MAP designs simple and low latency
We can use Hit Rate as proxy for MAP: High hit-rate SAM, low PAM
Accuracy improved with History-Based prediction
1. History-Based Global MAP (MAP-G)
• Single saturating counter per-core (3-bit)
• Increment on cache hit, decrement on miss
• MSB indicates SAM or PAM
Table
Of
Counters
(3-bit)
Miss PC
MAC
2. Instruction Based MAP (MAP-PC)
• Have a table of saturating counter
• Index table based on miss-causing PC
• Table of 256 entries sufficient (96 bytes)
0.6
0.8
1
1.2
1.4
1.6
1.8
Predictor Performance
Simple Memory Access Predictors obtain almost all potential gains
Speedup(NoDRAM$)
Alloy+MAP-Global Alloy +MAP-PC Alloy+PerfectMAPAlloy+NoPred
Accuracy of MAP-Global: 82% Accuracy of MAP-PC: 94%
Alloy Cache with MAP-PC gets 35%, Perfect MAP gets 36.5%
Hit-Latency versus Hit-Rate
Latency LH-Cache SRAM-Tag Alloy Cache
Average Latency (cycles) 107 67 43
Relative Latency 2.5x 1.5x 1.0x
Cache Size LH-Cache
(29-way)
Alloy Cache
(1-way)
Delta
Hit-Rate
256MB 55.2% 48.2% 7%
512MB 59.6% 55.2% 4.4%
1GB 62.6% 59.1% 2.5%
DRAM Cache Hit Rate
Alloy Cache reduces hit latency greatly at small loss of hit-rate
DRAM Cache Hit Latency
Outline
 Introduction & Background
 Insight: Optimize First for Latency
 Proposal: Alloy Cache
 Memory Access Prediction
 Summary
Summary
 DRAM Caches are slow, don’t make them slower
 Previous research: DRAM cache architected similar to SRAM cache
 Insight: Optimize DRAM cache first for latency, then hit-rate
 Latency optimized Alloy Cache avoids tag serialization
 Memory Access Predictor: simple, low latency, yet highly effective
 Alloy Cache + MAP outperforms SRAM-Tags (35% vs. 24%)
 Calls for new ways to manage DRAM cache space and bandwidth
Questions
Acknowledgement:
Work on “Memory Access Prediction” done while at IBM Research.
(Patent application filed Feb 2010, published Aug 2011)
Potential for Improvement
Design Performance
Improvement
Alloy Cache + MAP-PC 35.0%
Alloy Cache + Perfect Predictor 36.6%
IDEAL-LO Cache 38.4%
IDEAL-LO + No Tag Overhead 41.0%
Size Analysis
Simple Latency-Optimized design outperforms Impractical SRAM-Tags!
1.00
1.10
1.20
1.30
1.40
1.50
64MB 128MB 256MB 512MB 1GB
SRAM-Tags Alloy Cache + MAP-PCLH-Cache + MissMap
Proposed design provides 1.5x the benefit of SRAM-Tags
(LH-Cache provides about one-third the benefit)
Speedup(NoDRAM$)
How about Commercial Workloads?
Cache
Size
Hit-Rate
(1-way)
Hit-Rate
(32-way)
Hit-Rate
Delta
256MB 53.0% 60.3% 7.3%
512MB 58.6% 63.6% 5.0%
1GB 62.1% 65.1% 3.0%
Data averaged over 7 commercial workloads
Prediction Accuracy of MAP
MAP-PC
What about other SPEC benchmarks?
http://research.cs.wisc.edu/multifacet/papers/micro11_missmap_addendum.pdf
LH-Cache Addendum: Revised Results
SAM vs. PAM

More Related Content

Viewers also liked

Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonTony Nguyen
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your siteHarry Potter
 
Abstract class
Abstract classAbstract class
Abstract classFraboni Ec
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsHarry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaLuis Goldster
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonHarry Potter
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaTony Nguyen
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonTony Nguyen
 
Object oriented programming-with_java
Object oriented programming-with_javaObject oriented programming-with_java
Object oriented programming-with_javaTony Nguyen
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonTony Nguyen
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonLuis Goldster
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsTony Nguyen
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisFraboni Ec
 

Viewers also liked (20)

Poo java
Poo javaPoo java
Poo java
 
Api crash
Api crashApi crash
Api crash
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 
Abstract class
Abstract classAbstract class
Abstract class
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Poo java
Poo javaPoo java
Poo java
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Poo java
Poo javaPoo java
Poo java
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Object oriented programming-with_java
Object oriented programming-with_javaObject oriented programming-with_java
Object oriented programming-with_java
 
Python basics
Python basicsPython basics
Python basics
 
Object model
Object modelObject model
Object model
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 

Similar to Hardware managed cache

CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesDilum Bandara
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Hsien-Hsin Sean Lee, Ph.D.
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentationSaket Vihari
 
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxonur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxsivasubramanianManic2
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimizationK Gowsic Gowsic
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
 
Memory_Unit Cache Main Virtual Associative
Memory_Unit Cache Main Virtual AssociativeMemory_Unit Cache Main Virtual Associative
Memory_Unit Cache Main Virtual AssociativeRNShukla7
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeChris Adkin
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in indiaEdhole.com
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentalsChris Adkin
 

Similar to Hardware managed cache (20)

CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation2012 benjamin klenk-future-memory_technologies-presentation
2012 benjamin klenk-future-memory_technologies-presentation
 
Memoryhierarchy
MemoryhierarchyMemoryhierarchy
Memoryhierarchy
 
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptxonur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
 
Memory_Unit Cache Main Virtual Associative
Memory_Unit Cache Main Virtual AssociativeMemory_Unit Cache Main Virtual Associative
Memory_Unit Cache Main Virtual Associative
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
 
Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
memory.ppt
memory.pptmemory.ppt
memory.ppt
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Sql server scalability fundamentals
Sql server scalability fundamentalsSql server scalability fundamentals
Sql server scalability fundamentals
 

More from Fraboni Ec

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreadingFraboni Ec
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceFraboni Ec
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningFraboni Ec
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksFraboni Ec
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsFraboni Ec
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonFraboni Ec
 
Abstract data types
Abstract data typesAbstract data types
Abstract data typesFraboni Ec
 
Abstraction file
Abstraction fileAbstraction file
Abstraction fileFraboni Ec
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaFraboni Ec
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonFraboni Ec
 
Learning python
Learning pythonLearning python
Learning pythonFraboni Ec
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonFraboni Ec
 

More from Fraboni Ec (20)

Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
Lisp
LispLisp
Lisp
 
What is simultaneous multithreading
What is simultaneous multithreadingWhat is simultaneous multithreading
What is simultaneous multithreading
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object model
Object modelObject model
Object model
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 
Inheritance
InheritanceInheritance
Inheritance
 
Api crash
Api crashApi crash
Api crash
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Learning python
Learning pythonLearning python
Learning python
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 

Hardware managed cache

  • 1. 3-D Memory Stacking 3-D Stacked memory can provide large caches at high bandwidth 3D Stacking for low latency and high bandwidth memory system - E.g. Half the latency, 8x the bandwidth [Loh&Hill, MICRO’11] Stacked DRAM: Few hundred MB, not enough for main memory Hardware-managed cache is desirable: Transparent to software Source: Loh and Hill MICRO’11
  • 2. Problems in Architecting Large Caches Architecting tag-store for low-latency and low-storage is challenging Organizing at cache line granularity (64 B) reduces wasted space and wasted bandwidth Problem: Cache of hundreds of MB needs tag-store of tens of MB E.g. 256MB DRAM cache needs ~20MB tag store (5 bytes/line) Option 1: SRAM Tags Fast, But Impractical (Not enough transistors) Option 2: Tags in DRAM Naïve design has 2x latency (One access each for tag, data)
  • 3. Loh-Hill Cache Design [Micro’11, TopPicks] Recent work tries to reduce latency of Tags-in-DRAM approach LH-Cache design similar to traditional set-associative cache 2KB row buffer = 32 cache lines Speed-up cache miss detection: A MissMap (2MB) in L3 tracks lines of pages resident in DRAM cache Miss Map Data lines (29-ways)Tags Cache organization: A 29-way set-associative DRAM (in 2KB row) Keep Tag and Data in same DRAM row (tag-store & data store) Data access guaranteed row-buffer hit (Latency ~1.5x instead of 2x)
  • 4. Cache Optimizations Considered Harmful Need to revisit DRAM cache structure given widely different constraints DRAM caches are slow  Don’t make them slower Many “seemingly-indispensable” and “well-understood” design choices degrade performance of DRAM cache: • Serial tag and data access • High associativity • Replacement update Optimizations effective only in certain parameters/constraints Parameters/constraints of DRAM cache quite different from SRAM E.g. Placing one set in entire DRAM row  Row buffer hit rate ≈ 0%
  • 5. Outline  Introduction & Background  Insight: Optimize First for Latency  Proposal: Alloy Cache  Memory Access Prediction  Summary
  • 6. Simple Example: Fast Cache (Typical) Optimizing for hit-rate (at expense of hit latency) is effective Consider a system with cache: hit latency 0.1 miss latency: 1 Base Hit Rate: 50% (base average latency: 0.55) Opt A removes 40% misses (hit-rate:70%), increases hit latency by 40% Base Cache Opt-A Break Even Hit-Rate=52% Hit-Rate A=70%
  • 7. Simple Example: Slow Cache (DRAM) Base Cache Opt-A Break Even Hit-Rate=83% Consider a system with cache: hit latency 0.5 miss latency: 1 Base Hit Rate: 50% (base average latency: 0.75) Opt A removes 40% misses (hit-rate:70%), increases hit latency by 40% Hit-Rate A=70% Optimizations that increase hit latency start becoming ineffective
  • 8. Overview of Different Designs Our Goal: Outperform SRAM-Tags with a simple and practical design For DRAM caches, critical to optimize first for latency, then hit-rate
  • 9. What is the Hit Latency Impact? Both SRAM-Tag and LH-Cache have much higher latency  ineffective Consider Isolated accesses: X always gives row buffer hit, Y needs an row activation
  • 10. How about Bandwidth? LH-Cache reduces effective DRAM cache bandwidth by > 4x Configuration Raw Bandwidth Transfer Size on Hit Effective Bandwidth Main Memory 1x 64B 1x DRAM$(SRAM-Tag) 8x 64B 8x DRAM$(LH-Cache) 8x 256B+16B 1.8x DRAM$(IDEAL) 8x 64B 8x For each hit, LH-Cache transfers: • 3 lines of tags (3x64=192 bytes) • 1 line for data (64 bytes) • Replacement update (16 bytes)
  • 11. Performance Potential LH-Cache gives 8.7%, SRAM-Tag 24%, latency-optimized design 38% 8-core system with 8MB shared L3 cache at 24 cycles DRAM Cache: 256MB (Shared), latency 2x lower than off-chip 0.6 0.8 1 1.2 1.4 1.6 1.8 Speedup(NoDRAM$) LH-Cache SRAM-Tag IDEAL-Latency Optimized
  • 12. De-optimizing for Performance More benefits from optimizing for hit-latency than for hit-rate LH-Cache uses LRU/DIP  needs update, uses bandwidth LH-Cache can be configured as direct map  row buffer hits Configuration Speedup Hit-Rate Hit-Latency (cycles) LH-Cache 8.7% 55.2% 107 LH-Cache + Random Repl. 10.2% 51.5% 98 LH-Cache (Direct Map) 15.2% 49.0% 82 IDEAL-LO (Direct Map) 38.4% 48.2% 35
  • 13. Outline  Introduction & Background  Insight: Optimize First for Latency  Proposal: Alloy Cache  Memory Access Prediction  Summary
  • 14. Alloy Cache: Avoid Tag Serialization Alloy Cache has low latency and uses less bandwidth No dependent access for Tag and Data  Avoids Tag serialization Consecutive lines in same DRAM row  High row buffer hit-rate No need for separate “Tag-store” and “Data-Store”  Alloy Tag+Data One “Tag+Data”
  • 15. 0.6 0.8 1 1.2 1.4 1.6 1.8 Performance of Alloy Cache Alloy Cache with good predictor can outperform SRAM-Tag Alloy+MissMap SRAM-TagAlloy+PerfectPredAlloy Cache Speedup(NoDRAM$) Alloy Cache with no early-miss detection gets 22%, close to SRAM-Tag
  • 16. Outline  Introduction & Background  Insight: Optimize First for Latency  Proposal: Alloy Cache  Memory Access Prediction  Summary
  • 17. Cache Access Models Each model has distinct advantage: lower latency or lower BW usage Serial Access Model (SAM) and Parallel Access Model (PAM) Higher Miss Latency Needs less BW Lower Miss Latency Needs more BW
  • 18. To Wait or Not to Wait? Using Dynamic Access Model (DAM), we can get best latency and BW Dynamic Access Model: Best of both SAM and PAM When line likely to be present in cache use SAM, else use PAM Memory Access Predictor (MAP) L3-miss Address Prediction = Cache Hit Prediction = Memory Access Use PAM Use SAM
  • 19. Memory Access Predictor (MAP) Proposed MAP designs simple and low latency We can use Hit Rate as proxy for MAP: High hit-rate SAM, low PAM Accuracy improved with History-Based prediction 1. History-Based Global MAP (MAP-G) • Single saturating counter per-core (3-bit) • Increment on cache hit, decrement on miss • MSB indicates SAM or PAM Table Of Counters (3-bit) Miss PC MAC 2. Instruction Based MAP (MAP-PC) • Have a table of saturating counter • Index table based on miss-causing PC • Table of 256 entries sufficient (96 bytes)
  • 20. 0.6 0.8 1 1.2 1.4 1.6 1.8 Predictor Performance Simple Memory Access Predictors obtain almost all potential gains Speedup(NoDRAM$) Alloy+MAP-Global Alloy +MAP-PC Alloy+PerfectMAPAlloy+NoPred Accuracy of MAP-Global: 82% Accuracy of MAP-PC: 94% Alloy Cache with MAP-PC gets 35%, Perfect MAP gets 36.5%
  • 21. Hit-Latency versus Hit-Rate Latency LH-Cache SRAM-Tag Alloy Cache Average Latency (cycles) 107 67 43 Relative Latency 2.5x 1.5x 1.0x Cache Size LH-Cache (29-way) Alloy Cache (1-way) Delta Hit-Rate 256MB 55.2% 48.2% 7% 512MB 59.6% 55.2% 4.4% 1GB 62.6% 59.1% 2.5% DRAM Cache Hit Rate Alloy Cache reduces hit latency greatly at small loss of hit-rate DRAM Cache Hit Latency
  • 22. Outline  Introduction & Background  Insight: Optimize First for Latency  Proposal: Alloy Cache  Memory Access Prediction  Summary
  • 23. Summary  DRAM Caches are slow, don’t make them slower  Previous research: DRAM cache architected similar to SRAM cache  Insight: Optimize DRAM cache first for latency, then hit-rate  Latency optimized Alloy Cache avoids tag serialization  Memory Access Predictor: simple, low latency, yet highly effective  Alloy Cache + MAP outperforms SRAM-Tags (35% vs. 24%)  Calls for new ways to manage DRAM cache space and bandwidth
  • 24. Questions Acknowledgement: Work on “Memory Access Prediction” done while at IBM Research. (Patent application filed Feb 2010, published Aug 2011)
  • 25. Potential for Improvement Design Performance Improvement Alloy Cache + MAP-PC 35.0% Alloy Cache + Perfect Predictor 36.6% IDEAL-LO Cache 38.4% IDEAL-LO + No Tag Overhead 41.0%
  • 26. Size Analysis Simple Latency-Optimized design outperforms Impractical SRAM-Tags! 1.00 1.10 1.20 1.30 1.40 1.50 64MB 128MB 256MB 512MB 1GB SRAM-Tags Alloy Cache + MAP-PCLH-Cache + MissMap Proposed design provides 1.5x the benefit of SRAM-Tags (LH-Cache provides about one-third the benefit) Speedup(NoDRAM$)
  • 27. How about Commercial Workloads? Cache Size Hit-Rate (1-way) Hit-Rate (32-way) Hit-Rate Delta 256MB 53.0% 60.3% 7.3% 512MB 58.6% 63.6% 5.0% 1GB 62.1% 65.1% 3.0% Data averaged over 7 commercial workloads
  • 29. What about other SPEC benchmarks?

Editor's Notes

  1. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
  2. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
  3. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.
  4. Give a brief overview of the presentation. Describe the major focus of the presentation and why it is important. Introduce each of the major topics. To provide a road map for the audience, you can repeat this Overview slide throughout the presentation, highlighting the particular topic you will discuss next.