SlideShare a Scribd company logo
1 of 36
MASK: Redesigning the GPU
Memory Hierarchy to Support
Multi-Application Concurrency
Rachata Ausavarungnirun
Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
Executive Summary
2
Problem: Address translation overheads
limit the latency hiding capability of a GPU
Key Idea
Prioritize address translation requests over data requests
MASK: a GPU memory hierarchy that
A. Reduces shared TLB contention
B. Improves L2 cache utilization
C. Reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
Outline
3
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
Why Share Discrete GPUs?
4
• Enables multiple GPGPU applications to
run concurrently
• Better resource utilization
• An application often cannot utilize an entire GPU
• Different compute and bandwidth demands
• Enables GPU sharing in the cloud
• Multiple users spatially share each GPU
Key requirement: fine-grained memory protection
GPU Core
Private TLB
State-of-the-art Address Translation in GPUs
5
GPU Core
GPU Core
GPU Core
Shared TLB
Private TLB
Page Table
Walkers
Page Table
(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
GPU Core
Private TLB
A TLB Miss Stalls Multiple Warps
6
GPU Core
GPU Core
GPU Core
Shared TLB
Private TLB
Page Table
Walkers
Page Table
(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
Stalled
Stalled
Data in a page is
shared by many threads
All threads
access the same page
GPU Core
Private TLB
Multiple Page Walks Happen Together
7
GPU Core
GPU Core
GPU Core
Shared TLB
Private TLB
Page Table
Walkers
Page Table
(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
GPU’s parallelism creates
parallel page walks
Stalled Stalled
Stalled
Stalled
Data in a page is
shared by many threads
All threads
access the same page
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
8
Normalized Performance
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
9
Normalized Performance
What causes the large performance loss?
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
10
37.4%
Normalized Performance
Problem 1: Contention at the Shared TLB
11
• Multiple GPU applications contend for the TLB
0.0
0.2
0.4
0.6
0.8
1.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2
TLB
Miss
Rate
(Lower
is
Better)
Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
Problem 1: Contention at the Shared TLB
12
• Multiple GPU applications contend for the TLB
0.0
0.2
0.4
0.6
0.8
1.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2
TLB
Miss
Rate
(Lower
is
Better)
Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
Contention at the shared TLB leads to lower performance
Problem 2: Thrashing at the L2 Cache
13
• L2 cache can be used to reduce page walk latency
 Partial translation data can be cached
• Thrashing Source 1: Parallel page walks
 Different address translation data evicts each other
• Thrashing Source 2: GPU memory intensity
 Demand-fetched data evicts address translation data
L2 cache is ineffective at reducing page walk latency
Observation: Address Translation Is Latency Sensitive
14
• Multiple warps share data from a single page
0
10
20
30
40
3DS
BFS2
BLK
BP
CFD
CONS
FFT
FWT
GUPS
HISTO
HS
JPEG
LIB
LPS
LUD
LUH
MM
MUM
NN
NW
QTC
RAY
RED
SAD
SC
SCAN
SCP
SPMV
SRAD
TRD
Average
Warps
Stalled
Per
One
TLB
Miss
A single TLB miss causes 8 warps to stall on average
Observation: Address Translation Is Latency Sensitive
15
• Multiple warps share data from a single page
• GPU’s parallelism causes multiple concurrent page walks
0
20
40
60
3DS
BFS2
BLK
BP
CFD
CONS
FFT
FWT
GUPS
HISTO
HS
JPEG
LIB
LPS
LUD
LUH
MM
MUM
NN
NW
QTC
RAY
RED
SAD
SC
SCAN
SCP
SPMV
SRAD
TRD
Average
Concurrent
Page
Walks
High address translation latency  More stalled warps
Our Goals
16
•Reduce shared TLB contention
•Improve L2 cache utilization
•Lower page walk latency
Outline
17
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
MASK: A Translation-aware Memory Hierarchy
18
•Reduce shared TLB contention
A. TLB-fill Tokens
•Improve L2 cache utilization
B. Translation-aware L2 Bypass
•Lower page walk latency
C. Address-space-aware Memory Scheduler
A: TLB-fill Tokens
19
• Goal: Limit the number of warps that can fill the TLB
 A warp with a token fills the shared TLB
 A warp with no token fills a very small bypass cache
• Number of tokens changes based on TLB miss rate
 Updated every epoch
• Tokens are assigned based on warp ID
Benefit: Limits contention at the shared TLB
MASK: A Translation-aware Memory Hierarchy
20
•Reduce shared TLB contention
A. TLB-fill Tokens
•Improve L2 cache utilization
B. Translation-aware L2 Bypass
•Lower page walk latency
C. Address-space-aware Memory Scheduler
B: Translation-aware L2 Bypass
21
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
B: Translation-aware L2 Bypass
22
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
Page Table Level 2
B: Translation-aware L2 Bypass
23
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
Page Table Level 2
Page Table Level 3
B: Translation-aware L2 Bypass
24
• L2 hit rate decreases for deep page walk levels
Some address translation data does not benefit from caching
Only cache address translation data with high hit rate
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
Page Table Level 2
Page Table Level 3
Page Table Level 4
Cache
0 0.2 0.4 0.6 0.8 1
B: Translation-aware L2 Bypass
25
• Goal: Cache address translation data with high hit rate
L2 Cache Hit Rate
Page Table Level 1
Page Table Level 2
Page Table Level 3
Page Table Level 4 Bypass
Benefit 1: Better L2 cache utilization for translation data
Benefit 2: Bypassed requests  No L2 queuing delay
Average L2 Cache Hit Rate
MASK: A Translation-aware Memory Hierarchy
26
•Reduce shared TLB contention
A. TLB-fill Tokens
•Improve L2 cache utilization
B. Translation-aware L2 Bypass
•Lower page walk latency
C. Address-space-aware Memory Scheduler
C: Address-space-aware Memory Scheduler
27
• Cause: Address translation requests are treated similarly
to data demand requests
Idea: Lower address translation request latency
0
100
200
300
400
500
DRAM
Latency
0.0
0.2
0.4
0.6
0.8
1.0
DRAM
Bandwidth
Address Translation Requests Data Demand Requests
C: Address-space-aware Memory Scheduler
28
• Idea 1: Prioritize address translation requests
over data demand requests
To
DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Normal Queue
Data Demand Request
High Priority
Low Priority
C: Address-space-aware Memory Scheduler
29
• Idea 1: Prioritize address translation requests
over data demand requests
• Idea 2: Improve quality-of-service using the Silver Queue
To
DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Silver Queue
Data Demand Request
(Applications take turns)
Normal Queue
Data Demand Request
High Priority
Low Priority
Each application takes turn injecting into the Silver Queue
Outline
30
• Executive summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
Methodology
31
• Mosaic simulation platform [MICRO ’17]
- Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15]
- Models page walks and virtual-to-physical mapping
- Available at https://github.com/CMU-SAFARI/Mosaic
• NVIDIA GTX750 Ti
• Two GPGPU applications execute concurrently
• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites
- 3 workload categories based on TLB miss rate
Comparison Points
32
• State-of-the-art CPU–GPU memory management
[Power et al., HPCA ’14]
 PWCache: Page Walk Cache GPU MMU design
 SharedTLB: Shared TLB GPU MMU design
• Ideal: Every TLB access is an L1 TLB hit
0.0
0.5
1.0
1.5
2.0
2.5
0-HMR 1-HMR 2-HMR Average
Normalized
Performance
PWCache SharedTLB MASK Ideal
Performance
33
57.8%
52.0%
61.2%
58.7%
MASK outperforms state-of-the-art design for every workload
Other Results in the Paper
34
• MASK reduces unfairness
• Effectiveness of each individual component
• All three MASK components are effective
• Sensitivity analysis over multiple GPU architectures
• MASK improves performance on all evaluated architectures,
including CPU–GPU heterogeneous systems
• Sensitivity analysis to different TLB sizes
• MASK improves performance on all evaluated sizes
• Performance improvement over different memory
scheduling policies
• MASK improves performance over other
state-of-the-art memory schedulers
Conclusion
35
Problem: Address translation overheads
limit the latency hiding capability of a GPU
Key Idea
Prioritize address translation requests over data requests
MASK: a translation-aware GPU memory hierarchy
A. TLB-fill Tokens reduces shared TLB contention
B. Translation-aware L2 Bypass improves L2 cache utilization
C. Address-space-aware Memory Scheduler reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
MASK: Redesigning the GPU
Memory Hierarchy to Support
Multi-Application Concurrency
Rachata Ausavarungnirun
Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

More Related Content

Similar to MASK-address-space-aware-gpu-memory-hierarchy_asplos18-talk-nobackup.pptx

Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioAlluxio, Inc.
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesShivji Kumar Jha
 
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...Mydbops
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudCeph Community
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantSwiss Data Forum Swiss Data Forum
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
CICS TS 5.1 Loader Domain enhancements
CICS TS 5.1 Loader Domain enhancementsCICS TS 5.1 Loader Domain enhancements
CICS TS 5.1 Loader Domain enhancementsLarry Lawler
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014marvin herrera
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
LDAP at Lightning Speed
 LDAP at Lightning Speed LDAP at Lightning Speed
LDAP at Lightning SpeedC4Media
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Michael Zhang
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & StrategiesTiệp Vũ
 

Similar to MASK-address-space-aware-gpu-memory-hierarchy_asplos18-talk-nobackup.pptx (20)

Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
Navigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern DatabasesNavigating Transactions: ACID Complexity in Modern Databases
Navigating Transactions: ACID Complexity in Modern Databases
 
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
 
RISC.ppt
RISC.pptRISC.ppt
RISC.ppt
 
13 risc
13 risc13 risc
13 risc
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Retour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenantRetour d'expérience d'un environnement base de données multitenant
Retour d'expérience d'un environnement base de données multitenant
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
CICS TS 5.1 Loader Domain enhancements
CICS TS 5.1 Loader Domain enhancementsCICS TS 5.1 Loader Domain enhancements
CICS TS 5.1 Loader Domain enhancements
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
Membase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San FranciscoMembase Intro from Membase Meetup San Francisco
Membase Intro from Membase Meetup San Francisco
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
LDAP at Lightning Speed
 LDAP at Lightning Speed LDAP at Lightning Speed
LDAP at Lightning Speed
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
Lect15
Lect15Lect15
Lect15
 
Caching Methodology & Strategies
Caching Methodology & StrategiesCaching Methodology & Strategies
Caching Methodology & Strategies
 

Recently uploaded

(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service ThanePooja Nehwal
 
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...nagunakhan
 
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...Suhani Kapoor
 
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查awo24iot
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Call Girls in Nagpur High Profile
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointGetawu
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...Suhani Kapoor
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Pooja Nehwal
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsPooja Nehwal
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...Pooja Nehwal
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Pooja Nehwal
 
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...Pooja Nehwal
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...Call Girls in Nagpur High Profile
 
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Naicy mandal
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样qaffana
 

Recently uploaded (20)

(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
 
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
 
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
VIP Call Girls Hitech City ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With R...
 
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power point
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call Girls
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
Call Girls In Andheri East Call 9892124323 Book Hot And Sexy Girls,
 
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...High Profile Call Girls In Andheri 7738631006 Call girls in mumbai  Mumbai ...
High Profile Call Girls In Andheri 7738631006 Call girls in mumbai Mumbai ...
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
 
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
 

MASK-address-space-aware-gpu-memory-hierarchy_asplos18-talk-nobackup.pptx

  • 1. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
  • 2. Executive Summary 2 Problem: Address translation overheads limit the latency hiding capability of a GPU Key Idea Prioritize address translation requests over data requests MASK: a GPU memory hierarchy that A. Reduces shared TLB contention B. Improves L2 cache utilization C. Reduces page walk latency MASK improves system throughput by 57.8% on average over state-of-the-art address translation mechanisms Large performance loss vs. no translation High contention at the shared TLB Low L2 cache utilization
  • 3. Outline 3 • Executive Summary • Background, Key Challenges and Our Goal • MASK: A Translation-aware Memory Hierarchy • Evaluation • Conclusion
  • 4. Why Share Discrete GPUs? 4 • Enables multiple GPGPU applications to run concurrently • Better resource utilization • An application often cannot utilize an entire GPU • Different compute and bandwidth demands • Enables GPU sharing in the cloud • Multiple users spatially share each GPU Key requirement: fine-grained memory protection
  • 5. GPU Core Private TLB State-of-the-art Address Translation in GPUs 5 GPU Core GPU Core GPU Core Shared TLB Private TLB Page Table Walkers Page Table (in main memory) Private TLB Private TLB Private Shared App 1 App 2
  • 6. GPU Core Private TLB A TLB Miss Stalls Multiple Warps 6 GPU Core GPU Core GPU Core Shared TLB Private TLB Page Table Walkers Page Table (in main memory) Private TLB Private TLB Private Shared App 1 App 2 Stalled Stalled Data in a page is shared by many threads All threads access the same page
  • 7. GPU Core Private TLB Multiple Page Walks Happen Together 7 GPU Core GPU Core GPU Core Shared TLB Private TLB Page Table Walkers Page Table (in main memory) Private TLB Private TLB Private Shared App 1 App 2 GPU’s parallelism creates parallel page walks Stalled Stalled Stalled Stalled Data in a page is shared by many threads All threads access the same page
  • 8. 0 0.2 0.4 0.6 0.8 1 Ideal SharedTLB PWCache Effect of Translation on Performance 8 Normalized Performance
  • 9. 0 0.2 0.4 0.6 0.8 1 Ideal SharedTLB PWCache Effect of Translation on Performance 9 Normalized Performance
  • 10. What causes the large performance loss? 0 0.2 0.4 0.6 0.8 1 Ideal SharedTLB PWCache Effect of Translation on Performance 10 37.4% Normalized Performance
  • 11. Problem 1: Contention at the Shared TLB 11 • Multiple GPU applications contend for the TLB 0.0 0.2 0.4 0.6 0.8 1.0 App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2 L2 TLB Miss Rate (Lower is Better) Alone Shared 3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
  • 12. Problem 1: Contention at the Shared TLB 12 • Multiple GPU applications contend for the TLB 0.0 0.2 0.4 0.6 0.8 1.0 App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2 L2 TLB Miss Rate (Lower is Better) Alone Shared 3DS_HISTO CONS_LPS MUM_HISTO RED_RAY Contention at the shared TLB leads to lower performance
  • 13. Problem 2: Thrashing at the L2 Cache 13 • L2 cache can be used to reduce page walk latency  Partial translation data can be cached • Thrashing Source 1: Parallel page walks  Different address translation data evicts each other • Thrashing Source 2: GPU memory intensity  Demand-fetched data evicts address translation data L2 cache is ineffective at reducing page walk latency
  • 14. Observation: Address Translation Is Latency Sensitive 14 • Multiple warps share data from a single page 0 10 20 30 40 3DS BFS2 BLK BP CFD CONS FFT FWT GUPS HISTO HS JPEG LIB LPS LUD LUH MM MUM NN NW QTC RAY RED SAD SC SCAN SCP SPMV SRAD TRD Average Warps Stalled Per One TLB Miss A single TLB miss causes 8 warps to stall on average
  • 15. Observation: Address Translation Is Latency Sensitive 15 • Multiple warps share data from a single page • GPU’s parallelism causes multiple concurrent page walks 0 20 40 60 3DS BFS2 BLK BP CFD CONS FFT FWT GUPS HISTO HS JPEG LIB LPS LUD LUH MM MUM NN NW QTC RAY RED SAD SC SCAN SCP SPMV SRAD TRD Average Concurrent Page Walks High address translation latency  More stalled warps
  • 16. Our Goals 16 •Reduce shared TLB contention •Improve L2 cache utilization •Lower page walk latency
  • 17. Outline 17 • Executive Summary • Background, Key Challenges and Our Goal • MASK: A Translation-aware Memory Hierarchy • Evaluation • Conclusion
  • 18. MASK: A Translation-aware Memory Hierarchy 18 •Reduce shared TLB contention A. TLB-fill Tokens •Improve L2 cache utilization B. Translation-aware L2 Bypass •Lower page walk latency C. Address-space-aware Memory Scheduler
  • 19. A: TLB-fill Tokens 19 • Goal: Limit the number of warps that can fill the TLB  A warp with a token fills the shared TLB  A warp with no token fills a very small bypass cache • Number of tokens changes based on TLB miss rate  Updated every epoch • Tokens are assigned based on warp ID Benefit: Limits contention at the shared TLB
  • 20. MASK: A Translation-aware Memory Hierarchy 20 •Reduce shared TLB contention A. TLB-fill Tokens •Improve L2 cache utilization B. Translation-aware L2 Bypass •Lower page walk latency C. Address-space-aware Memory Scheduler
  • 21. B: Translation-aware L2 Bypass 21 • L2 hit rate decreases for deep page walk levels 0 0.2 0.4 0.6 0.8 1 L2 Cache Hit Rate Page Table Level 1
  • 22. B: Translation-aware L2 Bypass 22 • L2 hit rate decreases for deep page walk levels 0 0.2 0.4 0.6 0.8 1 L2 Cache Hit Rate Page Table Level 1 Page Table Level 2
  • 23. B: Translation-aware L2 Bypass 23 • L2 hit rate decreases for deep page walk levels 0 0.2 0.4 0.6 0.8 1 L2 Cache Hit Rate Page Table Level 1 Page Table Level 2 Page Table Level 3
  • 24. B: Translation-aware L2 Bypass 24 • L2 hit rate decreases for deep page walk levels Some address translation data does not benefit from caching Only cache address translation data with high hit rate 0 0.2 0.4 0.6 0.8 1 L2 Cache Hit Rate Page Table Level 1 Page Table Level 2 Page Table Level 3 Page Table Level 4
  • 25. Cache 0 0.2 0.4 0.6 0.8 1 B: Translation-aware L2 Bypass 25 • Goal: Cache address translation data with high hit rate L2 Cache Hit Rate Page Table Level 1 Page Table Level 2 Page Table Level 3 Page Table Level 4 Bypass Benefit 1: Better L2 cache utilization for translation data Benefit 2: Bypassed requests  No L2 queuing delay Average L2 Cache Hit Rate
  • 26. MASK: A Translation-aware Memory Hierarchy 26 •Reduce shared TLB contention A. TLB-fill Tokens •Improve L2 cache utilization B. Translation-aware L2 Bypass •Lower page walk latency C. Address-space-aware Memory Scheduler
  • 27. C: Address-space-aware Memory Scheduler 27 • Cause: Address translation requests are treated similarly to data demand requests Idea: Lower address translation request latency 0 100 200 300 400 500 DRAM Latency 0.0 0.2 0.4 0.6 0.8 1.0 DRAM Bandwidth Address Translation Requests Data Demand Requests
  • 28. C: Address-space-aware Memory Scheduler 28 • Idea 1: Prioritize address translation requests over data demand requests To DRAM Memory Scheduler Golden Queue Address Translation Request Normal Queue Data Demand Request High Priority Low Priority
  • 29. C: Address-space-aware Memory Scheduler 29 • Idea 1: Prioritize address translation requests over data demand requests • Idea 2: Improve quality-of-service using the Silver Queue To DRAM Memory Scheduler Golden Queue Address Translation Request Silver Queue Data Demand Request (Applications take turns) Normal Queue Data Demand Request High Priority Low Priority Each application takes turn injecting into the Silver Queue
  • 30. Outline 30 • Executive summary • Background, Key Challenges and Our Goal • MASK: A Translation-aware Memory Hierarchy • Evaluation • Conclusion
  • 31. Methodology 31 • Mosaic simulation platform [MICRO ’17] - Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15] - Models page walks and virtual-to-physical mapping - Available at https://github.com/CMU-SAFARI/Mosaic • NVIDIA GTX750 Ti • Two GPGPU applications execute concurrently • CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites - 3 workload categories based on TLB miss rate
  • 32. Comparison Points 32 • State-of-the-art CPU–GPU memory management [Power et al., HPCA ’14]  PWCache: Page Walk Cache GPU MMU design  SharedTLB: Shared TLB GPU MMU design • Ideal: Every TLB access is an L1 TLB hit
  • 33. 0.0 0.5 1.0 1.5 2.0 2.5 0-HMR 1-HMR 2-HMR Average Normalized Performance PWCache SharedTLB MASK Ideal Performance 33 57.8% 52.0% 61.2% 58.7% MASK outperforms state-of-the-art design for every workload
  • 34. Other Results in the Paper 34 • MASK reduces unfairness • Effectiveness of each individual component • All three MASK components are effective • Sensitivity analysis over multiple GPU architectures • MASK improves performance on all evaluated architectures, including CPU–GPU heterogeneous systems • Sensitivity analysis to different TLB sizes • MASK improves performance on all evaluated sizes • Performance improvement over different memory scheduling policies • MASK improves performance over other state-of-the-art memory schedulers
  • 35. Conclusion 35 Problem: Address translation overheads limit the latency hiding capability of a GPU Key Idea Prioritize address translation requests over data requests MASK: a translation-aware GPU memory hierarchy A. TLB-fill Tokens reduces shared TLB contention B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency MASK improves system throughput by 57.8% on average over state-of-the-art address translation mechanisms Large performance loss vs. no translation High contention at the shared TLB Low L2 cache utilization
  • 36. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu

Editor's Notes

  1. Add the intro to running multiple application before jumping right in. Too micro level animation
  2. Add some spacing
  3. GPU can benefit from sharing. For example, a cloud provider can increase GPU utilization by allowing multiple users to share the GPU. Applications with heterogeneous resource demand can be co-schedule and executed concurrently to maximize all the available GPU resources. For example, application that demands high compute units can concurrently share GPU resource with applications that demand high memory bandwidth. However, to safely share GPU across multiple applications, the GPU needs to be able to provide memory protection, which can be enabled to virtual address translation.
  4. In this example, the GPU contains four cores and is shared among two applications, app 1 and app 2. In this scenario, when the GPU want to load or store data, it first needs to translate the virtual address into the physical address. To do this, the page table walk walk the page table using the virtual address. This process incur multiple reads to the main memory, and is a high latency operation. To alleviate this performance penalty, modern processor, including the state-of-the-art GPU design employs a private translation lookaside buffer to reduce the latency of address translation. This TLB can include the private per-core TLB, as shown in this example, and a second level shared TLB. With these structures, the page table walker only need to translate addresses that miss in both private and the shared TLB. However, we found that as the GPU is shared across multiple applications, each application generates contention at the shared TLB, decreasing the effectiveness of these components.
  5. Make the warp lines bigger
  6. High number of concurrent walks Get rid of the warp thingy Stalled have different color
  7. As a result, we found that the long latency page walk and the contention at the shared TLB leads to two phenomenon. First, because GPU execute instructions in a lockstep, each TLB miss stalls multiple warps, leading to significantly lower throughput. Furthermore, the high amount of parallelism generate high number of concurrent page walks. Both of these limits the latency hiding capability of the GPUs, leading to lower performance. In our evaluation, we found that address translation leads to 45.6% performance degradation even with the state-of-the-art GPU MMU design. So, we would like to identify any inefficiency in order to reduce the overhead of address translation.
  8. As a result, we found that the long latency page walk and the contention at the shared TLB leads to two phenomenon. First, because GPU execute instructions in a lockstep, each TLB miss stalls multiple warps, leading to significantly lower throughput. Furthermore, the high amount of parallelism generate high number of concurrent page walks. Both of these limits the latency hiding capability of the GPUs, leading to lower performance. In our evaluation, we found that address translation leads to 45.6% performance degradation even with the state-of-the-art GPU MMU design. So, we would like to identify any inefficiency in order to reduce the overhead of address translation.
  9. As a result, we found that the long latency page walk and the contention at the shared TLB leads to two phenomenon. First, because GPU execute instructions in a lockstep, each TLB miss stalls multiple warps, leading to significantly lower throughput. Furthermore, the high amount of parallelism generate high number of concurrent page walks. Both of these limits the latency hiding capability of the GPUs, leading to lower performance. In our evaluation, we found that address translation leads to 45.6% performance degradation even with the state-of-the-art GPU MMU design. So, we would like to identify any inefficiency in order to reduce the overhead of address translation.
  10. To further understand the performance overhead in more detail, we provide three key observations. First, we found that there is a significant amount of thrashing at the shared TLB when GPU is being shared across multiple applications. In this example, on the y-axis we show the TLB miss rate of each application pair assuming that it is running along on the GPU.
  11. And this red bar shows the TLB miss rate of the application pair when the two applications are running together. As shown here, when multiple applications are sharing the shared L2 TLB, the TLB miss rate significantly increases, and each individual applications on all four pairs observe higher than 40% shared TLB miss rate. As you can see here, we found this shared TLB contention leads to significant trashing and causing unnecessary high number of page table walks.
  12. If we have the data use the data, if not say thrashing from multiple page walk requests + normal data requests
  13. Say the average PER CORE
  14. Say the average
  15. To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.
  16. But in order to achieve the best of both large and small pages, we first present key challenges and our design goal
  17. To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.
  18. We limit the Prioritizing warps with a token
  19. To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.
  20. Put in the radix tree Make it horizontal. Use it for the graph of the next slide, show what is cached and what is bypassed Move the title down to the first bullet and keep the title
  21. Put in the radix tree Make it horizontal. Use it for the graph of the next slide, show what is cached and what is bypassed Move the title down to the first bullet and keep the title
  22. Put in the radix tree Make it horizontal. Use it for the graph of the next slide, show what is cached and what is bypassed Move the title down to the first bullet and keep the title
  23. Put in the radix tree Make it horizontal. Use it for the graph of the next slide, show what is cached and what is bypassed Move the title down to the first bullet and keep the title
  24. Make it horizontal. Use it for the graph of the next slide, show what is cached and what is bypassed Banner: Same width
  25. To this end, our design goal are the following: we would like to be able to achieve high TLB reach with low demand paging latency. In addition, we want our design our mechanism to be application-transparent such that programmers do not need to modify GPGPU applications to take advantage of our design.
  26. This one should not have the silver queue Double check the term
  27. Motion path for blue (if have time)
  28. But in order to achieve the best of both large and small pages, we first present key challenges and our design goal
  29. We uses a modified version GPGPU-sim, which models a 30-core GTX750 Ti. We allows multiple GPGPU applications to be executed concurrently. Our infrastructure models the page table walk, page table as well as the virtual to physical mapping. We evaluate our framework using various benchmark suite and in total we run 235 workloads. Our framework is publicly available on Github.
  30. We compare Mosaic to two designs. The first baseline is the state-of-the-art CPU-GPU memory management proposed by Power et al in HPCA 2014. We call this baseline GPU-MMU. We also compare our mechanism against the Ideal TLB baseline with every single TLB access is a TLB hit.
  31. We compare Mosaic to two baselines. The first baseline is the state-of-the-art CPU-GPU memory management proposed by Power et al in HPCA 2014. We call this baseline GPU-MMU. We also compare our mechanism against the Ideal TLB baseline with every single TLB access is a TLB hit.
  32. Add the intro to running multiple application before jumping right in. Too micro level animation
  33. At the end: Thank you. I am happy to take any questions.