PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Cache coherence for
GPU Architectures

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU
Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture
1
(HPCA-19)

Agenda
Challenges with CPU
coherence on GPUs.

2

Agenda
Challenges with CPU
coherence on GPUs.
Temporal Coherence:
Rethinking coherence for GPUs

2

Agenda
Challenges with CPU
coherence on GPUs.
Temporal Coherence:
Rethinking coherence for GPUs
What is the cost of
providing coherence?
2

Why provide coherence?
1. Inter-workgroup
communication

2. Atomic operations

Characterizing and Evaluating a Key-value Store
Application on Heterogeneous CPU-GPU Systems, ISPASS 2012

3. Task queues

3

Cache Coherence
Programmer

P P P P
Shared
Memory
Appearance: One global copy of every location

4

Cache Coherence
Multicores

GPUs

P P P P
L1 L1 L1 L1
L2
L2

L1 L1 L1 L1
Memory

...

Memory

5

Cache Coherence
Heterogeneous Systems

P P P P
L1 L1 L1 L1
L2
L2
...

L1 L1 L1 L1
...

Memory

How to provide coherence?
6

Challenges with coherence

L1

L1
Shared L2

8


L1

L1

1

2

Shared L2

8


L1

L1

1

3

Shared L2

8

2

Challenge 1: Traﬃc

L1

L1
Shared L2

9


L1

L1
Shared L2

9

L1


L1

L1

L1

Shared L2

30% more trafﬁc than current GPUs
9

Challenge 2: Buﬀer Overhead

L1

L1
Shared L2

10

L1


L1

L1

Protocol
Buffer

Shared L2

10

L1


L1

L1

L1

Protocol
Buffer

Shared L2
Coherence protocol buffers require 28% of total L2
10

Challenge 3: Complexity

L1

L1

1

Shared L2
Incoherent
protocol
4 states
11

Challenge 3: Complexity

L1

L1
3

1

2

Shared L2
Incoherent
protocol
4 states

Coherent
protocol
16 states
11

Coherence Overhead.
L1

Coherence messages
1. Trafﬁc transferring
2. Area overhead
3. Protocol complexity

How to achieve coherence without messages?

12

Temporal Coherence
Time-based Approach
- trigger protocol events on timer alerts

L1

L1
Shared L2
14

Temporal Coherence

L1

L1
Shared L2
15

Temporal Coherence
Clock

1

L1

L1
Shared L2
15

Temporal Coherence
Clock

1
Load

L1

L1
Shared L2
15

Temporal Coherence
Clock

1
Load

Valid if
TIME LT
!

L1

L1
Shared L2
15

LT

Temporal Coherence
Clock

1
Load

Valid if
TIME LT
!

L1

L1

LT

!

GT
Shared L2
15

Shared if
TIME GT

Temporal Coherence

L1

L1

16

Temporal Coherence
TIME 0

L1

L1

16

Temporal Coherence
TIME 0
Load

L1

L1

16

Temporal Coherence
TIME 0
Load
!

L1

L1 20

16

Temporal Coherence
TIME 0
Load
!

L1

L1 20
!

20

Line shared
till 20

16

TIME 5
Load
!

L1

L1 20

17

!

L1 25

TIME 5
Load
!

L1

L1 20

!

L1 25

!

25
Line shared
till 25
17

TIME

15
!

L1

L1 20
!

25

18

!

L1 25

TIME

15

Write
!

L1

L1 20
!

25

18

!

L1 25

TIME

20

Write
!

L1

L1 20
!

25

19

!

L1 25

TIME

25

Write

L1

!

L1

L1 25

!

25

20

Temporal Coherence
No coherence messages
All transactions are 2-hop
Protocol complexity minimal
Supports strong and weak
memory models
Enables optimized communication
(ask me later...)
21

How to set the block lifetime?
• Longer

= writes may stall

• Shorter

= may not exploit temporal locality

!

•

Lifetime predictor

at L2.

-Load to expired block (for temporal locality)
-Store to unexpired block (reduce write stalls)
-Eviction of unexpired block (reduce L2 eviction stalls)
22

Temporal Coherence (Weak)

Write
!

L1

L1 20
!

25
Shared L2
23

!

L1 25

Sensitive to misprediction
Write
!

L1

L1 20
!

25
Shared L2
23

!

L1 25

Write

Resource ! stalls
L1

L1 20
!

25
Shared L2
23

!

L1 25

Write

Resource ! stalls
L1

L1 20
!

!

L1 25

25
Hurts GPU applications
Shared L2
23

Write

Resource ! stalls
L1

L1 20

!

L1 25

!

25
Hurts GPU applications
Shared L2

Goal : eliminate Write Stalls!
23

TIME

15
!

L1

OLD
L1
!

25

24

20

!

OLD
L1

25

TIME
!

25

15

Write
!

L1

OLD
L1
!

25

24

20

!

OLD
L1

25

TIME
!

25

15

Write
Fence
!

L1

OLD
L1
!

25

24

20

!

OLD
L1

25

TIME
!

25

15

Write
Fence
......
!

L1

OLD
L1
!

25

24

20

!

OLD
L1

25

TIME
!

25

20

Fence
!

L1

L1 20
!

25

25

!

OLD
L1

25

TIME
!

25

20

Fence
......
!

L1

L1 20
!

25

25

!

OLD
L1

25

TIME
!

25

25

Fence

L1

!

L1

L1 25

!

25

26

TIME

25

!

25
L1

!

L1

L1 25

!

25

26

No Access Stalls
Efﬁcient GPU applications
Aggressive lifetime predictors
Supports weak memory models
27

Coherence Applications
• Lock-based

programs

-Barnes Hut
-Cloth Physics
-Place-and-Route

• Stencil

-Max-Flow Min-cut
-3D equation solver

• Load

balancing

-Octree Partitioning
29

Interconnect Traﬃc

GPU Applications (do not need coherence)

30


2
1.5
1
0.5
0
30


2

1
0.5
0

NO.CC

1.5

30


2.3

2

0.5
0

MESI

1

NO.CC

1.5

30


2.3

2

0

GPU-VI

0.5

MESI

1

NO.CC

1.5

30


Wr-Through

2.3

2

.8x

0

GPU-VI

0.5

MESI

1

NO.CC

1.5

30


Wr-Through

2.3

2

.8x

1.5

0

30

.3x

TC

GPU-VI

MESI

0.5

NO.CC

1

No msgs

• Lock-based

programs

-Barnes Hut
-Cloth Physics
-Place-and-Route

• Stencil

-Max-Flow Min-cut
-3D equation solver

• Load

balancing

-Octree Partitioning
31

Speedup

32

Speedup
1.75
1.5
1.25
1
0.75
0.5
0.25
0
32

Speedup
1.75
1.5

1
0.75
0.5
0.25

NO L1

1.25

0
32

Speedup
1.75
1.5

0.75
0.5
0.25

MESI

1

NO L1

1.25

0
32

Speedup
1.75
1.5

0.5
0.25
0

32

TC

GPU-VI

0.75

MESI

1

NO L1

1.25

Speedup
Need a 32KB
directory

1.75
1.5

0.5
0.25
0

32

TC

GPU-VI

0.75

MESI

1

NO L1

1.25

Protocol Complexity

L1 Stable
L1
Transient
L2 Stable
L2
Transient
33

Protocol Complexity
NonCoherent

L1 Stable
L1
Transient
L2 Stable
L2
Transient

2
2
2
2
33

Protocol Complexity
NonCoherent

L1 Stable
L1
Transient
L2 Stable
L2
Transient

GPU-VI

2
2
2
2

2
1
5
10
33

Protocol Complexity
NonCoherent

L1 Stable
L1
Transient
L2 Stable
L2
Transient

Temporal
GPU-VI

Coherence

2
2
2
2

2
1
5
10
33

2
1
5
3

What did we learn
!

• Throughput

and heterogeneous architectures
require a more streamlined caching framework.
!

• Single-chip

integration enables mechanisms
that we can exploit to simplify communication
protocols.
!

• Efﬁcient

coherence protocols enable
programmers to deploy accelerators for wider
purposes..

Contact:
ashriram@cs.sfu.ca
or
aamodt@ece.ubc.ca
• Obtain

GPGPU-Sim with coherence support
http://www.ece.ubc.ca/~isingh/gpgpusim-ruby.tar.gz
35

Interconnect Energy

Interworkgroup

1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0

Router (Static)

Interworkgroup

Intraworkgroup

36

NO-COH
MESI
GPU-VI
GPU-Vini
TCW

Link (Static)

NO-L1
MESI
GPU-VI
GPU-Vini
TCW

Normalized Energy

Router (Dynamic)

NO-COH
MESI
GPU-VI
GPU-Vini
TCW

1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0

NO-L1
MESI
GPU-VI
GPU-Vini
TCW

Normalized Power

Link (Dynamic)

Intraworkgroup

1.0

1.0

0.5

0.5

0.0

0.0

STN
HSP

VPR

37
or coherent and non-coherent GPU memory systems.

communication
2.0

1.5

KMN

(b) Intra-workgroup communication

RCL=0.25
REQ=0.55

2.0

1.5

1.0

0.5

0.0

LPS
NO-COH
MESI
GPU-VI
GPU-Vini
NO-COH
TCW

Interconnect Traffic

0.0

NDL

MESI
NO-COH
GPU-VI
MESI
GPU-Vini
GPU-VI
TCW
GPU-Vini

RCL=0.09
REQ=0.55

HSP
KMN

RG

SR

TCW

2.0

ST

NO-COH
MESI
NO-COH
GPU-VI
MESI
GPU-VI
GPU-Vini
GPU-Vini
TCW

ATO

TCW

RCL=0.15
REQ=0.63

GPU-Vini
TCW
NO-COH

RCL ST LD REQ
INV
ATO

MESI
GPU-VI
NO-COH
GPU-Vini
MESI
TCW
GPU-VI

AVG
NO-COH
MESI
GPU-VI
GPU-Vini
TCW

NO-COH
NO-L1
MESI
MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW
TCW

1.5

Traffic

2.0

NO-L1
NO-COH
MESI
MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW
TCW

NO-L1
MESI
GPU-VI Interconnect
GPU-Vini
TCW

REQ

LD

RCL=0.16 RCL=0.25
REQ=0.63 REQ=0.55
2.27

R
R

1.5

1.0

0.5

AVG

LPS

(b) Intra-work

1.0

STN
NO-L1
NO-L1
MESI MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW TCW

BH

VPR

(a) Inter-workgroup communicationKMN
HSP
AVG
CL

NO-L1
MESI
GPU-VI
GPU-Vini
Interconnect
TCW

NO-L1
NO-L1
MESI MESI
GPU-VI
GPU-VI
GPU-Vini
GPU-Vini
TCW TCW

CC

DLB

0.5

0.5
0.0

0.0

0.0

STN

VPR

GPU-VI
GPU-Vini
NO-L1
TCWMESI

2.0

AVG

GPU-VI
GPU-Vini
TCW

ST

GPU-VI
GPU-Vini
NO-COH
MESI TCW

ATO

GPU-VI
NO-COH
GPU-Vini
MESI TCW

REQ

GPU-Vini
NO-L1
TCWMESI

1.5

NO-L1
MESI
GPU-VI
NO-COH
GPU-Vini
MESI
GPU-VI TCW

2.0

Traffic

INV

1.0

0.5

TCW

NO-L1
NO-L1
MESI
GPU-VI MESI
GPU-VI
GPU-Vini
GPU-Vini
TCW

RCL
RCL=0.03
INV=0.03
REQ=0.68

RCL
INV

LD

REQ

2.0 R
RCL=0.25
REQ=0.55
R
1.5

1.5
1.0

LPS

communication Breakdown of interconnect(b) Intra-work
Figure 8.
trafﬁc for co
38

TC-Strong vs TC-Weak
TCSUO

TCSOO

TCS

TCW

TCW w/ predictor

Fixed lifetime for all applications

Best lifetime for each application
1.2

1.2

Speedup

Speedup

1.4
1.0
0.8
0.6

1.0
0.8
0.6

All applications

39

All applications

PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Similar to PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt