Cache Optimization Techniques for General Purpose Graphic Processing Units

Cache Optimization Techniques for General
Purpose Graphic Processing Units
D.R.V.L.B. Thambawita
Supervised By
Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe
Department of Computer Engineering
Faculty of Engineering
University of Peradeniya

What is this GPU? Is it important?
AI Fluid Dynamic
Figure: CPU vs GPGPU
2 / 61

Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
3 / 61

Why this research?
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
3 / 61

Why this research?
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
Our Contribution....
Main
Finding suitable cache op-
timization techniques for
GPGPU (Programmer side)
Sub
Giving an idea about appli-
cation level cache behav-
ior of GPGPUs cache for
GPGPU architecture designers
3 / 61

GPU conﬁgurable cache architecture
Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)
4 / 61

Outline
1 Related works
2 Conceptual level design
3 Selected CPU Cache Optimization Techniques
4 Experimental Setup
5 Adaptation Process + Results and Discussion
6 Findings and Conclusions
7 Case Study
Introduction - Aho-corasick algorithm
Results and Discussion
Conclusion about the case study
8 Publications
9 Q? and A
5 / 61

Related works
Related works
J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”.
Morgan Kaufmann/Elsevier, 2012.
Identifying main cache optimization techniques in computer architecture.
M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and
Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003.
Selecting basic cache optimization techniques.
CUDA Toolkit Documentation
Finding available GPGPU optimization techniques and getting knowledge for adaptation
process.
C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using
a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp.
1906-1916, Oct. 2013.
Identifying a case study for our research.
6 / 61

Related works
Challenges!!!
Lack of information about GPGPU cache architecture
Complexity of SIMD architecture
No any direct research regarding GPGPU cache optimization
technique of end users
7 / 61

Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
8 / 61

GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
Analyzing
8 / 61

GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Adopting from CPU
to GPU
8 / 61

GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Adopting from CPU
to GPU
8 / 61

GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Adopting from CPU
to GPU
8 / 61

GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Case Study
Adopting from CPU
to GPU
8 / 61

Selected CPU Cache Optimization Techniques
Common end user cache optimization technique
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
9 / 61

GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
10 / 61

GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
10 / 61

GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Conﬁgurable)
- 16KB,48KB, L1 disabled
Texture Memory
10 / 61

GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Conﬁgurable)
- 16KB,48KB, L1 disabled
Texture Memory
32 banks
32 bytes cache
128 bytes cache
2D Spatial Locality
10 / 61

Experimental Setup
Experimental setups
Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM
Cache
size
Cache
line size
Associativity Description
L1 cache 32KB 64bytes 8-way
L2 cache 256KB 64bytes 8-way
L3 cache 3072KB 64bytes 12-way Shared Memory
Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory
Cache
size
Cache
line size
Associativity Description
L1 cache
48KB/
16KB
128bytes Not mentioned
can be disable
by using
-Xptxas-dlcm=cg
compile ﬂag
Shared
memory
16KB/
48KB
128bytes Not mentioned
can be used
manually
L2 cache 768KB
128bytes/
32bytes
Not mentioned Uniﬁed cache
11 / 61

Adaptation Process + Results and Discussion
One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
12 / 61

Adaptation Process + Results and Discussion Stride-one access
Stride-one memory access
Figure: Non-stride access vs stride access of GPGPU
13 / 61

Adaptation - From CPU to GPGPU
Loops
Changing parameter = Loop index
64 bytes
L1, L2 and L3
cache line size
Figure: Adaptation Process
14 / 61

Adaptation - From CPU to GPGPU
128 bytes
128 bytes 32 bytes
Changing parameter = blockDim * blockID + threadID
Kernel
L1 cache line size
L2 cache line size
Figure: Adaptation Process
15 / 61

Results: Eﬀect of stride-one access on the CPU
0 10 20 30 40 50 60 70
0
20
40
60
80
100
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Eﬀect of stride amount on CPU, Input size = 2867200 (710.9375MB)
Time taken for execution increased continuously according to the stride amount.
16 / 61

Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Time taken for execution increases according to the stride amount.
It shows the best performance while stride amount is 1 like CPU changes.
The effect of stride amount is comparably low after the cache line is full.
17 / 61

Results: Eﬀects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1)
Input Size=2867200(16KB L1)
Figure: Stride access eﬀect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Disabled L1 cache shows better performance for large stride about because number of
cache lines are high in L2 cache when L1 cache is disabled.
Large L1 cache shows better performance than small cache due to large number of cache
lines in large cache.
18 / 61

One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
19 / 61

Adaptation Process + Results and Discussion Blocking technique
Blocking technique
1
2
3
4
4
5
000000000000
0000
111111111111
1111
0000
000000000000
1111
111111111111
000000000000
0000
111111111111
111100000000000001111111111111
00
00000
0
000
0
11
11111
1
111
1
A C
B
Figure: Two different blocking techniques from two different sources. First technique uses small
blocks from the first matrix and large blocks from the second matrix. Second method uses equal
size blocks form both matrices
20 / 61

Adaptation
21 / 61

Adaptation
Figure: Adaptation process 22 / 61

Results: Eﬀects of blocking technique on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Size of the matrix
Time[s]
Default method - without tilling technique
Method from Computer Architecture: A Quantitative Approach book
Method equivalent to GPGPU tiling method
Figure: Eﬀect of tiling on CPU
Method equivalent to the GPGPU method shows better performance on the CPU also.
23 / 61

Results: Eﬀects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Figure: Non blocking vs blocking with various cache conﬁgurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61

512X512 1024X10241536X15362048X20482560X25603072X3072
0
200
400
600
800
Size of the matrix
Time[ms]
Default - L1 (16KB) Blocked - L1 (16KB)
24 / 61

512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
Size of the matrix
Time[ms]
24 / 61

512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Blocked - Shared memory
Blocking technique with shared memory shows the best performance among all other
GPGPU cache options.
24 / 61

One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
25 / 61

Adaptation Process + Results and Discussion Loop fusion
Loop fusion
It is required to match the number of branching conditions in both
fused and non-fused loops.
Common variables within for loops have been used.
The loops within the GPGPU are kernels.
Kernel fusion is the technique in GPGPUs corresponding to the loop
fusion in CPUs.
Example
for (int i=0;i<n*n;i++){
h_array_c[i] =h_array_a[i] *h_array_b[i];
}
for (int i=0;i<n*n;i++){
h_array_d[i] =h_array_c[i] *h_array_a[i];
}
26 / 61

Adaptation
Common Data
Elements are here
Loop unrolling for
matching iterations
Figure: Adaptation process
27 / 61

Adaptation
Kernel 1
Kernel 2
Making
One
Kernel
28 / 61

Results:Effect of loop fusion on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
50
100
Input Size
Time[ms]
Without loop fusion Without loop fusion - with loop unrolling
With loop fusion
Figure: Effect of loop fusion on CPU with two common data element
The loop fusion technique shows performance improvements on the CPU.
This improvement is not an affect of less number of iterations.
29 / 61

Results:Eﬀect of loop fusion on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096
1
2
3
Input Size
Time[ms]
Without kernel fusion - default settings With kernel fusion - L1(16KB)
With kernel fusion - L1(48KB) With kernel fusion - L1(disabled)
Figure: Eﬀect of kernel fusion on GPGPU - with common data accesses
Kernels fusion technique can be used for the kernels with common data accesses for
improving the performance.
30 / 61

One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
31 / 61

Adaptation Process + Results and Discussion Array padding
Array padding
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
Figure: Cache thrashing
32 / 61

Adaptation
L[0] L[1] L[2] a[m-1]
Adding aditional
data element
Generating Cache
thrashing using
nested loops
L[0] L[1] L[2] a[m-1]
Figure: Adaptation to GPGPU
33 / 61

Adaptation
L1 Cache
Thrashing
Selected one warp (32 threads)
Figure: Adaptation to GPGPU
34 / 61

Results: Eﬀect of array padding on the CPU
0 2,048 4,096 6,144 8,192 10,240
0.5
1
Input Size
Time[ms]
With cache thrashing Without cache thrashing (with array padding)
Figure: Eﬀect of array padding for cache thrashing on CPU
Array padding technique shows slight improvement of performance in CPU side.
35 / 61

Results: Effect of array padding on the GPGPU
0 10 20 30 40 50 60 70
0.6
0.8
1
1.2
1.4
Stride Amount
Time[ms]
Figure: Effect of bank conflict of shared memory on GPGPU
Shared memory bank conflicts of GPGPU show considerable effect for performance of
applications.
36 / 61

256X256 512X512 768X768 1024X10241280X12801536X1536
1
2
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61

256X256 512X512 768X768 1024X10241280X12801536X1536
0
1
2
3
4
Input Size
Time[ms]
37 / 61

256X256 512X512 768X768 1024X10241280X12801536X1536
0
2
4
6
Input Size
Time[ms]
37 / 61

0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (16KB) Without padding - L1 (48KB)
Figure: L1 cache thrashing points while L1 size is 16KB and 48KB
L1 cache accesses cause cache thrashing points according to the stride amount.
38 / 61

0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (48KB)
With one padding - L1 (48KB)
With two padding - L1 (48KB)
Figure: Eﬀect of padding to thrashing points of L1 cache
Array padding technique can be used to shift cache thrashing point rather than removing
those points.
39 / 61

0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
With padding - L1 (48KB) With padding - L1 (disabled)
Figure: Cache thrashing with L1 and L2 vs L2 only
Cache thrashing points which are signiﬁcant to the performance are caused by L1 cache
not from L2 cache.
40 / 61

One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
41 / 61

Adaptation Process + Results and Discussion Array merging
Array merging
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Figure: Basic idea behind the array merging technique.
42 / 61

Adaptation
+ = b[n]
Two diﬀerent arrays One merged array
43 / 61

Adaptation
Merged array
Located
+ = b[n]
Two diﬀerent arrays One merged array
44 / 61

Results: Eﬀect of array merging on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
0
10
20
30
Input Size
Time[ms]
With Array Merging Without Array Merging
Figure: Eﬀect of array merging on CPU
The array merging technique improves the performance of non-stride accesses on the CPU.
45 / 61

Results: Eﬀect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
10
20
30
40
Input Size
Time[ms]
Without Array Merging L1-disabled With Array Merging L1-disabled
Figure: Eﬀect of array merging on GPGPU
46 / 61

512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-16KB With Array Merging L1-16KB
46 / 61

512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-48KB With Array Merging L1-48KB
The array merging technique can be used on GPGPU also for improving performance.
It needs more cache line size to improve performance.
46 / 61

One by one
Stride-one access
Blocking
Loop fusion
Array padding
Array merging
Array transpose
47 / 61

Adaptation Process + Results and Discussion Array transpose
Array transpose
Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that
how to use transposed matrix for matrix multiplication
48 / 61

Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Cache friendly memory locations
49 / 61

Adaptation
Memory pattern - before Memory pattern - after
Transposed array
50 / 61

Results: Eﬀect of array transpose on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Input Size
Time[s]
Basic Method With Transpose (Without Transpose Overhead)
Transpose Overhead
Figure: Eﬀect of array transpose using matrix multiplication on CPU
The array transpose technique can be used to improve the performance of the CPU.
51 / 61

Results: Eﬀect of array transpose on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096 5120X5120
0
2,000
4,000
Input Size
Time[ms]
Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose
Figure: Eﬀect of array transpose for matrix multiplication on GPGPU
The array transpose technique is not a good option on GPGPUs.
It will increase the number of memory access compared with original memory accesses.
52 / 61

Findings and Conclusions
Findings and Conclusions about GPGPU cache
optimizations
1 Stride one access is the best case for gaining better performance.
However, large non-stride accesses shows better performance while L1
cache is disabled.
2 Manually using the cache memory (shared memory) is the best option
for gaining better performance with the blocking technique.
3 Better performance with kernel fusion can be achieved if multiple
kernels have common data accesses.
4 Array padding causes positive effects for larger shared memory bank
conflict. L1 cache conflicts can be avoided by applying array padding.
5 Array merging is a good option for improving the performance of
overall memory access on CPUs as well as GPGPUs.
6 Transposing 2D arrays is not a good option to GPGPU for gaining
better performance for large data sets.
53 / 61

Case Study Introduction - Aho-corasick algorithm
Aho-corasick algorithm - What is this?
The Aho-Corasick algorithm is a multiple patterns searching
algorithm.
Where can we see this Aho-corasick algorithm?
0 4
8
2
5
9
3
6 7
1
A
B G
B E D E
E
D
Aho-corasick
Algorithm
ABGABBEDG...GGABEDG
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
Figure: Applications of Aho-corasick Algorithm
A parallel GPGPU version of the Aho-Corasick algorithm is called as
Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known
Paper - IEEE Tran.]. 55 / 61

Case Study Introduction - Aho-corasick algorithm
How did we test our findings?
Implemented our own PFAC for DNA sequence matching. (From
available GPGPU Aho-corasick algorithm)
Analyzed the developed source code to find suitable locations to
apply the GPGPU optimization techniques
Analyzing...
1 Stride-one memory access → Not possible
2 Blocking → Found compatibility for input text file → Loading input
text via shared memory
3 Kernel fusion → Only one kernel
4 Array padding → No cache thrashing points
5 Array merging → Found compatibility for input pattern file → Two
arrays of input pattern file were merged via texture memory
6 Array transpose
56 / 61

Case Study Results and Discussion
Results: Comparison between original PFAC and our PFAC
(without cache optimization techniques)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.4
0.6
0.8
Time(s)
Available PFAC Our PFAC implementation - without any optimizations
Figure: Performance comparison between the original PFAC vs our PFAC implementation
Performance gain from application speciﬁc adaptation is around
1.27X
57 / 61

Case Study Results and Discussion
Results: Comparison between PFAC implementations
(without and with cache optimization)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.2
0.3
0.4
0.5
Time(s)
Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations
Figure: Performance comparison - our PFAC (without optimization and with optimizations)
Performance gain from the developed application speciﬁc solution to the
cache optimized solution is around 2X
58 / 61

Case Study Conclusion about the case study
Conclusion
Applied application speciﬁc techniques improved the performance of
our PFAC implementation.
Applied cache memory optimization techniques also improved the
performance of the PFAC implementation.
Our PFAC implementation worst case (without any optimizations)
shows 1.27X average improvement while the best case (with all
optimizations - Total) is 2.40X faster than available best GPGPU
solution (PFAC).
59 / 61

Publications
Publications (Up to now)
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Graphics processing units (gpus) for pattern matching algorithms.
In 7th IEEE International Conference on Information and Automation for Sustainability,
pages 1–4, Dec 2014.
An optimized parallel failure-less aho-corasick algorithm for dna sequence matching.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
To use or not to use: Cpu’s cache optimization techniques for gpgpus.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe.
GPGPU: To use or not to use?
In Peradeniya University Research Sessions (PURSE), 2013.
60 / 61

Cache Optimization Techniques for General Purpose Graphic Processing Units

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Cache Optimization Techniques for General Purpose Graphic Processing Units

Similar to Cache Optimization Techniques for General Purpose Graphic Processing Units (20)

More from Vajira Thambawita

More from Vajira Thambawita (20)

Recently uploaded

Recently uploaded (20)

Cache Optimization Techniques for General Purpose Graphic Processing Units