Cache Optimization Techniques for General
Purpose Graphic Processing Units
D.R.V.L.B. Thambawita
Supervised By
Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe
Department of Computer Engineering
Faculty of Engineering
University of Peradeniya
What is this GPU? Is it important?
AI Fluid Dynamic
Figure: CPU vs GPGPU
2 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
3 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
3 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
Our Contribution....
Main
Finding suitable cache op-
timization techniques for
GPGPU (Programmer side)
Sub
Giving an idea about appli-
cation level cache behav-
ior of GPGPUs cache for
GPGPU architecture designers
3 / 61
GPU configurable cache architecture
Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)
4 / 61
Outline
1 Related works
2 Conceptual level design
3 Selected CPU Cache Optimization Techniques
4 Experimental Setup
5 Adaptation Process + Results and Discussion
6 Findings and Conclusions
7 Case Study
Introduction - Aho-corasick algorithm
Results and Discussion
Conclusion about the case study
8 Publications
9 Q? and A
5 / 61
Related works
Related works
J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”.
Morgan Kaufmann/Elsevier, 2012.
Identifying main cache optimization techniques in computer architecture.
M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and
Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003.
Selecting basic cache optimization techniques.
CUDA Toolkit Documentation
Finding available GPGPU optimization techniques and getting knowledge for adaptation
process.
C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using
a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp.
1906-1916, Oct. 2013.
Identifying a case study for our research.
6 / 61
Related works
Challenges!!!
Lack of information about GPGPU cache architecture
Complexity of SIMD architecture
No any direct research regarding GPGPU cache optimization
technique of end users
7 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
Analyzing
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Case Study
Adopting from CPU
to GPU
8 / 61
Selected CPU Cache Optimization Techniques
Common end user cache optimization technique
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
9 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Configurable)
- 16KB,48KB, L1 disabled
Texture Memory
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Configurable)
- 16KB,48KB, L1 disabled
Texture Memory
32 banks
32 bytes cache
128 bytes cache
2D Spatial Locality
10 / 61
Experimental Setup
Experimental setups
Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM
Cache
size
Cache
line size
Associativity Description
L1 cache 32KB 64bytes 8-way
L2 cache 256KB 64bytes 8-way
L3 cache 3072KB 64bytes 12-way Shared Memory
Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory
Cache
size
Cache
line size
Associativity Description
L1 cache
48KB/
16KB
128bytes Not mentioned
can be disable
by using
-Xptxas-dlcm=cg
compile flag
Shared
memory
16KB/
48KB
128bytes Not mentioned
can be used
manually
L2 cache 768KB
128bytes/
32bytes
Not mentioned Unified cache
11 / 61
Adaptation Process + Results and Discussion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
12 / 61
Adaptation Process + Results and Discussion Stride-one access
Stride-one memory access
Figure: Non-stride access vs stride access of GPGPU
13 / 61
Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
Loops
Changing parameter = Loop index
64 bytes
L1, L2 and L3
cache line size
Figure: Adaptation Process
14 / 61
Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
128 bytes
128 bytes 32 bytes
Changing parameter = blockDim * blockID + threadID
Kernel
L1 cache line size
L2 cache line size
Figure: Adaptation Process
15 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effect of stride-one access on the CPU
0 10 20 30 40 50 60 70
0
20
40
60
80
100
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)
Time taken for execution increased continuously according to the stride amount.
16 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Time taken for execution increases according to the stride amount.
It shows the best performance while stride amount is 1 like CPU changes.
The effect of stride amount is comparably low after the cache line is full.
17 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1)
Input Size=2867200(16KB L1)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Disabled L1 cache shows better performance for large stride about because number of
cache lines are high in L2 cache when L1 cache is disabled.
Large L1 cache shows better performance than small cache due to large number of cache
lines in large cache.
18 / 61
Adaptation Process + Results and Discussion Stride-one access
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
19 / 61
Adaptation Process + Results and Discussion Blocking technique
Blocking technique
1
2
3
4
4
5
000000000000
0000
111111111111
1111
0000
000000000000
1111
111111111111
000000000000
0000
111111111111
111100000000000001111111111111
00
00000
0
000
0
11
11111
1
111
1
A C
B
Figure: Two different blocking techniques from two different sources. First technique uses small
blocks from the first matrix and large blocks from the second matrix. Second method uses equal
size blocks form both matrices
20 / 61
Adaptation Process + Results and Discussion Blocking technique
Adaptation
21 / 61
Adaptation Process + Results and Discussion Blocking technique
Adaptation
Figure: Adaptation process 22 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Size of the matrix
Time[s]
Default method - without tilling technique
Method from Computer Architecture: A Quantitative Approach book
Method equivalent to GPGPU tiling method
Figure: Effect of tiling on CPU
Method equivalent to the GPGPU method shows better performance on the CPU also.
23 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
200
400
600
800
Size of the matrix
Time[ms]
Default - L1 (16KB) Blocked - L1 (16KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
Size of the matrix
Time[ms]
Default - L1 (48KB) Blocked - L1 (48KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Default - L1 (16KB) Blocked - L1 (16KB)
Default - L1 (48KB) Blocked - L1 (48KB)
Blocked - Shared memory
Figure: Non blocking vs blocking with various cache configurations on GPGPU
Blocking technique with shared memory shows the best performance among all other
GPGPU cache options.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
25 / 61
Adaptation Process + Results and Discussion Loop fusion
Loop fusion
It is required to match the number of branching conditions in both
fused and non-fused loops.
Common variables within for loops have been used.
The loops within the GPGPU are kernels.
Kernel fusion is the technique in GPGPUs corresponding to the loop
fusion in CPUs.
Example
for (int i=0;i<n*n;i++){
h_array_c[i] =h_array_a[i] *h_array_b[i];
}
for (int i=0;i<n*n;i++){
h_array_d[i] =h_array_c[i] *h_array_a[i];
}
26 / 61
Adaptation Process + Results and Discussion Loop fusion
Adaptation
Common Data
Elements are here
Loop unrolling for
matching iterations
Figure: Adaptation process
27 / 61
Adaptation Process + Results and Discussion Loop fusion
Adaptation
Kernel 1
Kernel 2
Making
One
Kernel
Figure: Adaptation process
28 / 61
Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
50
100
Input Size
Time[ms]
Without loop fusion Without loop fusion - with loop unrolling
With loop fusion
Figure: Effect of loop fusion on CPU with two common data element
The loop fusion technique shows performance improvements on the CPU.
This improvement is not an affect of less number of iterations.
29 / 61
Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096
1
2
3
Input Size
Time[ms]
Without kernel fusion - default settings With kernel fusion - L1(16KB)
With kernel fusion - L1(48KB) With kernel fusion - L1(disabled)
Figure: Effect of kernel fusion on GPGPU - with common data accesses
Kernels fusion technique can be used for the kernels with common data accesses for
improving the performance.
30 / 61
Adaptation Process + Results and Discussion Loop fusion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
31 / 61
Adaptation Process + Results and Discussion Array padding
Array padding
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
Figure: Cache thrashing
32 / 61
Adaptation Process + Results and Discussion Array padding
Adaptation
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
Adding aditional
data element
Generating Cache
thrashing using
nested loops
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
Figure: Adaptation to GPGPU
33 / 61
Adaptation Process + Results and Discussion Array padding
Adaptation
L1 Cache
Thrashing
Selected one warp (32 threads)
Figure: Adaptation to GPGPU
34 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the CPU
0 2,048 4,096 6,144 8,192 10,240
0.5
1
Input Size
Time[ms]
With cache thrashing Without cache thrashing (with array padding)
Figure: Effect of array padding for cache thrashing on CPU
Array padding technique shows slight improvement of performance in CPU side.
35 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 10 20 30 40 50 60 70
0.6
0.8
1
1.2
1.4
Stride Amount
Time[ms]
Figure: Effect of bank conflict of shared memory on GPGPU
Shared memory bank conflicts of GPGPU show considerable effect for performance of
applications.
36 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
1
2
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
1
2
3
4
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
2
4
6
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
32-way Bank Conflict-Without Padding 32-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (16KB) Without padding - L1 (48KB)
Figure: L1 cache thrashing points while L1 size is 16KB and 48KB
L1 cache accesses cause cache thrashing points according to the stride amount.
38 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (48KB)
With one padding - L1 (48KB)
With two padding - L1 (48KB)
Figure: Effect of padding to thrashing points of L1 cache
Array padding technique can be used to shift cache thrashing point rather than removing
those points.
39 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
With padding - L1 (48KB) With padding - L1 (disabled)
Figure: Cache thrashing with L1 and L2 vs L2 only
Cache thrashing points which are significant to the performance are caused by L1 cache
not from L2 cache.
40 / 61
Adaptation Process + Results and Discussion Array padding
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
41 / 61
Adaptation Process + Results and Discussion Array merging
Array merging
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Figure: Basic idea behind the array merging technique.
42 / 61
Adaptation Process + Results and Discussion Array merging
Adaptation
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
43 / 61
Adaptation Process + Results and Discussion Array merging
Adaptation
Merged array
Located
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
44 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
0
10
20
30
Input Size
Time[ms]
With Array Merging Without Array Merging
Figure: Effect of array merging on CPU
The array merging technique improves the performance of non-stride accesses on the CPU.
45 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
10
20
30
40
Input Size
Time[ms]
Without Array Merging L1-disabled With Array Merging L1-disabled
Figure: Effect of array merging on GPGPU
46 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-16KB With Array Merging L1-16KB
Figure: Effect of array merging on GPGPU
46 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-48KB With Array Merging L1-48KB
Figure: Effect of array merging on GPGPU
The array merging technique can be used on GPGPU also for improving performance.
It needs more cache line size to improve performance.
46 / 61
Adaptation Process + Results and Discussion Array merging
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
47 / 61
Adaptation Process + Results and Discussion Array transpose
Array transpose
Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that
how to use transposed matrix for matrix multiplication
48 / 61
Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Cache friendly memory locations
Figure: Adaptation process
49 / 61
Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Figure: Adaptation process
50 / 61
Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Input Size
Time[s]
Basic Method With Transpose (Without Transpose Overhead)
Transpose Overhead
Figure: Effect of array transpose using matrix multiplication on CPU
The array transpose technique can be used to improve the performance of the CPU.
51 / 61
Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096 5120X5120
0
2,000
4,000
Input Size
Time[ms]
Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose
Figure: Effect of array transpose for matrix multiplication on GPGPU
The array transpose technique is not a good option on GPGPUs.
It will increase the number of memory access compared with original memory accesses.
52 / 61
Findings and Conclusions
Findings and Conclusions about GPGPU cache
optimizations
1 Stride one access is the best case for gaining better performance.
However, large non-stride accesses shows better performance while L1
cache is disabled.
2 Manually using the cache memory (shared memory) is the best option
for gaining better performance with the blocking technique.
3 Better performance with kernel fusion can be achieved if multiple
kernels have common data accesses.
4 Array padding causes positive effects for larger shared memory bank
conflict. L1 cache conflicts can be avoided by applying array padding.
5 Array merging is a good option for improving the performance of
overall memory access on CPUs as well as GPGPUs.
6 Transposing 2D arrays is not a good option to GPGPU for gaining
better performance for large data sets.
53 / 61
Case Study
.
54 / 61
Case Study Introduction - Aho-corasick algorithm
Aho-corasick algorithm - What is this?
The Aho-Corasick algorithm is a multiple patterns searching
algorithm.
Where can we see this Aho-corasick algorithm?
0 4
8
2
5
9
3
6 7
1
A
B G
B E D E
E
D
Aho-corasick
Algorithm
ABGABBEDG...GGABEDG
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
Figure: Applications of Aho-corasick Algorithm
A parallel GPGPU version of the Aho-Corasick algorithm is called as
Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known
Paper - IEEE Tran.]. 55 / 61
Case Study Introduction - Aho-corasick algorithm
How did we test our findings?
Implemented our own PFAC for DNA sequence matching. (From
available GPGPU Aho-corasick algorithm)
Analyzed the developed source code to find suitable locations to
apply the GPGPU optimization techniques
Analyzing...
1 Stride-one memory access → Not possible
2 Blocking → Found compatibility for input text file → Loading input
text via shared memory
3 Kernel fusion → Only one kernel
4 Array padding → No cache thrashing points
5 Array merging → Found compatibility for input pattern file → Two
arrays of input pattern file were merged via texture memory
6 Array transpose
56 / 61
Case Study Results and Discussion
Results: Comparison between original PFAC and our PFAC
(without cache optimization techniques)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.4
0.6
0.8
Time(s)
Available PFAC Our PFAC implementation - without any optimizations
Figure: Performance comparison between the original PFAC vs our PFAC implementation
Performance gain from application specific adaptation is around
1.27X
57 / 61
Case Study Results and Discussion
Results: Comparison between PFAC implementations
(without and with cache optimization)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.2
0.3
0.4
0.5
Time(s)
Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations
Figure: Performance comparison - our PFAC (without optimization and with optimizations)
Performance gain from the developed application specific solution to the
cache optimized solution is around 2X
58 / 61
Case Study Conclusion about the case study
Conclusion
Applied application specific techniques improved the performance of
our PFAC implementation.
Applied cache memory optimization techniques also improved the
performance of the PFAC implementation.
Our PFAC implementation worst case (without any optimizations)
shows 1.27X average improvement while the best case (with all
optimizations - Total) is 2.40X faster than available best GPGPU
solution (PFAC).
59 / 61
Publications
Publications (Up to now)
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Graphics processing units (gpus) for pattern matching algorithms.
In 7th IEEE International Conference on Information and Automation for Sustainability,
pages 1–4, Dec 2014.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
An optimized parallel failure-less aho-corasick algorithm for dna sequence matching.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Cpu’s cache optimization techniques for gpgpus.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe.
GPGPU: To use or not to use?
In Peradeniya University Research Sessions (PURSE), 2013.
60 / 61
Q? and A
Q? and A
61 / 61

Cache Optimization Techniques for General Purpose Graphic Processing Units

  • 1.
    Cache Optimization Techniquesfor General Purpose Graphic Processing Units D.R.V.L.B. Thambawita Supervised By Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe Department of Computer Engineering Faculty of Engineering University of Peradeniya
  • 2.
    What is thisGPU? Is it important? AI Fluid Dynamic Figure: CPU vs GPGPU 2 / 61
  • 3.
    Why this research? Howto optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) 3 / 61
  • 4.
    Why this research? Howto optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources??? 3 / 61
  • 5.
    Why this research? Howto optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources??? Our Contribution.... Main Finding suitable cache op- timization techniques for GPGPU (Programmer side) Sub Giving an idea about appli- cation level cache behav- ior of GPGPUs cache for GPGPU architecture designers 3 / 61
  • 6.
    GPU configurable cachearchitecture Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture) 4 / 61
  • 7.
    Outline 1 Related works 2Conceptual level design 3 Selected CPU Cache Optimization Techniques 4 Experimental Setup 5 Adaptation Process + Results and Discussion 6 Findings and Conclusions 7 Case Study Introduction - Aho-corasick algorithm Results and Discussion Conclusion about the case study 8 Publications 9 Q? and A 5 / 61
  • 8.
    Related works Related works J.L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”. Morgan Kaufmann/Elsevier, 2012. Identifying main cache optimization techniques in computer architecture. M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003. Selecting basic cache optimization techniques. CUDA Toolkit Documentation Finding available GPGPU optimization techniques and getting knowledge for adaptation process. C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013. Identifying a case study for our research. 6 / 61
  • 9.
    Related works Challenges!!! Lack ofinformation about GPGPU cache architecture Complexity of SIMD architecture No any direct research regarding GPGPU cache optimization technique of end users 7 / 61
  • 10.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Selecting CPU caceh optimization techniques 8 / 61
  • 11.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Selecting CPU caceh optimization techniques Analyzing 8 / 61
  • 12.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Adopting from CPU to GPU 8 / 61
  • 13.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Adopting from CPU to GPU 8 / 61
  • 14.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Identifying GPU cache optimizations Adopting from CPU to GPU 8 / 61
  • 15.
    Conceptual level design Conceptuallevel design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Identifying GPU cache optimizations Case Study Adopting from CPU to GPU 8 / 61
  • 16.
    Selected CPU CacheOptimization Techniques Common end user cache optimization technique Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 9 / 61
  • 17.
    Selected CPU CacheOptimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture 10 / 61
  • 18.
    Selected CPU CacheOptimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids 10 / 61
  • 19.
    Selected CPU CacheOptimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids Shared Memory L1 and L2 (Configurable) - 16KB,48KB, L1 disabled Texture Memory 10 / 61
  • 20.
    Selected CPU CacheOptimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids Shared Memory L1 and L2 (Configurable) - 16KB,48KB, L1 disabled Texture Memory 32 banks 32 bytes cache 128 bytes cache 2D Spatial Locality 10 / 61
  • 21.
    Experimental Setup Experimental setups Table:Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM Cache size Cache line size Associativity Description L1 cache 32KB 64bytes 8-way L2 cache 256KB 64bytes 8-way L3 cache 3072KB 64bytes 12-way Shared Memory Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory Cache size Cache line size Associativity Description L1 cache 48KB/ 16KB 128bytes Not mentioned can be disable by using -Xptxas-dlcm=cg compile flag Shared memory 16KB/ 48KB 128bytes Not mentioned can be used manually L2 cache 768KB 128bytes/ 32bytes Not mentioned Unified cache 11 / 61
  • 22.
    Adaptation Process +Results and Discussion One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 12 / 61
  • 23.
    Adaptation Process +Results and Discussion Stride-one access Stride-one memory access Figure: Non-stride access vs stride access of GPGPU 13 / 61
  • 24.
    Adaptation Process +Results and Discussion Stride-one access Adaptation - From CPU to GPGPU Loops Changing parameter = Loop index 64 bytes L1, L2 and L3 cache line size Figure: Adaptation Process 14 / 61
  • 25.
    Adaptation Process +Results and Discussion Stride-one access Adaptation - From CPU to GPGPU 128 bytes 128 bytes 32 bytes Changing parameter = blockDim * blockID + threadID Kernel L1 cache line size L2 cache line size Figure: Adaptation Process 15 / 61
  • 26.
    Adaptation Process +Results and Discussion Stride-one access Results: Effect of stride-one access on the CPU 0 10 20 30 40 50 60 70 0 20 40 60 80 100 Stride Amount Time[ms] Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB) Time taken for execution increased continuously according to the stride amount. 16 / 61
  • 27.
    Adaptation Process +Results and Discussion Stride-one access Results: Effects of stride-one access on the GPGPU 0 10 20 30 40 50 60 70 0 2 4 6 8 10 Stride Amount Time[ms] Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings) Time taken for execution increases according to the stride amount. It shows the best performance while stride amount is 1 like CPU changes. The effect of stride amount is comparably low after the cache line is full. 17 / 61
  • 28.
    Adaptation Process +Results and Discussion Stride-one access Results: Effects of stride-one access on the GPGPU 0 10 20 30 40 50 60 70 0 2 4 6 8 10 Stride Amount Time[ms] Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1) Input Size=2867200(16KB L1) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings) Disabled L1 cache shows better performance for large stride about because number of cache lines are high in L2 cache when L1 cache is disabled. Large L1 cache shows better performance than small cache due to large number of cache lines in large cache. 18 / 61
  • 29.
    Adaptation Process +Results and Discussion Stride-one access One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 19 / 61
  • 30.
    Adaptation Process +Results and Discussion Blocking technique Blocking technique 1 2 3 4 4 5 000000000000 0000 111111111111 1111 0000 000000000000 1111 111111111111 000000000000 0000 111111111111 111100000000000001111111111111 00 00000 0 000 0 11 11111 1 111 1 A C B Figure: Two different blocking techniques from two different sources. First technique uses small blocks from the first matrix and large blocks from the second matrix. Second method uses equal size blocks form both matrices 20 / 61
  • 31.
    Adaptation Process +Results and Discussion Blocking technique Adaptation 21 / 61
  • 32.
    Adaptation Process +Results and Discussion Blocking technique Adaptation Figure: Adaptation process 22 / 61
  • 33.
    Adaptation Process +Results and Discussion Blocking technique Results: Effects of blocking technique on the CPU 512X512 1024X1024 1536X1536 2048X2048 0 50 100 150 Size of the matrix Time[s] Default method - without tilling technique Method from Computer Architecture: A Quantitative Approach book Method equivalent to GPGPU tiling method Figure: Effect of tiling on CPU Method equivalent to the GPGPU method shows better performance on the CPU also. 23 / 61
  • 34.
    Adaptation Process +Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 1,000 Size of the matrix Time[ms] Default - L1 disabled Blocked - L1 disabled Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 35.
    Adaptation Process +Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 200 400 600 800 Size of the matrix Time[ms] Default - L1 (16KB) Blocked - L1 (16KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 36.
    Adaptation Process +Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 Size of the matrix Time[ms] Default - L1 (48KB) Blocked - L1 (48KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 37.
    Adaptation Process +Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 1,000 Size of the matrix Time[ms] Default - L1 disabled Blocked - L1 disabled Default - L1 (16KB) Blocked - L1 (16KB) Default - L1 (48KB) Blocked - L1 (48KB) Blocked - Shared memory Figure: Non blocking vs blocking with various cache configurations on GPGPU Blocking technique with shared memory shows the best performance among all other GPGPU cache options. 24 / 61
  • 38.
    Adaptation Process +Results and Discussion Blocking technique One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 25 / 61
  • 39.
    Adaptation Process +Results and Discussion Loop fusion Loop fusion It is required to match the number of branching conditions in both fused and non-fused loops. Common variables within for loops have been used. The loops within the GPGPU are kernels. Kernel fusion is the technique in GPGPUs corresponding to the loop fusion in CPUs. Example for (int i=0;i<n*n;i++){ h_array_c[i] =h_array_a[i] *h_array_b[i]; } for (int i=0;i<n*n;i++){ h_array_d[i] =h_array_c[i] *h_array_a[i]; } 26 / 61
  • 40.
    Adaptation Process +Results and Discussion Loop fusion Adaptation Common Data Elements are here Loop unrolling for matching iterations Figure: Adaptation process 27 / 61
  • 41.
    Adaptation Process +Results and Discussion Loop fusion Adaptation Kernel 1 Kernel 2 Making One Kernel Figure: Adaptation process 28 / 61
  • 42.
    Adaptation Process +Results and Discussion Loop fusion Results:Effect of loop fusion on the CPU 1024X1024 2048X2048 3072X3072 4096X4096 50 100 Input Size Time[ms] Without loop fusion Without loop fusion - with loop unrolling With loop fusion Figure: Effect of loop fusion on CPU with two common data element The loop fusion technique shows performance improvements on the CPU. This improvement is not an affect of less number of iterations. 29 / 61
  • 43.
    Adaptation Process +Results and Discussion Loop fusion Results:Effect of loop fusion on the GPGPU 1024X1024 2048X2048 3072X3072 4096X4096 1 2 3 Input Size Time[ms] Without kernel fusion - default settings With kernel fusion - L1(16KB) With kernel fusion - L1(48KB) With kernel fusion - L1(disabled) Figure: Effect of kernel fusion on GPGPU - with common data accesses Kernels fusion technique can be used for the kernels with common data accesses for improving the performance. 30 / 61
  • 44.
    Adaptation Process +Results and Discussion Loop fusion One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 31 / 61
  • 45.
    Adaptation Process +Results and Discussion Array padding Array padding a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] Figure: Cache thrashing 32 / 61
  • 46.
    Adaptation Process +Results and Discussion Array padding Adaptation a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] Adding aditional data element Generating Cache thrashing using nested loops a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] Figure: Adaptation to GPGPU 33 / 61
  • 47.
    Adaptation Process +Results and Discussion Array padding Adaptation L1 Cache Thrashing Selected one warp (32 threads) Figure: Adaptation to GPGPU 34 / 61
  • 48.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the CPU 0 2,048 4,096 6,144 8,192 10,240 0.5 1 Input Size Time[ms] With cache thrashing Without cache thrashing (with array padding) Figure: Effect of array padding for cache thrashing on CPU Array padding technique shows slight improvement of performance in CPU side. 35 / 61
  • 49.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 10 20 30 40 50 60 70 0.6 0.8 1 1.2 1.4 Stride Amount Time[ms] Figure: Effect of bank conflict of shared memory on GPGPU Shared memory bank conflicts of GPGPU show considerable effect for performance of applications. 36 / 61
  • 50.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 1 2 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 51.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 0 1 2 3 4 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding 16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 52.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 0 2 4 6 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding 16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding 32-way Bank Conflict-Without Padding 32-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 53.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles Without padding - L1 (16KB) Without padding - L1 (48KB) Figure: L1 cache thrashing points while L1 size is 16KB and 48KB L1 cache accesses cause cache thrashing points according to the stride amount. 38 / 61
  • 54.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles Without padding - L1 (48KB) With one padding - L1 (48KB) With two padding - L1 (48KB) Figure: Effect of padding to thrashing points of L1 cache Array padding technique can be used to shift cache thrashing point rather than removing those points. 39 / 61
  • 55.
    Adaptation Process +Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles With padding - L1 (48KB) With padding - L1 (disabled) Figure: Cache thrashing with L1 and L2 vs L2 only Cache thrashing points which are significant to the performance are caused by L1 cache not from L2 cache. 40 / 61
  • 56.
    Adaptation Process +Results and Discussion Array padding One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 41 / 61
  • 57.
    Adaptation Process +Results and Discussion Array merging Array merging a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Figure: Basic idea behind the array merging technique. 42 / 61
  • 58.
    Adaptation Process +Results and Discussion Array merging Adaptation a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Two different arrays One merged array Figure: Adaptation process 43 / 61
  • 59.
    Adaptation Process +Results and Discussion Array merging Adaptation Merged array Located a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Two different arrays One merged array Figure: Adaptation process 44 / 61
  • 60.
    Adaptation Process +Results and Discussion Array merging Results: Effect of array merging on the CPU 1024X1024 2048X2048 3072X3072 4096X4096 0 10 20 30 Input Size Time[ms] With Array Merging Without Array Merging Figure: Effect of array merging on CPU The array merging technique improves the performance of non-stride accesses on the CPU. 45 / 61
  • 61.
    Adaptation Process +Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 10 20 30 40 Input Size Time[ms] Without Array Merging L1-disabled With Array Merging L1-disabled Figure: Effect of array merging on GPGPU 46 / 61
  • 62.
    Adaptation Process +Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 20 40 Input Size Time[ms] Without Array Merging L1-16KB With Array Merging L1-16KB Figure: Effect of array merging on GPGPU 46 / 61
  • 63.
    Adaptation Process +Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 20 40 Input Size Time[ms] Without Array Merging L1-48KB With Array Merging L1-48KB Figure: Effect of array merging on GPGPU The array merging technique can be used on GPGPU also for improving performance. It needs more cache line size to improve performance. 46 / 61
  • 64.
    Adaptation Process +Results and Discussion Array merging One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 47 / 61
  • 65.
    Adaptation Process +Results and Discussion Array transpose Array transpose Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that how to use transposed matrix for matrix multiplication 48 / 61
  • 66.
    Adaptation Process +Results and Discussion Array transpose Adaptation Memory pattern - before Memory pattern - after Transposed array Cache friendly memory locations Figure: Adaptation process 49 / 61
  • 67.
    Adaptation Process +Results and Discussion Array transpose Adaptation Memory pattern - before Memory pattern - after Transposed array Figure: Adaptation process 50 / 61
  • 68.
    Adaptation Process +Results and Discussion Array transpose Results: Effect of array transpose on the CPU 512X512 1024X1024 1536X1536 2048X2048 0 50 100 150 Input Size Time[s] Basic Method With Transpose (Without Transpose Overhead) Transpose Overhead Figure: Effect of array transpose using matrix multiplication on CPU The array transpose technique can be used to improve the performance of the CPU. 51 / 61
  • 69.
    Adaptation Process +Results and Discussion Array transpose Results: Effect of array transpose on the GPGPU 1024X1024 2048X2048 3072X3072 4096X4096 5120X5120 0 2,000 4,000 Input Size Time[ms] Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose Figure: Effect of array transpose for matrix multiplication on GPGPU The array transpose technique is not a good option on GPGPUs. It will increase the number of memory access compared with original memory accesses. 52 / 61
  • 70.
    Findings and Conclusions Findingsand Conclusions about GPGPU cache optimizations 1 Stride one access is the best case for gaining better performance. However, large non-stride accesses shows better performance while L1 cache is disabled. 2 Manually using the cache memory (shared memory) is the best option for gaining better performance with the blocking technique. 3 Better performance with kernel fusion can be achieved if multiple kernels have common data accesses. 4 Array padding causes positive effects for larger shared memory bank conflict. L1 cache conflicts can be avoided by applying array padding. 5 Array merging is a good option for improving the performance of overall memory access on CPUs as well as GPGPUs. 6 Transposing 2D arrays is not a good option to GPGPU for gaining better performance for large data sets. 53 / 61
  • 71.
  • 72.
    Case Study Introduction- Aho-corasick algorithm Aho-corasick algorithm - What is this? The Aho-Corasick algorithm is a multiple patterns searching algorithm. Where can we see this Aho-corasick algorithm? 0 4 8 2 5 9 3 6 7 1 A B G B E D E E D Aho-corasick Algorithm ABGABBEDG...GGABEDG 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G Figure: Applications of Aho-corasick Algorithm A parallel GPGPU version of the Aho-Corasick algorithm is called as Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known Paper - IEEE Tran.]. 55 / 61
  • 73.
    Case Study Introduction- Aho-corasick algorithm How did we test our findings? Implemented our own PFAC for DNA sequence matching. (From available GPGPU Aho-corasick algorithm) Analyzed the developed source code to find suitable locations to apply the GPGPU optimization techniques Analyzing... 1 Stride-one memory access → Not possible 2 Blocking → Found compatibility for input text file → Loading input text via shared memory 3 Kernel fusion → Only one kernel 4 Array padding → No cache thrashing points 5 Array merging → Found compatibility for input pattern file → Two arrays of input pattern file were merged via texture memory 6 Array transpose 56 / 61
  • 74.
    Case Study Resultsand Discussion Results: Comparison between original PFAC and our PFAC (without cache optimization techniques) Pattern Set 1 Pattern Set 2 Pattern Set 3 Pattern Set 4 Pattern Set 5 0.4 0.6 0.8 Time(s) Available PFAC Our PFAC implementation - without any optimizations Figure: Performance comparison between the original PFAC vs our PFAC implementation Performance gain from application specific adaptation is around 1.27X 57 / 61
  • 75.
    Case Study Resultsand Discussion Results: Comparison between PFAC implementations (without and with cache optimization) Pattern Set 1 Pattern Set 2 Pattern Set 3 Pattern Set 4 Pattern Set 5 0.2 0.3 0.4 0.5 Time(s) Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations Figure: Performance comparison - our PFAC (without optimization and with optimizations) Performance gain from the developed application specific solution to the cache optimized solution is around 2X 58 / 61
  • 76.
    Case Study Conclusionabout the case study Conclusion Applied application specific techniques improved the performance of our PFAC implementation. Applied cache memory optimization techniques also improved the performance of the PFAC implementation. Our PFAC implementation worst case (without any optimizations) shows 1.27X average improvement while the best case (with all optimizations - Total) is 2.40X faster than available best GPGPU solution (PFAC). 59 / 61
  • 77.
    Publications Publications (Up tonow) D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. To use or not to use: Graphics processing units (gpus) for pattern matching algorithms. In 7th IEEE International Conference on Information and Automation for Sustainability, pages 1–4, Dec 2014. D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. An optimized parallel failure-less aho-corasick algorithm for dna sequence matching. In 8th IEEE International Conference on Information and Automation for Sustainability (ICIAFS), Dec 2016. D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. To use or not to use: Cpu’s cache optimization techniques for gpgpus. In 8th IEEE International Conference on Information and Automation for Sustainability (ICIAFS), Dec 2016. V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe. GPGPU: To use or not to use? In Peradeniya University Research Sessions (PURSE), 2013. 60 / 61
  • 78.
    Q? and A Q?and A 61 / 61