SlideShare a Scribd company logo
1 of 78
Download to read offline
Cache Optimization Techniques for General
Purpose Graphic Processing Units
D.R.V.L.B. Thambawita
Supervised By
Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe
Department of Computer Engineering
Faculty of Engineering
University of Peradeniya
What is this GPU? Is it important?
AI Fluid Dynamic
Figure: CPU vs GPGPU
2 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
3 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
3 / 61
Why this research?
How to optimize CPU cache access? (in Programming stage)
Using available CPU optimization technique (Too many references...)
How to optimize GPGPU cache access? (in Programming stage)
Do we have resources???
Our Contribution....
Main
Finding suitable cache op-
timization techniques for
GPGPU (Programmer side)
Sub
Giving an idea about appli-
cation level cache behav-
ior of GPGPUs cache for
GPGPU architecture designers
3 / 61
GPU configurable cache architecture
Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture)
4 / 61
Outline
1 Related works
2 Conceptual level design
3 Selected CPU Cache Optimization Techniques
4 Experimental Setup
5 Adaptation Process + Results and Discussion
6 Findings and Conclusions
7 Case Study
Introduction - Aho-corasick algorithm
Results and Discussion
Conclusion about the case study
8 Publications
9 Q? and A
5 / 61
Related works
Related works
J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”.
Morgan Kaufmann/Elsevier, 2012.
Identifying main cache optimization techniques in computer architecture.
M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and
Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003.
Selecting basic cache optimization techniques.
CUDA Toolkit Documentation
Finding available GPGPU optimization techniques and getting knowledge for adaptation
process.
C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using
a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp.
1906-1916, Oct. 2013.
Identifying a case study for our research.
6 / 61
Related works
Challenges!!!
Lack of information about GPGPU cache architecture
Complexity of SIMD architecture
No any direct research regarding GPGPU cache optimization
technique of end users
7 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Selecting CPU
caceh optimization
techniques
Analyzing
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Adopting from CPU
to GPU
8 / 61
Conceptual level design
Conceptual level design
GPGPU cache
optimizations?
CPU cache
optimizations?
Developing GPU
cache optimizations
Selecting CPU
caceh optimization
techniques
Analyzing
Analyzing
using GPU
Identifying GPU
cache optimizations
Case Study
Adopting from CPU
to GPU
8 / 61
Selected CPU Cache Optimization Techniques
Common end user cache optimization technique
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
9 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Configurable)
- 16KB,48KB, L1 disabled
Texture Memory
10 / 61
Selected CPU Cache Optimization Techniques
GPU cache complexity - in adaptation process
GPGPU cache
SIMD
Complex Memory
Architecture
Warps
Blocks
Grids
Shared Memory
L1 and L2 (Configurable)
- 16KB,48KB, L1 disabled
Texture Memory
32 banks
32 bytes cache
128 bytes cache
2D Spatial Locality
10 / 61
Experimental Setup
Experimental setups
Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM
Cache
size
Cache
line size
Associativity Description
L1 cache 32KB 64bytes 8-way
L2 cache 256KB 64bytes 8-way
L3 cache 3072KB 64bytes 12-way Shared Memory
Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory
Cache
size
Cache
line size
Associativity Description
L1 cache
48KB/
16KB
128bytes Not mentioned
can be disable
by using
-Xptxas-dlcm=cg
compile flag
Shared
memory
16KB/
48KB
128bytes Not mentioned
can be used
manually
L2 cache 768KB
128bytes/
32bytes
Not mentioned Unified cache
11 / 61
Adaptation Process + Results and Discussion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
12 / 61
Adaptation Process + Results and Discussion Stride-one access
Stride-one memory access
Figure: Non-stride access vs stride access of GPGPU
13 / 61
Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
Loops
Changing parameter = Loop index
64 bytes
L1, L2 and L3
cache line size
Figure: Adaptation Process
14 / 61
Adaptation Process + Results and Discussion Stride-one access
Adaptation - From CPU to GPGPU
128 bytes
128 bytes 32 bytes
Changing parameter = blockDim * blockID + threadID
Kernel
L1 cache line size
L2 cache line size
Figure: Adaptation Process
15 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effect of stride-one access on the CPU
0 10 20 30 40 50 60 70
0
20
40
60
80
100
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB)
Time taken for execution increased continuously according to the stride amount.
16 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Test 1) Input Size=2867200(Test 2)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Time taken for execution increases according to the stride amount.
It shows the best performance while stride amount is 1 like CPU changes.
The effect of stride amount is comparably low after the cache line is full.
17 / 61
Adaptation Process + Results and Discussion Stride-one access
Results: Effects of stride-one access on the GPGPU
0 10 20 30 40 50 60 70
0
2
4
6
8
10
Stride Amount
Time[ms]
Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1)
Input Size=2867200(16KB L1)
Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings)
Disabled L1 cache shows better performance for large stride about because number of
cache lines are high in L2 cache when L1 cache is disabled.
Large L1 cache shows better performance than small cache due to large number of cache
lines in large cache.
18 / 61
Adaptation Process + Results and Discussion Stride-one access
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
19 / 61
Adaptation Process + Results and Discussion Blocking technique
Blocking technique
1
2
3
4
4
5
000000000000
0000
111111111111
1111
0000
000000000000
1111
111111111111
000000000000
0000
111111111111
111100000000000001111111111111
00
00000
0
000
0
11
11111
1
111
1
A C
B
Figure: Two different blocking techniques from two different sources. First technique uses small
blocks from the first matrix and large blocks from the second matrix. Second method uses equal
size blocks form both matrices
20 / 61
Adaptation Process + Results and Discussion Blocking technique
Adaptation
21 / 61
Adaptation Process + Results and Discussion Blocking technique
Adaptation
Figure: Adaptation process 22 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Size of the matrix
Time[s]
Default method - without tilling technique
Method from Computer Architecture: A Quantitative Approach book
Method equivalent to GPGPU tiling method
Figure: Effect of tiling on CPU
Method equivalent to the GPGPU method shows better performance on the CPU also.
23 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
200
400
600
800
Size of the matrix
Time[ms]
Default - L1 (16KB) Blocked - L1 (16KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
Size of the matrix
Time[ms]
Default - L1 (48KB) Blocked - L1 (48KB)
Figure: Non blocking vs blocking with various cache configurations on GPGPU
The blocking technique shows better performance than non-blocking techniques.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
Results: Effects of blocking technique on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
500
1,000
Size of the matrix
Time[ms]
Default - L1 disabled Blocked - L1 disabled
Default - L1 (16KB) Blocked - L1 (16KB)
Default - L1 (48KB) Blocked - L1 (48KB)
Blocked - Shared memory
Figure: Non blocking vs blocking with various cache configurations on GPGPU
Blocking technique with shared memory shows the best performance among all other
GPGPU cache options.
24 / 61
Adaptation Process + Results and Discussion Blocking technique
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
25 / 61
Adaptation Process + Results and Discussion Loop fusion
Loop fusion
It is required to match the number of branching conditions in both
fused and non-fused loops.
Common variables within for loops have been used.
The loops within the GPGPU are kernels.
Kernel fusion is the technique in GPGPUs corresponding to the loop
fusion in CPUs.
Example
for (int i=0;i<n*n;i++){
h_array_c[i] =h_array_a[i] *h_array_b[i];
}
for (int i=0;i<n*n;i++){
h_array_d[i] =h_array_c[i] *h_array_a[i];
}
26 / 61
Adaptation Process + Results and Discussion Loop fusion
Adaptation
Common Data
Elements are here
Loop unrolling for
matching iterations
Figure: Adaptation process
27 / 61
Adaptation Process + Results and Discussion Loop fusion
Adaptation
Kernel 1
Kernel 2
Making
One
Kernel
Figure: Adaptation process
28 / 61
Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
50
100
Input Size
Time[ms]
Without loop fusion Without loop fusion - with loop unrolling
With loop fusion
Figure: Effect of loop fusion on CPU with two common data element
The loop fusion technique shows performance improvements on the CPU.
This improvement is not an affect of less number of iterations.
29 / 61
Adaptation Process + Results and Discussion Loop fusion
Results:Effect of loop fusion on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096
1
2
3
Input Size
Time[ms]
Without kernel fusion - default settings With kernel fusion - L1(16KB)
With kernel fusion - L1(48KB) With kernel fusion - L1(disabled)
Figure: Effect of kernel fusion on GPGPU - with common data accesses
Kernels fusion technique can be used for the kernels with common data accesses for
improving the performance.
30 / 61
Adaptation Process + Results and Discussion Loop fusion
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
31 / 61
Adaptation Process + Results and Discussion Array padding
Array padding
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
Figure: Cache thrashing
32 / 61
Adaptation Process + Results and Discussion Array padding
Adaptation
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
Adding aditional
data element
Generating Cache
thrashing using
nested loops
a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1]
L[0] L[1] L[2] a[m-1]
Figure: Adaptation to GPGPU
33 / 61
Adaptation Process + Results and Discussion Array padding
Adaptation
L1 Cache
Thrashing
Selected one warp (32 threads)
Figure: Adaptation to GPGPU
34 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the CPU
0 2,048 4,096 6,144 8,192 10,240
0.5
1
Input Size
Time[ms]
With cache thrashing Without cache thrashing (with array padding)
Figure: Effect of array padding for cache thrashing on CPU
Array padding technique shows slight improvement of performance in CPU side.
35 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 10 20 30 40 50 60 70
0.6
0.8
1
1.2
1.4
Stride Amount
Time[ms]
Figure: Effect of bank conflict of shared memory on GPGPU
Shared memory bank conflicts of GPGPU show considerable effect for performance of
applications.
36 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
1
2
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
1
2
3
4
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
256X256 512X512 768X768 1024X10241280X12801536X1536
0
2
4
6
Input Size
Time[ms]
8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding
16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding
32-way Bank Conflict-Without Padding 32-way Bank Conflict-With Padding
Figure: Effect of padding for shared memory bank conflicts on GPGPU
Array padding technique can be used as a solution for the shared memory bank conflicts if
the way of bank conflict is 32 (high number of bank conflict).
37 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (16KB) Without padding - L1 (48KB)
Figure: L1 cache thrashing points while L1 size is 16KB and 48KB
L1 cache accesses cause cache thrashing points according to the stride amount.
38 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
Without padding - L1 (48KB)
With one padding - L1 (48KB)
With two padding - L1 (48KB)
Figure: Effect of padding to thrashing points of L1 cache
Array padding technique can be used to shift cache thrashing point rather than removing
those points.
39 / 61
Adaptation Process + Results and Discussion Array padding
Results: Effect of array padding on the GPGPU
0 100 200 300 400 500
500
1,000
1,500
Stride Amount
ClockCycles
With padding - L1 (48KB) With padding - L1 (disabled)
Figure: Cache thrashing with L1 and L2 vs L2 only
Cache thrashing points which are significant to the performance are caused by L1 cache
not from L2 cache.
40 / 61
Adaptation Process + Results and Discussion Array padding
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
41 / 61
Adaptation Process + Results and Discussion Array merging
Array merging
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Figure: Basic idea behind the array merging technique.
42 / 61
Adaptation Process + Results and Discussion Array merging
Adaptation
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
43 / 61
Adaptation Process + Results and Discussion Array merging
Adaptation
Merged array
Located
a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n]
+ = b[n]
Two different arrays One merged array
Figure: Adaptation process
44 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the CPU
1024X1024 2048X2048 3072X3072 4096X4096
0
10
20
30
Input Size
Time[ms]
With Array Merging Without Array Merging
Figure: Effect of array merging on CPU
The array merging technique improves the performance of non-stride accesses on the CPU.
45 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
10
20
30
40
Input Size
Time[ms]
Without Array Merging L1-disabled With Array Merging L1-disabled
Figure: Effect of array merging on GPGPU
46 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-16KB With Array Merging L1-16KB
Figure: Effect of array merging on GPGPU
46 / 61
Adaptation Process + Results and Discussion Array merging
Results: Effect of array merging on the GPGPU
512X512 1024X10241536X15362048X20482560X25603072X3072
0
20
40
Input Size
Time[ms]
Without Array Merging L1-48KB With Array Merging L1-48KB
Figure: Effect of array merging on GPGPU
The array merging technique can be used on GPGPU also for improving performance.
It needs more cache line size to improve performance.
46 / 61
Adaptation Process + Results and Discussion Array merging
One by one
Data access optimization
Stride-one access
Blocking
Loop fusion
Data layout optimization
Array padding
Array merging
Array transpose
47 / 61
Adaptation Process + Results and Discussion Array transpose
Array transpose
Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that
how to use transposed matrix for matrix multiplication
48 / 61
Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Cache friendly memory locations
Figure: Adaptation process
49 / 61
Adaptation Process + Results and Discussion Array transpose
Adaptation
Memory pattern - before Memory pattern - after
Transposed array
Figure: Adaptation process
50 / 61
Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the CPU
512X512 1024X1024 1536X1536 2048X2048
0
50
100
150
Input Size
Time[s]
Basic Method With Transpose (Without Transpose Overhead)
Transpose Overhead
Figure: Effect of array transpose using matrix multiplication on CPU
The array transpose technique can be used to improve the performance of the CPU.
51 / 61
Adaptation Process + Results and Discussion Array transpose
Results: Effect of array transpose on the GPGPU
1024X1024 2048X2048 3072X3072 4096X4096 5120X5120
0
2,000
4,000
Input Size
Time[ms]
Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose
Figure: Effect of array transpose for matrix multiplication on GPGPU
The array transpose technique is not a good option on GPGPUs.
It will increase the number of memory access compared with original memory accesses.
52 / 61
Findings and Conclusions
Findings and Conclusions about GPGPU cache
optimizations
1 Stride one access is the best case for gaining better performance.
However, large non-stride accesses shows better performance while L1
cache is disabled.
2 Manually using the cache memory (shared memory) is the best option
for gaining better performance with the blocking technique.
3 Better performance with kernel fusion can be achieved if multiple
kernels have common data accesses.
4 Array padding causes positive effects for larger shared memory bank
conflict. L1 cache conflicts can be avoided by applying array padding.
5 Array merging is a good option for improving the performance of
overall memory access on CPUs as well as GPGPUs.
6 Transposing 2D arrays is not a good option to GPGPU for gaining
better performance for large data sets.
53 / 61
Case Study
.
54 / 61
Case Study Introduction - Aho-corasick algorithm
Aho-corasick algorithm - What is this?
The Aho-Corasick algorithm is a multiple patterns searching
algorithm.
Where can we see this Aho-corasick algorithm?
0 4
8
2
5
9
3
6 7
1
A
B G
B E D E
E
D
Aho-corasick
Algorithm
ABGABBEDG...GGABEDG
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
0 4
8
2
5
9
3
6 7
1
A
C G
C T G T
T
G
Figure: Applications of Aho-corasick Algorithm
A parallel GPGPU version of the Aho-Corasick algorithm is called as
Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known
Paper - IEEE Tran.]. 55 / 61
Case Study Introduction - Aho-corasick algorithm
How did we test our findings?
Implemented our own PFAC for DNA sequence matching. (From
available GPGPU Aho-corasick algorithm)
Analyzed the developed source code to find suitable locations to
apply the GPGPU optimization techniques
Analyzing...
1 Stride-one memory access → Not possible
2 Blocking → Found compatibility for input text file → Loading input
text via shared memory
3 Kernel fusion → Only one kernel
4 Array padding → No cache thrashing points
5 Array merging → Found compatibility for input pattern file → Two
arrays of input pattern file were merged via texture memory
6 Array transpose
56 / 61
Case Study Results and Discussion
Results: Comparison between original PFAC and our PFAC
(without cache optimization techniques)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.4
0.6
0.8
Time(s)
Available PFAC Our PFAC implementation - without any optimizations
Figure: Performance comparison between the original PFAC vs our PFAC implementation
Performance gain from application specific adaptation is around
1.27X
57 / 61
Case Study Results and Discussion
Results: Comparison between PFAC implementations
(without and with cache optimization)
Pattern Set 1
Pattern Set 2
Pattern Set 3
Pattern Set 4
Pattern Set 5
0.2
0.3
0.4
0.5
Time(s)
Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations
Figure: Performance comparison - our PFAC (without optimization and with optimizations)
Performance gain from the developed application specific solution to the
cache optimized solution is around 2X
58 / 61
Case Study Conclusion about the case study
Conclusion
Applied application specific techniques improved the performance of
our PFAC implementation.
Applied cache memory optimization techniques also improved the
performance of the PFAC implementation.
Our PFAC implementation worst case (without any optimizations)
shows 1.27X average improvement while the best case (with all
optimizations - Total) is 2.40X faster than available best GPGPU
solution (PFAC).
59 / 61
Publications
Publications (Up to now)
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Graphics processing units (gpus) for pattern matching algorithms.
In 7th IEEE International Conference on Information and Automation for Sustainability,
pages 1–4, Dec 2014.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
An optimized parallel failure-less aho-corasick algorithm for dna sequence matching.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe.
To use or not to use: Cpu’s cache optimization techniques for gpgpus.
In 8th IEEE International Conference on Information and Automation for Sustainability
(ICIAFS), Dec 2016.
V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe.
GPGPU: To use or not to use?
In Peradeniya University Research Sessions (PURSE), 2013.
60 / 61
Q? and A
Q? and A
61 / 61

More Related Content

What's hot

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...IJCSIS Research Publications
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processorsCSCJournals
 
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionKaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionPreferred Networks
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...IOSR Journals
 
Gpu based image segmentation using
Gpu based image segmentation usingGpu based image segmentation using
Gpu based image segmentation usingcsandit
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUZhengjie Lu
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 

What's hot (19)

Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
20120140505010
2012014050501020120140505010
20120140505010
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
GPU Parallel Computing of Support Vector Machines as applied to Intrusion Det...
 
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
[IJCT-V3I2P17] Authors: Sheng Lai, Xiaohua Meng, Dongqin Zheng
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
Image Processing Application on Graphics processors
Image Processing Application on Graphics processorsImage Processing Application on Graphics processors
Image Processing Application on Graphics processors
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionKaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...
 
Gpu based image segmentation using
Gpu based image segmentation usingGpu based image segmentation using
Gpu based image segmentation using
 
Map SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPUMap SMAC Algorithm onto GPU
Map SMAC Algorithm onto GPU
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 

Similar to Cache Optimization Techniques for General Purpose Graphic Processing Units

EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAVLSICS Design
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUsiguazio
 
GPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingGPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingAMD
 
Summary Of Academic Projects
Summary Of Academic ProjectsSummary Of Academic Projects
Summary Of Academic Projectsawan2008
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsQuEST Global (erstwhile NeST Software)
 

Similar to Cache Optimization Techniques for General Purpose Graphic Processing Units (20)

slides
slidesslides
slides
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGAEFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
EFFICIENT ABSOLUTE DIFFERENCE CIRCUIT FOR SAD COMPUTATION ON FPGA
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
GPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print ImagingGPU Compute in Medical and Print Imaging
GPU Compute in Medical and Print Imaging
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Summary Of Academic Projects
Summary Of Academic ProjectsSummary Of Academic Projects
Summary Of Academic Projects
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 

More from Vajira Thambawita

Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platformsVajira Thambawita
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computingVajira Thambawita
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Lecture 12 localization and navigation
Lecture 12 localization and navigationLecture 12 localization and navigation
Lecture 12 localization and navigationVajira Thambawita
 
Lecture 11 neural network principles
Lecture 11 neural network principlesLecture 11 neural network principles
Lecture 11 neural network principlesVajira Thambawita
 
Lecture 10 mobile robot design
Lecture 10 mobile robot designLecture 10 mobile robot design
Lecture 10 mobile robot designVajira Thambawita
 
Lecture 08 robots and controllers
Lecture 08 robots and controllersLecture 08 robots and controllers
Lecture 08 robots and controllersVajira Thambawita
 
Lecture 06 pic programming in c
Lecture 06 pic programming in cLecture 06 pic programming in c
Lecture 06 pic programming in cVajira Thambawita
 
Lecture 05 pic io port programming
Lecture 05 pic io port programmingLecture 05 pic io port programming
Lecture 05 pic io port programmingVajira Thambawita
 
Lecture 04 branch call and time delay
Lecture 04  branch call and time delayLecture 04  branch call and time delay
Lecture 04 branch call and time delayVajira Thambawita
 
Lecture 02 mechatronics systems
Lecture 02 mechatronics systemsLecture 02 mechatronics systems
Lecture 02 mechatronics systemsVajira Thambawita
 
Lecture 1 - Introduction to embedded system and Robotics
Lecture 1 - Introduction to embedded system and RoboticsLecture 1 - Introduction to embedded system and Robotics
Lecture 1 - Introduction to embedded system and RoboticsVajira Thambawita
 
Lec 09 - Registers and Counters
Lec 09 - Registers and CountersLec 09 - Registers and Counters
Lec 09 - Registers and CountersVajira Thambawita
 
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITSLec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITSVajira Thambawita
 
Lec 06 - Synchronous Sequential Logic
Lec 06 - Synchronous Sequential LogicLec 06 - Synchronous Sequential Logic
Lec 06 - Synchronous Sequential LogicVajira Thambawita
 

More from Vajira Thambawita (20)

Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computing
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Lecture 12 localization and navigation
Lecture 12 localization and navigationLecture 12 localization and navigation
Lecture 12 localization and navigation
 
Lecture 11 neural network principles
Lecture 11 neural network principlesLecture 11 neural network principles
Lecture 11 neural network principles
 
Lecture 10 mobile robot design
Lecture 10 mobile robot designLecture 10 mobile robot design
Lecture 10 mobile robot design
 
Lecture 09 control
Lecture 09 controlLecture 09 control
Lecture 09 control
 
Lecture 08 robots and controllers
Lecture 08 robots and controllersLecture 08 robots and controllers
Lecture 08 robots and controllers
 
Lecture 07 more about pic
Lecture 07 more about picLecture 07 more about pic
Lecture 07 more about pic
 
Lecture 06 pic programming in c
Lecture 06 pic programming in cLecture 06 pic programming in c
Lecture 06 pic programming in c
 
Lecture 05 pic io port programming
Lecture 05 pic io port programmingLecture 05 pic io port programming
Lecture 05 pic io port programming
 
Lecture 04 branch call and time delay
Lecture 04  branch call and time delayLecture 04  branch call and time delay
Lecture 04 branch call and time delay
 
Lecture 03 basics of pic
Lecture 03 basics of picLecture 03 basics of pic
Lecture 03 basics of pic
 
Lecture 02 mechatronics systems
Lecture 02 mechatronics systemsLecture 02 mechatronics systems
Lecture 02 mechatronics systems
 
Lecture 1 - Introduction to embedded system and Robotics
Lecture 1 - Introduction to embedded system and RoboticsLecture 1 - Introduction to embedded system and Robotics
Lecture 1 - Introduction to embedded system and Robotics
 
Lec 09 - Registers and Counters
Lec 09 - Registers and CountersLec 09 - Registers and Counters
Lec 09 - Registers and Counters
 
Lec 08 - DESIGN PROCEDURE
Lec 08 - DESIGN PROCEDURELec 08 - DESIGN PROCEDURE
Lec 08 - DESIGN PROCEDURE
 
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITSLec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
Lec 07 - ANALYSIS OF CLOCKED SEQUENTIAL CIRCUITS
 
Lec 06 - Synchronous Sequential Logic
Lec 06 - Synchronous Sequential LogicLec 06 - Synchronous Sequential Logic
Lec 06 - Synchronous Sequential Logic
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 

Cache Optimization Techniques for General Purpose Graphic Processing Units

  • 1. Cache Optimization Techniques for General Purpose Graphic Processing Units D.R.V.L.B. Thambawita Supervised By Dr. Roshan G. Ragel and Dr. Dhammika Elkaduwe Department of Computer Engineering Faculty of Engineering University of Peradeniya
  • 2. What is this GPU? Is it important? AI Fluid Dynamic Figure: CPU vs GPGPU 2 / 61
  • 3. Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) 3 / 61
  • 4. Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources??? 3 / 61
  • 5. Why this research? How to optimize CPU cache access? (in Programming stage) Using available CPU optimization technique (Too many references...) How to optimize GPGPU cache access? (in Programming stage) Do we have resources??? Our Contribution.... Main Finding suitable cache op- timization techniques for GPGPU (Programmer side) Sub Giving an idea about appli- cation level cache behav- ior of GPGPUs cache for GPGPU architecture designers 3 / 61
  • 6. GPU configurable cache architecture Figure: Cache memory hierarchy of CPUs and GPGPUs (Fermi architecture) 4 / 61
  • 7. Outline 1 Related works 2 Conceptual level design 3 Selected CPU Cache Optimization Techniques 4 Experimental Setup 5 Adaptation Process + Results and Discussion 6 Findings and Conclusions 7 Case Study Introduction - Aho-corasick algorithm Results and Discussion Conclusion about the case study 8 Publications 9 Q? and A 5 / 61
  • 8. Related works Related works J. L. Hennessy and D. A. Patterson, “Computer architecture: a quantitative approach”. Morgan Kaufmann/Elsevier, 2012. Identifying main cache optimization techniques in computer architecture. M. Kowarschik and C. Wei, “An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms,” LNCS, vol. 2625, pp. 213-232, 2003. Selecting basic cache optimization techniques. CUDA Toolkit Documentation Finding available GPGPU optimization techniques and getting knowledge for adaptation process. C.H. Lin, C.H. Liu, L.S. Chien, and S.C. Chang, “Accelerating Pattern Matching Using a Novel Parallel Algorithm on GPUs,” IEEE Trans. Comput., vol. 62, no. 10, pp. 1906-1916, Oct. 2013. Identifying a case study for our research. 6 / 61
  • 9. Related works Challenges!!! Lack of information about GPGPU cache architecture Complexity of SIMD architecture No any direct research regarding GPGPU cache optimization technique of end users 7 / 61
  • 10. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Selecting CPU caceh optimization techniques 8 / 61
  • 11. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Selecting CPU caceh optimization techniques Analyzing 8 / 61
  • 12. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Adopting from CPU to GPU 8 / 61
  • 13. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Adopting from CPU to GPU 8 / 61
  • 14. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Identifying GPU cache optimizations Adopting from CPU to GPU 8 / 61
  • 15. Conceptual level design Conceptual level design GPGPU cache optimizations? CPU cache optimizations? Developing GPU cache optimizations Selecting CPU caceh optimization techniques Analyzing Analyzing using GPU Identifying GPU cache optimizations Case Study Adopting from CPU to GPU 8 / 61
  • 16. Selected CPU Cache Optimization Techniques Common end user cache optimization technique Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 9 / 61
  • 17. Selected CPU Cache Optimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture 10 / 61
  • 18. Selected CPU Cache Optimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids 10 / 61
  • 19. Selected CPU Cache Optimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids Shared Memory L1 and L2 (Configurable) - 16KB,48KB, L1 disabled Texture Memory 10 / 61
  • 20. Selected CPU Cache Optimization Techniques GPU cache complexity - in adaptation process GPGPU cache SIMD Complex Memory Architecture Warps Blocks Grids Shared Memory L1 and L2 (Configurable) - 16KB,48KB, L1 disabled Texture Memory 32 banks 32 bytes cache 128 bytes cache 2D Spatial Locality 10 / 61
  • 21. Experimental Setup Experimental setups Table: Intel Core (TM) i5 3230M CPU 2.6 GHz Ivy Bridge micro-architecture with 8GB RAM Cache size Cache line size Associativity Description L1 cache 32KB 64bytes 8-way L2 cache 256KB 64bytes 8-way L3 cache 3072KB 64bytes 12-way Shared Memory Table: Tesla c2075 GPGPU’s cache architecture-6GB global memory Cache size Cache line size Associativity Description L1 cache 48KB/ 16KB 128bytes Not mentioned can be disable by using -Xptxas-dlcm=cg compile flag Shared memory 16KB/ 48KB 128bytes Not mentioned can be used manually L2 cache 768KB 128bytes/ 32bytes Not mentioned Unified cache 11 / 61
  • 22. Adaptation Process + Results and Discussion One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 12 / 61
  • 23. Adaptation Process + Results and Discussion Stride-one access Stride-one memory access Figure: Non-stride access vs stride access of GPGPU 13 / 61
  • 24. Adaptation Process + Results and Discussion Stride-one access Adaptation - From CPU to GPGPU Loops Changing parameter = Loop index 64 bytes L1, L2 and L3 cache line size Figure: Adaptation Process 14 / 61
  • 25. Adaptation Process + Results and Discussion Stride-one access Adaptation - From CPU to GPGPU 128 bytes 128 bytes 32 bytes Changing parameter = blockDim * blockID + threadID Kernel L1 cache line size L2 cache line size Figure: Adaptation Process 15 / 61
  • 26. Adaptation Process + Results and Discussion Stride-one access Results: Effect of stride-one access on the CPU 0 10 20 30 40 50 60 70 0 20 40 60 80 100 Stride Amount Time[ms] Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Effect of stride amount on CPU, Input size = 2867200 (710.9375MB) Time taken for execution increased continuously according to the stride amount. 16 / 61
  • 27. Adaptation Process + Results and Discussion Stride-one access Results: Effects of stride-one access on the GPGPU 0 10 20 30 40 50 60 70 0 2 4 6 8 10 Stride Amount Time[ms] Input Size=2867200(Test 1) Input Size=2867200(Test 2) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings) Time taken for execution increases according to the stride amount. It shows the best performance while stride amount is 1 like CPU changes. The effect of stride amount is comparably low after the cache line is full. 17 / 61
  • 28. Adaptation Process + Results and Discussion Stride-one access Results: Effects of stride-one access on the GPGPU 0 10 20 30 40 50 60 70 0 2 4 6 8 10 Stride Amount Time[ms] Input Size=2867200(Disabled L1) Input Size=2867200(48KB L1) Input Size=2867200(16KB L1) Figure: Stride access effect on Fermi GPGPU input=2867200, L1=16KB(default settings) Disabled L1 cache shows better performance for large stride about because number of cache lines are high in L2 cache when L1 cache is disabled. Large L1 cache shows better performance than small cache due to large number of cache lines in large cache. 18 / 61
  • 29. Adaptation Process + Results and Discussion Stride-one access One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 19 / 61
  • 30. Adaptation Process + Results and Discussion Blocking technique Blocking technique 1 2 3 4 4 5 000000000000 0000 111111111111 1111 0000 000000000000 1111 111111111111 000000000000 0000 111111111111 111100000000000001111111111111 00 00000 0 000 0 11 11111 1 111 1 A C B Figure: Two different blocking techniques from two different sources. First technique uses small blocks from the first matrix and large blocks from the second matrix. Second method uses equal size blocks form both matrices 20 / 61
  • 31. Adaptation Process + Results and Discussion Blocking technique Adaptation 21 / 61
  • 32. Adaptation Process + Results and Discussion Blocking technique Adaptation Figure: Adaptation process 22 / 61
  • 33. Adaptation Process + Results and Discussion Blocking technique Results: Effects of blocking technique on the CPU 512X512 1024X1024 1536X1536 2048X2048 0 50 100 150 Size of the matrix Time[s] Default method - without tilling technique Method from Computer Architecture: A Quantitative Approach book Method equivalent to GPGPU tiling method Figure: Effect of tiling on CPU Method equivalent to the GPGPU method shows better performance on the CPU also. 23 / 61
  • 34. Adaptation Process + Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 1,000 Size of the matrix Time[ms] Default - L1 disabled Blocked - L1 disabled Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 35. Adaptation Process + Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 200 400 600 800 Size of the matrix Time[ms] Default - L1 (16KB) Blocked - L1 (16KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 36. Adaptation Process + Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 Size of the matrix Time[ms] Default - L1 (48KB) Blocked - L1 (48KB) Figure: Non blocking vs blocking with various cache configurations on GPGPU The blocking technique shows better performance than non-blocking techniques. 24 / 61
  • 37. Adaptation Process + Results and Discussion Blocking technique Results: Effects of blocking technique on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 500 1,000 Size of the matrix Time[ms] Default - L1 disabled Blocked - L1 disabled Default - L1 (16KB) Blocked - L1 (16KB) Default - L1 (48KB) Blocked - L1 (48KB) Blocked - Shared memory Figure: Non blocking vs blocking with various cache configurations on GPGPU Blocking technique with shared memory shows the best performance among all other GPGPU cache options. 24 / 61
  • 38. Adaptation Process + Results and Discussion Blocking technique One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 25 / 61
  • 39. Adaptation Process + Results and Discussion Loop fusion Loop fusion It is required to match the number of branching conditions in both fused and non-fused loops. Common variables within for loops have been used. The loops within the GPGPU are kernels. Kernel fusion is the technique in GPGPUs corresponding to the loop fusion in CPUs. Example for (int i=0;i<n*n;i++){ h_array_c[i] =h_array_a[i] *h_array_b[i]; } for (int i=0;i<n*n;i++){ h_array_d[i] =h_array_c[i] *h_array_a[i]; } 26 / 61
  • 40. Adaptation Process + Results and Discussion Loop fusion Adaptation Common Data Elements are here Loop unrolling for matching iterations Figure: Adaptation process 27 / 61
  • 41. Adaptation Process + Results and Discussion Loop fusion Adaptation Kernel 1 Kernel 2 Making One Kernel Figure: Adaptation process 28 / 61
  • 42. Adaptation Process + Results and Discussion Loop fusion Results:Effect of loop fusion on the CPU 1024X1024 2048X2048 3072X3072 4096X4096 50 100 Input Size Time[ms] Without loop fusion Without loop fusion - with loop unrolling With loop fusion Figure: Effect of loop fusion on CPU with two common data element The loop fusion technique shows performance improvements on the CPU. This improvement is not an affect of less number of iterations. 29 / 61
  • 43. Adaptation Process + Results and Discussion Loop fusion Results:Effect of loop fusion on the GPGPU 1024X1024 2048X2048 3072X3072 4096X4096 1 2 3 Input Size Time[ms] Without kernel fusion - default settings With kernel fusion - L1(16KB) With kernel fusion - L1(48KB) With kernel fusion - L1(disabled) Figure: Effect of kernel fusion on GPGPU - with common data accesses Kernels fusion technique can be used for the kernels with common data accesses for improving the performance. 30 / 61
  • 44. Adaptation Process + Results and Discussion Loop fusion One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 31 / 61
  • 45. Adaptation Process + Results and Discussion Array padding Array padding a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] Figure: Cache thrashing 32 / 61
  • 46. Adaptation Process + Results and Discussion Array padding Adaptation a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] Adding aditional data element Generating Cache thrashing using nested loops a[0] a[1] a[n-1] b[0] b[1] b[n-1] c[0] c[1] c[n-1] d[0] d[1] d[n-1] L[0] L[1] L[2] a[m-1] Figure: Adaptation to GPGPU 33 / 61
  • 47. Adaptation Process + Results and Discussion Array padding Adaptation L1 Cache Thrashing Selected one warp (32 threads) Figure: Adaptation to GPGPU 34 / 61
  • 48. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the CPU 0 2,048 4,096 6,144 8,192 10,240 0.5 1 Input Size Time[ms] With cache thrashing Without cache thrashing (with array padding) Figure: Effect of array padding for cache thrashing on CPU Array padding technique shows slight improvement of performance in CPU side. 35 / 61
  • 49. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 10 20 30 40 50 60 70 0.6 0.8 1 1.2 1.4 Stride Amount Time[ms] Figure: Effect of bank conflict of shared memory on GPGPU Shared memory bank conflicts of GPGPU show considerable effect for performance of applications. 36 / 61
  • 50. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 1 2 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 51. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 0 1 2 3 4 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding 16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 52. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 256X256 512X512 768X768 1024X10241280X12801536X1536 0 2 4 6 Input Size Time[ms] 8-way Bank Conflict-Without Padding 8-way Bank Conflict-With Padding 16-way Bank Conflict-Without Padding 16-way Bank Conflict-With Padding 32-way Bank Conflict-Without Padding 32-way Bank Conflict-With Padding Figure: Effect of padding for shared memory bank conflicts on GPGPU Array padding technique can be used as a solution for the shared memory bank conflicts if the way of bank conflict is 32 (high number of bank conflict). 37 / 61
  • 53. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles Without padding - L1 (16KB) Without padding - L1 (48KB) Figure: L1 cache thrashing points while L1 size is 16KB and 48KB L1 cache accesses cause cache thrashing points according to the stride amount. 38 / 61
  • 54. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles Without padding - L1 (48KB) With one padding - L1 (48KB) With two padding - L1 (48KB) Figure: Effect of padding to thrashing points of L1 cache Array padding technique can be used to shift cache thrashing point rather than removing those points. 39 / 61
  • 55. Adaptation Process + Results and Discussion Array padding Results: Effect of array padding on the GPGPU 0 100 200 300 400 500 500 1,000 1,500 Stride Amount ClockCycles With padding - L1 (48KB) With padding - L1 (disabled) Figure: Cache thrashing with L1 and L2 vs L2 only Cache thrashing points which are significant to the performance are caused by L1 cache not from L2 cache. 40 / 61
  • 56. Adaptation Process + Results and Discussion Array padding One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 41 / 61
  • 57. Adaptation Process + Results and Discussion Array merging Array merging a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Figure: Basic idea behind the array merging technique. 42 / 61
  • 58. Adaptation Process + Results and Discussion Array merging Adaptation a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Two different arrays One merged array Figure: Adaptation process 43 / 61
  • 59. Adaptation Process + Results and Discussion Array merging Adaptation Merged array Located a[1] a[2] a[3] a[n] b[1] b[2] b[3] b[n] a[1] b[1] a[2] a[n] + = b[n] Two different arrays One merged array Figure: Adaptation process 44 / 61
  • 60. Adaptation Process + Results and Discussion Array merging Results: Effect of array merging on the CPU 1024X1024 2048X2048 3072X3072 4096X4096 0 10 20 30 Input Size Time[ms] With Array Merging Without Array Merging Figure: Effect of array merging on CPU The array merging technique improves the performance of non-stride accesses on the CPU. 45 / 61
  • 61. Adaptation Process + Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 10 20 30 40 Input Size Time[ms] Without Array Merging L1-disabled With Array Merging L1-disabled Figure: Effect of array merging on GPGPU 46 / 61
  • 62. Adaptation Process + Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 20 40 Input Size Time[ms] Without Array Merging L1-16KB With Array Merging L1-16KB Figure: Effect of array merging on GPGPU 46 / 61
  • 63. Adaptation Process + Results and Discussion Array merging Results: Effect of array merging on the GPGPU 512X512 1024X10241536X15362048X20482560X25603072X3072 0 20 40 Input Size Time[ms] Without Array Merging L1-48KB With Array Merging L1-48KB Figure: Effect of array merging on GPGPU The array merging technique can be used on GPGPU also for improving performance. It needs more cache line size to improve performance. 46 / 61
  • 64. Adaptation Process + Results and Discussion Array merging One by one Data access optimization Stride-one access Blocking Loop fusion Data layout optimization Array padding Array merging Array transpose 47 / 61
  • 65. Adaptation Process + Results and Discussion Array transpose Array transpose Figure: Basic matrix multiplication is shown in first figure while second figure is illustrated that how to use transposed matrix for matrix multiplication 48 / 61
  • 66. Adaptation Process + Results and Discussion Array transpose Adaptation Memory pattern - before Memory pattern - after Transposed array Cache friendly memory locations Figure: Adaptation process 49 / 61
  • 67. Adaptation Process + Results and Discussion Array transpose Adaptation Memory pattern - before Memory pattern - after Transposed array Figure: Adaptation process 50 / 61
  • 68. Adaptation Process + Results and Discussion Array transpose Results: Effect of array transpose on the CPU 512X512 1024X1024 1536X1536 2048X2048 0 50 100 150 Input Size Time[s] Basic Method With Transpose (Without Transpose Overhead) Transpose Overhead Figure: Effect of array transpose using matrix multiplication on CPU The array transpose technique can be used to improve the performance of the CPU. 51 / 61
  • 69. Adaptation Process + Results and Discussion Array transpose Results: Effect of array transpose on the GPGPU 1024X1024 2048X2048 3072X3072 4096X4096 5120X5120 0 2,000 4,000 Input Size Time[ms] Matrix Multiplication (Basic) Matrix Multiplication with Array Transpose Figure: Effect of array transpose for matrix multiplication on GPGPU The array transpose technique is not a good option on GPGPUs. It will increase the number of memory access compared with original memory accesses. 52 / 61
  • 70. Findings and Conclusions Findings and Conclusions about GPGPU cache optimizations 1 Stride one access is the best case for gaining better performance. However, large non-stride accesses shows better performance while L1 cache is disabled. 2 Manually using the cache memory (shared memory) is the best option for gaining better performance with the blocking technique. 3 Better performance with kernel fusion can be achieved if multiple kernels have common data accesses. 4 Array padding causes positive effects for larger shared memory bank conflict. L1 cache conflicts can be avoided by applying array padding. 5 Array merging is a good option for improving the performance of overall memory access on CPUs as well as GPGPUs. 6 Transposing 2D arrays is not a good option to GPGPU for gaining better performance for large data sets. 53 / 61
  • 72. Case Study Introduction - Aho-corasick algorithm Aho-corasick algorithm - What is this? The Aho-Corasick algorithm is a multiple patterns searching algorithm. Where can we see this Aho-corasick algorithm? 0 4 8 2 5 9 3 6 7 1 A B G B E D E E D Aho-corasick Algorithm ABGABBEDG...GGABEDG 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G 0 4 8 2 5 9 3 6 7 1 A C G C T G T T G Figure: Applications of Aho-corasick Algorithm A parallel GPGPU version of the Aho-Corasick algorithm is called as Parallel Failure-less Aho-Corasick algorithm (PFAC)[Well known Paper - IEEE Tran.]. 55 / 61
  • 73. Case Study Introduction - Aho-corasick algorithm How did we test our findings? Implemented our own PFAC for DNA sequence matching. (From available GPGPU Aho-corasick algorithm) Analyzed the developed source code to find suitable locations to apply the GPGPU optimization techniques Analyzing... 1 Stride-one memory access → Not possible 2 Blocking → Found compatibility for input text file → Loading input text via shared memory 3 Kernel fusion → Only one kernel 4 Array padding → No cache thrashing points 5 Array merging → Found compatibility for input pattern file → Two arrays of input pattern file were merged via texture memory 6 Array transpose 56 / 61
  • 74. Case Study Results and Discussion Results: Comparison between original PFAC and our PFAC (without cache optimization techniques) Pattern Set 1 Pattern Set 2 Pattern Set 3 Pattern Set 4 Pattern Set 5 0.4 0.6 0.8 Time(s) Available PFAC Our PFAC implementation - without any optimizations Figure: Performance comparison between the original PFAC vs our PFAC implementation Performance gain from application specific adaptation is around 1.27X 57 / 61
  • 75. Case Study Results and Discussion Results: Comparison between PFAC implementations (without and with cache optimization) Pattern Set 1 Pattern Set 2 Pattern Set 3 Pattern Set 4 Pattern Set 5 0.2 0.3 0.4 0.5 Time(s) Our PFAC implementation - without any optimizations Our PFAC implementation - with all optimizations Figure: Performance comparison - our PFAC (without optimization and with optimizations) Performance gain from the developed application specific solution to the cache optimized solution is around 2X 58 / 61
  • 76. Case Study Conclusion about the case study Conclusion Applied application specific techniques improved the performance of our PFAC implementation. Applied cache memory optimization techniques also improved the performance of the PFAC implementation. Our PFAC implementation worst case (without any optimizations) shows 1.27X average improvement while the best case (with all optimizations - Total) is 2.40X faster than available best GPGPU solution (PFAC). 59 / 61
  • 77. Publications Publications (Up to now) D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. To use or not to use: Graphics processing units (gpus) for pattern matching algorithms. In 7th IEEE International Conference on Information and Automation for Sustainability, pages 1–4, Dec 2014. D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. An optimized parallel failure-less aho-corasick algorithm for dna sequence matching. In 8th IEEE International Conference on Information and Automation for Sustainability (ICIAFS), Dec 2016. D. R. V. L. B. Thambawita, R. Ragel, and D. Elkaduwe. To use or not to use: Cpu’s cache optimization techniques for gpgpus. In 8th IEEE International Conference on Information and Automation for Sustainability (ICIAFS), Dec 2016. V. Thambawita, N. C. Ellepola, R. Ragel, and D. Elkaduwe. GPGPU: To use or not to use? In Peradeniya University Research Sessions (PURSE), 2013. 60 / 61
  • 78. Q? and A Q? and A 61 / 61