1. GPU
acceleration of
image
processing Jan
Lemeire
1
15/11/2012
2.
3. GPU vs CPU Peak Performance Trends
2010
350 Million triangles/second
GPU peak performance has grown aggressively.
3 Billion transistors GPU
Hardware has kept up with Moore’s law
1995
5,000 triangles/second
800,000 transistors GPU
Source : NVIDIA 3
4. To the rescue: Graphical Processing Units
(GPUs)
Many-core GPU
94 fps (AMD Tahiti Pro)
GPU: 1-3 TeraFlop/second Multi-core CPU
instead of 10-20 GigaFlop/second for CPU
Courtesy: John Owens
Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs.
15/11/2012 4
5.
6. GPUs
are an alternative for CPUs
in offering processing power
15/11/2012 6
7. pixel rescaling lens correction pattern detection
CPU gives only 4 fps
next generation machines need 50fps
15/11/2012 7
9. Methodology
Application
Identification of
compute-intensive parts
Feasibility study of
GPU acceleration
GPU implementation
GPU optimization
Hardware
15/11/2012 9
10. Obstacle 1
Hard(er) to implement
15/11/2012 10
11. GPU Programming Concepts
Device/GPU 1TFLOPS
Grid (1D, 2D or 3D)
kernel
Multiprocessor 1 Multiprocessor 2
get_local_size(0)
get_local_size(1)
Local Memory (16/48KB) Local Memory Group Group Group
(0, 0) (1, 0) (2, 0)
40GB/s few cycles
Private Private Private Private
Host/ Group Group Group
16K/8
CPU (0, 1) (1, 1) (2, 1)
Scalar
Scalar Scalar Scalar
Processor
Proces- Processor Processor Processor
1GHz
sor
100GB/s 200 cycles Work group
Work group size Sy
R Global Memory (1GB) (get_group_id(0),get_group_id(1))
A Work item Work item Work item
M Constant Memory (64KB) (0, 0) (1, 0) (2, 0)
Work item Work item Work item
(0, 1) (1, 1) (2, 1)
Texture Memory (in global memory)
Work item Work item Work item
4-8 GB/s (0, 2) (1, 2) (2, 2)
Work group size Sx
Max #work items per work group: 1024 (get_local_id(0), get_local_id(1))
Executed in warps/wavefronts of 32/64 work items
Max work groups simultaneously on MP: 8
Max active warps on MP: 24/48
15/11/2012 OpenCL terminology
11
12. Semi-abstract scalable hardware model
Need to know more Code remains
details than of CPU compatible/efficient
Need to know model for effective and efficient
code
CPU: processor ensures efficient execution
15/11/2012 12
13. Increased code complexity
1. Complex index calculations
Mapping data elements on processing elements (at
least 2 levels)
Sometimes better to group elements
2. Optimizations
Impact on performance need to be tested
3. A lot of parameters:
a. Algorithm, implementation
b. Configuration of mapping
c. Hardware parameters (limits)
d. Optimized versions
15/11/2012 13
14. Methodology
Application
Identification of
compute-intensive parts
Parallelization by
compiler
Feasibility study of
GPU acceleration
Pragma-based
Skeleton-based GPU implementation
OpenCL
GPU optimization
Hardware
15/11/2012 14
15. Obstacle 2
Hard(er) to get efficiency
15/11/2012 15
16. We expect peak performance
Speedup of 100x possible
At least, we expect some speedup
But what is 5x worth?
Reasons for low efficiency?
15/11/2012 16
23. Competence Center for Personal
Supercomputing
Offer trainings (overcome obstacle 1)
Acquire expertise
Take an independent, critical position
Offer feasibility and performance studies
(overcome obstacle 2)
Symposium: Brussels, December 13th 2012
http://parallel.vub.ac.be
15/11/2012 23
Editor's Notes
First, we have to understand where it comes from, the tremendous computational power of GPU. The CPU is capable of running a(ny) sequential program very fast. The GPU has a lot of processing units, but programming them requires more care.Map part of the computational work on processing elementDescribe by kernelKernel executed by a ‘thread’E.g. image processing: pixel is work unit
Case of KLA Tencor (ICOS – Leuven): inspection machines needing real-time image processing
Re-implementation of algorithms is required…
On the left the abstract hardware model and on the right the execution model. Both should be understood in order to write OpenCL programs. This contrasts with the simple Von Neumann model used for CPUs.
Our focus is on OpenCL programming and not high-level solutions that generate GPU programs. Those solutions are, in my opinion, not mature yet.
Is 5x worth the effort of porting to GPUs?
Roofline model gives which resource bounds the overall performance
After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.
After each waterfall follows calm water, but you have to accept the turbulences first.And you don’t know when you’re out of trouble.