2. CPU versus GPU
• Sophiscated Control
• Branch Prediction
• Out-of-Order Execution
• Large Cache
• Little Control
• No or Limited Branch
Prediction
• Simple Execution
• Small or no cache
• Lots of ALUs
4. Why OpenCL for CPU
Muiti-core CPU is out there
E.g. MediaTek Tri-Cluster 10 cores SoC
Mobile GPU is already busy
~25% occupied by system UI in Android
Not every programs run good on GPU
Heavy Branch Divergence
OpenCL allows easily exploit multi-core and SIMD
Imagine: writing pthread + SIMD in assembly or intrinsics
5. Running OpenCL Kernels on CPU
One thread per work-item?
Thousands of threads being created
Context-switching problems
How to synchronize threads?
How about running one work-group on a CPU thread?
6. Related Works
Twin peaks: a software platform for heterogeneous computing on
general-purpose and graphics processors.
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core
CPUs
Clover (http://people.freedesktop.org/~steckdenis/clover)
Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
7. What is to pocl
POrtable Computing Language
An efficient implementation of OpenCL standard which can be easily
adapted for new targets
http://github.com/pocl/pocl
Main developer: Pekka Jääskeläinen from Tampere University of
Technology
Supporting Architecture: CPU, tce, cellspu, HSA
Current version: 0.11
10. pocl Compilation Chain
1
2
3
4 Compile Kernel (OpenCL C) by
Clang
1
Linked with target-specific built-
in functions, such as sin, cos,
geom_distance, etc…
2
Work-group Function
Generation / Parallel Work-item
Loops Creation
3
Backend Optimizations (Auto-
vecs, …) and CodeGen
4
11. Work-group_function() {
for (int i = 0; i < work-group_size; i++) {
}
}
Work-group Function Generation
Kernel (single work-item)
What if there are
barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)
12. Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by
all work-items of a work-group executing the kernel
or by none at all…”
if (tid % 2) {
….
barrier();
…
}
13. Kernel Without barriers
• A node in a CFG is a basic block
(BB)
• BB: branchless sequence of
instructions
• BB executed as an entity,
from the first instruction to
the last.
• An edge in a CFG represents
a branch in the control flow
• Multiple exit BBs are
allowed
• pocl Kernel Compiler generates
WI-loop around the CFG
14. Types of Barrier
Un-conditional barriers
barrier that dominates the exit node
Conditional barriers
Barriers being placed in
if – else
for-loop (b-loop)
15. Kernel with unconditional barriers
pocl Kernel Compiler creates WI-loops
before and after the barrier
This forms an algorithm:
Algorithm 1: Parallel region formation when the kernel
does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and
the exit nodes of the kernel function and that there is
only one exit node in the kernel function. This is a safe
starting condition as it does not affect any execution
order restrictions.
Step2: Perform a depth-first-search traversal of the kernel
CFG. Ignore the possible back edges to avoid infinite
loops and to include the loops of the kernel to the
parallel region.
Step3: When encountering a barrier, create a parallel
region by calling CreateSubgraph for the previously
encountered barrier and the newly found barrier.
barrier
barrier
16. A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation
in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting
at the entry node.
Step2: Each time a new, unprocessed conditional barrier
is found, use CreateSubgraph to produce a sub-CFG from
that barrier to the next exit node (duplicate the tail).
Step3: Replicate the created sub-CFG using ReplicateCFG.
In order to reduce code duplication, merge the tails from
the same unconditional barrier paths. That is, replicate
the basic blocks only after the last barrier that is
unconditionally reachable from the one at hand.
Step4: Start the algorithm at each of the found barrier
successors.
17. A CFG with Two Conditional barriers
– After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?
19. Barriers in Kernel Loops
Insert implicit barrier into:
1. End of loop pre-header
block
2. Before the loop latch
branch
3. After the PhiNode
region of the loop
header block
3
2
1
21. Handling of Kernel Variables
1. There will be two parallel regions
2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)
3. B’s lifetime span across both parallel regions
Context Array
22. References
Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle
Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable
OpenCL Implementation" in International Journal of Parallel
Programming, Springer, August 2014.
http://github.com/pocl/pocl
Editor's Notes
A, B, D forms a parallel region and from B, there’s a branch to the middle of another parallel region’s (ABEHI) work-item loop.
If at least one work-item takes the branch after B that can lead to a barrier, the rest of the work-item must follow peel first loop