Architecture Aware Partitioning of Open-CL Programs

Architecture Aware Partitioning
of Open-CL Programs
SUBMITTED BY:
ANKIT SINGH
ROLL NO.-15IT60R04
SUBMITTED TO:
PROF. SOUMYAJIT DEY
DEPARTMENT OF COMPUTER
SCIENCE &ENGINEERING
IIT KHARAGPUR

Content:
 Objective
Introduction to Open-CL
Open-CL Platform Model
Partitioning of Open-CL program in CPU-GPU
GPGPU Sim
Architecture Specific Training
Architecture Aware Training
Architecture Aware Partitioning
Future Work
Conclusion
References

Objective
Create ML- Based architecture(CPU and GPU) aware partitioning classifier which takes
as input an Open-CL program and new architecture so that classifier can generate the
optimal partition class value for given input.

Introduction to Open-CL
OpenCL is a data parallel programming model introduced for heterogeneous system
architecture which may include CPUs, GPUs or other accelerator devices.
Developer(s)-Khronos Group

Partitioning in CPU-GPU(Matrix-Matrix
multiplication example)
20% Data of matrix A for
multiplying with matrix B
80% Data of matrix A for
multiplying with matrix B
Matrix A Matrix B
CPU GPU

Concept of Work Group and Work Item in
Open-CL
 A kernel is a function executed in each point of a problem
domain(For each work item)
 Number of work items-4096(16 work-groups, 256 work-items
each)

Profiling Events in Open-CL
I need to be able to measure Kernel execution time to validate some options. For a long
long Kernel you may use wall clock, but it’s not the right way to do it. There are few
steps to measure accurately the Kernel execution time:
Create Queue with Profiling enabled
Create event
Ensure to have executed all enqueued tasks
Launch Kernel linked to an event
Ensure kernel execution is finished
Get the Profiling data

Kernel Code for Matrix Multiplication
__kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA,
int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA*elementB;
}
C[ty * wA + tx] = value;
}

GPGPU Sim
 GPGPU Sim is a simulator that simulates
the different architectures
 GPGPU-Sim consumes mostly
unmodified GPGPU source code that is
linked to GPGPU-Sims custom GPGPU
runtime library
 The modified runtime library intercepts
all GPGPU-specific function calls and
emulates their effects

Architecture Specific Partitioning

Architecture Specific Training Data Set
P1: < f1(P1), f2(P1), f3(P1),….., fn(P1) > opt(P1)
P2: < f1(P2), f2(P2), f3(P2),….., fn(P2) > opt(P2)
P3: < f1(P3), f2(P3), f3(P3),….., fn(P3) > opt(P3)
……………………………………………………………………….
Pk: < f1(Pk), f2(Pk), f3(Pk),….., fn(Pk) > opt(Pk)
TRAINING DATA SET
STATIC PROGRAM FEATURES CLASSES
K
Programs
Static program partitioning in the context of
heterogeneous multi-core architectures is
identifying how a single program is to be
partitioned across the varied processing
elements of the architecture such that the
program execution time is minimized.

Architecture Specific Partitioning Classifier
Model

Execution Time of One Kernel on Different Architecture
Execution Profile Across Different Architectures

Problem with Architecture Specific Partitioning
Classifiers trained on one particular architecture cannot be used for predicting
partition class values of the same program on another architecture and thus must be
trained again.
It would be rather worthwhile to learn a more involved relationship between static
program features, architectural features and optimal program partitions.

Architecture Aware Partitioning
ML-Based Architecture Aware
Partitioning Model gives the
optimal partition class value for a
given new Open-CL program and
an architecture.

Architecture Aware Training Data Set
P1: < f1(P1), f1(P1),….., fn(P1) ,a1(D1),a2(D1),…,am(D1) > opt(P1,D1)
P2: < f1(P2), f2(P2),….., fn(P2), a1(D1),a2(D1),…,am(D1) > opt(P2,D1)
……………………………………………………………………………
Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(D1),a2(D1),…,am(D1) > opt(Pk,D1)
k
Programs
on D1
k
Programs
on Dj
P1: < f1(P1), f1(P1),….., fn(P1) ,a1(Dj),a2(Dj),…,am(Dj) > opt(P1,Dj)
P2: < f1(P2), f2(P2),….., fn(P2), a1(Dj),a2(Dj),…,am(Dj) > opt(P2,Dj)
……………………………………………………………………………
Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(Dj),a2(Dj),…,am(Dj) > opt(Pk,Dj)
………………………………………………………………………………………………………………………………….......
………………………………………………………………………………………………………………………………….......
………………………………………………………………………………………………………………………………….......
j architectures
D1,……Dj

ML-Based Architecture Aware Partitioning Model

Simulation Results
Input-Matrix, Size-1024*1024 and Vector, Size-1024 Matrix, Size-4096*4096 and Vector, Size-4096

Simulation Results
Input-Matrix, Size-4096*4096 and Vector, Size-4096 Input-Matrix, Size-8192*8192 and Vector, Size-8192

Simulation Results

TARGET SYSTEM:
CPU – Intel Xeon E5260
GPU – 8 Architectures (4 Real, 4 Synthetic)
TRAINING DATA:
15 Kernels, 8 Architectures,
2-4 Problem Sizes 400 training data points
MODEL: Logistic Regression
Experimental Results

Conclusion & Future Work
Partitioned many PolyBench kernels using the Open-CL API on heterogeneous platform(CPU and
GPU).
We trained a ML-Based Architecture Aware classifier model which helps to know about the optimal
partition class value for a given new Open-CL program and an architecture.
In future, with the help of more input data set, we can improve the accuracy of architecture aware
classifier model.

References
Scarpino, Matthew. ”Open-CL in Action: How to Accelerate Graphics and Computation. NY.”
USA: Manning (2012).
D. Grewe and M. F. OBoyle, A Static Task Partitioning Approach for Heterogeneous Systems
using OpenCL,in International Conference on Compiler Construction, 2011, pp. 286305.
D. Grewe, Z.Wang, and M. F. OBoyle, OpenCL Task Partitioning in the Presence of GPU
Contention, in Language and Compilers for Parallel Computing, 2011, pp. 87101.
P. Pandit and R. Govindarajan, Fluidic Kernels: Cooperative Execution of OpenCL Programs on
Multiple Heterogeneous Devices, in International Symposium of Code Generation and
Optimization, 2014, p. 273.
Chen, Kuan-Chung, and Chung-Ho Chen. ”An OpenCL runtime system for a heterogeneous
many-core virtual platform.” 2014 IEEE International Symposium on Circuits and Systems
(ISCAS).

Architecture Aware Partitioning of Open-CL Programs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecture Aware Partitioning of Open-CL Programs

Similar to Architecture Aware Partitioning of Open-CL Programs (20)

Recently uploaded

Recently uploaded (20)

Architecture Aware Partitioning of Open-CL Programs