Create ML- Based architecture(CPU and GPU) aware partitioning classifier which takes as input an Open-CL program and new architecture so that classifier can generate the optimal partition class value for given input.
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Architecture Aware Partitioning of Open-CL Programs
1. Architecture Aware Partitioning
of Open-CL Programs
SUBMITTED BY:
ANKIT SINGH
ROLL NO.-15IT60R04
SUBMITTED TO:
PROF. SOUMYAJIT DEY
DEPARTMENT OF COMPUTER
SCIENCE &ENGINEERING
IIT KHARAGPUR
2. Content:
Objective
Introduction to Open-CL
Open-CL Platform Model
Partitioning of Open-CL program in CPU-GPU
GPGPU Sim
Architecture Specific Training
Architecture Aware Training
Architecture Aware Partitioning
Future Work
Conclusion
References
3. Objective
Create ML- Based architecture(CPU and GPU) aware partitioning classifier which takes
as input an Open-CL program and new architecture so that classifier can generate the
optimal partition class value for given input.
4. Introduction to Open-CL
OpenCL is a data parallel programming model introduced for heterogeneous system
architecture which may include CPUs, GPUs or other accelerator devices.
Developer(s)-Khronos Group
9. Concept of Work Group and Work Item in
Open-CL
A kernel is a function executed in each point of a problem
domain(For each work item)
Number of work items-4096(16 work-groups, 256 work-items
each)
10. Profiling Events in Open-CL
I need to be able to measure Kernel execution time to validate some options. For a long
long Kernel you may use wall clock, but it’s not the right way to do it. There are few
steps to measure accurately the Kernel execution time:
Create Queue with Profiling enabled
Create event
Ensure to have executed all enqueued tasks
Launch Kernel linked to an event
Ensure kernel execution is finished
Get the Profiling data
11. Kernel Code for Matrix Multiplication
__kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA,
int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA*elementB;
}
C[ty * wA + tx] = value;
}
12. GPGPU Sim
GPGPU Sim is a simulator that simulates
the different architectures
GPGPU-Sim consumes mostly
unmodified GPGPU source code that is
linked to GPGPU-Sims custom GPGPU
runtime library
The modified runtime library intercepts
all GPGPU-specific function calls and
emulates their effects
15. Architecture Specific Training Data Set
P1: < f1(P1), f2(P1), f3(P1),….., fn(P1) > opt(P1)
P2: < f1(P2), f2(P2), f3(P2),….., fn(P2) > opt(P2)
P3: < f1(P3), f2(P3), f3(P3),….., fn(P3) > opt(P3)
……………………………………………………………………….
Pk: < f1(Pk), f2(Pk), f3(Pk),….., fn(Pk) > opt(Pk)
TRAINING DATA SET
STATIC PROGRAM FEATURES CLASSES
K
Programs
Static program partitioning in the context of
heterogeneous multi-core architectures is
identifying how a single program is to be
partitioned across the varied processing
elements of the architecture such that the
program execution time is minimized.
17. Execution Time of One Kernel on Different Architecture
Execution Profile Across Different Architectures
18. Problem with Architecture Specific Partitioning
Classifiers trained on one particular architecture cannot be used for predicting
partition class values of the same program on another architecture and thus must be
trained again.
It would be rather worthwhile to learn a more involved relationship between static
program features, architectural features and optimal program partitions.
19. Architecture Aware Partitioning
ML-Based Architecture Aware
Partitioning Model gives the
optimal partition class value for a
given new Open-CL program and
an architecture.
28. TARGET SYSTEM:
CPU – Intel Xeon E5260
GPU – 8 Architectures (4 Real, 4 Synthetic)
TRAINING DATA:
15 Kernels, 8 Architectures,
2-4 Problem Sizes 400 training data points
MODEL: Logistic Regression
Experimental Results
29. Conclusion & Future Work
Partitioned many PolyBench kernels using the Open-CL API on heterogeneous platform(CPU and
GPU).
We trained a ML-Based Architecture Aware classifier model which helps to know about the optimal
partition class value for a given new Open-CL program and an architecture.
In future, with the help of more input data set, we can improve the accuracy of architecture aware
classifier model.
30. References
Scarpino, Matthew. ”Open-CL in Action: How to Accelerate Graphics and Computation. NY.”
USA: Manning (2012).
D. Grewe and M. F. OBoyle, A Static Task Partitioning Approach for Heterogeneous Systems
using OpenCL,in International Conference on Compiler Construction, 2011, pp. 286305.
D. Grewe, Z.Wang, and M. F. OBoyle, OpenCL Task Partitioning in the Presence of GPU
Contention, in Language and Compilers for Parallel Computing, 2011, pp. 87101.
P. Pandit and R. Govindarajan, Fluidic Kernels: Cooperative Execution of OpenCL Programs on
Multiple Heterogeneous Devices, in International Symposium of Code Generation and
Optimization, 2014, p. 273.
Chen, Kuan-Chung, and Chung-Ho Chen. ”An OpenCL runtime system for a heterogeneous
many-core virtual platform.” 2014 IEEE International Symposium on Circuits and Systems
(ISCAS).