SlideShare a Scribd company logo
1 of 31
Download to read offline
Architecture Aware Partitioning
of Open-CL Programs
SUBMITTED BY:
ANKIT SINGH
ROLL NO.-15IT60R04
SUBMITTED TO:
PROF. SOUMYAJIT DEY
DEPARTMENT OF COMPUTER
SCIENCE &ENGINEERING
IIT KHARAGPUR
Content:
 Objective
Introduction to Open-CL
Open-CL Platform Model
Partitioning of Open-CL program in CPU-GPU
GPGPU Sim
Architecture Specific Training
Architecture Aware Training
Architecture Aware Partitioning
Future Work
Conclusion
References
Objective
Create ML- Based architecture(CPU and GPU) aware partitioning classifier which takes
as input an Open-CL program and new architecture so that classifier can generate the
optimal partition class value for given input.
Introduction to Open-CL
OpenCL is a data parallel programming model introduced for heterogeneous system
architecture which may include CPUs, GPUs or other accelerator devices.
Developer(s)-Khronos Group
Partitioning in CPU-GPU(Matrix-Matrix
multiplication example)
20% Data of matrix A for
multiplying with matrix B
80% Data of matrix A for
multiplying with matrix B
Matrix A Matrix B
CPU GPU
Open-CL Platform Model
GPGPU Programming Model
Host Programming
Concept of Work Group and Work Item in
Open-CL
 A kernel is a function executed in each point of a problem
domain(For each work item)
 Number of work items-4096(16 work-groups, 256 work-items
each)
Profiling Events in Open-CL
I need to be able to measure Kernel execution time to validate some options. For a long
long Kernel you may use wall clock, but it’s not the right way to do it. There are few
steps to measure accurately the Kernel execution time:
Create Queue with Profiling enabled
Create event
Ensure to have executed all enqueued tasks
Launch Kernel linked to an event
Ensure kernel execution is finished
Get the Profiling data
Kernel Code for Matrix Multiplication
__kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA,
int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);
float value = 0;
for (int k = 0; k < wA; ++k)
{
float elementA = A[ty * wA + k];
float elementB = B[k * wB + tx];
value += elementA*elementB;
}
C[ty * wA + tx] = value;
}
GPGPU Sim
 GPGPU Sim is a simulator that simulates
the different architectures
 GPGPU-Sim consumes mostly
unmodified GPGPU source code that is
linked to GPGPU-Sims custom GPGPU
runtime library
 The modified runtime library intercepts
all GPGPU-specific function calls and
emulates their effects
Architecture Specific Partitioning
Static Program Features:
Architecture Specific Training Data Set
P1: < f1(P1), f2(P1), f3(P1),….., fn(P1) > opt(P1)
P2: < f1(P2), f2(P2), f3(P2),….., fn(P2) > opt(P2)
P3: < f1(P3), f2(P3), f3(P3),….., fn(P3) > opt(P3)
……………………………………………………………………….
Pk: < f1(Pk), f2(Pk), f3(Pk),….., fn(Pk) > opt(Pk)
TRAINING DATA SET
STATIC PROGRAM FEATURES CLASSES
K
Programs
Static program partitioning in the context of
heterogeneous multi-core architectures is
identifying how a single program is to be
partitioned across the varied processing
elements of the architecture such that the
program execution time is minimized.
Architecture Specific Partitioning Classifier
Model
Execution Time of One Kernel on Different Architecture
Execution Profile Across Different Architectures
Problem with Architecture Specific Partitioning
Classifiers trained on one particular architecture cannot be used for predicting
partition class values of the same program on another architecture and thus must be
trained again.
It would be rather worthwhile to learn a more involved relationship between static
program features, architectural features and optimal program partitions.
Architecture Aware Partitioning
ML-Based Architecture Aware
Partitioning Model gives the
optimal partition class value for a
given new Open-CL program and
an architecture.
GPU Architecture Features:
Architecture Aware Training Data Set
P1: < f1(P1), f1(P1),….., fn(P1) ,a1(D1),a2(D1),…,am(D1) > opt(P1,D1)
P2: < f1(P2), f2(P2),….., fn(P2), a1(D1),a2(D1),…,am(D1) > opt(P2,D1)
……………………………………………………………………………
Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(D1),a2(D1),…,am(D1) > opt(Pk,D1)
k
Programs
on D1
k
Programs
on Dj
P1: < f1(P1), f1(P1),….., fn(P1) ,a1(Dj),a2(Dj),…,am(Dj) > opt(P1,Dj)
P2: < f1(P2), f2(P2),….., fn(P2), a1(Dj),a2(Dj),…,am(Dj) > opt(P2,Dj)
……………………………………………………………………………
Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(Dj),a2(Dj),…,am(Dj) > opt(Pk,Dj)
………………………………………………………………………………………………………………………………….......
………………………………………………………………………………………………………………………………….......
………………………………………………………………………………………………………………………………….......
j architectures
D1,……Dj
ML-Based Architecture Aware Partitioning Model
Simulation Results
Input-Matrix, Size-1024*1024 and Vector, Size-1024 Matrix, Size-4096*4096 and Vector, Size-4096
Simulation Results
Input-Matrix, Size-4096*4096 and Vector, Size-4096 Input-Matrix, Size-8192*8192 and Vector, Size-8192
Simulation Results
Input-Matrix, Size-4096*4096 and Vector, Size-4096 Input-Matrix, Size-2048*2048 and Vector, Size-2048
Simulation Results
Input-Matrix, Size-1024*1024 and Vector, Size-1024 Input-Matrix, Size-1024*1024 and Vector, Size-1024
Evaluation Methodology
TARGET SYSTEM:
CPU – Intel Xeon E5260
GPU – 8 Architectures (4 Real, 4 Synthetic)
TRAINING DATA:
15 Kernels, 8 Architectures,
2-4 Problem Sizes 400 training data points
MODEL: Logistic Regression
Experimental Results
Conclusion & Future Work
Partitioned many PolyBench kernels using the Open-CL API on heterogeneous platform(CPU and
GPU).
We trained a ML-Based Architecture Aware classifier model which helps to know about the optimal
partition class value for a given new Open-CL program and an architecture.
In future, with the help of more input data set, we can improve the accuracy of architecture aware
classifier model.
References
Scarpino, Matthew. ”Open-CL in Action: How to Accelerate Graphics and Computation. NY.”
USA: Manning (2012).
D. Grewe and M. F. OBoyle, A Static Task Partitioning Approach for Heterogeneous Systems
using OpenCL,in International Conference on Compiler Construction, 2011, pp. 286305.
D. Grewe, Z.Wang, and M. F. OBoyle, OpenCL Task Partitioning in the Presence of GPU
Contention, in Language and Compilers for Parallel Computing, 2011, pp. 87101.
P. Pandit and R. Govindarajan, Fluidic Kernels: Cooperative Execution of OpenCL Programs on
Multiple Heterogeneous Devices, in International Symposium of Code Generation and
Optimization, 2014, p. 273.
Chen, Kuan-Chung, and Chung-Ho Chen. ”An OpenCL runtime system for a heterogeneous
many-core virtual platform.” 2014 IEEE International Symposium on Circuits and Systems
(ISCAS).
Thanking You
Queries??

More Related Content

What's hot

Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT CompilerNetronome
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學 艾鍗科技
 
Circles graphic
Circles graphicCircles graphic
Circles graphicalldesign
 
Verilog overview
Verilog overviewVerilog overview
Verilog overviewposdege
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processorDhaval Kaneria
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference acceleratorsDarshanG13
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsMykola Novik
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)Hocine Merabti
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingJason Liu
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5Jeff Larkin
 
Verilog Tutorial - Verilog HDL Tutorial with Examples
Verilog Tutorial - Verilog HDL Tutorial with ExamplesVerilog Tutorial - Verilog HDL Tutorial with Examples
Verilog Tutorial - Verilog HDL Tutorial with ExamplesE2MATRIX
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
FPGA implementation of an Adaptive Noise Canceller (ANC)
FPGA implementation of an Adaptive Noise Canceller (ANC)FPGA implementation of an Adaptive Noise Canceller (ANC)
FPGA implementation of an Adaptive Noise Canceller (ANC)Hocine Merabti
 

What's hot (20)

Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Demystify eBPF JIT Compiler
Demystify eBPF JIT CompilerDemystify eBPF JIT Compiler
Demystify eBPF JIT Compiler
 
Yacf
YacfYacf
Yacf
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學
 
verilog
verilogverilog
verilog
 
Circles graphic
Circles graphicCircles graphic
Circles graphic
 
Verilog overview
Verilog overviewVerilog overview
Verilog overview
 
8 bit single cycle processor
8 bit single cycle processor8 bit single cycle processor
8 bit single cycle processor
 
Inference accelerators
Inference acceleratorsInference accelerators
Inference accelerators
 
Verilog hdl
Verilog hdlVerilog hdl
Verilog hdl
 
Verification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed SystemsVerification of Concurrent and Distributed Systems
Verification of Concurrent and Distributed Systems
 
Verilog tutorial
Verilog tutorialVerilog tutorial
Verilog tutorial
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
 
Crash course in verilog
Crash course in verilogCrash course in verilog
Crash course in verilog
 
Verilog Tutorial - Verilog HDL Tutorial with Examples
Verilog Tutorial - Verilog HDL Tutorial with ExamplesVerilog Tutorial - Verilog HDL Tutorial with Examples
Verilog Tutorial - Verilog HDL Tutorial with Examples
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
FPGA implementation of an Adaptive Noise Canceller (ANC)
FPGA implementation of an Adaptive Noise Canceller (ANC)FPGA implementation of an Adaptive Noise Canceller (ANC)
FPGA implementation of an Adaptive Noise Canceller (ANC)
 

Similar to Architecture Aware Partitioning of Open-CL Programs

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCinside-BigData.com
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Mod06 new development tools
Mod06 new development toolsMod06 new development tools
Mod06 new development toolsPeter Haase
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfdajiba
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方Naoto MATSUMOTO
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 

Similar to Architecture Aware Partitioning of Open-CL Programs (20)

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Speeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCCSpeeding up Programs with OpenACC in GCC
Speeding up Programs with OpenACC in GCC
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Mod06 new development tools
Mod06 new development toolsMod06 new development tools
Mod06 new development tools
 
FPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdfFPGA-BASED-CNN.pdf
FPGA-BASED-CNN.pdf
 
計算機性能の限界点とその考え方
計算機性能の限界点とその考え方計算機性能の限界点とその考え方
計算機性能の限界点とその考え方
 
GPU Design on FPGA
GPU Design on FPGAGPU Design on FPGA
GPU Design on FPGA
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 

Recently uploaded

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Architecture Aware Partitioning of Open-CL Programs

  • 1. Architecture Aware Partitioning of Open-CL Programs SUBMITTED BY: ANKIT SINGH ROLL NO.-15IT60R04 SUBMITTED TO: PROF. SOUMYAJIT DEY DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING IIT KHARAGPUR
  • 2. Content:  Objective Introduction to Open-CL Open-CL Platform Model Partitioning of Open-CL program in CPU-GPU GPGPU Sim Architecture Specific Training Architecture Aware Training Architecture Aware Partitioning Future Work Conclusion References
  • 3. Objective Create ML- Based architecture(CPU and GPU) aware partitioning classifier which takes as input an Open-CL program and new architecture so that classifier can generate the optimal partition class value for given input.
  • 4. Introduction to Open-CL OpenCL is a data parallel programming model introduced for heterogeneous system architecture which may include CPUs, GPUs or other accelerator devices. Developer(s)-Khronos Group
  • 5. Partitioning in CPU-GPU(Matrix-Matrix multiplication example) 20% Data of matrix A for multiplying with matrix B 80% Data of matrix A for multiplying with matrix B Matrix A Matrix B CPU GPU
  • 9. Concept of Work Group and Work Item in Open-CL  A kernel is a function executed in each point of a problem domain(For each work item)  Number of work items-4096(16 work-groups, 256 work-items each)
  • 10. Profiling Events in Open-CL I need to be able to measure Kernel execution time to validate some options. For a long long Kernel you may use wall clock, but it’s not the right way to do it. There are few steps to measure accurately the Kernel execution time: Create Queue with Profiling enabled Create event Ensure to have executed all enqueued tasks Launch Kernel linked to an event Ensure kernel execution is finished Get the Profiling data
  • 11. Kernel Code for Matrix Multiplication __kernel void matrixMul(__global float* C, __global float* A, __global float* B, int wA, int wB) { int tx = get_global_id(0); int ty = get_global_id(1); float value = 0; for (int k = 0; k < wA; ++k) { float elementA = A[ty * wA + k]; float elementB = B[k * wB + tx]; value += elementA*elementB; } C[ty * wA + tx] = value; }
  • 12. GPGPU Sim  GPGPU Sim is a simulator that simulates the different architectures  GPGPU-Sim consumes mostly unmodified GPGPU source code that is linked to GPGPU-Sims custom GPGPU runtime library  The modified runtime library intercepts all GPGPU-specific function calls and emulates their effects
  • 15. Architecture Specific Training Data Set P1: < f1(P1), f2(P1), f3(P1),….., fn(P1) > opt(P1) P2: < f1(P2), f2(P2), f3(P2),….., fn(P2) > opt(P2) P3: < f1(P3), f2(P3), f3(P3),….., fn(P3) > opt(P3) ………………………………………………………………………. Pk: < f1(Pk), f2(Pk), f3(Pk),….., fn(Pk) > opt(Pk) TRAINING DATA SET STATIC PROGRAM FEATURES CLASSES K Programs Static program partitioning in the context of heterogeneous multi-core architectures is identifying how a single program is to be partitioned across the varied processing elements of the architecture such that the program execution time is minimized.
  • 17. Execution Time of One Kernel on Different Architecture Execution Profile Across Different Architectures
  • 18. Problem with Architecture Specific Partitioning Classifiers trained on one particular architecture cannot be used for predicting partition class values of the same program on another architecture and thus must be trained again. It would be rather worthwhile to learn a more involved relationship between static program features, architectural features and optimal program partitions.
  • 19. Architecture Aware Partitioning ML-Based Architecture Aware Partitioning Model gives the optimal partition class value for a given new Open-CL program and an architecture.
  • 21. Architecture Aware Training Data Set P1: < f1(P1), f1(P1),….., fn(P1) ,a1(D1),a2(D1),…,am(D1) > opt(P1,D1) P2: < f1(P2), f2(P2),….., fn(P2), a1(D1),a2(D1),…,am(D1) > opt(P2,D1) …………………………………………………………………………… Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(D1),a2(D1),…,am(D1) > opt(Pk,D1) k Programs on D1 k Programs on Dj P1: < f1(P1), f1(P1),….., fn(P1) ,a1(Dj),a2(Dj),…,am(Dj) > opt(P1,Dj) P2: < f1(P2), f2(P2),….., fn(P2), a1(Dj),a2(Dj),…,am(Dj) > opt(P2,Dj) …………………………………………………………………………… Pk: < f1(Pk), f2(Pk),….., fn(Pk), a1(Dj),a2(Dj),…,am(Dj) > opt(Pk,Dj) …………………………………………………………………………………………………………………………………....... …………………………………………………………………………………………………………………………………....... …………………………………………………………………………………………………………………………………....... j architectures D1,……Dj
  • 22. ML-Based Architecture Aware Partitioning Model
  • 23. Simulation Results Input-Matrix, Size-1024*1024 and Vector, Size-1024 Matrix, Size-4096*4096 and Vector, Size-4096
  • 24. Simulation Results Input-Matrix, Size-4096*4096 and Vector, Size-4096 Input-Matrix, Size-8192*8192 and Vector, Size-8192
  • 25. Simulation Results Input-Matrix, Size-4096*4096 and Vector, Size-4096 Input-Matrix, Size-2048*2048 and Vector, Size-2048
  • 26. Simulation Results Input-Matrix, Size-1024*1024 and Vector, Size-1024 Input-Matrix, Size-1024*1024 and Vector, Size-1024
  • 28. TARGET SYSTEM: CPU – Intel Xeon E5260 GPU – 8 Architectures (4 Real, 4 Synthetic) TRAINING DATA: 15 Kernels, 8 Architectures, 2-4 Problem Sizes 400 training data points MODEL: Logistic Regression Experimental Results
  • 29. Conclusion & Future Work Partitioned many PolyBench kernels using the Open-CL API on heterogeneous platform(CPU and GPU). We trained a ML-Based Architecture Aware classifier model which helps to know about the optimal partition class value for a given new Open-CL program and an architecture. In future, with the help of more input data set, we can improve the accuracy of architecture aware classifier model.
  • 30. References Scarpino, Matthew. ”Open-CL in Action: How to Accelerate Graphics and Computation. NY.” USA: Manning (2012). D. Grewe and M. F. OBoyle, A Static Task Partitioning Approach for Heterogeneous Systems using OpenCL,in International Conference on Compiler Construction, 2011, pp. 286305. D. Grewe, Z.Wang, and M. F. OBoyle, OpenCL Task Partitioning in the Presence of GPU Contention, in Language and Compilers for Parallel Computing, 2011, pp. 87101. P. Pandit and R. Govindarajan, Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices, in International Symposium of Code Generation and Optimization, 2014, p. 273. Chen, Kuan-Chung, and Chung-Ho Chen. ”An OpenCL runtime system for a heterogeneous many-core virtual platform.” 2014 IEEE International Symposium on Circuits and Systems (ISCAS).