Your SlideShare is downloading. ×
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

3,358

Published on

Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,358
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2,772
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU
  • 2. Objective ! software has a long life-span that exceeds the life-span of hardware ! software is very expensive to be written and maintained ! next generation hardware also needs to run legacy software ! Example: IWAVE ! procedural C-code ! no object orientation ! tight integration between data structures and functions ! What do I mean by efficient scheduling? ! find ways to utilize GPU cores for code blocks ! find ways to utilize all CPU cores and GPU units at the same time !2 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 3. Historical Context GPU Compute Timeline Aparapi CUDA 2002 !3 | OpenCL and OpenMP Workloads on Accelerated Processing Units | 2008 AMP C++ 2010 2012
  • 4. Accelerator Challenges Technology Accessibility and Performance Performance OpenCL & CUDA CPU Multithread CPU Single Thread Ease-of-Use !4 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 5. APU Opportunities One Die - Two Computational Devices Metric CPU APU Memory Size large small Memory Bandwidth small large Parallelism small large yes no Performance application dependent application dependent Performance-per-Watt application dependent application dependent Traditional OpenCL General Purpose Programming !5 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 6. APU Opportunities Performance and Performance-per-Watt ! Example: Luxmark OpenCL Benchmark APU Performance[Pts] 170 197 316 50 37 58 3.4 5.3 5.4 Combined[Pts2/W] ! GPU has best performance-per-Watt GPU PPW[Pts/W] ! Best performance by using the APU CPU Power[W] ! Similar CPU and GPU performance Metric 578 1049 1722 ! APU provides outstanding value Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !6 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 7. Example: Luxmark Renderer Performance and Performance-per-Watt +64% +81% !7 | OpenCL and OpenMP Workloads on Accelerated Processing Units | Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz
  • 8. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Know the problem you are trying to solve. ! staggered rectangular grid in 3D ! coupled first order PDE ! scalar pressure field p ! vector velocity field v = {vx, vy, vz} ! source term g !8 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 9. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenMP(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } OpenMP p OpenMP vx // // // // // main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenMP vy OpenMP vz OpenMP Time !9 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 10. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Measure the initial performance. ! pressure and velocity field simulated using OpenMP ! average time T[ms] per iteration ! OpenMP linear scaling with threads !10 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 11. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! find computational blocks ! understand dependencies between blocks OpenMP vx OpenMP p OpenMP vy ! identify sequential and parallel parts OpenMP OpenMP vz Causality OpenMP p OpenMP vx OpenMP vy OpenMP vz OpenMP Time !11 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 12. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } // // // // // main simulation loop calculate pressure field p calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenCL vx OpenMP p IDLE OpenMP vy OpenMP vz OpenMP Time !12 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 13. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! use the GPU to compute vx ! the CPU is idle while the GPU is running ! 42% improvement for 1 thread ! 25% improvement for 2 threads ! 9% improvement for 4 threads !13 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 14. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); ! ! // main simulation loop // calculate pressure field p int num_threads = atoi(getenv("OMP_NUM_THREADS")); omp_set_num_threads(2); omp_set_nested(1); #pragma omp parallel shared(…) private(…) { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) break; case 1: omp_set_num_threads(num_threads); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); break; default: break; } } x } OpenCL v OpenMP p OpenMP vy OpenMP vz // save the current number of OpenMP threads // restrict the number of OpenMP threads to 2 // allow nested OpenMP threads // start 2 OpenMP threads // calculate velocity x-axis using OpenCL // increase number of OpenMP threads back // calculate velocity y-axis // calculate velocity z-axis // close OpenMP pragma // close simulation while OpenMP Time !14 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 15. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! overlap vx and vy ! CPU not idle anymore ! 50% improvement for 1 thread ! 40% improvement for 2 threads ! 38% improvement for 4 threads !15 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 16. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenCL(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenCL(dom, pars); sgn_ts3d_210_v2_OpenCL(dom, pars); … } // // // // // bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); clEnqueueReadBuffer(queue, buffer, …); … } OpenCL p OpenCL vx OpenCL vy main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis // copy data from host to device // execute OpenCL kernel on device // copy data from device to host OpenCL vz OpenCL Time !16 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 17. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! understand where performance gets lost ! 98% of time spent on I/O ! 2% of time spent on compute ! reduce I/O OpenCL Upload Kernel Execution OpenCL Download 188ms 4ms 54ms OpenCL vx OpenMP p OpenMP vy OpenMP vz OpenMP Time !17 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 18. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed? ! Fraction of OpenCL I/O time: FI/O ! 50% I/O time limit the maximal possible speedup to 2 ! Minimize OpenCL I/O, only then increase OpenCL kernel performance !18 SKernel SOpenCL = HSKernel - 1L FIêO + 1 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 19. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_ALL_OpenCL(dom, pars); … } // main simulation loop // combine all OpenCL calculations bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); ! ! while(…) { clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); kernel_V0, dims, …); kernel_V1, dims, …); kernel_V1, dims, …); // copy data from host to device // // // // execute execute execute execute OpenCL OpenCL OpenCL OpenCL kernel kernel kernel kernel for for for for pressure velocity x velocity y velocity z } clEnqueueReadBuffer(queue, buffer, …); … // copy data from device to host } OpenCL p OpenCL vx OpenCL vy OpenCL vz OpenCL Time !19 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 20. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! eliminate all but essential I/O ! significant speedup over simple OpenCL !20 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 21. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! measure real application performance ! 3000 iterations using a 97x405x389 simulation grid ! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads 14 10.5 7 3.5 0 CPU (8T) "Piledriver" !21 | OpenCL and OpenMP Workloads on Accelerated Processing Units | GPU (8CU) AMD S9000
  • 22. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! initial OpenCL performance measurements ! 89 Algorithms tested for image size of 4MP ! compare OpenCL I/O and execution time ! 28% of all algorithms are compute bound ! 72% of all algorithms are I/O bound OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !22 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 23. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! compare OpenCL and single-threaded performance ! 89 Algorithms tested for image size of 4MP ! realistic timing that includes I/O over PCIe ! 59% of all algorithms execute faster on the GPU ! 41% of all algorithms execute faster on the CPU(1) ! significant speedup for only 15% of all algorithms OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !23 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 24. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Task: Batch process a large amount of images using a single algorithm. ! OpenCL performance is algorithm and image size dependent ! Either the CPU will process data or the GPU, but not both ! How to choose which algorithm and device to use depending on image size? !24 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 25. Programming Strategies Example: High Throughput Computer Vision with OpenCV !25 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 26. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty. ! all CPU cores are fully utilized at all times even for single-threaded algorithms ! all GPU compute units are fully utilized at all times ! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm ! combined performance for multiple algorithms is better than sum of device performance P i APU =P P= !26 | OpenCL and OpenMP Workloads on Accelerated Processing Units | i CPU +P i N 1 ⁄i=1 Pi 1 GPU
  • 27. Programming Strategies Example: High Throughput Computer Vision with OpenCV !27 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 28. Programming Strategies Summary ! ! next generation hardware and legacy code requires compromises ! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time ! application performance can be increased by overlapping OpenCL and OpenMP workloads ! removing all but necessary OpenCL I/O can have a dramatic influence on performance ! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms ! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances ! APUs may provide greatest performance per Watt ! GPUs may provide greatest performance !28 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 29. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
 The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ! ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. !29 | OpenCL and OpenMP Workloads on Accelerated Processing Units |

×