HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

Efficient Scheduling of OpenMP and OpenCL Workloads
Getting the most out of your APU
Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time

!2

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Historical Context
GPU Compute Timeline

Aparapi
CUDA
2002
!3

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

2008

AMP C++
2010

2012
Accelerator Challenges
Technology Accessibility and Performance
Performance

OpenCL & CUDA

CPU Multithread

CPU Single Thread
Ease-of-Use
!4

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities
One Die - Two Computational Devices

Metric

CPU

APU

Memory Size

large

small

Memory Bandwidth

small

large

Parallelism

small

large

yes

no

Performance

application dependent

application dependent

Performance-per-Watt

application dependent

application dependent

Traditional

OpenCL

General Purpose

Programming

!5

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities

Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark

APU

Performance[Pts]

170

197

316

50

37

58

3.4

5.3

5.4

Combined[Pts2/W]

! GPU has best performance-per-Watt

GPU

PPW[Pts/W]

! Best performance by using the APU

CPU

Power[W]

! Similar CPU and GPU performance

Metric

578

1049

1722

! APU provides outstanding value

Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Example: Luxmark Renderer

Performance and Performance-per-Watt

+64%
+81%

!7

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g

!8

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

OpenMP p

OpenMP vx

//
//
//
//
//

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenMP vy

OpenMP vz

OpenMP
Time

!9

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads

!10

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! find computational blocks
! understand dependencies between blocks

OpenMP vx
OpenMP p

OpenMP vy

! identify sequential and parallel parts

OpenMP

OpenMP vz
Causality

OpenMP p

OpenMP vx

OpenMP vy

OpenMP vz

OpenMP
Time

!11

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

//
//
//
//
//

main simulation loop
calculate pressure field p
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenCL vx
OpenMP p

IDLE

OpenMP vy

OpenMP vz

OpenMP
Time

!12

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads
! 9% improvement for 4 threads

!13

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);

!
!

// main simulation loop
// calculate pressure field p

int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);

#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
break;
default:
break;
}
}
x
}

OpenCL v

OpenMP p

OpenMP vy

OpenMP vz

// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads

// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis

// close OpenMP pragma
// close simulation while

OpenMP
Time

!14

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread
! 40% improvement for 2 threads
! 38% improvement for 4 threads

!15

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenCL(dom, pars);
sgn_ts3d_210_v2_OpenCL(dom, pars);
…
}

//
//
//
//
//

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}

OpenCL p

OpenCL vx

OpenCL vy

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host

OpenCL vz

OpenCL
Time

!16

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O

OpenCL Upload

Kernel Execution

OpenCL Download

188ms

4ms

54ms

OpenCL vx
OpenMP p

OpenMP vy

OpenMP vz

OpenMP
Time

!17

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance

!18

SKernel
SOpenCL =
HSKernel - 1L FIêO + 1

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}

// main simulation loop
// combine all OpenCL calculations

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);

!
!

while(…) {
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,

kernel_P012, dims, …);
kernel_V0, dims, …);
kernel_V1, dims, …);
kernel_V1, dims, …);

// copy data from host to device
//
//
//
//

execute
execute
execute
execute

OpenCL
OpenCL
OpenCL
OpenCL

kernel
kernel
kernel
kernel

for
for
for
for

pressure
velocity x
velocity y
velocity z

}
clEnqueueReadBuffer(queue, buffer, …);
…

// copy data from device to host

}

OpenCL p

OpenCL vx

OpenCL vy

OpenCL vz

OpenCL
Time

!19

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! eliminate all but essential I/O
! significant speedup over simple OpenCL

!20

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads

14
10.5
7
3.5
0
CPU (8T) "Piledriver"

!21

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

GPU (8CU)

AMD S9000
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!22

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!23

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?

!24

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!25

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance

P

i

APU

=P

P=
!26

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

i

CPU

+P

i

N
1
⁄i=1 Pi

1

GPU
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!27

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Summary

!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance

!28

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation
Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

!29

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
1 of 29

Recommended

PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ... by
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
1.8K views39 slides
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu... by
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
1.5K views21 slides
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor by
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
5.9K views40 slides
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by... by
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
2.3K views26 slides
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning by
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
2K views72 slides
Leverage the Speed of OpenCL™ with AMD Math Libraries by
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
4.7K views26 slides

More Related Content

What's hot

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller by
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
1.3K views22 slides
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W... by
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
5.8K views85 slides
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi... by
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
5.3K views20 slides
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos by
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
1.6K views64 slides
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn by
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
13.1K views24 slides
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko by
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
4.7K views32 slides

What's hot(20)

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller by AMD Developer Central
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W... by AMD Developer Central
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi... by AMD Developer Central
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos by AMD Developer Central
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn by AMD Developer Central
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko by AMD Developer Central
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni... by AMD Developer Central
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans by AMD Developer Central
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar by AMD Developer Central
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah by AMD Developer Central
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap... by AMD Developer Central
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op... by AMD Developer Central
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ... by AMD Developer Central
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ... by AMD Developer Central
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty by AMD Developer Central
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD by HSA Foundation
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
HSA Foundation14.1K views
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs by AMD Developer Central
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs

Viewers also liked

Curriculum de professor_atual by
Curriculum de professor_atualCurriculum de professor_atual
Curriculum de professor_atualWanderson Amaral
31.6K views5 slides
CURRICULUM VITAE Alexandra Damaso by
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoAlexandra Damaso
28.1K views4 slides
Modelos de curriculo by
Modelos de curriculoModelos de curriculo
Modelos de curriculoPatrícia Soares
320K views4 slides
Modelo de currículo 1º emprego by
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º empregoCebracManaus
1.3M views1 slide
Curriculum vitae 2013 by
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013Ana Santos
181K views10 slides
Professor de musica curriculo - arnaldo alves by
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alvesArnaldo Alves
28.2K views2 slides

Viewers also liked(20)

CURRICULUM VITAE Alexandra Damaso by Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra Damaso
Alexandra Damaso28.1K views
Modelo de currículo 1º emprego by CebracManaus
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º emprego
CebracManaus1.3M views
Curriculum vitae 2013 by Ana Santos
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013
Ana Santos181K views
Professor de musica curriculo - arnaldo alves by Arnaldo Alves
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alves
Arnaldo Alves28.2K views
Modelo de curriculo menor aprendiz by CebracManaus
Modelo de curriculo menor aprendiz Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz
CebracManaus971.4K views
Modelo de-curriculum-1-preenchido by Jocileu Segundo
Modelo de-curriculum-1-preenchidoModelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchido
Jocileu Segundo245.4K views
CurríCulo Luiz 2010 by luizmarco
CurríCulo Luiz 2010CurríCulo Luiz 2010
CurríCulo Luiz 2010
luizmarco7.7K views
Curriculum Profª Elizete Arantes by elizetearantes
Curriculum  Profª Elizete ArantesCurriculum  Profª Elizete Arantes
Curriculum Profª Elizete Arantes
elizetearantes15.2K views
Trabalho LPL by Taissccp
Trabalho LPLTrabalho LPL
Trabalho LPL
Taissccp1.4K views
Curriculo 850 Alternativo by rpicorelli
Curriculo 850 AlternativoCurriculo 850 Alternativo
Curriculo 850 Alternativo
rpicorelli369 views
Manual blogger by BLAJEJS
Manual bloggerManual blogger
Manual blogger
BLAJEJS240 views
Blog na-educacao by Necy
Blog na-educacaoBlog na-educacao
Blog na-educacao
Necy1.4K views
Curriculum psicóloga educacional by carolinaanabella
Curriculum psicóloga educacionalCurriculum psicóloga educacional
Curriculum psicóloga educacional
carolinaanabella23.4K views

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

lecture_GPUArchCUDA04-OpenMPHOMP.pdf by
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
1 view21 slides
Profiling & Testing with Spark by
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
3K views51 slides
Threaded Programming by
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
718 views27 slides
MOVED: The challenge of SVE in QEMU - SFO17-103 by
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103Linaro
126 views22 slides
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th... by
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
200 views80 slides
NVIDIA HPC ソフトウエア斜め読み by
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
733 views63 slides

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel(20)

lecture_GPUArchCUDA04-OpenMPHOMP.pdf by Tigabu Yaya
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Tigabu Yaya1 view
Threaded Programming by Sri Prasanna
Threaded ProgrammingThreaded Programming
Threaded Programming
Sri Prasanna718 views
MOVED: The challenge of SVE in QEMU - SFO17-103 by Linaro
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
Linaro126 views
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th... by David Walker
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker200 views
NVIDIA HPC ソフトウエア斜め読み by NVIDIA Japan
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan733 views
20081114 Friday Food iLabt Bart Joris by imec.archive
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
imec.archive472 views
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab by CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab903 views
Intermachine Parallelism by Sri Prasanna
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna666 views
開放運算&GPU技術研究班 by Paul Chao
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
Paul Chao603 views
The Green Lab - [04 B] [PWA] Experiment setup by Ivano Malavolta
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setup
Ivano Malavolta2K views
H2O Design and Infrastructure with Matt Dowle by Sri Ambati
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
Sri Ambati1.3K views
Porting and optimizing UniFrac for GPUs by Igor Sfiligoi
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
Igor Sfiligoi63 views
Using GPUs to handle Big Data with Java by Adam Roberts. by J On The Beach
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach876 views
Hardware & Software Platforms for HPC, AI and ML by inside-BigData.com
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com1.3K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1.1K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1 view
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh441 views

More from AMD Developer Central

Introduction to Node.js by
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.jsAMD Developer Central
4.4K views21 slides
Media SDK Webinar 2014 by
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014AMD Developer Central
3K views37 slides
DirectGMA on AMD’S FirePro™ GPUS by
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUSAMD Developer Central
3.4K views17 slides
Webinar: Whats New in Java 8 with Develop Intelligence by
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
1.8K views18 slides
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant... by
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
2.5K views27 slides
Inside XBox- One, by Martin Fuller by
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerAMD Developer Central
3.5K views26 slides

More from AMD Developer Central (20)

The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant... by AMD Developer Central
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson by AMD Developer Central
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas by AMD Developer Central
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha... by AMD Developer Central
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array... by AMD Developer Central
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14 by AMD Developer Central
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14 by AMD Developer Central
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM... by AMD Developer Central
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle - Introducing a new API for Graphics - AMD at GDC14 by AMD Developer Central
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14

Recently uploaded

Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
90 views46 slides
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...ShapeBlue
88 views13 slides
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...ShapeBlue
123 views28 slides
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...ShapeBlue
79 views17 slides
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
138 views18 slides
Business Analyst Series 2023 - Week 4 Session 8 by
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8DianaGray10
86 views13 slides

Recently uploaded(20)

Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue88 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue123 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue79 views
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue138 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray1086 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue63 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue222 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue88 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10126 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue176 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue158 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc160 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue181 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue112 views

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

  • 1. Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU
  • 2. Objective ! software has a long life-span that exceeds the life-span of hardware ! software is very expensive to be written and maintained ! next generation hardware also needs to run legacy software ! Example: IWAVE ! procedural C-code ! no object orientation ! tight integration between data structures and functions ! What do I mean by efficient scheduling? ! find ways to utilize GPU cores for code blocks ! find ways to utilize all CPU cores and GPU units at the same time !2 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 3. Historical Context GPU Compute Timeline Aparapi CUDA 2002 !3 | OpenCL and OpenMP Workloads on Accelerated Processing Units | 2008 AMP C++ 2010 2012
  • 4. Accelerator Challenges Technology Accessibility and Performance Performance OpenCL & CUDA CPU Multithread CPU Single Thread Ease-of-Use !4 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 5. APU Opportunities One Die - Two Computational Devices Metric CPU APU Memory Size large small Memory Bandwidth small large Parallelism small large yes no Performance application dependent application dependent Performance-per-Watt application dependent application dependent Traditional OpenCL General Purpose Programming !5 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 6. APU Opportunities Performance and Performance-per-Watt ! Example: Luxmark OpenCL Benchmark APU Performance[Pts] 170 197 316 50 37 58 3.4 5.3 5.4 Combined[Pts2/W] ! GPU has best performance-per-Watt GPU PPW[Pts/W] ! Best performance by using the APU CPU Power[W] ! Similar CPU and GPU performance Metric 578 1049 1722 ! APU provides outstanding value Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !6 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 7. Example: Luxmark Renderer Performance and Performance-per-Watt +64% +81% !7 | OpenCL and OpenMP Workloads on Accelerated Processing Units | Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz
  • 8. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Know the problem you are trying to solve. ! staggered rectangular grid in 3D ! coupled first order PDE ! scalar pressure field p ! vector velocity field v = {vx, vy, vz} ! source term g !8 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 9. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenMP(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } OpenMP p OpenMP vx // // // // // main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenMP vy OpenMP vz OpenMP Time !9 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 10. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Measure the initial performance. ! pressure and velocity field simulated using OpenMP ! average time T[ms] per iteration ! OpenMP linear scaling with threads !10 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 11. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! find computational blocks ! understand dependencies between blocks OpenMP vx OpenMP p OpenMP vy ! identify sequential and parallel parts OpenMP OpenMP vz Causality OpenMP p OpenMP vx OpenMP vy OpenMP vz OpenMP Time !11 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 12. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } // // // // // main simulation loop calculate pressure field p calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenCL vx OpenMP p IDLE OpenMP vy OpenMP vz OpenMP Time !12 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 13. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! use the GPU to compute vx ! the CPU is idle while the GPU is running ! 42% improvement for 1 thread ! 25% improvement for 2 threads ! 9% improvement for 4 threads !13 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 14. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); ! ! // main simulation loop // calculate pressure field p int num_threads = atoi(getenv("OMP_NUM_THREADS")); omp_set_num_threads(2); omp_set_nested(1); #pragma omp parallel shared(…) private(…) { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) break; case 1: omp_set_num_threads(num_threads); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); break; default: break; } } x } OpenCL v OpenMP p OpenMP vy OpenMP vz // save the current number of OpenMP threads // restrict the number of OpenMP threads to 2 // allow nested OpenMP threads // start 2 OpenMP threads // calculate velocity x-axis using OpenCL // increase number of OpenMP threads back // calculate velocity y-axis // calculate velocity z-axis // close OpenMP pragma // close simulation while OpenMP Time !14 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 15. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! overlap vx and vy ! CPU not idle anymore ! 50% improvement for 1 thread ! 40% improvement for 2 threads ! 38% improvement for 4 threads !15 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 16. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenCL(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenCL(dom, pars); sgn_ts3d_210_v2_OpenCL(dom, pars); … } // // // // // bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); clEnqueueReadBuffer(queue, buffer, …); … } OpenCL p OpenCL vx OpenCL vy main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis // copy data from host to device // execute OpenCL kernel on device // copy data from device to host OpenCL vz OpenCL Time !16 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 17. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! understand where performance gets lost ! 98% of time spent on I/O ! 2% of time spent on compute ! reduce I/O OpenCL Upload Kernel Execution OpenCL Download 188ms 4ms 54ms OpenCL vx OpenMP p OpenMP vy OpenMP vz OpenMP Time !17 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 18. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed? ! Fraction of OpenCL I/O time: FI/O ! 50% I/O time limit the maximal possible speedup to 2 ! Minimize OpenCL I/O, only then increase OpenCL kernel performance !18 SKernel SOpenCL = HSKernel - 1L FIêO + 1 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 19. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_ALL_OpenCL(dom, pars); … } // main simulation loop // combine all OpenCL calculations bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); ! ! while(…) { clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); kernel_V0, dims, …); kernel_V1, dims, …); kernel_V1, dims, …); // copy data from host to device // // // // execute execute execute execute OpenCL OpenCL OpenCL OpenCL kernel kernel kernel kernel for for for for pressure velocity x velocity y velocity z } clEnqueueReadBuffer(queue, buffer, …); … // copy data from device to host } OpenCL p OpenCL vx OpenCL vy OpenCL vz OpenCL Time !19 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 20. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! eliminate all but essential I/O ! significant speedup over simple OpenCL !20 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 21. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! measure real application performance ! 3000 iterations using a 97x405x389 simulation grid ! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads 14 10.5 7 3.5 0 CPU (8T) "Piledriver" !21 | OpenCL and OpenMP Workloads on Accelerated Processing Units | GPU (8CU) AMD S9000
  • 22. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! initial OpenCL performance measurements ! 89 Algorithms tested for image size of 4MP ! compare OpenCL I/O and execution time ! 28% of all algorithms are compute bound ! 72% of all algorithms are I/O bound OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !22 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 23. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! compare OpenCL and single-threaded performance ! 89 Algorithms tested for image size of 4MP ! realistic timing that includes I/O over PCIe ! 59% of all algorithms execute faster on the GPU ! 41% of all algorithms execute faster on the CPU(1) ! significant speedup for only 15% of all algorithms OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !23 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 24. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Task: Batch process a large amount of images using a single algorithm. ! OpenCL performance is algorithm and image size dependent ! Either the CPU will process data or the GPU, but not both ! How to choose which algorithm and device to use depending on image size? !24 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 25. Programming Strategies Example: High Throughput Computer Vision with OpenCV !25 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 26. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty. ! all CPU cores are fully utilized at all times even for single-threaded algorithms ! all GPU compute units are fully utilized at all times ! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm ! combined performance for multiple algorithms is better than sum of device performance P i APU =P P= !26 | OpenCL and OpenMP Workloads on Accelerated Processing Units | i CPU +P i N 1 ⁄i=1 Pi 1 GPU
  • 27. Programming Strategies Example: High Throughput Computer Vision with OpenCV !27 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 28. Programming Strategies Summary ! ! next generation hardware and legacy code requires compromises ! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time ! application performance can be increased by overlapping OpenCL and OpenMP workloads ! removing all but necessary OpenCL I/O can have a dramatic influence on performance ! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms ! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances ! APUs may provide greatest performance per Watt ! GPUs may provide greatest performance !28 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 29. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
 The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ! ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. !29 | OpenCL and OpenMP Workloads on Accelerated Processing Units |