SlideShare a Scribd company logo

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

1 of 29
Download to read offline
Efficient Scheduling of OpenMP and OpenCL Workloads
Getting the most out of your APU
Objective
! software has a long life-span that exceeds the life-span of hardware
! software is very expensive to be written and maintained
! next generation hardware also needs to run legacy software
! Example: IWAVE
! procedural C-code
! no object orientation
! tight integration between data structures and functions
! What do I mean by efficient scheduling?
! find ways to utilize GPU cores for code blocks
! find ways to utilize all CPU cores and GPU units at the same time

!2

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Historical Context
GPU Compute Timeline

Aparapi
CUDA
2002
!3

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

2008

AMP C++
2010

2012
Accelerator Challenges
Technology Accessibility and Performance
Performance

OpenCL & CUDA

CPU Multithread

CPU Single Thread
Ease-of-Use
!4

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities
One Die - Two Computational Devices

Metric

CPU

APU

Memory Size

large

small

Memory Bandwidth

small

large

Parallelism

small

large

yes

no

Performance

application dependent

application dependent

Performance-per-Watt

application dependent

application dependent

Traditional

OpenCL

General Purpose

Programming

!5

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
APU Opportunities

Performance and Performance-per-Watt
! Example: Luxmark OpenCL Benchmark

APU

Performance[Pts]

170

197

316

50

37

58

3.4

5.3

5.4

Combined[Pts2/W]

! GPU has best performance-per-Watt

GPU

PPW[Pts/W]

! Best performance by using the APU

CPU

Power[W]

! Similar CPU and GPU performance

Metric

578

1049

1722

! APU provides outstanding value

Luxmark OpenCL Benchmark
Ubuntu 12.10 x86_64
4 Piledriver CPU cores @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!6

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Example: Luxmark Renderer

Performance and Performance-per-Watt

+64%
+81%

!7

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Luxmark OpenCL Benchmark
Render “Sala” Scene
Ubuntu 12.10 x86_64
4 Piledriver cores @ 2.5GHz
6 GPU CUs @ 720MHz
16GB DDR3 1600MHz
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
! Know the problem you are trying to solve.
! staggered rectangular grid in 3D
! coupled first order PDE
! scalar pressure field p
! vector velocity field v = {vx, vy, vz}
! source term g

!8

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenMP(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

OpenMP p

OpenMP vx

//
//
//
//
//

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenMP vy

OpenMP vz

OpenMP
Time

!9

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! Measure the initial performance.
! pressure and velocity field simulated using OpenMP
! average time T[ms] per iteration
! OpenMP linear scaling with threads

!10

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! find computational blocks
! understand dependencies between blocks

OpenMP vx
OpenMP p

OpenMP vy

! identify sequential and parallel parts

OpenMP

OpenMP vz
Causality

OpenMP p

OpenMP vx

OpenMP vy

OpenMP vz

OpenMP
Time

!11

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
…
}

//
//
//
//
//

main simulation loop
calculate pressure field p
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

OpenCL vx
OpenMP p

IDLE

OpenMP vy

OpenMP vz

OpenMP
Time

!12

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! use the GPU to compute vx
! the CPU is idle while the GPU is running
! 42% improvement for 1 thread
! 25% improvement for 2 threads
! 9% improvement for 4 threads

!13

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE
while(…) {
sgn_ts3d_210_p012_OpenMP(dom, pars);

!
!

// main simulation loop
// calculate pressure field p

int num_threads = atoi(getenv("OMP_NUM_THREADS"));
omp_set_num_threads(2);
omp_set_nested(1);

#pragma omp parallel shared(…) private(…)
{
switch ( omp_get_thread_num() ) {
case 0:
sgn_ts3d_210_v0_OpenCL(dom, pars)
break;
case 1:
omp_set_num_threads(num_threads);
sgn_ts3d_210_v1_OpenMP(dom, pars);
sgn_ts3d_210_v2_OpenMP(dom, pars);
break;
default:
break;
}
}
x
}

OpenCL v

OpenMP p

OpenMP vy

OpenMP vz

// save the current number of OpenMP threads
// restrict the number of OpenMP threads to 2
// allow nested OpenMP threads
// start 2 OpenMP threads

// calculate velocity x-axis using OpenCL
// increase number of OpenMP threads back
// calculate velocity y-axis
// calculate velocity z-axis

// close OpenMP pragma
// close simulation while

OpenMP
Time

!14

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! overlap vx and vy
! CPU not idle anymore
! 50% improvement for 1 thread
! 40% improvement for 2 threads
! 38% improvement for 4 threads

!15

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_p012_OpenCL(dom, pars);
sgn_ts3d_210_v0_OpenCL(dom, pars);
sgn_ts3d_210_v1_OpenCL(dom, pars);
sgn_ts3d_210_v2_OpenCL(dom, pars);
…
}

//
//
//
//
//

bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);
clEnqueueNDRangeKernel(queue, kernel_P012, dims, …);
clEnqueueReadBuffer(queue, buffer, …);
…
}

OpenCL p

OpenCL vx

OpenCL vy

main simulation loop
calculate pressure field
calculate velocity x-axis
calculate velocity y-axis
calculate velocity x-axis

// copy data from host to device
// execute OpenCL kernel on device
// copy data from device to host

OpenCL vz

OpenCL
Time

!16

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! understand where performance gets lost
! 98% of time spent on I/O
! 2% of time spent on compute
! reduce I/O

OpenCL Upload

Kernel Execution

OpenCL Download

188ms

4ms

54ms

OpenCL vx
OpenMP p

OpenMP vy

OpenMP vz

OpenMP
Time

!17

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! How does the speedup of an OpenCL application
(SOpenCL) depend on speedup of the OpenCL kernel
(SKernel) when the OpenCL I/O time is fixed?
! Fraction of OpenCL I/O time: FI/O
! 50% I/O time limit the maximal possible speedup to 2
! Minimize OpenCL I/O, only then increase OpenCL
kernel performance

!18

SKernel
SOpenCL =
HSKernel - 1L FIêO + 1

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

while(…) {
sgn_ts3d_210_ALL_OpenCL(dom, pars);
…
}

// main simulation loop
// combine all OpenCL calculations

bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) {
…
clEnqueueWriteBuffer(queue, buffer, …);

!
!

while(…) {
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,
clEnqueueNDRangeKernel(queue,

kernel_P012, dims, …);
kernel_V0, dims, …);
kernel_V1, dims, …);
kernel_V1, dims, …);

// copy data from host to device
//
//
//
//

execute
execute
execute
execute

OpenCL
OpenCL
OpenCL
OpenCL

kernel
kernel
kernel
kernel

for
for
for
for

pressure
velocity x
velocity y
velocity z

}
clEnqueueReadBuffer(queue, buffer, …);
…

// copy data from device to host

}

OpenCL p

OpenCL vx

OpenCL vy

OpenCL vz

OpenCL
Time

!19

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! eliminate all but essential I/O
! significant speedup over simple OpenCL

!20

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: Solving the Acoustic Wave Equation in 3D using IWAVE

! measure real application performance
! 3000 iterations using a 97x405x389 simulation grid
! 8 GCN Compute Units achieve 70% more
performance than 8 traditional OpenMP threads

14
10.5
7
3.5
0
CPU (8T) "Piledriver"

!21

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

GPU (8CU)

AMD S9000
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! initial OpenCL performance measurements
! 89 Algorithms tested for image size of 4MP
! compare OpenCL I/O and execution time
! 28% of all algorithms are compute bound
! 72% of all algorithms are I/O bound

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!22

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! compare OpenCL and single-threaded performance
! 89 Algorithms tested for image size of 4MP
! realistic timing that includes I/O over PCIe
! 59% of all algorithms execute faster on the GPU
! 41% of all algorithms execute faster on the CPU(1)
! significant speedup for only 15% of all algorithms

OpenCV Computer Vision Library Performance Tests v2.4
Ubuntu 12.10 x86_64
1 Piledriver CPU core @ 2.5GHz
6 GPU Compute Units @ 720MHz
16GB DDR3 1600MHz
!23

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Task: Batch process a large amount of images using a single algorithm.
! OpenCL performance is algorithm and image size dependent
! Either the CPU will process data or the GPU, but not both
! How to choose which algorithm and device to use depending on image size?

!24

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!25

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Example: High Throughput Computer Vision with OpenCV
! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty.
! all CPU cores are fully utilized at all times even for single-threaded algorithms
! all GPU compute units are fully utilized at all times
! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm
! combined performance for multiple algorithms is better than sum of device performance

P

i

APU

=P

P=
!26

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

i

CPU

+P

i

N
1
⁄i=1 Pi

1

GPU
Programming Strategies

Example: High Throughput Computer Vision with OpenCV

!27

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
Programming Strategies

Summary

!
! next generation hardware and legacy code requires compromises
! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time
! application performance can be increased by overlapping OpenCL and OpenMP workloads
! removing all but necessary OpenCL I/O can have a dramatic influence on performance
! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms
! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances
! APUs may provide greatest performance per Watt
! GPUs may provide greatest performance

!28

| OpenCL and OpenMP Workloads on Accelerated Processing Units |
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product
and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or
revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof
without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

!
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation
Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

!29

| OpenCL and OpenMP Workloads on Accelerated Processing Units |

Recommended

PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...AMD Developer Central
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosAMD Developer Central
 

More Related Content

What's hot

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...AMD Developer Central
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoAMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansAMD Developer Central
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyAMD Developer Central
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 

What's hot (20)

PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin CoumansGS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
GS-4150, Bullet 3 OpenCL Rigid Body Simulation, by Erwin Coumans
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-PoustyCC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
CC-4010, Bringing Spatial Love to your Java Application, by Steven Citron-Pousty
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 

Viewers also liked

Curriculum de professor_atual
Curriculum de professor_atualCurriculum de professor_atual
Curriculum de professor_atualWanderson Amaral
 
CURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoAlexandra Damaso
 
Modelo de currículo 1º emprego
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º empregoCebracManaus
 
Curriculum vitae 2013
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013Ana Santos
 
Professor de musica curriculo - arnaldo alves
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alvesArnaldo Alves
 
Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz CebracManaus
 
Modelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoModelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoJocileu Segundo
 
CurríCulo Luiz 2010
CurríCulo Luiz 2010CurríCulo Luiz 2010
CurríCulo Luiz 2010luizmarco
 
Curriculum Profª Elizete Arantes
Curriculum  Profª Elizete ArantesCurriculum  Profª Elizete Arantes
Curriculum Profª Elizete Aranteselizetearantes
 
Trabalho LPL
Trabalho LPLTrabalho LPL
Trabalho LPLTaissccp
 
Curriculo 850 Alternativo
Curriculo 850 AlternativoCurriculo 850 Alternativo
Curriculo 850 Alternativorpicorelli
 
PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014Fernando Heringer
 
Manual blogger
Manual bloggerManual blogger
Manual bloggerBLAJEJS
 
Criar Um Blog -Blogger
Criar Um Blog -BloggerCriar Um Blog -Blogger
Criar Um Blog -BloggerLeny Cerqueira
 
Blog na-educacao
Blog na-educacaoBlog na-educacao
Blog na-educacaoNecy
 
Curriculum psicóloga educacional
Curriculum psicóloga educacionalCurriculum psicóloga educacional
Curriculum psicóloga educacionalcarolinaanabella
 

Viewers also liked (20)

Curriculum de professor_atual
Curriculum de professor_atualCurriculum de professor_atual
Curriculum de professor_atual
 
CURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra DamasoCURRICULUM VITAE Alexandra Damaso
CURRICULUM VITAE Alexandra Damaso
 
Modelos de curriculo
Modelos de curriculoModelos de curriculo
Modelos de curriculo
 
Modelo de currículo 1º emprego
Modelo de currículo 1º empregoModelo de currículo 1º emprego
Modelo de currículo 1º emprego
 
Curriculum vitae 2013
Curriculum vitae 2013Curriculum vitae 2013
Curriculum vitae 2013
 
Professor de musica curriculo - arnaldo alves
Professor de  musica   curriculo - arnaldo alvesProfessor de  musica   curriculo - arnaldo alves
Professor de musica curriculo - arnaldo alves
 
Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz Modelo de curriculo menor aprendiz
Modelo de curriculo menor aprendiz
 
Modelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchidoModelo de-curriculum-1-preenchido
Modelo de-curriculum-1-preenchido
 
Curriculo pronto-3
Curriculo pronto-3Curriculo pronto-3
Curriculo pronto-3
 
CurríCulo Luiz 2010
CurríCulo Luiz 2010CurríCulo Luiz 2010
CurríCulo Luiz 2010
 
Curriculo:Enfermeiro
Curriculo:Enfermeiro Curriculo:Enfermeiro
Curriculo:Enfermeiro
 
Curriculum Profª Elizete Arantes
Curriculum  Profª Elizete ArantesCurriculum  Profª Elizete Arantes
Curriculum Profª Elizete Arantes
 
Trabalho LPL
Trabalho LPLTrabalho LPL
Trabalho LPL
 
Curriculo 850 Alternativo
Curriculo 850 AlternativoCurriculo 850 Alternativo
Curriculo 850 Alternativo
 
PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014PPP - E.B.M. Henrique Alfarth 2014
PPP - E.B.M. Henrique Alfarth 2014
 
Manual blogger
Manual bloggerManual blogger
Manual blogger
 
Criar Um Blog -Blogger
Criar Um Blog -BloggerCriar Um Blog -Blogger
Criar Um Blog -Blogger
 
Blog na-educacao
Blog na-educacaoBlog na-educacao
Blog na-educacao
 
Modelo de-curriculum-4-1
Modelo de-curriculum-4-1Modelo de-curriculum-4-1
Modelo de-curriculum-4-1
 
Curriculum psicóloga educacional
Curriculum psicóloga educacionalCurriculum psicóloga educacional
Curriculum psicóloga educacional
 

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded ProgrammingSri Prasanna
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103Linaro
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupIvano Malavolta
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 

Similar to HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel (20)

lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103MOVED: The challenge of SVE in QEMU - SFO17-103
MOVED: The challenge of SVE in QEMU - SFO17-103
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
The Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setupThe Green Lab - [04 B] [PWA] Experiment setup
The Green Lab - [04 B] [PWA] Experiment setup
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 

More from AMD Developer Central

Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 

More from AMD Developer Central (20)

Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 

Recently uploaded

Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...DianaGray10
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Cprime
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGRick Ossendrijver
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriSafe Software
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Trading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfTrading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfLucas Lagone
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueShapeBlue
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueShapeBlue
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfMostafa Higazy
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emNho Vĩnh
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...James Anderson
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...BookNet Canada
 
Business-Intelligence question paper 2023
Business-Intelligence question paper 2023Business-Intelligence question paper 2023
Business-Intelligence question paper 2023RohanMistry15
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentThorsten Huelsmann
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024ThousandEyes
 
Trailblazer Community - Flows Workshop (Session 1)
Trailblazer Community - Flows Workshop (Session 1)Trailblazer Community - Flows Workshop (Session 1)
Trailblazer Community - Flows Workshop (Session 1)Muhammad Tiham Siddiqui
 
AGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfAGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfRodneyThomas28
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Chris Bingham
 

Recently uploaded (20)

Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...Building Bridges:  Merging RPA Processes, UiPath Apps, and Data Service to bu...
Building Bridges: Merging RPA Processes, UiPath Apps, and Data Service to bu...
 
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
Improving IT Investment Decisions and Business Outcomes with Integrated Enter...
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
 
Geospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & EsriGeospatial Synergy: Amplifying Efficiency with FME & Esri
Geospatial Synergy: Amplifying Efficiency with FME & Esri
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Trading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdfTrading Software Development_ Trends to Watch in 2024.pdf
Trading Software Development_ Trends to Watch in 2024.pdf
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
Roundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdfRoundtable_-_API_Research__Testing_Tools.pdf
Roundtable_-_API_Research__Testing_Tools.pdf
 
Python For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ emPython For Kids - Sách Lập trình cho trẻ em
Python For Kids - Sách Lập trình cho trẻ em
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
 
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...Transcript: Trending now: Book subjects on the move in the Canadian market - ...
Transcript: Trending now: Book subjects on the move in the Canadian market - ...
 
Business-Intelligence question paper 2023
Business-Intelligence question paper 2023Business-Intelligence question paper 2023
Business-Intelligence question paper 2023
 
Establishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry developmentEstablishing data sharing standards to promote global industry development
Establishing data sharing standards to promote global industry development
 
New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024New ThousandEyes Product Features and Release Highlights: February 2024
New ThousandEyes Product Features and Release Highlights: February 2024
 
Trailblazer Community - Flows Workshop (Session 1)
Trailblazer Community - Flows Workshop (Session 1)Trailblazer Community - Flows Workshop (Session 1)
Trailblazer Community - Flows Workshop (Session 1)
 
AGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfAGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdf
 
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
Learning About GenAI Engineering with AWS PartyRock [AWS User Group Basel - F...
 

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel

  • 1. Efficient Scheduling of OpenMP and OpenCL Workloads Getting the most out of your APU
  • 2. Objective ! software has a long life-span that exceeds the life-span of hardware ! software is very expensive to be written and maintained ! next generation hardware also needs to run legacy software ! Example: IWAVE ! procedural C-code ! no object orientation ! tight integration between data structures and functions ! What do I mean by efficient scheduling? ! find ways to utilize GPU cores for code blocks ! find ways to utilize all CPU cores and GPU units at the same time !2 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 3. Historical Context GPU Compute Timeline Aparapi CUDA 2002 !3 | OpenCL and OpenMP Workloads on Accelerated Processing Units | 2008 AMP C++ 2010 2012
  • 4. Accelerator Challenges Technology Accessibility and Performance Performance OpenCL & CUDA CPU Multithread CPU Single Thread Ease-of-Use !4 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 5. APU Opportunities One Die - Two Computational Devices Metric CPU APU Memory Size large small Memory Bandwidth small large Parallelism small large yes no Performance application dependent application dependent Performance-per-Watt application dependent application dependent Traditional OpenCL General Purpose Programming !5 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 6. APU Opportunities Performance and Performance-per-Watt ! Example: Luxmark OpenCL Benchmark APU Performance[Pts] 170 197 316 50 37 58 3.4 5.3 5.4 Combined[Pts2/W] ! GPU has best performance-per-Watt GPU PPW[Pts/W] ! Best performance by using the APU CPU Power[W] ! Similar CPU and GPU performance Metric 578 1049 1722 ! APU provides outstanding value Luxmark OpenCL Benchmark Ubuntu 12.10 x86_64 4 Piledriver CPU cores @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !6 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 7. Example: Luxmark Renderer Performance and Performance-per-Watt +64% +81% !7 | OpenCL and OpenMP Workloads on Accelerated Processing Units | Luxmark OpenCL Benchmark Render “Sala” Scene Ubuntu 12.10 x86_64 4 Piledriver cores @ 2.5GHz 6 GPU CUs @ 720MHz 16GB DDR3 1600MHz
  • 8. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Know the problem you are trying to solve. ! staggered rectangular grid in 3D ! coupled first order PDE ! scalar pressure field p ! vector velocity field v = {vx, vy, vz} ! source term g !8 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 9. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenMP(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } OpenMP p OpenMP vx // // // // // main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenMP vy OpenMP vz OpenMP Time !9 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 10. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! Measure the initial performance. ! pressure and velocity field simulated using OpenMP ! average time T[ms] per iteration ! OpenMP linear scaling with threads !10 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 11. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! find computational blocks ! understand dependencies between blocks OpenMP vx OpenMP p OpenMP vy ! identify sequential and parallel parts OpenMP OpenMP vz Causality OpenMP p OpenMP vx OpenMP vy OpenMP vz OpenMP Time !11 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 12. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); … } // // // // // main simulation loop calculate pressure field p calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis OpenCL vx OpenMP p IDLE OpenMP vy OpenMP vz OpenMP Time !12 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 13. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! use the GPU to compute vx ! the CPU is idle while the GPU is running ! 42% improvement for 1 thread ! 25% improvement for 2 threads ! 9% improvement for 4 threads !13 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 14. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenMP(dom, pars); ! ! // main simulation loop // calculate pressure field p int num_threads = atoi(getenv("OMP_NUM_THREADS")); omp_set_num_threads(2); omp_set_nested(1); #pragma omp parallel shared(…) private(…) { switch ( omp_get_thread_num() ) { case 0: sgn_ts3d_210_v0_OpenCL(dom, pars) break; case 1: omp_set_num_threads(num_threads); sgn_ts3d_210_v1_OpenMP(dom, pars); sgn_ts3d_210_v2_OpenMP(dom, pars); break; default: break; } } x } OpenCL v OpenMP p OpenMP vy OpenMP vz // save the current number of OpenMP threads // restrict the number of OpenMP threads to 2 // allow nested OpenMP threads // start 2 OpenMP threads // calculate velocity x-axis using OpenCL // increase number of OpenMP threads back // calculate velocity y-axis // calculate velocity z-axis // close OpenMP pragma // close simulation while OpenMP Time !14 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 15. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! overlap vx and vy ! CPU not idle anymore ! 50% improvement for 1 thread ! 40% improvement for 2 threads ! 38% improvement for 4 threads !15 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 16. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_p012_OpenCL(dom, pars); sgn_ts3d_210_v0_OpenCL(dom, pars); sgn_ts3d_210_v1_OpenCL(dom, pars); sgn_ts3d_210_v2_OpenCL(dom, pars); … } // // // // // bool sgn_ts3d_210_p012_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); clEnqueueReadBuffer(queue, buffer, …); … } OpenCL p OpenCL vx OpenCL vy main simulation loop calculate pressure field calculate velocity x-axis calculate velocity y-axis calculate velocity x-axis // copy data from host to device // execute OpenCL kernel on device // copy data from device to host OpenCL vz OpenCL Time !16 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 17. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! understand where performance gets lost ! 98% of time spent on I/O ! 2% of time spent on compute ! reduce I/O OpenCL Upload Kernel Execution OpenCL Download 188ms 4ms 54ms OpenCL vx OpenMP p OpenMP vy OpenMP vz OpenMP Time !17 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 18. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! How does the speedup of an OpenCL application (SOpenCL) depend on speedup of the OpenCL kernel (SKernel) when the OpenCL I/O time is fixed? ! Fraction of OpenCL I/O time: FI/O ! 50% I/O time limit the maximal possible speedup to 2 ! Minimize OpenCL I/O, only then increase OpenCL kernel performance !18 SKernel SOpenCL = HSKernel - 1L FIêO + 1 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 19. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE while(…) { sgn_ts3d_210_ALL_OpenCL(dom, pars); … } // main simulation loop // combine all OpenCL calculations bool sgn_ts3d_210_ALL_OpenCL(RDOM* dom, void* pars) { … clEnqueueWriteBuffer(queue, buffer, …); ! ! while(…) { clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, clEnqueueNDRangeKernel(queue, kernel_P012, dims, …); kernel_V0, dims, …); kernel_V1, dims, …); kernel_V1, dims, …); // copy data from host to device // // // // execute execute execute execute OpenCL OpenCL OpenCL OpenCL kernel kernel kernel kernel for for for for pressure velocity x velocity y velocity z } clEnqueueReadBuffer(queue, buffer, …); … // copy data from device to host } OpenCL p OpenCL vx OpenCL vy OpenCL vz OpenCL Time !19 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 20. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! eliminate all but essential I/O ! significant speedup over simple OpenCL !20 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 21. Programming Strategies Example: Solving the Acoustic Wave Equation in 3D using IWAVE ! measure real application performance ! 3000 iterations using a 97x405x389 simulation grid ! 8 GCN Compute Units achieve 70% more performance than 8 traditional OpenMP threads 14 10.5 7 3.5 0 CPU (8T) "Piledriver" !21 | OpenCL and OpenMP Workloads on Accelerated Processing Units | GPU (8CU) AMD S9000
  • 22. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! initial OpenCL performance measurements ! 89 Algorithms tested for image size of 4MP ! compare OpenCL I/O and execution time ! 28% of all algorithms are compute bound ! 72% of all algorithms are I/O bound OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !22 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 23. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! compare OpenCL and single-threaded performance ! 89 Algorithms tested for image size of 4MP ! realistic timing that includes I/O over PCIe ! 59% of all algorithms execute faster on the GPU ! 41% of all algorithms execute faster on the CPU(1) ! significant speedup for only 15% of all algorithms OpenCV Computer Vision Library Performance Tests v2.4 Ubuntu 12.10 x86_64 1 Piledriver CPU core @ 2.5GHz 6 GPU Compute Units @ 720MHz 16GB DDR3 1600MHz !23 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 24. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Task: Batch process a large amount of images using a single algorithm. ! OpenCL performance is algorithm and image size dependent ! Either the CPU will process data or the GPU, but not both ! How to choose which algorithm and device to use depending on image size? !24 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 25. Programming Strategies Example: High Throughput Computer Vision with OpenCV !25 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 26. Programming Strategies Example: High Throughput Computer Vision with OpenCV ! Better: create input image queue that CPU and GPU query for new image tasks till queue is empty. ! all CPU cores are fully utilized at all times even for single-threaded algorithms ! all GPU compute units are fully utilized at all times ! combined performance for single algorithm is sum of GPU and CPU performance for that algorithm ! combined performance for multiple algorithms is better than sum of device performance P i APU =P P= !26 | OpenCL and OpenMP Workloads on Accelerated Processing Units | i CPU +P i N 1 ⁄i=1 Pi 1 GPU
  • 27. Programming Strategies Example: High Throughput Computer Vision with OpenCV !27 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 28. Programming Strategies Summary ! ! next generation hardware and legacy code requires compromises ! OpenCL performance is tied to Amdahl’s Law regarding OpenCL I/O and OpenCL execution time ! application performance can be increased by overlapping OpenCL and OpenMP workloads ! removing all but necessary OpenCL I/O can have a dramatic influence on performance ! for loosely coupled high-throughput applications the OpenCL and OpenMP performance add for single algorithms ! for multiple algorithms the combined performance across all algorithms is better than the sum of devices performances ! APUs may provide greatest performance per Watt ! GPUs may provide greatest performance !28 | OpenCL and OpenMP Workloads on Accelerated Processing Units |
  • 29. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
 The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
 AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
 AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ! ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. !29 | OpenCL and OpenMP Workloads on Accelerated Processing Units |