• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe
 

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

on

  • 1,071 views

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe at the AMD Developer Summit (APU13) November 11-13, 2013.

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe at the AMD Developer Summit (APU13) November 11-13, 2013.

Statistics

Views

Total Views
1,071
Views on SlideShare
1,071
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe Presentation Transcript

    • OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers Michael Wolfe Michael.Wolfe@pgroup.com http://www.pgroup.com APU13 San Jose, November, 2013
    •  C, C++, Fortran compilers  Optimizing  Vectorizing  Parallelizing  Graphical parallel tools  PGDBG debugger  PGPROF profiler      AMD, Intel, NVIDIA processors PGI Unified Binary™ technology Linux, MacOS, Windows Visual Studio & Eclipse integration PGI Accelerator support  OpenACC  CUDA Fortran www.pgroup.com
    • SMP Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
    • SMP Parallel Programming #pragma omp parallel for private(i) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –mp x.c …
    • AMD Radeon Block Diagram*  Multiple Compute Units  Vector Unit  Pipelining / Multithreading  Device Memory  Cache Hierarchy  SW-managed cache (LDS) *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
    • Heterogeneous Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
    • Heterogeneous Parallel Programming #pragma acc parallel loop private(i) pcopyin(b[0:n], c[0:n]) pcopyout(a[0:n]) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –acc –ta=radeon x.c
    • Overview  Parallel programming  GPU Architectural highlights  OpenACC 5 minute summary  PGI Implementation  Performance
    • Abstract CPU+Accelerator Target
    • Accelerator Architecture Features  Potentially separate memory (relatively small)  High bandwidth memory interface  Many degrees of parallelism  MIMD parallelism across many cores  SIMD parallelism within a core  Multithreading for latency tolerance  Asynchronous with host  Performance from Parallelism  slower clock, less ILP, simpler control unit, smaller caches
    • OpenACC Open Programming Standard for Parallel Computing “PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board
    • OpenACC Overview  Directive-based  Parallel Computation  Data Management #pragma acc data copyin( a[0:n] ) copy( b(0:n] ) create( tmp[0:n] ) { for( int i = 0; i < iters; ++i ){ relax( a, b, tmp, n ); relax( b, a, tmp, n ); } } relax(float *x,float *y,float *t,int n){ #pragma acc data present( x[0:n], y[0:n], t[0:n] ) { #pragma acc parallel loop for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop for( int j = 1; j < n-1; ++j x[j] = 0.25f*(t[j-1]+t[j+1] + y[n-j+1] + y[n-j-1]); } }
    • OpenACC compared to OpenMP  Data parallelism  Thread parallelism  Parallel per region  Fixed number of threads  Flexible || mapping  Fixed || thread mapping  Structured parallelism  Tasks and loops  Performance portability  ?
    • PGI OpenACC Implementation  C, C++, Fortran  pgcc, pgc++, pgfortran  Command line options     -acc -ta=radeon -ta=radeon,host -ta=radeon,nvidia  Planner  maps program ||ism to hardware ||ism  Code Generator  OpenCL API  Runtime  initialization  data management  kernel launches
    • Planner  Maps parallel loops  OpenACC abstractions  gang, worker, vector  OpenCL abstractions  work group, work item  Hardware abstractions  wavefront #pragma acc parallel loop gang for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop gang vector for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc kernels loop independent for( int j = 0; j < n; ++j ) t[j] = x[j];
    • Code Generator  Low-level OpenCL  “assembly code in C”  SPIR interface to AMD Radeon LLVM back-end  Uses non-standard features  device addresses
    • Runtime  Dynamically loads OpenCL library  Supports multiple devices  Multiple command queues  Host as a device (*)  Memory management  device addresses  bigbuffer(s) suballocation  Profiling support
    • Performance  AMD Piledriver 5800K  4.0GHz  2MB cache  8 cores  Single thread/core  OpenMP parallel  PGI 13.10 –fast –mp  AMD Radeon 7970     Tahiti 925 MHz 3GB memory 32 compute units  OpenACC parallel  PGI 13.10 –fast –acc –ta=radeon:tahiti
    • Cloverleaf Mantevo Miniapp  Lagrangian-Eulerian hydrodynamics  compressible Euler equation solver in 2D  9500 lines of Fortran+C with OpenMP, OpenACC  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, Herdman et al, 2012 SC Companion DOI: 10.1109/SC.Companion.2012.66
    • Performance Results 40 35 30 25 Serial OpenMP 20 R7970 15 S10000 10 5 0 960^2x87 1920^2x87 3840^2x87 960^2x2955 1920^2x2955
    • OpenACC on AMD GPUs and APUs  OpenACC is designed for performance portability  PGI Accelerator compilers provide evidence  Target-specific tuning still underway  Open Beta compilers available now  Product version in January 2014
    • Copyright Notice © Contents copyright 2013, NVIDIA Corp. This material may not be reproduced in any manner without the expressed written permission of NVIDIA Corp.