PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

1,919 views
1,575 views

Published on

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe at the AMD Developer Summit (APU13) November 11-13, 2013.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,919
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by Michael Wolfe

  1. 1. OpenACC on AMD GPUs and APUs with the PGI Accelerator Compilers Michael Wolfe Michael.Wolfe@pgroup.com http://www.pgroup.com APU13 San Jose, November, 2013
  2. 2.  C, C++, Fortran compilers  Optimizing  Vectorizing  Parallelizing  Graphical parallel tools  PGDBG debugger  PGPROF profiler      AMD, Intel, NVIDIA processors PGI Unified Binary™ technology Linux, MacOS, Windows Visual Studio & Eclipse integration PGI Accelerator support  OpenACC  CUDA Fortran www.pgroup.com
  3. 3. SMP Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  4. 4. SMP Parallel Programming #pragma omp parallel for private(i) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –mp x.c …
  5. 5. AMD Radeon Block Diagram*  Multiple Compute Units  Vector Unit  Pipelining / Multithreading  Device Memory  Cache Hierarchy  SW-managed cache (LDS) *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc.
  6. 6. Heterogeneous Parallel Programming for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]);
  7. 7. Heterogeneous Parallel Programming #pragma acc parallel loop private(i) pcopyin(b[0:n], c[0:n]) pcopyout(a[0:n]) for( i = 0; i < n; ++i ) a[i] = sinf(b[i]) + cosf(c[i]); % pgcc –acc –ta=radeon x.c
  8. 8. Overview  Parallel programming  GPU Architectural highlights  OpenACC 5 minute summary  PGI Implementation  Performance
  9. 9. Abstract CPU+Accelerator Target
  10. 10. Accelerator Architecture Features  Potentially separate memory (relatively small)  High bandwidth memory interface  Many degrees of parallelism  MIMD parallelism across many cores  SIMD parallelism within a core  Multithreading for latency tolerance  Asynchronous with host  Performance from Parallelism  slower clock, less ILP, simpler control unit, smaller caches
  11. 11. OpenACC Open Programming Standard for Parallel Computing “PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board
  12. 12. OpenACC Overview  Directive-based  Parallel Computation  Data Management #pragma acc data copyin( a[0:n] ) copy( b(0:n] ) create( tmp[0:n] ) { for( int i = 0; i < iters; ++i ){ relax( a, b, tmp, n ); relax( b, a, tmp, n ); } } relax(float *x,float *y,float *t,int n){ #pragma acc data present( x[0:n], y[0:n], t[0:n] ) { #pragma acc parallel loop for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop for( int j = 1; j < n-1; ++j x[j] = 0.25f*(t[j-1]+t[j+1] + y[n-j+1] + y[n-j-1]); } }
  13. 13. OpenACC compared to OpenMP  Data parallelism  Thread parallelism  Parallel per region  Fixed number of threads  Flexible || mapping  Fixed || thread mapping  Structured parallelism  Tasks and loops  Performance portability  ?
  14. 14. PGI OpenACC Implementation  C, C++, Fortran  pgcc, pgc++, pgfortran  Command line options     -acc -ta=radeon -ta=radeon,host -ta=radeon,nvidia  Planner  maps program ||ism to hardware ||ism  Code Generator  OpenCL API  Runtime  initialization  data management  kernel launches
  15. 15. Planner  Maps parallel loops  OpenACC abstractions  gang, worker, vector  OpenCL abstractions  work group, work item  Hardware abstractions  wavefront #pragma acc parallel loop gang for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc parallel loop gang vector for( int j = 0; j < n; ++j ) t[j] = x[j]; #pragma acc kernels loop independent for( int j = 0; j < n; ++j ) t[j] = x[j];
  16. 16. Code Generator  Low-level OpenCL  “assembly code in C”  SPIR interface to AMD Radeon LLVM back-end  Uses non-standard features  device addresses
  17. 17. Runtime  Dynamically loads OpenCL library  Supports multiple devices  Multiple command queues  Host as a device (*)  Memory management  device addresses  bigbuffer(s) suballocation  Profiling support
  18. 18. Performance  AMD Piledriver 5800K  4.0GHz  2MB cache  8 cores  Single thread/core  OpenMP parallel  PGI 13.10 –fast –mp  AMD Radeon 7970     Tahiti 925 MHz 3GB memory 32 compute units  OpenACC parallel  PGI 13.10 –fast –acc –ta=radeon:tahiti
  19. 19. Cloverleaf Mantevo Miniapp  Lagrangian-Eulerian hydrodynamics  compressible Euler equation solver in 2D  9500 lines of Fortran+C with OpenMP, OpenACC  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA, Herdman et al, 2012 SC Companion DOI: 10.1109/SC.Companion.2012.66
  20. 20. Performance Results 40 35 30 25 Serial OpenMP 20 R7970 15 S10000 10 5 0 960^2x87 1920^2x87 3840^2x87 960^2x2955 1920^2x2955
  21. 21. OpenACC on AMD GPUs and APUs  OpenACC is designed for performance portability  PGI Accelerator compilers provide evidence  Target-specific tuning still underway  Open Beta compilers available now  Product version in January 2014
  22. 22. Copyright Notice © Contents copyright 2013, NVIDIA Corp. This material may not be reproduced in any manner without the expressed written permission of NVIDIA Corp.

×