PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier

1,133 views

Published on

Presentation PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier, at the AMD Developer Summit (APU13) November 11-13, 2013.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier

  1. 1. PORTING  AND  OPTIMIZING  OPENMP  APPLICATIONS   TO  APU  USING  CAPS  TOOLS   JEAN-­‐CHARLES  VASNIER,  CAPS  ENTREPRISE  
  2. 2. AGENDA   y  CAPS  enterprise   y  OpenACC   y  CAPS  Compilers   y  CAPS  OpenMP  Compiler  for  AMD  APUs   ‒  Compiler  analyzes  and  code  generaPon   ‒  InteracPve  report   y  ExperimentaPons  with  benchmark  applicaPons   ‒  HydroC   y  Future  work   2   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 2
  3. 3. CAPS  enterprise  
  4. 4. COMPANY  PROFILE   y  Founded  in  2002   ‒  Large  experPse  in  processor  micro-­‐architecture  and  code  generaPon   ‒  Spin-­‐off  of  French  INRIA  Research  Lab   ‒  30  employees   y  Mission:  to  help  its  customers  to  leverage  the  performance  of  mulP/manycore  machines   ‒  ConsulPng  &  engineering  services   ‒  CAPS  OpenACC  Compiler  &  toolchain   ‒  Trainings   y  Expanding  sales  worldwide   ‒  Resellers  in  US  and  APAC     (Exxact,  Abso^,  JCC  Gimmick  Ltd,  Nodasys,  …)     4   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 4
  5. 5. CAPS  ECOSYSTEM   Customers 5   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Business Partners www.caps-entreprise.com European R&D Projects 5
  6. 6. OpenACC    
  7. 7. OPENACC  INITIATIVE   y  A CAPS, CRAY, Nvidia and PGI initiative y  Open Standard y  A directive-based approach for programming heterogeneous manycore hardware for C and FORTRAN applications y  http://www.openacc-standard.com 7   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 7
  8. 8. DIRECTIVE-­‐BASED  PROGRAMMING  (1)     y  Three ways of programming GPGPU applications: Libraries Directives Programming Languages Ready-to-use Acceleration   Quickly Accelerate Existing Applications   Maximum Performance   8   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 8
  9. 9. DIRECTIVE-­‐BASED  PROGRAMMING  (2)     9   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-­‐entreprise.com   9  
  10. 10. EXECUTION  MODEL   y  Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators ‒  Parallel regions ‒  Kernels regions y  Host is responsible for: ‒  Allocating memory space on accelerator ‒  Initiating data transfers ‒  Launching computations ‒  Waiting for completion ‒  Deallocating memory space y  Accelerators execute parallel regions: ‒  Use work-sharing directives ‒  Specify level of parallelization 10   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 10
  11. 11. OPENACC  EXECUTION  MODEL   y  Host-­‐controlled  execuPon   y  Based  on  three  parallelism  levels   ‒  Gangs  –  coarse  grain   ‒  Workers  –  fine  grain   ‒  Vectors  –  finest  grain   Device   Gang     Worker                                 Vectors   11   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Gang     Worker                                   Vectors   www.caps-entreprise.com …   11
  12. 12. CAPS   Compilers  
  13. 13. OPENACC  COMPILERS  (1)   CAPS  Compilers:   PGI  Accelerator   y  Source-­‐to-­‐source  compilers   y  Support  Intel  Xeon  Phi,  NVIDIA  GPUs,   AMD  GPUs  and  APUs   y  Extension  of  x86  PGI  compiler   y  Support  Intel  Xeon  Phi,  NVIDIA  GPUs,   AMD  GPUs  and  APUs   Cray  Compilers:   y  Provided  with  Cray  system  only   13   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-­‐entreprise.com   13  
  14. 14. CAPS  COMPILERS  (2)   Are source-to-source compilers, composed of 3 parts: y  The directives (OpenACC or OpenHMPP) ‒ Define parts of code to be accelerated ‒ Indicate resource allocation and communication ‒ Ensure portability y  The toolchain ‒ Helps building manycore applications ‒ Includes compilers and target code generators ‒ Insulates hardware specific computations ‒ Uses hardware vendor SDK y  The runtime ‒ Helps to adapt to platform configuration ‒ Manages hardware resource availability 14   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 14
  15. 15. CAPS  COMPILERS  (3)   y  Take  the  original  applicaPon  as  input  and  generate  another  applicaPon  source  code  as   output   ‒ AutomaPcally  turn  the  OpenACC  source  code  into  a  accelerator-­‐specific  source  code  (CUDA,  OpenCL)   y  Compile  the  enPre  hybrid  applicaPon     y  Just  prefix  the  original  compilaPon  line  with  capsmc  to  produce  a  hybrid  applicaPon   $ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90   y  CompaPble  with:   ‒ GNU   ‒ Intel   ‒ Open64   ‒ Abso^   ‒ …   15   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   www.caps-entreprise.com 15
  16. 16. CAPS  COMPILERS  (4)   C++   Frontend   y CAPS Compilers drives all compilation passes Fortran   Frontend   ExtracPon  module   y Host application compilation ‒ Calls traditional CPU compilers ‒ CAPS Runtime is linked to the host part of the application C   Frontend   codelets   Host  code   Fun   #1   Fun   #2   Fun   #3   Instrumen-­‐taPon   module   CUDA  Code   GeneraPon   OpenCL   GeneraPon   CPU  compiler     (gcc,  ifort,  …)   CUDA  compilers   OpenCL   compilers   y Device code production ‒ According to the specified target ‒ A dynamic library is built Executable   (mybin.exe)   16   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS  RunDme   www.caps-­‐entreprise.com   HWA  Code     (Dynamic  library)   16  
  17. 17. From  OpenMP   To  OpenACC  
  18. 18. CAPS  OPENMP  COMPILER   y  AutomaPcally  turns  OpenMP  codes  into  OpenACC   y  Diagnoses  compaPbility  issues  and  suggests  code  transformaPons   y  Builds  accelerated  versions  based  on  CUDA  or  OpenCL   y  Works  with  all  plalorms   ‒  AMD  and  Nvidia  GPUs   ‒  AMD  APUs   ‒  Intel  Xeon  Phi   18   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 18
  19. 19. CAPS  OPENMP  COMPILER  OVERVIEW   Profiling   19   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   Analysis   CAPS OpenMP Compiler - June 2013 AcceleraPon   19
  20. 20. EXTENSION  OF  THE  CAPS  OPENACC  COMPILER   y  Converts  OpenMP  codes  into  OpenACC     ‒  Examine  OpenMP  loop  nests  and  check  their  OpenACC  compaPbility     ‒  Diagnose  non  compaPbility  issues  and  propose  advice     ‒  Build  an  APU  version  based  on  OpenCL   y  Builds  a  interacPve  report     ‒  Based  on  the  compiler  staPc  and  dynamic  analyses     ‒  OpenMP  to  OpenACC  kernels  view  o    Performance  details  of  each  region     ‒  Regions’  In/Out  and  data  dependencies  between  regions   ‒  Gives  the  user  control  on  pushing  kernels  onto  GPU  and  manage  data  transfers   20   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  21. 21. OPENMP-­‐BASED  OPTIMIZATION  PROCESS   Application with OpenMP directives Instrumentation Execution Analysis Tracable application Profiling report HTML interactive report 21   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 Generation Accelerated executable 21
  22. 22. INSTRUMENTATION  AND  PROFILING  PHASES   y  Code  preprocessing  and  instrumentaPon   ‒  IdenPfy  supported  OpenMP  regions     ‒   parallel,  parallel    for  and  parallel  for  constructs   ‒  Instrument  the  code  to  track  data  and  measure  kernel  performance     y  Instrumented  applicaPon  execuPon     ‒  Based  on  the  user  data  set       ‒  Number  of  Pmes  a  OpenMP  region  is  executed     ‒  Region’s  reads  and  writes     ‒  Range  of  loops  iteraPon     ‒  Region  performance   22   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  23. 23. ANALYSIS  PHASE   y  Generates  an  interacPve  HTML  report   ‒  Based  on  the  compiler  staPc  and  dynamic  analyses   ‒  Metrics  for  each  OpenMP  regions     ‒   Check  OpenACC  compliancy     ‒  ComputaPon  density     ‒  Coalescing  of  data  accesses   ‒  EsPmated  speed-­‐up   ‒  Memory  usage   ‒  Propose  a  GPU  execuPon  or  naPve  OpenMP  execuPon   ‒  Data  usage  and  data  dependencies  graph  between  regions   ‒  Determine  when  transfers  are  required  between  kernels   ‒  Let  the  user  modify  the  CPU  or  GPU  execuPon  and  data  transfer  policy   23   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  24. 24. HTML  INTERACTIVE  REPORT  (1)   y  Get  regions  overview  in  a  snap!     y  Code  View:  from  OpenMP  to  OpenACC  direcPves   24   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 24
  25. 25. HTML  INTERACTIVE  REPORT  (2)   y Performance  details  of  each   region   y Analysis  conclusions  and   portability  diagnosis   25   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS  OpenMP  Compiler  -­‐  June  2013   25  
  26. 26. HTML  INTERACTIVE  REPORT  (3)   y  Regions’  inputs/outputs  and  data  dependencies  map   26   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 26
  27. 27. HTML  INTERACTIVE  REPORT  (4)   y  Get  the  control!   ‒  Manually  push  kernels  onto  accelerators   ‒  Manage  data  transfers   27   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 27
  28. 28. CODE  GENERATION  PHASE   y  Same  as  the  CAPS  OpenACC  Compiler     ‒  Based  on  the  analysis  report     ‒  Generates  OpenCL  kernels  from  OpenACC     ‒  AutomaPc  data  updates  to  ensure  memory  coherency   28   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  29. 29. FEATURES   y  Diagnoses   ‒  OpenACC  compliancy   ‒  ComputaPonal  density   ‒  Data  accesses  coalescing   ‒  Memory  usage   ‒  EsPmated  speed-­‐up   y  AutomaPc  porPng  to  AMD,  NVIDIA,  or  Intel  accelerators   y  Accelerates  execuPon  or  keeps  the  OpenMP  naPve  one   y  Gives  users  control  to  manual  opPmizaPons   29   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 29
  30. 30. ApplicaPon   ExperimentaPons  
  31. 31. HARDWARE  AND  SOFTWARE  ENVIRONMENT   y  Linux  system   ‒  AMD  SDK  2.8   ‒  CAPS  Compiler  revision  50387   ‒  GCC  4.6.1   ‒  OpenMPI  1.6.4   y  Hardware   ‒  AMD  A10-­‐5800K  APU  with  Radeon  HD  Graphics   31   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 31
  32. 32. APPLICATIONS  STATUS   y  Main  objecPve  is  proof  of  concept,  not  performance   ‒  Performance  limitaPons  of  current  version  of  the  APU     y  HydroC   ‒  Most  convincing  demo   ‒  x1.3  speed-­‐up  by  modifying  the     execuPon  and  transfer  policy   32   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 32
  33. 33. HYDROC  HTML  REPORT   33   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL  
  34. 34. Fututre  Work   C2PO  
  35. 35. C2PO  MISSION  STATEMENT   Guides  you  through  the  whole  process  of  porPng  and  tuning   applicaPons  onto  manycore  parallel  systems   y  Combines  various  CAPS  technologies  in  a  modular  tool  chain   ‒  StaPc  and  dynamic  code  analyzers   ‒  OpenMP  to  OpenACC  code  transformers   ‒  Kernel  micro-­‐bencher   ‒  Plug  with  third-­‐party  tools:  Vtune,  CUDA  profiler   ‒  Use  CAPS  Compiler  at  final  stage  to  produce  manycore  applicaPon   35   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 35
  36. 36. C2PO  PHASES   1.  GeneraPon  of  an  OpenACC  skeleton  from  OpenMP  or  sequenPal  code   ‒  2.  Hotspot  detecPon  and  dataflow  analysis   Indicates  global  and  local  advice  on     ‒  Data  management/placement  between  kernels  or  regions   ‒  First  ten  Pps  on  kernel  performance   ‒  Data  coalescing,  parallelism,  gridificaPon,  loops  order   3.  Let  you  rapidly  opPmize  performance  of  kernels   ‒  Extracts  funcPons,  loops  or  annotated  regions   ‒  Tune  kernel  code  following  C2PO  advice   ‒  Replay  standalone  with  applicaPon  data  and  measure  performance  gain   ‒  Re-­‐inject  opPmized  into  applicaPon  source  code   4.  Use  CAPS  Compilers  to  build  Intel  Xeon  Phi,  NVIDIA  or  AMD  GPUs   Dataflow   analysis   OpenACC   skeleton   generaPon   Extract  loops,   funcPons,  regions   Fine  tune  kernels   User  Input   36   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 36
  37. 37. C2PO  TOOL  CHAIN   InteracPve   Report   Global  tuning   Code  skeleton  generaDon   Data  Movement   Analyzer   SequenPal   Code   OpenACC   Generator   OpenACC  Code   OpenMP  Code   ubencher   HTML  Report   CUDA   profiler   Local  tuning   Kernels   37   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   VTune   C2PO - Oct. 2013 Performance   analyzer   37
  38. 38. C2PO  OPENACC  GENERATION   y  From  sequenPal  or  OpenMP  code  to  first  parallelized  code   ‒  Instrument  applicaPon  and  detect  hotspots   ‒  Generate  OpenACC  skeleton  of  kernels  from  loops   ‒  Manage  data  transfers  between  kernels   y  A  report  is  generated  containing   ‒  Various  performance  metrics   ‒  Kernel  execuPon   ‒  Memory  reads  and  writes   ‒  PotenPal  performance  gain   ‒  Data  dependencies  and  usage  between  kernels   ‒  OpenACC  code  view   38   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 38
  39. 39. C2PO  GLOBAL  TUNING   y  Dynamic  tracking  of  data  so  as  to  opPmize  their  movement   ‒  Dynamically  trace  uploads  and  downloads  at  execuPon  Pme   ‒  Detect  potenPally  redundant  data  transfers     Difficult  for  the   compiler  to  detect   any  CPU  use  of  data   #openacc  data  region   //  convergence  loop     for  {          Upload  data()          Kernels’  calls()          Download  data()   }   …   39   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 Possible  advice:  are  the   following  parameters  modified   by  the  CPU  between  the   downloads  and  uploads?     If  yes,  insert  OpenACC  data  region   with  non  modified  parameters   39
  40. 40. C2PO  TUNING  PHASE   y  Microbenchmarking  mechanism   ‒  Loops,  funcPons,  user  annotated  regions  are  extracted  in  kernels   ‒  Apply  opPmizaPons     ‒  Replay  kernels  with  original  data  set  without  running  the  whole  applicaPon   ‒  Once  tuned,  inject  kernels  into  the  applicaPon  source  code   y  Apply  performance  analyzers  from  third  party  tools  (Vtune,  CUDA  profiler)   ‒  Synthesizes  raw  metrics  (hardware  counters)  linked  to  the  source  code   40   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 40
  41. 41. C2PO  OBJECTIVES  AND  BENEFITS   y  Keep  one  single  OpenMP  code  for  various  parallel  many-­‐core  systems  (GPUs,  APUs,  MIC)   y  Incrementally  port  and  opPmize  codes  in  a  modular  way   y  Use  an  interacPve  compiler:  advice  from  dynamic  and  staPc  analyses  at  source  code  level   41   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   C2PO - Oct. 2013 41
  42. 42. THANK  YOU  FOR  YOUR   ATTENTION!   Vasnier  Jean-­‐Charles   Sales  Engineer,  CAPS  entreprise   Phone:  +1-­‐865-­‐227-­‐6899   Email:  jvasnier@caps-­‐entreprise.com  
  43. 43. GET  PERFORMANCE  IN  NO  TIME!   ExecuDon  Time  (seconds)   70   63,42   60   50   45,698   Original  (OpenMP)   40   30   27,539   Generated  (auto)   23,417   Generated(tweaked)   20   12,71   12,55   10   0   Hydro   x2  speed-­‐up   (a^er  user’s  tuning)   Nbody   x6  speed-­‐up   in  3  clicks   (full  automaPc)     ‒  Measured  on  a  dual  Sandy  bridge  E5-­‐2687W  with  32  Go  RAM  and  a  Kepler  K20C  driven  by  CUDA  v5.0     43   |      PRESENTATION  TITLE      |      NOVEMBRE  19,  2013      |      CONFIDENTIAL   CAPS OpenMP Compiler - June 2013 43

×