PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors

Ritu Arora
Ritu AroraHPC Researcher at Texas Advanced Computing Center (TACC)
PRESENTED BY:
Interactive Code Adaptation Tool for
Modernizing Applications for Intel Knights
Landing Processors
PEARC Conference Series 2017
Wednesday, July 12th
7/13/17 1
Ritu Arora
Email: rauta@tacc.utexas.edu
Lars Koesterke
Email: lars@tacc.utexas.edu
Outline
Part 1: Motivation and a simple example of code adaptation
using ICAT
Memory management from within code
Part 2: Ritu will take over and show the process of adapting a
real-world application using ICAT
Decision-support and semi-automatic software reengineering
7/13/17 2
HPC is Constantly Evolving
The complexity in architecture is increasing. We highlight 2
trends here:
1.  More cores, with more threads (Hyperthreads)
•  Pure MPI is not enough. Threads are needed to
1.  Make use of the Hyperthreads
2.  Reduce MPI tasks and MPI communication
3.  Allow for more memory per MPI tasks
2.  Knights Landing (KNL) with additional memory layers
•  High-bandwidth memory (MCDRAM)
•  May need additional memory management in code
The users are forced to constantly adapt their code. We are
developing tools to help users to:
1.  Parallelize code with MPI and OpenMP
•  MPI and OpenMP for new users
•  Upgrade to a hybrid setup (MPI + OpenMP)
•  Memory management; Compile and run code optimally (KNL-specific)
7/13/17 3
Stampede 2
Phase 1: 4200 KNL nodes
68 cores
Up to 272 Hyperthreads
96 GB DDR4 memory
16 GB MCDRAM
Most in Cache Mode (tag directories in quadrants)
Some in Flat Mode (tag directories in quadrants)
Special queues for other configurations
Phase 2: Skylake dual-socket nodes
40+ cores
80+ Hyperthreads
Fast interconnect (Fat-tree)
Fast file systems
4
We	
  are	
  in	
  ‘Phase	
  1’	
  
The	
  KNL’s	
  will	
  provide	
  more	
  
than	
  50%	
  of	
  the	
  available	
  
SUs	
  
Memory Hierarchy and Modes
•  Two main memory types
•  DDR4
•  MCDRAM
•  Three memory modes
•  Cache
•  Flat
•  Hybrid
•  Hybrid mode
•  Three choices
•  25% / 50% / 75%
•  4GB / 8 GB / 12GB
7/13/17 5
CPU	
  
MCDRAM	
  
(	
  CACHE	
  )	
   DDR4	
  
CACHE	
  MODE	
  
CPU	
  
MCDRAM	
  
DDR4	
  
FLAT	
  MODE	
  
CPU	
  
MCDRAM	
  
(	
  CACHE	
  )	
  
MCDRAM	
  
DDR4	
  
HYBRID	
  MODE	
  
Memory Modes
Memory	
  footprint	
   Flat	
   Cache	
  
<	
  16GB	
   OpSmal	
   Couple	
  of	
  percent	
  lower	
  
>16	
  GB:	
  many	
  arrays	
   May	
  work	
  well.	
  May	
  require	
  
code	
  modificaSons	
  (user	
  
idenSfies	
  which	
  arrays	
  to	
  
store	
  in	
  MCDRAM)	
  
May	
  work	
  well	
  
>16	
  GB:	
  few	
  large	
  arrays	
   Does	
  not	
  work,	
  if	
  no	
  array	
  
fits	
  in	
  the	
  MCDRAM	
  
May	
  work	
  well	
  
6
Using	
  the	
  ‘Flat’	
  mode	
  may	
  require	
  code	
  modificaSons	
  
Cluster Modes
•  Each tile tracks an assigned range of memory addresses on behalf
of all cores by maintaining a data structure called tag directory
•  To maintain cache-coherency, tile-to-tile, and tile-to-memory
communication is required
•  Cores that read or modify data must communicate with the tiles that
manage the memory associated with that data
•  Similarly, when cores need data from main memory, the tile(s) that
manage the associated addresses will communicate with the
memory controllers on behalf of those cores
•  The KNL can do this in the multiple cluster modes: all-to-all,
quadrant, and sub-NUMA-4 (or 2)
More information on clustering modes: https://portal.tacc.utexas.edu/user-guides/
stampede#cluster-modes
7/13/17 7
Why are Tools helpful?
Things we all need to understand and/or learn
1.  Concept of parallel execution or memory management
2.  Specific parallel paradigm:
1.  Shared vs. distributed memory - OpenMP vs. MPI
2.  When to manually modify the code for optimal memory placement
3.  Syntax, and good coding practices
4.  Actual code modifications
Our tools help with syntax and the actual code modifications
1.  Select OpenMP or MPI and a simple toy code
2.  Modify the code interactively
3.  Test the modified code
4.  Inspect code to learn syntax and good coding practices
7/13/17 8
General Philosophy
We provide a tool and a number of simple code examples
Alternatively the user may provide code
Profiling of code
Vtune and/or perf
à list of routines, list of memory objects, etc.
Interaction with user
Identify what (parallelization, vectorization, or memory management),
where (routines, loops, allocation calls)
Code modification using bash scripts and ROSE source-to-source compiler
Final code ready for compilation, testing, and inspection.
Extra for KNL: Recommendation for configuration modes
7/13/17 9
Code Adaptation Using ICAT (1)
Example: Directing array allocation towards
MCDRAM
Interactive tool
Works with example code or user code
Example shown in C and Fortran
Tool identifies statements like these:
10
real(8), dimension(:,:), allocatable, :: x
…
allocate (x(n,m))
double *x;
…
x = (double*)malloc(sizeof(double)*n*m)
Code Adaptation Using ICAT (2)
Modified code, pseudo code shown:
11
Fortran:
size_gb = 8 * n * m / (1024*1024*1024)
if (size_gb <= 16 & mcdram_is_available)
!DEC$ ATTRIBUTES FASTMEM :: x
allocate (x(n,m)) Or
C:
#include <hbwmalloc.h>
…
double *x;
//code for checking availability of HBM
if (fail)
‘normal allocate or malloc’ (Reason: No HBM available)
else
x = (double*)hbw_malloc(sizeof(double)*n*m)
•  Users	
  can	
  easily	
  test	
  without	
  ‘knowing’	
  all	
  the	
  syntax	
  
•  Modified	
  code	
  ready	
  for	
  ‘cut-­‐and-­‐paste’	
  
Switching Presenters
7/13/17 12
Common Questions Related to Code
Modernization and Migration in the Context
of KNL architecture
Related to Innovative Hardware and Supporting Software Stack
1.  How can I adapt my application to take advantage of the manycore
processors and multiple levels of memory hierarchies?
2.  Which is the best memory mode for running an application on a KNL
processor?
3.  What are the band-width critical portions of an application data?
4.  Which cluster mode is the best for running an application on a KNL
processor?
5.  How much effort is required to analyze and adapt an application for the
KNL processors?
7/13/17 13
Interactive Code Adaptation Tool (ICAT) -
Overview
7/13/17 14
•  Command-­‐line	
  tool	
  that	
  helps	
  
in	
  decision-­‐making,	
  and	
  code	
  
modernizaSon	
  and	
  migraSon	
  
•  Runs	
  the	
  code,	
  collects	
  
hardware	
  counter	
  data,	
  
determines	
  some	
  
characterisScs	
  of	
  
applicaSons’	
  (e.g.,	
  L1	
  cache	
  
misses)	
  
•  Presents	
  reports	
  and	
  if	
  needed,	
  
modifies	
  source	
  code	
  	
  
Test Case: Molecular Dynamic (MD)
Code, C/OpenMP
7/13/17 15
ICAT in Action (1): Launching ICAT
7/13/17 16
ICAT in Action (2): Running in Different
Advisor Modes
7/13/17 17
ICAT in Action (3): Memory Report
7/13/17 18
ICAT in Action (4): Cluster Mode Report
Locality of memory access: Quadrant, SNC-4, All-to-All
7/13/17 19
ICAT in Action (5): Code Adaptation
Selectively Using MCDRAM
7/13/17 20
ICAT in Action (6): Code Adaptation
Mode
7/13/17 21
Included	
  header	
  file:	
  
	
  
#include <hbwmalloc.h>
Replaced	
  the	
  calls	
  to	
  malloc	
  with	
  hbw_malloc:
	
  
acc=(double*)
hbw_malloc((nd*np)*sizeof(double));
Replaced	
  the	
  calls	
  to	
  free	
  with	
  hbw_free:	
  
	
  
hbw_free(acc);
ICAT Future Work
•  Adding support for advanced vectorization
•  There are situations in which the static-code analysis done
by the compiler to find vectorization opportunities can be
inconclusive. Users can add extra hints in the code to
support explicit vectorization.
•  #pragma novector, #pragma ivdep
•  Supporting memory optimization
•  Improve memory alignment, scalar-to-vector conversions
•  Loop-tiling, array-of-structure versus structure-of-arrays,
recalculating versus accessing from memory
•  Reference # 3 and # 4
7/13/17 22
Array of Structure Versus Structure of
Arrays (AoS Vs. SoA)
On architectures where performance can be significantly
affected by memory access pattern, it may be useful to
consider whether AoS is better or SoA
struct {
float r, g, b;
} AoS[N];
struct {
float r[N];
float g[N];
float b[N];
} SoA;
7/13/17 23
References
1.  Knights Landing (KNL):2nd Generation Intel® Xeon Phi™ Processor, Avinash
Sodani: http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-
Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-
Sodani-Intel.pdf
2.  Stampede User-Guide:
https://portal.tacc.utexas.edu/user-guides/stampede#cluster-modes
3.  Raghunandan Mathur, Hiroshi Matsuoka, Osamu Watanabe, Akihiro Musa,
Ryusuke Egawa, and Hiroaki Kobayashi. 2015. A Case Study of Memory
Optimization for Migration of a Plasmonics Simulation Application to SX-ACE. 2015
Third International Symposium on Computing and Networking (CANDAR), Sapporo,
2015, 521-527.
4.  Markus Kowarschik, Christian Weiß. 2002. An Overview of Cache Optimization
Techniques and Cache-Aware Numerical Algorithms. In Algorithms for Memory
Hierarchies, Advanced Lectures, 213 - 232.
5.  Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing
Processors. 2016. Accessed on March 10th 2017: https://colfaxresearch.com/knl-
avx512/
7/13/17 24
Thank You!
We are grateful for the support received through:
•  NSF Grant # 1642396
•  NSF Grant # 1359304
•  TACC STAR Scholars program
•  Extreme Science and Engineering Discovery Environment
(XSEDE) - NSF grant # ACI-105357
7/13/17 25
1 of 25

More Related Content

What's hot(12)

RISC AND CISC PROCESSORRISC AND CISC PROCESSOR
RISC AND CISC PROCESSOR
Khurram Siddiqui4.4K views
Risc & ciskRisc & cisk
Risc & cisk
Mohanlal Sukhadia University (MLSU)9.1K views
VliwVliw
Vliw
AJAL A J7K views
Functional approach to packet processingFunctional approach to packet processing
Functional approach to packet processing
Nicola Bonelli1.5K views
Concept of threadConcept of thread
Concept of thread
Munmun Das Bhowmik421 views
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
Nicola Bonelli974 views
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
Nicola Bonelli348 views
V.S.VamsiKrishnaV.S.VamsiKrishna
V.S.VamsiKrishna
vamsisvk173 views

Similar to PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors

Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
9.3K views62 slides

Similar to PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors(20)

Parallelization using open mpParallelization using open mp
Parallelization using open mp
ranjit banshpal1.9K views
Multicore ComputersMulticore Computers
Multicore Computers
A B Shinde9.3K views
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy366 views
ClusteringClustering
Clustering
Abhay Pai916 views
Doc32002Doc32002
Doc32002
Alfredo Santillan288 views
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm update
inwin stack240 views
Cache Performance Evaluation under Multi-parameters Using SMPCache simulatorCache Performance Evaluation under Multi-parameters Using SMPCache simulator
Cache Performance Evaluation under Multi-parameters Using SMPCache simulator
المهندسة عائشة بني صخر1.1K views
The sunsparc architectureThe sunsparc architecture
The sunsparc architecture
Taha Malampatti1.9K views
1.multicore processors1.multicore processors
1.multicore processors
Hebeon1203 views

PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors

  • 1. PRESENTED BY: Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors PEARC Conference Series 2017 Wednesday, July 12th 7/13/17 1 Ritu Arora Email: rauta@tacc.utexas.edu Lars Koesterke Email: lars@tacc.utexas.edu
  • 2. Outline Part 1: Motivation and a simple example of code adaptation using ICAT Memory management from within code Part 2: Ritu will take over and show the process of adapting a real-world application using ICAT Decision-support and semi-automatic software reengineering 7/13/17 2
  • 3. HPC is Constantly Evolving The complexity in architecture is increasing. We highlight 2 trends here: 1.  More cores, with more threads (Hyperthreads) •  Pure MPI is not enough. Threads are needed to 1.  Make use of the Hyperthreads 2.  Reduce MPI tasks and MPI communication 3.  Allow for more memory per MPI tasks 2.  Knights Landing (KNL) with additional memory layers •  High-bandwidth memory (MCDRAM) •  May need additional memory management in code The users are forced to constantly adapt their code. We are developing tools to help users to: 1.  Parallelize code with MPI and OpenMP •  MPI and OpenMP for new users •  Upgrade to a hybrid setup (MPI + OpenMP) •  Memory management; Compile and run code optimally (KNL-specific) 7/13/17 3
  • 4. Stampede 2 Phase 1: 4200 KNL nodes 68 cores Up to 272 Hyperthreads 96 GB DDR4 memory 16 GB MCDRAM Most in Cache Mode (tag directories in quadrants) Some in Flat Mode (tag directories in quadrants) Special queues for other configurations Phase 2: Skylake dual-socket nodes 40+ cores 80+ Hyperthreads Fast interconnect (Fat-tree) Fast file systems 4 We  are  in  ‘Phase  1’   The  KNL’s  will  provide  more   than  50%  of  the  available   SUs  
  • 5. Memory Hierarchy and Modes •  Two main memory types •  DDR4 •  MCDRAM •  Three memory modes •  Cache •  Flat •  Hybrid •  Hybrid mode •  Three choices •  25% / 50% / 75% •  4GB / 8 GB / 12GB 7/13/17 5 CPU   MCDRAM   (  CACHE  )   DDR4   CACHE  MODE   CPU   MCDRAM   DDR4   FLAT  MODE   CPU   MCDRAM   (  CACHE  )   MCDRAM   DDR4   HYBRID  MODE  
  • 6. Memory Modes Memory  footprint   Flat   Cache   <  16GB   OpSmal   Couple  of  percent  lower   >16  GB:  many  arrays   May  work  well.  May  require   code  modificaSons  (user   idenSfies  which  arrays  to   store  in  MCDRAM)   May  work  well   >16  GB:  few  large  arrays   Does  not  work,  if  no  array   fits  in  the  MCDRAM   May  work  well   6 Using  the  ‘Flat’  mode  may  require  code  modificaSons  
  • 7. Cluster Modes •  Each tile tracks an assigned range of memory addresses on behalf of all cores by maintaining a data structure called tag directory •  To maintain cache-coherency, tile-to-tile, and tile-to-memory communication is required •  Cores that read or modify data must communicate with the tiles that manage the memory associated with that data •  Similarly, when cores need data from main memory, the tile(s) that manage the associated addresses will communicate with the memory controllers on behalf of those cores •  The KNL can do this in the multiple cluster modes: all-to-all, quadrant, and sub-NUMA-4 (or 2) More information on clustering modes: https://portal.tacc.utexas.edu/user-guides/ stampede#cluster-modes 7/13/17 7
  • 8. Why are Tools helpful? Things we all need to understand and/or learn 1.  Concept of parallel execution or memory management 2.  Specific parallel paradigm: 1.  Shared vs. distributed memory - OpenMP vs. MPI 2.  When to manually modify the code for optimal memory placement 3.  Syntax, and good coding practices 4.  Actual code modifications Our tools help with syntax and the actual code modifications 1.  Select OpenMP or MPI and a simple toy code 2.  Modify the code interactively 3.  Test the modified code 4.  Inspect code to learn syntax and good coding practices 7/13/17 8
  • 9. General Philosophy We provide a tool and a number of simple code examples Alternatively the user may provide code Profiling of code Vtune and/or perf à list of routines, list of memory objects, etc. Interaction with user Identify what (parallelization, vectorization, or memory management), where (routines, loops, allocation calls) Code modification using bash scripts and ROSE source-to-source compiler Final code ready for compilation, testing, and inspection. Extra for KNL: Recommendation for configuration modes 7/13/17 9
  • 10. Code Adaptation Using ICAT (1) Example: Directing array allocation towards MCDRAM Interactive tool Works with example code or user code Example shown in C and Fortran Tool identifies statements like these: 10 real(8), dimension(:,:), allocatable, :: x … allocate (x(n,m)) double *x; … x = (double*)malloc(sizeof(double)*n*m)
  • 11. Code Adaptation Using ICAT (2) Modified code, pseudo code shown: 11 Fortran: size_gb = 8 * n * m / (1024*1024*1024) if (size_gb <= 16 & mcdram_is_available) !DEC$ ATTRIBUTES FASTMEM :: x allocate (x(n,m)) Or C: #include <hbwmalloc.h> … double *x; //code for checking availability of HBM if (fail) ‘normal allocate or malloc’ (Reason: No HBM available) else x = (double*)hbw_malloc(sizeof(double)*n*m) •  Users  can  easily  test  without  ‘knowing’  all  the  syntax   •  Modified  code  ready  for  ‘cut-­‐and-­‐paste’  
  • 13. Common Questions Related to Code Modernization and Migration in the Context of KNL architecture Related to Innovative Hardware and Supporting Software Stack 1.  How can I adapt my application to take advantage of the manycore processors and multiple levels of memory hierarchies? 2.  Which is the best memory mode for running an application on a KNL processor? 3.  What are the band-width critical portions of an application data? 4.  Which cluster mode is the best for running an application on a KNL processor? 5.  How much effort is required to analyze and adapt an application for the KNL processors? 7/13/17 13
  • 14. Interactive Code Adaptation Tool (ICAT) - Overview 7/13/17 14 •  Command-­‐line  tool  that  helps   in  decision-­‐making,  and  code   modernizaSon  and  migraSon   •  Runs  the  code,  collects   hardware  counter  data,   determines  some   characterisScs  of   applicaSons’  (e.g.,  L1  cache   misses)   •  Presents  reports  and  if  needed,   modifies  source  code    
  • 15. Test Case: Molecular Dynamic (MD) Code, C/OpenMP 7/13/17 15
  • 16. ICAT in Action (1): Launching ICAT 7/13/17 16
  • 17. ICAT in Action (2): Running in Different Advisor Modes 7/13/17 17
  • 18. ICAT in Action (3): Memory Report 7/13/17 18
  • 19. ICAT in Action (4): Cluster Mode Report Locality of memory access: Quadrant, SNC-4, All-to-All 7/13/17 19
  • 20. ICAT in Action (5): Code Adaptation Selectively Using MCDRAM 7/13/17 20
  • 21. ICAT in Action (6): Code Adaptation Mode 7/13/17 21 Included  header  file:     #include <hbwmalloc.h> Replaced  the  calls  to  malloc  with  hbw_malloc:   acc=(double*) hbw_malloc((nd*np)*sizeof(double)); Replaced  the  calls  to  free  with  hbw_free:     hbw_free(acc);
  • 22. ICAT Future Work •  Adding support for advanced vectorization •  There are situations in which the static-code analysis done by the compiler to find vectorization opportunities can be inconclusive. Users can add extra hints in the code to support explicit vectorization. •  #pragma novector, #pragma ivdep •  Supporting memory optimization •  Improve memory alignment, scalar-to-vector conversions •  Loop-tiling, array-of-structure versus structure-of-arrays, recalculating versus accessing from memory •  Reference # 3 and # 4 7/13/17 22
  • 23. Array of Structure Versus Structure of Arrays (AoS Vs. SoA) On architectures where performance can be significantly affected by memory access pattern, it may be useful to consider whether AoS is better or SoA struct { float r, g, b; } AoS[N]; struct { float r[N]; float g[N]; float b[N]; } SoA; 7/13/17 23
  • 24. References 1.  Knights Landing (KNL):2nd Generation Intel® Xeon Phi™ Processor, Avinash Sodani: http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25- Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing- Sodani-Intel.pdf 2.  Stampede User-Guide: https://portal.tacc.utexas.edu/user-guides/stampede#cluster-modes 3.  Raghunandan Mathur, Hiroshi Matsuoka, Osamu Watanabe, Akihiro Musa, Ryusuke Egawa, and Hiroaki Kobayashi. 2015. A Case Study of Memory Optimization for Migration of a Plasmonics Simulation Application to SX-ACE. 2015 Third International Symposium on Computing and Networking (CANDAR), Sapporo, 2015, 521-527. 4.  Markus Kowarschik, Christian Weiß. 2002. An Overview of Cache Optimization Techniques and Cache-Aware Numerical Algorithms. In Algorithms for Memory Hierarchies, Advanced Lectures, 213 - 232. 5.  Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors. 2016. Accessed on March 10th 2017: https://colfaxresearch.com/knl- avx512/ 7/13/17 24
  • 25. Thank You! We are grateful for the support received through: •  NSF Grant # 1642396 •  NSF Grant # 1359304 •  TACC STAR Scholars program •  Extreme Science and Engineering Discovery Environment (XSEDE) - NSF grant # ACI-105357 7/13/17 25