Parallel Application Performance
Prediction of Using Analysis Based
Modeling
Mohammad Abu Obaida, Jason Liu,
Gopinath Chennupati, Nandakishore Santhi and Stephan Eidenbenz
SIGSIM-PADS’18, Rome, Italy, May 23, 2018
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
2
HPC Performance Prediction
● HPC performance prediction provides insight about
○ Applications (e.g., scalability, performance variability)
○ Hardware/software (e.g., better design)
○ Workload behavior (present and future)
● Which is useful for —
○ Understanding application performance issues
○ Improving application and system, scalability
○ Budgeting, designing efficient systems (present and future)
3
Performance Prediction Challenges
4
1. New applications
2. New architectures (processors, GPUs, memory, interconnect)
3. New systems/tools
4. Scalability (applications/systems are getting larger)
Modeling techniques can help predict performance of
present and future applications, systems, and workloads
Existing Modeling Techniques
5
1. Analytical Performance Models (e.g., LogP, Loggp, Loggopsim)
○ Time expressed as mathematical formulas
○ Pros: simple, fast, flexible, scalable
○ Cons: low accuracy
2. Simulation (e.g., SST, TraceR, CODES, PPT)
○ Models detailed system and application behavior
○ Trace-driven and execution driven simulation
○ Pros: high accuracy, flexible, future architectures
○ Cons: slow, (typically) small in scale, complexity in building models
3. Hybrid Models (e.g., Aspen, Palm, Compass, Durango)
○ Combination of analytical, simulation and trace replaying
○ A choice of accuracy, flexibility vs speed
1. Aspen [Spafford and Vetter, SC’12]
○ Domain specific language to describe application and machine
○ Features analytical communication model
2. Compass [Lee, Meredith, Vetter, ICS’15]
○ Static analysis based automated model construction for Aspen
○ Built on Cetus compiler and OpenARC source transformation framework
○ High level communication abstraction functionality
3. Durango [Carothers et al., SIGSIM-PADS’17]
○ Combines Aspen and CODES
○ Parameterized application model w/ computation events
○ Simulates Aspen generated models or traces on CODES
4. PPT-AMM [Chennupati et al., WSC’17]
○ Simulation based performance prediction
○ Uses static analysis of source code to build data availability models
Most Related to Our Work
6
Our Approach - PyPassT
We use Static Analysis Based HPC Simulation framework
1. Automatically builds model from source code
2. CPU plus cache/memory models
3. Mid-level GPU Model
4. Detailed communication model
a. Point-to-point and collective MPI operations
b. Preserve locality or spatial features
c. Abstracts data volume properties
7
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
8
PyPassT Framework
● Purpose is to maintain accuracy and performance, flexibility, and
scalability; so as to allow studies of large-scale applications
● PyPassT:
○ An application model in {Py}thon
○ Generated automatically in compiler {Pass}es
○ Executed in PP{T} HPC simulation.
● Steps of an application performance analysis
○ Start with an application program
○ Statically analyze the program to build an abstract model
○ Transform into an executable model (encompassing CPU, GPU, and communication)
○ Run model with HPC Simulation (for performance prediction)
9
PyPassT Framework
10
PyPassT Static Analysis
● Static Analysis
○ Built on Compass/OpenARC/Cetus Compilers
○ Derive an abstract model for the application:
○ CPU Computation:
■ Obtain workload (flops) with OpenARC
■ Extra pass through Byfl (built on LLVM) to calculate data availability reuse
profile for memory/cache performance
○ GPU Computation:
■ Identify GPU kernels using OpenARC
■ Obtain workload (flops and memory loads/stores)
○ MPI Communication:
■ Point-to-point ops (sender, receiver, transfer size, domain)
■ Collective ops (root, size, operation, domain)
11
OpenACC
12
1. Standardization effort, defines a set of directives (#pragma acc ….)
a. Compiler uses these for parallel kernel transformation
b. Compute-intensive parallel regions (work-sharing loops) offloaded to GPU
2. Generates architecture dependent parallel code
3. Helps porting codes to a wide-variety of heterogeneous
a. HPC hardware platforms and architectures
b. GPU/Accelerator
We use OpenACC to find parallel regions for CPU/GPU
OpenACC Annotated Program
#pragma acc data copy(A), create(Anew)
#pragma acc parallel num_gangs(16) num_workers(32) … private(j)
{
...
#pragma acc loop gang
for( j = 1; j < n-1; j++) {
#pragma acc loop worker ….
for( i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
….
}
….
}
}
13
CPU Tasklist
iALU integer operations
fALU floating point operations (add/multiply)
fDIV floating point divisions
INTVEC Integer vector operations
VECTOR Integer vector operations
intranode Transfers within node (as memory access)
MEM_ACCESS cycles to move data through memory and cache
HITRATES Direct input of cache level hitrates
L1 Direct input of L1 accesses
L2 Direct input of L2accesses
L3,L4,L5,
RAM, mem
Direct input of higher cache and
memory accesses
CPU ops CPU operations
14
Memory Hierarchy: PPT-AMM [Chennupati et al., WSC’17]
● A parameterized model for computation performance prediction
● Uses reuse distance and profiles to estimate data availability
○ Patterns of hardware architecture-independent virtual memory accesses
● The reuse profile models different cache hierarchies
We use PPT-AMM to model computation more accurately
15
CPU Model Building
1. Produce memory trace w/ Byfl*
2. Transition probability from a BB to another
○ Calculated using LLVM coverage analysis
3. Calculate Probability of executing a BB
4. Calculate conditional reuse profile of a BB
5. Create and evaluate a PPT-AMM computation tasklist
6. Mimic computation with sleep
16
#stack_dist probability_sd
#3.0 0.357978254355
#7.0 0.141925649545
#11.0 0.053872280264
#12.0 1.44216666613e-05
#15.0 0.0577797511131
#19.0 0.00610221191651
#23.0 0.00814826082
tasklist_example = [['iALU', 8469], ['fALU', 89800], ['fDIV',6400], 
['MEM_ACCESS', 4, 10, 1, 1, 1, 10, 80, False, 
stack_dist, probability_sd, block_size, total_bytes, data_bus_width]]
CPU Model Example
17
alloc [host] m em ory allocations (in # of bytes)
unalloc [host] m em ory de-allocate
DEVICE_ALLO C Device allocations
DEVICE_TRANSFER Device transfers
KERNEL_CALL Call a G PU kernel with block/grid
iALU Integer operations L1 Direct input of L1 accesses
diALU Double precision integer operations L2 Direct input of L2 accesses
fALU Floating point operations (add/m ultiply) G LO B_M EM _ACCESS Access G PU on-chip global
m em ory
dfALU Double precision flop DEVICE_SYNC Synchronize G PU threads
fDIV Floating point divisions THREAD_SYNC Synchronize G PU threads w/CPU
SFU Special function calls
18
GPU Tasklist
● OpenARC provides
○ Memory-GPU transfers and vice versa
○ Loads
○ Stores
○ Flops
○ GPU block-size, grid-size
● Build GPU-warp tasklist from Compass generated IR
○ Evaluate with the desired hardware model
○ MPI rank sleep for the duration computation
GPU Model Building
19
GPU Kernel Analysis
#pragma acc kernels loop gang(16) worker(32) copy(m, n)
present(A[0:4096][0:4096], Anew[0:4096][0:4096]) private(i_0, j_0)
#pragma aspen control label(block_main50) loop((-2+n)) parallelism((-2+n))
for (j_0=0; j_0<=(-3+n); j_0 ++ ){
#pragma acc loop gang(16) worker(32)
#pragma aspen control label(block_main51) loop((-2+m)) parallelism((-2+m))
for (i_0=0; i_0<=(-3+m); i_0 ++ ){
#pragma aspen control execute label(block_main52)
loads((1*aspen_param_sizeof_double):from(Anew):traits(stride(1)))
stores((1*aspen_param_sizeof_double):to(A):traits(stride(1)))
A[(1+j_0)][(1+i_0)]=Anew[(1+j_0)][(1+i_0)];
}
}
20
Computation (CPU+GPU)
# accelerator warp instructions
GPU_WARP = [['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'],
['fALU'], ['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['dfALU']]
# calling the ward with block size and grid size
CPU_tasklist = [['KERNEL_CALL', 0, GPU_WARP,
blocksize, gridsize,regcount],['DEVICE_SYNC', 0]]
# evaluate with hardware model and collect statistics
now = mpi_wtime(mpi_comm_world)
(time, stats) = core.time_compute(CPU_tasklist, now, True)
# sleep for the duration
mpi_ext_sleep(time) 21
Communication
● Convert IR of static analysis to PPT communication calls
● Source
○ MPI_Sendrecv( A[start], M, MPI_FLOAT, top , 0, A[end], M, MPI_FLOAT,
bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );
○ MPI_Barrier( MPI_COMM_WORLD );
● Target
if (my_rank==0) {
execute "block_stencil1d59" {
#mpi send/recv
messages [MAX_STRING] to (my_rank+1) as send
messages [MAX_STRING] from (my_rank+1) as recv
}
}
execute "block_stencil1d66" {
messages [MPI_COMM_WORLD] as barrier
} 22
● MPI_Send, MPI_Recv,MPI_Sendrecv
● MPI_Isend, MPI_Irecv
● MPI_Wait, MPI_Waitall,
● MPI_Reduce, MPI_Allreduce
● MPI_Bcast, MPI_Barrier
● MPI_Gather, MPI_Allgather
● MPI_Scatter
● MPI_Alltoall
● MPI_Alltoallv
Once the application model is ready, it is evaluated using HPC simulation.
23
Supported MPI Operations
● Launch application on the Performance Prediction Toolkit (PPT)
PPT is HPC system and application simulator allows rapid prototyping of parallel
applications at a sufficiently detailed level.
● Collaborative project between LANL and FIU
● Developed based on process oriented parallel simulator Simian
● PPT Features
1. Hardware models (processor, memory, GPU)
2. Full-fledged MPI model
3. Detailed interconnection model
4. Large-Scale Workload Model
Application Simulation
24
PPT Overview
25
Open-sourced at: https://github.com /lanl/PPT
[ppt-torus] Kishwar Ahm ed, M oham m ad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, Guillaum e Chapuis, An Integrated Interconnection Network
M odel for Large-Scale Perform ance Prediction, ACM SIGSIM Conference on Principles of Advanced Discrete Sim ulation (PADS 2016), M ay 2016.
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
26
Runtime Predictions
● Application
○ Laplace2D MM
○ 512x512 to 4096x4096
● System / PPT-CAMM config
○ Run with 1 core(taskset)
○ 2x 8-core E5-2450 @2.1GHz
○ 48GB shared memory
○ Optimized (O3) & unoptimized flags
○ L1/L2/L3=32K/256K/12M.
● Results
○ Collect runtime, flops, loads, stores
○ 7.08% -- average error optimized
○ 3.12% -- average error unoptimized
27
28
● Computation
○ Result – No error
● Memory
○ ~1% error
Resource Usage
GPU Performance Prediction
● Application : Laplace2D MM
● Mesh Sizes: 1024x1024 to 8192x8192
● BlockSize: 16x1x1, GridSize: 32x1x1
○ #pragma acc parallel num gangs(16)num
workers(32)
Machine Config
● Two 8-core Xeon E5-5645 @2.1GHz
● 48GB shared memory
● NVIDIA Geforce GM 204
○ 1050 Hz, 4GB GDDR5
Results:
● 0.16% error for 8192x8192
● 13.8% error for 1024x1024
● Lower error for larger GPU kernels
29
Communication Prediction
● Application
○ Jacobi Iterative Method
○ 2048x2048 matrices
● Expt. Cluster
○ Grizzly, LANL, 1.79 PFlops
○ 53K cores, Omni-path
○ Message size: 8KB
● PyPassT Machine Config
○ Two 8-core Xeon E5-5645 @2.1GHz
○ 48GB shared memory
Results
● Perfect representation of total bytes
between src-dest pair actual/predicted
30
Communication Prediction(2)
31
Results
● Perfect matching of number of
packets for every src-dst pair
between actual runs and
predicted
● PyPassT automatically builds application model
● Combining with rapid HPC modeling tools
● Producing fast predictions with accuracy
Summary of Contributions —
32
Future Work:
• Study scalability of applications
• Apply dynamic analysis and ML for irregular applications
Thank You

Parallel Application Performance Prediction of Using Analysis Based Modeling

  • 1.
    Parallel Application Performance Predictionof Using Analysis Based Modeling Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi and Stephan Eidenbenz SIGSIM-PADS’18, Rome, Italy, May 23, 2018
  • 2.
    ● Motivation andRelated Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 2
  • 3.
    HPC Performance Prediction ●HPC performance prediction provides insight about ○ Applications (e.g., scalability, performance variability) ○ Hardware/software (e.g., better design) ○ Workload behavior (present and future) ● Which is useful for — ○ Understanding application performance issues ○ Improving application and system, scalability ○ Budgeting, designing efficient systems (present and future) 3
  • 4.
    Performance Prediction Challenges 4 1.New applications 2. New architectures (processors, GPUs, memory, interconnect) 3. New systems/tools 4. Scalability (applications/systems are getting larger) Modeling techniques can help predict performance of present and future applications, systems, and workloads
  • 5.
    Existing Modeling Techniques 5 1.Analytical Performance Models (e.g., LogP, Loggp, Loggopsim) ○ Time expressed as mathematical formulas ○ Pros: simple, fast, flexible, scalable ○ Cons: low accuracy 2. Simulation (e.g., SST, TraceR, CODES, PPT) ○ Models detailed system and application behavior ○ Trace-driven and execution driven simulation ○ Pros: high accuracy, flexible, future architectures ○ Cons: slow, (typically) small in scale, complexity in building models 3. Hybrid Models (e.g., Aspen, Palm, Compass, Durango) ○ Combination of analytical, simulation and trace replaying ○ A choice of accuracy, flexibility vs speed
  • 6.
    1. Aspen [Spaffordand Vetter, SC’12] ○ Domain specific language to describe application and machine ○ Features analytical communication model 2. Compass [Lee, Meredith, Vetter, ICS’15] ○ Static analysis based automated model construction for Aspen ○ Built on Cetus compiler and OpenARC source transformation framework ○ High level communication abstraction functionality 3. Durango [Carothers et al., SIGSIM-PADS’17] ○ Combines Aspen and CODES ○ Parameterized application model w/ computation events ○ Simulates Aspen generated models or traces on CODES 4. PPT-AMM [Chennupati et al., WSC’17] ○ Simulation based performance prediction ○ Uses static analysis of source code to build data availability models Most Related to Our Work 6
  • 7.
    Our Approach -PyPassT We use Static Analysis Based HPC Simulation framework 1. Automatically builds model from source code 2. CPU plus cache/memory models 3. Mid-level GPU Model 4. Detailed communication model a. Point-to-point and collective MPI operations b. Preserve locality or spatial features c. Abstracts data volume properties 7
  • 8.
    ● Motivation andRelated Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 8
  • 9.
    PyPassT Framework ● Purposeis to maintain accuracy and performance, flexibility, and scalability; so as to allow studies of large-scale applications ● PyPassT: ○ An application model in {Py}thon ○ Generated automatically in compiler {Pass}es ○ Executed in PP{T} HPC simulation. ● Steps of an application performance analysis ○ Start with an application program ○ Statically analyze the program to build an abstract model ○ Transform into an executable model (encompassing CPU, GPU, and communication) ○ Run model with HPC Simulation (for performance prediction) 9
  • 10.
  • 11.
    PyPassT Static Analysis ●Static Analysis ○ Built on Compass/OpenARC/Cetus Compilers ○ Derive an abstract model for the application: ○ CPU Computation: ■ Obtain workload (flops) with OpenARC ■ Extra pass through Byfl (built on LLVM) to calculate data availability reuse profile for memory/cache performance ○ GPU Computation: ■ Identify GPU kernels using OpenARC ■ Obtain workload (flops and memory loads/stores) ○ MPI Communication: ■ Point-to-point ops (sender, receiver, transfer size, domain) ■ Collective ops (root, size, operation, domain) 11
  • 12.
    OpenACC 12 1. Standardization effort,defines a set of directives (#pragma acc ….) a. Compiler uses these for parallel kernel transformation b. Compute-intensive parallel regions (work-sharing loops) offloaded to GPU 2. Generates architecture dependent parallel code 3. Helps porting codes to a wide-variety of heterogeneous a. HPC hardware platforms and architectures b. GPU/Accelerator We use OpenACC to find parallel regions for CPU/GPU
  • 13.
    OpenACC Annotated Program #pragmaacc data copy(A), create(Anew) #pragma acc parallel num_gangs(16) num_workers(32) … private(j) { ... #pragma acc loop gang for( j = 1; j < n-1; j++) { #pragma acc loop worker …. for( i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); …. } …. } } 13
  • 14.
    CPU Tasklist iALU integeroperations fALU floating point operations (add/multiply) fDIV floating point divisions INTVEC Integer vector operations VECTOR Integer vector operations intranode Transfers within node (as memory access) MEM_ACCESS cycles to move data through memory and cache HITRATES Direct input of cache level hitrates L1 Direct input of L1 accesses L2 Direct input of L2accesses L3,L4,L5, RAM, mem Direct input of higher cache and memory accesses CPU ops CPU operations 14
  • 15.
    Memory Hierarchy: PPT-AMM[Chennupati et al., WSC’17] ● A parameterized model for computation performance prediction ● Uses reuse distance and profiles to estimate data availability ○ Patterns of hardware architecture-independent virtual memory accesses ● The reuse profile models different cache hierarchies We use PPT-AMM to model computation more accurately 15
  • 16.
    CPU Model Building 1.Produce memory trace w/ Byfl* 2. Transition probability from a BB to another ○ Calculated using LLVM coverage analysis 3. Calculate Probability of executing a BB 4. Calculate conditional reuse profile of a BB 5. Create and evaluate a PPT-AMM computation tasklist 6. Mimic computation with sleep 16
  • 17.
    #stack_dist probability_sd #3.0 0.357978254355 #7.00.141925649545 #11.0 0.053872280264 #12.0 1.44216666613e-05 #15.0 0.0577797511131 #19.0 0.00610221191651 #23.0 0.00814826082 tasklist_example = [['iALU', 8469], ['fALU', 89800], ['fDIV',6400], ['MEM_ACCESS', 4, 10, 1, 1, 1, 10, 80, False, stack_dist, probability_sd, block_size, total_bytes, data_bus_width]] CPU Model Example 17
  • 18.
    alloc [host] mem ory allocations (in # of bytes) unalloc [host] m em ory de-allocate DEVICE_ALLO C Device allocations DEVICE_TRANSFER Device transfers KERNEL_CALL Call a G PU kernel with block/grid iALU Integer operations L1 Direct input of L1 accesses diALU Double precision integer operations L2 Direct input of L2 accesses fALU Floating point operations (add/m ultiply) G LO B_M EM _ACCESS Access G PU on-chip global m em ory dfALU Double precision flop DEVICE_SYNC Synchronize G PU threads fDIV Floating point divisions THREAD_SYNC Synchronize G PU threads w/CPU SFU Special function calls 18 GPU Tasklist
  • 19.
    ● OpenARC provides ○Memory-GPU transfers and vice versa ○ Loads ○ Stores ○ Flops ○ GPU block-size, grid-size ● Build GPU-warp tasklist from Compass generated IR ○ Evaluate with the desired hardware model ○ MPI rank sleep for the duration computation GPU Model Building 19
  • 20.
    GPU Kernel Analysis #pragmaacc kernels loop gang(16) worker(32) copy(m, n) present(A[0:4096][0:4096], Anew[0:4096][0:4096]) private(i_0, j_0) #pragma aspen control label(block_main50) loop((-2+n)) parallelism((-2+n)) for (j_0=0; j_0<=(-3+n); j_0 ++ ){ #pragma acc loop gang(16) worker(32) #pragma aspen control label(block_main51) loop((-2+m)) parallelism((-2+m)) for (i_0=0; i_0<=(-3+m); i_0 ++ ){ #pragma aspen control execute label(block_main52) loads((1*aspen_param_sizeof_double):from(Anew):traits(stride(1))) stores((1*aspen_param_sizeof_double):to(A):traits(stride(1))) A[(1+j_0)][(1+i_0)]=Anew[(1+j_0)][(1+i_0)]; } } 20
  • 21.
    Computation (CPU+GPU) # acceleratorwarp instructions GPU_WARP = [['GLOB_MEM_ACCESS'], ['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['fALU'], ['GLOB_MEM_ACCESS'], ['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['dfALU']] # calling the ward with block size and grid size CPU_tasklist = [['KERNEL_CALL', 0, GPU_WARP, blocksize, gridsize,regcount],['DEVICE_SYNC', 0]] # evaluate with hardware model and collect statistics now = mpi_wtime(mpi_comm_world) (time, stats) = core.time_compute(CPU_tasklist, now, True) # sleep for the duration mpi_ext_sleep(time) 21
  • 22.
    Communication ● Convert IRof static analysis to PPT communication calls ● Source ○ MPI_Sendrecv( A[start], M, MPI_FLOAT, top , 0, A[end], M, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); ○ MPI_Barrier( MPI_COMM_WORLD ); ● Target if (my_rank==0) { execute "block_stencil1d59" { #mpi send/recv messages [MAX_STRING] to (my_rank+1) as send messages [MAX_STRING] from (my_rank+1) as recv } } execute "block_stencil1d66" { messages [MPI_COMM_WORLD] as barrier } 22
  • 23.
    ● MPI_Send, MPI_Recv,MPI_Sendrecv ●MPI_Isend, MPI_Irecv ● MPI_Wait, MPI_Waitall, ● MPI_Reduce, MPI_Allreduce ● MPI_Bcast, MPI_Barrier ● MPI_Gather, MPI_Allgather ● MPI_Scatter ● MPI_Alltoall ● MPI_Alltoallv Once the application model is ready, it is evaluated using HPC simulation. 23 Supported MPI Operations
  • 24.
    ● Launch applicationon the Performance Prediction Toolkit (PPT) PPT is HPC system and application simulator allows rapid prototyping of parallel applications at a sufficiently detailed level. ● Collaborative project between LANL and FIU ● Developed based on process oriented parallel simulator Simian ● PPT Features 1. Hardware models (processor, memory, GPU) 2. Full-fledged MPI model 3. Detailed interconnection model 4. Large-Scale Workload Model Application Simulation 24
  • 25.
    PPT Overview 25 Open-sourced at:https://github.com /lanl/PPT [ppt-torus] Kishwar Ahm ed, M oham m ad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, Guillaum e Chapuis, An Integrated Interconnection Network M odel for Large-Scale Perform ance Prediction, ACM SIGSIM Conference on Principles of Advanced Discrete Sim ulation (PADS 2016), M ay 2016.
  • 26.
    ● Motivation andRelated Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 26
  • 27.
    Runtime Predictions ● Application ○Laplace2D MM ○ 512x512 to 4096x4096 ● System / PPT-CAMM config ○ Run with 1 core(taskset) ○ 2x 8-core E5-2450 @2.1GHz ○ 48GB shared memory ○ Optimized (O3) & unoptimized flags ○ L1/L2/L3=32K/256K/12M. ● Results ○ Collect runtime, flops, loads, stores ○ 7.08% -- average error optimized ○ 3.12% -- average error unoptimized 27
  • 28.
    28 ● Computation ○ Result– No error ● Memory ○ ~1% error Resource Usage
  • 29.
    GPU Performance Prediction ●Application : Laplace2D MM ● Mesh Sizes: 1024x1024 to 8192x8192 ● BlockSize: 16x1x1, GridSize: 32x1x1 ○ #pragma acc parallel num gangs(16)num workers(32) Machine Config ● Two 8-core Xeon E5-5645 @2.1GHz ● 48GB shared memory ● NVIDIA Geforce GM 204 ○ 1050 Hz, 4GB GDDR5 Results: ● 0.16% error for 8192x8192 ● 13.8% error for 1024x1024 ● Lower error for larger GPU kernels 29
  • 30.
    Communication Prediction ● Application ○Jacobi Iterative Method ○ 2048x2048 matrices ● Expt. Cluster ○ Grizzly, LANL, 1.79 PFlops ○ 53K cores, Omni-path ○ Message size: 8KB ● PyPassT Machine Config ○ Two 8-core Xeon E5-5645 @2.1GHz ○ 48GB shared memory Results ● Perfect representation of total bytes between src-dest pair actual/predicted 30
  • 31.
    Communication Prediction(2) 31 Results ● Perfectmatching of number of packets for every src-dst pair between actual runs and predicted
  • 32.
    ● PyPassT automaticallybuilds application model ● Combining with rapid HPC modeling tools ● Producing fast predictions with accuracy Summary of Contributions — 32 Future Work: • Study scalability of applications • Apply dynamic analysis and ML for irregular applications
  • 33.