Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
http://dl.acm.org/citation.cfm?id=2745876
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
1. Combining Phase Identification and
Statistic Modeling for Automated
Parallel Benchmark Generation
Ye Jin, Xiaosong Ma, Mingliang Liu,
Qing Liu, Jeremy Logan, Norbert Podhorszki,
Jong Youl Choi, Scott Klasky
2. Systems and Applications More Complex
1
Powerful supercomputers
Large-scale codes from diverse scientific domains solving real-world problems
• Large number of nodes
• Deeper memory and I/O
stack
• Heterogeneous architecture
• System-specific interconnect
3. More Choices for HPC Platforms
2
10 Gb
Single Server
Local Cluster
HPC Center
HPC in Cloud
4. Performance Study Crucial
• Yet remains challenging
– Single-platform analysis resource-consuming
– Not to mention cross-platform
• Benchmarks more important than ever
– Evaluate machines
– Validate hardware/software design
– Select from candidate platforms
• Realistic benchmarks hard to find
3
5. Benchmarks and Generation Tools
Type Example Pros Cons
Kernels
NPB, SPEC,
Intel-MPI
Real,
parametric
Simple, non-HPC,
less flexible
Manually extracted
FlashIO,
GTC-Bench
Realistic,
parametric
Labor-intensive,
easily obsolete
Trace-based ScalaBenGen Automatic
Replay-based,
non-parametric,
platform-
dependent
Specialized (I/O) IOR, Skel Parametric I/O phase only
4
Automatic, full-application benchmark extraction?
6. Outline
5
• Motivation
• Our recent work related to automatic
application benchmark extraction
• APPrime framework (SIGMETRICS15,[21])
• Led by NCSU PhD student Ye Jin
• Collaboration with ORNL
• Cypress tool for communication trace
compression (SC14, [7])
• Closing remarks
7. Desired Features
6
• Based on real, large-scale
applications
• Leveraging existing tracing
tools
• Automatic source code
generation
• Concise, configurable,
portable benchmarks
• With relative performance
retained
Traces
Trace
Our system
9. 8
Sample Use Case 2: I/O Method Selection
Compute nodes
I/O nodes
SAN
Main Mem. SSD
Staging nodes
Interconnection network
Simulation
job
Parallel I/O with multi-level data staging
• Lots of I/O options available
# of files
Sync or async?
# staging nodes
Use local or remote
SSD?
I/O library
Stripe width?
I/O frequency
…
Realistic I/O benchmarks allow users and I/O system designers to
• consider interplay between I/O and other activities
• evaluate I/O options with portable, light-weight benchmarks
• Assess candidate I/O design/configurations
10. APPrime Overview
9
• Automatic, whole-application benchmark generation
– Input: parallel execution traces of application A on one platform
– Output: “fake application” A’, simulating A’s behavior
• Computation, communication, I/O, scaling
• Portable, shorter source code using few libraries
• APPrime Main idea
– Get information from traces, but do not replay
– Differentiate between regular and irregular behavior
• Be exact with regular activity (loops)
• Model any irregularity as statistical distribution (histograms)
• Current status
– Ready: overall framework, communication, I/O
– To-do: computation kernel
11. • Iterative parallel applications have regular execution
patterns
• in form of I(C*W)*F [1]
I: one-time initialization phase (head)
F: one-time finalization phase (tail)
C: timestep computation phase (w. communication)
W: periodic I/O phase
Assumed Computation Model
10
I FW
Repeated x times
CCC ... ...
Repeated x time
WCCC ...
Repeated y times
APPrime: Automatically identifies phases from traces
- without any involvement of programmer/user
Event
Bubble
Event
Bubble
...
12. Complications from Real Large-scale Apps
11
• Challenges
– Noises (irregular activities)
• Found to be minor across all applications we studied
– Multiple I/O phases
– Heterogeneous C-phase communication behavior
• Identical event sequence, different parameters
• Solutions
– Extend C to C[0, a]D0|1C[b, |C|]
• Allow minor noise phase D
• Ignored in benchmark generation
– Extend W to Wi
• Multiple I/O phases, each with individual (fixed) frequency
– Use Marcov-Chain Model (MCM) to simulate transitions between
multiple C phases
13. APPrime Workflow
12
Dumpi
Traces
Scala-
Traces
Head (I)
Noise (D)
Phases
Identifier
Trace
Parser
Code
Generator
Head
I/O (W)
I/O
Translator
Major
Loops (C)
MCM
Builder
MC
States
Configuration Parameter File Source Code
APPrime Benchmark
Extractor
Generator
Static
phases
Phases in
each table
…
Tail (F) Tail
APPrime Automatic Benchmark Generation Framework
Input Output
Merging
cross
tables
Event
Tables
Parser
Factory
14. Trace Parsing: Trace to Event Table
…
• MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype
datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm),
MPI_Bcast returning at walltime 102625.244.
Sample Joint Per-process Event Table
• MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined-
comm), MPI_Barrier returning at walltime 102625.253.
• MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user-
defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0
(MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime
102627.439
Original ASCII DUMPI Trace
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A …
MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE …
… … … … … … … … … …
To be filled
13
15. Event Table to Trace String
14
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A …
… … … … … … … … … …
MPI_init => ‘a’,
MPI_Barrier => ‘c’,
MPI_Bcast => ‘d’,
MPI_File_open => ‘f’,
…
MPI_Finalize => ‘h’
ab…ccd…ccd…ef…ccd…ccd…ef…gh
Compact trace string
APPrime: Deploys new string processing
algorithm to
• automatically identifies all phases
• based on searching for partitioning that
maximizes inter-iteration repetition
16. Computation Gap cross Timesteps
15
Event table of one process’s first timestep (C phase)
MPI_Bcast(…)
MPI_Isend(…)
Timestep 2 Timestep n
MPI_Bcast(…)
MPI_Isend(…)
…
Bubble 2.1
Bubble 2.2
…
Bubble n.1
Bubble n.2
…
Histograms
…
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A …
… … … … … … … … … …
MPI_Bcast(…)
MPI_Barrier(…)
Bubble 1.1
Bubble 1.2
Timestep 1
…
17. MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 8 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Inter-Process Event Table Merging
16
MPI function
name
Data
Count
Type Dest. Src.
Com
m.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Merging per-process event tables
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
18. Markov Chain Model for C-Phase States
17
No. Name Count Type Dest. Src.
1 MPI_Irecv 20 MPI_INT N/A {4, 8, …}
2 MPI_Send 20 MPI_INT 4 N/A
… … … … … …
MC State
Rank 1
MC State
Rank 2
Merged Timestep m
Merged Timestep m+n
State
1
State
2
State
1
0.3 0.7
State
2
0.7 0.3
Transition
Probabilities Matrix
No. Name Count Type Dest. Src.
1 MPI_Irecv 80 MPI_INT N/A {1, 7, …}
2 MPI_Send 80 MPI_INT 1 N/A
… … … … … …
…
0.3
0.70.7
0.3
19. Benchmark Code Generation
1 int main(int argc, char* argv[]){
2 apprime_init(argc, argv);
3 init_phase();
4 // major loop
5 for(timestep = 0; timestep < total_timestep; timestep++) {
6 run_state_for_C_phase(state_rank, event_tables);
7 // update next state rank
8 state_rank = trans_state(state_rank, timestep);
9 // periodic I/O phases
10 if(timesteps+1 % restart_period_1 == 0)
11 W_phase_1();
12 …
13 }
14 final_phase();
15 apprime_finalize();
16 return 0;
17 }
I/O phase Wi,
here i =1
Direct replay I phase
Direct replay F phase
Select the next MC
state for C phase
18
20. Evaluation
19
• Platforms: Titan and Sith at ORNL
• Workloads:
• Real-world HPC applications:
• Quantum turbulence code: BEC2
• Gyrokinetic particle simulations: XGC and GTS
• NAS benchmarks: BTIO, LU, SP, CG
Name
# of
Nodes
Cores per
node
Mem. per
node
OS
File
System
Titan 18,688 16 32 GB Cray xk7 Lustre
Sith 40 32 64 GB Linux x86_64 Lustre
21. HPC Applications
20
Name Domain
Typical prod. run
Scale (# of cores)
Open
Source?
Status
XGC Gyrokinetic 225,280 No Done
GTS Gyrokinetic 262,144 No Done
BEC2 Unitary qubit 110,592 No Done
QMC-Pack Electronic molecular 256 – 16,000 No Applicable
S3D Molecular physics 96,000 – 180,000 No Applicable
AWP-ODC Wave propagation 223,074 No Applicable
NAMD Molecular dynamics 1,000 – 20,000 No Applicable
HFODD Nuclear 299,008 Yes Applicable
LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable
SPEC
Multi-domain
Benchmarks
150 - 600 Yes Applicable
NPB
Multi-domain
Benchmarks
N/A Yes Done
23. Results: A vs. A’
22
• Comparing target application A with APPrime generated
benchmark A’
– A’ much more compact and easier to build
– If A has multiple C-phase states, they take no more than dozens
of timesteps to be discovered
Name
Lines of code
Max # of TS tested
Max # of TS
requiredA A’
BEC2 1.5K 856 1,000 1
XGC 93.7K 7.7K 1,000 36
GTS 178.4K 13.7K 200 2
26. Comparing with Other Profile-based
Benchmark Generation Techniques
25
APPrime [21] BenchMaker [12] HBench [13]
Generated
Benchmark
Large scale iterative
parallel benchmark
Single process
(Multi-threaded)
benchmark
JAVA benchmark
Application
Specific
Yes Yes Yes
Source of
profile
Own processing of
execution traces
User’s input JVM profilers
Target of
profiling
• Recurrent event
sequence
• Event
parameter/inter-
arrival distribution
• Instruction mix
• Branch
probabilities
• Instruction level
parallelism
• Locality
• Methods frequently
invoked
• Function invoking
counts
• Time cost
27. Other Related Work
26
• Communication trace collection
• TAU [3], DUMPI [4], ScalaTrace [5]
• Trace reduction
• Lossy: Xu’s work [6], Cypress [7]
• Lossless: ScalaTrace [5]
• Profiling
• HPCtoolkit [8], Scalasca Performance Toolset [9]
• Trace-based application analysis
• ScalaExtrap [10], Casas’ work [11]
• Benchmark generation
• Trace-based: ScalaBenchGen [14]
• Source code slicing: FACT [15]
28. Ongoing Work
27
• Filling in computation kernel generation
– Currently using histogram to model “bubble size”
– Planned COMPrime tool
• recursive step on single-process computation kernel
• Instruction mix, memory access (more challenging)
• Modeling scaling behavior
– Take input traces of app A collected at different scale
• Problem size, execution size
– Can we simulate weak/strong scaling behavior with A’
• Connecting with collaborative work on scalable tracing
• Release full benchmarks!
30. APPrime References
1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel
Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC
2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005
2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure
Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335-
360.
3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of
High Performance Computing Applications, 20(2), 2006.
4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator
suite, 2011.
5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable
Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel
Distrib. Comput., 2009.
6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel
Execution. In IEEE IISWC, 2009.
7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for
top-down communication trace compression. In SC14, 2014.
8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent.
Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010.
9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca
Performance Toolset Architecture. In CCPE, 2010.
10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD
Programs. In ACM PPoPP, 2011.
29
31. APPrime References
11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of
MPI Applications. IJHPCA, 2010.
12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/ SIPEW, 2010.
13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking
framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java
Grande (JAVA '00).
14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication
Benchmarks Traces. In IEEE IPDPS, 2012.
15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace
Collection for Parallel Applications Through Program Slicing. In SC09, 2009.
16. GTC-benchmark in NERSC-8 suite, 2013.
17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html, 2003.18.
18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC
Benchmark Workshop, 2008.
19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian,
and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par.
Springer-Verlag, 2012.
20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC
Platforms. CUG, 2007
30
Editor's Notes
Profiling: Take a relatively light-weight approach, by summarizing aggregate or statistical information of parallel job executions.
ScalaExtrap: identifies and extrapolates communication topologies, given traces from executions of different scales.
BenchMaker and Hbench: generate benchmark according to statistical models characterizing the original applications
ScalaBenGen: Benchmark extraction based on compressed communication trace.
FACT: Benchmark extraction based on compressed communication trace.
Casas’ work: Phase identification with signal processing techniques
Profiling: Take a relatively light-weight approach, by summarizing aggregate or statistical information of parallel job executions.
ScalaExtrap: identifies and extrapolates communication topologies, given traces from executions of different scales.
BenchMaker and Hbench: generate benchmark according to statistical models characterizing the original applications
ScalaBenGen: Benchmark extraction based on compressed communication trace.
FACT: Benchmark extraction based on compressed communication trace.
Casas’ work: Phase identification with signal processing techniques