Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Combining Phase Identification and
Statistic Modeling for Automated
Parallel Benchmark Generation
Ye Jin, Xiaosong Ma, Mingliang Liu,
Qing Liu, Jeremy Logan, Norbert Podhorszki,
Jong Youl Choi, Scott Klasky

Systems and Applications More Complex
1
Powerful supercomputers
Large-scale codes from diverse scientific domains solving real-world problems
• Large number of nodes
• Deeper memory and I/O
stack
• Heterogeneous architecture
• System-specific interconnect

More Choices for HPC Platforms
2
10 Gb
Single Server
Local Cluster
HPC Center
HPC in Cloud

Performance Study Crucial
• Yet remains challenging
– Single-platform analysis resource-consuming
– Not to mention cross-platform
• Benchmarks more important than ever
– Evaluate machines
– Validate hardware/software design
– Select from candidate platforms
• Realistic benchmarks hard to find
3

Benchmarks and Generation Tools
Type Example Pros Cons
Kernels
NPB, SPEC,
Intel-MPI
Real,
parametric
Simple, non-HPC,
less flexible
Manually extracted
FlashIO,
GTC-Bench
Realistic,
parametric
Labor-intensive,
easily obsolete
Trace-based ScalaBenGen Automatic
Replay-based,
non-parametric,
platform-
dependent
Specialized (I/O) IOR, Skel Parametric I/O phase only
4
Automatic, full-application benchmark extraction?

Outline
5
• Motivation
• Our recent work related to automatic
application benchmark extraction
• APPrime framework (SIGMETRICS15,[21])
• Led by NCSU PhD student Ye Jin
• Collaboration with ORNL
• Cypress tool for communication trace
compression (SC14, [7])
• Closing remarks

Desired Features
6
• Based on real, large-scale
applications
• Leveraging existing tracing
tools
• Automatic source code
generation
• Concise, configurable,
portable benchmarks
• With relative performance
retained
Traces
Trace
Our system

Sample Use Case 1: Cross-Platform
Performance Estimation
7
BTIO%64(
BTIO%121(
BTIO%256(
CG%64(
CG%121(
CG%256(
SP%64(
SP%121(
SP%256(
0.0
0.5 0.7 0.9 1.1
3.0
5.0
1.0
7.0
9.0
0.0
• Estimate relative performance on candidate machines
Speed-up ratio, Titan supercomputer to Sith cluster (ORNL)
Relative performance highly case-dependent, varying across
• Applications, execution scales, tasks (computation, communication, I/O)

8
Sample Use Case 2: I/O Method Selection
Compute nodes
I/O nodes
SAN
Main Mem. SSD
Staging nodes
Interconnection network
Simulation
job
Parallel I/O with multi-level data staging
• Lots of I/O options available
 # of files
 Sync or async?
 # staging nodes
 Use local or remote
SSD?
 I/O library
 Stripe width?
 I/O frequency
 …
Realistic I/O benchmarks allow users and I/O system designers to
• consider interplay between I/O and other activities
• evaluate I/O options with portable, light-weight benchmarks
• Assess candidate I/O design/configurations

APPrime Overview
9
• Automatic, whole-application benchmark generation
– Input: parallel execution traces of application A on one platform
– Output: “fake application” A’, simulating A’s behavior
• Computation, communication, I/O, scaling
• Portable, shorter source code using few libraries
• APPrime Main idea
– Get information from traces, but do not replay
– Differentiate between regular and irregular behavior
• Be exact with regular activity (loops)
• Model any irregularity as statistical distribution (histograms)
• Current status
– Ready: overall framework, communication, I/O
– To-do: computation kernel

• Iterative parallel applications have regular execution
patterns
• in form of I(C*W)*F [1]
 I: one-time initialization phase (head)
 F: one-time finalization phase (tail)
 C: timestep computation phase (w. communication)
 W: periodic I/O phase
Assumed Computation Model
10
I FW
Repeated x times
CCC ... ...
Repeated x time
WCCC ...
Repeated y times
APPrime: Automatically identifies phases from traces
- without any involvement of programmer/user
Event
Bubble
Event
Bubble
...

Complications from Real Large-scale Apps
11
• Challenges
– Noises (irregular activities)
• Found to be minor across all applications we studied
– Multiple I/O phases
– Heterogeneous C-phase communication behavior
• Identical event sequence, different parameters
• Solutions
– Extend C to C[0, a]D0|1C[b, |C|]
• Allow minor noise phase D
• Ignored in benchmark generation
– Extend W to Wi
• Multiple I/O phases, each with individual (fixed) frequency
– Use Marcov-Chain Model (MCM) to simulate transitions between
multiple C phases

APPrime Workflow
12
Dumpi
Traces
Scala-
Traces
Head (I)
Noise (D)
Phases
Identifier
Trace
Parser
Code
Generator
Head
I/O (W)
I/O
Translator
Major
Loops (C)
MCM
Builder
MC
States
Configuration Parameter File Source Code
APPrime Benchmark
Extractor
Generator
Static
phases
Phases in
each table
…
Tail (F) Tail
APPrime Automatic Benchmark Generation Framework
Input Output
Merging
cross
tables
Event
Tables
Parser
Factory

Trace Parsing: Trace to Event Table
…
• MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype
datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm),
MPI_Bcast returning at walltime 102625.244.
Sample Joint Per-process Event Table
• MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined-
comm), MPI_Barrier returning at walltime 102625.253.
• MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user-
defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0
(MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime
102627.439
Original ASCII DUMPI Trace
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A …
MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE …
… … … … … … … … … …
To be filled
13

Event Table to Trace String
14
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A …
… … … … … … … … … …
MPI_init => ‘a’,
MPI_Barrier => ‘c’,
MPI_Bcast => ‘d’,
MPI_File_open => ‘f’,
…
MPI_Finalize => ‘h’
ab…ccd…ccd…ef…ccd…ccd…ef…gh
Compact trace string
APPrime: Deploys new string processing
algorithm to
• automatically identifies all phases
• based on searching for partitioning that
maximizes inter-iteration repetition

Computation Gap cross Timesteps
15
Event table of one process’s first timestep (C phase)
MPI_Bcast(…)
MPI_Isend(…)
Timestep 2 Timestep n
MPI_Bcast(…)
MPI_Isend(…)
…
Bubble 2.1
Bubble 2.2
…
Bubble n.1
Bubble n.2
…
Histograms
…
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A …
… … … … … … … … … …
MPI_Bcast(…)
MPI_Barrier(…)
Bubble 1.1
Bubble 1.2
Timestep 1
…

MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
… … … … … … …
Inter-Process Event Table Merging
16
MPI function
name
Data
Count
Type Dest. Src.
Com
m.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 …
… … … … … … …
Merging per-process event tables
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
… … … … … … …

Markov Chain Model for C-Phase States
17
No. Name Count Type Dest. Src.
1 MPI_Irecv 20 MPI_INT N/A {4, 8, …}
2 MPI_Send 20 MPI_INT 4 N/A
… … … … … …
MC State
Rank 1
MC State
Rank 2
Merged Timestep m
Merged Timestep m+n
State
1
State
2
State
1
0.3 0.7
State
2
0.7 0.3
Transition
Probabilities Matrix
No. Name Count Type Dest. Src.
1 MPI_Irecv 80 MPI_INT N/A {1, 7, …}
2 MPI_Send 80 MPI_INT 1 N/A
… … … … … …
…
0.3
0.70.7
0.3

Benchmark Code Generation
1 int main(int argc, char* argv[]){
2 apprime_init(argc, argv);
3 init_phase();
4 // major loop
5 for(timestep = 0; timestep < total_timestep; timestep++) {
6 run_state_for_C_phase(state_rank, event_tables);
7 // update next state rank
8 state_rank = trans_state(state_rank, timestep);
9 // periodic I/O phases
10 if(timesteps+1 % restart_period_1 == 0)
11 W_phase_1();
12 …
13 }
14 final_phase();
15 apprime_finalize();
16 return 0;
17 }
I/O phase Wi,
here i =1
Direct replay I phase
Direct replay F phase
Select the next MC
state for C phase
18

Evaluation
19
• Platforms: Titan and Sith at ORNL
• Workloads:
• Real-world HPC applications:
• Quantum turbulence code: BEC2
• Gyrokinetic particle simulations: XGC and GTS
• NAS benchmarks: BTIO, LU, SP, CG
Name
# of
Nodes
Cores per
node
Mem. per
node
OS
File
System
Titan 18,688 16 32 GB Cray xk7 Lustre
Sith 40 32 64 GB Linux x86_64 Lustre

HPC Applications
20
Name Domain
Typical prod. run
Scale (# of cores)
Open
Source?
Status
XGC Gyrokinetic 225,280 No Done
GTS Gyrokinetic 262,144 No Done
BEC2 Unitary qubit 110,592 No Done
QMC-Pack Electronic molecular 256 – 16,000 No Applicable
S3D Molecular physics 96,000 – 180,000 No Applicable
AWP-ODC Wave propagation 223,074 No Applicable
NAMD Molecular dynamics 1,000 – 20,000 No Applicable
HFODD Nuclear 299,008 Yes Applicable
LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable
SPEC
Multi-domain
Benchmarks
150 - 600 Yes Applicable
NPB
Multi-domain
Benchmarks
N/A Yes Done

Applications’ Trace Features
App
# of
procs
# of
TSs
Trace
size
Table
size
# events in
one state
String
size
# unique
funcs
# of
states
D%
TSV
%
Profile
size
BTIO 64 250 832MB 584MB 183 44.4KB 16 1 0% 2.1% 2.2MB
BTIO 256 250 7.02GB 4.75GB 266 91.5KB 16 1 0% 4.3% 8.1MB
CG 64 100 1.42GB 1.00GB 783 77.8KB 11 1 0% 1.5% 3.4MB
CG 256 100 7.51GB 5.50GB 1478 101KB 11 1 0% 1.8% 11MB
SP 64 500 1.44GB 960MB 139 68.1KB 15 1 0% 1.4% 1.4MB
SP 256 500 11.7GB 7.43GB 278 138KB 15 1 0% 3.5% 6.1MB
LU 64 300 18GB 12.3GB 1604 471KB 11 1 0% 2.3% 11MB
LU 256 300 75GB 51.3GB 1604 471KB 11 1 0% 3.8% 44MB
BEC2 64 100 142MB 101MB 74 7,5KB 14 1 0% 1.8% 1.1MB
BEC2 256 200 1.08GB 800MB 74 14.7KB 14 1 0% 2.7% 3.6MB
XGC 64 100 262MB 243MB 73 11.5KB 28 2 0.1% 4.3% 1.0MB
XGC 256 200 2.1GB 1.64GB 103 15.3KB 28 2 0.1% 5.8% 1.4MB
GTS 64 50 213MB 137MB 391 11.6KB 38 2 0.3% 5.6% 1.9MB
GTS 256 100 1.83GB 1.15GB 391 24.9KB 38 2 0.3% 5.9% 7.2MB21

Results: A vs. A’
22
• Comparing target application A with APPrime generated
benchmark A’
– A’ much more compact and easier to build
– If A has multiple C-phase states, they take no more than dozens
of timesteps to be discovered
Name
Lines of code
Max # of TS tested
Max # of TS
requiredA A’
BEC2 1.5K 856 1,000 1
XGC 93.7K 7.7K 1,000 36
GTS 178.4K 13.7K 200 2

Cross-Platform Relative Performance
23
BTIO CG
SPLU

Asynchronous I/O Configuration
Assessment
24
0"
400"
800"
1200"
1600"
0" 1" 4" 8" 0" 1" 4" 8" 0" 1" 4" 8"
64" 256" 512"
Applica' on* APPrime*benchmark*
Crashes
& aborts
Crashes
& aborts
BEC2 GTS
XGC

Comparing with Other Profile-based
Benchmark Generation Techniques
25
APPrime [21] BenchMaker [12] HBench [13]
Generated
Benchmark
Large scale iterative
parallel benchmark
Single process
(Multi-threaded)
benchmark
JAVA benchmark
Application
Specific
Yes Yes Yes
Source of
profile
Own processing of
execution traces
User’s input JVM profilers
Target of
profiling
• Recurrent event
sequence
• Event
parameter/inter-
arrival distribution
• Instruction mix
• Branch
probabilities
• Instruction level
parallelism
• Locality
• Methods frequently
invoked
• Function invoking
counts
• Time cost

Other Related Work
26
• Communication trace collection
• TAU [3], DUMPI [4], ScalaTrace [5]
• Trace reduction
• Lossy: Xu’s work [6], Cypress [7]
• Lossless: ScalaTrace [5]
• Profiling
• HPCtoolkit [8], Scalasca Performance Toolset [9]
• Trace-based application analysis
• ScalaExtrap [10], Casas’ work [11]
• Benchmark generation
• Trace-based: ScalaBenchGen [14]
• Source code slicing: FACT [15]

Ongoing Work
27
• Filling in computation kernel generation
– Currently using histogram to model “bubble size”
– Planned COMPrime tool
• recursive step on single-process computation kernel
• Instruction mix, memory access (more challenging)
• Modeling scaling behavior
– Take input traces of app A collected at different scale
• Problem size, execution size
– Can we simulate weak/strong scaling behavior with A’
• Connecting with collaborative work on scalable tracing
• Release full benchmarks!

APPrime References
1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel
Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC
2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005
2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure
Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335-
360.
3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of
High Performance Computing Applications, 20(2), 2006.
4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator
suite, 2011.
5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable
Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel
Distrib. Comput., 2009.
6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel
Execution. In IEEE IISWC, 2009.
7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for
top-down communication trace compression. In SC14, 2014.
8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent.
Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010.
9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca
Performance Toolset Architecture. In CCPE, 2010.
10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD
Programs. In ACM PPoPP, 2011.
29

APPrime References
11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of
MPI Applications. IJHPCA, 2010.
12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/ SIPEW, 2010.
13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking
framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java
Grande (JAVA '00).
14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication
Benchmarks Traces. In IEEE IPDPS, 2012.
15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace
Collection for Parallel Applications Through Program Slicing. In SC09, 2009.
16. GTC-benchmark in NERSC-8 suite, 2013.
17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html, 2003.18.
18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC
Benchmark Workshop, 2008.
19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian,
and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par.
Springer-Verlag, 2012.
20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC
Platforms. CUG, 2007
30

Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Similar to Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation (20)

Recently uploaded

Recently uploaded (20)

Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Editor's Notes