SlideShare a Scribd company logo
1 of 31
Combining Phase Identification and
Statistic Modeling for Automated
Parallel Benchmark Generation
Ye Jin, Xiaosong Ma, Mingliang Liu,
Qing Liu, Jeremy Logan, Norbert Podhorszki,
Jong Youl Choi, Scott Klasky
Systems and Applications More Complex
1
Powerful supercomputers
Large-scale codes from diverse scientific domains solving real-world problems
• Large number of nodes
• Deeper memory and I/O
stack
• Heterogeneous architecture
• System-specific interconnect
More Choices for HPC Platforms
2
10 Gb
Single Server
Local Cluster
HPC Center
HPC in Cloud
Performance Study Crucial
• Yet remains challenging
– Single-platform analysis resource-consuming
– Not to mention cross-platform
• Benchmarks more important than ever
– Evaluate machines
– Validate hardware/software design
– Select from candidate platforms
• Realistic benchmarks hard to find
3
Benchmarks and Generation Tools
Type Example Pros Cons
Kernels
NPB, SPEC,
Intel-MPI
Real,
parametric
Simple, non-HPC,
less flexible
Manually extracted
FlashIO,
GTC-Bench
Realistic,
parametric
Labor-intensive,
easily obsolete
Trace-based ScalaBenGen Automatic
Replay-based,
non-parametric,
platform-
dependent
Specialized (I/O) IOR, Skel Parametric I/O phase only
4
Automatic, full-application benchmark extraction?
Outline
5
• Motivation
• Our recent work related to automatic
application benchmark extraction
• APPrime framework (SIGMETRICS15,[21])
• Led by NCSU PhD student Ye Jin
• Collaboration with ORNL
• Cypress tool for communication trace
compression (SC14, [7])
• Closing remarks
Desired Features
6
• Based on real, large-scale
applications
• Leveraging existing tracing
tools
• Automatic source code
generation
• Concise, configurable,
portable benchmarks
• With relative performance
retained
Traces
Trace
Our system
Sample Use Case 1: Cross-Platform
Performance Estimation
7
BTIO%64(
BTIO%121(
BTIO%256(
CG%64(
CG%121(
CG%256(
SP%64(
SP%121(
SP%256(
0.0
0.5 0.7 0.9 1.1
3.0
5.0
1.0
7.0
9.0
0.0
• Estimate relative performance on candidate machines
Speed-up ratio, Titan supercomputer to Sith cluster (ORNL)
Relative performance highly case-dependent, varying across
• Applications, execution scales, tasks (computation, communication, I/O)
8
Sample Use Case 2: I/O Method Selection
Compute nodes
I/O nodes
SAN
Main Mem. SSD
Staging nodes
Interconnection network
Simulation
job
Parallel I/O with multi-level data staging
• Lots of I/O options available
 # of files
 Sync or async?
 # staging nodes
 Use local or remote
SSD?
 I/O library
 Stripe width?
 I/O frequency
 …
Realistic I/O benchmarks allow users and I/O system designers to
• consider interplay between I/O and other activities
• evaluate I/O options with portable, light-weight benchmarks
• Assess candidate I/O design/configurations
APPrime Overview
9
• Automatic, whole-application benchmark generation
– Input: parallel execution traces of application A on one platform
– Output: “fake application” A’, simulating A’s behavior
• Computation, communication, I/O, scaling
• Portable, shorter source code using few libraries
• APPrime Main idea
– Get information from traces, but do not replay
– Differentiate between regular and irregular behavior
• Be exact with regular activity (loops)
• Model any irregularity as statistical distribution (histograms)
• Current status
– Ready: overall framework, communication, I/O
– To-do: computation kernel
• Iterative parallel applications have regular execution
patterns
• in form of I(C*W)*F [1]
 I: one-time initialization phase (head)
 F: one-time finalization phase (tail)
 C: timestep computation phase (w. communication)
 W: periodic I/O phase
Assumed Computation Model
10
I FW
Repeated x times
CCC ... ...
Repeated x time
WCCC ...
Repeated y times
APPrime: Automatically identifies phases from traces
- without any involvement of programmer/user
Event
Bubble
Event
Bubble
...
Complications from Real Large-scale Apps
11
• Challenges
– Noises (irregular activities)
• Found to be minor across all applications we studied
– Multiple I/O phases
– Heterogeneous C-phase communication behavior
• Identical event sequence, different parameters
• Solutions
– Extend C to C[0, a]D0|1C[b, |C|]
• Allow minor noise phase D
• Ignored in benchmark generation
– Extend W to Wi
• Multiple I/O phases, each with individual (fixed) frequency
– Use Marcov-Chain Model (MCM) to simulate transitions between
multiple C phases
APPrime Workflow
12
Dumpi
Traces
Scala-
Traces
Head (I)
Noise (D)
Phases
Identifier
Trace
Parser
Code
Generator
Head
I/O (W)
I/O
Translator
Major
Loops (C)
MCM
Builder
MC
States
Configuration Parameter File Source Code
APPrime Benchmark
Extractor
Generator
Static
phases
Phases in
each table
…
Tail (F) Tail
APPrime Automatic Benchmark Generation Framework
Input Output
Merging
cross
tables
Event
Tables
Parser
Factory
Trace Parsing: Trace to Event Table
…
• MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype
datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm),
MPI_Bcast returning at walltime 102625.244.
Sample Joint Per-process Event Table
• MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined-
comm), MPI_Barrier returning at walltime 102625.253.
• MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user-
defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0
(MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime
102627.439
Original ASCII DUMPI Trace
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A …
MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE …
… … … … … … … … … …
To be filled
13
Event Table to Trace String
14
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A …
… … … … … … … … … …
MPI_init => ‘a’,
MPI_Barrier => ‘c’,
MPI_Bcast => ‘d’,
MPI_File_open => ‘f’,
…
MPI_Finalize => ‘h’
ab…ccd…ccd…ef…ccd…ccd…ef…gh
Compact trace string
APPrime: Deploys new string processing
algorithm to
• automatically identifies all phases
• based on searching for partitioning that
maximizes inter-iteration repetition
Computation Gap cross Timesteps
15
Event table of one process’s first timestep (C phase)
MPI_Bcast(…)
MPI_Isend(…)
Timestep 2 Timestep n
MPI_Bcast(…)
MPI_Isend(…)
…
Bubble 2.1
Bubble 2.2
…
Bubble n.1
Bubble n.2
…
Histograms
…
MPI function
name
Start End
Data
count
Root
Comm.
rank
File
access
mode
Phase
ID
Phase
type
…
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A …
… … … … … … … … … …
MPI_Bcast(…)
MPI_Barrier(…)
Bubble 1.1
Bubble 1.2
Timestep 1
…
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 8 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Inter-Process Event Table Merging
16
MPI function
name
Data
Count
Type Dest. Src.
Com
m.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Merging per-process event tables
MPI function
name
Data
Count
Type Dest. Src.
Comm.
rank
…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Markov Chain Model for C-Phase States
17
No. Name Count Type Dest. Src.
1 MPI_Irecv 20 MPI_INT N/A {4, 8, …}
2 MPI_Send 20 MPI_INT 4 N/A
… … … … … …
MC State
Rank 1
MC State
Rank 2
Merged Timestep m
Merged Timestep m+n
State
1
State
2
State
1
0.3 0.7
State
2
0.7 0.3
Transition
Probabilities Matrix
No. Name Count Type Dest. Src.
1 MPI_Irecv 80 MPI_INT N/A {1, 7, …}
2 MPI_Send 80 MPI_INT 1 N/A
… … … … … …
…
0.3
0.70.7
0.3
Benchmark Code Generation
1 int main(int argc, char* argv[]){
2 apprime_init(argc, argv);
3 init_phase();
4 // major loop
5 for(timestep = 0; timestep < total_timestep; timestep++) {
6 run_state_for_C_phase(state_rank, event_tables);
7 // update next state rank
8 state_rank = trans_state(state_rank, timestep);
9 // periodic I/O phases
10 if(timesteps+1 % restart_period_1 == 0)
11 W_phase_1();
12 …
13 }
14 final_phase();
15 apprime_finalize();
16 return 0;
17 }
I/O phase Wi,
here i =1
Direct replay I phase
Direct replay F phase
Select the next MC
state for C phase
18
Evaluation
19
• Platforms: Titan and Sith at ORNL
• Workloads:
• Real-world HPC applications:
• Quantum turbulence code: BEC2
• Gyrokinetic particle simulations: XGC and GTS
• NAS benchmarks: BTIO, LU, SP, CG
Name
# of
Nodes
Cores per
node
Mem. per
node
OS
File
System
Titan 18,688 16 32 GB Cray xk7 Lustre
Sith 40 32 64 GB Linux x86_64 Lustre
HPC Applications
20
Name Domain
Typical prod. run
Scale (# of cores)
Open
Source?
Status
XGC Gyrokinetic 225,280 No Done
GTS Gyrokinetic 262,144 No Done
BEC2 Unitary qubit 110,592 No Done
QMC-Pack Electronic molecular 256 – 16,000 No Applicable
S3D Molecular physics 96,000 – 180,000 No Applicable
AWP-ODC Wave propagation 223,074 No Applicable
NAMD Molecular dynamics 1,000 – 20,000 No Applicable
HFODD Nuclear 299,008 Yes Applicable
LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable
SPEC
Multi-domain
Benchmarks
150 - 600 Yes Applicable
NPB
Multi-domain
Benchmarks
N/A Yes Done
Applications’ Trace Features
App
# of
procs
# of
TSs
Trace
size
Table
size
# events in
one state
String
size
# unique
funcs
# of
states
D%
TSV
%
Profile
size
BTIO 64 250 832MB 584MB 183 44.4KB 16 1 0% 2.1% 2.2MB
BTIO 256 250 7.02GB 4.75GB 266 91.5KB 16 1 0% 4.3% 8.1MB
CG 64 100 1.42GB 1.00GB 783 77.8KB 11 1 0% 1.5% 3.4MB
CG 256 100 7.51GB 5.50GB 1478 101KB 11 1 0% 1.8% 11MB
SP 64 500 1.44GB 960MB 139 68.1KB 15 1 0% 1.4% 1.4MB
SP 256 500 11.7GB 7.43GB 278 138KB 15 1 0% 3.5% 6.1MB
LU 64 300 18GB 12.3GB 1604 471KB 11 1 0% 2.3% 11MB
LU 256 300 75GB 51.3GB 1604 471KB 11 1 0% 3.8% 44MB
BEC2 64 100 142MB 101MB 74 7,5KB 14 1 0% 1.8% 1.1MB
BEC2 256 200 1.08GB 800MB 74 14.7KB 14 1 0% 2.7% 3.6MB
XGC 64 100 262MB 243MB 73 11.5KB 28 2 0.1% 4.3% 1.0MB
XGC 256 200 2.1GB 1.64GB 103 15.3KB 28 2 0.1% 5.8% 1.4MB
GTS 64 50 213MB 137MB 391 11.6KB 38 2 0.3% 5.6% 1.9MB
GTS 256 100 1.83GB 1.15GB 391 24.9KB 38 2 0.3% 5.9% 7.2MB21
Results: A vs. A’
22
• Comparing target application A with APPrime generated
benchmark A’
– A’ much more compact and easier to build
– If A has multiple C-phase states, they take no more than dozens
of timesteps to be discovered
Name
Lines of code
Max # of TS tested
Max # of TS
requiredA A’
BEC2 1.5K 856 1,000 1
XGC 93.7K 7.7K 1,000 36
GTS 178.4K 13.7K 200 2
Cross-Platform Relative Performance
23
BTIO CG
SPLU
Asynchronous I/O Configuration
Assessment
24
0"
400"
800"
1200"
1600"
0" 1" 4" 8" 0" 1" 4" 8" 0" 1" 4" 8"
64" 256" 512"
Applica' on* APPrime*benchmark*
Crashes
& aborts
Crashes
& aborts
BEC2 GTS
XGC
Comparing with Other Profile-based
Benchmark Generation Techniques
25
APPrime [21] BenchMaker [12] HBench [13]
Generated
Benchmark
Large scale iterative
parallel benchmark
Single process
(Multi-threaded)
benchmark
JAVA benchmark
Application
Specific
Yes Yes Yes
Source of
profile
Own processing of
execution traces
User’s input JVM profilers
Target of
profiling
• Recurrent event
sequence
• Event
parameter/inter-
arrival distribution
• Instruction mix
• Branch
probabilities
• Instruction level
parallelism
• Locality
• Methods frequently
invoked
• Function invoking
counts
• Time cost
Other Related Work
26
• Communication trace collection
• TAU [3], DUMPI [4], ScalaTrace [5]
• Trace reduction
• Lossy: Xu’s work [6], Cypress [7]
• Lossless: ScalaTrace [5]
• Profiling
• HPCtoolkit [8], Scalasca Performance Toolset [9]
• Trace-based application analysis
• ScalaExtrap [10], Casas’ work [11]
• Benchmark generation
• Trace-based: ScalaBenchGen [14]
• Source code slicing: FACT [15]
Ongoing Work
27
• Filling in computation kernel generation
– Currently using histogram to model “bubble size”
– Planned COMPrime tool
• recursive step on single-process computation kernel
• Instruction mix, memory access (more challenging)
• Modeling scaling behavior
– Take input traces of app A collected at different scale
• Problem size, execution size
– Can we simulate weak/strong scaling behavior with A’
• Connecting with collaborative work on scalable tracing
• Release full benchmarks!
Thanks!
28
APPrime References
1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel
Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC
2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005
2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure
Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335-
360.
3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of
High Performance Computing Applications, 20(2), 2006.
4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator
suite, 2011.
5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable
Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel
Distrib. Comput., 2009.
6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel
Execution. In IEEE IISWC, 2009.
7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for
top-down communication trace compression. In SC14, 2014.
8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent.
Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010.
9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca
Performance Toolset Architecture. In CCPE, 2010.
10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD
Programs. In ACM PPoPP, 2011.
29
APPrime References
11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of
MPI Applications. IJHPCA, 2010.
12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/ SIPEW, 2010.
13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking
framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java
Grande (JAVA '00).
14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication
Benchmarks Traces. In IEEE IPDPS, 2012.
15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace
Collection for Parallel Applications Through Program Slicing. In SC09, 2009.
16. GTC-benchmark in NERSC-8 suite, 2013.
17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html, 2003.18.
18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC
Benchmark Workshop, 2008.
19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian,
and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par.
Springer-Verlag, 2012.
20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC
Platforms. CUG, 2007
30

More Related Content

What's hot

Whirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic LinkerWhirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic LinkerGonçalo Gomes
 
CNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginCNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginSam Bowne
 
CNIT 127 Ch 2: Stack overflows on Linux
CNIT 127 Ch 2: Stack overflows on LinuxCNIT 127 Ch 2: Stack overflows on Linux
CNIT 127 Ch 2: Stack overflows on LinuxSam Bowne
 
CNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxCNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxSam Bowne
 
(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memoryNico Ludwig
 
CNIT 127: Ch 8: Windows overflows (Part 2)
CNIT 127: Ch 8: Windows overflows (Part 2)CNIT 127: Ch 8: Windows overflows (Part 2)
CNIT 127: Ch 8: Windows overflows (Part 2)Sam Bowne
 
CNIT 126 5: IDA Pro
CNIT 126 5: IDA ProCNIT 126 5: IDA Pro
CNIT 126 5: IDA ProSam Bowne
 
CNIT 127: Ch 4: Introduction to format string bugs
CNIT 127: Ch 4: Introduction to format string bugsCNIT 127: Ch 4: Introduction to format string bugs
CNIT 127: Ch 4: Introduction to format string bugsSam Bowne
 
Csa stack
Csa stackCsa stack
Csa stackPCTE
 
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)Sam Bowne
 
CNIT 127: Ch 3: Shellcode
CNIT 127: Ch 3: ShellcodeCNIT 127: Ch 3: Shellcode
CNIT 127: Ch 3: ShellcodeSam Bowne
 
Digital design with Systemc
Digital design with SystemcDigital design with Systemc
Digital design with SystemcMarc Engels
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateFlink Forward
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir OverviewPTIHPA
 
Lecture 07 virtual machine i
Lecture 07 virtual machine iLecture 07 virtual machine i
Lecture 07 virtual machine i鍾誠 陳鍾誠
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Flink Forward
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 

What's hot (20)

PROCESSOR AND CONTROL UNIT
PROCESSOR AND CONTROL UNITPROCESSOR AND CONTROL UNIT
PROCESSOR AND CONTROL UNIT
 
Introduction to c part -3
Introduction to c   part -3Introduction to c   part -3
Introduction to c part -3
 
Whirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic LinkerWhirlwind tour of the Runtime Dynamic Linker
Whirlwind tour of the Runtime Dynamic Linker
 
CNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you BeginCNIT 127 Ch 1: Before you Begin
CNIT 127 Ch 1: Before you Begin
 
CNIT 127 Ch 2: Stack overflows on Linux
CNIT 127 Ch 2: Stack overflows on LinuxCNIT 127 Ch 2: Stack overflows on Linux
CNIT 127 Ch 2: Stack overflows on Linux
 
CNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on LinuxCNIT 127: Ch 2: Stack overflows on Linux
CNIT 127: Ch 2: Stack overflows on Linux
 
(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory(8) cpp stack automatic_memory_and_static_memory
(8) cpp stack automatic_memory_and_static_memory
 
CNIT 127: Ch 8: Windows overflows (Part 2)
CNIT 127: Ch 8: Windows overflows (Part 2)CNIT 127: Ch 8: Windows overflows (Part 2)
CNIT 127: Ch 8: Windows overflows (Part 2)
 
CNIT 126 5: IDA Pro
CNIT 126 5: IDA ProCNIT 126 5: IDA Pro
CNIT 126 5: IDA Pro
 
CNIT 127: Ch 4: Introduction to format string bugs
CNIT 127: Ch 4: Introduction to format string bugsCNIT 127: Ch 4: Introduction to format string bugs
CNIT 127: Ch 4: Introduction to format string bugs
 
Csa stack
Csa stackCsa stack
Csa stack
 
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)
CNIT 127 Ch 4: Introduction to format string bugs (rev. 2-9-17)
 
CNIT 127: Ch 3: Shellcode
CNIT 127: Ch 3: ShellcodeCNIT 127: Ch 3: Shellcode
CNIT 127: Ch 3: Shellcode
 
Digital design with Systemc
Digital design with SystemcDigital design with Systemc
Digital design with Systemc
 
Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
 
Lecture 07 virtual machine i
Lecture 07 virtual machine iLecture 07 virtual machine i
Lecture 07 virtual machine i
 
Multicore
MulticoreMulticore
Multicore
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 

Similar to Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVELinaro
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeDmitri Nesteruk
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Virtual platform
Virtual platformVirtual platform
Virtual platformsean chen
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
Crossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsCrossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsAndreas Koschak
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurementPTIHPA
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetVasyl Senko
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & AnalysisRishi Pathak
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxgopikahari7
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and AnalysisRishi Pathak
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 

Similar to Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation (20)

Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Java On CRaC
Java On CRaCJava On CRaC
Java On CRaC
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Virtual platform
Virtual platformVirtual platform
Virtual platform
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Crossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCsCrossing the Boundaries: Development Strategies for (P)SoCs
Crossing the Boundaries: Development Strategies for (P)SoCs
 
Defense
DefenseDefense
Defense
 
Defense
DefenseDefense
Defense
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
HPC Application Profiling & Analysis
HPC Application Profiling & AnalysisHPC Application Profiling & Analysis
HPC Application Profiling & Analysis
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
HPC Application Profiling and Analysis
HPC Application Profiling and AnalysisHPC Application Profiling and Analysis
HPC Application Profiling and Analysis
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 

Recently uploaded

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 

Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

  • 1. Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation Ye Jin, Xiaosong Ma, Mingliang Liu, Qing Liu, Jeremy Logan, Norbert Podhorszki, Jong Youl Choi, Scott Klasky
  • 2. Systems and Applications More Complex 1 Powerful supercomputers Large-scale codes from diverse scientific domains solving real-world problems • Large number of nodes • Deeper memory and I/O stack • Heterogeneous architecture • System-specific interconnect
  • 3. More Choices for HPC Platforms 2 10 Gb Single Server Local Cluster HPC Center HPC in Cloud
  • 4. Performance Study Crucial • Yet remains challenging – Single-platform analysis resource-consuming – Not to mention cross-platform • Benchmarks more important than ever – Evaluate machines – Validate hardware/software design – Select from candidate platforms • Realistic benchmarks hard to find 3
  • 5. Benchmarks and Generation Tools Type Example Pros Cons Kernels NPB, SPEC, Intel-MPI Real, parametric Simple, non-HPC, less flexible Manually extracted FlashIO, GTC-Bench Realistic, parametric Labor-intensive, easily obsolete Trace-based ScalaBenGen Automatic Replay-based, non-parametric, platform- dependent Specialized (I/O) IOR, Skel Parametric I/O phase only 4 Automatic, full-application benchmark extraction?
  • 6. Outline 5 • Motivation • Our recent work related to automatic application benchmark extraction • APPrime framework (SIGMETRICS15,[21]) • Led by NCSU PhD student Ye Jin • Collaboration with ORNL • Cypress tool for communication trace compression (SC14, [7]) • Closing remarks
  • 7. Desired Features 6 • Based on real, large-scale applications • Leveraging existing tracing tools • Automatic source code generation • Concise, configurable, portable benchmarks • With relative performance retained Traces Trace Our system
  • 8. Sample Use Case 1: Cross-Platform Performance Estimation 7 BTIO%64( BTIO%121( BTIO%256( CG%64( CG%121( CG%256( SP%64( SP%121( SP%256( 0.0 0.5 0.7 0.9 1.1 3.0 5.0 1.0 7.0 9.0 0.0 • Estimate relative performance on candidate machines Speed-up ratio, Titan supercomputer to Sith cluster (ORNL) Relative performance highly case-dependent, varying across • Applications, execution scales, tasks (computation, communication, I/O)
  • 9. 8 Sample Use Case 2: I/O Method Selection Compute nodes I/O nodes SAN Main Mem. SSD Staging nodes Interconnection network Simulation job Parallel I/O with multi-level data staging • Lots of I/O options available  # of files  Sync or async?  # staging nodes  Use local or remote SSD?  I/O library  Stripe width?  I/O frequency  … Realistic I/O benchmarks allow users and I/O system designers to • consider interplay between I/O and other activities • evaluate I/O options with portable, light-weight benchmarks • Assess candidate I/O design/configurations
  • 10. APPrime Overview 9 • Automatic, whole-application benchmark generation – Input: parallel execution traces of application A on one platform – Output: “fake application” A’, simulating A’s behavior • Computation, communication, I/O, scaling • Portable, shorter source code using few libraries • APPrime Main idea – Get information from traces, but do not replay – Differentiate between regular and irregular behavior • Be exact with regular activity (loops) • Model any irregularity as statistical distribution (histograms) • Current status – Ready: overall framework, communication, I/O – To-do: computation kernel
  • 11. • Iterative parallel applications have regular execution patterns • in form of I(C*W)*F [1]  I: one-time initialization phase (head)  F: one-time finalization phase (tail)  C: timestep computation phase (w. communication)  W: periodic I/O phase Assumed Computation Model 10 I FW Repeated x times CCC ... ... Repeated x time WCCC ... Repeated y times APPrime: Automatically identifies phases from traces - without any involvement of programmer/user Event Bubble Event Bubble ...
  • 12. Complications from Real Large-scale Apps 11 • Challenges – Noises (irregular activities) • Found to be minor across all applications we studied – Multiple I/O phases – Heterogeneous C-phase communication behavior • Identical event sequence, different parameters • Solutions – Extend C to C[0, a]D0|1C[b, |C|] • Allow minor noise phase D • Ignored in benchmark generation – Extend W to Wi • Multiple I/O phases, each with individual (fixed) frequency – Use Marcov-Chain Model (MCM) to simulate transitions between multiple C phases
  • 13. APPrime Workflow 12 Dumpi Traces Scala- Traces Head (I) Noise (D) Phases Identifier Trace Parser Code Generator Head I/O (W) I/O Translator Major Loops (C) MCM Builder MC States Configuration Parameter File Source Code APPrime Benchmark Extractor Generator Static phases Phases in each table … Tail (F) Tail APPrime Automatic Benchmark Generation Framework Input Output Merging cross tables Event Tables Parser Factory
  • 14. Trace Parsing: Trace to Event Table … • MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm), MPI_Bcast returning at walltime 102625.244. Sample Joint Per-process Event Table • MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined- comm), MPI_Barrier returning at walltime 102625.253. • MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user- defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0 (MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime 102627.439 Original ASCII DUMPI Trace MPI function name Start End Data count Root Comm. rank File access mode Phase ID Phase type … … … … … … … … … … … MPI_Bcast …5.244 …5.245 1 0 4 N/A … MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A … MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE … … … … … … … … … … … To be filled 13
  • 15. Event Table to Trace String 14 MPI function name Start End Data count Root Comm. rank File access mode Phase ID Phase type … … … … … … … … … … … MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A … MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A … MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A … … … … … … … … … … … MPI_init => ‘a’, MPI_Barrier => ‘c’, MPI_Bcast => ‘d’, MPI_File_open => ‘f’, … MPI_Finalize => ‘h’ ab…ccd…ccd…ef…ccd…ccd…ef…gh Compact trace string APPrime: Deploys new string processing algorithm to • automatically identifies all phases • based on searching for partitioning that maximizes inter-iteration repetition
  • 16. Computation Gap cross Timesteps 15 Event table of one process’s first timestep (C phase) MPI_Bcast(…) MPI_Isend(…) Timestep 2 Timestep n MPI_Bcast(…) MPI_Isend(…) … Bubble 2.1 Bubble 2.2 … Bubble n.1 Bubble n.2 … Histograms … MPI function name Start End Data count Root Comm. rank File access mode Phase ID Phase type … … … … … … … … … … … MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A … MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A … … … … … … … … … … … MPI_Bcast(…) MPI_Barrier(…) Bubble 1.1 Bubble 1.2 Timestep 1 …
  • 17. MPI function name Data Count Type Dest. Src. Comm. rank … … … … … … … … MPI_Irecv 20 MPI_INT N/A 4 4 … MPI_Send 20 MPI_INT 4 N/A 4 … … … … … … … … MPI function name Data Count Type Dest. Src. Comm. rank … … … … … … … … MPI_Irecv 20 MPI_INT N/A 8 4 … MPI_Send 20 MPI_INT 4 N/A 4 … … … … … … … … Inter-Process Event Table Merging 16 MPI function name Data Count Type Dest. Src. Com m. rank … … … … … … … … MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 … MPI_Send 20 MPI_INT 4 N/A 4 … … … … … … … … Merging per-process event tables MPI function name Data Count Type Dest. Src. Comm. rank … … … … … … … … MPI_Irecv 20 MPI_INT N/A 4 4 … MPI_Send 20 MPI_INT 4 N/A 4 … … … … … … … …
  • 18. Markov Chain Model for C-Phase States 17 No. Name Count Type Dest. Src. 1 MPI_Irecv 20 MPI_INT N/A {4, 8, …} 2 MPI_Send 20 MPI_INT 4 N/A … … … … … … MC State Rank 1 MC State Rank 2 Merged Timestep m Merged Timestep m+n State 1 State 2 State 1 0.3 0.7 State 2 0.7 0.3 Transition Probabilities Matrix No. Name Count Type Dest. Src. 1 MPI_Irecv 80 MPI_INT N/A {1, 7, …} 2 MPI_Send 80 MPI_INT 1 N/A … … … … … … … 0.3 0.70.7 0.3
  • 19. Benchmark Code Generation 1 int main(int argc, char* argv[]){ 2 apprime_init(argc, argv); 3 init_phase(); 4 // major loop 5 for(timestep = 0; timestep < total_timestep; timestep++) { 6 run_state_for_C_phase(state_rank, event_tables); 7 // update next state rank 8 state_rank = trans_state(state_rank, timestep); 9 // periodic I/O phases 10 if(timesteps+1 % restart_period_1 == 0) 11 W_phase_1(); 12 … 13 } 14 final_phase(); 15 apprime_finalize(); 16 return 0; 17 } I/O phase Wi, here i =1 Direct replay I phase Direct replay F phase Select the next MC state for C phase 18
  • 20. Evaluation 19 • Platforms: Titan and Sith at ORNL • Workloads: • Real-world HPC applications: • Quantum turbulence code: BEC2 • Gyrokinetic particle simulations: XGC and GTS • NAS benchmarks: BTIO, LU, SP, CG Name # of Nodes Cores per node Mem. per node OS File System Titan 18,688 16 32 GB Cray xk7 Lustre Sith 40 32 64 GB Linux x86_64 Lustre
  • 21. HPC Applications 20 Name Domain Typical prod. run Scale (# of cores) Open Source? Status XGC Gyrokinetic 225,280 No Done GTS Gyrokinetic 262,144 No Done BEC2 Unitary qubit 110,592 No Done QMC-Pack Electronic molecular 256 – 16,000 No Applicable S3D Molecular physics 96,000 – 180,000 No Applicable AWP-ODC Wave propagation 223,074 No Applicable NAMD Molecular dynamics 1,000 – 20,000 No Applicable HFODD Nuclear 299,008 Yes Applicable LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable SPEC Multi-domain Benchmarks 150 - 600 Yes Applicable NPB Multi-domain Benchmarks N/A Yes Done
  • 22. Applications’ Trace Features App # of procs # of TSs Trace size Table size # events in one state String size # unique funcs # of states D% TSV % Profile size BTIO 64 250 832MB 584MB 183 44.4KB 16 1 0% 2.1% 2.2MB BTIO 256 250 7.02GB 4.75GB 266 91.5KB 16 1 0% 4.3% 8.1MB CG 64 100 1.42GB 1.00GB 783 77.8KB 11 1 0% 1.5% 3.4MB CG 256 100 7.51GB 5.50GB 1478 101KB 11 1 0% 1.8% 11MB SP 64 500 1.44GB 960MB 139 68.1KB 15 1 0% 1.4% 1.4MB SP 256 500 11.7GB 7.43GB 278 138KB 15 1 0% 3.5% 6.1MB LU 64 300 18GB 12.3GB 1604 471KB 11 1 0% 2.3% 11MB LU 256 300 75GB 51.3GB 1604 471KB 11 1 0% 3.8% 44MB BEC2 64 100 142MB 101MB 74 7,5KB 14 1 0% 1.8% 1.1MB BEC2 256 200 1.08GB 800MB 74 14.7KB 14 1 0% 2.7% 3.6MB XGC 64 100 262MB 243MB 73 11.5KB 28 2 0.1% 4.3% 1.0MB XGC 256 200 2.1GB 1.64GB 103 15.3KB 28 2 0.1% 5.8% 1.4MB GTS 64 50 213MB 137MB 391 11.6KB 38 2 0.3% 5.6% 1.9MB GTS 256 100 1.83GB 1.15GB 391 24.9KB 38 2 0.3% 5.9% 7.2MB21
  • 23. Results: A vs. A’ 22 • Comparing target application A with APPrime generated benchmark A’ – A’ much more compact and easier to build – If A has multiple C-phase states, they take no more than dozens of timesteps to be discovered Name Lines of code Max # of TS tested Max # of TS requiredA A’ BEC2 1.5K 856 1,000 1 XGC 93.7K 7.7K 1,000 36 GTS 178.4K 13.7K 200 2
  • 25. Asynchronous I/O Configuration Assessment 24 0" 400" 800" 1200" 1600" 0" 1" 4" 8" 0" 1" 4" 8" 0" 1" 4" 8" 64" 256" 512" Applica' on* APPrime*benchmark* Crashes & aborts Crashes & aborts BEC2 GTS XGC
  • 26. Comparing with Other Profile-based Benchmark Generation Techniques 25 APPrime [21] BenchMaker [12] HBench [13] Generated Benchmark Large scale iterative parallel benchmark Single process (Multi-threaded) benchmark JAVA benchmark Application Specific Yes Yes Yes Source of profile Own processing of execution traces User’s input JVM profilers Target of profiling • Recurrent event sequence • Event parameter/inter- arrival distribution • Instruction mix • Branch probabilities • Instruction level parallelism • Locality • Methods frequently invoked • Function invoking counts • Time cost
  • 27. Other Related Work 26 • Communication trace collection • TAU [3], DUMPI [4], ScalaTrace [5] • Trace reduction • Lossy: Xu’s work [6], Cypress [7] • Lossless: ScalaTrace [5] • Profiling • HPCtoolkit [8], Scalasca Performance Toolset [9] • Trace-based application analysis • ScalaExtrap [10], Casas’ work [11] • Benchmark generation • Trace-based: ScalaBenchGen [14] • Source code slicing: FACT [15]
  • 28. Ongoing Work 27 • Filling in computation kernel generation – Currently using histogram to model “bubble size” – Planned COMPrime tool • recursive step on single-process computation kernel • Instruction mix, memory access (more challenging) • Modeling scaling behavior – Take input traces of app A collected at different scale • Problem size, execution size – Can we simulate weak/strong scaling behavior with A’ • Connecting with collaborative work on scalable tracing • Release full benchmarks!
  • 30. APPrime References 1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005 2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335- 360. 3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2), 2006. 4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator suite, 2011. 5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel Distrib. Comput., 2009. 6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel Execution. In IEEE IISWC, 2009. 7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC14, 2014. 8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010. 9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. In CCPE, 2010. 10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD Programs. In ACM PPoPP, 2011. 29
  • 31. APPrime References 11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of MPI Applications. IJHPCA, 2010. 12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/ SIPEW, 2010. 13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00). 14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces. In IEEE IPDPS, 2012. 15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace Collection for Parallel Applications Through Program Slicing. In SC09, 2009. 16. GTC-benchmark in NERSC-8 suite, 2013. 17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html, 2003.18. 18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC Benchmark Workshop, 2008. 19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par. Springer-Verlag, 2012. 20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC Platforms. CUG, 2007 30

Editor's Notes

  1. Profiling: Take a relatively light-weight approach, by summarizing aggregate or statistical information of parallel job executions. ScalaExtrap: identifies and extrapolates communication topologies, given traces from executions of different scales. BenchMaker and Hbench: generate benchmark according to statistical models characterizing the original applications ScalaBenGen: Benchmark extraction based on compressed communication trace. FACT: Benchmark extraction based on compressed communication trace. Casas’ work: Phase identification with signal processing techniques
  2. Profiling: Take a relatively light-weight approach, by summarizing aggregate or statistical information of parallel job executions. ScalaExtrap: identifies and extrapolates communication topologies, given traces from executions of different scales. BenchMaker and Hbench: generate benchmark according to statistical models characterizing the original applications ScalaBenGen: Benchmark extraction based on compressed communication trace. FACT: Benchmark extraction based on compressed communication trace. Casas’ work: Phase identification with signal processing techniques