• Like
  • Save
Assignment 1-mtat
Upcoming SlideShare
Loading in...5
×
 

Assignment 1-mtat

on

  • 216 views

 

Statistics

Views

Total Views
216
Views on SlideShare
216
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Assignment 1-mtat Assignment 1-mtat Presentation Transcript

    • Instrumentation and analysis of NPB Zafar Gilani EMDC 2012Measurement Tools and Techniques UPC
    • Outline● Introduction to benchmark app● Testbeds● Instrumentation● Traces● Measurement criterion● Evaluation● Anomalies● Conclusions
    • 1 Introduction to benchmark app ● NPB = NAS Parallel Benchmarks. ● A small set of programs designed to evaluate performance of parallel supercomputers. ● 5 kernels, 3 pseudo applications. ● 3 versions: Serial, OpenMP, MPI. ● 8 kind of classes of tests: ○ S - small, for quick tests ○ W - workstation size ○ A, B, C - standard tests, ~4x increase from A to C ○ D, E, F - large tests, ~16x increase from A to C
    • 2 Testbeds Local Remote Machine type Laptop Server Processor Intel Core i3-330M Intel Xeon E5645 2.13GHz 2.40GHz Cores 2 6 Cache (MB) 3 12 Memory (GB) 3 24
    • 3 Instrumentation ● Preload Extraes MPI trace library "libmpitrace.so". ● The library intercepts all the MPI calls and traces all the MPI events. ● Instrumented and executed: ○ NPB version 3.3 stable ○ NPB3.3-MPI ○ IS (Integer Sort) kernel with 2, 4, 8, 16 and 32 procs ● Per experiment: ○ Size of problem: Class C, 135 million values approx. ○ Iterations: 10
    • 4 Local traces Exec Comm
    • 5 Remote traces
    • Evaluation & Comparative Analysis
    • 6 Measurement criterion Metric Relevance to NPB-MPI Integer Sort Computation time General idea of speed-up. Communication time Impact of increasing number of processes on communication. Load imbalance Which processes or threads do less as compared to others. Bottlenecks Performance bottlenecks. L1 cache misses To see how many times the CPU had to go to other memory to find data.
    • 7 Computation time ● Measured: thread processing time. ● Local: ○ increase in time directly proportional to nprocs ○ upto 32 processes ○ poor scalability ● Remote: ○ decrease in time directly proportional to nprocs ○ upto 32 processes ○ good scalability
    • 8
    • 9 Communication time ● Overall communication time is determined by the process taking maximum time. ● Local: ○ rapid increase in time as number of processes are increased ● Remote: ○ nominal increase in time as number of processes are increased
    • 10
    • 11 Load Imbalance ● On boada ○ For nprocs = 4, threads = {2, 3} are lazy. ○ For nprocs = 16, threads = {5, 6, 7, 8, 12} are lazy. Exec Wait Comm
    • 12 Bottlenecks ● For nprocs = {8, 16, 32}, one or more processes takes more time. ○ Wait/Wait All signals. ○ Typical times for local machine is around 1000 ms. ○ Typical times for remote machine is around 250 ms. ■ 4x difference (threads in remote machine have shorter wait time).
    • 13 Wait I/O
    • 14 L1 cache misses ● Cache misses in local machine are more expensive: typically costing 5x more time. ○ Cache size difference? Local has to "look" elsewhere more often. ■ i3 has 3MB cache. ■ Xeon has 12MB cache.
    • 15
    • 16 Anomalies ● For 32 threads: ○ Time taken to spawn threads varies. ○ Remote takes less time to spawn 32 threads. ○ Possible reasons: ■ Acquiring locks and switching between resource acquisition and release is costly. ● Time taken by "other" jobs also varies: ○ But these generally vary from system to system.
    • 17 Spawning Others ??
    • 18 Conclusions ● Instrumentation is necessary to reveal performance insights of parallel code. ● Extrae supports a handy procedure for automatic instrumentation. ● Some interesting observations: ○ IS does not properly scale on low-end machines beyond 16 procs. ○ Scales nicely on a server such as boada. ○ IS code becomes communication intensive when nprocs is increased. ○ Some bottlenecks deteriorate performance.
    • Instrumentation and analysis of NPB Zafar Gilani EMDC 2012Measurement Tools and Techniques UPC