Super Computer


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Super Computer

  1. 1. Supercomputer Performance Characterization Presented By: IQxplorer
  2. 2. Here are some important computer performance questions <ul><li>What key computer system parameters determine performance? </li></ul><ul><li>What synthetic benchmarks can be used to characterize these system parameters? </li></ul><ul><li>How does performance on synthetics compare between computers? </li></ul><ul><li>How does performance on applications compare between computers? </li></ul><ul><li>How does performance scale (i.e., vary with processor count)? </li></ul>
  3. 3. Comparative performance results have been obtained on six computers at NCSA & SDSC, all with > 1,000 processors
  4. 4. These computers have shared-memory nodes of widely varying size connected by different switch types <ul><li>Blue Gene </li></ul><ul><ul><li>Massively parallel processor system with low-power, 2p nodes </li></ul></ul><ul><ul><li>Two custom switches for point-to-point and collective communication </li></ul></ul><ul><li>Cobalt </li></ul><ul><ul><li>Cluster of two large, 512p nodes (also called a constellation) </li></ul></ul><ul><ul><li>Custom switch within nodes & commodity switch between nodes </li></ul></ul><ul><li>DataStar </li></ul><ul><ul><li>Cluster of 8p nodes </li></ul></ul><ul><ul><li>Custom high-performance switch called Federation </li></ul></ul><ul><li>Mercury, Tungsten, & T2 </li></ul><ul><ul><li>Clusters of 2p nodes </li></ul></ul><ul><ul><li>Commodity switches </li></ul></ul>
  5. 5. Performance can be better understood with a simple model <ul><li>Total run time can be split into three components: </li></ul><ul><ul><li>ttot = tcomp + tcomm + tio </li></ul></ul><ul><li>Overlap may exist. If so, it can be handled as follows: </li></ul><ul><ul><li>tcomp = computation time </li></ul></ul><ul><ul><li>tcomm = communication time that can’t be overlapped with tcomp </li></ul></ul><ul><ul><li>tio = I/O time that can’t be overlapped with tcomp & tcomm </li></ul></ul><ul><li>Relative values vary depending upon computer, application, problem, & number of processors </li></ul>
  6. 6. Run-time components depend upon system parameters & code features Differences between point-to-point & collective communication are important too
  7. 7. Compute, communication, & I/O speeds have been measured for many synthetic & application benchmarks <ul><li>Synthetic benchmarks </li></ul><ul><ul><li>sloops (includes daxpy & dot) </li></ul></ul><ul><ul><li>HPL (Linpack) </li></ul></ul><ul><ul><li>HPC Challenge </li></ul></ul><ul><ul><li>NAS Parallel Benchmarks </li></ul></ul><ul><ul><li>IOR </li></ul></ul><ul><li>Application benchmarks </li></ul><ul><ul><li>Amber 9 PMEMD (biophysics: molecular dynamics) </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>WRF (atmospheric science: weather prediction) </li></ul></ul>
  8. 8. Normalized memory access profiles for daxpy show better memory access, but more memory contention on Blue Gene compared DataStar
  9. 9. Each HPCC synthetic benchmark measures one or two system parameters in varying combinations
  10. 10. Relative speeds are shown for HPCC benchmarks on 6 computers at 1,024p; 4 different computers are fastest depending upon benchmark; 2 of these are also slowest, depending upon benchmark Data available soon at CIP Web site:
  11. 11. Absolute speeds are shown for HPCC & IOR benchmarks on SDSC computers; TG processors are fastest, BG & DS interconnects are fastest, & all three computers have similar I/O rates
  12. 12. Relative speeds are shown for 5 applications on 6 computers at various processor counts; Cobalt & DataStar are generally fastest
  13. 13. Good scaling is essential to take advantage of high processors counts <ul><li>Two types of scaling are of interest </li></ul><ul><ul><li>Strong: performance vs processor count (p) for fixed problem size </li></ul></ul><ul><ul><li>Weak: performance vs p for fixed work per processor </li></ul></ul><ul><li>There are several ways of plotting scaling </li></ul><ul><ul><li>Run time (t) vs p </li></ul></ul><ul><ul><li>Speed (1/t) vs p </li></ul></ul><ul><ul><li>Speed/p vs p </li></ul></ul><ul><li>Scaling depends significantly on the computer, application, & problem </li></ul><ul><li>Use log-log plot to preserve ratios when comparing computers </li></ul>
  14. 14. AWM 512^3 problem shows good strong scaling to 2,048p on Blue Gene & to 512p on DataStar, but not on TeraGrid cluster Data from Yifeng Cui
  15. 15. MILC medium problem shows superlinear speedup on Cobalt, Mercury, & DataStar at small processor counts; strong scaling ends for DataStar & Blue Gene above 2,048p
  16. 16. NAMD ApoA1 problem scales best on DataStar & Blue Gene; Cobalt is fastest below 512p, but the same speed as DataStar at 512p
  17. 17. WRF standard problem scales best on DataStar; Cobalt is fastest below 512p, but the same speed as DataStar at 512p
  18. 18. Communication fraction generally grows with processor count in strong scaling scans, such as for WRF standard problem on DataStar
  19. 19. A more careful look at Blue Gene shows many pluses <ul><li>+ Hardware is more reliable than for other high-end systems installed at SDSC in recent years </li></ul><ul><li>+ Compute times are extremely reproducible </li></ul><ul><li>+ Networks scale well </li></ul><ul><li>+ I/O performance with GPFS is good at high p </li></ul><ul><li>+ Price per peak flop/s is low </li></ul><ul><li>+ Power per flop/s is low </li></ul><ul><li>+ Footprint is small </li></ul>
  20. 20. But there are also some minuses <ul><li>- Processors are relatively slow </li></ul><ul><ul><li>Clock speed is 700 MHz </li></ul></ul><ul><ul><li>Compilers seldom use second FPU in each processor (though optimized libraries do) </li></ul></ul><ul><li>- Applications must scale well to get high absolute performance </li></ul><ul><li>- Memory is only 512 MB/node, so some problems don’t fit </li></ul><ul><ul><li>Coprocessor mode can be used (with 1p/node), but this is inefficient </li></ul></ul><ul><ul><li>Some problems still don’t fit even in coprocessor mode </li></ul></ul><ul><li>- Cross-compiling complicates software development for complex codes </li></ul>
  21. 21. Major applications ported and being run on BG at SDSC span various disciplines
  22. 22. Speed of BG relative to DataStar varies about clock speed ratio (0.47 = 0.7/1.5) for applications on ≥ 512p; CO & VN mode perform similarly (per MPI p)
  23. 23. DNS scaling on BG is generally better than on DataStar, but shows unusual variation; VN mode is somewhat slower than CO mode (per MPI p) Data from Dmitry Pekurovsky
  24. 24. If number of allocated processors is considered, then VN mode is faster than CO mode, and both modes show unusual variation Data from Dmitry Pekurovsky
  25. 25. IOR weak scaling scans using GPFS-WAN show BG in VN mode achieves 3.4 GB/s for writes (~DS) & 2.7 GB/s for reads (>DS)
  26. 26. Blue Gene has more limited applicability than DataStar, but is a good choice if the application is right <ul><li>+ Some applications run relatively fast & scale well </li></ul><ul><li>+ Turnaround is good with only a few users </li></ul><ul><li>+ Hardware is reliable & easy to maintain </li></ul><ul><li>- Other applications run relatively slowly and/or don’t scale well </li></ul><ul><li>- Some typical problems need to run in CO mode to fit in memory </li></ul><ul><li>- Other typical problems won’t fit at all </li></ul>