 Review of XT6 Architecture
    AMD Opteron
    Cray Networks
    Lustre Basics
 Programming Environment
    PGI Com...
AMD CPU Architecture
   Cray Architecture
Lustre Filesystem Basics
2003          2005          2007          2008          2009           2010
               AMD           AMD
             ...
12 cores
                 1.7-2.2Ghz
1   4   7   10   105.6Gflops

                 8 cores
    5       11   1.8-2.4Ghz
2 ...
L3 cache




                                                                                                             ...
 A cache line is 64B
 Cache is a “victim cache”
    All references go to L1 immediately and get evicted down the caches...
SeaStar (XT-series)
Gemini (XE-series)
 Microkernel on Compute PEs,
                                   full featured Linux on Service
                          ...
Z
        Y    GigE

X

            10 GigE



             GigE
                            SMW
            Fibre
       ...
Now Scaled
                                                 to 225,000
                                                   ...
Processor   Frequency     Peak     Bandwidth        Balance
                        (Gflops)    (GB/sec)      (bytes/flop
...
6.4 GB/sec direct connect
       Characteristics                            HyperTransport
Number of       16 or 24 (MC)
C...
Greyhound                                          Greyhound
                                    Greyhound                ...
Without snoop filter, a streams test
                                shows 25MB/sec out of a possible
                    ...
With snoop filter, a streams test
                                shows 42.3 MB/sec out of a
                             ...
 New compute blade with 8 AMD
    Magny Cours processors
   Plug-compatible with XT5 cabinets
    and backplanes
   Ini...
Cray Inc. Preliminary and Proprietary   SC09   19
 Supports 2 Nodes per ASIC
 168 GB/sec routing capacity
 Scales to over 100,000 network
    endpoints
   Link Level Re...
Cray Baker Node
                                                                    Characteristics
                      ...
Module with
SeaStar


                                                             Z

                                    ...
net rsp
                                                                                                                 n...
 Two Gemini ASICs are
    packaged on a pin-compatible
    mezzanine card
   Topology is a 3-D torus
   Each lane of th...
 Like SeaStar, Gemini has a DMA offload engine
  allowing large transfers to proceed
  asynchronously

 Gemini provides ...
 Globally addressable memory provides efficient
  support for UPC, Co-array FORTRAN, Shmem and
  Global Arrays
    Cray ...
Gemini will represent a large improvement over SeaStar in
terms of reliability and serviceability


   Adaptive Routing –...
28
29
Low Velocity Airflow




      High Velocity Airflow




      Low Velocity Airflow




     High Velocity Airflow




30 ...
Cool air is released into the computer room




Liquid                                                              Liquid...
R134a piping   Exit Evaporators




                      Inlet Evaporator



                                  32
 32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size
                        Unable to take advantage of file system pa...
 Single OST, 256 MB File Size
                        Performance can be limited by the process (transfer size) or file ...
 Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and
  possibly size
    lfs setstripe -c -1 -s ...
PGI Compiler
Cray Compiler Environment
  Cray Scientific Libraries
 Cray XT/XE Supercomputers come with compiler wrappers to simplify
  building parallel applications (similar the mpicc/mp...
 Traditional (scalar) optimizations are controlled via -O# compiler flags
    Default: -O2
 More aggressive optimizatio...
 Compiler feedback is enabled with -Minfo and -Mneginfo
     This can provide valuable information about what optimizati...
Some compiler options may effect both performance and accuracy. Lower
accuracy is often higher performance, but it’s also ...
 Cray has a long tradition of high performance compilers on Cray
  platforms (Traditional vector, T3E, X1, X2)
    Vecto...
Fortran Source              C and C++ Source   C and C++ Front End
                                                       ...
 Standard conforming languages and programming models
    Fortran 2003
    UPC & CoArray Fortran
       Fully optimize...
 Make sure it is available
    module avail PrgEnv-cray
 To access the Cray compiler
    module load PrgEnv-cray
 To ...
 Excellent Vectorization
    Vectorize more loops than other compilers
 OpenMP 3.0
   Task and Nesting
 PGAS: Functio...
 Loop Based Optimizations
    Vectorization
    OpenMP
       Autothreading
   Interchange
   Pattern Matching
   C...
 Cray compiler supports a full and growing set of directives
  and pragmas

!dir$ concurrent
!dir$ ivdep
!dir$ interchang...
 Compiler can generate an filename.lst file.
     Contains annotated listing of your source code with letter indicating ...
• ftn –rm …      or cc –hlist=m …
29. b-------<   do i3=2,n3-1
30. b b-----<      do i2=2,n2-1
31. b b Vr--<        do i1=...
ftn-6289 ftn: VECTOR File = resid.f, Line = 29
 A loop starting at line 29 was not vectorized because a recurrence was fou...
 -hbyteswapio
   Link time option
   Applies to all unformatted fortran IO
 Assign command
   With the PrgEnv-cray mo...
 OpenMP is ON by default
   Optimizations controlled by –Othread#
   To shut off use –Othread0 or –xomp or –hnoomp



...
 Traditional model


   Tuned general purpose codes

     Only good for dense


     Not problem sensitive


     Not...
 Goal of scientific libraries
       Improve Productivity at optimal performance
 Cray use four concentrations to achiev...
 Three separate classes of standardization, each with a corresponding
  definition of productivity
   1. Standard interfa...
 Algorithmic tuning
    Increased performance by exploiting algorithmic improvements
        Sub-blocking, new algorith...
Dense           Sparse              FFT
  BLAS             CASK            CRAFFT

 LAPACK
                   PETSc       ...
 Serial and Parallel versions of sparse iterative linear solvers
    Suites of iterative solvers
        CG, GMRES, BiC...
 Cray provides state-of-the art scientific computing packages to strengthen
  the capability of PETSc
    Hypre: scalabl...
 The Trilinos Project http://trilinos.sandia.gov/
    “an effort to develop algorithms and enabling technologies within a...
 CASK is a product developed at Cray using the
  Cray Auto-tuning Framework (Cray ATF)
 The CASK Concept :
    Analyze ...
Large-scale application

              • Highly portable
              • User controlled

              PETSc / Trilinos /...
Speedup on Parallel SpMV on 8 cores, 60 different matrices
1.4


1.3


1.2


1.1


  1
      0           10          20   ...
Block Jacobi Preconditioning
 SpMV
                  Performance of CASK VS                         Performance of CASK VS...
2000
         1800
         1600
         1400
MFlops




         1200
         1000
          800
          600
        ...
Geometric Mean of 80 sparse matrix instances from U of Florida collection
         5000
         4500
         4000
      ...
 In FFTs, the problems are
    Which library choice to use?
    How to use complicated interfaces (e.g., FFTW)


 Stan...
 CRAFFT is designed with simple-to-use interfaces
    Planning and execution stage can be combined into one function cal...
128x128        256x256    512x512


FFTW plan               74        312         2758


FFTW exec          0.105         ...
1.   Load module fftw/3.2.0 or higher.
2.   Add a Fortran statement “use crafft”
3.   call crafft_init()
4.   Call crafft ...
 As of December 2009, CRAFFT includes distributed parallel transforms
 Uses the CRAFFT interface prefixed by “p”, with o...
1.   Add “use crafft” to Fortran code
2.   Initialize CRAFFT using crafft_init
3.   Assume MPI initialized and data distri...
2D FFT (N x N, transposed), 128 cores
         140,000

         120,000

         100,000
Mflops




         80,000

   ...
 Solves linear systems in single precision
 Obtaining solutions accurate to double precision
    For well conditioned p...
 “High Power Electromagnetic Wave
  Heating in the ITER Burning Plasma’’
 rf heating in tokamak

 Maxwell-Bolzmann Eqns...
Theoretical
Peak




          83
Decide if you want to use advanced API or benchmark API
    benchmark API :
         setenv IRT_USE_SOLVERS 1
    Advanced...
 LibSci 10.4.2 February 18th 2010
    OpenMP-aware LibSci
    Allows calling of BLAS inside or outside parallel region
...
CrayPAT
 Assist the user with application performance analysis and optimization
           Help user identify important and mean...
 Supports traditional post-mortem performance analysis
       Automatic identification of performance problems
         ...
 CrayPat
           Instrumentation of optimized code
           No source code modification required
           Data ...
 When performance measurement is triggered
         External agent (asynchronous)
               Sampling
             ...
 Millions of lines of code
           Automatic profiling analysis
                  Identifies top time consuming rout...
 Important performance statistics:


       Top time consuming routines


       Load balance across computing resource...
 No source code or makefile modification required
           Automatic instrumentation at group (function) level
       ...
       Analyze the performance data and direct the user to meaningful
            information

           Simplifies the...
 Performs data conversion


             Combines information from binary with raw performance
                  data

 ...
 Craypat / Cray Apprentice2 5.0 released September 10, 2009


           New internal data format
           FAQ
      ...
       Access performance tools software

                  % module load xt-craypat apprentice2

           Build appli...
      Generate report and .apa instrumentation file

       % pat_report –o my_sampling_report [<sdatafile>.xf |
        ...
# You can edit this file, if desired, and use it                           # 43.37% 99659 bytes
# to reinstrument the prog...
   biolib Cray Bioinformatics library routines      omp    OpenMP API (not supported on
   blacs Basic Linear Algebra c...
0   Summary with instruction     11 Floating point operations
    metrics                        mix (2)
1   Summary with ...
 Regions, useful to break up long routines
    int PAT_region_begin (int id, const char *label)
    int PAT_region_end ...
      Instrument application for further analysis (a.out+apa)

       % pat_build –O <apafile>.apa

      Run applicatio...
 MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)


     Number of files used to store raw data


           1 ...
 Full trace files show transient events but are too large

 Current run-time summarization misses transient events

 Pl...
 Call graph profile                    Cray Apprentice2
 Communication statistics              is target to help      ...
Switch Overview display




September 21-24, 2009   © Cray Inc.               108
© Cray Inc.   September 21-24, 2009   Slide 109
September 21-24, 2009   © Cray Inc.   110
September 21-24, 2009   © Cray Inc.   111
Min, Avg, and Max
                        Values




                                            -1, +1
                  ...
Width  inclusive time

                                   Height  exclusive time


                                     ...
Right mouse click:
                                                           Node menu
                                  ...
September 21-24, 2009   © Cray Inc.   115
September 21-24, 2009   © Cray Inc.   Slide 116
September 21-24, 2009   © Cray Inc.   Slide 117
September 21-24, 2009   © Cray Inc.   Slide 118
Min, Avg, and Max
                        Values




                                            -1, +1
                  ...
September 21-24, 2009   © Cray Inc.   120
 Cray Apprentice2 panel help


     pat_help – interactive help on the Cray Performance toolset


     FAQ available th...
 intro_craypat(1)
           Introduces the craypat performance tool
     pat_build
           Instrument a program fo...
pat_report: Help for -O option:

Available option values are in left column, a prefix can be specified:

  ct             ...
 Interactive by default, or use trailing '.' to just print a topic:


     New FAQ craypat 5.0.0.


     Has counter an...
The top level CrayPat/X help topics are listed below.
       A good place to start is:
               overview
       If a...
CPU Optimizations
Optimizing Communication
    I/O Best Practices
55. 1                 ii = 0
56. 1 2-----------< do b = abmin, abmax                           Poor loop order
57. 1 2 3--...
USER / #1.Original Loops
-----------------------------------------------------------------         Poor loop order
 Time% ...
75. 1 2-----------< do i = ijmin, ijmax
76. 1 2               jj = 0
77. 1 2 3---------<   do a = abmin, abmax            ...
USER / #2.Reordered Loops
-----------------------------------------------------------------       Improved striding
 Time%...
First loop, partially vectorized and                Second loop, vectorized and
unrolled by 4                             ...
USER / #3.Fissioned Loops
                                                                         Fissioning further
----...
 Cache blocking is a combination of strip mining and loop interchange, designed
  to increase data reuse.
     Takes adv...
j=1




             j=8
                          2D Laplacian
i=1                         do j = 1, 8
                 ...
j=1




             j=8
                         Unblocked loop: 120 cache misses
i=1                      Block the in...
j=1


             j=5
                         One-dimensional blocking reduced
i=1                       misses from 12...
   Matrix-matrix multiply (GEMM) is the canonical cache-blocking example
   Operations can be arranged to create multipl...
“I tried cache-blocking my code, but it didn’t help”


 You’re doing it wrong.
    Your block size is too small (too muc...
 Multigrid PDE solver
 Class D, 64 MPI ranks
                                         do i3 = 2, 257
    Global grid is...
 Block the inner two loops                                     Mop/s/proces
                                             ...
 Block the outer two loops                                    Mop/s/proces
                                              ...
(    53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa,
    int cola, int colb)
(   54) {
(   55)     int i...
(    53) void mat_mul_daxpy(double* restrict a, double* restrict b,
    double* restrict c, int rowa, int cola, int colb)
...
66, Generated alternate loop with no peeling - executed if loop count <= 24
        Generated vector sse code for inner lo...
July 2009   Slide 150
 GNU malloc library
   malloc, calloc, realloc, free calls
      Fortran dynamic variables
 Malloc library system call...
   Detecting “bad” malloc behavior
     Profile data => “excessive system time”
   Correcting “bad” malloc behavior
   ...
 Google created a replacement “malloc” library
   “Minimal” TCMalloc replaces GNU malloc
 Limited testing indicates TCM...
 Linux has a “first touch policy” for memory allocation
    *alloc functions don’t actually allocate your memory
    Me...
 Short Message Eager Protocol
    The sending rank “pushes” the message to the receiving rank
    Used for messages MPI...
Match Entries Posted by MPI
                                       Incoming Msg    S                         to handle Une...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010
Upcoming SlideShare
Loading in …5
×

Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

3,552
-1

Published on

This presentation gives basic information about optimizing applications for Cray XT6 and XE6 Supercomputers.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,552
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
120
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Planned Times:Architecture: 30-45 minPGI: 10-15 minCCE: 15-20 minLibsci: 15-20 minCrayPAT: 30-45 minOptimization: 60-90 min
  • CQ (Completion Queue) - an event notification block used when processor needs to be notified that BTE or FMA transactions have completed.NAT (Network Address Translation) - responsible for validating and translating addresses from the network address format to an address on the local node.AMO (Atomic Memory Operation) - responsible for AMO type of transactions.ORB (Outstanding Request Buffer) - processes requests to the network and matches responses from the network to the original requests.RMT (Receive Message Table) - tracks groups of packets, or sequences, transmitted from remote nodes of the networkSSID (Synchronization Sequence Identification) - Tracks all of the request packets that originate and all of the response packets that terminate at the NIC, in order to perform completion notifications for transactions.Assists in the identification of SW operations and processes impacted by errorsMonitors error detected by other NIC blocks
  • Figure 2: Logical and Physical views of striping. Four application processes write a variable amount of data sequentially within a shared file. This shared file is striped over 4 OSTs with 1 MB stripe sizes. This write operation is not stripe aligned therefore some processes write their data to stripes used by other processes. Some stripes are accessed by more than one process (which may cause contention). Additionally, OSTs are accessed by variable numbers of processes (3 OST0, 1 OST1, 2 OST2 and 2 OST3).
  • Figure 3: Write Performance for serial I/O at various Lustre stripe counts. File size is 32 MB per OST utilized and write operations are 32 MB in size. Utilizing more OSTs does not increase write performance. The Best performance is seen by utilizing a stripe size which matches the size of write operations.
  • Figure 4: Write Performance for serial I/O at various Lustre stripe sizes and I/O operation sizes. File utilized is 256 MB written to a single OST. Performance is limited by small operation sizes and small stripe sizes. Either can become a limiting factor in write performance. The best performance is obtained in each case when the I/O operation and stripe sizes are similar.
  • These loops were taken from the nuccor application and provided by Rebecca Hartman-Baker from ORNL. She originally began by comparing various compilers and optimization levels. The results of the following rewrites came at the suggestion of Vince Graziano of Cray.
  • This code better plays to the strengths of the CPU. More cache reuse, easier prefetching, better chance of vectorizing.
  • Original: 13.938244 sReordered: 7.955379 s
  • The code further improves on the last by allowing slightly better cache reuse, but significantly better opportunity to vectorize on both a and b. I asked the compiler team why the loop nest on the left was only partially vectorized and they said that their studies showed that it would probably not be profitable (probably due to the tmat7 array striding on the second dimension.
  • Original: 13.938244 sReordered: 7.955379 sFissioned: 2.481636 s
  • The following Cache Blocking example was created by Steve Whalen of Cray.
  • See http://en.wikipedia.org/wiki/Restrict for more information on “Restrict”
  • The following come from Kim McMahon (Cray)
  • Figure 5: Write performance of a file-per-process I/O pattern as a function of number of files/processes. The file size is 128 MB with 32 MB sized write operations. Performance increases as the number of processes/files increases until OST and metadata contention hinder performance improvements. Each file is subject to the limitations of serial I/O.Improved performance can be obtained from a parallel file system such as Lustre. However, at large process counts (large number of files) metadata operations may hinder overall performance. Additionally, at large process counts (large number of files) OSS and OST contention will hinder overall performance.
  • Figure 8: Write Performance of a single shared file as the number of processes increases. A file size of 32 MB per process is utilized with 32 MB write operations. For each I/O library (Posix, MPI-IO, and HDF5) performance levels off at high core counts.
  • Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD Users Group 2010

    1. 1.  Review of XT6 Architecture  AMD Opteron  Cray Networks  Lustre Basics  Programming Environment  PGI Compiler Basics  The Cray Compiler Environment  Cray Scientific Libraries  Cray Performance Analysis Tools  Optimizations  CPU  Communication  I/O
    2. 2. AMD CPU Architecture Cray Architecture Lustre Filesystem Basics
    3. 3. 2003 2005 2007 2008 2009 2010 AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™ Mfg. 130nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 800 4x DDR3 1333
    4. 4. 12 cores 1.7-2.2Ghz 1 4 7 10 105.6Gflops 8 cores 5 11 1.8-2.4Ghz 2 8 76.8Gflops Power (ACP) 3 6 9 12 80Watts Stream 27.5GB/s Cache 12x 64KB L1 12x 512KB L2 12MB L3
    5. 5. L3 cache HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache MEMORY CONTROLLER MEMORY CONTROLLER Core 2 Core 5 Core 8 Core 11 HT Link HT Link HT Link HT Link L2 cache L2 cache L2 cache L2 cache Core 1 Core 4 Core 7 Core 10 L2 cache L2 cache L2 cache L2 cache Core 0 Core 3 Core 6 Core 9
    6. 6.  A cache line is 64B  Cache is a “victim cache”  All references go to L1 immediately and get evicted down the caches  A cache line is usually only in one level of cache  Hardware prefetcher detects forward and backward strides through memory  Each core can perform a 128b add and 128b multiply per clock cycle  This requires SSE, packed instructions  “Stride-one vectorization”
    7. 7. SeaStar (XT-series) Gemini (XE-series)
    8. 8.  Microkernel on Compute PEs, full featured Linux on Service PEs.  Service PEs specialize by function Compute PE  Software Architecture Login PE eliminates OS “Jitter” Network PE  Software Architecture enables reproducible run times System PE  Large machines boot in under I/O PE 30 minutes, including filesystem Service Partition Specialized Linux nodes 10
    9. 9. Z Y GigE X 10 GigE GigE SMW Fibre Channels RAID Subsystem Compute node Login node Network node Boot/Syslog/Database nodes I/O and Metadata nodes 11
    10. 10. Now Scaled to 225,000 cores  Cray XT5 systems ship with the SeaStar2+ interconnect DMA HyperTransport 6-Port Router Engine Interface  Custom ASIC  Integrated NIC / Router Memory  MPI offload engine Blade Control Processor PowerPC 440 Processor  Connectionless Protocol Interface  Link Level Reliability  Proven scalability to 225,000 cores 12
    11. 11. Processor Frequency Peak Bandwidth Balance (Gflops) (GB/sec) (bytes/flop ) Istanbul 2.6 62.4 12.8 0.21 (XT5) 2.0 64 42.6 0.67 MC-8 2.3 73.6 42.6 0.58 2.4 76.8 42.6 0.55 1.9 91.2 42.6 0.47 MC-12 2.1 100.8 42.6 0.42 2.2 105.6 42.6 0.40 Cray Inc. Preliminary and Proprietary SC09 13
    12. 12. 6.4 GB/sec direct connect Characteristics HyperTransport Number of 16 or 24 (MC) Cores 32 (IL) Peak 153 Gflops/sec Performance MC-8 (2.4) Peak 211 Gflops/sec Performance MC-12 (2.2) Memory Size 32 or 64 GB per node Memory 83.5 GB/sec Bandwidth 83.5 GB/sec direct connect memory Cray SeaStar2+ Interconnect Cray Inc. Preliminary and Proprietary SC09 14
    13. 13. Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel 6MB L3 HT3 6MB L3 Greyhound Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound DDR3 Channel Greyhound Greyhound DDR3 Channel HT3 HT3 Greyhound H Greyhound DDR3 Channel 6MB L3 Greyhound Greyhound T3 6MB L3 Greyhound Greyhound DDR3 Channel Cache Greyhound Cache Greyhound Greyhound Greyhound Greyhound HT3 Greyhound DDR3 Channel DDR3 Channel To Interconnect HT1 / HT3  2 Multi-Chip Modules, 4 Opteron Dies  8 Channels of DDR3 Bandwidth to 8 DIMMs  24 (or 16) Computational Cores, 24 MB of L3 cache  Dies are fully connected with HT3  Snoop Filter Feature Allows 4 Die SMP to scale well Cray Inc. Preliminary and Proprietary SC09 15
    14. 14. Without snoop filter, a streams test shows 25MB/sec out of a possible 51.2 GB/sec or 48% of peak bandwidth Cray Inc. Preliminary and Proprietary SC09 16
    15. 15. With snoop filter, a streams test shows 42.3 MB/sec out of a possible 51.2 GB/sec or 82% of peak bandwidth • This feature will be key for two- socket Magny Cours Nodes which are the same architecture-wise Cray Inc. Preliminary and Proprietary SC09 17
    16. 16.  New compute blade with 8 AMD Magny Cours processors  Plug-compatible with XT5 cabinets and backplanes  Initially will ship with SeaStar interconnect as the Cray XT6  Upgradeable to Gemini Interconnect or Cray XE6  Upgradeable to AMD’s “Interlagos” series  XT6 systems will continue to ship with the current SIO blade  First customer ship, March 31st Cray Inc. Preliminary and Proprietary SC09 18
    17. 17. Cray Inc. Preliminary and Proprietary SC09 19
    18. 18.  Supports 2 Nodes per ASIC  168 GB/sec routing capacity  Scales to over 100,000 network endpoints  Link Level Reliability and Adaptive Hyper Hyper Routing Transport Transport 3 3  Advanced Resiliency Features  Provides global address NIC 0 Netlink NIC 1 SB space Block Gemini LO  Advanced NIC designed to Processor efficiently support 48-Port  MPI YARC Router  One-sided MPI  Shmem  UPC, Coarray FORTRAN Cray Inc. Preliminary and Proprietary SC09 20
    19. 19. Cray Baker Node Characteristics Number of 16 or 24 10 12X Gemini Cores Channels Peak 140 or 210 Gflops/s (Each Gemini High Radix YARC Router Performance acts like two nodes on the 3D with adaptive Memory Size 32 or 64 GB per Torus) Routing node 168 GB/sec capacity Memory 85 GB/sec Bandwidth Cray Inc. Preliminary and Proprietary SC09 21
    20. 20. Module with SeaStar Z Y X Module with Gemini Cray Inc. Preliminary and Proprietary SC09 22
    21. 21. net rsp net req LB ht treq p net LB Ring ht treq np FMA req T net ht trsp net net A req S req req vc0 ht p req net R S req O ht np req B I BTE R net D rsp B vc1 Router Tiles HT3 Cave NL ht irsp NPT vc1 ht np net ireq rsp net req CQ NAT ht np req H ht p req net rsp headers ht p A AMO net ht p req ireq R net req req net req vc0 B RMT ht p req RAT net rsp LM CLM  FMA (Fast Memory Access)  Mechanism for most MPI transfers  Supports tens of millions of MPI requests per second  BTE (Block Transfer Engine)  Supports asynchronous block transfers between local and remote memory, in either direction  For use for large MPI transfers that happen in the background Cray Inc. Preliminary and Proprietary SC09 23
    22. 22.  Two Gemini ASICs are packaged on a pin-compatible mezzanine card  Topology is a 3-D torus  Each lane of the torus is composed of 4 Gemini router “tiles”  Systems with SeaStar interconnects can be upgraded by swapping this card  100% of the 48 router tiles on each Gemini chip are used Cray Inc. Preliminary and Proprietary SC09 24
    23. 23.  Like SeaStar, Gemini has a DMA offload engine allowing large transfers to proceed asynchronously  Gemini provides low-overhead OS-bypass features for short transfers  MPI latency targeted at ~ 1us  NIC provides for many millions of MPI messages per second  “Hybrid” programming not a requirement for performance  RDMA provides a much improved one-sided communication mechanism  AMOs provide a faster synchronization method for barriers  Gemini supports adaptive routing, which  Reduces problems with network hot spots  Allows MPI to survive link failures Cray Inc. Preliminary and Proprietary SC09 25
    24. 24.  Globally addressable memory provides efficient support for UPC, Co-array FORTRAN, Shmem and Global Arrays  Cray Programming Environment will target this capability directly  Pipelined global loads and stores  Allows for fast irregular communication patterns  Atomic memory operations  Provides fast synchronization needed for one-sided communication models Cray Inc. Preliminary and Proprietary SC09 26
    25. 25. Gemini will represent a large improvement over SeaStar in terms of reliability and serviceability  Adaptive Routing – multiple paths to the same destination  Allows mapping around bad links without rebooting  Supports warm-swap of blades  Prevents hot spots  Reliable Transport of Messages  Packet level CRC carried from start to finish  Large blocks of memory protected by ECC  Can better handle failures on the HT-link, discards packets instead of putting backpressure into the network  Supports end-to-end reliable communication (used by MPI)  Improved error reporting and handling  The low overhead error reporting allows the programming model to replay failed transactions  Performance counters allowing tracking of app specific packets Cray Inc. Preliminary and Proprietary SC09 27
    26. 26. 28
    27. 27. 29
    28. 28. Low Velocity Airflow High Velocity Airflow Low Velocity Airflow High Velocity Airflow 30 Low Velocity Airflow
    29. 29. Cool air is released into the computer room Liquid Liquid/Vapor in Mixture out Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change (evaporation). R134a absorbs energy only in the presence of heated air. Phase change is 10x more efficient than pure water cooling. 31
    30. 30. R134a piping Exit Evaporators Inlet Evaporator 32
    31. 31.  32 MB per OST (32 MB – 5 GB) and 32 MB Transfer Size  Unable to take advantage of file system parallelism  Access to multiple disks adds overhead which hurts performance Single Writer Write Performance 120 100 80 Write (MB/s) 1 MB Stripe 60 32 MB Stripe 40 Lustre 20 0 1 2 4 16 32 64 128 160 Stripe Count 36
    32. 32.  Single OST, 256 MB File Size  Performance can be limited by the process (transfer size) or file system (stripe size) Single Writer Transfer vs. Stripe Size 140 120 100 Write (MB/s) 80 32 MB Transfer 60 8 MB Transfer 1 MB Transfer 40 Lustre 20 0 1 2 4 8 16 32 64 128 Stripe Size (MB) 37
    33. 33.  Use the lfs command, libLUT, or MPIIO hints to adjust your stripe count and possibly size  lfs setstripe -c -1 -s 4M <file or directory> (160 OSTs, 4MB stripe)  lfs setstripe -c 1 -s 16M <file or directory> (1 OST, 16M stripe)  export MPICH_MPIIO_HINTS=‘*: striping_factor=160’  Files inherit striping information from the parent directory, this cannot be changed once the file is written  Set the striping before copying in files
    34. 34. PGI Compiler Cray Compiler Environment Cray Scientific Libraries
    35. 35.  Cray XT/XE Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90)  Fortran Compiler: ftn  C Compiler: cc  C++ Compiler: CC  Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries  Cray MPT (MPI, Shmem, etc.)  Cray LibSci (BLAS, LAPACK, etc.)  …  Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly.  Always load the appropriate xtpe-<arch> module for your machine  Enables proper compiler target  Links optimized math libraries
    36. 36.  Traditional (scalar) optimizations are controlled via -O# compiler flags  Default: -O2  More aggressive optimizations (including vectorization) are enabled with the -fast or -fastsse metaflags  These translate to: -O2 -Munroll=c:1 -Mnoframe -Mlre –Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz –Mpre  Interprocedural analysis allows the compiler to perform whole-program optimizations. This is enabled with –Mipa=fast  See man pgf90, man pgcc, or man pgCC for more information about compiler options.
    37. 37.  Compiler feedback is enabled with -Minfo and -Mneginfo  This can provide valuable information about what optimizations were or were not done and why.  To debug an optimized code, the -gopt flag will insert debugging information without disabling optimizations  It’s possible to disable optimizations included with -fast if you believe one is causing problems  For example: -fast -Mnolre enables -fast and then disables loop redundant optimizations  To get more information about any compiler flag, add -help with the flag in question  pgf90 -help -fast will give more information about the -fast flag  OpenMP is enabled with the -mp flag
    38. 38. Some compiler options may effect both performance and accuracy. Lower accuracy is often higher performance, but it’s also able to enforce accuracy.  -Kieee: All FP math strictly conforms to IEEE 754 (off by default)  -Ktrap: Turns on processor trapping of FP exceptions  -Mdaz: Treat all denormalized numbers as zero  -Mflushz: Set SSE to flush-to-zero (on with -fast)  -Mfprelaxed: Allow the compiler to use relaxed (reduced) precision to speed up some floating point optimizations  Some other compilers turn this on by default, PGI chooses to favor accuracy to speed by default.
    39. 39.  Cray has a long tradition of high performance compilers on Cray platforms (Traditional vector, T3E, X1, X2)  Vectorization  Parallelization  Code transformation  More…  Investigated leveraging an open source compiler called LLVM  First release December 2008
    40. 40. Fortran Source C and C++ Source C and C++ Front End supplied by Edison Design Group, with Cray-developed Fortran Front End C & C++ Front End code for extensions and interface support Interprocedural Analysis Cray Inc. Compiler Technology Compiler Optimization and Parallelization X86 Code Cray X2 Code Generator Generator X86 Code Generation from Open Source LLVM, with Object File additional Cray-developed optimizations and interface support
    41. 41.  Standard conforming languages and programming models  Fortran 2003  UPC & CoArray Fortran  Fully optimized and integrated into the compiler  No preprocessor involved  Target the network appropriately:  GASNet with Portals  DMAPP with Gemini & Aries  Ability and motivation to provide high-quality support for custom Cray network hardware  Cray technology focused on scientific applications  Takes advantage of Cray’s extensive knowledge of automatic vectorization  Takes advantage of Cray’s extensive knowledge of automatic shared memory parallelization  Supplements, rather than replaces, the available compiler choices
    42. 42.  Make sure it is available  module avail PrgEnv-cray  To access the Cray compiler  module load PrgEnv-cray  To target the various chip  module load xtpe-[barcelona,shanghi,istanbul]  Once you have loaded the module “cc” and “ftn” are the Cray compilers  Recommend just using default options  Use –rm (fortran) and –hlist=m (C) to find out what happened  man crayftn
    43. 43.  Excellent Vectorization  Vectorize more loops than other compilers  OpenMP 3.0  Task and Nesting  PGAS: Functional UPC and CAF available today  C++ Support  Automatic Parallelization  Modernized version of Cray X1 streaming capability  Interacts with OMP directives  Cache optimizations  Automatic Blocking  Automatic Management of what stays in cache  Prefetching, Interchange, Fusion, and much more…
    44. 44.  Loop Based Optimizations  Vectorization  OpenMP  Autothreading  Interchange  Pattern Matching  Cache blocking/ non-temporal / prefetching  Fortran 2003 Standard; working on 2008  PGAS (UPC and Co-Array Fortran)  Some performance optimizations available in 7.1  Optimization Feedback: Loopmark  Focus
    45. 45.  Cray compiler supports a full and growing set of directives and pragmas !dir$ concurrent !dir$ ivdep !dir$ interchange !dir$ unroll !dir$ loop_info [max_trips] [cache_na] ... Many more !dir$ blockable man directives man loop_info
    46. 46.  Compiler can generate an filename.lst file.  Contains annotated listing of your source code with letter indicating important optimizations %%% L o o p m a r k L e g e n d %%% Primary Loop Type Modifiers ------- ---- ---- --------- a - vector atomic memory operation A - Pattern matched b - blocked C - Collapsed f - fused D - Deleted i - interchanged E - Cloned m - streamed but not partitioned I - Inlined p - conditional, partial and/or computed M - Multithreaded r - unrolled P - Parallel/Tasked s - shortloop V - Vectorized t - array syntax temp used W - Unwound w - unwound
    47. 47. • ftn –rm … or cc –hlist=m … 29. b-------< do i3=2,n3-1 30. b b-----< do i2=2,n2-1 31. b b Vr--< do i1=1,n1 32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3) 33. b b Vr > + u(i1,i2,i3-1) + u(i1,i2,i3+1) 34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1) 35. b b Vr > + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1) 36. b b Vr--> enddo 37. b b Vr--< do i1=2,n1-1 38. b b Vr r(i1,i2,i3) = v(i1,i2,i3) 39. b b Vr > - a(0) * u(i1,i2,i3) 40. b b Vr > - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) ) 41. b b Vr > - a(3) * ( u2(i1-1) + u2(i1+1) ) 42. b b Vr--> enddo 43. b b-----> enddo 44. b-------> enddo
    48. 48. ftn-6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.
    49. 49.  -hbyteswapio  Link time option  Applies to all unformatted fortran IO  Assign command  With the PrgEnv-cray module loaded do this: setenv FILENV assign.txt assign -N swap_endian g:su assign -N swap_endian g:du  Can use assign to be more precise
    50. 50.  OpenMP is ON by default  Optimizations controlled by –Othread#  To shut off use –Othread0 or –xomp or –hnoomp  Autothreading is NOT on by default;  -hautothread to turn on  Modernized version of Cray X1 streaming capability  Interacts with OMP directives If you do not want to use OpenMP and have OMP directives in the code, make sure to make a run with OpenMP shut off at compile time
    51. 51.  Traditional model  Tuned general purpose codes  Only good for dense  Not problem sensitive  Not architecture sensitive 60
    52. 52.  Goal of scientific libraries Improve Productivity at optimal performance  Cray use four concentrations to achieve this  Standardization  Use standard or “de facto” standard interfaces whenever available  Hand tuning  Use extensive knowledge of target processor and network to optimize common code patterns  Auto-tuning  Automate code generation and a huge number of empirical performance evaluations to configure software to the target platforms  Adaptive Libraries  Make runtime decisions to choose the best kernel/library/routine 61
    53. 53.  Three separate classes of standardization, each with a corresponding definition of productivity 1. Standard interfaces (e.g., dense linear algebra)  Bend over backwards to keep everything the same despite increases in machine complexity. Innovate ‘behind-the-scenes’  Productivity -> innovation to keep things simple 2. Adoption of near-standard interfaces (e.g., sparse kernels)  Assume near-standards and promote those. Out-mode alternatives. Innovate ‘behind-the-scenes’  Productivity -> innovation in the simplest areas  (requires the same innovation as #1 also) 3. Simplification of non-standard interfaces (e.g., FFT)  Productivity -> innovation to make things simpler than they are 62
    54. 54.  Algorithmic tuning  Increased performance by exploiting algorithmic improvements  Sub-blocking, new algorithms  LAPACK, ScaLAPACK  Kernel tuning  Improve the numerical kernel performance in assembly language  BLAS, FFT  Parallel tuning  Exploit Cray’s custom network interfaces and MPT  ScaLAPACK, P-CRAFFT 63
    55. 55. Dense Sparse FFT BLAS CASK CRAFFT LAPACK PETSc FFTW ScaLAPACK IRT Trilinos P-CRAFFT IRT – Iterative Refinement Toolkit CASK – Cray Adaptive Sparse Kernels CRAFFT – Cray Adaptive FFT 64
    56. 56.  Serial and Parallel versions of sparse iterative linear solvers  Suites of iterative solvers  CG, GMRES, BiCG, QMR, etc.  Suites of preconditioning methods  IC, ILU, diagonal block (ILU/IC), Additive Schwartz, Jacobi, SOR  Support block sparse matrix data format for better performance  Interface to external packages (ScaLAPACK, SuperLU_DIST)  Fortran and C support  Newton-type nonlinear solvers  Large user community  DoE Labs, PSC, CSCS, CSC, ERDC, AWE and more.  http://www-unix.mcs.anl.gov/petsc/petsc-as 65
    57. 57.  Cray provides state-of-the art scientific computing packages to strengthen the capability of PETSc  Hypre: scalable parallel preconditioners  AMG (Very scalable and efficient for specific class of problems)  2 different ILU (General purpose)  Sparse Approximate Inverse (General purpose)  ParMetis: parallel graph partitioning package  MUMPS: parallel multifrontal sparse direct solver  SuperLU: sequential version of SuperLU_DIST  To use Cray-PETSc, load the appropriate module : module load petsc module load petsc-complex (no need to load a compiler specific module)  Treat the Cray distribution as your local PETSc installation 66
    58. 58.  The Trilinos Project http://trilinos.sandia.gov/ “an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems”  A unique design feature of Trilinos is its focus on packages.  Very large user-base and growing rapidly. Important to DOE.  Cray’s optimized Trilinos released on January 21  Includes 50+ trilinos packages  Optimized via CASK  Any code that uses Epetra objects can access the optimizations  Usage : module load trilinos 67
    59. 59.  CASK is a product developed at Cray using the Cray Auto-tuning Framework (Cray ATF)  The CASK Concept :  Analyze matrix at minimal cost  Categorize matrix against internal classes  Based on offline experience, find best CASK code for particular matrix  Previously assign “best” compiler flags to CASK code  Assign best CASK kernel and perform Ax  CASK silently sits beneath PETSc on Cray systems  Trilinos support coming soon  Released with PETSc 3.0 in February 2009  Generic and blocked CSR format 68
    60. 60. Large-scale application • Highly portable • User controlled PETSc / Trilinos / Hypre All systems • Highly portable • User controlled Cray only CASK • XT4 & XT5 specific / tuned • Invisible to User 69
    61. 61. Speedup on Parallel SpMV on 8 cores, 60 different matrices 1.4 1.3 1.2 1.1 1 0 10 20 30 40 50 60 Matrix ID# 70
    62. 62. Block Jacobi Preconditioning SpMV Performance of CASK VS Performance of CASK VS PETSc PETSc N=65,536 to 67,108,864 300 N=65,536 to 67,108,864 200 250 150 200 GFlops 100 GFlops 150 50 100 50 0 0 128 256 384 512 640 768 896 1024 0 0 128 256 384 512 640 768 896 1024 # of Cores # of Cores BlockJacobi-IC(0)-CASK MatMult-CASK MatMult-PETSc BlockJacobi-IC(0)-PETSc 71
    63. 63. 2000 1800 1600 1400 MFlops 1200 1000 800 600 400 200 0 Matrix Name
    64. 64. Geometric Mean of 80 sparse matrix instances from U of Florida collection 5000 4500 4000 3500 MFlops 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 # of vectors CASK Trilinos Original
    65. 65.  In FFTs, the problems are  Which library choice to use?  How to use complicated interfaces (e.g., FFTW)  Standard FFT practice  Do a plan stage  Deduced machine and system information and run micro-kernels  Select best FFT strategy  Do an execute Our system knowledge can remove some of this cost! 74
    66. 66.  CRAFFT is designed with simple-to-use interfaces  Planning and execution stage can be combined into one function call  Underneath the interfaces, CRAFFT calls the appropriate FFT kernel  CRAFFT provides both offline and online tuning  Offline tuning  Which FFT kernel to use  Pre-computed PLANs for common-sized FFT  No expensive plan stages  Online tuning is performed as necessary at runtime as well  At runtime, CRAFFT will adaptively select the best FFT kernel to use based on both offline and online testing (e.g. FFTW, Custom FFT) 75
    67. 67. 128x128 256x256 512x512 FFTW plan 74 312 2758 FFTW exec 0.105 0.97 9.7 CRAFFT plan 0.00037 0.0009 0.00005 CRAFFT exec 0.139 1.2 11.4
    68. 68. 1. Load module fftw/3.2.0 or higher. 2. Add a Fortran statement “use crafft” 3. call crafft_init() 4. Call crafft transform using none, some or all optional arguments (as shown in red) In-place, implicit memory management : call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign) in-place, explicit memory management call crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,isign,work) out-of-place, explicit memory management : crafft_z2z3d(n1,n2,n3,input,ld_in,ld_in2,output,ld_out,ld_out2,isign,work) Note : the user can also control the planning strategy of CRAFFT using the CRAFFT_PLANNING environment variable and the do_exe optional argument, please see the intro_crafft man page. 77
    69. 69.  As of December 2009, CRAFFT includes distributed parallel transforms  Uses the CRAFFT interface prefixed by “p”, with optional arguments  Can provide performance improvement over FFTW 2.1.5  Currently implemented  complex-complex  Real-complex and complex-real  3-d and 2-d  In-place and out-of-place  Upcoming  C language support for serial and parallel 78
    70. 70. 1. Add “use crafft” to Fortran code 2. Initialize CRAFFT using crafft_init 3. Assume MPI initialized and data distributed (see manpage) 4. Call crafft, e.g. (optional arguments in red) 2-d complex-complex, in-place, internal mem management : call crafft_pz2z2d(n1,n2,input,isign,flag,comm) 2-d complex-complex, in-place with no internal memory : call crafft_pz2z2d(n1,n2,input,isign,flag,comm,work) 2-d complex-complex, out-of-place, internal mem manager : call crafft_pz2z2d(n1,n2,input,output,isign,flag,comm) 2-d complex-complex, out-of-place, no internal memory : crafft_pz2z2d(n1,n2,input,output,isign,flag,comm,work) Each routine above has manpage. Also see 3d equivalent : man crafft_pz2z3d 79
    71. 71. 2D FFT (N x N, transposed), 128 cores 140,000 120,000 100,000 Mflops 80,000 60,000 pcrafft fftw2.5.1 40,000 20,000 0 128 256 512 1024 2048 4096 8192 16384 3276865536 Size N 80
    72. 72.  Solves linear systems in single precision  Obtaining solutions accurate to double precision  For well conditioned problems  Serial and Parallel versions of LU, Cholesky, and QR  2 usage methods  IRT Benchmark routines  Uses IRT 'under-the-covers' without changing your code  Simply set an environment variable  Useful when you cannot alter source code  Advanced IRT API  If greater control of the iterative refinement process is required  Allows  condition number estimation  error bounds return  minimization of either forward or backward error  'fall back' to full precision if the condition number is too high  max number of iterations can be altered by users 81
    73. 73.  “High Power Electromagnetic Wave Heating in the ITER Burning Plasma’’  rf heating in tokamak  Maxwell-Bolzmann Eqns  FFT  Dense linear system  Calc Quasi-linear op Courtesy Richard Barrett 82
    74. 74. Theoretical Peak 83
    75. 75. Decide if you want to use advanced API or benchmark API benchmark API : setenv IRT_USE_SOLVERS 1 Advanced API : 1. locate the factor and solve in your code (LAPACK or ScaLAPACK) 2. Replace factor and solve with a call to IRT routine  e.g. dgesv -> irt_lu_real_serial  e.g. pzgesv -> irt_lu_complex_parallel  e.g pzposv -> irt_po_complex_parallel 3. Set advanced arguments  Forward error convergence for most accurate solution  Condition number estimate  “fall-back” to full precision if condition number too high 84
    76. 76.  LibSci 10.4.2 February 18th 2010  OpenMP-aware LibSci  Allows calling of BLAS inside or outside parallel region  Single library supported  No multi-thread library and single thread library (-lsci and –lsci_mp)  Performance not compromised (there were some usage restrictions with this version)  LibSci 10.4.3 April 2010  Parallel CRAFFT improvements  Fixes usage restrictions of 10.4.2  OMP_NUM_THREADS required (not GOTO_NUM_THREADS)  Upcoming  PETSc 3.1.0 May 20  Trilinos 10.2 May 20 85
    77. 77. CrayPAT
    78. 78.  Assist the user with application performance analysis and optimization  Help user identify important and meaningful information from potentially massive data sets  Help user identify problem areas instead of just reporting data  Bring optimization knowledge to a wider set of users  Focus on ease of use and intuitive user interfaces  Automatic program instrumentation  Automatic analysis  Target scalability issues in all areas of tool development  Data management  Storage, movement, presentation September 21-24, 2009 © Cray Inc. 87
    79. 79.  Supports traditional post-mortem performance analysis  Automatic identification of performance problems  Indication of causes of problems  Suggestions of modifications for performance improvement  CrayPat  pat_build: automatic instrumentation (no source code changes needed)  run-time library for measurements (transparent to the user)  pat_report for performance analysis reports  pat_help: online help utility  Cray Apprentice2  Graphical performance analysis and visualization tool September 21-24, 2009 © Cray Inc. 88
    80. 80.  CrayPat  Instrumentation of optimized code  No source code modification required  Data collection transparent to the user  Text-based performance reports  Derived metrics  Performance analysis  Cray Apprentice2  Performance data visualization tool  Call tree view  Source code mappings September 21-24, 2009 © Cray Inc. 89
    81. 81.  When performance measurement is triggered  External agent (asynchronous)  Sampling  Timer interrupt  Hardware counters overflow  Internal agent (synchronous)  Code instrumentation  Event based  Automatic or manual instrumentation  How performance data is recorded  Profile ::= Summation of events over time  run time summarization (functions, call sites, loops, …)  Trace file ::= Sequence of events over time September 21-24, 2009 © Cray Inc. 90
    82. 82.  Millions of lines of code  Automatic profiling analysis  Identifies top time consuming routines  Automatically creates instrumentation template customized to your application  Lots of processes/threads  Load imbalance analysis  Identifies computational code regions and synchronization calls that could benefit most from load balance optimization  Estimates savings if corresponding section of code were balanced  Long running applications  Detection of outliers September 21-24, 2009 © Cray Inc. 91
    83. 83.  Important performance statistics:  Top time consuming routines  Load balance across computing resources  Communication overhead  Cache utilization  FLOPS  Vectorization (SSE instructions)  Ratio of computation versus communication September 21-24, 2009 © Cray Inc. 92
    84. 84.  No source code or makefile modification required  Automatic instrumentation at group (function) level  Groups: mpi, io, heap, math SW, …  Performs link-time instrumentation  Requires object files  Instruments optimized code  Generates stand-alone instrumented program  Preserves original binary  Supports sample-based and event-based instrumentation September 21-24, 2009 © Cray Inc. 93
    85. 85.  Analyze the performance data and direct the user to meaningful information  Simplifies the procedure to instrument and collect performance data for novice users  Based on a two phase mechanism 1. Automatically detects the most time consuming functions in the application and feeds this information back to the tool for further (and focused) data collection 2. Provides performance information on the most significant parts of the application September 21-24, 2009 © Cray Inc. 94
    86. 86.  Performs data conversion  Combines information from binary with raw performance data  Performs analysis on data  Generates text report of performance results  Formats data for input into Cray Apprentice2 September 21-24, 2009 © Cray Inc. 95
    87. 87.  Craypat / Cray Apprentice2 5.0 released September 10, 2009  New internal data format  FAQ  Grid placement support  Better caller information (ETC group in pat_report)  Support larger numbers of processors  Client/server version of Cray Apprentice2  Panel help in Cray Apprentice2 September 21-24, 2009 © Cray Inc. 96
    88. 88.  Access performance tools software % module load xt-craypat apprentice2  Build application keeping .o files (CCE: -h keepfiles) % make clean % make  Instrument application for automatic profiling analysis  You should get an instrumented program a.out+pat % pat_build –O apa a.out  Run application to get top time consuming routines  You should get a performance file (“<sdatafile>.xf”) or multiple files in a directory <sdatadir> % aprun … a.out+pat (or qsub <pat script>) September 21-24, 2009 © Cray Inc. 97
    89. 89.  Generate report and .apa instrumentation file % pat_report –o my_sampling_report [<sdatafile>.xf | <sdatadir>]  Inspect .apa file and sampling report  Verify if additional instrumentation is needed September 21-24, 2009 © Cray Inc. Slide 98
    90. 90. # You can edit this file, if desired, and use it # 43.37% 99659 bytes # to reinstrument the program for tracing like this: -T mlwxyz_ # # pat_build -O mhd3d.Oapa.x+4125-401sdt.apa # 16.09% 17615 bytes # -T half_ # These suggested trace options are based on data from: # # 6.82% 6846 bytes # /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.ap2, -T artv_ /home/crayadm/ldr/mhd3d/run/mhd3d.Oapa.x+4125-401sdt.xf # 1.29% 5352 bytes # ---------------------------------------------------------------------- -T currenh_ # HWPC group to collect by default. # 1.03% 25294 bytes -T bndbo_ -Drtenv=PAT_RT_HWPC=1 # Summary with instructions metrics. # Functions below this point account for less than 10% of samples. # ---------------------------------------------------------------------- # Libraries to trace. # 1.03% 31240 bytes # -T bndto_ -g mpi ... # ---------------------------------------------------------------------- # ---------------------------------------------------------------------- # User-defined functions to trace, sorted by % of samples. # Limited to top 200. A function is commented out if it has < 1% -o mhd3d.x+apa # New instrumented program. # of samples, or if a cumulative threshold of 90% has been reached, # or if it has size < 200 bytes. /work/crayadm/ldr/mhd3d/mhd3d.x # Original program. # Note: -u should NOT be specified as an additional option. September 21-24, 2009 99 © Cray Inc.
    91. 91.  biolib Cray Bioinformatics library routines  omp OpenMP API (not supported on  blacs Basic Linear Algebra communication Catamount) subprograms  omp-rtl OpenMP runtime library (not  blas Basic Linear Algebra subprograms supported on Catamount)  caf Co-Array Fortran (Cray X2 systems only)  portals Lightweight message passing API  fftw Fast Fourier Transform library (64-bit  pthreads POSIX threads (not supported on only) Catamount)  hdf5 manages extremely large and complex  scalapack Scalable LAPACK data collections  shmem SHMEM  heap dynamic heap  stdio all library functions that accept or return  io includes stdio and sysio groups the FILE* construct  lapack Linear Algebra Package  sysio I/O system calls  lustre Lustre File System  system system calls  math ANSI math  upc Unified Parallel C (Cray X2 systems only)  mpi MPI  netcdf network common data form (manages array-oriented scientific data)
    92. 92. 0 Summary with instruction 11 Floating point operations metrics mix (2) 1 Summary with TLB metrics 12 Floating point operations mix (vectorization) 2 L1 and L2 metrics 13 Floating point operations 3 Bandwidth information mix (SP) 4 Hypertransport information 14 Floating point operations 5 Floating point mix mix (DP) 6 Cycles stalled, resources 15 L3 (socket-level) idle 16 L3 (core-level reads) 7 Cycles stalled, resources 17 L3 (core-level misses) full 18 L3 (core-level fills caused 8 Instructions and branches by L2 evictions) 9 Instruction cache 19 Prefetches 10 Cache hierarchy June 10 Slide 101
    93. 93.  Regions, useful to break up long routines  int PAT_region_begin (int id, const char *label)  int PAT_region_end (int id)  Disable/Enable Profiling, useful for excluding initialization  int PAT_record (int state)  Flush buffer, useful when program isn’t exiting cleanly  int PAT_flush_buffer (void)
    94. 94.  Instrument application for further analysis (a.out+apa) % pat_build –O <apafile>.apa  Run application % aprun … a.out+apa (or qsub <apa script>)  Generate text report and visualization file (.ap2) % pat_report –o my_text_report.txt [<datafile>.xf | <datadir>]  View report in text and/or with Cray Apprentice2 % app2 <datafile>.ap2 September 21-24, 2009 © Cray Inc. Slide 104
    95. 95.  MUST run on Lustre ( /work/… , /lus/…, /scratch/…, etc.)  Number of files used to store raw data  1 file created for program with 1 – 256 processes  √n files created for program with 257 – n processes  Ability to customize with PAT_RT_EXPFILE_MAX September 21-24, 2009 © Cray Inc. 105
    96. 96.  Full trace files show transient events but are too large  Current run-time summarization misses transient events  Plan to add ability to record:  Top N peak values (N small)  Approximate std dev over time  For time, memory traffic, etc.  During tracing and sampling July 15, 2008 Slide 106
    97. 97.  Call graph profile  Cray Apprentice2  Communication statistics  is target to help identify and  Time-line view correct:  Communication  Load imbalance  I/O  Excessive communication  Network contention  Activity view  Excessive serialization  Pair-wise communication statistics  I/O Problems  Text reports  Source code mapping September 21-24, 2009 107 © Cray Inc.
    98. 98. Switch Overview display September 21-24, 2009 © Cray Inc. 108
    99. 99. © Cray Inc. September 21-24, 2009 Slide 109
    100. 100. September 21-24, 2009 © Cray Inc. 110
    101. 101. September 21-24, 2009 © Cray Inc. 111
    102. 102. Min, Avg, and Max Values -1, +1 Std Dev marks September 21-24, 2009 © Cray Inc. 112
    103. 103. Width  inclusive time Height  exclusive time Filtered nodes or sub tree Load balance overview: Height  Max time Middle bar  Average time DUH Button: Lower bar  Min time Provides hints Yellow represents for performance imbalance time tuning Function Zoom List September 21-24, 2009 © Cray Inc. 113
    104. 104. Right mouse click: Node menu e.g., hide/unhide children Right mouse click: View menu: e.g., Filter Sort options % Time, Time, Imbalance % Imbalance time Function List off September 21-24, 2009 © Cray Inc. 114
    105. 105. September 21-24, 2009 © Cray Inc. 115
    106. 106. September 21-24, 2009 © Cray Inc. Slide 116
    107. 107. September 21-24, 2009 © Cray Inc. Slide 117
    108. 108. September 21-24, 2009 © Cray Inc. Slide 118
    109. 109. Min, Avg, and Max Values -1, +1 Std Dev marks September 21-24, 2009 © Cray Inc. 119
    110. 110. September 21-24, 2009 © Cray Inc. 120
    111. 111.  Cray Apprentice2 panel help  pat_help – interactive help on the Cray Performance toolset  FAQ available through pat_help September 21-24, 2009 © Cray Inc. 121
    112. 112.  intro_craypat(1)  Introduces the craypat performance tool  pat_build  Instrument a program for performance analysis  pat_help  Interactive online help utility  pat_report  Generate performance report in both text and for use with GUI  hwpc(3)  describes predefined hardware performance counter groups  papi_counters(5)  Lists PAPI event counters  Use papi_avail or papi_native_avail utilities to get list of events when running on a specific architecture September 21-24, 2009 © Cray Inc. 122
    113. 113. pat_report: Help for -O option: Available option values are in left column, a prefix can be specified: ct -O calltree defaults Tables that would appear by default. heap -O heap_program,heap_hiwater,heap_leaks io -O read_stats,write_stats lb -O load_balance load_balance -O lb_program,lb_group,lb_function mpi -O mpi_callers --- callers Profile by Function and Callers callers+hwpc Profile by Function and Callers callers+src Profile by Function and Callers, with Line Numbers callers+src+hwpc Profile by Function and Callers, with Line Numbers calltree Function Calltree View calltree+hwpc Function Calltree View calltree+src Calltree View with Callsite Line Numbers calltree+src+hwpc Calltree View with Callsite Line Numbers ... September 21-24, 2009 © Cray Inc. Slide 123
    114. 114.  Interactive by default, or use trailing '.' to just print a topic:  New FAQ craypat 5.0.0.  Has counter and counter group information % pat_help counters amd_fam10h groups . September 21-24, 2009 © Cray Inc. 124
    115. 115. The top level CrayPat/X help topics are listed below. A good place to start is: overview If a topic has subtopics, they are displayed under the heading "Additional topics", as below. To view a subtopic, you need only enter as many initial letters as required to distinguish it from other items in the list. To see a table of contents including subtopics of those subtopics, etc., enter: toc To produce the full text corresponding to the table of contents, specify "all", but preferably in a non-interactive invocation: pat_help all . > all_pat_help pat_help report all . > all_report_help Additional topics: API execute balance experiment build first_example counters overview demos report environment run pat_help (.=quit ,=back ^=up /=top ~=search) => September 21-24, 2009 © Cray Inc. Slide 125
    116. 116. CPU Optimizations Optimizing Communication I/O Best Practices
    117. 117. 55. 1 ii = 0 56. 1 2-----------< do b = abmin, abmax Poor loop order 57. 1 2 3---------< do j=ijmin, ijmax results in poor 58. 1 2 3 ii = ii+1 striding 59. 1 2 3 jj = 0 The inner-most loop 60. 1 2 3 4-------< do a = abmin, abmax strides on a slow 61. 1 2 3 4 r8----< do i = ijmin, ijmax dimension of each 62. 1 2 3 4 r8 jj = jj+1 array. 63. 1 2 3 4 r8 f5d(a,b,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) The best the compiler 64. 1 2 3 4 r8 f5d(b,a,i,j) = f5d(b,a,i,j) can do is unroll. - tmat7(ii,jj) 65. 1 2 3 4 r8 f5d(a,b,j,i) = f5d(a,b,j,i) Little to no cache - tmat7(ii,jj) reuse. 66. 1 2 3 4 r8 f5d(b,a,j,i) = f5d(b,a,j,i) + tmat7(ii,jj) 67. 1 2 3 4 r8----> end do 68. 1 2 3 4-------> end do 69. 1 2 3---------> end do 70. 1 2-----------> end do
    118. 118. USER / #1.Original Loops ----------------------------------------------------------------- Poor loop order Time% 55.0% results in poor Time 13.938244 secs cache reuse Imb.Time 0.075369 secs Imb.Time% 0.6% For every L1 cache Calls 0.1 /sec 1.0 calls hit, there’s 2 misses DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Overall, only 2/3 of L2_EXCLUSIVE:L2_SHARED 11.858M/sec 165279602 fills all references were in DATA_CACHE_REFILLS_FROM_SYSTEM: level 1 or 2 cache. ALL 11.931M/sec 166291054 fills PAPI_L1_DCM 23.499M/sec 327533338 misses PAPI_L1_DCA 34.635M/sec 482751044 refs User time (approx) 13.938 secs 36239439807 cycles 100.0%Time Average Time per Call 13.938244 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 32.2% hits 67.8% misses D2 cache hit,miss ratio 49.8% hits 50.2% misses D1+D2 cache hit,miss ratio 66.0% hits 34.0% misses
    119. 119. 75. 1 2-----------< do i = ijmin, ijmax 76. 1 2 jj = 0 77. 1 2 3---------< do a = abmin, abmax Reordered loop 78. 1 2 3 4-------< do j=ijmin, ijmax nest 79. 1 2 3 4 jj = jj+1 Now, the inner-most 80. 1 2 3 4 ii = 0 loop is stride-1 on 81. 1 2 3 4 Vcr2--< do b = abmin, abmax both arrays. 82. 1 2 3 4 Vcr2 ii = ii+1 83. 1 2 3 4 Vcr2 f5d(a,b,i,j) = f5d(a,b,i,j) Now memory + tmat7(ii,jj) accesses happen 84. 1 2 3 4 Vcr2 f5d(b,a,i,j) = f5d(b,a,i,j) along the cache line, - tmat7(ii,jj) allowing reuse. 85. 1 2 3 4 Vcr2 f5d(a,b,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) Compiler is able to 86. 1 2 3 4 Vcr2 f5d(b,a,j,i) = f5d(b,a,j,i) vectorize and better- + tmat7(ii,jj) use SSE instructions. 87. 1 2 3 4 Vcr2--> end do 88. 1 2 3 4-------> end do 89. 1 2 3---------> end do 90. 1 2-----------> end do
    120. 120. USER / #2.Reordered Loops ----------------------------------------------------------------- Improved striding Time% 31.4% greatly improved Time 7.955379 secs cache reuse Imb.Time 0.260492 secs Imb.Time% 3.8% Runtine was cut Calls 0.1 /sec 1.0 calls nearly in half. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Still, some 20% of all L2_EXCLUSIVE:L2_SHARED 0.419M/sec 3331289 fills references are cache DATA_CACHE_REFILLS_FROM_SYSTEM: misses ALL 15.285M/sec 121598284 fills PAPI_L1_DCM 13.330M/sec 106046801 misses PAPI_L1_DCA 66.226M/sec 526855581 refs User time (approx) 7.955 secs 20684020425 cycles 100.0%Time Average Time per Call 7.955379 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 79.9% hits 20.1% misses D2 cache hit,miss ratio 2.7% hits 97.3% misses D1+D2 cache hit,miss ratio 80.4% hits 19.6% misses
    121. 121. First loop, partially vectorized and Second loop, vectorized and unrolled by 4 unrolled by 4 95. 1 ii = 0 109. 1 jj = 0 96. 1 2-----------< do j = ijmin, ijmax 110. 1 2-----------< do i = ijmin, ijmax 97. 1 2 i---------< do b = abmin, abmax 111. 1 2 3---------< do a = abmin, abmax 98. 1 2 i ii = ii+1 112. 1 2 3 jj = jj+1 99. 1 2 i jj = 0 113. 1 2 3 ii = 0 100. 1 2 i i-------< do i = ijmin, ijmax 114. 1 2 3 4-------< do j = ijmin, ijmax 101. 1 2 i i Vpr4--< do a = abmin, abmax 115. 1 2 3 4 Vr4---< do b = abmin, abmax 102. 1 2 i i Vpr4 jj = jj+1 116. 1 2 3 4 Vr4 ii = ii+1 103. 1 2 i i Vpr4 f5d(a,b,i,j) = 117. 1 2 3 4 Vr4 f5d(b,a,i,j) = f5d(a,b,i,j) + tmat7(ii,jj) f5d(b,a,i,j) - tmat7(ii,jj) 104. 1 2 i i Vpr4 f5d(a,b,j,i) = 118. 1 2 3 4 Vr4 f5d(b,a,j,i) = f5d(a,b,j,i) - tmat7(ii,jj) f5d(b,a,i,j) + tmat7(ii,jj) 105. 1 2 i i Vpr4--> end do 119. 1 2 3 4 Vr4---> end do 106. 1 2 i i-------> end do 120. 1 2 3 4-------> end do 107. 1 2 i---------> end do 121. 1 2 3---------> end do 108. 1 2-----------> end do 122. 1 2-----------> end do
    122. 122. USER / #3.Fissioned Loops Fissioning further ----------------------------------------------------------------- improved cache reuse Time% 9.8% and resulted in better Time 2.481636 secs vectorization Imb.Time 0.045475 secs Imb.Time% 2.1% Runtime further Calls 0.4 /sec 1.0 calls reduced. DATA_CACHE_REFILLS: L2_MODIFIED:L2_OWNED: Cache hit/miss ratio L2_EXCLUSIVE:L2_SHARED 1.175M/sec 2916610 fills improved slightly DATA_CACHE_REFILLS_FROM_SYSTEM: ALL 34.109M/sec 84646518 fills Loopmark file points PAPI_L1_DCM 26.424M/sec 65575972 misses to better PAPI_L1_DCA 156.705M/sec 388885686 refs vectorization from User time (approx) 2.482 secs 6452279320 cycles 100.0%Time the fissioned loops Average Time per Call 2.481636 sec CrayPat Overhead : Time 0.0% D1 cache hit,miss ratios 83.1% hits 16.9% misses D2 cache hit,miss ratio 3.3% hits 96.7% misses D1+D2 cache hit,miss ratio 83.7% hits 16.3% misses
    123. 123.  Cache blocking is a combination of strip mining and loop interchange, designed to increase data reuse.  Takes advantage of temporal reuse: re-reference array elements already referenced  Good blocking will take advantage of spatial reuse: work with the cache lines!  Many ways to block any given loop nest  Which loops get blocked?  What block size(s) to use?  Analysis can reveal which ways are beneficial  But trial-and-error is probably faster
    124. 124. j=1 j=8  2D Laplacian i=1 do j = 1, 8 do i = 1, 16 a = u(i-1,j) + u(i+1,j) & - 4*u(i,j) & + u(i,j-1) + u(i,j+1) end do end do  Cache structure for this example:  Each line holds 4 array elements  Cache can hold 12 lines of u data i=16  No cache reuse between outer loop 120 30 18 15 13 12 10 9 7 6 4 3 iterations
    125. 125. j=1 j=8  Unblocked loop: 120 cache misses i=1  Block the inner loop do IBLOCK = 1, 16, 4 do j = 1, 8 i=5 do i = IBLOCK, IBLOCK + 3 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & i=9 + u(i,j-1) + u(i,j+1) end do end do end do i=13  Now we have reuse of the “j+1” data 80 20 12 10 11 9 8 7 6 4 3
    126. 126. j=1 j=5  One-dimensional blocking reduced i=1 misses from 120 to 80  Iterate over 4 4 blocks i=5 do JBLOCK = 1, 8, 4 do IBLOCK = 1, 16, 4 do j = JBLOCK, JBLOCK + 3 do i = IBLOCK, IBLOCK + 3 i=9 a(i,j) = u(i-1,j) + u(i+1,j) & - 2*u(i,j) & + u(i,j-1) + u(i,j+1) end do i=13 end do end do end do 15 13 12 10 60 30 18 17 16 11 9 8 7 6 4 3  Better use of spatial locality (cache lines)
    127. 127.  Matrix-matrix multiply (GEMM) is the canonical cache-blocking example  Operations can be arranged to create multiple levels of blocking  Block for register  Block for cache (L1, L2, L3)  Block for TLB  No further discussion here. Interested readers can see  Any book on code optimization  Sun’s Techniques for Optimizing Applications: High Performance Computing contains a decent introductory discussion in Chapter 8  Insert your favorite book here  Gunnels, Henry, and van de Geijn. June 2001. High-performance matrix multiplication algorithms for architectures with hierarchical memories. FLAME Working Note #4 TR- 2001-22, The University of Texas at Austin, Department of Computer Sciences  Develops algorithms and cost models for GEMM in hierarchical memories  Goto and van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (May), 1-25  Description of GotoBLAS DGEMM
    128. 128. “I tried cache-blocking my code, but it didn’t help”  You’re doing it wrong.  Your block size is too small (too much loop overhead).  Your block size is too big (data is falling out of cache).  You’re targeting the wrong cache level (?)  You haven’t selected the correct subset of loops to block.  The compiler is already blocking that loop.  Prefetching is acting to minimize cache misses.  Computational intensity within the loop nest is very large, making blocking less important.
    129. 129.  Multigrid PDE solver  Class D, 64 MPI ranks do i3 = 2, 257  Global grid is 1024 × 1024 × 1024 do i2 = 2, 257  Local grid is 258 × 258 × 258 do i1 = 2, 257  Two similar loop nests account for ! update u(i1,i2,i3) ! using 27-point stencil >50% of run time end do  27-point 3D stencil end do  There is good data reuse along end doi2+1 cache lines i2 leading dimension, even without i2-1 blocking i3+1 i3 i3-1 i1-1 i1 i1+1
    130. 130.  Block the inner two loops Mop/s/proces Block size  Creates blocks extending along i3 direction s unblocked 531.50 do I2BLOCK = 2, 257, BS2 do I1BLOCK = 2, 257, BS1 16 × 16 279.89 do i3 = 2, 257 22 × 22 321.26 do i2 = I2BLOCK, & min(I2BLOCK+BS2-1, 257) 28 × 28 358.96 do i1 = I1BLOCK, & min(I1BLOCK+BS1-1, 257) 34 × 34 385.33 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 408.53 end do end do 46 × 46 443.94 end do 52 × 52 468.58 end do end do 58 × 58 470.32 64 × 64 512.03 70 × 70 506.92
    131. 131.  Block the outer two loops Mop/s/proces Block size  Preserves spatial locality along i1 direction s unblocked 531.50 do I3BLOCK = 2, 257, BS3 do I2BLOCK = 2, 257, BS2 16 × 16 674.76 do i3 = I3BLOCK, & 22 × 22 680.16 min(I3BLOCK+BS3-1, 257) do i2 = I2BLOCK, & 28 × 28 688.64 min(I2BLOCK+BS2-1, 257) do i1 = 2, 257 34 × 34 683.84 ! update u(i1,i2,i3) ! using 27-point stencil 40 × 40 698.47 end do end do 46 × 46 689.14 end do 52 × 52 706.62 end do end do 58 × 58 692.57 64 × 64 703.40 70 × 70 693.87
    132. 132. ( 53) void mat_mul_daxpy(double *a, double *b, double *c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ C pointers ( 57) double con; /* constant value */ C pointers don’t carry ( 58) ( 59) rowb = cola; the same rules as ( 60) rowc = rowa; Fortran Arrays. ( 61) colc = colb; ( 62) The compiler has no ( 63) for(i=0;i<rowc;i++) { way to know whether ( 64) for(k=0;k<cola;k++) { *a, *b, and *c ( 65) con = *(a + i*cola +k); ( 66) for(j=0;j<colc;j++) { overlap or are ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); referenced differently ( 68) } elsewhere. ( 69) } ( 70) } The compiler must ( 71) } assume the worst, mat_mul_daxpy: thus a false data 66, Loop not vectorized: data dependency dependency. Loop not vectorized: data dependency Loop unrolled 4 times Slide 147
    133. 133. ( 53) void mat_mul_daxpy(double* restrict a, double* restrict b, double* restrict c, int rowa, int cola, int colb) ( 54) { ( 55) int i, j, k; /* loop counters */ C pointers, ( 56) int rowc, colc, rowb; /* sizes not passed as arguments */ restricted ( 57) double con; /* constant value */ C99 introduces the ( 58) ( 59) rowb = cola; restrict keyword, ( 60) rowc = rowa; which allows the ( 61) colc = colb; programmer to ( 62) promise not to ( 63) for(i=0;i<rowc;i++) { ( 64) for(k=0;k<cola;k++) { reference the ( 65) con = *(a + i*cola +k); memory via another ( 66) for(j=0;j<colc;j++) { pointer. ( 67) *(c + i*colc + j) += con * *(b + k*colb + j); ( 68) } If you declare a ( 69) } restricted pointer and ( 70) } ( 71) } break the rules, behavior is undefined by the standard. Slide 148
    134. 134. 66, Generated alternate loop with no peeling - executed if loop count <= 24 Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with no peeling and more aligned moves - executed if loop count <= 24 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop Generated alternate loop with more aligned moves - executed if loop count >= 25 and alignment test is passed Generated vector sse code for inner loop Generated 2 prefetch instructions for this loop • This can also be achieved with the PGI safe pragma and –Msafeptr compiler option or Pathscale –OPT:alias option Slide 149
    135. 135. July 2009 Slide 150
    136. 136.  GNU malloc library  malloc, calloc, realloc, free calls  Fortran dynamic variables  Malloc library system calls  Mmap, munmap =>for larger allocations  Brk, sbrk => increase/decrease heap  Malloc library optimized for low system memory use  Can result in system calls/minor page faults 151
    137. 137.  Detecting “bad” malloc behavior  Profile data => “excessive system time”  Correcting “bad” malloc behavior  Eliminate mmap use by malloc  Increase threshold to release heap memory  Use environment variables to alter malloc  MALLOC_MMAP_MAX_ = 0  MALLOC_TRIM_THRESHOLD_ = 536870912  Possible downsides  Heap fragmentation  User process may call mmap directly  User process may launch other processes  PGI’s –Msmartalloc does something similar for you at compile time 152
    138. 138.  Google created a replacement “malloc” library  “Minimal” TCMalloc replaces GNU malloc  Limited testing indicates TCMalloc as good or better than GNU malloc  Environment variables not required  TCMalloc almost certainly better for allocations in OpenMP parallel regions  There’s currently no pre-built tcmalloc for Cray XT, but some users have successfully built it. 153
    139. 139.  Linux has a “first touch policy” for memory allocation  *alloc functions don’t actually allocate your memory  Memory gets allocated when “touched”  Problem: A code can allocate more memory than available  Linux assumed “swap space,” we don’t have any  Applications won’t fail from over-allocation until the memory is finally touched  Problem: Memory will be put on the core of the “touching” thread  Only a problem if thread 0 allocates all memory for a node  Solution: Always initialize your memory immediately after allocating it  If you over-allocate, it will fail immediately, rather than a strange place in your code  If every thread touches its own memory, it will be allocated on the proper socket Slide 154
    140. 140.  Short Message Eager Protocol  The sending rank “pushes” the message to the receiving rank  Used for messages MPICH_MAX_SHORT_MSG_SIZE bytes or less  Sender assumes that receiver can handle the message  Matching receive is posted - or -  Has available event queue entries (MPICH_PTL_UNEX_EVENTS) and buffer space (MPICH_UNEX_BUFFER_SIZE) to store the message  Long Message Rendezvous Protocol  Messages are “pulled” by the receiving rank  Used for messages greater than MPICH_MAX_SHORT_MSG_SIZE bytes  Sender sends small header packet with information for the receiver to pull over the data  Data is sent only after matching receive is posted by receiving rank
    141. 141. Match Entries Posted by MPI Incoming Msg S to handle Unexpected Msgs E A Eager Rendezvous S App ME Short Msg ME Long Msg ME T A R STEP 1 STEP 2 MPI_RECV call Sender MPI_SEND call Receiver Post ME to Portals MPI RANK 0 RANK 1 Unexpected Buffers (MPICH_UNEX_BUFFER_SIZE) STEP 3 Portals DMA PUT Unexpected Msg Queue Other Event Queue (MPICH_PTL_OTHER_EVENTS) Unexpected Event Queue MPI_RECV is posted prior to MPI_SEND call (MPICH_PTL_UNEX_EVENTS)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×